top of page

Cost of Coordination

The second cycle is focusing on the cognitive costs of coordination. 

Software engineering is inherently a joint activity that requires coordination, particularly during incident response where information and expertise is shared across a distributed network.  In this cycle we are researching the costs of maintaining and sustaining productive efforts of the network as an incident evolves over time.   

Coping with Complexity

Coping with Complexity

The Stella report (the results from Cycle 1 of the Consortium) is available for download here.

Executive Summary

Current generation internet-facing technology platforms are complex and prone to brittle failure. Without the continuous effort of engineers to keep them running they would stop working -- many in days, most in weeks, all within a year. These platforms remain alive and functioning because workers are able to detect anomalies, diagnose their sources, remediate their effect, and repair their flaws and do so ceaselessly -- SNAFU Catching. Yet we know little about how they accomplish this vital work and even less about how to support them better in doing it.

 

During the past year a consortium including Etsy, IBM, IEX, and Ohio State University has explored issues around software engineering as it related to internet-facing business platforms. Technical teams from the consortium partners met for a workshop on coping with complexity. Each team presented a technical summary of a breakdown that occurred in their shop. The other teams commented. The Ohio State team facilitated and summarized emerging themes. Six themes were identified and discussed.

  1. Capturing the value of anomalies through postmortems

  2. Blame versus sanction in the aftermath of anomalies

  3. Controlling the costs of coordination during anomaly response

  4. Supporting work through improved visualizations

  5. The strange loop quality of anomalies

  6. Dark debt

 

The workshop provides a model for the deep, insightful inquiry that occurs when technical groups collaborate on anomaly analysis. Spin-offs from this effort will focus on building capacity for conducting this work and creating the tooling and processes necessary to assure efficient and effective response to incidents and post-event reviews.

Recent Posts
bottom of page