top of page
Cost of Coordination
The second cycle is focusing on the cognitive costs of coordination.
Software engineering is inherently a joint activity that requires coordination, particularly during incident response where information and expertise is shared across a distributed network. In this cycle we are researching the costs of maintaining and sustaining productive efforts of the network as an incident evolves over time.
Coping with Complexity
Coping with Complexity
The Stella report (the results from Cycle 1 of the Consortium) is available for download here.
Current generation internet-facing technology platforms are complex and prone to brittle failure. Without the continuous effort of engineers to keep them running they would stop working -- many in days, most in weeks, all within a year. These platforms remain alive and functioning because workers are able to detect anomalies, diagnose their sources, remediate their effect, and repair their flaws and do so ceaselessly -- SNAFU Catching. Yet we know little about how they accomplish this vital work and even less about how to support them better in doing it.
During the past year a consortium including Etsy, IBM, IEX, and Ohio State University has explored issues around software engineering as it related to internet-facing business platforms. Technical teams from the consortium partners met for a workshop on coping with complexity. Each team presented a technical summary of a breakdown that occurred in their shop. The other teams commented. The Ohio State team facilitated and summarized emerging themes. Six themes were identified and discussed.
Capturing the value of anomalies through postmortems
Blame versus sanction in the aftermath of anomalies
Controlling the costs of coordination during anomaly response
Supporting work through improved visualizations
The strange loop quality of anomalies
The workshop provides a model for the deep, insightful inquiry that occurs when technical groups collaborate on anomaly analysis. Spin-offs from this effort will focus on building capacity for conducting this work and creating the tooling and processes necessary to assure efficient and effective response to incidents and post-event reviews.
bottom of page