Where Complex Systems Fail
A while ago I wrote a telegraphic piece entitled How Complex Systems Fail that, although written primarily for medicos, resonated with the IT community.
Complex systems do fail, of course, and sometimes in surprising, uncomfortable, and even disturbing ways. Close contact with these failures is unsettling. It reminds us of how poorly calibrated we are about how systems really work -- in contrast with how we think they work.
People who operate, troubleshoot, repair, and resuscitate complex systems are quite remarkable. They struggle to maintain a good enough understanding of their systems to accomplish this work. They often possess subtle, tacit knowledge of behaviors and characteristics of the systems that allows them to detect when the system is beginning to fail.
For the past few years a group of us have been studying how workers cope with the complexity of incidents in large, running, software-intensive systems. These are dauntingly complex systems where incidents are common. As a researcher, I am deeply impressed by how quickly, deliberately, and efficiently the workers can discover the incident's characteristics, devine [sic] its sources, and cobble together immediate responses. They typically do this without a lot of fuss and often without much recognition. But their ability to do this tells me that they know a lot about the underlying system and, in particular, where to -- and how to -- look when things are not working correctly.
We have been fortunate to be able to work with talented people from different companies on a sort of "deep dive" into this work. The SNAFUcatchers consortium has provided a forum for these people to get together and compare incidents in their systems. The Stella report (named because the meeting took place in Brooklyn during the Stella winter storm and because we all love Tennessee Williams ;-) captures some of the themes that came out of that meeting. Many interesting and a few hard questions came out of that meeting and we will be investigating and capitalizing on them for a long while!
I've been thinking since that meeting about how hugely powerful these SNAFUcatchers are and how they got that way, but also the more general question of how they are able to maintain that powerful knowledge and ability in the face of the continuously changing complexity of the system.
David Woods points out that the accuracy of a single person's (well, he calls it "cognitive agent's") model of a system necessarily decreases as the complexity of that system increases. At best a person's model will imperfectly represent a fraction of the system.
How can the people who operate, troubleshoot, repair, and resuscitate complex systems stay abreast of what matters given that their models of that system are necessarily inaccurate? The pattern of problems in the system is constantly changing. Simply doing a survey of the entire system is not likely to be productive. There is so much to know and so much of that is changing. [As evidence of this, consider how many 'Wikis' have been created to describe systems or parts of systems and how quickly those well-intended efforts go stale!]
The SNAFUcatcher discussions suggest that incidents are useful in this respect. An incident is the message from the underlying system about where the people who made and who operate that system are uncalibrated. Incidents point to areas where vulnerabilities lie and where further attention, inquiry, and calibration are likely to be productive.
I have come to think of incidents as system messages that take the form of untyped pointers ( void * incident). They point to a region of the system where something interesting is happening. Like any untyped pointer they are informative only about the location and imply nothing about what is at the location. Incidents are not meaningful in themselves -- we effectively 'cast' the type by the way we investigate and analyze them.
The SNAFUcatchers discussions described in the Stella report showed us that our colleagues are quite sensitive to the incompleteness and inaccuracy of their own system models. They use incidents as pointers to areas of the system where their models are in need of calibration. Indeed they seem to be more or less continuously recalibrating their system models using incidents.
This suggests that our approach to incident handling could be improved. We might consider treating incidents as untyped pointers to important areas of the system, areas worthy of more examination, exploration, and contemplation. This is likely to be difficult. The pointers are, after all, untyped, which means that it is up to us to devise ways to decode (might I say, decompile?) that region. It cannot be a simple task. But when it is passed to us, this void * is the most unambiguous information we can have about where our understanding of the system could be profitably improved.
Thanks to John Allspaw, WIll Gallego, and Andrew Morrison for review and suggestions!
P.S. Like other analogies, it is easy to take it too far. It occurred to me that void pointers need not be politely aligned on data boundaries and that casting an incompatible data type can generate unmeaningful results. Just so with our investigations of incidents: it is quite possible to misunderstand an incident rather badly because the miscasting of the void pointer makes subsequent use of the 'data' worse than useless. But, as I say, one can go too far with these things!