Part 1: brittle failure, blame-and-train, and more productive reactions to failure

Richard Cook
Jun 6, 2017
5 min read

A recent posting by cscareerthrowaway567 is remarkable and worth reading. [Thanks to John Allspaw for pointing this out to me!] Like many such stories (and there are many) it captures three connected aspects of the modern conundrum.

Today was my first day on the job...[and]... [I] screwed up badly. I was basically given a document detailing how to setup my local development environment... [but]... those values were actually for the production database... [T]he tests add fake data, and clear existing data between test runs which basically cleared all the data from the production database. Honestly i had no idea what i did and it wasn't about 30 or so minutes after did someone actually figure out/realize what i did.

While what i had done was sinking in. The CTO told me to leave and never come back. He also informed me that apparently legal would need to get involved due to severity of the data loss. I basically offered and pleaded to let me help in someway to redeem my self and i was told that i "completely fucked everything up". [cscareerthrowaway567, reddit post, June 2, 2017]

The nearly 5000 [!] comments on this post are remarkable and well worth sampling. Commenters defended cscareerthrowaway567, offered advice, and reported similar experiences. The gist of the comment stream was that cscareerthrowaway567 was treated inappropriately, should not be considered responsible for the outcome, was better off not working for such a CTO and company, and that the company's IT is likely of very low quality.

tl;dr: Catastrophic system failures are remarkably common in IT-dependent environments. The reactions to such failures varies but is often some version of blame-and-train. There are a number of problems with blame-and-train but perhaps the most important is it is a form of organizational blindness that forestalls improvement.

Three things:

These failures are markers of systemic brittleness, the inverse of resilience.
The blame-and-train reaction is a diversion, a red herring, and counterproductive; it increases brittleness.
There are productive reactions to failure but they are difficult to accomplish, especially when the failure has big consequences.

Brittleness vs. resilience and the nature of IT-failures

IT-dependent enterprises struggle to produce failure-free performance but routinely fail to do so. Celebrated failures like the 27 May 2017 British Airways debacle and the reports such as that excerpted above are examples of brittle failures. The susceptibility of big IT structures to such sudden, catastrophic, and unanticipated failures parallels experience in other industries, e.g. transportation, medicine, where catastrophic outcomes seem to flow from minor or even trivial sources. Several reddit commenters point out that leaving a production system exposed to unintentional destruction is tantamount to IT malpractice. But as egregious as this case appears, it is clear that many big IT collections are vulnerable to such brittle failures.

The opposite of brittleness is resilience. There is growing interest in understanding the sources of resilience and methods for building resilient IT. Studies of resilient systems (cf. Allspaw, Dekker et al., Woods, Cook & Nemeth) reveal deliberate goal sacrificing, redirecting resources, anticipating bottlenecks, and other activities are common. Incorporating these into IT-heavy systems remains challenging. We want our IT to not fail in such brittle ways but we are not yet sure how to create and sustain systems to achieve that end.

Blame and Train

The reported CTO reaction to the event in the excerpted case above was abrupt and dramatic but not qualitatively much different than what usually happens after accidents in IT and elsewhere. For a variety of reasons, organizational reactions to failure usually focus on an individual and an error. It's possible to write books about this (for example, this one) and there is much to be said about why this approach satisfies specific needs ('error' is actually quite useful in an anti-pattern sort of way).

The typical remediation for an isolated, individual error is usually not firing but something less severe, often some indoctrination or re-education to insure that the particular action is not repeated. The connection between the person and the remediation is so much a recurring feature of after-accident reactions that it has its own name: blame-and-train.

Blame-and-train is now recognized as worse than ineffective. This is easy to see in the reddit case above. If we take the story at face value, identifying cscareerthrowaway567 as an offender might have allowed the CTO to vent his/her spleen but it did nothing to reduce the brittleness in the IT environment. As so many commenters pointed out, the exposed production system, the poorly written guide to setting up a new computer and, perhaps most important, the utter lack of assistance from or engagement with the technical staff were systemic factors that promoted the eventual failure. Localizing the source of failure in cscareerthrowaway567 was not a useful response. No amount of training could remediate those systemic factors.

[Some will observe that summary execution of cscareerthrowaway567 can hardly be described as 'training'. In this case it is the others in the workplace who are being trained about what the response to any future problems in likely to be.]

"Blame-and-train" has been thoroughly discredited and the term is now used derisively. Even so blame-and-train remains the most common response to failures -- especially ones with significant costs or injuries.

Why does blame-and-train remain so entrenched?

It is the easiest path to follow because it requires no other change.
Localizing the failure in an individual keeps attention away from sore spots and contentious issues in the workplace.
It satisfies multiple individual and organizational needs that are prominent in the wake of disaster. [The wonderful Monty Python credits at the beginning of "The Holy Grail" is a spoof on this!]

There is a tight relationship between "human error" and blame-and-train and there is even an academic paper on the connection.

Productive Reactions to Failure

Efforts to make reactions to failure more productive go back many decades but the current view had its origins in research that followed the nuclear power plant accident at Three Mile Island in 1979. This view, sometimes called "The New Look", treats complex system failures as highly encoded signals sent from the underlying system about itself.

The main goals of after-failure work include:

decode the signals
lay out the multiple influences and conditions that gave rise to the specific situation
identify similar situations or conditions
discover how the human operators normally detect, disarm, recover, or redirect the system trajectory towards desirable outcomes
find ways to enhance their ability to do this

In contrast to the usual blame-and-train story told after failure, this "second story" often contains valuable insights and opportunities for change that pay back the effort needed to produce it. Doing this work can be difficult and it is especially challenging after a high-value failure when control of the story is socially important [again, as noted by the reddit commentaries!].

Without knowing more about the cscareerthrowaway567 case it is hard to say what productive reactions to the failure would look like. Based on the sketch provided by cscareerthrowaway567 it seems unlikely that productive reactions will be forthcoming.

Next: How are bad outcomes usually avoided?

Postscript: Providing assistance to the people most directly effected by a catastrophic failure has long been recognized as morally and practically useful. Over the past 30 yrs it has become clear that those close to the failure -- including operators and other practitioners -- can sustain significant psychological damage from their proximity to a bad outcome. This injury can be magnified by being blamed -- openly or subtly -- for the outcome. In some cases the psychological trauma can be so severe as to be incapacitating. Associates, supervisors, and managers can help reduce the damage by being prepared to recommend and facilitate access to specialists outside the workplace. Many organizations have external employee resources for this purpose.