Thursday, August 20, 2009

Sweeping Out the Hump

This post on Lone Gunman got me thinking about Joel Spolsky's use of Taiichi Ohno's Five Whys coupled with the procedure of Fixing Things Two Ways as a resiliency mechanism for non-Black Swan problems. When something (e.g., a server) fails, it has to be repaired, but the underlying causes of the problem (the five levels of the Five Whys) should also be fixed. This strategy differs from the standard metrics-based approach in that it generates a system where costs due to previously known and realized causes go down over time.

Spolsky gives the excellent example of server downtime in comparison to airplane crashes:

Measuring the number of minutes of downtime per year does not predict the number of minutes of downtime you'll have the next year. It reminds me of commercial aviation today: the NTSB has done such a great job of eliminating all the common causes of crashes that nowadays, each commercial crash they investigate seems to be a crazy, one-off, black-swan outlier.
As time goes on, an increasing proportion of problems derive from rare events. The high-frequency events at the hump of the frequency distribution get "swept out", along with any early occurring rare problems from the tails. This frees up attention for the truly bizarre and unforseeable events. Note that this doesn't work well if your Black Swans are catastrophic. If you get hit by a civilization ending asteroid or a spontaneously business-ending event, you have little opportunity (or benefit) to learning from experience, but this kind of problem is thankfully extremely rare in our collective prior history.

No comments: