The 50-million-people bug

You’ve undoubtedly come across Murphy’s Law, which states, “What can go wrong, will go wrong.” Unfortunately, analysis of real-life bugs and system failures shows that Murphy’s Law is completely wrong—what can go wrong usually goes right. In most applications, there are usually many defects hiding within the software when it goes into production. Most of the time, these bugs stay silent and the application works successfully. Only occasionally is a bug actually triggered, whereupon the application goes wrong or crashes.

The bug discussed here, in a system that had over 3 million operational hours (340 years!), deprived 50 million people in the US and Canada of power.

There is another clue here. It becomes clear that some errors in complex distributed software systems will happen, and it's actually pretty near impossible to prevent all errors. So as well as working to prevent bugs, you should also place some serious emphasis on detecting them when they do occur and on limiting their negative consequences.

[UPDATE: Updated link]