{"id":792,"date":"2016-05-04T20:40:39","date_gmt":"2016-05-05T01:40:39","guid":{"rendered":"http:\/\/www.smbitjournal.com\/?p=792"},"modified":"2017-02-19T04:41:45","modified_gmt":"2017-02-19T09:41:45","slug":"a-public-post-mortem-of-an-outage","status":"publish","type":"post","link":"https:\/\/smbitjournal.com\/2016\/05\/a-public-post-mortem-of-an-outage\/","title":{"rendered":"A Public Post Mortem of An Outage"},"content":{"rendered":"

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.\u00a0 In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.\u00a0 Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.<\/p>\n

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.\u00a0 We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.\u00a0 This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.<\/p>\n

In the recent past I got to be involved in an all out disaster for a small business.\u00a0 The disaster hit what was nearly a “worst case scenario.”\u00a0 Not quite, but very close.\u00a0 The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.\u00a0 This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.<\/p>\n

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.\u00a0 Typically, when some big systems event happens, I do not get to talk about it publicly.\u00a0 But once in a while, I do.\u00a0\u00a0\u00a0 It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.\u00a0 But you have to examine the disaster.\u00a0 There is so much to be learned about processes and ourselves.<\/p>\n

First, some back story.\u00a0 A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.\u00a0 It is a little over four years old and has been running in isolation for many years.\u00a0 Older servers are always a bit worrisome as they approach end of life.\u00a0 Four years is hardly end of life for an enterprise class server but it was certainly not young, either.<\/p>\n

This was a single server without any failover mechanism.\u00a0 Backups were handled externally to an enterprise backup appliance in the same datacenter.\u00a0 A very simple system design<\/p>\n

I won’t include all internal details as any situation like this has many complexities in planning and in operation.\u00a0 Those are best left to an internal post mortem process.<\/p>\n

When the server failed, it failed spectacularly.\u00a0 The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.\u00a0 Even the server vendor was unable to diagnose the issue.\u00a0 This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.\u00a0 We could replace drives, we could replace power supplies, we could replace the motherboard.\u00a0 Who knew what might be the fix.<\/p>\n

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.\u00a0 In the end the system ended up being able to be repaired and no data was lost.\u00a0 The decision to restrain from going to backup was made as data recovery was more important than system availability.<\/p>\n

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.\u00a0 The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.\u00a0 The process was exhausting but when completed the system was restored successfully.<\/p>\n

The long outage and sense\u00a0 of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.\u00a0 People started saying it and this leads to people believing it.\u00a0 Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.<\/p>\n

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.<\/p>\n

The mayhem that happens during a triage often makes things feel much worse than they really are.\u00a0 But our triage handling had been superb.\u00a0 Triage doesn’t mean magic and there is discovery phase and a reaction phase.\u00a0 When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.\u00a0 We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.<\/em>\u00a0 This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.<\/p>\n

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.\u00a0 Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.\u00a0 Nothing is absolutely perfect, but it went extremely well.\u00a0 The machine worked as intended.<\/p>\n

The far more surprising part was looking at the disaster impact.\u00a0 There are two ways to look at this.\u00a0 One is the wiser one, the “no hindsight” approach.\u00a0 Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.\u00a0 This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.\u00a0 The second way is the 20\/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?\u00a0 It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.\u00a0 Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.<\/p>\n

In this case, we were decently confident that we had taken the right gamble from the start.\u00a0 The system had been in place for most of a decade with zero downtime.\u00a0 The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.\u00a0 That when considering the risk factor we had done good planning was not generally surprising to anyone.<\/p>\n

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!\u00a0 This was downright shocking.\u00a0 The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.\u00a0 In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!<\/p>\n

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted\u00a0 in massive long term cost savings, but the lesson is an important one.\u00a0 There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.\u00a0 Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!<\/p>\n

There were other lessons to be learned from this outage.\u00a0 We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.\u00a0 But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.\u00a0 Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.<\/p>\n

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.<\/p>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.\u00a0 In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers … Continue reading A Public Post Mortem of An Outage<\/span> →<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[133],"tags":[223],"class_list":["post-792","post","type-post","status-publish","format-standard","hentry","category-risk","tag-post-mortem"],"_links":{"self":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/792","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/comments?post=792"}],"version-history":[{"count":2,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/792\/revisions"}],"predecessor-version":[{"id":926,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/792\/revisions\/926"}],"wp:attachment":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/media?parent=792"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/categories?post=792"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/tags?post=792"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}