Downtime – now that is a word that no one wants to hear. It strikes fear into the heart of businesses, executives and especially IT staff. Downtime costs money and it causes frustration.
Because downtime triggers an emotional reaction businesses are often left reacting to it differently than traditional business factors. This emotional approach causes businesses, especially smaller businesses often lacking in rational financial controls, to treat downtime as being far worse than it is. It is not uncommon to find that smaller businesses have actually done more financial damage to themselves reacting to a fear of potential downtime than the feared downtime would have inflicted had it actually occurred. This is a dangerous overreaction.
The first step is to determine the cost of downtime. In IT we are often dealing with rather complex systems and downtime comes in a variety of flavors such as loss of access, loss of performance or a complete loss of a system or systems. Determining every type of downtime and its associated costs can be rather complex but a high level view is often enough for producing rational budgets or are, at the very least, a good starting point on a path towards understanding the business risks involved with downtime. Keep in mind that just like spending too much to avoid downtime is bad that spending too much to calculate the costs of downtime is bad. Don’t spend so much time and resources determining if you will lose money that you would have been better off just losing it. Beware of the high cost of decision making.
We can start by considering only complete system loss. What is the cost of organizational downtime for you – that is, if you had to cease all business for an hour or a day how much money is lost? In some cases the losses could be dramatic, like in the case of a hospital where a day of downtime would result in a loss of faith and future customer base and potentially result in lawsuits. But in many cases a day of downtime might have nominal financial impact – many businesses could simply call the day a holiday, let their staff rest for the day and have people work a little harder over the next few days to make up the backlog from the lost day. It all comes down to how your business does and can operate and how well suited you are for mitigating lost time. Many business will only look at daily revenue figures to determine lost revenue but this can be wildly misleading.
Once we have a rough figure for downtime cost we can then consider downtime risk. This is very difficult to assess as good figures on IT system reliability are nearly non-existent and every organization’s systems are so unique that industry data is very nearly useless. Here we are forced to rely on IT staff to provide an overview of risks and, hopefully, a reliable assessment of likelihoods of individual risks. For example, in big round numbers, if we had a line of business application that ran on a server with only one hard drive then we would expect that sometime in the next five to ten years that there will be downtime associated with the loss of that drive. If we have that same server with hot swap drives in a mirrored array then the likelihood of downtime associated with that storage system, even over ten years, is quite small. This doesn’t mean that a drive is not likely to fail, it is, but that the system is likely to be unaffected until redundancy is restored without end users noticing that anything has happened.
Our last rough estimation tool is to apply applicable business hours. Many businesses do not run 24×7, some do, of course, but most do not. Is the loss of a line of business application at six in the evening equivalent to the loss of that application at ten in the morning? What about on the weekend? Are people productively using it at three on a Friday afternoon or would losing it barely cost a thing and make for happy employees getting an extra hour or two on their weekends? Can schedules be shifted in case of a loss near lunch time? These factors while seemingly trivial can be significant. If downtime is limited to only two to four hours then many businesses can mitigate nearly all of the financial impact simply by asking employees to have a little flexibility in their schedules to accommodate the outage by taking lunch early or leaving work early one day and working an extra hour the next.
Now that we have these factors – the cost of downtime, the ability to mitigate downtime impact based on duration and the risks of outage events we can begin to draw a picture of what a downtime event is likely to look like. From this we can begin to derive how much money it would be worth to reduce the risk of such as event. For some businesses this number will be extremely high and for others it will be surprisingly low. This exercise can expose a great deal about how a business operates that may not be normally all that visible.
It is important to note at this point that what we are looking at here is a loss of availability of systems, not a loss of data. We are assuming that good backups are being taken and that those backups are not compromised. Redundancy and downtime are not topics related to data loss, just availability loss. Data loss scenarios should be treated with equal or greater diligence but are a separate topic. It is a rare business that can survive catastrophic data loss but common to experience and easily survive even substantial downtime.
There are multiple ways to stave off downtime, redundancy is highly visible and treated almost like a buzz word and so receives a lot of focus, but there are other means as well. Good system design is important, avoiding system complexity can heavily reduce downtime simply by removing points of unnecessary risk and fragility. Using quality hardware and software is important as well – as low end hardware that is redundant will often fail just as often as non-redundant enterprise class hardware. Having a rapid supply chain of replacement parts can be a significant factor often seen in the form of four hour hardware vendor replacement part response contracts. This list goes on. What we will focus on is redundancy which is where we are most likely to overspend when faced with the fear of downtime.
Now that we know the costs of failing to have adequate redundancy we can compare this potential cost against the very real, up front cost of providing that redundancy. Some things, such as hard drives, are highly likely to fail and relatively easy and cost effective to make redundant – taking significant risk and trivializing it. These tend to be a first point of focus. But there are many areas of redundancy to consider such as power supplies, network hardware, Internet connections and entire systems – often made redundant through modern virtualization techniques providing new avenues for redundancy previously not accessible to many smaller businesses.
New types of redundancy, especially those made available through virtualization, are often a point where businesses will be tempted to overspend, perhaps dramatically, compared to the risks of downtime. Worse yet, in the drive to acquire the latest fads in redundancy companies will often implement these techniques incorrectly and actually introduce greater risk and a higher likelihood of downtime compared to having done nothing at all. It is becoming increasingly common to hear of businesses spending tens or even hundreds of thousands of dollars in an attempt to mitigate a downtime monetary loss of only a few thousand dollars – and to then fail in that attempt and end up increasing their risk anyway.
When gauging the cost of mitigation it is critical to remember that mitigation is a guaranteed expense where risk is only a risk. Much like auto insurance where you pay a guaranteed small monthly fee in order to fend off a massive, unplanned expense. The theory of risk mitigation is to spend a comparatively small amount of money now in order to reduce the risk of a large expense later, but if the cost of mitigation gets too high then it becomes better to simply accept the risks.
Systems can be assessed individually, of course. Keeping a web presence and telephone system up and running at all times is far more important than an email system where even hours of downtime are unlikely to be detectable by external clients. Paying only to protect those systems where the cost of downtime is significant is an important strategy.
Do not be surprised if what you discover is that beyond some very basic redundancy (such as mirrored hard drives) that a simple network design with good backups and restore plans and a good hardware support contract is all that is needed for the majority, if not all, of your systems. By lowering the complexity of your systems you make them naturally more stable and easier to manage – further reducing the cost of your IT infrastructure.
This is a fantastic article, it would have been phenomenal if a checklist had also been included in downloadable format. Not all items in the list would apply to all organizations, but it would have been an enormous boon to have it as a sort of template to build off of. Regardless, great read my friend!