Category Archives: Risk

Do You Really Need Redundancy: The Real Cost of Downtime

Downtime – now that is a word that no one wants to hear.  It strikes fear into the heart of businesses, executives and especially IT staff.  Downtime costs money and it causes frustration.

Because downtime triggers an emotional reaction businesses are often left reacting to it differently than traditional business factors.  This emotional approach causes businesses, especially smaller businesses often lacking in rational financial controls, to treat downtime as being far worse than it is.  It is not uncommon to find that smaller businesses have actually done more financial damage to themselves reacting to a fear of potential downtime than the feared downtime would have inflicted had it actually occurred.  This is a dangerous overreaction.

The first step is to determine the cost of downtime.  In IT we are often dealing with rather complex systems and downtime comes in a variety of flavors such as loss of access, loss of performance or a complete loss of a system or systems.  Determining every type of downtime and its associated costs can be rather complex but a high level view is often enough for producing rational budgets or are, at the very least, a good starting point on a path towards understanding the business risks involved with downtime.  Keep in mind that just like spending too much to avoid downtime is bad that spending too much to calculate the costs of downtime is bad.  Don’t spend so much time and resources determining if you will lose money that you would have been better off just losing it.  Beware of the high cost of decision making.

We can start by considering only complete system loss.  What is the cost of organizational downtime for you – that is, if you had to cease all business for an hour or a day how much money is lost?  In some cases the losses could be dramatic, like in the case of a hospital where a day of downtime would result in a loss of faith and future customer base and potentially result in lawsuits.  But in many cases a day of downtime might have nominal financial impact – many businesses could simply call the day a holiday, let their staff rest for the day and have people work a little harder over the next few days to make up the backlog from the lost day.  It all comes down to how your business does and can operate and how well suited you are for mitigating lost time.  Many business will only look at daily revenue figures to determine lost revenue but this can be wildly misleading.

Once we have a rough figure for downtime cost we can then consider downtime risk.  This is very difficult to assess as good figures on IT system reliability are nearly non-existent and every organization’s systems are so unique that industry data is very nearly useless.  Here we are forced to rely on IT staff to provide an overview of risks and, hopefully, a reliable assessment of likelihoods of individual risks.  For example, in big round numbers, if we had a line of business application that ran on a server with only one hard drive then we would expect that sometime in the next five to ten years that there will be downtime associated with the loss of that drive.  If we have that same server with hot swap drives in a mirrored array then the likelihood of downtime associated with that storage system, even over ten years, is quite small.  This doesn’t mean that a drive is not likely to fail, it is, but that the system is likely to be unaffected until redundancy is restored without end users noticing that anything has happened.

Our last rough estimation tool is to apply applicable business hours.  Many businesses do not run 24×7, some do, of course, but most do not.  Is the loss of a line of business application at six in the evening equivalent to the loss of that application at ten in the morning?  What about on the weekend?  Are people productively using it at three on a Friday afternoon or would losing it barely cost a thing and make for happy employees getting an extra hour or two on their weekends?  Can schedules be shifted in case of a loss near lunch time?  These factors while seemingly trivial can be significant.  If downtime is limited to only two to four hours then many businesses can mitigate nearly all of the financial impact simply by asking employees to have a little flexibility in their schedules to accommodate the outage by taking lunch early or leaving work early one day and working an extra hour the next.

Now that we have these factors  – the cost of downtime, the ability to mitigate downtime impact based on duration and the risks of outage events we can begin to draw a picture of what a downtime event is likely to look like.  From this we can begin to derive how much money it would be worth to reduce the risk of such as event.  For some businesses this number will be extremely high and for others it will be surprisingly low.  This exercise can expose a great deal about how a business operates that may not be normally all that visible.

It is important to note at this point that what we are looking at here is a loss of availability of systems, not a loss of data.  We are assuming that good backups are being taken and that those backups are not compromised.  Redundancy and downtime are not topics related to data loss, just availability loss.  Data loss scenarios should be treated with equal or greater diligence but are a separate topic.  It is a rare business that can survive catastrophic data loss but common to experience and easily survive even substantial downtime.

There are multiple ways to stave off downtime, redundancy is highly visible and treated almost like a buzz word and so receives a lot of focus, but there are other means as well.  Good system design is important, avoiding system complexity can heavily reduce downtime simply by removing points of unnecessary risk and fragility.  Using quality hardware and software is important as well – as low end hardware that is redundant will often fail just as often as non-redundant enterprise class hardware.  Having a rapid supply chain of replacement parts can be a significant factor often seen in the form of four hour hardware vendor replacement part response contracts.  This list goes on.  What we will focus on is redundancy which is where we are most likely to overspend when faced with the fear of downtime.

Now that we know the costs of failing to have adequate redundancy we can compare this potential cost against the very real, up front cost of providing that redundancy.  Some things, such as hard drives, are highly likely to fail and relatively easy and cost effective to make redundant – taking significant risk and trivializing it.  These tend to be a first point of focus.  But there are many areas of redundancy to consider such as power supplies, network hardware, Internet connections and entire systems – often made redundant through modern virtualization techniques providing new avenues for redundancy previously not accessible to many smaller businesses.

New types of redundancy, especially those made available through virtualization, are often a point where businesses will be tempted to overspend, perhaps dramatically, compared to the risks of downtime.  Worse yet, in the drive to acquire the latest fads in redundancy companies will often implement these techniques incorrectly and actually introduce greater risk and a higher likelihood of downtime compared to having done nothing at all.  It is becoming increasingly common to hear of businesses spending tens or even hundreds of thousands of dollars in an attempt to mitigate a downtime monetary loss of only a few thousand dollars – and to then fail in that attempt and end up increasing their risk anyway.

When gauging the cost of mitigation it is critical to remember that mitigation is a guaranteed expense where risk is only a risk.  Much like auto insurance where you pay a guaranteed small monthly fee in order to fend off a massive, unplanned expense.   The theory of risk mitigation is to spend a comparatively small amount of money now in order to reduce the risk of a large expense later, but if the cost of mitigation gets too high then it becomes better to simply accept the risks.

Systems can be assessed individually, of course.  Keeping a web presence and telephone system up and running at all times is far more important than an email system where even hours of downtime are unlikely to be detectable by external clients.  Paying only to protect those systems where the cost of downtime is significant is an important strategy.

Do not be surprised if what you discover is that beyond some very basic redundancy (such as mirrored hard drives) that a simple network design with good backups and restore plans and a good hardware support contract is all that is needed for the majority, if not all, of your systems.  By lowering the complexity of your systems you make them naturally more stable and easier to manage – further reducing the cost of your IT infrastructure.

Why We Reboot Servers

A question that comes up on a pretty regular basis is whether or not servers should be routinely rebooted, such as once per week, or if they should be allowed to run for as long as possible to achieve maximum “uptime.”  To me the answer is simple – with rare exception, regular reboots are the most appropriate choice for servers.

As with any rule, there are cases when it does not apply.  For example, some businesses running critical systems have no allotment for downtime and must be available 24/7.  Obviously systems like this cannot simply be rebooted in a routine way.  However, if a system is so critical that it can never go down then this situation should trigger a red flag that this system is a point of failure and perhaps consideration for how to handle downtime, whether planned or unplanned, should be initiated.

Another exception is some AIX systems need significant uptime, greater than a few weeks, to obtain maximum efficiency as the system is self tuning and needs time to obtain usage information and to adjust itself accordingly.  This tends to be limited to large, seldom-changing database servers and similar use scenarios that are less common than other platforms.

In IT we often worship the concept of “uptime” – how long a system can run without needing to restart.  But “uptime” is not a concept that brings value to the business and IT needs to keep the business’ needs in mind at all times rather than focusing on artificial metrics.  The business is not concerned with how long a server has managed to stay online without rebooting – they only care that the server is available and ready when needed for business processing.  These are very different concepts.

For most any normal business server, there is a window when the server needs to be available for business purposes and a window when it is not needed.  These windows may be daily, weekly or monthly but it is a rare server that is actually in use around the clock without exception.

I often hear people state that because they run operating system X rather than Y that they no longer need to reboot, but this is simply not true.  There are two main reasons to reboot on a regular basis: to verify the ability of the server to reboot successfully and to apply patches that cannot be applied without rebooting.

Applying patches is why most businesses reboot.  Almost all operating systems receive regular updates that require rebooting in order to take effect.  As most patches are released for security and stability purposes, especially those requiring a reboot, the importance of applying them is rather high.  Making a server unnecessarily vulnerable just to maintain uptime is not wise.

Testing a server’s capacity to reboot successfully is what is often overlooked.  Most servers have changes applied to them on a regular basis.  Changes might be patches, new applications, configuration changes, updates or similar.  Any change introduces risk.  Just because a server is healthy immediately after a change is applied does not mean that the server nor the applications running on it will start as expected on reboot.

If the server is never rebooted then we never know if it can reboot successfully.  Over time the number of changes having been applied since the last reboot will increase.  This is very dangerous.  What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then failing.  At that point identifying what change is causing the system to fail could be an insurmountable process.  No single change to roll back, no known path to recoverability.  This is when panic sets in.  Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally – meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.

While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures, the purpose is to make those failures easily manageable from a “known change” standpoint and, more importantly, to control when those reboots occur to ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.

I have heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting.  I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief.  I know that we just caught an issue at a time when the business is not impacted financially.  Had that server not been restarted during off hours, it might have not been discovered to be “unbootable” until it had failed during active business hours and caused a loss of revenue.

Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week’s worth of changes to investigate, we are routinely able to fix the problems with generally little effort and great confidence that we understand what changes had been made prior to the failure.

Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.