“High Availability isn’t something you buy, it’s something that you do.” – John Nicholson
Few things are more universally desired in IT than High Availability (HA) solutions. I mean really, say those words and any IT Pro will instantly say that they want that. HA for their servers, their apps, their storage and, of course, even their desktops. If there was a checkbox next to any system that simply said “HA”, why wouldn’t we check it? We would, of course. No one voluntarily wants a system that fails a lot. Failure bad, success good.
First, though, we must define HA. HA can mean many things. At a minimum, HA must mean that the availability of the system in question must be higher than “normal”. What is normal? That alone is hard enough to define. HA is a loose term, at best. In the context of its most common usage, though, which is common applications running on normal enterprise hardware I would offer this starting point for HA discussions:
Normal or Standard Availability (SA) would be defined as the availability from a common mainline server running a common enterprise operating system running a common enterprise application in a best practices environment with enterprise support. Good examples of this might include Exchange running on Windows Server running on the HP Proliant DL380 (the most common mainline commodity server.) Or for BIND (the DNS server) running on Red Hat Enterprise Linux on the Dell PowerEdge R730. These are just examples to be used for establishing a rough baseline. There is no great way to measure this, but with a good support contract and rapid repair or replacement in the real world, reliability of a system of this nature is believed to be between four and five nines of reliability (99.99% uptime or higher) when human failure is not included.
High Availability (HA) should be commonly defined as having an availability significantly higher than that of Standard Availability. Significantly higher should be a minimum of one order of magnitude hincrease. So at least five nines of reliability and more likely six nines. (99.9999% uptime.)
Low Availability (LA) would be commonly defined as having an availability significantly lower than that of Standard Availability with significantly, again, meaning at least one order of magnitude. So LA would typically be assumed to be around 99% to 99.9% or lower availability.
Measurement here is very difficult as human factors, environmental and other play a massive role in determining the uptime of different configurations. The same gear used in one role might achieve five nines while in another fail to achieve even one. The quality of the datacenter, skill of the support staff, rapidity of parts replacement, granularity of monitoring and a multitude of other factors will affect the overall reliability significantly. This is not necessarily a problem for us, however. In most cases we can evaluate the portions of a system design that we control in such a way that relative reliability can be determined so that at least we can show that one approach is going to be superior to another in order that we can then leverage well informed decision making even if accurate failure rate models cannot be easily built.
It is important to note that other than providing a sample baseline set of examples from which to work there is nothing in the definitions of high availability or low availability that talk about how these levels should be achieved – that is not what the terms mean. The terms are resultant sets of reliability in relation to the baseline and nothing else. There are many ways to achieve high availability without using commonly assumed approaches and practically unlimited ways to achieve low availability.
Of course HA can be defined at every layer. We can have HA platforms or OS but have fragile applications on top. So it is very important to understand at what level we are speaking at any given time. At the end of the day, a business will only care about the high availability delivery of services regardless of how it is achieved, or where. The end result is what matters not the “under the hood” details of how it was accomplished or, as always, the ends justify the means.
It is extremely common today for IT departments to become distracted by new and flashy HA tools at the platform layer and forget to look for HA higher and lower in the stack to ensure that we provide highly available services to the business; rather than only looking at the one layer while leaving the business just as vulnerable, or moreso, than ever.
In the real world, though, HA is not always an option and, when it is, it comes at a cost. That cost is almost always monetary and generally comes with extra complexity as well. And as we well know, any complexity also carries additional risk and that risk could, if we are not careful, cause an attempt to achieve HA actually fail and might even leave us with LA or Low Availability.
Once we understand this necessary language for describing what we mean, we can begin to talk about when high availability, standard availability and even low availability may be right for us. We use this high level of granularity because it is so difficult to measure system reliability that getting too detailed becomes valueless.
Conceptually, all systems come with risk of downtime and nothing can be always up, that’s impossible. Reliability costs money, generally, all other things being equal. So to determine what level of availability is most appropriate for a workload we must determine the cost of risk mitigation (the amount of money that it takes to change the average amount of downtime) and compare that against the cost of the downtime itself.
This gets tricky and complicated because determining cost of downtime is difficult enough, then determining the risk of downtime is even more difficult. In many cases, downtime is not a flat number, but it might be. This cost could be expressed as $5/minute or $20K/day or similar. But an even better tool would be to create a “loss impact curve” that shows how money is lost over time (within a reasonable interval.)
For example, a company might easily face no loss at all for the first five minutes with slowly increasing, but small, losses until about four hours when work stops because people can no longer go to paper or whatever and then losses go from almost zero to quite large. Or some companies might take a huge loss the moment that the systems are down but the losses slowly dissipate over time. Loss might only be impactful at certain times of day. Maybe outages at night or during lunch are trivial but mid morning or mid afternoon are major. Every company’s impact, risk and ability to mitigate that risk are different, often dramatically so.
Sometimes it comes down to the people working at the company. Will they all happily take needed bathroom, coffee, snack or even lunch breaks at the time that a system fails so that they can return to work when it is fixed? Will people go home early and come in early tomorrow to make up for a major outage? Is there machinery that is going to sit idle? Will the ability to respond to customers be impacted? Will life support systems fail? There are countless potential impacts and countless potential ways of mitigating different types of failures. All of this has to be considered. The cost of downtime might be a fraction of corporate revenues on a minute by minute basis or downtime might cause a loss of customers or faith that is more impactful than the minute by minute revenue generated.
Once we have some rough loss numbers to deal with we at least have a starting point. Even if we only know that revenue is ~$10/minute and losses are expected to be around ~$5/minute we have a starting point of sorts. If we have a full curve or a study done with some more detailed numbers, all the better. Now we need to figure out roughly what our baseline is going to be. A well maintained server, running on premises, with a good support contract and good backup and restore procedures can pretty easily achieve four nines of reliability. That means that we would experience about five hours of downtime every five years. This is actually less than the generally expected downtime of SA in most environments and potentially far less than expected levels in excellent environments like high quality datacenters with nearby parts and service.
So, based on our baseline example of about five hours every five years we can figure out our potential risk. If we lose about ~$5/minute and we expect roughly 300 minutes of downtime every five years we looking at a potential loss of $1,500 every half decade.
That means that at the most extreme we could never spend $1,500 to mitigate that risk, that would be financially absurd. This happens for several reasons. One of the biggest is that this is only a risk, spending $1,500 to protect against losing $1,500 makes little sense, but it is a very common mistake to make when people do not analyze these numbers carefully.
The biggest factor is that any mitigation technique is not completely effective. If we manage to move our four nines system to a five nines system we would reduce only 90% of the average downtime moving us from $1,500 of loss to $150 of loss. If we spent $1,500 for that reduction, the total “loss” would still be $1,650 (the cost of protection is a form of financial loss.) The cost of the risk mitigation combined with the anticipated remaining impact when taken together must still be lower than the anticipated cost of the risk without mitigation or else the mitigation itself is pointless or actively damaging.
Many may question why the total cost of risk mitigation must be lower and not simply equal as surely, that must mean that we are at a “risk break even” point? This seems true on the surface, but because we are dealing with risk this is not the case. Risk mitigation is a certain cost- financial damage that we take up front in the hopes of reducing losses tomorrow. But the risk for tomorrow is a guess, hopefully a well educated one, but only a guess. The cost today is certain. Taking on certain damage today in the hopes of reducing possible damage tomorrow only makes sense when the damage today is small and the expected or possible damage tomorrow is very large and the effectiveness of mitigation is significant.
Included in the idea of “certain cost of front” to reduce “possible cost tomorrow” is the idea of the time value of money. Even if an outage was of a known size and time, we would not spend the same money today to mitigate it tomorrow because our money is more valuable today.
In the most dramatic cases, we sometimes see IT departments demanding tens or hundreds of thousands of dollars be spent up front to avoid losing a few thousand dollars, maybe, sometime maybe many years in the future. A strategy that we can refer to as “shooting ourselves in the face today to avoid maybe getting a headache tomorrow.”
It is included in the concept of evaluating the risk mitigation but it should be mentioned specifically that in the case of IT equipment there are many examples of attempted risk mitigation that may not be as effective as they are believed to be. For example, having two servers that sit in the same rack will potentially be very effective for mitigating the risk of host hardware failure, but will not mitigate against natural disasters, site loss, fire, most cases of electrical shock, fire suppression activation, network interruptions, most application failure, ransomware attack or other reasonably possible disasters.
It is common for storage devices to be equipment with “dual controllers” which gives a strong impression of high reliability, but generally these controllers are inside a single chassis with shared components and even if the components are not shared, often the firmware is shared and communications between components are complex; often leading to failures where the failure of one component triggers the failure of another – making them quite frequently LA devices rather than SA or the HA that people expected when purchasing them. So it is very critical to consider if the risk mitigation strategy will mitigate which risks and if the mitigation technique is likely to be effective. No technique is completely effective, there is always a chance for failure, but some strategies and techniques are more broadly effective than others and some are simply misleading or actually counter productive. If we are not careful, we may implement costly products or techniques that actively undermine our goals.
Some techniques and products used in the pursuit of high availability are rather expensive, which might include buying redundant hardware, leasing another building, installing expensive generators or licensing special software. There are low cost techniques and software as well, but in most cases any movement towards high availability will result in a respectively large outlay of investment capital in order to achieve it. It is absolutely critical to keep in mind that high availability is a process, there is no way to simply buy high availability. Achieving HA requires good documentation, procedures, planning, support, equipment, engineering and more. In the systems world, HA is normally approached first from an environmental perspective with failover power generators, redundant HVAC systems, power conditioning, air filtration, fire suppression systems and more to ensure that the environment for the availability is there. This alone can often make further investment unnecessary as this can deliver incredible results. Then comes HA system design ensuring that not just one layer of a technology stack is highly available but that the entire stack is allowing for the critical applications, data or services to remain functional during as much time as possible. Then looking at site to site redundancy to be able to withstand floods, hurricanes, blizzards, etc. Of course there are completely different techniques such as utilizing cloud computing services hosted remotely on our behalf. What matters is that high availability requires broad thinking and planning, cannot simply be purchased as a line item and is judged by the ability to return a risk factor providing a resulting uptime or likelihood of uptime much higher than a “standard” system design would deliver.
What is often surprising, almost shocking, to many businesses and especially to IT professionals, who rarely undertake financial risk analysis and who are constantly being told that HA is a necessity for any business and that buying the latest HA products is unquestionably how their budgets should be spent, is that when the numbers are crunched and the reality of the costs and effectiveness of risk mitigation strategies are considered that high availability has very little place in any organization, especially those that are small or have highly disparate workloads. In the small and medium business market it is almost universal to find that the cost and complexity (which in turn brings risk, mostly in the form of a lack of experience around techniques and risk assessment) of high availability approaches is far too costly to ever offset the potential damage of the outage from which the mitigation is hoped to protect. There are exceptions, of course, and there are many businesses for which high availability solutions are absolutely sensible, but these are the exception and very far from being the norm.
It is also sensible to think of the needs for high availability to be based on a workload basis and not department, company or technology wide. In a small business it is common for all workloads to share a common platform and the need of a single workload for high availability may sweep other, less critical, workloads along with it. This is perfectly fine and a great way to offset the cost of the risk mitigation of the critical workload through ancillary benefit to the less critical workloads. In a larger organization where there is a plethora of platform approaches used for differing workloads it is common for only certain workloads that are both highly critical (in terms of risk from downtime impact) and that are practically mitigated of risk (the ability to mitigate risk can vary dramatically between different types of workloads) to have high availability applied to them and other workloads to be left to standard techniques.
Examples of workloads that may be critical and can be effectively addressed with high availability might be an online ordering system where the latency created by multi-regional replication has little impact on the overall system but losing orders could be very financially impactful should a system fail. An example of a workload where high availability might be easy to implement but ineffectual would be an internal intranet site serving commonly asked HR questions; it would simply not be cost effective to avoid small amounts of occasional downtime for a system like this. An example of a system where risk is high but the cost or effectiveness of risk mitigation makes it impractical or even impossible might be a financial “tick” database requiring massive amounts of low latency data to be ingested and the ability to maintain a replica would not only be incredibly costly but could introduce latency that would undermine the ability of the system to perform adequately. Every business and workload is unique and should be evaluated carefully.
Of course high availability techniques can be actioned in stages; it is not an all or nothing endeavor. It might be practical to mitigate the risk of system level failure by having application layer fault tolerance to protect against failure of system hardware, virtualization platform or storage. But for the same workload it might not be valuable to protect against the loss of a single site. If a workload only services a particular site or is simply not valuable enough for the large investment needed to make it fail between sites it could easily fall “in the middle.” It is very common for workloads to only implement partially high availability solutions, often because an IT department may only be responsible for a portion of them and have no say over things like power support and HVAC, but probably most common because some high availability techniques are seen as high visibility and easy to sell to management while others, such as high quality power and air conditioning, often are not even though they may easily provide a better bang for the buck. There are good reasons why certain techniques may be chosen and not others as they affect different risk components and some risks may have a differing impact on an individual business or workload.
High availability requires careful thought as to whether it is worth considering and even more careful thought as to implementation. Building true HA systems requires a significant amount of effort and expertise and generally substantial cost. Understanding which components of HA are valuable and which are not requires not just extensive technical expertise but financial and managerial skills as well. Departments must work together extensively to truly understand how HA will impact an organization and when it will be worth the investment. It is critical that it be remembered that the need for high availability in an organization or for a workload is anything but a foregone conclusion and it should not be surprising in the least to find that extensive high availability or even casual high availability practices turn out to be economically impractical.
In many ways this is because standard availability has reached such a state that there is continuously less and less risk to mitigate. Technology components used in a business infrastructure, most notably servers, networking gear and storage, have become so reliable that the amount of downtime that we must protect against is quite low. Most of the belief in the need for knee jerk high availability comes from a different era when reliable hardware was unaffordable and even the most expensive equipment was rather unreliable by modern standards. This feeling of impending doom that any device might fail at any moment comes from an older era, not the current one. Modern equipment, while obviously still carrying risks, is amazingly reliable.
In addition to other risks, over-investing in high availability solutions carries financial and business risks that can be substantial. It increases technical debt in the face of business uncertainty. What if the business suddenly grows, or worse, what if it suddenly contracts, changes direction, gets purchased or goes out of business completely? The investment in the high availability is already spent even if the need for its protection disappears. What if technology or location change? Some or all of a high availability investment might be lost before it would have been at its expected end of life.
As IT practitioners, evaluating the benefits, risks and costs of technology solutions is at the core of what we do. Like everything else in business infrastructure, determining the type of risk mitigation, the value of protection and how much is financially proper is our key responsibility and cannot be glossed over or ignored. We can never simply assume that high availability is needed, nor that it can simply be skipped. It is in analysis of this nature that IT brings some of its greatest value to organizations. It is here that we have the potential to shine the most.