Tag Archives: reliability

Virtual Eggs and Baskets

In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from what is described as “don’t put your eggs in one basket.”

I can see where this concern arises.  Virtualization allows for many guest operating systems to be contained in a single physical system which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.  This sounds bad, but perhaps it is not as bad as we would first presume.

The idea of the eggs and baskets idiom is that we should not put all of our resources at risk at the same time.  This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities.  In the case of eggs (or money) we are talking about an interchangeable commodity.  One egg is as good as another.  A set of eggs are naturally redundant.

If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat.  Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.  Putting our already redundant eggs into multiple baskets allows us to hedge our bets.  Yes, carrying two baskets means that we have less time to pay attention to either one so it increases the risk of losing some of the eggs but reduces the chances of losing all of the eggs.  In the case of eggs, a wise proposition indeed.  Likewise, a smart way to prepare for your retirement.

This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization.  Servers, however, are not like eggs.  Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough.  Typically servers each play a unique role and all are relatively critical to the functioning of the business.  If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist.  When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.

IT services in a business are usually, at least to some degree, a “chain dependency.”  That is, they are interdependent and the loss of a single service may impact other services either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (such as an office worker needing the file server working in order to provide a file which he needs to edit with information from an email while discussing the changes over the phone or instant messenger.)  In these cases, the loss of a single key service such as email, network authentication or file services may create a disproportionate loss of working ability.  If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.   This is not always true, in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon.  Even if people can remain working, they are likely far less productive than usual.

When dealing with physical servers, each server represents its own point of failure.  So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers.  Each server that we add brings with it its own risk.  If each failure has an outage factor of 2.5 – that is financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.  I use the concept of factors and averages here to make this easy, determining the length of an average outage or impact of an average outage is not necessary as we only need to determine relative impact in this case to compare the scenarios.  It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.

With virtualization we have the obvious ability to consolidate.  In this example we will assume that we can collapse all ten of these existing servers down into a single server.  When we do this we often trigger the “all our eggs in one basket” response.  But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk.  If we assume the same risks as the example above our single server will, on average, incur just a single total site outage, once per decade.

Compare this to the first example which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.

Now keep in mind that this is based on the assumption that losing some services means a financial loss greater than the strict value of the service that was lost, which is almost always the case.  Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry.  In rare cases impact from losing a single system can be less than its “slice of the pie”, normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restore, but these cases are rare and are normally isolated to a few systems out of many with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.

So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory.  And this is purely from a hardware failure perspective.  Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.

With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department.  Fewer servers means that more time and attention can be paid to those that remain.  More attention means a better chance of catching issues early and more opportunity to keep parts on hand.  Better monitoring and maintenance leads to better reliability.

Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability.  With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly.   Understandable and for some businesses this may be the correct approach.  But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.

Instead by applying a more moderate approach keeping significant cost savings but still spending more, relatively speaking, on a single server you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc.  The cost savings of virtualization can often be turned directly into increased reliability further shifting the equation in favor of the single server approach.

As I stated in another article, one brick house is more likely to survive a wind storm than either one or two straw houses.  Having more of something doesn’t necessarily make it the more reliable choice.

These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself.  Virtualization provides extended risk mitigation features separately as well.  System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform.  This can play an important role in a disaster recovery strategy.

Of course, all of these concepts are purely to demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario.    There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.

It should be noted that virtualization can then extend the reliability of traditional commodity hardware providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide.  This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms.  These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops initially migrating from a non-failover, legacy hardware server environment.  High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today.  Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now, but due to the large drop in the cost of high availability, it is quite possible that it will he cost justified where previously it could not be.

In the same vein, virtualization is often feared because it is seen as a new, unproven technology.  This is certainly untrue but there is an impression of this in the small business and commodity server space.  In reality, though, virtualization was first introduced by IBM in the 1960s and ever since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability.  In the commodity server space virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world.  But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today which is very far past the point of being a nascent technology – in the world of IT it is positively venerable.  Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products.  The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.

Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy.  Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.

When No Redundancy Is More Reliable – The Myth of Redundancy

Risk in a difficult concept and it requires a lot of training, thought and analysis to properly assess given scenarios.  Often, because risk assessments are so difficult, we substitute risk analysis with simply adding basic redundancy and assuming that we have appropriately mitigated risk.  But very often this is not the case.  The introduction of complexity or additional failure modes often accompany the addition of redundancy and these new forms of failure have the potential to add more risk than the added redundancy removes.  Storage systems are especially prone to these decision processes which is unfortunate as few, if any, systems are so susceptible to failure and more important to protect.

RAID is a great example of where a lack of holistic risk thinking can lead to some strange decision making.  If we look at a not uncommon scenario we will see where the goal of protecting against drive failure can actually lead to an increase in risk even when additional redundancy is applied.  In this scenario we will compare a twelve drive array consisting of twelve three terabyte SATA hard drives in a single array.  It is not uncommon to hear of people choosing RAID 5 for this scenario to get “maximum capacity and performance” while having “adequate protection against failure.”

The idea here is that RAID 5 protects against the loss of a single drive which can be replaced and the array will rebuild itself before a second drive fails.  That is great in theory, but the real risks of an array of this size, thirty six terabytes of drive capacity, come not from multiple drive failures as people generally suspect but from an inability to reliably rebuild the array after a single drive failure or from a failure of the array itself with no individual drives failing.  The risk of a second drive failing is low, not non-existent, but quite low.  Drives today are highly reliable. Once one drives fails it does increase the likelihood of a second drive failing, which is well documented, but I don’t want this risk to mislead us from looking at the true risks – the risk of a failed resilvering operation.

What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur.  When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost.  On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations.  That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing.  Fifty percent chance of failure is insanely high.  Imagine if your car had a fifty percent chance of the wheels falling off every time that you drove it.  So with a small (by today’s standards) six terabyte RAID 5 array using 10^14 URE SATA drives, if we were to lose a single drive, we have only a fifty percent chance that the array will recover assuming the drive is replaced immediately.  That doesn’t include the risk of a second drive failing, only the risk of a URE failure.  It also assumes that the drive is completely idle other than the resilver operation.  If the drives are busily being used for other tasks at the same time then the chances of something bad happening, either a URE or a second drive failure, begin to increase dramatically.

With a twelve terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent – meaning that RAID 5 has no functionality whatsoever in that case.  There is always a chance of survival, but it is very low.  At six terabytes you can compare a resilver operation to a game of Russian roulette with one bullet and six chambers and you have to pull the trigger three times.  With twelve terabytes you have to pull it six times!  Those are not good odds.

But we are not talking about a twelve terabyte array.  We are talking about a thirty six terabyte array – which sounds large but this is a size that someone could easily have at home today, let alone in a business.  Every major server manufacturer, as well as nearly all low cost storage vendors, make sub $10,000 storage systems in this capacity range today.  Resilvering a RAID 5 array with a single drive failure on a thirty six terabyte array is like playing Russian roulette, one bullet, six chambers and pulling the trigger eighteen times!  Your data doesn’t stand much of a chance.  Add to that the incredible amount of time needed to resilver an array of that size and the risk of a second disk failing during that resilver window starts to become a rather significant threat.  I’ve seen estimates of resilver times climbing into weeks or months on some systems.  That is a long time to run without being able to lose another drive.  When we are talking hours or days the risks are pretty low, but still present.  When we are talking weeks or months of continuous abuse, as resilver operations are extremely drive intensive, the failure rates climb dramatically.

With an array of this size we can effectively assume that the loss of a single drive means the loss of the complete array leaving us with no drive failure protection at all.  Now if we look at a drive of the same or better performance with the same or better capacity under RAID 0, which also has no protection against drive loss, we need only use eleven of the same drives that we needed twelve of for our RAID 5 array.  What this means is that instead of twelve hard drives, each of which has a roughly three percent chance of annual failure, we have only eleven.  That alone makes our RAID 0 array more reliable as there are fewer drives to fail.  Not only do we have fewer drives but there is no need to write the parity block nor skip parity blocks when reading back lowering, ever so slightly, the mechanical wear and tear on the RAID 0 array for the same utilization giving it a very slight additional reliability edge.  The RAID 0 array of eleven drives will be identical in capacity to the twelve drive RAID 5 array but will have slightly better throughput and latency.  A win all around.  Plus the cost savings of not needing an additional drive.

So what we see here is that in large arrays (large in capacity, not in spindle count) that RAID 0 actually passes RAID 5 in certain scenarios.  When using common SATA drives this happens at capacities experienced even by power users at home and by many small businesses.  If we move to enterprise SATA drives or SAS drives then the capacity number where this occurs becomes very high and is not a concern today but will be in just a few years when drive capacities get larger still.  But this highlights how dangerous RAID 5 is in the sizes that we see today.  Everyone understands the incredible risks of RAID 0 but it can be difficult to put into perspective that RAID 5’s issues are so extreme that it might actually be less reliable than RAID 0.

That RAID 5 might be less reliable than RAID 0 in an array of this size based on resilver operations alone is just the beginning.  In a massive array like this the resilver time can take so long and exact such a toll on the drives that second drive failure starts to become a measurable risk as well.  And then there are additional risks caused by array controller errors that can utilize resilver algorithms to destroy an entire array even when no drive failure has occurred.  As RAID 0 (or RAID 1 or RAID 10) do not have resilver algorithms they do not suffer this additional risk.  These are hard risks to quantify but what is important is that they are additional risks that accumulate when using a more complex system when a simpler system, without the redundancy, was more reliable from the outset.

Now that we have established that RAID 5 can be less reliable than RAID 0 I will point out the obvious dangers of RAID 0.  RAID in general is used to mitigate the risk of a single, lone hard drive failing.  We all fear a single drive simply failing and all data being lost.  RAID 0, being a large stripe of drives without any form of redundancy, takes the risk of data loss of a single drive failing and multiplies it across a number of drives where any drive failing causes total loss of data to all drives.  So in our eleven disk example above, if any of the eleven disks fails all is lost.  It is clear to see where this is dramatically more dangerous than just using a single drive, all alone.

What I am trying to point out here is that redundancy does not mean reliability.  Just because something is redundant, like RAID 5, provides no guarantee that it will always be more reliable than something that is not redundant.

My favourite analogy here is to look at houses in a tornado.  In one scenario we build a house of brick and mortar.  On the second scenario we build two redundant house, east out of straw (our builders are pigs, apparently.)  When the tornado (or big bad wolf) comes along which is more likely to leave us with a standing house?  Clearing one brick and mortar house has some significant reliability advantages over redundant straw houses.  Redundancy didn’t matter, reliability mattered in the end.

Redundancy is often misleading because it is easy to quantify but hard to qualify.  Redundancy is a black or white question: Is it redundant?  Yes or no.  Simple.  Reliability is not so simple.  Reliability is about failure rates and likelihoods.  It is about statistics and analysis.  As it is hard to quantify reliability in a meaningful way, especially when selling a project to the business people, redundancy often becomes a simple substitute for this complex concept.

The concept of using redundancy to misdirect questions of reliability also ends up applying to subsystems in very convoluted ways.  Instead of making a “system” redundant it has become common to make a highly reliable, and low cost, subsystem redundant and treat subsystem redundancy as applying to the whole system.  The most common example of this is RAID controllers in SAN products.  Rather than having a redundant SAN (meaning two SANs) manufacturers will often make that one component not often redundant in normal servers redundant  and then calling the SAN redundant – meaning a SAN that contains redundancy, which is not at all the same thing.

A good analogy here would be to compare having redundant cars meaning two complete, working cars and having a single car with a spare water pump in the trunk in case the main one fails.  Clearly, a spare water pump is not a bad thing.  But it is also a trivial amount of protection against car failure compared to having a second car ready to go.  In one case the entire system is redundant, including the chassis.  In the other we are making just one, highly reliable component redundant inside the chassis.  It’s not even on par with having a spare tire which, at least, is a car component with a higher likelihood of failure.

Just like the myth of RAID 5 reliability and system/subsystem reliability, shared storage technologies like SANs and NAS often get treated in the same way, especially in regards to virtualization.  There is a common scenario where a virtualization project is undertaken and people instinctively panic because a single virtualization host represents a single point of failure where, if it fails, many systems will all fail at once.

Using the term “single point of failure” causes a panic feeling and is a great means of steering a conversation.  But a SPOF, as we like to call it, while something we like to remove when possible may not be the end of the world.  Think about our brick house.  It is a SPOF.  Our two houses of straw are not.  Yet a single breeze takes out our redundant solutions faster than our reliable SPOF.  Looking for SPOFs is a great way to find points of fragility in a system, but do not feel that every SPOF must be made redundant in every scenario.  Most businesses will find their best value having many SPOFs in place.  Our real goal is reliability at appropriate cost, redundancy, as we have seen, is no substitute for reliability, it is simply a tool that we can use to achieve reliability.

The theory that many people follow when virtualizing is that they take their virtualization host and say “This host is a SPOF, so I need to have two of them and use High Availability features to allow for transparent failover!”  This is spurred by the leading virtualization vendor making their money firstly by selling expensive HA add on products and secondly by being owned by a large storage vendor – so selling unnecessary or even dangerous additional shared storage is a big monetary win for them and could easily be the reason that they have championed the virtualization space from the beginning.  Redundant virtualization hosts with shared storage sounds great but can be extremely misguided for several reasons.

The first reason is that removing the initial SPOF, the virtualization host, is replaced with a new SPOF, the shared storage.  This accomplishes nothing.  Assuming that we are using comparable quality servers and shared storage all we’ve done is move where the risk is, not change how big it is.  The likelihood of the storage system failing is roughly equal to the likelihood of the original server failing.  But in addition to shuffling the SPOF around like in a shell game we’ve also done something far, far worse – we have introduced chained or cascading failure dependencies.

In our original scenario we had a single server.  If the server stayed working we are good, if it failed we were not.  Simple.  Now we have two virtualization hosts, a single storage server (SAN, NAS, whatever) and a network connecting them together.  We have already determined that the risk of the shared storage failing is approximately equal to our total system risk in the original scenario.  But now we have the additional dependencies of the network and the two front end virtualization nodes.  Each of these components is more reliable than the fragile shared storage (anything with mechanical drives is going to be fragile) but that they are lower risk is not the issue, the issue is that the risks are combinatorial.

If any of these three components (storage, network or the front end nodes) fail then everything fails.  The solution to this is to make the shared storage redundant on its own and to make the network redundant on its own.  With enough work we can overcome the fragility and risk that we introduced by adding shared storage but the shared storage on its own is not a form of risk mitigation but is a risk itself which must be mitigated.  The spiral of complexity begins and the cost associated with bringing this new system up on par with the reliability of the original, single server system can be astronomic.

Now that we have all of this redundancy we have one more risk to worry about.  Managing all of this redundancy, all of these moving parts, requires a lot more knowledge, skill and preparation than does managing a simple, single server.  We have moved from a simple solution to a very complex one.  In my own anecdotal experience the real dangers of solutions like this come not from the hardware failing but from human error.  Not only has little been done to avoid human error causing this new system to fail but we’ve added countless points where a human might accidentally bring the entire system, redundancy and all, right down.  I’ve seen it first hand; I’ve heard the horror stories.  The more complex the system the more likely a human is going to accidentally break everything.

It is critical that as IT professionals that we step back and look at complete systems and consider reliability and risk and think of redundancy simply as a tool to use in the pursuit of reliability.  Redundancy itself is not a panacea.  Neither is simplicity.  Reliability is a complex problem to tackle.  Avoiding simplistic replacements is an important first step in moving from covering up reliability issues to facing and solving them.

 

Do You Really Need Redundancy: The Real Cost of Downtime

Downtime – now that is a word that no one wants to hear.  It strikes fear into the heart of businesses, executives and especially IT staff.  Downtime costs money and it causes frustration.

Because downtime triggers an emotional reaction businesses are often left reacting to it differently than traditional business factors.  This emotional approach causes businesses, especially smaller businesses often lacking in rational financial controls, to treat downtime as being far worse than it is.  It is not uncommon to find that smaller businesses have actually done more financial damage to themselves reacting to a fear of potential downtime than the feared downtime would have inflicted had it actually occurred.  This is a dangerous overreaction.

The first step is to determine the cost of downtime.  In IT we are often dealing with rather complex systems and downtime comes in a variety of flavors such as loss of access, loss of performance or a complete loss of a system or systems.  Determining every type of downtime and its associated costs can be rather complex but a high level view is often enough for producing rational budgets or are, at the very least, a good starting point on a path towards understanding the business risks involved with downtime.  Keep in mind that just like spending too much to avoid downtime is bad that spending too much to calculate the costs of downtime is bad.  Don’t spend so much time and resources determining if you will lose money that you would have been better off just losing it.  Beware of the high cost of decision making.

We can start by considering only complete system loss.  What is the cost of organizational downtime for you – that is, if you had to cease all business for an hour or a day how much money is lost?  In some cases the losses could be dramatic, like in the case of a hospital where a day of downtime would result in a loss of faith and future customer base and potentially result in lawsuits.  But in many cases a day of downtime might have nominal financial impact – many businesses could simply call the day a holiday, let their staff rest for the day and have people work a little harder over the next few days to make up the backlog from the lost day.  It all comes down to how your business does and can operate and how well suited you are for mitigating lost time.  Many business will only look at daily revenue figures to determine lost revenue but this can be wildly misleading.

Once we have a rough figure for downtime cost we can then consider downtime risk.  This is very difficult to assess as good figures on IT system reliability are nearly non-existent and every organization’s systems are so unique that industry data is very nearly useless.  Here we are forced to rely on IT staff to provide an overview of risks and, hopefully, a reliable assessment of likelihoods of individual risks.  For example, in big round numbers, if we had a line of business application that ran on a server with only one hard drive then we would expect that sometime in the next five to ten years that there will be downtime associated with the loss of that drive.  If we have that same server with hot swap drives in a mirrored array then the likelihood of downtime associated with that storage system, even over ten years, is quite small.  This doesn’t mean that a drive is not likely to fail, it is, but that the system is likely to be unaffected until redundancy is restored without end users noticing that anything has happened.

Our last rough estimation tool is to apply applicable business hours.  Many businesses do not run 24×7, some do, of course, but most do not.  Is the loss of a line of business application at six in the evening equivalent to the loss of that application at ten in the morning?  What about on the weekend?  Are people productively using it at three on a Friday afternoon or would losing it barely cost a thing and make for happy employees getting an extra hour or two on their weekends?  Can schedules be shifted in case of a loss near lunch time?  These factors while seemingly trivial can be significant.  If downtime is limited to only two to four hours then many businesses can mitigate nearly all of the financial impact simply by asking employees to have a little flexibility in their schedules to accommodate the outage by taking lunch early or leaving work early one day and working an extra hour the next.

Now that we have these factors  – the cost of downtime, the ability to mitigate downtime impact based on duration and the risks of outage events we can begin to draw a picture of what a downtime event is likely to look like.  From this we can begin to derive how much money it would be worth to reduce the risk of such as event.  For some businesses this number will be extremely high and for others it will be surprisingly low.  This exercise can expose a great deal about how a business operates that may not be normally all that visible.

It is important to note at this point that what we are looking at here is a loss of availability of systems, not a loss of data.  We are assuming that good backups are being taken and that those backups are not compromised.  Redundancy and downtime are not topics related to data loss, just availability loss.  Data loss scenarios should be treated with equal or greater diligence but are a separate topic.  It is a rare business that can survive catastrophic data loss but common to experience and easily survive even substantial downtime.

There are multiple ways to stave off downtime, redundancy is highly visible and treated almost like a buzz word and so receives a lot of focus, but there are other means as well.  Good system design is important, avoiding system complexity can heavily reduce downtime simply by removing points of unnecessary risk and fragility.  Using quality hardware and software is important as well – as low end hardware that is redundant will often fail just as often as non-redundant enterprise class hardware.  Having a rapid supply chain of replacement parts can be a significant factor often seen in the form of four hour hardware vendor replacement part response contracts.  This list goes on.  What we will focus on is redundancy which is where we are most likely to overspend when faced with the fear of downtime.

Now that we know the costs of failing to have adequate redundancy we can compare this potential cost against the very real, up front cost of providing that redundancy.  Some things, such as hard drives, are highly likely to fail and relatively easy and cost effective to make redundant – taking significant risk and trivializing it.  These tend to be a first point of focus.  But there are many areas of redundancy to consider such as power supplies, network hardware, Internet connections and entire systems – often made redundant through modern virtualization techniques providing new avenues for redundancy previously not accessible to many smaller businesses.

New types of redundancy, especially those made available through virtualization, are often a point where businesses will be tempted to overspend, perhaps dramatically, compared to the risks of downtime.  Worse yet, in the drive to acquire the latest fads in redundancy companies will often implement these techniques incorrectly and actually introduce greater risk and a higher likelihood of downtime compared to having done nothing at all.  It is becoming increasingly common to hear of businesses spending tens or even hundreds of thousands of dollars in an attempt to mitigate a downtime monetary loss of only a few thousand dollars – and to then fail in that attempt and end up increasing their risk anyway.

When gauging the cost of mitigation it is critical to remember that mitigation is a guaranteed expense where risk is only a risk.  Much like auto insurance where you pay a guaranteed small monthly fee in order to fend off a massive, unplanned expense.   The theory of risk mitigation is to spend a comparatively small amount of money now in order to reduce the risk of a large expense later, but if the cost of mitigation gets too high then it becomes better to simply accept the risks.

Systems can be assessed individually, of course.  Keeping a web presence and telephone system up and running at all times is far more important than an email system where even hours of downtime are unlikely to be detectable by external clients.  Paying only to protect those systems where the cost of downtime is significant is an important strategy.

Do not be surprised if what you discover is that beyond some very basic redundancy (such as mirrored hard drives) that a simple network design with good backups and restore plans and a good hardware support contract is all that is needed for the majority, if not all, of your systems.  By lowering the complexity of your systems you make them naturally more stable and easier to manage – further reducing the cost of your IT infrastructure.