Tag Archives: risk

It Worked For Me

“Well, it worked for me.”  This has become a phrase that I have heard over and over again in defense of what would logically be otherwise considered a bad idea.  These words are often spoken innocently enough without deep intent, but they often cover deep meaning that should be explored.

But it is important to understand what drives these words both psychologically as well as technically.  At a high level, what we have is the delivery of an anecdote which can be restated as such: “While the approach or selection that I have used goes against your recommendation or best practices or what have you, in my particular case the bad situation of which you have warned or advised against has not arisen and therefore I believe that I am justified in the decision that I have made.”

I will call this the “Anecdotal Dismissal of Risk” or better known as “Outcome Bias.”  Generally this phrase is used to wave off the accusation that one has either taken on unnecessary risk or taken on unnecessary financial expense or, more likely, both.  The use of an anecdote for either of these cases is, of course, completely meaningless but the speaker does so with the hope of throwing off the discussion and routing it around their case by suggesting, without saying it, that perhaps they are a special case that has not been considered or, perhaps, that “getting lucky” is a valid form of decision making.

Of course, when talking risk, we are talking about statistical risk.  If anything was a sure thing, and could be proven or disproved with an anecdote, it would not be risk but would just be a known outcome and making the wrong choice would be amazingly silly.  Anecdotes have a tiny place when using in the negative, for example: They claim that it is a billion to one chance that this would happen, but it happened to me on the third try and I know one other person that it happened to.  That’s not proof, but anecdotally it suggests that the risk figures are unlikely correct.

That case is valid, still incredibly important to realize that even negative anecdotal evidence (anecdotal evidence of something that was extremely unlikely to happen) is still anecdotal and does not suggest that the results will happen again, but at least it suggests that you were an amazing edge case.  If you know of one person that has won the lottery, that’s unlikely but doesn’t prove that the lottery is likely to be won.  If you know that every other person you know who has played the lottery has won, something is wrong with the statistics.

However, the “it worked for me” case is universally used with risk that is less than fifty percent (if it were not the whole thing would become crazy.)  Often it is about taking something four nines reliability and reducing it to three nines when attempting to raise it.  Three nines of something still means that there is only a one in one thousand chance that the bad case will arise.  This isn’t statistically likely to occur, obviously.  At least we would hope that it was obvious.  Even though, in this example, the bad case arises ten times more often than it would have it we had left well enough alone and maybe one hundred times more than how often we intended for it to arise we still expect to never see the bad outcome unless we run thousands or tens of thousands of cases and then the statistics are still based on a rather small pool.

In many cases we talk about an assumption of unnecessary risk but generally this is risk at a financial cost. What prompts this reaction a great deal of the time, in my experience, is a reaction to being demonstrated a dramatic overspending – implementing very costly solutions when a less costly one, often fractionally as expensive, may approach or, in many cases, exceed the chosen solution that is being defended.

To take the reverse, out of any one thousand people, nine hundred and ninety nine of them, doing this same thing, would be expected to have no bad outcome.  For someone to claim, then, that the risk is one part in one thousand and have one of the nine hundred and ninety nine step forward and say “the risk can’t exist because I am not the incredibly unlikely one to have had the bad thing happen to me” obviously makes no sense whatsoever when looking at the pool as a whole.  But when we are the ones who made the decision to join that pool and then came away unscathed it is an apparently natural reaction to discount the assumed outcome of even a risky choice and assume that the risk did not exist.

It is difficult to explain risk in this way but, over the years, I’ve found a really handy example to use that tends to explain business or technical risk in a way that anyone can understand.  I call it the Mother Seatbelt Example.  Try this experiment (don’t actually try it but lie to your mother and tell her that you did to see the outcome.)

Drive a car without wearing a seatbelt for a whole day while continuously speeding.  Chances are extremely good that nothing bad will happen to you (other than paying some fines.)  The chances of having a car accident and getting hurt, even while being reckless in both your driving and disregarding basic safety precautions, is extremely low.  Easily less than one in one thousand.   Now, go tell your mother what you just did and say that you feel that doing this was a smart way to drive and that you made a good decision in having done so because “it worked out for me.”  Your mother will make it very clear to you what risky decisions mean and how anecdotal evidence of expected survival outcome does not indicate good risk / reward decision making.

In many cases, “it worked for me” is an attempt at deflection.  A reaction of our amygdala in a “fight or flight” response to avoid facing what is likely a bad decision of the past.  Everyone has this reaction, it is natural, but unhealthy.  By taking this stance of avoiding critical evaluation of past decisions we make ourselves more likely to continue to repeat the same bad decision or, at the very least, continue the bad decision making process that lead to that decision.  It is only by facing critical examination and accepting that past decisions may not have been ideal that we can examine ourselves and our processes and attempt to improve them to avoid making the same mistakes again.

It is understandable that in any professional venue there is a desire to save face and appear to have made if not a good decision, at least an acceptable one and so the desire to explore logic that might undermine that impression is low.  Even moreso there is a very strong possibility that someone who is a potential recipient of the risk or cost that the bad decision created will learn of the past decision making and there is, quite often, an even stronger desire to cover up any possibility that a decision may have been made without proper exploration or due diligence.  These are understandable reactions but they are not healthy and ultimately make the decision look even poorer than it would have.  Everyone makes mistakes, everyone.  Everyone overlooks things, everyone learns new things over time.  In some cases, new evidence comes to light that was impossible to have known at the time.  There should be no shame in past decisions that are less than ideal, only in failing to examine them and learn from them allowing us as individuals as well as our organizations to grow and improve.

The phrase seems innocuous enough when said.  It sounds like a statement of success.  But we need to reflect deeper.  The risk scenario we showed above.  But what about the financial one.  When a solution is selected that carries little or no benefits, and possibly great caveats as we see in many real world cases, while being much more costly and the term “it worked for me” is used, what is really being said is “wasting money didn’t get me in trouble.”  When used in the context of a business, this is quite a statement to make.  Businesses exist to make money.  Wasting money on solutions that don’t meet the need better is a failure whether the solution functions technically or not.  Many solutions are too expensive but would not fail, choosing the right solution always involves getting the right price for the resultant situation.  That is just the nature of IT in business.

Using this phrase can sound reasonable to the irrational, defense brain.  But to outsiders looking in with rational views it actually sounds like “well, I got away with…” fill in the blank: “wasting money”, “being risky”, “not doing my due diligence”, “not doing my job”, or whatever the case may be.  And likely whatever you think should be filled in there will not be as bad as what others assume.

If you are attempted to justify past actions by saying “it worked for me” or by providing anecdotal evidence that shows nothing, stop and think carefully.  Give yourself time to calm down and evaluate your response.  Is is based on logic or irrational amygdala emotions?  Don’t be ashamed of having the reaction, everyone has it.  It cannot be escaped.  But learning how to deal with it can allow us to approach criticism and critique with an eye towards improvement rather than defense.  If we are defensive, we lose the value in peer review, which is so important to what we do as IT professionals.

Virtual Eggs and Baskets

In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from what is described as “don’t put your eggs in one basket.”

I can see where this concern arises.  Virtualization allows for many guest operating systems to be contained in a single physical system which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.  This sounds bad, but perhaps it is not as bad as we would first presume.

The idea of the eggs and baskets idiom is that we should not put all of our resources at risk at the same time.  This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities.  In the case of eggs (or money) we are talking about an interchangeable commodity.  One egg is as good as another.  A set of eggs are naturally redundant.

If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat.  Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.  Putting our already redundant eggs into multiple baskets allows us to hedge our bets.  Yes, carrying two baskets means that we have less time to pay attention to either one so it increases the risk of losing some of the eggs but reduces the chances of losing all of the eggs.  In the case of eggs, a wise proposition indeed.  Likewise, a smart way to prepare for your retirement.

This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization.  Servers, however, are not like eggs.  Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough.  Typically servers each play a unique role and all are relatively critical to the functioning of the business.  If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist.  When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.

IT services in a business are usually, at least to some degree, a “chain dependency.”  That is, they are interdependent and the loss of a single service may impact other services either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (such as an office worker needing the file server working in order to provide a file which he needs to edit with information from an email while discussing the changes over the phone or instant messenger.)  In these cases, the loss of a single key service such as email, network authentication or file services may create a disproportionate loss of working ability.  If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.   This is not always true, in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon.  Even if people can remain working, they are likely far less productive than usual.

When dealing with physical servers, each server represents its own point of failure.  So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers.  Each server that we add brings with it its own risk.  If each failure has an outage factor of 2.5 – that is financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.  I use the concept of factors and averages here to make this easy, determining the length of an average outage or impact of an average outage is not necessary as we only need to determine relative impact in this case to compare the scenarios.  It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.

With virtualization we have the obvious ability to consolidate.  In this example we will assume that we can collapse all ten of these existing servers down into a single server.  When we do this we often trigger the “all our eggs in one basket” response.  But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk.  If we assume the same risks as the example above our single server will, on average, incur just a single total site outage, once per decade.

Compare this to the first example which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.

Now keep in mind that this is based on the assumption that losing some services means a financial loss greater than the strict value of the service that was lost, which is almost always the case.  Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry.  In rare cases impact from losing a single system can be less than its “slice of the pie”, normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restore, but these cases are rare and are normally isolated to a few systems out of many with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.

So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory.  And this is purely from a hardware failure perspective.  Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.

With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department.  Fewer servers means that more time and attention can be paid to those that remain.  More attention means a better chance of catching issues early and more opportunity to keep parts on hand.  Better monitoring and maintenance leads to better reliability.

Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability.  With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly.   Understandable and for some businesses this may be the correct approach.  But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.

Instead by applying a more moderate approach keeping significant cost savings but still spending more, relatively speaking, on a single server you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc.  The cost savings of virtualization can often be turned directly into increased reliability further shifting the equation in favor of the single server approach.

As I stated in another article, one brick house is more likely to survive a wind storm than either one or two straw houses.  Having more of something doesn’t necessarily make it the more reliable choice.

These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself.  Virtualization provides extended risk mitigation features separately as well.  System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform.  This can play an important role in a disaster recovery strategy.

Of course, all of these concepts are purely to demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario.    There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.

It should be noted that virtualization can then extend the reliability of traditional commodity hardware providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide.  This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms.  These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops initially migrating from a non-failover, legacy hardware server environment.  High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today.  Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now, but due to the large drop in the cost of high availability, it is quite possible that it will he cost justified where previously it could not be.

In the same vein, virtualization is often feared because it is seen as a new, unproven technology.  This is certainly untrue but there is an impression of this in the small business and commodity server space.  In reality, though, virtualization was first introduced by IBM in the 1960s and ever since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability.  In the commodity server space virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world.  But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today which is very far past the point of being a nascent technology – in the world of IT it is positively venerable.  Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products.  The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.

Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy.  Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.

What Windows 8 Means for the Datacenter

Talk around Microsoft’s new upcoming desktop operating, Windows 8, centers almost completely on its dramatically departing Metro User Interface, borrowed from the Windows Phone which, in turn, borrowed it from the ill-fated Microsoft Zune.  Apparently Microsoft believes that the third time is the charm when it comes to Metro.

To me the compelling story of Windows 8 comes not in the fit and finish but the under the hood rewiring that hint at a promising new future for the platform.  In the past Microsoft has attempted shipping Windows Server OS on some alternative architectures including, for those who remember, the Digital Alpha processor and more recently the Intel Itanium.  In these previous cases, the focus was on the highest end Microsoft platforms being run on hardware above and beyond what the Windows world normally sees.

Windows 8 promises to tackle the world of multiple architectures in a completely different way – starting with the lowest end operating system and focusing on a platform that is lighter and less powerful than the typical Intel or AMD offering, the low power ARM RISC architecture with the newly named Windows RT (previously WoA, Windows on ARM.)

The ARM architecture is making its headlines as Microsoft attempts to drive deep into handheld and low power devices.  Windows RT could signal a unification between the Windows desktop codebase and the mobile smartphone codebase down the road.  Windows RT could mean strong competition from Microsoft in the handheld tablet market where the iPad dominates so completely today.  Windows RT could be a real competitor to the Android platforms.

Certainly, as it stands today, Windows RT has a lot of potential to be really interesting, if not quite disruptive, with where it will stand upon release.  But I think that the interesting story lies beneath the surface in what Windows RT can potentially mean for the datacenter.  What might Microsoft have in store for us in the future?

The datacenter today is moving in many directions.  Virtualization is one driving factor as are low power server options such as Hewlett-Packard’s Project Moonshot which is designed to bring ARM-based, low power consumption servers into high end, horizontally scaling datacenter applications.

Currently, today, the number of server operating systems available to run on ARM servers, like those coming soon from HP, are few and far between and are mostly only available from the BSD family of operating systems.  The Linux community, for example, is scrambled to assemble even a single, enterprise-supported ARM-based distribution and it appears that Ubuntu will be the first out of the gate there.  But this paucity of server operating systems on ARM leaves an obvious market gap and one that Microsoft may be well thinking of filling.

Windows Server on ARM could be a big win for Microsoft in the datacenter.  A lower cost offering broadening their platform portfolio without the need for heavy kernel reworking since they are already providing this effort for the kernel on their handheld devices.  This could be a significant push for Windows into the growingly popular green datacenter arena where ARM processors are expected to play a central role.

Microsoft has long fought to gain a foothold in the datacenter and today is as comfortable there as anyone but Windows Servers continue to play in a segregated world where email, authentication and some internal applications are housed on Windows platforms but the majority of heavy processing, web hosting, storage and other roles are almost universally given to UNIX family members.  Windows’ availability on the ARM platform could push it to the forefront of options for horizontally scaling server forms like web servers, application servers and other tasks which will rise to the top of the ARM computing pool – possibly even green high performance compute grids.

ARM might mean exciting things for the future of the Windows Server platform, probably at least one, if not two releases out.  And, likewise, Windows might mean something exciting for ARM.

When No Redundancy Is More Reliable – The Myth of Redundancy

Risk in a difficult concept and it requires a lot of training, thought and analysis to properly assess given scenarios.  Often, because risk assessments are so difficult, we substitute risk analysis with simply adding basic redundancy and assuming that we have appropriately mitigated risk.  But very often this is not the case.  The introduction of complexity or additional failure modes often accompany the addition of redundancy and these new forms of failure have the potential to add more risk than the added redundancy removes.  Storage systems are especially prone to these decision processes which is unfortunate as few, if any, systems are so susceptible to failure and more important to protect.

RAID is a great example of where a lack of holistic risk thinking can lead to some strange decision making.  If we look at a not uncommon scenario we will see where the goal of protecting against drive failure can actually lead to an increase in risk even when additional redundancy is applied.  In this scenario we will compare a twelve drive array consisting of twelve three terabyte SATA hard drives in a single array.  It is not uncommon to hear of people choosing RAID 5 for this scenario to get “maximum capacity and performance” while having “adequate protection against failure.”

The idea here is that RAID 5 protects against the loss of a single drive which can be replaced and the array will rebuild itself before a second drive fails.  That is great in theory, but the real risks of an array of this size, thirty six terabytes of drive capacity, come not from multiple drive failures as people generally suspect but from an inability to reliably rebuild the array after a single drive failure or from a failure of the array itself with no individual drives failing.  The risk of a second drive failing is low, not non-existent, but quite low.  Drives today are highly reliable. Once one drives fails it does increase the likelihood of a second drive failing, which is well documented, but I don’t want this risk to mislead us from looking at the true risks – the risk of a failed resilvering operation.

What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur.  When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost.  On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations.  That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing.  Fifty percent chance of failure is insanely high.  Imagine if your car had a fifty percent chance of the wheels falling off every time that you drove it.  So with a small (by today’s standards) six terabyte RAID 5 array using 10^14 URE SATA drives, if we were to lose a single drive, we have only a fifty percent chance that the array will recover assuming the drive is replaced immediately.  That doesn’t include the risk of a second drive failing, only the risk of a URE failure.  It also assumes that the drive is completely idle other than the resilver operation.  If the drives are busily being used for other tasks at the same time then the chances of something bad happening, either a URE or a second drive failure, begin to increase dramatically.

With a twelve terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent – meaning that RAID 5 has no functionality whatsoever in that case.  There is always a chance of survival, but it is very low.  At six terabytes you can compare a resilver operation to a game of Russian roulette with one bullet and six chambers and you have to pull the trigger three times.  With twelve terabytes you have to pull it six times!  Those are not good odds.

But we are not talking about a twelve terabyte array.  We are talking about a thirty six terabyte array – which sounds large but this is a size that someone could easily have at home today, let alone in a business.  Every major server manufacturer, as well as nearly all low cost storage vendors, make sub $10,000 storage systems in this capacity range today.  Resilvering a RAID 5 array with a single drive failure on a thirty six terabyte array is like playing Russian roulette, one bullet, six chambers and pulling the trigger eighteen times!  Your data doesn’t stand much of a chance.  Add to that the incredible amount of time needed to resilver an array of that size and the risk of a second disk failing during that resilver window starts to become a rather significant threat.  I’ve seen estimates of resilver times climbing into weeks or months on some systems.  That is a long time to run without being able to lose another drive.  When we are talking hours or days the risks are pretty low, but still present.  When we are talking weeks or months of continuous abuse, as resilver operations are extremely drive intensive, the failure rates climb dramatically.

With an array of this size we can effectively assume that the loss of a single drive means the loss of the complete array leaving us with no drive failure protection at all.  Now if we look at a drive of the same or better performance with the same or better capacity under RAID 0, which also has no protection against drive loss, we need only use eleven of the same drives that we needed twelve of for our RAID 5 array.  What this means is that instead of twelve hard drives, each of which has a roughly three percent chance of annual failure, we have only eleven.  That alone makes our RAID 0 array more reliable as there are fewer drives to fail.  Not only do we have fewer drives but there is no need to write the parity block nor skip parity blocks when reading back lowering, ever so slightly, the mechanical wear and tear on the RAID 0 array for the same utilization giving it a very slight additional reliability edge.  The RAID 0 array of eleven drives will be identical in capacity to the twelve drive RAID 5 array but will have slightly better throughput and latency.  A win all around.  Plus the cost savings of not needing an additional drive.

So what we see here is that in large arrays (large in capacity, not in spindle count) that RAID 0 actually passes RAID 5 in certain scenarios.  When using common SATA drives this happens at capacities experienced even by power users at home and by many small businesses.  If we move to enterprise SATA drives or SAS drives then the capacity number where this occurs becomes very high and is not a concern today but will be in just a few years when drive capacities get larger still.  But this highlights how dangerous RAID 5 is in the sizes that we see today.  Everyone understands the incredible risks of RAID 0 but it can be difficult to put into perspective that RAID 5’s issues are so extreme that it might actually be less reliable than RAID 0.

That RAID 5 might be less reliable than RAID 0 in an array of this size based on resilver operations alone is just the beginning.  In a massive array like this the resilver time can take so long and exact such a toll on the drives that second drive failure starts to become a measurable risk as well.  And then there are additional risks caused by array controller errors that can utilize resilver algorithms to destroy an entire array even when no drive failure has occurred.  As RAID 0 (or RAID 1 or RAID 10) do not have resilver algorithms they do not suffer this additional risk.  These are hard risks to quantify but what is important is that they are additional risks that accumulate when using a more complex system when a simpler system, without the redundancy, was more reliable from the outset.

Now that we have established that RAID 5 can be less reliable than RAID 0 I will point out the obvious dangers of RAID 0.  RAID in general is used to mitigate the risk of a single, lone hard drive failing.  We all fear a single drive simply failing and all data being lost.  RAID 0, being a large stripe of drives without any form of redundancy, takes the risk of data loss of a single drive failing and multiplies it across a number of drives where any drive failing causes total loss of data to all drives.  So in our eleven disk example above, if any of the eleven disks fails all is lost.  It is clear to see where this is dramatically more dangerous than just using a single drive, all alone.

What I am trying to point out here is that redundancy does not mean reliability.  Just because something is redundant, like RAID 5, provides no guarantee that it will always be more reliable than something that is not redundant.

My favourite analogy here is to look at houses in a tornado.  In one scenario we build a house of brick and mortar.  On the second scenario we build two redundant house, east out of straw (our builders are pigs, apparently.)  When the tornado (or big bad wolf) comes along which is more likely to leave us with a standing house?  Clearing one brick and mortar house has some significant reliability advantages over redundant straw houses.  Redundancy didn’t matter, reliability mattered in the end.

Redundancy is often misleading because it is easy to quantify but hard to qualify.  Redundancy is a black or white question: Is it redundant?  Yes or no.  Simple.  Reliability is not so simple.  Reliability is about failure rates and likelihoods.  It is about statistics and analysis.  As it is hard to quantify reliability in a meaningful way, especially when selling a project to the business people, redundancy often becomes a simple substitute for this complex concept.

The concept of using redundancy to misdirect questions of reliability also ends up applying to subsystems in very convoluted ways.  Instead of making a “system” redundant it has become common to make a highly reliable, and low cost, subsystem redundant and treat subsystem redundancy as applying to the whole system.  The most common example of this is RAID controllers in SAN products.  Rather than having a redundant SAN (meaning two SANs) manufacturers will often make that one component not often redundant in normal servers redundant  and then calling the SAN redundant – meaning a SAN that contains redundancy, which is not at all the same thing.

A good analogy here would be to compare having redundant cars meaning two complete, working cars and having a single car with a spare water pump in the trunk in case the main one fails.  Clearly, a spare water pump is not a bad thing.  But it is also a trivial amount of protection against car failure compared to having a second car ready to go.  In one case the entire system is redundant, including the chassis.  In the other we are making just one, highly reliable component redundant inside the chassis.  It’s not even on par with having a spare tire which, at least, is a car component with a higher likelihood of failure.

Just like the myth of RAID 5 reliability and system/subsystem reliability, shared storage technologies like SANs and NAS often get treated in the same way, especially in regards to virtualization.  There is a common scenario where a virtualization project is undertaken and people instinctively panic because a single virtualization host represents a single point of failure where, if it fails, many systems will all fail at once.

Using the term “single point of failure” causes a panic feeling and is a great means of steering a conversation.  But a SPOF, as we like to call it, while something we like to remove when possible may not be the end of the world.  Think about our brick house.  It is a SPOF.  Our two houses of straw are not.  Yet a single breeze takes out our redundant solutions faster than our reliable SPOF.  Looking for SPOFs is a great way to find points of fragility in a system, but do not feel that every SPOF must be made redundant in every scenario.  Most businesses will find their best value having many SPOFs in place.  Our real goal is reliability at appropriate cost, redundancy, as we have seen, is no substitute for reliability, it is simply a tool that we can use to achieve reliability.

The theory that many people follow when virtualizing is that they take their virtualization host and say “This host is a SPOF, so I need to have two of them and use High Availability features to allow for transparent failover!”  This is spurred by the leading virtualization vendor making their money firstly by selling expensive HA add on products and secondly by being owned by a large storage vendor – so selling unnecessary or even dangerous additional shared storage is a big monetary win for them and could easily be the reason that they have championed the virtualization space from the beginning.  Redundant virtualization hosts with shared storage sounds great but can be extremely misguided for several reasons.

The first reason is that removing the initial SPOF, the virtualization host, is replaced with a new SPOF, the shared storage.  This accomplishes nothing.  Assuming that we are using comparable quality servers and shared storage all we’ve done is move where the risk is, not change how big it is.  The likelihood of the storage system failing is roughly equal to the likelihood of the original server failing.  But in addition to shuffling the SPOF around like in a shell game we’ve also done something far, far worse – we have introduced chained or cascading failure dependencies.

In our original scenario we had a single server.  If the server stayed working we are good, if it failed we were not.  Simple.  Now we have two virtualization hosts, a single storage server (SAN, NAS, whatever) and a network connecting them together.  We have already determined that the risk of the shared storage failing is approximately equal to our total system risk in the original scenario.  But now we have the additional dependencies of the network and the two front end virtualization nodes.  Each of these components is more reliable than the fragile shared storage (anything with mechanical drives is going to be fragile) but that they are lower risk is not the issue, the issue is that the risks are combinatorial.

If any of these three components (storage, network or the front end nodes) fail then everything fails.  The solution to this is to make the shared storage redundant on its own and to make the network redundant on its own.  With enough work we can overcome the fragility and risk that we introduced by adding shared storage but the shared storage on its own is not a form of risk mitigation but is a risk itself which must be mitigated.  The spiral of complexity begins and the cost associated with bringing this new system up on par with the reliability of the original, single server system can be astronomic.

Now that we have all of this redundancy we have one more risk to worry about.  Managing all of this redundancy, all of these moving parts, requires a lot more knowledge, skill and preparation than does managing a simple, single server.  We have moved from a simple solution to a very complex one.  In my own anecdotal experience the real dangers of solutions like this come not from the hardware failing but from human error.  Not only has little been done to avoid human error causing this new system to fail but we’ve added countless points where a human might accidentally bring the entire system, redundancy and all, right down.  I’ve seen it first hand; I’ve heard the horror stories.  The more complex the system the more likely a human is going to accidentally break everything.

It is critical that as IT professionals that we step back and look at complete systems and consider reliability and risk and think of redundancy simply as a tool to use in the pursuit of reliability.  Redundancy itself is not a panacea.  Neither is simplicity.  Reliability is a complex problem to tackle.  Avoiding simplistic replacements is an important first step in moving from covering up reliability issues to facing and solving them.