Tag Archives: raid

Hardware and Software RAID

RAID, Redundant Array of Inexpensive Disks, systems are implemented in one of two basic ways: software or dedicated hardware.  Both methods are very viable and have their own merits.

In the small business space, where Intel and AMD architecture systems and Windows operating systems rule, hardware RAID is so common that a lot of confusion has arisen around software RAID due, as we will see, in no small part to the wealth of scam software RAID products touted as dedicated hardware and known colloquially as “Fake RAID.”

When RAID was first developed, it was used, in software, on high end enterprise servers running things like proprietary UNIX where the systems were extremely stable and the hardware was very powerful and robust making software RAID work very well.  Early RAID was primarily focused on mirrored RAID or very simplistic parity RAID (like RAID 2) which had little overhead.

As the need for RAID began to spill into the smaller server space and as parity RAID began to grow in popularity requiring greater processing power to support it became an issue that the underpowered processors in the x86 space were significantly impacted by the processing load of RAID, especially RAID 5.  This, combined with almost no operating systems heavily used on these platforms having software RAID implementations, lead to the natural development of hardware RAID – an offload processor board (similar to a GPU for graphics) that had its own complete computer on board with CPU and memory and firmware all of its own.

Hardware RAID worked very well at solving the RAID overhead problem in the x86 server space.  As CPUs gained more power and memory became less scarce popular x86 operating systems like Windows Server began to offer software RAID options.  Specifically Windows software RAID was known as a poor RAID implementation and was available only on server operating system versions causing a lack of appreciation for software RAID in the community of system administrators working primarily with Windows.

Because of historical implementations in the enterprise server space and the commodity x86 space there became a natural separation between the two markets supported initially by technology and later purely by ideology.  If you talk to a system administrator in the commodity space you will almost universally hear that hardware RAID is the only option.  Conversely if you talk to a system administrator in the mainframe, RISC (Sparc, Power, ARM) or EPIC (Itanium) server (sometimes called UNIX server) space you will often be met with surprise as hardware RAID isn’t available for those classes of systems – software RAID is simply a forgone conclusion.  Neither camp seems to have a real knowledge of the situation in the opposite one and crossovers in skill sets between these two is relatively rare until recently as enterprise UNIX platforms like Linux, Solaris and FreeBSD have started to become very popular and well understood on commodity hardware platforms.

To make matters more confusing for the commodity server space, in order to fill the vacuum left by the dominate operating system vendor’s lack of software RAID for the non-server operating system market while attempting to market to a less technically savvy target audience, a large number of vendors began selling non-RAID controller cards along with a “driver” that was actually software RAID and pretending that the resulting product was actually hardware RAID.  This created a large amount of confusion at best and an incredible disdain for software RAID at worse as almost universally any system whose core function is to protect data whose market is built upon deception and confusion will result in disaster.  Fake RAID systems routinely have issues with performance and reliability.  While, in theory, a third party software RAID package is a reasonable option, the reality of the software RAID market is that essentially all quality software RAID implementations are native components of either the operating system itself (Linux, Mac OSX, Solaris, Windows)  or of the filesystem (ZFS, VxFS, BtrFS) and are provided and maintained by primary vendors leaving little room or purpose for third party products outside of the Windows desktop space where a few, small legitimate software RAID players do exist but are often overshadowed by the Fake RAID players.

Today there is almost no need for hardware RAID as commodity platforms are incredibly powerful and there is almost always a dramatic excess of both computational and memory resources.  Hardware RAID instead competes mostly based on features rather than on reducing resource load.  Selection of hardware RAID versus software RAID in the commodity server space is almost completely one of preference and market momentum rather than of specific performance or features – both platforms essentially are equal with individual implementations being far more important in considering product options rather than hardware and software approaches are on their own.

Today hardware RAID offerings tend to be more “generic” with rather vanilla implementations of standard RAID levels.  Hardware RAID tends to earn its value through resource utilization reduction (CPU and memory offload), ability to “blind swap” failed drives, simplified storage management, block level storage agnostically abstracted from the operating system, fast cache close to the drives and battery or flash backed cache.  Software RAID tends to earn its value through lower power consumption, lower cost of acquisition, integrated management with the operating system, unique or advanced RAID features (such as ZFS’ RAIDZ that doesn’t suffer from the standard RAID 5 write hole) and generally better overall performance.  It is truly not a discussion of better or worse but of better or worse for a very specific situation with the most important factor often being familiarity and comfort and/or default vendor offering.

One of the most overlooked but important differentiators between hardware and software RAID is the change in the job role associated with RAID array management.  Hardware RAID moves the handling of the array to the server administrator (the support role that works on the physical server and is  stationed in the datacenter) whereas software RAID moves the handling of the array to the system administrator (the support role working on the operating system and above and rarely sitting in the datacenter.)  In the SMB market this factor might be completely overlooked but in a Fortune 500 the difference in job role can be very significant.  In many cases with hardware RAID disk replacements and system setup can be done without the need for system administrator intervention.  Datacenter server administrators can discover failed drives either through alerts or by looking for “amber lights” during walkthroughs and do replacements on the fly without needing to contact anyone or know what the server is even running.  Software RAID almost always would require the system administrator to be involved in managing the offlining of a failed disk, coordinating the replacement process with the datacenter and onlining the new one once the replacement process was completed.

Because of the way that CPU offloading and performance works and because of some advantages in the way that non-standard RAID implementations often handle parity RAID reconstruction there is a tendency for mirrored RAID levels to favor hardware RAID and software RAID levels to favor parity RAID.  Parity RAID is drastically more CPU intensive and so having access to the high power central CPU resources can be a major factor in speeding up RAID calculations.  But with mirrored RAID where RAID reconstruction is far safer than with parity RAID and where automated rebuilds are more important then hardware RAID brings the benefit of allowing blind drive replacement very easily.

One aspect of the hardware and software RAID discussion that is extremely paradoxical is that the same market that often dismisses software RAID out of hand as being inferior to hardware RAID is almost completely overlapping (you can picture the Venn Diagram in your head here) with the market that feels that file servers are inferior to commodity NAS appliances yet those NAS appliances in the SMB range are almost universally based on the same software RAID implementations being casually dismissed.  So it is often considered both inferior and superior simultaneously.  Some NAS devices in the SMB range, and NAS appliance software, that are software RAID based include: Netgear ReadyNAS, Netgear ReadyData, Buffalo Terastation, QNAP, Synology, OpenFiler FreeNAS, Nexenta and NAS4Free.

There is truly no “always use one way or the other” with hardware and software RAID.  Even giant, six figure enterprise NAS and SAN appliances are undecided as to which to use with part of the industry going each direction.  The real answer is it depends on your specific situation – your job role separation, your technical needs, your experience, your budget, etc.  Both options are completely viable in any organization.

Hot Spare or a Hot Mess

A common approach to adding a layer of safety to RAID is to have spare drive(s) available so that replacement time for a failed drive is minimized.  The most extreme form of this is referred to as having a “hot spare” – a spare drive actually sitting in the array but unused until the array detects a drive failure at which time the system automatically disables the failed drive and enables the hot spare, the same as if a human had just popped the one drive out of the array and popped in the other allowing a resilver operation (a rebuilding of the array) to begin as soon as possible.  This can bring the time to swap in a new drive from hours or days to seconds and, in theory, can provide an extreme increase in safety.

First, I’d like to address what I personally feel is a mistake in the naming conventions. What we refer to as a hot spare should, I believe, actually be called a warm spare because it is sitting there ready to go but does not contain the necessary data to be used immediately.  A spare drive stored outside of the chassis, one that requires a human to step in and swap the drives manually, would be a cold spare.  To truly be a hot spare a drive should be full of data and, therefore, would be a participatory member of the RAID array in some capacity.  Red Hat has a good article on how this terminology applies to disaster recovery sites for reference.  This differentiation is important because what we call a hot spare does not already contain data and does not immediately step in to replace the failed drive but instead steps in to immediately begin the process of restoring the lost drive – a critical differentiation.

In order to keep concepts clear, from here on out I will refer to what vendors call hot spares as “warm spares.”  This will make sense in short order.

There are two main concerns with warm spares.  The first is the ineffectual nature of the warm spare in most use cases and the second is the “automated array destruction” risk.

Most people approach the warm spare concept as a means of mitigating the high risk of secondary drive failure on a parity RAID 5 array.  RAID 5 arrays protect only against the failure of a single disk within the array.  Once a single disk has failed the array is left with no form of parity and any additional drive failure results in the total loss of the array.  RAID 5 is chosen because it is very low cost for the given capacity and sacrifices reliability in order to achieve this cost effectiveness.   Because RAID 5 is therefore risky in comparison to other RAID options, such as RAID 6 or RAID 10, it is common to implement a warm spare in order to minimize the time that the array is left in a degraded state allowing the array to begin resilvering itself as quickly as possible.

So the takeaway here that is more relevant is that warm spares are generally used as a buffer against using less reliable RAID array types as a cost saving measure.  Warm spares are dramatically more common in RAID 5 arrays followed by RAID 6 arrays.  Both of which are chosen over RAID 10 due to cost for capacity, not for reliability or performance.  There is one case where the warm spare idea truly does make sense for added reliability, and that is in RAID 10 with a warm spare, but we will come to that.  Outside of that scenario I feel that warm spares make little sense in the real world.

We will start by examining RAID 1 with a warm spare.  RAID 1 consists of two drives, or more, in a mirror.  Adding a warm spare is nice in that if one of the mirrored pairs dies the warm spare will immediately begin mirroring the remaining drive and you will be protected again in short order.  That is wonderful.  Except for one minor flaw, instead of using a warm spare that same drive could have been added to the RAID 1 array all along where it would have been a tertiary mirror.  In this tertiary mirror capacity the drive would have added to the overall performance of the array giving a nearly fifty percent read performance boost with write performance staying level and providing instant protection in case of a drive failure rather than “as soon as it remirrors” protection.  Basically it would have been a true “hot spare” rather than a warm spare.  So without spending a penny more the system would have had better drive array performance and better reliability simply by having the extra drive in a hot “in the array” capacity rather than sitting warm and idle waiting for disaster to strike.

With RAID 5 we see an even more dramatic warning against the warm spare concept, here where it is more common than anywhere else.  RAID 5 is single parity RAID with the ability to rebuild, using the parity, any drive in the array that fails.  This is where the real problems begin.  Unlike in RAID 1 where a remirroring operation might be quite quick, a RAID 5 resilver (rebuild) has the potential to take quite a long time.  The warm spare will not assist in protecting the array until this resilver process completes successfully – this is commonly many hours and is easily days and possibly weeks or months depending on the size of the array and how busy the array is.  If we took that same warm spare drive and instead tasked it with being a member of the array with an additional parity stripe we would achieve RAID 6.  The same set of drives that we have for RAID 5 plus warm spare would create a RAID 6 array of the exact same capacity.  Again, like the RAID 1 example above, this would be much like having a hot spare, where the drive is participating in the array with live data rather than sitting idly by waiting for another drive to fail before kicking in to begin the process of taking over.  In this capacity the array degrades to a RAID 5 equivalent in case of a failure but without any rebuild time, so the additional drive is useful immediately rather than only after a possible very lengthy resilver process.  So for the same money, same capacity the choice of setting up the drives in RAID 6 rather than RAID 5 plus warm spare is a complete win.

We can continue this example with RAID 6 plus warm spare.  This one is a little less easy to define because in most RAID systems, except for the somewhat uncommon RAIDZ3 from ZFS, there is no triple parity system available one step above RAID 6 (imagine if there was a RAID 7, for example.)  If there were the exact argument made for RAID 5 plus warm spare would apply to RAID 6 plus warm spare.  In the majority of cases RAID 6 with a warm spare must justify itself against a RAID 10 array.  RAID 10 is more performant and far more reliable than a RAID 6 array but RAID 6 is generally chosen to save money in comparison to RAID 10.  But to offset RAID 6’s fragility warm spares are sometimes employed.  In some cases, such as a small five disk RAID 6 array with a warm spare, this is dollar for dollar equivalent to a six disk RAID 10 array without a warm spare.  In larger arrays the cost benefit of RAID 6 does become apparent but the larger the cost savings the larger the risk differential as parity RAID systems increase risk with array size much more quickly than do mirror based RAID systems like RAID 10.  Any money saved today is done at the risk of outage or data loss tomorrow.

Where a warm spare comes into play effectively is in a RAID 10 array where a warm spare rebuild is a mirror rebuild, like in RAID 1, which does not carry parity risks, where there is no logical extension RAID system above RAID 10 from which we are trying to save money by going with a more fragile system.  Here adding a warm spare may make sense for critical arrays because there is no more cost effective way to gain the same additional reliability.  However, RAID 10 is so reliable without a warm spare that any shop contemplating RAID 5 or RAID 6 with a warm spare would logically stop at simple RAID 10 having already surpassed the reliability they were considering settling for previously.  So only shops not considering those more fragile systems and looking for the most robust possible option would logically look to RAID 10 plus warm spare as their solution.

Just for technical accuracy, RAID 10 can be expanded for better read performance and dramatic improvement in reliability (but with a fifty percent cost increase) by moving to three disk RAID 1 mirrors in its RAID 0 stripe rather than standard two disk RAID 1 mirrors just like we showed in our RAID 1 example.  This is a level of reliability seldom sought in the real world but can exist and is an option.  Normally this is curtailed by drive count limitations in physical array chassis as well as competing poorly against building a completely separate secondary RAID 10 array in a different chassis and then mirroring these at a high level effectively created RAID 101 – which is the effective result of common, high end storage array clusters today.

Our second concern is that of “automated array destruction.”  This applies only to the parity RAID scenarios of RAID 5 and RAID 6 (or the rare RAID 2, RAID 3, RAID 4 and RAIDZ3.)  With the warm spare concept, the idea is that when a drive fails the warm spare is automatically and instantly swapped in by the array controller and the process of resilvering the array begins immediately.  If resilvering was a completely reliable process this would be obviouslyd highly welcomed.  The reality is, sadly, quite different.

During a resilver process a parity RAID array is at risk of Unrecoverable Read Errors (UREs) cropping up.  If a URE occurs in a single parity RAID resilver (that is RAID 2 –  5) then the resilvering process fails and the array is lost completely.  This is critical to understand because no additional drive has failed.  So if the warm spare had not been present then the resilvering would have not commenced and the data would still be intact and available – just not as quickly as usual and at the small risk of secondary drive failure.  URE rates are very high with today’s large drives and with large arrays the risks can become so high as to move from “possible” to “expected” during a standard resilvering operation.

So in many cases the warm spare itself might actually be the trigger for the loss of data rather than the savior of the data as expected.  An array that would have survived might be destroyed by the resilvering process before the human who manages it is even alerted to the first drive having failed.  Had a human been involved they could have, at the very least, taken the step to make a fresh backup of the array before kicking off the resilver knowing that the latest copy of the data would be available in case the resilver process was unsuccessful.  It would also allow the human to schedule when the resilver should begin, possibly waiting until business hours are over or the weekend has begun when the array is less likely to experience heavy load.

Dual and triple parity RAID (RAID 6 and RAIDZ3 respectively) share URE risks as well as they too are based on parity.  They mitigate this risk through the additional levels of parity and do so successfully for the most part.  The risk still exists, especially in very large RAID 6 arrays, but for the next several years the risks remain generally quite low for the majority of storage arrays until far larger spindle-based storage media is available on the market.

The biggest problem with parity RAID and the URE risk is that the driver towards parity RAID (willing to face additional data integrity risks in order to lower cost) is the same driver that introduces heightened URE risk (purchasing lower cost, non-enterprise SATA hard drives.)  Shops facing parity RAID generally do so with large, low cost SATA drives bringing two very dangerous factors together for an explosive combination.  Using non-parity RAID 1 or RAID 10 will completely eliminate the issue and using highly reliable enterprise SAS drives will drastically reduce the risk factor by an order of magnitude (not an expression, it is actually a change of one order of magnitude.)

Additionally during resilver operations it is possible for performance to degrade on parity systems so drastically as to equate to a long-term outage.  The resilver process, especially on large arrays, can be so intensive that end users cannot differentiate between a completely failed array and a resilvering array.  In fact, resilvering at its extreme can take so long and be so disruptive that the cost to the business can be higher than if the array had simply failed completely and a restore from backup had been done instead.  This resilver issue does not affect RAID 1 and RAID 10, again, because they are mirrored, not parity, RAID systems and their resilver process is trivial and the performance degradation of the system is minimal and short lived.  At its most extreme, a parity resilver could take weeks or months during which time the systems act as though they are offline – and at any point during this process there is the potential for the URE errors to arise as mentioned above which would end the resilver and force the restore from backup anyway.  (Typical resilvers do not take weeks but do take many hours and to take days is not at all uncommon.)

Our final overview can be broken down to the following (conventional term “hot spare” used again): RAID 10 without a “hot spare” is almost always a better choice than RAID 6 with a “hot spare.”  RAID 6 without a “hot spare” is always better than RAID 5 with a “hot spare.”  RAID 1 with additional mirror member is always better than RAID 1 with a “hot spare.”  So whatever RAID level with a hot spare you decide upon, simply move up one level of RAID reliability and drop the “hot spare” to maximize both performance and reliability for equal or nearly equal cost.

Warm spares, like parity RAID, had they day in the sun.  In fact it was when parity RAID still made sense for widespread use – when URE errors were unlikely and disk costs were high – that warm spare drives made sense as well.  They were well paired, when one made sense the other often did too.  What is often overlooked is that as parity RAID, especially RAID 5, has lost effectiveness it has pulled the warm spare along with it in unexpected ways.

When No Redundancy Is More Reliable – The Myth of Redundancy

Risk in a difficult concept and it requires a lot of training, thought and analysis to properly assess given scenarios.  Often, because risk assessments are so difficult, we substitute risk analysis with simply adding basic redundancy and assuming that we have appropriately mitigated risk.  But very often this is not the case.  The introduction of complexity or additional failure modes often accompany the addition of redundancy and these new forms of failure have the potential to add more risk than the added redundancy removes.  Storage systems are especially prone to these decision processes which is unfortunate as few, if any, systems are so susceptible to failure and more important to protect.

RAID is a great example of where a lack of holistic risk thinking can lead to some strange decision making.  If we look at a not uncommon scenario we will see where the goal of protecting against drive failure can actually lead to an increase in risk even when additional redundancy is applied.  In this scenario we will compare a twelve drive array consisting of twelve three terabyte SATA hard drives in a single array.  It is not uncommon to hear of people choosing RAID 5 for this scenario to get “maximum capacity and performance” while having “adequate protection against failure.”

The idea here is that RAID 5 protects against the loss of a single drive which can be replaced and the array will rebuild itself before a second drive fails.  That is great in theory, but the real risks of an array of this size, thirty six terabytes of drive capacity, come not from multiple drive failures as people generally suspect but from an inability to reliably rebuild the array after a single drive failure or from a failure of the array itself with no individual drives failing.  The risk of a second drive failing is low, not non-existent, but quite low.  Drives today are highly reliable. Once one drives fails it does increase the likelihood of a second drive failing, which is well documented, but I don’t want this risk to mislead us from looking at the true risks – the risk of a failed resilvering operation.

What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur.  When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost.  On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations.  That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing.  Fifty percent chance of failure is insanely high.  Imagine if your car had a fifty percent chance of the wheels falling off every time that you drove it.  So with a small (by today’s standards) six terabyte RAID 5 array using 10^14 URE SATA drives, if we were to lose a single drive, we have only a fifty percent chance that the array will recover assuming the drive is replaced immediately.  That doesn’t include the risk of a second drive failing, only the risk of a URE failure.  It also assumes that the drive is completely idle other than the resilver operation.  If the drives are busily being used for other tasks at the same time then the chances of something bad happening, either a URE or a second drive failure, begin to increase dramatically.

With a twelve terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent – meaning that RAID 5 has no functionality whatsoever in that case.  There is always a chance of survival, but it is very low.  At six terabytes you can compare a resilver operation to a game of Russian roulette with one bullet and six chambers and you have to pull the trigger three times.  With twelve terabytes you have to pull it six times!  Those are not good odds.

But we are not talking about a twelve terabyte array.  We are talking about a thirty six terabyte array – which sounds large but this is a size that someone could easily have at home today, let alone in a business.  Every major server manufacturer, as well as nearly all low cost storage vendors, make sub $10,000 storage systems in this capacity range today.  Resilvering a RAID 5 array with a single drive failure on a thirty six terabyte array is like playing Russian roulette, one bullet, six chambers and pulling the trigger eighteen times!  Your data doesn’t stand much of a chance.  Add to that the incredible amount of time needed to resilver an array of that size and the risk of a second disk failing during that resilver window starts to become a rather significant threat.  I’ve seen estimates of resilver times climbing into weeks or months on some systems.  That is a long time to run without being able to lose another drive.  When we are talking hours or days the risks are pretty low, but still present.  When we are talking weeks or months of continuous abuse, as resilver operations are extremely drive intensive, the failure rates climb dramatically.

With an array of this size we can effectively assume that the loss of a single drive means the loss of the complete array leaving us with no drive failure protection at all.  Now if we look at a drive of the same or better performance with the same or better capacity under RAID 0, which also has no protection against drive loss, we need only use eleven of the same drives that we needed twelve of for our RAID 5 array.  What this means is that instead of twelve hard drives, each of which has a roughly three percent chance of annual failure, we have only eleven.  That alone makes our RAID 0 array more reliable as there are fewer drives to fail.  Not only do we have fewer drives but there is no need to write the parity block nor skip parity blocks when reading back lowering, ever so slightly, the mechanical wear and tear on the RAID 0 array for the same utilization giving it a very slight additional reliability edge.  The RAID 0 array of eleven drives will be identical in capacity to the twelve drive RAID 5 array but will have slightly better throughput and latency.  A win all around.  Plus the cost savings of not needing an additional drive.

So what we see here is that in large arrays (large in capacity, not in spindle count) that RAID 0 actually passes RAID 5 in certain scenarios.  When using common SATA drives this happens at capacities experienced even by power users at home and by many small businesses.  If we move to enterprise SATA drives or SAS drives then the capacity number where this occurs becomes very high and is not a concern today but will be in just a few years when drive capacities get larger still.  But this highlights how dangerous RAID 5 is in the sizes that we see today.  Everyone understands the incredible risks of RAID 0 but it can be difficult to put into perspective that RAID 5’s issues are so extreme that it might actually be less reliable than RAID 0.

That RAID 5 might be less reliable than RAID 0 in an array of this size based on resilver operations alone is just the beginning.  In a massive array like this the resilver time can take so long and exact such a toll on the drives that second drive failure starts to become a measurable risk as well.  And then there are additional risks caused by array controller errors that can utilize resilver algorithms to destroy an entire array even when no drive failure has occurred.  As RAID 0 (or RAID 1 or RAID 10) do not have resilver algorithms they do not suffer this additional risk.  These are hard risks to quantify but what is important is that they are additional risks that accumulate when using a more complex system when a simpler system, without the redundancy, was more reliable from the outset.

Now that we have established that RAID 5 can be less reliable than RAID 0 I will point out the obvious dangers of RAID 0.  RAID in general is used to mitigate the risk of a single, lone hard drive failing.  We all fear a single drive simply failing and all data being lost.  RAID 0, being a large stripe of drives without any form of redundancy, takes the risk of data loss of a single drive failing and multiplies it across a number of drives where any drive failing causes total loss of data to all drives.  So in our eleven disk example above, if any of the eleven disks fails all is lost.  It is clear to see where this is dramatically more dangerous than just using a single drive, all alone.

What I am trying to point out here is that redundancy does not mean reliability.  Just because something is redundant, like RAID 5, provides no guarantee that it will always be more reliable than something that is not redundant.

My favourite analogy here is to look at houses in a tornado.  In one scenario we build a house of brick and mortar.  On the second scenario we build two redundant house, east out of straw (our builders are pigs, apparently.)  When the tornado (or big bad wolf) comes along which is more likely to leave us with a standing house?  Clearing one brick and mortar house has some significant reliability advantages over redundant straw houses.  Redundancy didn’t matter, reliability mattered in the end.

Redundancy is often misleading because it is easy to quantify but hard to qualify.  Redundancy is a black or white question: Is it redundant?  Yes or no.  Simple.  Reliability is not so simple.  Reliability is about failure rates and likelihoods.  It is about statistics and analysis.  As it is hard to quantify reliability in a meaningful way, especially when selling a project to the business people, redundancy often becomes a simple substitute for this complex concept.

The concept of using redundancy to misdirect questions of reliability also ends up applying to subsystems in very convoluted ways.  Instead of making a “system” redundant it has become common to make a highly reliable, and low cost, subsystem redundant and treat subsystem redundancy as applying to the whole system.  The most common example of this is RAID controllers in SAN products.  Rather than having a redundant SAN (meaning two SANs) manufacturers will often make that one component not often redundant in normal servers redundant  and then calling the SAN redundant – meaning a SAN that contains redundancy, which is not at all the same thing.

A good analogy here would be to compare having redundant cars meaning two complete, working cars and having a single car with a spare water pump in the trunk in case the main one fails.  Clearly, a spare water pump is not a bad thing.  But it is also a trivial amount of protection against car failure compared to having a second car ready to go.  In one case the entire system is redundant, including the chassis.  In the other we are making just one, highly reliable component redundant inside the chassis.  It’s not even on par with having a spare tire which, at least, is a car component with a higher likelihood of failure.

Just like the myth of RAID 5 reliability and system/subsystem reliability, shared storage technologies like SANs and NAS often get treated in the same way, especially in regards to virtualization.  There is a common scenario where a virtualization project is undertaken and people instinctively panic because a single virtualization host represents a single point of failure where, if it fails, many systems will all fail at once.

Using the term “single point of failure” causes a panic feeling and is a great means of steering a conversation.  But a SPOF, as we like to call it, while something we like to remove when possible may not be the end of the world.  Think about our brick house.  It is a SPOF.  Our two houses of straw are not.  Yet a single breeze takes out our redundant solutions faster than our reliable SPOF.  Looking for SPOFs is a great way to find points of fragility in a system, but do not feel that every SPOF must be made redundant in every scenario.  Most businesses will find their best value having many SPOFs in place.  Our real goal is reliability at appropriate cost, redundancy, as we have seen, is no substitute for reliability, it is simply a tool that we can use to achieve reliability.

The theory that many people follow when virtualizing is that they take their virtualization host and say “This host is a SPOF, so I need to have two of them and use High Availability features to allow for transparent failover!”  This is spurred by the leading virtualization vendor making their money firstly by selling expensive HA add on products and secondly by being owned by a large storage vendor – so selling unnecessary or even dangerous additional shared storage is a big monetary win for them and could easily be the reason that they have championed the virtualization space from the beginning.  Redundant virtualization hosts with shared storage sounds great but can be extremely misguided for several reasons.

The first reason is that removing the initial SPOF, the virtualization host, is replaced with a new SPOF, the shared storage.  This accomplishes nothing.  Assuming that we are using comparable quality servers and shared storage all we’ve done is move where the risk is, not change how big it is.  The likelihood of the storage system failing is roughly equal to the likelihood of the original server failing.  But in addition to shuffling the SPOF around like in a shell game we’ve also done something far, far worse – we have introduced chained or cascading failure dependencies.

In our original scenario we had a single server.  If the server stayed working we are good, if it failed we were not.  Simple.  Now we have two virtualization hosts, a single storage server (SAN, NAS, whatever) and a network connecting them together.  We have already determined that the risk of the shared storage failing is approximately equal to our total system risk in the original scenario.  But now we have the additional dependencies of the network and the two front end virtualization nodes.  Each of these components is more reliable than the fragile shared storage (anything with mechanical drives is going to be fragile) but that they are lower risk is not the issue, the issue is that the risks are combinatorial.

If any of these three components (storage, network or the front end nodes) fail then everything fails.  The solution to this is to make the shared storage redundant on its own and to make the network redundant on its own.  With enough work we can overcome the fragility and risk that we introduced by adding shared storage but the shared storage on its own is not a form of risk mitigation but is a risk itself which must be mitigated.  The spiral of complexity begins and the cost associated with bringing this new system up on par with the reliability of the original, single server system can be astronomic.

Now that we have all of this redundancy we have one more risk to worry about.  Managing all of this redundancy, all of these moving parts, requires a lot more knowledge, skill and preparation than does managing a simple, single server.  We have moved from a simple solution to a very complex one.  In my own anecdotal experience the real dangers of solutions like this come not from the hardware failing but from human error.  Not only has little been done to avoid human error causing this new system to fail but we’ve added countless points where a human might accidentally bring the entire system, redundancy and all, right down.  I’ve seen it first hand; I’ve heard the horror stories.  The more complex the system the more likely a human is going to accidentally break everything.

It is critical that as IT professionals that we step back and look at complete systems and consider reliability and risk and think of redundancy simply as a tool to use in the pursuit of reliability.  Redundancy itself is not a panacea.  Neither is simplicity.  Reliability is a complex problem to tackle.  Avoiding simplistic replacements is an important first step in moving from covering up reliability issues to facing and solving them.

 

RAID Revisited

Back when I was a novice service tech and barely knew anything about system administration one of the few topics that we were always expected to know cold was RAID –  Redundant Array of Inexpensive Disks.  It was the answer to all of our storage woes.  With RAID we could scale our filesystems larger, get better throughput and even add redundancy allowing us to survive the loss of a disk which, especially in those days, happened pretty regularly.  With the rise of NAS and SAN storage appliances the skill set of getting down to the physical storage level and tweaking it to meet the needs of the system in question are rapidly disappearing.  This is not a good thing.  Just because we are offloading storage to external devices does not change the fact that we need to fundamentally understand our storage and configure it to meet the specific needs of our systems.

A misconception that seems to have entered the field over the last five to ten years is the belief that RAID somehow represents a system backup.  It does not.  RAID is a form of fault tolerance.  Backup and fault tolerance are very different conceptually.  Backup is designed to allow you to recover after a disaster has occurred.  Fault tolerance is designed to lessen the chance of disaster in the first place.  Think of fault tolerance as building a fence at the top of a cliff and backup as building a hospital at the bottom of it.  You never really want to be in a situation without a both a fence and a hospital, but they are definitely different things.

Once we are implementing RAID for our drives, whether locally attached or on a remote appliance like SAN, we have four key RAID solutions from which to choose today for business: RAID 1 (mirroring), RAID 5 (striping with parity), RAID 6 (striping with double parity) and RAID 10 (mirroring with striping.)  There are others, like RAID 0, that only should be used in rare circumstances when you really understand your drive subsystem needs.  RAID 50 and 51 are used as well but far less commonly and are not nearly as effective.  Ten years ago RAID 1 and RAID 5 were common, but today we have more options.

Let’s step through the options and discuss some basic numbers.  In our examples we will use n to represent the number of drives in our array and we will use s to represent the size of any individual drive.  Using these we can express the usable storage space of an array making comparisons easy in terms of storage capacity.

RAID 1: In this RAID type drives are mirrored.  You have two drives and they do everything together at the same time, hence “mirroring”.  Mirroring is extremely stable as the process is so simple, but it requires you to purchase twice as many drives as you would need if you were not using RAID at all as your second drive is dedicated to redundancy.  The benefit being that you have the assurance that every bit that you write to disk is being written twice for your protection.  So with RAID 1 our capacity is calculated to be (n*s/2).  RAID 1 suffers from providing minimal performance gains over non-RAID drives.  Write speeds are equivalent to a non-RAID system while read speeds are almost twice as fast in most situations since during read operations the drives can access in parallel to increase throughput.  RAID 1 is limited to two drive sets.

RAID 5: Striping with Single Parity, in this RAID type data is written in a complex stripe across all drives in the array with a distributed parity block that exists across all of the drives.  By doing this RAID 5 is able to use an arbitrarily sized array of three or more disks and only loses the storage capacity equivalent to a single disk to parity although the parity is distributed and does not exist solely on any one physical disk.   RAID 5 is often used because of its cost effectiveness due to its lack of storage capacity loss in large arrays.  Unlike mirroring, striping with parity requires that a calculation be performed for each write stripe across the disks and this creates some overhead.  Therefore the throughput is not always an obvious calculation and is dependent heavily upon the computational power of the system doing the parity calculation.  Calculating RAID 5 capacity is quite easy as it is simply ((n-1)*s).  A RAID 5 array can survive the loss of any single disk in the array.

RAID 6: Redundant Striping with Double Parity.  RAID 6 is practically identical to RAID 5 but uses two parity blocks per stripe rather than one to allow for additional protection against disk failure.  RAID 6 is a newer member of the RAID family having been added several years after the other levels had become standardized.  RAID 6 is special in that it allows for the failure of any two drives within an array without suffering data loss.  But to accommodate the additional level of redundancy a RAID 6 array loses the storage capacity of the equivalent to two drives in the array and requires a minimum of four drives.  We can calculate the capacity of a RAID 6 array with ((n-2)*s).

RAID 10: Mirroring plus Striping.  Technically RAID 10 is a hybrid RAID type encompassing a set of RAID 1 mirrors existing in a non-parity stripe (RAID 0).  Many vendors use the term RAID 10 (or RAID 1+0) when speaking of only two drives in an array but technically that is RAID 1 as striping cannot occur until there are a minimum of four drives in the array.  With RAID 10 drives must be added in pairs so only an even number of drives can exist in an array.  RAID 10 can survive the loss of up to half of the total set of drives but a maximum loss of one from each pair.  RAID 10 does not involve a parity calculation giving it a performance advantage over RAID 5 or RAID 6 and requiring less computational power to drive the array.  RAID 10 delivers the greatest read performance of any common RAID type as all drives in the array can be used simultaneously in read operations although its write performance is much lower.  RAID 10’s capacity calculation is identical to that of RAID 1, (n*s/2).

In today’s enterprise it is rare for an IT department to have a serious need to consider any drive configuration outside of the four mentioned here regardless of whether software or hardware RAID is being implemented.  Traditionally the largest concern in a RAID array decision was based around usable capacity.  This was because drives were expensive and small.  Today drives are so large that storage capacity is rarely an issue, at least not like it was just a few years ago, and the costs have fallen such that purchasing additional drives necessary for better drive redundancy is generally of minor concern.  When capacity is at a premium RAID 5 is a popular choice because it loses the least storage capacity compared to other array types and in large arrays the storage loss is nominal.

Today we generally have other concerns, primarily data safety and performance.  Spending a little extra to ensure data protection should be an obvious choice.  RAID 5 suffers from being able to lose only a single drive.  In an array of just three members this is only slightly more dangerous than the protection offered by RAID 1.  We could survive the loss of any one out of three drives.  Not too scary compared to losing either of two drives.  But what about a large array, say sixteen drives.  Being able to safely lose only one of sixteen drives should make us question our reliability a little more thoroughly.

This is where RAID 6 stepped in to fill the gap.  RAID 6, when used in a large array, introduces a very small loss of storage capacity and performance while providing the assurance of being able to lose any two drives.  Proponents of the striping with parity camp will often quote these numbers to assuage management that RAID 5/6 can provide adequate “bang for the buck” in storage subsystems, but there are other factors at play.

Almost entirely overlooked in discussions of RAID reliability, an all too seldom discussed topic as it is, is the question of parity computation reliability.  With RAID 1 or RAID 10 there is no “calculation” done to create a stripe with parity.  Data is simply written in a stable manner.  When a drive fails its partner picks up the load and drive performance is slightly degraded until the partner is replaced.  There is no rebuilding process that impacts existing drive members.  Not so with parity stripes.

RAID arrays with parity have operations that involve calculating what is and what should be on the drives.  While this calculation is very simple it provides an opportunity for things to go wrong.  An array control that fails with RAID 1 or RAID 10 could, in theory, write bad data over the contents of the drives but there is no process by which the controller makes drive changes on its own so this is extremely unlikely to ever occur as there is never a “rebuild” process except in creating a mirror.

When arrays with parity perform a rebuild operation they perform a complex process by which they step through the entire contents of the array and write missing data back to the replaced drive.  In and of itself this is relatively simple and should be no cause for worry.  What I and others have seen first hand is a slightly different scenario involving disks that have lost connectivity due to loose connectors to the array.  Drives can commonly “shake” loose over time as they sit in a server especially after several years of service in an always-on system.

What can happen, in extreme scenarios, is that good data on drives can be overwritten by bad parity data when an array controller believes that one or more drives have failed in succession and been brought back online for rebuild.  In this case the drives themselves have not failed and there is no data loss.  All that is required is that the drives be reseated, in theory.  On hot swap systems the management of drive rebuilding is often automatic based on the removal and replacement of a failed drive.  So this process of losing and replacing a drive may occur without any human intervention – and a rebuilding process can begin.  During this process the drive system is at risk and should this same event occur again the drive array may, based upon the status of the drives, begin striping bad data across the drives overwriting the good filesystem.  It is one of the most depressing sights for a server administrator to see when a system with no failed drives loses an entire array due to an unnecessary rebuild operation.

In theory this type of situation should not occur and safeguards are in place to protect against it but the determination of a low level drive controller as to the status of a drive currently and previously and the quality of the data residing upon that drive is not as simple as it may seem and it is possible for mistakes to occur.  While this situation is unlikely it does happen and it adds a nearly impossible to calculate risk to RAID 5 and RAID 6 systems.  We must consider the risk of parity failure in addition to the traditional risk calculated from the number of drive losses that an array can survive out of a pool.  As drives become more reliable the significance of the parity failure risk event becomes greater.

Additionally, RAID 5 and RAID 6 parity introduces system overhead due to parity calculation which is often handled by way of dedicated RAID hardware.  This calculation introduces latency into the drive subsystem that varies dramatically by implementation both in hardware and in software making it impossible to state performance numbers of RAID levels against one another as each implementation will be unique.

Possibly the biggest problem with RAID choices today is that the ease with which metrics for storage efficiency and drive loss survivability can be obtained mask the big picture of reliability and performance as those statistics are almost entirely unavailable.  One of the dangers of metrics is that people will focus upon factors that can be easily measured and ignore those that cannot be easy measured regardless of their potential for impact.

While all modern RAID levels have their place it is critical that they be considered within context and with an understanding as to the entire scope of the risks.  We should work hard to shift our industry from a default of RAID 5 to a default of RAID 10.  Drives are cheap and data loss is expensive.

[Edit: In the years since this was initial written the rise of URE (Unrecoverable Read Errors) risks during a rebuild operation has shifted the primary risks from those listed to URE-related risks for parity arrays.]