Choosing RAID for Hard Drives in 2013

After many, many articles, discussions, threads, presentations, questions and posts on choosing RAID, I have finally decided to publish my 2012-2013 high level guide to choosing RAID.  The purpose of this article is not to broadly explain or defend RAID choices but to present a concise guide to making an educated, studied decision for RAID that makes sense for a given purpose.

Today, four key RAID types exist for the majority of purposes: RAID 0, RAID 1, RAID 6 and RAID 10.  Each has a place where it makes the most sense.  RAID 1 and RAID 10, one simply being an application of the other, can handily be considered as a single RAID type with the only significant difference being the size of the array.  Many vendors refer to RAID 1 incorrectly as RAID 10 today because of this and, while this is clearly a semantic mistake, we will call them RAID 1/10 here to make decision making less complicated.  Together they can be considered the “mirrored RAID” family and the differentiation between them is based solely on the number of pairs in the array.  One pair is RAID 1, more than one pair is RAID 10.

RAID 0: RAID without redundancy.  RAID 0 is very fast and very fragile.  It has practically no overhead and requires the fewest hard disks in order to accomplish capacity and performance goals.  RAID 0 is perfect for situations where data is volatile (such as temporary caches) and where data is read only and there are solid backups and where accessibility is not a key concern.  RAID 0 should never be used for live or critical data.

RAID 6: RAID 6 is the market standard today for parity RAID, the successor to RAID 5.  As such, RAID 6 is cost effective in larger arrays (five drives minimum, normally six or more drives) where performance and reliability are secondary concerns to cost.  RAID 6 is focused on cost effective capacity for near-line data.

RAID 1/10: Mirrored RAID provides the best speed and reliability making it ideally suited for online data – any data where speed and reliability are of the top concern.  It is the only reasonable choice for arrays of four or fewer drives where the data is non-volatile.  With rare exception, mirrored RAID should be the defacto choice for any RAID array where specific technical needs do not clearly mandate a RAID 1 or RAID 6 solution.

It is a rare circumstance where RAID 0 is required, very rare.  RAID 6 has a place in many organizations but almost never on its own.  Almost every organization should be relying on RAID 1 or 10 for its primary storage and potentially using other RAID types for special cases, such as backups, archives and caches.  It is a very, very rare business that wouldn’t not have RAID 10 as the primary storage for the bulk of its systems.

Virtual Eggs and Baskets

In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from what is described as “don’t put your eggs in one basket.”

I can see where this concern arises.  Virtualization allows for many guest operating systems to be contained in a single physical system which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.  This sounds bad, but perhaps it is not as bad as we would first presume.

The idea of the eggs and baskets idiom is that we should not put all of our resources at risk at the same time.  This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities.  In the case of eggs (or money) we are talking about an interchangeable commodity.  One egg is as good as another.  A set of eggs are naturally redundant.

If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat.  Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.  Putting our already redundant eggs into multiple baskets allows us to hedge our bets.  Yes, carrying two baskets means that we have less time to pay attention to either one so it increases the risk of losing some of the eggs but reduces the chances of losing all of the eggs.  In the case of eggs, a wise proposition indeed.  Likewise, a smart way to prepare for your retirement.

This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization.  Servers, however, are not like eggs.  Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough.  Typically servers each play a unique role and all are relatively critical to the functioning of the business.  If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist.  When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.

IT services in a business are usually, at least to some degree, a “chain dependency.”  That is, they are interdependent and the loss of a single service may impact other services either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (such as an office worker needing the file server working in order to provide a file which he needs to edit with information from an email while discussing the changes over the phone or instant messenger.)  In these cases, the loss of a single key service such as email, network authentication or file services may create a disproportionate loss of working ability.  If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.   This is not always true, in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon.  Even if people can remain working, they are likely far less productive than usual.

When dealing with physical servers, each server represents its own point of failure.  So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers.  Each server that we add brings with it its own risk.  If each failure has an outage factor of 2.5 – that is financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.  I use the concept of factors and averages here to make this easy, determining the length of an average outage or impact of an average outage is not necessary as we only need to determine relative impact in this case to compare the scenarios.  It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.

With virtualization we have the obvious ability to consolidate.  In this example we will assume that we can collapse all ten of these existing servers down into a single server.  When we do this we often trigger the “all our eggs in one basket” response.  But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk.  If we assume the same risks as the example above our single server will, on average, incur just a single total site outage, once per decade.

Compare this to the first example which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.

Now keep in mind that this is based on the assumption that losing some services means a financial loss greater than the strict value of the service that was lost, which is almost always the case.  Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry.  In rare cases impact from losing a single system can be less than its “slice of the pie”, normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restore, but these cases are rare and are normally isolated to a few systems out of many with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.

So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory.  And this is purely from a hardware failure perspective.  Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.

With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department.  Fewer servers means that more time and attention can be paid to those that remain.  More attention means a better chance of catching issues early and more opportunity to keep parts on hand.  Better monitoring and maintenance leads to better reliability.

Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability.  With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly.   Understandable and for some businesses this may be the correct approach.  But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.

Instead by applying a more moderate approach keeping significant cost savings but still spending more, relatively speaking, on a single server you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc.  The cost savings of virtualization can often be turned directly into increased reliability further shifting the equation in favor of the single server approach.

As I stated in another article, one brick house is more likely to survive a wind storm than either one or two straw houses.  Having more of something doesn’t necessarily make it the more reliable choice.

These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself.  Virtualization provides extended risk mitigation features separately as well.  System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform.  This can play an important role in a disaster recovery strategy.

Of course, all of these concepts are purely to demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario.    There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.

It should be noted that virtualization can then extend the reliability of traditional commodity hardware providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide.  This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms.  These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops initially migrating from a non-failover, legacy hardware server environment.  High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today.  Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now, but due to the large drop in the cost of high availability, it is quite possible that it will he cost justified where previously it could not be.

In the same vein, virtualization is often feared because it is seen as a new, unproven technology.  This is certainly untrue but there is an impression of this in the small business and commodity server space.  In reality, though, virtualization was first introduced by IBM in the 1960s and ever since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability.  In the commodity server space virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world.  But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today which is very far past the point of being a nascent technology – in the world of IT it is positively venerable.  Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products.  The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.

Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy.  Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.

Choosing a RAID Level by Drive Count

In addition to all other factors, the number of drives available to you plays a significant role in choosing what RAID level is appropriate for you.  Ideally RAID is chosen ahead of time in conjunction with chassis and drives in a holistic approach so that the entire system is engineered for the desired purpose, but even in these cases, knowing how drive counts can affect useful RAID choices can be very helpful.

To simplify the list, RAID 0 will be left off of it.  RAID 0 is a viable choice for certain niche business scenarios in any count of drives.  So there is no need to display it on the list.  Also, the list assumes that a hot spare, if it exists, is not included in the count as that is “outside” of the RAID array and so would not be a part of the array drive count.

2 Drives: RAID 1

3 Drives: RAID 1 *

4 Drives: RAID 10

5 Drives: RAID 6

6 Drives: RAID 6 or RAID 10

7 Drives: RAID 6 or RAID 7

8 Drives: RAID 6 or RAID 7 or RAID 10 **

9 Drives: RAID 6 or RAID 7

10 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61

11 Drives: RAID 6 or RAID 7 

12 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61

13 Drives: RAID 6 or RAID 7

14 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61or RAID 70/71

15 Drives: RAID 6 or RAID 7 or RAID 60

16 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61 or RAID 70/71

17 Drives: RAID 6 or RAID 7

18 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61 or RAID 70/71

19 Drives: RAID 6 or RAID 7

20 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61 or RAID 70/71

21 Drives: RAID 6 or RAID 7 or RAID 60 or RAID 70

22 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61 or RAID 70/71

23 Drives: RAID 6 or RAID 7

24 Drives: RAID 6 or RAID 7 or RAID 10 or RAID 60/61 or RAID 70/71

25 Drives: RAID 6 or RAID 7 or RAID 60

………

* RAID 1 is technically viable at any drive count of two or more.  I have included it only up to three drives because using it beyond that point is generally considered absurd and is completely unheard of in the real world.  But technically it would continue to provide equal write performance while continuing to increase in read performance and reliability as more drives are added to the mirror.  But for reasons of practicality I have included it only twice on the list where it would actually be useful.

** At six drives and higher both RAID 6 and RAID 10 are viable options for arrays of even drive counts and RAID 6 alone is a viable option for odd numbered drive array counts.

For this list I have only considered the standard RAID levels of 0, 1, 4, 5, 6 and 10.  I left 0 off of the list because it is always viable for certain use cases.  RAID 5 never appears because there is no time on spindle hard drives today that it should be used, as RAID 5 is an enhancement of RAID 4, it too does not appear on the list.  Non-standard double parity RAID solutions such as Netapp’s RAID-DP and Oracle’s RAIDZ2 can be treated as derivations of RAID 6 and apply accordingly.  Oracle’s triple parity RAIDZ3 (sometimes called RAID 7) would apply at seven drives and higher but is a non-standard level and extremely rare so I included it in italics.

More commonly, RAID 6 makes sense at six drives or more and RAID 7 at eight drives or more.

Like RAID 4 and 5, RAID levels based on them (RAID 40, 50, 41, 51, 55, etc.) are not appropriate any longer due to the failure and fragility modes of spindle-based hard drives.  Complex RAID levels based on RAID 6 and 7 (60, 61, 70, 71, etc.) have a place but are exceedingly rare as they generally have very little cost savings compared to RAID 10 but suffer from performance issues and increased risk.  RAID 61 and 71 are almost exclusively effective when the highest order RAID, the mirror component, is over a network rather than local on the system.

Hardware and Software RAID

RAID, Redundant Array of Inexpensive Disks, systems are implemented in one of two basic ways: software or dedicated hardware.  Both methods are very viable and have their own merits.

In the small business space, where Intel and AMD architecture systems and Windows operating systems rule, hardware RAID is so common that a lot of confusion has arisen around software RAID due, as we will see, in no small part to the wealth of scam software RAID products touted as dedicated hardware and known colloquially as “Fake RAID.”

When RAID was first developed, it was used, in software, on high end enterprise servers running things like proprietary UNIX where the systems were extremely stable and the hardware was very powerful and robust making software RAID work very well.  Early RAID was primarily focused on mirrored RAID or very simplistic parity RAID (like RAID 2) which had little overhead.

As the need for RAID began to spill into the smaller server space and as parity RAID began to grow in popularity requiring greater processing power to support it became an issue that the underpowered processors in the x86 space were significantly impacted by the processing load of RAID, especially RAID 5.  This, combined with almost no operating systems heavily used on these platforms having software RAID implementations, lead to the natural development of hardware RAID – an offload processor board (similar to a GPU for graphics) that had its own complete computer on board with CPU and memory and firmware all of its own.

Hardware RAID worked very well at solving the RAID overhead problem in the x86 server space.  As CPUs gained more power and memory became less scarce popular x86 operating systems like Windows Server began to offer software RAID options.  Specifically Windows software RAID was known as a poor RAID implementation and was available only on server operating system versions causing a lack of appreciation for software RAID in the community of system administrators working primarily with Windows.

Because of historical implementations in the enterprise server space and the commodity x86 space there became a natural separation between the two markets supported initially by technology and later purely by ideology.  If you talk to a system administrator in the commodity space you will almost universally hear that hardware RAID is the only option.  Conversely if you talk to a system administrator in the mainframe, RISC (Sparc, Power, ARM) or EPIC (Itanium) server (sometimes called UNIX server) space you will often be met with surprise as hardware RAID isn’t available for those classes of systems – software RAID is simply a forgone conclusion.  Neither camp seems to have a real knowledge of the situation in the opposite one and crossovers in skill sets between these two is relatively rare until recently as enterprise UNIX platforms like Linux, Solaris and FreeBSD have started to become very popular and well understood on commodity hardware platforms.

To make matters more confusing for the commodity server space, in order to fill the vacuum left by the dominate operating system vendor’s lack of software RAID for the non-server operating system market while attempting to market to a less technically savvy target audience, a large number of vendors began selling non-RAID controller cards along with a “driver” that was actually software RAID and pretending that the resulting product was actually hardware RAID.  This created a large amount of confusion at best and an incredible disdain for software RAID at worse as almost universally any system whose core function is to protect data whose market is built upon deception and confusion will result in disaster.  Fake RAID systems routinely have issues with performance and reliability.  While, in theory, a third party software RAID package is a reasonable option, the reality of the software RAID market is that essentially all quality software RAID implementations are native components of either the operating system itself (Linux, Mac OSX, Solaris, Windows)  or of the filesystem (ZFS, VxFS, BtrFS) and are provided and maintained by primary vendors leaving little room or purpose for third party products outside of the Windows desktop space where a few, small legitimate software RAID players do exist but are often overshadowed by the Fake RAID players.

Today there is almost no need for hardware RAID as commodity platforms are incredibly powerful and there is almost always a dramatic excess of both computational and memory resources.  Hardware RAID instead competes mostly based on features rather than on reducing resource load.  Selection of hardware RAID versus software RAID in the commodity server space is almost completely one of preference and market momentum rather than of specific performance or features – both platforms essentially are equal with individual implementations being far more important in considering product options rather than hardware and software approaches are on their own.

Today hardware RAID offerings tend to be more “generic” with rather vanilla implementations of standard RAID levels.  Hardware RAID tends to earn its value through resource utilization reduction (CPU and memory offload), ability to “blind swap” failed drives, simplified storage management, block level storage agnostically abstracted from the operating system, fast cache close to the drives and battery or flash backed cache.  Software RAID tends to earn its value through lower power consumption, lower cost of acquisition, integrated management with the operating system, unique or advanced RAID features (such as ZFS’ RAIDZ that doesn’t suffer from the standard RAID 5 write hole) and generally better overall performance.  It is truly not a discussion of better or worse but of better or worse for a very specific situation with the most important factor often being familiarity and comfort and/or default vendor offering.

One of the most overlooked but important differentiators between hardware and software RAID is the change in the job role associated with RAID array management.  Hardware RAID moves the handling of the array to the server administrator (the support role that works on the physical server and is  stationed in the datacenter) whereas software RAID moves the handling of the array to the system administrator (the support role working on the operating system and above and rarely sitting in the datacenter.)  In the SMB market this factor might be completely overlooked but in a Fortune 500 the difference in job role can be very significant.  In many cases with hardware RAID disk replacements and system setup can be done without the need for system administrator intervention.  Datacenter server administrators can discover failed drives either through alerts or by looking for “amber lights” during walkthroughs and do replacements on the fly without needing to contact anyone or know what the server is even running.  Software RAID almost always would require the system administrator to be involved in managing the offlining of a failed disk, coordinating the replacement process with the datacenter and onlining the new one once the replacement process was completed.

Because of the way that CPU offloading and performance works and because of some advantages in the way that non-standard RAID implementations often handle parity RAID reconstruction there is a tendency for mirrored RAID levels to favor hardware RAID and software RAID levels to favor parity RAID.  Parity RAID is drastically more CPU intensive and so having access to the high power central CPU resources can be a major factor in speeding up RAID calculations.  But with mirrored RAID where RAID reconstruction is far safer than with parity RAID and where automated rebuilds are more important then hardware RAID brings the benefit of allowing blind drive replacement very easily.

One aspect of the hardware and software RAID discussion that is extremely paradoxical is that the same market that often dismisses software RAID out of hand as being inferior to hardware RAID is almost completely overlapping (you can picture the Venn Diagram in your head here) with the market that feels that file servers are inferior to commodity NAS appliances yet those NAS appliances in the SMB range are almost universally based on the same software RAID implementations being casually dismissed.  So it is often considered both inferior and superior simultaneously.  Some NAS devices in the SMB range, and NAS appliance software, that are software RAID based include: Netgear ReadyNAS, Netgear ReadyData, Buffalo Terastation, QNAP, Synology, OpenFiler FreeNAS, Nexenta and NAS4Free.

There is truly no “always use one way or the other” with hardware and software RAID.  Even giant, six figure enterprise NAS and SAN appliances are undecided as to which to use with part of the industry going each direction.  The real answer is it depends on your specific situation – your job role separation, your technical needs, your experience, your budget, etc.  Both options are completely viable in any organization.

The Information Technology Resource for Small Business