Category Archives: Storage

The Jurassic Park Effect

“If I may… Um, I’ll tell you the problem with the scientific power that you’re using here, it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you even knew what you had, you patented it, and packaged it, and slapped it on a plastic lunchbox, and now …” – Dr. Ian Malcolm, Jurassic Park

When looking at building a storage server or NAS, there is a common feeling that what is needed is a “NAS operating system.”  This is an odd reaction, I find, since the term NAS means nothing more than a “fileserver with a dedicated storage interface.”  Or, in other words, just a file server with limited exposed functionality.  The reason that we choose physical NAS appliances is for the integrated support and sometimes for special, proprietary functionality (NetApp being a key example of this offering extensive SMB and NFS integration and some really unique RAID and filesystem options or Exablox offering fully managed scale out file storage and RAIN style protection.)  Using a NAS to replace a traditional file server is, for the most part, a fairly recent phenomenon and one that I have found is often driven by misconception or the impression that managing a file server, one of the  most basic IT workloads, is special or hard.  File servers are generally considered the most basic form of server and traditionally what people meant when using the term server unless additional description was added and the only form commonly integrated into the desktop (every Mac, Windows and Linux desktop can function as a file server and it is very common to do so.)

There is, of course, nothing wrong with turning to a NAS instead of a traditional file server to meet your storage needs, especially as some modern NAS options, like Exablox, offer scale out and storage options that are not available in most operating systems.  But it appears that the trend to use a NAS instead of a file server has led to some odd behaviour when IT professionals turn back to considering file servers again.  A cascading effect, I suspect, where the reasons for why NAS are sometimes preferred and the goal level thinking are lost and the resulting idea of “I should have a NAS” remains, so that when returning to look at file server options there is a drive to “have a NAS” regardless of whether there is a logical reason for feeling that this is necessary or not.

First we must consider that the general concept of a NAS is a simple one, take a traditional file server, simplify it by removing options and package it with all of the necessary hardware to make a simplified appliance with all of the support included from the interface down to the spinning drives and everything in between.  Storage can be tricky when users need to determine RAID levels, drive types, monitor effectively, etc.  A NAS addresses this by integrating the hardware into the platform.  This makes things simple but can add risk as you have fewer support options and less ability to fix or replace things yourself.  A move from a file server to a NAS appliance is truly about support almost exclusively and is generally a very strong commitment to a singular vendor.  You chose the NAS approach because you want to rely on a vendor for everything.

When we move to a file server we go in the opposite direction.  A file server is a traditional enterprise server like any other.  You buy your server hardware from one vendor (HP, Dell, IBM, etc.) and your operating system from another (Microsoft, Red Hat, Suse, etc.)  You specify the parts and the configuration that you need and you have the most common computing model for all of IT.  With this model you generally are using standard, commodity parts allowing you to easily migrate between hardware vendors and between software vendors. You have “vendor redundancy” options and generally everything is done using open, standard protocols.  You get great flexibility and can manage and monitor your file server just like any other member of your server fleet, including keeping it completely virtualized.  You give up the vertical integration of the NAS in exchange for horizontal flexibility and standardization.

What is odd, therefore, is when returning to the commodity model but seeking, what is colloquially known as, a NAS OS.  Common examples of these include NAS4Free, FreeNAS and OpenFiler.  This category of products is generally nothing more than a standard operating system (often FreeBSD as it has ideal licensing, or Linux because it is well known) with a “storage interface” put onto it and no special or additional functionality that would not exist with the normal operating system.  In theory they are a “single function” operating system that does only one thing.  But this is not reality.  They are general purpose operating systems with an extra GUI management layer added on top.  One could say the same thing about most physical NAS products themselves, but they typically include custom engineering even at the storage level, special features and, most importantly, an integrated support stack and true isolation of the “generalness” of the underlying OS.  A “NAS OS” is not a simpler version of a general purpose OS, it is a more complex, yet less functional version of it.

What is additionally odd is that general OSes, with rare exception, already come with very simple, extremely well known and fully supported storage interfaces.  Nearly every variety of Windows or Linux servers, for example, have included simple graphical interfaces for these functions for a very long time.  These included GUIs are often shunned by system administrators as being too “heavy and unnecessary” for a simple file server.  So it is even more unusual that adding a third party GUI, one that is not patched and tested by the OS team and not standardly known and supported, would then be desired as this goes against the common ideals and practices of using a server.

And this is where the Jurassic Park effect comes in – the OS vendors (Red Hat, Microsoft, Oracle, FreeBSD, Suse, Canonical, et. al.) are giants with amazing engineering teams, code review, testing, oversight and enterprise support ecosystems.  While the “NAS OS” vendors are generally very small companies, some with just one part time person, who stand on the shoulders of these giants and build something that they knew that they could but they never stopped to ask if they should.  The resulting products are wholly negative compared to their pure OS counterparts, they do not make systems management easier nor do they fill a gap in the market’s service offerings. Solid, reliable, easy to use storage is already available, more vendors are not needed to fill this place in the market.

The logic often applied to looking at a NAS OS is that they are “easy to set up.”   This may or may not be true as easy, here, must be a relational term.  For there to be any value a NAS OS has to be easy in comparison to the standard version of the same operating system.  So in the case of FreeNAS, this would mean FreeBSD.  FreeNAS would need to be appreciably easier to set up than FreeBSD for the same, dedicated functions.  And this is easily true, setting up a NAS OS is generally pretty easy.  But this ease is only a panacea and one of which IT professionals need to be quite aware.  Making something easy to set up is not a priority in IT, making something that is easy to operate and repair when there are problems is what is important.  Easy to set up is nice, but if it comes at a cost of not understanding how the system is configured and makes operational repairs more difficult it is a very, very bad thing.  NAS OS products routinely make it dangerously easy to get a product into production for a storage role, which is almost always the most critical or nearly the most critical role of any server in an environment, that IT has no experience or likely skill to maintain, operate or, most importantly, fix when something goes wrong.  We need exactly the opposite, a system that is easy to operate and fix.  That is what matters.  So we have a second case of “standing on the shoulders of giants” and building a system that we knew we could, but did not know if we should.

What exacerbates this problem is that the very people who feel the need to turn to a NAS OS to “make storage easy” are, by the very nature of the NAS OS, the exact people for whom operational support and the repair of the system is most difficult.  System administrators who are comfortable with the underlying OS would naturally not see a NAS OS as a benefit and avoid it, for the most part.  It is uniquely the people for whom it is most dangerous to run a not fully understood storage platform that are likely to attempt it.  And, of course, most NAS OS vendors earn their money, as we could predict, on post-installation support calls for customers who deployed and got stuck once they were in production so that they are at the mercy of the vendors for exorbitant support pricing.  It is in the interest of the vendors to make it easy to install and hard to fix.  Everything is working against the IT pro here.

If we take a common example and look at FreeNAS we can see how this is a poor alignment of “difficulties.”  FreeNAS is FreeBSD with an additional interface on top.  Anything that FreeNAS can do, FreeBSD an do.  There is no loss of functionality by going to FreeBSD.  When something fails, in either case, the system administrator must have a good working knowledge of FreeBSD in order to exact repairs.  There is no escaping this.  FreeBSD knowledge is common in the industry and getting outside help is relatively easy.  Using FreeNAS adds several complications, the biggest being that any and all customizations made by the FreeNAS GUI are special knowledge needed for troubleshooting on top of the knowledge already needed to operate FreeBSD.  So this is a large knowledge set as well as more things to fail.  It is also a relatively uncommon knowledge set as FreeNAS is a niche storage product from a small vendor and FreeBSD is a major enterprise IT platform (plus all use of FreeNAS is FreeBSD use but only a tiny percentage of FreeBSD use is FreeNAS.)  So we can see that using a NAS OS just adds risk over and over again.

This same issue carries over into the communities that grow up around these products.  If you look to communities around FreeBSD, Linux or Windows for guidance and assistance you deal with large numbers of IT professionals, skilled system admins and those with business and enterprise experience.  Of course, hobbyists, the uninformed and others participate too, but these are the enterprise IT platforms and all the knowledge of the industry is available to you when implementing these products.  Compare this to the community of a NAS OS.  By its very nature, only people struggling with the administration of a standard operating system and/or storage basics would look at a NAS OS package and so this naturally filters the membership in their communities to include only the people from whom we would be best to avoid getting advice.  This creates an isolated culture of misinformation and misunderstandings around storage and storage products.  Myths abound, guidance often becomes reckless and dangerous and industry best practices are ignored as if decades of accumulated experience had never happened.

A NAS OS also, commonly, introduces lags in patching and updates.  A NAS OS will almost always and almost necessarily trail its parent OS on security and stability updates and will very often follow months or years behind on major features.  In one very well known scenario, OpenFiler, the product was built on an upstream non-enterprise base (RPath Linux) which lacked community and vendor support, failed and was abandoned leaving downstream users, included everyone on OpenFiler, abandoned without the ecosystem needed to support them.  Using a NAS OS means trusting not just the large, enterprise and well known primary OS vendor that makes the base OS but trusting the NAS OS vendor as well.  And the NAS OS vendor is orders of magnitude more likely to fail if they are basing their products on enterprise class base OSes.

Storage is a critical function and should not be treated carelessly and should not be ignored as if its criticality did not exist.  NAS OSes tempt us to install quickly and forget, hoping that nothing ever goes wrong or that we can move on to other roles or companies completely before bad things happen.  It sets us up for failure where failure is most impactful.  When a typical application server fails we can always copy the files off of its storage and start fresh.  When storage fails, data is lost and systems go down.

“John Hammond: All major theme parks have delays. When they opened Disneyland in 1956, nothing worked!

Dr. Ian Malcolm: Yeah, but, John, if The Pirates of the Caribbean breaks down, the pirates don’t eat the tourists.”

When storage fails, businesses fail.  Taking the easy route to setting up storage and ignoring the long term support needs and seeking advice from communities that have filtered out the experienced storage and systems engineers increases risk dramatically.  Sadly, the nature of a NAS OS, is that the very reason that people turn to it (lack of deep technical knowledge to build the systems) is the very reason they must avoid it (even greater need for support.)  The people for whom NAS OSes are effectively safe to use, those with very deep and broad storage and systems knowledge would rarely consider these products because for them they offer no benefits.

At the end of the day, while the concept of a NAS OS sounds wonderful, it is not a panacea and the value of a NAS does not carry over from the physical appliance world to the installed OS world and the value of standard OSes is far too great for NAS OSes to effectively add real value.

“Dr. Alan Grant: Hammond, after some consideration, I’ve decided, not to endorse your park.

John Hammond: So have I.”

Practical RAID Choices for Spindle Based Arrays

A truly monumental amount of information abounds in reference to RAID storage systems exploring topics such as risk, performance, capacity, trends, approaches and more.  While the work on this subject is nearly staggering the information can be distilled into a handful of common, practical storage approaches that will cover nearly all use cases.  My goal here is to provide a handy guide that will allow a non-storage practitioner to approach RAID decision making in a practical and, most importantly, safe way.

For the purposes of this guide we will assume storage projects of no more than twenty five traditional drives (spinning platter drives properly known as Winchester drives.)  These drives could be SFF (2.5″) or LFF (3.5″) commonly, SATA or SAS, consumer or enterprise.  We will not tackle solid state drives as these have very different characteristics and require their own guidance.  Storage systems larger than roughly twenty five spindles should not work from standard guidance but delve deeper into specific storage needs to ensure proper planning.

The guidance here is written for standard systems in 2015.  Over the past two decades the common approaches to RAID storage have changed dramatically and while it is not anticipated that the key factors that influence these decisions will change enough in the future to alter these recommendations it is very possible that they will.  Good RAID design of 1998 is very poor RAID design today.  The rate of change in the industry has dropped significantly since that time and these recommendations are likely to stand for a very long time, very possibly until spindle-based drive storage is no longer available or at least popular, but like all things predictions are subject to great change.

In general we use what is termed a “One Big Array” approach.  That is a single RAID array on which all system and data partitions are created.  The need or desire to split our storage into multiple, physical arrays is mostly gone today and should only be done in non-general circumstances.  Only in situations where careful study of the storage needs and heavy analysis are being done should we look at array splitting.  Array splitting is far more likely to cause harm rather than good.  When it doubt, avoid split arrays.  The goal of this guide is general rules of thumb to allow any IT Pro to build a safe and reliable storage system.  Rules of thumb do not and can not cover every scenario, exceptions always exist.  But the idea here is to cover the vast majority of cases with tried and true approaches that are designed around modern equipment, use cases and needs while being mindful to err on the side of safety – when a choice is less than ideal it is still safe.  None of these choices is at all reckless, at worst they are overly conservative.

The first scenario we should consider is if your data does not matter.  This may sound like an odd thing to consider but it is a very important scenario.  There are many times where data saved to disk is considered ephemeral and does not need to be protected.  This is common for reconstructable data such as working space for rendering, intermediary calculation spaces or caches – situations where spending money to protect data is wasted and it would be acceptable to simply recreate lost data rather than protecting it.  This could be a case where downtime is not a problem and data is static or nearly so and rather than spending to reduce downtime we only worry about protecting the data via backup mechanisms so that if an array fails we simply restore the array completely.  In these cases the obvious choice is RAID 0.  It is very fast, very simple and provides the most cost effective capacity.  The only downside of RAID 0 is that it is fragile and provides no protection against data loss in case of drive failure or even a URE (which would cause data corruption the same as a desktop drive faces.)

It should be noted that an exception to the “One Big Array” approach that would be common is in systems using RAID 0 for data.  There would be a very good argument made for a small drive array dedicated to the OS and application data that would be cumbersome to reinstall in case of array loss being kept on RAID 1 and the RAID 0 data array being separate from it.  This way recovery could be very rapid rather than needing to completely rebuild the entire system from scratch rather than simply recreating the data.

Assuming that we have eliminated cases where the data does not require protection, we will assume for all remaining cases that the data is quite important and we want to protect it at some cost.  We will assume that protecting the data as it exists on the live storage is important, generally because we want to avoid downtime or because we want to ensure data integrity because the data on disk is not static and an array failure would also constitute data loss.  With this assumption we will continue.

If we have an array of only two disks the answer is very simple, we choose RAID 1.  There is no other option at this size, so no decision to be made.   In theory we should be planning our arrays holistically and not after the number of drives is determined, the number of drives and the type of array chosen should be done together not drives purchased then use determined based on that arbitrary number, but two drive chassis are so common that it is worth mentioning as a case.

Likewise, with a four drive array the only real choice to consider is RAID 10.  There is no need for further evaluation.  Simply select RAID 10 and continue.

An awkward case is a three drive array.  It is very, very rare that we are limited to three drives as the only common chassis limited to three drives was the Apple Xserve and this has been off of the market for some time so the need to deal with decision making around three spindle arrays should be extremely unlikely.  In cases where we have three drives it is often best to seek guidance but the most common approaches are to add a fourth drive and ergo chose RAID 10 or, if capacity of greater than a single drive’s worth is not needed, to put all three drives into a single triple-mirror RAID 1.

For all other cases, therefore, we are dealing with five to twenty five drives.  Since we have eliminated the situations where RAID 0 and RAID 1 would apply we are left with all common scenarios coming down to RAID 6 and RAID 10, and these constitute the vast majority of cases.  Choosing between RAID 6 and RAID 10 becomes the biggest challenge that we will face as we must look solely at a our “soft” needs of reliability, performance and capacity.

Choosing between RAID 6 and RAID 10 should not be incredibly difficult.  RAID 10 is ideal for situations where performance and safety are the priorities.  RAID 10 has much faster write performance and is safe regardless of disk type used (low cost consumer disks can still be extremely safe, even in large arrays.)  RAID 10 scales well to extremely large sizes, much larger than should be implemented using rules of thumb!  RAID 10 is the safest of all choices, it is fast and safe.  The obvious downsides are that RAID 10 has less storage capacity from the same disks and is more costly on the basis of capacity. It must be mentioned that RAID 10 can only utilize an even number of disks, disks are added in pairs.

RAID 6 is generally safe and fast but never as safe or as fast as RAID 10.  RAID 6 specifically suffers from write performance so is very poorly suited for workloads such as databases and heavily mixed loads like in large virtualization systems.  RAID 6 is cost effective and provides a heavy focus on available capacity compared to RAID 10.  When budgets are tight or capacity needs dominate over performance RAID 6 is an ideal choice.  Rarely is the difference in safety between RAID 10 and RAID 6 a concern except in very large systems with consumer class drives.  RAID 6 is subject to additional risk with consumer class drives that RAID 10 is not affected by which could warrant some concern around reliability in larger RAID 6 systems such as those above roughly 40TB when consumer drives are used.

In the small business space especially, the majority of systems will use RAID 10 simply because arrays rarely need to be larger than four drives.  When arrays are larger RAID 6 is the more common choice due to somewhat tight budgets and generally low concern around performance.  Both RAID 6 and RAID 10 are safe and effective solutions for nearly all usage scenarios with RAID 10 dominating when performance or extreme reliability are key and RAID 6 dominating when cost and capacity are key.  And, of course, when storage needs are highly unique or very large, such as larger than twenty five spindles in an array, remember to leverage a storage consultant as the scenario can easily become very complex.  Storage is one place where it pays to be extra diligent as so many things depend upon it, mistakes are so easy to make and the flexibility to change it after the fact is so low.

Slow OS Drives, Fast Data Drives

Over the years I have found that people often err on the side of high performance, highly reliable data storage for an operating system partition but choose slow, “cost effective” storage for critical data stores.  I am amazed by how often I find this occurring and now, with the advent of hypervisors, I see the same behaviour being repeated there as well – compounding the previously existing issues.

In many systems today we deal with only a single storage array shared by all components of the system.  In these cases we do not face the problem of misbalancing our storage system performance.  This is one of the big advantages of this approach and a major reason why it comes so highly recommended.  All performance is in a shared pool and the components that need the performance have access to it.

In many cases, whether in an attempt at increased performance or reliability design or out of technical necessity, I find that people are separating out their storage arrays and putting hypervisors and operating systems on one array and data on another.  But what I find shocking is that arrays dedicated to the hypervisor or operating system are often staggeringly large in capacity and extremely high in performance – often involving 15,000 RPM spindles or even solid state drives at great expense.  Almost always in RAID 1 (as per common standards from 1998.)

What needs to be understood here is that operating systems themselves have effectively no storage IO requirements.  There is a small amount, mostly for system logging, but that is about all that is needed.  Operating system partitions are almost completely static.  Required components are loaded into memory, mostly at boot time, and are not accessed again.  Even in cases where logging is needed, many times these logs are sent to a central logging system and not to the system storage area reducing or even removing that need as well.

With hypervisors this effect is even more extreme.  As hypervisors are far lighter and less robust than traditional operating systems they behave more like embedded systems and, in many ways, actually are embedded systems in many cases.  Hypervisors load into memory at system boot time and their media is almost never needed again while a system is running except for logging on some occasions.  Because hypervisors are small in physical size even the total amount of time needed to completely read a full hypervisor off of storage is very small, even on very slow media, because the total size is very small.

For these reasons, storage performance is of little to no consequence for operating systems and especially hypervisors.  The difference between fast storage and slow storage really only impacts system boot time where the difference in one second or thirty seconds rarely would be noticed, if at all.  When would anyone perceive even several extra seconds during the startup of a system and in most cases, startups are rare events happening at most once a week during an automated, routine system reboot during a planned maintenance window or very rarely, sometimes only once every several years, for systems that are only brought offline in emergencies.  Even the slowest conceivable storage system is far faster than necessary for this role.

Even slow storage is generally many times faster than is necessary for system logging activities.  In those rare cases where logging is very intense we have many choices of how to tackle this problem.  The most obvious and common solution here is to send logs to a drive array other than the one used by the operating system or hypervisor.  This is a very easy solution and ultimately very practical in cases where it is warranted.  The other common and highly useful solution is to simply refrain from keeping logs on the local device at all and send them to a remote log collection utility such as Splunk, Loggly or ELK.

The other major concern that most people have around their operating systems and hypervisors is reliability.  It is common to focus more efforts on protecting these relatively unimportant aspects of a system rather than the often irreplaceable data.  However, operating systems and hypervisors are easily rebuilt from scratch when necessary using fresh installs and manual reconfiguration when necessary.  The details which could be lost are generally relatively trivial to recreate.

This does not mean that these system filesystems should not be backed up, of course they should (in most cases.)  But just in case the backups fail as well, it is rare that the loss of an OS partition or filesystem truly spells tragedy but only an inconvenience.  There are ways to recover in nearly all cases without access to the original data, as long as the “data” filesystem is separate.  And because of the nature of operating systems and hypervisors, change is rare so backups can generally be less frequent, possibly triggered manually only when updates are applied!

With many modern systems in the DevOps and Cloud computing spaces it has become very common to view operating systems and hypervisor filesystems as completely disposable since they are defined remotely via a system image or by a configuration management system.  In these cases, which are becoming more and more common, there is no need for data protection or backups as the entire system is designed to be recreated, nearly instantly, without any special interaction.  The system is entirely self-replicating.  This further trivializes the need for system filesystem protection.

Taken together, the lack of need around performance and the lack of need around protection and reliability handled primarily through simple recreation and what we have is a system filesystem with very different needs than we commonly assume.  This does not mean that we should be reckless with our storage, we still want to avoid storage failure while a system is running and rebuilding unnecessarily is a waste of time and resources even if it does not prove to be disastrous.  So striking a careful balance is important.

It is, of course, for these reasons that including the operating system or hypervisor on the same storage array as data is now common practice – because there is little to no need for storage access to the system files at the same time that there is access to the data files so we get great synergy by getting fast boot times for the OS and no adverse impact on data access times once the system is online.  This is the primary means by which system designers today tackle the need for efficient use of storage.

When the operating system or hypervisor must be separated from the arrays holding data which can still happen for myriad reasons we generally seek to obtain reasonable reliability at low cost.  When using traditional storage (local disks) this means using small, slow, low cost spinning drives for operating system storage, generally in simple RAID 1 configuration.  A real world example is the use of 5400 RPM “eco-friendly” SATA drives in the smallest sizes possible.  These draw little power and are very inexpensive to acquire.  SSDs and high speed SAS drives would be avoided as they cost a premium for protection that is irrelevant and performance that is completely wasted.

In less traditional storage it is common to use a low cost, high density SAN consolidating the low priority storage for many systems onto shared, slow arrays that are not replicated. This is only effective in larger environments that can justify the additional architectural design and can achieve enough density in the storage consolidation process to create the necessary cost savings but in larger environments this is relatively easy.  SAN boot devices can leverage very low cost arrays across many servers for cost savings.  In the virtual space this could mean a low performance datastore used for OS virtual disks and another, high performance pool, for data virtual disks.  This would have the same effect as the boot SAN strategy but in a more modern setting and could easily leverage the SAN architecture under the hood to accomplish it.

Finally, and most dramatically, it is a general rule of thumb with hypervisors to install them to SD cards or USB thumb drives rather than to traditional storage as their performance and reliability needs are so much less even than traditional operating systems.  Normally if a drive of this nature were to fail while a system was running it would actually remain running without any problem as the drive is never used once the system has booted initially.  It would only be during a reboot that an issue would be found and, at that time, a backup boot device could be used such as a secondary SD card or USB stick.  This is the official recommendation for VMware vSphere, is often recommended by Microsoft representatives for HyperV and is officially supported through HyperV’s OEM vendors and is often recommended, but not so broadly supported, for Xen, XenServer and KVM systems.  Using SD cards or USB drives for hypervisor storage effectively turns a virtualization server into an embedded system.  While this may feel unnatural to system administrators who are used to thinking of traditional disks as a necessity for servers, it is important to remember that enterprise class, highly critical systems like routers and switches last decades and use this exact same strategy for the exact same reasons.

A common strategy for hypervisors in this embedded style mode with SD cards or USB drives is to have two such devices, which may actually be one SD card and one USB drive, each with a copy of the hypervisor.  If one device fails, booting to the second device is nearly as effective as a traditional RAID 1 system.  But unlike most traditional RAID 1 setups, we also have a relatively easy means of testing system updates by only updating one boot device at a time and testing the process before updating the second boot device leaving us with a reliable, well tested fall back in case a version update goes awry.  This process was actually common on large UNIX RISC systems where boot devices were often local software RAID 1 sets that supported a similar practice, especially common in AIX and Solaris circles.

It should also be noted that while this approach is the best practice for most hypervisor scenarios there is actually no reason why it cannot be applied to full operating system filesystems too, except that it is often more work.  Some OSes, especially Linux and BSD are very adept at being installed in an embedded fashion and can easily be adapted for installation on SD card or USB drive with a little planning.  This approach is not at all common but there is no technical reason why, in the right circumstances, it would not be an excellent approach except for the fact that almost never should an OS be installed to physical hardware rather than on top of a hypervisor.  In those cases where physical installs are necessary then this approach is extremely valid.

When designing and planning for storage systems, remember to be mindful as to what read and write patterns will really look like when a system is running. And remember that storage has changed rather dramatically since many traditional guidelines were developed and not all of the knowledge used to develop them still applies today or applies equally.  Think about not only which storage subsystems will attempt to use storage performance but also how they will interact with each other (for example, do two systems never request storage access at the same time or will they conflict regularly) and whether or not their access performance is important.  General operating system functions can be exceedingly slow on a database server without negative impact, all that matters is the speed at which a  database can be accessed.  Even access to application binaries is often irrelevant as they too, once loaded into memory, remain there and only memory speed impacts ongoing performance.

None of this is meant to suggest that separating OS and data storage subsystems from each other is advised, it often is not.  I have written in the past about how consolidating these subsystems is quite frequently the best course of action and that remains true now.  But there are also many reasonable cases where splitting certain storage needs from each other makes sense, often when dealing with large scale systems where we can lower cost by dedicating high cost storage to certain needs and low cost storage to other needs and it is in those cases where I want to demonstrate that operating systems and hypervisors should be considered the lowest priority in terms of both performance and reliability except in the most extreme cases.

Practical RAID Performance

Choosing a RAID level is an exercise in balancing many factors including cost, reliability, capacity and, of course, performance.  RAID performance can be difficult to understand especially as different RAID levels use different techniques and behave rather differently from each other in some cases.  In this article I want to explore the common RAID levels of RAID 0, 5, 6 and 10 to see how performance differs between them.

For the purposes of this article, RAID 1 will be assumed to be a subset of RAID 10.   This is often a handy way to think of RAID 1 – as simply being a RAID 10 array with only a single mirrored pair member.  As RAID 1 is truly a single pair RAID 10 and behaves as such this works wonderfully for making RAID performance easy to understand as it simply maps into the RAID 10 performance curve.

There are two types of performance to look at with all storage: reading and writing.  In terms of RAID reading is extremely easy and writing is rather complex.  Read performance is effectively stable across all RAID types.  Writing, however, is not.

To make discussing performance easier we need to define a few terms as we will be working with some equations. In our discussions we will use N to represent the total number of drives, often referred to as spindles, in our array and we will use X to refer to the performance of each drive individually.  This allows us to talk in terms of relative performance as a factor of the drive performance allowing us to abstract away the RAID array and not have to think in terms of raw IOPS.  This is important as IOPS are often very hard to define but we can compare performance in a meaningful way by speaking to it in relationship to the individual drives within the array.

It is also important to remember that we are only talking about the performance of the RAID array itself, not an entire storage subsystem.  Artifacts such as memory caches and solid state caches will do amazing things to alter the overall performance of a storage subsystem, but do not fundamentally change the performance of the RAID array itself under the hood.  There is no simple formula for determining how different cache options will impact the overall performance but suffice it to say that it can be very dramatic but this depends heavily not only on the cache choices themselves but also heavily upon workload. Even the biggest, fastest, most robust cache options cannot change the long term, sustained performance of an array.

RAID is complex and many factors influence the final performance.  One is the implementation of the RAID system itself.  A poor implementation might cause latency or may fail to take advantage of the available spindles (such as having a RAID 1 array read only from a single disk instead of from both simultaneously!)  There is no easy way to account for deficiencies in specific RAID implementations so we must assume that all are working to the limits of the specification as, indeed, any enterprise RAID system will do. It is primarily hobby and consumer RAID systems that fail to do this.

Some types of RAID also have dramatic amounts of computational overhead associated with them while others do not.  Primarily parity RAID levels require heavy processing in order to handle write operations with different levels having different amounts of computation necessary for each operation.  This introduces latency, but does not curtail throughput.  This latency will vary, however, based on the implementation of the RAID level as well as on the processing capability of the system in question.  Hardware RAID will use something like a general purpose CPU (often a Power or ARM RISC processor) or a custom ASIC to handle this while software RAID hands this off to the server’s own CPU.  Often the server CPU is actually faster here but consumes system resources.  ASICs can be very fast but are expensive to produce.  This latency impacts storage performance but is very difficult to predict and can vary from nominal to dramatic.  So I will mention the relative latency impact with each RAID level but will not attempt to measure it.  In most RAID performance calculations, this latency is ignored but it is important to understand that it is present and could, depending on the configuration of the array, have a noticeable impact on a workload.

There is, it should be mentioned, a tiny performance impact to read operations due to efficiencies in the layout of data on the disk itself.  Parity RAID requires there to be data on the disks that is useless during a healthy read operation but cannot be used to speed it up.  The actually results in it being slightly slower.  But this impact is ridiculously small and is normally not measured and so can be ignored.

Factors such as stripe size also impact performance, of course, but as that is configurable and not an intrinsic artifact in any RAID level I will ignore it here.  It is not a factor when choosing a RAID level itself but only in configuring one once chosen.

The final factor that I want to mention is the read to write ratio of storage operations.  Some RAID arrays will be used almost purely for read operations, some almost solely for write operations but most use a blend of the two, likely something like eighty percent read and twenty percent write.  This ratio is very important in understanding the performance that you will get from your specific RAID array and understanding how each RAID level will impact you.  I refer to this as the read/write blend.

We measure storage performance primarily in IOPS.  IOPS stands for Input/Output Operations Per Second (yes, I know that the letters don’t line up well, it is what it is.)  I further use the terms RIOPS for Read IOPS, WIOPS for Write IOPS and BIOPS for Blended IOPS which would come with a ration 80/20 or whatever.  Many people talk about storage performance with a single IOPS number.  When this is done they normally mean Blended IOPS at 50/50.  However, rarely does any workload run at 50/50 so that number can be extremely misleading.  Two numbers, RIOPS and WIOPS is what is needed to understand performance and these two together can be used to find any IOPS Blend that is needed.   For example, a 50/50 blend is as simple as (RIOPS * .5) + (WIOPS * .5).  The more common 80/20 blend would be (RIOPS * .8) + (WIOPS * .2).

Now that we have established some criteria and background understanding we will delve into our RAID levels themselves and see how performance varies across them.

For all RAID levels, the Read IOPS number is calculated using NX.  This does not address the nominal overhead numbers that I mention above, of course.  This is a “best case” number but the real world number is so close that it is very practical to simply use this formula.  Since take the number of spindles (N) and multiple by the IOPS performance of an individual drive (X).  Keep in mind that drives often have different read and write performance so be sure to use the drives Read IOPS rating or tested speed for the Read IOPS calculation and the Write IOPS rate or tested speed for the Write IOPS calculation.

RAID 0

RAID 0 is the easiest RAID level to understand because there is effectively no overhead to worry about, no resources consumed to power it and both read and write get the full benefit of every spindle, all of the time.  So for RAID 0 our formula for write performance is very simple: NX.  RAID 0 is always the most performant RAID level.

An example would be an eight spindle RAID 0 array.  If an individual drive in the array delivers 125 IOPS then our calculation would be from N = 8 and X = 125 so 8 * 125 yielding 1,000 IOPS.  Since both read and write IOPS are the same here, it is extremely simple as we get 1K RIOPS, 1K WIOPS and 1K with any blending thereof.  Very simple.  If we didn’t know the absolute IOPS of an individual spindle we could refer to an eight spindle RAID 0 as delivering 8X Blended IOPS.

RAID 10

RAID 10 has the second simplest RAID level to calculate.  Because RAID 10 is a RAID 0 stripe of mirror sets, we have no overhead to worry about from the stripe but each mirror has to write the same data twice in order to create the mirroring.  This cuts our write performance in half compared to a RAID 0 array of the same number of drives.  Giving us a write performance formula of simply: NX/2  or .5NX.

It should be noted that at the same capacity, rather than the same number of spindles, RAID 10 has the same write performance as RAID 0 but double the read performance – simply because it requires twice as many spindles to match the same capacity.

So an eight spindle RAID 10 array would be N = 8 and X = 125 and our resulting calculation comes out to be (8 * 125)/2 which is 500 WIOPS or 4X WIOPS.  A 50/50 blend would result in 750 Blended IOPS (1,000 Read IOPS and 500 Write IOPS.)

This formula applies to RAID 1, RAID 10, RAID 100 and RAID 01 equally.

Uncommon options such as triple mirroring in RAID 10 would alter this write penalty.  RAID 10 with triple mirroring would be NX/3, for example.

RAID 5

While RAID 5 is deprecated and should never be used in new arrays I include it here because it is a well known and commonly used RAID level and its performance needs to be understood.  RAID 5 is the most basic of the modern parity RAID levels.  RAID 2, 3 & 4 are no longer found in production systems and so we will not look into their performance here.  RAID 5, while not recommended for use today, is the foundation of other modern parity RAID levels so is important to understand.

Parity RAID adds a somewhat complicated need to verify and re-write parity with every write that goes to disk.  This means that a RAID 5 array will have to read the data, read the parity, write the data and finally write the parity.  Four operations for each effective one.  This gives us a write penalty on RAID 5 of four.  So the formula for RAID 5 write performance is NX/4.

So following the eight spindle example where the write IOPS of an individual spindle is 125 we would get the following calculation: (8 * 125)/4 or 2X Write IOPS which comes to 250 WIOPS.  In a 50/50 blend this would result in 625 Blended IOPS.

RAID 6

RAID 6, after RAID 10, is probably the most common and useful RAID level in use today.  RAID 6, however, is based off of RAID 5 and has another level of parity.  This makes it dramatically safer than RAID 5, which is very important, but also imposes a dramatic write penalty as each write operation requires the disks to read the data, read the first parity, read the second parity, write the data, write the first parity and then finally write the second parity.  This comes out to be a six times write penalty, which is pretty dramatic.  So our formula is NX/6.

Continuing our example we get (8 * 125)/6 which comes out to ~167 Write IOPS or 1.33X.  In our 50/50 blend example this is a performance of  583.5 Blended IOPS.  As you can see, parity writes cause a very rapid decrease in write performance and a noticeable drop in blended performance.

RAID 7 (aka RAID 5.3 or RAID 7.3)

RAID 7 is a somewhat non-standard RAID level with triple parity based off of the existing single parity of RAID 5 and the existing double parity of RAID 6.  The only current implementation of RAID 7 is ZFS’ RAIDZ3.  Because RAID 7 contains all of the overhead of both RAID 5 and RAID 6 plus the additional overhead of the third parity component we have a write penalty of a staggering eight times.  So our formula for finding RAID 7 write performance is NX/8.

In our example this would mean that (8 * 125)/8 would come out to 125 Write IOPS or 1X.  So with eight drives in our array we would get only the write performance of a single, stand alone drive.  That is significant overhead.  Our blended 50/50 IOPS would come out to only 562.5.

Complex RAID

Complex RAID levels or Nested RAID levels such as RAID 50, 60, 61, 16, etc. can be found using the information above and breaking the RAID down into its components and applying each using the formulæ provided above.  There is no simple formula for these levels because they have varying configurations.  It is necessary to break them down into their components and apply the formulæ multiple times.

RAID 60 with twelve drives, two sets of six drives, where each drive is 150 IOPS would be done with two RAID 6s.  It would be the NX of RAID 0 where N is two (for two RAID 6 arrays) and the X is the resultant performance of each RAID 6.   Each RAID 6 set would be (6 * 150)/6.  So the full array would be 2((6 * 150)/6).  Which results in 300 Write IOPS.

The same example as above but configured as RAID 61, a mirrored pair of RAID 6 arrays, would be the same performance per RAID 6 array, but applied to the RAID 1 formula which is NX/2 (where X is the resultant performance of the each RAID array.)  So the final formula would be 2((6 * 150)/6)/2 coming to 150 Write IOPS from twelve drives.

Performance as a Factor of Capacity

When we are producing RAID performance formulæ we think of these in terms of the number of spindles which is incredibly sensible.  This is very useful in determining the performance of a proposed array or even an existing one where measurement is not possible and allows us to compare the relative performance between different proposed options.  It is in these terms that we universally think of RAID performance.

This is not always a good approach, however, because typically we look at RAID as a factor of capacity rather than of performance or spindle count.  It would be very rare, but certainly possible, that someone would consider an eight drive RAID 6 array versus an eight drive RAID 10 array.  Once in a while this will occur due to a chassis limitation or some other, similar reason.  But typically RAID arrays are viewed from the standpoint of total array capacity (e.g. usable capacity) rather than spindle count, performance or any other factor.  It is odd, therefore, that we should then switch to viewing RAID performance as a function of spindle count.

If we change our viewpoint and pivot upon capacity as the common factor, while still assuming that individual drive capacity and performance (X) remains constant between comparators then we arrive at a completely different landscape of performance.  In doing this we see, for example, that RAID 0 is no longer the most performant RAID level and that read performance varies dramatically instead of being a constant.

Capacity is a fickle thing but we can distill it out to the number of spindles necessary to reach desired capacity.  This makes this discussion far easier.  So our first step is to determine our spindle count needed for raw capacity.  If we need a capacity of 10TB and are using 1TB drives, we would need ten spindles, for example.  Or if we need 3.2TB and are using 600GB drives we would need six spindles.  We will, different than before, refer to our spindle count as R.  As before, performance of the individual drive is represented as X.  (R is used here to denote that this is the Raw Capacity Count, rather that the total Number of spindles.)

RAID 0 remains simple, performance is still RX as there are no additional drives.  Both read and write IOPS are simply NX.

RAID 10 has RX Write IOPS but 2RX Read IOPS.  This is dramatic.  Suddenly when viewing performance as a factor of stable capacity we find that RAID 10 has double read performance over RAID 0!

RAID 5 gets slightly trickier.  Write IOPS would be expressed as ((R + 1) * X)/4.  The Read IOPS are expressed as ((R +1) * X).

RAID 6, as we expect, follows the pattern that RAID 5 projects.  Write IOPS for RAID 6 are ((R + 2) * X)/6.  And the Read IOPS are expressed as ((R + 2) * X).

RAID 7 falls right in line.  RAID 7 Write IOPS would be ((R + 3) * X)/8.  And the Read IOPS are ((R + 3) * X).

This vantage point changes the way that we think about performance and, when looking purely at read performance, RAID 0 becomes the slowest RAID level rather than the fastest and RAID 10 becomes the fastest for both read and write no matter what the values are for R and X!

If we take a real world example of 10 2TB drives to achieve 20TB of usable capacity with each drive having 100 IOPS of performance and assume a 50/50 blend, the resultant IOPS would be:  RAID 0 with 1,000 Blended IOPS, RAID 10 with 1,500 Blended IOPS (2,000 RIOPS / 1,000 WIOPS), RAID 5 with 687.5 Blended IOPS (1,100 RIOPS / 275 WIOPS), RAID 6 with 700 Blended IOPS (1,200 RIOPS / 200 WIOPS) and finally RAID 7 with 731.25 Blended IOPS (1,300 RIOPS / 162.5 WIOPS.)  RAID 10 is a dramatic winner here.

Latency and System Impact with Software RAID

As I have stated earlier, RAID 0 and RAID 10 have, effectively, no system overhead to consider.  The mirroring operation requires essentially no computational effort and is, for all intents and purposes, immeasurably small.  Parity RAID does have computational overhead and this results in latency at the storage layer and system resources being consumed.  Of course, if we are using hardware RAID those resources are dedicated to the RAID array and have no function but to be consumed in this role.  If we are using software RAID, however, these are general purpose system resources (primarily CPU) that are consumed for the purposes of the RAID array processing.

The impact to even a very small system with a large amount of RAID is still very small but it can be measured and should be considered, if only lightly.  Latency and system impact are directly related to one another.

There is no simple way to state latency and system impact for different RAID levels except in this way: RAID 0 and RAID 10 have effectively no latency or impact, RAID 5 has some latency and impact, RAID 6 has roughly twice as much computational latency and impact as RAID 5 and RAID 7 has roughly triple the computational latency and impact as RAID 5.

In many cases this latency and system impact will be so small that they cannot be measured with standard system tools and as modern processors become increasingly powerful the latency and system impact will continue to diminish.  Impact has been considered negligible for RAID 5 and RAID 6 systems on even low end, commodity hardware since approximately 2001.  But it is possible on heavily loaded systems with a large amount of parity RAID activity that there could be contention between the RAID subsystem and other processes requiring system resources.

Reference: The IT Hollow – Understanding the RAID Penalty

Article originally posted to the StorageCraft Blog – RAID Performance.