Category Archives: Storage

What is RAID 100?

RAID 10 is one of the most important and commonly used RAID levels in use today. RAID 10 is, of course, what is known as compound or nested RAID where one RAID level is nested within another. In the case of RAID 10, the “lowest” level of RAID, the one touching the physical drives, is RAID 1. The nomenclature of nested RAID is that the number to the left is the one touching the physical drives and each number to the right is the RAID that touches those arrays.

So RAID 10 is a number of RAID 1 (mirror) sets that are in a RAID 0 (non-parity stripe) set together. There is a certain common terminology sometimes applied, principally championed by HP, to refer to even RAID 1 as simply being a subset of RAID 10 – a RAID 10 array where the RAID 0 length is one. A quirky way to think of RAID 1, to be sure, but it actually makes many discussions and comparative calculations easier and makes sense in a practical way for most storage practitioners. Thinking of RAID 1 as a “special name” for the smallest possible RAID 10 stripe size and allowing, then, all RAID 10 permutations to exist as a calculation continuum makes sense.

Likewise, HP also refers to solitary drives attached to a RAID controller as RAID 0 sets of a stripe of one as well. So the application of that terminology to the RAID 10 world is actually more obvious and sensible when it is looked at in that light. However neither HP nor any other vendor today applies this same naming oddity to other array types such as RAID 5 being a subset of RAID 50 or RAID 6 being a subset of RAID 60 even though they can be thought of that way exactly the same as RAID 1 can be to RAID 10.

If we take that same logic and take it to the next level, figuratively and literally, we can take multiple RAID 10 arrays and stripe them together in another RAID 0. This seems odd but can make sense. The result is a stripe of RAID 10s or, to write it out, a stripe of stripes of mirrors (we generally state RAID from the top down but the nomenclature is from the bottom up.) So as this is RAID 1 on the physical drives, a stripe of those mirrors and then a stripe of those resultant arrays we get RAID 100 (R100.)

RAID 100 is, of course, rare and odd. However one extremely important RAID controller manufacturer utilizes R100 and, subsequently, so does their downstream integration vendor: namely LSI and Dell.

Fortunately because non-parity stripes inject little behavioral oddities and have near zero overhead or latency this approach is really not a problem although it can lead to a great deal of confusion. For all intents and purposes, RAID 100 behaves exactly like RAID 10 when each RAID 10 subset is identical to each other.

In theory, a RAID 100 could be made up of many disparate RAID 10 sets of varying drive types, spindle counts and speeds. In theory a RAID 10 could be made of up disparate RAID 1 sets but this is far more limited in potential or likely variation. RAID 100 could, theoretically, do some pretty bizarre things if left unchecked. In practicality, though, any RAID 100 implementation will likely, as does LSI’s implementation, enforce standardization and require that each RAID 10 subset be as identical as a controller is capable of enforcing. So each will be effectively uniform keeping the overall behavior to be the same as if the same drives were set up as RAID 10.

Because the behavior remains identical to RAID 10 there is an extremely strong tendency to avoid the confusion of calling the array RAID 100 and simply referring to it as RAID 10. This would work fine except for the semi-necessary quirk of needing to be able to specify the geometry of the underlying RAID 10 sets when building a RAID 100. LSI, and therefore Dell, requires that at the time of setting up a RAID 100 set that you must specify the underlying RAID 10 geometry but since the array is labeled as RAID 10, this makes no sense. A bizarre situation indeed.

To further complicate matters, because of the desire to maintain a façade of using RAID 10 rather than RAID 100, proper terminology is eschewed and instead of referring to the underlying RAID 10 members as “RAID 10 arrays” or “RAID 10 subsets” they are simply called “spans.”” Span, however, being a term used for something else in storage that does not apply properly here. Span, in no way, is a proper description for a RAID 10 set under any condition.

But if we agree to use the term span to refer to a RAID 10 subset of a RAID 100 array we can move forward pretty easily. Whenever possible, then, we want as many spans as possible to keep the underlying RAID 10 subsets as small as possible. If we make them small enough they actually collapse into RAID 1 sets (HPE’s odd RAID 10 with a stripe size of one) and our RAID 100 collapses into a RAID 10 with the middle stripe, rather than the outside stripe, being the one that disappears! Bizarre, yes, but practical.

So how do we apply this in real life? Quite easily. In a RAID 100 array we must specify a count of spans to be used. Since we desire that each span contain two physical drive devices so that each span is a simple RAID 1 we simply need to take the total number of drives in our RAID 100 array, which we will call N, and divide that by two. So the desired span count for a normal RAID 100 array is simply N/2. This means if you have a two drive array, you want one span. Four drives, two spans. Six drives, three spans. Twenty four drives, twelve spans. And so on.

Do not be afraid of RAID 100. For normal users it simply requires some additional knowledge of how to select the proper number of spans. It would be ideal if this was calculated automatically and kept hidden allowing end users to think of the arrays in terms of RAID 10. Or else be labeled consistently as RAID 100 to make it clear what the span must represent. Or, of course, simply use RAID 10 instead of RAID 100. But given the practical state of reality, dealing with RAID 100, once it is understood, is easy.

Comparing RAID 10 and RAID 01

These two RAID levels often bring about a tremendous amount of confusion, partially because they are incorrectly used interchangeably and often simply because they are poorly understood.

First, it should be pointed out that either maybe be written with or without the plus sign: RAID 10 is RAID 1+0 and RAID 01 is RAID 0+1. Strangely, RAID 10 is almost never written with the plus and RAID 01 is almost never written without. Storage engineers generally agree that the plus is never used as it is superfluous.

Both of these RAID levels are “compound” levels made from two different, simple RAID types being combined. Both are mirror-based, non-parity compound or nested RAID. Both have essentially identical performance characteristics – nominal overhead and latency with NX read speed and (NX)/2 write speed where N is the number of drives in the array and X is the performance of an individual drive in the array.

What sets the two RAID levels apart is how they handle disk failure. The quick overview is that RAID 10 is extremely safe under nearly all reasonable scenarios. RAID 01, however, rapidly becomes quite risky as the size of the array increases.

In a RAID 10, the loss of any single drive results in the degradation of a single RAID 1 set inside of the RAID 0 stripe. The stripe level sees no degradation, only the one singular RAID 1 mirror does. All other mirrors are unaffected. This means that our only increased risk is that the one single drive is now running without redundancy and has no protection. All other mirrored sets still retain full protection. So our exposure is a single, unprotected drive – much like you would expect in a desktop machine.

Array repair in a degraded RAID 10 is the fastest possible repair scenario. Upon replacing a failed drive, all that happens is that that single mirror is rebuilt – which is a simple copy operation that happens at the RAID 1 level, beneath the RAID 0 stripe. This means that if the overall array is idle the mirroring process can proceed at full speed and the overall array has no idea that this is even happening. A disk to disk mirror is extremely fast, efficient and reliable. This is an ideal recovery scenario. Even if multiple mirrors have degradation simultaneously and are repairing simultaneously there is no additional impact as the rebuilding of one does not impact others. RAID 10 risk and repair impact both scale extremely well.

RAID 01, on the other hand, when it loses a single drive immediately loses an entire RAID 0 stripe. In a typical RAID 01 mirror there are two RAID 0 stripes. This means that half of the entire array has failed. If we are talking about an eight drive RAID 01 array, the failure of a single drive renders four drives instantly inoperable and effectively failed (hardware does not need to be replaced but the data on the drives is out of date and must be rebuilt to be useful.) So from a risk perspective, we can look at it as being a failure of the entire stripe.

What is left after a single disk has failed is nothing but a single, unprotected RAID 0 stripe. This is far more dangerous than the equivalent RAID 10 failure because instead of there being only a single, isolated hard drive at risk there is now a minimum of two disks and potentially many more at risk and each drive exposed to this risk magnifies the risk considerably.

As an example, in the smallest possible RAID 10 or 01 array we have four drives. In RAID 10 if one drive fails, our risk is that its matching partner also fails before we rebuild the array. We are only worried about that one drive, all other drives in the RAID 10 set are still protected and safe. Only this one is of concern. In a RAID 01, when the first drive fails its partner in its RAID 0 set is instantly useless and effectively failed as it is no longer operable in the array. What remains are two drives with no protection running nothing but RAID 0 and so we have the same risk that RAID 10 did, twice. Each drive has the same risk that the one drive did before. This makes our risk, in the best case scenario, much higher.

But for a more dramatic example let us look at a large twenty-four drive RAID 10 and RAID 01 array. Again with RAID 10, if one drive fails all others, except for its one partner, are still protected. The extra size of the array added almost zero additional risk. We still only fear for the failure of that one solitary drive. Contrast that to RAID 01 which would have had one of its RAID 0 arrays fail taking twelve disks out at once with the failure of one leaving the other twelve disks in a RAID 0 without any form of protection. The chances of one of twelve drives failing is significantly higher than the chances of a single drive failing, obviously.

This is not the entire picture. The recovery of the single RAID 10 disk is fast, it is a straight copy operation from one drive to the other. It uses minimal resources and takes only as long as is required for a single drive to read and to write itself in its entirety. RAID 01 is not as lucky. Unlike RAID 10 which rebuilds only a small subset of the entire array, and a subset that does not grow as the array grows – the time to recover a four drive RAID 10 or a forty drive RAID 10 after failure is identical, RAID 01 must rebuild an entire half of the whole parents array. In the case of the four drive array, this is double the rebuild work of the RAID 10 but in the case of the twenty four drive array it is twelve times the rebuild work to be done. So RAID 01 rebuilds take longer to perform while being under significantly more risk during that time.

There is a rather persistent myth that RAID 01 and RAID 10 have different performance characteristics, but they do not. Both use plain striping and mirroring which are effectively zero overhead operations that requires almost no processing overhead. Both get full read performance from every disk device attached to them and each lose half of their write performance to their mirroring operation (assuming two way mirrors which is the only common use of either array type.) There is simply nothing to make RAID 01 or RAID 10 any faster or slower than the other. Both are extremely fast.

Because of the characteristics of the two array types, it is clear that RAID 10 is the only type, of the two, that should ever exist within a single array controller. RAID 01 is unnecessarily dangerous and carries no advantages. They use the same capacity overhead, they have the same performance, they cost the same to implement, but RAID 10 is significantly more reliable.

So why does RAID 01 even exist? Partially it exists out of ignorance or confusion. Many people, implementing their own compound RAID arrays, choose RAID 01 because they have heard the myth that it is faster and, as is generally the case with RAID, do not investigate why it would be faster and forget to look into its reliability and other factors. RAID 01 is truly only implemented on local arrays by mistake.

However, when we take RAID to the network layer, there are new factors to consider and RAID 01 can become important, as can its rare cousin RAID 61. We denote, via Network RAID Notation, where the local and where the network layers of the RAID exist. So in this case we mean RAID 0(1) OR RAID 6(1). The parentheses denote that the RAID 1 mirror, the “highest” portion of the RAID stack, is over a network connection and not on the local RAID controller.

How would this look in RAID 0(1)? If you have two servers, each with a standard RAID 0 array and you want them to be synchronized together to act as a single, reliable array you could use a technology such as DRBD (on Linux) or HAST (on FreeBSD) to create a network RAID 1 array out of the local storage on each server. Obviously this has a lot of performance overhead as the RAID 1 array must be kept in sync over the high latency, low bandwidth LAN connection. RAID 0(1) is the notation for this setup. If each local RAID 0 array was replaced with a more reliable RAID 6 we would write the whole setup as RAID 6(1).

Why do we accept the risk of RAID 01 when it is over a network and not when it is local? This is because of the nature of the network link. In the case of RAID 10, we rely on the low level RAID 1 portion of the RAID stack for protection and the RAID 0 sits on top. If we replicate this on a network level such as RAID 1(0) what we end up with is each host having a single mirror representing only a portion of the data of the array. If anything were to happen to any node in the array or if the network connection was to fail the array would be instantly destroyed and each node would be left with useless, incomplete data. It is the nature of the high risk of node failure and risk at the network connection level that makes RAID decisions in a network setting extremely different. This becomes a complex subject on its own.

Suffice it to say, when working with normal RAID array controllers or with local storage and software RAID, utilize RAID 10 exclusively and never RAID 01.

The Cult of ZFS

It’s pretty common for IT circles to develop a certain cult-like or “fanboy” mentality.  What causes this reaction to technologies and products I am not quite sure, but that it happens is undeniable.  One area that I never thought that I would see this occur is in the area of filesystems – one of the most “under the hood” system components and one that, until recently, received literally no attention even in decently technical circles.  Let’s face it, misunderstanding when something comes from Active Directory versus from NTFS is nearly ubiquitous.  Filesystems are, quite simply, ignored.  Ever since Windows NT 4 released and NTFS was the only viable option the idea that a filesystem is not an intrinsic component of an operating system and that there might be other options for file storage has all but faded away.  That is, until recently.

The one community where, to some small degree, this did not happen was the Linux community, but even there Ext2 and its descendants so completely won mindshare that even thought they were widely available, alternative filesystems were sidelines and only XFS received any attention, historically, and even it received very little.

Where some truly strange behavior has occurred, more recently, is around Oracle’s ZFS filesystem, originally developed for the Solaris operating system and the X4500 “Thumper” open storage platform (originally under the auspices of Sun prior to the Oracle acquisition.)  At the time (nine years ago) when ZFS released, competing filesystems were mostly ill prepared to handle large disk arrays that were expected to be made over the coming years.  ZFS was designed to handle them and heralded in the age of large scale filesystems.  Like most filesystems at that time, ZFS was limited only to a single operating system and so, while widely regarded as a great leap forward in filesystem design, it produces few ripples in the storage world and even fewer in the “systems” world where even Solaris administrators generally considered it a point of interest only for quite some time, mostly choosing to stick to the tried and true UFS that they had been using for many years.

ZFS was, truly, a groundbreaking filesystem and I was, and remain, a great proponent of it.  But it is very important to understand why ZFS did what it did, what its goals are, why those goals were important and how it applies to us today.  The complexity of ZFS has lead to much confusion and misunderstanding about how the filesystem works and when it is appropriate to use.

The principle goals of ZFS were to make a filesystem capable of scaling well to very large disk arrays.  At the time of its introduction, the scale to which ZFS was capable was unheard of in other file systems but there was no real world need for a filesystem to be able to grow that large.  By the time that the need arose, many other file systems such as NTFS, XFS, Ext3 and others had scaled to accommodate the need.  ZFS certainly lead the charge to larger filesystem handling but was joined by many others soon thereafter.

Because ZFS originated in the Solaris world where, like all big iron UNIX systems, there is no hardware RAID, software RAID had to be used.  Solaris had always had software RAID available as its own subsystem.  The decision was made to build a new software RAID implementation directly into ZFS.  This would allow for simplified management via a single tool set for both the RAID layer and the filesystem.  It did not introduce any significant change or advantage to ZFS, as is often believed, it simply shifted the interface for the software RAID layer from being its own command set to being part of the ZFS command set.

ZFS’ implementation of RAID introduced variable width stripes in parity RAID levels.  This innovation closed a minor parity RAID risk known as the “write hole”.  This innovation was very nice but came very late as the era of reliable parity RAID was beginning to end and the write hole problem was already considered to be an unmentioned “background noise” risk of parity arrays as it was not generally considered a thread due to its elimination through the use of battery backed array caches and, at about the same time, non-volatile array caches – avoid power loss and you avoid the write hole.  ZFS needed to address this issue because, as software RAID, it was at greater risk to the write hole than hardware RAID is because there is no opportunity for a cache protected against power loss – hardware RAID offers the potential for an additional layer of power protection for arrays.

The real “innovation” that ZFS inadvertently made was that instead of just implementing the usual RAID levels of 1, 5, 6 and 10 they instead “branded” these levels with their own naming conventions.  RAID 5 is known as RAIDZ.  RAID 6 is known as RAIDZ2.  RAID 1 is just known as mirroring.  And so on.  This was widely considered silly at the time and pointlessly confusing but, as it turned out, that confusion because the cornerstone of ZFS’ revival many years later.

It needs to be noted that ZFS later added the industry’s first production implementation of a RAID 7 (aka RAID 7.3) triple parity RAID system and branded it RAIDZ3.  This later addition is an important innovation for large scale arrays that need the utmost in capacity while remaining extremely safe but are willing to sacrifice performance in order to do so.  This remains a unique feature of ZFS but one that is rarely used.

In the spirit of collapsing the storage stack and using a single command set to manage all aspects of storage the logical volume management functions were rolled into ZFS as well.  It is often mistakenly believed that ZFS introduced logical volume management in certain circles but nearly all enterprise platforms, including AIX, Linux, Windows and even Solaris itself, had already had logical volume management for many years.  ZFS was not doing this to introduce a new paradigm but simply to consolidate management and wrap all three key storage layers (RAID, logical volume management and filesystem) into a single entity that would be easier to manage and could provide inherent communications up and down the stack.  There are pros and cons to this method and an industry opinion remains unformed nearly a decade later.

One of the most important aspects of this consolidation of three systems into one is that now we have a very confusing product to discuss.  ZFS is a filesystem, yes, but it is not only a filesystem.  It is a logical volume manager, but not only a logical volume manager.  People refer to ZFS as a filesystem, which is its primary function, but that it is so much more than a filesystem can be very confusing and makes comparisons against other storage systems difficult.  At the time I believe that this confusion was not foreseen.

What has resulted from this confusing merger is that ZFS is often compared to other filesystems, such as XFS or Ext4.  But this is confusing as ZFS is a complete stack and XFS is only one aspect of a stack.  ZFS would be better compared to MD (Linux Software RAID) / LVM / XFS or to SmartArray (HP Hardware RAID) / LVM/ XFS than to XFS alone.  Otherwise it appears that ZFS is full of features that XFS lacks but, in reality, it is only a semantic victory.  Most of the features often touted by ZFS advocates did not originate with ZFS and were commonly available with the alternative filesystems long before ZFS existed. But it is hard to compare “does your filesystem do that” because the answer is “no…. my RAID or my logical volume manager do that.”  And truly, it is not ZFS the filesystem providing RAIDZ, it is ZFS the software RAID subsystem that is doing so.

In order to gracefully handle very large filesystems, data integrity features were built into ZFS which included a checksum or hash check throughout the filesystem that could leverage the inclusive software RAID to repair corrupted files.  This was seen as necessary due to the anticipated size of ZFS filesystems in the future.  Filesystem corruption is a rarely seen phenomenon but as filesystems grow in size the risk increases.  This lesser known feature of ZFS is possibly its greatest.

ZFS also changed how filesystem checks are handled.  Because of the assumption that ZFS will be used on very large filesystems there was a genuine fear that a filesystem check on boot time could take impossibly long to complete and so an alternative strategy was found.  Instead of waiting to do a check at reboot the system would require a scrubbing process to run and perform a similar check while the system was running.  This requires more system overhead while the system is live but the system is able to recover from an unexpected restart more rapidly.  A trade off but one that is widely seen as very positive.

ZFS has powerful snapshotting capabilities in its logical volume layer and in its RAID layer has implemented very robust caching mechanisms making ZFS an excellent choice for many use cases.  These features are not unique in ZFS but are widely available in systems older than ZFS.  They are, however, very good implementations of each and very well integrated due to ZFS’ nature.

At one time, ZFS was open source and during that era its code became a part of Apple’s Mac OSX and FreeBSD operating systesms because they were compatible with the ZFS license.  Linux did not get ZFS at that time due to challenges around licensing.  Had ZFS licensing allowed Linux to use it unencumbered the Linux landscape would likely be very different today.  Mac OSX eventually dropped ZFS as it was not seen as having enough advantages to justify it in that environment.  FreeBSD clung to ZFS and, over time, it became the most popular filesystem on the platform although UFS is still heavily used as well.  Oracle closed the source of ZFS after the Sun acquisition leaving FreeBSD without continuing updates to its version of ZFS while Oracle continued to develop ZFS internally for Solaris.

Today Solaris remains using the original ZFS implementation now with several updates since its split with the open source community.  FreeBSD  and others continued using ZFS in the state it was when the code was closed source, no longer having access to Oracle’s latest updates.  Eventually work to update the abandoned open source ZFS codebase was taken up and is now known as OpenZFS.  OpenZFS is still fledgling and has not yet really made its mark but has some potential to revitalize the ZFS platform in the open source space but at this time, OpenZFS still lags ZFS.

Open source development for the last several years in this space has focused more on ZFS’ new rival BtrFS which is being developed natively on Linux and is well supported from many major operating system vendors.  BtrFS is very nascent but is making major strides to reach feature parity with ZFS in implemented features but has large aspirations and due to ZFS’ closed source nature has the benefit of market moment.  BtrFS was started, like ZFS, by Oracle and has been widely seen as Oracle’s view of the future being a replacement for ZFS even at Oracle.  At this time BtrFS has already, like ZFS, merged the filesystem, logical volume management and software RAID layers, implemented checksumming for filesystem integrity, scales even larger than ZFS (same absolute limit but handles more files), copy on write snapshots, etc.

ZFS, without a doubt, was an amazing filesystem in its heyday and remains a leader today.  I was a proponent of it in 2005 and I still believe heavily in it.  But it has saddened me to see the community around ZFS take on a fervor and zealotry that  does it no service and makes the mention of ZFS almost seem as a negative – ZFS being so universally chosen for the wrong reasons: primarily a belief that its features exist nowhere else, that its RAID is not subject to the risks and limitations that those RAID levels are always subject to or that it was designed for a different purpose (primarily performance) other than what it was designed for.  And when ZFS is a good choice, it is often implemented poorly based on untrue assumptions.

ZFS, of course, is not to blame.  Nor, as far as I can tell, are its corporate supporters or its open source developers.  Where ZFS seems to have gone awry is in a loose, unofficial community that has only recently come to know ZFS, often believing it to be new or “next generation” because they have only recently discovered it.  From what I have seen this is almost never via Solaris or FreeBSD channels but almost exclusively smaller businesses looking to use a packaged “NAS OS” like FreeNAS or NAS4Free who are not familiar with UNIX OSes.  The use of packaged NAS OSes, primarily by IT shops that possess neither deep UNIX nor storage skills and, consequently, little exposure to the broader world of filesystems outside of Windows and often little to no exposure to logical volume management and RAID, especially software RAID at all, appears to lead to a “myth” culture around ZFS with it taking on an almost unquestionable, infallible status.

This cult-like following and general misunderstanding of ZFS leads often to misapplications of ZFS or a chain of decision making based off of bad assumptions that can lead one very much astray.

One of the most amazing changes in this space is the change in following from hardware RAID to software RAID.  Traditionally, software RAID was a pariah in Windows administration circles without good cause – Windows administrators and small businesses, often unfamiliar with larger UNIX servers believed that hardware RAID was ubiquitous when, in fact, larger scale systems always used software RAID.  Hardware RAID was, almost industry wide, considered a necessity and software RAID completely eschewed.  That same audience, now faced with the “Cult of ZFS” movement, now react in exactly the opposite way believing that hardware RAID is bad and that ZFS’ software RAID is the only viable option.  The shift is dramatic and neither approach is valid – both hardware and software RAID and both in many implementations are very valid options and even using ZFS the use of hardware RAID might easily be appropriate.

ZFS is often chosen because it is believed that it is the highest performance option in filesystems but this was never a key design goal of ZFS.  The features allowing it to scale so large and handle so many different aspects of storage actually make being highly performant very difficult.  ZFS, at the time of its creation, was not even expected to be as fast as the venerable UFS which ran on the same systems as it.  However, this is often secondary to the fact that filesystem performance is widely moot as all modern filesystems are extremely fast and filesystem speed is rarely an important factor – especially outside of massive, high end storage systems on a very large scale.

An interesting study of ten filesystems on Linux produced by Phoronix in 2013 showed massive differences in filesystems by workload but no clear winners as far as overall performance.  What the study showed conclusively is that matching workload to filesystem is the most important choice, that ZFS falls to the slower side of all mainstream filesystems even in its more modern implementations and that choosing a filesystem for performance reasons without a very deep understanding of the workload will result in unpredictable performance – no filesystem should be chosen blindly if performance is an important factor.  Sadly, because the test was done on Linux, it lacked UFS which is often ZFS’ key competitor especially on Solaris and FreeBSD and it lacked HFS+ from Mac OSX. 

Moving from hardware RAID to software RAID carries additional, often unforeseen risks to shops not experienced in UNIX as well.  While ZFS allows for hot swapping, it is often forgotten that hot swap is primarily a feature of hardware, not of software, and it is also widely unknown that blind swapping (removal of hard drives without first offlining them in the operating system) is not synonymous with hot swapping and this can lead to disasters for shops moving from a tradition of hardware RAID that handled compatibility, hot swap and blind swapping transparently for them to a software RAID system that requires much more planning, coordination and understanding of the system in order to use safely.

A lesser, but still common misconception of ZFS, is that it is a clustered filesystem suitable for use on shared DAS or SAN scenarios a la OCFS, VxFS and GFS2.  ZFS is not a clustered filesystem and shares the same limitations in that space as all of its common competitors.

ZFS can be an excellent choice but it is far from the only one.  ZFS comes with large caveats, not the least of which is the operating system limitations associated with it, and while it has many benefits few, if any, are unique to ZFS and it is very rare that any shop will benefit from every one of them.  As with any technology, there are trade offs to be made.  One size does not fit all.  The key to knowing when ZFS is right for you is to understand what ZFS is, what is and is not unique about it, what its design goals are, how comparing a storage storage to a pure filesystem produces misleading results and what inherent limitations are tied to it.

ZFS is a key consideration and the common choice when Solaris or FreeBSD is the chosen operating system.  With rare exception, the operating system should never be chosen for ZFS, but instead ZFS should be often chosen, but not always, when the operating system is chosen.  The OS should drive the filesystem choices in all but the rarest of cases.  The choice of operating system is so dramatically more important than the choice of filesystem.

ZFS can be used on Linux but is not considered an enterprise option there but more of a hobby system for experimentation as no enterprise vendor (such as Red Hat, Suse or Canonical) support ZFS on Linux and as Linux has great alternatives already.  Someday ZFS might be promoted to a first class filesystem in Linux but this is not expected as BtrFS has already entered the mainline kernel and been included in production releases by several major vendors.

While ZFS will be seen in the vast majority of Solaris and FreeBSD deployments, this is primarily because it has moved into the position of default filesystem and not because it is clearly the superior choice in those instances or has even been evaluated critically.  ZFS is perfectly well suited to being a general purpose filesystem where it is native and supported.

What is ZFS’ primary use case?

ZFS’ design goal and principal use case is for Solaris and FreeBSD open storage systems providing either shared storage to other servers or as massive data repositories for locally installed applications.  In these cases, ZFS’ focus on scalability and data integrity really shine.  ZFS leans heavily towards large and enterprise scale shops and generally away from applicability in the small and medium business space where Solaris and FreeBSD skills, as well as large scale storage needs, are rare.

Reference: http://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=1

 

Understanding the Western Digital SATA Drive Lineup (2014)

I choose to categorize Western Digital’s SATA drive lineup for several reasons. One is that WD is the current market leader in spinning hard drives so this makes the categorization most useful to the greatest number of people, the “color coded” line is, based on anecdotal evidence, far and away the chosen drive family of the small business market where the diagnosis is most important and SATA drives retain the most disparity of features and factors making them far more necessary to understand well. While technically the only difference between a SAS (SCSI) and SATA (ATA) drive or even a Fibre Channel (FC) drive is nothing but the communications protocol used to communicate with them, in practical terms SAS and FC drives are only made in certain, high reliability configurations and do not require the same degree of scrutiny and do not carry the same extreme risks as SATA drives. Understanding SATA drive offerings is the more important for practical, real world storage needs.

WD has made understanding their SATA drive line up especially easy by adding color codes to the majority of their SATA drive offerings – those deemed to be “consumer” drives, and an “E” designation on their enterprise SATA drives and one outlier, the high performance Velociraptor drives which seek to compete with common SAS performance for SATA controllers. Altogether they have seven SATA drive families to consider covering the gamut of drive factors. While this diagnosis will apply to the easy to understand WD lineup, by comparing factors here with the offerings of other drive makers the use cases of their drives can be determined as well.

In considering SATA drives, three really key factors stand out as being the most crucial to consider (outside of price, of course.)

URE Rate: URE, or Unrecoverable Read Error, is an event that happens, with some regularity, to electromechanical disk storage media where a single sector is unable to be retrieved. In a standalone drive this happens from time to time but generally only affects a single file and users typically see this as a lost file (often one they do not notice) or possible a corrupt filesystem which may or may not easily be corrected. In healthy RAID arrays (other than RAID 0), the RAID system provides mirroring and/or parity that can cover for this sector failure and recreate the data protecting us from URE issues. When a RAID array is in a degraded state UREs are a potential risk again. In its worst case, a URE on a degraded parity array can, in some cases, cause total loss of an array (all data is lost.) So considering UREs and their implications in any drive purchases is extremely important and is the primary driver of cost differential in drives of varying types. URE varies from the low end at 10^14 to the high end at 10^16. The numbers are so large that they are always written in scientific notation. I will not go into an in-depth explanation of URE rates, ramifications and mitigation strategies here, but understanding URE is critical to decision making around drive purchases, especially in the large capacity, lower reliability space of SATA drives.

Spindle Speed: This is one of the biggest factors to most users, spindle speed directly correlates to IOPS and throughput. While measurements of drive speed are dynamic, at best, spindle speed is the best overall way to compare two otherwise identical drives under identical load. A 15,000 RPM drive will deliver almost exactly double the IOPS and throughput of a 7,200 RPM drive, for example. SATA drives commonly come in 5,400 RPM and 7,200 RPM varieties with rare high performance drives available at 10,000 RPMs.

Error Recovery Control (ERC): Also known as TLER (Time Limited Error Recovery) in WD parlance, ERC is a feature of a drive’s firmware which allows for configurable time limits for read or write errors which can be important when a hard drive is used in a RAID array as often error recovery needs to be handled at the array, rather than the drive, level. Without ERC, a drive is more likely to be incorrectly marked as failed when it has not. This is most dangerous in hardware based parity RAID arrays and has differing levels of effectiveness based on individual RAID controller parameters. It is an important feature for drives assumed for use in RAID arrays.

In addition to these key factors, WD lists many others for their drives such as cache size, number of processors, mean time between failures, etc. These tend to be far less important, especially MTBF and other reliability numbers as these can be skewed or misinterpreted easily and rarely offer the insight into drive reliability that we expect or hope. Cache size is not very significant for RAID arrays as they need to be disabled for reasons of data integrity. So outside of desktop use scenarios, the size of a hard drive’s cache is generally considered irrelevant. CPU count could also be misleading as single CPUs may be more powerful than dual CPUs if the CPUs are not identical and the efficacy of the second CPU is unknown. But WD lists this as a prominent feature of some drives and it is assumed that there is measurable performance gain, most likely in latency reduction, through the addition of the second CPU. I do, however, continue to treat this as a trivial factor and mostly only useful as a point of interest rather than as a decision factor
The drives.

All color-coded drives (Blue, Green, Red and Black) share one common factor – they have the “consumer” URE rating of 10^14. Consumer is a poor description here but is, more or less, industry standard. A better description is “desktop class” or suitable for non-parity RAID uses. The only truly poor application of 10^14 URE drives is in parity RAID arrays and even there, they can have their place if properly understood.

Blue: WD Blue drives are the effective baseline model for the SATA lineup. They spin at the “default” 7,200 RPMs, lack ERC/TLER and have a single processor. Drive cache varies between 16MB, 32MB and 64MB depending on the specific model. Blue drives are targeted at traditional desktop usage – as single drives with moderate speed characteristics, not well suited to server or RAID usage. Blue drives are what is “expected” to be found in off the shelf desktops. Blue drives have widely lost popularity and are often not available in larger sizes. Black and Green drives have mostly replaced the use of Blue drives, at least in larger capacity scenarios.

Black: WD Black drives are a small upgrade to the Blue drives changing nothing except to upgrade from one to two processors to slightly improve performance while not being quite as cost effective. Like the Blue drives they lack ERC/TLER and spin at 7,200 RPM. All Black drives have the 64MB cache. As with the Blue drives, Black drives are most suitable for traditional desktop applications where drives are stand alone.

Green: WD Green drives, as their name nominally implies, are designed for low power consumption applications. They are most similar to Blue drives but spin at a slower 5,400 RPMs which requires less power and generates less heat. Green drives, like Blue and Black, are designed for standalone use primarily in desktops that need less drive performance than is expected in an average desktop. Green drives have proven to be very popular due to their low cost of acquisition and operation. It is assumed, as well, that Green drives are more reliable than their faster spinning counterparts due to the lower wear and tear of the slower spindles although I am not aware of any study to this effect.

Red: WD Red drives are unique in the “color coded” WD drive line up in that they offer ERC/TLER and are designed for use in small “home use” server RAID arrays and storage devices (such as NAS and SAN.) Under the hood the WD Red drives are WD Green drives, all specifications are the same including the 5,400 RPM spindle speed, but with TLER enabled in the firmware. Physically they are the same drives. WD officially recommends Red drives only for consumer applications but Red drives, due to their lower power consumption and TLER, have proven to be extremely popular in large RAID arrays, especially when used for archiving. Red drives, having URE 10^14, are dangerous to use in parity RAID arrays but are excellent for mirrored RAID arrays and truly shine at archival and similar storage needs where large capacity and low operational costs are key and storage performance is not very important.

Outside of the color coded drives, WD has three SATA drive families which are all considered enterprise. What these drives share in common is that their URE rate is much higher than that of the “consumer” color coded drives. Ranging from URE 10^15 to 10^16 depending on model. The most important result of this URE rate is that these drives are far more applicable to use in parity RAID arrays (e.g. RAID 6.)

SE: SE drives are WD’s entry level enterprise SATA drives with URE 10^15 rates and 7,200 RPM spindle speeds. They have dual processors and a 64MB cache. Most importantly, SE drives have ERC/TLER enabled. SE drives are ideal for enterprise RAID arrays both mirrored and parity.

RE: RE drives are WD’s high end standard enterprise SATA drives with all specifications being identical to the SE drives but with the even better URE 10^16 rate. RE drives are the star players in WD’s RAID drive strategy being perfect for extremely large capacity arrays even when used in parity arrays. RE drives are available in both SATA and SAS configurations but with the same drive mechanics.

Velociraptor: WD’s Velociraptor is a bit of an odd member of the SATA category. With URE 10^16 and a 10,000 RPM spindle speed the Velociraptor is both highly reliable and very fast for a SATA drive competing with common, mainline SAS drives. Surprisingly, the Velociraptor has only a single processor and even more surprisingly, it lacks ERC/TLER making it questionable for use in RAID arrays. Lacking ERC, use in RAID can be considered on an implementation by implementation basis depending on how the RAID system interacts with the drive’s timing. With the excellent URE rating, Velociraptor would be an excellent choice for large, higher performance parity RAID arrays but only if the array handles the error timing in a graceful way, otherwise the risk of the array marking the drive as having failed is unacceptably high for an array as costly as this would be. It should be noted that Velociraptor drives do not come in capacities comparable to the other SATA drive offerings – they are much smaller.

Of course the final comparison that one needs to make is in price. When considering drive purchases, especially where large RAID arrays are concerned or for other bulk storage needs, the per drive cost is often a major, if not the driving, factor. The use of slower, less reliable drives in a more reliable RAID level (such as Red drives in RAID 10) versus faster, more reliable drives in a less reliable RAID level (such as RE drives in RAID 6) often provides a better blend of reliability, performance, capacity and cost. Real world drive prices play a significant factor in these decisions. These prices, unlike the drive specifications, can fluctuate from day to day and swing planning decisions in different directions but, overall, tend to remain relatively stable in comparison to one another.

At the time of this article, at the end of 2013, a quick survey of prices of 3TB drives from WD give these approximate breakdown:

Green $120
Red $135
Black $155
SE $204
RE $265

As can be seen, the jump in price primarily comes between the consumer or desktop class drives and the enterprise drives with their better URE rates with Red and RE drives, both with ERC/TLER, being in a price ratio of almost exactly 2:1 making, for equal capacity, it favorable to choose many more Red drives in RAID 10 than fewer RE drives in RAID 6, as an example. So comparing a number of factors, along with current real world prices, is crucial to making many buying decisions.

Newer drives, just being released, are starting to see reductions in onboard drive cache for exactly the reasons we stated above, drives designed around RAID use have little or no purpose to having onboard cache as it needs to be disabled for data integrity purposes.

Drive makers today are offering a wide variety of traditional spindle-based drive options to fit many different needs. Understanding these can lead to better reliability and more cost effective purchasing and will extend the usefulness of traditional drive technologies into the coming years.