All posts by Scott Alan Miller

Started in software development with Eastman Kodak in 1989 as an intern in database development (making database platforms themselves.) Began transitioning to IT in 1994 with my first mixed role in system administration.

Comparing RAID 10 and RAID 01

These two RAID levels often bring about a tremendous amount of confusion, partially because they are incorrectly used interchangeably and often simply because they are poorly understood.

First, it should be pointed out that either maybe be written with or without the plus sign: RAID 10 is RAID 1+0 and RAID 01 is RAID 0+1. Strangely, RAID 10 is almost never written with the plus and RAID 01 is almost never written without. Storage engineers generally agree that the plus is never used as it is superfluous.

Both of these RAID levels are “compound” levels made from two different, simple RAID types being combined. Both are mirror-based, non-parity compound or nested RAID. Both have essentially identical performance characteristics – nominal overhead and latency with NX read speed and (NX)/2 write speed where N is the number of drives in the array and X is the performance of an individual drive in the array.

What sets the two RAID levels apart is how they handle disk failure. The quick overview is that RAID 10 is extremely safe under nearly all reasonable scenarios. RAID 01, however, rapidly becomes quite risky as the size of the array increases.

In a RAID 10, the loss of any single drive results in the degradation of a single RAID 1 set inside of the RAID 0 stripe. The stripe level sees no degradation, only the one singular RAID 1 mirror does. All other mirrors are unaffected. This means that our only increased risk is that the one single drive is now running without redundancy and has no protection. All other mirrored sets still retain full protection. So our exposure is a single, unprotected drive – much like you would expect in a desktop machine.

Array repair in a degraded RAID 10 is the fastest possible repair scenario. Upon replacing a failed drive, all that happens is that that single mirror is rebuilt – which is a simple copy operation that happens at the RAID 1 level, beneath the RAID 0 stripe. This means that if the overall array is idle the mirroring process can proceed at full speed and the overall array has no idea that this is even happening. A disk to disk mirror is extremely fast, efficient and reliable. This is an ideal recovery scenario. Even if multiple mirrors have degradation simultaneously and are repairing simultaneously there is no additional impact as the rebuilding of one does not impact others. RAID 10 risk and repair impact both scale extremely well.

RAID 01, on the other hand, when it loses a single drive immediately loses an entire RAID 0 stripe. In a typical RAID 01 mirror there are two RAID 0 stripes. This means that half of the entire array has failed. If we are talking about an eight drive RAID 01 array, the failure of a single drive renders four drives instantly inoperable and effectively failed (hardware does not need to be replaced but the data on the drives is out of date and must be rebuilt to be useful.) So from a risk perspective, we can look at it as being a failure of the entire stripe.

What is left after a single disk has failed is nothing but a single, unprotected RAID 0 stripe. This is far more dangerous than the equivalent RAID 10 failure because instead of there being only a single, isolated hard drive at risk there is now a minimum of two disks and potentially many more at risk and each drive exposed to this risk magnifies the risk considerably.

As an example, in the smallest possible RAID 10 or 01 array we have four drives. In RAID 10 if one drive fails, our risk is that its matching partner also fails before we rebuild the array. We are only worried about that one drive, all other drives in the RAID 10 set are still protected and safe. Only this one is of concern. In a RAID 01, when the first drive fails its partner in its RAID 0 set is instantly useless and effectively failed as it is no longer operable in the array. What remains are two drives with no protection running nothing but RAID 0 and so we have the same risk that RAID 10 did, twice. Each drive has the same risk that the one drive did before. This makes our risk, in the best case scenario, much higher.

But for a more dramatic example let us look at a large twenty-four drive RAID 10 and RAID 01 array. Again with RAID 10, if one drive fails all others, except for its one partner, are still protected. The extra size of the array added almost zero additional risk. We still only fear for the failure of that one solitary drive. Contrast that to RAID 01 which would have had one of its RAID 0 arrays fail taking twelve disks out at once with the failure of one leaving the other twelve disks in a RAID 0 without any form of protection. The chances of one of twelve drives failing is significantly higher than the chances of a single drive failing, obviously.

This is not the entire picture. The recovery of the single RAID 10 disk is fast, it is a straight copy operation from one drive to the other. It uses minimal resources and takes only as long as is required for a single drive to read and to write itself in its entirety. RAID 01 is not as lucky. Unlike RAID 10 which rebuilds only a small subset of the entire array, and a subset that does not grow as the array grows – the time to recover a four drive RAID 10 or a forty drive RAID 10 after failure is identical, RAID 01 must rebuild an entire half of the whole parents array. In the case of the four drive array, this is double the rebuild work of the RAID 10 but in the case of the twenty four drive array it is twelve times the rebuild work to be done. So RAID 01 rebuilds take longer to perform while being under significantly more risk during that time.

There is a rather persistent myth that RAID 01 and RAID 10 have different performance characteristics, but they do not. Both use plain striping and mirroring which are effectively zero overhead operations that requires almost no processing overhead. Both get full read performance from every disk device attached to them and each lose half of their write performance to their mirroring operation (assuming two way mirrors which is the only common use of either array type.) There is simply nothing to make RAID 01 or RAID 10 any faster or slower than the other. Both are extremely fast.

Because of the characteristics of the two array types, it is clear that RAID 10 is the only type, of the two, that should ever exist within a single array controller. RAID 01 is unnecessarily dangerous and carries no advantages. They use the same capacity overhead, they have the same performance, they cost the same to implement, but RAID 10 is significantly more reliable.

So why does RAID 01 even exist? Partially it exists out of ignorance or confusion. Many people, implementing their own compound RAID arrays, choose RAID 01 because they have heard the myth that it is faster and, as is generally the case with RAID, do not investigate why it would be faster and forget to look into its reliability and other factors. RAID 01 is truly only implemented on local arrays by mistake.

However, when we take RAID to the network layer, there are new factors to consider and RAID 01 can become important, as can its rare cousin RAID 61. We denote, via Network RAID Notation, where the local and where the network layers of the RAID exist. So in this case we mean RAID 0(1) OR RAID 6(1). The parentheses denote that the RAID 1 mirror, the “highest” portion of the RAID stack, is over a network connection and not on the local RAID controller.

How would this look in RAID 0(1)? If you have two servers, each with a standard RAID 0 array and you want them to be synchronized together to act as a single, reliable array you could use a technology such as DRBD (on Linux) or HAST (on FreeBSD) to create a network RAID 1 array out of the local storage on each server. Obviously this has a lot of performance overhead as the RAID 1 array must be kept in sync over the high latency, low bandwidth LAN connection. RAID 0(1) is the notation for this setup. If each local RAID 0 array was replaced with a more reliable RAID 6 we would write the whole setup as RAID 6(1).

Why do we accept the risk of RAID 01 when it is over a network and not when it is local? This is because of the nature of the network link. In the case of RAID 10, we rely on the low level RAID 1 portion of the RAID stack for protection and the RAID 0 sits on top. If we replicate this on a network level such as RAID 1(0) what we end up with is each host having a single mirror representing only a portion of the data of the array. If anything were to happen to any node in the array or if the network connection was to fail the array would be instantly destroyed and each node would be left with useless, incomplete data. It is the nature of the high risk of node failure and risk at the network connection level that makes RAID decisions in a network setting extremely different. This becomes a complex subject on its own.

Suffice it to say, when working with normal RAID array controllers or with local storage and software RAID, utilize RAID 10 exclusively and never RAID 01.

It Worked For Me

“Well, it worked for me.”  This has become a phrase that I have heard over and over again in defense of what would logically be otherwise considered a bad idea.  These words are often spoken innocently enough without deep intent, but they often cover deep meaning that should be explored.

But it is important to understand what drives these words both psychologically as well as technically.  At a high level, what we have is the delivery of an anecdote which can be restated as such: “While the approach or selection that I have used goes against your recommendation or best practices or what have you, in my particular case the bad situation of which you have warned or advised against has not arisen and therefore I believe that I am justified in the decision that I have made.”

I will call this the “Anecdotal Dismissal of Risk” or better known as “Outcome Bias.”  Generally this phrase is used to wave off the accusation that one has either taken on unnecessary risk or taken on unnecessary financial expense or, more likely, both.  The use of an anecdote for either of these cases is, of course, completely meaningless but the speaker does so with the hope of throwing off the discussion and routing it around their case by suggesting, without saying it, that perhaps they are a special case that has not been considered or, perhaps, that “getting lucky” is a valid form of decision making.

Of course, when talking risk, we are talking about statistical risk.  If anything was a sure thing, and could be proven or disproved with an anecdote, it would not be risk but would just be a known outcome and making the wrong choice would be amazingly silly.  Anecdotes have a tiny place when using in the negative, for example: They claim that it is a billion to one chance that this would happen, but it happened to me on the third try and I know one other person that it happened to.  That’s not proof, but anecdotally it suggests that the risk figures are unlikely correct.

That case is valid, still incredibly important to realize that even negative anecdotal evidence (anecdotal evidence of something that was extremely unlikely to happen) is still anecdotal and does not suggest that the results will happen again, but at least it suggests that you were an amazing edge case.  If you know of one person that has won the lottery, that’s unlikely but doesn’t prove that the lottery is likely to be won.  If you know that every other person you know who has played the lottery has won, something is wrong with the statistics.

However, the “it worked for me” case is universally used with risk that is less than fifty percent (if it were not the whole thing would become crazy.)  Often it is about taking something four nines reliability and reducing it to three nines when attempting to raise it.  Three nines of something still means that there is only a one in one thousand chance that the bad case will arise.  This isn’t statistically likely to occur, obviously.  At least we would hope that it was obvious.  Even though, in this example, the bad case arises ten times more often than it would have it we had left well enough alone and maybe one hundred times more than how often we intended for it to arise we still expect to never see the bad outcome unless we run thousands or tens of thousands of cases and then the statistics are still based on a rather small pool.

In many cases we talk about an assumption of unnecessary risk but generally this is risk at a financial cost. What prompts this reaction a great deal of the time, in my experience, is a reaction to being demonstrated a dramatic overspending – implementing very costly solutions when a less costly one, often fractionally as expensive, may approach or, in many cases, exceed the chosen solution that is being defended.

To take the reverse, out of any one thousand people, nine hundred and ninety nine of them, doing this same thing, would be expected to have no bad outcome.  For someone to claim, then, that the risk is one part in one thousand and have one of the nine hundred and ninety nine step forward and say “the risk can’t exist because I am not the incredibly unlikely one to have had the bad thing happen to me” obviously makes no sense whatsoever when looking at the pool as a whole.  But when we are the ones who made the decision to join that pool and then came away unscathed it is an apparently natural reaction to discount the assumed outcome of even a risky choice and assume that the risk did not exist.

It is difficult to explain risk in this way but, over the years, I’ve found a really handy example to use that tends to explain business or technical risk in a way that anyone can understand.  I call it the Mother Seatbelt Example.  Try this experiment (don’t actually try it but lie to your mother and tell her that you did to see the outcome.)

Drive a car without wearing a seatbelt for a whole day while continuously speeding.  Chances are extremely good that nothing bad will happen to you (other than paying some fines.)  The chances of having a car accident and getting hurt, even while being reckless in both your driving and disregarding basic safety precautions, is extremely low.  Easily less than one in one thousand.   Now, go tell your mother what you just did and say that you feel that doing this was a smart way to drive and that you made a good decision in having done so because “it worked out for me.”  Your mother will make it very clear to you what risky decisions mean and how anecdotal evidence of expected survival outcome does not indicate good risk / reward decision making.

In many cases, “it worked for me” is an attempt at deflection.  A reaction of our amygdala in a “fight or flight” response to avoid facing what is likely a bad decision of the past.  Everyone has this reaction, it is natural, but unhealthy.  By taking this stance of avoiding critical evaluation of past decisions we make ourselves more likely to continue to repeat the same bad decision or, at the very least, continue the bad decision making process that lead to that decision.  It is only by facing critical examination and accepting that past decisions may not have been ideal that we can examine ourselves and our processes and attempt to improve them to avoid making the same mistakes again.

It is understandable that in any professional venue there is a desire to save face and appear to have made if not a good decision, at least an acceptable one and so the desire to explore logic that might undermine that impression is low.  Even moreso there is a very strong possibility that someone who is a potential recipient of the risk or cost that the bad decision created will learn of the past decision making and there is, quite often, an even stronger desire to cover up any possibility that a decision may have been made without proper exploration or due diligence.  These are understandable reactions but they are not healthy and ultimately make the decision look even poorer than it would have.  Everyone makes mistakes, everyone.  Everyone overlooks things, everyone learns new things over time.  In some cases, new evidence comes to light that was impossible to have known at the time.  There should be no shame in past decisions that are less than ideal, only in failing to examine them and learn from them allowing us as individuals as well as our organizations to grow and improve.

The phrase seems innocuous enough when said.  It sounds like a statement of success.  But we need to reflect deeper.  The risk scenario we showed above.  But what about the financial one.  When a solution is selected that carries little or no benefits, and possibly great caveats as we see in many real world cases, while being much more costly and the term “it worked for me” is used, what is really being said is “wasting money didn’t get me in trouble.”  When used in the context of a business, this is quite a statement to make.  Businesses exist to make money.  Wasting money on solutions that don’t meet the need better is a failure whether the solution functions technically or not.  Many solutions are too expensive but would not fail, choosing the right solution always involves getting the right price for the resultant situation.  That is just the nature of IT in business.

Using this phrase can sound reasonable to the irrational, defense brain.  But to outsiders looking in with rational views it actually sounds like “well, I got away with…” fill in the blank: “wasting money”, “being risky”, “not doing my due diligence”, “not doing my job”, or whatever the case may be.  And likely whatever you think should be filled in there will not be as bad as what others assume.

If you are attempted to justify past actions by saying “it worked for me” or by providing anecdotal evidence that shows nothing, stop and think carefully.  Give yourself time to calm down and evaluate your response.  Is is based on logic or irrational amygdala emotions?  Don’t be ashamed of having the reaction, everyone has it.  It cannot be escaped.  But learning how to deal with it can allow us to approach criticism and critique with an eye towards improvement rather than defense.  If we are defensive, we lose the value in peer review, which is so important to what we do as IT professionals.

Doing IT at Home: Enterprise Networking

In my fifth installment in the continuing series on Doing IT at Home I would like to focus on enterprise networking.  Many ways in which we can bring business class IT into our homes can really be done for free but networking is sadly, not one of those areas but it does not have to be as costly as you may, at first, think and having a good, solid, enterprise-class home network can bring many features that other IT at Home projects do not.

Implementing an real, working business class network at home lays the foundation for a lot of potential learning, experimenting, testing and growth; and compared to other, small and less ambitious projects this one will likely shine very brightly on a curriculum vitae.

Now we have to start by defining what we mean by “enterprise networking.”  Clearly the needs and opportunity for networking at home are not the same as they are in a real business, especially not a large one – at least without resorting to a pure lab setup which is not our goal of bringing IT home.  Having a lab at home is excellent and I highly recommend it, but I would not recommend building to “true lab” in your home until you have truly taken advantage of the far better opportunity to treat your home as a production “living lab” environment.  Needing your “living lab” to be up and running, in use every day changes how you view it, how you treat it and what you will take away from the experience.  A pure lab can be very abstract and it is easy to treat it in such a way that much of the educational opportunity is lost.

There are many aspects of enterprise networking that make sense to apply to our homes.  Every home is different and I will only present some ideas and I would love to hear what others can come up with as interesting ways to take home networking to the next level.

Firewall or Unified Thread Management (UTM):  This is the obvious starting point for upgrading any home network.  Most homes use a free multi-purpose device that is provided by their ISP that lacks features and security.  The firewall is the most featureful networking device that you will use in a home or in a small business and is the most important for providing basic security.  Your firewall provides the foundation of your home or small business network so getting this in place first makes sense.

There are numerous firewall and UTM products on the market.  Even for home or SMB use you will be flush with options.  You can only practically use a single unit and you need one powerful enough to be able to handle the throughput of a consumer WAN connection which may be a challenge with some vendors as consumer Internet access is getting very fast and requires quite a bit of processing power, especially from a UTM solution.

Choosing a firewall will likely mostly come down to your career goals and and price.  If you hope to pursue a career or certification in Cisco, Juniper or Palo Alto, for example, you will want to get devices from those vendors that allow you to do training at home.  These will be very expensive options but if that career path is your chosen one, having that gear at home will be immensely valuable not only for your learning and testing but for interviewing as well.

If you don’t have specific security or networking career goals your options are more open. There are traditional small business firewall suppliers like Netgear ProSafe that are low cost and easy to manage.  UTM devices are starting to enter this market like Netgear ProSecure, but these are almost universally more costly.  There is the software-only approach where you provide your own hardware and build the firewall yourself.  This is very popular and has many good options for software including pfSense, SmoothWall, Untangle and VyOS.  These vary in features and complexity.  For most cases, however, I would recommend a Ubiquiti EdgeMax router which runs Brocade Vyatta firmware.  These are less costly than most UTMs and run enterprise routing and firewall firmware – outside of needing a specific vendor’s product for networking career goals, this is the best learning, security and feature value on the market and will allow learning nearly any firewall or router skills outside of those specific to proprietary vendors.

When starting down the path of enterprise networking at home, remember to consider if you should also begin acquiring rackmount gear rather than tabletop equipment.  Having a racks, possibly just a half or even a quarter rack or cabinet, at home can make doing home enterprise projects much easier and can make the setup much more attractive, in many cases and is, like many of these projects, just that much more impressive.  Consider this when buying gear because firewalls in this category, are the hardest to find in rackmount configurations, much to my chagrin.

Switch: It has become common in home networking to begin forgoing a physical switch in favor of pure wireless solutions and for many homes where networking is not core to function this may make perfect sense.  But for us, it likely does not.  Adding switching makes for more learning opportunities, a better showcase, far more flexibility, faster data transfers inside of the house, a great number of connections and better reliability.  For a normal home user with few devices, most of which are mobile ones, this would be a waste, but for an IT pro at home, a real switch is practically a necessity.

Switching come in three key varieties.  Unmanaged, or dumb switches, which is all that you would find in a home or most small businesses.  Basic connectivity but nothing more.  This might be all that you need if you do not intend to explore deeper learning opportunities in networking.

Smart switching is a step up from an unmanaged switch.  A smart switch is often very low cost but adds additional features, normally through a web interface, that allow you to actively manage the switch, change configurations, troubleshoot, great VLANs and QoS, monitor, etc.  For someone looking to step up their at home network and approach networking from a higher-end small business perspective this is a great option and a very practical one for a home.

Managed switches are the most enterprise and by far the most costly.  These use SNMP and other standard protocols for remote management and monitoring and generally have the most features although often Smart switches have just as many.  Managed switches are not practical in a home for any reason as their benefits are around scalability, not features, but, like with everything, if learning those features is a key goal then this is another place where spending more money not only for those features but also to get “name brand” switches, like Cisco, Juniper, Brocade or HP can be an important investment.  But if the goal is only to learn the tools and standards of managed switches and not to go down the path of learning a specific implementation then lower cost options like Netgear Prosafe might make sense.

Once we decide on unmanaged, smart or managed switches then we have to decide on the “layer” of the switch.  This also has three options: Layer 2, Layer 2+ and Layer 3.  For home and small business use, L2 switches are the most common.  I have never seen more than an L2 in a home and rarely in a small business.  L2 are traditional switches that handle only Ethernet switching.  You can create VLANs on L2 switches but you cannot route traffic between the VLANs, that would require a router.  An L2+ switch adds some inter-VLAN traffic handling to allow VLANs to exist using static routes.  L3 switches have full IP handling and can do dynamic routing protocols.

So if you need to study “large” scale routing, an L3 switch is good.  This is not a common need and would be the most expensive route and would imply that you intend to purchase a lot more networking gear than just one switch.  In a home lab, this might exist, for handling the home itself, it would not.  If you want to implement VLANs in your home, perhaps one LAN, one Voice LAN, a DMZ and one Guest LAN then an L2+ switch is ideal.  If you don’t plan to study VLANing, stick to L2.

Cabling: One aspect of home networking that is far too often overlooking is implementing a quality cabling plant inside the home.  This requires far more effort than other home networking projects and falls more into the electrician space rather than the IT professional space but is also one of the most important pieces from the home owner perspective and end user perspective rather than the IT pro perspective.  A good, well installed cabling plant will make a home more attractive to buyers and make the value a powerful home network even better.

If you live in an apartment, likely you do not have the option to alter the wiring in this way, unfortunately.  But for home owners, cabling the house can be a great project with a lot of long term value.  Setting up a well labeled and organized cabling plant, just like you would in a business, can be attractive, impressive and eminently useful.  With good, well labeled cabling you can provide high speed, low latency connections without the need for wireless to every corner of your home.  I have found that cabling bedrooms, entertainment spaces and even the kitchen are very valuable.  This allows for higher throughput communications to all devices as wireless congestion is relieved and wired throughput is preferred when possible.  Devices such as video game consoles, smart televisions, receivers, media appliances (a la AppleTV, Roku, Google), desktops, docking stations, stationary laptops, VoIP phones and more all can benefit from the addition of complete cabling.

Wireless Access Point: These days home networks are primarily wireless with many homes being exclusively wireless.  Even if you follow my advice and have great wired networking you still need wireless whether for smart phones, tablets, laptops, guest access or whatever.    A typical home network will already have some cheap, probably unreliable wireless from the onset.  But I propose at least a moderate upgrade to this as a good practice in home networking.

Enterprise Access Points have come down in price dramatically today and a few vendors have even gotten then below one hundred dollars for high quality, centrally managed devices.  Good devices have high quality radios and antennae that will improve range and reliability.  Generally they will come with extra features like mapping, monitoring, centralized management console, VLAN support, hotspot login options, multiple SSID support, etc.  Most of these features are not needed in a home network but are commonly used even in a small business and having them at home for such a low price point makes sense.  If you own a large home, using good Access Points with centralized management can be additionally beneficial in providing whole home coverage.

Having secure guest access via the access point can be very nice in a home allowing guests to be isolated from the data and activities on the home network.  No need to share private passwords and provide access to data that is not necessary while still allowing guests to connect their mobile phones and tablets.  An ever more important feature.

If your home includes outdoor space, adding wireless projects to provide outdoor coverage could also make fora  great learning project.  Outdoor access points and specialized antenna can make for a fun and very useful project.  Make yourself able to stay connected even while roaming outdoors.

Good, enterprise access points are often quite attractive as well, being designed to be wall or ceiling mounted, making it easier to put them in good placement locations to better cover your available space.

Power over Ethernet: Now that you are looking at deploying enterprise access points and if you followed by earlier article on doing a PBX at home and you have desktop or wall mount VoIP phones you may want to consider adding additional PoE switching to reduce the need for electrical cables or power injectors.  A small PoE switch is not expensive and, while never really necessary, can make your home network that much more interesting and “polished.”  Many security devices take advantage of PoE as well as do some project board computers that are increasingly popular today.  The value to adding PoE is ever increasing.

Network Software: Once your home is upgraded to this level, it is only natural to then bring in network management and monitoring software to leverage it even further.  This could be as simple as setting up Wireshark to look at your LAN traffic or it could mean SNMP Monitors, Netflow tools and the like. What is available to you is highly dependent on the vendors and products that you choose but the options are there and this is really where much of the benefit comes in regards to the ongoing educational aspects of network.  Building the network and performing the occasional maintenance will, of course, be very good experience but having the tools to now watch the living network at work and learn from it will be key to the continuing value beyond the impressive end user experience that your household will enjoy.

The Cult of ZFS

It’s pretty common for IT circles to develop a certain cult-like or “fanboy” mentality.  What causes this reaction to technologies and products I am not quite sure, but that it happens is undeniable.  One area that I never thought that I would see this occur is in the area of filesystems – one of the most “under the hood” system components and one that, until recently, received literally no attention even in decently technical circles.  Let’s face it, misunderstanding when something comes from Active Directory versus from NTFS is nearly ubiquitous.  Filesystems are, quite simply, ignored.  Ever since Windows NT 4 released and NTFS was the only viable option the idea that a filesystem is not an intrinsic component of an operating system and that there might be other options for file storage has all but faded away.  That is, until recently.

The one community where, to some small degree, this did not happen was the Linux community, but even there Ext2 and its descendants so completely won mindshare that even thought they were widely available, alternative filesystems were sidelines and only XFS received any attention, historically, and even it received very little.

Where some truly strange behavior has occurred, more recently, is around Oracle’s ZFS filesystem, originally developed for the Solaris operating system and the X4500 “Thumper” open storage platform (originally under the auspices of Sun prior to the Oracle acquisition.)  At the time (nine years ago) when ZFS released, competing filesystems were mostly ill prepared to handle large disk arrays that were expected to be made over the coming years.  ZFS was designed to handle them and heralded in the age of large scale filesystems.  Like most filesystems at that time, ZFS was limited only to a single operating system and so, while widely regarded as a great leap forward in filesystem design, it produces few ripples in the storage world and even fewer in the “systems” world where even Solaris administrators generally considered it a point of interest only for quite some time, mostly choosing to stick to the tried and true UFS that they had been using for many years.

ZFS was, truly, a groundbreaking filesystem and I was, and remain, a great proponent of it.  But it is very important to understand why ZFS did what it did, what its goals are, why those goals were important and how it applies to us today.  The complexity of ZFS has lead to much confusion and misunderstanding about how the filesystem works and when it is appropriate to use.

The principle goals of ZFS were to make a filesystem capable of scaling well to very large disk arrays.  At the time of its introduction, the scale to which ZFS was capable was unheard of in other file systems but there was no real world need for a filesystem to be able to grow that large.  By the time that the need arose, many other file systems such as NTFS, XFS, Ext3 and others had scaled to accommodate the need.  ZFS certainly lead the charge to larger filesystem handling but was joined by many others soon thereafter.

Because ZFS originated in the Solaris world where, like all big iron UNIX systems, there is no hardware RAID, software RAID had to be used.  Solaris had always had software RAID available as its own subsystem.  The decision was made to build a new software RAID implementation directly into ZFS.  This would allow for simplified management via a single tool set for both the RAID layer and the filesystem.  It did not introduce any significant change or advantage to ZFS, as is often believed, it simply shifted the interface for the software RAID layer from being its own command set to being part of the ZFS command set.

ZFS’ implementation of RAID introduced variable width stripes in parity RAID levels.  This innovation closed a minor parity RAID risk known as the “write hole”.  This innovation was very nice but came very late as the era of reliable parity RAID was beginning to end and the write hole problem was already considered to be an unmentioned “background noise” risk of parity arrays as it was not generally considered a thread due to its elimination through the use of battery backed array caches and, at about the same time, non-volatile array caches – avoid power loss and you avoid the write hole.  ZFS needed to address this issue because, as software RAID, it was at greater risk to the write hole than hardware RAID is because there is no opportunity for a cache protected against power loss – hardware RAID offers the potential for an additional layer of power protection for arrays.

The real “innovation” that ZFS inadvertently made was that instead of just implementing the usual RAID levels of 1, 5, 6 and 10 they instead “branded” these levels with their own naming conventions.  RAID 5 is known as RAIDZ.  RAID 6 is known as RAIDZ2.  RAID 1 is just known as mirroring.  And so on.  This was widely considered silly at the time and pointlessly confusing but, as it turned out, that confusion because the cornerstone of ZFS’ revival many years later.

It needs to be noted that ZFS later added the industry’s first production implementation of a RAID 7 (aka RAID 7.3) triple parity RAID system and branded it RAIDZ3.  This later addition is an important innovation for large scale arrays that need the utmost in capacity while remaining extremely safe but are willing to sacrifice performance in order to do so.  This remains a unique feature of ZFS but one that is rarely used.

In the spirit of collapsing the storage stack and using a single command set to manage all aspects of storage the logical volume management functions were rolled into ZFS as well.  It is often mistakenly believed that ZFS introduced logical volume management in certain circles but nearly all enterprise platforms, including AIX, Linux, Windows and even Solaris itself, had already had logical volume management for many years.  ZFS was not doing this to introduce a new paradigm but simply to consolidate management and wrap all three key storage layers (RAID, logical volume management and filesystem) into a single entity that would be easier to manage and could provide inherent communications up and down the stack.  There are pros and cons to this method and an industry opinion remains unformed nearly a decade later.

One of the most important aspects of this consolidation of three systems into one is that now we have a very confusing product to discuss.  ZFS is a filesystem, yes, but it is not only a filesystem.  It is a logical volume manager, but not only a logical volume manager.  People refer to ZFS as a filesystem, which is its primary function, but that it is so much more than a filesystem can be very confusing and makes comparisons against other storage systems difficult.  At the time I believe that this confusion was not foreseen.

What has resulted from this confusing merger is that ZFS is often compared to other filesystems, such as XFS or Ext4.  But this is confusing as ZFS is a complete stack and XFS is only one aspect of a stack.  ZFS would be better compared to MD (Linux Software RAID) / LVM / XFS or to SmartArray (HP Hardware RAID) / LVM/ XFS than to XFS alone.  Otherwise it appears that ZFS is full of features that XFS lacks but, in reality, it is only a semantic victory.  Most of the features often touted by ZFS advocates did not originate with ZFS and were commonly available with the alternative filesystems long before ZFS existed. But it is hard to compare “does your filesystem do that” because the answer is “no…. my RAID or my logical volume manager do that.”  And truly, it is not ZFS the filesystem providing RAIDZ, it is ZFS the software RAID subsystem that is doing so.

In order to gracefully handle very large filesystems, data integrity features were built into ZFS which included a checksum or hash check throughout the filesystem that could leverage the inclusive software RAID to repair corrupted files.  This was seen as necessary due to the anticipated size of ZFS filesystems in the future.  Filesystem corruption is a rarely seen phenomenon but as filesystems grow in size the risk increases.  This lesser known feature of ZFS is possibly its greatest.

ZFS also changed how filesystem checks are handled.  Because of the assumption that ZFS will be used on very large filesystems there was a genuine fear that a filesystem check on boot time could take impossibly long to complete and so an alternative strategy was found.  Instead of waiting to do a check at reboot the system would require a scrubbing process to run and perform a similar check while the system was running.  This requires more system overhead while the system is live but the system is able to recover from an unexpected restart more rapidly.  A trade off but one that is widely seen as very positive.

ZFS has powerful snapshotting capabilities in its logical volume layer and in its RAID layer has implemented very robust caching mechanisms making ZFS an excellent choice for many use cases.  These features are not unique in ZFS but are widely available in systems older than ZFS.  They are, however, very good implementations of each and very well integrated due to ZFS’ nature.

At one time, ZFS was open source and during that era its code became a part of Apple’s Mac OSX and FreeBSD operating systesms because they were compatible with the ZFS license.  Linux did not get ZFS at that time due to challenges around licensing.  Had ZFS licensing allowed Linux to use it unencumbered the Linux landscape would likely be very different today.  Mac OSX eventually dropped ZFS as it was not seen as having enough advantages to justify it in that environment.  FreeBSD clung to ZFS and, over time, it became the most popular filesystem on the platform although UFS is still heavily used as well.  Oracle closed the source of ZFS after the Sun acquisition leaving FreeBSD without continuing updates to its version of ZFS while Oracle continued to develop ZFS internally for Solaris.

Today Solaris remains using the original ZFS implementation now with several updates since its split with the open source community.  FreeBSD  and others continued using ZFS in the state it was when the code was closed source, no longer having access to Oracle’s latest updates.  Eventually work to update the abandoned open source ZFS codebase was taken up and is now known as OpenZFS.  OpenZFS is still fledgling and has not yet really made its mark but has some potential to revitalize the ZFS platform in the open source space but at this time, OpenZFS still lags ZFS.

Open source development for the last several years in this space has focused more on ZFS’ new rival BtrFS which is being developed natively on Linux and is well supported from many major operating system vendors.  BtrFS is very nascent but is making major strides to reach feature parity with ZFS in implemented features but has large aspirations and due to ZFS’ closed source nature has the benefit of market moment.  BtrFS was started, like ZFS, by Oracle and has been widely seen as Oracle’s view of the future being a replacement for ZFS even at Oracle.  At this time BtrFS has already, like ZFS, merged the filesystem, logical volume management and software RAID layers, implemented checksumming for filesystem integrity, scales even larger than ZFS (same absolute limit but handles more files), copy on write snapshots, etc.

ZFS, without a doubt, was an amazing filesystem in its heyday and remains a leader today.  I was a proponent of it in 2005 and I still believe heavily in it.  But it has saddened me to see the community around ZFS take on a fervor and zealotry that  does it no service and makes the mention of ZFS almost seem as a negative – ZFS being so universally chosen for the wrong reasons: primarily a belief that its features exist nowhere else, that its RAID is not subject to the risks and limitations that those RAID levels are always subject to or that it was designed for a different purpose (primarily performance) other than what it was designed for.  And when ZFS is a good choice, it is often implemented poorly based on untrue assumptions.

ZFS, of course, is not to blame.  Nor, as far as I can tell, are its corporate supporters or its open source developers.  Where ZFS seems to have gone awry is in a loose, unofficial community that has only recently come to know ZFS, often believing it to be new or “next generation” because they have only recently discovered it.  From what I have seen this is almost never via Solaris or FreeBSD channels but almost exclusively smaller businesses looking to use a packaged “NAS OS” like FreeNAS or NAS4Free who are not familiar with UNIX OSes.  The use of packaged NAS OSes, primarily by IT shops that possess neither deep UNIX nor storage skills and, consequently, little exposure to the broader world of filesystems outside of Windows and often little to no exposure to logical volume management and RAID, especially software RAID at all, appears to lead to a “myth” culture around ZFS with it taking on an almost unquestionable, infallible status.

This cult-like following and general misunderstanding of ZFS leads often to misapplications of ZFS or a chain of decision making based off of bad assumptions that can lead one very much astray.

One of the most amazing changes in this space is the change in following from hardware RAID to software RAID.  Traditionally, software RAID was a pariah in Windows administration circles without good cause – Windows administrators and small businesses, often unfamiliar with larger UNIX servers believed that hardware RAID was ubiquitous when, in fact, larger scale systems always used software RAID.  Hardware RAID was, almost industry wide, considered a necessity and software RAID completely eschewed.  That same audience, now faced with the “Cult of ZFS” movement, now react in exactly the opposite way believing that hardware RAID is bad and that ZFS’ software RAID is the only viable option.  The shift is dramatic and neither approach is valid – both hardware and software RAID and both in many implementations are very valid options and even using ZFS the use of hardware RAID might easily be appropriate.

ZFS is often chosen because it is believed that it is the highest performance option in filesystems but this was never a key design goal of ZFS.  The features allowing it to scale so large and handle so many different aspects of storage actually make being highly performant very difficult.  ZFS, at the time of its creation, was not even expected to be as fast as the venerable UFS which ran on the same systems as it.  However, this is often secondary to the fact that filesystem performance is widely moot as all modern filesystems are extremely fast and filesystem speed is rarely an important factor – especially outside of massive, high end storage systems on a very large scale.

An interesting study of ten filesystems on Linux produced by Phoronix in 2013 showed massive differences in filesystems by workload but no clear winners as far as overall performance.  What the study showed conclusively is that matching workload to filesystem is the most important choice, that ZFS falls to the slower side of all mainstream filesystems even in its more modern implementations and that choosing a filesystem for performance reasons without a very deep understanding of the workload will result in unpredictable performance – no filesystem should be chosen blindly if performance is an important factor.  Sadly, because the test was done on Linux, it lacked UFS which is often ZFS’ key competitor especially on Solaris and FreeBSD and it lacked HFS+ from Mac OSX. 

Moving from hardware RAID to software RAID carries additional, often unforeseen risks to shops not experienced in UNIX as well.  While ZFS allows for hot swapping, it is often forgotten that hot swap is primarily a feature of hardware, not of software, and it is also widely unknown that blind swapping (removal of hard drives without first offlining them in the operating system) is not synonymous with hot swapping and this can lead to disasters for shops moving from a tradition of hardware RAID that handled compatibility, hot swap and blind swapping transparently for them to a software RAID system that requires much more planning, coordination and understanding of the system in order to use safely.

A lesser, but still common misconception of ZFS, is that it is a clustered filesystem suitable for use on shared DAS or SAN scenarios a la OCFS, VxFS and GFS2.  ZFS is not a clustered filesystem and shares the same limitations in that space as all of its common competitors.

ZFS can be an excellent choice but it is far from the only one.  ZFS comes with large caveats, not the least of which is the operating system limitations associated with it, and while it has many benefits few, if any, are unique to ZFS and it is very rare that any shop will benefit from every one of them.  As with any technology, there are trade offs to be made.  One size does not fit all.  The key to knowing when ZFS is right for you is to understand what ZFS is, what is and is not unique about it, what its design goals are, how comparing a storage storage to a pure filesystem produces misleading results and what inherent limitations are tied to it.

ZFS is a key consideration and the common choice when Solaris or FreeBSD is the chosen operating system.  With rare exception, the operating system should never be chosen for ZFS, but instead ZFS should be often chosen, but not always, when the operating system is chosen.  The OS should drive the filesystem choices in all but the rarest of cases.  The choice of operating system is so dramatically more important than the choice of filesystem.

ZFS can be used on Linux but is not considered an enterprise option there but more of a hobby system for experimentation as no enterprise vendor (such as Red Hat, Suse or Canonical) support ZFS on Linux and as Linux has great alternatives already.  Someday ZFS might be promoted to a first class filesystem in Linux but this is not expected as BtrFS has already entered the mainline kernel and been included in production releases by several major vendors.

While ZFS will be seen in the vast majority of Solaris and FreeBSD deployments, this is primarily because it has moved into the position of default filesystem and not because it is clearly the superior choice in those instances or has even been evaluated critically.  ZFS is perfectly well suited to being a general purpose filesystem where it is native and supported.

What is ZFS’ primary use case?

ZFS’ design goal and principal use case is for Solaris and FreeBSD open storage systems providing either shared storage to other servers or as massive data repositories for locally installed applications.  In these cases, ZFS’ focus on scalability and data integrity really shine.  ZFS leans heavily towards large and enterprise scale shops and generally away from applicability in the small and medium business space where Solaris and FreeBSD skills, as well as large scale storage needs, are rare.

Reference: http://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=1