Tag Archives: filesystem

The Cult of ZFS

It’s pretty common for IT circles to develop a certain cult-like or “fanboy” mentality.  What causes this reaction to technologies and products I am not quite sure, but that it happens is undeniable.  One area that I never thought that I would see this occur is in the area of filesystems – one of the most “under the hood” system components and one that, until recently, received literally no attention even in decently technical circles.  Let’s face it, misunderstanding when something comes from Active Directory versus from NTFS is nearly ubiquitous.  Filesystems are, quite simply, ignored.  Ever since Windows NT 4 released and NTFS was the only viable option the idea that a filesystem is not an intrinsic component of an operating system and that there might be other options for file storage has all but faded away.  That is, until recently.

The one community where, to some small degree, this did not happen was the Linux community, but even there Ext2 and its descendants so completely won mindshare that even thought they were widely available, alternative filesystems were sidelines and only XFS received any attention, historically, and even it received very little.

Where some truly strange behavior has occurred, more recently, is around Oracle’s ZFS filesystem, originally developed for the Solaris operating system and the X4500 “Thumper” open storage platform (originally under the auspices of Sun prior to the Oracle acquisition.)  At the time (nine years ago) when ZFS released, competing filesystems were mostly ill prepared to handle large disk arrays that were expected to be made over the coming years.  ZFS was designed to handle them and heralded in the age of large scale filesystems.  Like most filesystems at that time, ZFS was limited only to a single operating system and so, while widely regarded as a great leap forward in filesystem design, it produces few ripples in the storage world and even fewer in the “systems” world where even Solaris administrators generally considered it a point of interest only for quite some time, mostly choosing to stick to the tried and true UFS that they had been using for many years.

ZFS was, truly, a groundbreaking filesystem and I was, and remain, a great proponent of it.  But it is very important to understand why ZFS did what it did, what its goals are, why those goals were important and how it applies to us today.  The complexity of ZFS has lead to much confusion and misunderstanding about how the filesystem works and when it is appropriate to use.

The principle goals of ZFS were to make a filesystem capable of scaling well to very large disk arrays.  At the time of its introduction, the scale to which ZFS was capable was unheard of in other file systems but there was no real world need for a filesystem to be able to grow that large.  By the time that the need arose, many other file systems such as NTFS, XFS, Ext3 and others had scaled to accommodate the need.  ZFS certainly lead the charge to larger filesystem handling but was joined by many others soon thereafter.

Because ZFS originated in the Solaris world where, like all big iron UNIX systems, there is no hardware RAID, software RAID had to be used.  Solaris had always had software RAID available as its own subsystem.  The decision was made to build a new software RAID implementation directly into ZFS.  This would allow for simplified management via a single tool set for both the RAID layer and the filesystem.  It did not introduce any significant change or advantage to ZFS, as is often believed, it simply shifted the interface for the software RAID layer from being its own command set to being part of the ZFS command set.

ZFS’ implementation of RAID introduced variable width stripes in parity RAID levels.  This innovation closed a minor parity RAID risk known as the “write hole”.  This innovation was very nice but came very late as the era of reliable parity RAID was beginning to end and the write hole problem was already considered to be an unmentioned “background noise” risk of parity arrays as it was not generally considered a thread due to its elimination through the use of battery backed array caches and, at about the same time, non-volatile array caches – avoid power loss and you avoid the write hole.  ZFS needed to address this issue because, as software RAID, it was at greater risk to the write hole than hardware RAID is because there is no opportunity for a cache protected against power loss – hardware RAID offers the potential for an additional layer of power protection for arrays.

The real “innovation” that ZFS inadvertently made was that instead of just implementing the usual RAID levels of 1, 5, 6 and 10 they instead “branded” these levels with their own naming conventions.  RAID 5 is known as RAIDZ.  RAID 6 is known as RAIDZ2.  RAID 1 is just known as mirroring.  And so on.  This was widely considered silly at the time and pointlessly confusing but, as it turned out, that confusion because the cornerstone of ZFS’ revival many years later.

It needs to be noted that ZFS later added the industry’s first production implementation of a RAID 7 (aka RAID 7.3) triple parity RAID system and branded it RAIDZ3.  This later addition is an important innovation for large scale arrays that need the utmost in capacity while remaining extremely safe but are willing to sacrifice performance in order to do so.  This remains a unique feature of ZFS but one that is rarely used.

In the spirit of collapsing the storage stack and using a single command set to manage all aspects of storage the logical volume management functions were rolled into ZFS as well.  It is often mistakenly believed that ZFS introduced logical volume management in certain circles but nearly all enterprise platforms, including AIX, Linux, Windows and even Solaris itself, had already had logical volume management for many years.  ZFS was not doing this to introduce a new paradigm but simply to consolidate management and wrap all three key storage layers (RAID, logical volume management and filesystem) into a single entity that would be easier to manage and could provide inherent communications up and down the stack.  There are pros and cons to this method and an industry opinion remains unformed nearly a decade later.

One of the most important aspects of this consolidation of three systems into one is that now we have a very confusing product to discuss.  ZFS is a filesystem, yes, but it is not only a filesystem.  It is a logical volume manager, but not only a logical volume manager.  People refer to ZFS as a filesystem, which is its primary function, but that it is so much more than a filesystem can be very confusing and makes comparisons against other storage systems difficult.  At the time I believe that this confusion was not foreseen.

What has resulted from this confusing merger is that ZFS is often compared to other filesystems, such as XFS or Ext4.  But this is confusing as ZFS is a complete stack and XFS is only one aspect of a stack.  ZFS would be better compared to MD (Linux Software RAID) / LVM / XFS or to SmartArray (HP Hardware RAID) / LVM/ XFS than to XFS alone.  Otherwise it appears that ZFS is full of features that XFS lacks but, in reality, it is only a semantic victory.  Most of the features often touted by ZFS advocates did not originate with ZFS and were commonly available with the alternative filesystems long before ZFS existed. But it is hard to compare “does your filesystem do that” because the answer is “no…. my RAID or my logical volume manager do that.”  And truly, it is not ZFS the filesystem providing RAIDZ, it is ZFS the software RAID subsystem that is doing so.

In order to gracefully handle very large filesystems, data integrity features were built into ZFS which included a checksum or hash check throughout the filesystem that could leverage the inclusive software RAID to repair corrupted files.  This was seen as necessary due to the anticipated size of ZFS filesystems in the future.  Filesystem corruption is a rarely seen phenomenon but as filesystems grow in size the risk increases.  This lesser known feature of ZFS is possibly its greatest.

ZFS also changed how filesystem checks are handled.  Because of the assumption that ZFS will be used on very large filesystems there was a genuine fear that a filesystem check on boot time could take impossibly long to complete and so an alternative strategy was found.  Instead of waiting to do a check at reboot the system would require a scrubbing process to run and perform a similar check while the system was running.  This requires more system overhead while the system is live but the system is able to recover from an unexpected restart more rapidly.  A trade off but one that is widely seen as very positive.

ZFS has powerful snapshotting capabilities in its logical volume layer and in its RAID layer has implemented very robust caching mechanisms making ZFS an excellent choice for many use cases.  These features are not unique in ZFS but are widely available in systems older than ZFS.  They are, however, very good implementations of each and very well integrated due to ZFS’ nature.

At one time, ZFS was open source and during that era its code became a part of Apple’s Mac OSX and FreeBSD operating systesms because they were compatible with the ZFS license.  Linux did not get ZFS at that time due to challenges around licensing.  Had ZFS licensing allowed Linux to use it unencumbered the Linux landscape would likely be very different today.  Mac OSX eventually dropped ZFS as it was not seen as having enough advantages to justify it in that environment.  FreeBSD clung to ZFS and, over time, it became the most popular filesystem on the platform although UFS is still heavily used as well.  Oracle closed the source of ZFS after the Sun acquisition leaving FreeBSD without continuing updates to its version of ZFS while Oracle continued to develop ZFS internally for Solaris.

Today Solaris remains using the original ZFS implementation now with several updates since its split with the open source community.  FreeBSD  and others continued using ZFS in the state it was when the code was closed source, no longer having access to Oracle’s latest updates.  Eventually work to update the abandoned open source ZFS codebase was taken up and is now known as OpenZFS.  OpenZFS is still fledgling and has not yet really made its mark but has some potential to revitalize the ZFS platform in the open source space but at this time, OpenZFS still lags ZFS.

Open source development for the last several years in this space has focused more on ZFS’ new rival BtrFS which is being developed natively on Linux and is well supported from many major operating system vendors.  BtrFS is very nascent but is making major strides to reach feature parity with ZFS in implemented features but has large aspirations and due to ZFS’ closed source nature has the benefit of market moment.  BtrFS was started, like ZFS, by Oracle and has been widely seen as Oracle’s view of the future being a replacement for ZFS even at Oracle.  At this time BtrFS has already, like ZFS, merged the filesystem, logical volume management and software RAID layers, implemented checksumming for filesystem integrity, scales even larger than ZFS (same absolute limit but handles more files), copy on write snapshots, etc.

ZFS, without a doubt, was an amazing filesystem in its heyday and remains a leader today.  I was a proponent of it in 2005 and I still believe heavily in it.  But it has saddened me to see the community around ZFS take on a fervor and zealotry that  does it no service and makes the mention of ZFS almost seem as a negative – ZFS being so universally chosen for the wrong reasons: primarily a belief that its features exist nowhere else, that its RAID is not subject to the risks and limitations that those RAID levels are always subject to or that it was designed for a different purpose (primarily performance) other than what it was designed for.  And when ZFS is a good choice, it is often implemented poorly based on untrue assumptions.

ZFS, of course, is not to blame.  Nor, as far as I can tell, are its corporate supporters or its open source developers.  Where ZFS seems to have gone awry is in a loose, unofficial community that has only recently come to know ZFS, often believing it to be new or “next generation” because they have only recently discovered it.  From what I have seen this is almost never via Solaris or FreeBSD channels but almost exclusively smaller businesses looking to use a packaged “NAS OS” like FreeNAS or NAS4Free who are not familiar with UNIX OSes.  The use of packaged NAS OSes, primarily by IT shops that possess neither deep UNIX nor storage skills and, consequently, little exposure to the broader world of filesystems outside of Windows and often little to no exposure to logical volume management and RAID, especially software RAID at all, appears to lead to a “myth” culture around ZFS with it taking on an almost unquestionable, infallible status.

This cult-like following and general misunderstanding of ZFS leads often to misapplications of ZFS or a chain of decision making based off of bad assumptions that can lead one very much astray.

One of the most amazing changes in this space is the change in following from hardware RAID to software RAID.  Traditionally, software RAID was a pariah in Windows administration circles without good cause – Windows administrators and small businesses, often unfamiliar with larger UNIX servers believed that hardware RAID was ubiquitous when, in fact, larger scale systems always used software RAID.  Hardware RAID was, almost industry wide, considered a necessity and software RAID completely eschewed.  That same audience, now faced with the “Cult of ZFS” movement, now react in exactly the opposite way believing that hardware RAID is bad and that ZFS’ software RAID is the only viable option.  The shift is dramatic and neither approach is valid – both hardware and software RAID and both in many implementations are very valid options and even using ZFS the use of hardware RAID might easily be appropriate.

ZFS is often chosen because it is believed that it is the highest performance option in filesystems but this was never a key design goal of ZFS.  The features allowing it to scale so large and handle so many different aspects of storage actually make being highly performant very difficult.  ZFS, at the time of its creation, was not even expected to be as fast as the venerable UFS which ran on the same systems as it.  However, this is often secondary to the fact that filesystem performance is widely moot as all modern filesystems are extremely fast and filesystem speed is rarely an important factor – especially outside of massive, high end storage systems on a very large scale.

An interesting study of ten filesystems on Linux produced by Phoronix in 2013 showed massive differences in filesystems by workload but no clear winners as far as overall performance.  What the study showed conclusively is that matching workload to filesystem is the most important choice, that ZFS falls to the slower side of all mainstream filesystems even in its more modern implementations and that choosing a filesystem for performance reasons without a very deep understanding of the workload will result in unpredictable performance – no filesystem should be chosen blindly if performance is an important factor.  Sadly, because the test was done on Linux, it lacked UFS which is often ZFS’ key competitor especially on Solaris and FreeBSD and it lacked HFS+ from Mac OSX. 

Moving from hardware RAID to software RAID carries additional, often unforeseen risks to shops not experienced in UNIX as well.  While ZFS allows for hot swapping, it is often forgotten that hot swap is primarily a feature of hardware, not of software, and it is also widely unknown that blind swapping (removal of hard drives without first offlining them in the operating system) is not synonymous with hot swapping and this can lead to disasters for shops moving from a tradition of hardware RAID that handled compatibility, hot swap and blind swapping transparently for them to a software RAID system that requires much more planning, coordination and understanding of the system in order to use safely.

A lesser, but still common misconception of ZFS, is that it is a clustered filesystem suitable for use on shared DAS or SAN scenarios a la OCFS, VxFS and GFS2.  ZFS is not a clustered filesystem and shares the same limitations in that space as all of its common competitors.

ZFS can be an excellent choice but it is far from the only one.  ZFS comes with large caveats, not the least of which is the operating system limitations associated with it, and while it has many benefits few, if any, are unique to ZFS and it is very rare that any shop will benefit from every one of them.  As with any technology, there are trade offs to be made.  One size does not fit all.  The key to knowing when ZFS is right for you is to understand what ZFS is, what is and is not unique about it, what its design goals are, how comparing a storage storage to a pure filesystem produces misleading results and what inherent limitations are tied to it.

ZFS is a key consideration and the common choice when Solaris or FreeBSD is the chosen operating system.  With rare exception, the operating system should never be chosen for ZFS, but instead ZFS should be often chosen, but not always, when the operating system is chosen.  The OS should drive the filesystem choices in all but the rarest of cases.  The choice of operating system is so dramatically more important than the choice of filesystem.

ZFS can be used on Linux but is not considered an enterprise option there but more of a hobby system for experimentation as no enterprise vendor (such as Red Hat, Suse or Canonical) support ZFS on Linux and as Linux has great alternatives already.  Someday ZFS might be promoted to a first class filesystem in Linux but this is not expected as BtrFS has already entered the mainline kernel and been included in production releases by several major vendors.

While ZFS will be seen in the vast majority of Solaris and FreeBSD deployments, this is primarily because it has moved into the position of default filesystem and not because it is clearly the superior choice in those instances or has even been evaluated critically.  ZFS is perfectly well suited to being a general purpose filesystem where it is native and supported.

What is ZFS’ primary use case?

ZFS’ design goal and principal use case is for Solaris and FreeBSD open storage systems providing either shared storage to other servers or as massive data repositories for locally installed applications.  In these cases, ZFS’ focus on scalability and data integrity really shine.  ZFS leans heavily towards large and enterprise scale shops and generally away from applicability in the small and medium business space where Solaris and FreeBSD skills, as well as large scale storage needs, are rare.

Reference: http://www.phoronix.com/scan.php?page=article&item=linux_310_10fs&num=1