Tag Archives: storage

The Software RAID Inflection Point

In June, 2001 something amazing happened in the IT world: Intel released the Tualatin based Pentium IIIS 1.0 GHz processor. This was one of the first few Intel processors (IA32 architecture) to have crossed the 1 GHz clock barrier and the first of any significance. It was also special in that it had dual processor support and a double sized cache compared to its Coppermine based forerunners or it’s non-“S” Tualatin successor (that followed just one month behind.) The PIIIS system boards were insanely popular in their era and formed the backbone of high performance commodity servers, such as Proliant and PowerEdge, in 2001 and for the next few years culminating in the Pentium IIIS 1.4GHz dual processor systems that were so important that they resulted in kicking off the now famous HP Proliant “G” naming convention. The Pentium III boxes were “G1”.

What does any of this have to do with RAID? Well, we need to step back and look at where RAID was up until May, 2001. From the 1990s and up to May, 2001 hardware RAID was the standard for the IA32 server world which mainly included systems like Novell Netware, Windows NT 4, Windows 2000 and some Linux. Software RAID did exist for some of these systems (not Netware) but servers were always struggling for CPU and memory resources and expending these precious resources on RAID functions was costly and would cause applications to compete with RAID for access and the systems would often choke on the conflict. Hardware RAID solved this by adding dedicated CPU and RAM just for these functions.

RAID in the late 1990s and early 2000s was also very highly based around RAID 5 and to a lesser degree, RAID 6, parity striping because disks were tiny and extremely expensive for capacity and squeezing maximum capacity out of the available disks was of utmost priority and risks like URE were so trivial due to the small capacity sizes that parity RAID was very reliable, all things considered. The factors were completely different than they would be by 2009. In 2001, it was still common to see 2.1GB, 4.3GB and 9GB hard drives in enterprise servers!

Because parity RAID was the order of the day, and many drives were typically used on each server, RAID had more CPU overhead on average in 2000 than it did in 2010! So the impact of RAID on system resources was very significant.

And that is the background. But in June, 2001 suddenly the people who had been buying very low powered IA32 systems had access to the Tualatin Pentium IIIS processors with greatly improved clock speeds, efficient dual processor support and double sized on chip caches that presented an astounding leap in system performance literally over night. With all this new power and no corresponding change in software demands systems that traditionally were starved for CPU and RAM suddenly had more than they knew how to use, especially as additional threads were available and most applications of the time were single threaded.

The system CPUs, even in the Pentium III era, were dramatically more powerful than the small CPUs, which were often entry level PowerPC or MIPS chips, on the hardware RAID controllers and the available system memory was often much larger than the hardware RAM caches and investing in extra system memory was often far more effective and generally advantages so with the availability of free capacity on the main system RAID functions could, on average be moved from the hardware RAID cards to the central system and gain performance, even while giving up the additional CPU and RAM of the hardware RAID cards. This was not true on overloaded systems, those starved for resources and was more relevant for parity RAID systems with RAID 6 benefiting the most and non-parity systems like RAID 1 and 0 benefiting the least.

But June, 2001 was the famous inflection point – before that date the average IA32 system was faster when using hardware RAID. And after June, 2001 new systems purchased would on average be faster with software RAID. With each passing year the advantages have leaned more and more towards software RAID with the abundance of underutilized core CPUs and idle threads and spare RAM exploding with the only advantage towards hardware RAID being the drop in parity RAID usage as mirrored RAID took over as the standard as disk sizes increased dramatically while capacity costs dropped.

Today is has been more than fifteen years since the notion that hardware RAID would be faster has been retired. The belief lingers on due primarily to the odd “Class of 1998” effect. But this has long been a myth repeated improperly by those that did not take the time to understand the original source material. Hardware RAID continues to have benefits, but performance has not been one of them for the majority of the time that we’ve had RAID and is not expected to ever rise again.

Logical Volume Managers

A commonly used but often overlooked or misunderstood storage tool is the Logical Volume Manager.  Logical Volume Managers, or LVMs, are a storage abstraction, encapsulation and virtualization technology used to provide a level of flexibility often otherwise unavailable.

Most commonly an LVM is used to replace traditional partitioning systems, and sometimes additional functionality is rolled into an LVM such as RAID functions.  Nearly all operating systems offer an integrated LVM product today and most have for a very long time.  LVMs have become a standard feature of both server and client side storage management.

LVMs do not necessarily offer uniform features but common features often included in an LVM are logical volumes (soft partitioning), thin provisioning, flexible physical location allocation, encryption, simple RAID functionality (commonly only mirror based RAID) and snapshots.  Essentially all LVMs offer logical volumes, snapshots and flexible allocation; these being considered fundamental LVM functions.

Popular LVMs include Logical Disk Management on Windows Server 2000 through Server 2008 R2, Storage Spaces on Windows 2012 and later, LVM on Linux, BtrFS on Linux, Core Storage on Mac OSX, Solaris Volume Manager on Solaris, ZFS on Solaris and FreeBSD, Vinum Volume Manager on FreeBSD, Veritas Volume Manager for most UNIX systems, LVM on AIX and many more.  LVMs have been increasingly popular and standard since the late 1980s.  ZFS and BtrFS are interesting as they are filesystems that implement an LVM inside of the filesystem as an integrated system.

An LVM consumes block devices (drive appearances) and creates logical volumes (often referred to as LVs) which are themselves drive appearances as well.  Because of this, an LVM can sit at any of many different places in the storage stack.  Most commonly we would expect an LVM to consume a RAID array, split one RAID array into one or more logical volumes with each logical volume having a filesystem applied to it.  But it is completely possible for an LVM to sit directly on physical storage without RAID, and it is very possible for RAID to be implemented via software on top of the logical volumes rather than beneath them.  LVMs are also very useful for combining many different storage systems into one such as combining many physical devices and/or RAID arrays into a single, abstracted entity that can then be split up into logical volumes (with single volumes potentially utilizing many different underlying storage devices.)  One standard use of an LVM is to combine many SAN LUNs (potentially from a single SAN system or potentially from several different ones) into a single volume group.

While LVMs provide power and flexibility for working with multiple storage devices and types of storage devices while presenting a standard interface to higher layers in the storage stack, probably the most common usages are to provide for flexibility where rigid partitions used to be and for snapshots.  Traditional partitions are rigid and cannot be resized.  Logical volumes can almost always be grown or shrunk as needed making them tremendously more flexible.

Snapshots have become a major focus of LVM usage in the last decade, although mostly this has happened because of snapshot awareness growing rather than a recent shift in availability.  Commodity virtualization systems have brought snapshots from an underlying, storage industry knowledge component into the IT mainstream.  Much of how virtualization technologies tend to tackle storage virtualization can be thought of as being related to LVMs, but generally this is similar functionality offered in a different manner or simply passing LVM functionality on from a lower layer.

Today you can expect to find LVMs in use nearly everywhere, even implemented transparently on storage arrays (such as SAN equipment) to provide more flexible provisioning.  They are not just standardly available, but standardly implemented and have done much to improve the reliability and capability of modern storage.

When to Consider a SAN?

Everyone seems to want to jump into purchasing a SAN, sometimes quite passionately.  SANs are, admittedly, pretty cool.  They are one of the more fun and exciting, large scale hardware items that most IT professionals get a chance to have in their own shop.  Often the desire to have a SAN of ones own is a matter of “keeping up with the Jones” as using a SAN has become a bit of a status symbol – one of those last bastions of big business IT that you only see in a dedicated server closet and never in someone’s home (well, almost never.)  SANs are pushed heavily, advertised and sold as amazing boxes with internal redundancy making them infallible, speed that defies logic and loaded with features that you never knew that you needed.  When speaking to IT pros designing new systems, one of the most common design aspects that I hear is “well we don’t know much about our final design, but we know that we need a SAN.”

In the context of this article, I use SAN in its most common context, that is to mean a “block storage device” and not to refer to the entire storage network itself.  A storage network can exist for NAS but not use a SAN block storage device at all. So for this article SAN refers exclusively to SAN as a device, not SAN as a network.  SAN is a soft term used to mean multiple things at different times and can become quite confusing.  A SAN configured without a network becomes DAS.  DAS that is networked becomes SAN.

Let’s stop for a moment.  SAN is your back end storage.  The need for it would be, in all cases, determined by other aspects of your architecture.  If you have not yet decided upon many other pieces, you simply cannot know that a SAN is going to be needed, or even useful, in the final design.  Red flags. Red flags everywhere.  Imagine a Roman chariot race with the horses pushes the chariots (if you know what I mean.)

It is clear that the drive to implement a SAN is so strong that often entire projects are devised with little purpose except, it would seem, to justify the purchase of the SAN.  As with any project, the first question that one must ask is “What is the business need that we are attempting to fill?”   And work from there, not “We want to buy a SAN, where can we use it?”  SANs are complex, and with complexity comes fragility.  Very often SANs carry high cost.  But the scariest aspect of a SAN is the widespread lack of deep industry knowledge concerning them.  SANs pose huge technical and business risk that must be overcome to justify their use.  SANs are, without a doubt, very exciting and quite useful, but that is seldom good enough to warrant the desire for one.

We refer to SANs as “the storage of last resort.”  What this means is, when picking types of storage, you hope that you can use any of the other alternatives such as local drives, DAS (Direct Attach Storage) or NAS (Network Attached Storage) rather than SAN.  Most times, other options work wonderfully.  But there are times when the business needs demand requirements that can only reasonably be met with a SAN.  When those come up, we have no choice and must use a SAN.  But generally it can be avoided in favor of simpler and normally less costly or risky options.

I find that most people looking to implement a SAN are doing so under a number of misconceptions.

The first is that SANs, by their very nature, are highly reliable.  While there are certainly many SAN vendors and specific SAN products that are amazingly reliable, the same could be said about any IT product.  High end servers in the price range of high end SANs are every bit as reliable as SANs.  Since SANs are made from the same hardware components as normal servers, there is no magic to making them more reliable.  Anything that can be used to make a SAN reliable is a trickle down of server RAS (Reliability, Availability and Serviceability) technologies.  Just like SAN, NAS and DAS, as well as local disks, can be made incredibly reliable.  SAN only refers to the device being used to serve block storage rather than perform some other task.  A SAN is just a very simple server.  SANs encompass the entire range of reliability with mainframe-like reliability at the top end to devices that are nothing more than external hard drives – the most unreliable network devices on your network – on the bottom end.  So rather than SAN meaning reliability, it actually offers a few special cases of being the lowest reliability you can imagine.  But, for all intents and purposes, all servers share roughly equal reliability concerns.  SANs gain a reputation for reliability because often businesses put extreme budgets into their SANs that they do not put into their servers so that the comparison is a relatively high end SAN to a relatively budget server.

The second is that SAN means “big” and NAS means “small.”  There is no such association.  Both SANs and NASs can be of nearly any scale or quality.  They both run the gamut and there isn’t the slightest suggestion from the technology chosen whether a device is large or not.  Again, as above, SAN actually can technically come “smaller” than a NAS solution due to its possible simplicity but this is a specialty case and mostly only theoretical although there are SAN products on the market that are in this category, just very rare to find them in use.

The third is that SAN and NAS are dramatically different inside the chassis.  This is certainly not the case as the majority of SAN and NAS devices today are what is called “unified storage” meaning a storage appliance that acts simultaneously as both SAN and NAS.  This highlights that the key difference between the two is not in backend technology or hardware or size or reliability but the defining difference is the protocols used to transfer storage.  SANs are block storage exposing raw block devices onto the network using protocols like fibre channel, iSCSI, SAS, ZSAN, ATA over Ethernet (AoE) or Fibre Channel over Ethernet (FCoE.)  NAS, on the other hand, uses a network file system and exposes files onto the network using application layer protocols like NFS, SMB, AFP, HTTP and FTP which then ride over TCP/IP.

The fourth is that SANs are inherently a file sharing technology.  This is NAS.  SAN is simply taking your block storage (hard disk subsystem) and making it remotely available over a network.  The nature of networks suggests that we can attach that storage to multiple devices at once and indeed, physically, we can.  Just as we used to be able to physically attach multiple controllers to opposite ends of a SCSI ribbon cable with hard drives dangling in the middle.  This will, under normal circumstances, destroy all of the data on the drives as the controllers, which know nothing about each other, overwrite data from each other causing near instant corruption.  There are mechanisms available in special clustered filesystems and their drivers to allow for this, but this requires special knowledge and understanding that is far more technical than many people acquiring SANs are aware that they need for what they often believe is the very purpose of the SAN – a disaster so common that I probably speak to someone who has done just this almost weekly.  That the SAN puts at risk the very use case that most people believe it is designed to handle and not only fails to deliver the nearly magic protection sought but is, to the contrary, the very cause of the loss of data exposes the level of risk that implemented misunderstood storage technology carrier with it.

The fifth is that SANs are fast.  SANs can be fast; they can also be horrifically slow.  There is no intrinsic speed boost from the use of SAN technology on its own.  It is actually fairly difficult for SANs to overcome the inherent bottlenecks introduced by the network on which they sit.  As some other storage options such as DAS use all the same technologies as SAN but lack the bottleneck and latency of the actual network an equivalent DAS will also be just a little faster than its SAN complement.  SANs are generally a little faster than a hardware-identical NAS equivalent, but even this is not guaranteed.  SAN and NAS behave differently and in different use cases either may be the better performing.  SAN would rarely be chosen as a solution based on performance needs.

The sixth is that by being a SAN that the inherent problems associated with storage choices no longer apply.  A good example is the use of RAID 5.  This would be considered bad practice to do in a server, but when working with a SAN (which in theory is far more critical than a stand alone server) often careful storage subsystem planning is eschewed based on a belief that being a SAN that it has somehow fixed those issues or that they do not apply.  It is true that some high end SANs do have some amount of risk mitigation features unlikely to be found elsewhere, but these are rare and exclusively relegated to very high end units where using fragile designs would already be uncommon.  It is a dangerous, but very common practice, to take great care and consideration when planning storage for a physical server but when using a SAN that same planning and oversight is often skipped based on the assumption that the SAN handles all of that internally or that it is simply no longer needed.

Having shot down many misconceptions about SAN one may be wondering if SANs are ever appropriate.  They are, of course, quite important and incredibly valuable when used correctly.  The strongest points of SANs come from consolidation and special types of shared storage.

Consolidation was the historical driver bringing customers to SAN solutions.  A SAN allows us to combine many filesystems into a single disk array allowing far more efficient use of storage resources.  Because SAN is block level it is able to do this anytime that a traditional, local disk subsystem could be employed.  In many servers, and even many desktops, storage space is wasted due to the necessities of growth, planning and disk capacity granularity.  If we have twenty servers each with 300GB drive arrays but each only using 80GB of that capacity, we have large waste.  With a SAN would could consolidate to just 1.6TB plus a small amount necessary for overhead and spend far less on physical disks than if each server was maintaining its own storage.

Once we begin consolidating storage we begin to look for advanced consolidation opportunities.  Having consolidated many server filessytems onto a single SAN we have the chance, if our SAN implementation supports it, to deduplicate and compress that data which, in many cases such as server filesystems, can potentially result in significant utilization reduction.  So out 1.6TB in our example above might actually end up being only 800GB or less.  Suddenly our consolidation numbers are getting better and better.

To efficiently leverage consolidation it is necessary to have scale and this is where SANs really shine – when scale but in capacity and, more importantly, in the number of attaching nodes become very large.  SANs are best suited to large scale storage consolidation.  This is their sweet spot and what makes them nearly ubiquitous in large enterprises and very rare in small ones.

SANs are also very important for certain types of clustering and shared storage that requires single shared filesystem access.  These is actually a pretty rare need outside of one special circumstance – databases.  Most applications are happy to utilize any type of storage provided to them, but databases often require low level block access to be able to properly manipulate their data most effectively.  Because of this they can rarely be used, or used effectively, on NAS or file servers.  Providing high availability storage environments for database clusters has long been a key use case of SAN storage.

Outside of these two primary use cases, which justify the vast majority of SAN installations, SAN also provides for high levels of storage flexibility in making it potentially very simple to move, grow and modify storage in a large environment without needing to deal with physical moves or complicated procurement and provisioning.  Again, like consolidation, this is an artifact of large scale.

In very large environments, the use of SAN can also be used to provide a point a demarcation between storage and system engineering teams allowing there to be a handoff at the network layer, generally of fibre channel or iSCSI.  This clear separation of duties can be critical in allowing for teams to be highly segregated in companies that want highly discrete storage, network and systems teams.  This allows the storage team to do nothing but focus on storage and the systems team to do nothing but focus on the systems without any need for knowledge of the other team’s implementations.

For a long time SANs also presented themselves as a convenient means to improve storage performance.  This is not an intrinsic component of SAN but an outgrowth of their common use for consolidation.  Similarly to virtualization when used as consolidation, shared SANs will have a nature advantage of having better utilization of available spindles, centralized caches and bigger hardware than the equivalent storage spread out among many individual servers.  Like shared CPU resources, when the SAN is not receiving requests from multiple clients it has the ability to dedicate all of its capacity to servicing the requests of a single client providing an average performance experience potentially far higher than what an individual server would be able to affordably achieve on its own.

Using SAN for performance is rapidly fading from favor, however, because of the advent of SSD storage becoming very common.  As SSDs with incredibly low latency and high IOPS performance drop in price to the point where they are being added to stand alone servers as local cache or potentially even being used as mainline storage the bottleneck of the SANs networking becomes a larger and larger factor making it increasingly difficult for the consolidation benefits of a SAN to offset the performance benefits of local SSDs.  SSDs are potentially very disruptive for the shared storage market as they bring the performance advantage back towards local storage – just the latest in the ebb and flow of storage architecture design.

The most important aspect of SAN usage to remember is that SAN should not be a default starting point in storage planning.  It is one of many technology choices and one that often does not fit the bill as intended or does so but at an unnecessarily high price point either in monetary or complexity terms.  Start by defining business goals and needs.  Select SAN when it solves those needs most effectively, but keep an open mind and consider the overall storage needs of the environment.

Network RAID Notation Standard (SAM RAID Notation)

As the RAID landscape becomes more complex with the emergence of network RAID there is an important need for a more complex and concise notation system for RAID levels involving a network component.

Traditional RAID comes in single digit notation and the available levels are 0, 1, 2, 3, 4, 5, 6, 7.  Level 7 is unofficial but widely accepted as triple parity RAID (the natural extension of RAID 5 and RAID 6) and RAID 2 and RAID 3 are effectively disused today.

Nested RAID, one RAID level within another, is handled by putting single digit RAID levels together such as RAID 10, 50, 61, 100, etc.  These can alternatively be written with a plus sign separating the levels like RAID 1+0, 5+0, 6+1, 1+0+0, etc.

There are two major issues with this notation system, beyond the obvious issue that not all RAID types or extensions are accounted for by the single digit system with many of the aspects of proprietary RAID systems such as ZRAID, XRAID and BeyondRAID being unaccounted for in the notation system.  The first is a lack of network RAID notation and the second is a lack of specific denotation of intra-RAID configuration.

Network RAID comes in two key types, synchronous and asynchronous.  Synchronous network RAID operates effectively identically to its non-networked counterpart.  Asynchronous functions the same but brings extra risks as data may not be synchronized across devices at the time of a device failure.  So the differences between the two need to be visible in the notation.

Synchronous RAID should be denoted with parenthesis.  So two local RAID 10 systems mirrored over the network (a la DRBD) would be denoted RAID 10(1).  The effective RAID level for risk and capacity calculations would be the same as any RAID 101 but this informs all parties at a glance that the mirror is over a network.

Asynchronous RAID should be denoted with brackets.  So two local RAID 10 systems mirrored over the network asynchronously would be denoted as RAID 10[1] making it clear that there is a risky delay in the system.

There is an additional need for a different type of replication at a higher, filesystem level (a la rsync) that, while not truly related to RAID, provides a similar function for cold data and is often used in RAID discussions and I believe that storage engineers need the ability to quite denote this as well.  This asynchronous file-system level replication can be denoted by braces.  Only one notation is needed as file-system level replication is always asynchronous.  So as an example, two RAID 6 arrays synced automatically with a block-differential file system replication system would be denoted as RAID 6{1}.

To further simplify RAID notation and to shorten the obvious need to write the word “RAID” repeatedly as well as to remove ourselves from the traditional distractions of what the acronym stands for so that we can focus on the relevant replication aspects of it, a simple “R” prefix should be used.  So RAID 10 would simply be R10.  Or a purely networked mirror might be R(1).

This leaves one major aspect of RAID notation to address and that is the size of each component of the array.  Often this is implied but some RAID levels, especially those that are nested, can have complexities missed by traditional notation.  Knowing the total number of drives in an array does not always denote the setup of a specific array.  For example a 24 drive R10 is assumed to be twelve pairs of mirrors in a R0 stripe.  But it could be eight sets of triple mirrors in a R0 stripe.  Or it could even be six quad mirrors.  Or four sext mirrors.  Or three oct mirrors.  Or two dodeca mirrors.  While most of these are extremely unlikely, there is a need to notate it.  For the set size we use a superscript number to denote the size of that set.  Generally this is only needed for one aspect of the array, not all, as others can be derived, but when in down it can be denoted explicitly.

So an R10 array using three-way mirror sets would be R130.  Lacking the ability to write a superscript you could also write it as R1^3+0.  This notation does not state the complete array size, only its configuration type.  If all possible superscripts are included a full array size can be calculated using nothing more.  If we have an R10 of four sets of three-way mirrors we could write it R1304 which would inform us that the entire array consists of twelve drives – or in the alternate notation R1^3+0^4.

Superscript notation of sets is only necessary when non-obvious.  R10 with no other notation implies that the R1 component is mirror pairs, for example.  R55 nearly always requires additional notation except when the array consist of only nine members.

One additional aspect to consider is notating array size.  This is far simpler than the superscript notation and is nearly always complete adequate.  This alleviates the need to write in long form “A four drive RAID 10 array.”  Instead we can use a prefix for this.  4R10 would denote a four drive RAID 10 array.

So to look at our example from above, the twelve disk RAID 10 with the three-way mirror sets could be written out as 12R1304.  But the use of all three numbers becomes redundant.  Any one of the numbers can be dropped.  Typically this would be the final one as it is the least likely to be useful.  The R1 set size is useful in determining the basic risk and the leading 12 is used for capacity and performance calculations as well as chassis sizing and purchasing.  The trailing four is implied by the other two numbers and effectively useless on its own.  So the best way to write this would be simply 12R130.  If that same array was to use the common mirror pair approach rather than the three-way mirror we would simply write 12R10 to denote a twelve disk, standard RAID 10 array.