All posts by Scott Alan Miller

Started in software development with Eastman Kodak in 1989 as an intern in database development (making database platforms themselves.) Began transitioning to IT in 1994 with my first mixed role in system administration.

Solution Elegance

It is very easy, when working in IT, to become focused on big, complex solutions.  It seems that this is where the good solutions must lie – big solutions, lots of software, all the latest gadgets.  What we do is exciting and it is very easy to get caught up in the momentum.  It’s fun to do challenging, big projects.  Hearing what other IT pros are doing, how other companies solve challenges and talking to vendors with large systems to sell to us all adds to the excitement and it is very easy to lose a sense of scope and goal and it is so common to see big, over the top solutions to simple problems that it seems like this must just be how IT is.

But it need not be.  Complexity is the enemy of both reliability and security.  Unnecessarily complex solutions increase cost both in acquisition and in implementation as well as in maintenance while being generally slower, more fragile and possess a large attack surface that is harder to comprehend and protect.  Simple, or more appropriately, elegant solutions are the best approach.  This does not mean that all designs will be simple, not at all.  Complex designs are often required.  IT is hardly a field that has any lack of complexity.  In fact it is often believed that software development may be the most complex of all human endeavors, at least of those partaken of on any scale.  A typical IT installation includes millions of lines of codes, hundreds or thousands of protocols, large numbers of interconnected systems, layers of unique software configurations, more settings than any team could possibly know and only then do we add in the complexity of hundreds or thousands or hundreds of thousands of unpredictable, irrational humans trying to use these systems, each in a unique way.  IT is, without a doubt, complex.

What is important is to recognize that IT is complex, that this cannot be avoided completely but to focus on designing and engineering solutions to be as simple, as graceful… as elegant as possible.  This design idea comes from, at least in my mind, software engineering where complex code is seen as a mistake and simple, beautiful code that is easy to read, easy to understand is considered successful.  One of the highest accolades that can be bestowed upon a software engineer is for her code to be deemed elegant.  How apropos that it is attributed to Blaise Pascal, after whom one of the most popular programming languages of the 1970s and 1980s was named is this famous quote (translated poorly from French): “I am sorry I have had to write you such a long letter, but I did not have time to write you a short one.”

It is often far easier to design complex, convoluted solutions than it is to determine what simple approach would suffice.  Whether we are in a hurry or don’t know where to begin an investigation, elegance is always a challenge. The industry momentum is to promote the more difficult path.  It is in the interest of vendors to sell more gear not only to make the initial sale but they know that with more equipment comes more support dollars and if enough new, complex equipment is sold the support needs stop increasing linearly and begin to increase geometrically as additional support is needed not just for the equipment or software itself but also for the configuration and support of system interactions or additional customization   The financial influences behind complexity are great, and they do not stop with vendors.  IT professionals gain much job security, or the illusion of it, by managing large sets of hardware and software that are difficult to seamlessly transition to another IT professional.

Often complexity is so assumed, so expected, that the process of selecting a solution begins with great complexity as a foregone conclusion without any consideration for the possibility that a less complex solution might suffice, or even be superior outside of the question of complexity and cost itself.  Complexity is sometimes even completely tied to certain concepts to a degree where I have actually faced incredulity at the notion that a simple solution might outperform in price, performance and reliability a complex one.

Rhetoric is easy, but what is a real world example?  The best examples that I see today are mostly related to virtualization whether vis a vis storage or a cloud management layer or software or just virtualization itself.  I see quite frequently that a conversation involving just virtualization for one person brings an instant connotation of requiring networked, shared block storage, expensive virtualization management software, many redundant virtualization nodes and complex high availability software – none of which are intrinsic to virtualization and most of which are rarely for the purpose of supporting or really, even in the interest of the business for whom they will be implemented.  Rather than working from business requirements, these concepts arise predominantly from technology preconceptions.  It is simple to point to complexity and appear to be solving a problem – complexity creates a sense of comfort.  Filter many arguments down and you’ll hear “How can it not work, it’s complex?”  Complexity provides an illusion of completeness, or having solved a problem, but this can commonly hide the fact that a solution may not actually be complete or even functional but the degree of complexity makes this difficult to see.  Our minds will then not accept easily a simpler approach being more complete and solving a problem when a complex one does not because it feels so counter-intuitive.

A great example of this is that we resort to discussing redundancy rather than reliability.  Reliability is difficult to measure, redundancy is simple to quantify.  A brick is highly reliable, even when singular.  It does not take redundancy for a brick to be stable and robust.  Its design is simple.  You could make a supporting structure out of many redundant sticks that would not be nearly as reliable as a single brick.  If you talk in reliability – the chance that the structure will not fail – it is clear that the brick is a superior choice to several sticks.  But if you say “but there is no redundancy, the brick could fail and there is nothing to take its place” you sound silly.  But when talking about computers and computer systems we find systems that are so complex that rarely do people see when they have a brick or a stick and so, since they cannot determine reliability which matters, they focus on the easily to quantify redundancy, which doesn’t.  The entire system is too complex, but seeking the simple solution, the one that directly addresses the crux of the problem to solve can reduce complexity and provide us a far better answer in the end.

This can even be seen in RAID.  Mirrored RAID is simple, just one disk or set of disks being an exact copy of another set.  It’s so simple.  Parity RAID is complex with calculations on a variable stripe across many devices that must be encoded when written and decoded should a device fail.  Mirrored RAID lacks this complexity and solves the problem of disk reliability through simple, elegant copy operations that are highly reliable and very well understood.  Parity RAID is unnecessarily complex making it fragile.  Yet in doing so and by undermining its own ability to solve the problem for which it was designed it also, simultaneously, because seemingly more reliable based on no factor other than its own complexity.  The human mind immediately jumps to “it’s complex, therefore it is more advanced, therefore it is more reliable” but neither progression is a logical one.  Complexity does not suggest that it is more advanced and being advanced does not suggest that it is reliable, but the human mind itself is complex and easily mislead.

There is no simple answer for finding simplicity.  Knowing that complexity is bad by its nature but unavoidable at times teaches us to be mindful, however it does not teach us when to suspect over-complexity.  We must be vigilant, always seeking to determine if a more elegant answer exists and not accept complexity as the correct answer simply because it is complex.  We need to question proposed solutions and question ourselves.  “Is this solution really as simple as it should be?”  “Is this complexity necessary?”  “Does this require the complexity that I had assumed?”

In most system design recommendations that I give, the first technical determination step that I normally take, after the step of inquiring as to the business need needing to be solved, is to question complexity.  If complexity cannot be defended, it is probably unnecessary and actively defeating the purpose for which it was chosen.

“Is it really necessary to split those drives into many separate arrays?  If so, what is the technical justification for doing so?”

“Is shared storage really necessary for the task that you are proposing it for?”

“Does the business really justify the use of distributed high availability technologies?”

“Why are we replacing a simple system that was adequate yesterday with a dramatically more complex system tomorrow?  What has changed that makes a major improvement, while remaining simple, not more than enough but requires orders of magnitude more complexity and more spending that wasn’t justified previously?”

These are just common examples, complexity exists in every aspect of our industry.  Look for simplicity.  Strive for elegance.  Do not accept complexity without rigorously vetting it.  Put it through the proverbial ringer.  Do not allow complexity to creep in where it is not warranted.  Do not err on the side of complexity – when in doubt, fail simply.  Oversimplifying a solution typically results in a minor failure while making it overly complex allows for a far greater degree of failure.  The safer bet is with the simpler solution.  And if a simple solution is chosen and proven inadequate it is far easier to add complexity than it is to remove it.

The History of Array Splitting

Much of the rote knowledge of the IT field, especially that of the SMB field, arose in the very late 1990s based on a variety of factors.  The biggest factors were that suddenly smaller and smaller businesses were rushing to computerize, Microsoft had gotten Windows NT 4 so stable that there was a standard base for all SMB IT to center around, the Internet era had finally taken hold and Microsoft introduce their certification and training programs that reshaped knowledge dissemination in the industry.  Put together, this created both a need for new training and best practices and caused a massive burst of new thinking, writing, documentation, training, best practices, rules of thumb, etc.

For a few years nearly the entire field was trained on the same small knowledge set and many rules of thumb became de facto standards and much of the knowledge of the time was learned by rote and passed on mentor to intern in a cycle that moved much of the technical knowledge of 1998 into the unquestioned, set-in-stone processes of 2012.  At the time this was effective because the practices were relevant but that was fifteen years ago, technology, economics, use cases and knowledge have changed significantly since that time.

One of the best examples of this was the famous Microsoft SQL Server recommendation of RAID 1 for the operating system, RAID 5 for the database files and another RAID 1 for the logs.  This setup has endured for nearly the entire life of the product and was so well promoted that it has spread into almost all aspects of server design in the SMB space.  The use of RAID 1 for the operating system and RAID 5 for data is so pervasive that it is often simply assumed without any consideration as to why this was recommended at the time.

Let’s investigate the history and see why R1/5/1 was good in 1998 and why it should not exist today.  Keep some perspective in mind, the gap between when these recommendations first came out (as early as 1995) compared to today is immense.  Go back, mentally, to 1995 and think about the equivalent gap at the time.  That would have been like using recommendations in the early Internet age based on home computing needs for the first round of Apple ][ owners!  The 8bit home computer era was just barely getting started in 1978.  Commodore was still two years away from releasing their first home computer (the VIC=20) and would go through the entire Commodore and Commodore Amiga eras and go bankrupt and vanish all before 1995.  The Apple ][+ was still a year away.  People were just about to start using analogue cassette drives as storage.  COBOL and Fortran were the only series business languages in use.  Basically, the gap is incredible.  Things change.

First, we need to look at the factors that existed in the late 1990s that created the need for our historic setup.

  1. Drives were small, very small.  A large database array might have been four 2.1GB SCSI drives in an R5 array for just ~6GB of usable storage space on a single array.  The failure domain for parity RAID failure was tiny (compared to things like URE fail rates.)
  2. Drive connection technologies were parallel and slow.  The hard drives of the time were only slightly slower than drives are today but the connection technologies represented a considerable bottleneck.  It was common to split traffic to allow for reduced bus bottlenecks.
  3. SCSI drive technology was the only one used for servers.  The use of a PATA (IDE it was called at the time) in a server was unthinkable.
  4. Drives were expensive per gigabyte so cost savings was the key issue, while maintaining capacity, for effectively all businesses.
  5. Filesystems were fragile and failed more often than drives.
  6. Hardware RAID was required and only basic RAID levels of 1 and 5 were commonly available.  RAID 6 and RAID 10 were years away from being accessible to most businesses.  RAID 0 is discounted as it has no redundancy.
  7. Storage systems were rarely, if ever, shared between servers so access was almost always dedicated to a single request queue.
  8. Storage caches were tiny or did not exist making drive access limitations pass directly onto the operating system.  This meant having different arrays with different characteristics to handle different read/write or random/sequential access mixes.
  9. Drive failure was common and the principle concern of storage system design.
  10. Often drive array size was limited by physical limitations so often array splitting decisions were made out of necessity, not choice.
  11. A combination of the above factors meant that RAID 1 was best for some parts of the system where small size was acceptable and access was highly sequential or write heavy and RAID 5 was best for others where capacity outweighed reliability and where access was highly random and read heavy.

In the nearly two decades since the original recommendations were released, all of these factors have changed.  In some cases the changes are cascading ones where the move from general use RAID 5 to general use RAID 10 has then causes what would have been the two common array types, RAID 1 and RAID 10, to share access characteristics so the need or desire to use one or the other depending on load type is gone.

  1. Drives are now massive.  Rather than struggling to squeeze what we need onto them, we generally have excess capacity.  Single drives over a terabyte are common, even in servers.  Failure domains for parity are massive (compared to things like URE fail rates.)
  2. Drive connections are serial and fast.  The drive connections are no longer a bottleneck.
  3. SATA is now common on servers skewing potential risks for URE in a way that did not exist previously.
  4. Capacity is now cheap but performance and reliability are now the key concerns for dollars spent.
  5. Filesystems are highly robust today and filesystem failures are “background noise” in the greater picture of array reliability.
  6. Hardware RAID and software RAID are both options today and available RAID levels include many options but, most importantly, RAID 10 is available ubiquitously.
  7. Storage systems are commonly shared making sequential access even less common.
  8. Storage caches are commonly and often very large.  512MB and 1GB caches are considered normal today making many arrays in 1995 fit entirely into memory on the RAID controller today.  With caches growing rapidly compared to storage capacity and the recent addition of solid state drives as L2 cache in storage in the last two years it is not out of the question for even a small business to have databases and other performance sensitive applications running completely from cache.
  9. Drive failure is uncommon and of trivial concern to storage system design (compared to other failure types.)
  10. Drive array size is rarely limited by physical limitations.
  11. The use of RAID 1 and RAID 10 as the principle array types today means that there is no benefit to using different array levels for performance tuning.

These factors highlight why the split array system of 1995 made perfect sense at the time and why it does not make sense today.  OBR10, today’s standard, was unavailable at the time and cost prohibitive.  RAID 5 was relatively safe in 1995, but not today.  Nearly every factor involved in the decision process has changed dramatically in the last seventeen years and is going to continue to change as SSD becomes more common along with auto-tiering, even larger caches and pure SSD storage systems.

The change in storage design over the last two decades also highlights the dangers that IT faces as a large portion of the field learns, as is common in engineering, basic “rules of thumb” or “best practices” without necessarily understanding the underlying principles that drive those decisions making it difficult to know when not to apply those best practices or, even more importantly, when to recognize that the rule no longer applies.  Unlike traditional mechanical or civil engineering where new advances and significant factor changes may occur once or possibly never over the course of a career, IT still changes fast enough that complete “rethinks” of basic rules of thumb are required several times through a career.  Maybe not annually, but once per decade or more is almost always necessary.

The current move from uniprocessing to multithreaded architectures is another similar, significant change requiring the IT field to completely change how system design is handled.

RAID Notation Examples

As the new Network RAID Notation Standard (SAM RAID Notation) is a bit complex, I felt that it would be useful to provide a list of common use scenarios and specific implementation examples and how they would be notated.

  • Scenario: Netgear ReadyNAS Pro 2 with XRAID mirror.  Notation: R1
  • Scenario: Two Netgear ReadyNAS Ultra units with local RAID 1 sync’d over the network using rsync.  Notation: R1{1}
  • Scenario: Two Drobo B800fs NAS devices each loaded with single parity RAID sync’d using DroboSync. Notation: R5{1}
  • Scenario: Two Drobo B800fs NAS devices each with dual parity RAID sync’d using DroboSync.  Notation: R6{1}
  • Scenario: Two Linux servers with R6 locally using DRBD Mode A or B (asynchronous.)  Notation: R6[1]
  • Scenario: Two Linux servers with R5 locally using DRBD Mode C (synchronous.)  Notation: R6(1)
  • Scenario: Three node VMware vSphere VSA cluster with local R10.  Notation: R10(1)3
  • Scenario: Windows server with two four disk R0 stripes mirrored.  Notation: 8R01
  • Scenario: Two FreeBSD servers with R10 using HAST with memsync.  Notation: R10[1]
  • Scenario: Two FreeBSD servers with R1 using HAST with sync.  Notation: R1(1)
  • Scenario: Two Windows file servers with R10 using Robocopy to synchronize file systems. Notation: R10{1}
  • Scenario: Single Netgear SC101 SAN* using ZSAN drivers on Windows with two disks. Notation: R(1)

Technology References:

HAST: http://wiki.freebsd.org/HAST

DRBD: http://www.drbd.org/users-guide/s-replication-protocols.html

DroboSync: http://www.drobo.com/solutions/for-business/drobo-sync.php

Rsync: http://rsync.samba.org/

Robocopy: http://technet.microsoft.com/en-us/library/cc733145%28v=ws.10%29.aspx

Notes:

*The Netgear SC101 SAN is interesting in that while it can hold two PATA drives internally and exposes them to the network as block devices, via the ZSAN protocol, through a single Ethernet interface but there is no internal communications between the devices so all mirroring of the array happens in Windows which actually sees each disk as an entirely separate SAN device each with its own IP address.  Windows has no way to know that the two devices are related. The RAID 1 mirroring is handled one hundred percent in software RAID on Windows and the SAN itself is always two independent PATA drives exposed raw to the network.  A very odd, but enlightening device.

Network RAID Notation Standard (SAM RAID Notation)

As the RAID landscape becomes more complex with the emergence of network RAID there is an important need for a more complex and concise notation system for RAID levels involving a network component.

Traditional RAID comes in single digit notation and the available levels are 0, 1, 2, 3, 4, 5, 6, 7.  Level 7 is unofficial but widely accepted as triple parity RAID (the natural extension of RAID 5 and RAID 6) and RAID 2 and RAID 3 are effectively disused today.

Nested RAID, one RAID level within another, is handled by putting single digit RAID levels together such as RAID 10, 50, 61, 100, etc.  These can alternatively be written with a plus sign separating the levels like RAID 1+0, 5+0, 6+1, 1+0+0, etc.

There are two major issues with this notation system, beyond the obvious issue that not all RAID types or extensions are accounted for by the single digit system with many of the aspects of proprietary RAID systems such as ZRAID, XRAID and BeyondRAID being unaccounted for in the notation system.  The first is a lack of network RAID notation and the second is a lack of specific denotation of intra-RAID configuration.

Network RAID comes in two key types, synchronous and asynchronous.  Synchronous network RAID operates effectively identically to its non-networked counterpart.  Asynchronous functions the same but brings extra risks as data may not be synchronized across devices at the time of a device failure.  So the differences between the two need to be visible in the notation.

Synchronous RAID should be denoted with parenthesis.  So two local RAID 10 systems mirrored over the network (a la DRBD) would be denoted RAID 10(1).  The effective RAID level for risk and capacity calculations would be the same as any RAID 101 but this informs all parties at a glance that the mirror is over a network.

Asynchronous RAID should be denoted with brackets.  So two local RAID 10 systems mirrored over the network asynchronously would be denoted as RAID 10[1] making it clear that there is a risky delay in the system.

There is an additional need for a different type of replication at a higher, filesystem level (a la rsync) that, while not truly related to RAID, provides a similar function for cold data and is often used in RAID discussions and I believe that storage engineers need the ability to quite denote this as well.  This asynchronous file-system level replication can be denoted by braces.  Only one notation is needed as file-system level replication is always asynchronous.  So as an example, two RAID 6 arrays synced automatically with a block-differential file system replication system would be denoted as RAID 6{1}.

To further simplify RAID notation and to shorten the obvious need to write the word “RAID” repeatedly as well as to remove ourselves from the traditional distractions of what the acronym stands for so that we can focus on the relevant replication aspects of it, a simple “R” prefix should be used.  So RAID 10 would simply be R10.  Or a purely networked mirror might be R(1).

This leaves one major aspect of RAID notation to address and that is the size of each component of the array.  Often this is implied but some RAID levels, especially those that are nested, can have complexities missed by traditional notation.  Knowing the total number of drives in an array does not always denote the setup of a specific array.  For example a 24 drive R10 is assumed to be twelve pairs of mirrors in a R0 stripe.  But it could be eight sets of triple mirrors in a R0 stripe.  Or it could even be six quad mirrors.  Or four sext mirrors.  Or three oct mirrors.  Or two dodeca mirrors.  While most of these are extremely unlikely, there is a need to notate it.  For the set size we use a superscript number to denote the size of that set.  Generally this is only needed for one aspect of the array, not all, as others can be derived, but when in down it can be denoted explicitly.

So an R10 array using three-way mirror sets would be R130.  Lacking the ability to write a superscript you could also write it as R1^3+0.  This notation does not state the complete array size, only its configuration type.  If all possible superscripts are included a full array size can be calculated using nothing more.  If we have an R10 of four sets of three-way mirrors we could write it R1304 which would inform us that the entire array consists of twelve drives – or in the alternate notation R1^3+0^4.

Superscript notation of sets is only necessary when non-obvious.  R10 with no other notation implies that the R1 component is mirror pairs, for example.  R55 nearly always requires additional notation except when the array consist of only nine members.

One additional aspect to consider is notating array size.  This is far simpler than the superscript notation and is nearly always complete adequate.  This alleviates the need to write in long form “A four drive RAID 10 array.”  Instead we can use a prefix for this.  4R10 would denote a four drive RAID 10 array.

So to look at our example from above, the twelve disk RAID 10 with the three-way mirror sets could be written out as 12R1304.  But the use of all three numbers becomes redundant.  Any one of the numbers can be dropped.  Typically this would be the final one as it is the least likely to be useful.  The R1 set size is useful in determining the basic risk and the leading 12 is used for capacity and performance calculations as well as chassis sizing and purchasing.  The trailing four is implied by the other two numbers and effectively useless on its own.  So the best way to write this would be simply 12R130.  If that same array was to use the common mirror pair approach rather than the three-way mirror we would simply write 12R10 to denote a twelve disk, standard RAID 10 array.