Much of the rote knowledge of the IT field, especially that of the SMB field, arose in the very late 1990s based on a variety of factors. The biggest factors were that suddenly smaller and smaller businesses were rushing to computerize, Microsoft had gotten Windows NT 4 so stable that there was a standard base for all SMB IT to center around, the Internet era had finally taken hold and Microsoft introduce their certification and training programs that reshaped knowledge dissemination in the industry. Put together, this created both a need for new training and best practices and caused a massive burst of new thinking, writing, documentation, training, best practices, rules of thumb, etc.
For a few years nearly the entire field was trained on the same small knowledge set and many rules of thumb became de facto standards and much of the knowledge of the time was learned by rote and passed on mentor to intern in a cycle that moved much of the technical knowledge of 1998 into the unquestioned, set-in-stone processes of 2012. At the time this was effective because the practices were relevant but that was fifteen years ago, technology, economics, use cases and knowledge have changed significantly since that time.
One of the best examples of this was the famous Microsoft SQL Server recommendation of RAID 1 for the operating system, RAID 5 for the database files and another RAID 1 for the logs. This setup has endured for nearly the entire life of the product and was so well promoted that it has spread into almost all aspects of server design in the SMB space. The use of RAID 1 for the operating system and RAID 5 for data is so pervasive that it is often simply assumed without any consideration as to why this was recommended at the time.
Let’s investigate the history and see why R1/5/1 was good in 1998 and why it should not exist today. Keep some perspective in mind, the gap between when these recommendations first came out (as early as 1995) compared to today is immense. Go back, mentally, to 1995 and think about the equivalent gap at the time. That would have been like using recommendations in the early Internet age based on home computing needs for the first round of Apple ][ owners! The 8bit home computer era was just barely getting started in 1978. Commodore was still two years away from releasing their first home computer (the VIC=20) and would go through the entire Commodore and Commodore Amiga eras and go bankrupt and vanish all before 1995. The Apple ][+ was still a year away. People were just about to start using analogue cassette drives as storage. COBOL and Fortran were the only series business languages in use. Basically, the gap is incredible. Things change.
First, we need to look at the factors that existed in the late 1990s that created the need for our historic setup.
- Drives were small, very small. A large database array might have been four 2.1GB SCSI drives in an R5 array for just ~6GB of usable storage space on a single array. The failure domain for parity RAID failure was tiny (compared to things like URE fail rates.)
- Drive connection technologies were parallel and slow. The hard drives of the time were only slightly slower than drives are today but the connection technologies represented a considerable bottleneck. It was common to split traffic to allow for reduced bus bottlenecks.
- SCSI drive technology was the only one used for servers. The use of a PATA (IDE it was called at the time) in a server was unthinkable.
- Drives were expensive per gigabyte so cost savings was the key issue, while maintaining capacity, for effectively all businesses.
- Filesystems were fragile and failed more often than drives.
- Hardware RAID was required and only basic RAID levels of 1 and 5 were commonly available. RAID 6 and RAID 10 were years away from being accessible to most businesses. RAID 0 is discounted as it has no redundancy.
- Storage systems were rarely, if ever, shared between servers so access was almost always dedicated to a single request queue.
- Storage caches were tiny or did not exist making drive access limitations pass directly onto the operating system. This meant having different arrays with different characteristics to handle different read/write or random/sequential access mixes.
- Drive failure was common and the principle concern of storage system design.
- Often drive array size was limited by physical limitations so often array splitting decisions were made out of necessity, not choice.
- A combination of the above factors meant that RAID 1 was best for some parts of the system where small size was acceptable and access was highly sequential or write heavy and RAID 5 was best for others where capacity outweighed reliability and where access was highly random and read heavy.
In the nearly two decades since the original recommendations were released, all of these factors have changed. In some cases the changes are cascading ones where the move from general use RAID 5 to general use RAID 10 has then causes what would have been the two common array types, RAID 1 and RAID 10, to share access characteristics so the need or desire to use one or the other depending on load type is gone.
- Drives are now massive. Rather than struggling to squeeze what we need onto them, we generally have excess capacity. Single drives over a terabyte are common, even in servers. Failure domains for parity are massive (compared to things like URE fail rates.)
- Drive connections are serial and fast. The drive connections are no longer a bottleneck.
- SATA is now common on servers skewing potential risks for URE in a way that did not exist previously.
- Capacity is now cheap but performance and reliability are now the key concerns for dollars spent.
- Filesystems are highly robust today and filesystem failures are “background noise” in the greater picture of array reliability.
- Hardware RAID and software RAID are both options today and available RAID levels include many options but, most importantly, RAID 10 is available ubiquitously.
- Storage systems are commonly shared making sequential access even less common.
- Storage caches are commonly and often very large. 512MB and 1GB caches are considered normal today making many arrays in 1995 fit entirely into memory on the RAID controller today. With caches growing rapidly compared to storage capacity and the recent addition of solid state drives as L2 cache in storage in the last two years it is not out of the question for even a small business to have databases and other performance sensitive applications running completely from cache.
- Drive failure is uncommon and of trivial concern to storage system design (compared to other failure types.)
- Drive array size is rarely limited by physical limitations.
- The use of RAID 1 and RAID 10 as the principle array types today means that there is no benefit to using different array levels for performance tuning.
These factors highlight why the split array system of 1995 made perfect sense at the time and why it does not make sense today. OBR10, today’s standard, was unavailable at the time and cost prohibitive. RAID 5 was relatively safe in 1995, but not today. Nearly every factor involved in the decision process has changed dramatically in the last seventeen years and is going to continue to change as SSD becomes more common along with auto-tiering, even larger caches and pure SSD storage systems.
The change in storage design over the last two decades also highlights the dangers that IT faces as a large portion of the field learns, as is common in engineering, basic “rules of thumb” or “best practices” without necessarily understanding the underlying principles that drive those decisions making it difficult to know when not to apply those best practices or, even more importantly, when to recognize that the rule no longer applies. Unlike traditional mechanical or civil engineering where new advances and significant factor changes may occur once or possibly never over the course of a career, IT still changes fast enough that complete “rethinks” of basic rules of thumb are required several times through a career. Maybe not annually, but once per decade or more is almost always necessary.
The current move from uniprocessing to multithreaded architectures is another similar, significant change requiring the IT field to completely change how system design is handled.
One thought on “The History of Array Splitting”