Tag Archives: patterns

Making the Best of Your Inverted Pyramid of Doom

The 3-2-1 or Inverted Pyramid of Doom architecture has become an IT industry pariah for many reasons. Sadly for many companies, they only learn about the dangers associated with this design after the components have arrived and the money has left the accounts.

Some companies are lucky and catch this mistake early enough to be able to return their purchases and start over with a proper design and decision phase prior to the acquisition of new hardware and software. This, however, is an ideal and very rare situation. At best we can normally expect restocking fees and, far more commonly, the equipment cannot be returned at all or the fees are so large as to make it pointless.

What most companies face is a need to “make the best” of the situation moving forward. One of the biggest concerns is that concerned parties, whether it be the financial stake holders who have just spent a lot of money on the new hardware or if it is the technical stakeholders who now look bad for having allowed this equipment to be purchased, to succumb to an emotional reaction resulting in giving in to the sunk cost fallacy. It is vital that this emotional, illogical reaction not be allowed to take hold as it will undermine critical decision making.

It must be understood that the money spent on the inverted pyramid of doom has already been spent and is gone. That the money was wasted or how much was wasted is irrelevant to decision making at this point. If the system was a gift or if it cost a billion dollars does not matter, that money is gone and now we have to make do with what we have. A potential “trick” here would be to bring in a financial decision maker like a CFO, explain that there is about to be an emotional reaction to money already spent and discuss the sunk cost fallacy before talking about the actual problem so that people are aware and logical and the person trained (we hope) to best handle this kind of situation is there and ready to head off sunk cost emotions. Careful handling of a potentially emotionally-fueled reaction is important. This is not the time to attempt to cover up either the financial or the technical missteps, which is what the emotional reaction is creating. It is necessary for all parties to communicate and remain detached and logical in order to address the needs. Some companies handle this well, many do not and become caught trying to forge forward with bad decisions that were already made, probably in the hopes that nothing bad happens and that no one remembers or notices. Fight that reaction. Everyone has it, it is the natural amygdala “fight or flight” emotional response.

Now that we are ready to fight the emotional reactions to the problem we can begin to address “where do we go from here.” The good news is that where we are is generally a position of having “too much” rather than “too little.” So we have an opportunity to be a little creative. Thankfully there are generally good options that can allow us to move in several directions.

One thing that is very important to note is that we are looking at solutions exclusively that are more reliable, not less reliable, than the intended inverted pyramid of doom architecture that we are replacing. An IPOD is a very fragile and dangerous design and we could go to great lengths demonstrating concepts like risk analysis, single points of failure, the fallacies of false redundancy, looking at redundancy instead of reliability, dependency chains, etc. but what is absolutely critical for all parties to understand is that a single server, running with local storage is more reliable than the entire IPOD infrastructure would be. This is so important that it has to be said again: if a single server is “standard availability”, the IPOD is lower than that. More risky. If anyone at this stage fears a “lack of redundancy” or a “lack of complexity” in the resulting solutions we have to come back to this – nothing that we will discuss is as risky as what had already been designed and purchased. If there is any fear of risk going forward, the fear should have been greater before we improved the reliability of the design. This cannot be overstated. IPODs sell because they easily confuse those not trained in risk analysis and look reliable when, in fact, they are anything but.

Understanding the above and using a technique called “reading back” the accepted IPOD architecture tells us that the company in question was accepting of not having high availability (or even standard availability) at the time of purchasing the IPOD. Perhaps they believed that they were getting that, but the architecture could not provide it and so moving forward we have the option of “making do” with nothing more than a single server, running on its own local storage. This is simple and easy and improves on nearly every aspect of the intended IPOD design. It costs less to run and maintain, is often faster and is much less complex while being slightly more reliable.

But likely simply dropping down to a single server and hoping to find uses for the rest of the purchased equipment “elsewhere” is not going to be our best option. In situations where the IPOD had been meant to only be used for a single workload or set of workloads and other areas of the business have need for equipment as well it can be very beneficial to go to the “single server” approach for the intended IPOD workload and utilize the remaining equipment elsewhere in the business.

The most common approach to take with repurposing an IPOD stack is to reconfigure the two (or more) compute nodes to be full stack nodes containing their own storage. This step may require no purchases, depending on what storage has already been purchased, a movement of drives between systems or often the relatively small purchase of additional hard drives for this purpose.

These nodes can then be configured into one of two high availability models. In the past a common design choice, for cost reasons, was to use an asynchronous replication model (often known as the Veeam approach) that will replicate virtual machines between the nodes and allow VMs to be powered up very rapidly allowing for a downtime from the moment of compute node failure until recovery of as little as just a few minutes.

Today fully synchronous fault tolerance is available so commonly for free that it has effectively replaced the asynchronous model in nearly all cases. In this model storage is replicated in fully real time between the compute nodes allowing for failover to happen instantly, rather than with a few minutes delay, and with zero data loss instead of a small data loss window (e.g. RPO of zero.)

At this point it seems to be common for people to react to replication with a fear of a loss of storage capacity caused by the replication. Of course this is true. It is necessary that it be understood that it is this replication, missing from the original IPOD design, that provides the firm foundation for high reliability. If this replication is skipped, high availability is an unobtainable dream and individual compute nodes using local storage in a “stand alone” mode is the most reliable potential option. High availability solutions rely on replication and redundancy to build the necessary reliability to qualify for high availability.

This solves the question of what to do with our compute nodes but leaves us with what we can do with our external shared storage device, the single point of failure or the “point” of the inverted pyramid design. To answer this question we should start by looking at what this storage might be.

There are three common types of storage devices that would be used in an inverted pyramid design: DAS, SAN and NAS. We can lump DAS and SAN together as they are both two different aspects of block storage and can be used essentially interchangeably in our discussion – they are only differentiated by the existence of switching which can be added or removed as needed in our designs. NAS differs by being file storage rather than block storage.

In both cases, block (DAS or SAN) or file (NAS) storage one of the most common usages for this now superfluous device is as a backup target for our new virtualization infrastructure. In many cases the device may be overkill for this task, generally with more performance and many more features than needed for a simple backup target but good backup storage is important for any critical business infrastructure and erring on the side of overkill is not necessarily a bad thing. Businesses often attempt to skimp on their backup infrastructures and this is an opportunity to invest heavily in it without spending any extra money.

Along the same vein as backup storage, the external storage device could be repurposed as archival storage or other “lower tier” of storage where high availability is not warranted. This is a less common approach, generally because every business needs a good backup system but only some have a way to leverage an archival storage tier.

Beyond these two common and universal storage models, a common use case for external storage devices, especially if the device is a NAS, is to leverage it in its native rule as a file server separate from the virtualization infrastructure. For many businesses file serving is not as uptime critical as the core virtualization infrastructure and backups are far easier to maintain and manage. By offloading file serving to an already purchased NAS device this can reduce file serving requirements from the virtualization infrastructure both by reducing the number of VMs that need to be run there as well as moving what is typically one of the largest users of storage to a separate device which can lower the performance requirements of the virtualization infrastructure as well as its capacity requirements. By doing this we potentially reduce the cost of obtaining necessary additional hard drives for the local storage on the compute nodes as we stated earlier and so this can be a very popular method for many companies to address the repurposing needs.

Every company is unique and there are potentially many places where spare storage equipment could be effectively used from labs to archives to tiered storage. Using a little creativity and thinking outside of the box can be leveraged to take your unique set of available equipment and your business’ unique set of needs and demands and find the best place to use this equipment where it is decoupled from the core, critical virtualization infrastructure but can still bring value to the organization. By avoiding the inverted pyramid of doom we can obtain the maximum value from the equipment that we have already invested in rather than implementing fresh technical debt that we have to them work to overcome unnecessarily.

Hello, 1998 Calling….

Something magic seems to have happened in the Information Technology profession somewhere around 1998.  I know, from my own memory, that the late 90s were a special time to be working in IT.  Much of the architecture and technology that we have today stem from this era.  Microsoft moved from their old DOS products to Windows NT based, modern operating systems.  Linux became mature enough to begin appearing in business.  Hardware RAID became common, riding on the coattails of Intel’s IA32 processors as they finally begin to become powerful enough for many businesses to use seriously in servers.  The LAN became the business standard and all other models effectively faded away.  The Windows desktop became the one and only standard for regular computing and Windows servers were rapidly overtaking Novell as the principle player in LAN-based computing.

What I have come to realize over the last few years is that a large chunk of the communal wisdom of the industry appears to have been adopted during these formative and influential years of the IT profession and have since then passed into myth.  Much like the teachings of Aristotle who went for millennia considered to be the greatest thinker of all time and not to be questioned – stymieing scientific thought and providing a cornerstone for the dark ages.  A foundation of “rules of thumb” used in IT have passed from mentor to intern, from professor to student, from author to reader over the past fifteen or twenty years, many of them being learned by rote and treated as infallible truths of computing without any thought going into the reasoning and logic behind the initial decisions.  In many cases so much time has come and gone that the factors behind the original decisions are lost or misunderstood as those hoping to understand them today lack firsthand knowledge of computing from that era.

The codification of IT in the late nineties happened on an unprecedented scale driven primarily by Microsoft sudden lurching from lowly desktop maker to server and LAN ecosystem powerhouse.  When Microsoft made this leap with Windows NT 4 they reinvented the industry, a changing of the guard, with an entirely new generation of SMB IT Pros being born and coming into the industry right as this shift occurred.  This was the years leading up to the Y2K bubble with the IT industry swelling its ranks as rapidly as it could find moderately skilled computer-interested bodies.  This meant that everything had to be scripted (steps written on paper, that is) and best practices had to be codified to allow those with less technical backgrounds and training to work.  A perfect environment for Microsoft and their “never before seen” level of friendliness NT server product.  All at once the industry was full of newcomers without historical perspective, without the training and experience and with easy to use servers with graphical interfaces making them accessible to anyone.

Microsoft lept at the opportunity and created a tidal wave of documentation, best practices and procedures to allow anyone to get basic systems up and running quickly, easily and, more or less, reliably.  To do this they needed broad guidelines that were applicable in nearly all common scenarios, they needed it written in clear published form and they needed to guarantee that the knowledge was being assimilated.  Microsoft Press stepped in with the official publications of the Microsoft guidelines and right on its heels Microsoft MCSE program came into the spotlight totally changing the next decade of the profession.  There had been other industry certifications before the MCSE but the Windows NT 4 era and the MCP / MCSE certification systems were the game changing events of the era.  Soon everyone was getting boot camped through certification quickly memorizing Microsoft best practices and recommendations, learning them by rote and getting certified.

In the short term, the move did wonders for providing Microsoft an army of minimally skilled, but skilled nonetheless, supporters who had their own academic interests aligned with Microsoft’s corporate interest forming a symbiotic relationship that completely defined the era.  Microsoft was popular because nearly every IT professional was trained on it and nearly every IT professional encourage the adoption of Microsoft technologies because they had been trained and certified on it.

The rote guidelines of the era touched many aspects of computing, many are probably still unidentified to this day so strong was the pressure that Microsoft (and others) put on the industry at the time.  Most of today’s concepts of storage and disk arrays, filesystems, system security, networking, system architecture, application design, memory, swap space tuning and countless others all arose during this era and passed, rather quickly, into lore.  At the time we were aware that these were simply rules of thumb, subject to change just as they always had based on the changed in the industry.  Microsoft, and others, tried hard to make it clear what underlying principles created the rules of thumb.  It was not their intention to create a generation having learned by rote, but it happened.

That generation went on to be the effective founding fathers of modern LAN management.  In the small and medium business space the late 1990s represented the end of the central computer and remote terminals design, the Internet became ubiquitous (providing the underpinnings for the extensive propagation of the guidelines of the day), Microsoft washed away the memory of Novell and LANtastic, Ethernet over twisted pair completely abolished all competing technologies in LAN networking, TCP/IP beat out all layer three networking competitors and more.  Intel’s IA32 processor architecture began to steal the thunder from the big RISC processors of the previous era or the obscure sixteen and thirty two bit processors attempting to unseat Intel for generations.  The era was defining to a degree few who come since will ever understand.  Dial up networking gave way to always-on connections.  Disparate networks that could not communicate with each other lost to the Internet and a single, global networking standard.  Vampire taps and hermaphrodite connectors gave in as RJ45 connectors took to the field.  The LAN of 1992 looked nothing like the LAN of 1995.  But today, what we use, while faster and better polished, is effectively identical to the computing landscape as it was by around 1996.

All of this momentum, whether intentional or accidental, created an unstoppable force of myth driving the industry.  Careers were built on this industry wisdom taught around the campfire at night.  One generation clinging to their established beliefs, no longer knowing why they trusted those guidelines or if they applied, and another being taught them with little way to know that they were being taught distilled rules of thumb meant to be taught with coinciding background knowledge and understanding and having been designed not only for a very specific era, roughly the band from 1996 to 1999, but also, in a great many cases, for very specific implementations or products, generally Windows 95 and Windows NT 4 desktops and Windows NT 4 servers.

Today this knowledge is everywhere.  Ask enough questions and even young professionals still at university or doing a first internship are likely to have heard at least a few of the more common nuggets of conventional IT industry wisdom.  Sometimes the recommendations, applied to day, are nearly benign representing little more than inefficiency or performance waste.  In other cases they may represent pretty extreme degrees of bad practice today carrying significant risk.

It will be interesting to see just how long the late 1990s continue to so vastly influence our industry today.  Will the next generation of IT professionals finally issue a broad call to deep understanding and question the rote learning of the past eras?  Will misunderstood recommendations still be commonplace in the 2020s?  At the current pace of change, it seems unlikely that any significant change to the thinking of the industry is likely to change too much prior to 2030.  IT has been attempting to move from its wild west, everyone distilling raw knowledge into practical terms on their own to large scale codification like other, similar, fields like civil or electrical engineering, but the rate of change, while tremendously slowed since the rampant pace of the 70s and 80s, still remains so high that the knowledge of one generation is nearly useless to the next and only broad patterns, approaches and thought processes have great value to be taught mentor to student.  We may easily face another twenty years of the wild west before things begin to really settle down.

One Big RAID 10 – A New Standard in Server Storage

In the late 1990s the standard rule of thumb for building a new server was to put the operating system onto its own, small, RAID 1 array and separate out applications and data into a separate RAID 5 array.  This was done for several reasons, many of which have swirled away from us, lost in the sands of time.  The main driving factors were that storage capacity was extremely expensive, disks were small, filesystems corrupted regularly and physical hard drives failed at a very high rate compared to other types of failures.  People were driven by a need to protect against physical hard drive failures, protect against filesystem corruption and acquire enough capacity to meet their needs.

Today the storage landscape has changed.  Filesystems are incredibly robust and corruption from the filesystem itself is almost unheard of and, thank to technologies like journalling, can almost always be corrected quickly and effectively protecting the end users from data loss.  Almost no one worried about filesystem corruption today.

Modern filesystem are also able to handle far more capacity than they could previously.  It was not uncommon in the late 1990s and early 2000s to have the ability to easily make a drive array larger than any single filesystem could handle. Today that is not reasonably the case as all common filesystems handle many terabytes at least and often petabytes, exabytes or more of data.

Hard drives are much more reliable than they were in the late 1990s.  Failure rates for an entire drive failing are very low, even in less expensive drives.  So low, in fact, that array failures (data loss in the entire RAID array) is concerned with failing arrays primarily, rather than the failure of hard drives.  We no longer replace hard drives with wild abandon.  It is not unheard of for large arrays to run their entire lifespans without losing a single drive.

Capacities have scaled dramatically.  Instead of 4.3GB hard drives we are installing 3TB drives.  Nearly one thousand times more capacity on a single spindle compared to less than fifteen years ago.

These factors come together to create a need for a dramatically different approach to server storage design and a change to the “rule of thumb” about where to start when designing storage.

The old approach can be written RAID 1 + RAID 5.  The RAID 1 space was used for the operating system while the RAID 5 space, presumably much larger, was used for data and applications.  This design split the two storage concerns putting maximum effort into protecting the operating system (which was very hard to recover in case of disaster and on which the data relied for accessibility) onto highly reliable RAID 1.  Lower cost RAID 5, while somewhat riskier, was chosen, typically, for data because the cost of storing data on RAID 1 was too high in most cases.  It was a tradeoff that made sense at the time.

Today, with our very different concerns, a new approach is needed, and this new approach is known as “One Big RAID 10” – meaning a single, large RAID 10 array with operating system, applications and data all stored together.  Of course, this is just what we say to make it handy, in a system without the needs of performance or capacity beyond a single disk we would say “One Big RAID 1”, but many people include RAID 1 in the RAID 10 group so it is just easier to say the former.

To be even handier, we abbreviate this to OBR10.

Because the cost of storage has dropped considerably and instead of being at a premium is typically in abundance today, because filesystems are incredibly reliable, because RAID 1 and RAID 10 share performance characteristics and because non-disk failure triggered array failures have moved from background noise to primary causes of data loss the move to RAID 10 and to eliminate array splitting has become the new standard approach.

With RAID 10 we now have the highly available and resilient storage previously held only for the operating system available to all of our data.  We get the benefit of mirrored RAID performance plus the benefit of extra spindles for all of our data.  We get better drive capacity utilization and performance based on that improved utilization.

Even the traditional splitting of log files normally done with databases (the infamous RAID 1 + RAID 5 + RAID 1 approach) is no longer needed because RAID 10 keeps the optimum performance characteristics across all data.  With RAID 10 we eliminate almost all of the factors that once caused us to split arrays.

The only significant factor, that has not been mentioned, for which split arrays were traditionally seen as beneficial is access contention – the need for different processes to need access to different parts of the disk at the same time causing the drive head to move around in a less than ideal pattern reducing drive performance.  Contention was a big deal in the late 1990s when the old rule of thumb was developed.

Today, drive contention still exists but has been heavily mitigated by the use of large RAID caches.  In the late 90s drive caches were a few megabytes at best and often non-existent.  Today 256MB is a tiny cache and average servers are deployed with 1-2GB of cache on the RAID card alone.  Some systems are beginning to integrate additional solid state drive based caches to add a secondary cache beyond the memory cache on the controller.  These can easily add hundreds of gigabytes of extremely high speed cache that can buffer nearly any spindle operation from needing to worry about contention.  So the issue of contention has been solved in other ways over the years but has, like other technology changes, effectively freed us from the traditional concerns requiring us to split arrays.

Like array contention, another, far less common reason for splitting arrays in the late 1990s was to improve communications bus performance because of the limitations of the era’s SCSI and ATA technologies.  These, too, have been eliminated with the move to serial communications mechanisms, SAS and SATA, in modern arrays.  We are no longer limited to the capacity of a single bus for each array and can grow much larger with much more flexibility than previously.  Bus contention has been all but eliminated.

If there is a need to split off space for protection, such as log file growth, this can be achieved through partitioning rather than through physical array splitting.  In general you will want to minimize partitioning as it increases overhead and lowers the ability of the drives to tune themselves but there are cases where it is the better approach.  But it does not require that the underlying physical storage be split as it traditionally was.  Even better than partitioning, when available, is logical volume management which makes partition-like separations without the limitations of partitions.

So at the end of the day, the new rule of thumb for server storage is “One Big RAID 10.”  No more RAID 5, no more array splitting.  It’s about reliability, performance, ease of management and moderate cost effectiveness.  Like all rules of thumb, this does not apply to every single instance, but it does apply much more broadly than the old standard ever did.  RAID 1 + RAID 5, as a standard, was always an attempt to “make due” with something undesirable and to make the best of a bad situation.   OBR10 is not like that.  The new standard is a desired standard – it is how we actually want to run, not something with which we have been “stuck”.

When designing storage for a new server, start with OBR10 and only move away from it when it specifically does not meet your technology needs.  You should never have to justify using OBR10, only justify not using it.

 

Why We Reboot Servers

A question that comes up on a pretty regular basis is whether or not servers should be routinely rebooted, such as once per week, or if they should be allowed to run for as long as possible to achieve maximum “uptime.”  To me the answer is simple – with rare exception, regular reboots are the most appropriate choice for servers.

As with any rule, there are cases when it does not apply.  For example, some businesses running critical systems have no allotment for downtime and must be available 24/7.  Obviously systems like this cannot simply be rebooted in a routine way.  However, if a system is so critical that it can never go down then this situation should trigger a red flag that this system is a point of failure and perhaps consideration for how to handle downtime, whether planned or unplanned, should be initiated.

Another exception is some AIX systems need significant uptime, greater than a few weeks, to obtain maximum efficiency as the system is self tuning and needs time to obtain usage information and to adjust itself accordingly.  This tends to be limited to large, seldom-changing database servers and similar use scenarios that are less common than other platforms.

In IT we often worship the concept of “uptime” – how long a system can run without needing to restart.  But “uptime” is not a concept that brings value to the business and IT needs to keep the business’ needs in mind at all times rather than focusing on artificial metrics.  The business is not concerned with how long a server has managed to stay online without rebooting – they only care that the server is available and ready when needed for business processing.  These are very different concepts.

For most any normal business server, there is a window when the server needs to be available for business purposes and a window when it is not needed.  These windows may be daily, weekly or monthly but it is a rare server that is actually in use around the clock without exception.

I often hear people state that because they run operating system X rather than Y that they no longer need to reboot, but this is simply not true.  There are two main reasons to reboot on a regular basis: to verify the ability of the server to reboot successfully and to apply patches that cannot be applied without rebooting.

Applying patches is why most businesses reboot.  Almost all operating systems receive regular updates that require rebooting in order to take effect.  As most patches are released for security and stability purposes, especially those requiring a reboot, the importance of applying them is rather high.  Making a server unnecessarily vulnerable just to maintain uptime is not wise.

Testing a server’s capacity to reboot successfully is what is often overlooked.  Most servers have changes applied to them on a regular basis.  Changes might be patches, new applications, configuration changes, updates or similar.  Any change introduces risk.  Just because a server is healthy immediately after a change is applied does not mean that the server nor the applications running on it will start as expected on reboot.

If the server is never rebooted then we never know if it can reboot successfully.  Over time the number of changes having been applied since the last reboot will increase.  This is very dangerous.  What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then failing.  At that point identifying what change is causing the system to fail could be an insurmountable process.  No single change to roll back, no known path to recoverability.  This is when panic sets in.  Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally – meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.

While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures, the purpose is to make those failures easily manageable from a “known change” standpoint and, more importantly, to control when those reboots occur to ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.

I have heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting.  I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief.  I know that we just caught an issue at a time when the business is not impacted financially.  Had that server not been restarted during off hours, it might have not been discovered to be “unbootable” until it had failed during active business hours and caused a loss of revenue.

Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week’s worth of changes to investigate, we are routinely able to fix the problems with generally little effort and great confidence that we understand what changes had been made prior to the failure.

Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.