Tag Archives: inverted pyramid

The Emperor’s New Storage

We all know the story of the Emperor’s New Clothes.  In Hans Christian Anderson’s telling of the classic tale we have some unscrupulous cloth vendors who convince the emperor that they have clothes made from a fabric with the magical property of only being visible to people who are fit for their positions.  The emperor, not being able to see the clothes, decides to buy them because he fears people finding out that he cannot see them.  Everyone in the kingdom pretends to see them as well – all sharing the same fear.  It is a brilliant sales tactic because it puts everyone on the same team: the cloth sellers, the emperor, the people in the street all share a common goal that requires them to all maintain the same lie.  Only when a little boy who cares naught about his status in society but only about the truth points out that the emperor is naked is everyone free to admit that they don’t see the clothes either.

And this brings us to the storage market today.  Today we have storage vendors desperate to sell solutions of dubious value and buyers who often lack the confidence in their own storage knowledge to dare to question the vendors in front of management or who simply have turned to vendors to make their IT decisions on their behalf.  This has created a scenario where the vendor confidence and industry uncertainty has engendered market momentum causing the entire situation to snowball.  The effect is that using big, monolithic and expensive storage systems is so accepted today that often systems are purchased without any thought at all.  They are essentially a foregone conclusion!

It is time for someone to point at the storage buying process and declare that the emperor is, in fact, naked.

Don’t get me wrong.  I certainly do not mean to imply that modern storage solutions do not have value.  Most certainly they do.  Large SAN and NAS shared storage systems have driven much technological development and have excellent use cases.  They were not designed without value, but they do not apply to every scenario.

The idea of the inverted pyramid design, the overuse of SANs where they do not apply, came about because they are high profit margin approaches.  Manufacturers have a huge incentive to push these products and designs because they do much to generate profits.  SANs are one of the most profit-bearing products on the market.  This, in turn, incentivizes resellers to push SANs as well, both to generate profits directly through their sales but also to keep their vendors happy.  This creates a large amount of market pressure by which everyone on the “sales” side of the buyer / seller equation has massive pressure to convince you, the buyer, that a SAN is absolutely necessary.  This is so strong of a pressure, the incentives so large, that even losing the majority of potential customers in the process is worth it because the margins on the one customer that goes with the approach is generally worth losing many others.

Resellers are not the only “in between” players with incentive to see large, complex storage architectures get deployed.  Even non-reseller consultants have an incentive to promote this approach because it is big, complex and requires, on average, far more consulting and support than do simpler system designs.  This is unlikely to be a trivial number.  Instead of a ten hour engagement, they may win a hundred hours, for example, and for consultants those hours are bread and butter.

Of course, the media has incentive to promote this, too.  The vendors provide the financial support for most media in the industry and much of the content.  Media outlets want to promote the design because it promotes their sponsors and they also want to talk about the things that people are interested in and simple designs do not generate a lot of readership.  The same problems that exist with sensationalist news: the most important or relevant news is often skipped so that news that will gather viewership is shown instead.

This combination of factors is very forceful.  Companies that look to consultants, resellers and VARs, and vendors for guidance will get a unanimous push for expensive, complex and high margin storage systems.  Everyone, even the consultants who are supposed to be representing the client have a pretty big incentive to let these complex designs get approved because there is just so much money potentially sitting on the table.  You might get paid one hour of consulting time to recommend against overspending, but might be paid hundreds of hours for implementing and supporting the final system.  That’s likely tens of thousands of dollars difference, a lot of incentive, even for the smallest deployments.

This unification of the sales channel and even the front line of “protection” has an extreme effect.  Our only real hope, the only significant one, for someone who is not incentivized to participate in this system is the internal IT staff themselves.  And yet we find very rarely that internal staff will stand up to the vendors on these recommendations or even produce them themselves.

There are many reasons why well intentioned internal IT staff (and even external ones) may fail to properly assess needs such as these.  There are a great many factors involved and I will highlight some of them.

  • Little information in the market.  Because no company makes money by selling you less, there is almost no market literature, discussions or material to assist in evaluating decisions.  Without direct access to another business that has made the same decision or to any consultants or vendors promoting an alternative approach, IT professionals are often left all alone.  This lack of supporting experience is enough to cause adequate doubt to squash dissenting voices.
  • Management often prefers flashy advertising and the word of sales people over the opinions of internal staff.  This is a hard fact, but one that is often true.  IT professionals often face the fact that management may make buying decisions without any technical input whatsoever.
  • Any bid process immediately short circuits good design.  A bid would have to include “storage” and SAN vendors can easily bid on supplying storage while there is no meaningful way for “nothing” to bid on it.  Because there is no vendor for good design, good design has no voice in a bidding or quote based approach.
  • Lack of knowledge.  Often dealing with system architecture and storage concerns are one off activities only handled a few times over an entire career.  Making these decisions is not just uncommon, it is often the very first time that it has ever been done.  Even if the knowledge is there, the confidence to buck the trend easily is not.
  • Inexperience in assessing risk and cost profiles.  While these things may seem like bread and butter to IT management, often the person tasked with dealing with system design in these cases will have no training and no experience in determining comparative cost and risk in complex systems such as these.  It is common that risk goes unidentified.
  • Internal staff often see this big and costly purchase as a badge of honour or a means to bragging rights.  Excited to show off how much they were able to spend and how big their new systems are.  Everyone loves gadgets and these are often the biggest, most expensive toys that we ever touch in our industry.
  • Internal staff often have no access to work with equipment of this type, especially SANs.  Getting a large storage solution in house may allow them to improve their resume and even leverage the experience into a raise or, more likely, a new job.
  • Turning to other IT professionals who have tackled similar situations often results in the same advice as from sales people.  This is for several reasons.  All of the reasons above, of course, would have applied to them plus one very strong one – self preservation.  Any IT professional that has implemented a very costly system unnecessarily will have a lot of incentive to state that they believe that the purchase was a good one.  Whether this is irrational “reverse rationalization” – the trait where humans tend to apply ration to a decision that lacked ration when originally made, because they fear that their job may be in jeopardy if it was found out what they had done or because they have not assessed the value of the system after implementation; or even possibly because their factors were not the same as yours and the design was applicable to their needs.

The bottom line is that basically everyone, no matter what role they play, from vendors to sales people to those that do implementation and support to even your friends in similar job roles to strangers on Internet forums, all have big incentives to promote costly and risky storage architectures in the small and medium business space.  There is, for all intents and purposes, no one with a clear benefit for providing a counter point to this marketing and sales momentum.  And, of course, as momentum has grown the situation becomes more and more entrenched with people even citing the questioning of the status quo and asking critical questions as irrational or reckless.

As with any decision in IT, however, we have to ask “does this provide the appropriate value to meet the needs of the organization?”  Storage and system architectural design is one of the most critical and expensive decisions that we will make in a typical IT shop.  Of all of the things that we do, treating this decision as a knee-jerk, foregone conclusion without doing due diligence and not looking to address our company’s specific goals could be one of the most damaging that we make.

Bad decisions in this area are not readily apparent.  The same factors that lead to the initial bad decisions will also hide the fact that a bad decision was made much of the time.  If the issue is that the solution carries too much risk, there is no means to determine that better after implementation than before – thus is the nature of risk.  If the system never fails we don’t know if that is normal or if we got lucky.  If it fails we don’t know if this is common or if we were one in a million.  So observation of risk from within a single implementation, or even hundreds of implementations, gives us no statistically meaningful insight.  Likewise when evaluating wasteful expenditures we would have caught a financial waste before the purchase just as easily as after it.  So we are left without any ability for a business to do a post mortem on their decision, nor is there an incentive as no one involved in the process would want to risk exposing a bad decision making process.  Even companies that want to know if they have done well will almost never have a good way of determining this.

What makes this determination even harder is that the same architectures that are foolish and reckless for one company may be completely sensible for another.  The use of a SAN based storage system and a large number of attached hosts is a common and sensible approach to controlling costs of storage in extremely large environments.  Nearly every enterprise will utilize this design and it normally makes sense, but is used for very different reasons and goals than apply to nearly any small or medium business.  It is also, generally, implemented somewhat differently.  It is not that SANs or similar storage are bad.  What is bad is allowing market pressure, sales people and those with strong incentives to “sell” a costly solution to drive technical decision making instead of evaluating business needs, risk and cost analysis and implementing the right solution for the organization’s specific goals.

It is time that we, as an industry, recognize that the emperor is not wearing any clothes.  We need to be the innocent children who point, laugh and question why no one else has been saying anything when it is so obvious that he is naked.  The storage and architectural solutions so broadly accepted benefit far too many people and the only ones who are truly hurt by them (business owners and investors) are not in a position to understand if they do or do not meet their needs.  We need to break past the comfort provided by socially accepted plausible deniability or understanding, or culpability for not evaluating.  We must take responsibility for protecting our organizations and provide solutions that address their needs rather than the needs of the sales people.

 

For more information see: When to Consider a SAN and The Inverted Pyramid of Doom

Making the Best of Your Inverted Pyramid of Doom

The 3-2-1 or Inverted Pyramid of Doom architecture has become an IT industry pariah for many reasons. Sadly for many companies, they only learn about the dangers associated with this design after the components have arrived and the money has left the accounts.

Some companies are lucky and catch this mistake early enough to be able to return their purchases and start over with a proper design and decision phase prior to the acquisition of new hardware and software. This, however, is an ideal and very rare situation. At best we can normally expect restocking fees and, far more commonly, the equipment cannot be returned at all or the fees are so large as to make it pointless.

What most companies face is a need to “make the best” of the situation moving forward. One of the biggest concerns is that concerned parties, whether it be the financial stake holders who have just spent a lot of money on the new hardware or if it is the technical stakeholders who now look bad for having allowed this equipment to be purchased, to succumb to an emotional reaction resulting in giving in to the sunk cost fallacy. It is vital that this emotional, illogical reaction not be allowed to take hold as it will undermine critical decision making.

It must be understood that the money spent on the inverted pyramid of doom has already been spent and is gone. That the money was wasted or how much was wasted is irrelevant to decision making at this point. If the system was a gift or if it cost a billion dollars does not matter, that money is gone and now we have to make do with what we have. A potential “trick” here would be to bring in a financial decision maker like a CFO, explain that there is about to be an emotional reaction to money already spent and discuss the sunk cost fallacy before talking about the actual problem so that people are aware and logical and the person trained (we hope) to best handle this kind of situation is there and ready to head off sunk cost emotions. Careful handling of a potentially emotionally-fueled reaction is important. This is not the time to attempt to cover up either the financial or the technical missteps, which is what the emotional reaction is creating. It is necessary for all parties to communicate and remain detached and logical in order to address the needs. Some companies handle this well, many do not and become caught trying to forge forward with bad decisions that were already made, probably in the hopes that nothing bad happens and that no one remembers or notices. Fight that reaction. Everyone has it, it is the natural amygdala “fight or flight” emotional response.

Now that we are ready to fight the emotional reactions to the problem we can begin to address “where do we go from here.” The good news is that where we are is generally a position of having “too much” rather than “too little.” So we have an opportunity to be a little creative. Thankfully there are generally good options that can allow us to move in several directions.

One thing that is very important to note is that we are looking at solutions exclusively that are more reliable, not less reliable, than the intended inverted pyramid of doom architecture that we are replacing. An IPOD is a very fragile and dangerous design and we could go to great lengths demonstrating concepts like risk analysis, single points of failure, the fallacies of false redundancy, looking at redundancy instead of reliability, dependency chains, etc. but what is absolutely critical for all parties to understand is that a single server, running with local storage is more reliable than the entire IPOD infrastructure would be. This is so important that it has to be said again: if a single server is “standard availability”, the IPOD is lower than that. More risky. If anyone at this stage fears a “lack of redundancy” or a “lack of complexity” in the resulting solutions we have to come back to this – nothing that we will discuss is as risky as what had already been designed and purchased. If there is any fear of risk going forward, the fear should have been greater before we improved the reliability of the design. This cannot be overstated. IPODs sell because they easily confuse those not trained in risk analysis and look reliable when, in fact, they are anything but.

Understanding the above and using a technique called “reading back” the accepted IPOD architecture tells us that the company in question was accepting of not having high availability (or even standard availability) at the time of purchasing the IPOD. Perhaps they believed that they were getting that, but the architecture could not provide it and so moving forward we have the option of “making do” with nothing more than a single server, running on its own local storage. This is simple and easy and improves on nearly every aspect of the intended IPOD design. It costs less to run and maintain, is often faster and is much less complex while being slightly more reliable.

But likely simply dropping down to a single server and hoping to find uses for the rest of the purchased equipment “elsewhere” is not going to be our best option. In situations where the IPOD had been meant to only be used for a single workload or set of workloads and other areas of the business have need for equipment as well it can be very beneficial to go to the “single server” approach for the intended IPOD workload and utilize the remaining equipment elsewhere in the business.

The most common approach to take with repurposing an IPOD stack is to reconfigure the two (or more) compute nodes to be full stack nodes containing their own storage. This step may require no purchases, depending on what storage has already been purchased, a movement of drives between systems or often the relatively small purchase of additional hard drives for this purpose.

These nodes can then be configured into one of two high availability models. In the past a common design choice, for cost reasons, was to use an asynchronous replication model (often known as the Veeam approach) that will replicate virtual machines between the nodes and allow VMs to be powered up very rapidly allowing for a downtime from the moment of compute node failure until recovery of as little as just a few minutes.

Today fully synchronous fault tolerance is available so commonly for free that it has effectively replaced the asynchronous model in nearly all cases. In this model storage is replicated in fully real time between the compute nodes allowing for failover to happen instantly, rather than with a few minutes delay, and with zero data loss instead of a small data loss window (e.g. RPO of zero.)

At this point it seems to be common for people to react to replication with a fear of a loss of storage capacity caused by the replication. Of course this is true. It is necessary that it be understood that it is this replication, missing from the original IPOD design, that provides the firm foundation for high reliability. If this replication is skipped, high availability is an unobtainable dream and individual compute nodes using local storage in a “stand alone” mode is the most reliable potential option. High availability solutions rely on replication and redundancy to build the necessary reliability to qualify for high availability.

This solves the question of what to do with our compute nodes but leaves us with what we can do with our external shared storage device, the single point of failure or the “point” of the inverted pyramid design. To answer this question we should start by looking at what this storage might be.

There are three common types of storage devices that would be used in an inverted pyramid design: DAS, SAN and NAS. We can lump DAS and SAN together as they are both two different aspects of block storage and can be used essentially interchangeably in our discussion – they are only differentiated by the existence of switching which can be added or removed as needed in our designs. NAS differs by being file storage rather than block storage.

In both cases, block (DAS or SAN) or file (NAS) storage one of the most common usages for this now superfluous device is as a backup target for our new virtualization infrastructure. In many cases the device may be overkill for this task, generally with more performance and many more features than needed for a simple backup target but good backup storage is important for any critical business infrastructure and erring on the side of overkill is not necessarily a bad thing. Businesses often attempt to skimp on their backup infrastructures and this is an opportunity to invest heavily in it without spending any extra money.

Along the same vein as backup storage, the external storage device could be repurposed as archival storage or other “lower tier” of storage where high availability is not warranted. This is a less common approach, generally because every business needs a good backup system but only some have a way to leverage an archival storage tier.

Beyond these two common and universal storage models, a common use case for external storage devices, especially if the device is a NAS, is to leverage it in its native rule as a file server separate from the virtualization infrastructure. For many businesses file serving is not as uptime critical as the core virtualization infrastructure and backups are far easier to maintain and manage. By offloading file serving to an already purchased NAS device this can reduce file serving requirements from the virtualization infrastructure both by reducing the number of VMs that need to be run there as well as moving what is typically one of the largest users of storage to a separate device which can lower the performance requirements of the virtualization infrastructure as well as its capacity requirements. By doing this we potentially reduce the cost of obtaining necessary additional hard drives for the local storage on the compute nodes as we stated earlier and so this can be a very popular method for many companies to address the repurposing needs.

Every company is unique and there are potentially many places where spare storage equipment could be effectively used from labs to archives to tiered storage. Using a little creativity and thinking outside of the box can be leveraged to take your unique set of available equipment and your business’ unique set of needs and demands and find the best place to use this equipment where it is decoupled from the core, critical virtualization infrastructure but can still bring value to the organization. By avoiding the inverted pyramid of doom we can obtain the maximum value from the equipment that we have already invested in rather than implementing fresh technical debt that we have to them work to overcome unnecessarily.

The Inverted Pyramid of Doom

The 3-2-1 model of system architecture is extremely common today and almost always exactly the opposite of what a business needs or even wants if they were to take the time to write down their business goals rather than approaching an architecture from a technology first perspective.  Designing a solution requires starting with business requirements, otherwise we not only risk the architecture being inappropriately designed for the business but rather expect it.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

So the solution, often called a 3-2-1 design, can also be called the “Inverted Pyramid of Doom” because it is an upside down pyramid that is too fragile to run and extremely expensive for what is delivered. So unlike many other fragile models, it is very costly, not very flexible and not as reliable as simply not doing anything beyond having a single quality server.

There are times that a 3-2-1 makes sense, but mostly these are extreme edge cases where a fragile environment is desired and high levels of shared storage with massive processing capabilities are needed – not things you would see in the SMB world and very rarely elsewhere.

The inverted pyramid looks great to people who are not aware of the entire architecture, such as managers and business people.  There are a lot of boxes, a lot of wires, there are software components typically which are labeled “HA” which, to the outside observer, makes it sounds like the entire solution must be highly reliable.  Inverted Pyramids are popular because they offer “HA” from a marketing perspective making everything sound wonderful and they keep the overall cost within reason so it seems almost like a miracle – High Availability promises without the traditional costs.  The additional “redundancy” of some of the components is great for marketing.  As reliability is difficult to measure, business people and technical people alike often resort to speaking of redundancy instead of reliability as it is easy to see redundancy.  The inverted pyramid speaks well to these people as it provides redundancy without reliability.  The redundancy is not where it matters most.  It is absolutely critical to remember that redundancy is not a check box nor is redundancy a goal, it is a tool to use to obtain reliability improvements.  Improper redundancy has no value.  What good is a car with a redundant steering wheel in the trunk?  What good is a redundant aircraft if you die when the first one crashes?  What good is a redundant sever if your business is down and data lost when the single SAN went up in smoke?

The inverted pyramid is one of the most obvious and ubiquitous examples of “The Emperor’s New Clothes” used in technology sales.  Because it meets the needs of the resellers and vendors by promoting high margin sales and minimizing low margin ones and because nearly every vendor promotes it because of its financial advantages to the seller it has become widely accepted as a great solution because it is just complicated and technical enough that widespread repudiation does not occur and the incredible market pressure from the vast array of vendors benefiting from the architecture it has become the status quo and few people stop and question if the entire architecture has any merit.  That, combined with the fact that all systems today are highly reliable compared to systems of just a decade ago causing failures to be uncommon enough that the fact that they are more common that they should be and statistical failure rates are not shared between SMBs, means that the architecture thrives and has become the de facto solution set for most SMBs.

The bottom line is that the Inverted Pyramid approach makes no sense – it is far more unreliable than simpler solutions, even just a single server standing on its own, while costing many times more.  If cost is a key driver, it should be ruled out completely.  If reliability is a key driver, it should be ruled out completely.  Only if cost and reliability take very far back seats to flexibility should it even be put on the table and even then it is rare that a lower cost, more reliable solution doesn’t match it in overall flexibility within the anticipated scope of flexibility.  It is best avoided altogether.

Originally published on Spiceworks in abridged form: http://community.spiceworks.com/topic/312493-the-inverted-pyramid-of-doom