Category Archives: Architecture

The Weakest Link: How Chained Dependencies Impact System Risk

When assessing system risk scenarios it is very easy to overlook “chained” dependencies.  We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.”  But system risk is far more complicated than that.

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design.  Another good example is how web applications need both application hosts and database hosts in order to function.

It is easiest to explain chained dependencies with an example.  We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves.  Each of these three “layers” is dependent on the others for the system, as a whole, to function.  If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure.  Any one of the three failing causes the entire system to fail.  No one piece is useful on its own.  This is a chained dependency and the chain is only as strong as its weakest link.

In our simplistic example, each device represents a failure domain.  We can mitigate risk by improving the reliability of each domain.  We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure.  This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before.  We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer.  Now two failure domains have been addressed.  Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.

Now that we have beefed up our system, we still have three failure domains in a dependency chain.  What we have done is made each “link” in the chain, each failure domain, extra resilient on its own.  But the chain still exists.  This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone.  We have made something far better than where we started, but we still have many failure domains.  These risks add up.

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system.  It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role.  For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly.  Even defining a clear failure can therefore be challenging.

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners.  The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment.  It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer.  This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems.  In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.

The most obvious component which can be collapsed is the networking failure domain.  If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain.  Now instead of three chains, each of which has some potential to fail, we have only two.  Simpler is better, all other things being equal.

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced.  Again, this is with all other factors remaining equal.

Another approach to consider is making single nodes more reliable on their own.  It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains.  But traditionally this was not the default path taken to reliability.  It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node.  Mainframe and high end storage systems, for example, still do this today.  This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor.  This tends to work out only in special niche circumstances and is not practical on a more general scope.

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain.  Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies.  The risk of a single node can generally be estimated with some level of confidence.  A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain!  The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected.  Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer.  If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially.  But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation.  It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true.  But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy.  That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability.  Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk.  Avoiding human error can often be more important than avoiding mechanical failure.

We must also consider the cost of recovery.  If failure is to occur it is generally trivial to recover from the failure of a simple system.  An extremely complex system, having failed, may take a great degree of effort to restore to a working condition.  Complex systems also require much broader and deeper degrees of experience and confidence to maintain.

There is no easy answer to determining the reliability of systems.  Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases.  With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains,  and appreciate how changes in system design will move us clearly towards or away from reliability.

The Inverted Pyramid of Doom

The 3-2-1 model of system architecture is extremely common today and almost always exactly the opposite of what a business needs or even wants if they were to take the time to write down their business goals rather than approaching an architecture from a technology first perspective.  Designing a solution requires starting with business requirements, otherwise we not only risk the architecture being inappropriately designed for the business but rather expect it.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

So the solution, often called a 3-2-1 design, can also be called the “Inverted Pyramid of Doom” because it is an upside down pyramid that is too fragile to run and extremely expensive for what is delivered. So unlike many other fragile models, it is very costly, not very flexible and not as reliable as simply not doing anything beyond having a single quality server.

There are times that a 3-2-1 makes sense, but mostly these are extreme edge cases where a fragile environment is desired and high levels of shared storage with massive processing capabilities are needed – not things you would see in the SMB world and very rarely elsewhere.

The inverted pyramid looks great to people who are not aware of the entire architecture, such as managers and business people.  There are a lot of boxes, a lot of wires, there are software components typically which are labeled “HA” which, to the outside observer, makes it sounds like the entire solution must be highly reliable.  Inverted Pyramids are popular because they offer “HA” from a marketing perspective making everything sound wonderful and they keep the overall cost within reason so it seems almost like a miracle – High Availability promises without the traditional costs.  The additional “redundancy” of some of the components is great for marketing.  As reliability is difficult to measure, business people and technical people alike often resort to speaking of redundancy instead of reliability as it is easy to see redundancy.  The inverted pyramid speaks well to these people as it provides redundancy without reliability.  The redundancy is not where it matters most.  It is absolutely critical to remember that redundancy is not a check box nor is redundancy a goal, it is a tool to use to obtain reliability improvements.  Improper redundancy has no value.  What good is a car with a redundant steering wheel in the trunk?  What good is a redundant aircraft if you die when the first one crashes?  What good is a redundant sever if your business is down and data lost when the single SAN went up in smoke?

The inverted pyramid is one of the most obvious and ubiquitous examples of “The Emperor’s New Clothes” used in technology sales.  Because it meets the needs of the resellers and vendors by promoting high margin sales and minimizing low margin ones and because nearly every vendor promotes it because of its financial advantages to the seller it has become widely accepted as a great solution because it is just complicated and technical enough that widespread repudiation does not occur and the incredible market pressure from the vast array of vendors benefiting from the architecture it has become the status quo and few people stop and question if the entire architecture has any merit.  That, combined with the fact that all systems today are highly reliable compared to systems of just a decade ago causing failures to be uncommon enough that the fact that they are more common that they should be and statistical failure rates are not shared between SMBs, means that the architecture thrives and has become the de facto solution set for most SMBs.

The bottom line is that the Inverted Pyramid approach makes no sense – it is far more unreliable than simpler solutions, even just a single server standing on its own, while costing many times more.  If cost is a key driver, it should be ruled out completely.  If reliability is a key driver, it should be ruled out completely.  Only if cost and reliability take very far back seats to flexibility should it even be put on the table and even then it is rare that a lower cost, more reliable solution doesn’t match it in overall flexibility within the anticipated scope of flexibility.  It is best avoided altogether.

Originally published on Spiceworks in abridged form:

Solution Elegance

It is very easy, when working in IT, to become focused on big, complex solutions.  It seems that this is where the good solutions must lie – big solutions, lots of software, all the latest gadgets.  What we do is exciting and it is very easy to get caught up in the momentum.  It’s fun to do challenging, big projects.  Hearing what other IT pros are doing, how other companies solve challenges and talking to vendors with large systems to sell to us all adds to the excitement and it is very easy to lose a sense of scope and goal and it is so common to see big, over the top solutions to simple problems that it seems like this must just be how IT is.

But it need not be.  Complexity is the enemy of both reliability and security.  Unnecessarily complex solutions increase cost both in acquisition and in implementation as well as in maintenance while being generally slower, more fragile and possess a large attack surface that is harder to comprehend and protect.  Simple, or more appropriately, elegant solutions are the best approach.  This does not mean that all designs will be simple, not at all.  Complex designs are often required.  IT is hardly a field that has any lack of complexity.  In fact it is often believed that software development may be the most complex of all human endeavors, at least of those partaken of on any scale.  A typical IT installation includes millions of lines of codes, hundreds or thousands of protocols, large numbers of interconnected systems, layers of unique software configurations, more settings than any team could possibly know and only then do we add in the complexity of hundreds or thousands or hundreds of thousands of unpredictable, irrational humans trying to use these systems, each in a unique way.  IT is, without a doubt, complex.

What is important is to recognize that IT is complex, that this cannot be avoided completely but to focus on designing and engineering solutions to be as simple, as graceful… as elegant as possible.  This design idea comes from, at least in my mind, software engineering where complex code is seen as a mistake and simple, beautiful code that is easy to read, easy to understand is considered successful.  One of the highest accolades that can be bestowed upon a software engineer is for her code to be deemed elegant.  How apropos that it is attributed to Blaise Pascal, after whom one of the most popular programming languages of the 1970s and 1980s was named is this famous quote (translated poorly from French): “I am sorry I have had to write you such a long letter, but I did not have time to write you a short one.”

It is often far easier to design complex, convoluted solutions than it is to determine what simple approach would suffice.  Whether we are in a hurry or don’t know where to begin an investigation, elegance is always a challenge. The industry momentum is to promote the more difficult path.  It is in the interest of vendors to sell more gear not only to make the initial sale but they know that with more equipment comes more support dollars and if enough new, complex equipment is sold the support needs stop increasing linearly and begin to increase geometrically as additional support is needed not just for the equipment or software itself but also for the configuration and support of system interactions or additional customization   The financial influences behind complexity are great, and they do not stop with vendors.  IT professionals gain much job security, or the illusion of it, by managing large sets of hardware and software that are difficult to seamlessly transition to another IT professional.

Often complexity is so assumed, so expected, that the process of selecting a solution begins with great complexity as a foregone conclusion without any consideration for the possibility that a less complex solution might suffice, or even be superior outside of the question of complexity and cost itself.  Complexity is sometimes even completely tied to certain concepts to a degree where I have actually faced incredulity at the notion that a simple solution might outperform in price, performance and reliability a complex one.

Rhetoric is easy, but what is a real world example?  The best examples that I see today are mostly related to virtualization whether vis a vis storage or a cloud management layer or software or just virtualization itself.  I see quite frequently that a conversation involving just virtualization for one person brings an instant connotation of requiring networked, shared block storage, expensive virtualization management software, many redundant virtualization nodes and complex high availability software – none of which are intrinsic to virtualization and most of which are rarely for the purpose of supporting or really, even in the interest of the business for whom they will be implemented.  Rather than working from business requirements, these concepts arise predominantly from technology preconceptions.  It is simple to point to complexity and appear to be solving a problem – complexity creates a sense of comfort.  Filter many arguments down and you’ll hear “How can it not work, it’s complex?”  Complexity provides an illusion of completeness, or having solved a problem, but this can commonly hide the fact that a solution may not actually be complete or even functional but the degree of complexity makes this difficult to see.  Our minds will then not accept easily a simpler approach being more complete and solving a problem when a complex one does not because it feels so counter-intuitive.

A great example of this is that we resort to discussing redundancy rather than reliability.  Reliability is difficult to measure, redundancy is simple to quantify.  A brick is highly reliable, even when singular.  It does not take redundancy for a brick to be stable and robust.  Its design is simple.  You could make a supporting structure out of many redundant sticks that would not be nearly as reliable as a single brick.  If you talk in reliability – the chance that the structure will not fail – it is clear that the brick is a superior choice to several sticks.  But if you say “but there is no redundancy, the brick could fail and there is nothing to take its place” you sound silly.  But when talking about computers and computer systems we find systems that are so complex that rarely do people see when they have a brick or a stick and so, since they cannot determine reliability which matters, they focus on the easily to quantify redundancy, which doesn’t.  The entire system is too complex, but seeking the simple solution, the one that directly addresses the crux of the problem to solve can reduce complexity and provide us a far better answer in the end.

This can even be seen in RAID.  Mirrored RAID is simple, just one disk or set of disks being an exact copy of another set.  It’s so simple.  Parity RAID is complex with calculations on a variable stripe across many devices that must be encoded when written and decoded should a device fail.  Mirrored RAID lacks this complexity and solves the problem of disk reliability through simple, elegant copy operations that are highly reliable and very well understood.  Parity RAID is unnecessarily complex making it fragile.  Yet in doing so and by undermining its own ability to solve the problem for which it was designed it also, simultaneously, because seemingly more reliable based on no factor other than its own complexity.  The human mind immediately jumps to “it’s complex, therefore it is more advanced, therefore it is more reliable” but neither progression is a logical one.  Complexity does not suggest that it is more advanced and being advanced does not suggest that it is reliable, but the human mind itself is complex and easily mislead.

There is no simple answer for finding simplicity.  Knowing that complexity is bad by its nature but unavoidable at times teaches us to be mindful, however it does not teach us when to suspect over-complexity.  We must be vigilant, always seeking to determine if a more elegant answer exists and not accept complexity as the correct answer simply because it is complex.  We need to question proposed solutions and question ourselves.  “Is this solution really as simple as it should be?”  “Is this complexity necessary?”  “Does this require the complexity that I had assumed?”

In most system design recommendations that I give, the first technical determination step that I normally take, after the step of inquiring as to the business need needing to be solved, is to question complexity.  If complexity cannot be defended, it is probably unnecessary and actively defeating the purpose for which it was chosen.

“Is it really necessary to split those drives into many separate arrays?  If so, what is the technical justification for doing so?”

“Is shared storage really necessary for the task that you are proposing it for?”

“Does the business really justify the use of distributed high availability technologies?”

“Why are we replacing a simple system that was adequate yesterday with a dramatically more complex system tomorrow?  What has changed that makes a major improvement, while remaining simple, not more than enough but requires orders of magnitude more complexity and more spending that wasn’t justified previously?”

These are just common examples, complexity exists in every aspect of our industry.  Look for simplicity.  Strive for elegance.  Do not accept complexity without rigorously vetting it.  Put it through the proverbial ringer.  Do not allow complexity to creep in where it is not warranted.  Do not err on the side of complexity – when in doubt, fail simply.  Oversimplifying a solution typically results in a minor failure while making it overly complex allows for a far greater degree of failure.  The safer bet is with the simpler solution.  And if a simple solution is chosen and proven inadequate it is far easier to add complexity than it is to remove it.

Virtualization as a Standard Pattern

Virtualization as an enterprise concept is almost as old as business computing is itself.  The value of abstracting computing from the bare hardware was recognized very early on and almost as soon as computers had the power to manage the abstraction process, work began in implementing virtualization much as we know it today.

The earliest commonly accepted work on virtualization began in 1964 with the IBM CP-40 operating system developers for the IBM System/360 mainframe.  This was the first real foray into commercial virtualization and the code and design from this early virtualization platform has descended today into the IBM VM platform that has been used continuously since 1972 as a virtualization layer for the IBM mainframe families over the decades.  Since IBM first introduced virtualization we have seen enterprise systems adopting this pattern of hardware abstraction almost universally.  Many large scale computing systems, minicomputers and mainframes, moved to virtualization during the 1970s with the bulk of all remaining enterprise systems doing so, as the power and technology were available to them, during the 1980s and 1990s.

The only notable holdout to virtualization for enterprise computing was the Intel IA32 (aka x86) platform which lacked the advanced hardware resources necessary to implement effective virtualization until the advent of the extended AMD64 64-bit platform and even then only with specific new technology.  Once this was introduced the same high performance, highly secure virtualization was available across the board on all major platforms for business computing.

Because low cost x86 platforms lacked meaningful virtualization (outside of generally low performance software virtualization and niche high performance paravirtualization platforms) until the mid-2000s this left virtualization almost completely off of the table for the vast majority of small and medium businesses.  This has lead many dedicated to the SMB space to be unaware that virtualization is a well established, mature technology set that long ago established itself as the de facto pattern for business server computing.  The use of hardware abstraction is nearly ubiquitous in enterprise computing with many of the largest, most stable platforms having no option, at least no officially support option, for running systems “bare metal.”

There are specific niches where the need to avoid hardware abstraction through virtualization is not advised but these are extremely rare, especially in the SMB market.  Typical systems needing to not be virtualized include latency sensitive systems (such as low latency trading platforms) and multi-server combined workloads such as HPC compute clusters where the primary goal is performance above stability and utility.  Neither of these are common to the SMB.

Virtualization offers many advantages.  Often, in the SMB where virtualization is less expected, it is assumed that virtualization’s goal is consolidation where massive scale cost savings can occur or in providing new ways to provide for high availability.  Both of these are great options that can help specific organizations and situations but neither is the underlying justification for virtualization.  We can consolidate and achieve HA through other means, if necessary.  Virtualization simply provides us with a great array of options in those specific areas.

Many of the uses of virtualization are artifacts of the ecosystem such as a potential reduction in licensing costs.  These types of advantages are not intrinsic advantages to virtualization but do exist and cannot be overlooked in a real world evaluation.  Not all benefits apply to all hypervisors or virtualization platforms but nearly all apply across the board.  Hardware abstraction is a concept, not an implementation, so how it is leveraged will vary.  Conceptually, abstracting away hardware whether at the storage layer, at the computing layer, etc. is very important as it eases management, improves reliability and speeds development.

Here are some of the benefits from virtualization.  It is important to note that outside of specific things such as consolidation and high availability nearly all of these benefits apply not only to virtualizing on a single hardware node but for a single workload on that node.

  1. Reduced human effort and impact associated with hardware changes, breaks, modifications, expansion, etc.
  2. Storage encapsulation for simplified backup / restore process, even with disparate hardware targets
  3. Snapshotting of entire system for change management protection
  4. Ease of archiving upon retirement or decommission
  5. Better monitoring capabilities, adding out of band management even on hardware platforms that don’t offer this natively
  6. Hardware agnosticism provides for no vendor lock-in as the operating systems believe the hypervisor is the hardware rather than the hardware itself
  7. Easy workload segmentation
  8. Easy consolidation while maintaining workload segmentation
  9. Greatly improved resource utilization
  10. Hardware abstraction creates a significantly realized opportunity for improved system performance and stability while lowering the demands on the operating system and driver writers for client operating systems
  11. Simplified deployment of new and varied workloads
  12. Simple transition from single platform to multi-platform hosting environments which then allow for the addition of options such as cloud deployments or high availability platform systems
  13. Redeployment of workloads to allow for easy physical scaling

In today’s computing environments, server-side workloads should be universally virtualized for these reasons.  The benefits of virtualization are extreme while the downsides are few and trivial.  The two common scenarios where virtualization still needs to be avoided are in situations where there is specialty hardware that must be used directly on the server (this has become very rare today, but does still exist from time to time) and extremely low latency systems where sub-millisecond latencies are critical.  The second of these is common only in extremely niche business situations such as low latency investment trading systems.  Systems with these requirements will also have incredible networking and geolocational requirements such as low-latency Infiniband with fiber to the trading floor of less than five miles.

Some people will point out that high performance computing clusters do not use virtualization, but this is a grey area as any form of clustering is, in fact, a form of virtualization.  It is simply that this is a “super-system” level of virtualization instead of being strictly at the system level.

It is safe to assume that any scenario in which you might find yourself in which you should not use virtualization you will know it beyond a shadow of a doubt and will be able to empirically demonstrate why virtualization is either physically or practically impossible.  For all other cases, virtualize.  Virtualize if you have only one physical server and one physically workload and just one user.  Virtualize if you are a Fortune 100 with the most demanding workloads.  And virtualize if you are anyone in between.  Size is not a factor in virtualization; we virtualize out of a desire to have a more effective and stable computing environment both today and into the future.