When assessing system risk scenarios it is very easy to overlook “chained” dependencies. We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.” But system risk is far more complicated than that.
In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design. Another good example is how web applications need both application hosts and database hosts in order to function.
It is easiest to explain chained dependencies with an example. We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.
In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves. Each of these three “layers” is dependent on the others for the system, as a whole, to function. If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure. Any one of the three failing causes the entire system to fail. No one piece is useful on its own. This is a chained dependency and the chain is only as strong as its weakest link.
In our simplistic example, each device represents a failure domain. We can mitigate risk by improving the reliability of each domain. We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure. This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before. We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer. Now two failure domains have been addressed. Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.
Now that we have beefed up our system, we still have three failure domains in a dependency chain. What we have done is made each “link” in the chain, each failure domain, extra resilient on its own. But the chain still exists. This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone. We have made something far better than where we started, but we still have many failure domains. These risks add up.
What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system. It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role. For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly. Even defining a clear failure can therefore be challenging.
It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners. The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment. It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer. This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.
Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems. In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.
The most obvious component which can be collapsed is the networking failure domain. If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain. Now instead of three chains, each of which has some potential to fail, we have only two. Simpler is better, all other things being equal.
We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced. Again, this is with all other factors remaining equal.
Another approach to consider is making single nodes more reliable on their own. It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains. But traditionally this was not the default path taken to reliability. It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node. Mainframe and high end storage systems, for example, still do this today. This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor. This tends to work out only in special niche circumstances and is not practical on a more general scope.
So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain. Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.
Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies. The risk of a single node can generally be estimated with some level of confidence. A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain! The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected. Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)
Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.
Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer. If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially. But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)
With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation. It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true. But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy. That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability. Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk. Avoiding human error can often be more important than avoiding mechanical failure.
We must also consider the cost of recovery. If failure is to occur it is generally trivial to recover from the failure of a simple system. An extremely complex system, having failed, may take a great degree of effort to restore to a working condition. Complex systems also require much broader and deeper degrees of experience and confidence to maintain.
There is no easy answer to determining the reliability of systems. Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases. With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains, and appreciate how changes in system design will move us clearly towards or away from reliability.