{"id":647,"date":"2014-11-01T16:38:46","date_gmt":"2014-11-01T21:38:46","guid":{"rendered":"http:\/\/www.smbitjournal.com\/?p=647"},"modified":"2017-02-19T03:41:15","modified_gmt":"2017-02-19T08:41:15","slug":"the-weakest-link-how-chained-dependencies-impact-system-risk","status":"publish","type":"post","link":"https:\/\/smbitjournal.com\/2014\/11\/the-weakest-link-how-chained-dependencies-impact-system-risk\/","title":{"rendered":"The Weakest Link: How Chained Dependencies Impact System Risk"},"content":{"rendered":"

When assessing system risk scenarios it is very easy to overlook “chained” dependencies. \u00a0We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.” \u00a0But system risk is far more complicated than that.<\/p>\n

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design. \u00a0Another good example is how web applications need both application hosts and database hosts in order to function.<\/p>\n

It is easiest to explain chained dependencies with an example. \u00a0We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.<\/p>\n

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN\u00a0switches (switches dedicated for the storage network) and the disk arrays themselves. \u00a0Each of these three “layers” is dependent on the others for the system, as a whole, to function. \u00a0If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure. \u00a0Any one of the three failing causes the entire system to fail. \u00a0No one piece is useful on its own. \u00a0This is a chained dependency and the chain is only as strong as its weakest link.<\/p>\n

In our simplistic example, each device represents a failure domain. \u00a0We can mitigate risk by improving the reliability of each domain. \u00a0We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure. \u00a0This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before. \u00a0We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at\u00a0that layer. \u00a0Now two failure domains have been addressed. \u00a0Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.<\/p>\n

Now that we have beefed up our system, we still have three failure domains in a dependency chain. \u00a0What we have done is made each “link” in the chain, each failure domain, extra resilient on its own. \u00a0But the chain still exists. \u00a0This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone. \u00a0We have made something far better than where we started, but we still have many failure domains. \u00a0These risks add up.<\/p>\n

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system. \u00a0It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role. \u00a0For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly. \u00a0Even defining a clear failure can therefore be challenging.<\/p>\n

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners. \u00a0The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment. \u00a0It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer. \u00a0This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.<\/p>\n

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems. \u00a0In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.<\/p>\n

The most obvious component which can be collapsed is the networking failure domain. \u00a0If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain. \u00a0Now instead of three chains, each of which has some potential to fail, we have only two. \u00a0Simpler is better, all other things being equal.<\/p>\n

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced. \u00a0Again, this is with all other factors remaining equal.<\/p>\n

Another approach to consider is making single nodes more reliable on their own. \u00a0It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains. \u00a0But traditionally this was not the default path taken to reliability. \u00a0It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node. \u00a0Mainframe and high end storage systems, for example, still do this today. \u00a0This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor. \u00a0This tends to work out only in special niche circumstances and is not practical on a more general scope.<\/p>\n

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain. \u00a0Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.<\/p>\n

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies. \u00a0The risk of a single node can generally be estimated with some level of confidence. \u00a0A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain! \u00a0The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected. \u00a0Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)<\/p>\n

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.<\/p>\n

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer. \u00a0If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially. \u00a0But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)<\/p>\n

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation. \u00a0It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true. \u00a0But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy. \u00a0That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability. \u00a0Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk. \u00a0Avoiding human error can often be more important than avoiding mechanical failure.<\/p>\n

We must also consider the cost of recovery. \u00a0If failure is to occur it is generally trivial to recover from the failure of a simple system. \u00a0An extremely complex system, having failed, may take a great degree of effort to restore to a working condition. \u00a0Complex systems also require much broader and deeper degrees of experience and confidence to maintain.<\/p>\n

There is no easy answer to determining the reliability of systems. \u00a0Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases. \u00a0With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains, \u00a0and appreciate how changes in system design will move us clearly towards or away from reliability.<\/p>\n","protected":false},"excerpt":{"rendered":"

When assessing system risk scenarios it is very easy to overlook “chained” dependencies. \u00a0We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.” \u00a0But system risk is far more complicated than that. In most systems there are some components that rely on other components. The … Continue reading The Weakest Link: How Chained Dependencies Impact System Risk<\/span> →<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[120,132,133],"tags":[189],"class_list":["post-647","post","type-post","status-publish","format-standard","hentry","category-architecture","category-best-practices","category-risk","tag-dependency-chain"],"_links":{"self":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/647","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/comments?post=647"}],"version-history":[{"count":8,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/647\/revisions"}],"predecessor-version":[{"id":660,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/posts\/647\/revisions\/660"}],"wp:attachment":[{"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/media?parent=647"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/categories?post=647"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smbitjournal.com\/wp-json\/wp\/v2\/tags?post=647"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}