Category Archives: Risk

The Weakest Link: How Chained Dependencies Impact System Risk

When assessing system risk scenarios it is very easy to overlook “chained” dependencies.  We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.”  But system risk is far more complicated than that.

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design.  Another good example is how web applications need both application hosts and database hosts in order to function.

It is easiest to explain chained dependencies with an example.  We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves.  Each of these three “layers” is dependent on the others for the system, as a whole, to function.  If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure.  Any one of the three failing causes the entire system to fail.  No one piece is useful on its own.  This is a chained dependency and the chain is only as strong as its weakest link.

In our simplistic example, each device represents a failure domain.  We can mitigate risk by improving the reliability of each domain.  We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure.  This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before.  We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer.  Now two failure domains have been addressed.  Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.

Now that we have beefed up our system, we still have three failure domains in a dependency chain.  What we have done is made each “link” in the chain, each failure domain, extra resilient on its own.  But the chain still exists.  This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone.  We have made something far better than where we started, but we still have many failure domains.  These risks add up.

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system.  It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role.  For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly.  Even defining a clear failure can therefore be challenging.

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners.  The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment.  It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer.  This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems.  In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.

The most obvious component which can be collapsed is the networking failure domain.  If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain.  Now instead of three chains, each of which has some potential to fail, we have only two.  Simpler is better, all other things being equal.

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced.  Again, this is with all other factors remaining equal.

Another approach to consider is making single nodes more reliable on their own.  It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains.  But traditionally this was not the default path taken to reliability.  It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node.  Mainframe and high end storage systems, for example, still do this today.  This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor.  This tends to work out only in special niche circumstances and is not practical on a more general scope.

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain.  Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies.  The risk of a single node can generally be estimated with some level of confidence.  A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain!  The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected.  Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer.  If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially.  But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation.  It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true.  But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy.  That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability.  Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk.  Avoiding human error can often be more important than avoiding mechanical failure.

We must also consider the cost of recovery.  If failure is to occur it is generally trivial to recover from the failure of a simple system.  An extremely complex system, having failed, may take a great degree of effort to restore to a working condition.  Complex systems also require much broader and deeper degrees of experience and confidence to maintain.

There is no easy answer to determining the reliability of systems.  Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases.  With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains,  and appreciate how changes in system design will move us clearly towards or away from reliability.

The Inverted Pyramid of Doom

The 3-2-1 model of system architecture is extremely common today and almost always exactly the opposite of what a business needs or even wants if they were to take the time to write down their business goals rather than approaching an architecture from a technology first perspective.  Designing a solution requires starting with business requirements, otherwise we not only risk the architecture being inappropriately designed for the business but rather expect it.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

So the solution, often called a 3-2-1 design, can also be called the “Inverted Pyramid of Doom” because it is an upside down pyramid that is too fragile to run and extremely expensive for what is delivered. So unlike many other fragile models, it is very costly, not very flexible and not as reliable as simply not doing anything beyond having a single quality server.

There are times that a 3-2-1 makes sense, but mostly these are extreme edge cases where a fragile environment is desired and high levels of shared storage with massive processing capabilities are needed – not things you would see in the SMB world and very rarely elsewhere.

The inverted pyramid looks great to people who are not aware of the entire architecture, such as managers and business people.  There are a lot of boxes, a lot of wires, there are software components typically which are labeled “HA” which, to the outside observer, makes it sounds like the entire solution must be highly reliable.  Inverted Pyramids are popular because they offer “HA” from a marketing perspective making everything sound wonderful and they keep the overall cost within reason so it seems almost like a miracle – High Availability promises without the traditional costs.  The additional “redundancy” of some of the components is great for marketing.  As reliability is difficult to measure, business people and technical people alike often resort to speaking of redundancy instead of reliability as it is easy to see redundancy.  The inverted pyramid speaks well to these people as it provides redundancy without reliability.  The redundancy is not where it matters most.  It is absolutely critical to remember that redundancy is not a check box nor is redundancy a goal, it is a tool to use to obtain reliability improvements.  Improper redundancy has no value.  What good is a car with a redundant steering wheel in the trunk?  What good is a redundant aircraft if you die when the first one crashes?  What good is a redundant sever if your business is down and data lost when the single SAN went up in smoke?

The inverted pyramid is one of the most obvious and ubiquitous examples of “The Emperor’s New Clothes” used in technology sales.  Because it meets the needs of the resellers and vendors by promoting high margin sales and minimizing low margin ones and because nearly every vendor promotes it because of its financial advantages to the seller it has become widely accepted as a great solution because it is just complicated and technical enough that widespread repudiation does not occur and the incredible market pressure from the vast array of vendors benefiting from the architecture it has become the status quo and few people stop and question if the entire architecture has any merit.  That, combined with the fact that all systems today are highly reliable compared to systems of just a decade ago causing failures to be uncommon enough that the fact that they are more common that they should be and statistical failure rates are not shared between SMBs, means that the architecture thrives and has become the de facto solution set for most SMBs.

The bottom line is that the Inverted Pyramid approach makes no sense – it is far more unreliable than simpler solutions, even just a single server standing on its own, while costing many times more.  If cost is a key driver, it should be ruled out completely.  If reliability is a key driver, it should be ruled out completely.  Only if cost and reliability take very far back seats to flexibility should it even be put on the table and even then it is rare that a lower cost, more reliable solution doesn’t match it in overall flexibility within the anticipated scope of flexibility.  It is best avoided altogether.

Originally published on Spiceworks in abridged form:

Virtual Eggs and Baskets

In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from what is described as “don’t put your eggs in one basket.”

I can see where this concern arises.  Virtualization allows for many guest operating systems to be contained in a single physical system which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once.  This sounds bad, but perhaps it is not as bad as we would first presume.

The idea of the eggs and baskets idiom is that we should not put all of our resources at risk at the same time.  This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities.  In the case of eggs (or money) we are talking about an interchangeable commodity.  One egg is as good as another.  A set of eggs are naturally redundant.

If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat.  Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case.  Putting our already redundant eggs into multiple baskets allows us to hedge our bets.  Yes, carrying two baskets means that we have less time to pay attention to either one so it increases the risk of losing some of the eggs but reduces the chances of losing all of the eggs.  In the case of eggs, a wise proposition indeed.  Likewise, a smart way to prepare for your retirement.

This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization.  Servers, however, are not like eggs.  Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough.  Typically servers each play a unique role and all are relatively critical to the functioning of the business.  If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist.  When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.

IT services in a business are usually, at least to some degree, a “chain dependency.”  That is, they are interdependent and the loss of a single service may impact other services either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (such as an office worker needing the file server working in order to provide a file which he needs to edit with information from an email while discussing the changes over the phone or instant messenger.)  In these cases, the loss of a single key service such as email, network authentication or file services may create a disproportionate loss of working ability.  If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases.   This is not always true, in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon.  Even if people can remain working, they are likely far less productive than usual.

When dealing with physical servers, each server represents its own point of failure.  So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers.  Each server that we add brings with it its own risk.  If each failure has an outage factor of 2.5 – that is financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages.  I use the concept of factors and averages here to make this easy, determining the length of an average outage or impact of an average outage is not necessary as we only need to determine relative impact in this case to compare the scenarios.  It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.

With virtualization we have the obvious ability to consolidate.  In this example we will assume that we can collapse all ten of these existing servers down into a single server.  When we do this we often trigger the “all our eggs in one basket” response.  But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk.  If we assume the same risks as the example above our single server will, on average, incur just a single total site outage, once per decade.

Compare this to the first example which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.

Now keep in mind that this is based on the assumption that losing some services means a financial loss greater than the strict value of the service that was lost, which is almost always the case.  Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry.  In rare cases impact from losing a single system can be less than its “slice of the pie”, normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restore, but these cases are rare and are normally isolated to a few systems out of many with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.

So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory.  And this is purely from a hardware failure perspective.  Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.

With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department.  Fewer servers means that more time and attention can be paid to those that remain.  More attention means a better chance of catching issues early and more opportunity to keep parts on hand.  Better monitoring and maintenance leads to better reliability.

Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability.  With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly.   Understandable and for some businesses this may be the correct approach.  But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.

Instead by applying a more moderate approach keeping significant cost savings but still spending more, relatively speaking, on a single server you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc.  The cost savings of virtualization can often be turned directly into increased reliability further shifting the equation in favor of the single server approach.

As I stated in another article, one brick house is more likely to survive a wind storm than either one or two straw houses.  Having more of something doesn’t necessarily make it the more reliable choice.

These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself.  Virtualization provides extended risk mitigation features separately as well.  System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform.  This can play an important role in a disaster recovery strategy.

Of course, all of these concepts are purely to demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario.    There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.

It should be noted that virtualization can then extend the reliability of traditional commodity hardware providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide.  This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms.  These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops initially migrating from a non-failover, legacy hardware server environment.  High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today.  Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now, but due to the large drop in the cost of high availability, it is quite possible that it will he cost justified where previously it could not be.

In the same vein, virtualization is often feared because it is seen as a new, unproven technology.  This is certainly untrue but there is an impression of this in the small business and commodity server space.  In reality, though, virtualization was first introduced by IBM in the 1960s and ever since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability.  In the commodity server space virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world.  But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today which is very far past the point of being a nascent technology – in the world of IT it is positively venerable.  Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products.  The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.

Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy.  Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.

Nearly As Good Is Not Better

As IT professionals we often have to evaluate several different approaches, products or techniques.  The IT field is vast and we are faced with so many options that it can become difficult to filter out the noise and find just the options that truly make sense in our environment.

One thing that I have found repeatedly creating a stumbling block for IT professionals is that they come from a stance of traditional, legacy knowledge (a natural situation since all of our knowledge has to have come from sometime in the past) and attempting to justify new techniques or technologies in relationship to the existing, established assumptions of “normal.”  This is to be expected.

IT is a field of change, however, and it is critical that IT professionals accept change as normal and not react to it as an undermining of traditional values.  It is not uncommon for people to feel that decisions that they have made in the past will be judged by the standards of today.  They feel that because there is a better option now that their old decision is somehow invalid or inadequate.  This is not the case.  This is exacerbated in IT because decisions made in the past that have been dramatically overturned in favour of new knowledge might only be a few years old and the people who made them still doing the same job.  Change in IT is much more rapid than in most fields and we can often feel betrayed by good decisions that we have made not long ago.

This reaction puts us into a natural, defensive position that we must rationally overcome in order to make objective decisions about our systems.

One trick that I have found is to reverse questions involved assumed norms.  That is to say, if you believe that you must justify a new technique against an old and find that while convincing you are not totally sways, perhaps you should try the opposite – justify the old, accepted approach versus the new one.  I will give some examples that I see in the real world regularly.

Example one, in which we consider virtualization where none existed before.  Typically someone looking to do this will look for virtualization to provide some benefit that they consider to be significant.  Generally this results in someone feeling that virtualization either doesn’t offer adequate benefits or that they must incorporate other changes and end up going dramatically overboard for what should have been a smaller decision.  Instead, attempt to justify not using virtualization.  Treat virtualization as the accepted pattern (actually, it long has been, just not the in SMB space) and try to justify going with physical servers instead.

What we find is that, normally, our minds accepted that the physical machine only had to be “nearly as good” or “acceptable” in order to be chosen even though virtualization was, in nearly all cases, “better”.  Why would be decide to use something that is not “better”?  Because we approached one as change and one as not change.  Our minds play tricks on us.

Example two, in which traditional server storage is two arrays with the operating system on one RAID 1 array and the data partition on a second RAID 5 array versus the new standard of a single RAID 10 array holding both operating system and data.  If we argue from the aspect of the traditional approach we can make decent arguments, at times, that we can make the old system adequate for our needs.  Adequate seems good enough to not change our approach.  But argue from the other direction.  If we assume RAID 10 is the established, accepted norm (again, it is today) then it is clear that it comes out as dramatically superior in nearly all scenarios.  If we try to justify why we would chose a split array with RAID 1 and RAID 5 we would quickly see that they never provide a compelling value.  So sticking with RAID 10 is a clear win.

This reversal of thinking can provide for a dramatic, eye-opening effect on decision making.  Making assumptions about starting points and forcing new ideas to significantly “unseat” incumbent thinking is dangerous.  This keeps us from moving forward.  In reality, most approaches should start from equal ground and the “best” option should win.  It is far too often than a solution is considered “adequate” when it is not the best.  Yes, a solution may very well work in a given situation but why would we ever intentionally choose a less than superior solution (we assume that cost is factored into the definition of best?)

As IT professionals attempting to solve problems for a business we should be striving to recommend and implement the best possible solutions, but making due with less than ideal ones simply because we forget to equally consider the reasonable options against one another.  And it is important to remember that cost is inclusive in deciding when a solution is best or adequate.  The best solution is not a perfect solution but the best for the company, for the money.  But very often solutions are chosen that cost more and do less simply because they are considering the de facto starting point and the alternatives are expected to dramatically outperform them rather than simply being “better”.

Taking a fresh look at decision making can help us become better professionals.