In speaking with small business IT professionals, one of the key factors for hesitancy around deploying virtualization arises from what is described as “don’t put your eggs in one basket.”
I can see where this concern arises. Virtualization allows for many guest operating systems to be contained in a single physical system which, in the event of a hardware failure, causes all guest systems residing on it to fail together, all at once. This sounds bad, but perhaps it is not as bad as we would first presume.
The idea of the eggs and baskets idiom is that we should not put all of our resources at risk at the same time. This is generally applied to investing, encouraging investors to diversify and invest in many different companies and types of securities like bonds, stocks, funds and commodities. In the case of eggs (or money) we are talking about an interchangeable commodity. One egg is as good as another. A set of eggs are naturally redundant.
If we have a dozen eggs and we break six, we can still make an omelette, maybe a smaller one, but we can still eat. Eating a smaller omelette is likely to be nearly as satisfying as a larger one – we are not going hungry in any case. Putting our already redundant eggs into multiple baskets allows us to hedge our bets. Yes, carrying two baskets means that we have less time to pay attention to either one so it increases the risk of losing some of the eggs but reduces the chances of losing all of the eggs. In the case of eggs, a wise proposition indeed. Likewise, a smart way to prepare for your retirement.
This theory, because it is repeated as an idiom without careful analysis or proper understanding, is then applied to unrelated areas such as server virtualization. Servers, however, are not like eggs. Servers, especially in smaller businesses, are rarely interchangeable commodities where having six working, instead of the usual twelve, is good enough. Typically servers each play a unique role and all are relatively critical to the functioning of the business. If a server is not critical then it is unlikely to be able to justify the cost of acquiring and maintaining itself in the first place and so would probably not exist. When servers are interchangeable, such as in a large, stateless web farm or compute cluster, they are configured as such as a means to expanding capacity beyond the confines of a single, physical box and so fall outside the scope of this discussion.
IT services in a business are usually, at least to some degree, a “chain dependency.” That is, they are interdependent and the loss of a single service may impact other services either because they are technically interdependent (such as a line of business application being dependent on a database) or because they are workflow interdependent (such as an office worker needing the file server working in order to provide a file which he needs to edit with information from an email while discussing the changes over the phone or instant messenger.) In these cases, the loss of a single key service such as email, network authentication or file services may create a disproportionate loss of working ability. If there are ten key services and one goes down, company productivity from an IT services perspective likely drops by far more than ten percent, possibly nearing one hundred percent in extreme cases. This is not always true, in some unique cases workers are able to “work around” a lost service effectively, but this is very uncommon. Even if people can remain working, they are likely far less productive than usual.
When dealing with physical servers, each server represents its own point of failure. So if we have ten servers, we have ten times the likelihood of outage than if we had only one of those same servers. Each server that we add brings with it its own risk. If each failure has an outage factor of 2.5 – that is financially impacting the business for twenty five percent of revenue for, say, one day then our total average impact over a decade is the equivalent of two and a half total site outages. I use the concept of factors and averages here to make this easy, determining the length of an average outage or impact of an average outage is not necessary as we only need to determine relative impact in this case to compare the scenarios. It’s just a means of comparing cumulative outage financial impact of one event type compared to another without needing specific figures – this doesn’t help you determine what your spend should be, just relative reliability.
With virtualization we have the obvious ability to consolidate. In this example we will assume that we can collapse all ten of these existing servers down into a single server. When we do this we often trigger the “all our eggs in one basket” response. But if we run some risk analysis we will see that this is usually just fear and uncertainty and not a mathematically supported risk. If we assume the same risks as the example above our single server will, on average, incur just a single total site outage, once per decade.
Compare this to the first example which did the damage equivalent to two and a half total site outages – the risk of the virtualized, consolidated solution is only forty percent that of the traditional solution.
Now keep in mind that this is based on the assumption that losing some services means a financial loss greater than the strict value of the service that was lost, which is almost always the case. Even if the service lost is no more than the loss of an individual service we are only at break even and need not worry. In rare cases impact from losing a single system can be less than its “slice of the pie”, normally because people are flexible and can work around the failed system – like if instant messaging fails and people simple switch to using email until instant messaging is restore, but these cases are rare and are normally isolated to a few systems out of many with the majority of systems, say ERP, CRM and email, having disproportionally large impacts in the event of an outage.
So what we see here is that under normal circumstances moving ten services from ten servers to ten services on one server will generally lower our risk, not increase it – in direct contrast to the “eggs in a basket” theory. And this is purely from a hardware failure perspective. Consolidation offers several other important reliability factors, though, that can have a significant impact to our case study.
With consolidation we reduce the amount of hardware that needs to be monitored and managed by the IT department. Fewer servers means that more time and attention can be paid to those that remain. More attention means a better chance of catching issues early and more opportunity to keep parts on hand. Better monitoring and maintenance leads to better reliability.
Possibly the most important factor, however, with consolidation is that there is significant cost savings and this, if approached correctly, can provide opportunities for improved reliability. With the dramatic reduction in total cost for servers it can be tempting to continue to keep budgets tight and attempt to purely leverage the cost savings directly. Understandable and for some businesses this may be the correct approach. But it is not the approach that I would recommend when struggling against the notion of eggs and baskets.
Instead by applying a more moderate approach keeping significant cost savings but still spending more, relatively speaking, on a single server you can acquire a higher end (read: more reliable) server, use better parts, have on-site spares, etc. The cost savings of virtualization can often be turned directly into increased reliability further shifting the equation in favor of the single server approach.
As I stated in another article, one brick house is more likely to survive a wind storm than either one or two straw houses. Having more of something doesn’t necessarily make it the more reliable choice.
These benefits come purely from the consolidation aspect of virtualization and not from the virtualization itself. Virtualization provides extended risk mitigation features separately as well. System imaging and rapid restores, as well as restores to different hardware, are major advantages of most any virtualization platform. This can play an important role in a disaster recovery strategy.
Of course, all of these concepts are purely to demonstrate that single box virtualization and consolidation can beat the legacy “one app to one server” approach and still save money – showing that the example of eggs and baskets is misleading and does not apply in this scenario. There should be little trepidation in moving from a traditional environment directly to a virtualized one based on these factors.
It should be noted that virtualization can then extend the reliability of traditional commodity hardware providing mainframe-like failover features that are above and beyond what non-virtualized platforms are able to provide. This moves commodity hardware more firmly into line with the larger, more expensive RISC platforms. These features can bring an extreme level of protection but are often above and beyond what is appropriate for IT shops initially migrating from a non-failover, legacy hardware server environment. High availability is a great feature but is often costly and very often unnecessary, especially as companies move from, as we have seen, relatively unreliable environments in the past to more reliable environments today. Given that we have already increased reliability over what was considered necessary in the past there is a very good chance that an extreme jump in reliability is not needed now, but due to the large drop in the cost of high availability, it is quite possible that it will he cost justified where previously it could not be.
In the same vein, virtualization is often feared because it is seen as a new, unproven technology. This is certainly untrue but there is an impression of this in the small business and commodity server space. In reality, though, virtualization was first introduced by IBM in the 1960s and ever since then has been a mainstay of high end mainframe and RISC servers – those systems demanding the best reliability. In the commodity server space virtualization was a larger technical challenge and took a very long time before it could be implemented efficiently enough to make it effective to use in the real world. But even in the commodity server space virtualization has been available since the late 1990s and so is approximately fifteen years old today which is very far past the point of being a nascent technology – in the world of IT it is positively venerable. Commodity platform virtualization is a mature field with several highly respected, extremely advanced vendors and products. The use of virtualization as a standard for all or nearly all server applications is a long established and accepted “enterprise pattern” and one that now can easily be adopted by companies of any and every size.
Virtualization, perhaps counter-intuitively, is actually a very critical component of a reliability strategy. Instead of adding risk, virtualization can almost be approached as a risk mitigation platform – a toolkit for increasing the reliability of your computing platforms through many avenues.
I’ve read through this post a few times, and I am wondering if I’m missing an important part of your “10 servers to 1” scenario. In this example, are you assuming that these are the *only* 10 physical servers that exist, and then consolidating them down so there’s just 1 for the whole organization? Surely that can’t be the case.
Yes, that is exactly what I mean. Taking ten single points of failure and reducing to one single point of failure. Assuming that the ten boxes are not clustered but individual systems. Nearly all SMBs have traditionally run this way with each system being a single point of failure.
Given that situation, reducing all ten to one is a no brainer. Saves a fortune in cost and makes the overall system far less risky. Having tens systems, any one of which is just as likely to fail as the others, is incredibly risky compared to having just one for a normal spread of workloads.
It depends on how the systems are used, but any amount of system interdependence means that integration and consolidation is hugely beneficial. If you are going to have outages of systems, you want them to overlap as much as possible rather than being spread out. It’s about reducing the impact of the outage.
Some businesses can run well with, for example, email down. But more often than not, having email, AD or ERP systems offline means that the overall impact is far greater than the loss of 10% of the computing infrastructure would suggest.
Is the discussion framed to be only about hardware failures on the server(s) themselves? What happens when you have infrastructure or network failures? Let’s say ABC Corp has 2 geographic locations, each location has domain controllers and file servers with DFSR.
If there’s a major network outage in Site A, Site B is still running because they have their own on-premise servers. And if there’s a natural disaster in Site B (flood, hurricane, etc) then Site A doesn’t suffer because again, they had their own servers.
Then in that case you simply extrapolate the discussion appropriately. In the example we collapsed ten single points of failure into one single point of failure. In your example (two sites with clustered resources) let’s assume we still have ten systems, five at each site, with each being a cluster (most SMBs do not have an infrastructure like this but it is entirely possible – this could also be same-site clustering) then you are not looking at ten single points of failure but five clustered points of failure. We would then collapse those five into a single clustered point of failure. Of course this requires two pieces of resultant hardware rather than one, but the consolidation providing improved reliability remains. Collapsing multiple points of fragility into a single point of equal or lesser fragility results in lower cost and lower overall risk.