When No Redundancy Is More Reliable – The Myth of Redundancy

Risk in a difficult concept and it requires a lot of training, thought and analysis to properly assess given scenarios.  Often, because risk assessments are so difficult, we substitute risk analysis with simply adding basic redundancy and assuming that we have appropriately mitigated risk.  But very often this is not the case.  The introduction of complexity or additional failure modes often accompany the addition of redundancy and these new forms of failure have the potential to add more risk than the added redundancy removes.  Storage systems are especially prone to these decision processes which is unfortunate as few, if any, systems are so susceptible to failure and more important to protect.

RAID is a great example of where a lack of holistic risk thinking can lead to some strange decision making.  If we look at a not uncommon scenario we will see where the goal of protecting against drive failure can actually lead to an increase in risk even when additional redundancy is applied.  In this scenario we will compare a twelve drive array consisting of twelve three terabyte SATA hard drives in a single array.  It is not uncommon to hear of people choosing RAID 5 for this scenario to get “maximum capacity and performance” while having “adequate protection against failure.”

The idea here is that RAID 5 protects against the loss of a single drive which can be replaced and the array will rebuild itself before a second drive fails.  That is great in theory, but the real risks of an array of this size, thirty six terabytes of drive capacity, come not from multiple drive failures as people generally suspect but from an inability to reliably rebuild the array after a single drive failure or from a failure of the array itself with no individual drives failing.  The risk of a second drive failing is low, not non-existent, but quite low.  Drives today are highly reliable. Once one drives fails it does increase the likelihood of a second drive failing, which is well documented, but I don’t want this risk to mislead us from looking at the true risks – the risk of a failed resilvering operation.

What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur.  When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost.  On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations.  That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing.  Fifty percent chance of failure is insanely high.  Imagine if your car had a fifty percent chance of the wheels falling off every time that you drove it.  So with a small (by today’s standards) six terabyte RAID 5 array using 10^14 URE SATA drives, if we were to lose a single drive, we have only a fifty percent chance that the array will recover assuming the drive is replaced immediately.  That doesn’t include the risk of a second drive failing, only the risk of a URE failure.  It also assumes that the drive is completely idle other than the resilver operation.  If the drives are busily being used for other tasks at the same time then the chances of something bad happening, either a URE or a second drive failure, begin to increase dramatically.

With a twelve terabyte array the chances of complete data loss during a resilver operation begin to approach one hundred percent – meaning that RAID 5 has no functionality whatsoever in that case.  There is always a chance of survival, but it is very low.  At six terabytes you can compare a resilver operation to a game of Russian roulette with one bullet and six chambers and you have to pull the trigger three times.  With twelve terabytes you have to pull it six times!  Those are not good odds.

But we are not talking about a twelve terabyte array.  We are talking about a thirty six terabyte array – which sounds large but this is a size that someone could easily have at home today, let alone in a business.  Every major server manufacturer, as well as nearly all low cost storage vendors, make sub $10,000 storage systems in this capacity range today.  Resilvering a RAID 5 array with a single drive failure on a thirty six terabyte array is like playing Russian roulette, one bullet, six chambers and pulling the trigger eighteen times!  Your data doesn’t stand much of a chance.  Add to that the incredible amount of time needed to resilver an array of that size and the risk of a second disk failing during that resilver window starts to become a rather significant threat.  I’ve seen estimates of resilver times climbing into weeks or months on some systems.  That is a long time to run without being able to lose another drive.  When we are talking hours or days the risks are pretty low, but still present.  When we are talking weeks or months of continuous abuse, as resilver operations are extremely drive intensive, the failure rates climb dramatically.

With an array of this size we can effectively assume that the loss of a single drive means the loss of the complete array leaving us with no drive failure protection at all.  Now if we look at a drive of the same or better performance with the same or better capacity under RAID 0, which also has no protection against drive loss, we need only use eleven of the same drives that we needed twelve of for our RAID 5 array.  What this means is that instead of twelve hard drives, each of which has a roughly three percent chance of annual failure, we have only eleven.  That alone makes our RAID 0 array more reliable as there are fewer drives to fail.  Not only do we have fewer drives but there is no need to write the parity block nor skip parity blocks when reading back lowering, ever so slightly, the mechanical wear and tear on the RAID 0 array for the same utilization giving it a very slight additional reliability edge.  The RAID 0 array of eleven drives will be identical in capacity to the twelve drive RAID 5 array but will have slightly better throughput and latency.  A win all around.  Plus the cost savings of not needing an additional drive.

So what we see here is that in large arrays (large in capacity, not in spindle count) that RAID 0 actually passes RAID 5 in certain scenarios.  When using common SATA drives this happens at capacities experienced even by power users at home and by many small businesses.  If we move to enterprise SATA drives or SAS drives then the capacity number where this occurs becomes very high and is not a concern today but will be in just a few years when drive capacities get larger still.  But this highlights how dangerous RAID 5 is in the sizes that we see today.  Everyone understands the incredible risks of RAID 0 but it can be difficult to put into perspective that RAID 5’s issues are so extreme that it might actually be less reliable than RAID 0.

That RAID 5 might be less reliable than RAID 0 in an array of this size based on resilver operations alone is just the beginning.  In a massive array like this the resilver time can take so long and exact such a toll on the drives that second drive failure starts to become a measurable risk as well.  And then there are additional risks caused by array controller errors that can utilize resilver algorithms to destroy an entire array even when no drive failure has occurred.  As RAID 0 (or RAID 1 or RAID 10) do not have resilver algorithms they do not suffer this additional risk.  These are hard risks to quantify but what is important is that they are additional risks that accumulate when using a more complex system when a simpler system, without the redundancy, was more reliable from the outset.

Now that we have established that RAID 5 can be less reliable than RAID 0 I will point out the obvious dangers of RAID 0.  RAID in general is used to mitigate the risk of a single, lone hard drive failing.  We all fear a single drive simply failing and all data being lost.  RAID 0, being a large stripe of drives without any form of redundancy, takes the risk of data loss of a single drive failing and multiplies it across a number of drives where any drive failing causes total loss of data to all drives.  So in our eleven disk example above, if any of the eleven disks fails all is lost.  It is clear to see where this is dramatically more dangerous than just using a single drive, all alone.

What I am trying to point out here is that redundancy does not mean reliability.  Just because something is redundant, like RAID 5, provides no guarantee that it will always be more reliable than something that is not redundant.

My favourite analogy here is to look at houses in a tornado.  In one scenario we build a house of brick and mortar.  On the second scenario we build two redundant house, east out of straw (our builders are pigs, apparently.)  When the tornado (or big bad wolf) comes along which is more likely to leave us with a standing house?  Clearing one brick and mortar house has some significant reliability advantages over redundant straw houses.  Redundancy didn’t matter, reliability mattered in the end.

Redundancy is often misleading because it is easy to quantify but hard to qualify.  Redundancy is a black or white question: Is it redundant?  Yes or no.  Simple.  Reliability is not so simple.  Reliability is about failure rates and likelihoods.  It is about statistics and analysis.  As it is hard to quantify reliability in a meaningful way, especially when selling a project to the business people, redundancy often becomes a simple substitute for this complex concept.

The concept of using redundancy to misdirect questions of reliability also ends up applying to subsystems in very convoluted ways.  Instead of making a “system” redundant it has become common to make a highly reliable, and low cost, subsystem redundant and treat subsystem redundancy as applying to the whole system.  The most common example of this is RAID controllers in SAN products.  Rather than having a redundant SAN (meaning two SANs) manufacturers will often make that one component not often redundant in normal servers redundant  and then calling the SAN redundant – meaning a SAN that contains redundancy, which is not at all the same thing.

A good analogy here would be to compare having redundant cars meaning two complete, working cars and having a single car with a spare water pump in the trunk in case the main one fails.  Clearly, a spare water pump is not a bad thing.  But it is also a trivial amount of protection against car failure compared to having a second car ready to go.  In one case the entire system is redundant, including the chassis.  In the other we are making just one, highly reliable component redundant inside the chassis.  It’s not even on par with having a spare tire which, at least, is a car component with a higher likelihood of failure.

Just like the myth of RAID 5 reliability and system/subsystem reliability, shared storage technologies like SANs and NAS often get treated in the same way, especially in regards to virtualization.  There is a common scenario where a virtualization project is undertaken and people instinctively panic because a single virtualization host represents a single point of failure where, if it fails, many systems will all fail at once.

Using the term “single point of failure” causes a panic feeling and is a great means of steering a conversation.  But a SPOF, as we like to call it, while something we like to remove when possible may not be the end of the world.  Think about our brick house.  It is a SPOF.  Our two houses of straw are not.  Yet a single breeze takes out our redundant solutions faster than our reliable SPOF.  Looking for SPOFs is a great way to find points of fragility in a system, but do not feel that every SPOF must be made redundant in every scenario.  Most businesses will find their best value having many SPOFs in place.  Our real goal is reliability at appropriate cost, redundancy, as we have seen, is no substitute for reliability, it is simply a tool that we can use to achieve reliability.

The theory that many people follow when virtualizing is that they take their virtualization host and say “This host is a SPOF, so I need to have two of them and use High Availability features to allow for transparent failover!”  This is spurred by the leading virtualization vendor making their money firstly by selling expensive HA add on products and secondly by being owned by a large storage vendor – so selling unnecessary or even dangerous additional shared storage is a big monetary win for them and could easily be the reason that they have championed the virtualization space from the beginning.  Redundant virtualization hosts with shared storage sounds great but can be extremely misguided for several reasons.

The first reason is that removing the initial SPOF, the virtualization host, is replaced with a new SPOF, the shared storage.  This accomplishes nothing.  Assuming that we are using comparable quality servers and shared storage all we’ve done is move where the risk is, not change how big it is.  The likelihood of the storage system failing is roughly equal to the likelihood of the original server failing.  But in addition to shuffling the SPOF around like in a shell game we’ve also done something far, far worse – we have introduced chained or cascading failure dependencies.

In our original scenario we had a single server.  If the server stayed working we are good, if it failed we were not.  Simple.  Now we have two virtualization hosts, a single storage server (SAN, NAS, whatever) and a network connecting them together.  We have already determined that the risk of the shared storage failing is approximately equal to our total system risk in the original scenario.  But now we have the additional dependencies of the network and the two front end virtualization nodes.  Each of these components is more reliable than the fragile shared storage (anything with mechanical drives is going to be fragile) but that they are lower risk is not the issue, the issue is that the risks are combinatorial.

If any of these three components (storage, network or the front end nodes) fail then everything fails.  The solution to this is to make the shared storage redundant on its own and to make the network redundant on its own.  With enough work we can overcome the fragility and risk that we introduced by adding shared storage but the shared storage on its own is not a form of risk mitigation but is a risk itself which must be mitigated.  The spiral of complexity begins and the cost associated with bringing this new system up on par with the reliability of the original, single server system can be astronomic.

Now that we have all of this redundancy we have one more risk to worry about.  Managing all of this redundancy, all of these moving parts, requires a lot more knowledge, skill and preparation than does managing a simple, single server.  We have moved from a simple solution to a very complex one.  In my own anecdotal experience the real dangers of solutions like this come not from the hardware failing but from human error.  Not only has little been done to avoid human error causing this new system to fail but we’ve added countless points where a human might accidentally bring the entire system, redundancy and all, right down.  I’ve seen it first hand; I’ve heard the horror stories.  The more complex the system the more likely a human is going to accidentally break everything.

It is critical that as IT professionals that we step back and look at complete systems and consider reliability and risk and think of redundancy simply as a tool to use in the pursuit of reliability.  Redundancy itself is not a panacea.  Neither is simplicity.  Reliability is a complex problem to tackle.  Avoiding simplistic replacements is an important first step in moving from covering up reliability issues to facing and solving them.

 

Choosing an Open Storage Operating System

It is becoming increasingly common to forgo traditional, proprietary storage devices, both NAS and SAN, and instead using off the shelf hardware and installing a storage operating system on it for, what many call, “do it yourself” storage servers.  This, of course, is a misnomer since no one calls a normal file server “do it yourself” just because you installed Windows yourself.  Storage has a lot of myth and legend swirling around it and people often panic when the they think of installing Windows and calling it NAS rather than calling it a file server.  So, if it makes you feel better, use terms like file server or storage server rather than NAS and SAN – problem solved.  This is a part of the “open storage” movement – moving storage systems from proprietary to standard.

Choosing the right operating system for a storage server is important and not always that easy.  I work extensively in this space and people often ask me what I recommend and the recommendations vary, based on scenario, and often seem confusing.  But the factors are actually relatively easy, if you just know the limitations that create the choices and paths in the decision tree.

Before choosing an OS we must stop and consider what our needs are going to be.  Some areas that need to be considered are: capacity, performance, ease of administration, budget, connection technology, cost and clustering.  There are two main categories of systems that we will consider as well, standard operating system or storage appliance operating system.  The standard operating systems are Windows, Linux, Solaris and FreeBSD.  The storage appliance operating systems are FreeNAS, OpenFiler and NexentaStor.  There are others in both categories but these are the main players currently.

The first decision to be made is whether or not you or your organization is comfortable supporting a normal operating system operating in a storage server role.  If you are looking at NAS then simply ask yourself if you could administer a file server.  Administrating a block storage server (SAN) is a little more complex or, at least, unusual, so this might induce a small amount of concern but is really in line with other administration tasks.  If the answer is yes, that using normal operating system tools and interfaces is acceptable to you, then simply rule out the “appliance” category right away.  The appliance approach adds complexity and slows development and support cycles, so unless necessary is undesirable.

Storage appliance operating systems exist only to provide a pre-packaged, “easy to use” view into running a storage server.  In concept this is nice, but there are real problems with this method.  The biggest problems come from the packaging process which pulls you a step away from the enterprise OS vendors themselves making your system more fragile, further behind in updates and features and less secure than the traditional OS counterparts.  It also leaves you at the mercy of a very small company for OEM-level support when something goes wrong rather than with a large enterprise vendor with a massive user base and community.  The appliancization process also strips features and options from the systems by necessity.  In the end, you lose.

Appliances are nice because you get a convenient web interface from which “anyone” can administer your storage.  At least in theory.  But in reality there are two concerns.  The first is that there is always a need to drop into the operating system itself and fix things every once in a while.  Having the custom web interface of the appliance makes this dramatically harder than normal so at the time when you most need the appliance nature of the system is when you do not have it.  The second is that making something as critical as storage available for “anyone” to work on is a terrifying thought.  There are few pieces of your infrastructure where you want more experience, planning and care taken than in storage.  Making the system harder to use is not always a bad thing.

If you are in need of the appliance system then primarily you are looking at FreeNAS and OpenFiler.  NexentaStor offers a compelling product but it is not available in a free version and the cost can be onerous.  The freely downloadable version appears to be free for the first 18TB of raw storage but the license states otherwise making this rarely the popular choice.  (The cost of NexentaStor is high enough that purchasing a fully supported Solaris system would be less costly and provides full support from the original vendor rather than Nexenta which is essentially repackaging old versions of Solaris and ZFS.  More modern code and updates are available less expensively from the original source.)

FreeNAS, outside of clustering, is the storage platform of choice in an appliancized package.  It has the much touted ZFS filesystem which gives it flexibility and ease of use lacking in OpenFiler and other Linux-based alternatives.  It also has a working iSCSI implementation so you can use FreeNAS safely as either a NAS or a SAN.  Support for FreeNAS appears to be increasing with new developments being made regularly and features being retained.  FreeNAS offers a large range of features and supported protocols.  It is believed that clustering will be coming to FreeNAS in the future as well as this has recently been added to the underlying FreeBSD operating system.  If so, FreeNAS will completely eliminate the need for OpenFiler in the marketplace.  FreeNAS is completely free.

OpenFiler lacks a reliable iSCSI SAN implementation (unless you pay a fortune to have that part of the system replaced with a working component ) and is far more out of date than its competitors but does offer full block-level real-time replication allowing it to operate in a clustered mode for reliability .  The issue here being that the handy web interface of the NAS appliance does not address this scenario and if you want to do this you will need to get your hands dirty on the command line, very dirty indeed.  This is expert level stuff and anyone capable of even considering a project to make OpenFiler into a reliable cluster will be just as comfortable, and likely far more comfortable, building the entire cluster from scratch on their Linux distribution of choice.  OpenFiler is built on the rather unpopular, and now completely discontinued, rPath Linux using the Conary packaging system both which are niche players, to say the least, in the Linux world.  You’ll find little rPath support from other administrators and many packages and features that you may wish access to are unavailable.  OpenFiler’s singular advantage of any significance is the availability of DRBD for clustering, which as stated above, in nonsensical.  Support for OpenFiler appears to be waning with new features being non-existant and, in fact, key features like the AFP have been dropped rather than new features having been added.  OpenFiler is free but key features, like reliable iSCSI, are not.  Recent reports from OpenFiler users are that even non-iSCSI storage has become unstable in the latest release and losing data is a regular occurrence.  OpenFiler remains very popular in the mindshare of this industry segment but should be avoided completely.

If you do not need to have your storage operating system appliancized then you are left with more and better choices, but a far more complex decision tree.    Unlike the appliance OS market which is filled with potholes (NexentaStor has surprise costs, OpenFiler appears to support iSCSI but causes data loss, features get removed from new versions) all four operating systems mentioned here are extremely robust and feature rich.  Three of them have OEM vendor support which can be a major deciding factor and all have great third party support options far broader than what is available for the appliance market.

The first decision is whether or not Windows only features, notably NTFS ACLs, are needed.  It is common for new NAS users to be surprised when the SMB protocol does not provide all of the granular filesystem control that they are used to in Windows.  This is because those controls are actually handled by the filesystem, not the network protocol, and Windows alone provides these via NTFS.  So if that granular Windows file control is needed, Windows is your only option.

The other three entrants, Linux, Solaris and FreeBSD, all share basic capabilities with the notable exception of clustering.  All have good software RAID, all have powerful and robust filesystems, all have powerful logical volume management and all provide a variety of NAS and SAN connection options.  Many versions of Linux and FreeBSD are available completely freely.  Solaris, while free for testing, is not available for free for production use.

The biggest differentiator between these three OS options is clustering.  Linux has had DRBD for a long time now and this is a robust filesystem clustering technology.  FreeBSD has recently (as of 9.0) added HAST to serve the same purpose.  So, in theory, FreeBSD has the same clustering options as Linux but this is much newer and much less well known.  Solaris lacks filesystem clustering in the base OS and requires commercial add-ons to handle this at this time.

Solaris and FreeBSD share the powerful and battle tested ZFS filesystem.  ZFS is extremely powerful and flexible and has long been the key selling point of these platforms.  Linux’ support for filesystems is more convoluted.  Nearly any Linux distribution (we care principally about RHEL/CentOS, Oracle Unbreakable Linux, Suse/OpenSuse and Ubuntu here) supports EXT4 which is powerful and fast but lacks some of the really nice ZFS features.  However, Linux is rapidly adopting BtrFS which is very competitive with ZFS but is nascent and currently only available in Suse and Oracle Linux distros.  We expect to see it from the others soon for production use but at this time it is still experimental.

Outside of clustering, likely the choice of OS of these three will come down primarily to experience and comfort.  Solaris is generally known for providing the best throughput and FreeBSD the worst.  But all three are quite close.  Once BtrFS is widely available and stable on Linux, Linux will likely become the de facto choice as it has been in the past.

Without external influence, my recommendation for storage platform are FreeBSD and then Linux with Solaris eliminated on the basis that rarely is anyone looking for commercial support and so it is ruled out automatically.  This is based almost entirely on the availability of Copy-on-Write filesystems and assuming no clustering which is not common.  If clustering is needed then Linux first then FreeBSD and Solaris is ruled out, again.

Linux and FreeBSD are rapidly approaching each other in functionality.  As BtrFS matures on Linux and HAST matures on FreeBSD they seem to be meeting in the middle with the choice being little more than a toss up.

There is no single, simple answer.  Choosing a storage OS is all about balancing myriad factors from performance, resources, features, support, stability, etc.  There are a few factors that can be used to rule out many contenders and knowing these hard delimiters is key.  Knowing exactly how you plan to use the system and what factors are important to you are important in weeding through the available options.

Even once you pick a platform there are many decisions to make.  Some platforms include multiple file systems.  There is SAN and NAS.  There are multiple SAN and NAS protocols.  There is network bonding (or teaming, the Windows world.)  There is Multipathing.  There are snapshots, volumes, RAID.  The list goes on and on.

 

The True Cost of Printing

Of all of the things that are handled by your technology support department, printing is likely the one that you think about the least.  Printing isn’t fancy or exciting or a competitive advantage.  It is a lingering item from an age without portable reading devices, from an era before monitors.  Printers are going to be around for a long time to come, I do not wish to imply that they are not, but there is a lot to be considered when it comes to printers and much of that consideration can be easily overlooked.

When considering the cost of printing we often calculate the cost of the printer itself along with the consumables: paper and ink.  These things alone rack up a pretty serious per-page cost for an average business.  Planning for an appropriate lifespan and duty cycle of a printer are critical to making printing remain cost effective.  And do not forget the cost of parts replacement as well as stock-piled ink and paper.  These may seem minor, but printers often cause an investment in inventory that is never recovered.  When the printer dies, supplies for that printer are often useless.

The big, hidden cost of printing is none of these things. The big cost is in supporting the printers, both upfront with the initial deployment but even moreso in continuing support.  This is especially true in a smaller shop where the trend is to use many small printers rather than fewer large ones.  Deploying and supporting a five thousand dollar central office printer is no more than, and possibly lower than, the cost of deploying a two hundred dollar desktop inkjet.  The bigger the printer the better the support in drivers and support from the vendor that can usually be expected making normal support tasks easier and more reliable.

At a minimum, rolling out a new desktop printer is going to take half an hour.  Realistically it is far more likely to take closer to an hour.  Go ahead, count up the time: time to deliver printer to station, time to unpack printer, time to physically set up printer, time to plug in printer, time to install printer drivers and software, time to set up printer and time to print a test page.  If it was a one time race, you could probably do these steps pretty quickly.  But printer support is not a production line and rarely, if ever, do you have someone with these exact steps being performed in a rapidly repeatable manner.  Likely installing a printer is a “one off” activity that requires learning the new printer, tracking down the current driver and troubleshooting potential issues.

An hour to deploy a two hundred dollar printer could add fifty percent to the cost of the printer quite easily.  There are a lot of factors that can cause this number to skyrocket from a long travel distance between receiving location and the desk to missing cables to incompatible drivers.  Any given printer could take the better part of a day to deploy when things go wrong.  We are not even considering “disruption time” – that time in which the person receiving the printer is unable to work since someone is setting up a printer at their workstation.

Now that the printer has been set up and is, presumably, working just fine we need to consider the ongoing cost of printer support.  It is not uncommon for a printer to sit, undisturbed, for years chugging along just fine.  But printers have a surprisingly high breakage rate caused by the nature of ink, the nature of paper, a propensity for printers to be reassigned to different physical locations or for the machine to which they are attached to be changed or updated introducing driver breakage.  Add these things together and the ongoing support cost of a printer can be surprisingly high.

I recently witnessed the support of a company with a handful of high profile printers.  In a run of documentation, physical cabling and driver issues the printers were averaging between four and eight hours of technician time, per printer, to set up correctly.  Calculate out the per hour cost for that support and those printers, likely already costly, just became outrageously expensive.

I regularly hear of shops that decide to re-purpose printers and spend many times the cost of the printers in labor hours as older printers are massaged into working with newer computer setups or vice versa. Driver incompatibility or unavailability is far more common than people realize.

Printers have the additional complication of being used in many different modes such as directly attached to a workstation, directly attach and shared, directly attached to a print server, directly attached to the network or attached to a print server over the network.  While this complexity hardly creates roadblocks it does significantly slow work done on printers in a majority of businesses.

Printers, by their nature, are very difficult to support remotely.  Getting a print driver installed remotely is easy.  Knowing that something has printed successfully is something completely different.  Considering that printer support should be one of the lower cost support tasks this need for physical on-site presence for nearly every printer support task dramatically increases the cost of support if only because it increases the time to perform a task and receive appropriate feedback.

When we take these costs and combine them with the volume of printing normally performed by a printer we can start to acquire a picture of what printing is really costing.  The value to centralized printing suddenly takes on a new level of significance when seen through the eyes of support rather than through the eyes of purchasing.  Even beyond centralizing printing when possible it is important to eliminate unnecessary printing.

Good planning, strategic purchasing and a holistic approach can mitigate the potential for surprise costs in printing.

 

Just Because You Can…

I see this concept appear in discussions surrounding virtualization all of the time.  This is a broader, more general concept but virtualization is the “hot, new technology” facing many IT organizations and seems to be the space where currently we see the “just because you can, doesn’t mean you should” problems rearing their ugly heads most prevalently.  As with everything in IT, it is critical that all technical decisions be put into a business context so that we understand why we choose to do what we do and not blindly attempt to make our decisions based on popular deployment methodologies or worse, myths..

Virtualization itself, I should point out, I feel should be a default decision today for those working in the x64 computing space with systems being deployed sans virtualization only when a clear and obvious necessity exists such as specific hardware needs, latency sensitive applications, etc.  Baring any specific need, virtualization is free to implement from many vendors and offers many benefits both today and in future-proofing the environment.

That being said, what I often see today is companies deploying virtualization not as a best practice but as a panacea to all perceived IT problems.  This it certainly is not.  Virtualization is a very important tool to have in the IT toolbox and one that we will reach for very often, but it does not solve every problem and should be treated like every other tool that we posses and used only when appropriate.

I see several things recurring when virtualization discussions come up as a topic.  Many companies today are moving towards virtualization not because they have identified a business need but because it is the currently trending topic and people feel that if they do not implement virtualization that somehow they will be left behind or miss out on some mythical functionality.  This is generally good as it is increasing virtualization adoption, but it is bad because good IT and business decision making processes are being bypassed.  What happens is often that in the wave of virtualization hype IT departments feel that not only do they have to implement virtualization itself but do so in ways that may not be appropriate for their business.

There are four things that I often see tied to virtualization, often accepted as virtualization requirements, whether or not they make sense in a given business environment.  These are server consolidation, blade servers, SAN storage and high availability or live failover.

Consolidation is so often vaunted as the benefit of virtualization that I think most IT departments forget that there are other important reasons for doing implementing it.  Clearly consolidation is a great benefit for nearly all deployments (mileage may vary, of course) and is nearly always able to be achieved simply through better utilization of existing resources.  It is a pretty rare company that runs more than a single physical server that cannot shave some amount of cost through limited consolidation and it is not uncommon to see datacenter footprints decimated in larger organizations.

In extreme cases, though, it is not necessary to abandon virtualization projects just because consolidation proves to be out of the question.  These cases exist for companies with high utilization systems and little budget for a preemptive consolidation investment.  But these shops can still virtualize “in place” systems on a one to one basis to gain other benefits of virtualization today and look to consolidate when hardware needs to be replaced tomorrow or when larger, more powerful servers become more cost effective in the future.  It is important to not rule out virtualization just because its most heralded benefit may not apply at the current time in your environment.

Blade servers are often seen as the choice for virtualization environments.  Blades may play better in a standard virtualization environment than they do with more traditional computational workloads but this is both highly disputable and not necessarily applicable data.  Being a good scenario for blades themselves does not make it a good scenario for a business.  Just because the blades perform better than normal when used in this way does not imply that they perform better than traditional servers – only that they have potentially closed the gap.

Blades needs to be evaluated using the same harsh criteria when virtualizing as when not and, very often, they will continue to fail to provide the long term business value needed to choose them over the more flexible alternatives.  Blades remain far from a necessity for virtualization and often, in my opinion, a very poor choice indeed.

One of the most common misconceptions is that by moving to virtualization one must also move to shared storage such as SAN.  This mindset is the obvious reaction to the desire to also achieve other benefits from virtualization which, if they don’t require SAN, benefit greatly from it.  The ability to load balance or failover between systems is heavily facilitated by having a shared storage backend.  It is a myth that this is a hard requirement, but replicated local storage brings its own complexities and limitations.

But shared storage is far from a necessity of virtualization itself and, like everything, needs to be evaluated on its own.  If virtualization makes sense for your environment but you need no features that require SAN, then virtualize without shared storage.  There are many cases where local storage backed virtualization is an ideal deployment scenario.  There is no need to dismiss this approach without first giving it serious consideration.

The last major assumed necessary feature of virtualization is system level high availability or instant failover for your operating system.  Without a doubt, high availability at the system layer is a phenomenal benefit that virtualization brings us.  However, few companies needed high availability at this level prior to implementing virtualization and the price tag of the necessary infrastructure and software to do it with virtualization is often so high as to make it too expensive to justify.

High availability systems are complex and often overkill.  It is a very rare business system that requires transparent failover for even the most critical systems and those companies with that requirement would almost certainly already have failover processes in place.  I see companies moving towards high availability all of the time when looking at virtualization simply because a vendor saw an opportunity to dramatically oversell the original requirements.  The cost of high availability is seldom justified by the potential loss of revenue from the associated reduction in downtime.  With non-highly available virtualization, downtime for a failed hardware device might be measured in minutes if backups are handled well.  This means that high availability has to justify its cost in potentially eliminating just a few minutes of unplanned downtime per year minus any additional risks assumed by the added system complexity.  Even in the biggest organizations this is seldom justified on any large scale and in a more moderately sized company it is uncommon altogether.  But today we find many small businesses implementing high availability systems at extreme cost on systems that could easily suffer multi-day outages with minimal financial loss simply because the marketing literature promoted the concept.

Like anything, virtualization and all of the associated possibilities that it brings to the table need to be evaluated individually in the context of the organization considering them.  If the individual feature does not make sense for your business do not assume that you have to purchase or implement that feature.  Many organizations virtualize but use only a few, if any, of these “assumed” features.  Don’t look at virtualization as a black box, look at the parts and consider them like you would consider any other technology project.

What often happens in a snowball effect where one feature, likely high availability, is assumed to be necessary without the proper business assessment being performed.  Then a shared storage system, often assumed to be required for high availability, is added as another assumed cost.  Even if high availability features are not purchased the decision to use SAN might already be made and fail to be revisited after changes to the plan are made.  It is very common, in my experience, to find projects of this nature with sometimes more than fifty percent of the total expenditure on the project being spent on products that the purchaser is unable to even describe the reason for having purchased.

This concept does not stop at virtualization.  Extend it to everything that you do.  Keep IT in perspective of the business and don’t assume that going with one technology automatically assumes that you must adopt other technologies that are popularly associated with it.

The Information Technology Resource for Small Business