Category Archives: Risk

When to Consider High Availability?

“High Availability isn’t something you buy, it’s something that you do.”  – John Nicholson

Few things are more universally desired in IT than High Availability (HA) solutions.  I mean really, say those words and any IT Pro will instantly say that they want that.  HA for their servers, their apps, their storage and, of course, even their desktops.  If there was a checkbox next to any system that simply said “HA”, why wouldn’t we check it?  We would, of course.  No one voluntarily wants a system that fails a lot.  Failure bad, success good.

First, though, we must define HA.  HA can mean many things.  At a minimum, HA must mean that the availability of the system in question must be higher than “normal”.  What is normal?  That alone is hard enough to define.  HA is a loose term, at best.  In the context of its most common usage, though, which is common applications running on normal enterprise hardware I would offer this starting point for HA discussions:

Normal or Standard Availability (SA) would be defined as the availability from a common mainline server running a common enterprise operating system running a common enterprise application in a best practices environment with enterprise support.  Good examples of this might include Exchange running on Windows Server running on the HP Proliant DL380 (the most common mainline commodity server.)  Or for BIND (the DNS server) running on Red Hat Enterprise Linux on the Dell PowerEdge R730.  These are just examples to be used for establishing a rough baseline.  There is no great way to measure this, but with a good support contract and rapid repair or replacement in the real world, reliability of a system of this nature is believed to be between four and five nines of reliability (99.99% uptime or higher) when human failure is not included.

High Availability (HA) should be commonly defined as having an availability significantly higher than that of Standard Availability.  Significantly higher should be a minimum of one order of magnitude hincrease.  So at least five nines of reliability and more likely six nines. (99.9999% uptime.)

Low Availability (LA) would be commonly defined as having an availability significantly lower than that of Standard Availability with significantly, again, meaning at least one order of magnitude.  So LA would typically be assumed to be around 99% to 99.9% or lower availability.

Measurement here is very difficult as human factors, environmental and other play a massive role in determining the uptime of different configurations.  The same gear used in one role might achieve five nines while in another fail to achieve even one.  The quality of the datacenter, skill of the support staff, rapidity of parts replacement, granularity of monitoring and a multitude of other factors will affect the overall reliability significantly.  This is not necessarily a problem for us, however.  In most cases we can evaluate the portions of a system design that we control in such a way that relative reliability can be determined so that at least we can show that one approach is going to be superior to another in order that we can then leverage well informed decision making even if accurate failure rate models cannot be easily built.

It is important to note that other than providing a sample baseline set of examples from which to work there is nothing in the definitions of high availability or low availability that talk about how these levels should be achieved – that is not what the terms mean.  The terms are resultant sets of reliability in relation to the baseline and nothing else. There are many ways to achieve high availability without using commonly assumed approaches and practically unlimited ways to achieve low availability.

Of course HA can be defined at every layer.  We can have HA platforms or OS but have fragile applications on top.  So it is very important to understand at what level we are speaking at any given time.  At the end of the day, a business will only care about the high availability delivery of services regardless of how it is achieved, or where.  The end result is what matters not the “under the hood” details of how it was accomplished or, as always, the ends justify the means.

It is extremely common today for IT departments to become distracted by new and flashy HA tools at the platform layer and forget to look for HA higher and lower in the stack to ensure that we provide highly available services to the business; rather than only looking at the one layer while leaving the business just as vulnerable, or moreso, than ever.

In the real world, though, HA is not always an option and, when it is, it comes at a cost.  That cost is almost always monetary and generally comes with extra complexity as well.  And as we well know, any complexity also carries additional risk and that risk could, if we are not careful, cause an attempt to achieve HA actually fail and might even leave us with LA or Low Availability.

Once we understand this necessary language for describing what we mean, we can begin to talk about when high availability, standard availability and even low availability may be right for us.  We use this high level of granularity because it is so difficult to measure system reliability that getting too detailed becomes valueless.

Conceptually, all systems come with risk of downtime and nothing can be always up, that’s impossible.  Reliability costs money, generally, all other things being equal.  So to determine what level of availability is most appropriate for a workload we must determine the cost of risk mitigation (the amount of money that it takes to change the average amount of downtime) and compare that against the cost of the downtime itself.

This gets tricky and complicated because determining cost of downtime is difficult enough, then determining the risk of downtime is even more difficult.  In many cases, downtime is not a flat number, but it might be.  This cost could be expressed as $5/minute or $20K/day or similar.  But an even better tool would be to create a “loss impact curve” that shows how money is lost over time (within a reasonable interval.)

For example, a company might easily face no loss at all for the first five minutes with slowly increasing, but small, losses until about four hours when work stops because people can no longer go to paper or whatever and then losses go from almost zero to quite large.  Or some companies might take a huge loss the moment that the systems are down but the losses slowly dissipate over time.  Loss might only be impactful at certain times of day.  Maybe outages at night or during lunch are trivial but mid morning or mid afternoon are major.  Every company’s impact, risk and ability to mitigate that risk are different, often dramatically so.

Sometimes it comes down to the people working at the company.  Will they all happily take needed bathroom, coffee, snack or even lunch breaks at the time that a system fails so that they can return to work when it is fixed?  Will people go home early and come in early tomorrow to make up for a major outage?  Is there machinery that is going to sit idle?  Will the ability to respond to customers be impacted?  Will life support systems fail?  There are countless potential impacts and countless potential ways of mitigating different types of failures.  All of this has to be considered.  The cost of downtime might be a fraction of corporate revenues on a minute by minute basis or downtime might cause a loss of customers or faith that is more impactful than the minute by minute revenue generated.

Once we have some rough loss numbers to deal with we at least have a starting point.  Even if we only know that revenue is ~$10/minute and losses are expected to be around ~$5/minute we have a starting point of sorts.  If we have a full curve or a study done with some more detailed numbers, all the better.  Now we need to figure out roughly what our baseline is going to be.  A well maintained server, running on premises, with a good support contract and good backup and restore procedures can pretty easily achieve four nines of reliability.  That means that we would experience about five hours of downtime every five years.  This is actually less than the generally expected downtime of SA in most environments and potentially far less than expected levels in excellent environments like high quality datacenters with nearby parts and service.

So, based on our baseline example of about five hours every five years we can figure out our potential risk.  If we lose about ~$5/minute and we expect roughly 300 minutes of downtime every five years we looking at a potential loss of $1,500 every half decade.

That means that at the most extreme we could never spend $1,500 to mitigate that risk, that would be financially absurd.  This happens for several reasons.  One of the biggest is that this is only a risk, spending $1,500 to protect against losing $1,500 makes little sense, but it is a very common mistake to make when people do not analyze these numbers carefully.

The biggest factor is that any mitigation technique is not completely effective.  If we manage to move our four nines system to a five nines system we would reduce only 90% of the average downtime moving us from $1,500 of loss to $150 of loss.  If we spent $1,500 for that reduction, the total “loss” would still be $1,650 (the cost of protection is a form of financial loss.)  The cost of the risk mitigation combined with the anticipated remaining impact when taken together must still be lower than the anticipated cost of the risk without mitigation or else the mitigation itself is pointless or actively damaging.

Many may question why the total cost of risk mitigation must be lower and not simply equal as surely, that must mean that we are at a “risk break even” point?  This seems true on the surface, but because we are dealing with risk this is not the case.  Risk mitigation is a certain cost- financial damage that we take up front in the hopes of reducing losses tomorrow.  But the risk for tomorrow is a guess, hopefully a well educated one, but only a guess.  The cost today is certain.  Taking on certain damage today in the hopes of reducing possible damage tomorrow only makes sense when the damage today is small and the expected or possible damage tomorrow is very large and the effectiveness of mitigation is significant.

Included in the idea of “certain cost of front” to reduce “possible cost tomorrow” is the idea of the time value of money.  Even if an outage was of a known size and time, we would not spend the same money today to mitigate it tomorrow because our money is more valuable today.

In the most dramatic cases, we sometimes see IT departments demanding tens or hundreds of thousands of dollars be spent up front to avoid losing a few thousand dollars, maybe, sometime maybe many years in the future.  A strategy that we can refer to as “shooting ourselves in the face today to avoid maybe getting a headache tomorrow.”

It is included in the concept of evaluating the risk mitigation but it should be mentioned specifically that in the case of IT equipment there are many examples of attempted risk mitigation that may not be as effective as they are believed to be.  For example, having two servers that sit in the same rack will potentially be very effective for mitigating the risk of host hardware failure, but will not mitigate against natural disasters, site loss, fire, most cases of electrical shock, fire suppression activation, network interruptions, most application failure, ransomware attack or other reasonably possible disasters.

It is common for storage devices to be equipment with “dual controllers” which gives a strong impression of high reliability, but generally these controllers are inside a single chassis with shared components and even if the components are not shared, often the firmware is shared and communications between components are complex; often leading to failures where the failure of one component triggers the failure of another – making them quite frequently LA devices rather than SA or the HA that people expected when purchasing them.  So it is very critical to consider if the risk mitigation strategy will mitigate which risks and if the mitigation technique is likely to be effective.  No technique is completely effective, there is always a chance for failure, but some strategies and techniques are more broadly effective than others and some are simply misleading or actually counter productive.  If we are not careful, we may implement costly products or techniques that actively undermine our goals.

Some techniques and products used in the pursuit of high availability are rather expensive, which might include buying redundant hardware, leasing another building, installing expensive generators or licensing special software.  There are low cost techniques and software as well, but in most cases any movement towards high availability will result in a respectively large outlay of investment capital in order to achieve it.  It is absolutely critical to keep in mind that high availability is a process, there is no way to simply buy high availability.  Achieving HA requires good documentation, procedures, planning, support, equipment, engineering and more.  In the systems world, HA is normally approached first from an environmental perspective with failover power generators, redundant HVAC systems, power conditioning, air filtration, fire suppression systems and more to ensure that the environment for the availability is there.  This alone can often make further investment unnecessary as this can deliver incredible results.  Then comes HA system design ensuring that not just one layer of a technology stack is highly available but that the entire stack is allowing for the critical applications, data or services to remain functional during as much time as possible.  Then looking at site to site redundancy to be able to withstand floods, hurricanes, blizzards, etc.  Of course there are completely different techniques such as utilizing cloud computing services hosted remotely on our behalf.  What matters is that high availability requires broad thinking and planning, cannot simply be purchased as a line item and is judged by the ability to return a risk factor providing a resulting uptime or likelihood of uptime much higher than a “standard” system design would deliver.

What is often surprising, almost shocking, to many businesses and especially to IT professionals, who rarely undertake financial risk analysis and who are constantly being told that HA is a necessity for any business and that buying the latest HA products is unquestionably how their budgets should be spent, is that when the numbers are crunched and the reality of the costs and effectiveness of risk mitigation strategies are considered that high availability has very little place in any organization, especially those that are small or have highly disparate workloads.  In the small and medium business market it is almost universal to find that the cost and complexity (which in turn brings risk, mostly in the form of a lack of experience around techniques and risk assessment) of high availability approaches is far too costly to ever offset the potential damage of the outage from which the mitigation is hoped to protect.  There are exceptions, of course, and there are many businesses for which high availability solutions are absolutely sensible, but these are the exception and very far from being the norm.

It is also sensible to think of the needs for high availability to be based on a workload basis and not department, company or technology wide.  In a small business it is common for all workloads to share a common platform and the need of a single workload for high availability may sweep other, less critical, workloads along with it.  This is perfectly fine and a great way to offset the cost of the risk mitigation of the critical workload through ancillary benefit to the less critical workloads.  In a larger organization where there is a plethora of platform approaches used for differing workloads it is common for only certain workloads that are both highly critical (in terms of risk from downtime impact) and that are practically mitigated of risk (the ability to mitigate risk can vary dramatically between different types of workloads) to have high availability applied to them and other workloads to be left to standard techniques.

Examples of workloads that may be critical and can be effectively addressed with high availability might be an online ordering system where the latency created by multi-regional replication has little impact on the overall system but losing orders could be very financially impactful should a system fail.  An example of a workload where high availability might be easy to implement but ineffectual would be an internal intranet site serving commonly asked HR questions; it would simply not be cost effective to avoid small amounts of occasional downtime for a system like this.  An example of a system where risk is high but the cost or effectiveness of risk mitigation makes it impractical or even impossible might be a financial “tick” database requiring massive amounts of low latency data to be ingested and the ability to maintain a replica would not only be incredibly costly but could introduce latency that would undermine the ability of the system to perform adequately.  Every business and workload is unique and should be evaluated carefully.

Of course high availability techniques can be actioned in stages; it is not an all or nothing endeavor.  It might be practical to mitigate the risk of system level failure by having application layer fault tolerance to protect against failure of system hardware, virtualization platform or storage.  But for the same workload it might not be valuable to protect against the loss of a single site.  If a workload only services a particular site or is simply not valuable enough for the large investment needed to make it fail between sites it could easily fall “in the middle.”  It is very common for workloads to only implement partially high availability solutions, often because an IT department may only be responsible for a portion of them and have no say over things like power support and HVAC, but probably most common because some high availability techniques are seen as high visibility and easy to sell to management while others, such as high quality power and air conditioning, often are not even though they may easily provide a better bang for the buck.  There are good reasons why certain techniques may be chosen and not others as they affect different risk components and some risks may have a differing impact on an individual business or workload.

High availability requires careful thought as to whether it is worth considering and even more careful thought as to implementation.  Building true HA systems requires a significant amount of effort and expertise and generally substantial cost.  Understanding which components of HA are valuable and which are not requires not just extensive technical expertise but financial and managerial skills as well.  Departments must work together extensively to truly understand how HA will impact an organization and when it will be worth the investment.  It is critical that it be remembered that the need for high availability in an organization or for a workload is anything but a foregone conclusion and it should not be surprising in the least to find that extensive high availability or even casual high availability practices turn out to be economically impractical.

In many ways this is because standard availability has reached such a state that there is continuously less and less risk to mitigate.  Technology components used in a business infrastructure, most notably servers, networking gear and storage, have become so reliable that the amount of downtime that we must protect against is quite low.  Most of the belief in the need for knee jerk high availability comes from a different era when reliable hardware was unaffordable and even the most expensive equipment was rather unreliable by modern standards.  This feeling of impending doom that any device might fail at any moment comes from an older era, not the current one.  Modern equipment, while obviously still carrying risks, is amazingly reliable.

In addition to other risks, over-investing in high availability solutions carries financial and business risks that can be substantial.  It increases technical debt in the face of business uncertainty.  What if the business suddenly grows, or worse, what if it suddenly contracts, changes direction, gets purchased or goes out of business completely?  The investment in the high availability is already spent even if the need for its protection disappears.  What if technology or location change?  Some or all of a high availability investment might be lost before it would have been at its expected end of life.

As IT practitioners, evaluating the benefits, risks and costs of technology solutions is at the core of what we do.  Like everything else in business infrastructure, determining the type of risk mitigation, the value of protection and how much is financially proper is our key responsibility and cannot be glossed over or ignored.  We can never simply assume that high availability is needed, nor that it can simply be skipped.  It is in analysis of this nature that IT brings some of its greatest value to organizations.  It is here that we have the potential to shine the most.

 

 

Disaster Recovery Planning with Existing Platform Equipment

Disaster Recovery planning is always difficult, there are so many factors and “what ifs” that have to be considered and investing too much in the recovery solution can itself become a  bit of a disaster.  A factor that is often overlooked in DR planning is that: in the event of a disaster you are generally able and very willing to make compromises where needed because a disaster has already happened.  It is triage time, not business as usual.

Many people immediately imagine that if you need capacity and performance of X for your live, production systems that you will need X as well for your disaster recovery systems.  In the real world, this is rarely true, however.  In the event of a disaster you can, with rare exception, work with lower performance and limit system availability to just the more critical systems and many maintenance operations, which often includes archiving systems, are suspended until full production is restored.  This means that your disaster recovery system can often be much smaller than your primary production systems.

Disaster recovery systems are not investments in productivity, they are hedges against failure and need to be seen in that light.  Because of this it is a common and effective strategy to approach the DR system needs more from a perspective of being “adequate” to maintain business activities while not enough to necessarily do so comfortably or transparently.  If a full scale disaster hits and staff have to deal with sluggish file retrieval, slower than normal databases or hold off on a deep BI analysis run until the high performance production systems are restored, few people will complain.  Most workers and certainly more business decision makers can be very understanding that a system is in a failed state and that they may need to help carry on as best as they can until full capacity is restored.

With this approach in mind, it can be an effective strategy to re-purpose older platforms for use at Disaster Recovery sites when new platforms are purchased and implemented for primary production usage.  This can create a low cost and easily planned around “DR pipeline” where the DR site always has the capacity of your “last refresh” which, in most DR scenarios, is more than adequate.  This can be a great way to make use of equipment that otherwise might either be scrapped outright or might tempt itself into production re-deployment by invoking a “sunk cost” emotional response that, in general, we want to avoid.

The sunk cost fallacy is a difficult one to avoid.  Already owning equipment makes it very easy to feel that deploying it again, even when a newly designed system is being implemented, outside of the system designs and specifications is useful or good.  And there are cases where this might be true, but most likely it is not.  But just as we don’t want to become overly emotionally attached to equipment just because we have already paid for it, we also don’t want to ignore the value in the existing equipment that we already own.  This is where a planned pipeline into a Disaster Planning scenario can leverage what we have already invested in a really great way in many cases.  We do have to remember that this is likely very useful equipment with a lot of value left in it, if we just know how to use it properly to meet our existing needs.

A strong production to disaster recovery platform migration planning process can be a great way to lower budgetary spending while getting excellent disaster recovery results.

A Public Post Mortem of An Outage

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.  In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.  Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.  We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.  This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business.  The disaster hit what was nearly a “worst case scenario.”  Not quite, but very close.  The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.  This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.  Typically, when some big systems event happens, I do not get to talk about it publicly.  But once in a while, I do.    It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.  But you have to examine the disaster.  There is so much to be learned about processes and ourselves.

First, some back story.  A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.  It is a little over four years old and has been running in isolation for many years.  Older servers are always a bit worrisome as they approach end of life.  Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism.  Backups were handled externally to an enterprise backup appliance in the same datacenter.  A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation.  Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly.  The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.  Even the server vendor was unable to diagnose the issue.  This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.  We could replace drives, we could replace power supplies, we could replace the motherboard.  Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.  In the end the system ended up being able to be repaired and no data was lost.  The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.  The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.  The process was exhausting but when completed the system was restored successfully.

The long outage and sense  of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.  People started saying it and this leads to people believing it.  Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are.  But our triage handling had been superb.  Triage doesn’t mean magic and there is discovery phase and a reaction phase.  When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.  We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.  This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.  Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.  Nothing is absolutely perfect, but it went extremely well.  The machine worked as intended.

The far more surprising part was looking at the disaster impact.  There are two ways to look at this.  One is the wiser one, the “no hindsight” approach.  Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.  This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.  The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?  It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.  Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start.  The system had been in place for most of a decade with zero downtime.  The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.  That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!  This was downright shocking.  The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.  In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted  in massive long term cost savings, but the lesson is an important one.  There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.  Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage.  We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.  But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.  Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.

 

The Jurassic Park Effect

“If I may… Um, I’ll tell you the problem with the scientific power that you’re using here, it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you even knew what you had, you patented it, and packaged it, and slapped it on a plastic lunchbox, and now …” – Dr. Ian Malcolm, Jurassic Park

When looking at building a storage server or NAS, there is a common feeling that what is needed is a “NAS operating system.”  This is an odd reaction, I find, since the term NAS means nothing more than a “fileserver with a dedicated storage interface.”  Or, in other words, just a file server with limited exposed functionality.  The reason that we choose physical NAS appliances is for the integrated support and sometimes for special, proprietary functionality (NetApp being a key example of this offering extensive SMB and NFS integration and some really unique RAID and filesystem options or Exablox offering fully managed scale out file storage and RAIN style protection.)  Using a NAS to replace a traditional file server is, for the most part, a fairly recent phenomenon and one that I have found is often driven by misconception or the impression that managing a file server, one of the  most basic IT workloads, is special or hard.  File servers are generally considered the most basic form of server and traditionally what people meant when using the term server unless additional description was added and the only form commonly integrated into the desktop (every Mac, Windows and Linux desktop can function as a file server and it is very common to do so.)

There is, of course, nothing wrong with turning to a NAS instead of a traditional file server to meet your storage needs, especially as some modern NAS options, like Exablox, offer scale out and storage options that are not available in most operating systems.  But it appears that the trend to use a NAS instead of a file server has led to some odd behaviour when IT professionals turn back to considering file servers again.  A cascading effect, I suspect, where the reasons for why NAS are sometimes preferred and the goal level thinking are lost and the resulting idea of “I should have a NAS” remains, so that when returning to look at file server options there is a drive to “have a NAS” regardless of whether there is a logical reason for feeling that this is necessary or not.

First we must consider that the general concept of a NAS is a simple one, take a traditional file server, simplify it by removing options and package it with all of the necessary hardware to make a simplified appliance with all of the support included from the interface down to the spinning drives and everything in between.  Storage can be tricky when users need to determine RAID levels, drive types, monitor effectively, etc.  A NAS addresses this by integrating the hardware into the platform.  This makes things simple but can add risk as you have fewer support options and less ability to fix or replace things yourself.  A move from a file server to a NAS appliance is truly about support almost exclusively and is generally a very strong commitment to a singular vendor.  You chose the NAS approach because you want to rely on a vendor for everything.

When we move to a file server we go in the opposite direction.  A file server is a traditional enterprise server like any other.  You buy your server hardware from one vendor (HP, Dell, IBM, etc.) and your operating system from another (Microsoft, Red Hat, Suse, etc.)  You specify the parts and the configuration that you need and you have the most common computing model for all of IT.  With this model you generally are using standard, commodity parts allowing you to easily migrate between hardware vendors and between software vendors. You have “vendor redundancy” options and generally everything is done using open, standard protocols.  You get great flexibility and can manage and monitor your file server just like any other member of your server fleet, including keeping it completely virtualized.  You give up the vertical integration of the NAS in exchange for horizontal flexibility and standardization.

What is odd, therefore, is when returning to the commodity model but seeking, what is colloquially known as, a NAS OS.  Common examples of these include NAS4Free, FreeNAS and OpenFiler.  This category of products is generally nothing more than a standard operating system (often FreeBSD as it has ideal licensing, or Linux because it is well known) with a “storage interface” put onto it and no special or additional functionality that would not exist with the normal operating system.  In theory they are a “single function” operating system that does only one thing.  But this is not reality.  They are general purpose operating systems with an extra GUI management layer added on top.  One could say the same thing about most physical NAS products themselves, but they typically include custom engineering even at the storage level, special features and, most importantly, an integrated support stack and true isolation of the “generalness” of the underlying OS.  A “NAS OS” is not a simpler version of a general purpose OS, it is a more complex, yet less functional version of it.

What is additionally odd is that general OSes, with rare exception, already come with very simple, extremely well known and fully supported storage interfaces.  Nearly every variety of Windows or Linux servers, for example, have included simple graphical interfaces for these functions for a very long time.  These included GUIs are often shunned by system administrators as being too “heavy and unnecessary” for a simple file server.  So it is even more unusual that adding a third party GUI, one that is not patched and tested by the OS team and not standardly known and supported, would then be desired as this goes against the common ideals and practices of using a server.

And this is where the Jurassic Park effect comes in – the OS vendors (Red Hat, Microsoft, Oracle, FreeBSD, Suse, Canonical, et. al.) are giants with amazing engineering teams, code review, testing, oversight and enterprise support ecosystems.  While the “NAS OS” vendors are generally very small companies, some with just one part time person, who stand on the shoulders of these giants and build something that they knew that they could but they never stopped to ask if they should.  The resulting products are wholly negative compared to their pure OS counterparts, they do not make systems management easier nor do they fill a gap in the market’s service offerings. Solid, reliable, easy to use storage is already available, more vendors are not needed to fill this place in the market.

The logic often applied to looking at a NAS OS is that they are “easy to set up.”   This may or may not be true as easy, here, must be a relational term.  For there to be any value a NAS OS has to be easy in comparison to the standard version of the same operating system.  So in the case of FreeNAS, this would mean FreeBSD.  FreeNAS would need to be appreciably easier to set up than FreeBSD for the same, dedicated functions.  And this is easily true, setting up a NAS OS is generally pretty easy.  But this ease is only a panacea and one of which IT professionals need to be quite aware.  Making something easy to set up is not a priority in IT, making something that is easy to operate and repair when there are problems is what is important.  Easy to set up is nice, but if it comes at a cost of not understanding how the system is configured and makes operational repairs more difficult it is a very, very bad thing.  NAS OS products routinely make it dangerously easy to get a product into production for a storage role, which is almost always the most critical or nearly the most critical role of any server in an environment, that IT has no experience or likely skill to maintain, operate or, most importantly, fix when something goes wrong.  We need exactly the opposite, a system that is easy to operate and fix.  That is what matters.  So we have a second case of “standing on the shoulders of giants” and building a system that we knew we could, but did not know if we should.

What exacerbates this problem is that the very people who feel the need to turn to a NAS OS to “make storage easy” are, by the very nature of the NAS OS, the exact people for whom operational support and the repair of the system is most difficult.  System administrators who are comfortable with the underlying OS would naturally not see a NAS OS as a benefit and avoid it, for the most part.  It is uniquely the people for whom it is most dangerous to run a not fully understood storage platform that are likely to attempt it.  And, of course, most NAS OS vendors earn their money, as we could predict, on post-installation support calls for customers who deployed and got stuck once they were in production so that they are at the mercy of the vendors for exorbitant support pricing.  It is in the interest of the vendors to make it easy to install and hard to fix.  Everything is working against the IT pro here.

If we take a common example and look at FreeNAS we can see how this is a poor alignment of “difficulties.”  FreeNAS is FreeBSD with an additional interface on top.  Anything that FreeNAS can do, FreeBSD an do.  There is no loss of functionality by going to FreeBSD.  When something fails, in either case, the system administrator must have a good working knowledge of FreeBSD in order to exact repairs.  There is no escaping this.  FreeBSD knowledge is common in the industry and getting outside help is relatively easy.  Using FreeNAS adds several complications, the biggest being that any and all customizations made by the FreeNAS GUI are special knowledge needed for troubleshooting on top of the knowledge already needed to operate FreeBSD.  So this is a large knowledge set as well as more things to fail.  It is also a relatively uncommon knowledge set as FreeNAS is a niche storage product from a small vendor and FreeBSD is a major enterprise IT platform (plus all use of FreeNAS is FreeBSD use but only a tiny percentage of FreeBSD use is FreeNAS.)  So we can see that using a NAS OS just adds risk over and over again.

This same issue carries over into the communities that grow up around these products.  If you look to communities around FreeBSD, Linux or Windows for guidance and assistance you deal with large numbers of IT professionals, skilled system admins and those with business and enterprise experience.  Of course, hobbyists, the uninformed and others participate too, but these are the enterprise IT platforms and all the knowledge of the industry is available to you when implementing these products.  Compare this to the community of a NAS OS.  By its very nature, only people struggling with the administration of a standard operating system and/or storage basics would look at a NAS OS package and so this naturally filters the membership in their communities to include only the people from whom we would be best to avoid getting advice.  This creates an isolated culture of misinformation and misunderstandings around storage and storage products.  Myths abound, guidance often becomes reckless and dangerous and industry best practices are ignored as if decades of accumulated experience had never happened.

A NAS OS also, commonly, introduces lags in patching and updates.  A NAS OS will almost always and almost necessarily trail its parent OS on security and stability updates and will very often follow months or years behind on major features.  In one very well known scenario, OpenFiler, the product was built on an upstream non-enterprise base (RPath Linux) which lacked community and vendor support, failed and was abandoned leaving downstream users, included everyone on OpenFiler, abandoned without the ecosystem needed to support them.  Using a NAS OS means trusting not just the large, enterprise and well known primary OS vendor that makes the base OS but trusting the NAS OS vendor as well.  And the NAS OS vendor is orders of magnitude more likely to fail if they are basing their products on enterprise class base OSes.

Storage is a critical function and should not be treated carelessly and should not be ignored as if its criticality did not exist.  NAS OSes tempt us to install quickly and forget, hoping that nothing ever goes wrong or that we can move on to other roles or companies completely before bad things happen.  It sets us up for failure where failure is most impactful.  When a typical application server fails we can always copy the files off of its storage and start fresh.  When storage fails, data is lost and systems go down.

“John Hammond: All major theme parks have delays. When they opened Disneyland in 1956, nothing worked!

Dr. Ian Malcolm: Yeah, but, John, if The Pirates of the Caribbean breaks down, the pirates don’t eat the tourists.”

When storage fails, businesses fail.  Taking the easy route to setting up storage and ignoring the long term support needs and seeking advice from communities that have filtered out the experienced storage and systems engineers increases risk dramatically.  Sadly, the nature of a NAS OS, is that the very reason that people turn to it (lack of deep technical knowledge to build the systems) is the very reason they must avoid it (even greater need for support.)  The people for whom NAS OSes are effectively safe to use, those with very deep and broad storage and systems knowledge would rarely consider these products because for them they offer no benefits.

At the end of the day, while the concept of a NAS OS sounds wonderful, it is not a panacea and the value of a NAS does not carry over from the physical appliance world to the installed OS world and the value of standard OSes is far too great for NAS OSes to effectively add real value.

“Dr. Alan Grant: Hammond, after some consideration, I’ve decided, not to endorse your park.

John Hammond: So have I.”