All posts by Scott Alan Miller

Started in software development with Eastman Kodak in 1989 as an intern in database development (making database platforms themselves.) Began transitioning to IT in 1994 with my first mixed role in system administration.

Disaster Recovery Planning with Existing Platform Equipment

Disaster Recovery planning is always difficult, there are so many factors and “what ifs” that have to be considered and investing too much in the recovery solution can itself become a  bit of a disaster.  A factor that is often overlooked in DR planning is that: in the event of a disaster you are generally able and very willing to make compromises where needed because a disaster has already happened.  It is triage time, not business as usual.

Many people immediately imagine that if you need capacity and performance of X for your live, production systems that you will need X as well for your disaster recovery systems.  In the real world, this is rarely true, however.  In the event of a disaster you can, with rare exception, work with lower performance and limit system availability to just the more critical systems and many maintenance operations, which often includes archiving systems, are suspended until full production is restored.  This means that your disaster recovery system can often be much smaller than your primary production systems.

Disaster recovery systems are not investments in productivity, they are hedges against failure and need to be seen in that light.  Because of this it is a common and effective strategy to approach the DR system needs more from a perspective of being “adequate” to maintain business activities while not enough to necessarily do so comfortably or transparently.  If a full scale disaster hits and staff have to deal with sluggish file retrieval, slower than normal databases or hold off on a deep BI analysis run until the high performance production systems are restored, few people will complain.  Most workers and certainly more business decision makers can be very understanding that a system is in a failed state and that they may need to help carry on as best as they can until full capacity is restored.

With this approach in mind, it can be an effective strategy to re-purpose older platforms for use at Disaster Recovery sites when new platforms are purchased and implemented for primary production usage.  This can create a low cost and easily planned around “DR pipeline” where the DR site always has the capacity of your “last refresh” which, in most DR scenarios, is more than adequate.  This can be a great way to make use of equipment that otherwise might either be scrapped outright or might tempt itself into production re-deployment by invoking a “sunk cost” emotional response that, in general, we want to avoid.

The sunk cost fallacy is a difficult one to avoid.  Already owning equipment makes it very easy to feel that deploying it again, even when a newly designed system is being implemented, outside of the system designs and specifications is useful or good.  And there are cases where this might be true, but most likely it is not.  But just as we don’t want to become overly emotionally attached to equipment just because we have already paid for it, we also don’t want to ignore the value in the existing equipment that we already own.  This is where a planned pipeline into a Disaster Planning scenario can leverage what we have already invested in a really great way in many cases.  We do have to remember that this is likely very useful equipment with a lot of value left in it, if we just know how to use it properly to meet our existing needs.

A strong production to disaster recovery platform migration planning process can be a great way to lower budgetary spending while getting excellent disaster recovery results.

Understanding Bias

I often write about the importance of alignment in goals between IT and vendors and how critical it is to avoid getting advice from those that you are not paying for that advice, because that makes them salespeople, basically the importance of getting advice and guidance from a buyer’s agent rather than directly from the seller’s agent.  This leads to questions about bias; clearly the idea is that a salesperson is biased in a way that is likely unfavourable to you.  But, it should be obvious, that all people are biased.

This is true, all people have bias.  We cannot seek to escape or remove all bias, that is simply impossible.  In fact, in many ways, when we see advice whether it be from a paid consultant whose job it is to present us with a good option, from IT itself doing the same or getting feedback from a friend on products that they have tested – it is actually their biases that we are seeking!

What we need to do is strive to understand the biases and motivations of the people with whom we speak and receive advice, be self reflecting to understand our own biases, have a good knowledge of what biases are good for us and attempt to get advice from people who have a general bias-alignment with us.

Biases come in many forms.  We can have good and bad biases, strong and weak ones.

The biggest biases typically come externally in the form of monetary or near-monetary compensation for bias.  This might be someone being paid as a sales person to promote the products that they are available to sell, commission structures would take this to an even more acute level.  Someone paid to do sales might face two of the strongest biases: monetary (they get money if they make the sale) and ethical (they made an agreement to sell this product if possible and they are ethically bound to try to do so.)  These are the standard biases of the “seller’s agent” or sales person.

On the other hand a consultant is paid by the buyer or customer and is a buyer’s agent and as the same monetary and ethical biases, but in the favour of the buyer rather than against them.  (I use the term buyer and customer here mostly interchangeably to represent the business or IT department, the ones receiving advice or guidance on what to do or buy.)  These biases are pretty evident and easily to control and I have covered them before – never get advice from the seller’s agent, always get your advice from the buyer’s agent.

If we assume that these big biases, those of alignment, are covered we still have a large degree of bias from our buyer’s agent that we need to uncover and understand.

One of the most common biases is one towards familiarity.  This is not a bad bias, but we must be aware of it and how it colours recommendations.  This bias can run very deep and affect decision making in ways that we may not understand without investigation.  At the highest level, the idea is simply that most anyone is going to favour, possibly unintentionally, solutions and products with which they have familiarity and the stronger that familiarity often the stronger the bias towards those products will be.

This may seem obvious but it is a bias that is commonly overlooked.  People turning to consultants will often seek their advice from someone with a very small set of experiences which serves as a means by which the resulting recommendations are likely drawn.  In a way, this is effectively the buyer preselecting the desired outcome and choosing a consultant that will deliver the desired outcome.  An example of this would be choosing a network engineer to design a solution when that engineer only knows one product line; naturally the engineer will almost certainly design a solution from that product line.  In choosing someone with limited experience in that area we are, for all intents and purposes, directly the results by picking based on a strong bias.  This happens extremely often in IT, presumably because those hiring consultants base this decision on what they think are foregone conclusions about what the resulting advice will be and forgetting to step back and get advice at a higher level.

Of course, like with many things, there is also an offset bias to the familiarity bias, the exploration bias.  While we tend to be strongly biased towards things that we know, there is also a bias towards the unknown and the opportunity to explore and learn.  This bias tends to be extremely weak compared to the familiarity bias, but far from trivial in many IT practitioners.  It is a bias that should not be ignored and is important for helping broaden the potential scope of advice from a single consultant.

Of course there are more biases that stem from familiarity.  There is a natural, strong bias towards companies that we have found to have good products, have good support or interact well.  Companies with whom we have experienced product, support or interaction issues we tend to be strongly biased against.  These, of course, are highly valuable biases that we specifically want consultants to bring with them.

One of the worst biases, however, and one that affects everyone is marketing bias.  Companies with large or well made marketing campaigns or that align with industry marketing campaigns can induce a large amount of bias that is not based on something valuable to the end user.  Similarly, market share is an almost valueless and often negative factor (large companies often charge more for equal products – e.g. you “pay for the name”) but can be a strong bias, one often brought to the table by the customer.  Customers commonly either directly control this bias by demanding only well marketed, seemingly popular or large vendor promoted recommendations be made or fail to react properly to apparently alternative solutions: both reactions heavily influence what a consultant is willing to recommend.  This is known as “no one ever got fired for buying IBM” from the 1980s, and is often an amazingly costly bias and a difficult one to overcome.  Of course it applies much more broadly than only to IBM and does not primarily pertain to them today, but the term became famous during IBM’s heyday of IT.

Of course the main bias that we seek is the bias of “what is the best option for the customer.”  This is itself, a bias.  One that we hope, when combined with other positive biases, overpowers the influence of negative biases.  And likewise there is a prestige bias, a desire to produce advice that is so good that it increases the respect for the consultant.

Biases come in many different types and are both the value in advice and the dangers in it.  Leveraging bias requires an understanding of the major biases that are or are likely at play in any specific instance as well as having empathy for the people that give advice.  If you take time to learn about what their financial, ethics, experiential and objective biases are, you can understand their role far better and you can better filter their advice based on that knowledge.

Take the time to consider the biases of the people from whom you get advice.  Likely you already know a lot of which biases affect them significantly and may be able to guess what more of them are.  Everyone has different biases and all people react to them differently.  What is a strong bias for one person is a weak one for someone else.  Consider talking to your consultants about their biases, they should be open to this conversation (and if not, be extra cautious) and hopefully have thought about it themselves, even if not in depth or in the same terms.

The people from whom you get advice should have biases that strongly align favourably towards you and your goals.

 

Decision Point: VDI and Terminal Services

Two basic concepts vie for prominence, if technologies care about prominence, when it comes to remote graphical desktop interfaces: VDI (virtual desktop infrastructure) and terminal services.  The idea of both is simple, put the resources and processing on a server and have end users access the graphical interface remotely over a network.  What separates VDI and TS fundamentally is the difference between that remote server being a one to many experience with many users getting their desktops from a single operating system image (TS) and each user getting a dedicated server all of their own (presumably virtualized and called VDI) where there is no sharing of the individual operating system resources.

There is a certain amount of assumption, partially from the naming conventions, that VDI implies a desktop operating system rather than a server one but this should not be seen as an implication.  In fact, outside of the Windows world there truly is no separation between desktop and server operating systems so having such a distinction at the technology level would not make sense.  It is important to remember, however, that Microsoft defines VDI licensing by the use of different OS license options and most VDI is for Windows operating systems so while VDI does not imply this, in a practical sense it is generally important to keep in mind that on the tech side there is no distinction and on the Microsoft licensing side there are heavy distinctions.

Of the two, VDI is the newer concept.  Terminal Services have been around for decades and are well known and are anything but exciting or flashy today.  Terminal services predate Windows and are common to nearly every operating system family and are so common in the UNIX world that they are often used without note.  Terminal services are the GUI continuation of the old “green screen” terminals that were used since the “olden days” of computers.  In the old days the terminals were often serial connected VT100 terminals and today we use TCP/IP networking and protocols capable of carrying graphics, but the concept remains the same: many users on a single server.

With VDI we accomplish the same goals but do so giving each user all of their own resources.  Their OS is completely their own, not shared with anyone.  This means that there is all of the overhead of memory management, CPU management, process tables, copies of libraries and such for every individual user.  That is a lot of overhead.  Consider all of the resources that an idle graphical desktop requires just to boot up and wait for the user – it can be quite a bit.  Newer Windows operating systems have been getting leaner and more efficient, probably to make them more viable on VDI infrastructures, but the overhead remains a significant factor.  VDI was not really possible until virtualization made it a reality so in any practical sense it is a new use of technology and is often misunderstood.

What we face now is, when deciding on a remote computational infrastructure, choosing between these two architectural ideas.  Of course, it should be noted, these two can co-exist very easily and it would often be appropriate to do this.  In smaller shops it would be very easy for the two to co-exist on the same physical platform, in fact.  There are many factors here that we need to consider and this decision process can actually be rather complicated.

One of the biggest factors that we must consider is software compatibility.  This is the largest driver of the move to VDI rather than terminal services.  In the Windows world it is not uncommon for applications to require things such as a desktop operating system signature (refusing to run on server OS variants), single user environments, users to have administrator level privileges, uses to run with specific accounts or library requirement that will often conflict with other packages.  Because of these issues, many companies look to VDI to mimic the way individual desktops work where these issues were easily overlooked because each user was running in a discrete environment.  VDI brings this same functionality to the remote access world allowing problem child applications to be catered to as needed.  Isolation of the OS adds a layer of protection.

This driving factor essentially does not exist outside of the Windows world and is primarily why VDI has never taken hold in any other environment.  While easily achievable with Linux or FreeBSD, for example, VDI has little purpose or value in those cases.

A major concern with VDI is the extreme overhead necessary to manage many redundant operating systems each with its own duplicated processes, storage and memory. In the early days this made VDI incredibly inefficient.  More recently, however, advanced VDI systems, primarily centered around virtualization platforms and storage, have addressed many of these issues by deduplicating memory and storage, using common master boot files and other techniques.  In fact, contrary to most assumptions, it can even be the case that VDI may outperform traditional terminal services for Windows due to the hypervisor platform being able to handle memory management and task switching even more efficiently than Windows itself (a phenomenon first observed in the early 2000s when in some cases Windows would run faster when virtualized on top of Linux so that memory management could be partially handed off to the Linux system underneath which was more efficient.)  This is definitely not always the case, but the improvements in VDI handling have come so far that the two are often quite close.  Again, however, this is a factor making VDI more attractive in the Windows world but not as dramatically in the non-Windows world where native OS task management is typically more efficient and VDI would remain unnecessary overhead.

Another area where VDI has consistently shown to be more capable than terminal services is in the area of graphically rich rendered environments such as CAD and video editing.  The same areas that still lean heavily towards dedicated hardware tend to move to VDI rather than terminal services because of a heavy investment in GPU capabilities within the VDI solutions.  This is not a universal scenario, but for situations where heavy graphical rendering needs to take place it is worth investigating the possibility that VDI will perform significantly better.

Because of how VDI is managed, it is often reserved only for very large deployments where the scale, in number of end users included in the solution, can be used to overcome some of the cost of implementation.  Terminal services, however, do to its more scalable cost is often able to be implemented to smaller environments or subsets of users more cost effectively.  Neither is common for a very small environment of only a few users, although a strange phenomenon of manually managed VDI would make VDI likely more effective than terminal services for an exceptionally tiny number of users, perhaps less than ten, where VDI is treated more like individual servers rather than as a unified VDI environment.

With only the rarest of exceptions, primarily due to the licensing overhead created by the Windows desktop ecosystem in a virtualized setting, it is a de facto starting position for remote access end users systems to assume a starting point with terminal server technologies and only turn to the more complicated and more costly VDI solutions when terminal services prove to be unable to meet the technical requirements of the scenario.  For all intents and purposes, VDI is a fall back brute force method to make end user virtualization work where the preferred methods have come up short.

A Public Post Mortem of An Outage

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.  In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.  Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.  We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.  This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business.  The disaster hit what was nearly a “worst case scenario.”  Not quite, but very close.  The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.  This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.  Typically, when some big systems event happens, I do not get to talk about it publicly.  But once in a while, I do.    It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.  But you have to examine the disaster.  There is so much to be learned about processes and ourselves.

First, some back story.  A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.  It is a little over four years old and has been running in isolation for many years.  Older servers are always a bit worrisome as they approach end of life.  Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism.  Backups were handled externally to an enterprise backup appliance in the same datacenter.  A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation.  Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly.  The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.  Even the server vendor was unable to diagnose the issue.  This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.  We could replace drives, we could replace power supplies, we could replace the motherboard.  Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.  In the end the system ended up being able to be repaired and no data was lost.  The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.  The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.  The process was exhausting but when completed the system was restored successfully.

The long outage and sense  of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.  People started saying it and this leads to people believing it.  Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are.  But our triage handling had been superb.  Triage doesn’t mean magic and there is discovery phase and a reaction phase.  When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.  We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.  This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.  Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.  Nothing is absolutely perfect, but it went extremely well.  The machine worked as intended.

The far more surprising part was looking at the disaster impact.  There are two ways to look at this.  One is the wiser one, the “no hindsight” approach.  Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.  This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.  The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?  It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.  Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start.  The system had been in place for most of a decade with zero downtime.  The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.  That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!  This was downright shocking.  The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.  In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted  in massive long term cost savings, but the lesson is an important one.  There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.  Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage.  We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.  But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.  Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.