Decision Point: VDI and Terminal Services

Two basic concepts vie for prominence, if technologies care about prominence, when it comes to remote graphical desktop interfaces: VDI (virtual desktop infrastructure) and terminal services.  The idea of both is simple, put the resources and processing on a server and have end users access the graphical interface remotely over a network.  What separates VDI and TS fundamentally is the difference between that remote server being a one to many experience with many users getting their desktops from a single operating system image (TS) and each user getting a dedicated server all of their own (presumably virtualized and called VDI) where there is no sharing of the individual operating system resources.

There is a certain amount of assumption, partially from the naming conventions, that VDI implies a desktop operating system rather than a server one but this should not be seen as an implication.  In fact, outside of the Windows world there truly is no separation between desktop and server operating systems so having such a distinction at the technology level would not make sense.  It is important to remember, however, that Microsoft defines VDI licensing by the use of different OS license options and most VDI is for Windows operating systems so while VDI does not imply this, in a practical sense it is generally important to keep in mind that on the tech side there is no distinction and on the Microsoft licensing side there are heavy distinctions.

Of the two, VDI is the newer concept.  Terminal Services have been around for decades and are well known and are anything but exciting or flashy today.  Terminal services predate Windows and are common to nearly every operating system family and are so common in the UNIX world that they are often used without note.  Terminal services are the GUI continuation of the old “green screen” terminals that were used since the “olden days” of computers.  In the old days the terminals were often serial connected VT100 terminals and today we use TCP/IP networking and protocols capable of carrying graphics, but the concept remains the same: many users on a single server.

With VDI we accomplish the same goals but do so giving each user all of their own resources.  Their OS is completely their own, not shared with anyone.  This means that there is all of the overhead of memory management, CPU management, process tables, copies of libraries and such for every individual user.  That is a lot of overhead.  Consider all of the resources that an idle graphical desktop requires just to boot up and wait for the user – it can be quite a bit.  Newer Windows operating systems have been getting leaner and more efficient, probably to make them more viable on VDI infrastructures, but the overhead remains a significant factor.  VDI was not really possible until virtualization made it a reality so in any practical sense it is a new use of technology and is often misunderstood.

What we face now is, when deciding on a remote computational infrastructure, choosing between these two architectural ideas.  Of course, it should be noted, these two can co-exist very easily and it would often be appropriate to do this.  In smaller shops it would be very easy for the two to co-exist on the same physical platform, in fact.  There are many factors here that we need to consider and this decision process can actually be rather complicated.

One of the biggest factors that we must consider is software compatibility.  This is the largest driver of the move to VDI rather than terminal services.  In the Windows world it is not uncommon for applications to require things such as a desktop operating system signature (refusing to run on server OS variants), single user environments, users to have administrator level privileges, uses to run with specific accounts or library requirement that will often conflict with other packages.  Because of these issues, many companies look to VDI to mimic the way individual desktops work where these issues were easily overlooked because each user was running in a discrete environment.  VDI brings this same functionality to the remote access world allowing problem child applications to be catered to as needed.  Isolation of the OS adds a layer of protection.

This driving factor essentially does not exist outside of the Windows world and is primarily why VDI has never taken hold in any other environment.  While easily achievable with Linux or FreeBSD, for example, VDI has little purpose or value in those cases.

A major concern with VDI is the extreme overhead necessary to manage many redundant operating systems each with its own duplicated processes, storage and memory. In the early days this made VDI incredibly inefficient.  More recently, however, advanced VDI systems, primarily centered around virtualization platforms and storage, have addressed many of these issues by deduplicating memory and storage, using common master boot files and other techniques.  In fact, contrary to most assumptions, it can even be the case that VDI may outperform traditional terminal services for Windows due to the hypervisor platform being able to handle memory management and task switching even more efficiently than Windows itself (a phenomenon first observed in the early 2000s when in some cases Windows would run faster when virtualized on top of Linux so that memory management could be partially handed off to the Linux system underneath which was more efficient.)  This is definitely not always the case, but the improvements in VDI handling have come so far that the two are often quite close.  Again, however, this is a factor making VDI more attractive in the Windows world but not as dramatically in the non-Windows world where native OS task management is typically more efficient and VDI would remain unnecessary overhead.

Another area where VDI has consistently shown to be more capable than terminal services is in the area of graphically rich rendered environments such as CAD and video editing.  The same areas that still lean heavily towards dedicated hardware tend to move to VDI rather than terminal services because of a heavy investment in GPU capabilities within the VDI solutions.  This is not a universal scenario, but for situations where heavy graphical rendering needs to take place it is worth investigating the possibility that VDI will perform significantly better.

Because of how VDI is managed, it is often reserved only for very large deployments where the scale, in number of end users included in the solution, can be used to overcome some of the cost of implementation.  Terminal services, however, do to its more scalable cost is often able to be implemented to smaller environments or subsets of users more cost effectively.  Neither is common for a very small environment of only a few users, although a strange phenomenon of manually managed VDI would make VDI likely more effective than terminal services for an exceptionally tiny number of users, perhaps less than ten, where VDI is treated more like individual servers rather than as a unified VDI environment.

With only the rarest of exceptions, primarily due to the licensing overhead created by the Windows desktop ecosystem in a virtualized setting, it is a de facto starting position for remote access end users systems to assume a starting point with terminal server technologies and only turn to the more complicated and more costly VDI solutions when terminal services prove to be unable to meet the technical requirements of the scenario.  For all intents and purposes, VDI is a fall back brute force method to make end user virtualization work where the preferred methods have come up short.

A Public Post Mortem of An Outage

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.  In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.  Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.  We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.  This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business.  The disaster hit what was nearly a “worst case scenario.”  Not quite, but very close.  The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.  This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.  Typically, when some big systems event happens, I do not get to talk about it publicly.  But once in a while, I do.    It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.  But you have to examine the disaster.  There is so much to be learned about processes and ourselves.

First, some back story.  A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.  It is a little over four years old and has been running in isolation for many years.  Older servers are always a bit worrisome as they approach end of life.  Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism.  Backups were handled externally to an enterprise backup appliance in the same datacenter.  A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation.  Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly.  The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.  Even the server vendor was unable to diagnose the issue.  This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.  We could replace drives, we could replace power supplies, we could replace the motherboard.  Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.  In the end the system ended up being able to be repaired and no data was lost.  The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.  The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.  The process was exhausting but when completed the system was restored successfully.

The long outage and sense  of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.  People started saying it and this leads to people believing it.  Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are.  But our triage handling had been superb.  Triage doesn’t mean magic and there is discovery phase and a reaction phase.  When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.  We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.  This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.  Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.  Nothing is absolutely perfect, but it went extremely well.  The machine worked as intended.

The far more surprising part was looking at the disaster impact.  There are two ways to look at this.  One is the wiser one, the “no hindsight” approach.  Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.  This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.  The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?  It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.  Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start.  The system had been in place for most of a decade with zero downtime.  The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.  That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!  This was downright shocking.  The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.  In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted  in massive long term cost savings, but the lesson is an important one.  There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.  Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage.  We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.  But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.  Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.

 

The Physical Interaction Considerations of VDI

VDI (Virtual Desktop Infrastructure) is different from traditional virtualization of servers because, unlike servers which provide services exclusively onto a network, desktops are a point of physical interaction with end users.  There is no escaping the need for there to be physical equipment that the end users will actually touch.  Keyboards, mice, touchscreens, monitors, speakers… these things cannot be virtualized.

Because of this VDI faces much more complicated decision making and planning than virtualizing servers would.  VDI physical requirements can have a wide variety of solutions.

Traditionally we approached VDI and terminal servers’ needs around physical interaction through the use of thin clients.  Thin clients sit on the network and utilize the same protocols and techniques we would use for normal remote graphical access with protocols like NX, ICA, RDP and VNC.  A thin client runs a full operating system but one that is very lean and that has a singular purpose – to manage connections to other machines.  The idea of the thin client is to keep all processing remote and only have the necessary components on the local hardware to handle networking and local interactions.  Thin clients are relatively low cost, low power consumption, easy to maintain, reliable and have very long lifespans.  But they don’t cost so little as to not be a concern, typically prices are half to three quarters the cost of a traditional desktop and while they tend to last up to twice as long in the field, this remains neither a trivial cost for initial acquisition nor a trivial long term investment cost.

Because of the remaining high costs of traditional thin clients a more modern replacement, the zero client, has arisen as a fix to those issues.  A zero client is not a strict term and is truly just a class of thin clients but one that has removed traditional processing involving a CPU and moved to dedicated very low cost remote graphical processing that is essentially nothing more than a display adapter attached to a network.  Doing so reduces the power needs, management needs and manufacturing costs allowing for a much lower cost end point device.  Zero Clients offer few potential features than do Thin Clients which can often run their own apps like a web browser locally, as there is no local processing but this is often a good thing rather than a bad thing.   Supporting Zero Clients is also a new breed of remote graphical protocols often associated with them such as PCoIP.

Of course, going in the other direction, we can use full fat clients (e.g. traditional desktops and laptops) as our clients.  This generally only makes sense if either the desktops are remnants of a previous infrastructure and only being repurposed as remote graphical access points or if the infrastructure is a hybrid and users use the desktops for some purposes and the VDI or terminal services for others.  In some cases where thin clients are desired and fat clients are available at low cost, such as off lease with older units, fat clients can still make financial sense but the use cases there are limited.  It is extremely common to use existing fat clients during a transition phase and then to migrate to thin or zero clients once a desktop refresh point has been reached or on a machine by machine basis as the machines require maintenance.

Today other options do exist such as using phones, tablets and other mobile devices as remote access points but these are generally special cases and not the norm due to a lack of good input devices.  But use cases do exist and you can see this from time to time.  As devices such as Android-based desktops begin to become more common on the market we may find this becoming more standard and may even see some rather unexpected situations where even devices like advanced desktop phones that run Android will be used as a phone and a thin client device at once.  The more likely situation is that convertible cell phones that can double as lightweight desktop devices when docked will be popular thin client choices.

The last hardware consideration is that of BYOD or “Bring Your Own Device.”  When moving to VDI and/or Terminal Services infrastructures the ability to leverage employee devices becomes very good.  There are legal and logistical complications with employees supplying all of their own access devices but there are huge benefits as well such as happier employees, lower costs and more flexibility.  The use of remote graphical displays rather than exposing data directly vastly reduces security risk and changes how we can approach accessing and exposing internal systems.

It is easy to become caught up in the move of processing resources from local to server when looking at VDI and to overlook that hardware costs remain, and generally remain quite significant, on a per user “on the desktop” level.  Pricing out VDI is not as simple as determining the cost of a VDI server to replace the cost of desktops.  A cost reduction per desktop must be determined and can easily be significant but can just as easily be pretty trivial.  The cost of desktops or desktop replacement hardware will continue to be a large part of the per user IT budget even with VDI solutions.

Business: The Context of IT

I would estimate that the vast majority of people working in the IT field come to it out of an interest in or even a passion for computers. Working in IT lets them play with many big, fast, powerful computers, networks, storage devices and more.  It’s fun.  It’s exciting.  We tend to love gadgets and technical toys.  We love overseeing the roaring server room or datacenter.  This is, almost universally, true of IT people everywhere in the industry.

Because of this somewhat unnatural means by which people are introduced to IT as a career we are left with some issues that are not exactly unique to IT but that are, at the very least, relatively extreme in it.  Primarily the issue that we face, as an industry and especially within the SMB portion of the industry, is a lack of business context within our view of IT.

IT exists only with a business context, this is crucial for understanding all aspects of IT.  Without a business to support, IT would not be IT at all but would just be “playing with computers.”  Other departments that are directly tied to business support such as finance, accounting, human resources, legal, etc. have far more typical business involvement and less “inwardly focused interest” so that they tend to not lose focus on their role in supporting the business environment in everything that they do.  But IT is often so far removed from the business itself, at least mentally, that it is easy to begin to think that IT exists for its own sake.  But it does not.

Moreso than nearly any other department IT is and must be an integral part of the business.  IT has some of the deepest and broadest insight into the business and is invaluable as a partner with management in this aspect.  Everything that happens in IT must be considered within the context of, and in regards to the needs of, the business.

Of course there are roles within IT, within any department, that can function essentially completely without understanding the context of the business that they are supporting.  Job roles that are highly structured and rely on procedure rather than decision making can often get away without even knowing what the business does let alone considering its needs.  But once any role in IT moves into an advisement or decision making one, the business is the core focus.  In reality, the business is the only focus.  IT is an enabler of business, if it is not enabling the business, what is it doing?  Because of this we must remain ever cognizant of the business reasonings behind all decision making and planning.

This cannot be overstated: The primary role of IT is a business one, not a technical one.

IT needs to think about the business at every turn.  Every decision should be made with a keen sense of how it impacts the business in efficiency, cost effectiveness, etc. It is so easy, especially when working with other IT staff from other companies, to lose this perspective and begin to think that there are stock answers, that there are accepted “it should be done this way” approaches, that IT should dictate what is best for the business from an IT perspective.

These concepts become especially poignant when we talk about areas of risk.  It is commonly an IT perspective to think of risk as something that must be overcome, but a business perspective is to balance risk against the cost of mitigation.  If left to run on their own without oversight, most IT departments would see the business as so critical that any amount of money should be spent on a “better” IT infrastructure in order to make sure that downtime could never happen.  But this is completely wrong.  “Better” should never be associated with uptime, it should be associated with “what best serves the goals of the business.”  Perhaps that is uptime, perhaps it is a lowering of capital expenses: it depends on the unique business scenario. Often what is best for the business is not what is perceived as being best for IT.

Concepts such as “the business cannot go down” or “cost is no object” have no place in a business, and therefore cannot in IT.  Every business has a cost of uptime threshold where it is more cost effective to be done.  No IT project has cost as no object, in a business cost is always an object.

What IT needs to do is learn to think differently.  The needs of the business should be at the forefront of IT concepts of what is good and what is applicable.  The idea that there is a “proper or best level of protection” for a system should never even occur to IT decision makers.  Instead, IT should immediately think about value to the business, cost of downtime, cost of risk mitigation and make decisions based around the value to the business.

Thinking about “business first” or really “business only” can be a struggle for IT staff that come to IT from a technology perspective instead of from a business one, but it is a critical skill and will fundamentally change the approach and effectiveness of an IT department.

Businesses need to look for IT staff in decision making and guidance roles that have a firm understanding or and interest in business and can consistently maintain their IT work within that perspective.

The Information Technology Resource for Small Business