Virtualize Domain Controllers

One would think that the idea of virtualizing Active Directory Domain Controllers would not be a topic needing discussion, and yet I find that the question arises regularly as to whether or not AD DCs should be virtualized.  In theory, there is no need to ask this question because we have far more general guidance in the industry that tells us that all possible workloads should be virtualized and AD certainly presents no special cases with which to create an exception to this long standing and general rule.

Oddly, people seem to go out regularly seeking clarification on this one particular workload, however and if you seek bad advice, someone is sure to provide.  Tons of people post advice recommending physical servers for Active Directory, but rarely, if ever, with any explanation as to why they would recommend violating best practices at all, let alone with such a mundane and well known workload.

As to why people implementing AD DCs decide that it warrants specific investigation around virtualization when no other workload does, I cannot answer.  But after many years of research into this phenomenon I do have some insight into the source of the reckless advice around physical deployments.

The first mistake comes from a general misunderstanding of what virtualization even is.  This is sadly incredibly common and people quite often think that virtualization means consolidation, which of course it does not.  So they take that mistake and then apply the false logic that consolidation means consolidating two AD DCs onto the same physical host.  It also requires the leap to thinking that there will always be two or more AD DCs, but this is also a common belief.  So three large mistakes in logic come together for some very bad advice that, if you dig into the recommendations, you can normally trace back.  This seems to be the root of the majority of the bad advice.

Other causes are sometimes misunderstanding actual best practices, such as the phrase “If you have two AD DCs, each needs to be on a separate physical host.”  This statement is telling us that two physically disparate machines need to be used in this scenario, which is absolutely correct.  But it does not imply that either of them should not have a hypervisor, only that two different hosts are needed.  The wording used for this kind of advice is often hard to understand if you don’t have the existing understanding that under no circumstance is a non-virtual workload acceptable.  If you read the recommendation with that understanding, its meaning is clear and, hopefully, obvious.  Sadly, that recommendation often gets repeated out of context so the underlying meaning can easily get lost.

Long ago, as in around a decade ago, some virtualization platforms had some issues around timing and system clocks that could play havoc with clustered database systems like Active Directory.  This was a legitimate issue long ago but was long ago solved, as it needed to be for many different workloads.  A perception was created that AD might need special treatment, however, and it seems to linger on even though it has been a generation or two in IT terms since this was an issue and should have long ago been forgotten.

Another myth leading to bad advice is rooted in the fact that AD DCs, like other clustered databases, when used in a clustered mode should not be snapshotted as this will easily create database corruption if only one node of the cluster gets restored in that manner.  This is, however, a general aspect of storage and databases and is not related to virtualization at all.  The same information is necessary for physical AD DCs just the same.  That snapshots are associated with virtualization is another myth; virtualization implies no such storage artefact.

Still other myths arise from a belief that virtualization much rely on Active Directory itself in order to function and therefore AD has to run without virtualization.  This is completely a myth and nonsensical.  There is no such circular requirement.

Sadly, some areas of technical have given rise to large scale myths, often many of them, that surround them and can make it difficult to figure out the truth.  Virtualization is just complex enough that many people attempt to learn but just how to use it, but what it is conceptually, by rote giving rise to sometimes crazy misconceptions that are so far afield that it can be hard to figure out that that is really what we are seeing.  And in a case like this, misconceptions around virtualization, history, clustered databases, high availability techniques, storage and more add up to layer upon layer of misconceptions making it hard to figure out how so many things can come together around one deployment question.

At the end of the day, few workloads are as ideally suited to virtualization as Active Directory Domain Controllers are.  There is no case where the idea of using a physical bare metal operating system deployment for a DC should be considered – virtualize every time.

When a Backup Is Not A Backup

Conceptually the idea of “backup” has become a murky area within IT.  Everyone seems to have their own concepts of what a backup is and how they expect it to behave.  This can be dangerous when the person supplying backup and the person consuming backup have a mismatch in expectations.  I see this happen every day even with traditional backup mechanisms.  With new types of backups appearing on a regular basis the opportunities for miscommunications and loss of data become much more pronounced.

By traditional backups I refer to the traditional world of tape-based backups with a grandfather – father – son rotational strategy in place, just to set the stage for the discussion.  New backups might include system images, disk-based backups, continuous backups and backups to “the cloud” or online backups.  The world of backups is evolving rapidly and now is when misunderstandings begin to put corporate data resources at risk.

So what exactly is a “backup”?  The concept sounds simple, but what do we really mean when we use the term?  Do we mean the ability to restore a system after it has failed?  The ability to roll back to an earlier version of a file?  Perhaps archiving of data when the original no longer exists?  How long do which files get kept?  Does this apply only to file data or are emails and databases included too?  Do we only need to restore in case of system failure or do we need the ability to restore granular data as well?  Do we need only one copy or do we need copies of every version of a file?

Now, with the additional risks posed by things like ransonware, we have even more concerns than ever before and ideas around not just versioning but potentially unlimited versioning and air gapping between systems and backups has become of a concern where before, it generally was not.

Many organizations, especially smaller ones, often choose to approach backups a bit differently from enterprises and often eschew backups completely.  Instead they “take backups” but then often delete the original files.  And instead of keeping many copies of the files that have been “backed up” they opt to keep only a single copy (or multiple versions that are co-dependent on each other) .  This means that what they have is not really a backup, but rather an archive.  If the one disk or tape on which the file is stored becomes damaged, the file is lost completely.

The term backup implies that there are at least two copies of some piece of data that do not rely on each other.  An archive does not imply this and just implies that we have taken data from production to another system, presumably one that is lower cost and likely much lower and harder to retrieve.  Archived data implies no redundancy, unlike the term backup.

If we “take a backup” and then proceed to delete the original data we no longer have a backup and the file that is stored in the “backup system”, whether this is on disk, a tape in a vault or whatever, turns into an archive of the original data rather than a backup of it.  It is now our source file, rather than being a copy.  This is some of the magic of digital media, copies are a clone rather than a mimic so the archival component is legitimately the original in every sense.

This may seem pedantic but it truly is not.  If a business is paying for backups, they likely assume that that cost is going towards having some redundancy, not just a single copy of data.  And if you have regulations around being required to keep backups for compliance reasons, only having an archival copy is a clear violation of that requirement.  Having two systems fail and being unable to retrieve data is an edge case that all compliance must accept.  But having an archival system fail where a backup is required but was not kept, is not an acceptable scenario.

For this reason, and many more, concepts like the 3-2-1 backup methodology make sense because this approach guarantees that backups are kept within the backup system and originals do not need to be kept on production.  In some ways of thinking, this approach could be thought of as merging archiving and backups into a single system which adds much clarity to the design.

Whatever backup system works for you, be cognizant that backups mean independent copies and that in many ways, independent copies that do not share failure domains has become nearly a requirement for all backups today.

Hiring IT: Speed Matters

After decades of IT hiring, something that I have learned is that companies serious about hiring top talent always make hiring decisions very quickly.  They may spend months or even years looking for someone that is a right fit for the organization, but once they have found them they take action immediately.

This happens for many reasons.  But mostly it comes down to wanting to secure resources once they have been identified.  Finding good people is an expensive and time consuming process.  Once you have found someone that is considered to be the right fit for the need and the organization, there is a strong necessity to reduce risk by securing them as quickly as possible.  A delay in making an offer presents an opportunity for that resource to receive another offer or decide to go in a different direction.  Months of seeking a good candidate, only to lose them because of a delay of a few hours or days in making an offer is a ridiculous way to lose money.

Delays in hiring suggest that either the situation has not yet been decided upon or that the process has not gotten a priority and that other decisions or actions inside of the company are seen as more  important than the decisions around staffing.  And, of course, it may be true that other things are more important.

Other factors being more important are exactly the kinds of things that potential candidates worry about.   Legitimate priorities might include huge disasters in the company, things that are not a good sign in general.  Or worse, maybe the company just doesn’t see acquiring the best talent as being important and delays are caused by vacations, parties, normal work or not even being sure that they want to hire anyone at all.

It is extremely common for companies to go through hiring rounds just to “see what is out there.”  This doesn’t necessarily mean that they will not consider hiring someone if the right person does come along, but it easily means that the hiring is not fully approved or funded and might not even be possible.  Candidates go through this regularly, a great interview might result in no further action and so know better than to sit around waiting on positions, even ones that seem very likely and possible.  The risks are too high and if a different, good opportunity comes along, will normally move ahead with that.  Few things signal that a job offer is not forthcoming or that a job is not an ideal one than delays in the hiring process.

Candidates, especially senior ones, know that good jobs hire quickly.  So if the offer has not arrived promptly it is often assumed that offer(s) are being made to other candidates or that something else is wrong.  In either situation, candidates know to move on.

If hiring is to be a true priority in an organization, it must be prioritized.  This should go without saying, but good hiring slips through the cracks more often than not.  It is far too often seen as a background activity; one that is approached casually and haphazardly.  It is no wonder that so many organizations waste countless hours of time on unnecessary candidate searches and interviews and untold time attempting to fill positions when, for all intents and purposes, they are turning away their best options all the while.

When to Consider High Availability?

“High Availability isn’t something you buy, it’s something that you do.”  – John Nicholson

Few things are more universally desired in IT than High Availability (HA) solutions.  I mean really, say those words and any IT Pro will instantly say that they want that.  HA for their servers, their apps, their storage and, of course, even their desktops.  If there was a checkbox next to any system that simply said “HA”, why wouldn’t we check it?  We would, of course.  No one voluntarily wants a system that fails a lot.  Failure bad, success good.

First, though, we must define HA.  HA can mean many things.  At a minimum, HA must mean that the availability of the system in question must be higher than “normal”.  What is normal?  That alone is hard enough to define.  HA is a loose term, at best.  In the context of its most common usage, though, which is common applications running on normal enterprise hardware I would offer this starting point for HA discussions:

Normal or Standard Availability (SA) would be defined as the availability from a common mainline server running a common enterprise operating system running a common enterprise application in a best practices environment with enterprise support.  Good examples of this might include Exchange running on Windows Server running on the HP Proliant DL380 (the most common mainline commodity server.)  Or for BIND (the DNS server) running on Red Hat Enterprise Linux on the Dell PowerEdge R730.  These are just examples to be used for establishing a rough baseline.  There is no great way to measure this, but with a good support contract and rapid repair or replacement in the real world, reliability of a system of this nature is believed to be between four and five nines of reliability (99.99% uptime or higher) when human failure is not included.

High Availability (HA) should be commonly defined as having an availability significantly higher than that of Standard Availability.  Significantly higher should be a minimum of one order of magnitude hincrease.  So at least five nines of reliability and more likely six nines. (99.9999% uptime.)

Low Availability (LA) would be commonly defined as having an availability significantly lower than that of Standard Availability with significantly, again, meaning at least one order of magnitude.  So LA would typically be assumed to be around 99% to 99.9% or lower availability.

Measurement here is very difficult as human factors, environmental and other play a massive role in determining the uptime of different configurations.  The same gear used in one role might achieve five nines while in another fail to achieve even one.  The quality of the datacenter, skill of the support staff, rapidity of parts replacement, granularity of monitoring and a multitude of other factors will affect the overall reliability significantly.  This is not necessarily a problem for us, however.  In most cases we can evaluate the portions of a system design that we control in such a way that relative reliability can be determined so that at least we can show that one approach is going to be superior to another in order that we can then leverage well informed decision making even if accurate failure rate models cannot be easily built.

It is important to note that other than providing a sample baseline set of examples from which to work there is nothing in the definitions of high availability or low availability that talk about how these levels should be achieved – that is not what the terms mean.  The terms are resultant sets of reliability in relation to the baseline and nothing else. There are many ways to achieve high availability without using commonly assumed approaches and practically unlimited ways to achieve low availability.

Of course HA can be defined at every layer.  We can have HA platforms or OS but have fragile applications on top.  So it is very important to understand at what level we are speaking at any given time.  At the end of the day, a business will only care about the high availability delivery of services regardless of how it is achieved, or where.  The end result is what matters not the “under the hood” details of how it was accomplished or, as always, the ends justify the means.

It is extremely common today for IT departments to become distracted by new and flashy HA tools at the platform layer and forget to look for HA higher and lower in the stack to ensure that we provide highly available services to the business; rather than only looking at the one layer while leaving the business just as vulnerable, or moreso, than ever.

In the real world, though, HA is not always an option and, when it is, it comes at a cost.  That cost is almost always monetary and generally comes with extra complexity as well.  And as we well know, any complexity also carries additional risk and that risk could, if we are not careful, cause an attempt to achieve HA actually fail and might even leave us with LA or Low Availability.

Once we understand this necessary language for describing what we mean, we can begin to talk about when high availability, standard availability and even low availability may be right for us.  We use this high level of granularity because it is so difficult to measure system reliability that getting too detailed becomes valueless.

Conceptually, all systems come with risk of downtime and nothing can be always up, that’s impossible.  Reliability costs money, generally, all other things being equal.  So to determine what level of availability is most appropriate for a workload we must determine the cost of risk mitigation (the amount of money that it takes to change the average amount of downtime) and compare that against the cost of the downtime itself.

This gets tricky and complicated because determining cost of downtime is difficult enough, then determining the risk of downtime is even more difficult.  In many cases, downtime is not a flat number, but it might be.  This cost could be expressed as $5/minute or $20K/day or similar.  But an even better tool would be to create a “loss impact curve” that shows how money is lost over time (within a reasonable interval.)

For example, a company might easily face no loss at all for the first five minutes with slowly increasing, but small, losses until about four hours when work stops because people can no longer go to paper or whatever and then losses go from almost zero to quite large.  Or some companies might take a huge loss the moment that the systems are down but the losses slowly dissipate over time.  Loss might only be impactful at certain times of day.  Maybe outages at night or during lunch are trivial but mid morning or mid afternoon are major.  Every company’s impact, risk and ability to mitigate that risk are different, often dramatically so.

Sometimes it comes down to the people working at the company.  Will they all happily take needed bathroom, coffee, snack or even lunch breaks at the time that a system fails so that they can return to work when it is fixed?  Will people go home early and come in early tomorrow to make up for a major outage?  Is there machinery that is going to sit idle?  Will the ability to respond to customers be impacted?  Will life support systems fail?  There are countless potential impacts and countless potential ways of mitigating different types of failures.  All of this has to be considered.  The cost of downtime might be a fraction of corporate revenues on a minute by minute basis or downtime might cause a loss of customers or faith that is more impactful than the minute by minute revenue generated.

Once we have some rough loss numbers to deal with we at least have a starting point.  Even if we only know that revenue is ~$10/minute and losses are expected to be around ~$5/minute we have a starting point of sorts.  If we have a full curve or a study done with some more detailed numbers, all the better.  Now we need to figure out roughly what our baseline is going to be.  A well maintained server, running on premises, with a good support contract and good backup and restore procedures can pretty easily achieve four nines of reliability.  That means that we would experience about five hours of downtime every five years.  This is actually less than the generally expected downtime of SA in most environments and potentially far less than expected levels in excellent environments like high quality datacenters with nearby parts and service.

So, based on our baseline example of about five hours every five years we can figure out our potential risk.  If we lose about ~$5/minute and we expect roughly 300 minutes of downtime every five years we looking at a potential loss of $1,500 every half decade.

That means that at the most extreme we could never spend $1,500 to mitigate that risk, that would be financially absurd.  This happens for several reasons.  One of the biggest is that this is only a risk, spending $1,500 to protect against losing $1,500 makes little sense, but it is a very common mistake to make when people do not analyze these numbers carefully.

The biggest factor is that any mitigation technique is not completely effective.  If we manage to move our four nines system to a five nines system we would reduce only 90% of the average downtime moving us from $1,500 of loss to $150 of loss.  If we spent $1,500 for that reduction, the total “loss” would still be $1,650 (the cost of protection is a form of financial loss.)  The cost of the risk mitigation combined with the anticipated remaining impact when taken together must still be lower than the anticipated cost of the risk without mitigation or else the mitigation itself is pointless or actively damaging.

Many may question why the total cost of risk mitigation must be lower and not simply equal as surely, that must mean that we are at a “risk break even” point?  This seems true on the surface, but because we are dealing with risk this is not the case.  Risk mitigation is a certain cost- financial damage that we take up front in the hopes of reducing losses tomorrow.  But the risk for tomorrow is a guess, hopefully a well educated one, but only a guess.  The cost today is certain.  Taking on certain damage today in the hopes of reducing possible damage tomorrow only makes sense when the damage today is small and the expected or possible damage tomorrow is very large and the effectiveness of mitigation is significant.

Included in the idea of “certain cost of front” to reduce “possible cost tomorrow” is the idea of the time value of money.  Even if an outage was of a known size and time, we would not spend the same money today to mitigate it tomorrow because our money is more valuable today.

In the most dramatic cases, we sometimes see IT departments demanding tens or hundreds of thousands of dollars be spent up front to avoid losing a few thousand dollars, maybe, sometime maybe many years in the future.  A strategy that we can refer to as “shooting ourselves in the face today to avoid maybe getting a headache tomorrow.”

It is included in the concept of evaluating the risk mitigation but it should be mentioned specifically that in the case of IT equipment there are many examples of attempted risk mitigation that may not be as effective as they are believed to be.  For example, having two servers that sit in the same rack will potentially be very effective for mitigating the risk of host hardware failure, but will not mitigate against natural disasters, site loss, fire, most cases of electrical shock, fire suppression activation, network interruptions, most application failure, ransomware attack or other reasonably possible disasters.

It is common for storage devices to be equipment with “dual controllers” which gives a strong impression of high reliability, but generally these controllers are inside a single chassis with shared components and even if the components are not shared, often the firmware is shared and communications between components are complex; often leading to failures where the failure of one component triggers the failure of another – making them quite frequently LA devices rather than SA or the HA that people expected when purchasing them.  So it is very critical to consider if the risk mitigation strategy will mitigate which risks and if the mitigation technique is likely to be effective.  No technique is completely effective, there is always a chance for failure, but some strategies and techniques are more broadly effective than others and some are simply misleading or actually counter productive.  If we are not careful, we may implement costly products or techniques that actively undermine our goals.

Some techniques and products used in the pursuit of high availability are rather expensive, which might include buying redundant hardware, leasing another building, installing expensive generators or licensing special software.  There are low cost techniques and software as well, but in most cases any movement towards high availability will result in a respectively large outlay of investment capital in order to achieve it.  It is absolutely critical to keep in mind that high availability is a process, there is no way to simply buy high availability.  Achieving HA requires good documentation, procedures, planning, support, equipment, engineering and more.  In the systems world, HA is normally approached first from an environmental perspective with failover power generators, redundant HVAC systems, power conditioning, air filtration, fire suppression systems and more to ensure that the environment for the availability is there.  This alone can often make further investment unnecessary as this can deliver incredible results.  Then comes HA system design ensuring that not just one layer of a technology stack is highly available but that the entire stack is allowing for the critical applications, data or services to remain functional during as much time as possible.  Then looking at site to site redundancy to be able to withstand floods, hurricanes, blizzards, etc.  Of course there are completely different techniques such as utilizing cloud computing services hosted remotely on our behalf.  What matters is that high availability requires broad thinking and planning, cannot simply be purchased as a line item and is judged by the ability to return a risk factor providing a resulting uptime or likelihood of uptime much higher than a “standard” system design would deliver.

What is often surprising, almost shocking, to many businesses and especially to IT professionals, who rarely undertake financial risk analysis and who are constantly being told that HA is a necessity for any business and that buying the latest HA products is unquestionably how their budgets should be spent, is that when the numbers are crunched and the reality of the costs and effectiveness of risk mitigation strategies are considered that high availability has very little place in any organization, especially those that are small or have highly disparate workloads.  In the small and medium business market it is almost universal to find that the cost and complexity (which in turn brings risk, mostly in the form of a lack of experience around techniques and risk assessment) of high availability approaches is far too costly to ever offset the potential damage of the outage from which the mitigation is hoped to protect.  There are exceptions, of course, and there are many businesses for which high availability solutions are absolutely sensible, but these are the exception and very far from being the norm.

It is also sensible to think of the needs for high availability to be based on a workload basis and not department, company or technology wide.  In a small business it is common for all workloads to share a common platform and the need of a single workload for high availability may sweep other, less critical, workloads along with it.  This is perfectly fine and a great way to offset the cost of the risk mitigation of the critical workload through ancillary benefit to the less critical workloads.  In a larger organization where there is a plethora of platform approaches used for differing workloads it is common for only certain workloads that are both highly critical (in terms of risk from downtime impact) and that are practically mitigated of risk (the ability to mitigate risk can vary dramatically between different types of workloads) to have high availability applied to them and other workloads to be left to standard techniques.

Examples of workloads that may be critical and can be effectively addressed with high availability might be an online ordering system where the latency created by multi-regional replication has little impact on the overall system but losing orders could be very financially impactful should a system fail.  An example of a workload where high availability might be easy to implement but ineffectual would be an internal intranet site serving commonly asked HR questions; it would simply not be cost effective to avoid small amounts of occasional downtime for a system like this.  An example of a system where risk is high but the cost or effectiveness of risk mitigation makes it impractical or even impossible might be a financial “tick” database requiring massive amounts of low latency data to be ingested and the ability to maintain a replica would not only be incredibly costly but could introduce latency that would undermine the ability of the system to perform adequately.  Every business and workload is unique and should be evaluated carefully.

Of course high availability techniques can be actioned in stages; it is not an all or nothing endeavor.  It might be practical to mitigate the risk of system level failure by having application layer fault tolerance to protect against failure of system hardware, virtualization platform or storage.  But for the same workload it might not be valuable to protect against the loss of a single site.  If a workload only services a particular site or is simply not valuable enough for the large investment needed to make it fail between sites it could easily fall “in the middle.”  It is very common for workloads to only implement partially high availability solutions, often because an IT department may only be responsible for a portion of them and have no say over things like power support and HVAC, but probably most common because some high availability techniques are seen as high visibility and easy to sell to management while others, such as high quality power and air conditioning, often are not even though they may easily provide a better bang for the buck.  There are good reasons why certain techniques may be chosen and not others as they affect different risk components and some risks may have a differing impact on an individual business or workload.

High availability requires careful thought as to whether it is worth considering and even more careful thought as to implementation.  Building true HA systems requires a significant amount of effort and expertise and generally substantial cost.  Understanding which components of HA are valuable and which are not requires not just extensive technical expertise but financial and managerial skills as well.  Departments must work together extensively to truly understand how HA will impact an organization and when it will be worth the investment.  It is critical that it be remembered that the need for high availability in an organization or for a workload is anything but a foregone conclusion and it should not be surprising in the least to find that extensive high availability or even casual high availability practices turn out to be economically impractical.

In many ways this is because standard availability has reached such a state that there is continuously less and less risk to mitigate.  Technology components used in a business infrastructure, most notably servers, networking gear and storage, have become so reliable that the amount of downtime that we must protect against is quite low.  Most of the belief in the need for knee jerk high availability comes from a different era when reliable hardware was unaffordable and even the most expensive equipment was rather unreliable by modern standards.  This feeling of impending doom that any device might fail at any moment comes from an older era, not the current one.  Modern equipment, while obviously still carrying risks, is amazingly reliable.

In addition to other risks, over-investing in high availability solutions carries financial and business risks that can be substantial.  It increases technical debt in the face of business uncertainty.  What if the business suddenly grows, or worse, what if it suddenly contracts, changes direction, gets purchased or goes out of business completely?  The investment in the high availability is already spent even if the need for its protection disappears.  What if technology or location change?  Some or all of a high availability investment might be lost before it would have been at its expected end of life.

As IT practitioners, evaluating the benefits, risks and costs of technology solutions is at the core of what we do.  Like everything else in business infrastructure, determining the type of risk mitigation, the value of protection and how much is financially proper is our key responsibility and cannot be glossed over or ignored.  We can never simply assume that high availability is needed, nor that it can simply be skipped.  It is in analysis of this nature that IT brings some of its greatest value to organizations.  It is here that we have the potential to shine the most.

 

 

The Information Technology Resource for Small Business