Category Archives: Best Practices

Avoiding Local Service Providers

Inflammatory article titles aside, the idea of choosing a technology service provider based on the fact or partially based on the fact that they are in some way located geographically near to where you are currently, is almost always a very bad idea.  Knowledge based services are difficult enough to find at all, let alone finding the best potential skills, experience and price while introducing artificial and unnecessary constraints to limit the field of potential candidates.

With the rare exception of major global market cities like New York City and London, it is nearly impossible to find a full range of skills in Information Technology in a single locality, at least not in conjunction with a great degree of experience and breadth.  This is true of nearly all highly technical industries – expertise tends to focus around a handful of localities around the world and the remaining skills are scattered in a rather unpredictable manner often because those people in the highest demand can command salary and locations as desired and live where they want to, not where they have to.

IT, more than nearly any other field, has little value in being geographically near to the business that it is supporting.  Enterprise IT departments, even when located locally to their associated businesses and working in an office on premises are often kept isolated in different buildings away from both the businesses that they are supporting and the physical systems on which they work.  It is actually very rare that enterprise server admins would physically ever see their servers or network admins see their switches and routers.  This becomes even less likely when we start talking about roles like database administrators, software developers and others who have even less association with devices that have any physical component.

Adding in a local limitation when looking for consulting talent (and in many cases even internal IT staff) adds an artificial constraint that eliminates nearly the entire possible field of talented people while encouraging people to work on site even for work for which it makes no sense.  Often working on site causes a large increase in cost and loss of productivity due to interruptions, lack of resources, poor work environment, travel or similar.  Working with exclusively or predominantly remote resources encourages a healthy investment in efficient working conditions that generally pay off very well.  But it is important to keep in mind that just because a service company is remote does not imply that the work that they will do will be remote.  In many cases this will make sense, but in others it will not.

Location agnostic workers have many advantages.  By not being tied to a specific location you get far more flexibility as to skill level (allowing you to pursue the absolute best people) or cost (by allowing you to hire people living in low cost areas) or simply offering flexibility as an incentive or get broader skill sets, larger staff, etc.  Choosing purely local services simply limits you in many ways.

Companies that are not based locally are not necessarily unable to provide local resources.  Many companies work with local resources, either local companies or individuals, to allow them to have a local presence.  In many cases this is simply what we call local “hands” and is analogous to how most enterprises work internally with centrally or remotely based IT staff and physical “hands” existing only at locations with physical equipment to be serviced.  In cases where specific expertise needs to be located with physical equipment or people it is common for companies to either staff locally in cases where the resource is needed on a very regular basis or to have specific resources travel to the location when needed.  These techniques are generally far more effective than attempting to hire firms with the needed staff already coincidentally located in the best location.  This can easily be more cost effective than working with a full staff that is already local.

As time marches forward needs change as well.  Companies that work local only can find themselves facing new challenges when they expand to include other regions or locations.  Do they choose vendors and partners only where they were originally located?  Or where they are moving to or expanding to?  Do they choose local for each location separately?  The idea of working with local resources only is nearly exclusive to the smallest of business.  Typically as businesses grow the concept of local begins to change in interesting ways.

Locality and jurisdiction may represent different things.  In many cases it may be necessary to work with businesses located in the same state or country as your business due to legal or financial logistical reasoning and this can often make sense.  Small companies especially may not be prepared the tackle the complexities of working with a foreign firm.  Larger companies may find these boundaries to be worthy of ignoring as well.  But the idea that location should be ignored should not be taken to mean that jurisdiction, by extension, should also be ignored.  Jurisdiction still plays a significant role – one that some IT service providers or other vendors may be able to navigate on your behalf allowing you to focus on working with a vendor within your jurisdiction while getting the benefits of support from another jurisdiction.

As with many artificial constraint situations, not only do we generally eliminate the most ideal vendor candidates, but we also risk “informing” the existing vendor candidate pool that we care more about locality than quality of service or other important factors.  This can lead to a situation where the vendor, especially in a smaller market, feels that they have a lock in to you as the customer and do not need to perform up to a market standard level, price competitively (as there is no true competition given the constraints) or worse.  A vendor who feels that they have a trapped customer is unlikely to perform as a good vendor long term.

Of course we don’t want to avoid companies simply because they are local to our own businesses, but we should not be giving undue preference to companies for this reason either.  Some work has advantages to being done in person, there is no denying this.  But we must be careful not to extend this to rules and needs that do not have this advantage nor should we confuse the location of a vendor with the location(s) where they do or are willing to do business.

In extreme cases, all IT work can, in theory, be done completely remotely and only bench work (the physical remote hands) aspects of IT need an on premises presence.  This is extreme and of course there are reasons to have IT on site.  Working with a vendor to determine how best service can be provided, whether locally, remotely or a combination of the two can be very beneficial.

In a broader context, the most important concept here is to avoid adding artificial or unnecessary constraints to the vendor selection process.  Assuming that a local vendor will be able or willing to deliver a value that a non-local vendor can or will do is just one way that we might bring assumption or prejudice to a process such as this.  There is every possibility that the local company will do the best possible job and be the best, most viable vendor long term – but the chances are far higher than you will find the right partner for your business elsewhere.  It’s a big world and in IT more than nearly any other field it is becoming a large, flat playing field.

The Jurassic Park Effect

“If I may… Um, I’ll tell you the problem with the scientific power that you’re using here, it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you even knew what you had, you patented it, and packaged it, and slapped it on a plastic lunchbox, and now …” – Dr. Ian Malcolm, Jurassic Park

When looking at building a storage server or NAS, there is a common feeling that what is needed is a “NAS operating system.”  This is an odd reaction, I find, since the term NAS means nothing more than a “fileserver with a dedicated storage interface.”  Or, in other words, just a file server with limited exposed functionality.  The reason that we choose physical NAS appliances is for the integrated support and sometimes for special, proprietary functionality (NetApp being a key example of this offering extensive SMB and NFS integration and some really unique RAID and filesystem options or Exablox offering fully managed scale out file storage and RAIN style protection.)  Using a NAS to replace a traditional file server is, for the most part, a fairly recent phenomenon and one that I have found is often driven by misconception or the impression that managing a file server, one of the  most basic IT workloads, is special or hard.  File servers are generally considered the most basic form of server and traditionally what people meant when using the term server unless additional description was added and the only form commonly integrated into the desktop (every Mac, Windows and Linux desktop can function as a file server and it is very common to do so.)

There is, of course, nothing wrong with turning to a NAS instead of a traditional file server to meet your storage needs, especially as some modern NAS options, like Exablox, offer scale out and storage options that are not available in most operating systems.  But it appears that the trend to use a NAS instead of a file server has led to some odd behaviour when IT professionals turn back to considering file servers again.  A cascading effect, I suspect, where the reasons for why NAS are sometimes preferred and the goal level thinking are lost and the resulting idea of “I should have a NAS” remains, so that when returning to look at file server options there is a drive to “have a NAS” regardless of whether there is a logical reason for feeling that this is necessary or not.

First we must consider that the general concept of a NAS is a simple one, take a traditional file server, simplify it by removing options and package it with all of the necessary hardware to make a simplified appliance with all of the support included from the interface down to the spinning drives and everything in between.  Storage can be tricky when users need to determine RAID levels, drive types, monitor effectively, etc.  A NAS addresses this by integrating the hardware into the platform.  This makes things simple but can add risk as you have fewer support options and less ability to fix or replace things yourself.  A move from a file server to a NAS appliance is truly about support almost exclusively and is generally a very strong commitment to a singular vendor.  You chose the NAS approach because you want to rely on a vendor for everything.

When we move to a file server we go in the opposite direction.  A file server is a traditional enterprise server like any other.  You buy your server hardware from one vendor (HP, Dell, IBM, etc.) and your operating system from another (Microsoft, Red Hat, Suse, etc.)  You specify the parts and the configuration that you need and you have the most common computing model for all of IT.  With this model you generally are using standard, commodity parts allowing you to easily migrate between hardware vendors and between software vendors. You have “vendor redundancy” options and generally everything is done using open, standard protocols.  You get great flexibility and can manage and monitor your file server just like any other member of your server fleet, including keeping it completely virtualized.  You give up the vertical integration of the NAS in exchange for horizontal flexibility and standardization.

What is odd, therefore, is when returning to the commodity model but seeking, what is colloquially known as, a NAS OS.  Common examples of these include NAS4Free, FreeNAS and OpenFiler.  This category of products is generally nothing more than a standard operating system (often FreeBSD as it has ideal licensing, or Linux because it is well known) with a “storage interface” put onto it and no special or additional functionality that would not exist with the normal operating system.  In theory they are a “single function” operating system that does only one thing.  But this is not reality.  They are general purpose operating systems with an extra GUI management layer added on top.  One could say the same thing about most physical NAS products themselves, but they typically include custom engineering even at the storage level, special features and, most importantly, an integrated support stack and true isolation of the “generalness” of the underlying OS.  A “NAS OS” is not a simpler version of a general purpose OS, it is a more complex, yet less functional version of it.

What is additionally odd is that general OSes, with rare exception, already come with very simple, extremely well known and fully supported storage interfaces.  Nearly every variety of Windows or Linux servers, for example, have included simple graphical interfaces for these functions for a very long time.  These included GUIs are often shunned by system administrators as being too “heavy and unnecessary” for a simple file server.  So it is even more unusual that adding a third party GUI, one that is not patched and tested by the OS team and not standardly known and supported, would then be desired as this goes against the common ideals and practices of using a server.

And this is where the Jurassic Park effect comes in – the OS vendors (Red Hat, Microsoft, Oracle, FreeBSD, Suse, Canonical, et. al.) are giants with amazing engineering teams, code review, testing, oversight and enterprise support ecosystems.  While the “NAS OS” vendors are generally very small companies, some with just one part time person, who stand on the shoulders of these giants and build something that they knew that they could but they never stopped to ask if they should.  The resulting products are wholly negative compared to their pure OS counterparts, they do not make systems management easier nor do they fill a gap in the market’s service offerings. Solid, reliable, easy to use storage is already available, more vendors are not needed to fill this place in the market.

The logic often applied to looking at a NAS OS is that they are “easy to set up.”   This may or may not be true as easy, here, must be a relational term.  For there to be any value a NAS OS has to be easy in comparison to the standard version of the same operating system.  So in the case of FreeNAS, this would mean FreeBSD.  FreeNAS would need to be appreciably easier to set up than FreeBSD for the same, dedicated functions.  And this is easily true, setting up a NAS OS is generally pretty easy.  But this ease is only a panacea and one of which IT professionals need to be quite aware.  Making something easy to set up is not a priority in IT, making something that is easy to operate and repair when there are problems is what is important.  Easy to set up is nice, but if it comes at a cost of not understanding how the system is configured and makes operational repairs more difficult it is a very, very bad thing.  NAS OS products routinely make it dangerously easy to get a product into production for a storage role, which is almost always the most critical or nearly the most critical role of any server in an environment, that IT has no experience or likely skill to maintain, operate or, most importantly, fix when something goes wrong.  We need exactly the opposite, a system that is easy to operate and fix.  That is what matters.  So we have a second case of “standing on the shoulders of giants” and building a system that we knew we could, but did not know if we should.

What exacerbates this problem is that the very people who feel the need to turn to a NAS OS to “make storage easy” are, by the very nature of the NAS OS, the exact people for whom operational support and the repair of the system is most difficult.  System administrators who are comfortable with the underlying OS would naturally not see a NAS OS as a benefit and avoid it, for the most part.  It is uniquely the people for whom it is most dangerous to run a not fully understood storage platform that are likely to attempt it.  And, of course, most NAS OS vendors earn their money, as we could predict, on post-installation support calls for customers who deployed and got stuck once they were in production so that they are at the mercy of the vendors for exorbitant support pricing.  It is in the interest of the vendors to make it easy to install and hard to fix.  Everything is working against the IT pro here.

If we take a common example and look at FreeNAS we can see how this is a poor alignment of “difficulties.”  FreeNAS is FreeBSD with an additional interface on top.  Anything that FreeNAS can do, FreeBSD an do.  There is no loss of functionality by going to FreeBSD.  When something fails, in either case, the system administrator must have a good working knowledge of FreeBSD in order to exact repairs.  There is no escaping this.  FreeBSD knowledge is common in the industry and getting outside help is relatively easy.  Using FreeNAS adds several complications, the biggest being that any and all customizations made by the FreeNAS GUI are special knowledge needed for troubleshooting on top of the knowledge already needed to operate FreeBSD.  So this is a large knowledge set as well as more things to fail.  It is also a relatively uncommon knowledge set as FreeNAS is a niche storage product from a small vendor and FreeBSD is a major enterprise IT platform (plus all use of FreeNAS is FreeBSD use but only a tiny percentage of FreeBSD use is FreeNAS.)  So we can see that using a NAS OS just adds risk over and over again.

This same issue carries over into the communities that grow up around these products.  If you look to communities around FreeBSD, Linux or Windows for guidance and assistance you deal with large numbers of IT professionals, skilled system admins and those with business and enterprise experience.  Of course, hobbyists, the uninformed and others participate too, but these are the enterprise IT platforms and all the knowledge of the industry is available to you when implementing these products.  Compare this to the community of a NAS OS.  By its very nature, only people struggling with the administration of a standard operating system and/or storage basics would look at a NAS OS package and so this naturally filters the membership in their communities to include only the people from whom we would be best to avoid getting advice.  This creates an isolated culture of misinformation and misunderstandings around storage and storage products.  Myths abound, guidance often becomes reckless and dangerous and industry best practices are ignored as if decades of accumulated experience had never happened.

A NAS OS also, commonly, introduces lags in patching and updates.  A NAS OS will almost always and almost necessarily trail its parent OS on security and stability updates and will very often follow months or years behind on major features.  In one very well known scenario, OpenFiler, the product was built on an upstream non-enterprise base (RPath Linux) which lacked community and vendor support, failed and was abandoned leaving downstream users, included everyone on OpenFiler, abandoned without the ecosystem needed to support them.  Using a NAS OS means trusting not just the large, enterprise and well known primary OS vendor that makes the base OS but trusting the NAS OS vendor as well.  And the NAS OS vendor is orders of magnitude more likely to fail if they are basing their products on enterprise class base OSes.

Storage is a critical function and should not be treated carelessly and should not be ignored as if its criticality did not exist.  NAS OSes tempt us to install quickly and forget, hoping that nothing ever goes wrong or that we can move on to other roles or companies completely before bad things happen.  It sets us up for failure where failure is most impactful.  When a typical application server fails we can always copy the files off of its storage and start fresh.  When storage fails, data is lost and systems go down.

“John Hammond: All major theme parks have delays. When they opened Disneyland in 1956, nothing worked!

Dr. Ian Malcolm: Yeah, but, John, if The Pirates of the Caribbean breaks down, the pirates don’t eat the tourists.”

When storage fails, businesses fail.  Taking the easy route to setting up storage and ignoring the long term support needs and seeking advice from communities that have filtered out the experienced storage and systems engineers increases risk dramatically.  Sadly, the nature of a NAS OS, is that the very reason that people turn to it (lack of deep technical knowledge to build the systems) is the very reason they must avoid it (even greater need for support.)  The people for whom NAS OSes are effectively safe to use, those with very deep and broad storage and systems knowledge would rarely consider these products because for them they offer no benefits.

At the end of the day, while the concept of a NAS OS sounds wonderful, it is not a panacea and the value of a NAS does not carry over from the physical appliance world to the installed OS world and the value of standard OSes is far too great for NAS OSes to effectively add real value.

“Dr. Alan Grant: Hammond, after some consideration, I’ve decided, not to endorse your park.

John Hammond: So have I.”

Virtualizing Even a Single Server

I find it very common in conversations involving virtualization to have the concept of consolidation, which in the context of server virtualization refers to putting multiple formerly physical workloads onto a single physical box with separation handled by the virtual machine barriers, treated as being the core tenant and fundamental feature of virtualization.  Without a doubt, workload consolidation represents an amazing opportunity with virtualization, but it is extremely important that the value of virtualization and the value of consolidation not be confused.  Too often I have found that consolidation is viewed as the key value in virtualization and the primary justification for it but this is not the case.  Consolidation is a bonus feature, but should never be needed when justifying virtualization.  Virtualization should be a nearly foregone conclusion while consolidation must be evaluated and many times would not be used.  That workloads should not be consolidated should never lead to the belief that those workloads should not be virtual.  I would like to explore the virtualization decision space to see how we should be looking at this point.

Virtualization should be thought of as hardware abstraction as that is truly what it is, in a practical sense.  Virtualization encapsulates the hardware and presents a predictable, pristine hardware set to guest operating systems.  This may sound like it adds complication but, in reality, it actually simplifies a lot of things both for the makers of operating systems and drivers as well as for IT practitioners designing systems.  It is because computers, computer peripherals and operating systems are such complex beasts that this additional layer actually ends up removing complexity from the system by creating standard interfaces.  From standardization comes simplicity.

This exact same concept of presenting a standard, virtual machine to a software layer exists in other areas of computing as well, such as with how many programming languages are implemented.  This is a very mature and reliable computing model.

Hardware abstraction and the stability that it brings alone are reason enough to standardize on virtualization across the board but the practical nature of hardware abstraction as implemented by all enterprise virtualization products available to us today brings us even more important features.  To be sure, most benefits of virtualization can be found in some other way but rarely as completely, reliably, simply or freely as from virtualization.

The biggest set of additional features typically come from the abstraction of storage and memory allowing for the ability to snapshot storage or even the entire running state of a virtual machine, that is to take an image of the running system and store it in a file.  This ability leads to many very important capabilities such as the ability to take a system snapshot before installing new software, changing configurations or patching; allowing for extremely rapid rollbacks should anything go wrong.  This seemingly minor feature can lead to big peace of mind and overall system reliability.  It also makes testing of features and rolling back or repeated testing very easy in non-production environments.

The ability to snapshot from the abstraction layer also leads to the ability to take “image-based backups”, that is backups taken via the snapshot mechanism at a block device layer rather than from within the operating system’s file system layer.  This allows for operating system agnostic backup mechanisms and backups that include the entire system storage pool all at once.  Image backups allow for what were traditionally known as “bare metal restores” – the entire system can be restored to a fully running state without additional interaction – easily and very quickly.  Not all hypervisor makers include this capability or include it to equal levels so while conceptually a major feature it is critical that the extent to which this feature exists or is licensed must be considered on a case by case basis (notably HyperV includes this fully, XenServer includes it partially and VMware vSphere only includes it with non-free license levels.)  When available, image-based backups allow for extremely rapid recovery at speeds unthinkable with other backup methodologies.  Restoring systems in minutes is possible, from disaster to recovery!

The ability to treat virtual machines as files (at least when not actively running) provides additional benefits that are related to the backup benefits listed above.  Namely the ability to rapidly and easily migrate between physical hosts and even to move between disparate hardware.  Traditionally hardware upgrades or replacements meant a complicated migration process fraught with peril.  With modern virtualization, moving from existing hardware to new hardware can be a reliable, non-destructive process with safe fallback options and little or possibly even zero downtime!  Tasks that are uncommon but were very risky previously can often become trivial today.

Often this is the true benefit of virtualization and abstraction mechanisms.  It is not, necessarily, to improve the day to day operations of a system but to reduce risk and provide flexibility and options in the future.  Preparing for unknowns that are either unpredictable or are simply ignored in most common situations.  Rarely is such planning done at all, much to the chagrin of IT departments left with difficult and dangerous upgrades that could have been easily mitigated.

There are many features of virtualization that are applicable to only special scenarios.  Many virtualization products include live migration tools for moving running workloads between hosts, or possibly even between storage devices, without downtime.  High availability and fault tolerant options are often available allowing some workloads to rapidly or even transparently recover from system hardware failure, moving from failed hardware to redundant hardware without user intervention.  While more of a niche benefit and certainly not to be included in a list of why “nearly all workloads”  should be virtual, it is worth noting as a primary example of features that are often available and could be added later if a need for them arises as long as virtualization is used from the beginning.  Otherwise a migration to virtualization would be needed prior to being able to leverage such features.

Virtualization products typically come with extensive additional features that only matter in certain cases.  A great many of them fall into a large pool of “in case of future need.”  Possibly the biggest of all of these is the concept of consolidation, as I had mentioned at the beginning of this article.  Like other advanced features like high availability, consolidation is not a core value of virtualization but is often confused for it.  Workloads not intending to leverage high availability or consolidation should still be virtualized – without a doubt.  But these features are so potentially valuable as future options, even for scenarios where they will not be used today, that they are worth mentioning regardless.

Consolidation can be extremely valuable and it can easily be understood why so many people simply assume that it will be used as it is so often so valuable.  The availability of this once an infrastructure is in place is a key point of flexibility for handling the unknowns of future workloads.  Even when consolidation is completely unneeded today, there is a very good chance, even in the smallest of companies, that it will be useful at some unknown time in the future.  Virtualization provides us with a hedge against the unknown by preparing our systems for the maximum in flexibility.  One of the most important aspects of any IT decision is managing and reducing risk.  Virtualization does this.

Virtualization is about stability, flexibility, standardization, manageability and following best practices.  No major enterprise virtualization product is not available, at least in some form, for free today.  Any purchase would, of course, require a careful analysis of value versus expenditure.  However, with excellent enterprise options available for free from all four key product lines in this space currently (Xen, KVM, HyperV and VMware vSphere) we need make no such analysis.  We need only show that the implementation is a non-negative.

What makes the decision making easy is that when we consider the nominal case – the bare minimum that all enterprise virtualization provides which is the zero cost, abstraction, encapsulation and storage based benefits we find that we have a small benefit in effectively all cases, no measureable downsides and a very large potential benefit from the areas of flexibility and hedging against future needs.  This leaves us with a clear win and a simple decision that virtualization, being free and with essentially no downsides on its own, should be used in any case where it can be (which, at this point, is essentially all workloads.)  Additional, non-core, features like consolidation and high availability should be evaluated separately and only after the decision to virtualize has already been solidified.  No lack of need for those extended features, in any way, suggests that virtualization should not be chosen based on its own merits.

This is simply an explanation of existing industry best practices which have been to virtualize all potential workloads for many years.  This is not new nor a change of direction.  Just the fact that across the board virtualization has been an industry best practice for nearly a decade shows what a proven and accepted methodology this is.  There will always be workloads that, for one reason or another, simply cannot be virtualized, but these should be very few and far between and should prompt a deep review to find out why this is the case.

When deciding whether or not to virtualize, the approach should always be to assume that virtualization is a foregone conclusion and only vary from this if a solid, defended technical reason makes this impossible.  Nearly all arguments against virtualization come from a position of misunderstanding with a belief that consolidation, high availability, external storage, licensing cost and other loosely related or unrelated concepts are somehow intrinsic to virtualization.  They are not and should not be included in a virtualization versus physical deployment decision.  They are separate and should be evaluated as separate options.

It is worth noting that because consolidation is not part of our decision matrix in creating base value for virtualization, that all of the reasons that we are using apply equally to both one to one deployments (that is a single virtual machine on a single physical device) as to consolidated workloads (that is multiple virtual machines on a single physical device.)  There is no situation in which a workload is “too small” to be virtualized.  If anything, it is the opposite, only the largest workloads, typically with extreme latency sensitivity, where a niche scenario of non-virtualization still exists as an edge case but even these cases are rapidly disappearing as latency improvements in virtualization and total workload capacities are improved.  These cases are so rare and vanishing so quickly that even taking the time to mention these cases is probably unwise as it suggests that exceptions, based on capacity needs, are common enough to evaluate for, which they are not, especially not in the SMB market.  The smaller the workload, the more ideal for virtualization, but this is only to reinforce that small business, with singular workloads, are the most ideal case for virtualization across the board rather than an exception to best practices, not to suggest that larger businesses should be looking for exceptions themselves.

On DevOps and Snowflakes

One can hardly swing a proverbial cat in IT these days without hearing people talking about DevOps.  DevOps is the hot new topic in the industry picking up from where the talk of cloud left off and to hear people talk about it one might believe that traditional systems administration is already dead and buried.

First we must talk about what we mean by DevOps.  This can be confusing because, like cloud, an older term is often being stolen to mean something different or, at best, related to something that already existed.  Traditional DevOps was the merging of developer and operational roles.  In the 1960s through the 1990s, this was the standard way of running systems.  In this world the people who wrote the software were generally the same ones who deployed and maintained it.  Hence the merging of “developer” and “operations”, operations being a semi-standard term for the role of system administrator.  These roles were not commonly separated until the rise of the “IT Department” in the 1990s and the 2000s.  Since then, the return to the merging of the two roles has started to rise in popularity again primarily because of the way that the two can operate together with great value in many modern, hosted, web application situations.

Where DevOps is often talked about today is not as a strict merging of the developers and the operations staff but as a modification to the operations staff with a much higher focus on coding not the application itself but in defining application infrastructures as code as a natural extension of cloud architectures.  This can be rather confusing at first.  What is important to note is that traditional DevOps is not what is commonly occurring today but a new “fake” DevOps where developers remain developers and operations remains operations but operations has evolved into a new “code heavy” role that continues to focus on managing servers running code provided by the developers.

What is significant today is that the role of the system administrator has begun to diverge into two related, but significantly different roles, one of which is improperly called DevOps by most of the industry today (most of the industry being too young to remember when DevOps was the norm, not the exception and certainly not something new and novel.)  I refer to these two aspects of the system administrator role here as the DevOps and the Snowflake approaches.

I use the term Snowflake to refer to traditional architectures for systems because each individual server can be seen as a “unique Snowflake.”  They are all different, at least insofar as they are not somehow managed in such a way as to keep them identical.  This doesn’t mean that they have to be all unique, just that they retain the potential to be.  In traditional environments a system administrator will log into each server individually to work on them.  Some amount of scripting is common to ease administration tasks but at its core the role involves a lot of time working on individual systems.

Easing administration of Snowflake architectures often involved attempts to minimize differences between systems in reasonable ways.  This generally starts with things like choosing a single standard operating system and version (Windows 2012 R2 or Red Hat Enterprise Linux 7) rather than allowing every server installation to be a different OS or version.  This standardization may seem basic but many shops lack this standardization even today.

A next step is commonly creating a standard deployment methodology or a gold master image that is used for making all systems so that the base operating system and all base packages, often including system customization, monitoring packages, security packages, authentication configuration and similar modifications are standard and deployed uniformly.  This provides a common starting point for all systems to minimize divergence.  But technically they only ensure a standard starting point and over time divergence in configuration must be anticipated.

Beyond these steps, Snowflake environments typically use custom, bespoke administration scripts or management tools to maintain some standardization between systems over time.  The more commonalities that exist between systems the easier they are to maintain and troubleshoot and the less knowledge is needed by the administration staff.  More standardization means fewer surprises, fewer unknowns and much better testing capabilities.

In a single system administrator environment with good practices and tooling, Snowflake environments can take on a high degree of standardization.  But in environments with many system administrators, especially those supported around the clock from many regions, and with a large number of systems, standardization, even with very diligent practices, can become very difficult.  And that is even before we tackle the obvious issues surrounding the fact that different packages and possibly package versions are needed on systems that perform different roles.

The DevOps approach grows organically out of the cloud architecture model.  Cloud architecture is designed around automatically created and automatically destroyed, broadly identical systems (at least in groups) that are controlled through a programmatic interface or API.  This model lends itself, quite obviously, to being controlled centrally through a management system rather than through the manual efforts of a system administrator.  Manual administration is effectively impossible and completely impractical under this model.  Individual systems are not unique like in the Snowflake model and any divergence will create serious issues.

The idea that has emerged from the cloud architecture world is one that systems architecture should be defined centrally “in code” rather than on the servers themselves.  This sounds confusing at first but makes a lot of sense when we look at it more deeply.  In order to support this model a new type of systems management tool that has yet to take on a really standard name but is often called a systems automation tool, DevOps framework, IT automation tool or simply “infrastructure as code” tool has begun to emerge.  Common toolsets in this realm include Puppet, Chef, CFEngine and SaltStack.

The idea behind these automation toolsets is that a central service is used to manage and control all systems.  This central authority manages individual servers by way of code-based descriptions of how the system should look and behave.  In the Chef world, these are called “recipes” to be cute but the analogy works well.  Each system’s code might include information such as a list of which packages and package versions should be installed, what system configurations should be modified and files to be copied to the box.  In many cases decisions about these deployments or modifications are handled through potentially complex logic and hence the need for actual code rather than something more simplistic such as markup or templates.  Systems are then grouped by role and managed as groups.  The “web server” role might tell a set of systems to install Apache and PHP and configure memory to swap very little.  The “SQL Server” role might install MS SQL Server and special backup tools only used for that application and configure memory to be tuned as desired for a pool of SQL Server machines.  These are just examples.  Typically an organization would have a great many roles, some may be generic such as “web server” and others much more specific to support very specific applications.  Roles can generally be layered, so a system might be both a “web server” and a “java server” getting the combined needs of both met.

These standard definitions mean that systems, once designated as belonging to one role or another, can “build themselves” automatically.  A new system might be created by an administrator requesting a system or a capacity monitoring system might decide that additional capacity is needed for a role and spawn new server instances automatically without any human intervention whatsoever.  At the time that the system is requested, by a human or automatically, the role is designated and the system will, by way of the automation framework, transform itself into a fully configured and up to date “node.”  No human system administration intervention required.  The process is fast, simple and, most importantly, completely repeatable.

Defining systems in code has some non-obvious consequences.  One is that backups of complete systems are no longer needed.  Why backup a system that you can recreate, with minimum effort, almost instantly?  Local data from database systems would need to be backed up but only the database data, not the entire system.  This can greatly reduce strain on backup infrastructures and make restore processes faster and more reliable.

The amount of documentation needed for systems already defined in code is very minimal.  In Snowflake environments the system administrator needs to maintain documentation specific to every host and maintain that documentation manually. This is very time consuming and error prone.   Systems defined by way of central code need little to no documentation and the documentation can be handled at a group level, not the individual node level.

Testing systems that are defined in code is easy to do as well.  You can create a system via code, test it and know that when you move that definition into production that the production system will be created repeatably exactly as it was created in testing.  In Snowflake environments it is very common to have testing practices that attempt to do this but do so through manual efforts and are prone to being sloppy and not exactly repeatable and very often politics will dictate that it is faster to mimic repeatability than to actually strive for it.  Code defined systems bypass these problems making testing far more valuable.

Outside of needing to define a number of nodes to exist within each role, the system can reprovision an entire architecture, from scratch, automatically.  Rebuilding after a disaster or bringing up a secondary site can be very quickly and easily done.  Also moving between locally hosted systems and remotely hosted systems including those from companies like Amazon, Microsoft, IBM, Rackspace and others is extremely easy.

Of course, in the DevOps world there is a great value to using cloud architectures to enable the most extreme level of automation but using cloud architectures is unnecessary to leverage these types of tools.  And, of course, having a code defined architecture could be used partially while manual administration could be implemented too for a hybrid approach but this is rarely recommended on individual systems.  It is generally far better to have two environments, one that is managed as Snowflakes and one that is managed as DevOps when the two approaches are mandated.  This makes  a far better hybridization.  I have seen this work extremely well in an enterprise environment with more scores of thousands of “Snowflake” servers each very unique but with a dedicated environment of ten thousands nodes that was managed in a DevOps manner because all of the nodes were to be identical and interchangeable using one of two possible configurations.  Hybridization was very effective.

The DevOps approach, however, comes with major caveats as well.  The skill sets necessary to manage a system in this way are far greater than those needed for traditional systems administration as, at a minimum, all traditional systems administration knowledge is still needed plus solid programming knowledge typically of modern languages like Python and Ruby and knowledge of the specific frameworks in question as well.  This extended knowledge base requirement means that DevOps practitioners are not only rare but expensive too.  It also means that university education, already far short of preparing either systems administrators or developers for the professional world are now farther still from preparing graduates to work under a DevOps model.

System administrators working in each of these two camps have a tendency to see all systems as needing to fit into their own mold. New DevOps practitioners often believe that Snowflake systems are legacy and need to be updated.  Snowflake (traditional) admins tend to see the “infrastructure as code” movement as silly, filled with unnecessary overhead, overly complicated and very niche.

The reality is that both approaches have a tremendous amount of merit and both are going to remain extremely viable.  Both make sense for very different workloads and large organizations, I suspect, will commonly see both in place via some form of hybridization.  In the SMB market where there are typically only a tiny number of servers, no scaling leverage to justify cloud architectures and a high disparity between systems, I suspect that DevOps will remain almost indefinitely outside of the norm as the overhead and additional skills necessary to make it function are impractical or even impossible to acquire.  Larger organizations have to look at their workloads.  Many traditional workloads and much of traditional software is not well suited to the DevOps approach, especially cloud automation, and will either require hybridization or an impractically high level of coding on a per system basis making the DevOps model impossible to justify.  But workloads built on web architectures or that can scale horizontally extremely well will benefit heavily from the DevOps model at scale.  This could apply to large enterprise companies or smaller companies likely producing hosted applications for external consumption.

This difference in approach means that, in the United States for example, most of the US is comprised of companies that will remain focused on the Snowflake management model while some east coast companies could evaluate the DevOps model effectively and begin to move in that direction.  But on the west coast where more modern architectures and a much larger focus on hosted applications and applications for external consumption are the driving economic factors, DevOps is already moving from newcomer to mature, established normalcy.  DevOps and Snowflake approaches will likely remain heavily segregated by regions in this way just as IT, in general, sees different skill sets migrate to different regions.  It would not be surprising to see DevOps begin to take hold in markets such as Austin where traditional IT has performed rather poorly.

Neither approach is better or worse, they are two different approaches servicing two very different ways of provisioning systems and two different fundamental needs of those systems.  With the rise of cloud architectures and the DevOps model, however, it is critically important that existing system administrators understand what the DevOps model means and when it applies so that they can correctly evaluate their own workloads and unique needs.  A large portion of the traditional Snowflake system administration world will be migrating, over time, to the DevOps model.  We are very far from reaching a steady state in the industry as to the balance of these two models.

Originally published on the StorageCraft Blog.