Tag Archives: system administration

On DevOps and Snowflakes

One can hardly swing a proverbial cat in IT these days without hearing people talking about DevOps.  DevOps is the hot new topic in the industry picking up from where the talk of cloud left off and to hear people talk about it one might believe that traditional systems administration is already dead and buried.

First we must talk about what we mean by DevOps.  This can be confusing because, like cloud, an older term is often being stolen to mean something different or, at best, related to something that already existed.  Traditional DevOps was the merging of developer and operational roles.  In the 1960s through the 1990s, this was the standard way of running systems.  In this world the people who wrote the software were generally the same ones who deployed and maintained it.  Hence the merging of “developer” and “operations”, operations being a semi-standard term for the role of system administrator.  These roles were not commonly separated until the rise of the “IT Department” in the 1990s and the 2000s.  Since then, the return to the merging of the two roles has started to rise in popularity again primarily because of the way that the two can operate together with great value in many modern, hosted, web application situations.

Where DevOps is often talked about today is not as a strict merging of the developers and the operations staff but as a modification to the operations staff with a much higher focus on coding not the application itself but in defining application infrastructures as code as a natural extension of cloud architectures.  This can be rather confusing at first.  What is important to note is that traditional DevOps is not what is commonly occurring today but a new “fake” DevOps where developers remain developers and operations remains operations but operations has evolved into a new “code heavy” role that continues to focus on managing servers running code provided by the developers.

What is significant today is that the role of the system administrator has begun to diverge into two related, but significantly different roles, one of which is improperly called DevOps by most of the industry today (most of the industry being too young to remember when DevOps was the norm, not the exception and certainly not something new and novel.)  I refer to these two aspects of the system administrator role here as the DevOps and the Snowflake approaches.

I use the term Snowflake to refer to traditional architectures for systems because each individual server can be seen as a “unique Snowflake.”  They are all different, at least insofar as they are not somehow managed in such a way as to keep them identical.  This doesn’t mean that they have to be all unique, just that they retain the potential to be.  In traditional environments a system administrator will log into each server individually to work on them.  Some amount of scripting is common to ease administration tasks but at its core the role involves a lot of time working on individual systems.

Easing administration of Snowflake architectures often involved attempts to minimize differences between systems in reasonable ways.  This generally starts with things like choosing a single standard operating system and version (Windows 2012 R2 or Red Hat Enterprise Linux 7) rather than allowing every server installation to be a different OS or version.  This standardization may seem basic but many shops lack this standardization even today.

A next step is commonly creating a standard deployment methodology or a gold master image that is used for making all systems so that the base operating system and all base packages, often including system customization, monitoring packages, security packages, authentication configuration and similar modifications are standard and deployed uniformly.  This provides a common starting point for all systems to minimize divergence.  But technically they only ensure a standard starting point and over time divergence in configuration must be anticipated.

Beyond these steps, Snowflake environments typically use custom, bespoke administration scripts or management tools to maintain some standardization between systems over time.  The more commonalities that exist between systems the easier they are to maintain and troubleshoot and the less knowledge is needed by the administration staff.  More standardization means fewer surprises, fewer unknowns and much better testing capabilities.

In a single system administrator environment with good practices and tooling, Snowflake environments can take on a high degree of standardization.  But in environments with many system administrators, especially those supported around the clock from many regions, and with a large number of systems, standardization, even with very diligent practices, can become very difficult.  And that is even before we tackle the obvious issues surrounding the fact that different packages and possibly package versions are needed on systems that perform different roles.

The DevOps approach grows organically out of the cloud architecture model.  Cloud architecture is designed around automatically created and automatically destroyed, broadly identical systems (at least in groups) that are controlled through a programmatic interface or API.  This model lends itself, quite obviously, to being controlled centrally through a management system rather than through the manual efforts of a system administrator.  Manual administration is effectively impossible and completely impractical under this model.  Individual systems are not unique like in the Snowflake model and any divergence will create serious issues.

The idea that has emerged from the cloud architecture world is one that systems architecture should be defined centrally “in code” rather than on the servers themselves.  This sounds confusing at first but makes a lot of sense when we look at it more deeply.  In order to support this model a new type of systems management tool that has yet to take on a really standard name but is often called a systems automation tool, DevOps framework, IT automation tool or simply “infrastructure as code” tool has begun to emerge.  Common toolsets in this realm include Puppet, Chef, CFEngine and SaltStack.

The idea behind these automation toolsets is that a central service is used to manage and control all systems.  This central authority manages individual servers by way of code-based descriptions of how the system should look and behave.  In the Chef world, these are called “recipes” to be cute but the analogy works well.  Each system’s code might include information such as a list of which packages and package versions should be installed, what system configurations should be modified and files to be copied to the box.  In many cases decisions about these deployments or modifications are handled through potentially complex logic and hence the need for actual code rather than something more simplistic such as markup or templates.  Systems are then grouped by role and managed as groups.  The “web server” role might tell a set of systems to install Apache and PHP and configure memory to swap very little.  The “SQL Server” role might install MS SQL Server and special backup tools only used for that application and configure memory to be tuned as desired for a pool of SQL Server machines.  These are just examples.  Typically an organization would have a great many roles, some may be generic such as “web server” and others much more specific to support very specific applications.  Roles can generally be layered, so a system might be both a “web server” and a “java server” getting the combined needs of both met.

These standard definitions mean that systems, once designated as belonging to one role or another, can “build themselves” automatically.  A new system might be created by an administrator requesting a system or a capacity monitoring system might decide that additional capacity is needed for a role and spawn new server instances automatically without any human intervention whatsoever.  At the time that the system is requested, by a human or automatically, the role is designated and the system will, by way of the automation framework, transform itself into a fully configured and up to date “node.”  No human system administration intervention required.  The process is fast, simple and, most importantly, completely repeatable.

Defining systems in code has some non-obvious consequences.  One is that backups of complete systems are no longer needed.  Why backup a system that you can recreate, with minimum effort, almost instantly?  Local data from database systems would need to be backed up but only the database data, not the entire system.  This can greatly reduce strain on backup infrastructures and make restore processes faster and more reliable.

The amount of documentation needed for systems already defined in code is very minimal.  In Snowflake environments the system administrator needs to maintain documentation specific to every host and maintain that documentation manually. This is very time consuming and error prone.   Systems defined by way of central code need little to no documentation and the documentation can be handled at a group level, not the individual node level.

Testing systems that are defined in code is easy to do as well.  You can create a system via code, test it and know that when you move that definition into production that the production system will be created repeatably exactly as it was created in testing.  In Snowflake environments it is very common to have testing practices that attempt to do this but do so through manual efforts and are prone to being sloppy and not exactly repeatable and very often politics will dictate that it is faster to mimic repeatability than to actually strive for it.  Code defined systems bypass these problems making testing far more valuable.

Outside of needing to define a number of nodes to exist within each role, the system can reprovision an entire architecture, from scratch, automatically.  Rebuilding after a disaster or bringing up a secondary site can be very quickly and easily done.  Also moving between locally hosted systems and remotely hosted systems including those from companies like Amazon, Microsoft, IBM, Rackspace and others is extremely easy.

Of course, in the DevOps world there is a great value to using cloud architectures to enable the most extreme level of automation but using cloud architectures is unnecessary to leverage these types of tools.  And, of course, having a code defined architecture could be used partially while manual administration could be implemented too for a hybrid approach but this is rarely recommended on individual systems.  It is generally far better to have two environments, one that is managed as Snowflakes and one that is managed as DevOps when the two approaches are mandated.  This makes  a far better hybridization.  I have seen this work extremely well in an enterprise environment with more scores of thousands of “Snowflake” servers each very unique but with a dedicated environment of ten thousands nodes that was managed in a DevOps manner because all of the nodes were to be identical and interchangeable using one of two possible configurations.  Hybridization was very effective.

The DevOps approach, however, comes with major caveats as well.  The skill sets necessary to manage a system in this way are far greater than those needed for traditional systems administration as, at a minimum, all traditional systems administration knowledge is still needed plus solid programming knowledge typically of modern languages like Python and Ruby and knowledge of the specific frameworks in question as well.  This extended knowledge base requirement means that DevOps practitioners are not only rare but expensive too.  It also means that university education, already far short of preparing either systems administrators or developers for the professional world are now farther still from preparing graduates to work under a DevOps model.

System administrators working in each of these two camps have a tendency to see all systems as needing to fit into their own mold. New DevOps practitioners often believe that Snowflake systems are legacy and need to be updated.  Snowflake (traditional) admins tend to see the “infrastructure as code” movement as silly, filled with unnecessary overhead, overly complicated and very niche.

The reality is that both approaches have a tremendous amount of merit and both are going to remain extremely viable.  Both make sense for very different workloads and large organizations, I suspect, will commonly see both in place via some form of hybridization.  In the SMB market where there are typically only a tiny number of servers, no scaling leverage to justify cloud architectures and a high disparity between systems, I suspect that DevOps will remain almost indefinitely outside of the norm as the overhead and additional skills necessary to make it function are impractical or even impossible to acquire.  Larger organizations have to look at their workloads.  Many traditional workloads and much of traditional software is not well suited to the DevOps approach, especially cloud automation, and will either require hybridization or an impractically high level of coding on a per system basis making the DevOps model impossible to justify.  But workloads built on web architectures or that can scale horizontally extremely well will benefit heavily from the DevOps model at scale.  This could apply to large enterprise companies or smaller companies likely producing hosted applications for external consumption.

This difference in approach means that, in the United States for example, most of the US is comprised of companies that will remain focused on the Snowflake management model while some east coast companies could evaluate the DevOps model effectively and begin to move in that direction.  But on the west coast where more modern architectures and a much larger focus on hosted applications and applications for external consumption are the driving economic factors, DevOps is already moving from newcomer to mature, established normalcy.  DevOps and Snowflake approaches will likely remain heavily segregated by regions in this way just as IT, in general, sees different skill sets migrate to different regions.  It would not be surprising to see DevOps begin to take hold in markets such as Austin where traditional IT has performed rather poorly.

Neither approach is better or worse, they are two different approaches servicing two very different ways of provisioning systems and two different fundamental needs of those systems.  With the rise of cloud architectures and the DevOps model, however, it is critically important that existing system administrators understand what the DevOps model means and when it applies so that they can correctly evaluate their own workloads and unique needs.  A large portion of the traditional Snowflake system administration world will be migrating, over time, to the DevOps model.  We are very far from reaching a steady state in the industry as to the balance of these two models.

Originally published on the StorageCraft Blog.

Why We Reboot Servers

A question that comes up on a pretty regular basis is whether or not servers should be routinely rebooted, such as once per week, or if they should be allowed to run for as long as possible to achieve maximum “uptime.”  To me the answer is simple – with rare exception, regular reboots are the most appropriate choice for servers.

As with any rule, there are cases when it does not apply.  For example, some businesses running critical systems have no allotment for downtime and must be available 24/7.  Obviously systems like this cannot simply be rebooted in a routine way.  However, if a system is so critical that it can never go down then this situation should trigger a red flag that this system is a point of failure and perhaps consideration for how to handle downtime, whether planned or unplanned, should be initiated.

Another exception is some AIX systems need significant uptime, greater than a few weeks, to obtain maximum efficiency as the system is self tuning and needs time to obtain usage information and to adjust itself accordingly.  This tends to be limited to large, seldom-changing database servers and similar use scenarios that are less common than other platforms.

In IT we often worship the concept of “uptime” – how long a system can run without needing to restart.  But “uptime” is not a concept that brings value to the business and IT needs to keep the business’ needs in mind at all times rather than focusing on artificial metrics.  The business is not concerned with how long a server has managed to stay online without rebooting – they only care that the server is available and ready when needed for business processing.  These are very different concepts.

For most any normal business server, there is a window when the server needs to be available for business purposes and a window when it is not needed.  These windows may be daily, weekly or monthly but it is a rare server that is actually in use around the clock without exception.

I often hear people state that because they run operating system X rather than Y that they no longer need to reboot, but this is simply not true.  There are two main reasons to reboot on a regular basis: to verify the ability of the server to reboot successfully and to apply patches that cannot be applied without rebooting.

Applying patches is why most businesses reboot.  Almost all operating systems receive regular updates that require rebooting in order to take effect.  As most patches are released for security and stability purposes, especially those requiring a reboot, the importance of applying them is rather high.  Making a server unnecessarily vulnerable just to maintain uptime is not wise.

Testing a server’s capacity to reboot successfully is what is often overlooked.  Most servers have changes applied to them on a regular basis.  Changes might be patches, new applications, configuration changes, updates or similar.  Any change introduces risk.  Just because a server is healthy immediately after a change is applied does not mean that the server nor the applications running on it will start as expected on reboot.

If the server is never rebooted then we never know if it can reboot successfully.  Over time the number of changes having been applied since the last reboot will increase.  This is very dangerous.  What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then failing.  At that point identifying what change is causing the system to fail could be an insurmountable process.  No single change to roll back, no known path to recoverability.  This is when panic sets in.  Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally – meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.

While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures, the purpose is to make those failures easily manageable from a “known change” standpoint and, more importantly, to control when those reboots occur to ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.

I have heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting.  I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief.  I know that we just caught an issue at a time when the business is not impacted financially.  Had that server not been restarted during off hours, it might have not been discovered to be “unbootable” until it had failed during active business hours and caused a loss of revenue.

Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week’s worth of changes to investigate, we are routinely able to fix the problems with generally little effort and great confidence that we understand what changes had been made prior to the failure.

Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.