Category Archives: Best Practices

The Weakest Link: How Chained Dependencies Impact System Risk

When assessing system risk scenarios it is very easy to overlook “chained” dependencies.  We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.”  But system risk is far more complicated than that.

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design.  Another good example is how web applications need both application hosts and database hosts in order to function.

It is easiest to explain chained dependencies with an example.  We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves.  Each of these three “layers” is dependent on the others for the system, as a whole, to function.  If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure.  Any one of the three failing causes the entire system to fail.  No one piece is useful on its own.  This is a chained dependency and the chain is only as strong as its weakest link.

In our simplistic example, each device represents a failure domain.  We can mitigate risk by improving the reliability of each domain.  We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure.  This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before.  We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer.  Now two failure domains have been addressed.  Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.

Now that we have beefed up our system, we still have three failure domains in a dependency chain.  What we have done is made each “link” in the chain, each failure domain, extra resilient on its own.  But the chain still exists.  This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone.  We have made something far better than where we started, but we still have many failure domains.  These risks add up.

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system.  It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role.  For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly.  Even defining a clear failure can therefore be challenging.

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners.  The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment.  It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer.  This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems.  In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.

The most obvious component which can be collapsed is the networking failure domain.  If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain.  Now instead of three chains, each of which has some potential to fail, we have only two.  Simpler is better, all other things being equal.

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced.  Again, this is with all other factors remaining equal.

Another approach to consider is making single nodes more reliable on their own.  It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains.  But traditionally this was not the default path taken to reliability.  It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node.  Mainframe and high end storage systems, for example, still do this today.  This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor.  This tends to work out only in special niche circumstances and is not practical on a more general scope.

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain.  Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies.  The risk of a single node can generally be estimated with some level of confidence.  A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain!  The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected.  Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer.  If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially.  But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation.  It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true.  But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy.  That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability.  Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk.  Avoiding human error can often be more important than avoiding mechanical failure.

We must also consider the cost of recovery.  If failure is to occur it is generally trivial to recover from the failure of a simple system.  An extremely complex system, having failed, may take a great degree of effort to restore to a working condition.  Complex systems also require much broader and deeper degrees of experience and confidence to maintain.

There is no easy answer to determining the reliability of systems.  Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases.  With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains,  and appreciate how changes in system design will move us clearly towards or away from reliability.

Choosing Software Versions for Deployment

Something that I see discussed very often in IT circles is “which version of software should I install.” This could apply to a database, an application, firmware or, probably most often, operating systems, and with the upcoming end of support life for Windows XP the topic has reached a fevered pitch.

There are effectively two sides to this discussion. One side believes that the latest and, presumably, greatest software should always be used. The other believes that software needs to mature and take a “wait and see” approach or even considers each version to be a different product and not a continuum of development.

Both approaches have their merits and neither should exist completely without the other. Blindly updating software willy nilly is not wise and avoiding patches and updates without reason is not wise either. Careful consideration of the factors and empathy for the software development process are important to keep in mind when making these decisions.

First, there are two completely different scenarios to consider. One is the updating of current, existing software. The assumption being that the current state of things is “working” with the accepted possibility that “working” might include a security exposure that has been discovered and requires updating in order to close. The other scenario is a new deployment where there is nothing currently and, we are starting from scratch.

Let’s start with the second case, as it is far easier to provide guidance on.

In the case of new software deployments (or new operating systems), always use the current, most recent version of the software unless there is a clearly known technology limitation preventing it ?? such as known bugs or software incompatibilities.

Software is not like other types of products, especially not in today’s world of online patch releases and updates. I assume that the mentality that old versions of software might be preferable to current ones comes from a combination of physical products (watches, cars, dishes, furniture, wine) where a specific year or model might be superior to a newer model for various reasons and from legacy software delivery modes where finished software products were just ??thrown over the wall? and the final state was, quite simply, the final state without any reasonable opportunities for updates, patches or fixes. Neither of these cases applies to modern business software (with only the rarest of exceptions.)

Software development is roughly a continuum. Normal development processes have new software being built on top of old software either directly (by creating updates to an existing code base) or indirectly (by rebuilding based on knowledge gained from having built a previous version of the software.) The idea being that each subsequent version of software is superior to the one preceding it. This is not guaranteed, of course, there are such concepts as regression errors and just bad development, but by and large, software improves over time – especially when we are talking about enterprise class software used in businesses and under active development. New software is not just the next phase of the old software, it also represents, in nearly all cases, the current state of patches, bug fixes, updates and, when necessary, changes in approach or technique. New software, coming from quality shops, is almost exclusively better than old software. Software evolves and matures.

Beyond the quality of software itself, there is the concept of investing in the future. Software is not something that can sit on the shelf forever. It needs to stay, to some degree, up to date or it stops functioning because the platform that it runs on changes, some new artifact comes to light, security holes are discovered or needs change. Installing old software means that there is an investment in the past, an investment in installing, learning, using and supporting old technology. This is called “technical debt.” This old technology might last for years or even decades, but old software loses value over time and becomes increasingly expensive to support both for the vendors, if they continue to support it, and for the end users, who have to support it.

The same concept of technical debt applies to the software vendors in question. There is a very large cost in creating software and especially in maintaining multiple versions of that software. Software vendors have a lot of incentive to reduce support for older versions to focus resources on current software releases (this is a major reason why SaaS deployments are so popular, the vendor controls the available versions and can eliminate legacy versions through updates.) If customers require support for old versions, the cost must be absorbed somewhere and often it is absorbed both in monetary impact to all customers as well as a decrease in focus on the new product as development teams must be split to support patching old versions as well as developing the new. The more effort that must go in to old versions, the less effort that can be put into new improvements.

Within the framework of what I have already said, it is important to talk about code maturity. Often code maturity is stated as a reason for deploying “old code”, but I think that this is an IT misunderstanding of software development processes. If we think about a released line of code, just because it is released and in use does not really make it more mature. Code does not change in the wild, it just sits there. Its maturity is “locked” on the day that it is released. If it is patched, then yes, it would “mature” post release. Later versions of the same software, based on the same code base but more up to date, is truly the more “mature” code as it has been reviewed, updated, tested, etc. to a greater degree than the early release of the same code.

This is counterintuitive to, say, a car where each release is a fresh thing with new opportunities for mechanical problems and different reliability concerns – where waiting a few years gives you a chance to see what reliability issues get uncovered. Software is not like this. So the concept of wanting more mature software would push you to deploy the “latest and greatest” rather than the “tried and true.”

If we think of software version numbers rather like ages, this comes through.  Linux 3.1 is much older, in terms of software maturing, than Linux 2.4.  It has a decade of additional development.

Let’s use a real world example that is very relevant today. You are in a shop about to install your first server(s). Windows Server 2012 R2 has just released. Should you install Windows Server 2008, 2008 R2 (2010), Server 2012 or Server 2012 R2 (late 2013?)

To many shops, this sounds like we are talking about somewhere between two and four different products entirely which probably have different reasons for choosing each. This, by and large, is untrue. Each newer version is simply an upgrade, update, patch and feature increase on the previous one. Each one, in turn, is more advanced and mature than the one preceding it. Each new version benefits from the work done on the original release of its predecessor as well as bug fixes, patches and feature additions done in the interim between the original release and the successor release. Each new release is, in reality, a “minor release” of the one before it. If we look at the kernel revision numbers, instead of the marketing names of the releases, it might make more sense.

Windows Server 2008 was Windows NT 6.0. Windows Server 2008 R2 was Windows NT 6.1, obviously a minor revision or even a “patch” of the previous release. Windows Server 2012 was Windows NT 6.2 and our current Windows Server 2012 R2 is Windows NT 6.3. If we were to use the revision numbers instead of the marketing names, it sounds almost crazy to intentionally install an old, less mature, less updated and less patched version. We want the latest updates, the latest bug fixes and the latest security issues to have been addressed.

For new software deployments, the newer the software installed, the better opportunity to leverage the latest features and the most time before inevitable obsolescence takes its toll. All software ages so installing newer software gives the best chance that that software will last for the longest time. It provides the best flexibility for the unknown future.

Following this line of thinking might lead us to feel that deploying pre-release or even beta software would make sense as well. And while there might be specific cases where this does make sense, such as in “test groups” to check out software before releasing it to the company at large, in general it does not. The nature of pre-release software is that it is not supported and may contain code which never will be supported. Using such code in isolation can be beneficial, but for general use it is not advised.  There are important processes that are followed between preview or beta releases and final releases of code no matter what maturity level the overall product is at.

That brings us to the other situation, the one in which we are updating existing software. This, of course, is a completely different scenario to a fresh install and there are many, many more factors involved.

One of the biggest factors for most situations is that of licensing. Updating software regularly may incur licensing fees that need to be factored in to the benefits and cost equation. Some products, like most open source software, do not have this cost and can be updated as soon as new versions are available.

The other really large factor in updating software is a human effort cost to updating – unlike in a fresh installation, where the effort of install is effectively a break even between old software and new.  In reality, new software tends to be easier to install than old software simply due to improvements and advancements.  Maintaining a single version of software for a decade means that resources were not dedicated, during that time, to upgrade processes. Upgrading annually during that time means that resources were used ten times to enact separate upgrades. That makes updating much harder to cost justify. But there is more than just the effort of the update process itself, there is also the continuous training needed for end users who will be forced to experience more changes, more often through constant upgrades.

This might make updating software sound like a negative, but it is not. It is simply an equation where each side needs to be weighed. Regular updates often mean small, incremental changes rather than large leaps allowing end users to adapt more naturally. Regular updates mean that update processes are often easier and more predictable. Regular updates mean that technical debt is always managed and the benefits of the newer versions which may be features, efficiencies or security improvements, are available sooner allowing them to be leveraged for a longer period of time.

Taking what we have learned from the two scenarios above, however, there is another important take away to be found here. Once the decision to perform an update has been made, the question is often “to what version do we update?” In reality, however, every update that is more than a standard patching process is really like a miniature “new software” buying decision and the logic as to why we “always” install the newest available version when doing a fresh install also applies here. So when performing an update, we almost always should be updating as far as we can – hopefully to the current version.

To apply the Microsoft example again, we can take an organization that has Windows XP deployed today. The business decides to invest in an update cycle to a newer version, not just continued patching. There are several versions of the Windows desktop platform that are still under active support from Microsoft. These include Windows Vista, Windows 7, Windows 8 and Windows 8.1. Updating to one of the less current versions results in less time before that version’s end of life which increases organizational risk, using older versions means continued investment in already old technologies which means an increase in technical debt and less access to new features which may prove to be beneficial once available. In this particular example, newer versions are also considered to be more secure and require fewer hardware resources.

Every business needs to find the right balance for them for existing software update cycles. Every business and every software package is different. Enterprise software like Microsoft Windows, Microsoft Office or an Oracle Database follow these models very well. Small software projects and those falling near the bespoke range may have a more dynamic and unpredictable release cycle but generally will still follow most of these rules. Consider applying empathy to the software development process to understand how you and your software vendor can best partner to deliver the greatest value to your organization and combine that with your need to reduce technical debt to leverage your software investment in the best possible way for your organization.

But the rules of thumb are relatively easy:

When deploying new or updating, shoot for the latest reasonable version of software.  Use any deployment opportunity to eliminate technical debt as much as possible.

When software already exists weight factors such as human effort, licensing costs, environmental consistency and compatibility testing against benefits in features, performance and technical debt.

Hello, 1998 Calling….

Something magic seems to have happened in the Information Technology profession somewhere around 1998.  I know, from my own memory, that the late 90s were a special time to be working in IT.  Much of the architecture and technology that we have today stem from this era.  Microsoft moved from their old DOS products to Windows NT based, modern operating systems.  Linux became mature enough to begin appearing in business.  Hardware RAID became common, riding on the coattails of Intel’s IA32 processors as they finally begin to become powerful enough for many businesses to use seriously in servers.  The LAN became the business standard and all other models effectively faded away.  The Windows desktop became the one and only standard for regular computing and Windows servers were rapidly overtaking Novell as the principle player in LAN-based computing.

What I have come to realize over the last few years is that a large chunk of the communal wisdom of the industry appears to have been adopted during these formative and influential years of the IT profession and have since then passed into myth.  Much like the teachings of Aristotle who went for millennia considered to be the greatest thinker of all time and not to be questioned – stymieing scientific thought and providing a cornerstone for the dark ages.  A foundation of “rules of thumb” used in IT have passed from mentor to intern, from professor to student, from author to reader over the past fifteen or twenty years, many of them being learned by rote and treated as infallible truths of computing without any thought going into the reasoning and logic behind the initial decisions.  In many cases so much time has come and gone that the factors behind the original decisions are lost or misunderstood as those hoping to understand them today lack firsthand knowledge of computing from that era.

The codification of IT in the late nineties happened on an unprecedented scale driven primarily by Microsoft sudden lurching from lowly desktop maker to server and LAN ecosystem powerhouse.  When Microsoft made this leap with Windows NT 4 they reinvented the industry, a changing of the guard, with an entirely new generation of SMB IT Pros being born and coming into the industry right as this shift occurred.  This was the years leading up to the Y2K bubble with the IT industry swelling its ranks as rapidly as it could find moderately skilled computer-interested bodies.  This meant that everything had to be scripted (steps written on paper, that is) and best practices had to be codified to allow those with less technical backgrounds and training to work.  A perfect environment for Microsoft and their “never before seen” level of friendliness NT server product.  All at once the industry was full of newcomers without historical perspective, without the training and experience and with easy to use servers with graphical interfaces making them accessible to anyone.

Microsoft lept at the opportunity and created a tidal wave of documentation, best practices and procedures to allow anyone to get basic systems up and running quickly, easily and, more or less, reliably.  To do this they needed broad guidelines that were applicable in nearly all common scenarios, they needed it written in clear published form and they needed to guarantee that the knowledge was being assimilated.  Microsoft Press stepped in with the official publications of the Microsoft guidelines and right on its heels Microsoft MCSE program came into the spotlight totally changing the next decade of the profession.  There had been other industry certifications before the MCSE but the Windows NT 4 era and the MCP / MCSE certification systems were the game changing events of the era.  Soon everyone was getting boot camped through certification quickly memorizing Microsoft best practices and recommendations, learning them by rote and getting certified.

In the short term, the move did wonders for providing Microsoft an army of minimally skilled, but skilled nonetheless, supporters who had their own academic interests aligned with Microsoft’s corporate interest forming a symbiotic relationship that completely defined the era.  Microsoft was popular because nearly every IT professional was trained on it and nearly every IT professional encourage the adoption of Microsoft technologies because they had been trained and certified on it.

The rote guidelines of the era touched many aspects of computing, many are probably still unidentified to this day so strong was the pressure that Microsoft (and others) put on the industry at the time.  Most of today’s concepts of storage and disk arrays, filesystems, system security, networking, system architecture, application design, memory, swap space tuning and countless others all arose during this era and passed, rather quickly, into lore.  At the time we were aware that these were simply rules of thumb, subject to change just as they always had based on the changed in the industry.  Microsoft, and others, tried hard to make it clear what underlying principles created the rules of thumb.  It was not their intention to create a generation having learned by rote, but it happened.

That generation went on to be the effective founding fathers of modern LAN management.  In the small and medium business space the late 1990s represented the end of the central computer and remote terminals design, the Internet became ubiquitous (providing the underpinnings for the extensive propagation of the guidelines of the day), Microsoft washed away the memory of Novell and LANtastic, Ethernet over twisted pair completely abolished all competing technologies in LAN networking, TCP/IP beat out all layer three networking competitors and more.  Intel’s IA32 processor architecture began to steal the thunder from the big RISC processors of the previous era or the obscure sixteen and thirty two bit processors attempting to unseat Intel for generations.  The era was defining to a degree few who come since will ever understand.  Dial up networking gave way to always-on connections.  Disparate networks that could not communicate with each other lost to the Internet and a single, global networking standard.  Vampire taps and hermaphrodite connectors gave in as RJ45 connectors took to the field.  The LAN of 1992 looked nothing like the LAN of 1995.  But today, what we use, while faster and better polished, is effectively identical to the computing landscape as it was by around 1996.

All of this momentum, whether intentional or accidental, created an unstoppable force of myth driving the industry.  Careers were built on this industry wisdom taught around the campfire at night.  One generation clinging to their established beliefs, no longer knowing why they trusted those guidelines or if they applied, and another being taught them with little way to know that they were being taught distilled rules of thumb meant to be taught with coinciding background knowledge and understanding and having been designed not only for a very specific era, roughly the band from 1996 to 1999, but also, in a great many cases, for very specific implementations or products, generally Windows 95 and Windows NT 4 desktops and Windows NT 4 servers.

Today this knowledge is everywhere.  Ask enough questions and even young professionals still at university or doing a first internship are likely to have heard at least a few of the more common nuggets of conventional IT industry wisdom.  Sometimes the recommendations, applied to day, are nearly benign representing little more than inefficiency or performance waste.  In other cases they may represent pretty extreme degrees of bad practice today carrying significant risk.

It will be interesting to see just how long the late 1990s continue to so vastly influence our industry today.  Will the next generation of IT professionals finally issue a broad call to deep understanding and question the rote learning of the past eras?  Will misunderstood recommendations still be commonplace in the 2020s?  At the current pace of change, it seems unlikely that any significant change to the thinking of the industry is likely to change too much prior to 2030.  IT has been attempting to move from its wild west, everyone distilling raw knowledge into practical terms on their own to large scale codification like other, similar, fields like civil or electrical engineering, but the rate of change, while tremendously slowed since the rampant pace of the 70s and 80s, still remains so high that the knowledge of one generation is nearly useless to the next and only broad patterns, approaches and thought processes have great value to be taught mentor to student.  We may easily face another twenty years of the wild west before things begin to really settle down.

The Smallest IT Department

Working with small businesses means working with small IT shops.  It is very common to find the “one man” shows and I am often in discussions about how to handle environments so small.  There is no easy answer.  Unlike most company departments or job roles, IT is almost always an “around the clock” job that services the fundamental “plumbing” of the business – the infrastructure on which everything else depends.  Normal departments like finance, human resources, legal, management or marketing tend to knock off at the end of the day, leave an hour early on Fridays, go completely offline during the weekend, take normal vacations with little or no office contact, require little ongoing education or training once they are established and almost never have to worry about being expected to spend their nights or weekends doing their work to avoid interrupting others while they work, but this exactly how IT departments need to function.  IT staffs don’t reminisce about that “one time” that things were so bad at work that they had to work through the whole weekend or a full overnight and still work the next day or had to give up their family vacation because the company made no allowance for it operationally – that is simply day to day life for many people in IT.  What other departments often feel is completely unacceptable in IT is just normal practice.  But that doesn’t mean that it works well, IT departments are often driven into the ground and little consideration is given for their long term viability or success.

With rare exception, IT departments have needs that are different from normal departments – based primarily on what business demand from them: high reliability, continuous availability, deep business knowledge of all departments, ability to train others, knowledge of broad and disparate technologies, business skills, financial skills, procurement skills, travel, experience across technologies and industries, efficiency and experience on the latest technologies, trends, architectures, techniques and knowledge of the latest threats and products arriving daily – and to not only use all of that skill and experience to provide support roles but to also be a productive engineer, customer service representative and to present and defend recommendations to management that often pushes back or provides erratic or emotional support of infrastructural needs.  Quite literally, no single person can possibly fill those shoes and one that could would demand a salary higher than the revenue of most small businesses.

How do larger businesses handle this daunting task?  They do so with large IT departments filled with people who specialize in specific tasks, generalists who glue specialists together, dedicated support people who don’t need to do engineering, engineers who don’t get support interruptions, tiered support roles to filter tasks by difficulty, mentors to train newcomers, career pipelines, on call schedules or follow the sun support desks and internal education systems.  The number of challenges presented to a lone IT professional or very small IT department are nearly insurmountable forcing corners to be cut nearly everywhere, often dangerously.  There is no time or resources for tiny IT departments to handle the scope of the job thrown at them.  Even if the job role is whittled down to a very specific job role, SMB IT professionals are often faced with decision making for which they cannot be prepared.  For example, a simple server failure might be seen as just another “hardware upgrade” task because the overworked, under-scoped IT professional isn’t being given the necessary latitude to be able to flag management as to an arising opportunity for some strategic roadmap execution – maybe a complete departure from previous plans due to a late breaking technology change, or a chance to consolidate systems for cost savings or a tactical upgrade or change of platform might deliver unrealized features.

Having worked both in the trenches and in management I believe that there are two thresholds that need to be considered.  One is the minimum functional IT department size.  That is, the minimal size that an internal IT department can be to be able to complete basic job functions using internal staff.  To clarify, “internal staff” can be a rather ambiguous term.  Internal here I use to mean dedicated or effectively dedicated staff.  These people can be employees or contractors.  But at a minimum, with the exception of very rare companies that don’t operate during full business hours or other niche scenario, it takes at least three IT professionals on an IT team to functionally operate as an IT department.

With three people there is an opportunity for peer review, very critical in a technical field that is complex at the best of times and a swirling quagmire of unknown requirements, continuous change and insurmountable complexity at the worst of times.  Like any technical field, IT professionals need peers to talk to, to oversee their work, to check their ideas against and to keep them from entering the SMB IT Bubble.  Three is an important number.  Two people will have a natural tendency to become adversarial with one carrying the weight of recommendation to management and one living in their shadow – typically with the one with the greater soft skills or business skills gaining the ear of management while the one with the greater technical acumen losing their voice if management isn’t careful to intentionally include them.  As with maritime chronometers, it is critical that you have three because you can have a quorum.  Two simply have an argument.

IT is an “around the clock” endeavor.  During the day there are continuous needs from IT end users and the continuous potential for an outage or other disaster plus meetings, design sessions, planning and documentation.  In the evenings and on weekends there is all of the system maintenance that cannot, or at least should not, be done while the business is up and running.  This is often an extensive level of work, not an occasional bit of missing happy hour but regular workload eliminating dinner and family time.  Then comes the emergency calls and outages that happen any time day or night.  And there is the watching of email – even if nothing is wrong it is commonplace for IT to be involved in company business twelve to sixteen hours a day and weekends too, even in very small companies.  Even the most dedicated IT professional will face rapid burnout in an environment such as this without the ability to have a service rotation to facilitate necessary rest and work/life balance.

This comes before the considerations for the unforeseeable sick days, emergency leave or even just holidays or vacation.  If there are not enough people left behind to cover the business as usual tasks plus the unforeseeables, then vacations or even sick days become nearly, if not totally, impossible.  Skipping vacations for a year or two is theoretically possible but it is not healthy and doesn’t provide for a sustainable department.

Then there is training and education.  IT is a demanding field.  Running your own IT department suggests a desire to control the level of skill and availability granted to the company.  To maintain truly useful IT staff time and resources for continuous education is critical.  IT pros at any stage in their career need to have time to engage in discussions and forums, attend classes and training, participate in user groups, go to conferences and even just sit down and read books and web sites on the latest products, techniques and technologies.  If an IT professional is not given the chance to not just maintain, but grow their skills they will stagnate and gradually become useless technically and likely to fall into depression.  A one or two man shop, with even the smallest of organizations, cannot support the necessary free time for serious educational opportunities.

Lastly, and far more critical than it seems at first, is the need to handle request queues.  If issues arise within a business at a continuous, average rate of just enough per day to require eight hours per day to service them it may seem like only one person would be necessary to handle the queue that this work load would generate.  In an ideal world, perhaps that is true.  In the real world, requests come in at varying degrees of priority and often at very inopportune moments so that even a business that has taken on the expense of having dedicated, internal IT cannot have the “instant response time” that they often hope for because their IT professional is busy on an existing task.  The idea of instant response is based on the assumption that the IT resource is sitting idle and watching the ticket queue or waiting by the phone at all times.  That is not realistic.

In large enterprises, to handle the response time concerns of critical environments, surplus IT resources are maintained so that only in the direst of emergencies would all of them be called upon at one time to deal with high criticality issues at the same time.  There is always someone left behind to deal with another pressing issue should one arise.  This not only allows for low latency response to any important customer need but also provides spare time for projects, learning and the necessary mental downtime needed for abstract processing of troubleshooting without which IT professionals in a support role will lose efficiency even if other work does not force them to multitask.

In small shops there is little to be done.  There is a lack of scale to allow for the excess IT resource capacity to be sitting n the wings just waiting for issues to arise.  Having three people is, in my opinion, an absolute minimum to allow for the handling of most cases of this nature if the business is small enough.  By having three people there is, we hope, some chance of avoiding continuous re-prioritization of requests, inefficient multi-tasking and context switching.

In larger organizations there is also a separation of duties between administration or support job roles and engineering job roles.  One job is event driven, sitting “idle” waiting for a customer request and then reacting as quickly as possible. The other focused on projects and working towards overall efficiency.  Two very different aspects of IT that are nearly impossible for a single person to tackle simultaneously.  With a three person shop these roles can exist in many cases even if the roles are temporarily assigned as needed and not permanent aspects of title or function.

With only three people an IT department still lacks the size and scale necessary to provide a healthy, professional growth and training environment internally.  There are not enough rungs on the ladder for IT employees to move up and only turnover, unlikely to happen in the top slot, allows for any upward mobility forcing good candidates to leave rapidly for the sake of their careers leaving good shops with continuous turnover and training and lesser shops with dramatically inferior staff.  There is no simple solution for small organizations.  IT is a broad field with a great many steps on the ladder from helpdesk to CIO.  Top IT organizations have thousands or, in the most extreme cases, hundreds of thousands of IT professionals in a single organization.  These environments naturally have a great degree or both upward and lateral mobility, peer interaction and review, vendor resources, mentoring, lead oversight, career guidance and development and opportunities to explore new ideas and paths often that don’t exist in SMBs of any size.

To maintain a truly healthy IT department takes a much larger pool of resources.  Likely one hundred, or more, IT professionals would be required to provide adequate internal peerage, growth and opportunity to begin to provide for career needs, rather than “job needs.”  Realistically, the SMB market cannot bear this at an individual business scale and must accept that the nature of SMB IT is to have high turnover of the best resources and to work with other businesses, typically ones that are not directly competitive, to share or exchange resources.  In the enterprise space, even in the largest businesses, this is often very common – friendly exchanges of IT staff to allow for career advancement often with no penalties for returning later in their career for different positions at the original company.

Given this bleak picture of SMB IT staff scaling needs, what is the answer?  The reality is is that there is no easy one.  SMB IT sits at a serious disadvantage to its enterprise counterparts and at some scale, especially falling below three dedicated IT staff members, the scale becomes too low to allow for a sustainable work environment in all but the most extreme cases.

In smaller organizations, one answer is turning to consulting, outsourcing and/or managed service providers who are willing and able to work either in the role of internal staff or as a hybrid with existing internal staff to provide for an effectively larger IT organization shared between many businesses.   Another is simply investing more heavily in IT resources or using other departments as part time IT to handle helpdesk or other high demand roles, but this tends to be very ineffective as IT duties tend to overwhelm any other job role.  A more theoretical approach is to form a partnership with another one or two businesses to share in house IT in a closed environment.  This last approach is very difficult and problematic and generally works only when technology is heavily shared as is geographic location between the businesses in question.

More important than providing a simple answer is the realization that IT professionals need a team on which to work in order to thrive and will perform far better on a healthy team than they will alone.  How this is accomplished depends on the unique needs of any given business.  But the efficacy and viability of the one or two “man” IT shop, for even the smallest businesses, is questionable.  Some businesses are lucky enough to find themselves in a situation where this can work for a few years but often live day to day at a high degree of risk and almost always face high turnover with their entire IT department, a key underpinning of the workings of their entire business, leaving at once with the benefits of staggered turnover that a three person and larger shop at least have an opportunity to provide.  With a single person shop there is no handover of knowledge from predecessors, no training and often no opportunity to seek an adequate replacement before the original IT professional is gone leaving at best an abrupt handover and at worst a long period of time with no IT support at all and no in house skills necessary to interview and locate a successor.