All posts by Scott Alan Miller

Started in software development with Eastman Kodak in 1989 as an intern in database development (making database platforms themselves.) Began transitioning to IT in 1994 with my first mixed role in system administration.

It’s a Field, Not a Road

Over the years I have become aware of a tendency in the Information Technology arena to find strong expectations of exactly how much someone should know about certain technologies based on their job title and length of time having worked in IT.  Of course, someone’s current job title and experience level should give you some, if only a little, insight into what they are doing on the job today, but it should rarely give you much insight into what they have done in the past or how they got to where they are today.

There are some abundantly common “paths” through IT, especially in the small and medium business markets, which help to stereotype the advancement of an IT professional over time.  The most common path goes something like this: high school, four year college degree, one or two basic certifications from CompTIA, entry level helpdesk job, better help desk job, deskside support job, basic Microsoft certification, system administrator or IT manager position.  This path is common enough that many people who have taken it simply assume that everyone else in the IT world has done so as well and this assumption creates a lot of problems in many different areas.

First of all, it must be stated, that there is no standard path in IT, not even remotely.  Too often IT professionals, applying their own experiences to their view of other people, see IT as a road when it truly is a field (pun only partially intended.)  IT has no defined entry point nor exit point from the industry.  IT is a massive field made up of many different sub-disciplines that exist in little, if any, linear progression of any sort from one to another.  There are far more lateral moves in IT than there are ladders to climb.

Besides the completely untrue assumption that any specific education and certification requirements exist in order to enter IT, the widely held belief that helpdesk positions are the only entry level IT position that exists and that they are only a stepping stone job are completely unfounded and untrue.  Many, likely most, IT professionals do not enter the field through helpdesk, call centers, or even deskside support and probably not through any type of Windows-centric support at all.  While end user focused,  helpdesk remains only a small percentage of all IT careers and one through which only a portion of IT professionals will pass.  Windows-centric support is one of the most important foci within IT and clearly the most visible to end users and those outside of IT; this high level of visibility can be misleading, however.  It is equally true that helpdesk, call center, deskside support and the like are not stepping stone jobs exclusively and are career options in their own right.  It is unfortunate that such a high percentage of IT professionals view such positions as being inappropriate career goals because it is widely recognized that a lack of skilled and dedicated people in those specific positions is often what causes the most friction between end users and IT departments.

I have found on several occasions hiring managers who discounted hiring anyone who was truly interested in helpdesk or deskside support as a career and who enjoyed working with customers; and only desired to hire someone who looked down on those roles as necessary evils that should be passed over as quickly as possible en route to a more “rewarding” career destination.  I find this sad on many levels.  It implies that the hiring manager lacks empathy for other professionals and does not consider their individual desires or strengths.  It implies that the company in question is institutionalizing a system by which people are not hired to do something that they love nor something that they are good at but only hires people willing to do a job role that they don’t want to do in the hopes of eventually doing one that they do want to do.  This rules out anyone actually qualified to do the desired job since those people will go straight into those positions.  It almost guarantees, as well, that end user support will be poor as no one is hired that is specifically good at or interested in that role.  The hiring manager clearly sees end user support as not being a priority and the entire idea is that anyone going into that role will “succeed” by moving out of that role as quickly as possible, and thus leaving end users with a lack of continuity as well as a never ending cycle of churn.  Believing that IT is a road and not a field has tangible, negative consequences.

Seeing IT careers as a direct path from point A to point B creates an inappropriate set of expectations as well.  It is not uncommon at all for someone to say that anyone with five years of experience in IT must know how to <insert somewhat common Windows desktop task here> based on nothing but their length of time working in IT, completely ignoring the possibility that they have never worked on Windows or in a role that would do that task.  While Windows is common, many people working in IT have never performed those roles and there is no reason why it would be expected that a specific task like that would be known automatically.  This goes beyond the already problematic attitude that many people have that the tasks that they personally did in a specific job role are the same tasks that everyone in that job role have done.  This is, of course, completely untrue.  A Windows system admin at one company and a Windows system admin at another company or even just in another department, may do similar tasks or possibly completely different tasks.  Even a decade in those roles may produce almost completely unique experiences and skills.  There is just so much potential in IT for doing different things that we cannot make specific task assumptions.

This assumptive process carries over to certifications and education as well.  While many fields succumb to the cliche that anyone over a certain level must have a college education, it is far less common for that assumption to be true in IT.  Few fields find university training to be as optional as IT does, and remembering that alternative means of entering the field exist is critical.  Many of the best and brightest enter IT directly and not through an educational channel.  These candidates are often years ahead of their “educated” counterparts and often represent the most passionate, driven and capable pool of talent; and they are almost certainly the most capable of self motivation and self education which are both extremely important traits in IT.

Similarly I was recently introduced, for the first time, to assumptions about certifications.  Certifications are specific to job roles and none would apply broadly to all roles and none would be sensible for someone to hold if a higher certification was held or if they have never passed through that specific point in that specific job role.  The example that came up was with a hiring manager who actually believe that anyone at ten years of experience would be expected to have both an A+ and a Network+ certification.  Both are entry level certifications and not relevant to the vast majority of IT careers (the A+ especially has little broad applicability while the Network+ is much more general case but still effectively entry level.) While it would not be surprising to find these being held by a ten year IT veteran, it would make no sense whatsoever to be used as filtering agents by which someone would rule out candidates for lacking.  This is completely ridiculous.  Those certs are designed only to show rudimentary knowledge in specific IT career paths.  Anyone who has passed that point in their career without needing them would never go back and spend time and money earning entry level certifications while already at a career mid-point.  Once you have a PhD, you don’t go back and get another Associates degree just to show that you could have done it, the PhD is enough to demonstrate the ability to get an entry level degree.  And most people with a significant history in the field will often have passed the career point where those certs made sense years before the certs even existed (the Network+, for example, did not exist until I was in IT for more than a decade already!)

I am particular sensitive to this issue both because I spent several years as a career counselor and helped to put IT professionals on a path to career growth and development and because I myself did not take what is considered to be a conventional path into IT.  I was lucky enough to have interned in software development during the middle and high school years and was offered a position in UNIX support right out of high school.  I never passed through any Windows-centric roles, nor did I ever work on a helpdesk or do deskside support outside of a small amount of UNIX high end research labs.  My career took me in many different directions but almost none followed the paths that so many hiring managers expect.  Attempting to predict the path that one’s career will take in the future is impossible.  Equally, attempting to determine what path must have been taken to have reached a current location is also impossible.  There are simply too many ways to get from point A to point B.

Embracing uniqueness in IT is important.  We all bring different strengths and weaknesses, different ideas and priorities, different goals and different things that we enjoy or despise doing.  The job that one person sees as a necessary evil another will love doing and that passion for the role will show.  The passionate, career-focused helpdesk professional will bring an entirely different joie de vivre to the job than will someone who feels that they are trapped doing an undesirable job until another opportunity comes along.  This doesn’t mean that the later will not work hard and try their best, but there is little that can be done to compete with someone passionate about a specific role.

It is also very easy, when we look at IT as a singular path, to forget that individual roles, such as helpdesk, actually have progressions within the role itself.  Often many steps exist within specific roles.  In the case of a helpdesk it is common to refer to these as L0 through L3.  Plus there are helpdesk team leads and helpdesk manager positions that are common.  An entire career can be had just within the helpdesk focus sub-discipline within IT.  There is nothing wrong with entering IT directly into the role type that interests you.  There is also nothing wrong with achieving a place in your career where you are happy to stay. Everyone has an ideal position, a career position where they both excel at what they do and are happy doing indefinitely.  In most fields, people actually strive to achieve this type of position somewhat early in their careers.  In IT, it is strangely uncommon.

There is a large amount of social pressure within IT to have “ambition” pushing you towards more and more challenging positions within the field.  Partially this is because IT is such an enormous field that is so dynamic that most people really do enter wherever opportunity presents itself and then attempt to maneuver themselves into positions that they find interesting over a period of many years.  This creates a culture of continuous change and advancement expectations, to some degree. This is not entirely bad but it often marginalizes or even penalizes people who manage to find their desired positions, especially if this happens early in their careers and even more specifically if it happens in a role which many people see as a “stepping stone” role such as with helpdesk or deskside support.  This is not good for individuals, for businesses or for the field in general.  It pushes people into roles where they are not happy and not well suited in order to satisfy social pressures rather than career aspirations or business needs.

Ambition is not necessarily a good thing.  It certainly is not a bad thing.  But too often hiring managers look for ambition when it is not in anyone’s interest.  Hiring someone young or inexperienced in the hopes that they grow over time and move into more and more advanced roles is an admirable goal and can work out great.  But avoiding hiring someone perfectly suited for a role because they will want to stay where they are well suited and where they excel makes no sense at all.  In an ideal world, everyone would be hired directly into the perfect position for them and no one would ever need to change jobs.  This is best for both the employees and the employer. It is rarely possible, but certainly should not be avoided when the opportunity presents itself.

Creating stereotypes and using them to judge IT professionals has negative consequences for everyone.  It increases stress, reduces career satisfaction, decreases performance and lowers the quality of IT service delivery while making it more expensive to provide.  It is imperative that we accept IT as a field, not as a road, and that we also accept that IT professionals are individuals with different goals, different career motivations and different ambitions.  Variety and diversity in IT are far more important than they are in most fields because IT is so large and requires so many different perspectives to perform optimally.  Unlike a road that travels a single, predictable path, a field allows you to wander in many directions and arrive at many different destinations.

The Home Line

In many years of working with the small and medium business markets I have noticed that the majority of SMB IT shops tend to one of two extremes: massive overspend with an attempt to operate like huge companies by adopting costly and pointless technologies unnecessary at the SMB scale or they go to the opposite extreme spending nothing and running technology that is completely inadequate for their needs.  Of course the best answer is somewhere in between – finding the right technologies, the right investments for the business at hand; and some companies manage to work in that space but far too many go to one of the two extremes.

A tool that I have learned to use over the years is classifying the behavior of a business against decision making that I would use in a residential setting – specifically my own home.  To be sure, I run my home more like a business than does the average IT professional, but I think that it still makes a very important point.  As an IT professional, I understand the value of the technologies that I deploy, I understand where investing time and effort will pay off, and I understand the long term costs of different options.  So where I make judgement calls at home is very telling.  My home does not have the financial value of a functional business nor does it have the security concerns, nor the need to scale (my family will never grow in user base size, no matter how financial successful it is) so when comparing my home to a business, my home should, in theory, set the absolute lowest possible bar in regards to financial benefit of technology investment.  That is to say, that the weighing of options for an actual, functional business should always lean towards equal or more investment in performance, safety, reliability and ease of management than my home.  My home should be no more “enterprise” or “business class” than any real business.

One could argue, of course, that I make poor financial decisions in my home and over-invest there for myriad reasons and, of course, there is merit to that concern.  But realistically there are broad standards that IT professionals mostly agree upon as good guidelines and while many do not follow these at home, either through a need to cut costs, a lack of IT needs at home or, as is often the case, a lack of buy in from critical stakeholders (e.g. a spouse), most agree as to which ones make sense, when they make sense and why.  The general guideline as to what technology at which price points set the absolute minimum bar are by and large accepted and constitute what I refer to as the “home line.”  The line, below which, a business cannot argue that it is acting like a business but is, at best, acting like a consumer, hobbyist or worse.  A true business should never fall below the home line, doing so would mean that they consider the value of their information technology investment in their business to be lower than what I consider my investment at home to be.

This adds a further complication.  At home there is little cost to the implementation of technologies.  But in a business all of the time spent working on technology, and supporting less than ideal decisions, is costly.  Either costly in direct dollars spent, often because IT support is being provided by a third party doing so on a contractual basis, or costly because time and effort are being expended on basic technology support that could be being used elsewhere – the cost of lost opportunity.  Neither of these take into account things like the cost of downtime, data loss or data breach which are generally the more significant costs that we have to consider.

The cost of the IT support involved is a significant factor.  For a business, there should be a powerful leaning towards technologies that are robust and reliable with a lower total cost of ownership or a clear return on investment.  In a home there is more reason to spend more time tweaking products to get them to work, working with products that fail often or require lots of manual support, using products that lack powerful remote management options or products that lack centralized controls for user and system management.

It is also important to look at the IT expenditures of any business and ask if the IT support is thus warranted in the light of those investments.  If a business is unwilling to invest into the IT infrastructure an equivalent amount that I would invest into the same infrastructure for home use, why would a business be willing to maintain an IT staff, at great expense, to maintain that infrastructure?  This is a strange expenditure mismatch but one that commonly arises.  A business which has little need of full time IT support will often readily hire a full time IT employee but be unwilling to invest in the technology infrastructure that said employee is intended to support.  There seems to be a correlation between businesses that underspend on infrastructure with those that overspend on support – however a simple reason for that could be that staff in that situation is the most vocal.  Businesses with adequate staff and investment have little reason for staff to complain and those with no staff have no one to do the complaining.

For businesses making these kinds of tradeoffs, with only the rarest of exceptions, it would make far better financial and business sense to not have full time IT support in house and instead move to occasional outside assistance or a managed services agreement at a fraction of the cost of a full time person and invest a portion of the difference into the actual infrastructure.  This should provide far more IT functionality for less money and at lower risk.

I find that the home line is an all around handy tool.  Just a rough gauge for explaining to business people where their decisions fall in relation to other businesses or, in this case, non-businesses.  It is easy to say that someone is “not running their business like a business” but this adds weight and clarity to that sentiment.  That a business is not investing like another business up the street may not matter at all.  But if they are not putting as much into their business as the person that they are asking for advice puts into their home, that has a tendency to get their attention.  Even if, at this point, the decisions to improve the business infrastructure become primarily driven by emotion, the outcome can be very positive.

Comparing one business to another can result in simple excuses like “they are not as thrifty” or “that is a larger business” or “that is a kind of business that needs more computers.”  It is rarely useful for business people or IT people to do that kind of comparison.  But comparing to a single user or single family at home there is a much more corporeal comparison.  Owners and managers tend to take a certain pride in their businesses and having it be widely seen that they see their own company’s value as lower than that of a single household is non-trivial.  Most owners or CEOs would be ashamed if their own technology needs did not exceed those of an individual IT professional let alone theirs plus all of the needs of the entire business that they oversee.  Few people want to think of their entire company as being less than the business value of an individual.

This all, of course, brings up the obvious questions of what are some of the things that I use at home on my network?  I will provide some quick examples.

I do not use ISP supplied networking equipment, for many reasons.  I use a business class router and firewall unit that does not have integrated wireless nor a switch.  I have a separate switch to handle the physical cabling plant of the house.  I use a dedicated, managed, wireless access point.  I have CAT5e or CAT6 professionally wired into the walls of the house so that wireless is only used when needed, not as a default for more robust and reliable networking (most rooms have many network drops for flexibility and to support multimedia systems.)  I use a centrally managed anti-virus solution, I monitor my patch management and I never run under an administrator level account.  I have a business class NAS device with large capacity drives and RAID for storing media and backups in the house.  I have a backup service.  I use enterprise class cloud storage and applications.  My operating systems are all completely up to date.  I use large, moderate quality monitors and have a minimum of two per desktop.  I use desktops for stationary work and laptops for mobile work.  I have remote access solutions for every machine so that I can access anything from anywhere at any time.  I have all of my equipment on UPS.  I have even been known to rackmount the equipment in the house to keep things neater and easier to manage.  All of the cables in the attic are carefully strung on J-hooks to keep them neat.  I have VoIP telephony with extensions for different family members.  All of my computers are commercial grade, not consumer.

My home is more than just my residential network, it is an example of how easy and practical it is to do infrastructure well, even on a small scale.  It pays for itself in reliability and often the cost of the components that I use are far less than that of the consumer equipment often used by small businesses because I research more carefully what I purchase rather than buying whatever strikes my fancy in the moment at a consumer electronics store.  It is not uncommon for me to spend half as much for quality equipment as many small businesses spend for consumer grade equipment.

Look at the businesses that you support or even, in fact, your own business.  Are you keeping ahead of the “home line?”  Are you setting the bar for the quality of your business infrastructure high enough?

Originally published on the StorageCraft Blog.

Should IT Embrace Subscription Licensing

With big name, traditionally boxed products like Microsoft Office and Adobe’s Creative Suite turning to new subscription licensing models we, as IT, have to look into this model and determine if and when it is right for our businesses.  In some cases, like with MS Office, we have choices to buy boxed products, volume license deals or subscription licenses.  This is very flexible and allows us to consider many alternatives.  With Adobe, however, non-subscription options have been dropped and if we want to use their product line subscription pricing is our only option.  As we move forward this will be a trend more and more and something that all of the industry must face and understand.  It cannot be avoided easily.

First we should understand why subscription models are good for the vendors.  Many people, especially in IT, assume that subscriptions are designed to extract higher fees from customers and certainly any given vendor may raise prices in conjunction with changing models, but fundamentally subscription pricing is purely a licensing approach and does not imply and increase in cost.  It may, potentially, even mean a decrease.

Software vendors like subscription pricing for three key reasons.

The first is license management.  With traditional software purchases it was trivially easy for customers to install multiple copies, perhaps accidentally, of software causing a loss of revenue if software was used but not licensed.  License management was traditionally complicated and expensive for all parties involved.   Moving to subscription models makes it very easy to clearly communicate licensing requirements and to enforce policies.

For customers purchasing software, this change is actually beneficial as it lowers the overall cost of software because it helps to eliminate illegitimate uses of software.  By lowering the piracy rate the cost that needs to be passed on to legitimate businesses can be lowered.  Whether this turns into lower cost for customers or higher margins for vendors it is a benefit to all of the legitimate parties involved.

The second is eliminating legacy versions from support.  In traditional software and support models, customers might use old versions of software for many years resulting in many different versions requiring support simultaneously.  Often this would mean that support teams would need extensive training for a long tail of legacy customers or separate support groups would be needed for different software versions.  This was extremely expensive as support is a key cost in software development.  Likewise, development teams would be forced to be split with most resources focusing on developing or fixing the current software version while some developers would be forced to spend time patching and maintaining legacy versions that were no longer being sold.  These costs were often enormous and meant that great energy was being spent to support customers who were not investing in new software and came at the expense of resources for improving the software and support for the best customers.  The move to subscription licensing generally eliminates support needs for legacy versions as all customers move to the latest versions all of the time.

Again, this is a move that greatly benefits both the vendor and good customers.  It only sometimes is a negative to customers who were relying on being “expensive to maintain” customers who used old software for a long time rather than updating.  But commonly even those customers benefit from not running old software, even if this is not how they would operate if they had their druthers.  The benefits to the vendor and to “good” customers is very large, the penalty to customers that were formally not profitable is generally very small.

The third reason, which is really a combination of the above, is that customers who previously depended on buying a single version of a product and continuing to use it for a very long time, likely many years past the end of support, are effectively eliminated.  These customers, lacking a means to buy in this traditional manner, are normally either lost as customers (which is not a financial loss as they were not very profitable) or they convert to higher profit customers, even if begrudgingly.  This makes vendors very happy – separating the wheat from the chaff, so to speak.  Cutting lose customers that were not making them money and creating more customers that are making them money.

Now that we have seen why vendors like this model and why we are likely to see more and more of it in the future as large, leading vendors both demonstate the financial value of the change and condition customers to think in terms of subscription license models, we will look at why IT departments and businesses should consider embracing this model for their own reasons.

To the business itself, subscription licensing offers some significant value, especially to finance departments.  Through moving to subscription licensing we are generally able to move from capital expenses (capex) to operational expenses (opex) which is generally seen as favorable.  But subscription value is far larger than that.  Subscription pricing gives cost predictability.  A finance department can accurately predict their costs over time rarely being surprised whereas in the old approach software was largely forgotten and then some need would require an old package to be updated and suddenly a very large invoice would be forthcoming with potentially very little warning (often followed by large re-training expenses due to the possibly large gap in software versions.)  With subscription pricing, costs normally fluctuate fluidly with employee count.  As new employees are hired the finance department can predict exactly how much they will cost.  And when employees leave subscriptions can be discontinued and cost reduced.  Only software that is truly used is purchased.  The need to overbuy to account for fluctuations or predicted growth no longer exists.  Subscription licensing also leverages the time-value of money allowing businesses to hold onto their funds for as long as possible requiring them to pay only for what they use as they use it.

For IT the benefits are even greater.  IT should benefit from having a better relationship with finance and human resources as the costs and needs of incoming or outgoing users are better understood.  This eliminates some of the friction between these departments which is always beneficial.

IT also benefits from the effective enforcement of best practices.  It is common for IT departments to struggle to convince businesses to invest in newer versions of software which often results in support issues and unnecessary complexity and less than happy users.  With subscription pricing, IT is constantly supplied with the latest software for users which, in nearly all cases, is an enormous benefit both to IT and to the users of the software.  This eliminates much of the friction that IT experiences with the business and with management by moving the need for updates to an external mandate and no longer something that IT or the users must request.

IT benefits from easier license management on their end as well.  It is generally far easier to determine license availability and need.  Audits are unnecessary because the licensing process is generally handled (generally, nothing technically requires this) via an authentication mechanism with the vendor which means that unless specific effort is taken to violate licencing (cracking software or some other extreme measure) that licensing accidents are unlikely and easy to correct.

IT may also benefit from easier ability to handle complex licensing situations such as providing a higher feature set level for one user and not for another.  Licenses can often be purchased at a minimum level and upgraded if more needs are discovered.  The ability to easily customize per user and over time means that IT can deliver more value with less effort.

Many of the objections with subscription licensing are not actually with subscription licensing itself.  Often it is a perception of higher cost.  This is, of course, difficult to prove since any given company may choose to charge anything that they want for different license options.  Microsoft offers both subscription and non-subscription license options for some of their key products such as MS Office.  This gives us a chance to see how they see the cost differences and benefits and to compare the options so that we can find the most cost effective option for our own business.  By keeping both models Microsoft can be audited by their customers to keep costs of each model in line.  However, by offering both they also lose many of the benefits that pure subscription models bring such as needing to support only a single version at a time.

Adobe, on the other hand, made the switch from traditional licensing to subscription licensing basically all at once and appears to have decided to raise their prices at the same time.  This is very misleading because Adobe actually raised the price, and it is not the subscription model creating the price increase.  The benefits of subscription pricing are benefits of the model.  The pricing decisions of any given vendor are a separate thing and must be evaluated in the same way that any pricing evaluation is done.

The other common complaint that I have heard many times is an inability to “own” software.  This is a natural reaction but one that IT and business units should not have.  In a business setting software is not owned by people and we should have no emotional ties to it.  Software is just another tool for completing our work and whatever gives us the best ability to do that, at the best price, is what we want.  From a purely business perspective, owning software is irrelevant.  The desire to own things is a human reaction that is not conducive to good business thinking.  It is also very valuable to point out that IT should never have this mental reaction to owning software – it is the business, not the IT department or the IT professionals, who own software in their business.  IT is simply selecting, deploying, configuring and managing the software on behalf of the business that it supports.

Overall I truly believe that subscription licensing models are good, in general, for nearly everyone involved.  They benefit vendors in such a way that it enables them to be more viable and profitable, while making it easier for IT departments to deliver better value to their users often while enforcing many best practices that businesses would otherwise be tempted to avoid.  The improved profitability may also encourage vendors to pursue niche software titles that would have been previously unaffordable to create and support.  Vendors, IT and end users are nearly universal winners while businesses face the only real grey area where pricing may or may not be beneficial to them in this model.

Originally posted on the StorageCraft Blog.

The Weakest Link: How Chained Dependencies Impact System Risk

When assessing system risk scenarios it is very easy to overlook “chained” dependencies.  We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.”  But system risk is far more complicated than that.

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design.  Another good example is how web applications need both application hosts and database hosts in order to function.

It is easiest to explain chained dependencies with an example.  We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves.  Each of these three “layers” is dependent on the others for the system, as a whole, to function.  If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure.  Any one of the three failing causes the entire system to fail.  No one piece is useful on its own.  This is a chained dependency and the chain is only as strong as its weakest link.

In our simplistic example, each device represents a failure domain.  We can mitigate risk by improving the reliability of each domain.  We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure.  This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before.  We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer.  Now two failure domains have been addressed.  Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.

Now that we have beefed up our system, we still have three failure domains in a dependency chain.  What we have done is made each “link” in the chain, each failure domain, extra resilient on its own.  But the chain still exists.  This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone.  We have made something far better than where we started, but we still have many failure domains.  These risks add up.

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system.  It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role.  For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly.  Even defining a clear failure can therefore be challenging.

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners.  The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment.  It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer.  This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems.  In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.

The most obvious component which can be collapsed is the networking failure domain.  If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain.  Now instead of three chains, each of which has some potential to fail, we have only two.  Simpler is better, all other things being equal.

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced.  Again, this is with all other factors remaining equal.

Another approach to consider is making single nodes more reliable on their own.  It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains.  But traditionally this was not the default path taken to reliability.  It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node.  Mainframe and high end storage systems, for example, still do this today.  This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor.  This tends to work out only in special niche circumstances and is not practical on a more general scope.

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain.  Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies.  The risk of a single node can generally be estimated with some level of confidence.  A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain!  The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected.  Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer.  If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially.  But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation.  It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true.  But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy.  That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability.  Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk.  Avoiding human error can often be more important than avoiding mechanical failure.

We must also consider the cost of recovery.  If failure is to occur it is generally trivial to recover from the failure of a simple system.  An extremely complex system, having failed, may take a great degree of effort to restore to a working condition.  Complex systems also require much broader and deeper degrees of experience and confidence to maintain.

There is no easy answer to determining the reliability of systems.  Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases.  With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains,  and appreciate how changes in system design will move us clearly towards or away from reliability.