All posts by Scott Alan Miller

Started in software development with Eastman Kodak in 1989 as an intern in database development (making database platforms themselves.) Began transitioning to IT in 1994 with my first mixed role in system administration.

What Do I Do Now? Planning for Design Changes

Quite often I am faced with talking to people about their system designs, plans and architectures.  And many times that discussion happens too late and designs are either already implemented or they are partially implemented.  This can be very frustrating if the design in progress has been deemed to not be ideal for the situation.

I understand the feeling of frustration that will come from a situation like this but it is something that we in IT must face on a very regular basis and managing this reaction constructively is a key IT skill.  We must become masters of this situation both technically as well as emotionally.  We should not be crippled by it, it is a natural situation that every IT professional will experience on a regular basis.  It should not be discouraging or crippling but it is very understandable that it can feel that way.

One key reason that we experience this so often is because IT is a massive field with a great number of variables to be considered in every situation.  It is also a highly creative field where there can be numerous, viable approaches to any given problem.  That there is even a single “best” option is rarely true.  Normally there any many competitive options.  Sometimes these are very closely related, sometimes these options are drastically different making them very difficult to compare meaningfully.

Another key reason is that factors change.  This could be that new techniques or information come to light, new products are released, products are updated, prices change or business needs change near to or even during the decision making and design processes.  This rate of change is not something that we, as IT professionals, can hope to ever control.  It is something that we must accept and deal with as best as we can.

Another thing that I often see missed is that a solution that was ideal when made may not be ideal if the same decision was being made today.  This does not, in any way, constitute a deficiency in the original design yet I have seen many people react to it as if it did.  The most common scenario that I run into where I see people exhibit this behaviour is in the aversion to the use of RAID 5 in modern storage design, RAID 6 and RAID 10 being the popular alternatives for good reason.  But this RAID 5 aversion, common since about 2009, did not exist always and from the middle of the 1990s until nearly the end of the 2000s RAID 5 was not only viable, it was very commonly the best solution for the given business and technical needs (the increase in aversion to it was mostly gradual, not sudden.)  However many people see RAID 5 as understandably poor as an option today but apply this new aversion to systems designed and implemented long ago, sometimes close to two decades ago.  This makes no sense and is purely an emotional reaction.  RAID 5 being the best choice for a scenario in 2002 in no way implies that it will still be the best choice in 2015.  But likewise, RAID 5 being a poor choice in 2015 for a scenario in no way belittles or negates the fact that it was very often a great choice several years ago.

I have been asked many times what to do once less than ideal design decisions have been made.  “What do I do now?”

Learning what to do when perfection is no longer an option (as if it ever really was, all IT is about compromises) is a very important skill.  The first things that we must tackle are the emotional problems as these will undermine everything else.  We must do our best to step back, accept the situation and act rationally.  The last thing that we want to do is take a non-ideal situation and make things worse by attempting to reverse justify bad decisions or panicking.

Accepting that no design is perfect, that there is no way to always get things completely right and that dealing with this is just part of working IT is the first step.  Step back, breathe deep.  It isn’t that bad.  This is not a unique situation.  Every IT pro doing design goes through this all of the time.  You should try your best to make the best decisions possible but you must also accept that that can rarely be done – no one has access to enough resources to really be able to do that.  We work with what we have.  So here we are.  What’s next?

Next is to assess the situation.  Where are we now?  In many cases the implementation is done and there is nothing more to do.  The situation is not ideal, but is it bad?  Very often the biggest mistake that I see people facing of an all ready implemented design is that it is too costly – typically “better” solutions are not better because they are faster or more reliable but are better because they are cheaper, easier or faster to have implemented.  That’s an unfortunate situation but hardly a crippling one.  Whatever time or money was spent must have been an acceptable amount at the time and must have been approved.  The best that we can do, right now, is learn from the decision process and attempt to avoid the overspending in the future.  It does not mean that the existing solution will not work or even not work amazingly well.  It is simply that it may not have been a perfect choice given the business needs, primarily financial, involved.

There are situations where a design that has been implemented does not adequately meet the stated business requirements.  This is thankfully less common, in my experience, as it is a much more difficult situation.  In this case we need to make some modifications in order to fulfill our business needs.  This may prove to be expensive or complex.  But things may not be as bad as what they seem.  Often reactions to this are misleading and the situation can often be salvaged.

The first step once we are in a position where we have implemented a solution that fails to meet business needs is to reassess the business needs.  This is not to imply that we should fudge the needs to massage them into being whatever our system is able to fulfill, not at all.  But it is a good time to go back and see if the originally stated needs are truly valid or if they were simply not vetted well enough or, even more likely, to go and see if the business needs have changed during the time that the implementation took place.  It may be that the implemented solution does, in fact, meet actual business needs even if they were originally misstated or because the needs have changed over time.   Or it might be that business needs have changed so dramatically that even perfect planning would originally have fallen short of the existing needs and the fact that the implemented solution does not perform as expected is of minor consequence.   I have been very surprised just how often this verification of business needs has turned a solution believed to be inadequate into an “overkill” solution that actually cost more than necessary simply because no one pushed back on overstatements of business needs or no one questioned financial value to certain technology investments.

The second step is to create a new technology baseline.  This is a very important step to assist in preventing IT from falling into the drop of the sunk cost fallacy.  It is extremely common for anyone, this is not unique to IT in any way, to look at the time and money spent on a project and assume that continuing down the original path, no matter how foolish it is, is the way to go because so many resources have been expended on that path already.  But this makes no sense, of course, how you got to your current state is irrelevant.  What is relevant is assessing the current needs of the department and company and taking stock of the currently available solutions, technologies and resources.  Given the current state, the best course forward can be determined.  Any consideration given to the effort expended to get to the current state is only misleading.

A good example of the sunk cost fallacy is in the game of chess.  With each move it is important to assess all available moves, risks and strategies again because what moves were used to get to the current state have no bearing on what moves make sense going forward.  If the world’s greatest chess player or an amazing computer chess algorithm was to be brought in mid-game they would not require any knowledge as to how the current state had come to be – they would simply assess the current state and create a strategy based upon it.

This is the same as we should be behaving in IT.  Our current state is our current state.  It does not matter for strategic planning what unfolded to get us into that state.  We only care about those decisions and costs when doing a post mortem process in order to determine where decision making may have failed in order to learn from it.  Learning about ourselves and our processes is very important.  But that is a very different task from doing strategic planning for the current initiative.

The unfortunate thing here is that we must begin our planning processes again but this time with, we assume, more to work with.    But this cannot be avoided.  In the worst cases, budgets are no longer available and there are no resources to fix the flawed design and achieve the necessary business goals.  Compromises sometimes are necessary.  Making do with what we have is sometimes that best that we can do. But, in the vast majority of cases it would seem, some combination of additional budget or creative reuse of existing products can be adequate to remedy the situation.

Once we have reached a state in which we have addressed our short falls, whether simply by accepting that we have over spent, under-delivered or have adjusted to meet needs we have an opportunity to go back and investigate our decision making processes.  It is by doing this that we hope to grow as individuals and, if at all possible, on an organizational level to learn from our mistakes, or determine if there even were mistakes.  Every company and every individual makes mistakes.  What separates us is the ability to learn from them and avoid those same mistakes in the future.  Growth comes primarily from experiencing pain in this way and while often unpleasant to face it is here that we have the best opportunity to create real, lasting value.  Do not push off or skip this opportunity for review whether it be a harsh, personal review that you do yourself or a formal, organizational review run by people trained to do so or something in-between.  The sooner that the decision processes are evaluated the fresher the memory will be and the sooner the course correction can take effect.

The final step that we can do is to begin the decision process to design a replacement for the current implementation as soon as possible, once the review of the decision process is complete.  This does not necessarily mean that we should intend to spend money or change our designs in the near future.  Not at all.  But by being extremely pro-active in design making we can attempt to avoid the problems of the past by giving ourselves additional time for planning, more time for requirements gathering and documentation, better insight into changes in requirements over time by regularly revisiting those requirements to see if they remain stable or if they are changing, more opportunity to get management and peer buy in and investment in the decision and better understanding of the problem domain so that we are better equipped to alter the intended design or know when to scrap it and start over before implementing it the next time.  It also could, potentially, give us a better chance of codifying organizational knowledge that could be passed on to a successor should you yourself not be in the position of decision making or implementation when the next cycle comes around.

With good, rational processes and a good understanding of the steps that need to be taken in a case of less than ideal systems design or implementation we can recover from missteps and not only, in most cases, recover from them in the short term but we can insulate the organization from the same mistakes in the future.

Better IT Hiring: Contract To Hire

Information Technology workers are bombarded with “Contract to Hire” positions, often daily.  There are reasons why this method of hiring and working is fundamentally wrong and while workers immediately identify these positions as bad choices to make, few really take the time to move beyond emotional reaction to understand why this working method is so flawed and, more importantly, few companies take the time to explore why using tactics such as this undermine their staffing goals.

To begin we must understand that there are two basic types of technology workers: consultants (also called contractors) and permanent employees (commonly known as the FTEs.)  Nearly all IT workers fall into a desire to be one of these two categories. Neither is better or worse, they are simply two different approaches to employment engagements and represent differences in personality, career goals, life situations and so forth.  Workers do not always get to work they way that they desire, but basically all IT workers seek to be in either one camp or the other.

Understanding the desires and motivations of IT workers seeking to be full time employees is generally very easy to do.  Employees, in theory, have good salaries, stable work situations, comfort, continuity, benefits, vacations, protection and so forth.  At least this is how it seems, whether these aspects are real or just illusionary can be debated elsewhere.  What is important is that most people understand why people want to be employees, but the opposite is rarely true.  Many people lack the empathy for those seeking to not be employees.

Understanding professional or intentional consultants can be difficult.  Consultants live a less settled life but generally earn higher salaries and advance in their careers faster, see more diverse environments, get a better chance to learn and grow, are pushed harder and have more flexibility.  There are many factors which can make consulting or contracting intentionally a sensible decision.  Intentional contracting is very often favored by younger professionals looking to grow quickly and gain experience that they otherwise could not obtain.

What makes this matter more confusing is that the majority of workers in IT wish to work as full time employees but a great many end up settling for contract positions to hold them over until a desired full time position can be acquired.  This situation arises so commonly that a great many people both inside and outside of the industry and on both sides of the interview table may mistakenly believe that all cases are this way and that consulting is a lower or lesser form of employment.  This is completely wrong.  In many cases consulting is highly desired and contractors can benefit greatly for their choice of engagement methodology.  I, myself, spent most of my early career, around fifteen years, seeking only to work as a contractor and had little desire to land a permanent post.  I wanted rapid advancement, opportunities to learn, chances to travel and variety.

It is not uncommon at all for the desired mode of employment to change over time.  It is most common for contractors to seek to move to full employment at some point in their careers. Contracting is often exhausting and harder to sustain over a long career.  But certainly full time employees sometimes chose to move into a more mobile and adventurous contracting mode as well.  And many choose to only work one style or the other for the entirety of their careers.

Understanding these two models is key.  What does not fit into this model is the concept of a Contract to Hire.  This hiring methodology starts by hiring someone willing to work a contract position and then, sometimes after a set period of time and sometimes after an indefinite period of time, either promises to make a second determination to see if said team member should be “converted” into an employee, or let go.  This does not work well when we attempt to match it up against the two types of workers.  Neither type is a “want to start as one thing and then do another”.  Possibly somewhere there is an IT worker who would like to work as a contractor for four months and then become an employee, getting benefits but only after a four month delay, but I am not aware of such a person and it is reasonable to assume that if there is such a person he is unique and already has done this process and would not want to do it again.

This leaves us with two resulting models to match into this situation.  The first is the more common model of an IT worker seeking permanent employment and being offered a Contract to Hire position.  For this worker the situation is not ideal, the first four months represent a likely jarring and complex situation and a scary one that lacks the benefits and stability that is needed and the second decision point as to whether to offer the conversion is frightening.  The worker must behave and plan as if there was no conversion and must be actively seeking other opportunities during the contract period, opportunities that are pure employment from the beginning.  If there was any certainty of a position becoming a full employment one then there would be no contract period at all.  The risk is exceptionally high to the employee that no conversion will be offered.  In fact, it is almost unheard of in the industry for this to happen.

It must be noted that, for most IT professionals, the idea that a Contract to Hire will truly offer a conversion at the end of the contract duration is so unlikely that it is generally assumed that the enticement of the conversion process is purely a fake one and that there is no possibility of it happening at all.  And for reasons we will discover here it is obvious why companies would not honestly expect to attempt this process.  The term Contract to Hire spells almost certain unemployment for IT workers going down that path.  The “to Hire” portion is almost universally nothing more than a marketing ploy and a very dishonest one.

The other model that we must consider is the model of the contract-desiring employee accepting a Contract to Hire position.  In this model we have the better outcome for both parties.  The worker is happy with the contract arrangement and the company is able to employ someone who is happy to be there and not seeking something that they likely will be unable to get.  In cases where the company was less than forthcoming about the fact that the “to Hire” conversion would never be considered this might actually even work out well, but is far less likely to do so long term and in repeating engagements than if both parties were up front and honest about their intentions on a regular basis.  Even for professional contractors seeing the “to Hire” addendum is a red flag that something is amiss.

The results for a company, however, when obtaining an intentional contractor via a Contract to Hire posting is risky.  For one, contractors are highly volatile and are skilled and trained at finding other positions.  They are generally well prepared to leave a position the moment that the original contract is done.

One reason that the term Contract to Hire is used is so that companies can easily “string along” someone desiring a conversion to a full time position by dangling the conversion like a carrot and prolonging contract situations indefinitely.  Intentional contractors will see no carrot in this arrangement and will be, normally, prepared to leave immediately upon completion of their contract time and can leave without any notice as they simply need not renew their contract leaving the company in a lurch of their own making.

Even in scenarios where an intentional contractor is offered a conversion at the end of a contract period there is the very real possibility that they will simply turn down the conversion.  Just as the company maintains the right to not offer the conversion, the IT worker maintains an equal right to not agree to offered terms.  The conversion process is completely optional by both parties.  This, too, can leave the company in a tight position if they were banking on the assumption that all IT workers were highly desirous of permanent employment positions.

This may be the better situation, however.  Potentially even worse is an intentional contractor accepting a permanent employment position when they were not actually desiring an arrangement of that type.  They are likely to find the position to be something that they do not enjoy, or else they would have been seeking such an arrangement already, and will be easily tempted to leave for greener pastures very soon – defeating the purpose of having hired them in the first place.

The idea behind the Contract to Hire movement is the mistaken belief by companies that companies hold all of the cards and that IT workers are all desperate for work and thankful to find any job that they can.  This, combined with the incorrect assumption that nearly all IT workers truly want stable, traditional employment as a full time employee combines to make a very bad hiring situation.

Based on this, a great many companies attempt to leverage the Contract to Hire term in order to lure more and better IT workers to apply based on false promises or poor matching of employment values.  It is seen as a means of lowering cost, testing out potential employees, hedging bets against future head count needs, etc.

In a market where there is a massive over supply of IT workers a tactic such as this may actually pay off.  In the real world, however, IT workers are in very short supply and everyone is aware of the game that companies play and what this term truly means.

It might be assumed that IT workers would still consider taking Contract to Hire because they are willing to take on some risk and hope to convince the employer that conversion, in their case, would be worth while.  And certainly some companies do this process and for some people it has worked out well.  However, it should be noted, that any contract position offers the potential of a conversion offer and in positions where the to “Contract to Hire” is not used, conversions are actually quite common, or at least offers for conversion.  It is specifically when a potential future conversion is offered like a carrot that the conversions become exceptionally rare.  There is no need for an honest company and a quality workplace to mention “to Hire” when bringing on contractors.

What happens, however, is more complex and requires study.  In general the best workers in any field are those that are already employed.  It goes without saying that the better you are, the more likely you are to be employed.  This does not mean that great people never change jobs or find themselves unemployed but the better you are the more time you will average not seeking employment from a position of being unemployed and the worse you are the more likely you are to be unemployed non-voluntarily.  That may seem obvious, but when you combine that with other information that we have, something is amiss.  A Contract to Hire position can never, effectively, entice currently working people in any way.  A great offer of true, full time employment with better pay and benefits might entice someone to give up an existing position for a better one, that happens every day.  But good people generally have good jobs and are not going to give up the positions that they have, the safety and stability to join an unknown situation that only offers a short term contract with an almost certain no chance conversion carrot.  It just is not going to happen.

Likewise when good IT workers are unemployed they are not very likely to be in a position of desperation and even then are very unlikely to even talk to a position listing as Contract to Hire (or contract at all) as most people want full time employment and good IT people will generally be far too busy turning down offers to waste time looking at Contract to Hire positions.  Good IT workers are flooded with employment opportunities and being able to quickly filter out those that are not serious is a necessity.  The words “Contract to Hire” are one of the best low hanging fruits of this filtering process.  You do not need to see what company it is, what region it is in, what the position is or what experience they expect.  The position is not what you are looking for, move along, nothing to see here.

The idea that employers seem to have is the belief that everyone, employed and unemployed IT workers alike, are desperate and thankful for any possible job opening.  This is completely flawed.  Most of the industry is doing very well and there is no way to fill all of the existing job openings that we have today, IT workers are in demand.  Certainly there is always a certain segment of the IT worker population that is desperate for work for one reason or another – personal situations, geographic ties, over staffed technology specialization or, most commonly, not being very competitive.

What Contract to Hire positions do is filter out the best people.  They effectively filter out every currently employed IT worker completely.  In demand skills groups (like Linux, storage, cloud and virtualization) will be sorted out too, they are too able to find work anywhere to consider poor offerings.  Highly skilled individuals, even when out of work, will self filter as they are looking for something good, not looking for just anything that comes along.

At the end of the day, the only people in any number seriously considering Contract to Hire positions, often even to the point of being the only ones even willing to respond to postings, are the truly desperate.  Only the group that either has so little experience that they do not realize how foolish the concept is or, more commonly by far, those that are long out of work and have few prospects and feel that the incredible risks and low quality of work associated with Contract to Hire is acceptable.

This hiring problem begins a vicious loop of low quality, if one did not already exist. But most likely issues with quality already will exist before a company considers a Contract to Hire tactic.  Once good people begin to avoid a company, and this will happen even if only some positions are Contract to Hire, – because the quality of the hiring process is exposed, the quality of those able to be hired will begin to decline.  The worse it gets, the harder to turn the ship around.  Good people attract good people.  Good IT workers want to work with great IT workers to mentor them, to train them and to provide places where they can advance by doing a good job.  Good people do not seek to work in a shop staffed by the desperate.  Both because working only with desperate people is depressing and the quality of work is very poor, but also because once a shop gains a poor reputation it is very hard to shake and good people will be very wary of having their own reputation tarnished by having worked in such a place.

Contact to Hire tactics signal desperation and a willingness to admit defeat on the part of an employer.  Once a company sinks to this level with their hiring they are no longer focusing on building great teams, acquiring amazing talent or providing a wonderful work environment.  Contract to Hire is not always something that every IT professional can avoid all of the time.  All of us have times when we have to accept something less than ideal.  But it is important for all parties involved to understand their options and just what it means when a company moves into this mode.  Contract to Hire is not a tactic for vetting potential hires, it simply does not work that way.  Contract to Hire causes companies to be vetted and filtered out of consideration by the bulk of potential candidates without those metrics ever being made available to hiring firms.  Potential candidates simply ignore them and write them off, sometimes noting who is hiring this way and avoiding them even when other options come along in the future.

As a company, if you desire to have a great IT department and hire good people, do not allow Contract to Hire to ever be associated with your firm.  Hire full time employees and hire intentional contractors, but do not play games with dangling false carrots hoping that contractors will change their personalities or that full time employees will take huge personal risks for no reason, that is simply not how the real world works.

On DevOps and Snowflakes

One can hardly swing a proverbial cat in IT these days without hearing people talking about DevOps.  DevOps is the hot new topic in the industry picking up from where the talk of cloud left off and to hear people talk about it one might believe that traditional systems administration is already dead and buried.

First we must talk about what we mean by DevOps.  This can be confusing because, like cloud, an older term is often being stolen to mean something different or, at best, related to something that already existed.  Traditional DevOps was the merging of developer and operational roles.  In the 1960s through the 1990s, this was the standard way of running systems.  In this world the people who wrote the software were generally the same ones who deployed and maintained it.  Hence the merging of “developer” and “operations”, operations being a semi-standard term for the role of system administrator.  These roles were not commonly separated until the rise of the “IT Department” in the 1990s and the 2000s.  Since then, the return to the merging of the two roles has started to rise in popularity again primarily because of the way that the two can operate together with great value in many modern, hosted, web application situations.

Where DevOps is often talked about today is not as a strict merging of the developers and the operations staff but as a modification to the operations staff with a much higher focus on coding not the application itself but in defining application infrastructures as code as a natural extension of cloud architectures.  This can be rather confusing at first.  What is important to note is that traditional DevOps is not what is commonly occurring today but a new “fake” DevOps where developers remain developers and operations remains operations but operations has evolved into a new “code heavy” role that continues to focus on managing servers running code provided by the developers.

What is significant today is that the role of the system administrator has begun to diverge into two related, but significantly different roles, one of which is improperly called DevOps by most of the industry today (most of the industry being too young to remember when DevOps was the norm, not the exception and certainly not something new and novel.)  I refer to these two aspects of the system administrator role here as the DevOps and the Snowflake approaches.

I use the term Snowflake to refer to traditional architectures for systems because each individual server can be seen as a “unique Snowflake.”  They are all different, at least insofar as they are not somehow managed in such a way as to keep them identical.  This doesn’t mean that they have to be all unique, just that they retain the potential to be.  In traditional environments a system administrator will log into each server individually to work on them.  Some amount of scripting is common to ease administration tasks but at its core the role involves a lot of time working on individual systems.

Easing administration of Snowflake architectures often involved attempts to minimize differences between systems in reasonable ways.  This generally starts with things like choosing a single standard operating system and version (Windows 2012 R2 or Red Hat Enterprise Linux 7) rather than allowing every server installation to be a different OS or version.  This standardization may seem basic but many shops lack this standardization even today.

A next step is commonly creating a standard deployment methodology or a gold master image that is used for making all systems so that the base operating system and all base packages, often including system customization, monitoring packages, security packages, authentication configuration and similar modifications are standard and deployed uniformly.  This provides a common starting point for all systems to minimize divergence.  But technically they only ensure a standard starting point and over time divergence in configuration must be anticipated.

Beyond these steps, Snowflake environments typically use custom, bespoke administration scripts or management tools to maintain some standardization between systems over time.  The more commonalities that exist between systems the easier they are to maintain and troubleshoot and the less knowledge is needed by the administration staff.  More standardization means fewer surprises, fewer unknowns and much better testing capabilities.

In a single system administrator environment with good practices and tooling, Snowflake environments can take on a high degree of standardization.  But in environments with many system administrators, especially those supported around the clock from many regions, and with a large number of systems, standardization, even with very diligent practices, can become very difficult.  And that is even before we tackle the obvious issues surrounding the fact that different packages and possibly package versions are needed on systems that perform different roles.

The DevOps approach grows organically out of the cloud architecture model.  Cloud architecture is designed around automatically created and automatically destroyed, broadly identical systems (at least in groups) that are controlled through a programmatic interface or API.  This model lends itself, quite obviously, to being controlled centrally through a management system rather than through the manual efforts of a system administrator.  Manual administration is effectively impossible and completely impractical under this model.  Individual systems are not unique like in the Snowflake model and any divergence will create serious issues.

The idea that has emerged from the cloud architecture world is one that systems architecture should be defined centrally “in code” rather than on the servers themselves.  This sounds confusing at first but makes a lot of sense when we look at it more deeply.  In order to support this model a new type of systems management tool that has yet to take on a really standard name but is often called a systems automation tool, DevOps framework, IT automation tool or simply “infrastructure as code” tool has begun to emerge.  Common toolsets in this realm include Puppet, Chef, CFEngine and SaltStack.

The idea behind these automation toolsets is that a central service is used to manage and control all systems.  This central authority manages individual servers by way of code-based descriptions of how the system should look and behave.  In the Chef world, these are called “recipes” to be cute but the analogy works well.  Each system’s code might include information such as a list of which packages and package versions should be installed, what system configurations should be modified and files to be copied to the box.  In many cases decisions about these deployments or modifications are handled through potentially complex logic and hence the need for actual code rather than something more simplistic such as markup or templates.  Systems are then grouped by role and managed as groups.  The “web server” role might tell a set of systems to install Apache and PHP and configure memory to swap very little.  The “SQL Server” role might install MS SQL Server and special backup tools only used for that application and configure memory to be tuned as desired for a pool of SQL Server machines.  These are just examples.  Typically an organization would have a great many roles, some may be generic such as “web server” and others much more specific to support very specific applications.  Roles can generally be layered, so a system might be both a “web server” and a “java server” getting the combined needs of both met.

These standard definitions mean that systems, once designated as belonging to one role or another, can “build themselves” automatically.  A new system might be created by an administrator requesting a system or a capacity monitoring system might decide that additional capacity is needed for a role and spawn new server instances automatically without any human intervention whatsoever.  At the time that the system is requested, by a human or automatically, the role is designated and the system will, by way of the automation framework, transform itself into a fully configured and up to date “node.”  No human system administration intervention required.  The process is fast, simple and, most importantly, completely repeatable.

Defining systems in code has some non-obvious consequences.  One is that backups of complete systems are no longer needed.  Why backup a system that you can recreate, with minimum effort, almost instantly?  Local data from database systems would need to be backed up but only the database data, not the entire system.  This can greatly reduce strain on backup infrastructures and make restore processes faster and more reliable.

The amount of documentation needed for systems already defined in code is very minimal.  In Snowflake environments the system administrator needs to maintain documentation specific to every host and maintain that documentation manually. This is very time consuming and error prone.   Systems defined by way of central code need little to no documentation and the documentation can be handled at a group level, not the individual node level.

Testing systems that are defined in code is easy to do as well.  You can create a system via code, test it and know that when you move that definition into production that the production system will be created repeatably exactly as it was created in testing.  In Snowflake environments it is very common to have testing practices that attempt to do this but do so through manual efforts and are prone to being sloppy and not exactly repeatable and very often politics will dictate that it is faster to mimic repeatability than to actually strive for it.  Code defined systems bypass these problems making testing far more valuable.

Outside of needing to define a number of nodes to exist within each role, the system can reprovision an entire architecture, from scratch, automatically.  Rebuilding after a disaster or bringing up a secondary site can be very quickly and easily done.  Also moving between locally hosted systems and remotely hosted systems including those from companies like Amazon, Microsoft, IBM, Rackspace and others is extremely easy.

Of course, in the DevOps world there is a great value to using cloud architectures to enable the most extreme level of automation but using cloud architectures is unnecessary to leverage these types of tools.  And, of course, having a code defined architecture could be used partially while manual administration could be implemented too for a hybrid approach but this is rarely recommended on individual systems.  It is generally far better to have two environments, one that is managed as Snowflakes and one that is managed as DevOps when the two approaches are mandated.  This makes  a far better hybridization.  I have seen this work extremely well in an enterprise environment with more scores of thousands of “Snowflake” servers each very unique but with a dedicated environment of ten thousands nodes that was managed in a DevOps manner because all of the nodes were to be identical and interchangeable using one of two possible configurations.  Hybridization was very effective.

The DevOps approach, however, comes with major caveats as well.  The skill sets necessary to manage a system in this way are far greater than those needed for traditional systems administration as, at a minimum, all traditional systems administration knowledge is still needed plus solid programming knowledge typically of modern languages like Python and Ruby and knowledge of the specific frameworks in question as well.  This extended knowledge base requirement means that DevOps practitioners are not only rare but expensive too.  It also means that university education, already far short of preparing either systems administrators or developers for the professional world are now farther still from preparing graduates to work under a DevOps model.

System administrators working in each of these two camps have a tendency to see all systems as needing to fit into their own mold. New DevOps practitioners often believe that Snowflake systems are legacy and need to be updated.  Snowflake (traditional) admins tend to see the “infrastructure as code” movement as silly, filled with unnecessary overhead, overly complicated and very niche.

The reality is that both approaches have a tremendous amount of merit and both are going to remain extremely viable.  Both make sense for very different workloads and large organizations, I suspect, will commonly see both in place via some form of hybridization.  In the SMB market where there are typically only a tiny number of servers, no scaling leverage to justify cloud architectures and a high disparity between systems, I suspect that DevOps will remain almost indefinitely outside of the norm as the overhead and additional skills necessary to make it function are impractical or even impossible to acquire.  Larger organizations have to look at their workloads.  Many traditional workloads and much of traditional software is not well suited to the DevOps approach, especially cloud automation, and will either require hybridization or an impractically high level of coding on a per system basis making the DevOps model impossible to justify.  But workloads built on web architectures or that can scale horizontally extremely well will benefit heavily from the DevOps model at scale.  This could apply to large enterprise companies or smaller companies likely producing hosted applications for external consumption.

This difference in approach means that, in the United States for example, most of the US is comprised of companies that will remain focused on the Snowflake management model while some east coast companies could evaluate the DevOps model effectively and begin to move in that direction.  But on the west coast where more modern architectures and a much larger focus on hosted applications and applications for external consumption are the driving economic factors, DevOps is already moving from newcomer to mature, established normalcy.  DevOps and Snowflake approaches will likely remain heavily segregated by regions in this way just as IT, in general, sees different skill sets migrate to different regions.  It would not be surprising to see DevOps begin to take hold in markets such as Austin where traditional IT has performed rather poorly.

Neither approach is better or worse, they are two different approaches servicing two very different ways of provisioning systems and two different fundamental needs of those systems.  With the rise of cloud architectures and the DevOps model, however, it is critically important that existing system administrators understand what the DevOps model means and when it applies so that they can correctly evaluate their own workloads and unique needs.  A large portion of the traditional Snowflake system administration world will be migrating, over time, to the DevOps model.  We are very far from reaching a steady state in the industry as to the balance of these two models.

Originally published on the StorageCraft Blog.

Practical RAID Performance

Choosing a RAID level is an exercise in balancing many factors including cost, reliability, capacity and, of course, performance.  RAID performance can be difficult to understand especially as different RAID levels use different techniques and behave rather differently from each other in some cases.  In this article I want to explore the common RAID levels of RAID 0, 5, 6 and 10 to see how performance differs between them.

For the purposes of this article, RAID 1 will be assumed to be a subset of RAID 10.   This is often a handy way to think of RAID 1 – as simply being a RAID 10 array with only a single mirrored pair member.  As RAID 1 is truly a single pair RAID 10 and behaves as such this works wonderfully for making RAID performance easy to understand as it simply maps into the RAID 10 performance curve.

There are two types of performance to look at with all storage: reading and writing.  In terms of RAID reading is extremely easy and writing is rather complex.  Read performance is effectively stable across all RAID types.  Writing, however, is not.

To make discussing performance easier we need to define a few terms as we will be working with some equations. In our discussions we will use N to represent the total number of drives, often referred to as spindles, in our array and we will use X to refer to the performance of each drive individually.  This allows us to talk in terms of relative performance as a factor of the drive performance allowing us to abstract away the RAID array and not have to think in terms of raw IOPS.  This is important as IOPS are often very hard to define but we can compare performance in a meaningful way by speaking to it in relationship to the individual drives within the array.

It is also important to remember that we are only talking about the performance of the RAID array itself, not an entire storage subsystem.  Artifacts such as memory caches and solid state caches will do amazing things to alter the overall performance of a storage subsystem, but do not fundamentally change the performance of the RAID array itself under the hood.  There is no simple formula for determining how different cache options will impact the overall performance but suffice it to say that it can be very dramatic but this depends heavily not only on the cache choices themselves but also heavily upon workload. Even the biggest, fastest, most robust cache options cannot change the long term, sustained performance of an array.

RAID is complex and many factors influence the final performance.  One is the implementation of the RAID system itself.  A poor implementation might cause latency or may fail to take advantage of the available spindles (such as having a RAID 1 array read only from a single disk instead of from both simultaneously!)  There is no easy way to account for deficiencies in specific RAID implementations so we must assume that all are working to the limits of the specification as, indeed, any enterprise RAID system will do. It is primarily hobby and consumer RAID systems that fail to do this.

Some types of RAID also have dramatic amounts of computational overhead associated with them while others do not.  Primarily parity RAID levels require heavy processing in order to handle write operations with different levels having different amounts of computation necessary for each operation.  This introduces latency, but does not curtail throughput.  This latency will vary, however, based on the implementation of the RAID level as well as on the processing capability of the system in question.  Hardware RAID will use something like a general purpose CPU (often a Power or ARM RISC processor) or a custom ASIC to handle this while software RAID hands this off to the server’s own CPU.  Often the server CPU is actually faster here but consumes system resources.  ASICs can be very fast but are expensive to produce.  This latency impacts storage performance but is very difficult to predict and can vary from nominal to dramatic.  So I will mention the relative latency impact with each RAID level but will not attempt to measure it.  In most RAID performance calculations, this latency is ignored but it is important to understand that it is present and could, depending on the configuration of the array, have a noticeable impact on a workload.

There is, it should be mentioned, a tiny performance impact to read operations due to efficiencies in the layout of data on the disk itself.  Parity RAID requires there to be data on the disks that is useless during a healthy read operation but cannot be used to speed it up.  The actually results in it being slightly slower.  But this impact is ridiculously small and is normally not measured and so can be ignored.

Factors such as stripe size also impact performance, of course, but as that is configurable and not an intrinsic artifact in any RAID level I will ignore it here.  It is not a factor when choosing a RAID level itself but only in configuring one once chosen.

The final factor that I want to mention is the read to write ratio of storage operations.  Some RAID arrays will be used almost purely for read operations, some almost solely for write operations but most use a blend of the two, likely something like eighty percent read and twenty percent write.  This ratio is very important in understanding the performance that you will get from your specific RAID array and understanding how each RAID level will impact you.  I refer to this as the read/write blend.

We measure storage performance primarily in IOPS.  IOPS stands for Input/Output Operations Per Second (yes, I know that the letters don’t line up well, it is what it is.)  I further use the terms RIOPS for Read IOPS, WIOPS for Write IOPS and BIOPS for Blended IOPS which would come with a ration 80/20 or whatever.  Many people talk about storage performance with a single IOPS number.  When this is done they normally mean Blended IOPS at 50/50.  However, rarely does any workload run at 50/50 so that number can be extremely misleading.  Two numbers, RIOPS and WIOPS is what is needed to understand performance and these two together can be used to find any IOPS Blend that is needed.   For example, a 50/50 blend is as simple as (RIOPS * .5) + (WIOPS * .5).  The more common 80/20 blend would be (RIOPS * .8) + (WIOPS * .2).

Now that we have established some criteria and background understanding we will delve into our RAID levels themselves and see how performance varies across them.

For all RAID levels, the Read IOPS number is calculated using NX.  This does not address the nominal overhead numbers that I mention above, of course.  This is a “best case” number but the real world number is so close that it is very practical to simply use this formula.  Since take the number of spindles (N) and multiple by the IOPS performance of an individual drive (X).  Keep in mind that drives often have different read and write performance so be sure to use the drives Read IOPS rating or tested speed for the Read IOPS calculation and the Write IOPS rate or tested speed for the Write IOPS calculation.

RAID 0

RAID 0 is the easiest RAID level to understand because there is effectively no overhead to worry about, no resources consumed to power it and both read and write get the full benefit of every spindle, all of the time.  So for RAID 0 our formula for write performance is very simple: NX.  RAID 0 is always the most performant RAID level.

An example would be an eight spindle RAID 0 array.  If an individual drive in the array delivers 125 IOPS then our calculation would be from N = 8 and X = 125 so 8 * 125 yielding 1,000 IOPS.  Since both read and write IOPS are the same here, it is extremely simple as we get 1K RIOPS, 1K WIOPS and 1K with any blending thereof.  Very simple.  If we didn’t know the absolute IOPS of an individual spindle we could refer to an eight spindle RAID 0 as delivering 8X Blended IOPS.

RAID 10

RAID 10 has the second simplest RAID level to calculate.  Because RAID 10 is a RAID 0 stripe of mirror sets, we have no overhead to worry about from the stripe but each mirror has to write the same data twice in order to create the mirroring.  This cuts our write performance in half compared to a RAID 0 array of the same number of drives.  Giving us a write performance formula of simply: NX/2  or .5NX.

It should be noted that at the same capacity, rather than the same number of spindles, RAID 10 has the same write performance as RAID 0 but double the read performance – simply because it requires twice as many spindles to match the same capacity.

So an eight spindle RAID 10 array would be N = 8 and X = 125 and our resulting calculation comes out to be (8 * 125)/2 which is 500 WIOPS or 4X WIOPS.  A 50/50 blend would result in 750 Blended IOPS (1,000 Read IOPS and 500 Write IOPS.)

This formula applies to RAID 1, RAID 10, RAID 100 and RAID 01 equally.

Uncommon options such as triple mirroring in RAID 10 would alter this write penalty.  RAID 10 with triple mirroring would be NX/3, for example.

RAID 5

While RAID 5 is deprecated and should never be used in new arrays I include it here because it is a well known and commonly used RAID level and its performance needs to be understood.  RAID 5 is the most basic of the modern parity RAID levels.  RAID 2, 3 & 4 are no longer found in production systems and so we will not look into their performance here.  RAID 5, while not recommended for use today, is the foundation of other modern parity RAID levels so is important to understand.

Parity RAID adds a somewhat complicated need to verify and re-write parity with every write that goes to disk.  This means that a RAID 5 array will have to read the data, read the parity, write the data and finally write the parity.  Four operations for each effective one.  This gives us a write penalty on RAID 5 of four.  So the formula for RAID 5 write performance is NX/4.

So following the eight spindle example where the write IOPS of an individual spindle is 125 we would get the following calculation: (8 * 125)/4 or 2X Write IOPS which comes to 250 WIOPS.  In a 50/50 blend this would result in 625 Blended IOPS.

RAID 6

RAID 6, after RAID 10, is probably the most common and useful RAID level in use today.  RAID 6, however, is based off of RAID 5 and has another level of parity.  This makes it dramatically safer than RAID 5, which is very important, but also imposes a dramatic write penalty as each write operation requires the disks to read the data, read the first parity, read the second parity, write the data, write the first parity and then finally write the second parity.  This comes out to be a six times write penalty, which is pretty dramatic.  So our formula is NX/6.

Continuing our example we get (8 * 125)/6 which comes out to ~167 Write IOPS or 1.33X.  In our 50/50 blend example this is a performance of  583.5 Blended IOPS.  As you can see, parity writes cause a very rapid decrease in write performance and a noticeable drop in blended performance.

RAID 7 (aka RAID 5.3 or RAID 7.3)

RAID 7 is a somewhat non-standard RAID level with triple parity based off of the existing single parity of RAID 5 and the existing double parity of RAID 6.  The only current implementation of RAID 7 is ZFS’ RAIDZ3.  Because RAID 7 contains all of the overhead of both RAID 5 and RAID 6 plus the additional overhead of the third parity component we have a write penalty of a staggering eight times.  So our formula for finding RAID 7 write performance is NX/8.

In our example this would mean that (8 * 125)/8 would come out to 125 Write IOPS or 1X.  So with eight drives in our array we would get only the write performance of a single, stand alone drive.  That is significant overhead.  Our blended 50/50 IOPS would come out to only 562.5.

Complex RAID

Complex RAID levels or Nested RAID levels such as RAID 50, 60, 61, 16, etc. can be found using the information above and breaking the RAID down into its components and applying each using the formulæ provided above.  There is no simple formula for these levels because they have varying configurations.  It is necessary to break them down into their components and apply the formulæ multiple times.

RAID 60 with twelve drives, two sets of six drives, where each drive is 150 IOPS would be done with two RAID 6s.  It would be the NX of RAID 0 where N is two (for two RAID 6 arrays) and the X is the resultant performance of each RAID 6.   Each RAID 6 set would be (6 * 150)/6.  So the full array would be 2((6 * 150)/6).  Which results in 300 Write IOPS.

The same example as above but configured as RAID 61, a mirrored pair of RAID 6 arrays, would be the same performance per RAID 6 array, but applied to the RAID 1 formula which is NX/2 (where X is the resultant performance of the each RAID array.)  So the final formula would be 2((6 * 150)/6)/2 coming to 150 Write IOPS from twelve drives.

Performance as a Factor of Capacity

When we are producing RAID performance formulæ we think of these in terms of the number of spindles which is incredibly sensible.  This is very useful in determining the performance of a proposed array or even an existing one where measurement is not possible and allows us to compare the relative performance between different proposed options.  It is in these terms that we universally think of RAID performance.

This is not always a good approach, however, because typically we look at RAID as a factor of capacity rather than of performance or spindle count.  It would be very rare, but certainly possible, that someone would consider an eight drive RAID 6 array versus an eight drive RAID 10 array.  Once in a while this will occur due to a chassis limitation or some other, similar reason.  But typically RAID arrays are viewed from the standpoint of total array capacity (e.g. usable capacity) rather than spindle count, performance or any other factor.  It is odd, therefore, that we should then switch to viewing RAID performance as a function of spindle count.

If we change our viewpoint and pivot upon capacity as the common factor, while still assuming that individual drive capacity and performance (X) remains constant between comparators then we arrive at a completely different landscape of performance.  In doing this we see, for example, that RAID 0 is no longer the most performant RAID level and that read performance varies dramatically instead of being a constant.

Capacity is a fickle thing but we can distill it out to the number of spindles necessary to reach desired capacity.  This makes this discussion far easier.  So our first step is to determine our spindle count needed for raw capacity.  If we need a capacity of 10TB and are using 1TB drives, we would need ten spindles, for example.  Or if we need 3.2TB and are using 600GB drives we would need six spindles.  We will, different than before, refer to our spindle count as R.  As before, performance of the individual drive is represented as X.  (R is used here to denote that this is the Raw Capacity Count, rather that the total Number of spindles.)

RAID 0 remains simple, performance is still RX as there are no additional drives.  Both read and write IOPS are simply NX.

RAID 10 has RX Write IOPS but 2RX Read IOPS.  This is dramatic.  Suddenly when viewing performance as a factor of stable capacity we find that RAID 10 has double read performance over RAID 0!

RAID 5 gets slightly trickier.  Write IOPS would be expressed as ((R + 1) * X)/4.  The Read IOPS are expressed as ((R +1) * X).

RAID 6, as we expect, follows the pattern that RAID 5 projects.  Write IOPS for RAID 6 are ((R + 2) * X)/6.  And the Read IOPS are expressed as ((R + 2) * X).

RAID 7 falls right in line.  RAID 7 Write IOPS would be ((R + 3) * X)/8.  And the Read IOPS are ((R + 3) * X).

This vantage point changes the way that we think about performance and, when looking purely at read performance, RAID 0 becomes the slowest RAID level rather than the fastest and RAID 10 becomes the fastest for both read and write no matter what the values are for R and X!

If we take a real world example of 10 2TB drives to achieve 20TB of usable capacity with each drive having 100 IOPS of performance and assume a 50/50 blend, the resultant IOPS would be:  RAID 0 with 1,000 Blended IOPS, RAID 10 with 1,500 Blended IOPS (2,000 RIOPS / 1,000 WIOPS), RAID 5 with 687.5 Blended IOPS (1,100 RIOPS / 275 WIOPS), RAID 6 with 700 Blended IOPS (1,200 RIOPS / 200 WIOPS) and finally RAID 7 with 731.25 Blended IOPS (1,300 RIOPS / 162.5 WIOPS.)  RAID 10 is a dramatic winner here.

Latency and System Impact with Software RAID

As I have stated earlier, RAID 0 and RAID 10 have, effectively, no system overhead to consider.  The mirroring operation requires essentially no computational effort and is, for all intents and purposes, immeasurably small.  Parity RAID does have computational overhead and this results in latency at the storage layer and system resources being consumed.  Of course, if we are using hardware RAID those resources are dedicated to the RAID array and have no function but to be consumed in this role.  If we are using software RAID, however, these are general purpose system resources (primarily CPU) that are consumed for the purposes of the RAID array processing.

The impact to even a very small system with a large amount of RAID is still very small but it can be measured and should be considered, if only lightly.  Latency and system impact are directly related to one another.

There is no simple way to state latency and system impact for different RAID levels except in this way: RAID 0 and RAID 10 have effectively no latency or impact, RAID 5 has some latency and impact, RAID 6 has roughly twice as much computational latency and impact as RAID 5 and RAID 7 has roughly triple the computational latency and impact as RAID 5.

In many cases this latency and system impact will be so small that they cannot be measured with standard system tools and as modern processors become increasingly powerful the latency and system impact will continue to diminish.  Impact has been considered negligible for RAID 5 and RAID 6 systems on even low end, commodity hardware since approximately 2001.  But it is possible on heavily loaded systems with a large amount of parity RAID activity that there could be contention between the RAID subsystem and other processes requiring system resources.

Reference: The IT Hollow – Understanding the RAID Penalty

Article originally posted to the StorageCraft Blog – RAID Performance.