Practical RAID Choices for Spindle Based Arrays

A truly monumental amount of information abounds in reference to RAID storage systems exploring topics such as risk, performance, capacity, trends, approaches and more.  While the work on this subject is nearly staggering the information can be distilled into a handful of common, practical storage approaches that will cover nearly all use cases.  My goal here is to provide a handy guide that will allow a non-storage practitioner to approach RAID decision making in a practical and, most importantly, safe way.

For the purposes of this guide we will assume storage projects of no more than twenty five traditional drives (spinning platter drives properly known as Winchester drives.)  These drives could be SFF (2.5″) or LFF (3.5″) commonly, SATA or SAS, consumer or enterprise.  We will not tackle solid state drives as these have very different characteristics and require their own guidance.  Storage systems larger than roughly twenty five spindles should not work from standard guidance but delve deeper into specific storage needs to ensure proper planning.

The guidance here is written for standard systems in 2015.  Over the past two decades the common approaches to RAID storage have changed dramatically and while it is not anticipated that the key factors that influence these decisions will change enough in the future to alter these recommendations it is very possible that they will.  Good RAID design of 1998 is very poor RAID design today.  The rate of change in the industry has dropped significantly since that time and these recommendations are likely to stand for a very long time, very possibly until spindle-based drive storage is no longer available or at least popular, but like all things predictions are subject to great change.

In general we use what is termed a “One Big Array” approach.  That is a single RAID array on which all system and data partitions are created.  The need or desire to split our storage into multiple, physical arrays is mostly gone today and should only be done in non-general circumstances.  Only in situations where careful study of the storage needs and heavy analysis are being done should we look at array splitting.  Array splitting is far more likely to cause harm rather than good.  When it doubt, avoid split arrays.  The goal of this guide is general rules of thumb to allow any IT Pro to build a safe and reliable storage system.  Rules of thumb do not and can not cover every scenario, exceptions always exist.  But the idea here is to cover the vast majority of cases with tried and true approaches that are designed around modern equipment, use cases and needs while being mindful to err on the side of safety – when a choice is less than ideal it is still safe.  None of these choices is at all reckless, at worst they are overly conservative.

The first scenario we should consider is if your data does not matter.  This may sound like an odd thing to consider but it is a very important scenario.  There are many times where data saved to disk is considered ephemeral and does not need to be protected.  This is common for reconstructable data such as working space for rendering, intermediary calculation spaces or caches – situations where spending money to protect data is wasted and it would be acceptable to simply recreate lost data rather than protecting it.  This could be a case where downtime is not a problem and data is static or nearly so and rather than spending to reduce downtime we only worry about protecting the data via backup mechanisms so that if an array fails we simply restore the array completely.  In these cases the obvious choice is RAID 0.  It is very fast, very simple and provides the most cost effective capacity.  The only downside of RAID 0 is that it is fragile and provides no protection against data loss in case of drive failure or even a URE (which would cause data corruption the same as a desktop drive faces.)

It should be noted that an exception to the “One Big Array” approach that would be common is in systems using RAID 0 for data.  There would be a very good argument made for a small drive array dedicated to the OS and application data that would be cumbersome to reinstall in case of array loss being kept on RAID 1 and the RAID 0 data array being separate from it.  This way recovery could be very rapid rather than needing to completely rebuild the entire system from scratch rather than simply recreating the data.

Assuming that we have eliminated cases where the data does not require protection, we will assume for all remaining cases that the data is quite important and we want to protect it at some cost.  We will assume that protecting the data as it exists on the live storage is important, generally because we want to avoid downtime or because we want to ensure data integrity because the data on disk is not static and an array failure would also constitute data loss.  With this assumption we will continue.

If we have an array of only two disks the answer is very simple, we choose RAID 1.  There is no other option at this size, so no decision to be made.   In theory we should be planning our arrays holistically and not after the number of drives is determined, the number of drives and the type of array chosen should be done together not drives purchased then use determined based on that arbitrary number, but two drive chassis are so common that it is worth mentioning as a case.

Likewise, with a four drive array the only real choice to consider is RAID 10.  There is no need for further evaluation.  Simply select RAID 10 and continue.

An awkward case is a three drive array.  It is very, very rare that we are limited to three drives as the only common chassis limited to three drives was the Apple Xserve and this has been off of the market for some time so the need to deal with decision making around three spindle arrays should be extremely unlikely.  In cases where we have three drives it is often best to seek guidance but the most common approaches are to add a fourth drive and ergo chose RAID 10 or, if capacity of greater than a single drive’s worth is not needed, to put all three drives into a single triple-mirror RAID 1.

For all other cases, therefore, we are dealing with five to twenty five drives.  Since we have eliminated the situations where RAID 0 and RAID 1 would apply we are left with all common scenarios coming down to RAID 6 and RAID 10, and these constitute the vast majority of cases.  Choosing between RAID 6 and RAID 10 becomes the biggest challenge that we will face as we must look solely at a our “soft” needs of reliability, performance and capacity.

Choosing between RAID 6 and RAID 10 should not be incredibly difficult.  RAID 10 is ideal for situations where performance and safety are the priorities.  RAID 10 has much faster write performance and is safe regardless of disk type used (low cost consumer disks can still be extremely safe, even in large arrays.)  RAID 10 scales well to extremely large sizes, much larger than should be implemented using rules of thumb!  RAID 10 is the safest of all choices, it is fast and safe.  The obvious downsides are that RAID 10 has less storage capacity from the same disks and is more costly on the basis of capacity. It must be mentioned that RAID 10 can only utilize an even number of disks, disks are added in pairs.

RAID 6 is generally safe and fast but never as safe or as fast as RAID 10.  RAID 6 specifically suffers from write performance so is very poorly suited for workloads such as databases and heavily mixed loads like in large virtualization systems.  RAID 6 is cost effective and provides a heavy focus on available capacity compared to RAID 10.  When budgets are tight or capacity needs dominate over performance RAID 6 is an ideal choice.  Rarely is the difference in safety between RAID 10 and RAID 6 a concern except in very large systems with consumer class drives.  RAID 6 is subject to additional risk with consumer class drives that RAID 10 is not affected by which could warrant some concern around reliability in larger RAID 6 systems such as those above roughly 40TB when consumer drives are used.

In the small business space especially, the majority of systems will use RAID 10 simply because arrays rarely need to be larger than four drives.  When arrays are larger RAID 6 is the more common choice due to somewhat tight budgets and generally low concern around performance.  Both RAID 6 and RAID 10 are safe and effective solutions for nearly all usage scenarios with RAID 10 dominating when performance or extreme reliability are key and RAID 6 dominating when cost and capacity are key.  And, of course, when storage needs are highly unique or very large, such as larger than twenty five spindles in an array, remember to leverage a storage consultant as the scenario can easily become very complex.  Storage is one place where it pays to be extra diligent as so many things depend upon it, mistakes are so easy to make and the flexibility to change it after the fact is so low.

Slow OS Drives, Fast Data Drives

Over the years I have found that people often err on the side of high performance, highly reliable data storage for an operating system partition but choose slow, “cost effective” storage for critical data stores.  I am amazed by how often I find this occurring and now, with the advent of hypervisors, I see the same behaviour being repeated there as well – compounding the previously existing issues.

In many systems today we deal with only a single storage array shared by all components of the system.  In these cases we do not face the problem of misbalancing our storage system performance.  This is one of the big advantages of this approach and a major reason why it comes so highly recommended.  All performance is in a shared pool and the components that need the performance have access to it.

In many cases, whether in an attempt at increased performance or reliability design or out of technical necessity, I find that people are separating out their storage arrays and putting hypervisors and operating systems on one array and data on another.  But what I find shocking is that arrays dedicated to the hypervisor or operating system are often staggeringly large in capacity and extremely high in performance – often involving 15,000 RPM spindles or even solid state drives at great expense.  Almost always in RAID 1 (as per common standards from 1998.)

What needs to be understood here is that operating systems themselves have effectively no storage IO requirements.  There is a small amount, mostly for system logging, but that is about all that is needed.  Operating system partitions are almost completely static.  Required components are loaded into memory, mostly at boot time, and are not accessed again.  Even in cases where logging is needed, many times these logs are sent to a central logging system and not to the system storage area reducing or even removing that need as well.

With hypervisors this effect is even more extreme.  As hypervisors are far lighter and less robust than traditional operating systems they behave more like embedded systems and, in many ways, actually are embedded systems in many cases.  Hypervisors load into memory at system boot time and their media is almost never needed again while a system is running except for logging on some occasions.  Because hypervisors are small in physical size even the total amount of time needed to completely read a full hypervisor off of storage is very small, even on very slow media, because the total size is very small.

For these reasons, storage performance is of little to no consequence for operating systems and especially hypervisors.  The difference between fast storage and slow storage really only impacts system boot time where the difference in one second or thirty seconds rarely would be noticed, if at all.  When would anyone perceive even several extra seconds during the startup of a system and in most cases, startups are rare events happening at most once a week during an automated, routine system reboot during a planned maintenance window or very rarely, sometimes only once every several years, for systems that are only brought offline in emergencies.  Even the slowest conceivable storage system is far faster than necessary for this role.

Even slow storage is generally many times faster than is necessary for system logging activities.  In those rare cases where logging is very intense we have many choices of how to tackle this problem.  The most obvious and common solution here is to send logs to a drive array other than the one used by the operating system or hypervisor.  This is a very easy solution and ultimately very practical in cases where it is warranted.  The other common and highly useful solution is to simply refrain from keeping logs on the local device at all and send them to a remote log collection utility such as Splunk, Loggly or ELK.

The other major concern that most people have around their operating systems and hypervisors is reliability.  It is common to focus more efforts on protecting these relatively unimportant aspects of a system rather than the often irreplaceable data.  However, operating systems and hypervisors are easily rebuilt from scratch when necessary using fresh installs and manual reconfiguration when necessary.  The details which could be lost are generally relatively trivial to recreate.

This does not mean that these system filesystems should not be backed up, of course they should (in most cases.)  But just in case the backups fail as well, it is rare that the loss of an OS partition or filesystem truly spells tragedy but only an inconvenience.  There are ways to recover in nearly all cases without access to the original data, as long as the “data” filesystem is separate.  And because of the nature of operating systems and hypervisors, change is rare so backups can generally be less frequent, possibly triggered manually only when updates are applied!

With many modern systems in the DevOps and Cloud computing spaces it has become very common to view operating systems and hypervisor filesystems as completely disposable since they are defined remotely via a system image or by a configuration management system.  In these cases, which are becoming more and more common, there is no need for data protection or backups as the entire system is designed to be recreated, nearly instantly, without any special interaction.  The system is entirely self-replicating.  This further trivializes the need for system filesystem protection.

Taken together, the lack of need around performance and the lack of need around protection and reliability handled primarily through simple recreation and what we have is a system filesystem with very different needs than we commonly assume.  This does not mean that we should be reckless with our storage, we still want to avoid storage failure while a system is running and rebuilding unnecessarily is a waste of time and resources even if it does not prove to be disastrous.  So striking a careful balance is important.

It is, of course, for these reasons that including the operating system or hypervisor on the same storage array as data is now common practice – because there is little to no need for storage access to the system files at the same time that there is access to the data files so we get great synergy by getting fast boot times for the OS and no adverse impact on data access times once the system is online.  This is the primary means by which system designers today tackle the need for efficient use of storage.

When the operating system or hypervisor must be separated from the arrays holding data which can still happen for myriad reasons we generally seek to obtain reasonable reliability at low cost.  When using traditional storage (local disks) this means using small, slow, low cost spinning drives for operating system storage, generally in simple RAID 1 configuration.  A real world example is the use of 5400 RPM “eco-friendly” SATA drives in the smallest sizes possible.  These draw little power and are very inexpensive to acquire.  SSDs and high speed SAS drives would be avoided as they cost a premium for protection that is irrelevant and performance that is completely wasted.

In less traditional storage it is common to use a low cost, high density SAN consolidating the low priority storage for many systems onto shared, slow arrays that are not replicated. This is only effective in larger environments that can justify the additional architectural design and can achieve enough density in the storage consolidation process to create the necessary cost savings but in larger environments this is relatively easy.  SAN boot devices can leverage very low cost arrays across many servers for cost savings.  In the virtual space this could mean a low performance datastore used for OS virtual disks and another, high performance pool, for data virtual disks.  This would have the same effect as the boot SAN strategy but in a more modern setting and could easily leverage the SAN architecture under the hood to accomplish it.

Finally, and most dramatically, it is a general rule of thumb with hypervisors to install them to SD cards or USB thumb drives rather than to traditional storage as their performance and reliability needs are so much less even than traditional operating systems.  Normally if a drive of this nature were to fail while a system was running it would actually remain running without any problem as the drive is never used once the system has booted initially.  It would only be during a reboot that an issue would be found and, at that time, a backup boot device could be used such as a secondary SD card or USB stick.  This is the official recommendation for VMware vSphere, is often recommended by Microsoft representatives for HyperV and is officially supported through HyperV’s OEM vendors and is often recommended, but not so broadly supported, for Xen, XenServer and KVM systems.  Using SD cards or USB drives for hypervisor storage effectively turns a virtualization server into an embedded system.  While this may feel unnatural to system administrators who are used to thinking of traditional disks as a necessity for servers, it is important to remember that enterprise class, highly critical systems like routers and switches last decades and use this exact same strategy for the exact same reasons.

A common strategy for hypervisors in this embedded style mode with SD cards or USB drives is to have two such devices, which may actually be one SD card and one USB drive, each with a copy of the hypervisor.  If one device fails, booting to the second device is nearly as effective as a traditional RAID 1 system.  But unlike most traditional RAID 1 setups, we also have a relatively easy means of testing system updates by only updating one boot device at a time and testing the process before updating the second boot device leaving us with a reliable, well tested fall back in case a version update goes awry.  This process was actually common on large UNIX RISC systems where boot devices were often local software RAID 1 sets that supported a similar practice, especially common in AIX and Solaris circles.

It should also be noted that while this approach is the best practice for most hypervisor scenarios there is actually no reason why it cannot be applied to full operating system filesystems too, except that it is often more work.  Some OSes, especially Linux and BSD are very adept at being installed in an embedded fashion and can easily be adapted for installation on SD card or USB drive with a little planning.  This approach is not at all common but there is no technical reason why, in the right circumstances, it would not be an excellent approach except for the fact that almost never should an OS be installed to physical hardware rather than on top of a hypervisor.  In those cases where physical installs are necessary then this approach is extremely valid.

When designing and planning for storage systems, remember to be mindful as to what read and write patterns will really look like when a system is running. And remember that storage has changed rather dramatically since many traditional guidelines were developed and not all of the knowledge used to develop them still applies today or applies equally.  Think about not only which storage subsystems will attempt to use storage performance but also how they will interact with each other (for example, do two systems never request storage access at the same time or will they conflict regularly) and whether or not their access performance is important.  General operating system functions can be exceedingly slow on a database server without negative impact, all that matters is the speed at which a  database can be accessed.  Even access to application binaries is often irrelevant as they too, once loaded into memory, remain there and only memory speed impacts ongoing performance.

None of this is meant to suggest that separating OS and data storage subsystems from each other is advised, it often is not.  I have written in the past about how consolidating these subsystems is quite frequently the best course of action and that remains true now.  But there are also many reasonable cases where splitting certain storage needs from each other makes sense, often when dealing with large scale systems where we can lower cost by dedicating high cost storage to certain needs and low cost storage to other needs and it is in those cases where I want to demonstrate that operating systems and hypervisors should be considered the lowest priority in terms of both performance and reliability except in the most extreme cases.

What Do I Do Now? Planning for Design Changes

Quite often I am faced with talking to people about their system designs, plans and architectures.  And many times that discussion happens too late and designs are either already implemented or they are partially implemented.  This can be very frustrating if the design in progress has been deemed to not be ideal for the situation.

I understand the feeling of frustration that will come from a situation like this but it is something that we in IT must face on a very regular basis and managing this reaction constructively is a key IT skill.  We must become masters of this situation both technically as well as emotionally.  We should not be crippled by it, it is a natural situation that every IT professional will experience on a regular basis.  It should not be discouraging or crippling but it is very understandable that it can feel that way.

One key reason that we experience this so often is because IT is a massive field with a great number of variables to be considered in every situation.  It is also a highly creative field where there can be numerous, viable approaches to any given problem.  That there is even a single “best” option is rarely true.  Normally there any many competitive options.  Sometimes these are very closely related, sometimes these options are drastically different making them very difficult to compare meaningfully.

Another key reason is that factors change.  This could be that new techniques or information come to light, new products are released, products are updated, prices change or business needs change near to or even during the decision making and design processes.  This rate of change is not something that we, as IT professionals, can hope to ever control.  It is something that we must accept and deal with as best as we can.

Another thing that I often see missed is that a solution that was ideal when made may not be ideal if the same decision was being made today.  This does not, in any way, constitute a deficiency in the original design yet I have seen many people react to it as if it did.  The most common scenario that I run into where I see people exhibit this behaviour is in the aversion to the use of RAID 5 in modern storage design, RAID 6 and RAID 10 being the popular alternatives for good reason.  But this RAID 5 aversion, common since about 2009, did not exist always and from the middle of the 1990s until nearly the end of the 2000s RAID 5 was not only viable, it was very commonly the best solution for the given business and technical needs (the increase in aversion to it was mostly gradual, not sudden.)  However many people see RAID 5 as understandably poor as an option today but apply this new aversion to systems designed and implemented long ago, sometimes close to two decades ago.  This makes no sense and is purely an emotional reaction.  RAID 5 being the best choice for a scenario in 2002 in no way implies that it will still be the best choice in 2015.  But likewise, RAID 5 being a poor choice in 2015 for a scenario in no way belittles or negates the fact that it was very often a great choice several years ago.

I have been asked many times what to do once less than ideal design decisions have been made.  “What do I do now?”

Learning what to do when perfection is no longer an option (as if it ever really was, all IT is about compromises) is a very important skill.  The first things that we must tackle are the emotional problems as these will undermine everything else.  We must do our best to step back, accept the situation and act rationally.  The last thing that we want to do is take a non-ideal situation and make things worse by attempting to reverse justify bad decisions or panicking.

Accepting that no design is perfect, that there is no way to always get things completely right and that dealing with this is just part of working IT is the first step.  Step back, breathe deep.  It isn’t that bad.  This is not a unique situation.  Every IT pro doing design goes through this all of the time.  You should try your best to make the best decisions possible but you must also accept that that can rarely be done – no one has access to enough resources to really be able to do that.  We work with what we have.  So here we are.  What’s next?

Next is to assess the situation.  Where are we now?  In many cases the implementation is done and there is nothing more to do.  The situation is not ideal, but is it bad?  Very often the biggest mistake that I see people facing of an all ready implemented design is that it is too costly – typically “better” solutions are not better because they are faster or more reliable but are better because they are cheaper, easier or faster to have implemented.  That’s an unfortunate situation but hardly a crippling one.  Whatever time or money was spent must have been an acceptable amount at the time and must have been approved.  The best that we can do, right now, is learn from the decision process and attempt to avoid the overspending in the future.  It does not mean that the existing solution will not work or even not work amazingly well.  It is simply that it may not have been a perfect choice given the business needs, primarily financial, involved.

There are situations where a design that has been implemented does not adequately meet the stated business requirements.  This is thankfully less common, in my experience, as it is a much more difficult situation.  In this case we need to make some modifications in order to fulfill our business needs.  This may prove to be expensive or complex.  But things may not be as bad as what they seem.  Often reactions to this are misleading and the situation can often be salvaged.

The first step once we are in a position where we have implemented a solution that fails to meet business needs is to reassess the business needs.  This is not to imply that we should fudge the needs to massage them into being whatever our system is able to fulfill, not at all.  But it is a good time to go back and see if the originally stated needs are truly valid or if they were simply not vetted well enough or, even more likely, to go and see if the business needs have changed during the time that the implementation took place.  It may be that the implemented solution does, in fact, meet actual business needs even if they were originally misstated or because the needs have changed over time.   Or it might be that business needs have changed so dramatically that even perfect planning would originally have fallen short of the existing needs and the fact that the implemented solution does not perform as expected is of minor consequence.   I have been very surprised just how often this verification of business needs has turned a solution believed to be inadequate into an “overkill” solution that actually cost more than necessary simply because no one pushed back on overstatements of business needs or no one questioned financial value to certain technology investments.

The second step is to create a new technology baseline.  This is a very important step to assist in preventing IT from falling into the drop of the sunk cost fallacy.  It is extremely common for anyone, this is not unique to IT in any way, to look at the time and money spent on a project and assume that continuing down the original path, no matter how foolish it is, is the way to go because so many resources have been expended on that path already.  But this makes no sense, of course, how you got to your current state is irrelevant.  What is relevant is assessing the current needs of the department and company and taking stock of the currently available solutions, technologies and resources.  Given the current state, the best course forward can be determined.  Any consideration given to the effort expended to get to the current state is only misleading.

A good example of the sunk cost fallacy is in the game of chess.  With each move it is important to assess all available moves, risks and strategies again because what moves were used to get to the current state have no bearing on what moves make sense going forward.  If the world’s greatest chess player or an amazing computer chess algorithm was to be brought in mid-game they would not require any knowledge as to how the current state had come to be – they would simply assess the current state and create a strategy based upon it.

This is the same as we should be behaving in IT.  Our current state is our current state.  It does not matter for strategic planning what unfolded to get us into that state.  We only care about those decisions and costs when doing a post mortem process in order to determine where decision making may have failed in order to learn from it.  Learning about ourselves and our processes is very important.  But that is a very different task from doing strategic planning for the current initiative.

The unfortunate thing here is that we must begin our planning processes again but this time with, we assume, more to work with.    But this cannot be avoided.  In the worst cases, budgets are no longer available and there are no resources to fix the flawed design and achieve the necessary business goals.  Compromises sometimes are necessary.  Making do with what we have is sometimes that best that we can do. But, in the vast majority of cases it would seem, some combination of additional budget or creative reuse of existing products can be adequate to remedy the situation.

Once we have reached a state in which we have addressed our short falls, whether simply by accepting that we have over spent, under-delivered or have adjusted to meet needs we have an opportunity to go back and investigate our decision making processes.  It is by doing this that we hope to grow as individuals and, if at all possible, on an organizational level to learn from our mistakes, or determine if there even were mistakes.  Every company and every individual makes mistakes.  What separates us is the ability to learn from them and avoid those same mistakes in the future.  Growth comes primarily from experiencing pain in this way and while often unpleasant to face it is here that we have the best opportunity to create real, lasting value.  Do not push off or skip this opportunity for review whether it be a harsh, personal review that you do yourself or a formal, organizational review run by people trained to do so or something in-between.  The sooner that the decision processes are evaluated the fresher the memory will be and the sooner the course correction can take effect.

The final step that we can do is to begin the decision process to design a replacement for the current implementation as soon as possible, once the review of the decision process is complete.  This does not necessarily mean that we should intend to spend money or change our designs in the near future.  Not at all.  But by being extremely pro-active in design making we can attempt to avoid the problems of the past by giving ourselves additional time for planning, more time for requirements gathering and documentation, better insight into changes in requirements over time by regularly revisiting those requirements to see if they remain stable or if they are changing, more opportunity to get management and peer buy in and investment in the decision and better understanding of the problem domain so that we are better equipped to alter the intended design or know when to scrap it and start over before implementing it the next time.  It also could, potentially, give us a better chance of codifying organizational knowledge that could be passed on to a successor should you yourself not be in the position of decision making or implementation when the next cycle comes around.

With good, rational processes and a good understanding of the steps that need to be taken in a case of less than ideal systems design or implementation we can recover from missteps and not only, in most cases, recover from them in the short term but we can insulate the organization from the same mistakes in the future.

Better IT Hiring: Contract To Hire

Information Technology workers are bombarded with “Contract to Hire” positions, often daily.  There are reasons why this method of hiring and working is fundamentally wrong and while workers immediately identify these positions as bad choices to make, few really take the time to move beyond emotional reaction to understand why this working method is so flawed and, more importantly, few companies take the time to explore why using tactics such as this undermine their staffing goals.

To begin we must understand that there are two basic types of technology workers: consultants (also called contractors) and permanent employees (commonly known as the FTEs.)  Nearly all IT workers fall into a desire to be one of these two categories. Neither is better or worse, they are simply two different approaches to employment engagements and represent differences in personality, career goals, life situations and so forth.  Workers do not always get to work they way that they desire, but basically all IT workers seek to be in either one camp or the other.

Understanding the desires and motivations of IT workers seeking to be full time employees is generally very easy to do.  Employees, in theory, have good salaries, stable work situations, comfort, continuity, benefits, vacations, protection and so forth.  At least this is how it seems, whether these aspects are real or just illusionary can be debated elsewhere.  What is important is that most people understand why people want to be employees, but the opposite is rarely true.  Many people lack the empathy for those seeking to not be employees.

Understanding professional or intentional consultants can be difficult.  Consultants live a less settled life but generally earn higher salaries and advance in their careers faster, see more diverse environments, get a better chance to learn and grow, are pushed harder and have more flexibility.  There are many factors which can make consulting or contracting intentionally a sensible decision.  Intentional contracting is very often favored by younger professionals looking to grow quickly and gain experience that they otherwise could not obtain.

What makes this matter more confusing is that the majority of workers in IT wish to work as full time employees but a great many end up settling for contract positions to hold them over until a desired full time position can be acquired.  This situation arises so commonly that a great many people both inside and outside of the industry and on both sides of the interview table may mistakenly believe that all cases are this way and that consulting is a lower or lesser form of employment.  This is completely wrong.  In many cases consulting is highly desired and contractors can benefit greatly for their choice of engagement methodology.  I, myself, spent most of my early career, around fifteen years, seeking only to work as a contractor and had little desire to land a permanent post.  I wanted rapid advancement, opportunities to learn, chances to travel and variety.

It is not uncommon at all for the desired mode of employment to change over time.  It is most common for contractors to seek to move to full employment at some point in their careers. Contracting is often exhausting and harder to sustain over a long career.  But certainly full time employees sometimes chose to move into a more mobile and adventurous contracting mode as well.  And many choose to only work one style or the other for the entirety of their careers.

Understanding these two models is key.  What does not fit into this model is the concept of a Contract to Hire.  This hiring methodology starts by hiring someone willing to work a contract position and then, sometimes after a set period of time and sometimes after an indefinite period of time, either promises to make a second determination to see if said team member should be “converted” into an employee, or let go.  This does not work well when we attempt to match it up against the two types of workers.  Neither type is a “want to start as one thing and then do another”.  Possibly somewhere there is an IT worker who would like to work as a contractor for four months and then become an employee, getting benefits but only after a four month delay, but I am not aware of such a person and it is reasonable to assume that if there is such a person he is unique and already has done this process and would not want to do it again.

This leaves us with two resulting models to match into this situation.  The first is the more common model of an IT worker seeking permanent employment and being offered a Contract to Hire position.  For this worker the situation is not ideal, the first four months represent a likely jarring and complex situation and a scary one that lacks the benefits and stability that is needed and the second decision point as to whether to offer the conversion is frightening.  The worker must behave and plan as if there was no conversion and must be actively seeking other opportunities during the contract period, opportunities that are pure employment from the beginning.  If there was any certainty of a position becoming a full employment one then there would be no contract period at all.  The risk is exceptionally high to the employee that no conversion will be offered.  In fact, it is almost unheard of in the industry for this to happen.

It must be noted that, for most IT professionals, the idea that a Contract to Hire will truly offer a conversion at the end of the contract duration is so unlikely that it is generally assumed that the enticement of the conversion process is purely a fake one and that there is no possibility of it happening at all.  And for reasons we will discover here it is obvious why companies would not honestly expect to attempt this process.  The term Contract to Hire spells almost certain unemployment for IT workers going down that path.  The “to Hire” portion is almost universally nothing more than a marketing ploy and a very dishonest one.

The other model that we must consider is the model of the contract-desiring employee accepting a Contract to Hire position.  In this model we have the better outcome for both parties.  The worker is happy with the contract arrangement and the company is able to employ someone who is happy to be there and not seeking something that they likely will be unable to get.  In cases where the company was less than forthcoming about the fact that the “to Hire” conversion would never be considered this might actually even work out well, but is far less likely to do so long term and in repeating engagements than if both parties were up front and honest about their intentions on a regular basis.  Even for professional contractors seeing the “to Hire” addendum is a red flag that something is amiss.

The results for a company, however, when obtaining an intentional contractor via a Contract to Hire posting is risky.  For one, contractors are highly volatile and are skilled and trained at finding other positions.  They are generally well prepared to leave a position the moment that the original contract is done.

One reason that the term Contract to Hire is used is so that companies can easily “string along” someone desiring a conversion to a full time position by dangling the conversion like a carrot and prolonging contract situations indefinitely.  Intentional contractors will see no carrot in this arrangement and will be, normally, prepared to leave immediately upon completion of their contract time and can leave without any notice as they simply need not renew their contract leaving the company in a lurch of their own making.

Even in scenarios where an intentional contractor is offered a conversion at the end of a contract period there is the very real possibility that they will simply turn down the conversion.  Just as the company maintains the right to not offer the conversion, the IT worker maintains an equal right to not agree to offered terms.  The conversion process is completely optional by both parties.  This, too, can leave the company in a tight position if they were banking on the assumption that all IT workers were highly desirous of permanent employment positions.

This may be the better situation, however.  Potentially even worse is an intentional contractor accepting a permanent employment position when they were not actually desiring an arrangement of that type.  They are likely to find the position to be something that they do not enjoy, or else they would have been seeking such an arrangement already, and will be easily tempted to leave for greener pastures very soon – defeating the purpose of having hired them in the first place.

The idea behind the Contract to Hire movement is the mistaken belief by companies that companies hold all of the cards and that IT workers are all desperate for work and thankful to find any job that they can.  This, combined with the incorrect assumption that nearly all IT workers truly want stable, traditional employment as a full time employee combines to make a very bad hiring situation.

Based on this, a great many companies attempt to leverage the Contract to Hire term in order to lure more and better IT workers to apply based on false promises or poor matching of employment values.  It is seen as a means of lowering cost, testing out potential employees, hedging bets against future head count needs, etc.

In a market where there is a massive over supply of IT workers a tactic such as this may actually pay off.  In the real world, however, IT workers are in very short supply and everyone is aware of the game that companies play and what this term truly means.

It might be assumed that IT workers would still consider taking Contract to Hire because they are willing to take on some risk and hope to convince the employer that conversion, in their case, would be worth while.  And certainly some companies do this process and for some people it has worked out well.  However, it should be noted, that any contract position offers the potential of a conversion offer and in positions where the to “Contract to Hire” is not used, conversions are actually quite common, or at least offers for conversion.  It is specifically when a potential future conversion is offered like a carrot that the conversions become exceptionally rare.  There is no need for an honest company and a quality workplace to mention “to Hire” when bringing on contractors.

What happens, however, is more complex and requires study.  In general the best workers in any field are those that are already employed.  It goes without saying that the better you are, the more likely you are to be employed.  This does not mean that great people never change jobs or find themselves unemployed but the better you are the more time you will average not seeking employment from a position of being unemployed and the worse you are the more likely you are to be unemployed non-voluntarily.  That may seem obvious, but when you combine that with other information that we have, something is amiss.  A Contract to Hire position can never, effectively, entice currently working people in any way.  A great offer of true, full time employment with better pay and benefits might entice someone to give up an existing position for a better one, that happens every day.  But good people generally have good jobs and are not going to give up the positions that they have, the safety and stability to join an unknown situation that only offers a short term contract with an almost certain no chance conversion carrot.  It just is not going to happen.

Likewise when good IT workers are unemployed they are not very likely to be in a position of desperation and even then are very unlikely to even talk to a position listing as Contract to Hire (or contract at all) as most people want full time employment and good IT people will generally be far too busy turning down offers to waste time looking at Contract to Hire positions.  Good IT workers are flooded with employment opportunities and being able to quickly filter out those that are not serious is a necessity.  The words “Contract to Hire” are one of the best low hanging fruits of this filtering process.  You do not need to see what company it is, what region it is in, what the position is or what experience they expect.  The position is not what you are looking for, move along, nothing to see here.

The idea that employers seem to have is the belief that everyone, employed and unemployed IT workers alike, are desperate and thankful for any possible job opening.  This is completely flawed.  Most of the industry is doing very well and there is no way to fill all of the existing job openings that we have today, IT workers are in demand.  Certainly there is always a certain segment of the IT worker population that is desperate for work for one reason or another – personal situations, geographic ties, over staffed technology specialization or, most commonly, not being very competitive.

What Contract to Hire positions do is filter out the best people.  They effectively filter out every currently employed IT worker completely.  In demand skills groups (like Linux, storage, cloud and virtualization) will be sorted out too, they are too able to find work anywhere to consider poor offerings.  Highly skilled individuals, even when out of work, will self filter as they are looking for something good, not looking for just anything that comes along.

At the end of the day, the only people in any number seriously considering Contract to Hire positions, often even to the point of being the only ones even willing to respond to postings, are the truly desperate.  Only the group that either has so little experience that they do not realize how foolish the concept is or, more commonly by far, those that are long out of work and have few prospects and feel that the incredible risks and low quality of work associated with Contract to Hire is acceptable.

This hiring problem begins a vicious loop of low quality, if one did not already exist. But most likely issues with quality already will exist before a company considers a Contract to Hire tactic.  Once good people begin to avoid a company, and this will happen even if only some positions are Contract to Hire, – because the quality of the hiring process is exposed, the quality of those able to be hired will begin to decline.  The worse it gets, the harder to turn the ship around.  Good people attract good people.  Good IT workers want to work with great IT workers to mentor them, to train them and to provide places where they can advance by doing a good job.  Good people do not seek to work in a shop staffed by the desperate.  Both because working only with desperate people is depressing and the quality of work is very poor, but also because once a shop gains a poor reputation it is very hard to shake and good people will be very wary of having their own reputation tarnished by having worked in such a place.

Contact to Hire tactics signal desperation and a willingness to admit defeat on the part of an employer.  Once a company sinks to this level with their hiring they are no longer focusing on building great teams, acquiring amazing talent or providing a wonderful work environment.  Contract to Hire is not always something that every IT professional can avoid all of the time.  All of us have times when we have to accept something less than ideal.  But it is important for all parties involved to understand their options and just what it means when a company moves into this mode.  Contract to Hire is not a tactic for vetting potential hires, it simply does not work that way.  Contract to Hire causes companies to be vetted and filtered out of consideration by the bulk of potential candidates without those metrics ever being made available to hiring firms.  Potential candidates simply ignore them and write them off, sometimes noting who is hiring this way and avoiding them even when other options come along in the future.

As a company, if you desire to have a great IT department and hire good people, do not allow Contract to Hire to ever be associated with your firm.  Hire full time employees and hire intentional contractors, but do not play games with dangling false carrots hoping that contractors will change their personalities or that full time employees will take huge personal risks for no reason, that is simply not how the real world works.

The Information Technology Resource for Small Business