IT Roles: Productivity and Availability

As IT managers we face the need to deal with two very different types of technical professionals.  These two types of professionals are separated, not by their personality types or working styles, but by the very nature of their job roles.  Understanding the unique needs of these two job types is critical in effectively managing technical workers, but few IT departments truly take the time to understand and appreciate the nuances inherent to these two different job roles.

The first type, and by the far the best understood, I will call the “engineer.” This engineering role encompasses a massive array of job functions ranging from software developers and designers, architects, system engineers, network engineers or anyone whose primary function is to creatively design or implement new systems of any sort.  The term engineer is a loose one but is relatively meaningful.

The second type of technology worker role can be generically referred to as the “support” role.  Support professions might include helpdesk, systems administration, desktop support, network monitoring, command center, etc.  What separates support professionals from engineering professionals is that they are not tasked with creative processes involving new designs or implementations but instead work with existing systems ensuring that they run properly and get fixed quickly when something is wrong.

It goes without saying that no one real-world human is likely to ever be completely in only one category or the other, but almost all job functional in IT focus very heavily upon one or the other.  It is pretty safe to assume that almost any role will be exceptionally weighted to one role or the other.  It is very rare for a single position to be split evenly between these roles.

Where this identification of roles comes into play is in knowing how to measure and manage technical staff.  Measuring and managing engineers, from a very high level, is quite well understood.  The concept of productivity is very simple and meaningful for engineering roles.   The goal of managing an engineering person or team is to allow and encourage that role to output as much creative design or implementation as possible.  The concept of quality exists as well, of course, but we still can think generally about engineering roles in relatively concrete terms such as number of functions written, number of deployment packages produced, size of network designed, etc.  Metrics are a fuzzy thing, but we at least have a good idea of what efficiency means to an engineer even if we cannot necessarily measure it accurately.

Support roles do not have this same concept.  Sure you could use an artificial metric such as “tickets closed” to measure productivity in a support role, but that would be very misleading.  One ticket could be trivial and the next a large research challenge.  In many cases there may be no tickets available for a long time and then many arrive at once that cannot be serviced simultaneously.  Productivity is likely to be sporadic and non-sustainable and, ultimately, not at all meaningful to measure.

Engineering positions earn their keep by producing output effectively over a rather long period of time often even spanning into months and years for large projects.  The goal, therefore, with engineering positions is to provide an environment that encourages sustainable productivity.  It is well know that engineers will often gain productivity by working shortened or alternative hours, taking regular vacations, etc.  Not only does this often increase productivity but often greatly increases the quality of the output as well.

Support positions earn their bread and butter by “being there” when needed.  If a support person is attempting to work at maximum efficiency there is a natural implication that there is a continuous backlog of support issues awaiting the support team’s attention and that there are many people requiring support who have to wait for it in order to form a queue.  By having a queue always in place this also means that support personnel are continuously taking work off the stack instead of resolving live items – either ignoring high priority items or being regularly interrupted – causing continuous context switching which significantly reduces the ability to efficiently handle the queue – whose entire purpose for existing was to create the appearance of artificial productivity in the first place.

Support roles are “event driven.”  I like this terminology because I think it most accurately describes the mode in which nearly all support professionals work.  Whether an event is generated by a phone call, an instant message, an email or a ticket it is an “event” that kicks off the transition of the support person from idle to action or, in some cases, from a low priority item to a high priority item.  One way or another, an event represents a “context switch” for the support professional.  Without an event there is nothing for a support professional to do.  Even if the “event” is represented by a ticket queue or an email backlog it is still a form of event.

Having a truly efficient support desk requires careful management of the event process.  Having a never ending queue of support issues is exhausting for the support professionals and it also means that no amount of staff is ever in an “idle” state awaiting high priority items.  Because of this, high priority items are either not addressed as quickly as they should be or else in-process items are neglected.

Understanding the event driven nature of support staff is critical to understanding how to approach the management of these teams.  There are no simple answers, and metrics of support staff are often even more meaningless than those of engineering staff – so use with extreme caution, but by empathizing with the support role we can begin to see where our role as a support manager plays into the bigger picture of supporting and promoting the support team members.

The most important concept, from my experiences, is providing a good flow of the interrupts going to the support team.  Often support teams are handling a number of different avenues for support, such as email and telephone.  Restricting and funnels events to appropriate channels is critical.

The problem with telephones is that they are aggressive and demand an immediate context switch whether the recipient is idle or if they are currently supporting the most critical production outage in corporate history.  The person calling is guessing that their immediate need outweighs the current needs of whomever the support person is currently supporting.  Telephones cause this problem everywhere that they are used.

Think about the last time that you were at a pizza parlor placing your order at the counter.  You waited in line patiently as each person was served.  You did the right thing.  You arrive at the front of the queue.  You begin to place your order when, the phone rings.  The person taking your order puts you on “hold” even though you are standing right there, picks up the phone, takes the order, hangs up and returns to you.  What this says is that the person calling, being the “squeaky wheel”, is more important to the restaurant than are the people actually in the restaurant.  This same effect happens on many support desks – in process work is interrupted by calls going to a group line or directly to the support person.  This is, at best, inefficient and at worst may disrupt critical support processes for highly critical issues.

So when thinking about how to manage IT professionals, think about the purpose of their role.  The goal of an engineer is productivity.  The goal of a support professional is availability.

Why We Reboot Servers

A question that comes up on a pretty regular basis is whether or not servers should be routinely rebooted, such as once per week, or if they should be allowed to run for as long as possible to achieve maximum “uptime.”  To me the answer is simple – with rare exception, regular reboots are the most appropriate choice for servers.

As with any rule, there are cases when it does not apply.  For example, some businesses running critical systems have no allotment for downtime and must be available 24/7.  Obviously systems like this cannot simply be rebooted in a routine way.  However, if a system is so critical that it can never go down then this situation should trigger a red flag that this system is a point of failure and perhaps consideration for how to handle downtime, whether planned or unplanned, should be initiated.

Another exception is some AIX systems need significant uptime, greater than a few weeks, to obtain maximum efficiency as the system is self tuning and needs time to obtain usage information and to adjust itself accordingly.  This tends to be limited to large, seldom-changing database servers and similar use scenarios that are less common than other platforms.

In IT we often worship the concept of “uptime” – how long a system can run without needing to restart.  But “uptime” is not a concept that brings value to the business and IT needs to keep the business’ needs in mind at all times rather than focusing on artificial metrics.  The business is not concerned with how long a server has managed to stay online without rebooting – they only care that the server is available and ready when needed for business processing.  These are very different concepts.

For most any normal business server, there is a window when the server needs to be available for business purposes and a window when it is not needed.  These windows may be daily, weekly or monthly but it is a rare server that is actually in use around the clock without exception.

I often hear people state that because they run operating system X rather than Y that they no longer need to reboot, but this is simply not true.  There are two main reasons to reboot on a regular basis: to verify the ability of the server to reboot successfully and to apply patches that cannot be applied without rebooting.

Applying patches is why most businesses reboot.  Almost all operating systems receive regular updates that require rebooting in order to take effect.  As most patches are released for security and stability purposes, especially those requiring a reboot, the importance of applying them is rather high.  Making a server unnecessarily vulnerable just to maintain uptime is not wise.

Testing a server’s capacity to reboot successfully is what is often overlooked.  Most servers have changes applied to them on a regular basis.  Changes might be patches, new applications, configuration changes, updates or similar.  Any change introduces risk.  Just because a server is healthy immediately after a change is applied does not mean that the server nor the applications running on it will start as expected on reboot.

If the server is never rebooted then we never know if it can reboot successfully.  Over time the number of changes having been applied since the last reboot will increase.  This is very dangerous.  What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then failing.  At that point identifying what change is causing the system to fail could be an insurmountable process.  No single change to roll back, no known path to recoverability.  This is when panic sets in.  Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally – meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.

While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures, the purpose is to make those failures easily manageable from a “known change” standpoint and, more importantly, to control when those reboots occur to ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.

I have heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting.  I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief.  I know that we just caught an issue at a time when the business is not impacted financially.  Had that server not been restarted during off hours, it might have not been discovered to be “unbootable” until it had failed during active business hours and caused a loss of revenue.

Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week’s worth of changes to investigate, we are routinely able to fix the problems with generally little effort and great confidence that we understand what changes had been made prior to the failure.

Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.

IT in a Bubble

It is an old story in SMB IT, IT managers who get their start young, stay with a single company, work their way through the ranks and become venerable IT managers who have never worked outside of their current environment.  Just like the “good old days” when people stuck with a single company for their entire careers, this too sounds like a wonderful thing.  But IT has long rewarded “job hoppers”, those technically minded folk who move from shop to shop every few years.  The lack of direct upward mobility within single shops has encouraged this process – incremental promotions could only be found between companies, seldom within a single one.

Some people support and some people dispute the idea that there is value, or significant value, to be had by changing companies.  The idea is that by moving between environments you will glean techniques, procedures, processes and general experience that you will then bring with you to your next position – that you are a cumulative product of all of your past environments.  This concept, I believe, has some merit, moreso in technology than in other fields.

In technology fields, I believe that the value of moving between jobs, after a reasonable amount of time, is generally of much better value than is staying put.  The reason for this is relatively simple: Most small businesses lack an ecosystem of support and training for IT professionals. It is well known that IT professionals, working in small shops, lack the interaction with peers and vendors generally accepted as necessary for healthy professional development and which is common in enterprise shops.

An IT professional, after spending many years in a small shop, effectively all alone, tends to feel isolated lacking the professional interaction that most specialists enjoy.  Most small professional or artisan shops have a number of specialists who work together, share research and experience, are encouraged to work with competitors or vendors, to attend trade events, training, etc.  Few fields share the odd dispersion of IT professionals with only one or two people working together at any given company with little to no interaction with the outside world or with peers at other companies.

This isolation can lead to “IT insanity” if left unchecked.  An IT professional, working in a vacuum with little to no technical or professional feedback, will lose the ability to assess themselves against other professionals.  As often the sole provider of technology guidance and policy for potentially years or even decades, a lone IT professional can easily “drift off course” and lose contact and course correction from the larger IT field with only light guidance offered through the filtered world of vendors attempting to sell expensive products and services.

IT professionals suffering from “IT insanity” will often be found implementing bizarre, nonsensical policies that would never be tolerated in a shop with a strong peer-review mechanism, purchasing incredibly overpriced solutions for simple problems and working either completely with or completely without mainstream technologies – mostly dependent upon individual personality.  Partially this is caused by an increasing dependence on a singular, established skill set as the lack of environmental change encourages a process of continuing dependence on existing skills and procedures.

IT insanity will commonly arise in IT shops that have only a single IT professional or in shops where there is a strict hierarchy with no movement at the management ranks so that fresh ideas and experience from younger professionals do not feed up into the managers and instead established practices and “because I said so” policies are forced down the chain to the technologists actually implementing solutions.

This is not to say that all is lost.  There are steps that can be taken to avoid this scenario.  The first is to consider outsourcing IT – any shop so small as to face this dilemma should seriously consider if having full time, dedicated internal staff makes sense in their environment.  Looking for fresh blood is an option – getting IT professionals from other shops and even other industries can work wonders.  Some shops will even trade staff back and forth in extreme cases to keep from losing existing employees but seeking to “mix things up.”

Short of drastic measures such as changing employees entirely, non-IT organizations need to think seriously about the professional health of their staff and look to opportunities for peer interaction.  IT professionals need continuous professional interaction for many reasons and organizations need to actively support and promote this behavior.  Sending staff to training, seminars, peer groups, conventions, shows or even out as volunteers to non-profit and community activities where they can provide IT support in an alternative environment can do wonder for getting them out of the office and face to face with alternative viewpoints and get their hands on different technologies than they see in their day to day lives.

IT managers need opportunities to explore different solution sets and to learn what others are doing in order to best be able to offer objective, broad-based decision making value to their own organizations.

IT Managers and the Value of Decision Making

When I was new to IT I can remember people using the phrase “No one ever got fired for buying IBM.”  At the time I was young and didn’t think too much about what this phrase implies.  Recently, I heard this phrase again – except this time it was “No one ever gets fired for buying Cisco” and soon thereafter I heard it applied to virtualization and VMWare.   This time I stopped to think about what exactly I was being told.

At face value, the statement comes as little more than an observation, but the intent runs much deeper.  The statement is used as a justification for a decision that has been made and implies that the decision was made not because the product or vendor in question was the best choice but because it was the choice that was believed to have the least risk involved for the decision maker.  Not the least risk or most value for the organization – least risk to the decision maker.

This implies one of two possibilities.  The first being that the decision maker in question, presumably an IT manager, feels that due diligence and careful analysis is not recognized or rewarded by the organization. That marketing, by an IT vendor to non-IT management, has convinced management that those products and services are superior without consideration for functionality, cost, reliability or service.

The second possibility is that the IT decision maker believes that they can get away without performing the cost, risk and functionality analysis which would be deemed proper for deciding between competing options and believes that by picking a popular option, well known in the marketplace, that they will be shielded from serious inquiry into their processes and simply deliver what sounds like a plausible solution with minimal effort on their part.

As IT Managers, one of the most crucial job functions that we perform is in identifying, evaluating and recommending products and solutions to our organizations.  The fact that phrases like these are used so commonly suggests that a large percentage of IT managers and advisers are deciding to forgo the difficult and laborious process of researching products and solutions and are banking on making an easy decision that is likely to seem reasonable to management, regardless of whether or not it is a viable solution, let alone the best one for the organization.  The assumption being that a very expensive product will be chosen when potentially a less expensive or less well known option might have worked as well or better and in some extreme cases a product may be recommended using this method that does not even provide for the needs of the organization at all.

IT lives and dies by the decision making value that it brings to the organization.  We hate to admit it, but finding people who can fix desktops is not that hard and the economic value of someone who can fix anything wrong on the desktop versus simply rebuilding one is small.  If we eliminate quality decision analysis from the IT manager’s skill set, what value does he or she bring to the company?

The Information Technology Resource for Small Business