Category Archives: Best Practices

Solution Elegance

It is very easy, when working in IT, to become focused on big, complex solutions.  It seems that this is where the good solutions must lie – big solutions, lots of software, all the latest gadgets.  What we do is exciting and it is very easy to get caught up in the momentum.  It’s fun to do challenging, big projects.  Hearing what other IT pros are doing, how other companies solve challenges and talking to vendors with large systems to sell to us all adds to the excitement and it is very easy to lose a sense of scope and goal and it is so common to see big, over the top solutions to simple problems that it seems like this must just be how IT is.

But it need not be.  Complexity is the enemy of both reliability and security.  Unnecessarily complex solutions increase cost both in acquisition and in implementation as well as in maintenance while being generally slower, more fragile and possess a large attack surface that is harder to comprehend and protect.  Simple, or more appropriately, elegant solutions are the best approach.  This does not mean that all designs will be simple, not at all.  Complex designs are often required.  IT is hardly a field that has any lack of complexity.  In fact it is often believed that software development may be the most complex of all human endeavors, at least of those partaken of on any scale.  A typical IT installation includes millions of lines of codes, hundreds or thousands of protocols, large numbers of interconnected systems, layers of unique software configurations, more settings than any team could possibly know and only then do we add in the complexity of hundreds or thousands or hundreds of thousands of unpredictable, irrational humans trying to use these systems, each in a unique way.  IT is, without a doubt, complex.

What is important is to recognize that IT is complex, that this cannot be avoided completely but to focus on designing and engineering solutions to be as simple, as graceful… as elegant as possible.  This design idea comes from, at least in my mind, software engineering where complex code is seen as a mistake and simple, beautiful code that is easy to read, easy to understand is considered successful.  One of the highest accolades that can be bestowed upon a software engineer is for her code to be deemed elegant.  How apropos that it is attributed to Blaise Pascal, after whom one of the most popular programming languages of the 1970s and 1980s was named is this famous quote (translated poorly from French): “I am sorry I have had to write you such a long letter, but I did not have time to write you a short one.”

It is often far easier to design complex, convoluted solutions than it is to determine what simple approach would suffice.  Whether we are in a hurry or don’t know where to begin an investigation, elegance is always a challenge. The industry momentum is to promote the more difficult path.  It is in the interest of vendors to sell more gear not only to make the initial sale but they know that with more equipment comes more support dollars and if enough new, complex equipment is sold the support needs stop increasing linearly and begin to increase geometrically as additional support is needed not just for the equipment or software itself but also for the configuration and support of system interactions or additional customization   The financial influences behind complexity are great, and they do not stop with vendors.  IT professionals gain much job security, or the illusion of it, by managing large sets of hardware and software that are difficult to seamlessly transition to another IT professional.

Often complexity is so assumed, so expected, that the process of selecting a solution begins with great complexity as a foregone conclusion without any consideration for the possibility that a less complex solution might suffice, or even be superior outside of the question of complexity and cost itself.  Complexity is sometimes even completely tied to certain concepts to a degree where I have actually faced incredulity at the notion that a simple solution might outperform in price, performance and reliability a complex one.

Rhetoric is easy, but what is a real world example?  The best examples that I see today are mostly related to virtualization whether vis a vis storage or a cloud management layer or software or just virtualization itself.  I see quite frequently that a conversation involving just virtualization for one person brings an instant connotation of requiring networked, shared block storage, expensive virtualization management software, many redundant virtualization nodes and complex high availability software – none of which are intrinsic to virtualization and most of which are rarely for the purpose of supporting or really, even in the interest of the business for whom they will be implemented.  Rather than working from business requirements, these concepts arise predominantly from technology preconceptions.  It is simple to point to complexity and appear to be solving a problem – complexity creates a sense of comfort.  Filter many arguments down and you’ll hear “How can it not work, it’s complex?”  Complexity provides an illusion of completeness, or having solved a problem, but this can commonly hide the fact that a solution may not actually be complete or even functional but the degree of complexity makes this difficult to see.  Our minds will then not accept easily a simpler approach being more complete and solving a problem when a complex one does not because it feels so counter-intuitive.

A great example of this is that we resort to discussing redundancy rather than reliability.  Reliability is difficult to measure, redundancy is simple to quantify.  A brick is highly reliable, even when singular.  It does not take redundancy for a brick to be stable and robust.  Its design is simple.  You could make a supporting structure out of many redundant sticks that would not be nearly as reliable as a single brick.  If you talk in reliability – the chance that the structure will not fail – it is clear that the brick is a superior choice to several sticks.  But if you say “but there is no redundancy, the brick could fail and there is nothing to take its place” you sound silly.  But when talking about computers and computer systems we find systems that are so complex that rarely do people see when they have a brick or a stick and so, since they cannot determine reliability which matters, they focus on the easily to quantify redundancy, which doesn’t.  The entire system is too complex, but seeking the simple solution, the one that directly addresses the crux of the problem to solve can reduce complexity and provide us a far better answer in the end.

This can even be seen in RAID.  Mirrored RAID is simple, just one disk or set of disks being an exact copy of another set.  It’s so simple.  Parity RAID is complex with calculations on a variable stripe across many devices that must be encoded when written and decoded should a device fail.  Mirrored RAID lacks this complexity and solves the problem of disk reliability through simple, elegant copy operations that are highly reliable and very well understood.  Parity RAID is unnecessarily complex making it fragile.  Yet in doing so and by undermining its own ability to solve the problem for which it was designed it also, simultaneously, because seemingly more reliable based on no factor other than its own complexity.  The human mind immediately jumps to “it’s complex, therefore it is more advanced, therefore it is more reliable” but neither progression is a logical one.  Complexity does not suggest that it is more advanced and being advanced does not suggest that it is reliable, but the human mind itself is complex and easily mislead.

There is no simple answer for finding simplicity.  Knowing that complexity is bad by its nature but unavoidable at times teaches us to be mindful, however it does not teach us when to suspect over-complexity.  We must be vigilant, always seeking to determine if a more elegant answer exists and not accept complexity as the correct answer simply because it is complex.  We need to question proposed solutions and question ourselves.  “Is this solution really as simple as it should be?”  “Is this complexity necessary?”  “Does this require the complexity that I had assumed?”

In most system design recommendations that I give, the first technical determination step that I normally take, after the step of inquiring as to the business need needing to be solved, is to question complexity.  If complexity cannot be defended, it is probably unnecessary and actively defeating the purpose for which it was chosen.

“Is it really necessary to split those drives into many separate arrays?  If so, what is the technical justification for doing so?”

“Is shared storage really necessary for the task that you are proposing it for?”

“Does the business really justify the use of distributed high availability technologies?”

“Why are we replacing a simple system that was adequate yesterday with a dramatically more complex system tomorrow?  What has changed that makes a major improvement, while remaining simple, not more than enough but requires orders of magnitude more complexity and more spending that wasn’t justified previously?”

These are just common examples, complexity exists in every aspect of our industry.  Look for simplicity.  Strive for elegance.  Do not accept complexity without rigorously vetting it.  Put it through the proverbial ringer.  Do not allow complexity to creep in where it is not warranted.  Do not err on the side of complexity – when in doubt, fail simply.  Oversimplifying a solution typically results in a minor failure while making it overly complex allows for a far greater degree of failure.  The safer bet is with the simpler solution.  And if a simple solution is chosen and proven inadequate it is far easier to add complexity than it is to remove it.

Virtualization as a Standard Pattern

Virtualization as an enterprise concept is almost as old as business computing is itself.  The value of abstracting computing from the bare hardware was recognized very early on and almost as soon as computers had the power to manage the abstraction process, work began in implementing virtualization much as we know it today.

The earliest commonly accepted work on virtualization began in 1964 with the IBM CP-40 operating system developers for the IBM System/360 mainframe.  This was the first real foray into commercial virtualization and the code and design from this early virtualization platform has descended today into the IBM VM platform that has been used continuously since 1972 as a virtualization layer for the IBM mainframe families over the decades.  Since IBM first introduced virtualization we have seen enterprise systems adopting this pattern of hardware abstraction almost universally.  Many large scale computing systems, minicomputers and mainframes, moved to virtualization during the 1970s with the bulk of all remaining enterprise systems doing so, as the power and technology were available to them, during the 1980s and 1990s.

The only notable holdout to virtualization for enterprise computing was the Intel IA32 (aka x86) platform which lacked the advanced hardware resources necessary to implement effective virtualization until the advent of the extended AMD64 64-bit platform and even then only with specific new technology.  Once this was introduced the same high performance, highly secure virtualization was available across the board on all major platforms for business computing.

Because low cost x86 platforms lacked meaningful virtualization (outside of generally low performance software virtualization and niche high performance paravirtualization platforms) until the mid-2000s this left virtualization almost completely off of the table for the vast majority of small and medium businesses.  This has lead many dedicated to the SMB space to be unaware that virtualization is a well established, mature technology set that long ago established itself as the de facto pattern for business server computing.  The use of hardware abstraction is nearly ubiquitous in enterprise computing with many of the largest, most stable platforms having no option, at least no officially support option, for running systems “bare metal.”

There are specific niches where the need to avoid hardware abstraction through virtualization is not advised but these are extremely rare, especially in the SMB market.  Typical systems needing to not be virtualized include latency sensitive systems (such as low latency trading platforms) and multi-server combined workloads such as HPC compute clusters where the primary goal is performance above stability and utility.  Neither of these are common to the SMB.

Virtualization offers many advantages.  Often, in the SMB where virtualization is less expected, it is assumed that virtualization’s goal is consolidation where massive scale cost savings can occur or in providing new ways to provide for high availability.  Both of these are great options that can help specific organizations and situations but neither is the underlying justification for virtualization.  We can consolidate and achieve HA through other means, if necessary.  Virtualization simply provides us with a great array of options in those specific areas.

Many of the uses of virtualization are artifacts of the ecosystem such as a potential reduction in licensing costs.  These types of advantages are not intrinsic advantages to virtualization but do exist and cannot be overlooked in a real world evaluation.  Not all benefits apply to all hypervisors or virtualization platforms but nearly all apply across the board.  Hardware abstraction is a concept, not an implementation, so how it is leveraged will vary.  Conceptually, abstracting away hardware whether at the storage layer, at the computing layer, etc. is very important as it eases management, improves reliability and speeds development.

Here are some of the benefits from virtualization.  It is important to note that outside of specific things such as consolidation and high availability nearly all of these benefits apply not only to virtualizing on a single hardware node but for a single workload on that node.

  1. Reduced human effort and impact associated with hardware changes, breaks, modifications, expansion, etc.
  2. Storage encapsulation for simplified backup / restore process, even with disparate hardware targets
  3. Snapshotting of entire system for change management protection
  4. Ease of archiving upon retirement or decommission
  5. Better monitoring capabilities, adding out of band management even on hardware platforms that don’t offer this natively
  6. Hardware agnosticism provides for no vendor lock-in as the operating systems believe the hypervisor is the hardware rather than the hardware itself
  7. Easy workload segmentation
  8. Easy consolidation while maintaining workload segmentation
  9. Greatly improved resource utilization
  10. Hardware abstraction creates a significantly realized opportunity for improved system performance and stability while lowering the demands on the operating system and driver writers for client operating systems
  11. Simplified deployment of new and varied workloads
  12. Simple transition from single platform to multi-platform hosting environments which then allow for the addition of options such as cloud deployments or high availability platform systems
  13. Redeployment of workloads to allow for easy physical scaling

In today’s computing environments, server-side workloads should be universally virtualized for these reasons.  The benefits of virtualization are extreme while the downsides are few and trivial.  The two common scenarios where virtualization still needs to be avoided are in situations where there is specialty hardware that must be used directly on the server (this has become very rare today, but does still exist from time to time) and extremely low latency systems where sub-millisecond latencies are critical.  The second of these is common only in extremely niche business situations such as low latency investment trading systems.  Systems with these requirements will also have incredible networking and geolocational requirements such as low-latency Infiniband with fiber to the trading floor of less than five miles.

Some people will point out that high performance computing clusters do not use virtualization, but this is a grey area as any form of clustering is, in fact, a form of virtualization.  It is simply that this is a “super-system” level of virtualization instead of being strictly at the system level.

It is safe to assume that any scenario in which you might find yourself in which you should not use virtualization you will know it beyond a shadow of a doubt and will be able to empirically demonstrate why virtualization is either physically or practically impossible.  For all other cases, virtualize.  Virtualize if you have only one physical server and one physically workload and just one user.  Virtualize if you are a Fortune 100 with the most demanding workloads.  And virtualize if you are anyone in between.  Size is not a factor in virtualization; we virtualize out of a desire to have a more effective and stable computing environment both today and into the future.

 

Nearly As Good Is Not Better

As IT professionals we often have to evaluate several different approaches, products or techniques.  The IT field is vast and we are faced with so many options that it can become difficult to filter out the noise and find just the options that truly make sense in our environment.

One thing that I have found repeatedly creating a stumbling block for IT professionals is that they come from a stance of traditional, legacy knowledge (a natural situation since all of our knowledge has to have come from sometime in the past) and attempting to justify new techniques or technologies in relationship to the existing, established assumptions of “normal.”  This is to be expected.

IT is a field of change, however, and it is critical that IT professionals accept change as normal and not react to it as an undermining of traditional values.  It is not uncommon for people to feel that decisions that they have made in the past will be judged by the standards of today.  They feel that because there is a better option now that their old decision is somehow invalid or inadequate.  This is not the case.  This is exacerbated in IT because decisions made in the past that have been dramatically overturned in favour of new knowledge might only be a few years old and the people who made them still doing the same job.  Change in IT is much more rapid than in most fields and we can often feel betrayed by good decisions that we have made not long ago.

This reaction puts us into a natural, defensive position that we must rationally overcome in order to make objective decisions about our systems.

One trick that I have found is to reverse questions involved assumed norms.  That is to say, if you believe that you must justify a new technique against an old and find that while convincing you are not totally sways, perhaps you should try the opposite – justify the old, accepted approach versus the new one.  I will give some examples that I see in the real world regularly.

Example one, in which we consider virtualization where none existed before.  Typically someone looking to do this will look for virtualization to provide some benefit that they consider to be significant.  Generally this results in someone feeling that virtualization either doesn’t offer adequate benefits or that they must incorporate other changes and end up going dramatically overboard for what should have been a smaller decision.  Instead, attempt to justify not using virtualization.  Treat virtualization as the accepted pattern (actually, it long has been, just not the in SMB space) and try to justify going with physical servers instead.

What we find is that, normally, our minds accepted that the physical machine only had to be “nearly as good” or “acceptable” in order to be chosen even though virtualization was, in nearly all cases, “better”.  Why would be decide to use something that is not “better”?  Because we approached one as change and one as not change.  Our minds play tricks on us.

Example two, in which traditional server storage is two arrays with the operating system on one RAID 1 array and the data partition on a second RAID 5 array versus the new standard of a single RAID 10 array holding both operating system and data.  If we argue from the aspect of the traditional approach we can make decent arguments, at times, that we can make the old system adequate for our needs.  Adequate seems good enough to not change our approach.  But argue from the other direction.  If we assume RAID 10 is the established, accepted norm (again, it is today) then it is clear that it comes out as dramatically superior in nearly all scenarios.  If we try to justify why we would chose a split array with RAID 1 and RAID 5 we would quickly see that they never provide a compelling value.  So sticking with RAID 10 is a clear win.

This reversal of thinking can provide for a dramatic, eye-opening effect on decision making.  Making assumptions about starting points and forcing new ideas to significantly “unseat” incumbent thinking is dangerous.  This keeps us from moving forward.  In reality, most approaches should start from equal ground and the “best” option should win.  It is far too often than a solution is considered “adequate” when it is not the best.  Yes, a solution may very well work in a given situation but why would we ever intentionally choose a less than superior solution (we assume that cost is factored into the definition of best?)

As IT professionals attempting to solve problems for a business we should be striving to recommend and implement the best possible solutions, but making due with less than ideal ones simply because we forget to equally consider the reasonable options against one another.  And it is important to remember that cost is inclusive in deciding when a solution is best or adequate.  The best solution is not a perfect solution but the best for the company, for the money.  But very often solutions are chosen that cost more and do less simply because they are considering the de facto starting point and the alternatives are expected to dramatically outperform them rather than simply being “better”.

Taking a fresh look at decision making can help us become better professionals.

Patching in a Small Environment

In enterprise IT shops, system patching is a complicated process involving large numbers of test systems which mirror production systems so that each new patch arriving from operating system and software vendors can be tested in a real world environment to see how they interact with the hardware and software combinations available in the organization.  In an ideal world, every shop would have a managed patching process that immediately responded to newly published patches, tested instantly and applied as soon as the patch was deemed safe and applicable.  But the world is not an ideal one and in real life we have to make due with limited resources: physical, temporal and financial.

Patches are generally released for a few key reasons: security, stability, performance and, occasionally, to supply new features.  Except for the addition of new features, which is normally handled through a different release process, patches represent a fix to a known issue.  This is not a “if it is not broken, don’t fix it” scenario but is a “it is broken and has not completely failed yet” scenario which demands attention – the sooner the better.  Taking a “sit back and wait” approach to patches is unwise as the existence of a new patch means that malicious hackers have a “fix” to analyze and even if an exploit did not exist previously, it will very shortly.  The release of the patch itself can be the trigger for the immediate need for said patch.

This patch ecosystem creates a need for a “patch quickly” mentality.  Patches should never sit, they need to be applied often as soon as they are released and tested.  Waiting to patch can mean running with critical security bugs or keeping systems unnecessarily unreliable.

Small IT shops rarely, if ever, have test environments whether for servers, networking equipment or even desktops.  Not ideal but, realistically, even if those environments were available few small shops have the excess human IT resources available to run those tests in a timely manner.

This is not as bleak as it sounds.  The testing done for most patches is redundant with patching already tested by the vendor.  Vendors cannot possibly test every hardware and software interaction that could ever happen with their products but they generally test wide ranges of permutations and look at areas where interactions are most likely.  It is rare for a major vendor to cripple their own software with bad patches.  Yes, it does happen and having good backups and rollback plans are important, but in day to day operations, patching is a relatively safe process that is far more important to do promptly than it is to wait for opportunities that may or may not occur.

Like any system change, patches are best applied in frequent, small dosages.  If patches are applied promptly then normally only one or a few patches must be applied at the same time.  For operating systems you may still have to deal with multiple patches at one time, especially if patching only weekly, but seldom must you patch dozens or hundreds of files at one time when done in this manner.  When done like this it is vastly easier to evaluate patches for adverse affects and to roll back if a patch process goes badly.

The worst scenario for a small business lacking a proper patch testing workflow is to wait on patches.  Waiting means that systems go without needed care for long periods of times and when patches are finally applied it is often in large, bulk patch processes.  Applying many patches at once increases the chances that something will go wrong and, when it does, identifying which patch(es) is at fault and producing a path to remediation can be much more difficult.

Delayed patching is a process that provides little or no advantage to either IT or a business but does carry substantial risk to security, stability and performance.  Best practices for patching in a small environment is either to allow systems to self patch as quickly as possible or to schedule a regular patching process, perhaps weekly, during a time when the business is most prepared for patching to fail and patch remediation to be handled.  Whether you choose to patch automatically or simply to do so regularly through a manual process, patch often and promptly for best results.