How I Learned To Stop Worrying and Love BYOD

Bring Your Own Devices (or BYOD) is one of those hot topics this year that seems to have every IT department worried.  What does BYOD mean for the future of IT?  People have already begun to call it the consumerization of IT and IT professionals everywhere are terrified that the traditional role of IT is ending and that BYOD is shifting all control into the hands of the end users.

Is this really the case?  In a world where security and control of data are becoming increasingly regulated and exposed and as the public takes a growing interest in how companies are securing their data it is safe to assume that the movement of the IT field is not going to be towards a loss of control.  And, in my experience, BYOD means exactly the opposite.

There is no ignoring the fact that BYOD signals many changes and demands IT departments rethink traditional approaches.  But is that such a bad thing?  The old model was one of a network castle.  The firewalls were the moat and all of our devices from servers to desktops sat huddled together inside the castle courtyard talking freely one to another.  One of the greatest fears was that one of those desktops were to become “compromised” and would unleash a fifth column attack from within the castle where there were practically no defenses of which to speak.

The old model created a quagmire of issues and required complicated workarounds in order to accommodate modern changes in computing environments.  When businesses existed in only a single location or when businesses would regularly purchase leased lines connecting all of their offices the model worked rather well.  Once workers began to need to work remotely, whether at home or when on the road, the model became difficult to support and the concept of VPNs were introduced in order to extend the castle wherever it was needed.  VPNs changed how companies could physically exist but did so without addressing some fundamental issues with the architecture of a traditional IT infrastructure.

The solution to this infrastructure reinvention has been coming for a long time now.  The movement towards web applications, “cloud services”, hosted applications, Software as a Service and other terms for the new ways in which people were thinking about applications.  Slowly we started exposing applications to the “outside”.  We started simply with email, then basic web applications and slowly more and more components of business infrastructure start to be exposed externally without requiring the use of a VPN.

The advent of smartphones accelerated this process as certain applications, email and calendaring being the biggest drivers, absolutely demanded extension to these mobile devices.  For the most part, IT departments did not even see a significant shift occurring.  Instead it was little pinholes, small changes as more and more of the tools used in the business were available without connecting to the VPN, without sitting inside the office.

Today a new business might legitimately ask its CIO: “Why do we even need a LAN?  What benefit do we get from everyone sitting on a single, physical network?”  There are still plenty of good reasons why a LAN might be needed.  But it is a valuable question to ask and the answer might surprise you.  I was asked this myself and the answer was that we didn’t need a LAN, every app was available through its own, secure channel, without a need for VPNs or a local network.

Where LANs continue to shine brightest is in desktop management.  If you need to lock down and control the actual end user equipment then LANs work their best here – currently.  This too will change in time.  But this is where BYOD becomes the secret weapon of the IT department.

BYOD, while creating its own raft of obvious complications, especially around end user support expected after decades of total IT control of end user devices, offers the opportunity to eliminate the LAN, pull back the walls of the castle to surround only the core infrastructure where no end user ever need venture and to drop the support of end users devices solidly into the lap of the end users themselves.  With modern LAN-less application publishing strategies (this includes web apps, remote desktop technologies and others) end user devices are effectively thin clients often providing no more processing capacity than is necessary to display the application.  They are a window into the infrastructure, not a gateway.  They look at the servers, they aren’t sitting inside the castle with them.

Thinking of end user devices as view panels or windows rather than computing devices is the key to making BYOD an advantage to the IT department rather than its bane.  Of course, this plays into the usual ebb and flow and fat and thin clients over the history of computing.  The tide will change again, but for now, this is our current opportunity.  End users want the illusion of control and the reality of picking the device that is best suited to their needs – which are almost strictly physical needs whether of fashion or function.  IT departments want the reality of control and should be happy to allow end users to pick their own devices.  Everyone can win.

The key, of course, is eliminating legacy applications or finding workarounds.  Technological approaches such as VDI, terminal servers or even racks of datacenter-housed desktops potentially provide fallback strategies that can be accessed from nearly any device while “view” layer technologies like HTML 5 look to provide elegant, modern options for exposing applications, shifting display-related processing to the end user device and standardizing on a protocol that is likely to exist ubiquitously in the very near future.  The technologies are there today.

With the corporate network shrunk down to being only the infrastructure servers and associated networking gear suddenly IT departments have the potential for greater control and more flexibility while giving up little.  End users are happy, IT is happy.  BYOD is an opportunity for IT to exert greater control, tighter security all while giving the impression of being approachable and flexible.

The Windows Desktop Cycle

Microsoft has been bringing out desktop operating environments for decades now and those of us who have been in the industry long enough are aware of a pattern that they use, perhaps unofficially, in bringing new technologies to market that those who have not had enough exposure to their releases over the years may have missed.  The release cycle for new Windows products is a very slow one with many years between each release which makes it very difficult to see the pattern emerge if you have not been directly exposed to it for decades.  Researching the products in retrospect, especially with the public’s reaction to them in juxtaposition, is very difficult.

What is important is that Windows comes out in a flip-flop fashion with every other release being a “long term support, heavily stable” release and the alternate releases being the “new technology preview” releases.  This is not to say that any particular release is good or bad, but that one release is based around introducing a new system to the public and the next is a more polished release with fewer changes than its predecessor focused on long term adoption.

The goal of this release pattern should be obvious.  Whenever major changes from to such a widely used platform the average user, even the average IT professional, tends to resist the change and be unhappy with it.  But after a while the new look, feel and features start to feel natural.  Then a slightly updated, slightly more polished version of the same features can be released and the general public feels like Microsoft has “learned its lesson” and they appreciate the same features that they disliked a few years before.  This approach works wonders in Microsoft’s mixed consumer and business world where they get home users to adopt the latest and greatest at home with OEM licenses bundled with the computers that they buy and businesses can, and usually do, wait for the “every other” cycle to allow them to utilize only the more mature of the two releases to their users who have already lived through the pain of the changes at home.

Outside of the Windows world you can witness the same sort of adoption with the much maligned MS Office 2007 and MS Office 2010.  The former was universally hated because of the then new Ribbon interface.  The later was much loved mostly because people had already adapted to the Ribbon interface and now appreciated it but also because Microsoft had time to learn from the 2007 release and tweak the Ribbon to be improved by 2010.

This pattern started long ago and can be seen happening, to some degree, even in the DOS-based Windows era (the Windows family starting from the very beginning and running up through Windows ME.)  Of the more recent family members Windows 3 was the preview, Windows 3.1 was the long term release, Windows 95 was the preview, Windows 98 the long term release and Windows ME was the preview.  Each one of the previews had poor reception, comparatively, due to the introduction of new ideas and interfaces.  Each of the long term releases outlived its counterpart preview release on the market and were widely loved.  It is a successful pattern.

In the modern era of Windows NT, starting with Windows NT 3.1 in 1993, the overarching pattern continued with NT 3.1 itself being the “preview” member of the new Windows NT family.  Just one year later Windows NT 3.5 released and was popular for its time.  Windows NT 3.51 came out and provided the first support for the new world of interoperability with Windows 95 from the DOS family which released just a few months after NT 3.51 itself did.  Then the stable, long term Windows NT 4 released in 1996 and dominated the Windows world for the next half decade.  Windows NT 4 leveraged both the cycle from the Windows NT family as well as the cycle from the DOS/Windows family to great effect.

In 2000 when Windows 2000 released it was a dramatic shift for the Windows NT family and was poorly received.  The changes, both to the desktop and the coinciding Server product with the introduction of Active Directory were massive and disruptive.  Windows 2000 was the quintessential preview release.  It took just one year before Windows XP replaced it on the desktop.  Windows XP, per its place in the cycle, turned out to be the quintessential long term release making even Windows NT 4 look short lived.  Windows XP expanded very little on Windows 2000 Workstation but it brought additional polish and no significant changes making it exactly what businesses and most home users, were looking for as their main operating system for a very long time.

When Microsoft was ready to disrupt the desktop again with new changes, like the additional security of UAC, they did so in Windows Vista.  Vista, like Windows 2000, was not well received and was possibly the most hated Windows release of all time.  But Vista did its job perfectly.  Shortly after the release of Windows Vista came the nominally different Windows 7 with some minor UAC changes and some improved polish and was very well received.  Vista paved the way so that Windows 7 could be loved and used for many years.

Now we stand on the verge of the Windows 8 release.  Like Vista, 2000, Office 2007 and Windows 95, Windows 8 represents a dramatic departure for the platform and already, before even being released, has generated massive amounts of bad press and animosity.  If we study the history of the platform, though, we would have expected this in the Windows 8 release regardless of what changes were going to be announced.  Windows 8 is the “preview” release.  We know that a new operating system, perhaps called Windows 9, is at most two years away and will bring a slightly tweaked, more polished version of Windows 8 that end users will love and the issues with Windows 8, like its predecessors, will soon be forgotten.  The cycle is well established and very successful.  There is very little chance that it will be changing anytime soon.

Hot Spare or a Hot Mess

A common approach to adding a layer of safety to RAID is to have spare drive(s) available so that replacement time for a failed drive is minimized.  The most extreme form of this is referred to as having a “hot spare” – a spare drive actually sitting in the array but unused until the array detects a drive failure at which time the system automatically disables the failed drive and enables the hot spare, the same as if a human had just popped the one drive out of the array and popped in the other allowing a resilver operation (a rebuilding of the array) to begin as soon as possible.  This can bring the time to swap in a new drive from hours or days to seconds and, in theory, can provide an extreme increase in safety.

First, I’d like to address what I personally feel is a mistake in the naming conventions. What we refer to as a hot spare should, I believe, actually be called a warm spare because it is sitting there ready to go but does not contain the necessary data to be used immediately.  A spare drive stored outside of the chassis, one that requires a human to step in and swap the drives manually, would be a cold spare.  To truly be a hot spare a drive should be full of data and, therefore, would be a participatory member of the RAID array in some capacity.  Red Hat has a good article on how this terminology applies to disaster recovery sites for reference.  This differentiation is important because what we call a hot spare does not already contain data and does not immediately step in to replace the failed drive but instead steps in to immediately begin the process of restoring the lost drive – a critical differentiation.

In order to keep concepts clear, from here on out I will refer to what vendors call hot spares as “warm spares.”  This will make sense in short order.

There are two main concerns with warm spares.  The first is the ineffectual nature of the warm spare in most use cases and the second is the “automated array destruction” risk.

Most people approach the warm spare concept as a means of mitigating the high risk of secondary drive failure on a parity RAID 5 array.  RAID 5 arrays protect only against the failure of a single disk within the array.  Once a single disk has failed the array is left with no form of parity and any additional drive failure results in the total loss of the array.  RAID 5 is chosen because it is very low cost for the given capacity and sacrifices reliability in order to achieve this cost effectiveness.   Because RAID 5 is therefore risky in comparison to other RAID options, such as RAID 6 or RAID 10, it is common to implement a warm spare in order to minimize the time that the array is left in a degraded state allowing the array to begin resilvering itself as quickly as possible.

So the takeaway here that is more relevant is that warm spares are generally used as a buffer against using less reliable RAID array types as a cost saving measure.  Warm spares are dramatically more common in RAID 5 arrays followed by RAID 6 arrays.  Both of which are chosen over RAID 10 due to cost for capacity, not for reliability or performance.  There is one case where the warm spare idea truly does make sense for added reliability, and that is in RAID 10 with a warm spare, but we will come to that.  Outside of that scenario I feel that warm spares make little sense in the real world.

We will start by examining RAID 1 with a warm spare.  RAID 1 consists of two drives, or more, in a mirror.  Adding a warm spare is nice in that if one of the mirrored pairs dies the warm spare will immediately begin mirroring the remaining drive and you will be protected again in short order.  That is wonderful.  Except for one minor flaw, instead of using a warm spare that same drive could have been added to the RAID 1 array all along where it would have been a tertiary mirror.  In this tertiary mirror capacity the drive would have added to the overall performance of the array giving a nearly fifty percent read performance boost with write performance staying level and providing instant protection in case of a drive failure rather than “as soon as it remirrors” protection.  Basically it would have been a true “hot spare” rather than a warm spare.  So without spending a penny more the system would have had better drive array performance and better reliability simply by having the extra drive in a hot “in the array” capacity rather than sitting warm and idle waiting for disaster to strike.

With RAID 5 we see an even more dramatic warning against the warm spare concept, here where it is more common than anywhere else.  RAID 5 is single parity RAID with the ability to rebuild, using the parity, any drive in the array that fails.  This is where the real problems begin.  Unlike in RAID 1 where a remirroring operation might be quite quick, a RAID 5 resilver (rebuild) has the potential to take quite a long time.  The warm spare will not assist in protecting the array until this resilver process completes successfully – this is commonly many hours and is easily days and possibly weeks or months depending on the size of the array and how busy the array is.  If we took that same warm spare drive and instead tasked it with being a member of the array with an additional parity stripe we would achieve RAID 6.  The same set of drives that we have for RAID 5 plus warm spare would create a RAID 6 array of the exact same capacity.  Again, like the RAID 1 example above, this would be much like having a hot spare, where the drive is participating in the array with live data rather than sitting idly by waiting for another drive to fail before kicking in to begin the process of taking over.  In this capacity the array degrades to a RAID 5 equivalent in case of a failure but without any rebuild time, so the additional drive is useful immediately rather than only after a possible very lengthy resilver process.  So for the same money, same capacity the choice of setting up the drives in RAID 6 rather than RAID 5 plus warm spare is a complete win.

We can continue this example with RAID 6 plus warm spare.  This one is a little less easy to define because in most RAID systems, except for the somewhat uncommon RAIDZ3 from ZFS, there is no triple parity system available one step above RAID 6 (imagine if there was a RAID 7, for example.)  If there were the exact argument made for RAID 5 plus warm spare would apply to RAID 6 plus warm spare.  In the majority of cases RAID 6 with a warm spare must justify itself against a RAID 10 array.  RAID 10 is more performant and far more reliable than a RAID 6 array but RAID 6 is generally chosen to save money in comparison to RAID 10.  But to offset RAID 6’s fragility warm spares are sometimes employed.  In some cases, such as a small five disk RAID 6 array with a warm spare, this is dollar for dollar equivalent to a six disk RAID 10 array without a warm spare.  In larger arrays the cost benefit of RAID 6 does become apparent but the larger the cost savings the larger the risk differential as parity RAID systems increase risk with array size much more quickly than do mirror based RAID systems like RAID 10.  Any money saved today is done at the risk of outage or data loss tomorrow.

Where a warm spare comes into play effectively is in a RAID 10 array where a warm spare rebuild is a mirror rebuild, like in RAID 1, which does not carry parity risks, where there is no logical extension RAID system above RAID 10 from which we are trying to save money by going with a more fragile system.  Here adding a warm spare may make sense for critical arrays because there is no more cost effective way to gain the same additional reliability.  However, RAID 10 is so reliable without a warm spare that any shop contemplating RAID 5 or RAID 6 with a warm spare would logically stop at simple RAID 10 having already surpassed the reliability they were considering settling for previously.  So only shops not considering those more fragile systems and looking for the most robust possible option would logically look to RAID 10 plus warm spare as their solution.

Just for technical accuracy, RAID 10 can be expanded for better read performance and dramatic improvement in reliability (but with a fifty percent cost increase) by moving to three disk RAID 1 mirrors in its RAID 0 stripe rather than standard two disk RAID 1 mirrors just like we showed in our RAID 1 example.  This is a level of reliability seldom sought in the real world but can exist and is an option.  Normally this is curtailed by drive count limitations in physical array chassis as well as competing poorly against building a completely separate secondary RAID 10 array in a different chassis and then mirroring these at a high level effectively created RAID 101 – which is the effective result of common, high end storage array clusters today.

Our second concern is that of “automated array destruction.”  This applies only to the parity RAID scenarios of RAID 5 and RAID 6 (or the rare RAID 2, RAID 3, RAID 4 and RAIDZ3.)  With the warm spare concept, the idea is that when a drive fails the warm spare is automatically and instantly swapped in by the array controller and the process of resilvering the array begins immediately.  If resilvering was a completely reliable process this would be obviouslyd highly welcomed.  The reality is, sadly, quite different.

During a resilver process a parity RAID array is at risk of Unrecoverable Read Errors (UREs) cropping up.  If a URE occurs in a single parity RAID resilver (that is RAID 2 –  5) then the resilvering process fails and the array is lost completely.  This is critical to understand because no additional drive has failed.  So if the warm spare had not been present then the resilvering would have not commenced and the data would still be intact and available – just not as quickly as usual and at the small risk of secondary drive failure.  URE rates are very high with today’s large drives and with large arrays the risks can become so high as to move from “possible” to “expected” during a standard resilvering operation.

So in many cases the warm spare itself might actually be the trigger for the loss of data rather than the savior of the data as expected.  An array that would have survived might be destroyed by the resilvering process before the human who manages it is even alerted to the first drive having failed.  Had a human been involved they could have, at the very least, taken the step to make a fresh backup of the array before kicking off the resilver knowing that the latest copy of the data would be available in case the resilver process was unsuccessful.  It would also allow the human to schedule when the resilver should begin, possibly waiting until business hours are over or the weekend has begun when the array is less likely to experience heavy load.

Dual and triple parity RAID (RAID 6 and RAIDZ3 respectively) share URE risks as well as they too are based on parity.  They mitigate this risk through the additional levels of parity and do so successfully for the most part.  The risk still exists, especially in very large RAID 6 arrays, but for the next several years the risks remain generally quite low for the majority of storage arrays until far larger spindle-based storage media is available on the market.

The biggest problem with parity RAID and the URE risk is that the driver towards parity RAID (willing to face additional data integrity risks in order to lower cost) is the same driver that introduces heightened URE risk (purchasing lower cost, non-enterprise SATA hard drives.)  Shops facing parity RAID generally do so with large, low cost SATA drives bringing two very dangerous factors together for an explosive combination.  Using non-parity RAID 1 or RAID 10 will completely eliminate the issue and using highly reliable enterprise SAS drives will drastically reduce the risk factor by an order of magnitude (not an expression, it is actually a change of one order of magnitude.)

Additionally during resilver operations it is possible for performance to degrade on parity systems so drastically as to equate to a long-term outage.  The resilver process, especially on large arrays, can be so intensive that end users cannot differentiate between a completely failed array and a resilvering array.  In fact, resilvering at its extreme can take so long and be so disruptive that the cost to the business can be higher than if the array had simply failed completely and a restore from backup had been done instead.  This resilver issue does not affect RAID 1 and RAID 10, again, because they are mirrored, not parity, RAID systems and their resilver process is trivial and the performance degradation of the system is minimal and short lived.  At its most extreme, a parity resilver could take weeks or months during which time the systems act as though they are offline – and at any point during this process there is the potential for the URE errors to arise as mentioned above which would end the resilver and force the restore from backup anyway.  (Typical resilvers do not take weeks but do take many hours and to take days is not at all uncommon.)

Our final overview can be broken down to the following (conventional term “hot spare” used again): RAID 10 without a “hot spare” is almost always a better choice than RAID 6 with a “hot spare.”  RAID 6 without a “hot spare” is always better than RAID 5 with a “hot spare.”  RAID 1 with additional mirror member is always better than RAID 1 with a “hot spare.”  So whatever RAID level with a hot spare you decide upon, simply move up one level of RAID reliability and drop the “hot spare” to maximize both performance and reliability for equal or nearly equal cost.

Warm spares, like parity RAID, had they day in the sun.  In fact it was when parity RAID still made sense for widespread use – when URE errors were unlikely and disk costs were high – that warm spare drives made sense as well.  They were well paired, when one made sense the other often did too.  What is often overlooked is that as parity RAID, especially RAID 5, has lost effectiveness it has pulled the warm spare along with it in unexpected ways.

What Windows 8 Means for the Datacenter

Talk around Microsoft’s new upcoming desktop operating, Windows 8, centers almost completely on its dramatically departing Metro User Interface, borrowed from the Windows Phone which, in turn, borrowed it from the ill-fated Microsoft Zune.  Apparently Microsoft believes that the third time is the charm when it comes to Metro.

To me the compelling story of Windows 8 comes not in the fit and finish but the under the hood rewiring that hint at a promising new future for the platform.  In the past Microsoft has attempted shipping Windows Server OS on some alternative architectures including, for those who remember, the Digital Alpha processor and more recently the Intel Itanium.  In these previous cases, the focus was on the highest end Microsoft platforms being run on hardware above and beyond what the Windows world normally sees.

Windows 8 promises to tackle the world of multiple architectures in a completely different way – starting with the lowest end operating system and focusing on a platform that is lighter and less powerful than the typical Intel or AMD offering, the low power ARM RISC architecture with the newly named Windows RT (previously WoA, Windows on ARM.)

The ARM architecture is making its headlines as Microsoft attempts to drive deep into handheld and low power devices.  Windows RT could signal a unification between the Windows desktop codebase and the mobile smartphone codebase down the road.  Windows RT could mean strong competition from Microsoft in the handheld tablet market where the iPad dominates so completely today.  Windows RT could be a real competitor to the Android platforms.

Certainly, as it stands today, Windows RT has a lot of potential to be really interesting, if not quite disruptive, with where it will stand upon release.  But I think that the interesting story lies beneath the surface in what Windows RT can potentially mean for the datacenter.  What might Microsoft have in store for us in the future?

The datacenter today is moving in many directions.  Virtualization is one driving factor as are low power server options such as Hewlett-Packard’s Project Moonshot which is designed to bring ARM-based, low power consumption servers into high end, horizontally scaling datacenter applications.

Currently, today, the number of server operating systems available to run on ARM servers, like those coming soon from HP, are few and far between and are mostly only available from the BSD family of operating systems.  The Linux community, for example, is scrambled to assemble even a single, enterprise-supported ARM-based distribution and it appears that Ubuntu will be the first out of the gate there.  But this paucity of server operating systems on ARM leaves an obvious market gap and one that Microsoft may be well thinking of filling.

Windows Server on ARM could be a big win for Microsoft in the datacenter.  A lower cost offering broadening their platform portfolio without the need for heavy kernel reworking since they are already providing this effort for the kernel on their handheld devices.  This could be a significant push for Windows into the growingly popular green datacenter arena where ARM processors are expected to play a central role.

Microsoft has long fought to gain a foothold in the datacenter and today is as comfortable there as anyone but Windows Servers continue to play in a segregated world where email, authentication and some internal applications are housed on Windows platforms but the majority of heavy processing, web hosting, storage and other roles are almost universally given to UNIX family members.  Windows’ availability on the ARM platform could push it to the forefront of options for horizontally scaling server forms like web servers, application servers and other tasks which will rise to the top of the ARM computing pool – possibly even green high performance compute grids.

ARM might mean exciting things for the future of the Windows Server platform, probably at least one, if not two releases out.  And, likewise, Windows might mean something exciting for ARM.

The Information Technology Resource for Small Business