Dreaded Array Confusion

Dreaded Array Confusion, or DAC, is a term given to a group of RAID array failure types which are effectively impossible to diagnose but are categorized by the commonality that they experience no drive failure in conjunction with complete array failure resulting in total data loss.  It is hypothesized that three key causes result in the majority of DAC:

Software or Firmware Bugs: While dramatic bugs in RAID behavior are rare today, they are always possible, especially with more complicated array types such as parity RAID where reconstructive calculations must be performed on the array.  A bug in RAID software or firmware (depending on if we are talking about software of hardware RAID) could manifest itself in any number of ways including the accidental destruction of the array.  Firmware issues could occur in the drives themselves as well.

Hardware Failure:  Failure in hardware such as processors, memory or controllers can have dramatic effects on a RAID array.  Memory errors especially could easily result in total array loss.  This is thought to be the least common cause of DAC.

Drive Shake: In this scenario individual drives shake loose and disconnect from the backplane and later shake back into place triggering a resilvering event.  If this were to happen with multiple drives during a resilver cycle or if a URE were encountered during a resilver we would see total array loss on parity arrays potentially even without any hardware failure occurring.

Because of the nature of DAC and because it is not an issue with RAID itself but with the support components for it we are left in a very difficult position to attempt to identify or quantify the risk.  No one knows how likely DAC is to happen and while we know that DAC is a more significant threat on parity RAID systems we do not know by how much.  Anecdotal evidence suggests the risk on mirrored RAID is immeasurably low and on parity RAID may rise above background noise in risk analysis.  Of the failure modes, software bugs and drive shake both present much higher risk to systems running on parity RAID because URE risk only impacts parity arrays and the software necessary for parity is far more complex than the software needed for mirroring.  Parity RAID simply is more fragile and carries more types of risks exposing it to DAC in more ways than mirrored RAID is.

Because DAC is a number of possibilities and because it is effectively impossible to identify after it has occurred there is little possible means of any data being collected on it.  Since having identified DAC as a risk many people have come forth, predominantly in the Spiceworks community, to provide anecdotal eye witness accounts of DAC array failures.  The nature of end user IT is that statistics, especially on nebulous concepts like DAC which are not widely known, are not gathered and cannot be.  DAC arises in shops all over the world where a system administrator returns to the office to find a server with all data gone and no hardware having failed.  The data is already lost.  Diagnostics will not likely be run, logs will not exist and even if the issue can be identified to whom would it be reported and even if reported, how do we quantify how often it happens versus how often it does not or how often it might but not be reported.  Sadly all I know is that in having identified and somewhat publicized the risk and its symptoms that suddenly many people came forth acknowledging that they had seen DAC first hand as well and had no idea what had happened.

If my anecdotal studies are any indicator, it would seem that DAC actually poses a sizable risk to parity arrays with failures existing in an appreciable percentage of arrays but the accuracy and size of the cross section of that data collection was tiny.  However, it was original though that DAC was so rare that theoretically you would be unable to find anyone who had ever observed it but this does not appear to be the case.  I already am aware of many people who have experienced it.

We are forced, by the nature of the industry, to accept DAC as a potential risk and list it as an unknown “minor” risk in risk evaluations and be prepared for it but cannot calculate against it.  But knowing that it can be a risk and understanding why it can happen are important in evaluating risk and risk mitigation.

[Anecdotal evidence suggests that DAC is almost always exclusive to hardware RAID implementations of single parity RAID arrays on SCSI controllers.]

The Inverted Pyramid of Doom

The 3-2-1 model of system architecture is extremely common today and almost always exactly the opposite of what a business needs or even wants if they were to take the time to write down their business goals rather than approaching an architecture from a technology first perspective.  Designing a solution requires starting with business requirements, otherwise we not only risk the architecture being inappropriately designed for the business but rather expect it.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

So the solution, often called a 3-2-1 design, can also be called the “Inverted Pyramid of Doom” because it is an upside down pyramid that is too fragile to run and extremely expensive for what is delivered. So unlike many other fragile models, it is very costly, not very flexible and not as reliable as simply not doing anything beyond having a single quality server.

There are times that a 3-2-1 makes sense, but mostly these are extreme edge cases where a fragile environment is desired and high levels of shared storage with massive processing capabilities are needed – not things you would see in the SMB world and very rarely elsewhere.

The inverted pyramid looks great to people who are not aware of the entire architecture, such as managers and business people.  There are a lot of boxes, a lot of wires, there are software components typically which are labeled “HA” which, to the outside observer, makes it sounds like the entire solution must be highly reliable.  Inverted Pyramids are popular because they offer “HA” from a marketing perspective making everything sound wonderful and they keep the overall cost within reason so it seems almost like a miracle – High Availability promises without the traditional costs.  The additional “redundancy” of some of the components is great for marketing.  As reliability is difficult to measure, business people and technical people alike often resort to speaking of redundancy instead of reliability as it is easy to see redundancy.  The inverted pyramid speaks well to these people as it provides redundancy without reliability.  The redundancy is not where it matters most.  It is absolutely critical to remember that redundancy is not a check box nor is redundancy a goal, it is a tool to use to obtain reliability improvements.  Improper redundancy has no value.  What good is a car with a redundant steering wheel in the trunk?  What good is a redundant aircraft if you die when the first one crashes?  What good is a redundant sever if your business is down and data lost when the single SAN went up in smoke?

The inverted pyramid is one of the most obvious and ubiquitous examples of “The Emperor’s New Clothes” used in technology sales.  Because it meets the needs of the resellers and vendors by promoting high margin sales and minimizing low margin ones and because nearly every vendor promotes it because of its financial advantages to the seller it has become widely accepted as a great solution because it is just complicated and technical enough that widespread repudiation does not occur and the incredible market pressure from the vast array of vendors benefiting from the architecture it has become the status quo and few people stop and question if the entire architecture has any merit.  That, combined with the fact that all systems today are highly reliable compared to systems of just a decade ago causing failures to be uncommon enough that the fact that they are more common that they should be and statistical failure rates are not shared between SMBs, means that the architecture thrives and has become the de facto solution set for most SMBs.

The bottom line is that the Inverted Pyramid approach makes no sense – it is far more unreliable than simpler solutions, even just a single server standing on its own, while costing many times more.  If cost is a key driver, it should be ruled out completely.  If reliability is a key driver, it should be ruled out completely.  Only if cost and reliability take very far back seats to flexibility should it even be put on the table and even then it is rare that a lower cost, more reliable solution doesn’t match it in overall flexibility within the anticipated scope of flexibility.  It is best avoided altogether.

Originally published on Spiceworks in abridged form: http://community.spiceworks.com/topic/312493-the-inverted-pyramid-of-doom

When to Consider a Private Cloud?

The idea of running a private cloud, hosted or on premise, for a single company is rapidly becoming a commonplace one.  More and more businesses are learning of cloud computing and seeing that running their own cloud platform is both feasible and potentially valuable to the business.  But do to a general lack of cloud knowledge it is becoming more and more common that clouds are recommended when they do not suit the needs of the business at all, instead being mistaken for traditional virtualization management systems.

A cloud is a special type of virtualization platform and fills a unique niche.  Cloud computing takes traditional virtualization and layers it with automated scaling and provisioning that allows for rapid, horizontal scaling of applications.  This is not a normal business need.  Cloud also lends itself, and is often tied to, self-service of resource provisioning but this alone does not make something a cloud nor justify the move to a cloud platform, but could be an added incentive.  What makes cloud interesting is the ability to provide self-service portals to end users and the ability for applications to self-provision themselves.  These are the critical aspects that set a cloud platform apart from traditional virtualization.

What a cloud does not imply are features such as simplified whole-domain system management from a single pane of glass, large scale consolidation, easy migration between hardware systems, rapid provisioning of new systems, virtualization, high availability, resource over-commitment, etc.  These features are all available in other ways, primarily through or on top of standard platform virtualization (VMware vSphere, Microsoft’s HyperV, Xen, et. al.)  It is not that these features cannot be made available in a private cloud, but the features are not aspects of the cloud but rather of the underlying virtualization platform.  The cloud layer is above these and simply passes through the benefits of the underlying layers.

Often cloud is approached because of a misunderstanding that many of the features commonly associated with private clouds are not available in some other, simpler form.  This is rarely the case.  Normal virtualization platforms, most commonly VMware’s vSphere and Microsoft’s HyperV, offer all of these options.  They can be used to make robust clusters of physical servers, managed from a single interface, with incredibly high reliability and rapid provisioning of new systems that require minimal specialty knowledge from the IT department and maintain traditional business workflows.  Most times, when I am speaking with businesses that believe that they may be interested in pursuing the ownership of their own cloud, the features that they really want are not cloud features at all.

The term “cloud” has simply become so popular recently that people begin to assume that important features for nearly everyone must be attributed to it to explain the sudden surge in importance, but this is simply not the case.  Cloud remains, and will remain, a predominantly niche solution appropriate for only a very small number of companies to own themselves.  The use of public clouds or the use of hosted services delivered from cloud platforms will become, and indeed has already become, nearly ubiquitous   But ownership of a private cloud for the use of a single company is a long way from being a business need for most businesses or business units and in many cases, I suspect, never will become so.

Private clouds shine in two key areas.  The first is a business who needs a large number of temporary or ad hoc systems “spun up” on a regular basis.  This often occurs with large development teams and application testing groups, especially if these groups target multiple operating systems.  The ability to rapidly provision temporary testing systems or lab systems can be very advantageous and the nature of cloud computing to easily expose provisioning tools that allow business customers to create, manage and destroy their own system instances with, we would expect, built-in charge back mechanisms can be very beneficial to corporate efficiency as the interaction between the IT department and the end users becomes nearly frictionless for this transaction.  Responsibility for maintaining the cloud as a whole can easily be segregated from the responsibilities of maintaining individual systems.  Seldom used in this manner for production workloads, this allows a self-service approach that many business units desperately seek today.  Impractical on a small scale due to the overhead of creating and maintaining the cloud platform itself but on a large scale can be hugely productive.  In addition to technical advantages, this aspect of cloud computing can serve as a model for thinking of IT as an internal service provider and departments as customers.  We have long discussed IT and other business units in these terms but we rarely truly think of them in this way.

The second area where cloud computing really comes into its own and the one for which the concept was developed originally is to handle auto provisioning for horizontally scaling applications.  That is, application workloads that are able to increase in their capacity handling ability by spawning new instances for themselves.  On a small scale, many web applications, due to their stateless nature, do this within a single system by spawning new thread workers to handle additional connections.  An Apache web server might start with eight listeners ready to service requests but as those threads become exhausted it automatically starts new threads to handle additional incoming connections so that it is able to scale within the confines of a single server.  To expand on this concept, applied to cloud computing, that same application sensing thread exhaustion approaching on a system-wide level (or based on other metrics such as a lack of free memory or a loss of performance) would use an API exposed from the cloud computing platform to signal the cloud management system to provision a new copy of the system that was calling it – essentially cloning itself on the fly.  In a matter of seconds, a new virtual server, identical to the first, would be up and running and joining its parent in servicing incoming requests.  This child or clone system would likewise spawn new threads internally, as needed, and then if it too sensed exhaustion would call the cloud platform to create yet another new system to handle even more threads.  In this way the application can grow itself almost infinitely (within the hardware limits of the entire cloud platform) as needed, on the fly, automatically.  Then, as individual systems become idle, workloads die down, one at a time a system can signal that it is no longer needed to the cloud management system and the system will be powered off and destroyed as it was simply a stateless clone, freeing system capacity for other applications and workloads that may need to take advantage of the spare capacity.

As we can see, cloud computing is massively powerful, especially with the bulk of today’s public and enterprise applications being written in a stateless manner in order to take advantage of web protocols and end user interfaces. Web applications are especially adept at leveraging cloud computing’s scalability model and most large scale web applications leverage this elastic expanding and contracting of capacity today.  Many new NoSQL models are beginning to emerge that signal that databases, in addition to application front end processing nodes, may soon benefit from similar models on a large scale.  This can certainly be leveraged for internal applications as well as publicly facing ones, however internal applications rarely need to scale beyond a single system and so it is quite rare to find private clouds being leveraged in quite this way.

The dangers around cloud computing come in the form of additional complexity above and beyond normal virtualization.  There is the potential for complex storage needed to support the platform and more layers to learn and maintain.  Cloud computing’s ability to rapidly create and destroy systems can make it tempting for users to attempt to use cloud resources as if they were persistent systems, which they can be made to be, which can result in data loss from users receiving behavior very different from what is traditional and expected.  Possibly the biggest cloud concern is a human one and that is the increased likelihood of experiencing uncontrolled system sprawl as end users wildly spin up more and more new systems which, as they are created by end users and not IT, are probably not tightly controlled and monitored leaving systems in a rogue, and oft forgotten state.  This can lead to a maintenance and security nightmare as systems go unpatched and uncared for increasing risk and draining resources.  And most worrisome is the possibility that systems will be created and forgotten and potentially exist without proper licensing.  Tracking and reporting on auto provisioned systems carries process risk caused by the huge shift in how systems are created.  IT departments are accustomed to the heavy licensing processes necessary to maintain compliance but with cloud computing there is a potential for this process to be exposed to the business units in a way for which they are not at all equipped to handle.  There are accommodations for the licensing needs of cloud computing, but this is extra complexity and management that must be addressed.  Allowing systems to exist without direct IT department oversight clearly carries risk of a potentially unforeseen nature.

Private cloud ownership brings many exciting possibilities, but it is clear that these benefits and opportunities are not for everyone.  They cater to larger businesses, to those with good process control, to companies running especially adapted applications that are capable of taking advantage of the system-level elasticity of the resources and those needing large scale ad hoc system creation and destruction provided, as a service, for end users to self-provision.  Most large enterprises will find limited use for cloud computing in house.  Smaller organizations will rarely find cloud computing to be advantageous in the near future, if ever.

Stick to IT, Don’t Become Another Department

I see this very regularly, it seems to be a huge temptation of IT departments to overstep IT bounds and want to take on the roles and responsibilities of other company departments. In the SMB this might be a lot more true because there isn’t a clear demarcation of IT versus other departments, job roles are often shared, there aren’t good policies and procedures, there aren’t people doing those other jobs, etc. And there is always the possibility that these cross-domain responsibilities are truly assigned to IT. But nine times out of ten, this is not the case.

I believe that this behaviour stems from a few things:

  1. People tend to work in IT because they are “smarter” or at least “more interested” about most things than average people so we tend to carry a lot of general knowledge that allows us to act as a competent member of any department (IT can do HR’s job in a pinch, is the reverse commonly true?)
  2. IT tends to get thrown whatever work other departments don’t want to do and can get away with handing off (can you print this for us? can you fix my microwave? the fuse has blown!  have you any experience with sprinklers?) So we get into this mindset from other departments’ behaviors towards us.
  3. We have a broad view into the organization as a whole, moreso than almost any other department.
  4. We tend to be passionate about doing things the “right way” – which is often based on technical excellence or industry common practice but may not account for the specific business needs nor unique factors.

Put together, these, and other, factors make us tend to want to get involved in anything and everything in and around the businesses which we serve. Questions around involvement in other departments’ activities come up regularly. To establish just how skewed our thinking about this behavior tends to be – we see IT people asking IT people what their responsibility is rather than talking to their own business’ management who are the ones actually making that decision. This isn’t about best practice, it is about following your own company’s rules.

Some examples of places where IT people like to jump in and try to be other departments:

  • “People are surfing Facebook at work, I have to stop them.” – Do you? Is this a business decision or is IT just making HR or security decisions for those departments? IT bringing this up as a topic is great, but feeling a need to enforce personal work habit decisions should probably be left to the business owner, manager or designated department like HR, legal or security.
  • Spying on end users, capturing passwords, etc. – Did the legal department ask you to do this? If not, don’t take on legal and security responsibilities, especially ones that might carry fines or even  jail time in your local jurisdiction!  We risk turning the tables from suspecting someone else to being the culprits ourselves.
  • Pressuring the business about fire hazards, safety issues (that are not your own), etc. – See something, say something. Awesome. Don’t be the cause of bad behaviour yourself. But if the business isn’t concerned about these things once reported, unless it is a legal issue that you need to turn over to the police, don’t feel that this is IT’s job. The janitor doesn’t feel this way, HR doesn’t feel this way, IT shouldn’t either. If the business decides to not care, you shouldn’t either. (Example was AJ talking about stringing surge protectors together.)
  • The business can’t be down! – IT loves this one. This might be us pushing for high availability clusters or just overbuilt servers or who knows what. The reality is, this is 100% a financial decision that the accounting, financial and CFO teams should be making. IT has no idea how much the business can be or can’t be down – we just know how much it costs to mitigate how much risk. We feed data to the financial people who come back with the risk/reward evaluation. IT shouldn’t be making financial decisions on any scale.

I could go on and on. HR, finance, security, facilities management, legal – we want to get involved in all of these job roles. But is it our responsibility to do so? Maybe it is in your case, but normally, it is not. We take on personal and professional risk in order to push our ideas and opinions on businesses that often aren’t interested in our input (in those areas.)

Step back and look at your relationship to the business. Are you making suggestions and decisions that line up with your role within the business and with the business’ unique needs? Keep perspective. It is so easy to get caught up in IT doing things the “right” way that we forget that the business might not share our opinions of what is right and wrong for them – and we aren’t in IT just for the sake of being in IT, but for the purpose of supporting the business.

[Reprinted from a post in Spiceworks, January 8, 2013]

The Information Technology Resource for Small Business