Tag Archives: design

What Do I Do Now? Planning for Design Changes

Quite often I am faced with talking to people about their system designs, plans and architectures.  And many times that discussion happens too late and designs are either already implemented or they are partially implemented.  This can be very frustrating if the design in progress has been deemed to not be ideal for the situation.

I understand the feeling of frustration that will come from a situation like this but it is something that we in IT must face on a very regular basis and managing this reaction constructively is a key IT skill.  We must become masters of this situation both technically as well as emotionally.  We should not be crippled by it, it is a natural situation that every IT professional will experience on a regular basis.  It should not be discouraging or crippling but it is very understandable that it can feel that way.

One key reason that we experience this so often is because IT is a massive field with a great number of variables to be considered in every situation.  It is also a highly creative field where there can be numerous, viable approaches to any given problem.  That there is even a single “best” option is rarely true.  Normally there any many competitive options.  Sometimes these are very closely related, sometimes these options are drastically different making them very difficult to compare meaningfully.

Another key reason is that factors change.  This could be that new techniques or information come to light, new products are released, products are updated, prices change or business needs change near to or even during the decision making and design processes.  This rate of change is not something that we, as IT professionals, can hope to ever control.  It is something that we must accept and deal with as best as we can.

Another thing that I often see missed is that a solution that was ideal when made may not be ideal if the same decision was being made today.  This does not, in any way, constitute a deficiency in the original design yet I have seen many people react to it as if it did.  The most common scenario that I run into where I see people exhibit this behaviour is in the aversion to the use of RAID 5 in modern storage design, RAID 6 and RAID 10 being the popular alternatives for good reason.  But this RAID 5 aversion, common since about 2009, did not exist always and from the middle of the 1990s until nearly the end of the 2000s RAID 5 was not only viable, it was very commonly the best solution for the given business and technical needs (the increase in aversion to it was mostly gradual, not sudden.)  However many people see RAID 5 as understandably poor as an option today but apply this new aversion to systems designed and implemented long ago, sometimes close to two decades ago.  This makes no sense and is purely an emotional reaction.  RAID 5 being the best choice for a scenario in 2002 in no way implies that it will still be the best choice in 2015.  But likewise, RAID 5 being a poor choice in 2015 for a scenario in no way belittles or negates the fact that it was very often a great choice several years ago.

I have been asked many times what to do once less than ideal design decisions have been made.  “What do I do now?”

Learning what to do when perfection is no longer an option (as if it ever really was, all IT is about compromises) is a very important skill.  The first things that we must tackle are the emotional problems as these will undermine everything else.  We must do our best to step back, accept the situation and act rationally.  The last thing that we want to do is take a non-ideal situation and make things worse by attempting to reverse justify bad decisions or panicking.

Accepting that no design is perfect, that there is no way to always get things completely right and that dealing with this is just part of working IT is the first step.  Step back, breathe deep.  It isn’t that bad.  This is not a unique situation.  Every IT pro doing design goes through this all of the time.  You should try your best to make the best decisions possible but you must also accept that that can rarely be done – no one has access to enough resources to really be able to do that.  We work with what we have.  So here we are.  What’s next?

Next is to assess the situation.  Where are we now?  In many cases the implementation is done and there is nothing more to do.  The situation is not ideal, but is it bad?  Very often the biggest mistake that I see people facing of an all ready implemented design is that it is too costly – typically “better” solutions are not better because they are faster or more reliable but are better because they are cheaper, easier or faster to have implemented.  That’s an unfortunate situation but hardly a crippling one.  Whatever time or money was spent must have been an acceptable amount at the time and must have been approved.  The best that we can do, right now, is learn from the decision process and attempt to avoid the overspending in the future.  It does not mean that the existing solution will not work or even not work amazingly well.  It is simply that it may not have been a perfect choice given the business needs, primarily financial, involved.

There are situations where a design that has been implemented does not adequately meet the stated business requirements.  This is thankfully less common, in my experience, as it is a much more difficult situation.  In this case we need to make some modifications in order to fulfill our business needs.  This may prove to be expensive or complex.  But things may not be as bad as what they seem.  Often reactions to this are misleading and the situation can often be salvaged.

The first step once we are in a position where we have implemented a solution that fails to meet business needs is to reassess the business needs.  This is not to imply that we should fudge the needs to massage them into being whatever our system is able to fulfill, not at all.  But it is a good time to go back and see if the originally stated needs are truly valid or if they were simply not vetted well enough or, even more likely, to go and see if the business needs have changed during the time that the implementation took place.  It may be that the implemented solution does, in fact, meet actual business needs even if they were originally misstated or because the needs have changed over time.   Or it might be that business needs have changed so dramatically that even perfect planning would originally have fallen short of the existing needs and the fact that the implemented solution does not perform as expected is of minor consequence.   I have been very surprised just how often this verification of business needs has turned a solution believed to be inadequate into an “overkill” solution that actually cost more than necessary simply because no one pushed back on overstatements of business needs or no one questioned financial value to certain technology investments.

The second step is to create a new technology baseline.  This is a very important step to assist in preventing IT from falling into the drop of the sunk cost fallacy.  It is extremely common for anyone, this is not unique to IT in any way, to look at the time and money spent on a project and assume that continuing down the original path, no matter how foolish it is, is the way to go because so many resources have been expended on that path already.  But this makes no sense, of course, how you got to your current state is irrelevant.  What is relevant is assessing the current needs of the department and company and taking stock of the currently available solutions, technologies and resources.  Given the current state, the best course forward can be determined.  Any consideration given to the effort expended to get to the current state is only misleading.

A good example of the sunk cost fallacy is in the game of chess.  With each move it is important to assess all available moves, risks and strategies again because what moves were used to get to the current state have no bearing on what moves make sense going forward.  If the world’s greatest chess player or an amazing computer chess algorithm was to be brought in mid-game they would not require any knowledge as to how the current state had come to be – they would simply assess the current state and create a strategy based upon it.

This is the same as we should be behaving in IT.  Our current state is our current state.  It does not matter for strategic planning what unfolded to get us into that state.  We only care about those decisions and costs when doing a post mortem process in order to determine where decision making may have failed in order to learn from it.  Learning about ourselves and our processes is very important.  But that is a very different task from doing strategic planning for the current initiative.

The unfortunate thing here is that we must begin our planning processes again but this time with, we assume, more to work with.    But this cannot be avoided.  In the worst cases, budgets are no longer available and there are no resources to fix the flawed design and achieve the necessary business goals.  Compromises sometimes are necessary.  Making do with what we have is sometimes that best that we can do. But, in the vast majority of cases it would seem, some combination of additional budget or creative reuse of existing products can be adequate to remedy the situation.

Once we have reached a state in which we have addressed our short falls, whether simply by accepting that we have over spent, under-delivered or have adjusted to meet needs we have an opportunity to go back and investigate our decision making processes.  It is by doing this that we hope to grow as individuals and, if at all possible, on an organizational level to learn from our mistakes, or determine if there even were mistakes.  Every company and every individual makes mistakes.  What separates us is the ability to learn from them and avoid those same mistakes in the future.  Growth comes primarily from experiencing pain in this way and while often unpleasant to face it is here that we have the best opportunity to create real, lasting value.  Do not push off or skip this opportunity for review whether it be a harsh, personal review that you do yourself or a formal, organizational review run by people trained to do so or something in-between.  The sooner that the decision processes are evaluated the fresher the memory will be and the sooner the course correction can take effect.

The final step that we can do is to begin the decision process to design a replacement for the current implementation as soon as possible, once the review of the decision process is complete.  This does not necessarily mean that we should intend to spend money or change our designs in the near future.  Not at all.  But by being extremely pro-active in design making we can attempt to avoid the problems of the past by giving ourselves additional time for planning, more time for requirements gathering and documentation, better insight into changes in requirements over time by regularly revisiting those requirements to see if they remain stable or if they are changing, more opportunity to get management and peer buy in and investment in the decision and better understanding of the problem domain so that we are better equipped to alter the intended design or know when to scrap it and start over before implementing it the next time.  It also could, potentially, give us a better chance of codifying organizational knowledge that could be passed on to a successor should you yourself not be in the position of decision making or implementation when the next cycle comes around.

With good, rational processes and a good understanding of the steps that need to be taken in a case of less than ideal systems design or implementation we can recover from missteps and not only, in most cases, recover from them in the short term but we can insulate the organization from the same mistakes in the future.

The Inverted Pyramid of Doom

The 3-2-1 model of system architecture is extremely common today and almost always exactly the opposite of what a business needs or even wants if they were to take the time to write down their business goals rather than approaching an architecture from a technology first perspective.  Designing a solution requires starting with business requirements, otherwise we not only risk the architecture being inappropriately designed for the business but rather expect it.

The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)

So the solution, often called a 3-2-1 design, can also be called the “Inverted Pyramid of Doom” because it is an upside down pyramid that is too fragile to run and extremely expensive for what is delivered. So unlike many other fragile models, it is very costly, not very flexible and not as reliable as simply not doing anything beyond having a single quality server.

There are times that a 3-2-1 makes sense, but mostly these are extreme edge cases where a fragile environment is desired and high levels of shared storage with massive processing capabilities are needed – not things you would see in the SMB world and very rarely elsewhere.

The inverted pyramid looks great to people who are not aware of the entire architecture, such as managers and business people.  There are a lot of boxes, a lot of wires, there are software components typically which are labeled “HA” which, to the outside observer, makes it sounds like the entire solution must be highly reliable.  Inverted Pyramids are popular because they offer “HA” from a marketing perspective making everything sound wonderful and they keep the overall cost within reason so it seems almost like a miracle – High Availability promises without the traditional costs.  The additional “redundancy” of some of the components is great for marketing.  As reliability is difficult to measure, business people and technical people alike often resort to speaking of redundancy instead of reliability as it is easy to see redundancy.  The inverted pyramid speaks well to these people as it provides redundancy without reliability.  The redundancy is not where it matters most.  It is absolutely critical to remember that redundancy is not a check box nor is redundancy a goal, it is a tool to use to obtain reliability improvements.  Improper redundancy has no value.  What good is a car with a redundant steering wheel in the trunk?  What good is a redundant aircraft if you die when the first one crashes?  What good is a redundant sever if your business is down and data lost when the single SAN went up in smoke?

The inverted pyramid is one of the most obvious and ubiquitous examples of “The Emperor’s New Clothes” used in technology sales.  Because it meets the needs of the resellers and vendors by promoting high margin sales and minimizing low margin ones and because nearly every vendor promotes it because of its financial advantages to the seller it has become widely accepted as a great solution because it is just complicated and technical enough that widespread repudiation does not occur and the incredible market pressure from the vast array of vendors benefiting from the architecture it has become the status quo and few people stop and question if the entire architecture has any merit.  That, combined with the fact that all systems today are highly reliable compared to systems of just a decade ago causing failures to be uncommon enough that the fact that they are more common that they should be and statistical failure rates are not shared between SMBs, means that the architecture thrives and has become the de facto solution set for most SMBs.

The bottom line is that the Inverted Pyramid approach makes no sense – it is far more unreliable than simpler solutions, even just a single server standing on its own, while costing many times more.  If cost is a key driver, it should be ruled out completely.  If reliability is a key driver, it should be ruled out completely.  Only if cost and reliability take very far back seats to flexibility should it even be put on the table and even then it is rare that a lower cost, more reliable solution doesn’t match it in overall flexibility within the anticipated scope of flexibility.  It is best avoided altogether.

Originally published on Spiceworks in abridged form: http://community.spiceworks.com/topic/312493-the-inverted-pyramid-of-doom