Category Archives: Architecture

The Commoditization of Architecture

I often talk about the moving “commodity line”, this line affects essentially all technology, including designs.  Essentially, when any new technology comes out it will start highly proprietary, complex and expensive.  Over time the technology moves towards openness, simplicity and becomes inexpensive.  At some point any given technology becomes goes so far in that direction that it falls over the “commodity” line where it moves from being unique and a differentiator to becoming a commodity and accessible to essentially everyone.

Systems architecture is no different from other technologies in this manner, it is simply a larger, less easily defined topic.  But if we look at systems architecture, especially over the last few decades, we can easily system servers, storage and complete systems moving from the highly proprietary towards the commodity.  Systems were complex and are becoming simple, they were expensive and are becoming inexpensive, they were proprietary and they are becoming open.

Traditionally we dealt with systems that were physical operating systems on bare metal hardware.  But virtualization came along and abstracted this.  Virtualization gave us many of the building blocks for systems commonidization.  Virtualization itself commoditized very quickly and today we have a market flush with free, open and highly enterprise hypervisors and toolsets making virtualization totally commoditized even several years ago.

Storage moved in a similar manner.  First there was independent local storage.  Then the SAN revolution of the 1990s brought us power through storage abstraction and consolidation.  Then the replicated local storage movement moved that complex and expensive abstraction to a more reliable, more open and more simple state.

Now we are witnessing this same movement in the orchestration and management layers of virtualization and storage.  Hyperconvergence is currently taking the majority of systems architectural components and merging them into a cohesive, intelligent singularity that allows for a reduction in human understanding and labour while improving system reliability, durability and performance.  The entirety of the systems architecture space is moving, quite rapidly, toward commoditization.  It is not fully commoditized yet, but the shift is very much in motion.

As in any space, it takes a long time for commoditization to permeate the market.  Just because systems have become commoditized does not mean that non-commodity remnants will not remain in use for a long time to come or that niche proprietary (non-commodity) aspects will not linger on.  Today, for example, systems architecture commoditization is highly limited to the SMB market space as there are effective upper bound limits to hyperconvergence growth that have yet to be tackled, but over time they will be tackled.

What we are witnessing today is a movement from complex to simple within the overall architecture space and we will continue to witness this for several years as the commodity technologies mature, expand, prove themselves, become well known, etc.  The emergence of what we can tell will be commodity technologies has happened but the space has not yet commoditized.  It is an interesting moment where we have what appears to be a very clear vision of the future, some scope in which we can realize its benefits today, a majority of systems and thinking that reside in the legacy proprietary realm and a mostly clear path forward as an industry both in technology focus as well as in education, that will allow us to commoditize more quickly.

Many feel that systems are becoming overly complex, but the opposite is true. Virtualization, modern storage systems, cloud and hyperconverged orchestration layers are all coming together to commoditize first individual architectural components and then architectural design as a whole.  The move towards simplicity, openness and effectiveness is happening, is visible and is moving at a very healthy pace.  The future of systems architecture is one that clearly is going to free IT professionals from spending so much time thinking about systems design and more time thinking about how to drive competitive advantage to their individual organizations.

Understanding the Role of the Dell VRTX

Dell’s VRTX is one of those devices that is just sexy, as IT hardware goes. It strikes a chord and drives IT professionals nearly wild. It looks cool, it has an incredible amount of power, it can be rack mounted or placed under a desk, it is quiet – so quiet that it can be run right in the middle of an open office space. It’s just really cool, and nearly every IT professional wants one – even if they have no idea why.

The problem with the VRTX is that it is generally misunderstood and the misunderstandings around the device itself and the architecture used within it have led to a lot of proposals, nearly continuous, to use the device where it is least suited. The device itself truly is awesome and has excellent use cases, but it is very important to understand what they are and what they are not as this is a very specialty piece of hardware.

First, we need to determine what the VRTX “is”. The Dell VRTX is primarily a blade enclosure, more or less like any blade system. But unlike traditional blade enclosure that typically hold six to ten blades per enclosure, the VRTX only holds four. So it is a “baby” blade enclosure. Because it is a true blade system, the Dell VRTX carriers the normal caveats of any blade enclosure. However, due to its small size the probability of it being able to be used, and retired, effectively make it quite a bit more reasonable to consider than traditional, larger blade enclosures. So an understanding of its blade nature is important in evaluating it for your organization’s needs.

Along with the included blade component, the VRTX also has a DAS (Direct Attached Storage) system attached via SAS to the blades. This storage array offers either twelve large form factor (3.5”) or twenty five small form factor (2.5”) hard drives attached by way of either one or two PERC8 hardware RAID controllers. This included, large scale, shared external storage array inside of the VRTX blade enclosure is what makes the VRTX unit truly unique.

So all four blades share the single DAS unit for storage. The four blades constitute 2U of the VRTX enclosure and the DAS unit another 2U for a total enclosure size of 4U.

Of course, as with any blade system, there is no requirement that you fully populate the VRTX initially, or ever. The system can be used with any number of blades from one to four, as needed. But the value of a blade enclosure, especially a small one such as this, depends heavily on being completely populated or nearly so, to be cost viable.

Architecturally what the VRTX represents is a highly compact, single chassis, Inverted Pyramid of Doom (the “traditional” 3-2-1 architectural design) built following what are, more or less, the best approaches for that type of system. The biggest advantages here are that the use of a solid DAS is mandated and cannot be altered and all connections between the DAS and the compute nodes are hard wired internally for the highest level of potential reliability for a shared external storage system with the least opportunity for human failure. By using DAS instead of SAN in this example, our 3-2-1 has its “2” layer removed resulting in a far better inverted pyramid structure. What we are left with is a 4-1 inverted pyramid design.

The overall profile of the VRTX is one of massive compute capabilities, far outstripping the computational needs of a normal SMB business, all in a single chassis. The smallest blade option is a dual processor module and the biggest are quad processor meaning that when populated we have a minimum of eight Intel Xeon processors over four nodes and a maximum of sixteen Intel Xeon processors over four nodes. This is truly a mammoth computational system in a small package. But it is critical to understand that all of this horsepower shares a single storage array and is not highly available and cannot be made so. This is a system designed for processing power, not as a reliable infrastructure component.

It should also be noted that Dell experienced reliability issues with the redundant PERC8 hardware RAID controller setup and had to pull it from the market for some time. As with nearly all storage systems in this category, which includes many DAS and SAN devices, redundant controllers are commonly the cause of storage outages rather than the preventers of such. Redundancy of RAID controllers is rarely a valuable addition and should never be looked on as a panacea to storage reliability concerns.

Given that fact that the VRTX is compute heavy and reliability weak, what are its designated use cases? Where does it make the most sense to consider deploying this unit?

There are three extremely common deployment scenarios today where large compute and shared “fragile” storage often fit. Of course there may be many special cases and those should be evaluated individually based on the power, cost and reliability profiles of the VRTX relative to other options. But by and large the big three use cases where we would want to see the VRTX deployed would be:

Enterprise Remote Office and Branch Office (ROBO): This use case is based around the concept of the VRTX being a single device, easily deployable with nothing to do but to “plug it in” delivering a “reliable enough” but very powerful platform for remote offices. Not every remote or branch office would need the kind of horsepower than a VRTX can provide and some would require high availability which it does not have, but large ROBOs are often ideally suited to this architectural profile due to the ease of remote management and the common ability to use remote access to a central office or datacenter as a means of providing failover and reliability in the event of a major disaster either to IT itself (such as a total failure of the VRTX) or to the ROBO itself (fire, flood, etc.)

A VRTX in this scenario can easily be the sole IT device, outside of networking equipment, powering an entire ROBO of hundreds or potentially even thousands of users. And the ability to do nearly all maintenance in a non-disruptive way, which if properly designed is trivial to provide with a VRTX, can be quite significant to a ROBO.

The concept of this being solely for the “enterprise” ROBO rather than SMB ROBOs is simply because of the total scale of the VRTX being larger than the typical needs of an SMB as a whole let alone the needs of just one remote office. The VRTX is just too “big” for the typical needs of an SMB without being specifically focused on the needs of SMB.

Virtual Desktop Infrastructure (VDI): VDI generally requires a large amount of compute power, non-disruptive updates and shared storage which is perfect for the VRTX. Of course this only makes sense in shops that need at least three nodes, if not four nodes, of compute power to leverage the blade chassis natural of the VRTX. But for companies looking for eight to sixteen CPUs worth of VDI power the VRTX can be a slam dunk. Possibly no use case is more appropriate for the VRTX than as a single, modular VDI system.

Big Data: Not many SMBs look to do big data processing today (Hadoop, Apache Spark, etc.) but a VRTX can be an ideal platform for doing huge processing in a small business that does not need to scale its data processing beyond this point. For larger enterprises needing a much larger scale of processing the VRTX would not be well suited, what makes it exceptionally valuable is in matching the size to the organization’s need. Of course other kinds of computationally heavy processing, such as Monte Carlo simulations, would also work well on this platform.

Now that we know where the VRTX is well suited, where does it not fit well?

The VRTX is very poorly suited to general computing use, in both the SMB and the enterprise sectors. In the enterprise the VRTX represents a fully contained, but non-scaling, stack which would be unwieldy and expensive in a large infrastructure.

In the SMB the VRTX is dramatic overkill on the computational size while underkill, generally in reliability, on the storage side. Most SMBs, when scaling past a single computation node, are seeking both flexible scalability as well as higher than typical reliability. Often it is a desire for high availability alone that drives SMBs past a single computation node considering the incredible capacity of a single node that is available today. So moving to an inverted pyramid architecture would be counter-productive to the needs of the typical SMB. The VRTX is simply too big, too rigid and lacks the reliability profile desired by SMBs. The SMB is really the last market where I would expect the VRTX to be deployed as general computing needs that drive SMB needs simply is the farthest appropriate use case for this device.

The VRTX is an amazing piece of equipment and well designed for several niche use cases, but is not designed to replace or be used in typical scenarios where standard servers, such as the Dell PowerEdge R730, have been designed to be the ideal equipment. General use equipment exists as the industry standards and best sellers for a reason, niche equipment also exists for a reason. Be sure to understand why the equipment you are considering makes sense for your environment, new and interesting is not enough to justify moving to special case gear.

Making the Best of Your Inverted Pyramid of Doom

The 3-2-1 or Inverted Pyramid of Doom architecture has become an IT industry pariah for many reasons. Sadly for many companies, they only learn about the dangers associated with this design after the components have arrived and the money has left the accounts.

Some companies are lucky and catch this mistake early enough to be able to return their purchases and start over with a proper design and decision phase prior to the acquisition of new hardware and software. This, however, is an ideal and very rare situation. At best we can normally expect restocking fees and, far more commonly, the equipment cannot be returned at all or the fees are so large as to make it pointless.

What most companies face is a need to “make the best” of the situation moving forward. One of the biggest concerns is that concerned parties, whether it be the financial stake holders who have just spent a lot of money on the new hardware or if it is the technical stakeholders who now look bad for having allowed this equipment to be purchased, to succumb to an emotional reaction resulting in giving in to the sunk cost fallacy. It is vital that this emotional, illogical reaction not be allowed to take hold as it will undermine critical decision making.

It must be understood that the money spent on the inverted pyramid of doom has already been spent and is gone. That the money was wasted or how much was wasted is irrelevant to decision making at this point. If the system was a gift or if it cost a billion dollars does not matter, that money is gone and now we have to make do with what we have. A potential “trick” here would be to bring in a financial decision maker like a CFO, explain that there is about to be an emotional reaction to money already spent and discuss the sunk cost fallacy before talking about the actual problem so that people are aware and logical and the person trained (we hope) to best handle this kind of situation is there and ready to head off sunk cost emotions. Careful handling of a potentially emotionally-fueled reaction is important. This is not the time to attempt to cover up either the financial or the technical missteps, which is what the emotional reaction is creating. It is necessary for all parties to communicate and remain detached and logical in order to address the needs. Some companies handle this well, many do not and become caught trying to forge forward with bad decisions that were already made, probably in the hopes that nothing bad happens and that no one remembers or notices. Fight that reaction. Everyone has it, it is the natural amygdala “fight or flight” emotional response.

Now that we are ready to fight the emotional reactions to the problem we can begin to address “where do we go from here.” The good news is that where we are is generally a position of having “too much” rather than “too little.” So we have an opportunity to be a little creative. Thankfully there are generally good options that can allow us to move in several directions.

One thing that is very important to note is that we are looking at solutions exclusively that are more reliable, not less reliable, than the intended inverted pyramid of doom architecture that we are replacing. An IPOD is a very fragile and dangerous design and we could go to great lengths demonstrating concepts like risk analysis, single points of failure, the fallacies of false redundancy, looking at redundancy instead of reliability, dependency chains, etc. but what is absolutely critical for all parties to understand is that a single server, running with local storage is more reliable than the entire IPOD infrastructure would be. This is so important that it has to be said again: if a single server is “standard availability”, the IPOD is lower than that. More risky. If anyone at this stage fears a “lack of redundancy” or a “lack of complexity” in the resulting solutions we have to come back to this – nothing that we will discuss is as risky as what had already been designed and purchased. If there is any fear of risk going forward, the fear should have been greater before we improved the reliability of the design. This cannot be overstated. IPODs sell because they easily confuse those not trained in risk analysis and look reliable when, in fact, they are anything but.

Understanding the above and using a technique called “reading back” the accepted IPOD architecture tells us that the company in question was accepting of not having high availability (or even standard availability) at the time of purchasing the IPOD. Perhaps they believed that they were getting that, but the architecture could not provide it and so moving forward we have the option of “making do” with nothing more than a single server, running on its own local storage. This is simple and easy and improves on nearly every aspect of the intended IPOD design. It costs less to run and maintain, is often faster and is much less complex while being slightly more reliable.

But likely simply dropping down to a single server and hoping to find uses for the rest of the purchased equipment “elsewhere” is not going to be our best option. In situations where the IPOD had been meant to only be used for a single workload or set of workloads and other areas of the business have need for equipment as well it can be very beneficial to go to the “single server” approach for the intended IPOD workload and utilize the remaining equipment elsewhere in the business.

The most common approach to take with repurposing an IPOD stack is to reconfigure the two (or more) compute nodes to be full stack nodes containing their own storage. This step may require no purchases, depending on what storage has already been purchased, a movement of drives between systems or often the relatively small purchase of additional hard drives for this purpose.

These nodes can then be configured into one of two high availability models. In the past a common design choice, for cost reasons, was to use an asynchronous replication model (often known as the Veeam approach) that will replicate virtual machines between the nodes and allow VMs to be powered up very rapidly allowing for a downtime from the moment of compute node failure until recovery of as little as just a few minutes.

Today fully synchronous fault tolerance is available so commonly for free that it has effectively replaced the asynchronous model in nearly all cases. In this model storage is replicated in fully real time between the compute nodes allowing for failover to happen instantly, rather than with a few minutes delay, and with zero data loss instead of a small data loss window (e.g. RPO of zero.)

At this point it seems to be common for people to react to replication with a fear of a loss of storage capacity caused by the replication. Of course this is true. It is necessary that it be understood that it is this replication, missing from the original IPOD design, that provides the firm foundation for high reliability. If this replication is skipped, high availability is an unobtainable dream and individual compute nodes using local storage in a “stand alone” mode is the most reliable potential option. High availability solutions rely on replication and redundancy to build the necessary reliability to qualify for high availability.

This solves the question of what to do with our compute nodes but leaves us with what we can do with our external shared storage device, the single point of failure or the “point” of the inverted pyramid design. To answer this question we should start by looking at what this storage might be.

There are three common types of storage devices that would be used in an inverted pyramid design: DAS, SAN and NAS. We can lump DAS and SAN together as they are both two different aspects of block storage and can be used essentially interchangeably in our discussion – they are only differentiated by the existence of switching which can be added or removed as needed in our designs. NAS differs by being file storage rather than block storage.

In both cases, block (DAS or SAN) or file (NAS) storage one of the most common usages for this now superfluous device is as a backup target for our new virtualization infrastructure. In many cases the device may be overkill for this task, generally with more performance and many more features than needed for a simple backup target but good backup storage is important for any critical business infrastructure and erring on the side of overkill is not necessarily a bad thing. Businesses often attempt to skimp on their backup infrastructures and this is an opportunity to invest heavily in it without spending any extra money.

Along the same vein as backup storage, the external storage device could be repurposed as archival storage or other “lower tier” of storage where high availability is not warranted. This is a less common approach, generally because every business needs a good backup system but only some have a way to leverage an archival storage tier.

Beyond these two common and universal storage models, a common use case for external storage devices, especially if the device is a NAS, is to leverage it in its native rule as a file server separate from the virtualization infrastructure. For many businesses file serving is not as uptime critical as the core virtualization infrastructure and backups are far easier to maintain and manage. By offloading file serving to an already purchased NAS device this can reduce file serving requirements from the virtualization infrastructure both by reducing the number of VMs that need to be run there as well as moving what is typically one of the largest users of storage to a separate device which can lower the performance requirements of the virtualization infrastructure as well as its capacity requirements. By doing this we potentially reduce the cost of obtaining necessary additional hard drives for the local storage on the compute nodes as we stated earlier and so this can be a very popular method for many companies to address the repurposing needs.

Every company is unique and there are potentially many places where spare storage equipment could be effectively used from labs to archives to tiered storage. Using a little creativity and thinking outside of the box can be leveraged to take your unique set of available equipment and your business’ unique set of needs and demands and find the best place to use this equipment where it is decoupled from the core, critical virtualization infrastructure but can still bring value to the organization. By avoiding the inverted pyramid of doom we can obtain the maximum value from the equipment that we have already invested in rather than implementing fresh technical debt that we have to them work to overcome unnecessarily.

What Do I Do Now? Planning for Design Changes

Quite often I am faced with talking to people about their system designs, plans and architectures.  And many times that discussion happens too late and designs are either already implemented or they are partially implemented.  This can be very frustrating if the design in progress has been deemed to not be ideal for the situation.

I understand the feeling of frustration that will come from a situation like this but it is something that we in IT must face on a very regular basis and managing this reaction constructively is a key IT skill.  We must become masters of this situation both technically as well as emotionally.  We should not be crippled by it, it is a natural situation that every IT professional will experience on a regular basis.  It should not be discouraging or crippling but it is very understandable that it can feel that way.

One key reason that we experience this so often is because IT is a massive field with a great number of variables to be considered in every situation.  It is also a highly creative field where there can be numerous, viable approaches to any given problem.  That there is even a single “best” option is rarely true.  Normally there any many competitive options.  Sometimes these are very closely related, sometimes these options are drastically different making them very difficult to compare meaningfully.

Another key reason is that factors change.  This could be that new techniques or information come to light, new products are released, products are updated, prices change or business needs change near to or even during the decision making and design processes.  This rate of change is not something that we, as IT professionals, can hope to ever control.  It is something that we must accept and deal with as best as we can.

Another thing that I often see missed is that a solution that was ideal when made may not be ideal if the same decision was being made today.  This does not, in any way, constitute a deficiency in the original design yet I have seen many people react to it as if it did.  The most common scenario that I run into where I see people exhibit this behaviour is in the aversion to the use of RAID 5 in modern storage design, RAID 6 and RAID 10 being the popular alternatives for good reason.  But this RAID 5 aversion, common since about 2009, did not exist always and from the middle of the 1990s until nearly the end of the 2000s RAID 5 was not only viable, it was very commonly the best solution for the given business and technical needs (the increase in aversion to it was mostly gradual, not sudden.)  However many people see RAID 5 as understandably poor as an option today but apply this new aversion to systems designed and implemented long ago, sometimes close to two decades ago.  This makes no sense and is purely an emotional reaction.  RAID 5 being the best choice for a scenario in 2002 in no way implies that it will still be the best choice in 2015.  But likewise, RAID 5 being a poor choice in 2015 for a scenario in no way belittles or negates the fact that it was very often a great choice several years ago.

I have been asked many times what to do once less than ideal design decisions have been made.  “What do I do now?”

Learning what to do when perfection is no longer an option (as if it ever really was, all IT is about compromises) is a very important skill.  The first things that we must tackle are the emotional problems as these will undermine everything else.  We must do our best to step back, accept the situation and act rationally.  The last thing that we want to do is take a non-ideal situation and make things worse by attempting to reverse justify bad decisions or panicking.

Accepting that no design is perfect, that there is no way to always get things completely right and that dealing with this is just part of working IT is the first step.  Step back, breathe deep.  It isn’t that bad.  This is not a unique situation.  Every IT pro doing design goes through this all of the time.  You should try your best to make the best decisions possible but you must also accept that that can rarely be done – no one has access to enough resources to really be able to do that.  We work with what we have.  So here we are.  What’s next?

Next is to assess the situation.  Where are we now?  In many cases the implementation is done and there is nothing more to do.  The situation is not ideal, but is it bad?  Very often the biggest mistake that I see people facing of an all ready implemented design is that it is too costly – typically “better” solutions are not better because they are faster or more reliable but are better because they are cheaper, easier or faster to have implemented.  That’s an unfortunate situation but hardly a crippling one.  Whatever time or money was spent must have been an acceptable amount at the time and must have been approved.  The best that we can do, right now, is learn from the decision process and attempt to avoid the overspending in the future.  It does not mean that the existing solution will not work or even not work amazingly well.  It is simply that it may not have been a perfect choice given the business needs, primarily financial, involved.

There are situations where a design that has been implemented does not adequately meet the stated business requirements.  This is thankfully less common, in my experience, as it is a much more difficult situation.  In this case we need to make some modifications in order to fulfill our business needs.  This may prove to be expensive or complex.  But things may not be as bad as what they seem.  Often reactions to this are misleading and the situation can often be salvaged.

The first step once we are in a position where we have implemented a solution that fails to meet business needs is to reassess the business needs.  This is not to imply that we should fudge the needs to massage them into being whatever our system is able to fulfill, not at all.  But it is a good time to go back and see if the originally stated needs are truly valid or if they were simply not vetted well enough or, even more likely, to go and see if the business needs have changed during the time that the implementation took place.  It may be that the implemented solution does, in fact, meet actual business needs even if they were originally misstated or because the needs have changed over time.   Or it might be that business needs have changed so dramatically that even perfect planning would originally have fallen short of the existing needs and the fact that the implemented solution does not perform as expected is of minor consequence.   I have been very surprised just how often this verification of business needs has turned a solution believed to be inadequate into an “overkill” solution that actually cost more than necessary simply because no one pushed back on overstatements of business needs or no one questioned financial value to certain technology investments.

The second step is to create a new technology baseline.  This is a very important step to assist in preventing IT from falling into the drop of the sunk cost fallacy.  It is extremely common for anyone, this is not unique to IT in any way, to look at the time and money spent on a project and assume that continuing down the original path, no matter how foolish it is, is the way to go because so many resources have been expended on that path already.  But this makes no sense, of course, how you got to your current state is irrelevant.  What is relevant is assessing the current needs of the department and company and taking stock of the currently available solutions, technologies and resources.  Given the current state, the best course forward can be determined.  Any consideration given to the effort expended to get to the current state is only misleading.

A good example of the sunk cost fallacy is in the game of chess.  With each move it is important to assess all available moves, risks and strategies again because what moves were used to get to the current state have no bearing on what moves make sense going forward.  If the world’s greatest chess player or an amazing computer chess algorithm was to be brought in mid-game they would not require any knowledge as to how the current state had come to be – they would simply assess the current state and create a strategy based upon it.

This is the same as we should be behaving in IT.  Our current state is our current state.  It does not matter for strategic planning what unfolded to get us into that state.  We only care about those decisions and costs when doing a post mortem process in order to determine where decision making may have failed in order to learn from it.  Learning about ourselves and our processes is very important.  But that is a very different task from doing strategic planning for the current initiative.

The unfortunate thing here is that we must begin our planning processes again but this time with, we assume, more to work with.    But this cannot be avoided.  In the worst cases, budgets are no longer available and there are no resources to fix the flawed design and achieve the necessary business goals.  Compromises sometimes are necessary.  Making do with what we have is sometimes that best that we can do. But, in the vast majority of cases it would seem, some combination of additional budget or creative reuse of existing products can be adequate to remedy the situation.

Once we have reached a state in which we have addressed our short falls, whether simply by accepting that we have over spent, under-delivered or have adjusted to meet needs we have an opportunity to go back and investigate our decision making processes.  It is by doing this that we hope to grow as individuals and, if at all possible, on an organizational level to learn from our mistakes, or determine if there even were mistakes.  Every company and every individual makes mistakes.  What separates us is the ability to learn from them and avoid those same mistakes in the future.  Growth comes primarily from experiencing pain in this way and while often unpleasant to face it is here that we have the best opportunity to create real, lasting value.  Do not push off or skip this opportunity for review whether it be a harsh, personal review that you do yourself or a formal, organizational review run by people trained to do so or something in-between.  The sooner that the decision processes are evaluated the fresher the memory will be and the sooner the course correction can take effect.

The final step that we can do is to begin the decision process to design a replacement for the current implementation as soon as possible, once the review of the decision process is complete.  This does not necessarily mean that we should intend to spend money or change our designs in the near future.  Not at all.  But by being extremely pro-active in design making we can attempt to avoid the problems of the past by giving ourselves additional time for planning, more time for requirements gathering and documentation, better insight into changes in requirements over time by regularly revisiting those requirements to see if they remain stable or if they are changing, more opportunity to get management and peer buy in and investment in the decision and better understanding of the problem domain so that we are better equipped to alter the intended design or know when to scrap it and start over before implementing it the next time.  It also could, potentially, give us a better chance of codifying organizational knowledge that could be passed on to a successor should you yourself not be in the position of decision making or implementation when the next cycle comes around.

With good, rational processes and a good understanding of the steps that need to be taken in a case of less than ideal systems design or implementation we can recover from missteps and not only, in most cases, recover from them in the short term but we can insulate the organization from the same mistakes in the future.