Tag Archives: technical debt

Understanding Technical Debt

From Wikipedia: “Technical debt (also known as design debt or code debt) is “a concept in programming that reflects the extra development work that arises when code that is easy to implement in the short run is used instead of applying the best overall solution”.

Technical debt can be compared to monetary debt. If technical debt is not repaid, it can accumulate ‘interest’, making it harder to implement changes later on. Unaddressed technical debt increases software entropy. Technical debt is not necessarily a bad thing, and sometimes (e.g., as a proof-of-concept) technical debt is required to move projects forward. On the other hand, some experts claim that the “technical debt” metaphor tends to minimize the impact, which results in insufficient prioritization of the necessary work to correct it.”

The concept of technical debt comes from the software engineering world, but it applies to the world of IT and business infrastructure just as much. Like software engineering, we design our systems and our networks, and taking shortcuts in our designs, which includes working with less than ideal designs, incorporating existing hardware and other bad design practices produce technical debt.  One of the more significant forms of this comes from investing in the “past” rather than in the “future” and is quite often triggered through the sunk cost fallacy (a.k.a. throwing good money after bad.)

It is easy to see this happening in businesses every day.  New plans are made for the future, but before they are implemented investments are made in making an old system design continue working, work better, expand or whatever.  This investment then either turns into a nearly immediate financial loss or, more often, becomes incentive to not invest in the future designs as quickly, as thoroughly or possible, at all.  The investment in the past can become crippling in the worst cases.

This happens in numerous ways and is generally unintentional.  Often investments are needed to keep an existing system running properly and, under normal conditions, would simply be made.  But in a situation where there is a future change that is needed or potentially planned this investment can be problematic.  Better cost analysis and triage planning can remedy this, in many cases, though.

In a non-technical example, imagine owning an older car that has served well but is due for retirement in three months.  In three months you plan to invest in a new car because the old one is no longer cost effective due to continuous maintenance needs, lower efficiency and so forth.  But before your three month plan to buy a new car comes around, the old car suffers a minor failure and now requires a significant investment to keep it running.  Putting money into the old car would be an new investment in the technical debt.  Rather than spending a large amount of money to make an old car run for a few months, moving up the time table to buy the new one is obviously drastically more financially sound.  With cars, we see this easily (in most cases.)  We save money, potentially a lot of it, by quickly buying a new car.  If we were to invest heavily in the old one, we either lose that investment in a few months or we risk changes our solid financial planning for the purchase of a new car that was already made.  Both cases are bad financially.

IT works the same way.  Spending a large sum of money to maintain an old email system six months before a planned migration to a hosted email system would likely be very foolish.  The investment is either lost nearly immediately when the old system is decommissioned or it undermines our good planning processes and leads us to not migrate as planned and do a sub-par job for our businesses because we allowed technical debt to drive our decision making rather than proper planning.

Often a poor triage operation or improper authority to triage players can be the factor that causes emergency technical debt investments rather than rapid future looking investments.  This is only one area where major improvements may address issues, but it is a major one.  This can also be mitigated, in some cases, through “what if” planning to have investment plans in place contingent on common or expected emergencies that might arise, which may be as simple as capacity expansion needs due to growth that happen before systems planning comes into play.

Another great example of common technical debt is server storage capacity expansion.  This is a scenario that I see with some frequency and demonstrates technical debt well.  It is common for a company to purchase servers that lack large internal storage capacity.  Either immediately or sometime down the road more capacity is needed.  If this happens immediately we can see that the server purchased was a form of technical debt in improper design and obviously represents a flaw in the planning and purchasing process.

But a more common example is needing to expand storage two or three years after a server has been purchased.  Common expansion choices include adding an external storage array to attach to the server or modifying the server to accept more local storage.  Both of these approaches tend to be large investments in an already old server, a server that is easily forty percent or higher through its useful lifespan.  In many cases the same or only slightly higher investment in a completely new server can result in new hardware, faster CPUs, more RAM, the storage needed, purpose designed and built, aligned and refreshed support lifespan, smaller datacenter footprint, lower power consumption, newer technologies and features, better vendor relationships and more all while retaining the original server to reuse, retire or resell.  One way spends money supporting the past, the other often can spend comparable money on the future.

Technical debt is a crippling factor for many businesses.  It increases the cost of IT, sometimes significantly, and can lead to high levels of risk through a lack of planning and most spending being emergency based.

 

What Do I Do Now? Planning for Design Changes

Quite often I am faced with talking to people about their system designs, plans and architectures.  And many times that discussion happens too late and designs are either already implemented or they are partially implemented.  This can be very frustrating if the design in progress has been deemed to not be ideal for the situation.

I understand the feeling of frustration that will come from a situation like this but it is something that we in IT must face on a very regular basis and managing this reaction constructively is a key IT skill.  We must become masters of this situation both technically as well as emotionally.  We should not be crippled by it, it is a natural situation that every IT professional will experience on a regular basis.  It should not be discouraging or crippling but it is very understandable that it can feel that way.

One key reason that we experience this so often is because IT is a massive field with a great number of variables to be considered in every situation.  It is also a highly creative field where there can be numerous, viable approaches to any given problem.  That there is even a single “best” option is rarely true.  Normally there any many competitive options.  Sometimes these are very closely related, sometimes these options are drastically different making them very difficult to compare meaningfully.

Another key reason is that factors change.  This could be that new techniques or information come to light, new products are released, products are updated, prices change or business needs change near to or even during the decision making and design processes.  This rate of change is not something that we, as IT professionals, can hope to ever control.  It is something that we must accept and deal with as best as we can.

Another thing that I often see missed is that a solution that was ideal when made may not be ideal if the same decision was being made today.  This does not, in any way, constitute a deficiency in the original design yet I have seen many people react to it as if it did.  The most common scenario that I run into where I see people exhibit this behaviour is in the aversion to the use of RAID 5 in modern storage design, RAID 6 and RAID 10 being the popular alternatives for good reason.  But this RAID 5 aversion, common since about 2009, did not exist always and from the middle of the 1990s until nearly the end of the 2000s RAID 5 was not only viable, it was very commonly the best solution for the given business and technical needs (the increase in aversion to it was mostly gradual, not sudden.)  However many people see RAID 5 as understandably poor as an option today but apply this new aversion to systems designed and implemented long ago, sometimes close to two decades ago.  This makes no sense and is purely an emotional reaction.  RAID 5 being the best choice for a scenario in 2002 in no way implies that it will still be the best choice in 2015.  But likewise, RAID 5 being a poor choice in 2015 for a scenario in no way belittles or negates the fact that it was very often a great choice several years ago.

I have been asked many times what to do once less than ideal design decisions have been made.  “What do I do now?”

Learning what to do when perfection is no longer an option (as if it ever really was, all IT is about compromises) is a very important skill.  The first things that we must tackle are the emotional problems as these will undermine everything else.  We must do our best to step back, accept the situation and act rationally.  The last thing that we want to do is take a non-ideal situation and make things worse by attempting to reverse justify bad decisions or panicking.

Accepting that no design is perfect, that there is no way to always get things completely right and that dealing with this is just part of working IT is the first step.  Step back, breathe deep.  It isn’t that bad.  This is not a unique situation.  Every IT pro doing design goes through this all of the time.  You should try your best to make the best decisions possible but you must also accept that that can rarely be done – no one has access to enough resources to really be able to do that.  We work with what we have.  So here we are.  What’s next?

Next is to assess the situation.  Where are we now?  In many cases the implementation is done and there is nothing more to do.  The situation is not ideal, but is it bad?  Very often the biggest mistake that I see people facing of an all ready implemented design is that it is too costly – typically “better” solutions are not better because they are faster or more reliable but are better because they are cheaper, easier or faster to have implemented.  That’s an unfortunate situation but hardly a crippling one.  Whatever time or money was spent must have been an acceptable amount at the time and must have been approved.  The best that we can do, right now, is learn from the decision process and attempt to avoid the overspending in the future.  It does not mean that the existing solution will not work or even not work amazingly well.  It is simply that it may not have been a perfect choice given the business needs, primarily financial, involved.

There are situations where a design that has been implemented does not adequately meet the stated business requirements.  This is thankfully less common, in my experience, as it is a much more difficult situation.  In this case we need to make some modifications in order to fulfill our business needs.  This may prove to be expensive or complex.  But things may not be as bad as what they seem.  Often reactions to this are misleading and the situation can often be salvaged.

The first step once we are in a position where we have implemented a solution that fails to meet business needs is to reassess the business needs.  This is not to imply that we should fudge the needs to massage them into being whatever our system is able to fulfill, not at all.  But it is a good time to go back and see if the originally stated needs are truly valid or if they were simply not vetted well enough or, even more likely, to go and see if the business needs have changed during the time that the implementation took place.  It may be that the implemented solution does, in fact, meet actual business needs even if they were originally misstated or because the needs have changed over time.   Or it might be that business needs have changed so dramatically that even perfect planning would originally have fallen short of the existing needs and the fact that the implemented solution does not perform as expected is of minor consequence.   I have been very surprised just how often this verification of business needs has turned a solution believed to be inadequate into an “overkill” solution that actually cost more than necessary simply because no one pushed back on overstatements of business needs or no one questioned financial value to certain technology investments.

The second step is to create a new technology baseline.  This is a very important step to assist in preventing IT from falling into the drop of the sunk cost fallacy.  It is extremely common for anyone, this is not unique to IT in any way, to look at the time and money spent on a project and assume that continuing down the original path, no matter how foolish it is, is the way to go because so many resources have been expended on that path already.  But this makes no sense, of course, how you got to your current state is irrelevant.  What is relevant is assessing the current needs of the department and company and taking stock of the currently available solutions, technologies and resources.  Given the current state, the best course forward can be determined.  Any consideration given to the effort expended to get to the current state is only misleading.

A good example of the sunk cost fallacy is in the game of chess.  With each move it is important to assess all available moves, risks and strategies again because what moves were used to get to the current state have no bearing on what moves make sense going forward.  If the world’s greatest chess player or an amazing computer chess algorithm was to be brought in mid-game they would not require any knowledge as to how the current state had come to be – they would simply assess the current state and create a strategy based upon it.

This is the same as we should be behaving in IT.  Our current state is our current state.  It does not matter for strategic planning what unfolded to get us into that state.  We only care about those decisions and costs when doing a post mortem process in order to determine where decision making may have failed in order to learn from it.  Learning about ourselves and our processes is very important.  But that is a very different task from doing strategic planning for the current initiative.

The unfortunate thing here is that we must begin our planning processes again but this time with, we assume, more to work with.    But this cannot be avoided.  In the worst cases, budgets are no longer available and there are no resources to fix the flawed design and achieve the necessary business goals.  Compromises sometimes are necessary.  Making do with what we have is sometimes that best that we can do. But, in the vast majority of cases it would seem, some combination of additional budget or creative reuse of existing products can be adequate to remedy the situation.

Once we have reached a state in which we have addressed our short falls, whether simply by accepting that we have over spent, under-delivered or have adjusted to meet needs we have an opportunity to go back and investigate our decision making processes.  It is by doing this that we hope to grow as individuals and, if at all possible, on an organizational level to learn from our mistakes, or determine if there even were mistakes.  Every company and every individual makes mistakes.  What separates us is the ability to learn from them and avoid those same mistakes in the future.  Growth comes primarily from experiencing pain in this way and while often unpleasant to face it is here that we have the best opportunity to create real, lasting value.  Do not push off or skip this opportunity for review whether it be a harsh, personal review that you do yourself or a formal, organizational review run by people trained to do so or something in-between.  The sooner that the decision processes are evaluated the fresher the memory will be and the sooner the course correction can take effect.

The final step that we can do is to begin the decision process to design a replacement for the current implementation as soon as possible, once the review of the decision process is complete.  This does not necessarily mean that we should intend to spend money or change our designs in the near future.  Not at all.  But by being extremely pro-active in design making we can attempt to avoid the problems of the past by giving ourselves additional time for planning, more time for requirements gathering and documentation, better insight into changes in requirements over time by regularly revisiting those requirements to see if they remain stable or if they are changing, more opportunity to get management and peer buy in and investment in the decision and better understanding of the problem domain so that we are better equipped to alter the intended design or know when to scrap it and start over before implementing it the next time.  It also could, potentially, give us a better chance of codifying organizational knowledge that could be passed on to a successor should you yourself not be in the position of decision making or implementation when the next cycle comes around.

With good, rational processes and a good understanding of the steps that need to be taken in a case of less than ideal systems design or implementation we can recover from missteps and not only, in most cases, recover from them in the short term but we can insulate the organization from the same mistakes in the future.