Tag Archives: vendor

You Can’t Virtualize That!

We get this all of the time in IT, a vendor tells us that a system cannot be virtualized.  The reasons are numerous.  On the IT side, we are always shocked that a vendor would make such an outrageous claim; and often we are just as shocked that a customer (or manager) believes them.  Vendors have worked hard to perfect this sales pitch over the years and I think that it is important to dissect it.

The root cause of problems is that vendors are almost always seeking ways to lower costs to themselves while increasing profits from customers.  This drives a lot of what would otherwise be seen as odd behaviour.

One thing that many, many vendors attempt to do is limit the scenarios under which their product will be supported.  By doing this, they set themselves up to be prepared to simply not provide support – support is expensive and unreliable.  This is a common strategy.  It some cases, this is so aggressive that any acceptable, production deployment scenario fails to even exist.

A very common means of doing this is to fail to support any supported operating system, de facto deprecating the vendor’s own software (for example, today this would mean only supporting Windows XP and earlier.)  Another example is only supporting products that are not licensed for the use case (an example would be requiring the use of a product like Windows 10 be used as a server.)  And one of the most common cases is forbidding virtualization.

These scenarios put customers into difficult positions because on one hand they have industry best practices, standard deployment guidelines, in house tooling and policies to adhere to; and on the other hand they have vendors often forbidding proper system design, planning and management.  These needs are at odds with one another.

Of course, no one expects every vendor to support every potential scenario.   Limits must be applied.  But there is a giant chasm between supporting reasonable, well deployed systems and actively requiring unacceptably bad deployments.  We hope that our vendors will behave as business partners and share a common interest in our success or, at the very least, the success of their product and not directly seek to undermine both of these causes.  We would hope that, at a very minimum, best effort support would be provided for any reasonable deployment scenario and that guaranteed support would be likely offered for properly engineered, best practice scenarios.

Imagine a world where driving the speed limit and wearing a seatbelt would violate your car warranty and that you would only get support if you drove recklessly and unprotected!

Some important things need to be understood about virtualization.  The first is that virtualization is a long standing industry best practice and is expected to be used in any production deployment scenario for services.  Virtualization is in no way new, even in the small business market it has been in the best practice category for well over a decade now and for many decades in the enterprise space.  We are long past the point where running systems non-virtualized is considered acceptable, and that includes legacy deployments that have been in place for a long time.

There are, of course, always rare exceptions to nearly any rule.  Some systems need access to very special case hardware and virtualization may not be possible, although with modern hardware passthrough this is almost unheard of today.  And some super low latency systems cannot be virtualized but these are normally limited to only the biggest international investment banks and most aggressive hedgefunds and even the majority of those traditional use cases have been eliminated by improvements in virtualization making even those situations rare.  But the bottom line is, if you can’t virtualize you should be sad that you cannot, and you will know clearly why it is impossible in your situation.  In all other cases, your server needs to be virtual.

Is it not important?

If a vendor does not allow you to follow standard best practices for healthy deployments, what does this say about the vendor’s opinion of their own product?  If we were talking about any other deployment, we would immediately question why we were deploying a system so poorly if we plan to depend on it.  If our vendor forces us to behave this way, we should react in the same manner – if the vendor doesn’t take the product to the same degree that we take the least of our IT services, why should we?

This is an “impedance mismatch”, as we say in engineering circles, between our needs (production systems) and how the vendor making that system appears to treat them (hobby or entertainment systems.)  If we need to depend on this product for our businesses, we need a vendor that is on board and understands business needs – has a production mind set.  If the product is not business targeted or business ready, we need to be aware of that.  We need to question why we feel we should be using a service in production, on which we depend and require support, that is not intended to be used in that manner.

Is it supported?  Is it being tested?

Something that is often overlooked from the perspective of customers is whether or not the necessary support resources for a product are in place.  It’s not uncommon for the team that supports a product to become lean, or even disappear, but the company to keep selling the product in the hopes of milking it for as much as they can and bank on either muddling through a problem or just returning customer funds should the vendor be caught in a situation where they are simply unable to support it.

Most software contracts state that the maximum damage that can be extracted from the vendor is the cost of the product, or the amount spent to purchase it.  In a case such as this, the vendor has no risk from offering a product that they cannot support – even if charging a premium for support.  If the customer manages to use the product, great they get paid. If the customer cannot and the vendor cannot support it, they only lose money that they would never have gotten otherwise.  The customer takes on all the risk, not the vendor.

This suggests, of course, that there is little or no continuing testing of the product as well, and this should be of additional concern.  Just because the product runs does not mean that it will continue to run.  Getting up and running with an unsupported, or worse unsupportable, product means that you are depending more and more over time on a product with a likely decreasing level of potential support, slowly getting worse over time even as the need for support and the dependency on the software would be expected to increase.

If a proprietary product is deployed in production, and the decision is made to forgo best practice deployments in order to accommodate support demands, how can this fit in a decision matrix? Should this imply that proper support does not exist? Again, as before, this implies a mismatch in our needs.

 

Is It Still Being Developed?

If the deployment needs of the software follow old, out of date practices, or require out of date (or not reasonably current software or design) then we have to question the likelihood that the product is currently being developed.  In some cases we can determine this by watching the software release cycle for some time, but not in all cases.  There is a reasonable fear that the product may be dead, with no remaining development team working on it.  The code may simply be old, technical debt that is being sold in the hopes of making a last, few dollars off of an old code base that has been abandoned.  This process is actually far more common than is often believed.

Smaller software shops often manage to develop an initial software package, get it on the market and available for sale, but fail to be able to afford to retain or restaff their development team after initial release(s).  This is, in fact, a very common scenario.  This leaves customers with a product that is expected to become less and less viable over time with deployment scenarios becoming increasingly risky and data increasing hard to extricate.

 

How Can It Be Supported If the Platform Is Not Supported?

A common paradox of some more extreme situations is software that, in order to qualify as “supported”, requires other software that is either out of support or was never supported for the intended use case.  Common examples of this are requiring that a server system be run on top of a desktop operating system or requiring versions of operating systems, databases or other components, that are no longer supported at all.  This last scenario is scarily common.  In a situation like this, one has to ask if there can ever be a deployment, then, where the software can be considered to be “supported”?  If part of the stack is always out of support, then the whole stack is unsupported.  There would always be a reason that support could be denied no matter what.   The very reason that we would therefore demand that we avoid best practices would equally rule out choosing the software itself in the first place.

Are Industry Skills and Knowledge Lacking?

Perhaps the issue that we face with software support problems of this nature are that the team(s) creating the software simply do not know how good software is made and/or how good systems are deployed.  This is among the most reasonable and valid reasons for what would drive us to this situation.  But, like the other hypothesis reasons, it leaves us concerned about the quality of the software and the possibility that support is truly available.  If we can’t trust the vendor to properly handle the most visible parts of the system, why would we turn to them as our experts for the parts that we cannot verify?

The Big Problem

The big, overarching problem with software that has questionable deployment and maintenance practice demands in exchange for unlocking otherwise withheld support is not, as we typically assume a question of overall software quality, but one of viable support and development practices.  That these issues suggest a significant concern for long term support should make us strongly question why we are choosing these packages in the first place while expecting strong support from them when, from the onset, we have very visible and very serious concerns.

There are, of course, cases where no other software products exist to fill a need or none of any more reasonable viability.  This situation should be extremely rare and if such a situation exists should be seen as a major market opportunity for a vendor looking to enter that particular space.

From a business perspective, it is imperative that the technical infrastructure best practices not be completely ignored in exchange for blind or nearly blind following of vendor requirements that, in any other instance, would be considered reckless or unprofessional. Why do we so often neglect to require excellence from core products on which our businesses depend in this way?  It puts our businesses at risk, not just from the action itself, but vastly moreso from the risks that are implied by the existence of such a requirement.

Buyers and Sellers Agents in IT

When dealing with real estate purchases, we have discrete roles defined legally as to when a real estate agent represents the seller or when they represent the buyer.  Each party gets clear documentation as to how they are being represented.  In both cases, the agent is bound by honesty and ethical limitations, but beyond that their obligations are to their represented party.

Outside of the real estate world, most of us do not deal with buyer’s agents very often.  Seller’s agents are everywhere, we just call them salespeople.  We deal with them at many stores and they are especially evident when we go to buy something large, like a car.

In business, buyer’s agents are actually pretty common and actually come in some interesting and unspoken forms.  Rarely does anyone actually talk about buyer’s agents in business terms, mostly because we are not talking about buying objects but about buying solutions, services or designs.  Identifying buyer’s and seller’s agents alone can become confusing and, often, companies may not even recognize when a transaction of this nature is taking place.

We mostly see the engagement of sellers – they are the vendors with products and services that they want us to purchase.  We can pretty readily identify the seller’s agents that are involved.  These include primarily the staff of the vendor itself and the sales people (which includes pre-sales engineering and any “technical” resource that gets compensation by means of the sale rather than being explicitly engaged and remunerated to represent your own interests) of the resellers (resellers being a blanket term for any company that is compensated for selling products, services or ideas that they themselves do not produce; this commonly includes value added resellers and stores.)  The seller’s side is easy.  Are they making money by somehow getting me to buy something?  If so… seller’s agent.

Buyer’s agents are more difficult to recognize.  So much so that it is common for businesses to forget to engage them, overlook them or confuse seller’s agents for them.  Sadly, outside of real estate, the strict codes of conduct and legal oversight do not exist and ensuring that seller’s agent is not engaged mistakenly where a buyer’s agent should be is purely up to the organization engaging said parties.

Buyer’s agents come in many forms but the most common, yet hardest to recognize, is the IT department or staff, themselves.  This may seem like a strange thought, but the IT department acts as a technical representative of the business and, because they are not the business themselves directly, an emotional stop gap that can aid in reducing the effects of marketing and sales tactics while helping to ensure that technical needs are met.  The IT team is the most important buyer’s agent in the IT supply chain and the last line of defense for companies to ensure that they are engaging well and getting the services, products and advice that they need.

Commonly  IT departments will engage consulting services to aid in decision making. The paid consulting firm is the most identifiable buyer’s agent in the process and the one that is most often skipped (or a seller’s agent is mistaken for the consultant.)  A consultant is hired by, paid by and has an ethical responsibility to represent the buyer.  Consultants have an additional air gap that helps to separate them from the emotional responses common of the business itself.  The business and its internal IT staff are easily motivated by having “cool solutions” or expensive “toys” or can be easily caused to panic through good marketing, but consultants have many advantages.

Consultants have the advantage that they are often specialists in the area in question or at least spend their time dealing with many vendors, resellers, products, ideas and customer needs.  They can more easily take a broad view of needs and bring a different type of experience to the decision table.

Consultants are not the ones who, at the end of the day, get to “own” the products, services or solutions in question and are generally judged on their ability to aid the business effectively.  Because of this they have a distinct advantage in being more emotionally distant and therefore more objective in deciding on recommendations.  The coolest, newest solutions have little effect on them while cost effectiveness and business viability do.  More importantly, consultants and internal IT working together provide an important balancing of biases, experience and business understandings that combine the broad experience across many vendors and customers of the one, and the deep understanding of the individual business of the other.

One can actually think of the Buyer’s and Seller’s Agent system as a “stack”.  When a business needs to acquire new services, products or to get advice, the ideal and full stack would look something like this: Business > IT Department > ITSP/Consultants <> Value Added Reseller < Distributor < Vendor.  The <> denotes the reflection point between the buyer’s side and the seller’s side.  Of course, many transactions will not involve and should not involve the entire stack.  But this visualization can be effective in understanding how these pieces are “designed” to interface with each other.  The business should ideally get the final options from IT (IT can be outsourced, of course), IT should interface through an ITSP consultant in many cases, and so forth.  An important part of the processes is keeping actors on the left side of the stack (or the bottom) from having direct contact with those high up in the stack (or on the right) because this can short circuit the protections that the system provides allowing vendors or sales staff to influence the business without the buyer’s agents being able to vet the information.

Identifying, understanding and leveraging the buyer’s and seller’s agent system is important to getting good, solid advice and sales for any business and is widely applicable far outside of IT.

Avoiding Local Service Providers

Inflammatory article titles aside, the idea of choosing a technology service provider based on the fact or partially based on the fact that they are in some way located geographically near to where you are currently, is almost always a very bad idea.  Knowledge based services are difficult enough to find at all, let alone finding the best potential skills, experience and price while introducing artificial and unnecessary constraints to limit the field of potential candidates.

With the rare exception of major global market cities like New York City and London, it is nearly impossible to find a full range of skills in Information Technology in a single locality, at least not in conjunction with a great degree of experience and breadth.  This is true of nearly all highly technical industries – expertise tends to focus around a handful of localities around the world and the remaining skills are scattered in a rather unpredictable manner often because those people in the highest demand can command salary and locations as desired and live where they want to, not where they have to.

IT, more than nearly any other field, has little value in being geographically near to the business that it is supporting.  Enterprise IT departments, even when located locally to their associated businesses and working in an office on premises are often kept isolated in different buildings away from both the businesses that they are supporting and the physical systems on which they work.  It is actually very rare that enterprise server admins would physically ever see their servers or network admins see their switches and routers.  This becomes even less likely when we start talking about roles like database administrators, software developers and others who have even less association with devices that have any physical component.

Adding in a local limitation when looking for consulting talent (and in many cases even internal IT staff) adds an artificial constraint that eliminates nearly the entire possible field of talented people while encouraging people to work on site even for work for which it makes no sense.  Often working on site causes a large increase in cost and loss of productivity due to interruptions, lack of resources, poor work environment, travel or similar.  Working with exclusively or predominantly remote resources encourages a healthy investment in efficient working conditions that generally pay off very well.  But it is important to keep in mind that just because a service company is remote does not imply that the work that they will do will be remote.  In many cases this will make sense, but in others it will not.

Location agnostic workers have many advantages.  By not being tied to a specific location you get far more flexibility as to skill level (allowing you to pursue the absolute best people) or cost (by allowing you to hire people living in low cost areas) or simply offering flexibility as an incentive or get broader skill sets, larger staff, etc.  Choosing purely local services simply limits you in many ways.

Companies that are not based locally are not necessarily unable to provide local resources.  Many companies work with local resources, either local companies or individuals, to allow them to have a local presence.  In many cases this is simply what we call local “hands” and is analogous to how most enterprises work internally with centrally or remotely based IT staff and physical “hands” existing only at locations with physical equipment to be serviced.  In cases where specific expertise needs to be located with physical equipment or people it is common for companies to either staff locally in cases where the resource is needed on a very regular basis or to have specific resources travel to the location when needed.  These techniques are generally far more effective than attempting to hire firms with the needed staff already coincidentally located in the best location.  This can easily be more cost effective than working with a full staff that is already local.

As time marches forward needs change as well.  Companies that work local only can find themselves facing new challenges when they expand to include other regions or locations.  Do they choose vendors and partners only where they were originally located?  Or where they are moving to or expanding to?  Do they choose local for each location separately?  The idea of working with local resources only is nearly exclusive to the smallest of business.  Typically as businesses grow the concept of local begins to change in interesting ways.

Locality and jurisdiction may represent different things.  In many cases it may be necessary to work with businesses located in the same state or country as your business due to legal or financial logistical reasoning and this can often make sense.  Small companies especially may not be prepared the tackle the complexities of working with a foreign firm.  Larger companies may find these boundaries to be worthy of ignoring as well.  But the idea that location should be ignored should not be taken to mean that jurisdiction, by extension, should also be ignored.  Jurisdiction still plays a significant role – one that some IT service providers or other vendors may be able to navigate on your behalf allowing you to focus on working with a vendor within your jurisdiction while getting the benefits of support from another jurisdiction.

As with many artificial constraint situations, not only do we generally eliminate the most ideal vendor candidates, but we also risk “informing” the existing vendor candidate pool that we care more about locality than quality of service or other important factors.  This can lead to a situation where the vendor, especially in a smaller market, feels that they have a lock in to you as the customer and do not need to perform up to a market standard level, price competitively (as there is no true competition given the constraints) or worse.  A vendor who feels that they have a trapped customer is unlikely to perform as a good vendor long term.

Of course we don’t want to avoid companies simply because they are local to our own businesses, but we should not be giving undue preference to companies for this reason either.  Some work has advantages to being done in person, there is no denying this.  But we must be careful not to extend this to rules and needs that do not have this advantage nor should we confuse the location of a vendor with the location(s) where they do or are willing to do business.

In extreme cases, all IT work can, in theory, be done completely remotely and only bench work (the physical remote hands) aspects of IT need an on premises presence.  This is extreme and of course there are reasons to have IT on site.  Working with a vendor to determine how best service can be provided, whether locally, remotely or a combination of the two can be very beneficial.

In a broader context, the most important concept here is to avoid adding artificial or unnecessary constraints to the vendor selection process.  Assuming that a local vendor will be able or willing to deliver a value that a non-local vendor can or will do is just one way that we might bring assumption or prejudice to a process such as this.  There is every possibility that the local company will do the best possible job and be the best, most viable vendor long term – but the chances are far higher than you will find the right partner for your business elsewhere.  It’s a big world and in IT more than nearly any other field it is becoming a large, flat playing field.

Explaining the Lack of Large Scale Studies in IT

IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software.  This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it.  And yet, regardless of the high demand for such data there is none available.  How can this be.

Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field.  These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and/or share this data with other companies.

Cost is by far the largest factor.  If the cost of large scale studies could be overcome, all other factors could have solutions found for them.  But sadly the nature of a large scale study is that it will be costly.  As an example we can look at server reliability rates.

In order to determine failure rates on a server we need a large number of servers in order to collect this data.  This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one.  We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers.  If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment!  And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device.  If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.

Now that cost, of course, is for testing a single configuration of a single model server.  Presumably for a study to be meaningful we would need many different models of servers.  Perhaps several from each vendor to compare different lines and features.  Perhaps many different vendors.  It is easy to see how quickly the cost of a study becomes impossibly large.

This is just the beginning of the cost, however.  To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible.  This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control.  Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads.  In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.

Then, of course, we must address the needs for special sensors and testing.  What exactly constitutes a failure?  Even in production systems there is often dispute on this.  Is a hard drive failing in an array a failure, even if the array does not fail?  Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way?  There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study.  Establishing study guidelines that are applicable and useful to end users is much harder than it seems.

And the biggest cost, manual labor.  Maintaining an environment for a large study will take human capital which may equal the cost of the study itself.  It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data.  All in all, the cost are generally, simply impossible to do.

Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while still having spent a large sum of money.

The second insurmountable problem is time.  Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years.  Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful.  What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.

But that is not the biggest problem.  The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late.  The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results.  Often production equipment is only purchased for three to five years total lifespan.  Getting results even one year into this span would have little value.  And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.

The final major factor is a lack of incentive to provide existing data to those who need it.  While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such.  These are rarely done in controlled environments and often involve data collected from the field.  In many cases this data may even be private to customers and not legally able to be shared regardless.

But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist.  Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.

The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale.  These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.

Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage.  So for these reasons they are not shared.

I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity.  But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure!  The lack of failures was, itself, very valuable.  But we were unable to produce any standard statistic like Mean Time to Failure.  To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough.  Perhaps millions of servers years would have been necessary.  There is no way to truly know.

Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist.  When they do they will be isolated and almost certainly crippled by the necessities of reality.  There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research.  As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data.  It is surprising that so many people in the field expect this type of data to be available when it never has been historically.

Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques.  This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.