IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software. This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it. And yet, regardless of the high demand for such data there is none available. How can this be.
Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field. These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and/or share this data with other companies.
Cost is by far the largest factor. If the cost of large scale studies could be overcome, all other factors could have solutions found for them. But sadly the nature of a large scale study is that it will be costly. As an example we can look at server reliability rates.
In order to determine failure rates on a server we need a large number of servers in order to collect this data. This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one. We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers. If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment! And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device. If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.
Now that cost, of course, is for testing a single configuration of a single model server. Presumably for a study to be meaningful we would need many different models of servers. Perhaps several from each vendor to compare different lines and features. Perhaps many different vendors. It is easy to see how quickly the cost of a study becomes impossibly large.
This is just the beginning of the cost, however. To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible. This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control. Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads. In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.
Then, of course, we must address the needs for special sensors and testing. What exactly constitutes a failure? Even in production systems there is often dispute on this. Is a hard drive failing in an array a failure, even if the array does not fail? Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way? There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study. Establishing study guidelines that are applicable and useful to end users is much harder than it seems.
And the biggest cost, manual labor. Maintaining an environment for a large study will take human capital which may equal the cost of the study itself. It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data. All in all, the cost are generally, simply impossible to do.
Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while still having spent a large sum of money.
The second insurmountable problem is time. Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years. Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful. What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.
But that is not the biggest problem. The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late. The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results. Often production equipment is only purchased for three to five years total lifespan. Getting results even one year into this span would have little value. And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.
The final major factor is a lack of incentive to provide existing data to those who need it. While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such. These are rarely done in controlled environments and often involve data collected from the field. In many cases this data may even be private to customers and not legally able to be shared regardless.
But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist. Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.
The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale. These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.
Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage. So for these reasons they are not shared.
I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity. But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure! The lack of failures was, itself, very valuable. But we were unable to produce any standard statistic like Mean Time to Failure. To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough. Perhaps millions of servers years would have been necessary. There is no way to truly know.
Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist. When they do they will be isolated and almost certainly crippled by the necessities of reality. There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research. As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data. It is surprising that so many people in the field expect this type of data to be available when it never has been historically.
Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques. This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.