Explaining the Lack of Large Scale Studies in IT

IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software.  This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it.  And yet, regardless of the high demand for such data there is none available.  How can this be.

Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field.  These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and/or share this data with other companies.

Cost is by far the largest factor.  If the cost of large scale studies could be overcome, all other factors could have solutions found for them.  But sadly the nature of a large scale study is that it will be costly.  As an example we can look at server reliability rates.

In order to determine failure rates on a server we need a large number of servers in order to collect this data.  This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one.  We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers.  If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment!  And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device.  If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.

Now that cost, of course, is for testing a single configuration of a single model server.  Presumably for a study to be meaningful we would need many different models of servers.  Perhaps several from each vendor to compare different lines and features.  Perhaps many different vendors.  It is easy to see how quickly the cost of a study becomes impossibly large.

This is just the beginning of the cost, however.  To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible.  This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control.  Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads.  In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.

Then, of course, we must address the needs for special sensors and testing.  What exactly constitutes a failure?  Even in production systems there is often dispute on this.  Is a hard drive failing in an array a failure, even if the array does not fail?  Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way?  There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study.  Establishing study guidelines that are applicable and useful to end users is much harder than it seems.

And the biggest cost, manual labor.  Maintaining an environment for a large study will take human capital which may equal the cost of the study itself.  It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data.  All in all, the cost are generally, simply impossible to do.

Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while still having spent a large sum of money.

The second insurmountable problem is time.  Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years.  Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful.  What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.

But that is not the biggest problem.  The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late.  The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results.  Often production equipment is only purchased for three to five years total lifespan.  Getting results even one year into this span would have little value.  And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.

The final major factor is a lack of incentive to provide existing data to those who need it.  While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such.  These are rarely done in controlled environments and often involve data collected from the field.  In many cases this data may even be private to customers and not legally able to be shared regardless.

But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist.  Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.

The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale.  These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.

Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage.  So for these reasons they are not shared.

I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity.  But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure!  The lack of failures was, itself, very valuable.  But we were unable to produce any standard statistic like Mean Time to Failure.  To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough.  Perhaps millions of servers years would have been necessary.  There is no way to truly know.

Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist.  When they do they will be isolated and almost certainly crippled by the necessities of reality.  There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research.  As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data.  It is surprising that so many people in the field expect this type of data to be available when it never has been historically.

Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques.  This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.

10 thoughts on “Explaining the Lack of Large Scale Studies in IT”

  1. >”lack of incentive to produce and/or share this data with other companies.”

    No doubt that every sufficiently big IT company nowdays collects this kind of data in order to minimize cost and/or risk.

    But Blackblaze still is the only company to publish their findings in detail, because they believe that this kind of publicity gives them an advantage over their competition.

    Not even Google published data about the manufacturers and models of their failure analysis!

    And the big rest of companies is convinced that this kind of valuable know-how best suits their needs if it is not available to their competitors.

    This is the short truth, no longer text is needed.

  2. And even BackBlaze doesn’t have enough data to be truly useful. They have limited failure information on a very small subset of drives. The data is the best that we have, in most cases, but it is far less than you would want for a meaningful ability to compare between vendors, generations, and even products within a vendor.

  3. @Scott Alan Miller April 2:

    Could you please explain how you come to that conclusion?

    I felt safe enough to decide my recent (small scale) buying decisions based on the published Blackblaze data.

    My own experience had shown unsatisfactory reliability of WD green drives, but due to a much too small number that data is not representative.

    For all those who do not have data from an installed base of a few thousand drives, imho Blackblaze gives valuable hints.

  4. I agree, BackBlaze gives us good hints and I am very thankful to them for supplying the data that they do. But their samples sizes are very low and contain only a handful of different drives. Even BB is not large enough to, nor do they have a good interest in, having many, many different types of drives to produce broad comparisons. They only use a few different drives and some of those are in tiny quantities. To get good numbers we would need at least as many drives as they have for their largest pools but of several different drive models and vendors. BB has the best data on the market, but it is poor as statistical data goes. That’s my point, even the best isn’t very good when it comes to large IT studies.

    I’ve used Green drives too with great results, but my pool is tiny and not statistically significant. Are you using them standalone or in storage pools (like RAID?) If the latter, look to WD Red drives instead. They are WD Green drives with modified firmware to be more reliable in that setting.

  5. The last reliability report of Blackblaze is based on more than 40’000 drives and I do not have the slightest doubts that with these numbers the results are reliable enough to draw clear conclusions.

    So I do no longer expect you to prove why their sample sizes should be too low.

  6. The total number of drives is not the factor, it is the total number of different types of drives. 40,000 is not a large number when we are talking about a statistical sample set. If that were testing just four models, that’s an average of 10K drives of each. The numbers get much smaller very quickly. But the distribution is nowhere near even, a few models are in huge quantity and others are in very small. The total sample size is obviously misleading because 40K sounds good until you look at the distribution and individual samples.

    Doing drive and array reliability requires very large numbers. For doing my own statistics I used over 32,000 drives at a time over 8 years, more than 60,000 drives total and I can assure you it was not statistically relevant at all with the rarity of drive and array failures. I could produce a few meaningful statistics about a single factor but did not have a sample size large enough to produce comparative numbers at all.

    So while you draw the conclusion that a number like 40K is simply enough to be meaningful based on a raw number, I see 40K as unable to produce enough data to compare any meaningful number of options while overcoming background noise from the sample process. Stastically, when looking at very low failure rates and needing to not produce a single number but needing to compare many different models, types and options, 40K is far, far too small to tell us anything more than some very general numbers that we can only trust so far because the data is primarily about one or two specific models of drives and not about drives in general.

  7. Mr. Scott, in case you really do have trustworthy data about 60K drives over 8 years, why did you not (yet?) publish it?

    I would be too glad to tell you what conclusions can be drawn from it.

  8. I have repeatedly published that data – it is discussed regularly. There is not an official publication from the investment bank where the study was done because banks do not do this, there is no financial reason to publish data for the IT industry by a bank. The reasons for this were covered in the article. Pretty much any company capable of doing such a study has no financial incentive to provide it to others, has legal barriers from doing so or would see the data as a competitive advantage not to be shared with their competition. Even in a case, like mine, where the data is benign and non-competitive, no company will voluntarily take on risk and cost of publishing a study that has no benefit to them.

  9. I’m not even sure to what you are referring. You complained that I’ve not published but did not check that I had. I provided the information again.

    Today BackBlaze just released some new drive info. It’s handy, but ultimately proves my points – the data is a small data set (only 4500 drives) and comes three years after the drives are out of production. So while it is handy and interesting to know that Seagate had a higher failure rate three years ago, it does not give us usable information to change our buying patterns today. Nor does it show comparitives to other models and makers during that same window.

    https://www.backblaze.com/blog/3tb-hard-drive-failure/

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.