{"id":736,"date":"2015-03-31T04:59:24","date_gmt":"2015-03-31T09:59:24","guid":{"rendered":"http:\/\/www.smbitjournal.com\/?p=736"},"modified":"2017-04-24T15:04:52","modified_gmt":"2017-04-24T20:04:52","slug":"explaining-the-lack-of-large-scale-studies-in-it","status":"publish","type":"post","link":"https:\/\/smbitjournal.com\/2015\/03\/explaining-the-lack-of-large-scale-studies-in-it\/","title":{"rendered":"Explaining the Lack of Large Scale Studies in IT"},"content":{"rendered":"
IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software. \u00a0This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it. \u00a0And yet, regardless of the high demand for such data there is none available. \u00a0How can this be.<\/p>\n
Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field. \u00a0These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and\/or share this data with other companies.<\/p>\n
Cost is by far the largest factor. \u00a0If the cost of large scale studies could be overcome, all other factors could have solutions found for them. \u00a0But sadly the nature of a large scale study is that it will be costly. \u00a0As an example we can look at server reliability rates.<\/p>\n
In order to determine failure rates on a server we need a large number of servers in order to collect this data. \u00a0This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one. \u00a0We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers. \u00a0If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment! \u00a0And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device. \u00a0If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.<\/p>\n
Now that cost, of course, is for testing a single configuration of a single model server. \u00a0Presumably for a study to be meaningful we would need many different models of servers. \u00a0Perhaps several from each vendor to compare different lines and features. \u00a0Perhaps many different vendors. \u00a0It is easy to see how quickly the cost of a study becomes impossibly large.<\/p>\n
This is just the beginning of the cost, however. \u00a0To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible. \u00a0This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control. \u00a0Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads. \u00a0In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.<\/p>\n
Then, of course, we must address the needs for special sensors and testing. \u00a0What exactly constitutes a failure? \u00a0Even in production systems there is often dispute on this. \u00a0Is a hard drive failing in an array a failure, even if the array does not fail? \u00a0Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way? \u00a0There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study. \u00a0Establishing study guidelines that are applicable and useful to end users is much harder than it seems.<\/p>\n
And the biggest cost, manual labor. \u00a0Maintaining an environment for a large study will take human capital which may equal the cost of the study itself. \u00a0It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data. \u00a0All in all, the cost are generally, simply impossible to do.<\/p>\n
Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while\u00a0still having spent a large sum of money.<\/p>\n
The second insurmountable problem is time. \u00a0Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years. \u00a0Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful. \u00a0What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.<\/p>\n
But that is not the biggest problem. \u00a0The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late. \u00a0The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results. \u00a0Often production equipment is only purchased for three to five years total lifespan. \u00a0Getting results even one year into this span would have little value. \u00a0And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use\u00a0in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.<\/p>\n
The final major factor is a lack of incentive to provide existing data to those who need it. \u00a0While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such. \u00a0These are rarely done in controlled environments and often involve data collected from the field. \u00a0In many cases this data may even be private to customers and not legally able to be shared regardless.<\/p>\n
But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist. \u00a0Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.<\/p>\n
The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale. \u00a0These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.<\/p>\n
Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage. \u00a0So for these reasons they are not shared.<\/p>\n
I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity. \u00a0But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure! \u00a0The lack of failures was, itself, very valuable. \u00a0But we were unable to produce any standard statistic like Mean Time to Failure. \u00a0To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough. \u00a0Perhaps millions of servers years would have been necessary. \u00a0There is no way to truly know.<\/p>\n
Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist. \u00a0When they do they will be isolated and almost certainly crippled by the necessities of reality. \u00a0There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research. \u00a0As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data. \u00a0It is surprising that so many people in the field expect this type of data to be available when it never has been historically.<\/p>\n
Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques. \u00a0This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.<\/p>\n","protected":false},"excerpt":{"rendered":"
IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software. \u00a0This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it. \u00a0And yet, regardless … Continue reading Explaining the Lack of Large Scale Studies in IT<\/span>