Sign in with
Sign up | Sign in

What Do We Know About Storage?

Investigation: Is Your SSD More Reliable Than A Hard Drive?
By

SSDs are a relatively new technology (at least compared to hard drives, which are almost 60 years old). It’s understandable that we would compare the new kid on the block against tried and true. But what do we really know about hard drives? Two important studies shed some light. Back in 2007, Google published a study on the reliability of 100 000 consumer PATA and SATA drives used in its data centre. Similarly, Dr. Bianca Schroeder and adviser Dr. Garth Gibson calculated the replacement rates of over 100 000 drives used at some of the largest national labs. The difference is that they also cover enterprise SCSI, SATA, and Fibre Channel drives.

If you haven’t read either paper, we highly recommend at least reading the second study. It won best paper at the File and Storage Technologies (FAST ’07) conference. For those not interested in poring over academic papers, we’ll also summarize.

MTTF Rating 

You remember what MTBF means (here's a hint: we covered it on page four of OCZ's Vertex 3: Second-Generation SandForce For The Masses), right? Let’s use the Seagate Barracuda 7200.7 as an example. It has a 600 000-hour MTBF rating. In any large population, we'd expect half of these drives to fail in the first 600 000 hours of operation. Assuming failures are evenly distributed, one drive would fail per hour. We can convert this to an annualized failure rate (AFR) of 1.44%.

But that’s not what Google or Dr. Schroeder found, because failures do not necessarily equal disk replacements. That is why Dr. Schroeder measured the annualized replacement rate (ARR). This is based on the number of actual disks replaced, according to service logs.

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs (annualized replacement rates) range from 0.5% to as high as 13.5%. That is, the observed ARRs by data set and type, are by up to a factor of 15 higher than datasheet AFRs.

Drive makers define failures differently than we do, and it’s no surprise that their definition overstates drive reliability. Typically, a MTBF rating is based on accelerated life testing, return unit data, or a pool of tested drives. Vendor return data continues to be highly suspect, though. As Google states, “we have observed… situations where a drive tester consistently ‘green lights’ a unit that invariably fails in the field.”

Drive Failure Over Time

Most people assume that the failure rate of a hard drive looks like a bathtub curve. At first, you see many drives fail in the beginning due to a phenomenon referred to as infant mortality. After that initial period, you expect to see low failure rates. At the other end, there’s a steady rise as drives finally wear out. Neither study found that assumption to be true. Overall, they found that drive failures steadily increase with age.

Enterprise Drive Reliability

When you compare the two studies, you realize that the 1 000 000 MTBF Cheetah drive is much closer to a datasheet MTBF of 300 000 hours. This means that “enterprise” and “consumer” drives have pretty much the same annualized failure rate, especially when you are comparing similar capacities. According to Val Bercovici, director of technical strategy at NetApp, "…how storage arrays handle the respective drive type failures is what continues to perpetuate the customer perception that more expensive drives should be more reliable. One of the storage industry’s dirty secrets is that most enterprise and consumer drives are made up of largely the same components. However, their external interfaces (FC, SCSI, SAS, or SATA), and most importantly their respective firmware design priorities/resulting goals play a huge role in determining enterprise versus consumer drive behaviour in the real world."

Data Safety and RAID

Dr. Schroeder’s study covers the use of enterprise drives used in large RAID systems in some of the biggest high-performance computing labs. Typically, we assume that data is safer in properly-chosen RAID modes, but the study found something quite surprising.

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

This means that the failure of one drive in an array increases the likelihood of another drive failure. The more time that passes since the last failure means the more time is expected to pass until the next one. Of course, this has implications for the RAID reconstruction process. After the first failure, you are four times more likely to see another drive fail within the same hour. Within 10 hours, you are only two times more likely to experience a subsequent failure.

Temperature

One of the stranger conclusions comes from Google’s paper. The researchers took temperature readings from SMART—the self-monitoring, analysis, and reporting technology built into most hard drives—and they found that a higher operating temperature did not correlate with a higher failure rate. Temperature does seem to affect older drives, but the effect is minor.

Is SMART Really Smart?

The short answer is no. SMART was designed to catch disk errors early enough so that you can back up your data. But according to Google, more than one-third of all failed drives did not trigger an alert in SMART. This isn't a huge surprise, as many industry insiders have been suspecting this for years. It turns out that SMART is really optimized to catch mechanical failures. Much of a disk is still electronic, though. That's why behavioural and situational problems like power failure go, unnoticed while data integrity issues are caught. If you're relying on SMART to tell you of an impending failure, you need to plan for additional layer of redundancy if you want to ensure the safety of your data.

Now let's see how SSDs stack up against hard drives.

Display all 4 comments.
  • 3 Hide
    AlexIsAlex , 29 July 2011 15:25
    The 'drive completely dead, data unrecoverable' failure mode is not the worst; I can restore yesterday's image and lose, at most, a day's data (acceptable for my usage - obv. tailor backup frequency etc. to what's acceptable to you).

    The worst is what happened to my last SSD. For weeks I thought the problems I was seeing were software issues: the occasional crash, the odd SxS error in the event log, a game failing Steam file validation, an
    old email showing half garbled. Eventually, I managed to diagnose the problem.

    Old, untouched, files on the SSD were being corrupted at a very low rate (a few bytes per GB, I'd estimate). A file could be written and verified after writing, but days later might fail a checksum test when read. Without any error notification, SMART or otherwise, to indicate that the data read was anything other than perfect.

    Now that was a problem. Who knows when the last backup image without any corruption was? How can you even tell? The vast majority of files will be fine, but some will be backed up corrupt, and may have been for some time. With much manual effort I eventually did recover everything important, but my new backup regime involves checksumming everything on the SSD weekly. If something has changed data but not changed timestamp, this time I'm going to get some red flags!

    I can't say for certain that this failure mode is SSD specific, but it happened on my first SSD, and never on any of my spinners. Not enough data to be statistically significant, but enough to make me cautious.
  • 0 Hide
    Anonymous , 31 July 2011 19:26
    Can second the findings with regard to OCZ Vertex 2 drives. Mine has just gone and without any warning - all data lost after a year of light use. OCZ are completely useless in helping to fix it. It's like they know that their SSDs fail a lot and aren't at all surprised. Have gone onto Intel 320 SSD based on the hardware.fr findings.
  • 0 Hide
    dyvim , 1 August 2011 16:34
    Thanks Andrew, that's an interesting article even for a layman operating a single SSD ^^
    So far my OCZ Vertex 2 is doing fine, but then failure is always only a probability. System drives shouldn't be used to store important data in my eyes anyways.
    If not having mechanical parts doesn't really lower the percentage of dying drives, that only means that backup is just as important (and as often forgotten) as it always was.
  • 0 Hide
    Anonymous , 5 August 2011 11:02
    Good news: this website (http: proxy4biz.com ) we has been updated and add products and many things they abandoned their increases are welcome to visit our website. Accept cash or credit card payments, free transport. You can try oh, will make you satisfied.
    Tshirt price is $12Jeans price is $34
React To This Article