«

»

Using RAID-5 Means the Sky is Falling


Using RAID-5 Means the Sky is Falling!

Why disk URE rate does not guarantee rebuild failure.

Editorial article by Olin Coles

Today’s appointment brought me out to a small but reliable business, where I’m finishing the hard drive upgrades for their cold storage backup system. It was an early morning drive into the city, with enough ice on the roads to contribute towards the more than 30,000 fatality accidents that occur each year1. The backup appliance I’m servicing has received 6TB desktop hard disks to replace an old set with a fraction of the capacity, so rebuilding the array has taken considerable time.

Their primary storage spans eight disks in a RAID-10 set, which gets archived to the server backup appliance for long-term retention. That backup appliance has a unique cartridge system that safely holds three disks in a redundant array. Later this evening when the project is finally finished, I’ll count myself as lucky for surviving the treacherous roadways and lethal cold, but I won’t give it a second thought to the risks I took by using RAID-5 on their cold storage devices.

You might not agree, but there are people out there who believe we should not be driving because the statistics indicate it’s clearly a dangerous activity. Nearly every driver in America will be involved in an auto-accident at some point in their life3, some of which will cause serious injury or death. For those people not involved in an accident this year, which is more than 96% of all licensed drivers, we’ll drive to our destinations unharmed. Sure, a statistical risk exists, but it’s not an absolute guarantee I’ll be killed on the drive home. The sky is not falling.WD Purple 3.5-inch Surveillance-Class Hard Drive Line Debuts

Every year, no matter where you live, it gets cold in the winter. This natural occurrence drops temperatures, which could lead to hypothermia and for a very small portion, death. Winter temperatures in Sierra Nevada can chill you to the bone, which explains how over 1500 people succumb to hypothermia annually2. When I walked from the parking lot to the client’s office this morning it was extremely cold, but just because there is a statistical risk of hypothermia does not mean I’m surely going to freeze to death. The sky is still not falling.

For some strange reason people seem to think that everything changes when we talk about hard disk drives, and that a statistical possibility becomes absolute certainty. Manufacturers conduct abbreviated testing on hard disk components4, sampling a set number of drives to determine a relative mean time between failure (MTBF) or maximum unrecoverable read errors (URE). Nevertheless, there are people using fear tactics that claim redundant arrays of large capacity disks, such as the 6TB hard disks I used in those RAID-5 sets, are risky business. Some even go so far as to say RAID-5 will stop working on a particular year, reminiscent of pre-apocalypse Y2K.

In reality, most hard disks seldom see operating temperatures below the chill of a server room or beyond the warmth of rack space, and most disks will not commit an URE that crashes a RAID-5 rebuild. While it is agreed that better parity schemes exist, the exception is not the rule. My customer could have retained cold storage data to individual disks via removable drives, with no redundancy at all. In fact, most organizations already use a single removable disk or cloud container for their nightly backup routine. My customer choose a special backup appliance that fits three disks into a single cartridge, further protecting archived data and proving RAID-5 still has business applications.

But if the opinion of an Internet personality vocal on storage technology is to be revered as the gospel truth5, then we must forego these large capacity disks because they’re all purported to carry an “almost certain” unrecoverable read error rate… something to the tune of 1014. A guaranteed URE, you ask? Well, it’s not as certain as freezing to death or being killed in an accident, or even both of these statistics combined, but according to the often-cited but seldom verified test methodology, your hard drive will fail to read a sector once every 12TB of data. Such a failure could happen as a RAID-5 array is being rebuilt, striking a sector with a guaranteed URE on the parity disk happening at exactly 100,000,000,000,000 bits – unless it doesn’t.

Some writers build their reputation by making audacious claims that create controversy, done solely to help propel traffic onto the website they write for. Common sense and real-word experience be damned; let the lack of evidence claiming otherwise and the use of complex math help prove their confusing point! After all, it’s not like anybody knows exactly how any particular manufacturer came up with a 10^14 error rate, which arbitrarily changes from time to time, or where people can find these clearly documented test procedures. You’re not supposed to question the numbers – you’re just supposed to believe what the manufacturer tells you, and know that regardless of capacity per disk or number of drives involved that after reading 12TB you will experience an unrecoverable read error. Oh, and that RAID-5 also stopped working in 2009 – except that it didn’t.

We all survived Y2K unscathed, and not surprisingly the end of RAID-5 did not actually happen as predicted. That same author later wrote a follow-up article, and instead of admitting defeat he doubled down and claimed RAID-5 was as doomed as ever because URE rates remained the same in the largest capacity drives. Never mind that there are countless real-world scenarios where RAID-5 continues to be used with great success well into 2015, that’s not important. The people forget that the 10^14 bit URE rate is not an absolute; it’s a predictive failure specification, measured for a single disk based on a unknown test sample size of disks. It’s also a marketing ploy, since nearly all consumer desktop hard drives typically receive the same failure rate while enterprise drives magically receive a 10^15 bit URE rate… an entire order of magnitude greater reliability, all without quantified explanation.

It’s possible that people who claim the sky will fall have failed to envision a future with solid state storage, or they’ve misinterpreted a suggested error rate as a predictable mechanical function. Both are likely, yet facts being facts, they still weren’t able to prevent an entire subculture from embracing the notion that RAID-5 does not work, or that all desktop hard disks will have read failure precisely at that 10^14 bit. All that is necessary to disprove this is the successful rebuilding of a RAID-5 set with 12TB or better capacity, as many primary and backup storage administrators have done countless times.

As we approach an era where Solid State Drive products reach multi-Terabyte capacity with built-in error checking and data management technologies, the argument for unrecoverable errors and subsequent RAID rebuild failures becomes even less valid. It’s foolish to claim a proven technology will one day fail in the far off future7, when that future involves dramatic improvements with every product cycle that nobody can predict. If the sky really is falling, next time they’ll just have to shout louder and use proven math.

  1. NHTSA, 2012: http://www-nrd.nhtsa.dot.gov/Pubs/811856.pdf
  2. CDC, 2010: http://www.cdc.gov/mmwr/preview/mmwrhtml/mm6151a6.htm
  3. Karen Aho, 2011: http://www.carinsurance.com/Articles/How-many-accidents.aspx
  4. Adrian Kingsley-Hughes, 2007: http://www.zdnet.com/article/making-sense-of-mean-time-to-failure-mttf
  5. Robin Harris, 2007: http://www.zdnet.com/article/why-raid-5-stops-working-in-2009
  6. Robin Harris, 2013: http://www.zdnet.com/article/has-raid5-stopped-working
  7. Robin Harris, 2010: http://storagemojo.com/2010/02/27/does-raid-6-stops-working-in-2019

11 comments

Skip to comment form

  1. KGregory

    Outstanding article. Admittedly, I also was among those that allowed themselves to be influenced by ‘doom hype’ relating to RAID-5. So I left off using RAID-5 and stuck with RAID-6 or RAID-10 depending on needs.

    The only upside I see is that HD pricing is fantastic considering the storage offered. Therefore one can opt to any of the various RAID configurations without the ‘heavy cost/analysis’ that massively affected the end decision of decades ago.

    My actual complaints these days is the pitiful innovation in RAID Controllers. From hardware, firmware and software. It’s very stagnant, at least to me.

    Thanks again for the article.

  2. James Culvet

    Kudos to benchmarkreviews for this editorial. Eight years later people are still quoting Robin Harris’ fabulously fictional article and deciding infrastructure around his predictions. I’m happy to see someone point out the obvious mistakes in his assumption, and remind us that hard drive makers pull the strings. In spite of its damaged reputation, RAID5 is still a good choice for backup and archive storage. I use RAID6 for primary storage, which I see he now thinks will stop working in 2019.

  3. Duane

    For small environments such as you describe you are correct. However working in an enterprise environment the law of averages does catchup. I have administration of several hundred RAIDed storage arrays, and just about every one of them has lost a disk a some point during their lifespan. In several cases more that 1 drive has failed in the same RAID group causing loss of critical data.
    Yes your 3 disk cartridge may well work for years with no issues, but 1 URE and you will have data loss.
    Given a MTBF of 1,000,000 hours, and having 10,000 spindles means a failure about once every 100 hours.
    I don’t know how important your data is, but for mine, I want RAID6 with a Hot spare. The Larger drives exacerbate the issue as rebuild takes much longer. If rebuild times start getting up in the 100+ hour range, which they do for very large arrays, you may not have time for the replacement drive to fully kick in before the next error. It is all statistics.
    You have to make the determination of how “important” is your data. Its expensive but I know of environments that are run Raid 61 (RAID 6 on mirrored pairs) and all data is replicated to a minimum of 1 other system usually off site.

  4. David

    Not all drives have 10^14 UREs.
    Consumer SATA drives are 10^14
    Nearline Enterprise 10^15
    Enterprise SAS/FC 10^16
    some Enterprise SAS SSD’s 10^17

    1. Olin Coles

      The 10^14 URE referenced in this article revolves around my introductory statement: “backup appliance I’m servicing has received 6TB desktop hard disks”. Later into the article it is mentioned that Enterprise drives are 10^15 URE, which I presume you noticed while reading.

  5. Ian Hall

    The likelihood of my house burning down is statistically negligible. I don’t even know anyone who has had a major house fire. So by Olin’s argument I don’t need to insure against loss by fire. Yes, many people have successfully restored RAID 5 arrays after a single drive failure and not lost data. But some people have not, and have lost data due to a URE during rebuild. I know more than one installation that has suffered this, and have heard reports of others.
    Small arrays with small (by todays standards) or SSD arrays, fine use RAID 5, but for large arrays of large capacity spinning disks RAID 5 is dangerous. You may be OK, but you may not. Stick your head in the sand if you like, but I have fire insurance, and I think carefully about what is the appropriate RAID level for valuable corporate data.

  6. Chris O.

    The analogy doesn’t really work, unless the author was saying that the appliance was relying on an a RAID-0 or JBOD volume. RAID-5 does provide some measure of insurance. RAID-5 + HS provides more. RAID-6 or RAID-6 + HS provides more still. I think the point of the article was just to call attention to people making decisions based on false or misleading quantitative figures.

    As another commenter pointed out, it comes down to the value of the data involved, and the need for continuous uptime. Is the incemental cost of purchasing an additional drive to go from RAID-5 to RAID-5 + HS justified by the likelihood of URE during the time that it will take to order and receive a replacement drive, plus execute the rebuild? Is the incremental cost associated with getting a RAID-6-capable controller justified by the likelihood of a URE during a RAID-5 rebuild? There are dozens of other cost/benefit questions that can be posed to determine exactly what type of setup is appropriate.

    It is ironic though… plenty of companies I work with get so fixated on arguments like this, that they really don’t take into account all of the failure vectors that can affect them in similar ways. For example, they will have agonized over the redundancy in their array configuration, but they will not have a spare RAID card that can read the set type, because controllers can fail too – not just the drives. Similarly, they may not have a replacement PSU on the shelf for their drive chassis, or a spare FC card, or any number of other items that can fail.

  7. Greg schulz

    Nice article/post,

    URE/BRE are real, however the sky is not falling as some would have you believe.

    There are many different legacy and current or emerging advanced RAID approaches including parity / RAID 2 derived erasure code and FEC among others.

    While some simply dont like different RAID levels for various reasons, others simply follow what they hear or like to hear and then amplify, thus when something is repeated enough, it can be taken for the legend and in some cases that is what has happened with RAID in general for some, and RAID 5 for others.

    Its not that RAID 5 or any RAID level is bad, granted there are some bad implementations, rather the decisions and configurations as well as types of drives, devices and related decisions of what to use when, where, why and how.

    For those who are truly focused and concerned about URE/BER and related, I encourage them to take a step or to back and timeout from attacking those who dont agree with them or their manifestos, thesis or premise and look at the bigger picture. For example, if really concerned about URE/BER, how about adding DIF and related end to end data integrity topics into the conversation. If the concern is really about devices failing and exposure, how adding replication and dispersal protection to the conversation vs. focused on just an approach or two.

    There is much more to data protection including RAID and everything is not the same, hence different approaches for various environments, some need RAID 0, others RAID 1, some RAID 10 while others RAID 4, 5, 6, 7 or other variations not to mention FEC and Erasure Coding along with good backups in various combinations.

    Here are some additional perspectives which I have included a link to the above post:

    Revisiting RAID storage remains relevant and resources
    http://storageioblog.com/revisiting-raid-remains-relevant-resources-context-matters/

    Ok, nuff said, for now, cheers gs

  8. patrick motley

    Why even risk it using RAID 5? RAID 6 or RAID 10 is the way to go with better odds.

    1. Olin Coles

      Although it’s clear you didn’t read the article, your question is still easy to answer: because not every storage scenario requires more than three disks. This scenario used an eight-disk RAID-10 array for primary storage (you must have missed that), and a backup system that uses removable RAID-5 cartridges. This is a small business that couldn’t budget for much more, and so double-parity for removable backup media just wasn’t necessary. So what kind of redundancy/parity do your backup media have?

  9. Arron

    Having installed a number of RAIDs starting in 1989 (49 SCSI disk array w a military budget), I use RAID6 IF it is important (i.e. critical). I have had less than half a dozen RAID5 rebuild failures over the years, so for most SMEs, RAID5 is fine – as part of my brief they always had a complete system disk backup somewhere off site plus a sound data backup arrangement. As a result, none of the RAID5 “fails” caused more than a minor inconvenience. Ditto the whole concept of “hot swap drives” – very few SMEs actually need that degree of redundancy.

    Another fine article Olin, well done.

    Arron

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*