What Happens When ECC Fails to Correct SSD Data?
Grandma falls out of the family photo.
By Kent Smith of LSI
Most users have no idea that reading electronic information from a data storage medium like a hard disk drive (HDD) or solid state drive (SSD) is plagued with read errors. For this reason error correction codes (ECC) are used to fix the random bit errors that arise during the reading process before the incorrect data is returned to the user. But the error correction codes can only handle so many errors at one time. If data errors exceed the ECC limits, the data goes uncorrected and is lost forever. More recent ECC algorithms like the LSI SHIELD error correction technology go a lot farther to protect user data than prior solutions.
What happens to the data when the ECC fails? If the ECC fails, only a backup protection mechanism will recover the data. There are three alternatives. First, users should always back up their critical data since ECC failure and other threats can destroy data or render it inaccessible such as natural disasters (earthquakes, tornadoes, flooding etc.) that cause heavy damage to buildings and their contents, lightning overloading that can burn up a computer without adequate electrical protection, and of course computer theft. Any backup system should be either automated or at least run consistently if it is manual. Industry reports cite that less than 10% of computer users back up their data. That is not very comforting.
The second solution is to employ a RAID (Redundant Array of Independent Disks) array that uses multiple storage devices with one or more of the drives acting as a parity device to provide redundancy. That way if one drive fails, the redundant drive provides enough parity information to restore the original data. This type of system is very common in enterprise environments—a work computer—but hardly used in home systems or laptop PC.
Is the third solution simple, automatic, and operable in a single-drive environment? Yes. Yes. And Yes. LSI® SandForce® flash and SSD controllers have a feature called RAISE™ data protection that meets all of these needs. Introduced in 2009 with the first SandForce controller, RAISE technology stands for Redundant Array of Independent Silicon Elements. It sounds like RAID, and acts something like RAID, but protects data using a single drive. With RAISE technology, the individual flash die act like the drives in a RAID array. The original RAISE level 1 technology protects against single page and block failures in the flash. These types of failures are beyond the protection of the ECC, but RAISE technology can recover the data.
With the introduction of the SF3700 this month, RAISE technology now offers more flexibility to deliver greater data protection. With the original RAISE level 1, the space of a full flash die had to be allocated solely to protect user data. In small-capacity configurations, like 64GB, RAISE level 1 required too much over provisioning and therefore had to be disabled or, with RAISE left on, its available capacity reduced to 60GB or 55GB. With a new enhancement to the SF3700, no such tradeoff is necessary. The new Fractional RAISE option for this first level of protection uses only a small portion of a die to protect user data in even the smallest configurations and preserve over provisioning (OP). This is important because, as I explained in my blog titled Gassing up your SSD, the more space you allocate for OP, the lower the write amplification, which translates to higher performance during writes and longer endurance of the flash memory.
Stronger data protection with RAISE level 2 A new RAISE level 2 capability offers even stronger data protection, safeguarding against multiple, simultaneous page and block failures, as well as a full die failure. If a die fails, the SandForce controller recovers the user data. RAISE level 2 includes Auto-Reallocation that can be set up to automatically redistribute and protect user data in the event of a subsequent die failure. Because the option to protect against a second die failure would reduce the available OP area, the RAISE level 2 feature can be set up to simply drop back to RAISE level 1 protection without sacrificing any OP space. .
Another new capability is an additional (9th) flash channel that enables the manufacturer to populate an extra flash package with one die that enables full RAISE level 1 protection while maintaining maximum user data capacity such as 64GB, 128GB, 256GB, etc. Without the 9th channel option, the SSD capacity would be forced to sacrifice a few GBs of capacity (reducing available user capacity to 60GB, 120GB, 240GB, respectively) because RAISE requires extra storage space.
Although all these new features cannot protect against the would-be thief or catastrophic drive failures from electrical surges or natural disasters, the probability of those events is much lower than a simple ECC failure. That’s why you would be best suited to have an SSD with RAISE technology to automatically protect against the more common ECC failures and then make a backup copy of your system at least periodically to protect your data against those far more serious events.