back to article Boffins say flash disk demands new RAID designs

Solid state disks (SSDs) are wonderfully fast, but every time you write to them, the semiconductors involved degrade just a little, a property that means you swap them into established storage rigs at your peril. That's the conclusion of a new paper, “Stochastic Analysis on RAID Reliability for Solid-State Drives” from a pair …

COMMENTS

This topic is closed for new posts.
  1. Gordan

    "you can't assume SSDs are always as reliable as spinning rust, especially when you use them in the same way."

    Does that last statement take into account just how woefully unreliable spinning rust is?

    1. Shades
      Facepalm

      Do you mean the 10 year old 200Gb 3.5" HD I've got in a USB/NAS box? What about the 17 year old 17Gb 3.5" HD I've got that occasionally gets spun up to store small files. Or maybe you're on about the 20 year old 120Mb 2.5" HD that lives inside my Amiga A1200? Do you mean those woefully unreliable lumps of spinning rust?

      Naturally usage patterns/environment affect life-time but to make such a sweeping statement about spinning HDs is silly, especially when SSDs have their own longevity problems.

      1. Matt 21

        Perhaps

        he's talking about my 6 year old Intel SSD which is still going strong and a pair of 3-5 year old spinning rust disks which failed in the last six months?

        On the other hand I don't think we can say that SSDs last longer, mostly the opposite, although they react better to being dropped whilst being used... well in my experience :-)

      2. This post has been deleted by its author

      3. Gordan

        @Shades

        You speak of disks from a somewhat more reliable era for disks. My observation of the reliability track record of 1TB+ disks is that they, quite frankly, suck. The failure rate per year among my server estate (granted, only about 35 1TB+ disks, but it's enough to get the gist of the reliability you can expect) shows a failure rate of nearly 10%/year over 3 years (the worst model of the disk I have has seen a failure rate of about 75% over 3 years - yes, some really are that dire).

        Of course, some makes/models are better than others, but the trend is pretty poor, both in terms of the number of complete disk failures and in terms of the disk defects (bad sectors) that arise. And bear in mind that every time you get a bad sector you are liable to lose that data.

        The primary cause of flash failure is wear - and that is predictable and trackable. Spinning rust can suffer an electro-mechanical failure at any time, and often does with little or no notice. Early warning and predictability are important if you value your data.

        1. Shades
          Thumb Up

          @Gordan

          Thanks for your polite and informative reply (a rarity on El Reg these days, especially during school holidays... I suppose it also helps that this isn't a topic about Apple/iPhone or Android!)

          Its a shame we can't use multiple icons otherwise you'd have a beer too!

  2. Filippo Silver badge

    Couldn't wear-levelling actually work against you in a RAID environment, by causing all drives to fail at nearly the same time?

    1. Paul Crawford Silver badge

      Maybe - can anyone answer here how good the SMART equivalent is for flash drives in telling the system about the underlying health/error rate of the storage elements that are being degraded on write?

      1. lloyd_atkinson

        There isn't a SMART equivalent, SSD's still use SMART. Although, my understanding is that some RAID controllers do not have the ability to allow the OS to query an individual drive for it's SMART data.

    2. Nigel 11
      Meh

      Wear-levelling works against you?

      Couldn't wear-levelling actually work against you in a RAID environment, by causing all drives to fail at nearly the same time?

      Not exactly. Without wear-levelling a small subset of blocks would get hammered and the "disk" would fail a lot faster. RAID or no RAID.

      However, there's a difference in failure mode between an SSD array where all devices will predictably degrade to unacceptable at about the same time, and an HD array where the future failure of each device is pretty much unpredictable unless you are suffering from common-mode failure (ie a batch of bad disks). In any case HDD failure is linked firstly to hours in service, and then to seek activity. Volume of data written to a HDD does very little that could cause earlier failure.

      RAID nearly doubles or trebles the number of write actions. With JBOD, M writes spread over N disks. With RAID-5, 2M writes spread over (N+1) disks. With RAID-6, 3M/(N+2). This will reduce the SSD life expectancy in that environment by ~2x or ~3x.

      With SSDs you might want to define a new sort of array that puts differing amounts of parity on different drives, so each SSD in the array experiences a different level of write activity and wears out at a different rate. Of course, there would be a penalty w.r.t. the performance of such an array.

      1. Gordan

        Re: Wear-levelling works against you?

        @Nigel 11

        I think you are exaggerating a worst-case scenario based on small writes with big stripes. If your writes are small, you should be using small RAID chunks.

  3. IO-IO
    Black Helicopters

    That's why Pure storage have a bespoke flash aware RAID implementation.

    The real secret for treating flash well is to coalesce as many writes as possible.

    1. Suricou Raven

      Nice idea, with a small problem: That means caching a lot of writes in RAM for some time, which means lots of nice fun data corruption in the event of an interuption in service. It leaves your data in a quite fragile state - better make sure your UPSs have good batteries and hope nothing breaks down.

      1. Anonymous Coward
        Anonymous Coward

        Nothing new

        Mirrored battery backed cache or mirrored cache and de-stage to disk fixes those issues in enterprise platforms. To be honest pretty much all modern disk arrays will attempt to coalesce writes into a full stripe to avoid the Raid 5 write hole anyway. But regardless of the methodology used, cache is finite and at some point you need to de-stage to permanent media. So there's a fine balance between the ingest rate and the flush to disk rate, overstep that and your high speed array's performance will tank. The problem with all these SSD specific optimizations is that it's just a transitory technology. All the good work being done now to work around SSD's inherent flaws by the niche SSD vendors, will likely be a waste of time and money in the very near future as replacement technologies emerge.

  4. bobbles31

    I thought that the whole point of raid was that the expectation that a drive would fail and the system could carry on.

    Surely the question becomes one of performance improvement over maintenance cost. Given that you are going to be replacing drives more frequently, is the increased maintenance cost less than the increased revenue from having faster drives? If so, use SSDs if not don't. I would hazard to guess that across the industry their are relatively few applications that would provide a net benefit.

    1. Anonymous Coward
      Anonymous Coward

      "Given that you are going to be replacing drives more frequently"

      Why do you reckon that? Any device has an expected life (and a not very useful MTBF), but I don't think it is a given that an SSD would need to be replaced more frequently than an HDD in any similar duty scenarios. Certainly you can posit a situation (eg write caching) where the finite write life of an SSD puts it at a disadvantage, but the reverse is also true, that (for example) repeat reads on an HDD use up the anticipated life of the hardware, whereas the reads off an SSD make little difference to the limited write endurance.

      Moreover the write endurance is a particular issue for writes to SSD, so there's any number of options - the vendors favour predicitve replacement ($$$), or you could use premium grade SLC NAND ($$), or for write caching you could even use a virtual disk in DRAM (home brew?).

      1. Nigel 11
        Thumb Up

        @Ledswinger good point

        Very good point. For read-mostly data SSDs should be much more reliable than HDDs, as well as faster.

  5. John Robson Silver badge

    RaidZ

    Does the variable stripe size not make this a much better approach than hardware raid anyway?

  6. Alan Brown Silver badge

    spinning rust and other stuff

    For starters: GIven current disk ECC factors you're statistically highly likely to get at least one piece of data corruption for every 8Tb read.

    Secondly: 2Tb and larger drives have a terrible failure rate. I'd gotten used to seeing drives lasting 6-8 years and now they're down at the 3-4 range again. Manufacturers dropped their warranty to 12 months for sound financial reasons.

    Thirdly: all disks lose stuff. All arrays drop their bundle and all users wipe something they shouldn't have. That's why you have BACKUPS - in a safe place offline so j random skiddie can't delete backup.tar.gz (I've seen this happen to a 200,000 user ISP who got careless with their userspace website.)

    Fourthly: RAID5 is a dead duck. If you use it with disk sizes above about 200Gb it's highly likely you'll encounter another error whilest rebuilding the array and have to resort to.... BACKUPS!

    RAID6 is better but given the failure rate of 2-4Tb drives and the time taken to rebuild an array, it's running too close to the wind for my liking - and that's speaking as the admin of about 500Tb of live storage.

    Yes, of course RAID-SSD needs to be done better, but "classic" RAID itself has serious issues, especially when it comes to transparent detection AND CORRECTION of silent data errors, along with more prosaic things such as the time taken offline to FSCK a filesystem (which can add 1 day to the boot time of some of my systems). There are better ways of building arrays at these kinds of scales and reliabilities (eg: ZFS Raidz-3). That's why large storage systems have premium pricetags attached and trying to do things with overextended cheaper gear frequently results in the difference being eaten up (and then some) in labour costs.

    1. Nigel 11
      Thumb Down

      Re: spinning rust and other stuff

      Secondly: 2Tb and larger drives have a terrible failure rate. I'd gotten used to seeing drives lasting 6-8 years and now they're down at the 3-4 range again. Manufacturers dropped their warranty to 12 months for sound financial reasons.

      So what do you say about my twelve 2Tb disks that haven't experienced a single failure since they were installed three years ago? Not even server-grade ones. Could be luck, but if the MTBF really is three years then half of them should have failed by now, and the odds of NONE having failed are 1 in 2^12.

      I think you're generalizing from bad luck. I keep telling people that the big risk is common-mode failure caused by bad batches of disks. If you buy all your disks at once, then if one is of low reliability, the other N are likely the same. If you are running mirrored pairs, then you should buy half your disks from one manufacturer and the other half from a different manufacturer, and pair them heterogeneously. That minimizes the chance of losing the whole array to one bad batch of disks.

      I suspect manufacturers cut the warranty because disks were getting so cheap, the warranty replacement process was becoming too expensive compared to the disks themselves, even if the percentage failing in warranty has not changed.

  7. Alan Brown Silver badge

    BTW

    The issue of treating SSDs like spinning rust is precisely why Flash-friendly FS (F2FS) got developed. There are several parallel development paths being worked on.

  8. JoeT
    Thumb Down

    This seems oddly familier

    Ah, a non-peer reviewed ArXiv paper on RAID on SSDs... where have I seen this before? Could it be the 2010 peer-reviewed paper from MS Research (http://research.microsoft.com/pubs/119126/euro093-balakrishnan.pdf), or maybe it was the 2009 peer-reviewed paper from UC Santa Cruz (http://www.ssrc.ucsc.edu/media/papers/greenan-hotdep09.pdf). Can't really be sure.

    What I can be sure is that the conclusion that SSDs don't play well with traditional RAID algorithms is at least 4 years old, and that these authors are just johnny-come-latelys.

This topic is closed for new posts.