back to article Facebook SSD failure study pinpoints mid-life burnout rate trough

Facebook engineers and Carnegie Mellon researchers have looked into SSD failure patterns and found surprising temperature and data contiguity results in the first large-scale SSD failure study. In a paper (PDF) entitled A Large-Scale Study of Flash Memory Failures in the Field they looked at SSDs used by Facebook over a four …

  1. This post has been deleted by its author

    1. vogon00

      Have to agree with this.....anyone else noticed the recent increase in 'oh, FFS' obvious typos etc.?

      1. Crazy Operations Guy

        I prefer obvious typos over small, insidious little errors in the technical data...

      2. M Couchman

        Not only that, but when I post a comment about bad proof-reading, it usually gets deleted. They don't like it up 'em.

  2. Destroy All Monsters Silver badge
    Paris Hilton

    The first diagram seems to indicate that about 20% of the drives of any manufacturer have high error rates (the rightmost part of the cumulative function).

    The second diagram I don't understand. Shouldn't there be "time" on the x axis?

    1. Anonymous Coward
      Anonymous Coward

      They've labelled x "usage" which is probably and a passable colloquialism for whatever it is they're actually trying to express... data written or similar one would imagine, despite the earlier implication that time is usage. I'm not sure time is really their forte...

      "In a paper (PDF) entitled A Large-Scale Study of Flash Memory Failures in the Field they looked at SSDs used by Facebook over a four year period, with many millions of days of usage."

      Erm, 1461 != "many millions"

      1. Anonymous Coward
        Anonymous Coward

        Erm

        1461 x lots and lots of ssds = "many millions"

        Like you can use more than one kwh in an hour.

      2. bitpushr

        If you have 685 SSD drives, and they've all been running for 4 years, that's 685*4*365 days' worth of data you've gathered.

        685*4*365 = 1,000,100 days' worth of SSD drive service.

        1. Anonymous Coward
          Anonymous Coward

          > 685*4*365 = 1,000,100 days' worth of SSD drive service.

          + 685 = 1,000,785

          You forgot to account for the leap year.

  3. Anonymous Coward
    Anonymous Coward

    Well who would have guessed a bathtub failure rate ?

    Just about every manufactured item exhibits that pattern of failure.

    1. Anonymous Coward
      Happy

      >Just about every manufactured item exhibits that pattern of failure.

      Bathtubs have a "bathtub curve" failure pattern too. Problems with poor installation, shipping or manufacturing at the start; hard water and fatigue related cracking taking their their toll after a few years.

  4. Mage Silver badge
    Devil

    Bathtub curve

    So just a different shape but same issue as most other electronics products in existence.

    I expect CFL are similar curve. Early failures due to electronics and manufacturing defects. Later failures due to cooking electrolytics, gradual degradation of phosphors (brightness halves every x thousand hours) abrupt failure due to electrode emission if turned on and off every day.

    Of course the failure modes of SSD are not the same. But I think have been well understood for very many years. This isn't news or ground breaking research, but then it's a press release from an exploitive content free / free content advert network that wants to make a walled garden.

    1. Brewster's Angle Grinder Silver badge

      Re: Bathtub curve

      It's called a "bathtub curve" for a reason: it looks like a bathtub. (The failure rate starts high, drops off and stays low for a prolonged period, and then picks up again.) So the second graph is not what I would expect. And the rest of the research sounds new, as well.

      1. OliverJ

        Re: Bathtub curve

        I beg to disagree. The second graph looks like a bathtub curve with QA doing a better job compared to the good ol' days, when we had cold solder joints which tend to fail after a few hours of "burn in".

  5. DuncanL

    Watering trough

    That "hero" picture must win some sort of award for "most tenuous link to the actual story"!

    1. Anonymous Coward
      Anonymous Coward

      Re: Watering trough

      I'm glad The Reg labelled it for us though. I thought it was a photo of a flock of pterodactyls.

      1. Little Mouse
        Boffin

        Re: Watering trough

        Kudos to Curiosity though.

        I'd say that photo well and truly answers the "Water on Mars" question.

  6. Anonymous Coward
    Anonymous Coward

    I would speculate that the reason it doesn't look like a traditional bathtub curve is because the initial failures are being caused by issues like atom migration which wouldn't be picked up in a quick test before the drive leaves the factory. Atom migration is slow but the damage would be cumulative so you'd see a rise in failures over weeks. The manufacturing would be designed to stop this from happening though so I could imagine reaching a state where there were few early death cases left to find and the failure rate would then drop.

  7. This post has been deleted by its author

    1. asdf
      Trollface

      can't resist

      >The total number of errors per SSD is highly skewed, with a small fraction of SSDs accounting for a majority of the errors.

      OCZ? I keed.

      >So once a drive starts to show signs of failing, swap it out immediately. Funny, I've been doing that for 30 years...

      Except often with SSD from what I have heard with no signs of physical failure like with spinning rust they often just shit the bed suddenly (controller takes a dump, etc). Much more an all or nothing type device.

  8. AbortRetryFail
    Joke

    Damn you, Weibull!

    When come back, bring pie.

  9. tony72

    Anything for end users?

    Is data write throttling, or the avoidance of sparsely allocated data, something that end-users have any control over, or is this stuff under the control of SSD firmware/drivers/OS?

    1. Fan of Mr. Obvious

      Re: Anything for end users?

      For the end user you just buy cheaper spinning disks if you want to write at slower speeds :P

    2. Gerhard Mack

      Re: Anything for end users?

      These stats are for Facebook running these things in a very loaded server environment and do not correspond at all to desktop style loads.

  10. Innocent-Bystander*

    Something Useful from FB

    Look at Facebook putting out something that isn't a complete waste of time! I must have entered the Twilight Zone.

    So the takeaway from this message is: put my desktop's SSD beside a fan for longer life and only fill it to about 50-70% capacity to let the wear balancing algorithm do its magic.

    1. Gene Cash Silver badge

      Re: Something Useful from FB

      I think it was mostly CMU researchers, with the FB engineers going "oh really? that's interesting. put my name on it."

    2. John Brown (no body) Silver badge

      Re: Something Useful from FB

      "So the takeaway from this message is: put my desktop's SSD beside a fan for longer life and only fill it to about 50-70% capacity to let the wear balancing algorithm do its magic."

      Yes, so in effect nothing new there really.

  11. razorfishsl

    For every 10 Deg c you run cooler you basically double the life of a component....

    This has been knows for over 40 years.....

    Let us also be aware that no manufacturer is going to ship crap drives to a big outfit,.... they save that crap for the general public, so we can surmise this data is weighted badly.

    Next up:

    "Non-contiguously-allocated data leads to higher SSD failure rates"

    how would they know that?

    Unless they take the chips off the SSd and look at the actual data storage in the chip rather than thru the controller....

    Just because you ASK the controller where it stuck the data in NO indication of WHERE it actually is on the chip surface or device, there is a mapping relationship.

    So whilst the controller mapping may say :

    "I stuck the data in blocks , 5 ,200, 70000"

    the data on the chip may be at physical location 1,2,3

    Also there is SIGNIFICANT research in read/write disturbance, which has shown that if data is written in contiguous blocks next to each on the chip surface, it seriously stresses the silicon and causes corruption of data in the surrounding areas.

    Basically the reading /writing causes parts of the chip die to become electrically offset from the read/write amplifiers , resulting in the bit pair boundaries being incorrectly recognized.....

    or to put it another way the playing field in no longer level due to charge buildup in a concentrated area, which incorrectly sets the 'floor' for the boundary recovery of bit pairs... compared to other areas of the silicon. like playing football up hill.

  12. Anonymous Coward
    Anonymous Coward

    For avoidance of contiguous data we are going to need a 'disc fragmentor' utility. Anyone?

  13. Anonymous Coward
    Anonymous Coward

    LOOK at the drives

    This all reltes to PCIE drives, from a few years ago, of relatively large sizes .. so yes not really much bearing on the sata 64Gb drive you bought then

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like