back to article Boffin finds formula for four-year-five-nines disk arrays

Forty-five disk drives, ten parity drives, and 33 spare disks: that's the optimum array size to protect data for four years with no service visits, according to a study published at Arxiv. The problem the study addresses is that the world's rush towards hyperscale data centres puts an awful lot of disks in one place, and the …

Page:

  1. Anonymous Coward
    Anonymous Coward

    The concept of spares needs to go

    All those spindles provide additional performance. Use them. Arrays are virtual these days anyway, so as drives fail and rebuilds occur you lose spindles (performance declines) and you lose available capacity (not a big deal until it gets too close what you're actually using)

    It may not be cost effective to send out a guy to replace a single drive, but surely it is still cost effective to have them replace dozens of drives if you need a mid life performance/capacity kicker until the array has fully depreciated.

    1. theOtherJT Silver badge

      Re: The concept of spares needs to go

      I completely agree. Distributed scale-out storage across many nodes is clearly the way we're going. Let the software handle where any given data block is written / read from and just keep feeding it disks and CPU cycles as necessary.

      Breaks down a bit if you need to really slam a _lot_ of data down on the disks very fast because you end up IO bound by the speed of the network interface(s), but come on - in that case you're probably using some sort of flash storage on the local box anyway.

      1. Anonymous Coward
        Anonymous Coward

        Re: The concept of spares needs to go

        Four downvotes and not one comment as to why they think I'm wrong? Did the fanboys take a wrong turn on the way to the article about Apple's record quarter?

    2. Rebecca M

      Re: The concept of spares needs to go

      All those spindles provide additional performance. Use them. Arrays are virtual these days anyway, so as drives fail and rebuilds occur you lose spindles (performance declines) and you lose available capacity (not a big deal until it gets too close what you're actually using)

      How do you use that performance? The kind of medium scale array that is studied here will have no problem saturating a couple of 10GbE links even with relatively slow drives and dumb controllers. If you make the reasonable assumption that a mid range array is tied to a mid range network where is that performance going to go?

      I'd sooner have the spares in place and spun down when not in use. Less power, less cooling, less noise and the opportunity to force asymmetric wear on each drive, so that come the end of the array's life you don't get clumps of failures in quick succession according to what batch they were from.

      Four downvotes and not one comment as to why they think I'm wrong? Did the fanboys take a wrong turn on the way to the article about Apple's record quarter?

      I can't speak for everyone else but for me there comes a point were a comment is so far removed from real world experience it simply isn't worth commenting on in the first instance.

    3. Solmyr ibn Wali Barad

      Re: The concept of spares needs to go

      Good suggestion - unless it gets taken literally. Which is probably the cause for downvotes. It's not an either/or proposition. There is no actual need to forget about the concept of spares, it is a valid concept with many usage scenarios, and will remain viable.

      Distributed, massively parallel storage is also a valid approach. With or without dedicated spares. XIV (bought by IBM few years back) was pretty much founded on this idea, and they did indeed put all spares to work. Zillions of ZFS-based setups can be configured whichever way you please. GPFS (or whatever it's called these days) too.

  2. Anonymous Coward
    Anonymous Coward

    Oops...

    "an array has 45 data disks, 10 data disks, and 33 spare disks"

    I'm assuming that was to read 10 parity disks like it did in the beginning of the article.

    1. Richard Chirgwin (Written by Reg staff)

      Re: Oops...

      Oops indeed - thanks, I have fixed this.

      RC

  3. Robert Helpmann??
    Childcatcher

    Not terribly surprising

    Hardware and software usually contribute less to the cost of ownership of a system than the support staff to maintain it, at least in my experience. I would have liked to see a broader sample of disks for comparison, though, as altering variables just a little bit might result in much different outcomes. For example, HDD reliability varies greatly by manufacturer and SDDs are missing entirely from this study. Also, much of the reliability data made available by drive vendors does not count failed drives that are replaced by warranty, which is perhaps why Backblaze's data was the only set used.

  4. Sampler

    I'm no storage king

    So this is a genuine question, dedicated parity drives?

    My understanding was RAID6 spreads the parity across the storage drives as dedicated parity drives end up dying sooner due to the high usage of the parity drive compared to the storage.

    Or are they talking simply in terms of storage and the parity will be effectively written across all drives and the figures are just in terms of storage lost to parity (ie ten drives worth from fifty five disks, leaving forty five disks worth of space)?

    1. A Non e-mouse Silver badge

      Re: I'm no storage king

      I think you're confusing RAID 4, 5 & 6.

      RAID 4 has a dedicated parity disc.

      RAID 5 has 1 extra disc for storing parity data, but that parity data is spread across all the discs in the RAID group.

      RAID 6 has 2 extra discs for storing parity data. Again, that parity data is spread across all the discs.

    2. John Tserkezis

      Re: I'm no storage king

      "My understanding was RAID6 spreads the parity across the storage drives as dedicated parity drives end up dying sooner due to the high usage of the parity drive compared to the storage."

      Not quite. The 45 data drives, and the 10 parity drives are actually part of the entire working set (55 drives). They're numbered like that so you can more easily determine how much space you have to work with (45 drives). Your regular data, AND the parity data is spread evenly across the 55 drive set. Losing any one (or two for RAID6) drives results in only slightly slower data transfer operations, but otherwise, your users may not even notice. Hopefully, the adminstrators do though...

      The spares, depending on confguration may be power up, or not, but either way, do nothing till a drive in the working set (whichever drive that is) fails, and a spare is called apon to take it's place. A lack of any spare drives in the array is the same (regardless if they're all in use due to failures, or you didn't have any spare at all), but you need manual intervention to swap the faulty drives for new ones instead.

  5. Fazal Majid

    Theory and practice

    Typical academic paper making simplistic and very optimistic assumptions about failure modes. In my experience about one third to one half of storage faults are Byzantine, I.e. the drive doesn't just go down, it is actively attempting to sabotage your array by sending interfere down the bus (specially on buses where this is theoretically impossible like FC-AL) or all sorts of crippling behavior. Something like that will still require physical intervention.

    Here is an excellent introduction to the subject:

    http://dtrace.org/blogs/wesolows/2014/02/20/on-disk-failure/

    And of course John Gall's immortal classics about systems thinking.

    1. theOtherJT Silver badge

      Re: Theory and practice

      Or the controller card lets go.

      Or the power goes out to one of your racks and then the storage array's bios has a shit fit and refuses to come back up.

      Or one of the ram banks in the array is acting up leading to constant re-writes as the checksums fail and now the controller thinks there's something wrong with the disks and starts removing them from service.

      Or ONE of the network interfaces on the box goes down so the replication traffic to the other boxes in the cluster stops, but the outside world still thinks the disks are accessible for a while before STONITH kicks in and kills it, but it's too late by then and now you have a 14 hour rebuild on your hands when you bring the thing back up again.

      Or the firmware on the controller card said "JBOD" but wasn't really jbod, and was still writing some sort of header to every disk, so when the drive fails and the spare fires up ZFS refuses to accept it as a replacement because there's some data on there already and you wouldn't want to risk over writing it would you?

      You get 5 9's uptime by having a person on site who actually checks for shit like this. HA storage is _weird_

      1. yoganmahew

        Re: Theory and practice

        Absolutely. The six sigma events, three days in a row of the financial crisis seems to have taught modelling boffins very little.

        For me, 5 9s is a result of speed of recovery from failure moreso than preventing failure itself. Statistically 'impossible' (yes, they're just unlikely) events happen quite often. Not having to wait for an engineer to travel with a spare part is what distinguishes a quick recovery from a slow one...

        1. Destroy All Monsters Silver badge

          Re: Theory and practice

          The six sigma events, three days in a row of the financial crisis

          These have as much to about "reliability" and sigma whatever (which is something that comes from manufacturing, too AFAIK) as does hoping to survive repeat attemps at playing russian roulette. Just saying.

          "Yes M'lord we never managed to reach relibability significantly above 3 clicks in this game."

    2. Solmyr ibn Wali Barad

      Re: Theory and practice

      "the drive doesn't just go down, it is actively attempting to sabotage your array by sending interfere down the bus (specially on buses where this is theoretically impossible like FC-AL)"

      Oh yes, misbehaving drive can cause lots of grief. Even in the modern SAS fabric.

      Original FC-AL, by the way, was very susceptible to bad acts. It's not even a bus, but an arbitrated loop. Which already says a lot about it. Loop devices must play nice with each other - unique loop ID, obligation to forward the traffic to other members, and in cases of conflict they must obey the elected arbiter. Hah. Like that's going to happen. Only when everybody is in a really good mood.

      Hub variants of FC-AL did away with big loops. Physical connections moved a bit towards star topology, whereas logical setup still emulated a loop. Small hubs were built into the drive enclosures, so it became possible to disconnect an offending drive without its consent, and set the slot circuits to bypass (so that loop traffic would be forwarded). Definitely better than original FC-AL, but disturbances are still quite possible.

      Many thanks for the link!

  6. Duncan Macdonald

    On site support

    If you have a large data centre then the cost of one person on site who can swap disks is not going to add much to the costs. (That person could even double as one of the security guards - a high level of ability is not needed to swap disk drives.)

    1. phuzz Silver badge

      Re: On site support

      I wouldn't trust any of the data-center security guards I've met to swap a disk. Don't get me wrong, some of them are bloody good guards, but I've never met one who was interested in what they were guarding.

      1. Anonymous Coward
        Anonymous Coward

        Re: On site support

        If they found someone local willing to pay for disks, I bet they'd get swapped quickly.

        (I worked my way through college doing security, I met some dubious characters on the job)

  7. A Non e-mouse Silver badge

    Costs

    the cost of calling someone to replace a dead drive far outweighs the price of the disk

    Someone's making the wrong comparison. You need to look at the cost of replacing the disc versus the value of the data on the disc. I suspect the disc is tiny in value, compared to that of the data it holds.

    1. CraPo

      Re: Costs

      " I suspect the disc is tiny in value, compared to that of the data it holds."

      Cat videos?

    2. Patrick R

      Re: Costs

      When the disc gets replaced, it's first unusable, then it's gone, so is the data on it. Where do you see value?

    3. the spectacularly refined chap

      Re: Costs

      Someone's making the wrong comparison. You need to look at the cost of replacing the disc versus the value of the data on the disc. I suspect the disc is tiny in value, compared to that of the data it holds.

      No, that is the wrong comparison. If you have data that you can't afford to lose on one device (or even one array) that is your problem - if you have a backup of the data on a drive the value of the data on the dead one is meaningless.

      However, that still isn't the point they are making. It is being taken as read that the data must be protected and in that sense your point is the very opening premise of the study. They are not arguing over whether data should be protected but the most cost effective way of assuring that.

      Having said that I'm still not convinced the comparison is valid. I'll admit my experience is at the lower end of the scale, only going up to a few tens of terabytes but in my experience the cost of the drives is usually around half of even the capital cost of the array. You have semi-fixed costs such as computer smarts and software on top but the extra costs per unit are not inconsiderable, i.e. physical enclosures, controllers and power supplies, which inevitably scale with the number of drives.

  8. John Robson Silver badge

    Assuming no batch failure modes

    Because that's a good assumption.

    A decade ago I learnt to mix'n'match batches in RAID arrays, prefereably mix'n'match manufacturers...

    Batch failures are common, even if not due to fault - as they all see the same lifecycle they all tend to fail together, or at least fail during the rebuild, when having been brought to within 1% of it's lifespan the disk is then thrashed for dozens of hours to get all the data read as fast as possible.

    1. Anonymous Coward
      Anonymous Coward

      Re: Assuming no batch failure modes

      Agree, we once lost 9 drives, one after another , faster than the new ones were coming online, the others were failing.

      Thank <deity> for decent backups.

  9. This post has been deleted by its author

  10. Anonymous Coward
    Anonymous Coward

    Something new every day

    Except that my employer has been doing this for years. The drives are protected by RAID-6 and there are a bunch of automatically assigned spares in the box. Maybe not 33 spares for 45 drives, but enough that visits to replenish to spare pool are rare. When the pool of spares is getting low the box calls home to the service centre and a service guy is dispatched to the site on a non urgent basis.

    It looks like academia has just caught up to the idea.

    1. Jonathan Richards 1
      Boffin

      Re: Something new every day

      Not so much just caught up to the idea, as actually quantifying it with real-world failure numbers, and working out an optimum with a bit of maths. Just a bit more precise than your "bunch of". See icon!

      1. Tom 38

        Re: Something new every day

        Its real world failure numbers for a specific type of load.

        If your real world load is not the same as theirs, I'm not sure you can tell too much from this.

        Personally, I think their entire premise is bogus - "How many disks do you need to plug in to a server so you can just leave it for 4 years?" is not a question that needs answering because the opex of providing someone to support your boxes is dwarfed by specifying an array of that size (in terms of extra initial cost, extra PDU, extra rack space).

        They haven't even eliminated the person to maintain the server - every server needs an admin or two, even if you don't have to go put disks in it occasionally.

    2. Destroy All Monsters Silver badge
      Thumb Down

      Re: Something new every day

      Except that my employer has been doing this for years.

      No he hasn't.

      You are confusing "I'm gonna do something along these lines like a rabid monkey with some fast guesses" with optimization.

      1. Anonymous Coward
        Anonymous Coward

        Re: Something new every day

        You are confusing "I'm gonna do something along these lines like a rabid monkey with some fast guesses" with optimization.

        Setting your spare level at 73% of your data drives is certainly not optimisation. What I am talking about is enterprise size storage servers not some little box of commodity drives stuck under somebody's desk. Failure rates are fairly well known, obviously the author of the original article must have used them in his calculations. You put in enough spares for an optimal service interval, it is a whole lot cheaper to have an entry level tech to visit a site say once a year to replace the used spares than it is to spare the box for the whole of it's life.

        This is the reality of commercial practice not some academic theory.

  11. Anonymous Coward
    Anonymous Coward

    what about enclosure failure?

    The answer to drive (and enclosure failures which wasn't mentioned) is to have distributed grid with data spread pseudo-randomly over the enclosures and then have all drives contain data, parity and hot spare capacity.

    In that instance, the loss of a single drive brings all drives in the system, not just in a single RAID set, into the rebuild operation, and depending on the RAID level you have used, can enable some extraordinary quick rebuild times per TB.

    Sounds fanciful?

    IBM's XIV does this and has done so for years. It's a grid architecture and it has the capability to rebuild complete parity following a 4TB drive failure in under 1 hr on a fully configured system (more than 24 times faster than stated in the article).

    1. Jan 0 Silver badge

      Re: what about enclosure failure?

      This is the proposal that DougS made in the first post. I don't understand why it's accumulating down votes.

    2. theOtherJT Silver badge

      Re: what about enclosure failure?

      This is pretty much how CEPH works too.

      You specify a set of nodes, tell it how much parity to data you want, and let it get on with it. Lose a disk? The data on that disk is re calculated from parity (or just redundant copies if you're doing what amounts to raid 10) and written to other disks across the set. Lose a node? Give the other nodes a moment to decide that it's actually gone and isn't coming back, rather than this being some sort of transient network issue, and the same happens but on a larger scale.

      Half a dozen nodes with dozen disks each and this becomes really very robust and very VERY fast due to all those spindles being up and spinning all the time. You can also just throw more nodes at it when you want to increase capacity - which is just lovely.

    3. Fuzz

      Re: what about enclosure failure?

      HPs EVA did this as well, spares were just space reserved at the end of each drive. That way none of your spindles was unused and if your array was at 50% capacity you could lose 50% of your disks (not at the same time) without ever having to replace a drive.

      I don't understand why all arrays don't work this way.

  12. Christoph

    "the cost of calling someone to replace a dead drive far outweighs the price of the disk"

    So why use an expensive fleshy? It's a standard box in a standard slot with standard connections, arranged in a standard rack in known positions. Just have a juke-box type arm swing in and do the replacement.

    1. Alan Brown Silver badge

      "It's a standard box in a standard slot with standard connections, arranged in a standard rack in known positions."

      As is the location of the hard drive. You may as well just add an interface to that slot and forget about the robot. It's not like tapes where the complexity is in the tape drive and the cartridge is a simple unit.

      Telcos have been doing "periodic maintenance" for decades, it's a known quantity.

      There's an assumption being made that all the storage is in the same enclosure or even in the same datacentre. If it's that critical you don't do things like that.

  13. David Roberts

    Some wierd assumptions

    Firstly the apparent assumption that arrays don't carry spares.

    I worked with RAID5 arrays in the '90s and there was always at least one hot spare.

    Secondly (as already pointed out) using the cost of replacing a single disc vs. leaving the array untouched for 4 years. No apparent consideration of someone popping in once a month to replace failed drives as a bulk process.

    1. Adam 1

      Re: Some wierd assumptions

      Plus the assumption that you run a data centre but would have to call a guy in to replace the drive?

    2. Anonymous Coward
      Anonymous Coward

      Re: Some wierd assumptions

      It doesn`t seem to reflect the experience I`m sure of many here that cluster failures do happen and there are SPOFS for this in the array chassis (`s ?) with PSU`s backplanes, controllers.

      The IBM HPFS , or whatever its current label is from the blurb seems to address most of the problems by spreading data and redundant recovery info amorphously across all the drives even the "spares" so that the worst aspects of other disk array organisation Ie the performance hit and time taken for a reconfigure are not showstoppers, and neither are multiple failures even simultaneously. I`m sure this comes at ££s though.

      Urgent replacement of a drive rather than routine visits will usually be driven by SLA, and so the meatware time/cost compared to the loss of revenue or potential contracted penalties seems not to have been considered.

      Or the "customer mobility" concerns.

      Funny that, realworld conditions.

    3. Destroy All Monsters Silver badge

      Re: Some wierd assumptions

      I worked with RAID5 arrays in the '90s and there was always at least one hot spare.

      Yes, and?

      Secondly (as already pointed out) using the cost of replacing a single disc vs. leaving the array untouched for 4 years. No apparent consideration of someone popping in once a month to replace failed drives as a bulk process.

      "Chief. About this disk array down in Antarctica? Can you have PFY pass by for a fast repair once a month?"

  14. Anonymous Coward
    Anonymous Coward

    Real estate costs

    If you find a suitable cooling solution it's possible to pack disk pods densely with no aisles and reduce your real estate costs. Floor loading might become an issue, though.

    1. Tom 38

      Re: Real estate costs

      A suitable cooling system means a DC that has enough cooling and power per rack to give you what you are asking. DCs are designed with a specific wattage per rack.

      Since everyone wants more power and cooling, if you want more than the average, your DC provider is going to ream you for it.

  15. Anonymous Coward
    Anonymous Coward

    I wonder...

    ...how Google-large storage arrays deal with it.

    Do they have a dude driving a minivan and pushing a trolley filled with spare drives roaming about all datacenters, one per week? By the time he finishes the last datacenter, 4 years will have passed since he visited datacenter one.

    It is like lawnmowing... or eventually will come to it.

  16. Otto is a bear.

    Not very green either

    Running 33 hot spares and failed drives for 4 years. I don't know if it's still true, but the longer you leave a drive idle, the less likely it is to start-up when you need it.

    There also have to be extra space costs for having 33 drives doing nothing.

  17. Anonymous Coward
    Anonymous Coward

    RAID failure cases are so ugly

    And that's before you even take controller failures into account.

    I really liked Figure 2 in the paper that shows how non-orthogonal the failure cases can be. By that I mean that with RAID you can't just give a fraction of tolerated disk failures, but have to consider clusters of worst-case scenarios. Their figure isn't for standard RAID, but you can still see how their scheme carries over the non-orthogonality of current RAID implementations.

    I dabble in using Rabin's Information Dispersal Algorithm as an alternative to RAID. I've only recently added some introductory information to my repo to show what it is and how it can be better than RAID. In fact, one of the points I made was that failure analysis is a cinch with IDA. Since IDA doesn't distinguish between data and parity, your redundancy level is a simple fraction and if you know the failure rates of individual disks it's very straightforward to calculate the probability of failure of the cluster as a whole. You can obviously go nuts and apply a Poisson arrival model or use prior probabilities to examine reliability over time if you want to, but it's not necessary.

    I'm constantly amazed at the number of researches that persist with the mentality that XOR-based data redundancy systems (ie, all RAID systems) are the way to go, and that non-orthogonal failure cases are acceptable. I get that XOR is cheap, but so is any other O(n) algorithm (like IDA) if it's done in hardware. And it's not even the case that XOR-based systems have to have non-orthogonal failure cases. There's a thing called "Online Codes" invented by Petar Maymounkov that follows from previous work on Raptor codes. It uses two layers of XOR and gives asymptotic (probabilistic) guarantees about the recoverability of the data with a given number of erasures. It might not be well-suited to use in storage systems (or it might, if anyone bothered to look into the maths), but it at least shows that orthogonality is possible. (I'm also in the process of implementing this in my repo since it should be good for multicasting a file across my network/storage array so that individual nodes can then do the IDA bit, along with allowing for other nested/hybrid IDA setups)

    I'm actually reminded, as I write this of the whole debate around neural networks back when the "Perceptron" was discovered. Research into the whole field basically stalled for quite a few years because it was proven that the Perceptron couldn't encode an XOR rule. It wasn't until multi-layer neural networks were invented that this particular problem was overcome and progress started to be made again. I wonder if there isn't a similar artificial plateau effect happening these days with RAID systems?

  18. Jim O'Reilly
    Holmes

    Painting the Titanic!

    This would have been an interesting topic 15 years ago, when RAID was our only data integrity option. Today, with erasure codes and replication, the thesis of the article is badly off base (except for the idea that drive replacement is passe) and essentially irrelevant.

    We can get the benefit of no-repair storage arrays using erasure codes. This spreads the drives over a set of appliances. Typical configurations protect against a 6-drive loss, which will allow plenty of time for rebuilds, which can be anywhere in the storage pool (with say Ceph or other modern storage software). There is no need for huge numbers of dedicated spares up front, since adding new boxes of drives is the solution for sparing. Replication is not quite as good, typically providing 2-drive failure protection, but again recovery is to spread the data over available space or onto empty drives in a new appliance.

    Object storage doesn't require disk-level recovery - objects can be spread over existing free space on many drives.

  19. J.G.Harston Silver badge

    All well and good....

    until some PHB says "we're running out of space, just use some of those spare drives"

  20. a_milan

    Finally

    ... a scientific proof that XIV is the right idea.

  21. Anonymous Coward
    Happy

    Of course this all falls down....

    ...when you own the data centre and work just across the corridor.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like