back to article IBM builds biggest-ever disk for secret customer

Flash may be one cutting edge of storage action, but big data is causing developments at the other side of the storage pond, with IBM developing a 120 petabyte 200,000-disk array. The mighty drive is being developed for a secret supercomputer-using customer "for detailed simulations of real-world phenomena" according to MIT's …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    YAY!

    They received my order for the porn storage farm!

    I was getting worried by the silence for a while there.

    1. Solomon Grundy

      Hahahahahaha

      That is all!

      1. Marvin the Martian
        Unhappy

        But is it portable?

        That seems to be of little use if at home --- it's out on the cabin in the moors without connection and only sheep around that it would come into its own.

        1. Anonymous Coward
          Anonymous Coward

          Re: But is it portable?

          Ah, I can see why you would find this a problem.

          I however am half Welsh, so the sheep will suffice will be more than enough stimulation for me.

  2. chr0m4t1c

    I wonder what the MTBF is for the drives?

    A quick fag-packet calculation suggests to me that if the MTBF is 3 years, you're looking at 180+ drive failures every day (or about 7-8 an hour).

    That's gonna keep someone in gainful employment.

    Still, if your machine crashes you can go on holiday for 6 months while fsck runs.

    1. Version 1.0 Silver badge
      Devil

      Coming to ebay soon?

      MTBF is just a sadistic statistic in this situation and it's not really relevant in any practical sense at this scale. I would expect that they are tracking run-time and replacing the drives before the end-of-life arrives - which begs the question:

      Will the "used" drives start appearing on ebay in about two years?

      1. Steven Jones

        MTBF and operational lifetimes

        MTBF absolutely is critical when designing large storage arrays. It's the key number (along with mean time to recover - MTTR) that tells you the likelyhood of a double failure within in any one raid set. It is the size of the raid set, the number of failures it can withstand and the total number of raidsets that matter on a 200,000 array storage device (using RAID in its most general sense of storage redundancy). Note that MTTR on RAID sets including modern, very large disks can be measured in 10s of hours. One of the reasons multi-level protection is becoming more important - the other being unrecoverable read errors short of complete device failure that prevent RAID rebuilds).

        There is a big question over just how trustworthy MTBF figures are. Google did a study a few years back demonstrating that failure rates are not random. They tend to be associated with particular batches, models and manufacturers (annoyingly they wouldn't identify the bad ones). Also, they found that failure predictors, including S.M.A.R.T. stats correlated very poorly with actual failures. I've had experience of that myself where very high and statistically extremely improbable failure rates were observed on a subset of disks over a month. That speaks of non-random failure modes. Those devices exhibited no warnings of any failures.

        Note that some arrays make the definition of what constitutes a RAID set an extremely slippery concept. Also, the concept of RAID sets and the consequences of common-mode failures compromising redundancy can make this a very tricky analysis.

        Also, I wish people would stop thinking MTBF has anything directly to do with the lifespan of a device. The MTBF is simply the average number of total operational hours that might be expected between failures for a large(ish) population of similar devices. The MTBF figure only applies to devices within given (and not well publicised) working lifespans. We have hard drives now with MTBFs approaching 100 years. However, nobody in their right minds believes these devices will actually run for 100 years before they fail - a decade would be good.

        In general, very large storage arrays have to be self-healing with dynamic spares and (within the operational lifespan) will rely on a mixture of re-active and pre-emptive device swapping, but I don't know of storage suppliers who swap drives out at fixed lifespan intervals (although I have known manufacturers swap out batches of suspect drives where excessive early failures have been detected in order to pre-empt catastrophic failures).

        Needless to say, manufacturers are less than forthcoming about this...

      2. Lance 3

        Re: Coming to ebay soon?

        If you see 200,000 of them, then the answer is yes. Why bother tracking the drives; all were bought and put into service at the same time, so in 2 years it will essentially be a new array and if they wanted more storage, that is easy to do at that point as well.

    2. Ian Yates
      Mushroom

      Homogenous

      And these are 200,000 drives from the same manufacturer and, presumably, there won't be too many different manufacturing batches involved (unless they've been stock-piling drives for years) - so you could potentially see a situation where 50+ drives die in a very short time frame...

      Although, I'm sure they've thought it all through and we're just missing some info.

      /else: boom!

  3. Piro Silver badge
    FAIL

    Loving the..

    Paraphrased section in the middle.

    What a silly thing to say. Shit, even if they did manage to make it last a million years (haha, I doubt we'll even be around by then), it would be smaller than the flash drive you get free in a box of cereal.

    1. John Macintyre
      Joke

      last a million years

      probably comes with a limited one year warranty though...

  4. Disco-Legend-Zeke
    Pint

    Room For...

    ...every word spoken on Earth, and the transcript.

    Who would need that?

    1. Mike Powers
      Big Brother

      Good point

      Nuke simulations need RAM; I can't see how caching to a hundred-Petabyte disk is going to be do-able with any kind of responsiveness.

    2. Guido Esperanto

      /title

      GCHQ, FBI, CIA, MI5/6, Europol

      All nicely packaged in a fully reported access db :D....with service provision from the likes of EDS.

  5. lee harvey osmond

    It's Microsoft I tell you

    for installing a very early build of the next release of Windows

  6. Ken Hagan Gold badge
    Mushroom

    Million years

    If you are going to make claims like that, you need to factor in "rare" external risks such as Yellowstone blowing the west coast away, the Canaries washing the east coast away, or taking a direct hit from a 1km meteorite. (See icon for illustration.)

    Call me cynical, but I'm guessing that these discs probably *can't* take a multi-gigaton direct hit.

  7. Anonymous Coward
    Boffin

    120 petabytes is all well and good...

    ...but how many MP3s can it hold?

    1. A. Coatsworth Silver badge
      Pirate

      MP3

      What's El Reg's standard unit for storage capacity?

      If it is not the MP3, it should be...

      1. Elmer Phud

        mp3?

        Mp3 -- but what bitrate?

        Then we'll have a new Reg standard and a multiple for mp3 storage in micro-IBMs

        1. Anonymous Coward
          Coat

          If you go by the fine print on various device packages...

          ...it's 4-minute songs encoded at 128kbps. I'm actually a bit surprised they chose 128k; if the satellite radio people can fob off 64k swishy sloshy swish shit as "CD quality" - or maybe just "digital quality" which I suppose is technically true - then why not double the number of songs?

          Anybody else remember when that Creative Labs (IIRC) thing came out with the monstrous hard drive, and everyone else was spluttering about putting 10 songs in your 32mb device? And then there was Steve Jobs, who said, "Hmm, that thing is ugly, and the documentation and interface look unprofessional... This looks like a job for Jobs! Muahahahahahaha!"

          I actually saw a headline reading, "President To Give Speech On Jobs", and thought, geez, it's not THAT big of a deal..

          1. Marvin the Martian
            Facepalm

            "Surprised by 128kbps choice"?

            Well Aldi's been listening to you: http://www.aldi.co.uk/uk/html/offers/special_buys3_20542.htm?WT.mc_id=2011-09-02-09-08 -- they take music of unspecified length at 64kbps

  8. Mondo the Magnificent
    Coat

    Now to...

    ....hook this here array up to a Windows Server.. and run Scandisk on the beast.

    It should be done in about 15 years time....

    1. Anonymous Coward
      Boffin

      Strangely enough

      I was involved today with running a full mmfsck on a multi-terabyte GPFS filesystem. What was interesting that it took almost exactly 90 minutes to check about 230TB, which was 90% full.

      If you say that this is is about 150TB an hour, extrapolating this would mean that checking 120PB would take a shade over 34 days. And this is the checking rate for Power 6 hardware, and that assumes it is configured as a single GPFS filesystem (unlikely).

      I suspect that this article is about a Power 7 IH installation like the now defunct Blue Waters project. Everything from the wider racks to the water cooling would suggest this, although BlueGene/Q also has both of these attributes (but Lustre is the preferred filesystem for that system type).

      1. Anonymous Coward
        Unhappy

        Re: Strangely enough

        And in 10 years, all of that power will be available in your iPad 20, and will used to render more detailed fruit in Farmville.

      2. Anonymous Coward
        Flame

        Not GPFS as we know it

        Doubt this is GPFS as it currently stands - IBM has been working on PERCS to produce a system that can handle storage arrays bigger than this one. http://www.almaden.ibm.com/storagesystems/projects/perseus/

        Also the detail says this is one GPFS filesystem - so all under one namespace. Hence the insanely large amount of metadata required.

        I doubt Lustre would scale this far, its only in the last year had distributed metadata support added whereas GPFS was architected from the ground up to be distributed.

        1. Anonymous Coward
          Anonymous Coward

          Re: Not GPFS as we know it

          If it is part of PERCS, IBM has been working on a concept called Declustered RAID, codename Perseus, which runs software RAID within the GPFS layer. It uses a combination of 8+3 parity with Reed-Solomon encoding and track mirroring to spread data across the maximum number of spindles for performance while still maintaining good data resilience.

          I am extrapolating here, but I believe that the same technology is being deployed in their SONAS devices.

          All I can say is that I hope it will work as well as we are being told, because it will be a nightmare otherwise, as you will never be able to work out where data is actually being stored!

      3. Kebabbert

        @A.C

        So you did a filesystem check of 230TB in 90 minutes? Then you check 2.55 TB/minute = 42.5 GB/sec.

        Say that a disk checks 50MB/sec in practice. Then you need 850 disks to achieve 42.5 GB/sec.

        Did you really had 850 disks in racks? How many racks did you have with disks? Holey Moley! Entire rooms were full of disks? How many rooms?

        According to any SAS Enterprise disk spec sheet, such a disk encounters 1 irrecoverable error on every 10^16 bit read. So, if you have enough bits, you will face irrecoverable bit errors. Bit Rot, and such stuff. So if you have 850 disks, then you have a lot of bit rot and flipped bits on random. That is why you use ECC RAM, because bits are flipped on random in RAM. The same thing happens on disks: bits flip on random. And guess what: such errors are not even detectable sometimes. Hardware raid can not detect, nor repair such errors. The more disks you have, the more bit rot there will be, and then you need to protect against bit rot.

        1. Ilgaz

          He isn't doing surface scan

          he isn't doing chkdsk /r , he is just checking metadata and it takes that long.

          While on it, doing a full smart test in drive is a way better idea than chkdsk /r these days. If needed of course.

          1. This post has been deleted by its author

          2. Anonymous Coward
            Boffin

            @llgaz

            If someone would add SAS support to an AIX port of smartmontools, then I would quite happily use it. Unfortunately, IBM has yet to add a SMART tool to the AIX toolset, and although I can compile the latest smartmond, it will not recognise SAS disks.

            The AIX error daemon is very good at picking up errors, but unfortunately, IBM no longer ship a sense data analysing tool as part of AIX, so you have to engage their hardware support if the human-readable diagnostic message does not give enough information, especially for some of the Temporary Hardware type errors.

            I am looking at a port this myself, but I fear that I have a steep learning curve ahead of me, because my knowledge of SCSI and the SAS transport layer has been largely at an academic level so far, and I am far happier writing C than C++.

            1. Ilgaz

              The author seems to be a nice guy

              I think the lag of AIX support has similar reasons to OS X lack of directly compilable ports of software, e.g. why fink and macports exists.

              Lack of access to hardware (not everyone can run AIX let alone have root) and older libs (unlike linux).

              So, perhaps you AIX guys can offer some testing or (in case of IBM) actual hardware to test.

              I have read smartctl man pages before and therefore I guess the above reasons.

        2. Anonymous Coward
          Anonymous Coward

          @kebabert

          The geometry of this particular filesystem is 10 racks each of 12 disk drawers, each with 10 3.5" 300GB SAS disks, so a total of 1200 disks. Each rack has 2 Power 6 520 servers, each with 12 SAS RAID adapters contained in external expansion drawers.

          Each drawer of 10 disks is connected to two SAS RAID array cards running in HA mode, with each card in a different system for redundancy. Each set of 10 disks is arranged as an 8+2 RAID 6 array.

          The 120 individual RAID arrays are bound together by GPFS into a single filesystem.

          This is a standard layout for disk in Power 6 IH node (P6 575) deployments, so there are quite a few sites like this around the world.

          BTW, we lose about 1 spindles a month due to hardware failure in this particular storage cluster (actually, about 3 a month in this, its sister [we have more than one], and a number of smaller clusters). In total in the HPCs, we have in excess of 3000 disks providing application storage.

        3. This post has been deleted by its author

        4. Anonymous Coward
          Anonymous Coward

          @Kebabbert (again)

          Because of the distributed nature of GPFS, which from your post about bit rot, you obviously don't know about, there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads.

          In this case, there are 20 systems, each with a primary control of 6 raid arrays. So even if we were actually reading every block, (using your figure) the 42.5 GB/sec comes down to 2.125 GB/sec per system, or about 354 MB/sec per RAID adapter. Assuming just the 8 disks in each RAID set, this is then 44MB/sec per spindle, which would (just about) be in the realms of the possible. But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster.

          I get the impression that you've never really worked on very large systems.

          1. Kebabbert

            @A.C - data corruption

            "...there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads..."

            I have never assumed all disks are in one single raid system. I just wondered how many disks you had.

            Regarding my post about Bit rot, I wonder if bit rot is taken care of, or is bit rot just ignored? The hardware alone is not capable of handling bit rot. You need to have lot of checksums, which decreases performance significantly.

            .

            .

            "...But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster..."

            So the actual data is never checked. So when bits start to flip spontaneously, you have no way of detecting that. Even less, correct the corrupted bits. Maybe you should start to think about Silent Corruption. The more disks you have, the more corruption, and silent corruption, you will face. I hope you are at least using ECC RAM? If not, you should start to use ECC RAM. I really recommend it. You obviously dont care about corrupted bits on disks, so I would not be surprised you dont care about corrupted bits on RAM either.

            .

            .

            "...I get the impression that you've never really worked on very large systems..."

            This is true. I have never worked on a very large system, but I am still allowed to ask questions, right?

            I get the impression you dont know too much about data corruption.

            1. Anonymous Coward
              Anonymous Coward

              re: Data corruption

              Kebabbert. What you have said appears to have been lifted almost exactly from the marketing spiel for ZFS, so I'm really not sure that your credentials in data corruption are that good.

              Strangely, bit rot on disks, although acknowledged as possible, does not appear to register as a big concern on most sysadmins thoughts. Maybe it should, but it's not a hot topic.

              Spending some time looking into how Reed Solomon block encoding is applied in RAID 6, I am reassured that even though single bit errors become a likelihood in large datastores, in order to actually cause a non-recoverable data loss, they have to occur in clusters (looking at a typical R-S encoding strategy, more than 16 in a 255 byte symbol-block if I read Wkikpedia correctly), which even for large filestores is quite improbable, although my maths is too rusty to work out the statistics correctly.

              Regular reading and re-writing of the data (data scrubbing) is regarded as the best way of preventing gradual degradation in this case, and most modern RAID systems will do this automatically.

              Of course, failures of multiple disks in a RAID set is a problem because of similar aged disks (probably more so than bit-rot), which is why our RAID sets are RAID 6 with 8+2 parity, allowing (in theory) two disks to fail without data loss, (and in the Perseus implementation will allow 8+3), but more than two disks failing in a set would probably challenge most filesystems.

              BTW. I am primarily a sysadmin. I'm not really even an IT architect. I really don't have to understand in great detail how error-correction works as long as I trust the people who do the design. The model IBM uses to deploy large clusters involves some of the best people in the industry, and having designed a layout, they tend to stick to it, so I believe that most of the bases are covered.

  9. Lorddraco

    when it is mechanical .. it will fail

    as long as there is moving parts ... it will fail ...

    common disk failure is batch problem .. anyone in the storage industry knows this nightmare batch problem....

  10. John F***ing Stepp

    Brings back memories*

    Of walking along the racks checking what vacuum tubes (valves) weren't glowing.

    Back when we all had electrical heat, one 12v filament at a time.

    *Also one really bad pun.

  11. Anonymous Coward
    Big Brother

    NSA

    Nuke labs? Weather forecasting? Please. The customer is obvious, and the application is data warehousing of communications intercepts.

  12. Anonymous Coward
    Headmaster

    Wide racks

    Aren't EMC's "DMX" arrays in wide racks already?

    We* had to take the $%^&! things out of their shipping boxes before they'd fit in our lift, anyway - and ordinary 42RU 19" racks fit the lift in their shipping boxes.

    * By "we" I of course mean 'the horny-handed over-muscled lads from the shipping company'. Perish the thought that we soft-skinned pudgy** bespectacled ICT folks should soil our hands with this kind of labour.

    ** If we're so into 'Agile' development, how come all our developers are 170cm and 120kg?

    1. Marvin the Martian
      Meh

      **I'll hazard a guess.

      Inbreeding?

  13. Anonymous Coward
    Devil

    Can't wait..

    So how much power would be wasted just by identifying all drives in the array?

  14. Antoine Dubuc
    Big Brother

    Social Simulation

    Some universities are doing research in Social Simulation... ahhhh... behold the scaffolding of Isaac Asimov's PsychoHistory!

  15. alwarming
    Facepalm

    On paper probably IBM offered a deal too good for "them" to resist....

    But "they" will have to wait till the charges for software AMC, parts AMC and the whole solution add up.

  16. Lance 3

    A lot of lights

    That would be a lot of disk activity lights.

    "Soldier: Those lights are blinking out of sequence.

    Murdock: Make them blink in sequence."

    "Buck Murdock: Oh, cut the bleeding heart crap, will ya? We've all got our switches, lights, and knobs to deal with, Striker. I mean, down here there are literally hundreds and thousands of blinking, beeping, and flashing lights, blinking and beeping and flashing - they're *flashing* and they're *beeping*. I can't stand it anymore! They're *blinking* and *beeping* and *flashing*! Why doesn't somebody pull the plug!"

  17. Bernd Felsche

    Client: UEA for Climate Modelling?

    So that they can run lots of "scenarios" and keep a copy online as "proof"

    200,000 HDD at idle each consuming around 4W. And about 3 times that at peak.

    What's the cost of a 2 MW UPS?

    And of course 2 MW of airconditioning.

    Rack space requirement isn't enormous. 20 SFF drives fit across the width with a 2U height. So you only need a single 20,000-U rack cabinet. ;-)

  18. Anonymous Coward
    Anonymous Coward

    are you sure they're disks?

    Wasn't there a massive flash drive purchase last week? Of this same size?

  19. Toastan Buttar
    Linux

    The secret customer is...

    John Hammond, Isla Nublar.

  20. Ken 16 Silver badge
    Terminator

    in 20 years

    only the cheapest mobile phones will have 120PB

This topic is closed for new posts.

Other stories you might like