User topics

Article topics

Log in Sign up

IBM builds biggest-ever disk for secret customer

Flash may be one cutting edge of storage action, but big data is causing developments at the other side of the storage pond, with IBM developing a 120 petabyte 200,000-disk array. The mighty drive is being developed for a secret supercomputer-using customer "for detailed simulations of real-world phenomena" according to MIT's …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Thursday 1st September 2011 11:39 GMT Anonymous Coward

YAY!

They received my order for the porn storage farm!

I was getting worried by the silence for a while there.

23 0
1. Thursday 1st September 2011 12:17 GMT Solomon Grundy
  
  Hahahahahaha
  
  That is all!
  
  1 0
  1. Friday 2nd September 2011 08:31 GMT Marvin the Martian
    
    But is it portable?
    
    That seems to be of little use if at home --- it's out on the cabin in the moors without connection and only sheep around that it would come into its own.
    
    0 0
    1. Friday 2nd September 2011 14:14 GMT Anonymous Coward
      
      Re: But is it portable?
      
      Ah, I can see why you would find this a problem.
      
      I however am half Welsh, so the sheep will suffice will be more than enough stimulation for me.
      
      0 0
Thursday 1st September 2011 12:18 GMT chr0m4t1c

I wonder what the MTBF is for the drives?

A quick fag-packet calculation suggests to me that if the MTBF is 3 years, you're looking at 180+ drive failures every day (or about 7-8 an hour).

That's gonna keep someone in gainful employment.

Still, if your machine crashes you can go on holiday for 6 months while fsck runs.

3 0
1. Thursday 1st September 2011 12:58 GMT Version 1.0
  
  Coming to ebay soon?
  
  MTBF is just a sadistic statistic in this situation and it's not really relevant in any practical sense at this scale. I would expect that they are tracking run-time and replacing the drives before the end-of-life arrives - which begs the question:
  
  Will the "used" drives start appearing on ebay in about two years?
  
  1 2
  1. Thursday 1st September 2011 15:02 GMT Steven Jones
    
    MTBF and operational lifetimes
    
    MTBF absolutely is critical when designing large storage arrays. It's the key number (along with mean time to recover - MTTR) that tells you the likelyhood of a double failure within in any one raid set. It is the size of the raid set, the number of failures it can withstand and the total number of raidsets that matter on a 200,000 array storage device (using RAID in its most general sense of storage redundancy). Note that MTTR on RAID sets including modern, very large disks can be measured in 10s of hours. One of the reasons multi-level protection is becoming more important - the other being unrecoverable read errors short of complete device failure that prevent RAID rebuilds).
    
    There is a big question over just how trustworthy MTBF figures are. Google did a study a few years back demonstrating that failure rates are not random. They tend to be associated with particular batches, models and manufacturers (annoyingly they wouldn't identify the bad ones). Also, they found that failure predictors, including S.M.A.R.T. stats correlated very poorly with actual failures. I've had experience of that myself where very high and statistically extremely improbable failure rates were observed on a subset of disks over a month. That speaks of non-random failure modes. Those devices exhibited no warnings of any failures.
    
    Note that some arrays make the definition of what constitutes a RAID set an extremely slippery concept. Also, the concept of RAID sets and the consequences of common-mode failures compromising redundancy can make this a very tricky analysis.
    
    Also, I wish people would stop thinking MTBF has anything directly to do with the lifespan of a device. The MTBF is simply the average number of total operational hours that might be expected between failures for a large(ish) population of similar devices. The MTBF figure only applies to devices within given (and not well publicised) working lifespans. We have hard drives now with MTBFs approaching 100 years. However, nobody in their right minds believes these devices will actually run for 100 years before they fail - a decade would be good.
    
    In general, very large storage arrays have to be self-healing with dynamic spares and (within the operational lifespan) will rely on a mixture of re-active and pre-emptive device swapping, but I don't know of storage suppliers who swap drives out at fixed lifespan intervals (although I have known manufacturers swap out batches of suspect drives where excessive early failures have been detected in order to pre-empt catastrophic failures).
    
    Needless to say, manufacturers are less than forthcoming about this...
    
    6 0
  2. Friday 2nd September 2011 02:08 GMT Lance 3
    
    Re: Coming to ebay soon?
    
    If you see 200,000 of them, then the answer is yes. Why bother tracking the drives; all were bought and put into service at the same time, so in 2 years it will essentially be a new array and if they wanted more storage, that is easy to do at that point as well.
    
    0 0
2. Thursday 1st September 2011 12:58 GMT Ian Yates
  
  Homogenous
  
  And these are 200,000 drives from the same manufacturer and, presumably, there won't be too many different manufacturing batches involved (unless they've been stock-piling drives for years) - so you could potentially see a situation where 50+ drives die in a very short time frame...
  
  Although, I'm sure they've thought it all through and we're just missing some info.
  
  /else: boom!
  
  0 0
Thursday 1st September 2011 12:30 GMT Piro

Loving the..

Paraphrased section in the middle.

What a silly thing to say. Shit, even if they did manage to make it last a million years (haha, I doubt we'll even be around by then), it would be smaller than the flash drive you get free in a box of cereal.

2 0
1. Thursday 1st September 2011 13:20 GMT John Macintyre
  
  last a million years
  
  probably comes with a limited one year warranty though...
  
  1 0
Thursday 1st September 2011 12:58 GMT Disco-Legend-Zeke

Room For...

...every word spoken on Earth, and the transcript.

Who would need that?

3 0
1. Thursday 1st September 2011 19:10 GMT Mike Powers
  
  Good point
  
  Nuke simulations need RAM; I can't see how caching to a hundred-Petabyte disk is going to be do-able with any kind of responsiveness.
  
  1 0
2. Thursday 1st September 2011 22:55 GMT Guido Esperanto
  
  /title
  
  GCHQ, FBI, CIA, MI5/6, Europol
  
  All nicely packaged in a fully reported access db :D....with service provision from the likes of EDS.
  
  2 0
Thursday 1st September 2011 13:20 GMT lee harvey osmond

It's Microsoft I tell you

for installing a very early build of the next release of Windows

4 0
Thursday 1st September 2011 13:54 GMT Ken Hagan

Million years

If you are going to make claims like that, you need to factor in "rare" external risks such as Yellowstone blowing the west coast away, the Canaries washing the east coast away, or taking a direct hit from a 1km meteorite. (See icon for illustration.)

Call me cynical, but I'm guessing that these discs probably *can't* take a multi-gigaton direct hit.

3 0
Thursday 1st September 2011 14:26 GMT Anonymous Coward

120 petabytes is all well and good...

...but how many MP3s can it hold?

0 0
1. Thursday 1st September 2011 17:14 GMT A. Coatsworth
  
  MP3
  
  What's El Reg's standard unit for storage capacity?
  
  If it is not the MP3, it should be...
  
  0 0
  1. Thursday 1st September 2011 17:24 GMT Elmer Phud
    
    mp3?
    
    Mp3 -- but what bitrate?
    
    Then we'll have a new Reg standard and a multiple for mp3 storage in micro-IBMs
    
    0 0
    1. Thursday 1st September 2011 19:09 GMT Anonymous Coward
      
      If you go by the fine print on various device packages...
      
      ...it's 4-minute songs encoded at 128kbps. I'm actually a bit surprised they chose 128k; if the satellite radio people can fob off 64k swishy sloshy swish shit as "CD quality" - or maybe just "digital quality" which I suppose is technically true - then why not double the number of songs?
      
      Anybody else remember when that Creative Labs (IIRC) thing came out with the monstrous hard drive, and everyone else was spluttering about putting 10 songs in your 32mb device? And then there was Steve Jobs, who said, "Hmm, that thing is ugly, and the documentation and interface look unprofessional... This looks like a job for Jobs! Muahahahahahaha!"
      
      I actually saw a headline reading, "President To Give Speech On Jobs", and thought, geez, it's not THAT big of a deal..
      
      2 0
      1. Friday 2nd September 2011 10:34 GMT Marvin the Martian
        
        "Surprised by 128kbps choice"?
        
        Well Aldi's been listening to you: http://www.aldi.co.uk/uk/html/offers/special_buys3_20542.htm?WT.mc_id=2011-09-02-09-08 -- they take music of unspecified length at 64kbps
        
        0 0
Thursday 1st September 2011 15:38 GMT Mondo the Magnificent

Now to...

....hook this here array up to a Windows Server.. and run Scandisk on the beast.

It should be done in about 15 years time....

0 0
1. Thursday 1st September 2011 17:23 GMT Anonymous Coward
  
  Strangely enough
  
  I was involved today with running a full mmfsck on a multi-terabyte GPFS filesystem. What was interesting that it took almost exactly 90 minutes to check about 230TB, which was 90% full.
  
  If you say that this is is about 150TB an hour, extrapolating this would mean that checking 120PB would take a shade over 34 days. And this is the checking rate for Power 6 hardware, and that assumes it is configured as a single GPFS filesystem (unlikely).
  
  I suspect that this article is about a Power 7 IH installation like the now defunct Blue Waters project. Everything from the wider racks to the water cooling would suggest this, although BlueGene/Q also has both of these attributes (but Lustre is the preferred filesystem for that system type).
  
  1 0
  1. Thursday 1st September 2011 22:55 GMT Anonymous Coward
    
    Re: Strangely enough
    
    And in 10 years, all of that power will be available in your iPad 20, and will used to render more detailed fruit in Farmville.
    
    0 0
  2. Thursday 1st September 2011 22:55 GMT Anonymous Coward
    
    Not GPFS as we know it
    
    Doubt this is GPFS as it currently stands - IBM has been working on PERCS to produce a system that can handle storage arrays bigger than this one. http://www.almaden.ibm.com/storagesystems/projects/perseus/
    
    Also the detail says this is one GPFS filesystem - so all under one namespace. Hence the insanely large amount of metadata required.
    
    I doubt Lustre would scale this far, its only in the last year had distributed metadata support added whereas GPFS was architected from the ground up to be distributed.
    
    0 0
    1. Friday 2nd September 2011 11:10 GMT Anonymous Coward
      
      Re: Not GPFS as we know it
      
      If it is part of PERCS, IBM has been working on a concept called Declustered RAID, codename Perseus, which runs software RAID within the GPFS layer. It uses a combination of 8+3 parity with Reed-Solomon encoding and track mirroring to spread data across the maximum number of spindles for performance while still maintaining good data resilience.
      
      I am extrapolating here, but I believe that the same technology is being deployed in their SONAS devices.
      
      All I can say is that I hope it will work as well as we are being told, because it will be a nightmare otherwise, as you will never be able to work out where data is actually being stored!
      
      0 0
  3. Thursday 1st September 2011 22:55 GMT Kebabbert
    
    @A.C
    
    So you did a filesystem check of 230TB in 90 minutes? Then you check 2.55 TB/minute = 42.5 GB/sec.
    
    Say that a disk checks 50MB/sec in practice. Then you need 850 disks to achieve 42.5 GB/sec.
    
    Did you really had 850 disks in racks? How many racks did you have with disks? Holey Moley! Entire rooms were full of disks? How many rooms?
    
    According to any SAS Enterprise disk spec sheet, such a disk encounters 1 irrecoverable error on every 10^16 bit read. So, if you have enough bits, you will face irrecoverable bit errors. Bit Rot, and such stuff. So if you have 850 disks, then you have a lot of bit rot and flipped bits on random. That is why you use ECC RAM, because bits are flipped on random in RAM. The same thing happens on disks: bits flip on random. And guess what: such errors are not even detectable sometimes. Hardware raid can not detect, nor repair such errors. The more disks you have, the more bit rot there will be, and then you need to protect against bit rot.
    
    0 0
    1. Friday 2nd September 2011 09:36 GMT Ilgaz
      
      He isn't doing surface scan
      
      he isn't doing chkdsk /r , he is just checking metadata and it takes that long.
      
      While on it, doing a full smart test in drive is a way better idea than chkdsk /r these days. If needed of course.
      
      0 0
      1. This post has been deleted by its author
      2. Friday 2nd September 2011 14:04 GMT Anonymous Coward
        
        @llgaz
        
        If someone would add SAS support to an AIX port of smartmontools, then I would quite happily use it. Unfortunately, IBM has yet to add a SMART tool to the AIX toolset, and although I can compile the latest smartmond, it will not recognise SAS disks.
        
        The AIX error daemon is very good at picking up errors, but unfortunately, IBM no longer ship a sense data analysing tool as part of AIX, so you have to engage their hardware support if the human-readable diagnostic message does not give enough information, especially for some of the Temporary Hardware type errors.
        
        I am looking at a port this myself, but I fear that I have a steep learning curve ahead of me, because my knowledge of SCSI and the SAS transport layer has been largely at an academic level so far, and I am far happier writing C than C++.
        
        0 0
        
        Friday 2nd September 2011 20:49 GMT Ilgaz
        
        The author seems to be a nice guy
        
        I think the lag of AIX support has similar reasons to OS X lack of directly compilable ports of software, e.g. why fink and macports exists.
        
        Lack of access to hardware (not everyone can run AIX let alone have root) and older libs (unlike linux).
        
        So, perhaps you AIX guys can offer some testing or (in case of IBM) actual hardware to test.
        
        I have read smartctl man pages before and therefore I guess the above reasons.
        
        0 0
    2. Friday 2nd September 2011 11:10 GMT Anonymous Coward
      
      @kebabert
      
      The geometry of this particular filesystem is 10 racks each of 12 disk drawers, each with 10 3.5" 300GB SAS disks, so a total of 1200 disks. Each rack has 2 Power 6 520 servers, each with 12 SAS RAID adapters contained in external expansion drawers.
      
      Each drawer of 10 disks is connected to two SAS RAID array cards running in HA mode, with each card in a different system for redundancy. Each set of 10 disks is arranged as an 8+2 RAID 6 array.
      
      The 120 individual RAID arrays are bound together by GPFS into a single filesystem.
      
      This is a standard layout for disk in Power 6 IH node (P6 575) deployments, so there are quite a few sites like this around the world.
      
      BTW, we lose about 1 spindles a month due to hardware failure in this particular storage cluster (actually, about 3 a month in this, its sister [we have more than one], and a number of smaller clusters). In total in the HPCs, we have in excess of 3000 disks providing application storage.
      
      0 0
    3. This post has been deleted by its author
    4. Friday 2nd September 2011 14:14 GMT Anonymous Coward
      
      @Kebabbert (again)
      
      Because of the distributed nature of GPFS, which from your post about bit rot, you obviously don't know about, there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads.
      
      In this case, there are 20 systems, each with a primary control of 6 raid arrays. So even if we were actually reading every block, (using your figure) the 42.5 GB/sec comes down to 2.125 GB/sec per system, or about 354 MB/sec per RAID adapter. Assuming just the 8 disks in each RAID set, this is then 44MB/sec per spindle, which would (just about) be in the realms of the possible. But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster.
      
      I get the impression that you've never really worked on very large systems.
      
      0 0
      1. Monday 5th September 2011 07:47 GMT Kebabbert
        
        @A.C - data corruption
        
        "...there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads..."
        
        I have never assumed all disks are in one single raid system. I just wondered how many disks you had.
        
        Regarding my post about Bit rot, I wonder if bit rot is taken care of, or is bit rot just ignored? The hardware alone is not capable of handling bit rot. You need to have lot of checksums, which decreases performance significantly.
        
        .
        
        .
        
        "...But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster..."
        
        So the actual data is never checked. So when bits start to flip spontaneously, you have no way of detecting that. Even less, correct the corrupted bits. Maybe you should start to think about Silent Corruption. The more disks you have, the more corruption, and silent corruption, you will face. I hope you are at least using ECC RAM? If not, you should start to use ECC RAM. I really recommend it. You obviously dont care about corrupted bits on disks, so I would not be surprised you dont care about corrupted bits on RAM either.
        
        .
        
        .
        
        "...I get the impression that you've never really worked on very large systems..."
        
        This is true. I have never worked on a very large system, but I am still allowed to ask questions, right?
        
        I get the impression you dont know too much about data corruption.
        
        0 0
        
        Tuesday 6th September 2011 09:16 GMT Anonymous Coward
        
        re: Data corruption
        
        Kebabbert. What you have said appears to have been lifted almost exactly from the marketing spiel for ZFS, so I'm really not sure that your credentials in data corruption are that good.
        
        Strangely, bit rot on disks, although acknowledged as possible, does not appear to register as a big concern on most sysadmins thoughts. Maybe it should, but it's not a hot topic.
        
        Spending some time looking into how Reed Solomon block encoding is applied in RAID 6, I am reassured that even though single bit errors become a likelihood in large datastores, in order to actually cause a non-recoverable data loss, they have to occur in clusters (looking at a typical R-S encoding strategy, more than 16 in a 255 byte symbol-block if I read Wkikpedia correctly), which even for large filestores is quite improbable, although my maths is too rusty to work out the statistics correctly.
        
        Regular reading and re-writing of the data (data scrubbing) is regarded as the best way of preventing gradual degradation in this case, and most modern RAID systems will do this automatically.
        
        Of course, failures of multiple disks in a RAID set is a problem because of similar aged disks (probably more so than bit-rot), which is why our RAID sets are RAID 6 with 8+2 parity, allowing (in theory) two disks to fail without data loss, (and in the Perseus implementation will allow 8+3), but more than two disks failing in a set would probably challenge most filesystems.
        
        BTW. I am primarily a sysadmin. I'm not really even an IT architect. I really don't have to understand in great detail how error-correction works as long as I trust the people who do the design. The model IBM uses to deploy large clusters involves some of the best people in the industry, and having designed a layout, they tend to stick to it, so I believe that most of the bases are covered.
        
        0 0
Thursday 1st September 2011 17:23 GMT Lorddraco

when it is mechanical .. it will fail

as long as there is moving parts ... it will fail ...

common disk failure is batch problem .. anyone in the storage industry knows this nightmare batch problem....

0 0
Thursday 1st September 2011 17:23 GMT John F***ing Stepp

Brings back memories*

Of walking along the racks checking what vacuum tubes (valves) weren't glowing.

Back when we all had electrical heat, one 12v filament at a time.

*Also one really bad pun.

1 0
Thursday 1st September 2011 19:17 GMT Anonymous Coward

NSA

Nuke labs? Weather forecasting? Please. The customer is obvious, and the application is data warehousing of communications intercepts.

0 0
Thursday 1st September 2011 22:55 GMT Anonymous Coward

Wide racks

Aren't EMC's "DMX" arrays in wide racks already?

We* had to take the $%^&! things out of their shipping boxes before they'd fit in our lift, anyway - and ordinary 42RU 19" racks fit the lift in their shipping boxes.

* By "we" I of course mean 'the horny-handed over-muscled lads from the shipping company'. Perish the thought that we soft-skinned pudgy** bespectacled ICT folks should soil our hands with this kind of labour.

** If we're so into 'Agile' development, how come all our developers are 170cm and 120kg?

1 0
1. Friday 2nd September 2011 09:34 GMT Marvin the Martian
  
  **I'll hazard a guess.
  
  Inbreeding?
  
  0 0
Thursday 1st September 2011 22:55 GMT Anonymous Coward

Can't wait..

So how much power would be wasted just by identifying all drives in the array?

0 0
Friday 2nd September 2011 00:33 GMT Antoine Dubuc

Social Simulation

Some universities are doing research in Social Simulation... ahhhh... behold the scaffolding of Isaac Asimov's PsychoHistory!

0 0
Friday 2nd September 2011 02:08 GMT alwarming

On paper probably IBM offered a deal too good for "them" to resist....

But "they" will have to wait till the charges for software AMC, parts AMC and the whole solution add up.

0 0
Friday 2nd September 2011 02:08 GMT Lance 3

A lot of lights

That would be a lot of disk activity lights.

"Soldier: Those lights are blinking out of sequence.

Murdock: Make them blink in sequence."

"Buck Murdock: Oh, cut the bleeding heart crap, will ya? We've all got our switches, lights, and knobs to deal with, Striker. I mean, down here there are literally hundreds and thousands of blinking, beeping, and flashing lights, blinking and beeping and flashing - they're *flashing* and they're *beeping*. I can't stand it anymore! They're *blinking* and *beeping* and *flashing*! Why doesn't somebody pull the plug!"

1 0
Friday 2nd September 2011 07:24 GMT Bernd Felsche

Client: UEA for Climate Modelling?

So that they can run lots of "scenarios" and keep a copy online as "proof"

200,000 HDD at idle each consuming around 4W. And about 3 times that at peak.

What's the cost of a 2 MW UPS?

And of course 2 MW of airconditioning.

Rack space requirement isn't enormous. 20 SFF drives fit across the width with a 2U height. So you only need a single 20,000-U rack cabinet. ;-)

0 0
Friday 2nd September 2011 07:24 GMT Anonymous Coward

are you sure they're disks?

Wasn't there a massive flash drive purchase last week? Of this same size?

0 0
Friday 2nd September 2011 12:01 GMT Toastan Buttar

The secret customer is...

John Hammond, Isla Nublar.

0 0
Saturday 3rd September 2011 16:01 GMT Ken 16

in 20 years

only the cheapest mobile phones will have 120PB

0 0

This topic is closed for new posts.

Other stories you might like

IBM accused of cheating its own executive assistants out of overtime pay

Big Blue bosses retaliate against those seeking overtime, lawsuit claims

On-Prem 18 Apr 2024 | 30

Tech titans assemble to decide which jobs AI should cut first

But don't worry, if tech takes your job, we'll retrain you

AI + ML 4 Apr 2024 | 64

IBM CEO pay jumps 23% in 2023, average employee gets 7%

And the party extends to shareholders with an overall $6B payout

Software 15 Mar 2024 | 17

IBM said to be binning off more staff as 'workforce rebalance' continues

Next logical step after rounds of voluntary layoffs

Systems 12 Mar 2024 | 24

Trying out Microsoft's pre-release OS/2 2.0

It fell through a timewarp from an alternate and very different computing universe

OSes 11 Mar 2024 | 96

Hands up if you want to volunteer for layoffs, IBM tells staff

Exclusive Global 'Resource Actions' to hit Europe hard, with Enterprise Ops & Support, CIO, HR and Real Estate in firing line

Software 29 Feb 2024 | 41

Preview edition of Microsoft OS/2 2.0 surfaces on eBay

Discounted from $2,600 down to just $650. What a bargain!

OSes 20 Feb 2024 | 52

Orgs are having a major identity crisis while crims reap the rewards

Hacking your way in is so 2022 – logging in is much easier

Security 21 Feb 2024 | 8

Days after half a billion Asians went to the polls, Big Tech promises to counter 2024 election misinformation

Google, Meta, Microsoft, OpenAI and pals promise they'll try very hard to keep AI nasties off the 'net

AI + ML 19 Feb 2024 | 12

IBM pitches bite-sized $135k LinuxONE box for smaller biz types

Fancy a mainframe that runs Linux? You'll need deep pockets

Systems 6 Feb 2024 | 27

India probes SAP and IBM over ancient Air India ERP tender

Procurement process in 2011 deal raises suspicions

Legal 7 Feb 2024 | 1

IBM Japan and NTT think they can make datacenter aircon adjust to different workloads

They're measuring server exhaust temperatures to detect power consumption

AI + ML 7 Feb 2024 | 17

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024