YAY!
They received my order for the porn storage farm!
I was getting worried by the silence for a while there.
Flash may be one cutting edge of storage action, but big data is causing developments at the other side of the storage pond, with IBM developing a 120 petabyte 200,000-disk array. The mighty drive is being developed for a secret supercomputer-using customer "for detailed simulations of real-world phenomena" according to MIT's …
A quick fag-packet calculation suggests to me that if the MTBF is 3 years, you're looking at 180+ drive failures every day (or about 7-8 an hour).
That's gonna keep someone in gainful employment.
Still, if your machine crashes you can go on holiday for 6 months while fsck runs.
MTBF is just a sadistic statistic in this situation and it's not really relevant in any practical sense at this scale. I would expect that they are tracking run-time and replacing the drives before the end-of-life arrives - which begs the question:
Will the "used" drives start appearing on ebay in about two years?
MTBF absolutely is critical when designing large storage arrays. It's the key number (along with mean time to recover - MTTR) that tells you the likelyhood of a double failure within in any one raid set. It is the size of the raid set, the number of failures it can withstand and the total number of raidsets that matter on a 200,000 array storage device (using RAID in its most general sense of storage redundancy). Note that MTTR on RAID sets including modern, very large disks can be measured in 10s of hours. One of the reasons multi-level protection is becoming more important - the other being unrecoverable read errors short of complete device failure that prevent RAID rebuilds).
There is a big question over just how trustworthy MTBF figures are. Google did a study a few years back demonstrating that failure rates are not random. They tend to be associated with particular batches, models and manufacturers (annoyingly they wouldn't identify the bad ones). Also, they found that failure predictors, including S.M.A.R.T. stats correlated very poorly with actual failures. I've had experience of that myself where very high and statistically extremely improbable failure rates were observed on a subset of disks over a month. That speaks of non-random failure modes. Those devices exhibited no warnings of any failures.
Note that some arrays make the definition of what constitutes a RAID set an extremely slippery concept. Also, the concept of RAID sets and the consequences of common-mode failures compromising redundancy can make this a very tricky analysis.
Also, I wish people would stop thinking MTBF has anything directly to do with the lifespan of a device. The MTBF is simply the average number of total operational hours that might be expected between failures for a large(ish) population of similar devices. The MTBF figure only applies to devices within given (and not well publicised) working lifespans. We have hard drives now with MTBFs approaching 100 years. However, nobody in their right minds believes these devices will actually run for 100 years before they fail - a decade would be good.
In general, very large storage arrays have to be self-healing with dynamic spares and (within the operational lifespan) will rely on a mixture of re-active and pre-emptive device swapping, but I don't know of storage suppliers who swap drives out at fixed lifespan intervals (although I have known manufacturers swap out batches of suspect drives where excessive early failures have been detected in order to pre-empt catastrophic failures).
Needless to say, manufacturers are less than forthcoming about this...
And these are 200,000 drives from the same manufacturer and, presumably, there won't be too many different manufacturing batches involved (unless they've been stock-piling drives for years) - so you could potentially see a situation where 50+ drives die in a very short time frame...
Although, I'm sure they've thought it all through and we're just missing some info.
/else: boom!
If you are going to make claims like that, you need to factor in "rare" external risks such as Yellowstone blowing the west coast away, the Canaries washing the east coast away, or taking a direct hit from a 1km meteorite. (See icon for illustration.)
Call me cynical, but I'm guessing that these discs probably *can't* take a multi-gigaton direct hit.
...it's 4-minute songs encoded at 128kbps. I'm actually a bit surprised they chose 128k; if the satellite radio people can fob off 64k swishy sloshy swish shit as "CD quality" - or maybe just "digital quality" which I suppose is technically true - then why not double the number of songs?
Anybody else remember when that Creative Labs (IIRC) thing came out with the monstrous hard drive, and everyone else was spluttering about putting 10 songs in your 32mb device? And then there was Steve Jobs, who said, "Hmm, that thing is ugly, and the documentation and interface look unprofessional... This looks like a job for Jobs! Muahahahahahaha!"
I actually saw a headline reading, "President To Give Speech On Jobs", and thought, geez, it's not THAT big of a deal..
I was involved today with running a full mmfsck on a multi-terabyte GPFS filesystem. What was interesting that it took almost exactly 90 minutes to check about 230TB, which was 90% full.
If you say that this is is about 150TB an hour, extrapolating this would mean that checking 120PB would take a shade over 34 days. And this is the checking rate for Power 6 hardware, and that assumes it is configured as a single GPFS filesystem (unlikely).
I suspect that this article is about a Power 7 IH installation like the now defunct Blue Waters project. Everything from the wider racks to the water cooling would suggest this, although BlueGene/Q also has both of these attributes (but Lustre is the preferred filesystem for that system type).
Doubt this is GPFS as it currently stands - IBM has been working on PERCS to produce a system that can handle storage arrays bigger than this one. http://www.almaden.ibm.com/storagesystems/projects/perseus/
Also the detail says this is one GPFS filesystem - so all under one namespace. Hence the insanely large amount of metadata required.
I doubt Lustre would scale this far, its only in the last year had distributed metadata support added whereas GPFS was architected from the ground up to be distributed.
If it is part of PERCS, IBM has been working on a concept called Declustered RAID, codename Perseus, which runs software RAID within the GPFS layer. It uses a combination of 8+3 parity with Reed-Solomon encoding and track mirroring to spread data across the maximum number of spindles for performance while still maintaining good data resilience.
I am extrapolating here, but I believe that the same technology is being deployed in their SONAS devices.
All I can say is that I hope it will work as well as we are being told, because it will be a nightmare otherwise, as you will never be able to work out where data is actually being stored!
So you did a filesystem check of 230TB in 90 minutes? Then you check 2.55 TB/minute = 42.5 GB/sec.
Say that a disk checks 50MB/sec in practice. Then you need 850 disks to achieve 42.5 GB/sec.
Did you really had 850 disks in racks? How many racks did you have with disks? Holey Moley! Entire rooms were full of disks? How many rooms?
According to any SAS Enterprise disk spec sheet, such a disk encounters 1 irrecoverable error on every 10^16 bit read. So, if you have enough bits, you will face irrecoverable bit errors. Bit Rot, and such stuff. So if you have 850 disks, then you have a lot of bit rot and flipped bits on random. That is why you use ECC RAM, because bits are flipped on random in RAM. The same thing happens on disks: bits flip on random. And guess what: such errors are not even detectable sometimes. Hardware raid can not detect, nor repair such errors. The more disks you have, the more bit rot there will be, and then you need to protect against bit rot.
This post has been deleted by its author
If someone would add SAS support to an AIX port of smartmontools, then I would quite happily use it. Unfortunately, IBM has yet to add a SMART tool to the AIX toolset, and although I can compile the latest smartmond, it will not recognise SAS disks.
The AIX error daemon is very good at picking up errors, but unfortunately, IBM no longer ship a sense data analysing tool as part of AIX, so you have to engage their hardware support if the human-readable diagnostic message does not give enough information, especially for some of the Temporary Hardware type errors.
I am looking at a port this myself, but I fear that I have a steep learning curve ahead of me, because my knowledge of SCSI and the SAS transport layer has been largely at an academic level so far, and I am far happier writing C than C++.
I think the lag of AIX support has similar reasons to OS X lack of directly compilable ports of software, e.g. why fink and macports exists.
Lack of access to hardware (not everyone can run AIX let alone have root) and older libs (unlike linux).
So, perhaps you AIX guys can offer some testing or (in case of IBM) actual hardware to test.
I have read smartctl man pages before and therefore I guess the above reasons.
The geometry of this particular filesystem is 10 racks each of 12 disk drawers, each with 10 3.5" 300GB SAS disks, so a total of 1200 disks. Each rack has 2 Power 6 520 servers, each with 12 SAS RAID adapters contained in external expansion drawers.
Each drawer of 10 disks is connected to two SAS RAID array cards running in HA mode, with each card in a different system for redundancy. Each set of 10 disks is arranged as an 8+2 RAID 6 array.
The 120 individual RAID arrays are bound together by GPFS into a single filesystem.
This is a standard layout for disk in Power 6 IH node (P6 575) deployments, so there are quite a few sites like this around the world.
BTW, we lose about 1 spindles a month due to hardware failure in this particular storage cluster (actually, about 3 a month in this, its sister [we have more than one], and a number of smaller clusters). In total in the HPCs, we have in excess of 3000 disks providing application storage.
This post has been deleted by its author
Because of the distributed nature of GPFS, which from your post about bit rot, you obviously don't know about, there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads.
In this case, there are 20 systems, each with a primary control of 6 raid arrays. So even if we were actually reading every block, (using your figure) the 42.5 GB/sec comes down to 2.125 GB/sec per system, or about 354 MB/sec per RAID adapter. Assuming just the 8 disks in each RAID set, this is then 44MB/sec per spindle, which would (just about) be in the realms of the possible. But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster.
I get the impression that you've never really worked on very large systems.
"...there is no single system that has all of the storage attached to it, nor is it in a single RAID set, nor are all the disks involved in single block reads..."
I have never assumed all disks are in one single raid system. I just wondered how many disks you had.
Regarding my post about Bit rot, I wonder if bit rot is taken care of, or is bit rot just ignored? The hardware alone is not capable of handling bit rot. You need to have lot of checksums, which decreases performance significantly.
.
.
"...But as has been pointed out, mmfsck just checks the meta-data, and even that runs across all of the systems in the storage cluster..."
So the actual data is never checked. So when bits start to flip spontaneously, you have no way of detecting that. Even less, correct the corrupted bits. Maybe you should start to think about Silent Corruption. The more disks you have, the more corruption, and silent corruption, you will face. I hope you are at least using ECC RAM? If not, you should start to use ECC RAM. I really recommend it. You obviously dont care about corrupted bits on disks, so I would not be surprised you dont care about corrupted bits on RAM either.
.
.
"...I get the impression that you've never really worked on very large systems..."
This is true. I have never worked on a very large system, but I am still allowed to ask questions, right?
I get the impression you dont know too much about data corruption.
Kebabbert. What you have said appears to have been lifted almost exactly from the marketing spiel for ZFS, so I'm really not sure that your credentials in data corruption are that good.
Strangely, bit rot on disks, although acknowledged as possible, does not appear to register as a big concern on most sysadmins thoughts. Maybe it should, but it's not a hot topic.
Spending some time looking into how Reed Solomon block encoding is applied in RAID 6, I am reassured that even though single bit errors become a likelihood in large datastores, in order to actually cause a non-recoverable data loss, they have to occur in clusters (looking at a typical R-S encoding strategy, more than 16 in a 255 byte symbol-block if I read Wkikpedia correctly), which even for large filestores is quite improbable, although my maths is too rusty to work out the statistics correctly.
Regular reading and re-writing of the data (data scrubbing) is regarded as the best way of preventing gradual degradation in this case, and most modern RAID systems will do this automatically.
Of course, failures of multiple disks in a RAID set is a problem because of similar aged disks (probably more so than bit-rot), which is why our RAID sets are RAID 6 with 8+2 parity, allowing (in theory) two disks to fail without data loss, (and in the Perseus implementation will allow 8+3), but more than two disks failing in a set would probably challenge most filesystems.
BTW. I am primarily a sysadmin. I'm not really even an IT architect. I really don't have to understand in great detail how error-correction works as long as I trust the people who do the design. The model IBM uses to deploy large clusters involves some of the best people in the industry, and having designed a layout, they tend to stick to it, so I believe that most of the bases are covered.
Aren't EMC's "DMX" arrays in wide racks already?
We* had to take the $%^&! things out of their shipping boxes before they'd fit in our lift, anyway - and ordinary 42RU 19" racks fit the lift in their shipping boxes.
* By "we" I of course mean 'the horny-handed over-muscled lads from the shipping company'. Perish the thought that we soft-skinned pudgy** bespectacled ICT folks should soil our hands with this kind of labour.
** If we're so into 'Agile' development, how come all our developers are 170cm and 120kg?
That would be a lot of disk activity lights.
"Soldier: Those lights are blinking out of sequence.
Murdock: Make them blink in sequence."
"Buck Murdock: Oh, cut the bleeding heart crap, will ya? We've all got our switches, lights, and knobs to deal with, Striker. I mean, down here there are literally hundreds and thousands of blinking, beeping, and flashing lights, blinking and beeping and flashing - they're *flashing* and they're *beeping*. I can't stand it anymore! They're *blinking* and *beeping* and *flashing*! Why doesn't somebody pull the plug!"
So that they can run lots of "scenarios" and keep a copy online as "proof"
200,000 HDD at idle each consuming around 4W. And about 3 times that at peak.
What's the cost of a 2 MW UPS?
And of course 2 MW of airconditioning.
Rack space requirement isn't enormous. 20 SFF drives fit across the width with a 2U height. So you only need a single 20,000-U rack cabinet. ;-)