Epic Math fail
"72TB disk we get (7,200/80) = 900 seconds (a quarter of an hour, 15 minutes)."
Assuming its 72GB not TB it's still 72'000/80 = 9000s = 150min
Even without math by hearth you should know that coping 72GB takes more than 15min
RAID rebuilds are too slow with 4, 6 and 8 and 10TB drives. So erasure coding is coming in to shorten the failed disk data rebuild time and also use less capacity overhead. What happens when SSDs are used instead of disk drives? Will RAID be more or less acceptable than with disk drives? Intuitively we expect a flash array to …
72Gbytes in 15 mins? No problem. That's three of my user's profiles copied across a network?
Even at gigabit, that's doable in 10 minutes even with overhead. (http://techinternets.com/copy_calc?do). Which kind of correlates with real-world profile copies / restores in my day-to-day calculations of such things.
And I assure you my storage can saturates my Gbit network without any problems at all from a single storage array (Hell, SATA is 6Gbps to each disk, on a RAID you can easily saturate a SAS link without even trying). What kind of hardware? I'm in a small prep school.
What kind of junk do you think people run that you can't copy 72Gb in 15 mins? Hell you could do it to / from a USB 2.0 drive in a little over 20 minutes, and you know the USB 2.0 bus can't keep up with even a single SATA disk, right?
systems like 3PAR (I think XIV does something similar though last I read they were limited to RAID 1 only, XIO did something similar too they distributed RAID over individual platters from what I recall, so you could have a platter fail and keep using the remaining platters in the disk(systems sold with enough capacity that this was transparent), which is one way they were able to go so long without any disk replacements), distribute RAID across many disks, so you could have dozens or even hundreds of disks rebuilding data in parallel, meaning faster recovery times, and lower latency.
Enterprise class arrays often protect against an entire shelf of disks failing as well(for 3PAR this is the default). So you can lose many drives(as long as they are the "right" drives). Enterprise systems go above and beyond even that and typically only rebuild the data that is written, reducing rebuild times even further depending on how full your system is. I wrote about this more in depth 6 years ago (http://www.techopsguys.com/2010/08/13/do-you-really-need-raid-6/)
Also most/all enterprise systems proactively fail disks before they completely die. Though that certainly doesn't catch all failures, I would wager it accounts for a decent amount though.
SSDs seem to be more optimized for IOPS rather than throughput, short of NVMe SSDs at least, from what I have read anyways SSDs aren't all a million times faster on sequential operations(RAID rebuild) like they are with random operations.
HP at one point said something like 90%+ of the SSDs they have sold were still in operation in the field. Results like this is what helped them decide to offer a 5 year unconditional warranty on all of the SSDs on 3PAR (maybe other platforms too).
my company's first all flash array was a 3PAR 7450 which came online on 2014-11-04, according to the system, from an endurance perspective (initially used the 2TB SSDs, since added 4TB as well), the system is reporting the 2TB SSDs have 98% of endurance left, and the 4TBs installed late last year still have 100% of endurance left. These are for what are otherwise designed to be "read intensive" SSDs, though HP (and others now too) support any workload running on them(HP doesn't market them as read intensive), no restrictions.
The article sort of explains what various RAID levels imply, whilst saying nothing about wtf erasure codes are. Is RAID now so old-fangled that it needs to be explained, whereas erasure codes are like DevOps, their value and implementation are self evident (preferably after a conference or two)?
In particular the article says nothing (that I could see) about the fact that an array rebuild could be bottlenecked in principle either by IO performance or (depending on implementation) by CPU performance.
My limited knowledge of erasure codes (about five minutes old, found in
suggests that erasure codes are quite possibly more likely to be CPUbound than their RAID equivalent (again, depending on implemenation).
Odd indeed as the article is aiming for ceteris paribus but can't quite get there. "All other things being held constant" just isn't that possible, or plausible for that matter. I've been studying this exact question for months now as I have a lot of content (data and archives) that require active presence yet as near 100% protection against data corruption and hardware failure. Rebuild times are right near the top of concerns.
What hardware the RAID runs on determines whether SSD's are useful let alone preffered. Simple optimizations one way or another can throw any comparison. Which negates straight drop-in replacement testing in the same Yada-yada.
whilst saying nothing about wtf erasure codes are
Also, the article mentions "RAID-vs-erasure code rebuild times" but doesn't examine them. Perhaps in a follow-up article? Erasure codes (like Cauchy-Reed-Solomon) are mathematically optimal (both in terms of bandwidth and storage space), so they will always be at least as good as the equivalent RAID scheme (with the same number of erasures).
I have another gripe about the maths in the article. When measuring overhead, surely it is the difference between raw capacity and usable capacity expressed as a fraction of usable capacity? Or, in other words, how much extra storage would I need to add to "raidify" my setup? That surely is the only sensible definition of "overhead".
Your first example (6 drives in RAID 5) has the correct overhead figure: for a usable capacity of 5 drives, add one more to make it RAID-5, which is an overhead of 1 drive in 5 or 20%.
You start going wrong from there. With 10 drives in RAID 5, the usable capacity is 9 drives, so the overhead is 1/9 or 11.111...
In the RAID 6 example, you say that a 4-disk system has overhead 50% when it's actually 100% overhead. RAID 6 tolerates 2 failures, so that's your original 2 disks + 2 for redundancy, with 2/2 = 1. Likewise for 10 drives in RAID 6. You have add 2 drives to raidify an 8-disk array, so the overhead is 2/8 = 12.5 (not 20%)
A raid 6 group of 14 disks, with 4 tb 7200 disks in takes about 72 hours under medium load to rebuild. Depending on how you have it set up, and what controller you are using. (dell/netapp/IBM/LSI/3460 60 drive 4u jobby)
The problem with spinning disk is that the heads take ages to move. So if you are having to serve real data during a rebuild, your throughput per disk goes from 100+ megs a second to tens of megs.
But that's because its got to blindly copy the data from all disks to rebuild a disk *image*. If your raid/EC scheme is content aware there is no need to rebuild the zeros from the unused part of the disk. Thats partially how XIO and GPFS do their super fast rebuild. because they are vaguely aware of where the data is, they can just re-build the parts that matter. Crucially they can in some cases pull good data from the dieing disk
Hard drives should not be used for latency sensitive storage any longer, but only for bulk storage. Therefore you should be using big RAID sets, like 14+2. Then you don't care how long a rebuild takes, because you aren't going to have three drives fail during that time span unless you are incredibly unlucky (and if you "win" that particular lottery, that's what backups are for)
Your title was good but you should have expanded that thought: RAID5 rebuild times are way, way, way too long for the large HDDs that are current. I've seen a 2TB RAID5 take 72 hours to rebuild on a fast desktop computer.
I'd go with RAID1+0 - fast throughput, fast rebuild - 'cuz HDDs are relatively cheap. From what I've seen, after the first minutes or so, when it's rebuilding the first few tracks, a RAID10 which is rebuilding is indistinguishable from one that isn't. That short rebuild time reduces the chances that the failure of a second drive (a specific one out the three that are left) will destroy the data.
For NAS applications, HDDs are usually OK, as long as their average throughput is substantially greater than the LAN transfer speeds. SATA HDDs can usually output at least 3GB/sec, three times faster than a 1000BaseT; a striped RAID should double that.
To that I say, so what? Use RAID6 and don't worry about a 72 hour rebuild time. What difference does it make how long it takes, so long as you don't have two more failures during that time? Back in the day you cared how long it took because it caused increased latency for regular I/O. It still will, but since your latency sensitive stuff is on SSD now, you don't really care if the nominal 10 ms response time becomes 15 ms for a few days. All your hot data / hot blocks should be on flash.
I've seen a 2TB RAID5 take 72 hours to rebuild on a fast desktop computer.
1) What the heck? That's pathetic performance. I just replaced a 3TB drive in a 12TB usable / 18 TB total with double parity ZFS RAID-Z2 array (90% full) and it resilvered overnight. Consumer grade 7200 rpm SATA drives.
Granted I wouldn't use RAID-Z2 nowadays. I've used RAID-Z3 (triple parity) in newer builds.
2) Who uses old fashioned RAID any more anyway when there is ZFS?
" Therefore you should be using big RAID sets, like 14+2"
"Then you don't care how long a rebuild takes, because you aren't going to have three drives fail during that time span unless you are incredibly unlucky "
I've been unlucky on a number of occasions. Vendors such as HP tend to supply the same model of drive _FROM THE SAME BATCH_ for their raid arrays.
Even without that, there's a ~2% chance of array loss during rebuild with RAID6 - that's too high for a number of applications. RAIDZ3 reduces that to under 0.01%
Real world experience of 1TB RAID5/RAID6 rebuilds on rusty media is more like 12 hours. RAID rebuilds don't involve sequential disk operations.
Therefore you should be using big RAID sets, like 14+2.
But if you're going to be using such a large number of disks, it makes more sense to use an erasure code. I assume that 14 + 2 means that you have 16 disks and you can tolerate 2 failures.You might think that three near-simultaneous failures is going to happen infrequently enough that you can ignore it. I guess you've heard people talking about waiting ages for a bus and then two come at once. It's all based on the Poisson distribution: (independent) rare events can and do happen in clusters. You might say there's more chance of winning the lottery and being struck by lightning but things like that do happen
I found a calculator tool and it told me that in a 16-disk setup (16 x 5Tb), with 10Mb/s available for rebuilding, MTTF of 3 years and resupply time of 7 days (no hot spares), the chance of a data loss is 1/37.6 per year for a RAID 6 array.
A once-in-40-years chance might not sound too bad although that is only for one array. If you're in a data centre with 40 arrays, you can expect around one such failure per year.
Anyway, to actually get to the point, you use the Poisson distribution to calculate the likelihood that a certain number of independent disk failures won't happen in the window when you're rebuilding the system. The more disks you add, the higher the probability that these rare coincidences will happen. The best mitigation for this is to increase the redundancy level, so that if instead of a 14 +2 scheme, you used a 12 + 4 one (ie, an erasure code) you're (roughly) exponentially less likely to suffer from catastrophic failures.
Add in the fact that Poisson arrival rates are only an assumption, and that clusters of disk drive failures can happen more frequently than the model suggests (eg, bad batches from a single manufacturer), it makes even more sense to use an erasure code for arrays with many more disks than the standard raid setups (more than 4-8 disks).
"Add in the fact that Poisson arrival rates are only an assumption, and that clusters of disk drive failures can happen more frequently than the model suggests"
Like when the power goes off and then an hour later you try to power up an array from cold that has been spinning for 4 years?
Maybe Nimble who moved/moving to triple parity? But several top AFAs have their own schemes. RAID5/6 aren't even part of the discussion. Pure has RAID-3D "better than dual-parity" and XtremIO has XDP , a ppt you can google and toggle through 'cause it ain't simple to grasp, able to handle "up to 5 failed SSD per brick." A much better direction on SSD rebuild might have been comparing and contrasting all major AFAs and how they line up.
Disclosure - Nimble Employee
We did indeed move to triple parity with the 2.0 release of Nimble OS (now at 3.2). It is more advanced than a simple triple parity. This really comes into play because SSDs don't fail in the same way as HDDs. Rule #1 of storage is don't lose the data and the old RAID 5/6/10 just don't cut it with SSDs (or large HDDs).
Umesh, one of the Nimble founders, explains it much better than I can at: https://www.nimblestorage.com/blog/the-reliability-of-flash-drives/
Not all AFAs are the same.
It is a challenge to manage RAID for many. In fact, I saw an Enterprise storage vendor that sells a 2 or 3 day service to configure the RAID levels on that large whiz-bang frame you just bought. What are you suggesting? Instead of creating RAID at the frame, ZFS raid? All sorts of options, including triple parity. But you are still "managing RAID." I'll do you one better. Now and going forward, no more managing of RAID or you have a handicapped storage offering. A number of solutions just work, you don't have to fiddle with RAID arrays. The aforementioned XtremIO and Pure. Infinidat comes to mind. That work for you? I guess a ZFS admin would be rather bummed because they would want JBODs so they could do the RAIDing. There's no such thing as a JBOD with those 3. But back to your post-RAID. Sure... two or three years from now RAID discussions will be fewer and fewer as most vendors will move on from that except for certain Enterprise storage vendors that face a daunting task of re-write or all new code base to move in this same direction.
Disclosure - Nimble Employee
Nimble offers triple parity plus data protection without traditional RAID to manage. The Nimble OS manages protection/capacity automatically and does not require configuring the 'back end' either at initial setup or adding capacity. We lack the concept of RAID groups, aggregates etc. that traditional architectures are built on, making storage management application centric rather than about setting up the 'back end'.
Odd that the DC P3608 is included in the comparisons. That is a dual PCIe SSD that would never be used in a RAID 5 configuration.
Rebuild time is also contingent upon compute availability. For a software RAID scenario if there is a lot of overhead from host tasks it will take longer. An offloaded hardware RAID controller has limited compute, they are only a few watts of power there, and of course this would not apply to the PCIe SSDs that are included in the chart.
One should >>never<< use any form of "Erasure Codes" on SSDs because of the massive write-amplification involved. Whatever the endurance of the SSD is, Erasure Coding will take away 60-80%. We used to call it the "RAID-5(6) Write Penalty"
Also, lots of vendor B.S. out there (Pivot3 for example) hyping their RAID-5/6 implementations while re-naming them as "Erasure Coding". Ohhhhhh...sounds so much sexier than "RAID-5".
Biting the hand that feeds IT © 1998–2019