When Intel releases its Sandy Bridge-based two-socket "Romley" platform in the middle of next year, its "Patsburg" platform controller hub (PCH) will include support for serial attached SCSI (SAS). By putting SAS support on the motherboard, Intel is embracing what it formerly shunned: software RAID. "I'll plead guilty. We stood …
Safe write caching
One of the traditional downsides to software raid that even effects RAID1, which doesn't suffer the CPU overhead, is the need to log writes unless there is some form of safe write caching. Are these new boxes going to provide battery backed up write cache?
Ehh, not necessary
"""Are these new boxes going to provide battery backed up write cache?"""
A write cache isn't strictly necessary at all, it's just a performance boost. Modern drives pretty much all flush their internal write caches to disk on a sync() command issued from the application, so for RAID you need to wait for each disk to come back with a completed sync before you consider the raid-level sync to be completed.
But a battery backed write cache is more or less essential for high performance.
And I'll believe those raid5 numbers when I see them. I've personally pulled 400MB/s from a degraded 6 disk raid5 (so 5 disks at the time) on a 3ware card, and software raid5 would be eating through quite a lot CPU. Since that 400MB/s was cpu limited in the application, software raid would certainly have slowed it down.
Are these new boxes going to provide battery backed up write cache?
No. It's software raid. Check the last line of the article.
? Ehh, not necessary
If you write to a raid device you need to write to at least 2 physical drives (hdd/ssd doesn't matter). Since you won't be able to get the two devices into sync the 2 writes will occur at slightly different times. A system failure will result in inconsistent data on either the mirror or the parity device. So you need to have some mechanism to avoid this.
You can journal the writes through an intent log, but this requires an addition physical write to occur before the real data write can be allowed to proceed. OK, you can have a cache of recent accessed areas and trade cache size against recovery speed, but it still doesn't help random writes and fast sequential writes will always tend to saturate this cache.
Or you can hold the un-acknowledged write in the cache pending completion. In order to be able to use this approach you need a protected cache.
The claim was that SW RAID could have a equal or better performance than HW RAID, so I would expect the logged write approach to be unrealistic.
That is what UPS-es are for
Even if the controller has battery backed cache, the disk enclosure does not.
That is what UPS-es and more importantly UPS software on servers is for. If you lose power you should shut down in an orderly manner and not rely on gimmicks. After all, even if the controller has battery backed up cache, the disk usually does not and it has some cache as well. Additionally, a RAID controller is not a SAN. It has no filesystem or application awareness and there are plenty of scenarios where it is better to abort a half-finished atomic write so that the data is not left in an inconsistent state.
As far as software RAID having higher performance in non-degraded mode, that has been the case at least on Linux for the last 10 years. Hello Intel and welcome to year 2000.
The downside however, has always been the lack of or very risky hot-plug. So one had to interrupt service prior to a Software RAID rebuild in order not to risk loosing data from the server going belly up. That is why people run hardware RAID - nearly all enclosure+controller combinations are hot-plug and allow rebuild and resize on the fly. In other words, it is not speed, it is the maintainability and reliability which is the key criteria for this market.
I've done a RAID-5 reconstruct on a Linux software MD raid array of six SATA drives (on an AMD motherboard chipset). All the drives maxed out at about 150Mb/s (reading from 5, writing XOR to the 6th, with the checksum data write rotating around all the disks in the usual way). That was on a system with a 45W Athlon II AM2+ CPU, and it had plenty of CPU cycles to spare while this was going on.
Kind of puts that 3Ware card in perspective.
On most workloads it's academic anyway. The disks spend most of their time seeking. The I/O data rate is nowhere near the maximim sustainable rate. Reconstruct (on an otherwise idle system) is the most I/O and checksum-calculation it'll ever do. BTW Linux reconstruct is smart, it'll throttle the reconstruction IO rate if the system gets busy, rather than slow the real work down unacceptably.
A 3Ware horror story from long ago. I used to use them, back in the days when hardware RAID really was faster. One day an engineer connected the wrong cables to the drives and the controller trashed the array. Shouldn't have happened. But I do know for sure, on Linux you don't have to bother labelling the cables, because Linux reads the disks to ID the array components. It doesn't matter in the slightest if the disks have been shuffled between shutdown and startup.
BTW Yes, it'll boot off shuffled disks, as long as /boot is a 6-way mirror partition and you've remembered to duplicate the MBR onto all 6 disks.
Re: Anton - That is what UPS-es are for
Regarding the battery cache on the RAID controller - the disk cache doesn't matter, on restart the controller will ensure that all the writes are completed and that all disks are properly synced. That's the point... in a hard crash situation you maintain the integrity of the RAID mirror/set and as you put it, it is not application or filesystem aware. It doesn't make me breakfast either ;)
That is not "a gimmick" in my book. It is a very useful and, in my experience, reliable layer of resiliency. There are all sorts of situations (a couple I outlined in a reply to N2) where a server can go hard down. One I didn't mention was motherboard failure which, in the Intel world, isn't a exactly a rare event.
As far as battery backups go, in a data center hosting thousands of servers you just don't see the type of UPS that you're talking about. The only time I have ever seen an APC hooked to a server with the interface properly setup to take the server down clean were in SOHO's. In an enterprise scale data center the assumption is that the power will never go down, and it is exceedingly rare to see it happen (I have, however, seen it happen). The only company who, AFAIK, has addressed this issue is Google - who, according to the rumors I've heard, build batteries into their servers - everyone else, again AFAIK and in my experience, understands that this is a risk and plan accordingly (which may include ignoring the problem and using software RAID anyway).
You hit the nail on the head with your last comment: "it is the maintainability and reliability which is the key criteria for this market" - but the reliability issue is bigger than just hot swap.
*Disclaimer, I'm speaking mainly of Wintel server builds - if you're using a JFS on Linux/Solaris I'm sure the equation is different.
In a datacenter you still have a UPS
It is just a UPS for the _WHOLE_ datacenter. Or rack lineup, or whatever. Once again with proper software to initiate an orderly shutdown if the diesel (or natural gas) generator runs out of fuel or cannot startup in time.
MGE (nowdays APC) has had the software and clients to hook up your whole server lineup for ages now. So while you rely on "power never going down" you still have the means of shutting down in an orderly manner when it does as well as shutting down rack line ups for power maintenance.
As far as Google having built in UPSes I am having trouble to believe in the rumour. This is the mother of all fire hazards and all H&S hazards. A demo of what happens when you allow UPSes and batteries in your datacenter lineup is the fire at the Uni of Twente a few years back. It looked like 3rd World War.
I don't really disagree with anything you said. I have not, however, seen the UPS-OS integration in the data centers I have worked in. I could see this making sense in a highly standardized, heterogenious environment - ours are anything but that.
At the end of the day, battery backed up cache is just an extra layer of protection for your RAID mirrors/sets for hard crash scenarios. Some people may think it's unnecessary - all I can say is that from my experience it's a standard architectural component, and IMO for good reason.
RAID in the enterprise is about data integrity. I really don't see WTF they're droning on about... one hard crash or power failure and you could corrupt your volumes.
Compared to read/write cached RAID controllers, the performance is still crap too unless proven otherwise (which I seriously doubt will ever happen).
That said, if you're running SSD's the equation could be different (which is not something they should have said if that's what they meant). Intel makes SSD's after all. The above statements are in regards to old school HDD architectures. I don't have enough experience with SSD's yet to say one way or another, but the performance should be comparable. The question is if they can handle a hard crash without crapping out the volumes.
In the enterprise?
What about the UPS units and the diesels with fuel for a month, I know generators are an expensive facility but anyone without an APC with some shutmedown type software set to battery = 10% is essential, especially when the grid spikes frequently.
As for SSDs, Im sure they will become common place eventually but the technology is a tad expensive & new.
Re: Power failure
Yes, backup power should always be available and functional but the potential for a hard failure due to power issues still exists. In the data centers I've worked with the battery backups do not hook into the OS layer like you can do with an APC - we use REALLY big battery systems that cover multiple servers/racks and, in my experience, if there is an extended power outage the servers have to be shut down manually. Also, with the big data centers I've worked with, we'll have a few days (not a month) worth of diesel stored on site and contracts in place to truck in fuel to keep it going indefinitely. During hurricanes we've had servers go down hard because the backup generators and the batteries drop out because the facility had to be evacuated before they could power everything off and the refuel trucks couldn't get in because of mandatory evacuations.
We had an IBM P-Series go down hard with a power supply issue recently. One of the redundant power supplies failed and it cooked off the backup supply before they could get it replaced. For some god-awful reason those boxes keep their local storage on software mirrors (which run insanely slow BTW) and I have seen both power supply and hard drive failures cause volume corruptions in this configuration.
I'll just put it this way, the clients I've dealt with over the years have little tolerance for outages of any kind and they get very pissed when their outages are due to architectural shortcuts. The situations I described above are extremely rare, no disagreement, but the expectation on both sides is that our architecture should be up to a minimum standard of resiliency and software mirroring, with old school HDDs at least, quite frankly, does not meet that standard.
In hindsight at least, an extended outage requiring a volume restore vs. the purchase of a real RAID controller is usually a given. We don't ask the client if they want it, we just put it into the build. The one time I've seen it (the P-Series boxes I described earlier) is as an exception to the rule... and one that we try to avoid like the plague.
This means more fakeraids. STOP THIS NONSENSE. It's as idiotic as the Winmodems of 10 years ago. It's bad, and should not be encouraged!
Down with WinDevices
> This means more fakeraids. STOP THIS NONSENSE.
If you are going to buy into software RAID then just back away and let the OS handle it. Just support enough channels so that the Operating System has something to work with.
Just support more SATA channels in system chipsets across the board.
Make 6 standard and 10 common.
RAID or not, a cheap board with lots of SATA ports can be very handy.
Six is standard, and you can do 6 x 8 with LUNs
Six IS standard, isn't it? At least in the chipsets ... some cheapo motherboards save money by connecing up less of them.
I've never tried them, but I've read about SATA LUN boards. It's a little board that fans out one SATA cable to four or eight disks. No bus interface. SATA is logically like the old SCSI bus, and can support many disks on a single port, if your O/S understands what it's seeing. I read about building a Linux system with 48 SATA disks connected to a standard motherboard once.
It's funny, since USB3 is a big pile of CPU hogging rubbish compared to Firewire, yet when people talk about proper DMA controllers and doing things correctly they get shot down on here.
Most PC clone hardware is cheap mass produced junk. Most motherboards still have a BIOS that should have been consigned to the dustbin when PnP came out.
Other than the fact that...
...SATA drives are pretty unreliable, Windoze in all flavors is bug ridden and crashes frequently, the whole scheme sounds like complete marketing B.S..
RAID software quality from Intel etc
Several issues come to mind.
Historically, Intel has had soft-RAID "support" in several generations of their ICH's - on top of SATA HBA's, up to six drive channels. A few years ago it was called the "Application Accelerator", then it was renamed to "Matrix Storage". I don't know for sure if there's ever been a RAID5/XOR Accelerator in there, or if the RAID feature consisted of some ability to change PCI ID's of the SATA HBA at runtime + dedicated BIOS support (= RAID support consisting of software and PR/advertising, on top of a little chipset hack). Based on the vague response in the article, I'd guess that there's still no RAID5 XOR (let alone RAID6 Reed-Solomon) acceleration in the PCH hardware - what they said means that they're looking at the performance and trying to squeeze out as much as possible out of the software side. Looks like not much is new here on the software part (RAID BIOS + drivers) - the only news is SAS support (how many HBA channels?), which gives you access to some swift and reliable spindles (the desktop-grade SATA spindles are neither), if the ports support multi-lane operation they could be used for external attachment to entry-level HW RAID boxes, and if the claim about expander support is true, you could also attach a beefy JBOD enclosure with many individual drives (unless the setup gets plagued by some expander/HBA/drive compatibility issues, which are not uncommon even with the current "discrete" SAS setups). I'm wondering about "enclosure management" - something rather new to Intel soft-RAID, but otherwise a VERY useful feature (especially the per-drive failure LED's are nice to have).
The one safe claim about Intel on-chip SATA soft-raid has always been "lack of comfort" (lack of features). The Intel drivers + management software, from Application Accelerator to Matrix Storage, has been so spartan that it was not much use, especially in critical situations (drive fails and you need to replace it). I've seen worse (onboard HPT/JMicron I believe), but you can also certainly do much more with a pure-SW RAID stack - take Promise, Adaptec HostRAID or even the LSI soft-RAID for example. It's just that the vanilla Intel implementation has always lacked features (not sure about bugs/reliability, never used it in practice). Probably as a consequence, some motherboard vendors used to supply (and still do supply) their Intel ICH-R-based boards with a 3rd-party RAID BIOS option ROM (and OS drivers). I've seen Adaptec HostRAID and the LSI soft-stack. Some motherboards even give you a choice in the BIOS setup, which soft-stack you prefer: e.g., Intel Matrix Storage or Adaptec HostRAID. Again, based on one note in the article, this practice is likely to continue. I just wish Intel did something to improve the quality of their own vanilla software.
One specific chapter is Linux (FOSS) support. As the commercial software-RAID stacks contain all the "intellectual property" in software, they are very unlikely to get open-sourced. And there's not much point in writing an open-source driver from scratch on top of reverse-enginered on-disk format. There have been such attempts in history and led pretty much nowhere. Any tiny change in the vendor's closed-source firmware / on-disk format would "break" the open driver. And the open-source volunteers will never be able to write plausible management utils from scratch (unless supported by the respective RAID vendor). Linux and FreeBSD nowadays contain pretty good native soft-RAID stacks and historically the natural tendency has been to work on the native stacks and ignore the proprietary soft-RAID stacks. The Linux/BSD native soft-RAID stacks can run quite fine on top of any Intel ICH, whether it has the -R suffix or not :-)
People who are happy to use a soft-RAID hardly ever care about battery-backed write-back cache. Maybe the data is just not worth the additional money, or maybe it's easy to arrange regular backup in other ways - so that the theoretical risk of a dirty server crash becomes a non-issue. Power outages can be handled by a UPS. It's allways a tradeoff between your demands and budget.
As far as performance is concerned:
Parity-less soft-RAIDs are not limited by the host CPU's number-crunching performance (XOR/RS). If you omit the possibility of sub-prime soft RAID stack implementation, the only potential bottleneck that remains is bus throughput: the link from north bridge to south bridge, and the SATA/SAS HBA itself. Some Intel ICH's on-chip SATA HBA's used to behave as if two drives shared a virtual SATA channel (just like IDE master+slave) in the old days - not sure about the modern-day AHCI incarnations. Also the HubLink used to be just 256 MBps thick. Nowadays the DMI is 1 GBps+ (full duplex), which is plenty good enough for 6 modern rotating drives, even if you only care about sequential throughput. Based on practical tests, one thing's for sure: Intel's ICH on-chip SATA HBA's have always been the best performers around in their class - the competition was worse, sometimes much worse.
As for parity-based RAID levels (5, 6, their derivatives and others): a good indicator may be the Linux native MD RAID's boot messages. When booting, the Linux MD driver "benchmarks" the (potentially various) number-crunching subsystems available, such as the inherent x86 ALU XOR vs. MMX/SSE XOR, or several software algorithm implementations, and picks the one which is best. On basic desktop CPU's today (Core2), the fastest benchmark usually says something like 3 GBps, and that's for a single CPU core. I recall practical numbers like 80 MBps RAID5 sequential writing on a Pentium III @ 350 MHz in the old days.
The higher-end internal RAID cards, containing an IOP348 CPU at ~1GHz, tend to be limited to around 1 GBps when _not_ crunching the data with XOR (appears to be a PCI-e x8 bus limit). They're slower when number-crunching.
In reality, for many types of load I would expect the practical limit to be set by the spindles' seeking capability - i.e., for loads that consist of smaller transactions and random seeking. A desktop SATA drive can do about 60-75 random seeks per second, enterprise drives can do up to about 150. SSD's are much faster.
The one thing I've recently been wondering about is this: where did Intel get their SAS HBA susbsystem from? Already the IOP348 contains an 8way SAS HBA. Now the Sandy Bridge PCH should also contain some channels. Are they the same architecture? Are they not? Is that Intel's in-house design? Or, is it an "IP core" purchased from some incumbent in the SCSI/SAS chipmaking business? (LSI Fusion MPT or Agilent/Avago/PMC Tachyon come to mind.) The LSI-based HBA's tend to be compatible with everything around. Most complaints about SAS incompatibility that I've noticed tend to involve an Intel IOP348 CPU (on boards e.g. from Areca or Adaptec) combined with a particular expander brand or drive model / firmware version... Sometimes it was about SATA drives hooked up over a SAS expander etc. The situation gets hazy with other less-known vendors (Broadcom or Vitesse come to mind) producing their own RoC's with on-chip HBA's...
You guys sound so 2005
You guys have no idea how ZFS works or how it's been shown repeatedly to be as reliable as hardware battery-backed RAID, do you? RAID-Z trounces RAID-5/6 for speed as well. Solution: Get drives with NVRAM caches or put your drives on an alternate power supply that keeps them running beyond a host down, for 1/10 the price of a hardware RAID NVRAM cache. Things have come a long way since software RAID meant running RAID-5 in flaky, slow Windows drivers.
Hit the nail on the head
Most people running Intel chips are windozers. Anyone playing ZFS is likely to be on solaris and either at home on intel-like hardware (where SAS drives are going to be too expensive) or at work on a 'real' Sunracle box.
People in "enterprise" class environment running an OS capable of ZFS on Intel hardware are probably as common as hen's teeth.
Hence we have another solution to a non-problem.
But, maybe in the future it will be a good thing.
A little more common than previously thought
@max allan "People in "enterprise" class environment running an OS capable of ZFS on Intel hardware are probably as common as hen's teeth."
Well I can't speak for all companies but we have a metric ass ton of Intel machines running ZFS and all new hardware purchases have been Intel with some kind of solaris on it.
The real problem stopping ZFS from being more mainstream is now solaris' licensing and the fact that you need a license to run solaris in an enterprise fashion (thanks oracle).
Personally I'm more excited with btrfs than ZFS, mainly because of the licensing issues. Btrfs is not there yet but it is very very promising performance wise.
RAID-Z is it
RAIDZ is the only solution addressing bit rot (bit flipping).
Use anything else if you don't care about your data.
Where is the evil Larry icon? Alien will do I suppose.
Ever tried ZFS?
Give ZFS a couple of hardware threads to work on and you'll forget why you needed a raid controller.
You do really want to use ECC RAM
I've seen what slightly faulty RAM can do to a filesystem, and it wasn't pretty. The same on RAID-5 could be far worse, if a disk failed and you tried to reconstruct through failing RAM.
So any RAID-5 system should use ECC RAM. This ought to be a selling point for hardware RAID controller manufacturers ... but I'm not aware of any of the low-end ones (4-8 disk) that actually use ECC!
Intel don't support ECC on any "desktop" boards and CPUs, you have to buy an expensive server board and Xeon. If you want to do it at home, build your own system with an AMD CPU, ECC RAM, and a motherboard that supports ECC (Most, not all, ASUS ones do). The 45 Watt TDP low-end CPUs are plenty fast enough for a fileserver box, and run cooler and quieter.
Intel - if you are serious about Atom servers in the datacentre, they need ECC support!!
Software RAID rules
This isn't news to anyone that runs Linux. Software RAID has been trouncing RAID controllers for several years. The overhead is negligible and you get lots of nice features like dynamic resizing and re-shaping of arrays (powerfail-safe ... not an experiment I've tried!) It's only Windoze that needs the band-aids of hardware RAID or FakeRAID, because it's crap.
In fact the CPU overhead has been unimportant since Pentium-4 days. The penalty was having to do two reads and a write per RAID-write along a PCI bus, compared to just one write for a hardware RAID controller. That penalty went away with modern multi-SATA chipsets and PCI-express. You can max out six HDDs on an intel ICHx doing a RAID rebuild, the disks are the bottleneck.
The future trend will be to remove RAID from the disk block driver or controller, and embed it into the filesystem, along with data-checksumming so that the filesystem can detect (and possibly repair) various sorts of higher-level errors. It also becomes possible to decide on a per-file basis whether to use no RAID, mirror-RAID, RAID-5 or -6. BTRFS is coming. (Agreed, ZFS was there first, shame about the licensing).
Operating System RAID Support
Windows in all of its NT flavors has had built in software RAID for quite some time now, since at least the NT 4.0 days. NTFS and the built in disk management framework support JBOD, RAID0 and RAID1 at the very least. The few times I've run things that way, it's been plenty reliable enough if limited. (So far as I know, you can't actually boot Windows from such volumes as there is no way to set them up before the OS is installed.)
FakeRAID is another matter entirely and is universally evil no matter the OS.
Don't know where you got the idea that hardware RAID is a band-aid of any sort. As far as I'm concerned, if you want RAID done right AND you need to offload some/all of the overhead to something else other than the CPU, a hardware RAID controller is exactly what you want. Otherwise you will probably get along fine with whatever software options your OS offers...
Traditionally it's the stepmother who's evil, and the stepchild is the oppressed one. I guess this is par for the course for someone who does FUD for a living.
what is this 'eating crow'? i thought it was a euphemism for some sort of deviant sexual behaviour and was hoping to find an explanation or playmobil reconstration here.
paris icon because i saw a film where she ate something that might have been a crow.
All this has me wondering...
... how many ZFS-related patents Oracle has, and how soon they will bury BTRFS with threats of legal action when it starts to become popular?
Just because Schwartz was okay with giving away Sun technology doesn't mean Larry will be...
You mean, like everyone running Linux ran out and purchased a SCO license? Not.
They'd have to say what it is that they claim a patent on. (SCO tried not to, basically because they were bluffing, and the FUD didn't work). Then it's extremely likely that someone will show the prior art, or the community will have replaced the patented bit with some other sort of wheel in the next release.
Oracle created btrfs
So I don't know where you see legal action coming into play here. It's still hosted on oss.oracle.com! However, zfs was one of the giant reasons they bought Sun, so they'll probably either incorporate it into a btrfs2 or phase out their direct support for btrfs and force a community fork, at which point it'll start getting left behind, like xfs. Hopefully the community cares enough to do something great with it.
"FakeRAID is another matter entirely and is universally evil no matter the OS."
Except you seem to be making a distinction between FakeRAID and software RAID that doesn't even exist except for Windows... in Linux, some util will read just enough data from the FakeRAID controller, or from the disks, to figure out the RAID config, then Linux's standard, well-implemented, and fast "md" RAID support will take over.
So.... personally, I've been a fan of software RAID for quite a while... the RAID cards I've used always seem to be "fragile" in terms of needing everything to be set up JUST so before they work, it always made me a bit nervous that something would get slightly off and the whole array would dump, when compared to the software RAID which was much more forgiving.
Still not admitting the truth
"Referring to software-RAID I/O performance of years past, she said: "Historically, it was slower. It just was. And it's not anymore. And in fact in a lot of cases it's a higher-performing solution.""
Software raid has ALWAYS been faster than hardware raid, unless the hardware RAID is caching to (hopefully battery backed) RAM.
Avoid RAID-5 and RAID-6
There is a whole site with lots of technical articles by sysadmins, dedicated to explain how bad Raid5 is.
More research on Raid:
"Detecting and recovering from data corruption requires protection techniques beyond those provided by the disk drive. In fact, basic protection schemes such as RAID  may also be unable to detect these problems.
As we discuss later, checksums do not protect against all forms of corruption"
Research shows that Raid6 is also bad:
"The paper explains that the best RAID-6 can do is use probabilistic methods to distinguish between single and dual-disk corruption, eg. "there are 95% chances it is single-disk corruption so I am going to fix it assuming that, but there are 5% chances I am going to actually corrupt more data, I just can't tell". I wouldn't want to rely on a RAID controller that takes gambles :-)"
But ZFS is safe against these problems:
My two cents
No one has mentioned yet custom implementation of RAID. If Hardware RAID card dies you need an identical one (make and model) to be able to access your data. This might be problematic if your card is a few years old.
Where with software you don't - take your HD to another PC with the same OS (motherboard etc can all be different).
- Oh noes, fanbois! iPhone 6 Plus shipments 'DELAYED' in the UK
- The sound of silence: One excited atom is so quiet that the human ear cannot detect it
- Bloat-free, unlocked Moto X to be dubbed 'Pure Edition', says report
- In a spin: Samsung accuses LG exec of washing machine SABOTAGE
- Feature Be your own Big Brother: Monitoring your manor, the easy way