back to article Diary of a server failure

I recently experienced a major server failure. This article is my post-mortem. First, the environment in question: I have various ESXi servers using 300GB Velociraptors in RAID 5 as local storage. One server that hosted 27 user VMs had two disks in a six-disk RAID 5 fail simultaneously. When I rebooted the server the RAID …

COMMENTS

This topic is closed for new posts.
  1. Diskcrash

    And your other mistakes were..............

    Server failures are annoying and upsetting but data losing is soul destroying so one area not to cut costs is where your data lives.

    One thing to never do is to run SATA drives in anything other than RAID-10 or RAID-6. Their reliability is just not good enough and with their capacities being at 2TB and soon 3TB that is just too much data to lose so if you must use SATA drives then use RAID-6.

    Next don't use Western Digital drives. Sorry but these are not appropriate for business or enterprise level use. These are great drives for home use and for non-critical situations but until WDC gets serious about providing real MTBF information then I can't recommend them.

    By this I mean that WDC will state that their drives have a MTBF of 1.4 million hours but if go to their website and do a bit of digging you get this:

    http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=665&p_created=1035836712&p_sid=eRPhj*gk&p_accessibility=0&p_redirect=&p_srch=1&p_lva=&p_sp=cF9zcmNoPTEmcF9zb3J0X2J5PSZwX2dyaWRzb3J0PSZwX3Jvd19jbnQ9MTcsMTcmcF9wcm9kcz0mcF9jYXRzPSZwX3B2PSZwX2N2PSZwX3BhZ2U9MSZwX3NlYXJjaF90ZXh0PW10YmY!&p_li=&p_topview=1

    What that says is that while WDC will state there is a MTBF they don't actually really know what the MTBF of their drives are. They create their number by using a combination of how long a component should last with how many drives get returned in a year and this counts the ones that are sitting in my mother's and grandmother's computers. That is madness. A proper MTBF measures not only the failure rate but provides you details on what the duty cycle is to arrive at that value and then and only then are you able to make an informed decision about the suitability of a drive to its use.

    I am not saying that WDC drives are bad I'm just saying I wouldn't recommend them for use in an vmWare system or anything with a heavy load and I certainly wouldn't ever use them without RAID-10 or RAID-6 in place.

    You saved a few hundred on buying the drives and then spent three days of hell trying to recover your data and for me that isn't much of a cost savings in the end.

    1. Trevor_Pott Gold badge

      @Diskcrash

      Can I have your budget? It sounds large.

      1. jake Silver badge

        @Trevor_Pott

        "Can I have your budget?"

        No. IMO, at this point in your career you will only waste it.

        You don't seem to be able to get the point of cost/benefit analysis across to management.

        Suggestion: Go back to school. Get an MBA. Learn to talk to manglement. Easiest bit of schooling I ever slept thru' ... and holding that particular degree makes it easier to land contracts with Fortune 500s, to boot ;-)

        I know you consider me antagonistic, Trevor, but I'm not ... Sharing the hard knocks of over a third of a century in this business isn't a place I gloss over. I've made my mistakes, I've got the bruises to prove it, and I'm not afraid of laughing at myself. Unfortunately, few are willing to listen to the lessons of the past ...

        1. Trevor_Pott Gold badge

          @jake

          I don’t think you’re antagonistic, jake. I think you’re arrogant in your assumptions. I can forgive a lot of the commenters on El Reg for having Commenter Disease, but not you. For someone with your years of experience, you should be perfectly capable of understanding that the world as it applies to you does not apply to everyone else.

          I read your comment as saying that a “proper” sysadmin who is in my situation would either sweet-talk my bosses into providing me more funding, or get a different job because the one I am in isn’t good enough. There are two problems with this comment. The first is that there quite literally is no more funding to be had. No matter how many degrees you have, how good a con artist, salesman, businessman or smooth talker you are…you quite simply cannot get access to what isn’t available. I certainly will not be going in depth into my Company’s finances in full view of the internet…but I am one of the few in this company who knows where all the dollars end up. Suffice it to say that they are being spent where they need to be spent and there really isn’t anything more to be had for IT.

          The second assumption, that I should get an MBA and be essentially “just like you” is rubbish. I’d rather be boiled. I got into IT because I like FIXING things. I don’t like project management. I don’t really like management – though I’ll do either if called required. Thanks to having shrinks for parents I am fairly good at manipulating, coercing, cajoling and coddling people…I simply choose not to. I prefer machines. I don’t want your career. In fact, as time has progressed and I have lived my life…I have discovered I want less and less to do with IT in general. I prefer writing. I compose music. Oddly enough, I get a thrill out of taking hardware and software of various types and pushing them to their absolute limits. In the 80s and early 90s I would have been called a “hacker.” Not because I spend my time penetrating other people’s computer systems, but because I like to tinker with things and figure out how they work.

          The key here is that (shock and horror,) I have no real ambition “to be rich.” I am (most of the time) content with being a middle-class largely blue-collar schmoe. I make enough to keep me happy right now. 5-10% more than I make would provide enough to save luxuriously for retirement. You and I simply have different values, jake. I want nothing more than to largely be left alone to tinker. Periodically, I like to share my thoughts and experiences before returning to my man-cave. I want a different job, it’s true. I just don’t want /your/ job.

          The job I really want is pretty rare. I want a job in what amounts to “practical application research and development of IT systems.” I want a job wherein I get to take off-the-shelf components and do something with them that hasn’t been quite done before. “I wonder if this can do X.” That guy – whomever he is – that decided that cookie-tray servers were a good idea? Dreamt them up, built a prototype, tested them and refined the process? That’s the job I want. Figure out how to bodge 48 disks reliably into a case meant for 8? Hey, that sounds like an absolute BLAST! Working where I am is never going to make me rich. It’s frustrating and it’s constraining and I get laughed at by people on the internet for not being an MBA working for a fortune 500. Its still the closest thing I’ve ever found to being the guy I described above.

          How does this reflect in my writing, my articles and threads like this? It means I direct myself not at the guy who is gunning for the job at the fortune 500 and running a fantastic network with all the right parts in all the right slots with the right budget. I write what I know: trying to do the nearly impossible with a virtually non-existent budget and usually a whole bunch of mismatched equipment that was purchased slowly, a piece at a time over the course of years.

          Look at some of these comments in this thread. There’s a guy somewhere here who makes some radical assumptions like “you’re lucky ESXi installed on that third computer.” Talk about Commenter Disease! That “third computer” was a diskless spare box designed to be swapped in place of a failed ESXi box. He takes the fact that ESXi didn’t install on my PROTOTYPE WINDOWS FILESERVER and extends this logic to assume that I simply didn’t have known adaptable spares. This is a shining example of where you and I butt heads. When I am writing an article such as the one I just wrote, I am trying to convey a narrow slice of an infinitely complex puzzle and bodge the whole thing into 500 words. This comment alone is longer than I would be allowed to write my articles! You extrapolate an awful lot from what is available and make some very big (and largely incorrect) assumptions in doing so.

          It would take me days to properly explain my environment to you. We quite simply don’t have the money to do things “by the book,” but that doesn’t mean we don’t have backups upon backups and dozens of layers of redundancy. Everything on this network is designed in such a way that it can pull double or triple duty if necessary. There are spares for all critical components and I even go so far as to ensure that my personal computers (and personal computers sold to family members/etc.) use standard-model parts. If the day ever comes that I have burned through all of my spares and absolutely need a replacement bit of kit on an emergency basis, I’ll know where to go to find it.

          I am not saying I do everything perfectly, even taking my limited resources into account. Far from it! I have much yet to learn. I this exact case, I did something neat: I saved a failed RAID 5 by using a different-but-related RAID controller. I futzed with the servers for a few frustrating (but ultimately very fun) hours, and then I poked a transfer window two or three times over the course of a weekend. I didn’t have to go through the hassle of restoring anything from backups. Restoring from backups would have taken about the same amount of time but been far more work.

          I learned something new and figured I would pass it along to whomever my musings might help. I did so knowing full well that the comments thread would be nothing but eleventeen squillion commenters with “let’s make a bunch of completely invalid assumptions and then lay into an author/commenter/random individual for making the mistakes we only assume they made” Commenter Disease. I can even forgive them that.

          You though? A worldly management type with decades of technical experience should be beyond that by now; you should know the world is rarely as simple as it is presented in a 500 word bit of text.

          1. jake Silver badge

            Wow. 1200+ words of denial.

            1: Re-read mine. Not arrogance, just a suggestion based on a lifetime's experience.

            b) I wasn't saying "become management", I was suggesting "learn how they think".

            iii - The funding exists. Trust me. But it'll go to managerial bonuses, unless you can figure out a way to redirect it.

            quatro] Tinkering isn't a job description, unless you are independently wealthy (for various values of wealthy ... I knew a couple Rom families in England who were poorer than me, financially, but a hell of a lot richer from a family standpoint. I took notes & run my Ranch from a similar standpoint.)

            five} Lose the angst. It's not doing you any good, and it's wasting a lot of your time. Life's too short.

            1. Trevor_Pott Gold badge
              Unhappy

              @jake

              Wow. I cannot believe I was so very deeply wrong about you. You truly do have the very worst form of Commenter’s Disease there is. This statement: “iii - The funding exists. Trust me. But it'll go to managerial bonuses, unless you can figure out a way to redirect it.” Quite simply means you have no effing clue, and aren’t interested in even trying to extract said clue from what someone else writes. You actually are incapable of comprehending that the world does periodically function in a manner that is non-cognate with your personal beliefs and experiences. I will note this and move one. I am deeply disappointed that I was this wrong about you.

              As to angst, well…I know the internet is terrible for conveying subtlety. You are mistaking angst for frustration. They are very different concepts that I at least deal with differently Given the commenter’s disease present here however I won’t bother trying to explain.

              If at some point in the future I ever find myself in a situation remotely like yours, I will look you up. You are an intelligent individual with a great deal of experience to share. Unfortunately, the disparities between our professional and personal lives is so great that we are unable to communicate remotely effectively. I lack the skill to convey my situation in a manner capable of overcoming your commenter’s disease; a sad failing I freely admit.

              For now, I will simply wish you a good day, sir. Good luck in your future endeavours.

              1. jake Silver badge

                Commenter’s disease?

                Re: iii; I learned printing during and right out of highschool. Small volume, high quality shop. We did four-color process + 1 (usually black or varnish) (including custom color mixing), foil stamping, hand-set lead type, embossing, die cutting, and miscellaneous other bindery work. To this day, I have a Heidelberg Windmill & other printing bits & pieces in a roof-to-floor 32x32 "cube" in my machine shop.

                At my first print shop, we installed an early computer based system to monitor who was doing what, and for what customer, in order to better estimate jobs. The boss discovered that I had a Clue about computers ... and I became the sysadmin. After about a year of the boss babbling about "profit sharing", with me having a grasp of what a lying twat he was (based on me having access to the books), I quit & purchased the above mentioned Windmill (and a small Polar paper cutter, but that's another story).

                In a nutshell, if your Boss is driving a new Mercedes, and you are driving a fifth-hand Escort, you are being had if you are working for a small shop.

                As for angst ... When you are banging out 1200 words in this kind of context where 150 or so will do, there is a lot of emotion behind it. From my perspective it is angst. I could be wrong. Wouldn't be the first time.

                Not sure what you mean by "commenter’s disease" ... I read ElReg when I'm stuck in the office (tonight, and probably for the next couple weeks, I'm monitoring a mare who is threatening to foal about a month early ... I only need about 4 hours sleep a day, so I take the night-shift ... one of the Hands will spell me about 4AM). Sometimes I comment on what I read here. I feel no need to do so, but I like to share in the hopes that I can educate, make people think, or get someone to laugh.

                You, on the otherhand, seem to feel a need to over-comment on the threads generated by your articles. Perhaps Orlowski has a point ...

                I'll leave it there ... If I sense an article is one of yours, I won't bother reading it. Shame really, because I think you can write and have promise in this field ...

                I think I've offered before, but if I haven't ... If you're ever in ElReg's San Francisco office, drop me a line. I'll buy lunch, if I can get away. I kinda suspect we'd have a good conversation over a beer or two :-)

                1. Trevor_Pott Gold badge

                  The issue here..

                  ...is perceptions like "your boss driving a new Mercedes, etc." My Boss drives a 2003 GMC Jimmy. I drive a 2005 Scion XB. He lives in a nicer house, but he bought during a housing bust - I bought during a boom. You are - wrongly - transposing your views/experiences to others. Sure, my boss makes more than I do; but not a heck of a lot more and he earns every penny through additional responsibility and a great deal of hard work.

                  For every complaint I could lodge against the place I work – and the folk who run it – I cannot say that they eat cake whilst the proles beg for crusts of bread. My boss makes mistakes – we all do including myself – but I will unreservedly say he’s a good person.

                  Regarding the angst bit…you are wrong. There is no angst over any of this, merely frustration. Regardless of the number of words – 150, 1200 or otherwise – I don’t seem capable of providing adequate context. This is doubly frustrating for me; as a sysadmin it means having someone analyse and comment on my professional capability whilst starting from incorrect assumptions.

                  As a writer, I seem to lack the ability to convey information in such a manner as to be capable of correcting those false assumptions. This generates no more angst than a website install requiring a mod_security alteration that I don’t know off the top of my head. It generates more frustration however, because I can Google the mod_security alteration. I am as yet inexperienced enough to know which syntax to enter into which search engine to alter misperceptions.

                  As to “commenting too much on my own articles,” you’re probably right. I made the mistake of assuming that certain commenters – yourself among them – were willing and capable to of absorbing additional facts that might then alter extant misperceptions. Unlike the many of the other authors here on El Reg I started as a commenter first. Long debates moderated by her excellence Madam Bee are not foreign to me.

                  Make no mistake, I welcome criticism and suggestions. I have a deep respect for the staff and commenters on El Reg. In many of my articles there have been excellent suggestions…several of which I have tested and which have made their way into my production environment. I wrote an article here about how I got lucky and recovered a RAID 5. A half dozen people came out of the woodwork and proclaimed “RAIDX is dead! Long live ZFS!” I haven’t had much opportunity to work with ZFS in the past 18 months or so, but these comments have inspired me to go forth and set up a test lab to see exactly what it is I am missing.

                  Where it all falls down for me – personally and professionally – are the circular arguments. I have absolutely no idea how to deal with them. The religious argument is a great example:

                  10 The bible is infallible.

                  20 How do you know it’s infallible?

                  30 Because the Bible is the word of God.

                  40 How can you be sure it’s the word of God?

                  50 Because the Bible tells us so.

                  60 Why believe the Bible?

                  70 GOTO 10

                  I am unable to deal with those arguments. I do not know how to “win” them. When trapped in them, I know of no graceful way out of them. Whilst I can deal with technical, political or religious arguments about many topics, I have this personal failing when it comes to circular reasoning. When you and I have an argument along the lines of:

                  10 X is terrible, you should Y.

                  20 I had no funds for Y, I had no choice but to X.

                  30 There is always money available!

                  40 I promise you there is no way there was funding to Y, I had to X.

                  50 GOTO 10

                  I expect experience will give me a greater chance of seeing these sorts of logic loops and avoiding them. I hope experience grants me the ability to at some point learn how to gracefully exit these sorts of pointless conversations. Until that point in my individual development however, I fear circular reasoning loops will continue to be my personal kryptonite.

          2. Peter Mc Aulay
            Thumb Up

            Re: @jake

            I feel your pain. For what it's worth, I think saving a hardware RAID array by getting it to work with a different controller is cool too. (I once got the "opportunity" to test whether a SCSI HBA from a Compaq would work in a Sun, for slightly similar reasons. For the record, it did.)

            I find that different jobs (companies) lead to very different approaches to sysadminning - things you would do in a large enterprise simply won't fly in an SME, MBA or not. In one job you have to improvise with duct tape and repurpose the tin cans from a previous project, in another you can pressure a project manager for more budget, or brow-beat the architect into fixing a broken spec. It all depends.

            1. Trevor_Pott Gold badge
              Pint

              @Peter Mc Aulay

              It's a relief to know that there are some folks on these boards that do indeed understand. Yes - the approaches between running an SME shop with two tin cans (one of which is on loan) and a string is exceptionally different from running a shop where you can do magical things like source all your gear from a Tier 1. The interesting part is that I am held to a 99.99999 SLA by one of the two primary shareholders. Concepts like “well, just cluster everything and then get four-hour service plans from Tier 1s!” display a shocking ignorance of what my world actually looks like.

              I’m lucky though…it all ends in mid 2012. For the very first time we get to refresh our servers all at once, and do it properly. This year I got to do the desktops: Out with the 11-year-old systems that were falling apart to a brand new deployment of Wyse clients. (Hurray!) 2012 brings the server refresh…and a move from my world where whitepapers might actually apply! Its things like “the money exists, trust me” that absolutely floor me. No degree – MBA or not – makes that assertion true.

              Still, the commenter’s disease of most commenters thrown aside…I don’t write my articles for the folks with MBAs or working in places where buying “new gear for a specific job” is ever an option. El Reg has plenty of readers who don’t fall into that category. In my city alone El Reg is the wild favourite of all the sysadmins working for the various charities. Several low-IT-budget SME admins are also part of the local gang. Not to sya the folks running the University departments aren’t also regulars…but they simply play in a different world than I do.

              There are lots of articles on El Reg that talk about “EMC storage arrays” and “VMWare’s latest super-deluxe ultra-edition management software that requires you to pay in the form of pureed virgin soul.” There aren’t so many aimed at the guy working for the local charity who is putting together donations from a dozen different businesses, most of which don’t match, barely work and for which he doesn’t have spare parts.

              I’ll see if I can get the brass to rename my blog. “Sysadmin blog” is obviously going to cause nothing but continual commenter’s disease issues with the types of folk who think that all sysadmins face the exact same challenges. Maybe I can get them to rename it “Two cans and some borrowed string Blog.” Has a ring to it, I think!

              In any case, it should pointed out that despite my frustrations as regards rampant commenter’s disease, there is a lot of gold in this thread if you aren’t a “two cans and some borrowed string” kind of sysadmin. Most of the commenters here are – as usual – dead bang on rights. El Reg really does have a bright crowd answering the call of the comments section.

    2. Chris Mellor 1

      WD drive MBTF

      Hello Diskcrash,

      I though the posted comment about WD's odd MBTF measurement was brilliant. Would you like to contribute your thoughts on where and when to use 2.5-inch vs 3.5-inch HDDs to a Register story?

      Cheers,

      Chris Mellor (cmellor@sitpub.com)

      1. Diskcrash

        2.5 inch drives are a good option to consider if capacity isn't your main criteria

        2.5 inch drives are attractive for a number of reasons the main one is that their size is a counter to the "fat" drive syndrome. If you consider that drive manufacturers use the same disk drive design within a family and scale it up by adding additional platters to the drive you can see that a 250GB 500GB and 1TB drive have the same fundamental performance. The use of 1,2 or 4 disk platters does have a more direct effect on the cost of the disk drive.

        There are four basic aspects of drives to keep in mind when considering a disk drive and which of the importance of each of these will vary depending upon your priorities for that drive. Performance, Reliability, Capacity and Cost.

        When it comes to performance drives have two basic performance measurements. Megabytes per second and IOs per second and they are somewhat inversely linked so that you can get high MB/s or high IOPs but not both at the same time from the same drive.

        The max MB/s a drive can provide is generally much easier to attain as that can be done with large block sequential reads and/or writes. IOPs are harder but still for a disk drive it isn't that hard to reach an individual drives maximum.

        Going back to the 250/500/1TB drive if you were to look at the pricing it would scale at something better than 1 to 1. By that I mean the 1TB drive isn't 4 times the cost of the 250GB model. You might see a pricing along the lines of £100/£175/£300 (costs are made up for illustration purposes and to make the math easier) . So if you wanted to purchase 10TB of capacity these disks would cost you:

        40 250GB x £100 = £4,000

        20 500GB x £175 = £3,500

        10 1TB x £300 = £3,000

        So the larger drives give you a better cost per megabyte of capacity that you can purchase. But remember the performance aspect of drives and how does the cost savings effect that? If you were to assume that this imaginary drive family could sustain 50MB/s or 200 IOPs per drive you then would have the following:

        40 250GB x 200iops = 8000 IOPs (2000MB/s)

        20 500GB x 200iops = 4000 IOPs (1000MB/s)

        10 1TB x 200iops = 2000 IOPs (500MB/s)

        This means that if your servers and your applications need to sustain 4000 IOPs and you purchase 1TB drives that you need to purchase at least 20 of them even if you only need 10TB of capacity. This is the fat drive effect as the drive's capacity gets larger its performance does not scale up with that capacity and the resultant tendency to reduce the number of drives used or purchased has a negative effect on this.

        So what does all that have to do with 2.5 inch drives? Well quite simply they don't store as much so people will be buying more of them. High performance enterprise level 2.5inch SAS drives will max out at about 900GB for the 6 months or so where as 3TB enterprise drives will be on the market early in 2011.

        My personal feeling is that you should always consider any component of a data centre in context of what you want them to do. Sometimes cheap drives that you can toss into the bin are perfectly fine but other times your reputation and possibly job depend upon doing more than just choosing the cheapest supplier and in those cases a bit more thought should be put into it. Ideally you would want to have the ability to use both 2.5 inch and 3.5 inch drives of varying capacity and performance levels to give you this flexibility though probably in the end you will only go with one or two of the possible options out there to simply the service and support of your systems.

        I think 2.5inch disks are great and so far I've not seen anything to indicate that they have a different level of reliability at least not significantly from that of a 3.5 inch disk.

        It was interesting to see a recent report that SSD drives seem to be having a field failure rate similar to that of an HDD though possibly that might improve over time as their technology is further refined and defined.

  2. Anonymous Coward
    Boffin

    my title is broke !

    "Take home lesson? If your system can break in a given way, take the time to research exactly how you’ll deal with it when it does"

    so, you need to reserch every wich way your server can break and have a plan in place incase it does...

    there is only a finite amout of resources you can reasonablly set asside, but it is an impossible task to make sure you have a plan for every eventuality... best you can really do is double or even tripple redundancy....

  3. Lee Dowling Silver badge
    FAIL

    Duh

    1) If your rebuild-from-backup time is 3 days, why don't you invest in something minor (like a 1Tb eSATA- (or even USB-) connected disk like I can buy from any Maplin's shop for about £50-100) that will *not* provide any extra security or redundancy but *will* provide a restore time on the order of minutes rather than days. Permanently connect it, have backups write to it as an external medium in the usual backup procedure. It won't save you in a fire, but it'll cut your restore times by orders of magnitude when it comes to most hardware failures. Especially handy if you know your drives have been troublesome lately. Or a nice, cheap NAS box is the perfect "local restore" backup for things like this - you barely need to do *anything* to make it work and Gigabit restore speeds certainly sound better than waiting days for backup tapes to be found.

    2) If ESXi was so finicky about what it would install on, you're purely lucky that it installed at all. What would you have done if it didn't like the only card that could see your RAID, so that no matter WHAT machine you used for restore it wouldn't install?

    3) If your RAID5 rebuild times are that high, that's why RAID6 was invented. At the cost of an extra disk, it's a MASSIVE reassurance when you have long RAID restore times (and even a double-failed array can be read raw off the disk and reconstructed, assuming point 4 below apply).

    4) Your RAID was highly dependent on the chipset implementation given to you. This in itself is a *fabulous* argument for Software RAID (which has closed the gap tremendously in terms of speed) or for using a RAID setup that has a well-documented layout (and thus can be loaded in any machine with something like Linux "md" driver, at least enough for data recovery). If the chipset had changed, or the format had been upgraded, or newer chips didn't come with the backwards compatibility, you would be stuffed again.

    You were damn lucky, basically. And all for something that an extra hard-drive (either in the RAID or as an external backup device) would have turned into a mere afternoon job without panic.

    1. Anonymous Coward
      WTF?

      Disk backup good plan

      2. ESXi isn't finicky, it has a very specific hardware compatbility list, no luck involved. But servers drop off it with every revision, so ESXi 4.0 and 4.1 will have different lists.

      3. Not all PCIe RAID cards do RAID6, but yeah, aim for one that does. Author may not have had the option if his servers are old.

      4. ESXi doesn't do software RAID. But yeah, for DAS software RAID every time if you can.

      5. A bunch of DAS based ESXi servers is just strange in 2010. iSCSI and NAS boxes that support it are cheap, Iomega NAS/iSCSI devices are on the compat list now EMC own them.

  4. Orv Silver badge
    Pint

    RAID 6

    I mostly run RAID 6 these days. My reasoning: With RAID 5 I had to have a hotspare anyway, if I didn't want to come in a 3 am in the event of a drive failure; once I have another drive spinning, it makes more sense to just go RAID 6 and actually use that additional spindle.

    1. Snapper

      RAID 6 eclipses RAID 5

      With nTb SATA drives in RAID boxes now, RAID 5 really should come with warning stickers. The rebuild from one failed drive is obviously going to put a lot of sustained strain on the remaining drives, and the likelihood of a second failure is hockey-sticking.

      I manage several servers in graphic studios, so big files and lots of them. Deadlines are being cut finer and finer as the Studios shed staff, and fewer people are available to do the work. Any stoppage or slow-down could mean financial disaster, but budgets are still getting massacred.

      I usually put in two RAID 6 units to the main servers. One backs up the other, but in addition the folder of current work gets a daily full copy to five separate folders on the second RAID. The same current data is also backed up to five daily folders on two NAS/FireWire drives on the network and is also backed up off-site overnight. This ensures that copies of the current files can be accessed one way or the other and takes into account deleted or corrupt files, with a rising time requirement for recovery as you go down the chain.

      Not 100%, but as I said, the budgets just ain't there to do more, keep it simple AND have a good chance of getting the data back quickly.

  5. jake Silver badge

    This is why some of us ...

    ... insist on having proper off-line backup solutions, and tested "known good" recovery systems in place. Experience is a painful teacher.

  6. AdamWill

    RAID rebuild times

    useful tips for improving RAID rebuild times:

    http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html

  7. graeme leggett Silver badge

    had a similar case

    when two drives apparently went West at the same time.

    You say RAID5 was the wrong choice but didn't say what the right choice should have been.

    1. Trevor_Pott Gold badge

      @Graeme Leggett

      RAID 1 for small stuff. Disk capacities are enormous now, and RAID 1 is the quickest rebuild.

      RAID 6 for anything you might have previously used RAID 5 for: it's the new "best compromise."

      RAID 0+1 for Speed.

      That would be my take for RAIDing on a budget…

      1. Anonymous Coward
        Anonymous Coward

        Of course....

        You could also ask what the expected IO profile of the application(s) is going to be and make a judgement call based on that.

        Just choosing RAID levels on resync times is like choosing a car because the spark plugs are easy to change - hardly the whole story.

  8. Anton Ivanov
    Flame

    It's another bit of history repeating

    Multiple disk failures in a RAD5 or RAID1 are classic. The probability especially for RAID5 is much higher than you would expect. The reason for this is that a failure of one disk stresses the rest of the disk subsystem and a pending failure shows up immediately.

    Here RAID becomes your enemy. In most cases a RAID failure of a disk is a failure of a specific sector, not a whole disk failure. The disk should be marked as suspicious, alert raised, 00 written forcing S.M.A.R.T. relocation (that is what the spec does), 00,ff, 55, aa written to same sector for a test pattern and after that the sector recovered from the other disks in the set.

    However, I have yet to see a single OS or simple hardware RAID set to do that (most SANs do it internally with extra bells and whistles). At best you get SMART tests and SMART monitoring by firmware. Usually you get nil and I have yet to see a sysadmin who runs SMART tests on RAID array components. So when a failure comes you run a very high chance of cascaded failure because the reconstruction will encounter one or more of the "relocation pending" sectors on the other disks.

    As long as no OS does what I noted above (which is bloody trivial) your only choice is to do regular SMART checks on all disks in an array and promptly plug in a hot spare if you do not like something. The optimal solution would have been to hit any SMART pendings with a 00, but that is usually not an option because you cannot compute where they are so you have to nuke whole partitions in software RAID or whole disks. Just yesterday I had to take out from the array, nuke it with 00, and put back into the array one of the volumes which were part of the "/var" RAID set on my home server. Unpleasant, but necessary as there was no way to compute exactly where the bloody pending sectors were.

    1. gratou
      Stop

      ZFS

      Why ZFS is not mentionned when there is a discussion about raid is beyond me.

      ZFS and RAIDZ beat any raid config for data safety, ease of maintenance, ease of moving the array to a different box.

      RAIDx is dead afaiac.

  9. NemoWho
    FAIL

    Another Valued Lesson...

    ...on the horseshit sense of security one has in running RAID5. Post-crash successful rebuilds? We've heard of them...

  10. Jeff 11
    Unhappy

    Another sysadmin humbled by RAID5 failure

    Was there a conscious reason why you chose RAID5 with vastly expensive disks over RAID10 with (more) inexpensive ones, perhaps aside from reduced space and power consumption? If you were using them for VMs then that's potentially a lot of writes to swapspace, which along with database commits is one of the worst applications I can envisage for that I/O arrangement.

    Unless you've got relatively small volumes where the chance of two single bit errors is low, I'd avoid it like the plague. I believe the peculiarity of RAID5 is that it gets less reliable as you add more disks, whereas RAID10 gets MORE reliable, due to the decreasing probability of two disks on the same spindle failing at the same time.

    1. Trevor_Pott Gold badge

      "Why RAID 5"

      That has a very long story. Thanks for asking rather than assuming, though! The real answer was that we didn't specify "RAID 5" when building these servers. Originally, the servers had 8 drive bays: 2x 250GB Seagate ES.2s for the OS and 2x Velociraptors for the VMs. Both were RAID 1.

      We left the rest of the bays open so that we could expand capacity later…when we got more money to do so. (You really have to understand that money does not flow here like it does for many of the other commenters. People cavalierly toss about sugesstions of putting 15K SAS drives in my servers…but I had to scrimp and sacrifice to put my data on separate disks from my OS in the first pair of this model VM server.)

      When we bought our third server (and along with it FINALLY a physical backup unit in case the motherboard went on any of our now three production copies of this model of VM server) we were in a position of doing rather well, cash-wise. I was able to purchase enough disks to fill all the slots in all three servers. It would be enough to get us the VM capacity we so very desperately needed. With the following caveats:

      1) I only had enough money if I didn’t toss the existing 4 Velociraptors. That meant all my new drives either had to be the same or I had to start divvying up the arrays.

      2) The only way I would get enough space out of the existing drives whilst still having redundancy of any variety was RAID 5.

      This meant extending our extant two Velociraptor RAID 1s into RAID 5s, and putting a RAID 5 in third server. That was about 8 months ago. We now have 6 of these systems in service with two physical spares. We have 20% surplus capacity across this model of VM server…so I will be able to take the hit to reduce to RAID 6. (We do that this coming Tuesday, as a matter of fact.)

      They aren’t our only VM servers…I have a fleet of 12 others in the field (two active, one physical spare per city for four cities.) They have half the cores, half the RAM and run only a RAID 1 of Velociraptors each. There are a smattering of other VM servers too…but they are all test bench stuff as they are one offs that I don’t have replacement parts for.

      So the “why RAID 5” is a legacy item: it’s from days not too long ago when we absolutely needed those very last gigabytes and had no more dollars to spend. Not that we have many dollars now…but I have been making very careful purchases with every dollar I can get my hands on.

      I am very eagerly awaiting the First True Server Refresh in 2012 (we finally have this budgeted as a company-wide Major Project!) This refresh will see SANs in each city. If I have my way, SANs running SAS drives in RAID 10. It might be sad, but when I dream the dreams that I dream, I dream dreams of SANs…

  11. Nate Amsden

    shouldn't be that bad

    rebuild times for a 5+1 RAID 5 array with 300GB 10k RPM SATA disks should be pretty quick. People start to see issues when your talking 1TB+ on 7200RPM SATA disks. If it is too slow perhaps it's an issue with the controller or something.

    Since you suffered two simultaneous faults at the same time on disks that another server says are OK, then perhaps you have issues elsewhere, controller, firmware, cables, power supply? I assume you have good UPSs backing your systems..

    The first set of arrays at my last company that they bought before I started there on the SATA side were mixtures of I think 250GB and 500GB RAID arrays, all of them SATA-I(old old) and RAID 5 *12+1* (stupid stupid). ~250 disks were SATA (another ~250 were 10k FC, also RAID 5 12+1). SATA rebuild times in that case could take up to 36 hours at the most. FC rebuild times typically 4-6 hours(146GB disks)

    not fun! distributed RAID on a nice SAN for me..no more futzing with stupid cheap crap RAID controllers(or worse, software RAID)

  12. Anonymous Coward
    Alert

    No s**t Sherlock

    "RAID 5 was a bad choice."

    Why do you think random failure never seems random.

    "Most arrays contain members of roughly the same age and drive generation; chances of a second drive failing during a rebuild are high."

    So you order/make a server what do you bet all the disks are from the same batch? How many people would source their disks from different suppliers to avoid this? Wonder if you could buy a storage array and ask if all the disks were from different batches/factories/makers :-)

    1. Trevor_Pott Gold badge

      @AC

      Well, our servers are specced by us...but built by the local distie. (Supercom.) The servers are actually usually quite good kit and Supercom are fantastic to work with. I do indeed get to specify "please make sure they aren't all the same batch." They will even go out of their way to dig up slightly older or fresh-off-the-boat-new disks to mix-and-match what goes into an array for me. Maybe all server makers won’t…but mine does. I love them for it.

      Still, there is only so much variability you can get doing that. When you want 6 drives of the same model for your array…they are going to be relatively close together. No perfect happy solution, I’m afraid…

  13. Destroy All Monsters Silver badge
    Megaphone

    RAID 5? In 2010?

    http://www.miracleas.com/BAARF/BAARF2.html

    http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt

  14. batfastad
    Jobs Horns

    RAID 6

    I always tend to go for RAID 6 over RAID 5 for that very reason as it can survive 2 simultaneous drive failures. The more drives in the array, the more drives there are to potentially fail.

    RAID rebuild takes a fair amount of time.

    With a half-decent RAID card the performance differential of RAID 6 over 5 shouldn't be that much worse. Might not even be worse, I think the bottleneck could be the I/O operations on the drives.

    What is worrying though is how the LSI card refused to process the metadata on its drives. We've got a 3Ware 9690SA card (3Ware are now part of LSI) and it's superb.

    Is there any sort of basic interoperability or standards for the RAID array metadata between vendors?

    Would be reassuring if there was. But I guess the parity/stripe algorithms are an area where vendors can get performance gains over each other so unlikely to open that stuff up.

  15. William Boyle

    RAID != backup

    It seems a lot of people, IT sysadmins in particular, have the attitude that because they have a RAID, they don't need to be quite so careful to keep their data backed up to off-line storage. In reality, it is quite the opposite. RAID's tend to be configured with drives from the same manufacturing batch, which means that when there is a drive failure, there are likely to be more than one drive going south for the winter. Bang! That was nice data while it lasted! :-(

    So, after 30 years in the business of designing, building, and deploying large-scale enterprise critical systems for major manufacturers, my opinion is this.

    1. Use RAID 5/10 for read-mostly performance. If you want write-mostly performance, go to the fastest devices you can find with the widest I/O channels and skip the RAID cruft.

    2. Keep all critical data backed up on a regular (daily) basis.

    3. Keep database transaction logs on devices apart from the actual data storage.

    4. Make bit-image copies of system drives both before and after you do significant OS or application code updates/upgrades. If things go south with the system/boot devices, you can restore to the last known good configuration. If the update/upgrade fails, then you can revert to the image that existed just before you updated.

    5. Plan for failure. Hardware fails. Software has bugs. This is life - live with it! But don't assume because something works today that it will work tomorrow.

    I spent years designing and implementing the MES used to run most 300mm semiconductor fabs around the world. One of our major design elements was that the system be resilient to failure. NOT to be fault-tolerant, but fault-resilient. Provide means for automatic fail-over when software fails, computers fail, disc drives fail, controllers fail, networks fail. As a result, these systems run 24x365 with 6-sigma+ reliability. That's a bit less than 1 hour per year of unplanned down time (at 6-sigma reliability). FWIW, 1 hour of down time in a 300mm semiconductor fab is several million USD. So, let's just say that 6 sigma is not good enough... :-)

  16. magnawave
    Grenade

    Scrub it... regularly

    Yup RAID6 or mirrors are smart, but there is one thing to remember - especially with using larger capacity or SATA drives in any RAID: DO regular routine parity / mirror block checks!

    One of the (behind the scenes) features you see in the enterprise arrays (and Linux software arrays from the major distros nowadays) is a background parity raid scrubbing feature or at least a periodic scan during off hours. Extra parity bits (RAID6) or mirrors are definitely a good thing - but there is absolutely no replacement for occasional background reading of all blocks on your arrays. I can't begin to explain the number of times I've seen small raid rebuilds "fail" due to a bad block that accumulated over time but wasn't scrubbed when there was another disk to rebuild from. Data can be corrupted due to noise / vibration / or just bad luck. You don't want to find out about parity block 45698 being bad when the rebuild kicks off due to a real disk kicking the can.

  17. EXAFLOPS'R'US
    Boffin

    Redundancy is the issue here

    When building servers and networks, one of your first goals is to remove any single point of failure from your design. If you are able to accomplish that by also having live offsite replication on the other side of the country/world then that's a really nice thing, too. You could also look at a VSAN solution with replication. There are opensource products like OpenFiler and some other more expensive, but worth it VSAN products like Lefthand SAN.

    Once that is accomplished, you need to have data snapshots in place (backups, archives) and a DR plan that works backwards from a recovery time window. You may have a failover server offsite, or if the MTTR is long enough and budget low, a hardware purchase agreement in place to supply you with a replacement server within eg.48 hours.

    There are some wonderful backup solutions available, such as ShadowProtect and Veeam, which worth well with VMware ESXi, although you'll need to be running at least VMware vSphere Essentials for Veeam licensing restriction. Disks are cheap.

    Data is expensive.

    Good solutions don't have to cost the earth, either.

  18. Anonymous Coward
    Anonymous Coward

    How about design a decent solution...?

    Sometimes it pays to open the wallet.

    Firstly, as already pointed out, you're using SATA. If you're desperate to use local storage SAS is faster and the disks more reliable (RAID 6/10 is an alternative, but controller overheads can be higher than RAID 5).

    Second, when speccing the kit, use VMware's compatibility matrix. Also, try off the shelf kit like Dell etc. Sometimes, he who buys cheap gets his ass kicked unnecessarily.

    Third, why did you not look at a NFS/iSCSI NAS (plenty of cheap, but reliable kit out there) to house your VMs.

    Fourth, a decent backup solution. Nuff said.

  19. ScottXJ6

    ZFS FTW Again...

    Big problem here is proprietry raid format preventing the moving of data from one system to another due to the controller lock-in.

    ZFS being software raid doesnt mean your locked in to any particular controller/system.

    Rebuild times with ZFS, because the file system and volume manager are combined (not seperate like RAID card and FILE system ontop seperately), means rebuild times are equivelant to the amount of data on the disk. Rebuilds are also top down from the file system structure meaning most important data is rebuilt first (and thus available first). Not having to wait for every block to be re-done before even the probably good blocks are available...

    admittedly it wouldnt have helped againt the WD uptime problem of 59 days, but it would have been easier to restore, wouldnt have needed a 'rebuild', and would have been back online much quicker.

    my home atom (passively cooled) 6tb raidz atom 400quid home server could pull 1TB over the lan in just over 4 hours.

    Can scrub it once per week aswell to ensure no data loss or data coruption that other filesystems wouldnt even notice.

    oh and dedupe, and tiered caching to SSD, and so on...

    Its painful to get emails from IT about lack of space on a 1tb file server when storage is now so cheap and accessable.

  20. Peter Kay

    Sounds more like an architecture failure to me..

    You don't state why you had problems re-installing ESXi on another system. It sounds like you didn't have either a spare chassis, or another server that the disks could be taken from and directly transplanted to another server. Was it the case of an unprotected boot disk but an alive array? In that case the boot drive should be protected with at least RAID 1.

    I may be missing something, but I don't see how you recovered it in the first place as RAID 5 can only survive a loss of one drive, unless it's both a hot spare and one of the drives that fail.

    In any case, I don't see why you chose fast SATA disks when 15K SAS disks are only marginally more expensive. Nor do I think that RAID 5 is necessarily the wrong choice, provided the rebuild times in case of a complete failure are acceptable and the cost/benefit of the risk of failure is appropriate.

    I don't understand why the transfer time of VMs using vSphere is valid. Either the system is working and the transfer time of a day is likely irrevelant, or a rebuild is required, or ESXi is dead but the array is alive but suspect - in that case it's just as appropriate to use an external disk to transfer the data.

    1. Trevor_Pott Gold badge

      @Peter Kay

      I had trouble installing ESXi on the system with the Intel RAID card simply because it was never designed to be an ESXi server! It was a prototype Windows file server...in all honesty it was a Big Collection Of Storage Space that served as a "focal point" for backups across the company. All the backups were collected onto this system, then written to removable media. I was testing a new chassis (24x SAS hotswap darling,) a new Motherboard, new RAID card and a new SAS expander. It was literally in early prototype state.

      Remember that the RAID 5 didn't actually have any disk failures! The drives merely dropped out of the array due to that wretched TLER bug. (For the record: I loathe Velociraptors. I wish I had the cash for proper SAS drives, but at the time it was "use the Velociraptors, or we make you use 7200rpm Seagate ES drives." I had zero choice in the matter.) That means the drives were actually fine…but after 49 days and change they simply stop responding to commands. The RAID card can’t see them any more and so thinks that they have dropped from the array. Power the server physically off and then power them on…*poof!* drives are back up and doing fine.

      So in this case, when the came up the LSI controller read the metadata on these two drives and saw that they should be part of a 6-disk RAID 5. When it looked for other members of that array it found four other disks…all who believed they were members of an array which had dropped two disks! By sheer fluke the Intel controller was able to pick up all six disks as a single array…apparently ignoring the metadata mismatch that the TLER error caused.

      As to not choosing SAS drives…it simply wasn’t an option. Most commenters in this thread behave as though I could have simply had a tantrum and money would have appeared…but that quite honestly wasn’t the case. I was lucky (HAH!) to not be stuck with 7200rpm Seagate ESes. Things will be different in 2012. Then I finally get to something like a “bulk replace” of my entire server fleet. For the first time I can do it properly: a SAN with some front end VM servers, some physical servers for critical tasks and proper identical parts (with spares) from a single vendor giving us a sexy warrantee. The company I work for has never been in the position before to do so. Seven years ago they had four computers and one server. The growth has been in fits and starts and quite literally at the very limit of the budget each time.

      The transfer time issue is this: Only that bloody Intel controller would talk to those six disks as an array. If I shoved the disks back into the LSI 1078 (any of the many 1078s I have) it would see them as two arrays. If I wanted to get the VMs off (which I did, because restoring from backups is a pain in the ass,) then I had to shove the Intel controller into a system which could boot ESXi (not the Windows prototype it was originally located in) and then pull the VMs off. Understand that nothing about the array was suspect! The drives were not DEAD. They had dropped out of the array due to the TLER error and nothing more. The data was 100% intact, the only question was; how to get at it?

      Once I had put one of the spare ESXi computers back together (I had it apart for a testbed project) I was able to toss the Intel card into it and it saw the array just fine. I shoved a new set of disks into the original ESXi box with the 1078 in RAID 6. Copy the VMs from the spare ESXi box with the Intel controller in it to a file server and then from there back up to the original ESXi box with its new array.

      This is why the transfer time is important: getting the array back up and shoved in a box that would read it doesn’t take long. Pulling the VMs off and then uploading them again does. Fortunately, I don’t have to do anything but periodically poke the computer to make sure the transfer hasn’t failed. That’s a hell of a lot less work than restoring everything from backups would have been.

  21. Anonymous Coward
    Anonymous Coward

    Good post, good comments

    But "50 days since I last power cycled the servers"? Wtf? I only power cycle my servers if I do a hardware upgrade.

  22. Anonymous Coward
    Anonymous Coward

    You need more focus on the overall picture

    The solution is to build proper redundancy into your overall infrastructure. Never get yourself into the situation where you need a specific physical server to be running for a particular job to get done, because that server will die and you'll be in firefighting hell.

    Always make it such that the VMs you need can be started up and deployed on another server within a short period of time.

    Then don't bother faffing about with RAID controllers which aren't working properly. That's what support contracts are for. Get the hardware engineer to deal with it, then rebuild the machine at your leisure when its hardware is working properly again.

  23. Anonymous Coward
    Anonymous Coward

    And why all these problems with RAID?

    In my experience RAID does its job well. When I've had disk failures in RAID 1, RAID 5 and RAID 10 arrays I've simply done what you'd expect, i.e. popped in the spare disk and things have continued fine.

    I will say, though, that my experience of SMART is that it's pretty useless in predicting disk failures or even telling me when one has occurred. Instead, for HP servers, I've used hpacucli connected to the check_hparray plugin in Nagios which does the trick nicely.

  24. Gordan

    To sum up

    A few points:

    1) Hardware RAID is way too 20th century. Most controllers still use totally inadequate 200-ish MHz DSP for calculating checksums and they become the bottleneck on RAID5 and RAID6 to the point where the performance is considerably higher with software RAID. CPUs are plenty fast enough to handle this better than more expensive dedicated hardware. Software RAID is also more flexible and more recoverable - you have a lot more control over what you can do when things go wrong.

    2) Checksummed file system level RAID (ZFS and BTRFS) is better still because it can leverage extra file system level knowledge to handle partial failures much more gracefully. IMO, the days of hardware RAID are rapidly expiring due to the nosediving reliability of disks.

    3) RAID rebuild speeds are limited to the write speed of a single drive. Be it RAID 1, 5, 6, or 10. So on a 300GB Velociraptor in RAID5, your rebuild speed, assuming about 100MB/s (conservative) should be about 50 minutes. Anything more than this indicates a bottleneck on the RAID card (see 1) ).

    4) You say you have a 6-disk RAID5 array of 300GB Velociraptors. That means 1.5TB of usable space. Are you really saying that your backup solution takes 3 days to recover 1.5TB? Assuming that your 1.5TB array was 100% full (unlikely) that gives restore speed of 5MB/second. It sounds questionable whether such a backup solution is fit for purpose for the way you rely on it.

    5) RAID5/6 is inappropriate for high-performance applications because the random write performance is typically the same as that of a single disk (can be even slower). The RAID card may glaze over this slightly if most of your writes are in infrequent batches smaller than the cache. Given the low performance of such a setup you are likely to be better off with something like a 4-disk RAID10 with bigger disks, which will also give you better reliability, at the expensive of rebuild time being longer (limited by the speed of the disk being rebuilt). RAID 1x will also have the benefit of letting you mix drives from 2 different manufacturers.

    6) Disks are unreliable. Horribly so. Google's analysis a couple of years back showed that the manufacturer specified reliability figures (MTBF and bit error rates) are overstated on the side of optimism by about an order of magnitude. It also showed that when disks of the same make/model are used in an array, probability of a 2nd disk failing once the 1st has failed goes up by a factor of 10x. Probability of a 3rd disk failing goes up by a factor of 100x. This makes RAID questionable if disks are all the same make and model. The situation gets even worse if the disks are from the same batch. If you can't have FS level RAID for some reason, the 2nd most sensible thing is RAID10 with disks from 2 different manufacturers mixed so that no mirrored pair is on the same make/model of a disk. Not to mention the performance improvement you will see out of it over parity based RAIDs.

  25. There's a bee in my bot net

    Wow you really put yourself out there... (for a kicking)

    It sounds like you are building your own servers (please tell me I'm wrong).

    Buy off the shelf with hardware support (4hr response on site isn't very expensive). Don't use SATA unless it's for a server providing an ancillary service that no one will really notice too much when it disappears. And I personally would never add more than 4 (Hot) SATA drives to a raid 5.

    Try vMA and Ghetto VCB to backup your VMs to a (low cost) NAS with big disks. If you are really paranoid, periodically backup the NAS with cheap massive USB disks...

    Worst case is if your array is trashed you can boot ESXi from flash and run the VMs (albeit a bit slower) from the NAS (using NFS to mount it as a datastore or most provide iSCSI these days so you could do it that way) while you rebuild the array...

    vMA and Ghetto VCB can backup a 300GB VM in 15mins...

  26. Anonymous Coward
    FAIL

    Software RAID is still bad

    Professionals do not use software RAID at this scale for 2 important reasons, there is no BBU to the cache (PC PSUs fail far more than RAID cards even if you do have a UPS) and with any large number of disks you will saturate the PCI bus. Software RAID has it's place in servers with just a few disks but it is certainly not a general replacement for hardware RAID.

    Plus there is a simple answer to the MTBF problem. Don't use 2TB or 3TB drives in RAID sets, use more 1TB disks.

    1. Gordan

      Re: Software RAID is still bad

      Your argument doesn't withstand scrutiny.

      First of all, on-board SATA controllers build into the south-bridge has more bandwidth than disk can use up. On the older 3x/4x Core2 Intel chipsets and the AMD 790 series, I have never managed to saturate them - they provide 4 ports and easily sustain over 450MB/s.

      Additional PCIe SATA cards typically with 2-4 ports, if wired into a PCIe x1 slot have least up to 1GB of bandwidth available on the PCIe bus.

      If anything, that means that with software RAID and no-frills SATA cards you can have more PCI bandwidth available than with a single RAID card. Granted, it depends on how many PCIe lanes the RAID card has going to it, but at best it's a non-issue. And on a RAID card you'll bottleneck on checksumming on RAID5/6 anyway, long before you use up the PCIe/disk bandwidth.

      You don't need battery backup to the caches in terms of safety, just make sure your hardware correctly obeys barriers. Most hardware does nowdays, including cheap consumer grade SATA disks.

      Hardware RAID got deprecated some years ago, IMO. It lacks both performance and flexibility compared to software RAID today.

      As for MTBF - size of disks is less relevant than the redundancy arrangements. The fundamental problem is that probability of a disk failing soon after another similar disk failed increases dramatically.

    2. This post has been deleted by its author

  27. ScottXJ6

    lol

    At this scale? he is talking about a handful of 300gb drives, hardly big scale? most geeks home fileservers are many times larger...

    Battery backed cache is needed because of the raid 5 write hole issue. What happens if your raid card goes down? bet you cant hotswap the data in the cache to another card!

    PCI bus saturation is an issue which should be considered when specing your server... Again, ZFS can utilise any sata ports from any controller on any busses and assuming the system can keep up, the performance is accrued across all devices. software raid is superior in this respect in a way that traditiona; hardware raid cannot begin to cope. Its not like you can 'SLI' 2 identical raidcards to get twice the throughput for 1 array.

    ZFS really is the future.

    1. Gordan

      Re: lol

      "Its not like you can 'SLI' 2 identical raidcards to get twice the throughput for 1 array."

      Indeed, I know what you mean - near impossible to do with multiple parity RAID cards.

      But - actually, I did once upon have a pair of IDE RAID cards that allowed you to do just that. They were of the fake/BIOS RAID type, and I cannot remember off the top of my head if they were based on the Promise or HPT chipset, but the BIOS on the first card detected and initialized all similar cards in the system and presented you with the disks from all the cards to use in one big RAID array. It could do this precisely because it was fake/BIOS/software RAID.

  28. Rob Brady
    Alert

    Data isnt data unless it's in (at least) two places (Preferably far apart!)

    Agree full with the other "Old Timers" comments on here.

    RAID should NEVER beconsidered as backup.

    Skimping a few bucks on the RIGHT hardware to balance the cost of downtime and potential data loss is always false economy.

    El Cheapo "Fake RAID" cards should be avioded like the plague with the possible exception of ICH SB onboard RAID 1 for boot drives ONLY in small systems. Disks created by that are generally highly portable to another box if the motherboard dies. Also handy to get around the braindead ESX/XenServer installer decisions.

    Battery backup in a RAID controler is nice, but even nicer are dual powersupplies to an A & B rail (Fully independent power). If you're not in a data center then two separate UPSes will do at a pinch. It may be overkill for a home NAS serving up movies, but if your job is on the line then make a strong case for it. If you're overridden, get it in writing so if things do go south then you know you'll still be employed.. ;-)

    On the subject of backups, if you even half care about protecting your data you must have at minimum "Offline" backups in a fireproof safe, or preferably at a remote site (LVM with snapshots is great for this).

    The classic error is to assume that just because you have two copies of data on the LAN you're golden. While unlikely, a decent fire or lightning strike will quickly show you the error of your ways.

    Lighning's rare, but I've had to deal with the aftermath of a big strike 30ft from a rack of servers. Induced current travelled over the cat5, powerlines and in a whole host of other "Interesting" ways. Succeeded in destroying every hard disk at the site - both in the server rack and every PC on the LAN along with pretty much everything else IT related. The 20x disks in the SAN were actually turned into magnets! Yup, you could pick up screws with them...

    All in all, about 30TB of data was blown away. Once new hardware was sourced it took less than a day to have everything back up and running, with no luck involved.

    There's a reason it's called "Disaster Recovery PLANNNING"... ;-)

    Unfortunately it's not until you've had your fingers burned at least once do most people start to pay attention to this unglamorous part of keeping systems running. In my case it was cutting corners on when building a big RAID6 array (many moons ago). Being young and inexperienced at the time I overlooked the small matter that if you have 8 IDE disks on a 4 port controller when the drive that fails is the Master, you'll loose the slave as well... Ooops.

    Glad this worked out for you, but consider yourself bloody lucky you dodged a bullet here. It's easy for eveyone to lecture over a rookie mistake, but I'm still astonished how many "Pros" have no idea about what it really takes to avoid data loss.

    Hopefully this story and subsequent comments will enable at least one reader to learn from the mistakes of others, but I'm sure there also will be more than one that scoffed but gets bitten later by something similar.

  29. Trevor_Pott Gold badge

    @All the "raid != backup comments."

    For the record...I do have perfectly valid backups. Recovering the array in question was not a matter of "oh crap...if I lose that data I am dead!" Recovering from backups is a fairly long and tedious process that I was not particularly amused by.

    Recovering the array on the other hand was

    a) far more intellectually interesting

    b) potentially much faster.

    If it makes anyone feel any better, one of the elements left out of this particular article was that in the background, whilst I fiddled with the array, backups were unpacking to a secondary server just in case I needed to make use of them. In my case, recovering the array was faster than recovering from backups, which would have involved transfering and differentiating template VMs followed by reloading the latest backups to them.

    That all said, I would like to reinforce that RAID != Backup!!! Raid is a convenience, nothing more. It is a method of helping to provide uptime or raw speed. I should also add that Replication != Backup!!! Replication is again nothing more than a convenience. It is an uptime tool. (If it is offsite replication it can be a disaster recovery tool.)\

    Backups provide more than simply the ability to recover from a disaster such as failed RAID. Properly done, they provide the ability to recover from human error: “oops I deleted this file!” Replication in many cases will simply replicate the deletion. RAID won’t help you out of that pickle at all.

    Recovering an array as described in this article should never, EVER be your only option! Please view such measures as convieniences only!

  30. Kevin Davidson
    Alert

    RAID5 no longer fit for purpose

    1st choice RAIDZ2 or at least RAIDZ. Yes, with a Z - ZFS.

    2nd choice RAID10

    3rd choice RAID6, but that's assuming you're not running any databases, in which case you only get the first two choices.

    SMART, BTW, is pish. It will very rarely warn you of impending failures.

  31. Anonymous Coward
    FAIL

    Outages

    I do not know how you guys run your shop but we *KNOW* why an outage occuers. If it takes one person working on it for a week we will know why/how it happened. If it was a hardware problem we will know why, usually the same day. Software is slightly different as it takes man hours to go through the dump and at the end we can get pretty specific what went wrong *AND* how to fix it. Yes there are a misc 1 or 2 a year that defy classification ( I would like 0 but know after going through dumps for 40 years sometimes it is just not possible. I personally spent 80 hours++ going through a dump and trying to come up with a final answer. The answer was a cross between hardware and software. The device was malfunctioning and giving errors that the OS had never heard of before and did not know how to handle. The base issue was two fold Hardware and the hardware generating an error that the OS was not able to handle. BTW I sent the dump off to the hardware manufacturer and they fessed up and sent out a patch would stop that from occuring.

    The real hard ones hardware/software/firmware are a dickens to shot. We had a new version of sort and everyonce in a while it would go into a never ending wait. I had so many traces running trying to trap the issue at one time the system became overloaded and stopped tracing. That bug had me going around in circles for weeks. I could not trust what I was seeing in memory.

    Everyweek we had a status update and all the resolutions and issue had to be talked about. One time our bosses boss stopped in and he could follow the conversation. Luckily even if you are bleeding edge someone out there has found the problem (most of the time) before you did and software people are either working on a fix or are awhere of it and they do fix the problems (at least IBM did)

    1. Trevor_Pott Gold badge

      @AC

      That is a very rational way to run a shop. In all honesty, we don't run that tight...but I do try to adhere to the principals you espouse as much as is possible. The reality of that situation is that some hardware issues I am unqualified to troubleshoot. (I am not an electrical engineer.) Others...well we sometimes take the easy route out. ("We only have five of these in the field, they are 2/3 of the way through their life and we've RMAed three. Pull the whole line and we'll replace them.")

      I do take the time to throw things on the bench and test the dickens out of them whenever I can. It's the reason I have spare parts for everything; more often than not, if I can get the originals back in hand I can find out how they died and often repair them.

      Overall, I think taking the time to properly investigate failures is important; sometimes failures are preventable simply by making small changes to the operating environment. (Reducing vibration, temperature deltas, etc.) It’s an important practice.

This topic is closed for new posts.

Other stories you might like