back to article Hundreds of websites go titsup in Prime Hosting disk meltdown

Hundreds of UK-hosted websites and email accounts fell offline when a disk array failed at web biz Prime Hosting. As many as 860 customers are still waiting for a fix more than 48 hours after the storage unit went titsup. The downtime at the Manchester-based hosting reseller began at 5am on 31 July, and two days later some …

COMMENTS

This topic is closed for new posts.
  1. Mako

    "[P]romised that it has had a team working solidly for 36 hours without sleep in order to minimise the impact."

    Knowing how goofy and error-prone I become after about 24 hours without sleep, that doesn't exactly fill me with feelings of confidence.

    And it also gives me the impression that this is yet another company that thinks working people like pit-ponies is not only acceptable, it's laudable.

    1. Rameses Niblick the Third (KKWWMT)
      Thumb Up

      Definitely, definitely this. One upvote is just not enough

    2. LarsG

      This pretty much tells you to keep a local backup and not rely on a third party to keep your data.

  2. Wize

    They stick on old backups to start with...

    ...then are slowly replacing them with newer backups.

    What if a customer places an order while the old one is up and the database gets splatted with the newer backup?

  3. Lord Voldemortgage

    Have some sympathy for them

    Some.

    Drives do sometimes go in batches.

    But if I was a customer I would want to be asked first before an old version of a site was brought on line - I mean there might be ordering systems with old pricing / stock figures or anything on there; in those circumstances better a holding page with an apology than a working site feeding through garbage.

  4. Steve Evans

    Tut tut...

    RAID is about availability, it is *NOT* a backup solution.

    /end lecture.

    1. Alex Rose
      WTF?

      Re: Tut tut...

      I've read the article again and I still can't see the bit where anybody claims that RAID is a backup solution.

      1. jockmcthingiemibobb
        FAIL

        Re: Tut tut...

        Restoring the data from a 3 month old backup would kinda imply they were relying on RAID as their backup solution

        1. Anonymous Coward
          Anonymous Coward

          Re: Tut tut...

          No it wouldn't, it would imply that they had migrated to a new array and the old one was still there. No-one said they restored the data from 3 months ago.

    2. Anonymous Coward
      Anonymous Coward

      Re: Tut tut...

      Reading the article is about comprehension of a story, not just reading the first line and jumping to a conclusion about what you expect to have happened.

      /end lecture.

  5. Trygve Henriksen
    FAIL

    This is CRAP!

    Any server system with a smidgeon of professionality built into it will warn you when a drive becomes borderline.

    Having 3 fail in one RAID6 array is... mindboggling...

    Exactly how many drives do they have in each array, anyway?

    Restoring old site backups to get the VMs up faster?

    This is CRAP!

    I'm guessing that what they brought back is the LAST FULL BACKUP of the failed array, and that they're now busy restoring Differential or Incrementals from after that.*

    They should at least have the brains to keep the systems offline until they've restored everything, as it may otherwise result in lost orders and whatnot.

    (What if someone browses to a webshop on one of those sites, orders something using a CC and they then restore over the transaction details? )

    * Cheap bastards probably used incremental backups, too, instead of Differential, to save money...

    1. Anonymous Coward
      Anonymous Coward

      Re: This is CRAP!

      Rather than jumping to the "This is all crap" conclusion, consider:

      The array probably had all its drives purchased at the same time, this vastly increases the likelihood of drives failing in fairly quick succession.

      If the array is new, it's entirely possible that the array that it was replacing is still kicking around, awaiting decommissioning. If it became apparent that the existing array is completely dead, it may have been a case of just zoning the old LUNs to the servers and away you go, with old data. This would also back up the first point. A recovery from tape could then take place to update the old data and the new array could be recommissioned when everything has settled down.

      1. DJ Smiley
        Facepalm

        Re: This is CRAP!

        Or they don't understand data scrubbing and checking for data failures on the devices themselves rather than trusting the raid controller which is going "Yes yes its all fine, don't worry about those blocks I've just moved because they failed, its really ok I promise you!"

        1. Trygve Henriksen

          Re: This is CRAP!

          Error logs from array systems are there for a reason, which many unfortunately never bother to read.

          With a 'new' system, that should be checked DAILY.

          Automated emails from the system?

          Sure, but I wouldn't trust them. Too many systems between the originator and me.

          (sucks if the email warning of a problem with a RAID gets lost because the email storage is on the glitching array... Or, someone changes the IP of the SMTP server and the array box doesn't understand DNS. )

          1. Nigel 11

            Re: This is CRAP!

            You really need to data-scrub, and watch the SMART statistics for the drives themselves, and act pro-actively. If the number of reallocated blocks starts increasing, replace that drive BEFORE the array is in peril. Sometimes drives do turn into bricks just like that, but in my experience and that of Google, an increasing rate of bad block reallocations after the array is first built is a warning not to be ignored.

            If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch. A RAID-6 made out of drives with consecutive serial numbers is horribly vulnerable to all the drives containing the same faulty component that will fail within a month. I'd also want to burn in a new array for a month or longer before putting it into service. If a new drive is going to turn into a brick, it most commonly does so in its first few weeks (aka the bathtub curve).

            1. Anonymous Coward
              Anonymous Coward

              Re: This is CRAP!

              "...If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch...."

              It doesn't work like that, you get the disks you get when you buy an array. The array manufacturers spend ages testing that the firmware is compatible with the existing disks, that the disks are reliable and perform to spec, with the array and the array controller. There is far more chance of a failure caused by bad firmware or incompatibilities between disks and array/controller than mechanically. You will also be very hard pushed to find a array supplier who will support random disks being inserted into their array.

              The best bet, is to have a healthy amount of online spares, for automatic rebuild and an array that phones home for more.

      2. chops

        Re: This is CRAP!

        maybe it was hardware common to the disks, like a backplane or cable which failed.

        Prime seem to have had a problem with their SAN for a while - it's been blamed for some slow (to non existent) server responses over the past couple of weeks. I'm not even sure it's a reliable design of SAN (I believe it was 'home baked' from what support staff told me <last time this happened!>).

        Not much surprises me with Prime any longer, they don't usually appear to show a great deal of care or understanding about the importance of DNS, data or, come to it, their customers.

    2. Anonymous Coward
      Anonymous Coward

      > Having 3 fail in one RAID6 array is... mindboggling...

      Not very mindboggling here: we lost a very large number of drives simultaneously when some muppet contractor managed to set off the fire suppression system whilst attempting routine "maintenance". Service induced failure is a lot more common than you might think in all sorts of areas.

      Most likely they had an old copy of the data sitting on disk storage somewhere and brought that back on line as a quick fix whilst the tape system is recovering the backups. I can't remember the last time I saw a full backup done or imagine quite how long it would take. These days all our stuff is incremental into a library, and we just restore from the library without having to worry about when individual files were backed up.

      1. Trygve Henriksen

        Re: Mupped doing maintenance...

        Nothing can protect you against 'Acts of Dead Meat', unfortunately...

        (Well, mirroring the array to another similar box in another location might... )

        Full backups are important. They really are.

        If for nothing else, they're really handy for off-site storage...

        (To protect against flooding, fire, sabotage, theft... )

        1. Anonymous Coward
          Anonymous Coward

          Re: Mupped doing maintenance...

          Replication is great, but like RAID, not a panacea.

          Also, it's highly likely that the customers don't want to pay for that level of data security, it's very expensive, more than double the costs because you have to pay for the datalink as well as the extra servers and disk.

    3. Fatman
      FAIL

      Re: ....warn you when a drive becomes borderline.

      Perhaps it did!!!

      (Now to get my damagement bashing in; and this is just speculation, mind you.)

      Perhaps the warning signs were there, but damagement, in its quest for ever increasing profits, decided to hold off replacing the drives. Could it be that they did not want their quarterly bonuses to take a `hit`??? The spreadsheet jockeys could not find a line item for replacement drives.

      Icon that says it all.

      1. Captain Scarlet Silver badge
        Meh

        Re: ....warn you when a drive becomes borderline.

        I'm sure most "Enterprise" drives (If they used them) have 3-5 year warranties and the majority of manufacturers will replace if certain parameters have reached.

  6. Dave 62
    Happy

    At least they have recent backups.

  7. theloon
    FAIL

    no sleep? Umm, go home

    last thing anyone needs are exhausted people working on problems..... Not reassuring..

  8. Anonymous Coward
    Anonymous Coward

    Batch

    Dear me, I learned in the late 1990s through practical experience that you NEVER put a RAID together with disks from the same batch. A guarantee of disaster if they start popping off in quick succession..

    1. Colin Bull 1

      Re: Batch

      It is not trivial to avoid using the same batch in a RAID unless using RAID10.

      I bet they wish they had joined this group ..

      http://www.miracleas.com/BAARF/BAARF2.html?40,51

      It might be old but it is still applicable

      1. Destroy All Monsters Silver badge

        Re: Batch

        OH SO TRUE.

        Of course, buy new hardware, the disks will have continguous serial numbers. Order a replacement for the failed one, the next one will fail while the new one you got will ALSO fail.

  9. BryanM
    FAIL

    Lady Bracknell

    To paraphrase Oscar Wilde...

    "To lose one disk, Mr Smith, may be regarded as a misfortune; to lose three looks like carelessness."?

    After 3 disk failures I'd be checking the RAID controllers and stuff to ensure it's not something other than a disk issue. Unless you tell me it's software RAID that is, then I'll just laugh at you.

    1. LinkOfHyrule
      Coat

      Re: Lady Bracknell

      Where's that quote from? The Ballard of RAIDing Gaol?

    2. TeeCee Gold badge

      Re: Lady Bracknell

      Nope. Disk #1 fails. A new disk is insterted and the array starts to rebuild. The act of rebuilding stresses the living shit out of the other disks, including accessing areas of them that haven't been looked at since Jesus was a lad[1] (e.g. parity stripes for O/S files on the failed disk that were written during a server installation early in the array's life and never touched since). Disks #2 and #3 turn their toes....

      Of the three disks I have had fail in my own gear, two failed during full backup cycles and one in a RAID rebuild. It's heavy use of the entire disk that shines a glaring light on problems. This is also why anyone relying on incremental backups and thus not ensuring that the entire disk structure is kosher on a regular basis is asking for it.

      Mixing batches of disks is unlikely to help, except in the unlikely case where a particular batch has a manufacturing defect. In such cases, they'll usually start dropping like flies at commissioning time anyway. What will help is ensuring that your RAID array is populated with disks with significantly different numbers of service hours on them, but since arrays tend to be commissioned in one go with new disks, this very rarely happens.

      [1] This is why monitoring the SMART stats makes no odds. SMART only records errors when they are seen in normal operation, it does not proactively scan the entire surface looking for 'em.

  10. Anonymous Coward
    Anonymous Coward

    More fun if the card goes

    Raid error reporting is one thing but if the raid card itself goes there is not even a hint of the impending doom.

    Can't help thinking RAID is another one of those "Many beasts with one name" technologies that would benefit from some rigorous standards.

    Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)

    AC just because many people in IT seem to think they have it all covered and "unknown unknowns" could never happen to them, pointing fingers and being smart may distract us from the discipline required.

    1. DJ Smiley

      Re: More fun if the card goes

      We had a raid controller from Dhell do this - it went to write through mode as it should if it encounters errors; except instead of actually writing the data though (abet slowly) it decided any writes could be silently ignored and dropped.

      People saying H/W raid is better than software raid are either never dealt with dodgy raid controllers; or are thinking of that joke of raid that comes built into motherboards and not mdadm.

      1. Anonymous Coward
        Anonymous Coward

        Re: More fun if the card goes

        Really? In my experience, people who say that software RAID is better than hardware RAID are OS engineers, who think that they somehow automatically know about either local or SAN attached storage infrastructure.

        Software RAID, after all, still goes through disk controller chips, often the same one for multiple drives.

    2. Anonymous Coward
      Anonymous Coward

      > people in IT seem to think they have it all covered

      Mmm, some of the loud shouters come across to me as being rather inexperienced and naive. If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff.

      1. Kev K
        Devil

        Re: > people in IT seem to think they have it all covered

        " If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff."

        This with huge great bells on it. $hit WILL happen.

    3. Nigel 11

      Re: More fun if the card goes

      Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)

      No, thats one of the several reasons that these days I refuse to countenance hardware RAID controllers.

      Another is the case where the manufacturer of your RAID controller goes out of business and the only place you can get a (maybe!) compatible replacement is E-bay. And then there's the time you find out the hard way that if you swap two drives by mistake, it immediately scrambles all your data beyond retrieval. And if there's a hardware RAID card that uses ECC RAM, I've yet to see it.

      Use Linux software RAID. Modern CPUs can crunch XORs on one out of four or more cores much faster than SATA drives can deliver data. And auto-assembly from shuffled drives does work! You do of course have a UPS, and you have of course tested that UPS-initiated low-battery shutdown does actually work before putting it in production.

      (Enterprise RAID systems with sixteen-up drives may be less bad, and in any case it's a bit hard to interface more than 12 drives to a regular server PC. It's little 4-8 drive hardware RAID controllers that I won't touch with a bargepole.

      1. Anonymous Coward
        Anonymous Coward

        Re: More fun if the card goes

        @Nigel 11 - I think we're talking about significantly different systems here. To me a RAID array is something that is free standing and has hundreds of disks. A the only locally attached arrays that I've used recently are made by HP (nee Compaq, nee DEC) and are 2u racks full of 2.4" disks, with controllers that have battery backed write cache - with error correcting RAM.

        If your primary concern when buying an array is "will this company go bust", don't buy it. However, rest assured a proper, enterprise (or SME) class RAID controller/Array is way faster and more reliable than software, it also won't knacker your disks if you put them in, in the wrong order. It certainly won't loose writes if cached, when there's a power failure, which software RAID will.

  11. Johnny Quest
    FAIL

    Prime Hosting has apologised to punters...

    "Prime Hosting has apologised to punters"

    Um... no, no they have not. Not a single apology.

    Their site has no information about downtime on it anywhere, their ticket support system is being completely ignored, their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away".

    Their Twitter feed is the only thing with any information on, and that is remarkably lacking.

    1. Anonymous Coward
      Anonymous Coward

      Re: Prime Hosting has apologised to punters...

      >their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away"

      And you think calling them is going to make them suddenly get things back to normal?

      If there is information on twitter then I would assume that is the current state of the problem, if you don't think so then complain later..

      What do you want? A bit by bit commentary on twitter and a fully manned telephone ops room or maybe a dedicated line for you to constantly ask "what's going on? when will my really, really important web pages be available, don't you know there are poeole out their who haven't seen a pictrue of my pussy for more than ten minutes?", while one guy tries to get the thing back to its previous state?

      1. Johnny Quest
        Holmes

        Re: Prime Hosting has apologised to punters...

        Those are some nice logical leaps you've made there. True genius in the works.

        Actually, despite you trying to make it sound completely ridiculous, a bit by bit commentary isn't exactly out of the question. It's not that unheard of for there to be people employed by a company that aren't experts in data recovery and server migration. Maybe those people not involved in that side of the issue could take a few minutes to at least to keep some worried customers updated?

        Regardless, what I'd expect is the bare minimum of customer support:

        1) At least one mention that there is a known issue on their website;

        2) For their single point of support to be working (their ticket system was offline all day ). These shared servers that are down aren't their only hosting business.

        3) Less than 9 hours between Twitter posts on the day the majority of sites went down.

        4) To maybe ask customers whether they would like an unusably old backup in place before doing so;

        5) To maybe let customers who do have an already unusably old backup in place that they can provide a more recent one, so there's not a need to panic (12+ hours between putting some backups in place and then sending out a Tweet).

        That is not much to ask, seeing as they have just destroyed a good number of businesses (not mine, chillax before you start worrying about my transexual cat photo enterprise).

      2. Johnny Quest
        Facepalm

        Re: Prime Hosting has apologised to punters...

        Oh, and I forgot the most important one:

        A FUCKING APOLOGY!

        Telling The Register that they're sorry isn't quite the same as telling the hundreds of customers. I think some of Prime's customers might not be regular Reg readers.

  12. Anonymous Coward
    Anonymous Coward

    disk do fail

    I added some more memory to a DELL R610 last month (one of our Hyper-v hosts) restarted the server to find that 2 of the 4 disks that make up a RAID 10 volume had failed. S*it happenslucky it was RAID 10 so it wasn't much of an issue

  13. Anonymous Coward
    Anonymous Coward

    Oh lovely lovely RAID...

    One of those techs people think is the holy grail to save you having to spend lots of dosh on a duplicate or clustered system. If anyone wants to save themselves from getting into the hell hole of a situation; always have backup on backups and duplicate on duplicate systems.

    Where I'm working at the moment; we have 6 duplicate servers hosting all our websites and all the elements and database hosted on a clustered group of servers. Even the local hard-disks on the servers are RAIDed for performance reasons on top of availability. It would take a lot to go wrong for us to go fully tits up.

    1. Anonymous Coward
      Anonymous Coward

      Re: Oh lovely lovely RAID...

      I take it all these servers are spread across different physical locations, redundant power and networks within each location, and diverse power and data into the data centres?

      1. Anonymous Coward
        Anonymous Coward

        Re: Oh lovely lovely RAID...

        Oh deffo! If you're going to half the risk, you might as well go all the way down the chain. Even down to split redundant switches using teamed NIC cards. =D

        1. Anonymous Coward
          Anonymous Coward

          Re: Oh lovely lovely RAID...

          It's something that should be on your checklist when choosing a hosting ISP:

          I chose one that had redundant data centres and took the subject seriously.

          Not the cheapest but you get what you pay for.

  14. Anonymous Coward
    Anonymous Coward

    It's ok....

    ...the customers have their own back ups as well don't they? You know just in case the site goes utterly tits up / bankrupt / get closed down by the police / you want to move hosts.

    oh....

  15. Chris Long
    Unhappy

    Irony

    Whilst looking in vain on their website for any scrap of information as to what the fudge had happened to my sites, I enjoyed* the irony of finding this press release:

    http://www.primehosting.co.uk/news/Recruiting_Again

    How nice of them, I thought, to be blowing their own trumpets whilst quite literally in the middle of the biggest clusterfudge a hosting company could hope to experience. Surely, I wondered, the PR person pimping this press release could instead be informing customers as to when their sites might re-appear? But apparently not.

    * did not enjoy

  16. This post has been deleted by its author

  17. Wensleydale Cheese

    And for those of you using hosting ISPs

    Do you take regular backups of your sites?

    I certainly do and can restore the lot reasonably quickly. I have tested that too.

    This article does present a scenario I hadn't thought of though, namely that of the ISP restoring older backups over whatever I might have already restored.

  18. Alan Brown Silver badge

    Backups

    Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations.

    To get around this you need a database containing a complete file list at any given point in time. Luckliy at least one backup package (Bacula) does this and can use Full+Diff+incrementals to restore an exact image at any given backup WITHOUT needing to shag around with intermediate steps.

    As for full backups taking too long: If they do, then use synthetic full backups (Existing F+D+I backups are used to create a new full backup.). Once you have a database containing a full list of files at any given point in time this is trivial. (Bacula can do this too)

    Having lost 40Tb disk arrays due to simultaneous drive failures I appreciate the speed in restoration.

    If things are really THAT business critical then it's entirely not silly to build a RAID array of RAID arrays (RAID 51 or 61 or 55 or 66), or use cross-site replication and put up with the wastage - but that's not an excuse to get lax about backups.

    1. Anonymous Coward
      Anonymous Coward

      Re: Backups

      "Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations."

      Wrong. Modern backup software has move and delete detection.

      If you think that I'm even going to entertain backing up enterprise data with a product which has been round ten years, yet somehow has practically no agents, you've got another think coming. I've been working in enterprise backup and recovery/storage for about 17 years and have only once heard of anyone using bacula in that time and he used it at home.

  19. Andy Farley
    WTF?

    Not to step into libel territory

    But we had a double disk failure on a HP SAN, a "one in several million" chance. Unfortunately this was caused by the controller firmware so could have happened again at any time. I wonder if their firmware was fully patched?

    Luckily we'd built in proper redundancy (it was a pension company) and had bitwise VM backup to other-site machines so we were down for 20 minutes with very little data loss.

    Of course, they ran off the same SANs, bought at the same time, which meant squeeky bum time while the SANs were replaced and upgraded - four weeks later as we had to wait for HP to test the patch.

    1. Gordan
      Boffin

      Re: Not to step into libel territory

      Double disk failure is not "one in several million" chance.

      Here's a trick I call "maths".

      Disks like most of the 1TB SATA ones in my arrays have an unrecoverable error rate of 10^-14. That's an unrecoverable error approximately every 12TB of reads.

      Say you have an array of 6+1 such disks in RAID5. You have a disk failure. To reconstruct the missing disk you have to read all of the content of the 6 remaining disks, which is 6TB. That means that probabalistically speaking, you have a whopping 50% chance of suffering an unrecoverable error during the recovery operation and losing data (whether the array will panic or attempt to reconstruct with only mildly corrupted data depends on the implementation and I wouldn't want to have make a guess about what might happen).

      Even if you are running such disks mirrored, the probability of an error during rebuild is ~ 8%, which is uncomfortably high. Now up that from 1TB disks to 4TB disks, and probability of failure during rebuilding the mirror goes up to 32%. If you're not worried - you should be.

      With modern disks and their expected failure rates, the probability of failure during an array rebuild is very high, and extra precautions should always be taken, both by upping the redundancy level and higher level mirroring/replication.

      1. Tim Wolfe-Barry
        Stop

        Re: Not to step into libel territory

        Thanks for this - I was desperately trying to find the references before posting; I think the 1st place I saw this was here, back in 2007: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/

        The critical parts are summarized (below), but basically the upshot is that the more and bigger disks you have, the GREATER rather than LESSER the likelihood of a failure during rebuild...

        ========

        Data safety under RAID 5?

        . . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

        Independence of drive failures in an array?

        The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

        Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

        1. Gordan

          Re: Not to step into libel territory

          Indeed, that article is pretty much spot on - and it dates back to 2007, when the biggest disks were 4-5x smaller than they are today. The problem has grown substantially since then.

  20. Anonymous Coward
    FAIL

    EPIC fail - if your hosting a site with user data...

    Any sites storing user activity - such as e-commerce - will be boned. User Id's, invoice Id's - what a god awful mess that will be.

    So, assuming those types of sites decide they have to keep running with 3 month old data - which would be crazy for an ecommerce site - when the most recent data is restored, it'll wipe out any database changes once again - double whammy.

    These types of sites will have no choice but to go into maintenance mode until recent data is restored.

    Yes, hard drives fail, shit happens, Prime Hosting *HAVE* to warn *ALL* their customers affected prior to restoring the recent update.

    Messy and a sysadmins worst nightmare - my god, there must've been a lot of swearing, sweating and shaking going on when the final drive failed.

    My heart goes out to the poor guys who have to fix this mess - it's a thankless task.

  21. simon newton
    Facepalm

    erm ,whut/

    When a RAID6 array busts a drive, hot spare. When a second drive heads south during the rebuild, i would immediately power down, pull each physical drive and image them, byte by byte, directly to another drive or suitable storage place. while thats going on, i replace each drive data cable in the array housing and check each drive power connecter for acceptable voltage and current. I personally feel its not a bright idea for me to let the array carry on rebuilding activities after a second disk failure within quick sucession of the first disk failure. Uptime and availability is important to the customers, but making all of their sites go offline anyway, then get months old in an instant tends to hit harder.

    "A bit of downtime can be sweetened away,

    not much you can do with a broken array"

    P.S dont use RAID3/4/5/6 in mission critical systems, and FFS keep a good local and offsite, nightly snapshot backup with weekly and monthly rotations. Disk space is pocket change in the scheme of things.

  22. Duffaboy
    Facepalm

    Likely Senario

    Walking round datacenters you see failed drives blinking away in distress all the time (its the red lights that make them stick out you know). Yet nobody takes any notice, i do my bit to point them out on my repair visits. some have probably been like that for weeks as these centers are just usually staffed buy tape swapping guys.

    Seriously Its unlikely that 3 drives went all at once.

  23. Volker Hett

    rsync and cron

    I love my very small shell script which syncs my webseite and mail server every hour!

  24. Gordan
    Boffin

    SMART + ZFS/RAIDZ[23] + WRV + LsyncD + CopyFS

    To quite a fantastic film: "It's not whether you're paranoid - it's whether you're paranoid _enough_."

    Disks are atrociously, mind-boggingly unreliable. This is just a fact of life in 2012. Plan accordingly.

    The way I protect my data from loss involves:

    1) Monitoring SMART attributes of disks in cacti

    1.1) Actually making a point of checking this monitoring data at least daily

    2) Running short SMART checks daily

    3) Running long SMART tests weekly

    4) Running zpool scrubs weekly

    5) Having Write-Read-Verify enabled on all disks that support it. Sadly, very few do (mainly Seagates). I wrote a patch for hdparm to add this feature which has been rolled into the release some months ago, you may want to look into upgrading to latest hdparm and using it if your disks support WRV.

    6) Running lsyncd on everything to monitor all files and copy them to the warm-spare server, and to the backup server after each close following a write.

    7) The backup server target location runs on CopyFS backed by ZFS with dedupe and compression enabled. so every version of a file that ever existed can be preserved (weekly cron job prunes the most ancient, most churned over of the files, dedupe and compression keep data growth relatively minimal).

    Needless to say, the backup server is not at the same site as the primary and the warm spare.

    Despite the relevant mentioned precautions I still had occurrences in the past of enough disks failing in a single array to hose the whole pool. But the warm spare and the near-real-time versioned backup server has always kept me out of serious trouble.

    Disk are cheap and unreliable. Data is expensive and irreplaceable. Act accordingly.

  25. HappyC
    WTF?

    So many experts

    Having read through all the comments and debates over what is and what is not good practice, what can happen and what can't, how many disks can fail at once and how many can't and what the staff at prime did or did not do, I have a conclusion:-

    I wasn't there, I wasn't present when it happened so anything I say would simply be conjecture. For all I know it could have been that elusive second gunman from behind a grassy knoll or Elvis leaving the building that caused the failure.

    How many of us though can say hand on heart that we have never had an unexpected failure or a corruption that comes out of the blue? I know I can't

  26. Anonymous Coward
    Anonymous Coward

    happened before

    a similar problem happened last year - October 2011. They lost a lot of data and my sites went down for 2 days.

    I run an ecommerce site - orders are placed every day. So when PrimeHosting decided to restore an old version of my store (without me knowing) I then had a massive problem - new customers making orders on a 6 month old database. The DB was also missing 6 months worth of orders.

    What makes it even worse - is then that the DB was from a month ago was then restored over the top - so the new orders that were made on the old DB were lost. And also any new customers who had signed up.

    We had the PayPal details - but not the order contents - to sum it all up - a nightmare scenario and VERY VERY embarrassing for me and my customers.

    I used to work in a support job - so I feel for the poor guys having to sort the mess out BUT

    Q - Is any of the hardware monitored ?

    Q - Were there alarms generated by the system monitoring software - does any one react to those alarms ?

    Q - If PH were aware of this nightmare scenario possibility - did they have a process to respond and react accordingly ?

    I have local weekly backups that I take myself - I had assumed that fail over disks / monitoring were in place as from their website it says :

    "We have invested heavily in virtualising our hosting infrastructure, this ensures high availability. For example, if a node were to fail, our system automatically fails over to another node within one minute. We can achieve this because our storage is centralised, we're utilising the latest RAID 6 ISCSI SAN's for maximum performance and flexibility. The node servers are running the latest Core i7 Intel CPUs with 12GB of RAM. We also monitor individual node workload to ensure an equal balance is maintained across the cluster."

    Has this sort of problem happened anywhere else ? - please reply

    ps - I am now looking for a new and better host - and willing to pay for it - recommendations welcome

  27. Anonymous Coward
    Anonymous Coward

    NullMan

    I am also gutted, many of my sites were restored to November 2011 with no recent backups. Lots of money has been lost as a result of the melt-down of Prime Hosting... Lots of time has been lost and not to mention search engine rank drops. I have been so stressed out over the whole process.

    In some respects I also feel sorry for Prime Hosting, but it is very embarrasing telling clients I didn't know what the problem was.

    I too have had clients who have LOST current orders, products added and lost order history, invoices etc.

    All in all I've been very disappointed with Prime Hosting lately... I also have had many emails bounce back because Prime Hosting keep getting blacklisted.

  28. Anonymous Coward
    Anonymous Coward

    Prime Hosting shared servers get continually blacklisted. Very embarrassing trying to explain that to website owners I have just built a site for. Your choices are re route through Gmail using dns or buy a dedicated IP address. Both options are a bit OTT - but they are in the only solution - other than move host. And any new host may have the same problem.

This topic is closed for new posts.

Other stories you might like