back to article Hundreds of websites go titsup in Prime Hosting disk meltdown

Hundreds of UK-hosted websites and email accounts fell offline when a disk array failed at web biz Prime Hosting. As many as 860 customers are still waiting for a fix more than 48 hours after the storage unit went titsup. The downtime at the Manchester-based hosting reseller began at 5am on 31 July, and two days later some …

COMMENTS

This topic is closed for new posts.

Page:

  1. Alan Brown Silver badge

    Backups

    Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations.

    To get around this you need a database containing a complete file list at any given point in time. Luckliy at least one backup package (Bacula) does this and can use Full+Diff+incrementals to restore an exact image at any given backup WITHOUT needing to shag around with intermediate steps.

    As for full backups taking too long: If they do, then use synthetic full backups (Existing F+D+I backups are used to create a new full backup.). Once you have a database containing a full list of files at any given point in time this is trivial. (Bacula can do this too)

    Having lost 40Tb disk arrays due to simultaneous drive failures I appreciate the speed in restoration.

    If things are really THAT business critical then it's entirely not silly to build a RAID array of RAID arrays (RAID 51 or 61 or 55 or 66), or use cross-site replication and put up with the wastage - but that's not an excuse to get lax about backups.

    1. Anonymous Coward
      Anonymous Coward

      Re: Backups

      "Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations."

      Wrong. Modern backup software has move and delete detection.

      If you think that I'm even going to entertain backing up enterprise data with a product which has been round ten years, yet somehow has practically no agents, you've got another think coming. I've been working in enterprise backup and recovery/storage for about 17 years and have only once heard of anyone using bacula in that time and he used it at home.

  2. Andy Farley
    WTF?

    Not to step into libel territory

    But we had a double disk failure on a HP SAN, a "one in several million" chance. Unfortunately this was caused by the controller firmware so could have happened again at any time. I wonder if their firmware was fully patched?

    Luckily we'd built in proper redundancy (it was a pension company) and had bitwise VM backup to other-site machines so we were down for 20 minutes with very little data loss.

    Of course, they ran off the same SANs, bought at the same time, which meant squeeky bum time while the SANs were replaced and upgraded - four weeks later as we had to wait for HP to test the patch.

    1. Gordan
      Boffin

      Re: Not to step into libel territory

      Double disk failure is not "one in several million" chance.

      Here's a trick I call "maths".

      Disks like most of the 1TB SATA ones in my arrays have an unrecoverable error rate of 10^-14. That's an unrecoverable error approximately every 12TB of reads.

      Say you have an array of 6+1 such disks in RAID5. You have a disk failure. To reconstruct the missing disk you have to read all of the content of the 6 remaining disks, which is 6TB. That means that probabalistically speaking, you have a whopping 50% chance of suffering an unrecoverable error during the recovery operation and losing data (whether the array will panic or attempt to reconstruct with only mildly corrupted data depends on the implementation and I wouldn't want to have make a guess about what might happen).

      Even if you are running such disks mirrored, the probability of an error during rebuild is ~ 8%, which is uncomfortably high. Now up that from 1TB disks to 4TB disks, and probability of failure during rebuilding the mirror goes up to 32%. If you're not worried - you should be.

      With modern disks and their expected failure rates, the probability of failure during an array rebuild is very high, and extra precautions should always be taken, both by upping the redundancy level and higher level mirroring/replication.

      1. Tim Wolfe-Barry
        Stop

        Re: Not to step into libel territory

        Thanks for this - I was desperately trying to find the references before posting; I think the 1st place I saw this was here, back in 2007: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/

        The critical parts are summarized (below), but basically the upshot is that the more and bigger disks you have, the GREATER rather than LESSER the likelihood of a failure during rebuild...

        ========

        Data safety under RAID 5?

        . . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

        Independence of drive failures in an array?

        The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

        Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

        1. Gordan

          Re: Not to step into libel territory

          Indeed, that article is pretty much spot on - and it dates back to 2007, when the biggest disks were 4-5x smaller than they are today. The problem has grown substantially since then.

  3. Anonymous Coward
    FAIL

    EPIC fail - if your hosting a site with user data...

    Any sites storing user activity - such as e-commerce - will be boned. User Id's, invoice Id's - what a god awful mess that will be.

    So, assuming those types of sites decide they have to keep running with 3 month old data - which would be crazy for an ecommerce site - when the most recent data is restored, it'll wipe out any database changes once again - double whammy.

    These types of sites will have no choice but to go into maintenance mode until recent data is restored.

    Yes, hard drives fail, shit happens, Prime Hosting *HAVE* to warn *ALL* their customers affected prior to restoring the recent update.

    Messy and a sysadmins worst nightmare - my god, there must've been a lot of swearing, sweating and shaking going on when the final drive failed.

    My heart goes out to the poor guys who have to fix this mess - it's a thankless task.

  4. simon newton
    Facepalm

    erm ,whut/

    When a RAID6 array busts a drive, hot spare. When a second drive heads south during the rebuild, i would immediately power down, pull each physical drive and image them, byte by byte, directly to another drive or suitable storage place. while thats going on, i replace each drive data cable in the array housing and check each drive power connecter for acceptable voltage and current. I personally feel its not a bright idea for me to let the array carry on rebuilding activities after a second disk failure within quick sucession of the first disk failure. Uptime and availability is important to the customers, but making all of their sites go offline anyway, then get months old in an instant tends to hit harder.

    "A bit of downtime can be sweetened away,

    not much you can do with a broken array"

    P.S dont use RAID3/4/5/6 in mission critical systems, and FFS keep a good local and offsite, nightly snapshot backup with weekly and monthly rotations. Disk space is pocket change in the scheme of things.

  5. Duffaboy
    Facepalm

    Likely Senario

    Walking round datacenters you see failed drives blinking away in distress all the time (its the red lights that make them stick out you know). Yet nobody takes any notice, i do my bit to point them out on my repair visits. some have probably been like that for weeks as these centers are just usually staffed buy tape swapping guys.

    Seriously Its unlikely that 3 drives went all at once.

  6. Volker Hett

    rsync and cron

    I love my very small shell script which syncs my webseite and mail server every hour!

  7. Gordan
    Boffin

    SMART + ZFS/RAIDZ[23] + WRV + LsyncD + CopyFS

    To quite a fantastic film: "It's not whether you're paranoid - it's whether you're paranoid _enough_."

    Disks are atrociously, mind-boggingly unreliable. This is just a fact of life in 2012. Plan accordingly.

    The way I protect my data from loss involves:

    1) Monitoring SMART attributes of disks in cacti

    1.1) Actually making a point of checking this monitoring data at least daily

    2) Running short SMART checks daily

    3) Running long SMART tests weekly

    4) Running zpool scrubs weekly

    5) Having Write-Read-Verify enabled on all disks that support it. Sadly, very few do (mainly Seagates). I wrote a patch for hdparm to add this feature which has been rolled into the release some months ago, you may want to look into upgrading to latest hdparm and using it if your disks support WRV.

    6) Running lsyncd on everything to monitor all files and copy them to the warm-spare server, and to the backup server after each close following a write.

    7) The backup server target location runs on CopyFS backed by ZFS with dedupe and compression enabled. so every version of a file that ever existed can be preserved (weekly cron job prunes the most ancient, most churned over of the files, dedupe and compression keep data growth relatively minimal).

    Needless to say, the backup server is not at the same site as the primary and the warm spare.

    Despite the relevant mentioned precautions I still had occurrences in the past of enough disks failing in a single array to hose the whole pool. But the warm spare and the near-real-time versioned backup server has always kept me out of serious trouble.

    Disk are cheap and unreliable. Data is expensive and irreplaceable. Act accordingly.

  8. HappyC
    WTF?

    So many experts

    Having read through all the comments and debates over what is and what is not good practice, what can happen and what can't, how many disks can fail at once and how many can't and what the staff at prime did or did not do, I have a conclusion:-

    I wasn't there, I wasn't present when it happened so anything I say would simply be conjecture. For all I know it could have been that elusive second gunman from behind a grassy knoll or Elvis leaving the building that caused the failure.

    How many of us though can say hand on heart that we have never had an unexpected failure or a corruption that comes out of the blue? I know I can't

  9. Anonymous Coward
    Anonymous Coward

    happened before

    a similar problem happened last year - October 2011. They lost a lot of data and my sites went down for 2 days.

    I run an ecommerce site - orders are placed every day. So when PrimeHosting decided to restore an old version of my store (without me knowing) I then had a massive problem - new customers making orders on a 6 month old database. The DB was also missing 6 months worth of orders.

    What makes it even worse - is then that the DB was from a month ago was then restored over the top - so the new orders that were made on the old DB were lost. And also any new customers who had signed up.

    We had the PayPal details - but not the order contents - to sum it all up - a nightmare scenario and VERY VERY embarrassing for me and my customers.

    I used to work in a support job - so I feel for the poor guys having to sort the mess out BUT

    Q - Is any of the hardware monitored ?

    Q - Were there alarms generated by the system monitoring software - does any one react to those alarms ?

    Q - If PH were aware of this nightmare scenario possibility - did they have a process to respond and react accordingly ?

    I have local weekly backups that I take myself - I had assumed that fail over disks / monitoring were in place as from their website it says :

    "We have invested heavily in virtualising our hosting infrastructure, this ensures high availability. For example, if a node were to fail, our system automatically fails over to another node within one minute. We can achieve this because our storage is centralised, we're utilising the latest RAID 6 ISCSI SAN's for maximum performance and flexibility. The node servers are running the latest Core i7 Intel CPUs with 12GB of RAM. We also monitor individual node workload to ensure an equal balance is maintained across the cluster."

    Has this sort of problem happened anywhere else ? - please reply

    ps - I am now looking for a new and better host - and willing to pay for it - recommendations welcome

  10. Anonymous Coward
    Anonymous Coward

    NullMan

    I am also gutted, many of my sites were restored to November 2011 with no recent backups. Lots of money has been lost as a result of the melt-down of Prime Hosting... Lots of time has been lost and not to mention search engine rank drops. I have been so stressed out over the whole process.

    In some respects I also feel sorry for Prime Hosting, but it is very embarrasing telling clients I didn't know what the problem was.

    I too have had clients who have LOST current orders, products added and lost order history, invoices etc.

    All in all I've been very disappointed with Prime Hosting lately... I also have had many emails bounce back because Prime Hosting keep getting blacklisted.

  11. Anonymous Coward
    Anonymous Coward

    Prime Hosting shared servers get continually blacklisted. Very embarrassing trying to explain that to website owners I have just built a site for. Your choices are re route through Gmail using dns or buy a dedicated IP address. Both options are a bit OTT - but they are in the only solution - other than move host. And any new host may have the same problem.

Page:

This topic is closed for new posts.

Other stories you might like