User topics

Article topics

Log in Sign up

Hundreds of websites go titsup in Prime Hosting disk meltdown

Hundreds of UK-hosted websites and email accounts fell offline when a disk array failed at web biz Prime Hosting. As many as 860 customers are still waiting for a fix more than 48 hours after the storage unit went titsup. The downtime at the Manchester-based hosting reseller began at 5am on 31 July, and two days later some …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Thursday 2nd August 2012 11:57 GMT Mako

"[P]romised that it has had a team working solidly for 36 hours without sleep in order to minimise the impact."

Knowing how goofy and error-prone I become after about 24 hours without sleep, that doesn't exactly fill me with feelings of confidence.

And it also gives me the impression that this is yet another company that thinks working people like pit-ponies is not only acceptable, it's laudable.

78 0
1. Thursday 2nd August 2012 12:05 GMT Rameses Niblick the Third (KKWWMT)
  
  Definitely, definitely this. One upvote is just not enough
  
  3 0
2. Thursday 2nd August 2012 18:10 GMT LarsG
  
  This pretty much tells you to keep a local backup and not rely on a third party to keep your data.
  
  2 0
Thursday 2nd August 2012 11:57 GMT Wize

They stick on old backups to start with...

...then are slowly replacing them with newer backups.

What if a customer places an order while the old one is up and the database gets splatted with the newer backup?

3 0
Thursday 2nd August 2012 11:58 GMT Lord Voldemortgage

Have some sympathy for them

Some.

Drives do sometimes go in batches.

But if I was a customer I would want to be asked first before an old version of a site was brought on line - I mean there might be ordering systems with old pricing / stock figures or anything on there; in those circumstances better a holding page with an apology than a working site feeding through garbage.

8 0
Thursday 2nd August 2012 12:09 GMT Steve Evans

Tut tut...

RAID is about availability, it is *NOT* a backup solution.

/end lecture.

7 7
1. Thursday 2nd August 2012 12:11 GMT Alex Rose
  
  Re: Tut tut...
  
  I've read the article again and I still can't see the bit where anybody claims that RAID is a backup solution.
  
  6 0
  1. Thursday 2nd August 2012 20:47 GMT jockmcthingiemibobb
    
    Re: Tut tut...
    
    Restoring the data from a 3 month old backup would kinda imply they were relying on RAID as their backup solution
    
    2 1
    1. Friday 3rd August 2012 12:08 GMT Anonymous Coward
      
      Re: Tut tut...
      
      No it wouldn't, it would imply that they had migrated to a new array and the old one was still there. No-one said they restored the data from 3 months ago.
      
      0 1
2. Thursday 2nd August 2012 12:27 GMT Anonymous Coward
  
  Re: Tut tut...
  
  Reading the article is about comprehension of a story, not just reading the first line and jumping to a conclusion about what you expect to have happened.
  
  /end lecture.
  
  1 0
Thursday 2nd August 2012 12:11 GMT Trygve Henriksen

This is CRAP!

Any server system with a smidgeon of professionality built into it will warn you when a drive becomes borderline.

Having 3 fail in one RAID6 array is... mindboggling...

Exactly how many drives do they have in each array, anyway?

Restoring old site backups to get the VMs up faster?

This is CRAP!

I'm guessing that what they brought back is the LAST FULL BACKUP of the failed array, and that they're now busy restoring Differential or Incrementals from after that.*

They should at least have the brains to keep the systems offline until they've restored everything, as it may otherwise result in lost orders and whatnot.

(What if someone browses to a webshop on one of those sites, orders something using a CC and they then restore over the transaction details? )

* Cheap bastards probably used incremental backups, too, instead of Differential, to save money...

6 5
1. Thursday 2nd August 2012 12:33 GMT Anonymous Coward
  
  Re: This is CRAP!
  
  Rather than jumping to the "This is all crap" conclusion, consider:
  
  The array probably had all its drives purchased at the same time, this vastly increases the likelihood of drives failing in fairly quick succession.
  
  If the array is new, it's entirely possible that the array that it was replacing is still kicking around, awaiting decommissioning. If it became apparent that the existing array is completely dead, it may have been a case of just zoning the old LUNs to the servers and away you go, with old data. This would also back up the first point. A recovery from tape could then take place to update the old data and the new array could be recommissioned when everything has settled down.
  
  4 0
  1. Thursday 2nd August 2012 12:45 GMT DJ Smiley
    
    Re: This is CRAP!
    
    Or they don't understand data scrubbing and checking for data failures on the devices themselves rather than trusting the raid controller which is going "Yes yes its all fine, don't worry about those blocks I've just moved because they failed, its really ok I promise you!"
    
    1 0
    1. Thursday 2nd August 2012 13:00 GMT Trygve Henriksen
      
      Re: This is CRAP!
      
      Error logs from array systems are there for a reason, which many unfortunately never bother to read.
      
      With a 'new' system, that should be checked DAILY.
      
      Automated emails from the system?
      
      Sure, but I wouldn't trust them. Too many systems between the originator and me.
      
      (sucks if the email warning of a problem with a RAID gets lost because the email storage is on the glitching array... Or, someone changes the IP of the SMTP server and the array box doesn't understand DNS. )
      
      0 0
      1. Thursday 2nd August 2012 16:33 GMT Nigel 11
        
        Re: This is CRAP!
        
        You really need to data-scrub, and watch the SMART statistics for the drives themselves, and act pro-actively. If the number of reallocated blocks starts increasing, replace that drive BEFORE the array is in peril. Sometimes drives do turn into bricks just like that, but in my experience and that of Google, an increasing rate of bad block reallocations after the array is first built is a warning not to be ignored.
        
        If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch. A RAID-6 made out of drives with consecutive serial numbers is horribly vulnerable to all the drives containing the same faulty component that will fail within a month. I'd also want to burn in a new array for a month or longer before putting it into service. If a new drive is going to turn into a brick, it most commonly does so in its first few weeks (aka the bathtub curve).
        
        5 0
        
        Thursday 2nd August 2012 19:03 GMT Anonymous Coward
        
        Re: This is CRAP!
        
        "...If I were ever running a big-data centre, I'd insist on buying a few disk drives monthly, and from different manufacturers, so I could assemble new RAID arrays from disks no two of which were likely to be from the same manufacturing batch...."
        
        It doesn't work like that, you get the disks you get when you buy an array. The array manufacturers spend ages testing that the firmware is compatible with the existing disks, that the disks are reliable and perform to spec, with the array and the array controller. There is far more chance of a failure caused by bad firmware or incompatibilities between disks and array/controller than mechanically. You will also be very hard pushed to find a array supplier who will support random disks being inserted into their array.
        
        The best bet, is to have a healthy amount of online spares, for automatic rebuild and an array that phones home for more.
        
        0 0
  2. Friday 3rd August 2012 13:57 GMT chops
    
    Re: This is CRAP!
    
    maybe it was hardware common to the disks, like a backplane or cable which failed.
    
    Prime seem to have had a problem with their SAN for a while - it's been blamed for some slow (to non existent) server responses over the past couple of weeks. I'm not even sure it's a reliable design of SAN (I believe it was 'home baked' from what support staff told me <last time this happened!>).
    
    Not much surprises me with Prime any longer, they don't usually appear to show a great deal of care or understanding about the importance of DNS, data or, come to it, their customers.
    
    0 0
2. Thursday 2nd August 2012 12:44 GMT Anonymous Coward
  
  > Having 3 fail in one RAID6 array is... mindboggling...
  
  Not very mindboggling here: we lost a very large number of drives simultaneously when some muppet contractor managed to set off the fire suppression system whilst attempting routine "maintenance". Service induced failure is a lot more common than you might think in all sorts of areas.
  
  Most likely they had an old copy of the data sitting on disk storage somewhere and brought that back on line as a quick fix whilst the tape system is recovering the backups. I can't remember the last time I saw a full backup done or imagine quite how long it would take. These days all our stuff is incremental into a library, and we just restore from the library without having to worry about when individual files were backed up.
  
  1 0
  1. Thursday 2nd August 2012 13:10 GMT Trygve Henriksen
    
    Re: Mupped doing maintenance...
    
    Nothing can protect you against 'Acts of Dead Meat', unfortunately...
    
    (Well, mirroring the array to another similar box in another location might... )
    
    Full backups are important. They really are.
    
    If for nothing else, they're really handy for off-site storage...
    
    (To protect against flooding, fire, sabotage, theft... )
    
    0 0
    1. Thursday 2nd August 2012 15:56 GMT Anonymous Coward
      
      Re: Mupped doing maintenance...
      
      Replication is great, but like RAID, not a panacea.
      
      Also, it's highly likely that the customers don't want to pay for that level of data security, it's very expensive, more than double the costs because you have to pay for the datalink as well as the extra servers and disk.
      
      0 0
3. Thursday 2nd August 2012 21:49 GMT Fatman
  
  Re: ....warn you when a drive becomes borderline.
  
  Perhaps it did!!!
  
  (Now to get my damagement bashing in; and this is just speculation, mind you.)
  
  Perhaps the warning signs were there, but damagement, in its quest for ever increasing profits, decided to hold off replacing the drives. Could it be that they did not want their quarterly bonuses to take a `hit`??? The spreadsheet jockeys could not find a line item for replacement drives.
  
  Icon that says it all.
  
  0 1
  1. Friday 3rd August 2012 12:42 GMT Captain Scarlet
    
    Re: ....warn you when a drive becomes borderline.
    
    I'm sure most "Enterprise" drives (If they used them) have 3-5 year warranties and the majority of manufacturers will replace if certain parameters have reached.
    
    0 0
Thursday 2nd August 2012 12:14 GMT Dave 62

At least they have recent backups.

0 0
Thursday 2nd August 2012 12:30 GMT theloon

no sleep? Umm, go home

last thing anyone needs are exhausted people working on problems..... Not reassuring..

3 0
Thursday 2nd August 2012 12:34 GMT Anonymous Coward

Batch

Dear me, I learned in the late 1990s through practical experience that you NEVER put a RAID together with disks from the same batch. A guarantee of disaster if they start popping off in quick succession..

0 0
1. Thursday 2nd August 2012 12:44 GMT Colin Bull 1
  
  Re: Batch
  
  It is not trivial to avoid using the same batch in a RAID unless using RAID10.
  
  I bet they wish they had joined this group ..
  
  http://www.miracleas.com/BAARF/BAARF2.html?40,51
  
  It might be old but it is still applicable
  
  1 0
  1. Thursday 2nd August 2012 13:10 GMT Destroy All Monsters
    
    Re: Batch
    
    OH SO TRUE.
    
    Of course, buy new hardware, the disks will have continguous serial numbers. Order a replacement for the failed one, the next one will fail while the new one you got will ALSO fail.
    
    1 0
Thursday 2nd August 2012 12:42 GMT BryanM

Lady Bracknell

To paraphrase Oscar Wilde...

"To lose one disk, Mr Smith, may be regarded as a misfortune; to lose three looks like carelessness."?

After 3 disk failures I'd be checking the RAID controllers and stuff to ensure it's not something other than a disk issue. Unless you tell me it's software RAID that is, then I'll just laugh at you.

6 1
1. Thursday 2nd August 2012 15:28 GMT LinkOfHyrule
  
  Re: Lady Bracknell
  
  Where's that quote from? The Ballard of RAIDing Gaol?
  
  1 0
2. Friday 3rd August 2012 07:05 GMT TeeCee
  
  Re: Lady Bracknell
  
  Nope. Disk #1 fails. A new disk is insterted and the array starts to rebuild. The act of rebuilding stresses the living shit out of the other disks, including accessing areas of them that haven't been looked at since Jesus was a lad[1] (e.g. parity stripes for O/S files on the failed disk that were written during a server installation early in the array's life and never touched since). Disks #2 and #3 turn their toes....
  
  Of the three disks I have had fail in my own gear, two failed during full backup cycles and one in a RAID rebuild. It's heavy use of the entire disk that shines a glaring light on problems. This is also why anyone relying on incremental backups and thus not ensuring that the entire disk structure is kosher on a regular basis is asking for it.
  
  Mixing batches of disks is unlikely to help, except in the unlikely case where a particular batch has a manufacturing defect. In such cases, they'll usually start dropping like flies at commissioning time anyway. What will help is ensuring that your RAID array is populated with disks with significantly different numbers of service hours on them, but since arrays tend to be commissioned in one go with new disks, this very rarely happens.
  
  [1] This is why monitoring the SMART stats makes no odds. SMART only records errors when they are seen in normal operation, it does not proactively scan the entire surface looking for 'em.
  
  0 0
Thursday 2nd August 2012 12:44 GMT Anonymous Coward

More fun if the card goes

Raid error reporting is one thing but if the raid card itself goes there is not even a hint of the impending doom.

Can't help thinking RAID is another one of those "Many beasts with one name" technologies that would benefit from some rigorous standards.

Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)

AC just because many people in IT seem to think they have it all covered and "unknown unknowns" could never happen to them, pointing fingers and being smart may distract us from the discipline required.

3 0
1. Thursday 2nd August 2012 12:47 GMT DJ Smiley
  
  Re: More fun if the card goes
  
  We had a raid controller from Dhell do this - it went to write through mode as it should if it encounters errors; except instead of actually writing the data though (abet slowly) it decided any writes could be silently ignored and dropped.
  
  People saying H/W raid is better than software raid are either never dealt with dodgy raid controllers; or are thinking of that joke of raid that comes built into motherboards and not mdadm.
  
  3 2
  1. Thursday 2nd August 2012 13:39 GMT Anonymous Coward
    
    Re: More fun if the card goes
    
    Really? In my experience, people who say that software RAID is better than hardware RAID are OS engineers, who think that they somehow automatically know about either local or SAN attached storage infrastructure.
    
    Software RAID, after all, still goes through disk controller chips, often the same one for multiple drives.
    
    0 0
2. Thursday 2nd August 2012 12:49 GMT Anonymous Coward
  
  > people in IT seem to think they have it all covered
  
  Mmm, some of the loud shouters come across to me as being rather inexperienced and naive. If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff.
  
  3 0
  1. Saturday 4th August 2012 04:43 GMT Kev K
    
    Re: > people in IT seem to think they have it all covered
    
    " If you manage to stick around long enough in this flakey industry you see all sorts of weird stuff."
    
    This with huge great bells on it. $hit WILL happen.
    
    0 0
3. Thursday 2nd August 2012 17:02 GMT Nigel 11
  
  Re: More fun if the card goes
  
  Drives from one controller will often not talk to later versions of the same controller (or have I just been unlucky)
  
  No, thats one of the several reasons that these days I refuse to countenance hardware RAID controllers.
  
  Another is the case where the manufacturer of your RAID controller goes out of business and the only place you can get a (maybe!) compatible replacement is E-bay. And then there's the time you find out the hard way that if you swap two drives by mistake, it immediately scrambles all your data beyond retrieval. And if there's a hardware RAID card that uses ECC RAM, I've yet to see it.
  
  Use Linux software RAID. Modern CPUs can crunch XORs on one out of four or more cores much faster than SATA drives can deliver data. And auto-assembly from shuffled drives does work! You do of course have a UPS, and you have of course tested that UPS-initiated low-battery shutdown does actually work before putting it in production.
  
  (Enterprise RAID systems with sixteen-up drives may be less bad, and in any case it's a bit hard to interface more than 12 drives to a regular server PC. It's little 4-8 drive hardware RAID controllers that I won't touch with a bargepole.
  
  2 1
  1. Thursday 2nd August 2012 19:12 GMT Anonymous Coward
    
    Re: More fun if the card goes
    
    @Nigel 11 - I think we're talking about significantly different systems here. To me a RAID array is something that is free standing and has hundreds of disks. A the only locally attached arrays that I've used recently are made by HP (nee Compaq, nee DEC) and are 2u racks full of 2.4" disks, with controllers that have battery backed write cache - with error correcting RAM.
    
    If your primary concern when buying an array is "will this company go bust", don't buy it. However, rest assured a proper, enterprise (or SME) class RAID controller/Array is way faster and more reliable than software, it also won't knacker your disks if you put them in, in the wrong order. It certainly won't loose writes if cached, when there's a power failure, which software RAID will.
    
    0 0
Thursday 2nd August 2012 13:12 GMT Johnny Quest

Prime Hosting has apologised to punters...

"Prime Hosting has apologised to punters"

Um... no, no they have not. Not a single apology.

Their site has no information about downtime on it anywhere, their ticket support system is being completely ignored, their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away".

Their Twitter feed is the only thing with any information on, and that is remarkably lacking.

0 1
1. Thursday 2nd August 2012 18:17 GMT Anonymous Coward
  
  Re: Prime Hosting has apologised to punters...
  
  >their phone lines (which might be back up now) were down for all of yesterday with a recorded message basically saying "We know there issues, go away"
  
  And you think calling them is going to make them suddenly get things back to normal?
  
  If there is information on twitter then I would assume that is the current state of the problem, if you don't think so then complain later..
  
  What do you want? A bit by bit commentary on twitter and a fully manned telephone ops room or maybe a dedicated line for you to constantly ask "what's going on? when will my really, really important web pages be available, don't you know there are poeole out their who haven't seen a pictrue of my pussy for more than ten minutes?", while one guy tries to get the thing back to its previous state?
  
  0 2
  1. Friday 3rd August 2012 08:22 GMT Johnny Quest
    
    Re: Prime Hosting has apologised to punters...
    
    Those are some nice logical leaps you've made there. True genius in the works.
    
    Actually, despite you trying to make it sound completely ridiculous, a bit by bit commentary isn't exactly out of the question. It's not that unheard of for there to be people employed by a company that aren't experts in data recovery and server migration. Maybe those people not involved in that side of the issue could take a few minutes to at least to keep some worried customers updated?
    
    Regardless, what I'd expect is the bare minimum of customer support:
    
    1) At least one mention that there is a known issue on their website;
    
    2) For their single point of support to be working (their ticket system was offline all day ). These shared servers that are down aren't their only hosting business.
    
    3) Less than 9 hours between Twitter posts on the day the majority of sites went down.
    
    4) To maybe ask customers whether they would like an unusably old backup in place before doing so;
    
    5) To maybe let customers who do have an already unusably old backup in place that they can provide a more recent one, so there's not a need to panic (12+ hours between putting some backups in place and then sending out a Tweet).
    
    That is not much to ask, seeing as they have just destroyed a good number of businesses (not mine, chillax before you start worrying about my transexual cat photo enterprise).
    
    1 0
  2. Friday 3rd August 2012 08:25 GMT Johnny Quest
    
    Re: Prime Hosting has apologised to punters...
    
    Oh, and I forgot the most important one:
    
    A FUCKING APOLOGY!
    
    Telling The Register that they're sorry isn't quite the same as telling the hundreds of customers. I think some of Prime's customers might not be regular Reg readers.
    
    0 0
Thursday 2nd August 2012 13:12 GMT Anonymous Coward

disk do fail

I added some more memory to a DELL R610 last month (one of our Hyper-v hosts) restarted the server to find that 2 of the 4 disks that make up a RAID 10 volume had failed. S*it happenslucky it was RAID 10 so it wasn't much of an issue

0 0
Thursday 2nd August 2012 13:14 GMT Anonymous Coward

Oh lovely lovely RAID...

One of those techs people think is the holy grail to save you having to spend lots of dosh on a duplicate or clustered system. If anyone wants to save themselves from getting into the hell hole of a situation; always have backup on backups and duplicate on duplicate systems.

Where I'm working at the moment; we have 6 duplicate servers hosting all our websites and all the elements and database hosted on a clustered group of servers. Even the local hard-disks on the servers are RAIDed for performance reasons on top of availability. It would take a lot to go wrong for us to go fully tits up.

2 0
1. Thursday 2nd August 2012 13:32 GMT Anonymous Coward
  
  Re: Oh lovely lovely RAID...
  
  I take it all these servers are spread across different physical locations, redundant power and networks within each location, and diverse power and data into the data centres?
  
  0 0
  1. Thursday 2nd August 2012 14:18 GMT Anonymous Coward
    
    Re: Oh lovely lovely RAID...
    
    Oh deffo! If you're going to half the risk, you might as well go all the way down the chain. Even down to split redundant switches using teamed NIC cards. =D
    
    0 0
    1. Thursday 2nd August 2012 15:11 GMT Anonymous Coward
      
      Re: Oh lovely lovely RAID...
      
      It's something that should be on your checklist when choosing a hosting ISP:
      
      I chose one that had redundant data centres and took the subject seriously.
      
      Not the cheapest but you get what you pay for.
      
      0 0
Thursday 2nd August 2012 13:24 GMT Anonymous Coward

It's ok....

...the customers have their own back ups as well don't they? You know just in case the site goes utterly tits up / bankrupt / get closed down by the police / you want to move hosts.

oh....

1 0
Thursday 2nd August 2012 14:08 GMT Chris Long

Irony

Whilst looking in vain on their website for any scrap of information as to what the fudge had happened to my sites, I enjoyed* the irony of finding this press release:

http://www.primehosting.co.uk/news/Recruiting_Again

How nice of them, I thought, to be blowing their own trumpets whilst quite literally in the middle of the biggest clusterfudge a hosting company could hope to experience. Surely, I wondered, the PR person pimping this press release could instead be informing customers as to when their sites might re-appear? But apparently not.

* did not enjoy

0 0
This post has been deleted by its author
Thursday 2nd August 2012 15:08 GMT Wensleydale Cheese

And for those of you using hosting ISPs

Do you take regular backups of your sites?

I certainly do and can restore the lot reasonably quickly. I have tested that too.

This article does present a scenario I hadn't thought of though, namely that of the ISP restoring older backups over whatever I might have already restored.

0 0
Thursday 2nd August 2012 15:16 GMT Alan Brown

Backups

Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations.

To get around this you need a database containing a complete file list at any given point in time. Luckliy at least one backup package (Bacula) does this and can use Full+Diff+incrementals to restore an exact image at any given backup WITHOUT needing to shag around with intermediate steps.

As for full backups taking too long: If they do, then use synthetic full backups (Existing F+D+I backups are used to create a new full backup.). Once you have a database containing a full list of files at any given point in time this is trivial. (Bacula can do this too)

Having lost 40Tb disk arrays due to simultaneous drive failures I appreciate the speed in restoration.

If things are really THAT business critical then it's entirely not silly to build a RAID array of RAID arrays (RAID 51 or 61 or 55 or 66), or use cross-site replication and put up with the wastage - but that's not an excuse to get lax about backups.

1 1
1. Friday 3rd August 2012 12:22 GMT Anonymous Coward
  
  Re: Backups
  
  "Restoring the last full and then overlaying incrementals is old school and likely to result in a clusterfuck at the end of the day - files which were deleted end up reappearing and directory trees which were moved around show up in both locations."
  
  Wrong. Modern backup software has move and delete detection.
  
  If you think that I'm even going to entertain backing up enterprise data with a product which has been round ten years, yet somehow has practically no agents, you've got another think coming. I've been working in enterprise backup and recovery/storage for about 17 years and have only once heard of anyone using bacula in that time and he used it at home.
  
  0 0
Thursday 2nd August 2012 15:40 GMT Andy Farley

Not to step into libel territory

But we had a double disk failure on a HP SAN, a "one in several million" chance. Unfortunately this was caused by the controller firmware so could have happened again at any time. I wonder if their firmware was fully patched?

Luckily we'd built in proper redundancy (it was a pension company) and had bitwise VM backup to other-site machines so we were down for 20 minutes with very little data loss.

Of course, they ran off the same SANs, bought at the same time, which meant squeeky bum time while the SANs were replaced and upgraded - four weeks later as we had to wait for HP to test the patch.

0 0
1. Friday 3rd August 2012 02:03 GMT Gordan
  
  Re: Not to step into libel territory
  
  Double disk failure is not "one in several million" chance.
  
  Here's a trick I call "maths".
  
  Disks like most of the 1TB SATA ones in my arrays have an unrecoverable error rate of 10^-14. That's an unrecoverable error approximately every 12TB of reads.
  
  Say you have an array of 6+1 such disks in RAID5. You have a disk failure. To reconstruct the missing disk you have to read all of the content of the 6 remaining disks, which is 6TB. That means that probabalistically speaking, you have a whopping 50% chance of suffering an unrecoverable error during the recovery operation and losing data (whether the array will panic or attempt to reconstruct with only mildly corrupted data depends on the implementation and I wouldn't want to have make a guess about what might happen).
  
  Even if you are running such disks mirrored, the probability of an error during rebuild is ~ 8%, which is uncomfortably high. Now up that from 1TB disks to 4TB disks, and probability of failure during rebuilding the mirror goes up to 32%. If you're not worried - you should be.
  
  With modern disks and their expected failure rates, the probability of failure during an array rebuild is very high, and extra precautions should always be taken, both by upping the redundancy level and higher level mirroring/replication.
  
  0 0
  1. Friday 3rd August 2012 09:00 GMT Tim Wolfe-Barry
    
    Re: Not to step into libel territory
    
    Thanks for this - I was desperately trying to find the references before posting; I think the 1st place I saw this was here, back in 2007: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/
    
    The critical parts are summarized (below), but basically the upshot is that the more and bigger disks you have, the GREATER rather than LESSER the likelihood of a failure during rebuild...
    
    ========
    
    Data safety under RAID 5?
    
    . . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .
    
    Independence of drive failures in an array?
    
    The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.
    
    Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!
    
    1 0
    1. Friday 3rd August 2012 17:08 GMT Gordan
      
      Re: Not to step into libel territory
      
      Indeed, that article is pretty much spot on - and it dates back to 2007, when the biggest disks were 4-5x smaller than they are today. The problem has grown substantially since then.
      
      0 0
Thursday 2nd August 2012 18:09 GMT Anonymous Coward

EPIC fail - if your hosting a site with user data...

Any sites storing user activity - such as e-commerce - will be boned. User Id's, invoice Id's - what a god awful mess that will be.

So, assuming those types of sites decide they have to keep running with 3 month old data - which would be crazy for an ecommerce site - when the most recent data is restored, it'll wipe out any database changes once again - double whammy.

These types of sites will have no choice but to go into maintenance mode until recent data is restored.

Yes, hard drives fail, shit happens, Prime Hosting *HAVE* to warn *ALL* their customers affected prior to restoring the recent update.

Messy and a sysadmins worst nightmare - my god, there must've been a lot of swearing, sweating and shaking going on when the final drive failed.

My heart goes out to the poor guys who have to fix this mess - it's a thankless task.

1 0
Thursday 2nd August 2012 20:08 GMT simon newton

erm ,whut/

When a RAID6 array busts a drive, hot spare. When a second drive heads south during the rebuild, i would immediately power down, pull each physical drive and image them, byte by byte, directly to another drive or suitable storage place. while thats going on, i replace each drive data cable in the array housing and check each drive power connecter for acceptable voltage and current. I personally feel its not a bright idea for me to let the array carry on rebuilding activities after a second disk failure within quick sucession of the first disk failure. Uptime and availability is important to the customers, but making all of their sites go offline anyway, then get months old in an instant tends to hit harder.

"A bit of downtime can be sweetened away,

not much you can do with a broken array"

P.S dont use RAID3/4/5/6 in mission critical systems, and FFS keep a good local and offsite, nightly snapshot backup with weekly and monthly rotations. Disk space is pocket change in the scheme of things.

2 1
Thursday 2nd August 2012 21:35 GMT Duffaboy

Likely Senario

Walking round datacenters you see failed drives blinking away in distress all the time (its the red lights that make them stick out you know). Yet nobody takes any notice, i do my bit to point them out on my repair visits. some have probably been like that for weeks as these centers are just usually staffed buy tape swapping guys.

Seriously Its unlikely that 3 drives went all at once.

3 0
Thursday 2nd August 2012 23:47 GMT Volker Hett

rsync and cron

I love my very small shell script which syncs my webseite and mail server every hour!

0 0
Friday 3rd August 2012 01:48 GMT Gordan

SMART + ZFS/RAIDZ[23] + WRV + LsyncD + CopyFS

To quite a fantastic film: "It's not whether you're paranoid - it's whether you're paranoid _enough_."

Disks are atrociously, mind-boggingly unreliable. This is just a fact of life in 2012. Plan accordingly.

The way I protect my data from loss involves:

1) Monitoring SMART attributes of disks in cacti

1.1) Actually making a point of checking this monitoring data at least daily

2) Running short SMART checks daily

3) Running long SMART tests weekly

4) Running zpool scrubs weekly

5) Having Write-Read-Verify enabled on all disks that support it. Sadly, very few do (mainly Seagates). I wrote a patch for hdparm to add this feature which has been rolled into the release some months ago, you may want to look into upgrading to latest hdparm and using it if your disks support WRV.

6) Running lsyncd on everything to monitor all files and copy them to the warm-spare server, and to the backup server after each close following a write.

7) The backup server target location runs on CopyFS backed by ZFS with dedupe and compression enabled. so every version of a file that ever existed can be preserved (weekly cron job prunes the most ancient, most churned over of the files, dedupe and compression keep data growth relatively minimal).

Needless to say, the backup server is not at the same site as the primary and the warm spare.

Despite the relevant mentioned precautions I still had occurrences in the past of enough disks failing in a single array to hose the whole pool. But the warm spare and the near-real-time versioned backup server has always kept me out of serious trouble.

Disk are cheap and unreliable. Data is expensive and irreplaceable. Act accordingly.

1 0
Monday 6th August 2012 08:12 GMT HappyC

So many experts

Having read through all the comments and debates over what is and what is not good practice, what can happen and what can't, how many disks can fail at once and how many can't and what the staff at prime did or did not do, I have a conclusion:-

I wasn't there, I wasn't present when it happened so anything I say would simply be conjecture. For all I know it could have been that elusive second gunman from behind a grassy knoll or Elvis leaving the building that caused the failure.

How many of us though can say hand on heart that we have never had an unexpected failure or a corruption that comes out of the blue? I know I can't

0 0
Wednesday 8th August 2012 12:27 GMT Anonymous Coward

happened before

a similar problem happened last year - October 2011. They lost a lot of data and my sites went down for 2 days.

I run an ecommerce site - orders are placed every day. So when PrimeHosting decided to restore an old version of my store (without me knowing) I then had a massive problem - new customers making orders on a 6 month old database. The DB was also missing 6 months worth of orders.

What makes it even worse - is then that the DB was from a month ago was then restored over the top - so the new orders that were made on the old DB were lost. And also any new customers who had signed up.

We had the PayPal details - but not the order contents - to sum it all up - a nightmare scenario and VERY VERY embarrassing for me and my customers.

I used to work in a support job - so I feel for the poor guys having to sort the mess out BUT

Q - Is any of the hardware monitored ?

Q - Were there alarms generated by the system monitoring software - does any one react to those alarms ?

Q - If PH were aware of this nightmare scenario possibility - did they have a process to respond and react accordingly ?

I have local weekly backups that I take myself - I had assumed that fail over disks / monitoring were in place as from their website it says :

"We have invested heavily in virtualising our hosting infrastructure, this ensures high availability. For example, if a node were to fail, our system automatically fails over to another node within one minute. We can achieve this because our storage is centralised, we're utilising the latest RAID 6 ISCSI SAN's for maximum performance and flexibility. The node servers are running the latest Core i7 Intel CPUs with 12GB of RAM. We also monitor individual node workload to ensure an equal balance is maintained across the cluster."

Has this sort of problem happened anywhere else ? - please reply

ps - I am now looking for a new and better host - and willing to pay for it - recommendations welcome

0 0
Thursday 9th August 2012 08:26 GMT Anonymous Coward

NullMan

I am also gutted, many of my sites were restored to November 2011 with no recent backups. Lots of money has been lost as a result of the melt-down of Prime Hosting... Lots of time has been lost and not to mention search engine rank drops. I have been so stressed out over the whole process.

In some respects I also feel sorry for Prime Hosting, but it is very embarrasing telling clients I didn't know what the problem was.

I too have had clients who have LOST current orders, products added and lost order history, invoices etc.

All in all I've been very disappointed with Prime Hosting lately... I also have had many emails bounce back because Prime Hosting keep getting blacklisted.

0 0
Wednesday 15th August 2012 15:34 GMT Anonymous Coward

Prime Hosting shared servers get continually blacklisted. Very embarrassing trying to explain that to website owners I have just built a site for. Your choices are re route through Gmail using dns or buy a dedicated IP address. Both options are a bit OTT - but they are in the only solution - other than move host. And any new host may have the same problem.

0 0

This topic is closed for new posts.

Other stories you might like

911 goes MIA across multiple US states, cause unclear

Updated Some say various cell services were out, others still say landlines were affected. What just happened?

Networks 18 Apr 2024 | 36

Sacramento airport goes no-fly after AT&T internet cable snipped

Police say this appears to be a 'deliberate act.'

Cyber-crime 19 Apr 2024 | 44

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Updated As WhatsApp, Facebook Messenger, other Meta bits plus Apple stuff fall offline today

Security 3 Apr 2024 | 18

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

It's the region where stuff gets stressed at scale first, says Dave Brown, as he plots variants of Amazon's Outposts

PaaS + IaaS 10 Apr 2024 | 4

Datacenter outages are on the decline, but when they hit, they hit hard

Power snafus take limelight in latest downtime diary from Uptime Institute

On-Prem 2 Apr 2024 | 3

Tech trade union confirms cyberattack behind IT, email outage

Exclusive Systems have been pulled offline as a precaution

Cyber-crime 25 Mar 2024 | 11

McDonald's ordering system suffers McFlurry of tech troubles

Global meltdown turns fast food slow

Off-Prem 15 Mar 2024 | 109

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

Updated What's not to like? At the moment, everything on Microsoft's social network

Personal Tech 6 Mar 2024 | 15

World-plus-dog booted out of Facebook, Instagram, Threads

Updated Millions of Meta addicts suddenly cried out in terror and were silenced

Networks 5 Mar 2024 | 61

AT&T's apology for Thursday's outage should stretch to a cup of coffee

Check your service level agreements to make sure you'll at least get a slice of cake when your vendor goes down

Networks 26 Feb 2024 | 13

Americans wake to widespread AT&T cellular outages

Final update Telco battles to fix busted connectivity as other carriers feel the effects

Networks 22 Feb 2024 | 81

X protests forced suspension of accounts on orders of India's government

Nonprofit SFLC links orders to farming protests

Public Sector 23 Feb 2024 | 20

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024