Typo on the last line ...
"on a fixed term contract with King's until the middle of next year" playing solitaire.
More than a month after the catastrophic incident that brought King's College London's entire IT system down, the head of infrastructure services, Russell Frostick, is being replaced. The change was announced by the university's CIO, Nick Leake, in an internal communication seen by The Register, although it is not clear …
"TTR for 250TB "
If you haven't factored in how many tape drives you need running simultaneously in the event of a disaster, then you haven't got your DR plan sorted properly.
($HINT: you aren't going to be doing restorations from a single LTO6, or even 3-4 of them and it's worth making that point when PHB asks why the library needs so many drives when backups only tie up a couple of them.)
"the problem is TTR for 250TB of data is a tad longer than even a half baked DR..."
But it shouldn't have been that much longer. The backup / restore system should be designed so it meets the required RTO (Recovery Time Objective) - whatever that might have been for an array failure.
For instance an LTO 7 drive can handle about 300MB uncompressed data a second, so lets say 500 MB / second compressed. So that's roughly 1GB every 2 seconds, so about 30 mins / TB. So with a 2 drive LTO7 library you should be able to restore about 4TB / hour. Or a bit under 3 days to restore 250TB from scratch. and that's not allowing for de-duplication / disk caching which most modern backup systems would also use...
...two more to go.
It was very good of then to give the names of those responsible - the CIO Nick Leake and the next down the line Gareth Wright. Both those named are responsible for the problem. The CIO should have seen that the infrastructure was adequate for the task and this Gavin Wright should have made sure that the mirror servers fail-over actually worked.
Now the question is when do they get their P45s?
Now suppose that both of them have spent the last umpteen years campaigning for better infrastructure, highlighting that the college is desperately vulnerable to failures, but been ignored by management up the line who would rather spend the cash on executive bonuses. Would you still want to give them P45s? It may be that the choices they faced were all appalling, and they had to pick what appeared to be the least risky option. How do you know you would have done any better? Until you know the whole story you know nothing, which is the trouble with social media witchhunts. Plus of course the other thing to consider is whether, with the money on offer and the job circumstances on offer, they'll actually get anyone who could do a better job.
"Now suppose that both of them have spent the last umpteen years campaigning for better infrastructure,"
The fact is they haven't. Can't comment on Gareth, but Nick is way out of touch with reality and how things work.
KCL has a healthy IT budget, that gets squandered on one disastrous outsourcing deal after the other. Guess who sings them all...
Now the question is when do they get their P45s?
They're working on the basis that the human sacrifice of Frostick (by the sound of it only a contractor anyway) will placate the gods, and let them off the hook. He gets paid to the end of his fixed term contract, and thus nobody is really punished for a crime that is utterly unforgiveable.
I'm with you. The buck moves up the line, besmirching all along that line, and both Wright and Leake should be forced to walk the plank. But that requires some governance of KCL, and I'm not sure that institutions like academia or NHS Trusts have any proper governance or transparency.
Great, you are on the end of a phone, trying to start your career, fixing something seriously fubar-ed and probably have never even seen the place or met the people your supposed to help. If that isn't going to put you off a life in IT I don't know what is?
I certainly have, but I've never lost anybody's data but my own.
This outage is not a technical issue, it is a major organizational failure. It is not the underlings who should pay the price, the board itself should be hauled over the coals and explanations should be given as to how the board allowed the situation to become so fragile.
"This outage is not a technical issue, it is a major organizational failure. "
On par with the vice-provost of another university making pronouncements about the direction of IT at another certain London university, to the complete surprise of the people actually in charge of the stuff.
They were even more surprised when told they had better bloody well make it happen like the VP wanted, despite it not being the best path for the university.
"Some of us make sure there are backups and was know that RAID6 is required for SATA disks. "
Nothing to do with SATA and everything to do with size. I've had TB-scale raid6 arrays of UW-scsi disks go titsup during a rebuild cycle and been very glad of the nightly backups.
FWIW, even RAID6 is not good enough once you get past 10TB or so. Whilst there's no "Official" term for 3 drive redundancy (other than RAID-Z3 for ZFS), people are referring to it as RAID7
And of course none of your raid counts for the pimple on a baboon's arse if someone goes "rm -rf" in the wrong location, which is a more important reason why backups are important than the risk of hardware failure - and why "backups" are _NOT_ extra storage arrays attached to your main data store
Hint: If your data isn't in 2 distinct locations then it's not backed up, and a replicated server, or another array on the same box is not "2 locations" for this purpose.
Then there're the issues of:
Ensuring that what you back up can actually be restored (no brainer).
That you're backing up what you're supposed to be backing up ("ooops, you never asked for that dataset to be added to the backups", or "You want a restoration. Of a dataset that you refused to sign off on being backed up because that cost too much. I suggest you try a seance")
That you keep them around long enough ("What do you mean you want a file restored from 3 years ago? We only keep them 12 months!" - this happens regularly despite telling people the time limits form the outset)
AND that what's being backed up isn't random garbage thanks to some memory error scribbling all over the storage 6 months ago ("Well we restored it, but it's random gibberish. Looks to have been been like that on the original disks for the last 2 years" - real world example, 1999)
So yeah, RAID and backups are important, but so is testing that everything works/is correct.
I wouldn't be at all surprised to find that KCL's ISD have been fighting for decent budget for years. The attitude from academics is that they know it all and everything can be done with desktop PC class hardware. A couple of £2k bills for restoring fried HDDs that weren't backed up, or £500k bills for rerunning data analyses that they didn't think were important don't even seem to drive the lesson home
(Finding out that the data "we don't need to backup because we can re-download it" is either no longer available or will take 3 months to stream in doesn't seem to sink in either.)
The one that still hurts to think about was a server failure at a satellite site. A Motherboard/controller problem took the server out, but managed to get about *half* the system back with a replacement box, but the rest of the data was trashed, no matter what games I played with disks arrays and alternate systems. Because I had the main server up again (and the users partially operational) the centralised "Enterprise" class backup system refused to restore the data to any other system, but it took it over a week to get the rest of it back down the wire.
"Nothing to do with SATA and everything to do with size."
No, it's far more to do with SATA than the size - the RAID set size is only a linear relationship to risk of failure. The hard Bit Error Rates (BER) on SATA disks are far higher than on Fibre Channel, SAS, or SSD disks. Therefore the chances of a hard error during a RAID set rebuild are much much higher.
"even RAID6 is not good enough once you get past 10TB"
Nope - again wrong - RAID 6 will last you reasonably safely well past that. So for instance RAID 6 would be fine for most uses for 4TB disks in a set of 14:2. Which gives a per RAID set size of 64TB including parity... Which would have a hard failure probability rate over ten year of 0.03304 % or better.
(MTBF): Manufacturer Spec, Enterprise (1.2M)
Non recoverable Error Rate (SATA) : 10^15
Drive Capacity: 4TB
Sector Size: 4KB
Quantity of Disks: 16
Volume Rebuild Speed (MB/s): 15MB/S
"The hard Bit Error Rates (BER) on SATA disks are far higher than on Fibre Channel, SAS, or SSD disks."
You clearly don't operate large (as in undred of TB large) data sites.
1: Those figures are highly optimistic at best and most large SAS drives quote the same 1E10-14 that consumer drives do. (Real world experience says it's about that figure on all drives, even the ones claiming 1E10-15)
2: Something else will invariably bite you on the arse (MSA1000 array controllers, as a for instance)
3: This has NOTHING whatsoever to do with BER, which is about silent ECC failures that the raid controller will detect, flag and correct (but only ZFS will write the fix back to the drive and verify it's ok)
Once you get into the 6+TB array range there's around a 2-5% chance of _catastrophic_ failure(*) during the rebuild cycle when a drive dies.
(*) Catastropic: As in - "2 more drives going tits up"
Factoring in the rebuild time of a 14+2 large raidset plus the factor that makers invariably supply them from the same batch (and usually sequential serial numbers), once you get into the 6-10TB range that chance is significantly higher.
In the case of the aforementioned UW-scsi arrays, they were 14+2 arrays of about 1TB apiece on MSA1000 controllers handling about 5TB apiece. In the 7 years operating these arrays we had 2 total loss events during rebuilds and those both happened in the first 3 years of operation. (We also had ~35 drive losses caused by the controller losing contact with the drive, but the drive subsequently testing OK - apparently a known issue. Gee thanks HP)
"You clearly don't operate large (as in undred of TB large) data sites."
No, I work on a site with 1000s of TB (PBs)...
"Those figures are highly optimistic at best"
The fact is that overall SATA drives are FAR less reliable than most other drive types. Your personal experience does not reflect numerous studies that back this up.
"Something else will invariably bite you on the arse (MSA1000 array controllers, as a for instance)"
I use more things like EMC VMAX and HP 3PAR that are fault tolerant throughout. We would not use such a crap single point of failure device.
"This has NOTHING whatsoever to do with BER"
It has everything to do with the actual BER. For instance "We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors." https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf
"Once you get into the 6+TB array range there's around a 2-5% chance of _catastrophic_ failure(*) during the rebuild cycle when a drive dies"
Utter dross - completely invented numbers. I have hundreds of disks running in much large RAID 6 sets with zero hard failures ever and plenty of replaced disks over time... the chances of loosing another 2 disks during rebuild are pretty small - I gave an actual calculation above...
". In the 7 years operating these arrays we had 2 total loss events "
These are crappy low end and not fully fault tolerant arrays. Not an enterprise grade solution.
"We also had ~35 drive losses caused by the controller losing contact with the drive"
Quite. Your issues are not due to normal drive failures under RAID 6.
"The hard Bit Error Rates (BER) on SATA disks are far higher than on Fibre Channel, SAS, or SSD disks."
Why should that be the case.
Excepting SSD, this sounds like the manufacturers "inject" some problems to drive customers to the "reliably pricey solution"-
The old adage applies - don't have a backup solution, have a recovery solution.
And, er, you know, test it.
I've inherited a platform in my new job, and lets just say that I've rebuilt the backup infrastructure from the ground up for that exact reason...now we have backups that recover, imagine such a thing!
Anon because you know damned well why.
I'd like too an off-site backup to the true Clouds, the Heavenly Ones - were your data are safe from failure, cockups, hackers and Satan himself. I asked priests, but beyond something about an expensive subscription, I couldn't get more details about the transfer technology, especially about a restore before Doomsday.
Seems like somebody gone the biggie best route, and got nothing but troubles for that.
Moral of the story - sometimes the most expensive backup system is not worth it, and you'll be better off with a cheaper system - which can be duplicated/replicated easily since you have more funds available to purchase backup units with.
Should one cheap unit fail, you'll be assured of data integrity on the other extra unit(s).
Not so with a top-of-the-range wallet-ripping backup unit, where you can only afford one, and when it dies, you have a battle with inn-sewer-ants to get them to actually pay out so that you can get another one...
"Seems like somebody gone the biggie best route, and got nothing but troubles for that."
Those routes use (at least) raid6 on arrays. If they'd gone down those routes we would be reading stories about stupidly expensive systems, not broken ones.
This is a clusterfuck on multiple levels.
Whilst mucho blame can be dropped on the lap of management, whoever signed off on using RAID5 needs to go too. Part of systems management is the ability to stand up to idiots trying to do stuff which will hurt the organisation (vs themselves - if it's only them, then let them fuck themselves over if they really insist).
$HINT: If PHB insists on using raid5 ("Because we bought a system with 12TB of disk in it and I want my 12TB goddamit"), then make sure that you explain why this is a really bad idea - in writing - and get an explicit instruction to do it anyway - in writing. This usually acts as a "persuader" to not be stupid and a "get your arse out of jail free" card when it invariably breaks.
The "shortsightedness" of management policy is merely shown up even more by the lack of working backups and DR plans and this is the part where heads need to roll at the top.
Look to see the culprits looking for new jobs shortly, albiet with glowing references (it's cheaper to do this than spend ages in court fighting them and is why such references should always be crosschecked)
King's has a job advert for a Business Continuity Manager in their Strategy, Planning & Assurance Directorate, for a person who has a qualification in business continuity or emergency management.
Duties include analysis of business critical functions; revision of incident response plans; and running simulation exercises to test their effectiveness.
Salary is £49,772 to £57,674 plus £2,623 London Weighting Allowance per annum.
But it is a temporary / fixed term position - Obviously not a long term solution....
Joke alert - The university takes the safety of its people and the protection of its business very seriously and the Business Continuity Manager will play a pivotal role in the development of a strategy to ensure that day-to-day operations support these aims.
A serious Business Continuity Manager isn't going to apply for a temporary position and the pay is too low. This role is vital to the business/organisation and yet peanuts are on offer.
Biting the hand that feeds IT © 1998–2019