"London mayor IT outrage"
I really need a new pair of glasses...
King's College London suffered its seventh consecutive day of IT woes today. According to our sources in Blighty's capital, this was down to a redundant array of inexpensive disks (RAID) which was running virtualised systems failing during a hardware upgrade. As KCL officials note, their IT systems department has been working …
My understanding is that we don't have traditional backups. They have one expensive array that acts as the data server and also the destination for backups (snapshots). I suspect the reason they are taking so long to restore systems is that they are trying to piece things together from the mess created by a raid controller gone crazy.
This post has been deleted by its author
This post has been deleted by its author
RAID6 or equivalent is required
For now, but even that won't be adequate soon, apparently.
"For critical data, I'm now only using RAID 10"
I can't tell if you're joking or not, so no offense intended if you were.
Just in case you're not, read the links posted - RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least). RAID 6 or Erasure Coding are the currently considered safe ways to store your important data, and even then, have a second copy somewhere. Preferably using object storage too, so you only kill one thing in a failure. It's all based on the probability of being able to recover after a disk loss, and with RAID 5 and 10 the probability is higher for a total loss of information than for recovery for a given RAID set size.
> RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least)
RAID10 isn't parity-based like 5 or 6 and thus isn't subject to UREs in the same fashion. Rebuilding a RAID10 stripe just clones block-for-block from one side of the RAID1 to another - that's a remirror rather than a rebuild. Even if there is a block read error reading from one of the drives, a flipped bit in a single block RAID10 isn't the end of the world (and if you've got a checksumming filesystem on top of that it'll be corrected anyway), but with parity-based RAID you've got no way of calculating the new parity from bogus data, so your array is toast.
Remember that, during a parity RAID rebuild, the entire array has to be re-read, parity calculated and re-written to disc - so the bigger your array, the bigger the amount that read and written and the longer rebuild time. RAID10 just needs to clone the contents of one disc to another so no matter the size of your array, it's basically a sequential read of one disk going to a sequential write of another instead of the slower and more random read-modify-write of parity RAIDs.
In a nutshell: as a rule of thumb RAID5|6 rebuild times scale up with the size of the array, RAID10 rebuild times scale with the size of the individual disks.
"For critical data, I'm now only using RAID 10"
That's very expensive on disks / slots though - so not ideal for many deployments. Most commonly in disk arrays these days SATA storage uses RAID 6 (or RAID DP), and SSD / FC uses RAID 5.
High end arrays also often have additional inbuilt error correction / redundancy striped across the RAID sets - for instance 3PAR does this...
Why RAID 6 stops working in 2019
WTF am I reading?
The problem with RAID 5 is that disk drives have read errors. SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.
So... are there any that are lower? Hint. Not SCSI, which are the same drives with a changed controller.
2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.
I seriously hope that RAID reconstruction does NOT stop (aka. throwing the baby out with the acid bath), as there is a very nonzero probability that the smoked sector is not even being used.
With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15, unlike Seagate's 2 TB ST32000542AS or Hitachi's Deskstar 7K2000
Oh...
"I seriously hope that RAID reconstruction does NOT stop....as there is a very nonzero probability that the smoked sector is not even being used."
Modern arrays don't generally try and rebuild sectors without any data on. If the array does hit a hard error on rebuild, I wouldn't want it to just pretend everything is OK! In my experience arrays will go into a fault condition in this case and will indeed stop rebuilding...
(Reposting my comment from first article. Any KCL staff/students should feel free to pass this info on to the College governance. IMO something this big this looks like a strategic and management failure and not something that can be blamed on lowly tech staff.)
It is amazing what you can find on Google.
KCL spent £875,000 on kit in 2015 to expand their existing HP solution and to provide a duplicate at a second site:
http://ted.europa.eu/udl?uri=TED:NOTICE:290801-2015:TEXT:EN:HTML
http://ted.europa.eu/udl?uri=TED:NOTICE:28836-2015:TEXT:EN:HTML
Quote:
"The platforms are fit-for-purpose and serviceable, but are lacking an integrated backup storage for the converged system storage (3PAR StoreServ). "
Quote:
"King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."
What this means...
These "Voluntary ex ante transparency notices" means KCL directly awarded the contracts to HP and had to own up, after the fact, for failing to go to tender with this juicy chunk of taxpayers' cash (a legal requirement).
Did the contract for the *original* HP system (the one that has failed) go to public tender, as demanded by law? If so, I can find no record of it in the usual places.
As the link above shows, the contract for the business continuity and disaster recovery unit was awarded in January 2015. If they've had the hardware since Q1 2015, then why are they not able to failover their most important student facing and administrative systems to the other site? Perhaps these expensive systems have been sitting there uselessly depreciating because the IT management had other priorities...
One such strange priority (compared to keeping vital systems up) may have been the preparation of the grand opening of a new service centre...in Cornwall (of all places!):
http://www.kcl.ac.uk/newsevents/news/newsrecords/2015/August/KingsCollegeLondoncreatesskilledjobsatnewITCentreinCornwall.aspx
https://twitter.com/kingsitsystems/status/634047199991726080
(doubt they are smiling now)
https://ksc.ac.uk/
Bootnote:
Seemingly that service centre is run as a private company:
https://beta.companieshouse.gov.uk/company/02714181
So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...
"King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."
Off site and outsourced to Giant Computer Company. SLA probably not thoroughly double checked.
I think I see the problem...
Yes, it is true that the framework agreements are pre-tendered for various classes of equipment and services. However, it is clear from the above links that the contracts for the newer HP systems were directly awarded without such a process. As for the original HP system, I suggest a Freedom of Information request would get to the bottom of that question. My intuition: there was no mini-competition. You can easily put in a FOIA request here: https://www.whatdotheyknow.com/
Or maybe someone from KCL can give us that scoop? Did they follow law and procedure when buying the original HP kit containing the failed 3Par or did some IT director just go out and buy it?
"detemine the root causes of the problem"
Insufficient VM replicas.
Oh! You mean why that particular storage failed?! I didn't.
The whole point of virtualising your infrastructure like this is that you DO NOT have to rely on one storage, machine, datacentre or whatever else to stay up.
Where are your independent replicas? Your warm-spare hypervisors? Your secondary cluster machines to move those VMs to?
Hardware upgrade failing a RAID - yes, agreed, nasty.
But you seem to have NO OTHER RAID or indeed any practical hypervisor or storage replica, certainly not one with a vaguely recent copy of data it appears, around.
What is the point of putting your stuff on VMs and then running them from one bunch of hardware? By now you should have been able to - at worst case - restore your backup to anything capable of acting as hypervisor (e.g. a machine from PC World if it really comes to it, but more reassuringly your backup server cluster?) and carried on as if nothing had happened. Alright, maybe an IP change here or a tweak there, or running off a local drive somewhere temporarily while your storage is being rebuilt.
But, hell, being down for SEVEN WHOLE DAYS on virtualised infrastructure that includes your telephony and all kinds of other stuff? That's just ridiculous.
"What is the point of putting your stuff on VMs and then running them from one bunch of hardware? "
Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time.
You are absolutely correct that virtualization allows for the recovery options you mentioned.
However, you completely ignore the fact that virtualization was originally and still is most often sold not as a recovery solution but as a cost-cutting solution.
For public entities required to jump through hoops for every penny spent, and then still criticized by moronic taxpayers for any expense with more than three digits to the left of the decimal, no matter how well-spent,the natural tendency is to cut costs rather than to optimize. The net result is what you see here.
What makes you assume they are underfunded? There is no evidence they are and plenty of evidence to the contrary. They are Russell group and have a good proportion of foreign (i.e. full-cost paying) students. The links above and their swollen IT leadership org imply they have plenty of dough, just not spent effectively.
It is is far too easy to plead poverty in public sector IT without evidence. I know from experience the cash that is often wasted by IT senior management pursuing fads, empire building and hidden agendas, whilst neglecting solid technical foundations. Students and academics should demand that the KCL Council conduct an independent investigation into this fiasco. Maybe they could engage the BCS or similar.
"Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time."
Nope. Then you'd just consolidate your server's functions onto one server.
You virtualise to remove the dependency on the underlying hardware to provide portability, and isolate it from the other machines also running on the same hardware.
Otherwise you'd container, or consolidate, or something else.
or you could use all that spare run time doing something useful... like bioinformatics or some other other processor hungry number crunching which you sell on to your research departments at bargain basement pricing, on the understanding that in the event of a failure then your cycles are toast.
This post has been deleted by its author
This post has been deleted by its author
You can bet tech staff at KCL have been asking for off-site standby systems for years. If it hasn't been implemented by now, in my opinion, it can only be due to strategic management failure at the most senior levels of the IT department.
With a CIO, five IT directors and and fourteen other senior staff drawing nice salaries, they clearly have enough cash:
http://www.kcl.ac.uk/it/about/it-org-chart-lt-july-2016-revised.pdf
They obviously prefer structures, silos and processes to actually getting critical technical work done.
The questions that need to be asked is who they are where did they come from and what are they doing..clearly they have not understood the way the academic environment works or funded..Why was the 3 par system implemented over a Netapp system? Did implementation of new data centre take priority over a solid backup system? Surely before implementing remote data centres you make sure the heart of your infrastructure is rock solid..storage..backups..network..servers..security..
As a Mobile engineer I get called into the odd server call to un-manned and unchecked server halls. What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks. Yet nobody takes note.
Now considering that these servers are probably belong to a large number of customers and I am only going to 1 particular customer server unrelated to the other distressed servers, who is actually monitoring them.
If it's anything like most managed racks DC/colo offerings I end up working with (against), it will often be sheer beauracracy preventing the drives being replaced, not alerting. Beauracracy of raising a ticket with global service desk, often using some weird Excel macro form, global service desk routing it to the right NOC, admin shipping a replacement drive, NOC finding the replacement drive in the loading bay after a bit, needing a new ticket to schedule the work, needing a new ticket for engineer access, replace the drive, shipping the drive back, the drive was never shipped back, etc.
When I'm at a company with a facility in or around London I always prefer actually going there. It's a train ticket and a taxi, it's an afternoon out of the office, but it's done in a day. Not two weeks of "to me to you" with overworked underpayed on-site NOC.
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
"As is our normal practice, there will be a full review once normal services are restored. The review will confirm the root cause(s) of the problem," cameth the mea culpa.
What? No "Lessons will be learned"?
On top of which "As is our normal practice..." makes it sound as though this is not exactly a rare occurrence.