"London mayor IT outrage"
I really need a new pair of glasses...
King's College London suffered its seventh consecutive day of IT woes today. According to our sources in Blighty's capital, this was down to a redundant array of inexpensive disks (RAID) which was running virtualised systems failing during a hardware upgrade. As KCL officials note, their IT systems department has been working …
My understanding is that we don't have traditional backups. They have one expensive array that acts as the data server and also the destination for backups (snapshots). I suspect the reason they are taking so long to restore systems is that they are trying to piece things together from the mess created by a raid controller gone crazy.
This post has been deleted by its author
This post has been deleted by its author
RAID6 or equivalent is required
For now, but even that won't be adequate soon, apparently.
"For critical data, I'm now only using RAID 10"
I can't tell if you're joking or not, so no offense intended if you were.
Just in case you're not, read the links posted - RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least). RAID 6 or Erasure Coding are the currently considered safe ways to store your important data, and even then, have a second copy somewhere. Preferably using object storage too, so you only kill one thing in a failure. It's all based on the probability of being able to recover after a disk loss, and with RAID 5 and 10 the probability is higher for a total loss of information than for recovery for a given RAID set size.
> RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least)
RAID10 isn't parity-based like 5 or 6 and thus isn't subject to UREs in the same fashion. Rebuilding a RAID10 stripe just clones block-for-block from one side of the RAID1 to another - that's a remirror rather than a rebuild. Even if there is a block read error reading from one of the drives, a flipped bit in a single block RAID10 isn't the end of the world (and if you've got a checksumming filesystem on top of that it'll be corrected anyway), but with parity-based RAID you've got no way of calculating the new parity from bogus data, so your array is toast.
Remember that, during a parity RAID rebuild, the entire array has to be re-read, parity calculated and re-written to disc - so the bigger your array, the bigger the amount that read and written and the longer rebuild time. RAID10 just needs to clone the contents of one disc to another so no matter the size of your array, it's basically a sequential read of one disk going to a sequential write of another instead of the slower and more random read-modify-write of parity RAIDs.
In a nutshell: as a rule of thumb RAID5|6 rebuild times scale up with the size of the array, RAID10 rebuild times scale with the size of the individual disks.
"For critical data, I'm now only using RAID 10"
That's very expensive on disks / slots though - so not ideal for many deployments. Most commonly in disk arrays these days SATA storage uses RAID 6 (or RAID DP), and SSD / FC uses RAID 5.
High end arrays also often have additional inbuilt error correction / redundancy striped across the RAID sets - for instance 3PAR does this...
Why RAID 6 stops working in 2019
WTF am I reading?
The problem with RAID 5 is that disk drives have read errors. SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.
So... are there any that are lower? Hint. Not SCSI, which are the same drives with a changed controller.
2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.
I seriously hope that RAID reconstruction does NOT stop (aka. throwing the baby out with the acid bath), as there is a very nonzero probability that the smoked sector is not even being used.
With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15, unlike Seagate's 2 TB ST32000542AS or Hitachi's Deskstar 7K2000
Oh...
"I seriously hope that RAID reconstruction does NOT stop....as there is a very nonzero probability that the smoked sector is not even being used."
Modern arrays don't generally try and rebuild sectors without any data on. If the array does hit a hard error on rebuild, I wouldn't want it to just pretend everything is OK! In my experience arrays will go into a fault condition in this case and will indeed stop rebuilding...
(Reposting my comment from first article. Any KCL staff/students should feel free to pass this info on to the College governance. IMO something this big this looks like a strategic and management failure and not something that can be blamed on lowly tech staff.)
It is amazing what you can find on Google.
KCL spent £875,000 on kit in 2015 to expand their existing HP solution and to provide a duplicate at a second site:
http://ted.europa.eu/udl?uri=TED:NOTICE:290801-2015:TEXT:EN:HTML
http://ted.europa.eu/udl?uri=TED:NOTICE:28836-2015:TEXT:EN:HTML
Quote:
"The platforms are fit-for-purpose and serviceable, but are lacking an integrated backup storage for the converged system storage (3PAR StoreServ). "
Quote:
"King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."
What this means...
These "Voluntary ex ante transparency notices" means KCL directly awarded the contracts to HP and had to own up, after the fact, for failing to go to tender with this juicy chunk of taxpayers' cash (a legal requirement).
Did the contract for the *original* HP system (the one that has failed) go to public tender, as demanded by law? If so, I can find no record of it in the usual places.
As the link above shows, the contract for the business continuity and disaster recovery unit was awarded in January 2015. If they've had the hardware since Q1 2015, then why are they not able to failover their most important student facing and administrative systems to the other site? Perhaps these expensive systems have been sitting there uselessly depreciating because the IT management had other priorities...
One such strange priority (compared to keeping vital systems up) may have been the preparation of the grand opening of a new service centre...in Cornwall (of all places!):
http://www.kcl.ac.uk/newsevents/news/newsrecords/2015/August/KingsCollegeLondoncreatesskilledjobsatnewITCentreinCornwall.aspx
https://twitter.com/kingsitsystems/status/634047199991726080
(doubt they are smiling now)
https://ksc.ac.uk/
Bootnote:
Seemingly that service centre is run as a private company:
https://beta.companieshouse.gov.uk/company/02714181
So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...
"King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."
Off site and outsourced to Giant Computer Company. SLA probably not thoroughly double checked.
I think I see the problem...
Yes, it is true that the framework agreements are pre-tendered for various classes of equipment and services. However, it is clear from the above links that the contracts for the newer HP systems were directly awarded without such a process. As for the original HP system, I suggest a Freedom of Information request would get to the bottom of that question. My intuition: there was no mini-competition. You can easily put in a FOIA request here: https://www.whatdotheyknow.com/
Or maybe someone from KCL can give us that scoop? Did they follow law and procedure when buying the original HP kit containing the failed 3Par or did some IT director just go out and buy it?
"detemine the root causes of the problem"
Insufficient VM replicas.
Oh! You mean why that particular storage failed?! I didn't.
The whole point of virtualising your infrastructure like this is that you DO NOT have to rely on one storage, machine, datacentre or whatever else to stay up.
Where are your independent replicas? Your warm-spare hypervisors? Your secondary cluster machines to move those VMs to?
Hardware upgrade failing a RAID - yes, agreed, nasty.
But you seem to have NO OTHER RAID or indeed any practical hypervisor or storage replica, certainly not one with a vaguely recent copy of data it appears, around.
What is the point of putting your stuff on VMs and then running them from one bunch of hardware? By now you should have been able to - at worst case - restore your backup to anything capable of acting as hypervisor (e.g. a machine from PC World if it really comes to it, but more reassuringly your backup server cluster?) and carried on as if nothing had happened. Alright, maybe an IP change here or a tweak there, or running off a local drive somewhere temporarily while your storage is being rebuilt.
But, hell, being down for SEVEN WHOLE DAYS on virtualised infrastructure that includes your telephony and all kinds of other stuff? That's just ridiculous.
"What is the point of putting your stuff on VMs and then running them from one bunch of hardware? "
Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time.
You are absolutely correct that virtualization allows for the recovery options you mentioned.
However, you completely ignore the fact that virtualization was originally and still is most often sold not as a recovery solution but as a cost-cutting solution.
For public entities required to jump through hoops for every penny spent, and then still criticized by moronic taxpayers for any expense with more than three digits to the left of the decimal, no matter how well-spent,the natural tendency is to cut costs rather than to optimize. The net result is what you see here.
What makes you assume they are underfunded? There is no evidence they are and plenty of evidence to the contrary. They are Russell group and have a good proportion of foreign (i.e. full-cost paying) students. The links above and their swollen IT leadership org imply they have plenty of dough, just not spent effectively.
It is is far too easy to plead poverty in public sector IT without evidence. I know from experience the cash that is often wasted by IT senior management pursuing fads, empire building and hidden agendas, whilst neglecting solid technical foundations. Students and academics should demand that the KCL Council conduct an independent investigation into this fiasco. Maybe they could engage the BCS or similar.
"Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time."
Nope. Then you'd just consolidate your server's functions onto one server.
You virtualise to remove the dependency on the underlying hardware to provide portability, and isolate it from the other machines also running on the same hardware.
Otherwise you'd container, or consolidate, or something else.
or you could use all that spare run time doing something useful... like bioinformatics or some other other processor hungry number crunching which you sell on to your research departments at bargain basement pricing, on the understanding that in the event of a failure then your cycles are toast.
This post has been deleted by its author
This post has been deleted by its author
You can bet tech staff at KCL have been asking for off-site standby systems for years. If it hasn't been implemented by now, in my opinion, it can only be due to strategic management failure at the most senior levels of the IT department.
With a CIO, five IT directors and and fourteen other senior staff drawing nice salaries, they clearly have enough cash:
http://www.kcl.ac.uk/it/about/it-org-chart-lt-july-2016-revised.pdf
They obviously prefer structures, silos and processes to actually getting critical technical work done.
The questions that need to be asked is who they are where did they come from and what are they doing..clearly they have not understood the way the academic environment works or funded..Why was the 3 par system implemented over a Netapp system? Did implementation of new data centre take priority over a solid backup system? Surely before implementing remote data centres you make sure the heart of your infrastructure is rock solid..storage..backups..network..servers..security..
As a Mobile engineer I get called into the odd server call to un-manned and unchecked server halls. What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks. Yet nobody takes note.
Now considering that these servers are probably belong to a large number of customers and I am only going to 1 particular customer server unrelated to the other distressed servers, who is actually monitoring them.
If it's anything like most managed racks DC/colo offerings I end up working with (against), it will often be sheer beauracracy preventing the drives being replaced, not alerting. Beauracracy of raising a ticket with global service desk, often using some weird Excel macro form, global service desk routing it to the right NOC, admin shipping a replacement drive, NOC finding the replacement drive in the loading bay after a bit, needing a new ticket to schedule the work, needing a new ticket for engineer access, replace the drive, shipping the drive back, the drive was never shipped back, etc.
When I'm at a company with a facility in or around London I always prefer actually going there. It's a train ticket and a taxi, it's an afternoon out of the office, but it's done in a day. Not two weeks of "to me to you" with overworked underpayed on-site NOC.
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
This post has been deleted by its author
"As is our normal practice, there will be a full review once normal services are restored. The review will confirm the root cause(s) of the problem," cameth the mea culpa.
What? No "Lessons will be learned"?
On top of which "As is our normal practice..." makes it sound as though this is not exactly a rare occurrence.
Closely followed by a redundant arrary of IT management..
This isn't a small business, they can easily afford a system architect to design them something that HP (or whoever) can then provide. There's no way there should be a single box marked "all the software" in the network diagram.
I'm a Postgraduate Research student at KCL. Most of my work is self-directed, and relies on these online infrastructures to be up and running in order to fulfill my duties as a researcher. Not only can I not navigate through the website and click on important information about upcoming workshops, skills sessions, seminars, or contact administrators--I also cannot access any journal subscriptions, find the location of online resources, check materials out of the library, use the printing facilities, book rooms for upcoming conferences, liaise with administrative support, or submit documents for review via the internal grading services. I've asked for a tuition reimbursement for this week, but the admin was expectedly evasive, asking me to "be patient" and bear with them. This is an institutional failure on massive scale--it's not IT's fault, but it certainly points to the signal to noise ratio in the administrative channels that need a thorough overhaul--particulalry where they connect with the highest levels of office at KCL.
> it's not IT's fault
So you blame management for demanding too much from IT, being tightwads with the IT budget, and relying on IT for EVERYTHING?
Yup, sounds like academic IT... "just dumb code monkeys" who don't get no respect. (I read an article to that effect in the Chronicle of Higher Ed several years back whilst waiting to interview for my first, and last, academic IT job)
You're putting words into my mouth, I encourage you not to make such blatant assumptions. I suspect that IT knew this was a concerning issue and was capable of fixing it, but that the administration (of which IT is also part) failed to see the significance or spend the time directing the financial resources to IT to fix it. Perhaps it was one of those massive projects that everyone is always getting around to doing, but never actually gets done until a serious malfunction causes immediate attention and time to be spent on repairs. I have worked internally at KCL and other universities, and office politics can clog up progress more often than sheer ineptitude. I am really disappointed with the updates, support service, and communication that the highest levels of admin has provided during these outages--speaks to a system fundamentally disconnected with the real world, material needs of its staff and students.
> You're putting words into my mouth
Just trying to clarify what you meant by that hand-wavy wall of text. Obviously you've spent way too much time in politically-correct university hell. :)
> Perhaps it was one of those massive projects that everyone is always getting around to doing, but never actually gets done until a serious malfunction causes immediate attention and time to be spent on repairs.
Yeah. That's always the way.
Savvy IT managers and BOFHs will occasionally "let" unsupportable systems fail to force the issue before it gets this bad.
I disagree. If this is hardware failure in a single system and a single site (it appears it is) and they've have had the resources and time to implement multi-system or multi-site redundancy (it appears they have), then the fault lays squarely within the KCL IT department leadership. It couldn't really be more clear-cut. Their twee Twitter account does not excuse this (perhaps they should have employed an apprentice storage admin instead of a social media coordinator?).
If you try to spread the blame to an abstraction such as "administrative channels", no one will learn.
... When I was at university, I heard that Warwick Uni had a Harris H1000 supermini that was used by undergraduates. However someone found that there was a serious bug in the OS JCL interpreter, basically if you entered the command "SREG $A=$B+$C+$D" (I.e. add three variables together and store the result in a fourth variable) the OS crashed and needed a hard reboot.
Apparently it was amazing how often this happened just before coursework had to be handed in!
I was there and heard the same story though never experienced it. Also jobs were submitted by undergrads on wads of punched cards (two elastic bands were the equivalent of raid5 and the smart ones drew a diagonal line in felt tip on the edge of the stack for fast recovery). Anyway the story was that you could stick several copies of the offending instruction in your stack and each one would cause a crash. Apparently there was no way to remove cards from the machine once they had got that far in.
Ahh yes, punched cards. Never had to worry about them myself; my year was the first that used CRTs for all our work. However I do remember being in the computer centre once and seeing a 2nd year break down and cry when she dropped her course work and ended up with punched cards scattered across half the room.
Those were the days <sigh>
Beer logo, because I was a student then!
This post has been deleted by its author
Its all well and good to having all these resilience technologies, but you still need to monitor the damn thing for the situation when a component fails and actually do something about it.
I'm wondering when the last test if their resilience and DR processes was as well, since that should have been the belt-and-braces proof that the design actually works.
On the bright side though, given that its educational, at least this will be a "learning experience" for them when they do the next set of purchasing and they can get it right that time around.
[tongue-in-cheek-mode on]
Remind me again why on-prem is better than the cloud? Something about reliability wasn't it? And an ability to shout at people to fix things quickly? And no risk of data loss?
So that's a stupid and juvenile thing to say, but the comments here would have been full if they'd have been an AWS/Azure/GCP customer.
"King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."
(posted by another commentard above. see full post above for dates and links)
I've highlighted the relevant bits.
To be fair, I don't know if the new center in on site or not but for sure it is covered by HP and not the college staff at this time.
This post has been deleted by its author
In a previous life , I designed storage / backup solutions for environments like this. Virtualised or purely physical wouldn't make a lot of difference if the whole infrastructure was based on a single, shared, storage array. The SLA in place would determine the amount of availability / redundancy needed here. For the systems described, maybe an 8-hour outage would be "affordable"... which should easily be achieved by a traditional Backup / Restore mechanism.
Of course , if the common storage failed completely , was out of support ...and no replacement could be easily sourced...then I could believe a 7-day outage, awaiting hardware for the restore ;) This is a management issue, not an IT issue.
Jc
You shouldn't be upgrading a production server online anyway. You need an off-line mirror. You upgrade them alternately and give them both the same input and check their output is the same. Then, after your defined soaking-in period, you swap the primary and the secondary around. Rinse and repeat. OK, that's probably overkill, but it's amazing just how business critical some IT is nowadays.
This post has been deleted by its author
I haven't looked at the KCL org chart (linked somewhere above) but if it's like other academic institutions I know, job titles are irrelevant, there's never more than one competent sysadmin on staff at any given time, and they never stay more than a year. So nobody even knows of the existence of 50-90% of the systems.
Our campus-wide VOIP was on that? Who knew!!? bwahahaha
Yes. Which is why you generally try to source your drives from at least a couple of different suppliers so at least they are from different batches. Ideally, you will speculatively replace disks with new ones, bought later, at various times to spread the age/firmware versions around a bit so multiple failures are less likely.
Of course, when you buy in a "managed solution" none of this will happen. A company like HP will simply deliver a "system" at the cheapest possible cost to them so odds are all the HDDs are from the same batch. Ditto for the rest of the hardware modules. On the other hand, you might might get a weird mish-mash of versions/models/firmware built from whatever was to hand in the warehouse at the time, which might be a whole other world of hurt if there are, for example, multiple RAID cards with different firmware versions.
This post has been deleted by its author
Back in the day best practice would be to mirror all drives on different shelves with drives from different batch and different power supply for each shelf.
The definitive guide to RAID is at http://www.baarf.dk/BAARF/BAARF2.html
About 25 years ago I witnessed a classic RAID 5, 5 disk failure. Every day for a week one drive went down and was replaced the next day. At the end of the week the recovery had not quite caught up. Spectacular.
"About 25 years ago I witnessed a classic RAID 5, 5 disk failure. Every day for a week one drive went down and was replaced the next day. At the end of the week the recovery had not quite caught up. Spectacular."
Worst one I ever came across was a small office, single server, RAID card failed. The factory installed firmware version on the new RAID card was incapable of booting on that server model without the RAID card BIOS being updated. There were no other MCA slot machines in the office. Took the card back, got the BIOS updated elsewhere, returned to sire and someone had since "helpfully" pulled the drives from the system without powering it down. Now, I can't be sure if the disks were recoverable, but they definitely were not now.
Luckily for me, only the hardware repair was my problem. IIRC they spent the following week rebuilding and restoring what would likely have been just slower, degraded array while it auto-rebuilt.
This post has been deleted by its author
From what I understand, the ideal (and seldom practiced) method of procurement is source disks from different batches.
The mean time to fail on identically manufactured disks, given that in a RAID they're typically all going to be spinning for the same hours, will be very similar across a batch.
Definitely agree. Unfortunately, that also means you need to stagger the acquisitions, which means planning ahead which is becoming something of an exotic science these days.
When I decided to go for a home NAS, I first spent four months buying one 3TB every month, to make as sure as I could that not all disks would be from the same batch. On the last month, I bought the 4th disk and the Synology station that would make them all useful.
I do not see that most management types would be able to have that much patience.
Last time I was at College/Uni (it was a shared course); my laptop picked up a virus that wiped 70% of my data (and porn), despite up to date AV etc.
My tutors pooh-poohed the excuse, but did grudgingly give me a few extra days to turn in my coursework.
Sadly, it was over a month before they could read it, as a few days later the entire college system crashed from the same virus, and took the rest of the term to get back up again.
On another note - in response to something said in an earlier post; management dont value ANY technical staff, of any discipline; more than once I was over-ruled due to complaints by CLEANING STAFF, and stopped from doing my job properly.
Coms cables were fed in with MV power cables because they wouldnt pay for dedicated trunking, fire breaks werent installed in vertical trunking to save time and money, shop floor staff were allowed to plug 3Kw heaters into multisockets that were stuck into outlets only meant for powering the IT equipment - and not backed up with cabling that could handle a 3Kw load safely.
All this cost cutting saved the company a few thousand per year, but cost them millions as the IT systems went down regularly when a 3Kw heater overloaded the system (3 electrical fires in the 18 months I was there, as some enterprising soul bypassed the breaker to stop it tripping out), MV interference corrupted data streams and damaged back up attempts, and a fire in the lift shaft burnt through 7 floors of coms cables, destroying 22 MILES of cabling inside the shaft.
Oh yeah, putting the cables in the lift shaft also broke the law.
All this was, OF COURSE, the fault of the technical department, and when I refused to sign off on some insanely unsafe office equipment ordered by one of the chinless wonders on the top floor, they fired me.
What does former London Mayor Boris Johnson have to say on this? Could this be one of these sneaky Jihadist ICT Cyber attacks under auspices of new London Mayor Sadiq Khan ? Remember that attacks on infrastructure like with Stuxnet in Iran have been announced to be retaliated. Also the Pentagon has been rumored to commence a cyber offense at China and Russia. Watching all this, its nothing less than to be expected that new Job ad asks for 'detrimental' sysadmins. Just my two pennies here.
This post has been deleted by its author