back to article And so we enter day seven of King's College London major IT outage

King's College London suffered its seventh consecutive day of IT woes today. According to our sources in Blighty's capital, this was down to a redundant array of inexpensive disks (RAID) which was running virtualised systems failing during a hardware upgrade. As KCL officials note, their IT systems department has been working …

Page:

  1. Anonymous Coward
    Anonymous Coward

    "London mayor IT outrage"

    I really need a new pair of glasses...

  2. David Austin

    Oops

    Sounds like a case of Good Backups, No Disaster Recovery.

    That's going to be one heck of a post-mortem.

    1. Anonymous Coward
      Anonymous Coward

      Re: Oops

      One of the reasons it's taking so long is that with the catering services bookings being out of commission, there's a distinct lack of coffee at the IT crisis meetings.

    2. Anonymous Coward
      Anonymous Coward

      Re: Oops

      My understanding is that we don't have traditional backups. They have one expensive array that acts as the data server and also the destination for backups (snapshots). I suspect the reason they are taking so long to restore systems is that they are trying to piece things together from the mess created by a raid controller gone crazy.

      1. This post has been deleted by its author

      2. This post has been deleted by its author

  3. TheVogon

    "What happens when a one-disk-failure-tolerant RAID fails"

    Someone should get fired if they were SATA disks. RAID6 or equivalent is required.

    http://deliveryimages.acm.org/10.1145/1680000/1670144/leventhal1.png

    1. Warm Braw

      RAID6 or equivalent is required

      For now, but even that won't be adequate soon, apparently.

      1. A Non e-mouse Silver badge

        For critical data, I'm now only using RAID 10. For stuff that doesn't matter too much if I loose it, RAID 5.

        1. Lusty

          "For critical data, I'm now only using RAID 10"

          I can't tell if you're joking or not, so no offense intended if you were.

          Just in case you're not, read the links posted - RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least). RAID 6 or Erasure Coding are the currently considered safe ways to store your important data, and even then, have a second copy somewhere. Preferably using object storage too, so you only kill one thing in a failure. It's all based on the probability of being able to recover after a disk loss, and with RAID 5 and 10 the probability is higher for a total loss of information than for recovery for a given RAID set size.

          1. Anonymous Coward
            Anonymous Coward

            > RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least)

            RAID10 isn't parity-based like 5 or 6 and thus isn't subject to UREs in the same fashion. Rebuilding a RAID10 stripe just clones block-for-block from one side of the RAID1 to another - that's a remirror rather than a rebuild. Even if there is a block read error reading from one of the drives, a flipped bit in a single block RAID10 isn't the end of the world (and if you've got a checksumming filesystem on top of that it'll be corrected anyway), but with parity-based RAID you've got no way of calculating the new parity from bogus data, so your array is toast.

            Remember that, during a parity RAID rebuild, the entire array has to be re-read, parity calculated and re-written to disc - so the bigger your array, the bigger the amount that read and written and the longer rebuild time. RAID10 just needs to clone the contents of one disc to another so no matter the size of your array, it's basically a sequential read of one disk going to a sequential write of another instead of the slower and more random read-modify-write of parity RAIDs.

            In a nutshell: as a rule of thumb RAID5|6 rebuild times scale up with the size of the array, RAID10 rebuild times scale with the size of the individual disks.

        2. TheVogon

          "For critical data, I'm now only using RAID 10"

          That's very expensive on disks / slots though - so not ideal for many deployments. Most commonly in disk arrays these days SATA storage uses RAID 6 (or RAID DP), and SSD / FC uses RAID 5.

          High end arrays also often have additional inbuilt error correction / redundancy striped across the RAID sets - for instance 3PAR does this...

          1. Lusty

            "for instance 3PAR does this..."

            Oh the irony.

      2. TheVogon

        "For now, but even that won't be adequate soon, apparently."

        That refers to SATA drives. By 2019 most new deployments will be on solid state disks, and long rebuild times / risk of double or triple failures are less of an issue...

        1. Destroy All Monsters Silver badge

          ...or more of an issue.

          Also, looks like the moderator is pretty frisky..

          1. TheVogon

            " .or more of an issue."

            We already know that's likely not the case as enterprise class SSD disks have much lower Bit Error Rates than SATA...

      3. Destroy All Monsters Silver badge
        Facepalm

        Why RAID 6 stops working in 2019

        WTF am I reading?

        The problem with RAID 5 is that disk drives have read errors. SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.

        So... are there any that are lower? Hint. Not SCSI, which are the same drives with a changed controller.

        2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.

        I seriously hope that RAID reconstruction does NOT stop (aka. throwing the baby out with the acid bath), as there is a very nonzero probability that the smoked sector is not even being used.

        With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15, unlike Seagate's 2 TB ST32000542AS or Hitachi's Deskstar 7K2000

        Oh...

        1. Anonymous Coward
          Anonymous Coward

          RAID 60 with near-instantaneous duplication to a warm stand-by or RAID 10+ operating a triple mirror (the 3rd mirror being used for the backup creation).

        2. TheVogon

          "I seriously hope that RAID reconstruction does NOT stop....as there is a very nonzero probability that the smoked sector is not even being used."

          Modern arrays don't generally try and rebuild sectors without any data on. If the array does hit a hard error on rebuild, I wouldn't want it to just pretend everything is OK! In my experience arrays will go into a fault condition in this case and will indeed stop rebuilding...

        3. John Brown (no body) Silver badge

          "With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15,"

          Oh, that's; handy. That's what's in my home server.

  4. Anonymous Coward
    Anonymous Coward

    Some interesting links...

    (Reposting my comment from first article. Any KCL staff/students should feel free to pass this info on to the College governance. IMO something this big this looks like a strategic and management failure and not something that can be blamed on lowly tech staff.)

    It is amazing what you can find on Google.

    KCL spent £875,000 on kit in 2015 to expand their existing HP solution and to provide a duplicate at a second site:

    http://ted.europa.eu/udl?uri=TED:NOTICE:290801-2015:TEXT:EN:HTML

    http://ted.europa.eu/udl?uri=TED:NOTICE:28836-2015:TEXT:EN:HTML

    Quote:

    "The platforms are fit-for-purpose and serviceable, but are lacking an integrated backup storage for the converged system storage (3PAR StoreServ). "

    Quote:

    "King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."

    What this means...

    These "Voluntary ex ante transparency notices" means KCL directly awarded the contracts to HP and had to own up, after the fact, for failing to go to tender with this juicy chunk of taxpayers' cash (a legal requirement).

    Did the contract for the *original* HP system (the one that has failed) go to public tender, as demanded by law? If so, I can find no record of it in the usual places.

    As the link above shows, the contract for the business continuity and disaster recovery unit was awarded in January 2015. If they've had the hardware since Q1 2015, then why are they not able to failover their most important student facing and administrative systems to the other site? Perhaps these expensive systems have been sitting there uselessly depreciating because the IT management had other priorities...

    One such strange priority (compared to keeping vital systems up) may have been the preparation of the grand opening of a new service centre...in Cornwall (of all places!):

    http://www.kcl.ac.uk/newsevents/news/newsrecords/2015/August/KingsCollegeLondoncreatesskilledjobsatnewITCentreinCornwall.aspx

    https://twitter.com/kingsitsystems/status/634047199991726080

    (doubt they are smiling now)

    https://ksc.ac.uk/

    Bootnote:

    Seemingly that service centre is run as a private company:

    https://beta.companieshouse.gov.uk/company/02714181

    So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...

    1. Anonymous Coward
      Anonymous Coward

      Re: Some interesting links...

      So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...

      Apparently everything!

      As you say, where is the mirror failover system to these essential services?

    2. ecofeco Silver badge

      Re: Some interesting links...

      "King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."

      Off site and outsourced to Giant Computer Company. SLA probably not thoroughly double checked.

      I think I see the problem...

    3. stevebp

      Re: Some interesting links...

      If KCL are using HP under the SUPC framework, they may not need to go to tender as long as they have held a "mini-competition".

      1. Anonymous Coward
        Anonymous Coward

        Re: Some interesting links...

        Yes, it is true that the framework agreements are pre-tendered for various classes of equipment and services. However, it is clear from the above links that the contracts for the newer HP systems were directly awarded without such a process. As for the original HP system, I suggest a Freedom of Information request would get to the bottom of that question. My intuition: there was no mini-competition. You can easily put in a FOIA request here: https://www.whatdotheyknow.com/

        Or maybe someone from KCL can give us that scoop? Did they follow law and procedure when buying the original HP kit containing the failed 3Par or did some IT director just go out and buy it?

  5. Lee D Silver badge

    "detemine the root causes of the problem"

    Insufficient VM replicas.

    Oh! You mean why that particular storage failed?! I didn't.

    The whole point of virtualising your infrastructure like this is that you DO NOT have to rely on one storage, machine, datacentre or whatever else to stay up.

    Where are your independent replicas? Your warm-spare hypervisors? Your secondary cluster machines to move those VMs to?

    Hardware upgrade failing a RAID - yes, agreed, nasty.

    But you seem to have NO OTHER RAID or indeed any practical hypervisor or storage replica, certainly not one with a vaguely recent copy of data it appears, around.

    What is the point of putting your stuff on VMs and then running them from one bunch of hardware? By now you should have been able to - at worst case - restore your backup to anything capable of acting as hypervisor (e.g. a machine from PC World if it really comes to it, but more reassuringly your backup server cluster?) and carried on as if nothing had happened. Alright, maybe an IP change here or a tweak there, or running off a local drive somewhere temporarily while your storage is being rebuilt.

    But, hell, being down for SEVEN WHOLE DAYS on virtualised infrastructure that includes your telephony and all kinds of other stuff? That's just ridiculous.

    1. Steve Knox

      "What is the point of putting your stuff on VMs and then running them from one bunch of hardware? "

      Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time.

      You are absolutely correct that virtualization allows for the recovery options you mentioned.

      However, you completely ignore the fact that virtualization was originally and still is most often sold not as a recovery solution but as a cost-cutting solution.

      For public entities required to jump through hoops for every penny spent, and then still criticized by moronic taxpayers for any expense with more than three digits to the left of the decimal, no matter how well-spent,the natural tendency is to cut costs rather than to optimize. The net result is what you see here.

      1. Anonymous Coward
        Anonymous Coward

        Why are you assuming they are underfunded?

        What makes you assume they are underfunded? There is no evidence they are and plenty of evidence to the contrary. They are Russell group and have a good proportion of foreign (i.e. full-cost paying) students. The links above and their swollen IT leadership org imply they have plenty of dough, just not spent effectively.

        It is is far too easy to plead poverty in public sector IT without evidence. I know from experience the cash that is often wasted by IT senior management pursuing fads, empire building and hidden agendas, whilst neglecting solid technical foundations. Students and academics should demand that the KCL Council conduct an independent investigation into this fiasco. Maybe they could engage the BCS or similar.

      2. Lee D Silver badge

        "Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time."

        Nope. Then you'd just consolidate your server's functions onto one server.

        You virtualise to remove the dependency on the underlying hardware to provide portability, and isolate it from the other machines also running on the same hardware.

        Otherwise you'd container, or consolidate, or something else.

        1. TRT Silver badge

          or you could use all that spare run time doing something useful... like bioinformatics or some other other processor hungry number crunching which you sell on to your research departments at bargain basement pricing, on the understanding that in the event of a failure then your cycles are toast.

          1. Korev Silver badge

            Bioinformatics is an ideal workload as each programme almost always runs on on a single node with no MPI.

            My work's backup VM farm runs test/dev servers normally. If a $BADTHING happens then the backed up prod servers would be spun up instead.

        2. This post has been deleted by its author

        3. This post has been deleted by its author

    2. Anonymous Coward
      Anonymous Coward

      Management Fail

      You can bet tech staff at KCL have been asking for off-site standby systems for years. If it hasn't been implemented by now, in my opinion, it can only be due to strategic management failure at the most senior levels of the IT department.

      With a CIO, five IT directors and and fourteen other senior staff drawing nice salaries, they clearly have enough cash:

      http://www.kcl.ac.uk/it/about/it-org-chart-lt-july-2016-revised.pdf

      They obviously prefer structures, silos and processes to actually getting critical technical work done.

      1. Anonymous Coward
        Anonymous Coward

        Re: Management Fail or management cover up

        The questions that need to be asked is who they are where did they come from and what are they doing..clearly they have not understood the way the academic environment works or funded..Why was the 3 par system implemented over a Netapp system? Did implementation of new data centre take priority over a solid backup system? Surely before implementing remote data centres you make sure the heart of your infrastructure is rock solid..storage..backups..network..servers..security..

  6. Anonymous Coward
    Anonymous Coward

    See it all the time

    As a Mobile engineer I get called into the odd server call to un-manned and unchecked server halls. What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks. Yet nobody takes note.

    Now considering that these servers are probably belong to a large number of customers and I am only going to 1 particular customer server unrelated to the other distressed servers, who is actually monitoring them.

    1. Nolveys

      Re: See it all the time

      What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks.

      We were able to reduce maintenance costs by 72% by having all our technicians wear human-sized horse blinders during working hours.

    2. batfastad

      Re: See it all the time

      If it's anything like most managed racks DC/colo offerings I end up working with (against), it will often be sheer beauracracy preventing the drives being replaced, not alerting. Beauracracy of raising a ticket with global service desk, often using some weird Excel macro form, global service desk routing it to the right NOC, admin shipping a replacement drive, NOC finding the replacement drive in the loading bay after a bit, needing a new ticket to schedule the work, needing a new ticket for engineer access, replace the drive, shipping the drive back, the drive was never shipped back, etc.

      When I'm at a company with a facility in or around London I always prefer actually going there. It's a train ticket and a taxi, it's an afternoon out of the office, but it's done in a day. Not two weeks of "to me to you" with overworked underpayed on-site NOC.

      1. This post has been deleted by its author

      2. This post has been deleted by its author

    3. Anonymous Coward
      Anonymous Coward

      Re: See it all the time

      I've seen that too - in my own company's data center.

      And they don't keep spare disks on hand - they have to be ordered.

      They do tend to have rather spectacular system outages.

      1. TheVogon

        Re: See it all the time

        "And they don't keep spare disks on hand - they have to be ordered."

        That's OK if you have hot spares in your arrays. Otherwise you should really keep some onsite spare disks (and replace your stock via a warranty / maintenance claim each time one fails)...

    4. This post has been deleted by its author

    5. This post has been deleted by its author

  7. Steve Davies 3 Silver badge

    Meanwhile on a tropical island

    The salesman who sold this POS orders another Pina collada.

    1. TheVogon

      Re: Meanwhile on a tropical island

      "The salesman who sold this POS"

      The salesman just sells what the customer / architect designs and orders.....

  8. Commswonk

    Something wrong here...

    "As is our normal practice, there will be a full review once normal services are restored. The review will confirm the root cause(s) of the problem," cameth the mea culpa.

    What? No "Lessons will be learned"?

    On top of which "As is our normal practice..." makes it sound as though this is not exactly a rare occurrence.

    1. Anonymous Coward
      Anonymous Coward

      Re: Something wrong here...

      "The review will confirm the root cause(s) of the problem"

      Sounds suspiciously as if someone had already made up their mind.

  9. Anonymous Coward
    Anonymous Coward

    Obviously, KCL has found a way to sort out the economic difficulties being visited on us through NeoLiberalism by the lizard people and I need more tinfoil. Where did I put that?

  10. Anonymous Coward
    Anonymous Coward

    Christ almighty, WTF

    Phones and everything, all running on the same mainframe VM host or SAN.

    What they need is a Redundant Array of Independent Servers...

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon