back to article And so we enter day seven of King's College London major IT outage

King's College London suffered its seventh consecutive day of IT woes today. According to our sources in Blighty's capital, this was down to a redundant array of inexpensive disks (RAID) which was running virtualised systems failing during a hardware upgrade. As KCL officials note, their IT systems department has been working …

  1. Anonymous Coward
    Anonymous Coward

    "London mayor IT outrage"

    I really need a new pair of glasses...

  2. David Austin

    Oops

    Sounds like a case of Good Backups, No Disaster Recovery.

    That's going to be one heck of a post-mortem.

    1. Anonymous Coward
      Anonymous Coward

      Re: Oops

      One of the reasons it's taking so long is that with the catering services bookings being out of commission, there's a distinct lack of coffee at the IT crisis meetings.

    2. Anonymous Coward
      Anonymous Coward

      Re: Oops

      My understanding is that we don't have traditional backups. They have one expensive array that acts as the data server and also the destination for backups (snapshots). I suspect the reason they are taking so long to restore systems is that they are trying to piece things together from the mess created by a raid controller gone crazy.

      1. This post has been deleted by its author

      2. This post has been deleted by its author

  3. TheVogon

    "What happens when a one-disk-failure-tolerant RAID fails"

    Someone should get fired if they were SATA disks. RAID6 or equivalent is required.

    http://deliveryimages.acm.org/10.1145/1680000/1670144/leventhal1.png

    1. Warm Braw

      RAID6 or equivalent is required

      For now, but even that won't be adequate soon, apparently.

      1. A Non e-mouse Silver badge

        For critical data, I'm now only using RAID 10. For stuff that doesn't matter too much if I loose it, RAID 5.

        1. Lusty

          "For critical data, I'm now only using RAID 10"

          I can't tell if you're joking or not, so no offense intended if you were.

          Just in case you're not, read the links posted - RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least). RAID 6 or Erasure Coding are the currently considered safe ways to store your important data, and even then, have a second copy somewhere. Preferably using object storage too, so you only kill one thing in a failure. It's all based on the probability of being able to recover after a disk loss, and with RAID 5 and 10 the probability is higher for a total loss of information than for recovery for a given RAID set size.

          1. Anonymous Coward
            Anonymous Coward

            > RAID 10 is WORSE than RAID 5. If you lose one disk the remaining disk has to produce every single block, without error, to keep your data alive or to rebuild the RAID. If there is one single URE on the remaining disk your RAID is considered borked and you lose your weekend (at the very least)

            RAID10 isn't parity-based like 5 or 6 and thus isn't subject to UREs in the same fashion. Rebuilding a RAID10 stripe just clones block-for-block from one side of the RAID1 to another - that's a remirror rather than a rebuild. Even if there is a block read error reading from one of the drives, a flipped bit in a single block RAID10 isn't the end of the world (and if you've got a checksumming filesystem on top of that it'll be corrected anyway), but with parity-based RAID you've got no way of calculating the new parity from bogus data, so your array is toast.

            Remember that, during a parity RAID rebuild, the entire array has to be re-read, parity calculated and re-written to disc - so the bigger your array, the bigger the amount that read and written and the longer rebuild time. RAID10 just needs to clone the contents of one disc to another so no matter the size of your array, it's basically a sequential read of one disk going to a sequential write of another instead of the slower and more random read-modify-write of parity RAIDs.

            In a nutshell: as a rule of thumb RAID5|6 rebuild times scale up with the size of the array, RAID10 rebuild times scale with the size of the individual disks.

        2. TheVogon

          "For critical data, I'm now only using RAID 10"

          That's very expensive on disks / slots though - so not ideal for many deployments. Most commonly in disk arrays these days SATA storage uses RAID 6 (or RAID DP), and SSD / FC uses RAID 5.

          High end arrays also often have additional inbuilt error correction / redundancy striped across the RAID sets - for instance 3PAR does this...

          1. Lusty

            "for instance 3PAR does this..."

            Oh the irony.

      2. TheVogon

        "For now, but even that won't be adequate soon, apparently."

        That refers to SATA drives. By 2019 most new deployments will be on solid state disks, and long rebuild times / risk of double or triple failures are less of an issue...

        1. Destroy All Monsters Silver badge

          ...or more of an issue.

          Also, looks like the moderator is pretty frisky..

          1. TheVogon

            " .or more of an issue."

            We already know that's likely not the case as enterprise class SSD disks have much lower Bit Error Rates than SATA...

      3. Destroy All Monsters Silver badge
        Facepalm

        Why RAID 6 stops working in 2019

        WTF am I reading?

        The problem with RAID 5 is that disk drives have read errors. SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.

        So... are there any that are lower? Hint. Not SCSI, which are the same drives with a changed controller.

        2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.

        I seriously hope that RAID reconstruction does NOT stop (aka. throwing the baby out with the acid bath), as there is a very nonzero probability that the smoked sector is not even being used.

        With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15, unlike Seagate's 2 TB ST32000542AS or Hitachi's Deskstar 7K2000

        Oh...

        1. Anonymous Coward
          Anonymous Coward

          RAID 60 with near-instantaneous duplication to a warm stand-by or RAID 10+ operating a triple mirror (the 3rd mirror being used for the backup creation).

        2. TheVogon

          "I seriously hope that RAID reconstruction does NOT stop....as there is a very nonzero probability that the smoked sector is not even being used."

          Modern arrays don't generally try and rebuild sectors without any data on. If the array does hit a hard error on rebuild, I wouldn't want it to just pretend everything is OK! In my experience arrays will go into a fault condition in this case and will indeed stop rebuilding...

        3. John Brown (no body) Silver badge

          "With one exception: Western Digital's Caviar Green, model WD20EADS, is spec'd at 10^15,"

          Oh, that's; handy. That's what's in my home server.

  4. Anonymous Coward
    Anonymous Coward

    Some interesting links...

    (Reposting my comment from first article. Any KCL staff/students should feel free to pass this info on to the College governance. IMO something this big this looks like a strategic and management failure and not something that can be blamed on lowly tech staff.)

    It is amazing what you can find on Google.

    KCL spent £875,000 on kit in 2015 to expand their existing HP solution and to provide a duplicate at a second site:

    http://ted.europa.eu/udl?uri=TED:NOTICE:290801-2015:TEXT:EN:HTML

    http://ted.europa.eu/udl?uri=TED:NOTICE:28836-2015:TEXT:EN:HTML

    Quote:

    "The platforms are fit-for-purpose and serviceable, but are lacking an integrated backup storage for the converged system storage (3PAR StoreServ). "

    Quote:

    "King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."

    What this means...

    These "Voluntary ex ante transparency notices" means KCL directly awarded the contracts to HP and had to own up, after the fact, for failing to go to tender with this juicy chunk of taxpayers' cash (a legal requirement).

    Did the contract for the *original* HP system (the one that has failed) go to public tender, as demanded by law? If so, I can find no record of it in the usual places.

    As the link above shows, the contract for the business continuity and disaster recovery unit was awarded in January 2015. If they've had the hardware since Q1 2015, then why are they not able to failover their most important student facing and administrative systems to the other site? Perhaps these expensive systems have been sitting there uselessly depreciating because the IT management had other priorities...

    One such strange priority (compared to keeping vital systems up) may have been the preparation of the grand opening of a new service centre...in Cornwall (of all places!):

    http://www.kcl.ac.uk/newsevents/news/newsrecords/2015/August/KingsCollegeLondoncreatesskilledjobsatnewITCentreinCornwall.aspx

    https://twitter.com/kingsitsystems/status/634047199991726080

    (doubt they are smiling now)

    https://ksc.ac.uk/

    Bootnote:

    Seemingly that service centre is run as a private company:

    https://beta.companieshouse.gov.uk/company/02714181

    So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...

    1. Anonymous Coward
      Anonymous Coward

      Re: Some interesting links...

      So...cheaper staff; no public sector employment contracts or pensions; management jollies to the seaside. Sweet! What could possibly go wrong...

      Apparently everything!

      As you say, where is the mirror failover system to these essential services?

    2. ecofeco Silver badge

      Re: Some interesting links...

      "King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."

      Off site and outsourced to Giant Computer Company. SLA probably not thoroughly double checked.

      I think I see the problem...

    3. stevebp

      Re: Some interesting links...

      If KCL are using HP under the SUPC framework, they may not need to go to tender as long as they have held a "mini-competition".

      1. Anonymous Coward
        Anonymous Coward

        Re: Some interesting links...

        Yes, it is true that the framework agreements are pre-tendered for various classes of equipment and services. However, it is clear from the above links that the contracts for the newer HP systems were directly awarded without such a process. As for the original HP system, I suggest a Freedom of Information request would get to the bottom of that question. My intuition: there was no mini-competition. You can easily put in a FOIA request here: https://www.whatdotheyknow.com/

        Or maybe someone from KCL can give us that scoop? Did they follow law and procedure when buying the original HP kit containing the failed 3Par or did some IT director just go out and buy it?

  5. Lee D Silver badge

    "detemine the root causes of the problem"

    Insufficient VM replicas.

    Oh! You mean why that particular storage failed?! I didn't.

    The whole point of virtualising your infrastructure like this is that you DO NOT have to rely on one storage, machine, datacentre or whatever else to stay up.

    Where are your independent replicas? Your warm-spare hypervisors? Your secondary cluster machines to move those VMs to?

    Hardware upgrade failing a RAID - yes, agreed, nasty.

    But you seem to have NO OTHER RAID or indeed any practical hypervisor or storage replica, certainly not one with a vaguely recent copy of data it appears, around.

    What is the point of putting your stuff on VMs and then running them from one bunch of hardware? By now you should have been able to - at worst case - restore your backup to anything capable of acting as hypervisor (e.g. a machine from PC World if it really comes to it, but more reassuringly your backup server cluster?) and carried on as if nothing had happened. Alright, maybe an IP change here or a tweak there, or running off a local drive somewhere temporarily while your storage is being rebuilt.

    But, hell, being down for SEVEN WHOLE DAYS on virtualised infrastructure that includes your telephony and all kinds of other stuff? That's just ridiculous.

    1. Steve Knox

      "What is the point of putting your stuff on VMs and then running them from one bunch of hardware? "

      Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time.

      You are absolutely correct that virtualization allows for the recovery options you mentioned.

      However, you completely ignore the fact that virtualization was originally and still is most often sold not as a recovery solution but as a cost-cutting solution.

      For public entities required to jump through hoops for every penny spent, and then still criticized by moronic taxpayers for any expense with more than three digits to the left of the decimal, no matter how well-spent,the natural tendency is to cut costs rather than to optimize. The net result is what you see here.

      1. Anonymous Coward
        Anonymous Coward

        Why are you assuming they are underfunded?

        What makes you assume they are underfunded? There is no evidence they are and plenty of evidence to the contrary. They are Russell group and have a good proportion of foreign (i.e. full-cost paying) students. The links above and their swollen IT leadership org imply they have plenty of dough, just not spent effectively.

        It is is far too easy to plead poverty in public sector IT without evidence. I know from experience the cash that is often wasted by IT senior management pursuing fads, empire building and hidden agendas, whilst neglecting solid technical foundations. Students and academics should demand that the KCL Council conduct an independent investigation into this fiasco. Maybe they could engage the BCS or similar.

      2. Lee D Silver badge

        "Not having to buy many bunches of hardware, each specced out to peak usage and hence idle 99% of the time."

        Nope. Then you'd just consolidate your server's functions onto one server.

        You virtualise to remove the dependency on the underlying hardware to provide portability, and isolate it from the other machines also running on the same hardware.

        Otherwise you'd container, or consolidate, or something else.

        1. TRT Silver badge

          or you could use all that spare run time doing something useful... like bioinformatics or some other other processor hungry number crunching which you sell on to your research departments at bargain basement pricing, on the understanding that in the event of a failure then your cycles are toast.

          1. Korev Silver badge

            Bioinformatics is an ideal workload as each programme almost always runs on on a single node with no MPI.

            My work's backup VM farm runs test/dev servers normally. If a $BADTHING happens then the backed up prod servers would be spun up instead.

        2. This post has been deleted by its author

        3. This post has been deleted by its author

    2. Anonymous Coward
      Anonymous Coward

      Management Fail

      You can bet tech staff at KCL have been asking for off-site standby systems for years. If it hasn't been implemented by now, in my opinion, it can only be due to strategic management failure at the most senior levels of the IT department.

      With a CIO, five IT directors and and fourteen other senior staff drawing nice salaries, they clearly have enough cash:

      http://www.kcl.ac.uk/it/about/it-org-chart-lt-july-2016-revised.pdf

      They obviously prefer structures, silos and processes to actually getting critical technical work done.

      1. Anonymous Coward
        Anonymous Coward

        Re: Management Fail or management cover up

        The questions that need to be asked is who they are where did they come from and what are they doing..clearly they have not understood the way the academic environment works or funded..Why was the 3 par system implemented over a Netapp system? Did implementation of new data centre take priority over a solid backup system? Surely before implementing remote data centres you make sure the heart of your infrastructure is rock solid..storage..backups..network..servers..security..

  6. Anonymous Coward
    Anonymous Coward

    See it all the time

    As a Mobile engineer I get called into the odd server call to un-manned and unchecked server halls. What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks. Yet nobody takes note.

    Now considering that these servers are probably belong to a large number of customers and I am only going to 1 particular customer server unrelated to the other distressed servers, who is actually monitoring them.

    1. Nolveys

      Re: See it all the time

      What strikes me as strange is on walking into server halls to be greeted to an array of Flashing Amber or even red Disks.

      We were able to reduce maintenance costs by 72% by having all our technicians wear human-sized horse blinders during working hours.

    2. batfastad

      Re: See it all the time

      If it's anything like most managed racks DC/colo offerings I end up working with (against), it will often be sheer beauracracy preventing the drives being replaced, not alerting. Beauracracy of raising a ticket with global service desk, often using some weird Excel macro form, global service desk routing it to the right NOC, admin shipping a replacement drive, NOC finding the replacement drive in the loading bay after a bit, needing a new ticket to schedule the work, needing a new ticket for engineer access, replace the drive, shipping the drive back, the drive was never shipped back, etc.

      When I'm at a company with a facility in or around London I always prefer actually going there. It's a train ticket and a taxi, it's an afternoon out of the office, but it's done in a day. Not two weeks of "to me to you" with overworked underpayed on-site NOC.

      1. This post has been deleted by its author

      2. This post has been deleted by its author

    3. Anonymous Coward
      Anonymous Coward

      Re: See it all the time

      I've seen that too - in my own company's data center.

      And they don't keep spare disks on hand - they have to be ordered.

      They do tend to have rather spectacular system outages.

      1. TheVogon

        Re: See it all the time

        "And they don't keep spare disks on hand - they have to be ordered."

        That's OK if you have hot spares in your arrays. Otherwise you should really keep some onsite spare disks (and replace your stock via a warranty / maintenance claim each time one fails)...

    4. This post has been deleted by its author

    5. This post has been deleted by its author

  7. Steve Davies 3 Silver badge

    Meanwhile on a tropical island

    The salesman who sold this POS orders another Pina collada.

    1. TheVogon

      Re: Meanwhile on a tropical island

      "The salesman who sold this POS"

      The salesman just sells what the customer / architect designs and orders.....

  8. Commswonk

    Something wrong here...

    "As is our normal practice, there will be a full review once normal services are restored. The review will confirm the root cause(s) of the problem," cameth the mea culpa.

    What? No "Lessons will be learned"?

    On top of which "As is our normal practice..." makes it sound as though this is not exactly a rare occurrence.

    1. Anonymous Coward
      Anonymous Coward

      Re: Something wrong here...

      "The review will confirm the root cause(s) of the problem"

      Sounds suspiciously as if someone had already made up their mind.

  9. Anonymous Coward
    Anonymous Coward

    Obviously, KCL has found a way to sort out the economic difficulties being visited on us through NeoLiberalism by the lizard people and I need more tinfoil. Where did I put that?

  10. Anonymous Coward
    Anonymous Coward

    Christ almighty, WTF

    Phones and everything, all running on the same mainframe VM host or SAN.

    What they need is a Redundant Array of Independent Servers...

    1. lukewarmdog

      Re: Christ almighty, WTF

      Closely followed by a redundant arrary of IT management..

      This isn't a small business, they can easily afford a system architect to design them something that HP (or whoever) can then provide. There's no way there should be a single box marked "all the software" in the network diagram.

  11. CyborgQueen

    It's bad.

    I'm a Postgraduate Research student at KCL. Most of my work is self-directed, and relies on these online infrastructures to be up and running in order to fulfill my duties as a researcher. Not only can I not navigate through the website and click on important information about upcoming workshops, skills sessions, seminars, or contact administrators--I also cannot access any journal subscriptions, find the location of online resources, check materials out of the library, use the printing facilities, book rooms for upcoming conferences, liaise with administrative support, or submit documents for review via the internal grading services. I've asked for a tuition reimbursement for this week, but the admin was expectedly evasive, asking me to "be patient" and bear with them. This is an institutional failure on massive scale--it's not IT's fault, but it certainly points to the signal to noise ratio in the administrative channels that need a thorough overhaul--particulalry where they connect with the highest levels of office at KCL.

    1. Anonymous Coward
      Anonymous Coward

      Re: It's bad.

      > it's not IT's fault

      So you blame management for demanding too much from IT, being tightwads with the IT budget, and relying on IT for EVERYTHING?

      Yup, sounds like academic IT... "just dumb code monkeys" who don't get no respect. (I read an article to that effect in the Chronicle of Higher Ed several years back whilst waiting to interview for my first, and last, academic IT job)

      1. CyborgQueen

        Re: It's bad.

        You're putting words into my mouth, I encourage you not to make such blatant assumptions. I suspect that IT knew this was a concerning issue and was capable of fixing it, but that the administration (of which IT is also part) failed to see the significance or spend the time directing the financial resources to IT to fix it. Perhaps it was one of those massive projects that everyone is always getting around to doing, but never actually gets done until a serious malfunction causes immediate attention and time to be spent on repairs. I have worked internally at KCL and other universities, and office politics can clog up progress more often than sheer ineptitude. I am really disappointed with the updates, support service, and communication that the highest levels of admin has provided during these outages--speaks to a system fundamentally disconnected with the real world, material needs of its staff and students.

        1. eionmac

          Re: It's bad.

          KCL is not alone in "office politics can clog up progress more often than sheer ineptitude. "

        2. Anonymous Coward
          Anonymous Coward

          Re: It's bad.

          > You're putting words into my mouth

          Just trying to clarify what you meant by that hand-wavy wall of text. Obviously you've spent way too much time in politically-correct university hell. :)

          > Perhaps it was one of those massive projects that everyone is always getting around to doing, but never actually gets done until a serious malfunction causes immediate attention and time to be spent on repairs.

          Yeah. That's always the way.

          Savvy IT managers and BOFHs will occasionally "let" unsupportable systems fail to force the issue before it gets this bad.

    2. Anonymous Coward
      Anonymous Coward

      Re: It's bad.

      I disagree. If this is hardware failure in a single system and a single site (it appears it is) and they've have had the resources and time to implement multi-system or multi-site redundancy (it appears they have), then the fault lays squarely within the KCL IT department leadership. It couldn't really be more clear-cut. Their twee Twitter account does not excuse this (perhaps they should have employed an apprentice storage admin instead of a social media coordinator?).

      If you try to spread the blame to an abstraction such as "administrative channels", no one will learn.

      1. TRT Silver badge

        Re: It's bad.

        It's not all bad. I understand they've got Skype for Business working again. So at least they can video conference with their Cornish support teams now.

    3. Guildencrantz

      Contractual issues re: It's bad.

      Wouldn't it be good if there were some organisation paid to represent students' interests in recovering from the college for breaches of contract? Oh wait - the students union is paid to do that. Will it? As if.

  12. Anonymous Coward
    Anonymous Coward

    Many years ago...

    ... When I was at university, I heard that Warwick Uni had a Harris H1000 supermini that was used by undergraduates. However someone found that there was a serious bug in the OS JCL interpreter, basically if you entered the command "SREG $A=$B+$C+$D" (I.e. add three variables together and store the result in a fourth variable) the OS crashed and needed a hard reboot.

    Apparently it was amazing how often this happened just before coursework had to be handed in!

    1. Natalie Gritpants

      Re: Many years ago...

      I was there and heard the same story though never experienced it. Also jobs were submitted by undergrads on wads of punched cards (two elastic bands were the equivalent of raid5 and the smart ones drew a diagonal line in felt tip on the edge of the stack for fast recovery). Anyway the story was that you could stick several copies of the offending instruction in your stack and each one would cause a crash. Apparently there was no way to remove cards from the machine once they had got that far in.

      1. Anonymous Coward
        Pint

        Re: Many years ago...

        Ahh yes, punched cards. Never had to worry about them myself; my year was the first that used CRTs for all our work. However I do remember being in the computer centre once and seeing a 2nd year break down and cry when she dropped her course work and ended up with punched cards scattered across half the room.

        Those were the days <sigh>

        Beer logo, because I was a student then!

    2. Anonymous Coward
      Anonymous Coward

      Re: Many years ago...

      "So I tied an onion to my belt, which was the style at the time. Now, to take the ferry cost a nickel, and in those days, nickels had pictures of bumblebees on 'em..."

    3. This post has been deleted by its author

  13. TheVogon

    "Insufficient VM replicas."

    I would go for poor infrastructure design and / or failed / untested implementation as the most likely general cause. Followed by inadequate backups / DR facilities and procedures if it takes a week + to restore services....

  14. Dwarf

    Monitoring and testing

    Its all well and good to having all these resilience technologies, but you still need to monitor the damn thing for the situation when a component fails and actually do something about it.

    I'm wondering when the last test if their resilience and DR processes was as well, since that should have been the belt-and-braces proof that the design actually works.

    On the bright side though, given that its educational, at least this will be a "learning experience" for them when they do the next set of purchasing and they can get it right that time around.

  15. Adam 52 Silver badge

    [tongue-in-cheek-mode on]

    Remind me again why on-prem is better than the cloud? Something about reliability wasn't it? And an ability to shout at people to fix things quickly? And no risk of data loss?

    So that's a stupid and juvenile thing to say, but the comments here would have been full if they'd have been an AWS/Azure/GCP customer.

    1. ecofeco Silver badge

      "King's College London is about to migrate much of its central ICT infrastructure to a new shared data centre, and the opportunity is being taken to extend DR and BC facilities wherever possible to provide additional resilience in support of the university's business. Maximum resilience and most cost-effective cover is provided by replicating as closely as possible the existing converged platform, which is designed and supported by Hewlett Packard, who have exclusive rights in the existing platform."

      (posted by another commentard above. see full post above for dates and links)

      I've highlighted the relevant bits.

      To be fair, I don't know if the new center in on site or not but for sure it is covered by HP and not the college staff at this time.

      1. This post has been deleted by its author

  16. John Jc

    In a previous life , I designed storage / backup solutions for environments like this. Virtualised or purely physical wouldn't make a lot of difference if the whole infrastructure was based on a single, shared, storage array. The SLA in place would determine the amount of availability / redundancy needed here. For the systems described, maybe an 8-hour outage would be "affordable"... which should easily be achieved by a traditional Backup / Restore mechanism.

    Of course , if the common storage failed completely , was out of support ...and no replacement could be easily sourced...then I could believe a 7-day outage, awaiting hardware for the restore ;) This is a management issue, not an IT issue.

    Jc

  17. Anonymous South African Coward Bronze badge

    Who does hardware upgrade on a production server without backing up critical data?

    Mind be boggling.

    1. Anonymous Coward
      Anonymous Coward

      I think we do, I'm not sure, it is unclear what either the IT suppliers or the guys in the branch offices are doing and I have not seen any report fixed on serious paper by a hot roller about what's going on.

      Not King's College but Anon anyway.

      1. TRT Silver badge

        You shouldn't be upgrading a production server online anyway. You need an off-line mirror. You upgrade them alternately and give them both the same input and check their output is the same. Then, after your defined soaking-in period, you swap the primary and the secondary around. Rinse and repeat. OK, that's probably overkill, but it's amazing just how business critical some IT is nowadays.

        1. This post has been deleted by its author

  18. Frank N. Stein

    Sys Admins

    Sys Admins aren't monitoring. If they were, and someone had a plan in place to restore services in a similar situation, this wouldn't have happend. Will manager heads roll? Probably not.

  19. Anonymous Coward
    Anonymous Coward

    Sys Admins

    Sys Admins aren't monitoring. If they were, and someone had a plan in place to restore services in a similar situation, this wouldn't have happend. Will manager heads roll? Probably not.

    1. Anonymous Coward
      Anonymous Coward

      Re: Sys Admins

      I haven't looked at the KCL org chart (linked somewhere above) but if it's like other academic institutions I know, job titles are irrelevant, there's never more than one competent sysadmin on staff at any given time, and they never stay more than a year. So nobody even knows of the existence of 50-90% of the systems.

      Our campus-wide VOIP was on that? Who knew!!? bwahahaha

      1. TRT Silver badge

        Re: Sys Admins

        80% of the telephony isn't VOIP it's POTS. But even that failed. Something to do with a computerised exchange, you know mapping extension numbers to switched circuit hardware cards.

  20. Anonymous Coward
    Anonymous Coward

    My Laptop

    I run my company's VMs on my laptop so they'll miss me when I take it with me on vacation.

    Bonus Plus: Everything just starts working when I come back.

    Bonus Bonus Plus: Just one disk failure short of being redundant as the KC systems.

  21. ecofeco Silver badge

    Yes, I'm going to say it...

    So how's that cloud thing working for you?

    1. TRT Silver badge

      Re: Yes, I'm going to say it...

      Office365 has been relatively unaffected. But Office365 can't and shouldn't do everything that a university needs, administration-wise, in order to function.

  22. heyrick Silver badge

    "Among the many services affected are telephony, internal websites, shared drives, room booking, payroll, student records, purchasing, catering services bookings and more."

    Isn't this a bit...putting all the eggs in one basket?

  23. Anonymous Coward
    Anonymous Coward

    I'm curious.

    When you build a raid setup you buy your drives at the time (same make model and firmware) so therefore would that not increase the possibility of two drives failing at the same time because you try to get them as close to identical as possible?

    1. John Brown (no body) Silver badge

      Yes. Which is why you generally try to source your drives from at least a couple of different suppliers so at least they are from different batches. Ideally, you will speculatively replace disks with new ones, bought later, at various times to spread the age/firmware versions around a bit so multiple failures are less likely.

      Of course, when you buy in a "managed solution" none of this will happen. A company like HP will simply deliver a "system" at the cheapest possible cost to them so odds are all the HDDs are from the same batch. Ditto for the rest of the hardware modules. On the other hand, you might might get a weird mish-mash of versions/models/firmware built from whatever was to hand in the warehouse at the time, which might be a whole other world of hurt if there are, for example, multiple RAID cards with different firmware versions.

      1. This post has been deleted by its author

  24. Colin Bull 1

    I'm curious

    Back in the day best practice would be to mirror all drives on different shelves with drives from different batch and different power supply for each shelf.

    The definitive guide to RAID is at http://www.baarf.dk/BAARF/BAARF2.html

    About 25 years ago I witnessed a classic RAID 5, 5 disk failure. Every day for a week one drive went down and was replaced the next day. At the end of the week the recovery had not quite caught up. Spectacular.

    1. John Brown (no body) Silver badge

      Re: I'm curious

      "About 25 years ago I witnessed a classic RAID 5, 5 disk failure. Every day for a week one drive went down and was replaced the next day. At the end of the week the recovery had not quite caught up. Spectacular."

      Worst one I ever came across was a small office, single server, RAID card failed. The factory installed firmware version on the new RAID card was incapable of booting on that server model without the RAID card BIOS being updated. There were no other MCA slot machines in the office. Took the card back, got the BIOS updated elsewhere, returned to sire and someone had since "helpfully" pulled the drives from the system without powering it down. Now, I can't be sure if the disks were recoverable, but they definitely were not now.

      Luckily for me, only the hardware repair was my problem. IIRC they spent the following week rebuilding and restoring what would likely have been just slower, degraded array while it auto-rebuilt.

      1. This post has been deleted by its author

  25. Scaffa

    From what I understand, the ideal (and seldom practiced) method of procurement is source disks from different batches.

    The mean time to fail on identically manufactured disks, given that in a RAID they're typically all going to be spinning for the same hours, will be very similar across a batch.

    1. Pascal Monett Silver badge

      Definitely agree. Unfortunately, that also means you need to stagger the acquisitions, which means planning ahead which is becoming something of an exotic science these days.

      When I decided to go for a home NAS, I first spent four months buying one 3TB every month, to make as sure as I could that not all disks would be from the same batch. On the last month, I bought the 4th disk and the Synology station that would make them all useful.

      I do not see that most management types would be able to have that much patience.

    2. Anonymous Coward
      Anonymous Coward

      If you look at drive failure stats you'll see that there is no such behavior. Buying from seperate batches is not necessary whatsoever and this kind of wisdom from armchair commentors is not very helpful. Leave the job to the professionals who base decisions on knowledge and fact.

  26. Anonymous Coward
    Anonymous Coward

    Colleges, HUH! What are they good for?

    Last time I was at College/Uni (it was a shared course); my laptop picked up a virus that wiped 70% of my data (and porn), despite up to date AV etc.

    My tutors pooh-poohed the excuse, but did grudgingly give me a few extra days to turn in my coursework.

    Sadly, it was over a month before they could read it, as a few days later the entire college system crashed from the same virus, and took the rest of the term to get back up again.

    On another note - in response to something said in an earlier post; management dont value ANY technical staff, of any discipline; more than once I was over-ruled due to complaints by CLEANING STAFF, and stopped from doing my job properly.

    Coms cables were fed in with MV power cables because they wouldnt pay for dedicated trunking, fire breaks werent installed in vertical trunking to save time and money, shop floor staff were allowed to plug 3Kw heaters into multisockets that were stuck into outlets only meant for powering the IT equipment - and not backed up with cabling that could handle a 3Kw load safely.

    All this cost cutting saved the company a few thousand per year, but cost them millions as the IT systems went down regularly when a 3Kw heater overloaded the system (3 electrical fires in the 18 months I was there, as some enterprising soul bypassed the breaker to stop it tripping out), MV interference corrupted data streams and damaged back up attempts, and a fire in the lift shaft burnt through 7 floors of coms cables, destroying 22 MILES of cabling inside the shaft.

    Oh yeah, putting the cables in the lift shaft also broke the law.

    All this was, OF COURSE, the fault of the technical department, and when I refused to sign off on some insanely unsafe office equipment ordered by one of the chinless wonders on the top floor, they fired me.

  27. rmstock

    is this Mayoral issue as well ?

    What does former London Mayor Boris Johnson have to say on this? Could this be one of these sneaky Jihadist ICT Cyber attacks under auspices of new London Mayor Sadiq Khan ? Remember that attacks on infrastructure like with Stuxnet in Iran have been announced to be retaliated. Also the Pentagon has been rumored to commence a cyber offense at China and Russia. Watching all this, its nothing less than to be expected that new Job ad asks for 'detrimental' sysadmins. Just my two pennies here.

  28. Anonymous Coward
    Anonymous Coward

    oh dear

    The only BC we have here is a calculator!

  29. ROIdude

    RAID

    Amazingly, some HCI mfs are not only utilizing RAID, they're even bragging about it:

    ---

    8/25/16, 10:40 AM

    #HyperConverged 380 has RAID which means no system downtime! #HPEDare2Compare #VMWorld hpe.to/6015BNqfl pic.twitter.com/F9bKUfHI8r

  30. This post has been deleted by its author

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon