back to article KCL out(r)age continues: Two weeks TITSUP, two weeks to go

We are two weeks into the outage issues at King's College London, and a communiqué from IT has warned staff that those issues won't be completely resolved for at least a fortnight more. As of this morning, KCL's internal website and its software distribution system are still down, while library services and (vitally) payroll …

  1. Rich 11

    And then we were wondering whether it was redundant to talk about redundancy in a redundancy system.

    This comment is redundant.

  2. Anonymous Coward
    Anonymous Coward

    "The greatest lesson to be learned will surely be about failure tolerance, responsibility for which must fall to an executive manager."

    Who will pass it down to someone down the management chain that actually said "We need this" with regards to external backups or cross site duplication but was told "It's not in the budget" only to get sacked for bringing it up.

    My money is on a failed Peoplesoft upgrade due to the inclusion of payroll.

    1. Erroneous Howard

      Of course that NEVER happens in IT.......EVER. Like complaining about something for 7 years as being well past end of life, then getting asked why you never warned anybody once it finally goes wrong.

      1. Stumpy

        That's why it's always to remember to get all of that stuff in writing so you have an audit trail when the shit really does hit the fan.

      2. Anonymous South African Coward Bronze badge

        Happened to me in my previous job. Manglement finding new excuses as to why they can't get new kit etc etc etc... AND there is an audit trial (emails etc).

        1. TRT Silver badge

          Ah... so...

          there might be a few more redundancies coming? Or sackings at least.

          1. This post has been deleted by its author

          2. Mark 85

            Re: Ah... so...

            Probably down the food chain.. maybe near the bottom of it for starters. Then the Manager who decided NOT to set up a better backup system" will come up with the brilliant idea to outsource and offshore to <cough> "keep this from ever happening again". <cough> Then will follow a lot of redundancies.

          3. ZiggyZiggy

            Re: Ah... so...

            No chance - it's higher education - mention the R or S word and you'll have the unions in full action!

            (On a positive, the best days I had working in HE IT was when the unions went on strike... a very quiet and peaceful day in the office with a selection of my finest colleagues - good days!)

            1. Destroy All Monsters Silver badge
              Trollface

              Re: Ah... so...

              Serial incompetance?

              I think this can be fixed with the appropriate terminator on the RS-232 cable.

              1. Peter Gathercole Silver badge

                Re: Ah... so...

                RS232 did not need termination.

                It's possible that if you were attempting to drive it further than the stated maximum distances, you could find matching the impedance would help, but within spec, it was just point-to-point without termination.

            2. Anonymous Coward
              Anonymous Coward

              Re: Ah... so...

              You seem to be confusing now with the 1970's. Plenty of universities are doing this. I for one work in one of them.

    2. Anonymous Coward
      Anonymous Coward

      "My money is on a failed Peoplesoft upgrade "

      Mine is on serial incompetence...

      1. Anonymous Coward
        Anonymous Coward

        Serial Incompetence.

        I think if the storage kit is as old as it seems, then it is more likely to be parallel incompetence.

      2. TRT Silver badge

        Mine is on serial incompetence...

        Serial's too slow. I think this was parallel incompetence.

  3. TRT Silver badge

    On the bright side...

    A year or two down the line and KCL will have the most failure resilient, attack hardened, double-quadruply redundant IT system in any UK university. OK, there might not be any IT budget left...

    1. Anonymous Coward
      Anonymous Coward

      Re: On the bright side...

      Nope. With the dysfunctional management in typical higher education, there will be alot of meetings about it, some strongly worded emails and then a very watered down plan which doesn't really prevent anything, and probably introduces more issues.

      * AC as I worked in HE. went through a similar outage. had a similar outcome.

      1. Anonymous Coward
        Anonymous Coward

        Re: On the bright side...

        "With the dysfunctional management in typical higher education, there will be alot of meetings about it, some strongly worded emails and then a very watered down plan which doesn't really prevent anything, and probably introduces more issues."

        Some of this is down to management, some of this is down to the way purchasing in HE, and public sector in general works. Because you're spending public money you have to prove that you are getting value for money, generally by going through a tender process using an approved framework for the tender format and then sending it to a list of University group consortium approved suppliers.

        The problem with this process is, despite what the people in charge may say, is that 9 times out of 10 it results in you purchasing the cheapest possible solution.

        This happens because the qualitative differences between the suppliers and their tender responses are often not well reflected in the paperwork, and there for management will rule options out based on them appearing to be the some as other solutions but for a greater cost.

        Ruling out a supplier, or a particular solution, because of your personal experience from another workplace or their reputation among people you've spoken to is not allowed/frowned upon.

        Between how the tender process rules are supposed to work, and over zealous attempts by management to adhere to them in order to avoid future problems with audits, the tender process rarely results in the best value purchases.

        AC, because I work in HE.

        1. Alan Brown Silver badge

          Re: On the bright side...

          "The problem with this process is, despite what the people in charge may say, is that 9 times out of 10 it results in you purchasing the cheapest possible solution"

          But not necessarily at the cheapest price.

          The point about tendering processes and all the other guff is not to get the best possible deal but so that you can show you've followed procedures when the auditors show up. Saving money has little to do with it.

          Just remember when you're flagging a serious problem, to keep your audit trail somewhere where it can't mysteriously disappear.

          Not AC. I work in HE (but not at KCL)

        2. Anonymous Coward
          Anonymous Coward

          Re: On the bright side...

          a) What makes you think they followed the proper tendering processes for the HP kit? This is a key question that KCL IT senior management need to answer IMO.

          b) I disagree with the thrust of your comments. If you know what you're doing, you can get good value and fit-for-purpose equipment using the public sector tendering or framework processes. You just have to set your evaluation criteria appropriately so that up-front cost is only one proportionate factor to be judged. Of course this requires competence and good domain-specific knowledge. If you don't have these, please get someone in who does, before you waste yet more taxpayers' money.

          1. Anonymous Coward
            Anonymous Coward

            Re: On the bright side...

            "If you know what you're doing, you can get good value and fit-for-purpose equipment using the public sector tendering or framework processes"

            "If you know what you're doing" is the key point there.

            As is the case in many places, not just public sector, the management who are making the ultimate spending decision often don't know what they're doing. They either don't understand the paperwork and management issues because they're techies not bureaucrats or they don't understand the technical issues because they're bureaucrats not techies.

            It's extremely rare to have someone making the decision who understands both sides and unfortunately most of the time it's that they're bureaucrats who don't understand the technical issues, which wouldn't be so bad but some of them compound this by hearing what their technical staff have to say and then completely ignoring it and going with the lowest price option, either through ignorance or fear of auditors or both.

            1. Anonymous Coward
              Anonymous Coward

              Re: On the bright side...

              "As is the case in many places, not just public sector, the management who are making the ultimate spending decision often don't know what they're doing."

              Then is sounds like we agree. You can't blame the public sector procurement processes, you have to blame the people willing to waste taxpayers' money. I don't know about you, but when I don't understand something I ask an expert. If my decision is worth ££££ of taxpayers' money, I'll even pay a small amount up-front for expert advice to save a large amount later.

            2. Alan Brown Silver badge

              Re: On the bright side...

              > "If you know what you're doing" is the key point there.

              And aren't overruled.

              I rejected every tender in a project about 15 years ago because none of them would work (the person issuing the tender had slashed about 25k off the figure and the people we wanted to tender all walked away.)

              I got overruled. Stuff got purchased and installed. Worked ok at first (first 6 months) but then as the load cranked up it started breaking spectacularly. Vendor (HP) abandoned us, etc.

              it wasn't a particularly happy period - and I copped a lot of the flak for stuff breaking, even though I'd said from the outset that it wasn't up to the projected loads we were planning.

              1. Anonymous Coward
                Anonymous Coward

                Re: On the bright side...

                I was involved at a low level in a procurement project for an invoice processing system a few years ago, when I was a minion in a finance department.

                The system had been specified to deal with x number of invoices a year - stated in the spec by the system integrators as being 24x7. Unfortunately, we didn't work 24 hours a day, we worked 7x5. I pointed out this significant discrepancy - it's about 20% of the capacity - and was promptly told to go away as I knew nothing.

                Needless to say, the system was seriously overloaded from the first day, (we're talking scanning something at 9am and maybe it made it through four hours later) and I left soon after it was installed.

                Last I heard, four years on, it was still not coping...

        3. Alan Brown Silver badge

          Re: On the bright side...

          "Ruling out a supplier, or a particular solution, because of your personal experience from another workplace or their reputation among people you've spoken to is not allowed/frowned upon."

          Funny.....

          We use that kind of data a lot and have specific weighting for it.

          Which is why Suse (amongst others) will never be allowed to cross our threshold again.

        4. PNGuinn
          FAIL

          Re: On the bright side...

          "a tender process using an approved framework for the tender format and then sending it to a list of University group consortium approved suppliers."

          Ie the "Usual Suspects"??

          1. TRT Silver badge

            Re: On the bright side...

            There will be a number of CIOs one could recruit if you don't mind ex-GDS.

          2. Anonymous Coward
            Anonymous Coward

            Re: On the bright side...

            Absolutely... the same names came up all the time - and if you went out of your way to avoid them, they would just partner with your preferred supplier so you ended up dealing with them anyway!

            I think they usually saw HE as a easy pickings/dumping ground for anything they had left at the end of the year. The usual discount on list price for HE was/is around 60-70%, so whilst the sales pitch was particularly polished, the enthusiasm for been helpful vanished after contract signing.

            AC - ex HE, and whilst it was a particularly frustrating industry to work in, the pace of life did appeal so I might consider going back :)

        5. Anonymous Coward
          Anonymous Coward

          Re: On the bright side...

          The issues don't just depend on the the tender process but the Managers finding appropriate solutions to the problems faced... Evaluating systems and asking the right questions. ie justifying why you have chosen the system you did.Preventing to some extent purchasing systems that were recommended by buddies and backhanders.

        6. hmv

          Re: On the bright side...

          If you end up with something cheap and nasty, then you haven't written the tender specification properly. Yes you have to justify why you're not using the cheapest response, but you _can_ do that (I've done it).

      2. Anonymous Coward
        Anonymous Coward

        Re: On the bright side...

        Nonsense, they had reliability and redundancy back in the day.

        Quick summary. When I was at KCL (2004-2011), the IT system was maintained by the postgrads, along with the IT department and the odd comp sci prof (when they were not too busy) .

        It ran a mixture of FreeBSD, Solaris and Linux machines, and it worked for years without a hitch, with leaving grads handing the reins to the new grads, which would then hack on the system further and keep it going.

        It wasn't the prettiest (The webmail interface didn't use javascript and had no whizz bang features) but damn it worked and worked and worked.

        During my third year there, some PHB in the new management decided we needed to scrap everything, and contracted a third party company to replace the entire infrastructure with a Windows based system. AD, Exchange, sharepoint, the whole shebang. It was the "New way forward", "everything integrated" etc....

        Out went the grads (who were getting useful real world experience in infra work and programming), the professors could no longer bend the system to their will, and 95% of the IT department was made redundant. All the Unix boxen were replaced with shiny new Windows servers, with a third party contract that managed the system. Needless to say the IT costs must have gone up a bomb as well, but I am sure the cash went to the "right pockets".

        Also, in the last year or so I was there, the new system was plagued by instability and outages, causing much disruption and frustration. I ended up chatting to my professors through non uni systems (like gmail). It was a complete cock up, but upper management reassured everyone that it was just initial teething issues.

        By the time I left the system was still having outages, but less so (it would go 2-3 weeks without a problem). I never understood their rationale. You have an entire department of pretty damn skilled comp sci students chomping at the bit to put their skills to use, helpful faculty and an experienced IT department, and what do you do?

        You hand it over to a third party, and lock everyone at the uni out of the system, turning them into plain third party users, and introduce a black box system that is neither as good nor as reliable as the old one (but it was flashy, with ajax and all that crap on the OWS webmail), all handled by a company who presumably made more money the more support tickets had to be dealt with.

        Doesn't surprise me that eventually the whole mess collapsed, you can only balance plates so long before it all crashes down.

        1. Anonymous Coward
          Anonymous Coward

          Re: On the bright side...

          Nice story, but like most stories of a lost golden age, it is mostly mythical. They were maintaining a bunch of obsolete systems that most of the College didn't use and all the real I.T. was being done in the academic Schools (now Faculties) and departments. You are also conveniently missing out the bit where the (crap) central Unix-based email system went down for a week... That final outrage caused by your beloved setup going bang was what led *directly* to the (also crap) outsourced Microsoft Exchange system and hilariously disastrous Global Desktop (SunRay thin clients streaming Windows sessions from the North of England - yum!).

          But that's all ancient history... The on-site Windows servers, just before this screwed HP system was put in, were on VMware vSphere and NetApp storage I believe and were pretty solid. The only outages I can remember were due to long power cuts (no money for off-site replica due to money being wasted on aforementioned outsourced failures one assumes). Not sure why they replaced that seemingly reliable setup with this HP system. It would have made more sense to spend the money on off-site or cloud replicas.

          Also worth noting that the (Microsoft!) Office 365 system has remained available throughout this current fiasco. The current management cabal can't can't take any credit for that though as the decision to move to 0365 for email was taken before they started by the previous mob (a good decision by them for all their many faults).

          1. Alan Brown Silver badge

            Re: On the bright side...

            > Also worth noting that the (Microsoft!) Office 365 system has remained available throughout this current fiasco.

            o365 WEB interfaces have been amazingly reliable.

            o365 SMTP/IMAP is rather less so. Some of us prefer to handle our mail in a mail client.

          2. Anonymous Coward
            Anonymous Coward

            Re: On the bright side...

            The original system may not have been perfect. Lets be real, you are never going to get 5 nines reliability in a university environment. It isn't how the system works, nor do they have that kind of budget or need to pay for it.

            However how is replacing a system which had one outage in years of use with a system that had outages consistently roughly every month an improvement? Especially when the previous system was (mostly) open source, and it could have been improved upon. I mean, the current infra is having a 2 week long outage so far, and unlike the mentioned "one week" outage from before, this has actually hit the public news. Should we label it "crap" and throw it all away and get something new again?

            I did know that there were faculty level Infrastructure outside the core Unix system, but I also knew that they all tied into the core Unix system itself (which is why the outage you mentioned, when it happened , affected everyone). Do you know what caused that week long outage? I don't remember if I ever found out.

            Good to see you are more up to date with the situation, and that things have moved on from that awful crap they foisted on us back then.

            Still, for the amount of time and money they have thrown at the infrastructure, I still think they would have done better if they just put the same money, time and effort into updating the existing Unix based infrastructure, Its design was solid, but it needed some dedicated resources for updating all the software to the latest versions. Not to mention that at the end they would still have an open system that would allow modifications, upgrades and tweaks to anything you wanted, and would have been a good educational tool for the students to boot.

            1. Anonymous Coward
              Anonymous Coward

              Re: On the bright side...

              "Especially when the previous system was (mostly) open source, and it could have been improved upon"

              It was improved upon. They ditched it for something more useable and commercially supported...

              1. Destroy All Monsters Silver badge
                Headmaster

                Re: On the bright side...

                IT system was maintained by the postgrads, along with the IT department and the odd comp sci prof (when they were not too busy) .

                For me that spells "utter disaster area" (first hand experience of disappearing mountpoints on Solaris ... where is the backup ... owwww!)

                The postgrads are barely able to sysop or even program their way out of a paperbag (do they acquire knowledge by osmosis with humming infrastructure?) and are busy working on their PhDs or teaching duties (as they should be) and the profs are are far away from the nuts and bolts in scienceland (as they should be, as that's what they are getting paid for) and are wont to take utterly stupid decisions based on too little experience and perceived relative status.

                Better to have a dedicated section of people that deal with the machine crap on a daily basis but that liaise continually with the people for whom they are running everything. You might even have postgrads working on both teams, why not.

                The decision to go into Windows is orthogonal to this, everyone is free to open their veins while taking a warm bath after all.

          3. hoola Silver badge

            Re: On the bright side...

            Whilst HP feature here, this could be substituted by any other major vendor. HE in particular suffers from moronic tendering processes, unreasonable requirements for the tender and then the equally unreasonable actual requirements.

            One of the huge problems in these types of institutions what an earlier post states, the IT Techies have evolved into the business as Postgraduates. Yes, they may be very clever and can make the tech do all sorts of funky stuff but often that is way off-piste when it comes to reality. The theory may be good but then it is taken to the limit for some very sound theoretical reason but is totally rubbish when a problem strikes and everything comes home to roost.

            Management will be following the current trend of "if you are a manager you can mange anything" and have limited understanding and no control of the high-tech bull that is being fed to them.

            The KCL outage is extreme but similar things will have happened in these types of places all over the UK and be buried deep in recycling.

          4. Ian 55

            Re: On the bright side...

            "hilariously disastrous Global Desktop (SunRay thin clients streaming Windows sessions from the North of England - yum!)"

            What is it with people who think that this sort of thing is a good idea? I hope your lot were able to turn the animations off in Microsoft Office - mine weren't, so every pixel change involved in opening every menu went over the not very fast network.

            The best bit was that there was an ISDN link as fallback for when the broadband failed. You could barely start up a PC in the eight hours.

      3. Anonymous Coward
        Anonymous Coward

        Re: On the bright side...

        (AC as also in HE...)

        And if anything like my institution, you actually have to kill someone to get fired...

        Proper DR/failover tests just aren't done enough here due to the perceived risk of running the test, complete lack of central governance in telling the respective owners of the thousand and one systems that exist that they will be doing it like it or not.

    2. Vince

      Re: On the bright side...

      Nah, my experience is that people don't learn lessons and just assume lightning will never strike twice.

  4. Doctor Syntax Silver badge

    A few simple lessons:

    RAID is not backup.

    Incremental backups only bridge the gaps between full backups.

    Fully backup before you do even the most routine maintenance to your RAID.

    A good sysadmin/DBA is paranoid. Technical competence comes a close second but paranoia comes first.

    1. Dr Who

      You've totally hit the nail on the head. I'd upvote multiple times if I could. This failure has nothing to do with technology and everything to do with system administration practices.

      And yes RAID is not backup. Nor is cross-site synchronisation.

    2. Anonymous South African Coward Bronze badge

      "Fully backup before you do even the most routine maintenance to your RAID.

      A good sysadmin/DBA is paranoid. Technical competence comes a close second but paranoia comes first."

      As a rule I would always perform a FULL backup of the server should the RAID show a degraded status, and need a disk replacement. Never failed me.

      But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process.

      1. DNTP

        Re: slap in a new HD, watch it bork

        This was me with our main bioinformatics machine four months ago. Fortunately it was under a full service contract and was getting regular backups. All of the following items were performed by the vendor's service engineers or by me under their direction.

        First service action: Replace degraded disk, attempt rebuild, borks.

        Second action: Source and replace RAID controller, all disks, goes titsup.

        Third action: Replace entire server under warranty.

        I am an ex-IT generalist, currently a geneticist and technical scientist, I do not think of myself as ignorant about computer systems, but as far as I know RAIDs are sinister and mysterious forces of nature that cannot be understood by typical mortals.

        1. Alan Brown Silver badge

          Re: slap in a new HD, watch it bork

          "Source and replace RAID controller"

          Raid controllers are best avoided these days.

          Seriously, you're better off with software RAID or ZFS - at least that way

          1: You know you can shove the drives in anything and they'll still be readable.

          2: The processing power on a raid card is peanuts compared to that of an average desktop CPU these days (raid doesn't even make them get warm), let alone a server.

          Your more general mistake was not having someone herding the boxes who knows how to configure for actual requirements in the firsdt place and kick them if they misbehave. The moment the vendors get you doing trained monkey jobs, it's time to get someone onsite.

          1. TheVogon

            Re: slap in a new HD, watch it bork

            "Raid controllers are best avoided these days.

            Seriously, you're better off with software RAID or ZFS - at least that way"

            Just LOL I bet you don't work in mission critical. NO ONE uses software RAID these days in production.

            "You know you can shove the drives in anything and they'll still be readable."

            Well, no you can't - you now have to shove them in something running a specific OS that understands the disk config. And what about protecting your boot volume?

            "The processing power on a raid card is peanuts compared to that of an average desktop CPU these days (raid doesn't even make them get warm), let alone a server."

            It really isn't. Large disk array controllers often have things like several multicore Xeons in them these days...

            1. This post has been deleted by its author

            2. Anonymous Coward
              Anonymous Coward

              Re: slap in a new HD, watch it bork

              "Just LOL I bet you don't work in mission critical. NO ONE uses software RAID these days in production."

              Yes they do. It's not the 1990s any more when CPU cycles were precious and hardware RAID had some real benefit. I've never had a software RAID array fail that wasn't straightforward to recover.

              "Well, no you can't - you now have to shove them in something running a specific OS that understands the disk config. And what about protecting your boot volume?"

              Software RAID doesn't require special drivers so it's dead easy to fix on alternate hardware should the need arise. OS version doesn't matter, at least with the 'nix systems I deal with. You don't need hardware RAID to reliably protect a boot volume either. You also get more flexibility to choose drives you want and aren't locked to a single vendor, and you can fine tune the array right down to how you need it to run.

              I say this as someone who is a sysadmin with 20 years experience and who once reconstructed a failed RAID 5 array and recovered the entire contents (minus a single corrupted file) by hand. (From a double drive failure on a Compaq SmartArray).

      2. Alan Brown Silver badge

        "But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process."

        Those would be the same people running multi-TB RAID5 setups. Even RAID6 is risky once you get past about 10TB.

        I've had to pick up the pieces from both cases - and for systems that I don't backup because they were in someone's "private server estate" that "doesn't have critical data on it and doesn't need backups"

        The howls when you tell them that their 30-100TB of "critical data" is gone are something to behold, as are the ones when they don't howl at first but then find out that rebuilding the data from sources across the Internet will take several months.

        "Oh, you can't reformat this as a raid6+hotspare (or raidz3), we absolute need the full capacity of all the drives"

        Really? After just losing a raid5 array because you didn't notice one drive had died and another went toes up?

        Welcome to UK HE computing. As with most things british: Good idea. Bad design. Lousy execution. Inability to learn lessons.

        1. TheVogon

          "Even RAID6 is risky once you get past about 10TB."

          Nope. Risk for RAID 6 is still close to zero even in worst case of say 14+2 using multi-terabyte SATA disks. See http://pics.aboutnetapp.ru/hds_raid_5_and_raid_6_risk_of_data_loss_probability.jpg

      3. Doctor Syntax Silver badge

        "But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process."

        I also heard of someone losing a disk from a mirrored system during a system move. They put in a new disk and re-silvered their mirror. From the faulty disk.

        1. Mayhem

          I also heard of someone losing a disk from a mirrored system during a system move. They put in a new disk and re-silvered their mirror. From the faulty disk.

          Yep, my previous workplace had a minion do that after I left. Then he overwrote the copy. Then he broke the offsite version. *Then* he confessed to having had some problems.

      4. gryphon

        RAID Firmware

        Thankfully I never got burned by it but I saw a frightening message in an HP update notification for Smart Array once.

        It boiled down to the following if I remember correctly:-

        You have a RAID1 mirror pair and a disk fails

        You replace the failed disk, the mirror rebuilds and gives a completion message and all appears to be right with the world.

        Except it never actually completed properly and is lying to you, you are actually running on one disk and who knows what it is actually mirroring.

        I really pity the poor admin that got burned by that one before they released the update. Thanks so much HP.

    3. Alan Brown Silver badge

      "Incremental backups only bridge the gaps between full backups."

      Also, there are many different kinds of backup software and only a handful fo them are any good - many of the commercial packages costing up to 30k a shot are complete piles of fetid dingo kidneys.

      One of the major hurdles with getting backup systems in place is the cost of the hardware and the entrenched attitude in HE that everything can be done with a Windows desktop PC.

      Unsurprisingly this doesn't usually change until AFTER someone loses critical data (I've had a number of "I need this system restored" demands for things that we don't backup because the demander was unwilling to pay for the service).

      Someone more surprisingly there's often strong resistance to forking out for the appropriate kit/software even AFTER such events. "You got it fixed, why do you need all this expensive stuff now?"

    4. Nate Amsden

      RAID absolutely is a backup against failed drives. Stupid tired argument. You can say full backups are no good unless they are off site, and depending on region on another tectonic plate. Then you can go farther and say it's not a backup unless it's been fully restored and validated.(and even farther to say on a regular basis)

      All depends what you are protecting against.

      In this case it appears as if a high end 3PAR is the likely cause based on what I've read on what the VS3 is. I had a vaguely similar event happen(no system upgrade involved) to me 6 years ago on a T400. Took a week to recover fully (end users not impacted, mainly because a ton of data was on an exanet cluster and they went bust earlier in that year. 3par was back up in 6 hrs). Support was awesome though and made me a more loyal customer as a result.

      Certainly an unfortunate situation with lots of data loss and backups didn't cover everything("can you restore X? "Sorry you never requested X be backed up, and everyone knew we had a targeted backup strategy due to budget amd time constraints") fortunately most of the lost data was non critical.

      [Only other similar issue i was involved with was a double controller failure on an emc array which ran another company's oracle dbs. 35 hrs of downtime for that then 1 to 3 outages a month for the next year or so to recover corruption that oracle encountered along the way. I wasn't responsible for that array.]

      After all of that, and getting a new revised budget for DR (unrelated to incident ) VP decided to can DR project because he needed the money for some other project he massively under budgeted for. I left before the last clusterfuck could get off the ground and had 18 months of outages and pain from what I heard.

      1. Doctor Syntax Silver badge

        "RAID absolutely is a backup against failed drives."

        It's not a backup against flood, fire, theft, accidental deletion, ransomware...

        It's not a backup.

        1. Nate Amsden

          I see you stopped reading my post pretty quick, as I covered all of those other factors. You have to ask yourself what are you protecting against? Then solve for that.

          Very often you will find when you want to protect everything against every situation the organization will not shell out the $$ to cover even a fraction of what you may want to protect (whether it is $$ for hardware or $$ for staffing to do it, test it etc).

          1. Doctor Syntax Silver badge

            "You have to ask yourself what are you protecting against? Then solve for that."

            Complete loss of data storage and processing power. How it's caused is immaterial.

          2. Richard 12 Silver badge

            RAID doesn't really protect against failed drives. It gives you breathing room to get the backup recovery process in order.

            RAID-5 theoretically protects against a single failed drive.

            However, in the real world this isn't true, as a second drive will probably fail during the rebuild - they are a similar age and have had a similar amount of usage, and the RAID rebuild is likely to be the most intensive work they've ever done.

            Assume it rebuilds ok - you've dodged a bullet, but what happens when the next drive fails? All except one are now very old...

            So when the drives are large, the rebuild takes so long that probability of a second failure during rebuild quickly gets above 50%, and a third is rather probable.

            So backup. Backup means you can port to a new RAID, and you have a way of recovering when the rebuild fails.

        2. Nate Amsden

          a lof of the posts here imply to me "system admins" in general working with small data sets in fairly simple environments. It's easy to protect a small amount of data, obviously as complexity and data sets go up the amount of work required to back things up right goes up as well.

          An extreme example to be sure, but I recall going to an AT&T disaster recovery conference in Seattle probably about 2009 time frame. At the massive scale AT&T was at they still had stuff to learn.

          Specifically they covered two scenarios that bit them after the 9/11/2001 attacks in NYC.

          First was they had never planed for a scenario where all flights in the U.S. were grounded. They had the people and equipment but could not get them to the locations in a timely manor.

          Second was when they setup a new site after the WTC was destroyed to handle AT&T network traffic probably a few blocks away they had big signs up advertising they were AT&T there, and they realized maybe it wasn't a good idea to advertise the fact that they were there so publicaly.

          One company I was at had a "disaster recovery plan" which they knew wouldn't work from day 1(as in 100% sure there was no way in hell it would ever work), but they signed the contract with the vendor anyway just to show to their customer base that "yes they had a DR plan" (the part where "does it actually work" fortunately wasn't part of the contracts). They paid the DR vendor something like $30k/mo to keep 200 or so servers in semi trucks on call in the event they would need them -- knowing full well they had no place to connect those servers if they had to make that call.

          A lot of the comments here incorrectly portray the process of true data protection as something that is pretty simple. It is not, and if you can't understand that then well I don't have more time to try to explain.

    5. David Harper 1

      "Fully backup before you do even the most routine maintenance to your RAID.

      A good sysadmin/DBA is paranoid. Technical competence comes a close second but paranoia comes first."

      As a paranoid DBA, I approve this message.

  5. Anonymous Coward
    Anonymous Coward

    I would just like to know what software update would cause total data corruption like that? Was it an HP 3par one? If so, how come a multitude of customers aren't affected?

    1. Korev Silver badge

      HI know of a university (outside the UK) recovering hundreds of TB after HSM corrupted some of their files whilst migrating from an old system to a new one. As far as I know they did everything correctly, just their storage ****ed up. I'm told $VENDOR is being very proactive in helping them get their data back from tape!

  6. Nick Ryan Silver badge

    I'm rather puzzled... if there's something that was evidently this critical then why was it stored (from the description) on a RAID-1 array consisting of just two disks? Eeek.

    Also, why wasn't the backup reverted to rather faster than two weeks? Yes, reverting to old data is a pain particularly when followed by a likely merge but it should be less painful than two weeks downtime. Also, why wasn't the backup period somewhat shorter - if the data is this critical then the question should have been asked "how much (time of) data are you prepared to lose?"

    1. TRT Silver badge

      I think the various virtual machines were backed up a machine at a time according to a strategy particular to that machine and its function. So, for example, and the strategy may well in actuality be different, HR, finance and student records systems would be backed up daily in their entirety, whereas the shared drives and old, semi-retired webpages on an image of an ancient server would be backed up incrementally with a full image every month or so.

      I also think the RAID was many more disks that just two. Probably 24 disks arranged as a whole load of RAID 5 volumes, but the volumes spread out across the whole 24, which would effectively make it a very bad system indeed. I can't believe anyone would do that, TBH. Not being au fait with the system itself, I can only guess.

      1. Anonymous Coward
        Anonymous Coward

        As correctly stated theThe PS3 works in a different manner. Having read the original mathematical paper that underlies the the system, I could never understand why they used raid discs in the first place...redundancy over redundancy... . The other issue lies with the "array managers database" when it wrongly assumes that a volume does not exist, and it does not appear in the tables "a ghost volume".

        This could easily cause the system to corrupt itself an any upgrade. You probably say this is not possible, but I have seen this happen.

        As for backups I am a firm believer in having backups first before splashing out on new redundant data centres...

        Why restore from two weeks backup? Because either the problem was not realised or the backup capacity was highly under resourced by mangers who did not understand how to mange backup...

        Should heads role? Heads should only role if there is negligence.. this includes not listening your employees when they highlight problems...

    2. Anonymous Coward
      Anonymous Coward

      More to the point, if this was so critical where were the failover mirror servers? If they didn't have mirror servers in a different server room/building why not?

      At this point in the game it is time for the top IT management to be given their P45s and permanently shown the door.

    3. Alan Brown Silver badge

      "Also, why wasn't the backup reverted to rather faster than two weeks? "

      Maybe because the backups were corrupted.

      I know of one case from 1999 (a telephone exchange) where what was being backed up turned out to be random garbage and the organisation had to revert to 18-month old backups, then replay every single transaction that had gone through since that point (logged separately).

      It took 3 months to get fixed - and in that case, being a telephone exchange with 20,000 people on it, all sorts of odd shit was going on with people's phone service (starting with 3 days of "dead lines")

      1. Anonymous Coward
        Anonymous Coward

        Speaking as the Backup Manager for one of the UK's larger University's and given the description of the fault (sounds like a HW level cross-site replication issue) the problem would have been immediately spotted so backup corruption (more specifically data corruption) would unlikely be the issue. There's never enough money for backups, we still have an issue at our institution, where backups are still being done to tape for massive amounts of research data. Data growth is greatly exceeding technological growth in raw speed (throughput) so cloud based disk snapshotting is the only feasible route as we don't have a capable archiving system available. Costs are skyrocketting, for the past 3 years I've seen an annual data growth rate of 100%. The numbers don't add up.

    4. ctx189

      It obviously wasn't a 2 disk mirror. The article was badly worded but it obviously wasn't a 2 disk mirror.

      "Also, why wasn't the backup reverted to rather faster than two weeks" likely running on incremental backups for far too long + a potentially narrow download pipe.

      ""Also, why wasn't the backup period somewhat shorter - if the data is this critical then the question should have been asked "how much (time of) data are you prepared to lose?" - backups should at the very least have a 24 hour restore point, I've not read anything contrary to that.

  7. Stevie

    Bah!

    Redundant to speak of a redundant array of inexpensive disks array at any rate.

    1. Z80

      Re: Bah!

      RAID array - what's wrong with that? Everyone where I work (Department of Redundancy Department) calls them that.

      1. Destroy All Monsters Silver badge

        Re: Bah!

        I thought it was "Redundant Array of Independent Disks" because inexpensive stuff? BAH!!

        1. Stevie

          Re: I thought it was "Redundant Array of Independent Disks"

          Ah, my textbooks must be older than yours. When I first saw the tech it was "inexpensive disks".

          IT, synonymous with RETCON since, well, forever.

  8. Andy The Hat Silver badge

    Upgrade ...

    I did the routine upgrade that I'd tested before I applied it and the raid melted down ... so I restored from the backup I made before I did the upgrade ...doh!

  9. Anonymous Coward
    Anonymous Coward

    Optional

    All these people carrying out a FULL backup.... I take it you also tested that the restore process worked?

    I've said it before, I'll say it again. No one needs a backup plan. You start with a recovery plan and then develop processes accordingly.

    1. Anonymous Coward
      Anonymous Coward

      Re: Optional

      Around here, the "awww shits" challenge the restore process regularly. "But I only, accidentally, deleted it last month!" That's the non-technical users. When it's us techies, the "awww, fuck" tend to be a higher challenge in restores.

    2. Doctor Syntax Silver badge

      Re: Optional

      "All these people carrying out a FULL backup.... I take it you also tested that the restore process worked?"

      Being a paranoid sysadmin, this goes with the territory.

    3. ctx189

      Re: Optional

      You should start with a "service recovery plan", think of the large numbers of servers underpinning a service. Having a decent CMDB is essential in making sense of the complexity, but it can only capture dependencies.

  10. Robert Carnegie Silver badge

    "Sponsored: How do you pick the right cloud for the right job?"

  11. Harry the Bastard

    lost opportunity for the subhead...

    Now IT bricks it: A 'Series of Unfortunate Events'

    1. John G Imrie

      Re: lost opportunity for the subhead...

      Where's Lemony Snicket when you need him?

      1. Destroy All Monsters Silver badge
        Windows

        Re: lost opportunity for the subhead...

        Did anyone think of "Magnolia" and the final frog rain?

        Bad Karma Release, I say!

  12. Lee D Silver badge

    Overthinking

    Meanwhile, the problem could have been solved with a single off-site backup that wasn't reliant on fanciful de-dupe or whatever technology.

    You know, like a copy of the VHDs of the virtual machines. Flung onto a cheap NAS, or - god forbid - a tape.

    Even if it was just once-a-week, and not "The" backup method, you could have been up and running for most VMs in a matter of hours in such a circumstance.

    But by being overly-complex, your recovery process is now an absolute nightmare involving stitching arrays back together and hoping your backups weren't corrupted and so on.

    Two weeks is head-roll time, as far as I'm concerned. Sure, let them fix it. But be planning their replacement staff and prepping the pink slips.

    1. Dabooka

      Re: Overthinking

      Certainly head roll time, but it's not two weeks it'll be nearer to four!

      Shocking

      1. Nate Amsden

        Re: Overthinking

        Time will tell for them if they prioritize this failure as something that can be protected against in the future or if the cost and risks involved mean at the end of the day perhaps they do little or nothing. (I've seen both myself)

    2. Nate Amsden

      Re: Overthinking

      Speaking as someone who has been involved in similar situations on multiple occasions at multiple organizations, more often than not the fault lies as high as the CFO or CEO who signs off on the budgets.

      Good luck giving them the pink slips?

      1. Anonymous Coward
        Anonymous Coward

        Re: Overthinking

        Hang on - "Speaking as someone who has been involved in similar situations on multiple occasions at multiple organizations" - ummm, are you sure that it wasn't something to do with you? :p

        AC - because you might be big and scary

    3. ctx189

      Re: Overthinking

      "You know, like a copy of the VHDs of the virtual machines. Flung onto a cheap NAS, or - god forbid - a tape." - I doubt it's the VM environment that's affected, it'll be the SAN storage. I doubt you have any experience dealing with PB size backups, virtualisation products don't work well with "big data", the data is always stored independently via CIFS or NFS.

  13. ecofeco Silver badge
    FAIL

    Setting the new standard for FAIL

    Congratulations KLC. ----------------------------->>>>

    1. Doctor Syntax Silver badge

      Re: Setting the new standard for FAIL

      No. They did that years ago if the legend I heard in the '60s was true. According to that, after the war we were offered the whole of Somerset House, not just the east wing. And the offer was turned down.

      1. Anonymous Coward
        Anonymous Coward

        Re: Setting the new standard for FAIL

        KCL also blithely disposed of the *very first* computer worth the name:

        "On Mr Gravatt applying to the Board of Works, it was stated that the Difference Engine itself had been placed in the Kensington Museum because the authorities of King's College had declined receiving it,"

        -- Charles Babbage: http://digitalriffs.blogspot.co.uk/2014/01/charles-babbage-and-kings-college-london.html

        And thus Kings College London's IT endeavours have been cursed ever since.

        1. Doctor Syntax Silver badge

          Re: Setting the new standard for FAIL

          "KCL also blithely disposed of the *very first* computer worth the name"

          What can you expect. It was digital & KCL was into analogue - the Wheatstone bridge.

  14. Destroy All Monsters Silver badge
    Paris Hilton

    Do not understand

    "mitigate against this happening again"

    while

    "major system outage and some data loss due to a series of extremely unlikely events"

    Does not compute.

    Does King's College have a department of logic?

  15. Milton

    Undercarriages and why we love KISS

    Not for the first time, and prickled particularly by El Reg's humorous reference to the redundancy of redundancies, I wonder whether yet again an IT system has been snared not by incompetence or laziness or misapplied intentions, but by unnecessary complexity.

    Airliner undercarriages are required to be extremely robust, reliable, fail-safe and are engineered to large performance tolerances. They have to handle not just a nice smooth touchdown but also the unexpectedly violent hard landing of last-second wind-shear, or a piloting error. And yet somewhere in the world they go wrong every month or so.

    Their design teaches us that you can keep adding bits—couplers, moment arms, bearings, springs, shock absorbers, kitchen sink—each component specifically intended to improve performance, reliability or comfort, but that eventually the complexity of the whole thing actually begins to reduce its reliability. There are too many things to go wrong.

    It's just a thought and not a great analogy, but for me this all adds to the notion that sometimes we are adding too much to our systems in the name of safety and security and reliability, when we should step back and consider subtracting instead. A particular problem for institutions is the infestation of corporates' salescreatures and and blandishments of marketurds, who have so many ways to persuade that your system is unsafe and a mountain of lies to convince you to buy their product ... you know the one, it's in the back of the cupboard, still shrink-wrapped under the corpses of spiders from 2007.

    You really cannot beat the authority of an experienced technically-savvy manager with a cynical eye, skin long since thickened against the lies of sales and marketing, to stand a good way back and look at your system. The Axe of Pragmatism can often make things cheaper, better and simpler.

    1. Destroy All Monsters Silver badge
      Pint

      Re: Undercarriages and why we love KISS

      Beer for this man or woman (unless muslim/jewish, in that case, maybe sweetened tea is selection of choice)

    2. ctx189

      Re: Undercarriages and why we love KISS

      Should we stop abstracting IT then? If we followed your logic then ... where would we be? Should programmers still write in Assembler? Civilisation advances because we do not and cannot expect a single person to know everything. Is there a line that should be drawn?... probably not, hasn't worked so far. If this were a flaw in the underlying technology (not ruling that out .... not enough information yet) then many thousands of institutions would have been hit. Part of the reason we have independent backs is to guard ourselves from a screwup from a specific piece of technology.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like