And then we were wondering whether it was redundant to talk about redundancy in a redundancy system.
This comment is redundant.
We are two weeks into the outage issues at King's College London, and a communiqué from IT has warned staff that those issues won't be completely resolved for at least a fortnight more. As of this morning, KCL's internal website and its software distribution system are still down, while library services and (vitally) payroll …
"The greatest lesson to be learned will surely be about failure tolerance, responsibility for which must fall to an executive manager."
Who will pass it down to someone down the management chain that actually said "We need this" with regards to external backups or cross site duplication but was told "It's not in the budget" only to get sacked for bringing it up.
My money is on a failed Peoplesoft upgrade due to the inclusion of payroll.
This post has been deleted by its author
Probably down the food chain.. maybe near the bottom of it for starters. Then the Manager who decided NOT to set up a better backup system" will come up with the brilliant idea to outsource and offshore to <cough> "keep this from ever happening again". <cough> Then will follow a lot of redundancies.
No chance - it's higher education - mention the R or S word and you'll have the unions in full action!
(On a positive, the best days I had working in HE IT was when the unions went on strike... a very quiet and peaceful day in the office with a selection of my finest colleagues - good days!)
Nope. With the dysfunctional management in typical higher education, there will be alot of meetings about it, some strongly worded emails and then a very watered down plan which doesn't really prevent anything, and probably introduces more issues.
* AC as I worked in HE. went through a similar outage. had a similar outcome.
"With the dysfunctional management in typical higher education, there will be alot of meetings about it, some strongly worded emails and then a very watered down plan which doesn't really prevent anything, and probably introduces more issues."
Some of this is down to management, some of this is down to the way purchasing in HE, and public sector in general works. Because you're spending public money you have to prove that you are getting value for money, generally by going through a tender process using an approved framework for the tender format and then sending it to a list of University group consortium approved suppliers.
The problem with this process is, despite what the people in charge may say, is that 9 times out of 10 it results in you purchasing the cheapest possible solution.
This happens because the qualitative differences between the suppliers and their tender responses are often not well reflected in the paperwork, and there for management will rule options out based on them appearing to be the some as other solutions but for a greater cost.
Ruling out a supplier, or a particular solution, because of your personal experience from another workplace or their reputation among people you've spoken to is not allowed/frowned upon.
Between how the tender process rules are supposed to work, and over zealous attempts by management to adhere to them in order to avoid future problems with audits, the tender process rarely results in the best value purchases.
AC, because I work in HE.
"The problem with this process is, despite what the people in charge may say, is that 9 times out of 10 it results in you purchasing the cheapest possible solution"
But not necessarily at the cheapest price.
The point about tendering processes and all the other guff is not to get the best possible deal but so that you can show you've followed procedures when the auditors show up. Saving money has little to do with it.
Just remember when you're flagging a serious problem, to keep your audit trail somewhere where it can't mysteriously disappear.
Not AC. I work in HE (but not at KCL)
a) What makes you think they followed the proper tendering processes for the HP kit? This is a key question that KCL IT senior management need to answer IMO.
b) I disagree with the thrust of your comments. If you know what you're doing, you can get good value and fit-for-purpose equipment using the public sector tendering or framework processes. You just have to set your evaluation criteria appropriately so that up-front cost is only one proportionate factor to be judged. Of course this requires competence and good domain-specific knowledge. If you don't have these, please get someone in who does, before you waste yet more taxpayers' money.
"If you know what you're doing, you can get good value and fit-for-purpose equipment using the public sector tendering or framework processes"
"If you know what you're doing" is the key point there.
As is the case in many places, not just public sector, the management who are making the ultimate spending decision often don't know what they're doing. They either don't understand the paperwork and management issues because they're techies not bureaucrats or they don't understand the technical issues because they're bureaucrats not techies.
It's extremely rare to have someone making the decision who understands both sides and unfortunately most of the time it's that they're bureaucrats who don't understand the technical issues, which wouldn't be so bad but some of them compound this by hearing what their technical staff have to say and then completely ignoring it and going with the lowest price option, either through ignorance or fear of auditors or both.
"As is the case in many places, not just public sector, the management who are making the ultimate spending decision often don't know what they're doing."
Then is sounds like we agree. You can't blame the public sector procurement processes, you have to blame the people willing to waste taxpayers' money. I don't know about you, but when I don't understand something I ask an expert. If my decision is worth ££££ of taxpayers' money, I'll even pay a small amount up-front for expert advice to save a large amount later.
> "If you know what you're doing" is the key point there.
And aren't overruled.
I rejected every tender in a project about 15 years ago because none of them would work (the person issuing the tender had slashed about 25k off the figure and the people we wanted to tender all walked away.)
I got overruled. Stuff got purchased and installed. Worked ok at first (first 6 months) but then as the load cranked up it started breaking spectacularly. Vendor (HP) abandoned us, etc.
it wasn't a particularly happy period - and I copped a lot of the flak for stuff breaking, even though I'd said from the outset that it wasn't up to the projected loads we were planning.
I was involved at a low level in a procurement project for an invoice processing system a few years ago, when I was a minion in a finance department.
The system had been specified to deal with x number of invoices a year - stated in the spec by the system integrators as being 24x7. Unfortunately, we didn't work 24 hours a day, we worked 7x5. I pointed out this significant discrepancy - it's about 20% of the capacity - and was promptly told to go away as I knew nothing.
Needless to say, the system was seriously overloaded from the first day, (we're talking scanning something at 9am and maybe it made it through four hours later) and I left soon after it was installed.
Last I heard, four years on, it was still not coping...
"Ruling out a supplier, or a particular solution, because of your personal experience from another workplace or their reputation among people you've spoken to is not allowed/frowned upon."
Funny.....
We use that kind of data a lot and have specific weighting for it.
Which is why Suse (amongst others) will never be allowed to cross our threshold again.
Absolutely... the same names came up all the time - and if you went out of your way to avoid them, they would just partner with your preferred supplier so you ended up dealing with them anyway!
I think they usually saw HE as a easy pickings/dumping ground for anything they had left at the end of the year. The usual discount on list price for HE was/is around 60-70%, so whilst the sales pitch was particularly polished, the enthusiasm for been helpful vanished after contract signing.
AC - ex HE, and whilst it was a particularly frustrating industry to work in, the pace of life did appeal so I might consider going back :)
The issues don't just depend on the the tender process but the Managers finding appropriate solutions to the problems faced... Evaluating systems and asking the right questions. ie justifying why you have chosen the system you did.Preventing to some extent purchasing systems that were recommended by buddies and backhanders.
Nonsense, they had reliability and redundancy back in the day.
Quick summary. When I was at KCL (2004-2011), the IT system was maintained by the postgrads, along with the IT department and the odd comp sci prof (when they were not too busy) .
It ran a mixture of FreeBSD, Solaris and Linux machines, and it worked for years without a hitch, with leaving grads handing the reins to the new grads, which would then hack on the system further and keep it going.
It wasn't the prettiest (The webmail interface didn't use javascript and had no whizz bang features) but damn it worked and worked and worked.
During my third year there, some PHB in the new management decided we needed to scrap everything, and contracted a third party company to replace the entire infrastructure with a Windows based system. AD, Exchange, sharepoint, the whole shebang. It was the "New way forward", "everything integrated" etc....
Out went the grads (who were getting useful real world experience in infra work and programming), the professors could no longer bend the system to their will, and 95% of the IT department was made redundant. All the Unix boxen were replaced with shiny new Windows servers, with a third party contract that managed the system. Needless to say the IT costs must have gone up a bomb as well, but I am sure the cash went to the "right pockets".
Also, in the last year or so I was there, the new system was plagued by instability and outages, causing much disruption and frustration. I ended up chatting to my professors through non uni systems (like gmail). It was a complete cock up, but upper management reassured everyone that it was just initial teething issues.
By the time I left the system was still having outages, but less so (it would go 2-3 weeks without a problem). I never understood their rationale. You have an entire department of pretty damn skilled comp sci students chomping at the bit to put their skills to use, helpful faculty and an experienced IT department, and what do you do?
You hand it over to a third party, and lock everyone at the uni out of the system, turning them into plain third party users, and introduce a black box system that is neither as good nor as reliable as the old one (but it was flashy, with ajax and all that crap on the OWS webmail), all handled by a company who presumably made more money the more support tickets had to be dealt with.
Doesn't surprise me that eventually the whole mess collapsed, you can only balance plates so long before it all crashes down.
Nice story, but like most stories of a lost golden age, it is mostly mythical. They were maintaining a bunch of obsolete systems that most of the College didn't use and all the real I.T. was being done in the academic Schools (now Faculties) and departments. You are also conveniently missing out the bit where the (crap) central Unix-based email system went down for a week... That final outrage caused by your beloved setup going bang was what led *directly* to the (also crap) outsourced Microsoft Exchange system and hilariously disastrous Global Desktop (SunRay thin clients streaming Windows sessions from the North of England - yum!).
But that's all ancient history... The on-site Windows servers, just before this screwed HP system was put in, were on VMware vSphere and NetApp storage I believe and were pretty solid. The only outages I can remember were due to long power cuts (no money for off-site replica due to money being wasted on aforementioned outsourced failures one assumes). Not sure why they replaced that seemingly reliable setup with this HP system. It would have made more sense to spend the money on off-site or cloud replicas.
Also worth noting that the (Microsoft!) Office 365 system has remained available throughout this current fiasco. The current management cabal can't can't take any credit for that though as the decision to move to 0365 for email was taken before they started by the previous mob (a good decision by them for all their many faults).
The original system may not have been perfect. Lets be real, you are never going to get 5 nines reliability in a university environment. It isn't how the system works, nor do they have that kind of budget or need to pay for it.
However how is replacing a system which had one outage in years of use with a system that had outages consistently roughly every month an improvement? Especially when the previous system was (mostly) open source, and it could have been improved upon. I mean, the current infra is having a 2 week long outage so far, and unlike the mentioned "one week" outage from before, this has actually hit the public news. Should we label it "crap" and throw it all away and get something new again?
I did know that there were faculty level Infrastructure outside the core Unix system, but I also knew that they all tied into the core Unix system itself (which is why the outage you mentioned, when it happened , affected everyone). Do you know what caused that week long outage? I don't remember if I ever found out.
Good to see you are more up to date with the situation, and that things have moved on from that awful crap they foisted on us back then.
Still, for the amount of time and money they have thrown at the infrastructure, I still think they would have done better if they just put the same money, time and effort into updating the existing Unix based infrastructure, Its design was solid, but it needed some dedicated resources for updating all the software to the latest versions. Not to mention that at the end they would still have an open system that would allow modifications, upgrades and tweaks to anything you wanted, and would have been a good educational tool for the students to boot.
IT system was maintained by the postgrads, along with the IT department and the odd comp sci prof (when they were not too busy) .
For me that spells "utter disaster area" (first hand experience of disappearing mountpoints on Solaris ... where is the backup ... owwww!)
The postgrads are barely able to sysop or even program their way out of a paperbag (do they acquire knowledge by osmosis with humming infrastructure?) and are busy working on their PhDs or teaching duties (as they should be) and the profs are are far away from the nuts and bolts in scienceland (as they should be, as that's what they are getting paid for) and are wont to take utterly stupid decisions based on too little experience and perceived relative status.
Better to have a dedicated section of people that deal with the machine crap on a daily basis but that liaise continually with the people for whom they are running everything. You might even have postgrads working on both teams, why not.
The decision to go into Windows is orthogonal to this, everyone is free to open their veins while taking a warm bath after all.
Whilst HP feature here, this could be substituted by any other major vendor. HE in particular suffers from moronic tendering processes, unreasonable requirements for the tender and then the equally unreasonable actual requirements.
One of the huge problems in these types of institutions what an earlier post states, the IT Techies have evolved into the business as Postgraduates. Yes, they may be very clever and can make the tech do all sorts of funky stuff but often that is way off-piste when it comes to reality. The theory may be good but then it is taken to the limit for some very sound theoretical reason but is totally rubbish when a problem strikes and everything comes home to roost.
Management will be following the current trend of "if you are a manager you can mange anything" and have limited understanding and no control of the high-tech bull that is being fed to them.
The KCL outage is extreme but similar things will have happened in these types of places all over the UK and be buried deep in recycling.
"hilariously disastrous Global Desktop (SunRay thin clients streaming Windows sessions from the North of England - yum!)"
What is it with people who think that this sort of thing is a good idea? I hope your lot were able to turn the animations off in Microsoft Office - mine weren't, so every pixel change involved in opening every menu went over the not very fast network.
The best bit was that there was an ISDN link as fallback for when the broadband failed. You could barely start up a PC in the eight hours.
(AC as also in HE...)
And if anything like my institution, you actually have to kill someone to get fired...
Proper DR/failover tests just aren't done enough here due to the perceived risk of running the test, complete lack of central governance in telling the respective owners of the thousand and one systems that exist that they will be doing it like it or not.
"Fully backup before you do even the most routine maintenance to your RAID.
A good sysadmin/DBA is paranoid. Technical competence comes a close second but paranoia comes first."
As a rule I would always perform a FULL backup of the server should the RAID show a degraded status, and need a disk replacement. Never failed me.
But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process.
This was me with our main bioinformatics machine four months ago. Fortunately it was under a full service contract and was getting regular backups. All of the following items were performed by the vendor's service engineers or by me under their direction.
First service action: Replace degraded disk, attempt rebuild, borks.
Second action: Source and replace RAID controller, all disks, goes titsup.
Third action: Replace entire server under warranty.
I am an ex-IT generalist, currently a geneticist and technical scientist, I do not think of myself as ignorant about computer systems, but as far as I know RAIDs are sinister and mysterious forces of nature that cannot be understood by typical mortals.
"Source and replace RAID controller"
Raid controllers are best avoided these days.
Seriously, you're better off with software RAID or ZFS - at least that way
1: You know you can shove the drives in anything and they'll still be readable.
2: The processing power on a raid card is peanuts compared to that of an average desktop CPU these days (raid doesn't even make them get warm), let alone a server.
Your more general mistake was not having someone herding the boxes who knows how to configure for actual requirements in the firsdt place and kick them if they misbehave. The moment the vendors get you doing trained monkey jobs, it's time to get someone onsite.
"Raid controllers are best avoided these days.
Seriously, you're better off with software RAID or ZFS - at least that way"
Just LOL I bet you don't work in mission critical. NO ONE uses software RAID these days in production.
"You know you can shove the drives in anything and they'll still be readable."
Well, no you can't - you now have to shove them in something running a specific OS that understands the disk config. And what about protecting your boot volume?
"The processing power on a raid card is peanuts compared to that of an average desktop CPU these days (raid doesn't even make them get warm), let alone a server."
It really isn't. Large disk array controllers often have things like several multicore Xeons in them these days...
This post has been deleted by its author
"Just LOL I bet you don't work in mission critical. NO ONE uses software RAID these days in production."
Yes they do. It's not the 1990s any more when CPU cycles were precious and hardware RAID had some real benefit. I've never had a software RAID array fail that wasn't straightforward to recover.
"Well, no you can't - you now have to shove them in something running a specific OS that understands the disk config. And what about protecting your boot volume?"
Software RAID doesn't require special drivers so it's dead easy to fix on alternate hardware should the need arise. OS version doesn't matter, at least with the 'nix systems I deal with. You don't need hardware RAID to reliably protect a boot volume either. You also get more flexibility to choose drives you want and aren't locked to a single vendor, and you can fine tune the array right down to how you need it to run.
I say this as someone who is a sysadmin with 20 years experience and who once reconstructed a failed RAID 5 array and recovered the entire contents (minus a single corrupted file) by hand. (From a double drive failure on a Compaq SmartArray).
"But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process."
Those would be the same people running multi-TB RAID5 setups. Even RAID6 is risky once you get past about 10TB.
I've had to pick up the pieces from both cases - and for systems that I don't backup because they were in someone's "private server estate" that "doesn't have critical data on it and doesn't need backups"
The howls when you tell them that their 30-100TB of "critical data" is gone are something to behold, as are the ones when they don't howl at first but then find out that rebuilding the data from sources across the Internet will take several months.
"Oh, you can't reformat this as a raid6+hotspare (or raidz3), we absolute need the full capacity of all the drives"
Really? After just losing a raid5 array because you didn't notice one drive had died and another went toes up?
Welcome to UK HE computing. As with most things british: Good idea. Bad design. Lousy execution. Inability to learn lessons.
"But I heard tales of woe from more than one person that they simply slapped in a new HDD into their RAID system, and it borked itself halfway through the rebuild process."
I also heard of someone losing a disk from a mirrored system during a system move. They put in a new disk and re-silvered their mirror. From the faulty disk.
I also heard of someone losing a disk from a mirrored system during a system move. They put in a new disk and re-silvered their mirror. From the faulty disk.
Yep, my previous workplace had a minion do that after I left. Then he overwrote the copy. Then he broke the offsite version. *Then* he confessed to having had some problems.
Thankfully I never got burned by it but I saw a frightening message in an HP update notification for Smart Array once.
It boiled down to the following if I remember correctly:-
You have a RAID1 mirror pair and a disk fails
You replace the failed disk, the mirror rebuilds and gives a completion message and all appears to be right with the world.
Except it never actually completed properly and is lying to you, you are actually running on one disk and who knows what it is actually mirroring.
I really pity the poor admin that got burned by that one before they released the update. Thanks so much HP.
"Incremental backups only bridge the gaps between full backups."
Also, there are many different kinds of backup software and only a handful fo them are any good - many of the commercial packages costing up to 30k a shot are complete piles of fetid dingo kidneys.
One of the major hurdles with getting backup systems in place is the cost of the hardware and the entrenched attitude in HE that everything can be done with a Windows desktop PC.
Unsurprisingly this doesn't usually change until AFTER someone loses critical data (I've had a number of "I need this system restored" demands for things that we don't backup because the demander was unwilling to pay for the service).
Someone more surprisingly there's often strong resistance to forking out for the appropriate kit/software even AFTER such events. "You got it fixed, why do you need all this expensive stuff now?"
RAID absolutely is a backup against failed drives. Stupid tired argument. You can say full backups are no good unless they are off site, and depending on region on another tectonic plate. Then you can go farther and say it's not a backup unless it's been fully restored and validated.(and even farther to say on a regular basis)
All depends what you are protecting against.
In this case it appears as if a high end 3PAR is the likely cause based on what I've read on what the VS3 is. I had a vaguely similar event happen(no system upgrade involved) to me 6 years ago on a T400. Took a week to recover fully (end users not impacted, mainly because a ton of data was on an exanet cluster and they went bust earlier in that year. 3par was back up in 6 hrs). Support was awesome though and made me a more loyal customer as a result.
Certainly an unfortunate situation with lots of data loss and backups didn't cover everything("can you restore X? "Sorry you never requested X be backed up, and everyone knew we had a targeted backup strategy due to budget amd time constraints") fortunately most of the lost data was non critical.
[Only other similar issue i was involved with was a double controller failure on an emc array which ran another company's oracle dbs. 35 hrs of downtime for that then 1 to 3 outages a month for the next year or so to recover corruption that oracle encountered along the way. I wasn't responsible for that array.]
After all of that, and getting a new revised budget for DR (unrelated to incident ) VP decided to can DR project because he needed the money for some other project he massively under budgeted for. I left before the last clusterfuck could get off the ground and had 18 months of outages and pain from what I heard.
I see you stopped reading my post pretty quick, as I covered all of those other factors. You have to ask yourself what are you protecting against? Then solve for that.
Very often you will find when you want to protect everything against every situation the organization will not shell out the $$ to cover even a fraction of what you may want to protect (whether it is $$ for hardware or $$ for staffing to do it, test it etc).
RAID doesn't really protect against failed drives. It gives you breathing room to get the backup recovery process in order.
RAID-5 theoretically protects against a single failed drive.
However, in the real world this isn't true, as a second drive will probably fail during the rebuild - they are a similar age and have had a similar amount of usage, and the RAID rebuild is likely to be the most intensive work they've ever done.
Assume it rebuilds ok - you've dodged a bullet, but what happens when the next drive fails? All except one are now very old...
So when the drives are large, the rebuild takes so long that probability of a second failure during rebuild quickly gets above 50%, and a third is rather probable.
So backup. Backup means you can port to a new RAID, and you have a way of recovering when the rebuild fails.
a lof of the posts here imply to me "system admins" in general working with small data sets in fairly simple environments. It's easy to protect a small amount of data, obviously as complexity and data sets go up the amount of work required to back things up right goes up as well.
An extreme example to be sure, but I recall going to an AT&T disaster recovery conference in Seattle probably about 2009 time frame. At the massive scale AT&T was at they still had stuff to learn.
Specifically they covered two scenarios that bit them after the 9/11/2001 attacks in NYC.
First was they had never planed for a scenario where all flights in the U.S. were grounded. They had the people and equipment but could not get them to the locations in a timely manor.
Second was when they setup a new site after the WTC was destroyed to handle AT&T network traffic probably a few blocks away they had big signs up advertising they were AT&T there, and they realized maybe it wasn't a good idea to advertise the fact that they were there so publicaly.
One company I was at had a "disaster recovery plan" which they knew wouldn't work from day 1(as in 100% sure there was no way in hell it would ever work), but they signed the contract with the vendor anyway just to show to their customer base that "yes they had a DR plan" (the part where "does it actually work" fortunately wasn't part of the contracts). They paid the DR vendor something like $30k/mo to keep 200 or so servers in semi trucks on call in the event they would need them -- knowing full well they had no place to connect those servers if they had to make that call.
A lot of the comments here incorrectly portray the process of true data protection as something that is pretty simple. It is not, and if you can't understand that then well I don't have more time to try to explain.
HI know of a university (outside the UK) recovering hundreds of TB after HSM corrupted some of their files whilst migrating from an old system to a new one. As far as I know they did everything correctly, just their storage ****ed up. I'm told $VENDOR is being very proactive in helping them get their data back from tape!
I'm rather puzzled... if there's something that was evidently this critical then why was it stored (from the description) on a RAID-1 array consisting of just two disks? Eeek.
Also, why wasn't the backup reverted to rather faster than two weeks? Yes, reverting to old data is a pain particularly when followed by a likely merge but it should be less painful than two weeks downtime. Also, why wasn't the backup period somewhat shorter - if the data is this critical then the question should have been asked "how much (time of) data are you prepared to lose?"
I think the various virtual machines were backed up a machine at a time according to a strategy particular to that machine and its function. So, for example, and the strategy may well in actuality be different, HR, finance and student records systems would be backed up daily in their entirety, whereas the shared drives and old, semi-retired webpages on an image of an ancient server would be backed up incrementally with a full image every month or so.
I also think the RAID was many more disks that just two. Probably 24 disks arranged as a whole load of RAID 5 volumes, but the volumes spread out across the whole 24, which would effectively make it a very bad system indeed. I can't believe anyone would do that, TBH. Not being au fait with the system itself, I can only guess.
As correctly stated theThe PS3 works in a different manner. Having read the original mathematical paper that underlies the the system, I could never understand why they used raid discs in the first place...redundancy over redundancy... . The other issue lies with the "array managers database" when it wrongly assumes that a volume does not exist, and it does not appear in the tables "a ghost volume".
This could easily cause the system to corrupt itself an any upgrade. You probably say this is not possible, but I have seen this happen.
As for backups I am a firm believer in having backups first before splashing out on new redundant data centres...
Why restore from two weeks backup? Because either the problem was not realised or the backup capacity was highly under resourced by mangers who did not understand how to mange backup...
Should heads role? Heads should only role if there is negligence.. this includes not listening your employees when they highlight problems...
"Also, why wasn't the backup reverted to rather faster than two weeks? "
Maybe because the backups were corrupted.
I know of one case from 1999 (a telephone exchange) where what was being backed up turned out to be random garbage and the organisation had to revert to 18-month old backups, then replay every single transaction that had gone through since that point (logged separately).
It took 3 months to get fixed - and in that case, being a telephone exchange with 20,000 people on it, all sorts of odd shit was going on with people's phone service (starting with 3 days of "dead lines")
Speaking as the Backup Manager for one of the UK's larger University's and given the description of the fault (sounds like a HW level cross-site replication issue) the problem would have been immediately spotted so backup corruption (more specifically data corruption) would unlikely be the issue. There's never enough money for backups, we still have an issue at our institution, where backups are still being done to tape for massive amounts of research data. Data growth is greatly exceeding technological growth in raw speed (throughput) so cloud based disk snapshotting is the only feasible route as we don't have a capable archiving system available. Costs are skyrocketting, for the past 3 years I've seen an annual data growth rate of 100%. The numbers don't add up.
It obviously wasn't a 2 disk mirror. The article was badly worded but it obviously wasn't a 2 disk mirror.
"Also, why wasn't the backup reverted to rather faster than two weeks" likely running on incremental backups for far too long + a potentially narrow download pipe.
""Also, why wasn't the backup period somewhat shorter - if the data is this critical then the question should have been asked "how much (time of) data are you prepared to lose?" - backups should at the very least have a 24 hour restore point, I've not read anything contrary to that.
Meanwhile, the problem could have been solved with a single off-site backup that wasn't reliant on fanciful de-dupe or whatever technology.
You know, like a copy of the VHDs of the virtual machines. Flung onto a cheap NAS, or - god forbid - a tape.
Even if it was just once-a-week, and not "The" backup method, you could have been up and running for most VMs in a matter of hours in such a circumstance.
But by being overly-complex, your recovery process is now an absolute nightmare involving stitching arrays back together and hoping your backups weren't corrupted and so on.
Two weeks is head-roll time, as far as I'm concerned. Sure, let them fix it. But be planning their replacement staff and prepping the pink slips.
"You know, like a copy of the VHDs of the virtual machines. Flung onto a cheap NAS, or - god forbid - a tape." - I doubt it's the VM environment that's affected, it'll be the SAN storage. I doubt you have any experience dealing with PB size backups, virtualisation products don't work well with "big data", the data is always stored independently via CIFS or NFS.
KCL also blithely disposed of the *very first* computer worth the name:
"On Mr Gravatt applying to the Board of Works, it was stated that the Difference Engine itself had been placed in the Kensington Museum because the authorities of King's College had declined receiving it,"
-- Charles Babbage: http://digitalriffs.blogspot.co.uk/2014/01/charles-babbage-and-kings-college-london.html
And thus Kings College London's IT endeavours have been cursed ever since.
Not for the first time, and prickled particularly by El Reg's humorous reference to the redundancy of redundancies, I wonder whether yet again an IT system has been snared not by incompetence or laziness or misapplied intentions, but by unnecessary complexity.
Airliner undercarriages are required to be extremely robust, reliable, fail-safe and are engineered to large performance tolerances. They have to handle not just a nice smooth touchdown but also the unexpectedly violent hard landing of last-second wind-shear, or a piloting error. And yet somewhere in the world they go wrong every month or so.
Their design teaches us that you can keep adding bits—couplers, moment arms, bearings, springs, shock absorbers, kitchen sink—each component specifically intended to improve performance, reliability or comfort, but that eventually the complexity of the whole thing actually begins to reduce its reliability. There are too many things to go wrong.
It's just a thought and not a great analogy, but for me this all adds to the notion that sometimes we are adding too much to our systems in the name of safety and security and reliability, when we should step back and consider subtracting instead. A particular problem for institutions is the infestation of corporates' salescreatures and and blandishments of marketurds, who have so many ways to persuade that your system is unsafe and a mountain of lies to convince you to buy their product ... you know the one, it's in the back of the cupboard, still shrink-wrapped under the corpses of spiders from 2007.
You really cannot beat the authority of an experienced technically-savvy manager with a cynical eye, skin long since thickened against the lies of sales and marketing, to stand a good way back and look at your system. The Axe of Pragmatism can often make things cheaper, better and simpler.
Should we stop abstracting IT then? If we followed your logic then ... where would we be? Should programmers still write in Assembler? Civilisation advances because we do not and cannot expect a single person to know everything. Is there a line that should be drawn?... probably not, hasn't worked so far. If this were a flaw in the underlying technology (not ruling that out .... not enough information yet) then many thousands of institutions would have been hit. Part of the reason we have independent backs is to guard ourselves from a screwup from a specific piece of technology.