Didn't a certain MichaEl eDell just acquire EMC? WTF is going on? I'm starting to feel sorry for him!
Victoria University in Wellington, New Zealand, experienced two days of Total Inability To Support Usual Performance (TITSUP) after failures in its EMC equipment. An email sent by senior IT staff at the University reports that “the storage systems that manages all our servers have experienced multiple failures over the past 2 …
Is it possible to build this type of redundancy/fail over when systems get this large? If so, is the time taken to switch between the two quick enough to minimise downtime to an acceptable level at a cost which is affordable?
This isn't meant to be knocking DR plans/redundancy/fault tolerance, more a question that as systems get bigger and storage more complicated is it affordable to split resources over multiple vendors to stop instances like this? It'll be interesting to see what the RCA is for this fault, if it was a EMC failure exacerbated by other faults or just a single failure somewhere which took out a lot more.
Universities (and most corporates) are not always awash with Cash so have to accept a level of risk. If your DR/BC plan is essentially restore from backup, then a single Array Failure can take loads of services off line for the time it takes to restore from Backups, assuming you have/can make capacity available to restore them too...
NOT everyone wants to spend lots of dosh on dual live systems....
If you haven't got live capacity to restore to (e.g. by killing Test/Dev) then you may be waiting for EMC to ship a replacement...
"At a guess, this looks like a major failure of a storage area network, likely including products beyond EMC's given that printing and internet access were also impacted"
I guess who wrote this doesn't realise that Print and Active Directory servers more than likely run from a SAN as well...
It is entirely possible it wasn't hardware failure or anything to do with the SAN at all. Network misconfiguration, some fibre channel switch change or even something at the virtualisation layer which could cause issues for a wide range of systems.
Maybe they do have a replicated storage setup and something changed which flooded the network, effectively an internal DoS. A combination of all the above can lead to that situation.
Everyone is quick to blame the vendor or quick to suggest that the University didn't spend enough, but with big complex infrastructure you can quickly see a small configuration issue snowball into something huge.
"The storage systems"
All storage systems?
"Sign-on to networks was slow, Internet connections went down and even printing was problematic."
Because a storage failure?
"We've also contacted the executive who sent the email below. He appears to have flicked it to the University's communications team and hasn't offered any of the detail we requested about the nature of the outage or the EMC products involved."
Then... with what information are we working?
Well, the lack of facts at hand makes it anyone's guess as to exactly what happened. That said, storage networks have a lot of moving parts and a failure in the networking part could easily disable access to the storage part. If the outage was due to a planned upgrade or maintenance, then there should have been a roll-back procedure in order to recover. While you cannot rule out human error, you expect that the people involved in operating and managing the storage network are adequately trained and experienced. The vendors involved along with the university will likely issue a "post mortem" when the facts surrounding the outage are understood. Then the guilty can be charged.
What was the failure ?
Was it capacity ? What was the university IT operations team doing before this happened ? Did they have any kind of monitoring on before services degraded ? Do they have some kind of failover or spare capacity in there architecture. Or was it that some beancounter could not be convinced fast enough to add/upgrade IT infra - so lets blame vendor for failure ?
Or the systems just inexplicably stop working ! Just collapsed like a pile of ...
as you say "The more EMC tried to fix the worse it got"
I am sure you would love to blame EMC - oh the big corporate behemoth, who do they think they are - lets get some more facts please before you stick a knife in.
Biting the hand that feeds IT © 1998–2019