Tieto, a prominent Swedish IT service supplier, had an EMC Array go titsup on 25 November, causing five days of chaos at the Motor Vehicle Inspectorate, the Sollentuna and Nacka municipalities, the City of Stockholm's schools' website and intranet, the National Board of Health and other prominent sites. The debacle (in Swedish) …
That old chestnut.
"...despite having a 99.8 per cent uptime agreement..."
I've been saying this for some time now. You may very well have a beautifully worded SLA, guaranteeing availibility, which gives you a nice, warm feeling of security.
For some reason, there are many in the business who don't seem to realise that it is a legal document and nothing more. It's *only* purpose is to be used as a weapon *when*, not if, that catastrophic failure occurs.
It's marketing blurb. Potential losses could bankrupt both the provider and their hardware suppliers.
It seems the Swedes can't stop getting screwed by the Americans. First Saab, and now this!
99.8 on VNX?
Yes, and who selected VNX as the mission critical grade storage platform? I doubt EMC would have even recommend VNX for this application. Legacy Clariion dual controller, RAID 5-6 with some EMC software over the top of it.
Waiting on more details.
There's two old sayings that come to mind - "a workman is only as good as his tools", and "you need a computer to make a mess, but you need a beancounter to make a real disaster!" At the moment, it looks like someone at EMC or their supplying partner made a mistake on the tools selected for the job or how they were implemented/upgraded, but it also looks like someone at the customer end hadn't really thought through their DR process. "What happens if I lose my storage device and can't recover from my primary backups" should be a pretty standard question when designing a proper, belts-and-braces system with DR failover. Do I detect the foul odour of beancounters cutting costs.....?
Rise of the virtual machines
means you can fail lots of customers at once.
Yup, like the situation recently with a huge company with that doesn't like its staff to wear shirts that are the same colour as their logo. The hosting in their European data centre became overloaded, and various odd things started happening. This affected all the clients in the data centre, and the suggestion from said company was for clients to start manually replacing VM instances with ones in the US data centre. Not much good when the increased latency to the US data centre from our customer sites is considerable.
Double disk failure on the flash tier/cache, no configured hot spare
That is the rumour on the street.
@Double disk failure
The failure of more than one item is far, far more likely than the usual calculations for redundancy assume!
That is based on "independent random failures" and not allowing for an external influence (PSU surge, over-temp, etc) stressing multiple items, or a bad batch of some component causing much higher failure rates as well. Also the strain of a RAID rebuild on traditional HDD's head servo can provoke them to croak before redundancy is restored, though that should not matter for flash.
On a personal note, we have a Sun/Oracle 'open storage' system configured with dual redundancy and tested it for our acceptance by pulling 2 HDD from a RAID set, and it failed. More than once, and in one case trashing some files (at least ZFS told us which ones!). Years and several firmware revisions later Oracle has not attempted to find the actual cause, and assures us that because they have not replicated it recently that somehow it must be fixed by other code revisions.
Can someone remind me of who said "if you don't have 3 copies of your data, you don't really have your data"?
I read somewhere
that the increase in drive capacity has not been accompanied by a decrease in error rate, so with a big enough rebuild you are *likely* to experience another problem.
3 copies of data
Can someone remind me of who said "if you don't have 3 copies of your data, you don't really have your data"?
Moses. There were actually six stone tablets. Only two were displayed.
Not Even Allowed to Test
Several years ago I was involved with the introduction of a then well known maker's hardware and worked on the acceptance test schedules. These were taken from the documents used for other suppliers.
The maker in question ran from the negotiations straight to his board level contacts claiming that we 'just wanted to break his equipment'. Our point was that others saw no problem and passed the test in question without issue.
The maker in question is no longer in business, their competitors are still trading, is there a conclusion lurking there?
Learning from each other's mistakes?
I understand that from a business perspective the exact reasons for the titsup may not be good to disclose. Companies automatically go into face-save mode and make statements and quote statistics proving how rare such things are etc.
The problem I see with that is the lack of information sharing. They use off the shelf products in a configuration probably quite similar to other people. If they were to say "Hey guys, we had X and Y in Z config but when X did A and Y did B then Z went pop!" then we all could learn about nasty gotchas.
Yes, I know the problems with that include, but are not limited to; Trade Secrets or maybe having a stupid setup you don't want to admit to.
Oh, OK. I'm just a dreamer who lives in a fantasy world of people helping each other out.
""" then we all could learn about nasty gotchas."""
It is a common engineering "secret" that the information provided by Failure is often much more valuable than that of a success. Success is what happens every single day on the assembly line and is blurted out in volumes of BUMF by marketing and sales, a failure is rare and exceptional.
So we engineers generally keep failures close to our chest and secretly hope that a major competitor will also hit that particular landmine but in a bigger way - or even that someone will blab at the trade-show piss-up so we also get time to short the stock (and offer consultancy at exorbitant rates).
Reminds me of a song
I've looked at clouds from both sides now,
From up and down, and still somehow
It's cloud illusions I recall.
I really don't know clouds at all.
thats midrange storage for you
What do you expect when you ask a midrange storage device to deliver enterprise-class reliability?
You get what you pay for..
Stupid IT people
EMC like any other major company creates SLAs relevant to a customers' infrastructure. There are some really stupid people on here.
I have been recently involved with a VNX migration from old CX kit. We also had a big outage releated to the fast cache. Luckly for us, we were able to quickly go back to the old kit without big impact. The early versions of the VNX code (flare) are abysmal. I wonder if they were running an old version of flare?!
wrong storage for the job
Really a VNX for multi-tenancy? you get what you pay for. The Entire VNX line is tier 2. I would never think of running mission critical multi-tenant workloads on one of them.
Backups are useless
If you don't test restores, regularly.
Buying from the lowest bidder
EMC makes a good product, or compilation of products put together and repackaged into a good product. That being said, I definitely would not tout a 99.8% uptime without offisite replication and RPO/RTO clearly defined. Also, it looks like a true local D2D was not involved, other than some simple COW snapshots which only solve a few problems at best.
Stepping back though, this article writes a pretty good story. EMC probably undercut everyone on price, cut out the D2D and replication portion of the solution, and somebody chose to "save money" instead of buying the right solution. Case closed.
Buying from the lowest bidder → #
Agree, that is how all of the VNX (and Clariion before it) mid-range stuff gets into enterprise, mission critical workloads. People call EMC because they think EMC = storage. EMC shows them the latest Symmetrix systems, they ask for the price, they ask for a less costly alternative, they buy Clariion/VNX and they are amazed when their low/mid range storage works like low/mid-range storage... but it has an EMC logo on it! It seems that whoever was in charge of DR configuration/testing messed this up as well.
It's still the responsibility of the people who designed the solution.
If you make a solution using entry level components and/or midrange components and you promise highend features, availability etc. you can't blame it on the hardware vendor. At least not totally.
You have to know and understand the building blocks you use.
Not able to read the Networker Backups? I doubt it. Networker may have its problems but a simple Windows backup and recovery isn't going to tax it. There are three possibilities:
1) The tape broke, or there was some other sort of hardware fault
2) The backups weren't, in fact being taken.
3) (And this is the one my money is on) They upgraded from 2003 to 2008 R2, but didn't to any checking for recovery, other than a basic file restore. The company didn't realise that you need to have a system image (I can't remember it's name, exactly) to install at build time, and now can't do a registry restore. IIRC this was fixed in later versions of the Networker client.
That is the problem with fly by night hosting/cloud providers
Unless the hosting/cloud provider has a huge reputation on the line, e.g. IBM, they often buy the cheapest gear, or gear that their customers would have never selected for these workloads, with the least costly HA/DR set ups. As every dollar/krona/euro/pound spent on resiliency comes directly off of their bottom line, it is easier just to promise 99.9% up-time without actually having any reason to believe that you will be able to achieve it and bank on customers not taking you to court over the SLAs. It is not like Tieto had a brand to protect. They will probably just declare bankruptcy and move next door with a new name.
"""It is not like Tieto had a brand to protect. They will probably just declare bankruptcy and move next door with a new name"""
Tieto is a huge company, similar to Ericsson. They do not just go bust overnight and it is possible that they are so interconnected with Swedish/Finnish infrastructure that they simply can't - some deal will be forced through, should the occasion merit.
However, I do know from personal experience that they sacked a lot of their best talent during 2007. Maybe what's left at Tieto is scores of Java-hackers straight from university - cheap, but has no systems understanding at all - and shun any learning in that direction - because of the purity of their Java belief system is well above the world as it is.
It not bad though: I, and many others, have earned so much money cleaning up after those guys, its almost embarrassing ....
FAST2 issues - double SSD failures
I have firsthand knowledge of a couple of incidents too, when using fast2 on the VNX. Double diskfailure on tier 0 SSD-disks, and you loose the array. Even if your restores work, we still talk about days... And much longer than that, if it isn't all virtual servers.
Word on the street is: You don't install a VNX with FAST2 enabled anymore. Too risky. Happened too many times to be just a stroke of bad luck.