Did a Hitachi Data Systems USP-V array controller failure cause the Barclay's ATM outage yesterday? Yesterday, to its great embarrassment, Barclays' ATM network in the south of England crashed at 1pm, together with a lot of its online banking facilities. Functions were not restored until 4.30pm or later, and thousands of people …
Is this the same Barclays that....
.... are cutting IT Staff like crazy and planning on outsourcing the rest? This incident shows that IT is core to their modern business, yet they still want to trim, outsource and lose control over it. Brilliant!
Bet they don't sack whoever was responsible for this bit of well-thought out non-redundancy though.
even on the mainframe, has no one ever heard of multipathing IO??
Paris, because she multipaths the IO like a maniac
HDS arrays are indestructable. if it was a HDS it either got unplugged or someone blew it up.
Failure to understand systems reliability and availability
If it is true that a single piece of hardware caused this outage then the fault is not that there was a lack of redundant hardware but in the design of the software system that relied on single pieces of hardware in a single building. Buying overpriced tin (Jurassic Park storage, I'm looking at you HDS and EMC) and thinking that it will shore up your badly written, implemented and operated software stack is a fool's error. This is the route to completely unmaintanable systems and a stupid waste of money for which people should be fired.
If that's true then it beggars belief that they didn't have a mirror.
"A lot of due diligence is happening at the moment"
Translation: We're all working out who we can pin the blame on
One cut too far
Just look at the list of related stories at the end of the page. Explains it all really.
I live in South Wales and couldnt access my Barclays account via an instore(bank) cash machine yesterday. After multiple attempts I withdraw money using my Cahoot card.
Today I went to use the machine which promptly retained my card as I went to check my balance... Right now Im calling their Debit Card Service which was engaged twice (with an actual engaged tone!) and finally managed to get into a queueing system. Hopefully will get to sort it all out...
Even if the HDS hardware did fail (which I very much doubt due to it's built in resilience), what about the mirror array? Only a clown outfit would not mirror all their critical systems across multiple arrays, TrueCopy license suddenly looks a lot cheaper in hindsight eh?
Lessons learned from 9/11
"What drive array was this? One that was involved in storing data relevant to cash machine operations and online banking? Also, given that the Gloucester data centre has a history of computing system failures (see here, here, and here) why wasn't there an adequate fallback mechanism in place?"
I'm sorry, but anyone designing a system to handle financial information that needs 24/7 access should have designed a system that had Enterprise Replication so that a failure of one complete system would mean that the secondary could pick up the slack.
I referenced 9/11 because after 9/11 many finacial institutions in the US created two separate data centers connected so that if something happened to one, the second site could continue to provide data.
Of course there was a certain unamed global bank that has operations in Chicago where during the X-mas holidays one year, a piece of dark fiber was accidentally cut and the failover failed. Of course it came out that they failed to test the failover... (So the story goes...)
Maybe banks and other companies will learn that you're better off insourcing support for your critical infrastructure and hiring smaller teams of better and brighter people.
A flame because bean counters don't grok this until they get nailed and its not their ass on the line when things fail.
Data Centre? Gloucester?
That'll teach em for using FastHosts.
It's interesting that the of the history (the three items referenced) in the article citing hardware failures, one is dated 1998 and another 1999. Hardly current.
I worked at Barclays for many years this century, and there were a small handful of errors that made the press. Generally the systems there are capable of handling single component failures without impact to customers. So things failed, but the service was restored so quickly that they were not noticed. What all the support staff were afraid of was a whole site failure.
If such a failure occurred, chances are the loudest sound would be the resignation letters hitting the managers desks as the staff would rather resign that attempt to pick up the pieces. It was often said by those in the know that after a lot of hard work and bickering between the internal businesses about who would get their systems back first, the DR site(s) would carry the load, but that the chances of a successful fail-back was so close to zero that it was unlikely to happen.
Of course, things may have changed, after all, they have a lot of good people. Um.
Oh wait, I forgot...
BTW. Most of the IBM SP/2s in the picture have gone, as they have been obsolete for many years. The picture probably matches the referenced articles.
... now where's my design, ah, yes, erm..
Storage guy to Problem Manager: " I told the Project Manager that we should have done a full DR test and failed over the Primary to Secondary. " Hmm says PM. Storage Guy continues, "..but they told me that the system is too important to fail-over and the Customer would never give us a window to do it in..." And so the story goes.
has no one ever heard of multipathing IO??
Todd, multipath has nothing to improve availability nor data consistency on storage. It only affects performance.
If data got corrupted, as it seems it turned to, even their replicas got corrupted in the same way. In this case, only a restore will save.
Paris: Cause she loves to perform a restore when everything is down...
Hardware Failure ???
it is said that it is HW failure and then "HP/EDS have the maintenance contract for the affected system" so I'd say EDS smeggin''up the clustering and blaming it on the hardware.
There are so many layers in a cluster that is likely that no one picked up the signs because no one ever read the logs on a daily basis.
Then again there is always room for the eejit left to their own devices in a data centre unplugging things to plug their own.
This sounds familiar
I remember working with a different storage vendor on different hardware. We had a crash (caused when their hardware engineer accidentally knocked the main power switch off when doing an inventory of the drives) and found that the backup procedures, put in place by the storage vendor, didn't backup all the data we needed. A crucial bit of the database didn't get saved. It said it was, and if you were prepared to wipe out the DB and restore to a cold backup taken at the weekend it would work, but the roll forward would always fail.
As it was their hardware, with their software and their backup procedure we made it there problem. It took them four weeks, but they did come up with a solution.
It was a change to the manual that said that data of the type we had was not covered in the backup process. We can't fix your data so we will accept it was a documentation failure.
Strangely that vendors hardware was scrapped at the next upgrade and they forfeited the right to bid for that contract.
I have been working on tags since before they came out, you still have multiple CHIP pairs on the fromt end. not many single points of failure exist. if you have both path running across 1 CHIP pair (1 controller board) you can have this failure. if you have multiple paths running across multiple boards, (ficon or FC) you should be able to avoid this type of fault completely. smells of a lazy config to me
Queue lots of people who think they know about disk making comments. Did I say queue, it appears they already have...
Re: Lessons learned from 9/11
"I referenced 9/11 because after 9/11 many finacial institutions in the US created two separate data centers connected so that if something happened to one, the second site could continue to provide data."
The company I used to work for - Intranet, Inc. - was used by a good many banks. When the first attack on the Twin Towers happened in the late 90s, one of our clients there failed over seamlessly to their backup site (in NJ at the time, I believe). Nary a hiccough. So proper solutions to this problem predate 9/11; in fact, it was a hot selling point of the product/system (the funds-transfer backbone) before I joined in '91.
All running at the time on VAXen and (Open)VMS. They did refactor the whole shebang to move to AIX when it was clear DEC was going to implode. But I have lost track of the technical details in the intervening years, just as I have lost track of whom they have bought or been bought by. But I do take credit for securing the domain name for them, back in the last of the days when you just had to send a letter asking for it: intranet.com
was it an HDS array - do you even know? was it a USP V or something older... who knows? There are so many ways this failure could have happened and so little detail provided then talking about it is simply pointless supposition by folks trying to demonstrate their techncial prowess in IT systems resilience
Barclays wiill do root cause analysis (or the finger pointing exercise as some call it), but we'll more than likley never hear that part of the story. All we need to know is that a Barclays IT system failed, and that's Barclays fault - how they address it and who they blame is something we'll never know.
Now stop swinging you big techie dicks around and get on with some real work!
darn new fangled infrastructure
They should have bought a mainframe - you'd be surprised what you can get for £50K these days.
I called the number given to me by my branch, and over an hour (yes HOUR) later I get through to an indian woman that promptly tells me I need to contact my branch quickly gives me the branch phone number and hangs up without even a 'bye'!
So local number, call it and let it ring for 5 mins, a quick search on google reveals the number is a 0870 number which happens to be the same for all branches...
I call it and the first thing it asks me for is a 16 digit card number (the very card the cash machine had retained).
Eventually I get put through to a queueing system and another indian lady answers who then has to put me through to another department. Fortunately I got put through to a very nice english girl who went out of her way to help me and ordered me a new card. She admitted the card was swallowed by the network problems they suffered on tuesday and that a large number of other people also had their cards retained.
I now have to wait 5-7 days for replacement card which is another inconvenience as the problem of me not being able to access my account for 1 afternoon has now expanded into a whole week!
I cant see much sensible analysis here or in the main article.
For instance, when I worked for another vendor we had a major failure once on a storage array that had caused a set of web transactions to be lost. The program essentially was a web application to the outside world that took details and allocated a unique identifying number.
The array had failed and we were hauled in to explain to the Customer's Senior Management why our equipment was such crap.
Superficially it was our fault. A controller had failed and the whole enterprise storage system went down. It was back up and running within 10 minutes of the failure.
Analysis definitely showed that a controller had failed and then whole storage system shut down
1. 18 hours previously, a controller had started to fail. The appropriately warnings were put out to operator consoles etc. No one acted on the warning. This was government, so no phone home capability.
2. The controller automatically switched over to the redundant controller. And proceeded on.
3. Unfortunately, this controller also started to fail, notifying the consoles etc. No one acted AGAIN.
4. This controller then failed over. In the 18 hours since the original failure, the other board could have been replaced, but it hadnt been. A logic error in the code had the now sole running controller fail over to the already failed original controller. The storage then crashed.
6. The (onsite) engineer had the storage system back up and running 10 minutes later with the spare parts that were on site!.
7.The customer was furious that the system had issued a bunch of receipts numbers to customers but no data was recorded for those receipts.
8. If you are running a database with two phase commit etc, how do you lose a transaction in a well designed application?
Turns out said customers were using some of that new fangled web .NET stuff. They were having problems getting performance out of the gazillion 1RU rack boxes used for the system. So they cached the transactions in memory and trickle fed them to database storage. It allowed customers to get their special receipt number without waiting for the data to be in the database. They were effectively giving receipts for uncommitted data when the user hit the submit button.
Who is at fault here?
1. Storage Vendor - there was a bug in the failover code - so yes.
2. Operations - they didnt have proper operating processes for the system - good process would have utterly avoided the problem
3. Solution Architects - for designing a Bozo solution with no integrity - there were heaps of other bits of kit failure that would have had the same result
4. Customer upper management - for not having any kind of clue about either the solution or adequate operating procedure, and throwing good money after a bad idea
5. Controller provider - perhaps it was a bad batch? They had been running for a long time prior to this fault.
6. Dinky Field Service contracts? Not likely in this case, the engineer was actually there and had parts available. He was never notified of the problem by Ops.
6. All of the above
Just saying its the HDS kit that failed is like blaming Boeing for the crash when the pilot was drunk and the airline had outsourced maintenance to Peru (no offence to Peruvian Aircraft Maintainers or Pilots intended here).
single point of failure
Haven't seen anyone mention in the comments that the virtualiziation controller in the USP is a single point of failure in that system. I don't think it impacts the disks you have on the local array itself, but if you put other arrays behind the USP then the connection to those back end arrays goes through a single point of failure. The article mentioned this to some degree and HDS came out with a way to improve availability of this controller by buying a 2nd array and clustering them. Why you couldn't just get another controller and put it into an existing USP I don't know, maybe that would be too cost effective or something.
I didn't know about this particular single point of failure on the USP, which is interesting because when HDS was pitching their stuff to us last year I asked them about their 100% uptime guarantee and they said they still had it on their high end USP. I wonder if that guarantee extended to those with virtualized storage behind a USP or not.
"Claus Mikkelsen, HDS' chief technology officer of storage architectures, said Hitachi Data Systems has focused on improving the availability of the single controller, but found some customers were still nervous about putting all of their data eggs behind one Hitachi Data Systems controller basket. "This eliminates the virtualization controller as a single point of access failure," he said."
As someone who works within part of the Barlcays group in IT I can assure everyone that this is just an isolated incident. Isolated in the sense that there was even an attempt at DR capability that is.
Barclays have their own datacentre (it's big and green [in colour]). Do you really think that FastHosts will be comfortable with several hundred racks of non-Wintel systems with god-knows how much storage, running in isolated networks?
Redundancy, redundancy, redundancy...
and again I say, redundancy.
Yes, it's costly at the level of components, systems, networks, and data centers.
But look at what the costs will be now for this 3-hour outage:
1. SLA penalties,
2. loss of goodwill, and
3. Customers and partners now demanding much higher investment in facilities
4. Loss of business from customers who just decide not to even approach Barclays