'Mainframe blowout' knackered millions of RBS, NatWest accounts
A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night. A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on …
Re: I reckon the other source had it spot on
"throw in a couple of 3rd parties and you've got them all pointing fingers at each other as well to add into the mix."
"RBS - Data Consultant - Accenture"
I reckon who ever wrote that never actually worked in a real IT environment ...
Re: I reckon the other source had it spot on
You, sir, have just summed up my job with scary accuracy!
(Are you in my team....?)
Re: I reckon the other source had it spot on
So all, technicians are incredibly overworked but still infallible, whilst all managers are lazy and incompetent. Procedures are completely unnecessary. Just get rid of all managers and procedures and everything will be fantastic.
Re: I reckon the other source had it spot on
"So all, technicians are incredibly overworked but still infallible, whilst all managers are lazy and incompetent. Procedures are completely unnecessary. Just get rid of all managers and procedures and everything will be fantastic."
And irate operators really do poison their bosses with halon. Oh come on, that has got to be the best rant I've seen in a while. Possibly it might have come from person experience, as far as you know.
Wherever it came from, I think Mike Smith needs to be hired as Simon Travaglia's ghost writer for when he's off sick and a new BOFH episode needs writing up. That was awesome.
Re: I reckon the other source had it spot on
Personal experience, even. And too late to use the edit button too.
Damn my fat fingers.
(Maybe that's what someone said shortly after the start of the RBS outage?)
Re: I reckon the other source had it spot on
"Highly available mainframe plex's like RBS run active/active across multiple sites."
Exactly, no one should have had to call anyone. The mainframe should have moved to the second active system in real time. The only call that would have been made is system calling IBM to tell an engineer to come and replace whatever was broken.
Re: I reckon the other source had it spot on
"Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."
With update meetings every half hour, which is why you also need team leads and project managers, so there is somebody to go to the meeting and say " NO, the techs are still working on it"
Re: I reckon the other source had it spot on @Mike Smith
wow - having worked at RBS this is not far from what has happened on some recovery calls I was involved in. Systems were down and some manager would actually request a change be raised BEFORE fixing the issue - Anybody who knows the RBS change system *coughInfomancough* knows it is not the quickest system in to world.
have a pint for reminding what I escaped from.
Re: I reckon the other source had it spot on
Wot no down votes? So are there no managers who frequent el Reg? Or maybe, just maybe, could the techies be right?
[Coat -- system's buggered so I'm off down the pub.]
Re: I reckon the other source had it spot on
@ andy mcandy
Just for clarification (I wrote clarifiction first, is that a word ? speeling chucker didn't barf at it), it is Mike Smith's post you are referring to, if not which one please ?
"In theory, the banking group’s disaster-recovery procedures should have kicked in straight away without a glitch in critical services."
Easy to be judgemental but a DR failover is normally a controlled failover which take a number of hours, you need to be 100% sure the data is in a consistent state to be able to switch across. I've seen failovers that have gone wrong and its a million times worse being left halfway between both!
Its unlikely any system is truly active/active across all of its parts
In theory...
Of course, in theory reality and theory are identical. In reality, they are not.
"Its unlikely any system is truly active/active across all of its parts"
That is the beauty of the System z though. It is not a distributed system which requires 8 HA software layers from 8 different vendors, none of which are aware of each other, to be perfectly in sync. It is truly active/active across all of its parts because there are not that many parts, or third party parts to sync. Parallel sysplex, IBM mainframe HA, manages the whole process.
banking group’s disaster-recovery procedures
This was the bank where the disaster-recovery procedures went disastrously wrong.
I'm going to at least consider the possiblity that this time they were told they COULD NOT start a disaster-recovery procedure until everything was turned off and backed up.
Microsoft let their own cert expire and it blew up their cloud
Just because it's a small avoidable failure, doesn't mean it can't have a catastrophic effect
"procedures should have kicked in straight away without a glitch in critical services"
These would be the disaster-recovery procedures that RBS said were given a good seeing to after last year's balls up so this kind of thing would never happen again.
(I know a hardware problem and a batch problem aren't related, but the procedures that they follow if something happens which brings down the bank's service probably are.)
Re: "procedures should have kicked in straight away without a glitch in critical services"
Disaster recovery doesn't help a damn if (a) your data is replicated immediately to DR and is buggered or (b) it's quicker to recover in the same site than flip to remote site.
The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways.
Re: "procedures should have kicked in straight away without a glitch in critical services"
The assumption some people have that DR is a simple flick to a remote site and everything springs to life in 5 minutes is horribly, horribly flawed in so many ways."
Indeed, though vendors often like to give the impression that this is how their solutions work.
Alternatively...
...mainframe is duplicated across 2 towers. Or more.
Part 1: Tower 1 take out for patching/maintenance/a laugh/upgrade
Part 2: At the same time, somebody yanks the 3 phase on tower 2 by mistake. As it was night time, it was probably to plug in a hoover (yes, I know...it's a joke). or the main breaker blows. Or the entire DC where Tower 2 resides goes on holiday to the bermuda triangle.
Part 3: Royal shitstorm getting Tower 2 back online after unclean shutdown, and everything rolledback, then rolled forward again. it was probably noticed *in milliseconds" when it went down, but "Swtiching it on and off again" doesn't really work on a transactional Z. It takes time.
Part 4: Parallel recovery is to cancel the work on Tower 1 and get it back online, while rolling forward all the "in transit" stuff from Tower 2.
Part 5: Meanwhile CA7 guys run about cancelling/rescheduling batch jobs.
Edit: Part 6: All the secondary services are restarted - mainly the application server layer, and any rollbacks are replayed now the back end is back.
Tandem
I don't know much about these things but I recall many years ago a relative working in banking IT who mentioned that some of the banks used Tandem hardware that provided a continuous and automatic redundancy, but the machines cost a lot more. As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents? Any insights most welcome.
Re: Tandem
Just found this on Wiki, partly explains my question, Tandem no longer seem to exist, but perhaps proves my point about cheaper options and that use of Tandem-like solutions should cope with these failures and not require a committee to have a conference call about the failure?
"Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss. The company was founded in 1974 and remained independent until 1997. It is now a server division within Hewlett Packard.
Tandem's NonStop systems use a number of independent identical processors and redundant storage devices and controllers to provide automatic high-speed "failover" in the case of a hardware or software failure.
To contain the scope of failures and of corrupted data, these multi-computer systems have no shared central components, not even main memory. Conventional multi-computer systems all use shared memories and work directly on shared data objects. Instead, NonStop processors cooperate by exchanging messages across a reliable fabric, and software takes periodic snapshots for possible rollback of program memory state."
Re: Tandem
Actually, Tandom (or HP NonStop as it is now called) is still in wide use. At the institution I work at (UK) it's the backend to the ATM network. I went to a seminar about it and was told their last outage was in 1991!
Re: Tandem
Tandem hardware was Fault Tolerant, not Highly Available. There were other players, like Stratus and Sun, in that area.
FT hardware duplicates the systems inside a single box, perhaps three CPUs, three disk controllers, three network cards, etc. They all do the same work on the same data, and if they get different results there's a majority vote to decide who's right.
It provides excellent protection against actual hardware failure, like a CPU or memory chip dying, but offers no protection at all against an external event, or operator error. Just like using RAID with disks, which protects against disk failure, but it isn't a replacement for having a backup if someone deletes the wrong file by mistake.
It is expensive, you're paying for three systems but getting the performance of one, and given the reliability of most systems these days it isn't used much outside the aviation/space/nuclear/medical world, where even the time to switchover to a backup can be fatal. There's a reason that none of the companies who made FT systems managed to survive as independent entities.
Re: Tandem
Yeah, but the main thing about Tandem/HP NonStop systems is every CPU is duplicated, all memory is duplicated, and for every operation if the two results don't match the (dual)CPU in question STOPS. It's very keen on stopping; it's only a huge mound of failover software and redundant power and duplication that makes a *system* very keen on continuing; individual parts stop quite readily.
Of course, the intended market is OLTP, so the goal is to make sure that the decrement to your bank balance is the right answer; if two paired hardware CPUs and their memory give different answers, that pair of CPUs stops and a whole 'nother hardware set attempts the same transaction.
Re: Tandem
"As I said, I know nothing about mainframes but these days do all mainframes provide such features or not? If not, is that why we get such incidents?"
Yes, IBM mainframe's parallel sysplex is the gold standard in HA. Basically the system upon which all other clusters have been based. The systems read/write I/O in parallel so both system A and B (or more than two if you choose) have perfect data integrity and can process I/O is parallel. If one of those systems goes away, the others continue handling I/O with no disruptions. There is also a geographically dispersed parallel sysplex option which can provide out of region DR, in case the data center blows up or something, at wire speed with log shipping which is also active/active, but it takes a few seconds, literally, before the I/O on the wire is written and the DR site takes over. In theory, we should never get such incidents, but, like anything, people can misimplement the HA solution... which seems to have happened here.
Ok my two pence
The zSeries is not some mickey mouse piece of Hardware , this is built to have downtimes measuers in minutes per year.
In the rush to get back to profits, using slash and burn tactics, did they kick out too many british based banking staff??
You Hit the Nail
RBS had to "retain talent" in the trading rooms and pay them several 100k of bonus per year. And let them bet the entire bank to get the short term results for obtaining said bonuses. On the long run it crashed the bank and "in order to become profitable again", experienced British engineers and specialists were replaced by Indians with 1/10th of wage and 1/100th of experience/skill/actual value.
But you know what ? That is the whole purpose of modern finance - suck the host white until it is dead, then leave the carcass for the next host. IT people are considered part of the host organism.
Grab yourself a history book and see how that played out between 1929 and 1945.
Picture of firebombed city.
Prevented millions from accessing their accounts?
I had no idea that their customers were such night owls. More likely that thousands were affected by the outage, no?
Re: Prevented millions from accessing their accounts?
9pm at night with no card transactions and no ATM. Anyone out for dinner or drinks using RBS was knackered.
Certainly going to have an effect on many people.
Re: Prevented millions from accessing their accounts?
9pm at night with no card transactions and no ATM. Anyone out for dinner or drinks using RBS was knackered.
That would limit it to politicians and Traders, as everyone else in the country is too skint to eat out midweek. Then again, the politicians probably wouldn't be paying for it anyway - it's all on "expenses". So it's just the traders then. No biggie.
It does seem odd that the auxiliary systems didn't take the load whilst the primary was out of action.
High Availability and Resiliance
This should NOT have been about DR or backups. This should have been handled as part of any high-availability , RESILIENT cluster system design. I've designed and architected HA on IBM SP2 supercomputer clusters and can well attest that it works - our "system test" was walking the floor of the data centre randomly pulling drive controller cables and CPU boards out of their sockets, while having the core systems still running processes without failing! And that was 10+ years ago - I find it appalling that a live banking system would not be engineered to have the same degree of _resiliance_. Don't talk in terms of how many minutes of downtime it will have per year - it should be engineered to have the failure of x number of disks, y number of controllers, and z number of processors within a chassis/partition/etc.) before failure. For a live, financial system, those should be the metrics that are quoted, not reliability alone.
Re: High Availability and Resiliance
Exactly, this is an HA design or implementation issue.
Re: High Availability and Resiliance
Definitely an HA issue and it should be automated/orchestrated.
Tandem
There would be a Tandem line between the mainframe and the ATM network. If it goes down or out of sync it can take some coordination restarts to bring it back up.
Common problem
It's that 16K RAM pack on the back. If you wobble it, you can lose all your data.
Just Joshing...
The Government have said they will sell their holding in RBS when the stock price reaches a certain level. What if someone decided they didn't like that idea? This little incident will put that sale in doubt.
After the last mainframe blow out - one would have thought the place would have been running a bit better - I take it other banks aren't experiencing similar outages?
A mainframe hardware fault !
"A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night.
Assuming this were the case, they must have multiple redundent systems, mustn't they? On the other hand, maybe someone ran the wrong backup procedure ... again ! ! !
"This fault may have been something as simple as a corrupted hard drive, broken disk controller or interconnecting hardware."
No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ...
Re: A mainframe hardware fault !
"No, mainframes have multiple harddrives, disk controllers and error detection and correction circuits ..."
Yes, and they are clustered systems, so if one system bombs, even with all the fault tolerant architecture, another system in the cluster (or parallel sysplex in mainframe vernacular) should pick up the load. As with any cluster, you may take a performance hit, but it should never just go down.
Banks Fail Again
Unless their data centre was a smoking hole in the ground, outages of live systems are unacceptable.
Even if their data centre was nuked, the bank should have continued running it's live services from an alternate location, with minimal "down time"
The bank is paid very hansomly by it's customers for services, and "off lining" several Billion pounds of the UK economy for 3 hours is completely unacceptable.
Whilst normally I personally think less legislation is a "good thing", HMG really needs to kick the regulator to remind them of a "fit and proper" organisation to have a banking license should include can they actually deliver the service reliably.
Re: Banks Fail Again
I think it's unacceptable that society has got to a point that computers are *that* big and *that* important and the job *can't* be done by humans.
Probable key factor in the outage - "IBM mainframes don't fail!"
I have had (stupid) people say to me statements like "Here, this is our DR plan, but don't worry about reading it, we have an IBM mainframe and IBM told us it will never fail" (note - IBM are VERY careful not to make that legally binding statement in their sales pitch, but they are happy to leave you with the impression). I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?" Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE, so when something does go wrong there is an inertia due to an inability to accept the simple fact stuff breaks, whether it has an IBM badge or not. I bet half the delay in solving the RBS outage was simply down to people getting past that inertia.
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"Half the problem is people are so lulled by the IBM sales pitch they just don't stop to think ANYTHING MANMADE IS FALLIBLE"
Yes, anything manmade is fallible. IBM mainframe is fault tolerant, redundant hardware which can be dynamically used in the case of a component failure, but it is also a clustered system, parallel sysplex. Parallel sysplex in place specifically because the systems might fail for whatever reason, e.g. hw failure, software error, data center blows up. I/O is processed in parallel across multiple systems so if one is unavailable, the other mainframes can immediately pickup the I/O. The IBM coupling facilities which make it possible for server time protocols to work in parallel are brilliant. No system hardware failure should ever take down a mainframe environment, unless you implemented parallel sysplex incorrectly. It is like having Oracle RAC implemented incorrectly and blaming the outage on a single server failure. If RAC is implemented correctly, a server failure should not matter. I highly doubt any IBM rep told anyone the mainframe is infallible and never goes down at a hardware level, if for no other reason than they wanted to sell parallel sysplex software.
"I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected, and after a few moments of bewildered silence at the other end of the line you get those immortal words: "But it's a mainframe....?"
I doubt that ever happened, but, if it did, they asked the right question. If properly implemented, that should never happen. Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
".....Yes, anything manmade is fallible....." But - don't tell me - IBM mainframes are made by The Gods, right?
"..... IBM mainframe is fault tolerant, redundant hardware....." Ignoring my own experience, this story goes to show you are completely and wilfully blind!
"......No system hardware failure should ever take down a mainframe...." So the event never happened, it was all just a fairy tale, right? You know I mentioned stupid people earlier that said stuff like "forget DR, it's on an IBM mainframe", well please take a bow, Mr Stupid.
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
> If properly implemented, that should never happen.
I hope to God you're never implementing systems I have to rely on.
Let me guess, your code also has lots of:
/* We can never get here */
return;
> Much like if you were to call your Director at three in the morning to tell them that the RAC cluster is down because a server failed, they would say "But it's a RAC cluster.....?"
And we all know that RAC clusters never, ever, go down. It's amazing that Oracle even bothers to sell support for them isn't it?
Re: Probable key factor in the outage - "IBM mainframes don't fail!"
"I have called directors at three in the morning to tell them the bizz is swimming in the brown stuff because we have had a mainframe stop/pop/do-the-unexpected"
There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation...
Either IBM customer service has gone down hill or they've decided it's better business to be friends with the IT organisation.
Re: Roland6 Re: Probable key factor in the outage - "IBM mainframes don't fail!"
".....There was time when the Director would of called you to ask why they had to get the news of a fault from IBM and not their own IT organisation..." If you're implying the typical IBM response was to worry about cuddling up to senior management rather than fixing the problem then nothing has changed. But you should also know it is the first rule of BOFHdom that you should always know more than those above you. Dial-home services and the like should always have the BOFH as contact so you are in control of the flow of information uphill, so as to make sure that when the brown stuff comes rolling downhill it is not on your side. Your role has probably already been short-listed for being outsourced if you haven't mastered such basics.
Re: ...was simply down to people getting past that inertia.
A well placed kick with a steel toed boot might help!
