Feeds

back to article 'Mainframe blowout' knackered millions of RBS, NatWest accounts

A hardware fault in one of the Royal Bank of Scotland Group's mainframes prevented millions of customers from accessing their accounts last night. A spokesman said an unspecified system failure was to blame after folks were unable to log into online banking, use cash machines or make payments at the tills for three hours on …

COMMENTS

This topic is closed for new posts.

Page:

Nice to see RBS not using ITIL. *cough*

1
0
Anonymous Coward

Crap Bonuses = Only Crap Staff Stay = Crap Services...

3
1
Anonymous Coward

Fuck off. Like the people getting paid millions in bonuses have got anything to do with the actual day to day running of the bank's IT services. Apart from taking the credit when things don't go wrong, obv...

1
0
Silver badge

Is it better that it is a new failure rather than a repeat of an old one?

5
0
Anonymous Coward

Yes.

Next question.

2
0
jai
Silver badge

to add some detail, if it's the same one, that would suggest that no one did any due diligence, or lessons learned, or root cause analysis or any of a dozen other service delivery buzzwords that all basically mean, "wtf happened and how do we make sure it doesn't happen again?"

you'll always get new issues that break things. such is the way of IT. no system is 100% perfect. you just have to put in as much monitoring/alerting and backup systems as you can afford to ensure any impact from outage to your business critical systems is as minimal as possible.

1
0
Bronze badge
Coat

New failure better than repeat failure?

Yep, they're innovating.

4
0
Anonymous Coward

This failure is *not* related to the previous one.

Oh no.

We are capable of lots of different failures.

2
0
FAIL

I doubt it

I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out. What kind of Mickey mouse setup do they have that a hardware failure can take down their core services for hours?

9
3

Re: I doubt it

Indeed. My point re ITIL...

It beggars belief that this has happened. Oh wait...no actually it doesn't. Seen a simple electrical fault take down a Tier Two dc...hit 6 companies hosting live revenue generating services. One company nearly went tits up.

Of course at RBS I expect a director to be promoted to a VP type position for this cock up.

7
0
Anonymous Coward

Re: I doubt it

It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail.

To accuse them of "Mickey mouse" operation suggests that you've no idea how big or complex the IT setup at RBS is. I believe they currently have the largest "footprint" of zServers in Europe, that's without even thinking of mentioning the vast amount of other hardware on a globally distributed network.

Small IT = Easy.

Big IT = Exponentially more complicated.

14
6

Re: I doubt it

Incidentally, this could be another argument that RBS is just too big.

... and why the fuck are they allowed to have BOTH a banking licence and limited liability? ... mutter mutter .... moan ...

3
1
Silver badge

Re: I doubt it

One where the PHB will only release funds for repairs when there is an actual service failure.

0
0
Silver badge

Re: I doubt it

> It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen.

It also assumes someone noticed the first failure. I remember our DEC service bod (it was a while ago :) ) complaining about a customer who'd had a total cluster outage after a disk controller failed. Customer was ranting & raving about the useless "highly available" hardware they'd spent so much money on.

Investigation showed that one of the redundant controllers had failed three months before, but none of the system admins had been checking logs or monitoring things. The spare controller took over without a glitch, no-one noticed, and it was only when it failed that the system finally went down.

29
0
jai
Silver badge

Re: I doubt it

i was once told an anecdote. the major investment bank they worked for failed in it's overnight payment processing repeatedly, every night. Eventually they determined it was happening at exactly the same time each night. so they upgraded the ram/disks. patched the software. replaced the whole server. nothing helped.

Finally, head of IT decides enough is enough, takes a chair and a book to the data centre, sits in front of the server all night long to see what it's doing.

and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question.

just because you're a big firm, doesn't mean you don't get taken out by the simplest and stupidest of things

1
6
WTF?

Re: I doubt it

And no-one noticed the reboots in the syslog? No-one noticed the uptime looked a bit funny when they ran top (or equivalent)? No-one came in the next morning and wondered why their remote logins had dropped?

I call shenanigans.

6
0
Bronze badge
Mushroom

Re: I doubt it

You mean demoted? Director is higher than VP.

1
2
Bronze badge
Facepalm

Re: I doubt it

Recycled urban legend methinks. I heard the one about a server stuck under a desk with the janitor etc.

3
0
Anonymous Coward

Re: I doubt it

"You mean demoted? Director is higher than VP."

Well in a sane world that might be true but one very large (70000+) UK company I worked for had :

minions

me

associate director

director

VP

Head of dept

Head of Research

Board level directors

0
0
Anonymous Coward

Urban Legend ...

"i was once told an anecdote .. the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question."

I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...

2
0
Gold badge
Childcatcher

Re: I doubt it

"... and why the fuck are they allowed to have BOTH a banking licence and limited liability? ... mutter mutter .... moan ..."

You forgot that UK banks have "preferred creditor" status, so are one of the first in line if a company is declared bankrupt. Because it's to protect the widows, orphans and other children (hence my icon).

Which can be when a bank asks them to repay their overdraft now for example.

1
0

This post has been deleted by its author

Stop

Re: I doubt it

That was an episode of Monk.

1
0
Silver badge
Facepalm

Re: Jai Re: I doubt it

"......and at the time when the batch failed every night previously, the door to the server room opens, the janitor comes in with a hoover, looks for a spare power socket, finds none, so unplugs the nearest and plugs in the hoover. yes, you guessed it, plug that was unplugged was the power lead for the server in question....." Yeah, and I have some prime real estate in Florida if you're interested. A Hoover or the like would be on a normal three-pin, whilst a proper server would be on a C13/19 or 16/32A commando plug. It would also have probably at least two PSUs so two power leads, unplugging one would not kill it.

3
1
FAIL

Re: I doubt it

Not to mention that all the kit in a DC is powered by sockets *inside the racks*.

1
0
Anonymous Coward

Re: I doubt it

"I work somewhere that has a much smaller IT department and a much smaller IT budget than RBS, but it would take the failure of multiple hardware devices to knock a key service out."

Yes, and we are talking about a mainframe. It is near impossible to knock a mainframe off line with a simple "hardware failure." Those systems are about 14 way redundant in the first place, so it isn't as though a OSA corrupted, or another component, and knocked the mainframe offline. Even if the data center flooded or the system disappeared using magic, almost all of these mega-mainframes have a parallel sysplex/HyperSwap configuration which is a bulletproof HA design. If system A falls off the map, the secondary system picks up the I/O in real time, so why didn't that happen.... I am interested to hear the details.

3
0
Anonymous Coward

Re: I doubt it

"It would take failure of multiple pieces of hardware to take down an IBM zServer, that doesn't mean it can't happen. The only thing you can be sure of with a system is that the system will eventually fail."

A hardware failure taking down a System z is supremely unlikely. I think the real world mean time for mainframe outages are once in 50 years. Even if it was a hardware failure (which would mean a series of hardware component failures all at the same time in the system), IBM has a class of its own HA solution, geographically dispersed parallel sysplex. You can intentionally blow up a mainframe or the entire data center in that HA design and it will be functionally transparent to the end user. A system might fail, but the environment never should.

3
1

I've Said it before and witnessed it again this week

Walking through a data center only 2 days ago I noticed a failed drive on a San Box. No Alerts no one doing a physical check either every morning and afternoon either so I'm not surprised

1
0

Re: I doubt it

We used to have a Stratus, they worked on the principle that SysOps *do* forget to look at logs, the irony being is that if all the components are individually reliable, then humans being humans won't worry about it so much.

So Stratus machines phoned home when a part died and that meant an engineer turning up with a new bit before the local SysOps had noticed it had died.

That's not a cheap way of doing things of course, but at some level that's how you do critical systems. When a critical component fails the system should attract the attention of the operators.

That leads me back to seeing this as yet another failure of IT management at RBS.

If the part failed, then there should have been an alert of such a nature that the Ops could not missi it. A manager might not write that himself, but his job is to make sure someone does.

The Ops should be motivated, trained and managed to act rapidly and efficiently. Again this is a management responsibility.

Al hardware fails, all you can do is buy lower probability of system failure, so the job of senior IT management at RBS is not as they seem to think playing golf and greasing up to other members of the "management team", but delivering a service that actually works.

No hardware component can be trusted. I once had to deal with a scummy issue where a cable that lived in a duct just started refusing to pass signals along. The dust on the duct showed it had not been touched or even chewed by rats, it had just stopped. Never did find out why.

3
0
Facepalm

Re: Urban Legend ...

Although I imagine the strory has grown in the telling, the "handy power socket" is certainly something I've experienced first hand in end user situations. I remember in one office having to go round labelling the appropriate dirty power sockets "vacuum cleaners only" in order to try and prevent the staff plugging IT equipment into them, thus leaving the cleaner grabbing any handy socket on the clean power system for the vacuum cleaner...

0
0
Mushroom

Re: I doubt it

The biggest failure point on a Z is the idiots running loose around it.

I asked a console operator on an Amdahl 470 what the button labelled IPL was for. He said "IDIOTS PUSH LOAD".

1
0

Re: Urban Legend ...

“I read something similar, only it was set in the ICU of a hospital and what they unplugged was the ventilator ...”

Yup - famous Urban Legend. The hospital setting dates back to a South African newspaper "story" in 1996, but the UL itself goes back much further.

http://www.snopes.com/horrors/freakish/cleaner.asp

2
0
Bronze badge

Re: so unplugs the nearest and plugs in the hoover.

WROK PALCE prevents that one by using LOCKING plugs on ALL of its servers. (For those of you on the "other" side of the pond, locking plugs are completely incompatible with standard US power cords.)

0
0
Rob
Bronze badge
Coat

Sounds like...

... the first error was the PFY and this one is the BOFH.

Give me my coat quick before someone lumbers out of the halon mist brandishing a cattle prod.

2
0
Silver badge

I thought one of the features of a mainframe...

...was umpteen levels of redundancy? One CPU "cartridge" goes pop? Fine. Rip it out of the backplane and stuff another one in, when you've got one to stuff in there.

Dual (or more) PSUs, RAID arrays.. and yet this happens. Oh well. Wonder what RBS's SLAs say about this?

They do have SLAs for those likely-hired-from-someone-probably-IBM machines, don't they?

2
1
Anonymous Coward

Re: I thought one of the features of a mainframe...

There are umpteen levels of redundancy, that doesn't mean that outages don't happen on occasion.

0
1
Silver badge

Re: I thought one of the features of a mainframe...

Multiple hardware components are fine as long as it is a discreet hardware failure.

Firmware, microcode or whatever you want to call it can also fail, and even when you're running alleged different versions at different sites they could have the same inherent fault.

The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way

2
3
Headmaster

...so long as it is a DISCRETE hardware failure...

I won't do my usual sigh - this one is a bit more subtle than your/you're.

Discrete - separate

Discreet - circumspect.

To remember, the e's are discrete.

4
1
Anonymous Coward

Re: I thought one of the features of a mainframe...

"The only true way to have resilience is for the resilient components to be made by different vendors using different components (which is what Linx/Telehouse has with Jupiter, Cisco, Foundry and others for their network cores). IBM mainframes don't work this way"

Yeah, I suppose that is true... although you are more likely to have constant integration issues with many vendors in the environment, even if you are protected against the blue moon event of a system wide fault spreading across the environment. By protecting yourself against the possible, but extremely unlikely, big problem, you guarantee yourself a myriad of smaller problems all the time.

1
1
Anonymous Coward

Re: I thought one of the features of a mainframe...

Except for 1 thing, the RBS mainframe is not "a mainframe", its a cluster of (14 was the last number I heard) mainframes, all with multiple CPUs. This failure probably is not a single point of failure, its a total system failure of the IT hardware and the processes used to manage it.

0
0
Bronze badge
FAIL

I reckon the other source had it spot on

"the bank’s IT procedures will in some way require system administrators to understand a problem before they start flipping switches."

Naturally. However, let's not forget the best-of-breed world-class fault resolution protocol that's been implemented to ensure a right-first-time customer-centric outcome.

That protocol means that a flustercluck of management has to be summoned to an immediate conference call. That takes time - dragging them out of bed, out of the pub, out of the brothel gentlemen's club and so on.

Next, they have to dial into the conference call. They wait while everyone joins. Then the fun begins:

Manager 1: "Ok what's this about?"

Operator: "The mainframe's shat itself, we need to fail over NOW. Can you give the OK, please?"

Manager 2: "Hang on a minute. What's the problem exactly?"

Operator: "Disk controller's died."

Manager 3: "Well, can't you fix it?"

Operator: "Engineer's on his way, but this is a live system. We need to fail over NOW."

Manager 4: "All right, all right. Let's not get excited. Why can't we just switch it off and switch it on again? That's what you IT Crowd people do, isn't it?"

Operator: "Nggggg!"

Manager 1: "I beg your pardon?"

Operator: (after deep breath): "We can't just switch it off and on again. Part of it's broken. Can I fail it over now, please?"

Manager 2: "Well, where's your change request?"

Operator: "I've just called you to report a major failure. I haven't got time to do paperwork!"

Manager 3: "Well, I'm not sure we should agree to this. There are processes we have to follow."

Manager 4: "Indeed. We need to have a properly documented change request, impact assessment from all stakeholders and a timeframe for implementation AND a backout plan. Maybe you should get all that together and we'll reconvene in the morning?"

Operator: "For the last bloody time, the mainframe's dead. This is an emergency!"

Manager 1: "Well, I'm not sure of the urgency, but if it means so much to you..."

Manager 2: "Tell you what. Do the change, write it up IN FULL and we'll review it in the morning. But it's up to you to make sure you get it right, OK"

Operator: "Fine, thanks."

<click>

Manager 3: "He's gone. Was anyone taking minutes?"

Manager 4: "No. What a surprise. These techie types just live on a different planet."

Manager 1: "Well, I'm off to bed now. I'll remember this when his next appraisal's due. Broken mainframe indeed. Good night."

Manager 2: "Yeah, night."

Manager 3: "Night."

Manager 4: "Night."

87
1
Anonymous Coward

Re: I reckon the other source had it spot on

@Mike - that may well be what you think happens, but I've experienced financial services IT recovery management and it's a lot more along the lines of:

Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager.

You tend to get paniky engineers who identified the problem saying a disk controller has died, and we must change it now, NOW, do you hear?

The recovery manager will typically ask "Why did it fail, what are the risks of putting another one in, do we have scheduled maintenance running at the moment, has there been a software update, can someone confirm that going to DR is an option, are we certain that we understand what we're seeing? What is the likelihood of the remaining disk controller failing?

The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse. Let me assure you, this isn't the sort of situation where people bugger off back to bed before it's fixed and expect to have a job in the morning.

6
4
Anonymous Coward

Re: I reckon the other source had it spot on

http://www.emptylemon.co.uk/jobs/view/345349

throw in a couple of 3rd parties and you've got them all pointing fingers at each other as well to add into the mix.

6
0

Re: I reckon the other source had it spot on

AC 15:24 - this.

Not only financial services btw.

I'm not sure miss those midnight calls...in some ways quite fun to sort shit out but on the flip side the pressure to get it right first time is immense.

However its not only just flipping the bit..its also very much understanding the impact of that decision. If you fail over an entire DC you need to really be able to explain why...

2
0
Bronze badge

Re: I reckon the other source had it spot on

"Bunch of experts in the hardware, OS, software, Network, Storage and Backup get on call to discuss, chaired by a trained professional recovery manager."

Well, quite. That's exactly what should happen. Been there myself, admittedly not in financial services.

I've seen it done properly, and it's precisely as you describe.

And I've seen it done appallingly, with calls derailed by people who knew next to nothing about the problem, but still insisted on adding value by not keeping their traps shut.

I guess I'm just too old and cynical these days :-)

14
0
Silver badge

Re: I reckon the other source had it spot on

Which is how it might work if you have manual intervention required.

Highly available mainframe plex's like RBS run active/active across multiple sites.

3
0
Thumb Up

Re: I reckon the other source had it spot on

That's so good, and accurate, I have just printed it out and stuck it on the kitchen wall as a reminder to the ever expanding mass of PHB's I have the (mis)fortune of working with/for/alongside

Made my day :)

2
0
Anonymous Coward

Re: I reckon the other source had it spot on

> The last thing you want to do is failover to DR at the flick of a switch, because that may well make things worse.

I spend so much time trying to convince customers of that, and many of them still won't get past "but we need automatic failover to the DR site". We refuse to do it, the field staff cobble something together with a script, and it all ends in tears.

5
0
Anonymous Coward

Re: I reckon the other source had it spot on

> Which is how it might work if you have manual intervention required.

For DR you should have manual intervention required.

For simple HA when the sites are close enough to be managed by the same staff, have guaranteed independent redundant networking links, etc. then, yes, you can do automatic failover.

For proper DR, with sites far enough apart that a disaster at one doesn't touch the other, you have far more to deal with than just the IT stuff, and there you must have a person in the loop. How often have you watched TV coverage of a disaster when even the emergency services don't know what the true situation is for hours (9/11 or Fukishima, anyone?) ? Having the IT stuff switching over by itself while you're still trying to figure out what the hell has happened will almost always just make the disaster worse.

For example, ever switched over to another call center, when all the staff there are sleeping obliviously in their beds? Detected a site failure which hasn't happened, due to a network fault, and switched the working site off? There is a reason that the job of trained business continuity manager exists. We aren't at the stage where (s)he can be replaced by an expert system yet, let alone by a dumb one.

4
0

This post has been deleted by its author

Page:

This topic is closed for new posts.