The five-day mainframe bank system meltdown at the National Bank of Australia (NAB) was due to a corrupted file on an IBM mainframe system that was being upgraded. It's reported that staff attempted a mainframe upgrade on Wednesday 25 November, and this failed to complete. It was reversed and this was when, it appears, ongoing …
Why am I not surprised...
"...with some customer accounts having multiple incorrect debits applied..."
No mention of "...with some customer accounts having multiple incorrect CREDITS applied..."
Banks seem the same the whole world over - when they screw up it's never in our favour.
It's often in our favour, you hear many stories of ATMs being mis-loaded with 20s rather than 10s, the odd story about people being given hundreds of thousands of pounds due to keying errors.
The probable cause of credits not being made is that it's a multi-step process which required several tos and fros of requests, confirmations, counter requests etc. This takes time and with a knackered system it's highly unlikely that you'd get past the first stage.
Also - do you expect to keep any money that's accidentally credited to your account? In the same way that you'd expect a refund if it were accidentally taken from your account?
Do you think ...
that anyone who had money mistakenly paid into his account is complaining loudly, or quietly hoping that nobody will notice?
that's why we test the change & backout in as close to live as possible, the day before we do any changes.
don't want to cock about with people's money!
Outsourcing .. NAB got what they paid for
Good article Chris
But just for further clarification,
the NAB's ITO "Wave 1" outsourcing arrangement with "Satyam" (or "Mahindra" as they re-brand themselves now) is still in place.
Last year Satyam was barred from NAB's ITO Wave 2 outsourcing program after the corruption scandal.
There was supposedly to be discussions on how to bring maintenance and support for legacy apps back in-house.
But it appears too many PHB's at the bank preferred to play the white Raj with subservient low wage Indians on dodgy 457 visa's stoking their egos.
I expect a whole line of heads to roll but what's the betting the company IT processes/culture was just as much to blame as some poor sysadmin/team that just followed the recognized upgrade path. and people complain that systems never/ever get upgraded - upgrades are always expected to work smoothly and it must be someone's fault when things go up in smoke - it just doesn't work like that - computers are messy and even if you do absolutely everything right things can still break - and yes it's important to have a good roll back strategy but you need two or three or four in the real world.
dev, test, prod now and forever amen....
i learned that from a crusty old admin who is probably retired now....
am having trouble believing this patch went fine onto their dev and test clusters, only to fail on the live cluster.
ah yes, mainframe clusters are expensive.... so are outages........
so I'm saying, they cannot have followed dev, test prod and therefore that's a FAIL, not an accident, or a poor maligned sysadmin, or any other kind of voodoo gremlin dust-bunnies, no, it's a simple case of failing to follow a simple, decades-old methodology.
Having worked as a sysadmin at a bank, I'm all to aware that sometimes shit happens (even following the supposed foolproof Dev -> Test -> Prod methodology).
I once spent 6+ hours in a series of meetings because someone (not me) made a typo that caused a 3 hour production outage after the change went flawlessly in dev and test. No-one was ever in danger of losing their job, but 6+ hours of listening to expensive idiots discussing how to prevent sysadmins from ever making a typo ever again was mind numbingly painful ( unsurprisingly their chosen solution - to double the already onerous change documenation - only increased the chances of a typo being made)
"The bank faces class action lawsuits, and it is very likely that a senior IT management head or two will have to be lopped off to appease customers."
I'm a NAB customer, but I sincerely hope they only get the chop if they actually deserve it.
Clue for you
Whenever we talk about billions, one must be extremely paranoid.
Prepare and plan for a typo too, for example use scripting to the max.
One of the practices I put in place where I work is to have all commands that are going to be issued on live systems approved by a separate technical resource from the same department as the person specifying the command.
During the implementation commands are cut'n'pasted onto the command line. I do not expect any of my team to have to think when it's 3am and they've been going for hours. All thinking should be done in business hours.
Even if you think you've got a simple, single command fix, you still forward it to another member of staff (usually the on-call person) for the once-over. We've had a far fewer outages caused by implementations since this simple double check was in place.
There is a chance that not everyone bothers with this, but if someone cocks up having issued a non-checked out command, there would be serious arse-kickings involved.
A corrupt file?
One "file"? They seem to have a lot of scrambled eggs in that basket. We had a client that used to be with the NAB who was wanted help with their old batch format. It was mostly ASCII but about 16 characters in each record were ECBDIC. The banks in Oz (only some of which can verify CVC at all) haven't figured out that their high fraud rates are because criminals have managed to work their way in with bribes in India.
"Experience is directly proportional to the amount of equipment ruined"
What I'm trying to say is that those techs that were involved in this farce will now have the experience and (most likely) won't do the same thing again. If you sack them and get replacements, they'll be doomed to repeat their predecessors mistakes.
(no amount of RTFM can compare to first hand experience)
Pick your poison
"If you sack them and get replacements, they'll be doomed to repeat their predecessors mistakes."
Techs responsible are in Australia. The replacements will be in India. Thus, doomed.
Techs responsible are in India. Turnover rate huge there. Thus, doomed.
Cut to the chase - they're doomed to repeat....
The brain cells of "institutional memory" are people. Why don't companies ever remember this? Or yeah, they forget...
THE SYSTEM DID IT!!!!!!!!!!!!!!!!!!
It's about time the bankers started shouting this ad nausium, I personally think they are a little late at trying to blame their retarded practices on their IT departments.
Can you smell that, yes its the scent of despiration...
Entropy tends to a maximum.
Now guys, remember the third law of Thermodynamics - "entropy tends to a maximum" or to put it another way - "Things get worse."
You can do all the testing that's practically possible, and run it in development and testing systems before putting it live, but it can still go bad. You need to be prepared for that to happen in spite of all your efforts, and to be able to back out to the state that last worked happily with minimum outage.
And IF that all goes wrong, YOU NEED CALM, CAPABLE, EXPERIENCED TECHNICAL STAFF WHO UNDERSTAND YOUR SYSTEMS, WHO CAN THINK AND COLLABORATE TO SAVE YOUR BACKSIDE.
Would you travel as a passenger in a plane that was being flown by a cut price pilot who had never flown that kind of plane before? Just remember that pilot who crash landed with his passengers and crew on the Hudson River in New York.
"that's why we test the change & backout in as close to live as possible, the day before we do any changes." Good, you'll go far.
"they cannot have followed dev, test prod and therefore that's a FAIL, not an accident," Uh ohh, you need more experience of the real world. It just ain't as easy as you think it is.
I was hoping for a "thanks for the tip-off" line.
Neoc - I am not sure who you sent the tip to - but thanks!
As a general rule, We rarely give thanks in print for tip-offs for serious stories - on grounds that people are more likely to lose their jobs.
Nah, I meant that I was hoping to have been the one to tip El Reg off to the story, but that obviously I hadn't been. ^_^
the ones to be chopped
should the "C-level" ones who control the budgets for backup systems, and offshoring. Not the staff who have to work with what they've got.
(said by someone who hates our own IT group with a passion...)
So another big bonus for the CEO again this year...
As is the tradition in the Australian Banking Criminal Syndicate, sorry, "Industry", the CEOs and CFOs will all award themselves some extra bonus this year to compensate for the stress induced by this failure.
My god, some of them must have choked on their champagne and caviar morning tea when they heard about the problem!
Of course they will crucify some mid-level IT sod for this, it couldn't POSSIBLY be in any way related to cutbacks in IT funding by the senior management, while they raise their own bonuses (it's the Australian way, you know), take a few bribes from offshoring tender applicants, put their money into untouchable, untaxable foreign accounts...
Always remember the wisdom of Vizzini - "Australia is poepled entirely by criminals"
This was an OS upgrade? I'd expect them to take each mainframe image out of service in turn, upgrade, validate and when signed off, bring back in. Actually the same would go for an application upgrade, *especially* an application upgrade. Unless they signed off and then realised they had problems then this kind of problem should be completely avoidable.