User topics

Article topics

Log in Sign up

IBM employee sparks massive bank outage

Last Monday, one of Singapore's largest banks suffered a seven-hour IT outage that took down everything from back-office services to ATMs. This Tuesday, the flawed component was identified: an IBM employee. "We take full responsibility for this incident," wrote DBS Group Holdings CEO Piyush Gupta in a statement. A laudably …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Tuesday 13th July 2010 19:15 GMT Will Godfrey

Could be a blessing

If they sack him/her then I'm sure there's a book + film in the offing.

0 0
Tuesday 13th July 2010 19:16 GMT Combat Wombat

I know what was done...

Squawk box reports that NAS is having issues

IBM monkey pulls the wrong faulty hot swap disks from a Raid 5

IBM monkey replaces correct broken disk but RAID is borked

IBM monkey runs chkdsk /fix on the broken volume

IBM monkey notices Chkdsk breaking the volume more, panic and hits the reset button.

IBM monkey reboots NAS, and the chkdsk restarts, pooches the volume even more..

Someone who actually KNOWS what they are doing is called around 6am, and it takes the rest of the time to fix the issue, restore backups, and see to it that someone meets with a "terrible accident"

The last hour would have been devoted to sourcing a big enough bag of lime, a shovel, and a roll of carpet.

11 1
1. Wednesday 14th July 2010 09:22 GMT Gordon is not a Moron
  
  They need to employ the BoFH
  
  as he'd have ready supplies of lime, carpet and shovels
  
  0 0
  1. Wednesday 14th July 2010 16:24 GMT Combat Wombat
    
    Yes but...
    
    I can see the senior IBM guy going through the supplies at a steady rate, given the caliber of people IBM handle for day to day operations work.
    
    Sadly with my above example.. I was the guy called in at 6am, for that exact situation for another outsourcing company, who will remain nameless.
    
    0 0
Tuesday 13th July 2010 19:16 GMT Pete 2

There, but for the grace of god

... goes pretty much every major company in the world.

The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation. The only surprise is that this sort of thing doesn't happen more often - or maybe just that it isn't reported more often.

Until systems are built robust enough to survive the onslaught of a trainee with the manual held upside-down, we really can't call what we do a "profession".

6 1
1. Tuesday 13th July 2010 21:13 GMT Anonymous Coward
  
  Couldn't agree more...
  
  and with every large company on the face of the Earth sucking out all the cash for executive bonuses in the multi-millions instead of on training we'll see a lot more of this. The last of the folks who know what they are doing, had sufficient training to work on complex systems, are starting/have started to retire.
  
  Fun times ahead, wonder if companies will be able to sue retired executives for bad business practices after they've retired or moved on. You know, once it becomes apparent to everyone that they have ruined the companies that paid them.
  
  4 0
2. Tuesday 13th July 2010 22:01 GMT Matt Bryant
  
  RE: There, but for the grace of god
  
  ".....The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation....." Yeah, so comforting to just blame the sysadmin, but the truth is this is a management failure, as just about every "laugh-at-the-silly-admin-that-pulled-the-wrong-disk" situation actually resolves down to. Why? Because it is management that selects the admin and gives them that root access. You wouldn't give a novice driver the keys to your Ferrari, would you? If you did, and they bent it, wouldn't you feel just a bit to blame for putting them in the driving seat?
  
  This wouldn't have been some architect-level tech genius, this was probably the junior admin if they were doing the overnight shift. Read the article - the admin thought he was using a good procedure, the fact he didn't know it was a wrong procedure highlights several possible management failings:
  
  1/ They hired an incompetant admin that didn't have the up-to-date training he claimed to have (i.e., he lied on his CV), which means their selection process was flawed (probably because they didn't include a skilled sysadmin in the selection team, who would have spotted the "exaggerations", just used HR drones).
  
  2/ The bank introduced new kit but IBM didn't do the requisite staff training, either because they didn't check their staff's skillsets; or IBM decided to save a few pennies and just told the sysadmin to "self-train on the job"; or IBM actually didn't know what the new kit required, and hence couldn't provide a correctly skilled resource, probably because it was another vendor's kit.
  
  3/ IBM management didn't assign a competent technical project manager or technical team leader who should have looked at the new kit when it was introduced, review any new procedures, update the sysadmin procedures and plan any additional training to get their skillset right.
  
  So, blame the sysadmin if it makes you feel better, but it was incompetant management that put that incorrectly prepared sysadmin at the console.
  
  8 1
  1. Wednesday 14th July 2010 11:59 GMT HighlightAll
    
    It might have been a coffee thing
    
    Could have been a talented admin but incorrectly fuelled. Where can you get good coffee at 3 in the morning?
    
    It's not always management's fault. My blood/caffeine alarm has just gone off...
    
    0 0
Tuesday 13th July 2010 21:13 GMT Phil Rigby

@Combat, @Pete

I don't know about you but I wouldn't have a trainee working on my system at 3am. They'd be working on it when I'm there to supervise. And I'm senior, so I don't work at 3am :) Hell I don't even work at 3pm.

3 0
1. Wednesday 14th July 2010 09:12 GMT Geoff Campbell
  
  3am is the Fail
  
  I've run 24 hour operations, and I used to insist that we subject all new procedures to "the 3am test". No-one is at their best at this time, 2am or 4am are much, much better bets, so we would make sure all procedures were simple enough to be followed at 3am, and any procedure that we had the choice over timing would very definitely not be done at 3am.
  
  I have no idea *why* 3am is such a problem, but observably it most certainly is. Something to do with biorhythms, I think.
  
  GJC
  
  0 0
  1. Thursday 15th July 2010 10:14 GMT Matt Bryant
    
    RE: 3am is the Fail
    
    So true! It always used to make me laugh that project managers would insist on scheduling work on 24 x 7 bizz crit systems for 2am, on the specious idea that the systems would be least busy then and it was therefore less risky! The problem is that your best staff and your vendor's best staff are also likely to be busy sleeping at that time. I used to be a real pain and call everyone involved in such early morning changes every fifteen-odd minutes just to make sure they were still awake, because you'd be surprised at the number of times I caught employees asleep at the console whilst they waited for someone else to complete some part of a change. The best one we had was when one sleepy admin rebooted the production billing server rather than a stand-by one, and it took ten long minutes before anyone realised! That's ten minutes of skilled people looking at screens and just not registering what was actually appearing on them, simply because they had hit that 3am low.
    
    1 1
2. Wednesday 14th July 2010 18:34 GMT Combat Wombat
  
  Well phill
  
  I'd agree with you, but by your comment you have never had work with the drooling mouth breathers who make up most of IBM's operations staff.
  
  I wasn't a trainee... more than likely an average monkey who was reading from a script.
  
  IBM hires people who need a script to get dressed in the morning.
  
  0 0
Tuesday 13th July 2010 22:01 GMT John G Imrie

an "outdated procedure" was used to initiate the repair,

So not the Grunt's fault then, but the management for not updating the procedure or not notifying the Grunt of the update.

5 0
Tuesday 13th July 2010 22:08 GMT John Loy

Employee

I doubt he was fired. He is now the most experienced employee in what not to do. This was a very expenisive training course for IBM.

2 0
Wednesday 14th July 2010 07:17 GMT David Cuthbert

Fragile process

If an error in a "routine maintenance operation" causes this kind of outage, then I'd say it's a larger problem in the process rather than training. (Not that the place I work at is immune to these kinds of process failures, mind you, but at least we identify them for what they are.)

1 0
Wednesday 14th July 2010 07:17 GMT Winkypop

Singapore? Massive problem?

Does anyone know where Nick Leeson works these days?

0 0
Wednesday 14th July 2010 07:17 GMT John Tserkezis

an "outdated procedure" was used to initiate the repair,

"So not the Grunt's fault then, but the management for not updating the procedure or not notifying the Grunt of the update"

Yes, and you can bet the same management will take the credit for "bringing the system back to life", then give themselves a payrise to justify that.

1 0
Wednesday 14th July 2010 07:17 GMT Dagg

As an ex-IBMer

I've seen this happen a few times. The IBM method is get rid of anyone who has any skills (IE expensive) and replace them with trained monkeys that just blindly follow scripts. The monkeys are not able to tell if the processes in the scripts are valid or correct as they have never had the training to even have a basic understanding of what they are attempting to do.

I suggest to any company looking at outsourcing their IT support to IBM, DON"T DO IT! Ask Air New Zealand about what IBM IT Support did to them. I think IBM had them such down for about a day.

3 0
Wednesday 14th July 2010 07:17 GMT nnwin

What if it happened in your company?

Greetings from Singapore! I'd like to pose a question to Reg readers to gain some insight into management practices under different cultural context.

If such kind of massive failure happens in your country/company (as a service provider), what will happen next? I mean, to the unfortunate (or clueless?) sysadm/engineer, Head of IT, COO, and/or even CEO. And where should the buck stop?

Over here in Asia, they will quietly reprimand the the engineer and his/her supervisor, or even let them go. But the buck would usually stop there. Appreciate your comments. Thanks.

0 0
1. Wednesday 14th July 2010 09:22 GMT Pete 2
  
  what happens? not much
  
  The few occasions where I've seen cockups turning into big problems (dba dropping tables on a production database, coffee cup knocked over into the main router, someone changing root password and instantly forgetting the new one) the person involved has received admonitions from their peers/boss ("you PLONKER", etc.), but career-wise except for the coffee issue, they were regarded as "blips" in otherwise good work records.
  
  The coffee-knocker left shortly afterwards of their own free will.
  
  The conclusion was that these accidents could have happened to anyone and that everyone makes a mistake now and again. While this is true, and universally recognised, the underlying problem with our industry is that this is accepted and few, if any companies feel the situation needs to be, or can be improved. You do get point solutions to specific (costly) errors after the fact, but all the processes in the world: BS5750, ISO9000, ITIL don't seem to account for figner trouble and the IT systems themselves are designed to be so brittle that a simple error can kill them.
  
  0 0
Wednesday 14th July 2010 07:17 GMT Norman Inglethwaite

Medal

Give the grunt a medal .. the data not lost will be unrecoverable by the bank.

0 0
Wednesday 14th July 2010 09:05 GMT Field Marshal Von Krakenfart

the IBM field circus

Well, this what happens when you give an important support contract to a company who's CEO only aim is to boost his own earning by increasing the company share price, when sales are falling, by buying your own stock and sacking trained and experienced personnel.

I too have seen the IBM field circus fuck up and bend a pin on a chip after 'servicing' the mainframe and as a result the machine could not do simple maths.

Another place I worked in used to get a significant portion of their hard ware from Amdahl, just to remind big blue that there were alternatives....

0 0
Wednesday 14th July 2010 09:12 GMT Anonymous Coward

IBM is a no blame culture

That was the first thing I was told when I joined.

The but they don't mention is 'unless something goes wrong'

0 0
Wednesday 14th July 2010 09:44 GMT Anonymous Coward

I wonder what really happened ..

Given that there is no actual technical detail in this `communiqué', I hazard a guess that a failed update is what really happened. The rest is just psuedo-technical sounding managment waffle.

keyword waffle:: score +12: bankwide disaster recovery command centre, cascading failure, complete system outage, complications during the machine restart, error messages, flawed component, IBM employee, intermittent failures, outdated procedure, procedural error, routine maintenance operation, technical command function ..

0 0
Wednesday 14th July 2010 09:46 GMT Hieronymus Coward

Robust infrastructure?

While I can't comment on the abilities of the tech that started the 'cascade failure', this does bring back memories of the email storage saga at Plusnet a while back.

They were replacing the storage arrays with shiny new ones and a tech (presumably a system architect or senior admin) had a console window open to both the old and the new array and issued a format command in the wrong console. Que ensuing shit storm including the revelation that plusnet don't (or didn't) backup the arrays coupled with lots of (failed) attempts at data recovery and the vendor admitting that the array was 'to new' to work with their recovery tools.

Human error (from the admin perspective) is always going to be a factor in managing complex systems. When systems are robust enough to survive human error we won't need sys admins any more, just someone to turn up and set the stuff up.

That said there are always times when you have some kit that just doesn't want to play nice.

0 0
1. Wednesday 14th July 2010 18:34 GMT staggers
  
  Why it's called human error
  
  Doesn't have to be complex system either. Just has to involve humans. Or, in the following, humans + beer.
  
  I was once present in a recording studio in the 70s when a famous English band was doing an album. In those days the big tapes used would hold about 30 minutes of music, so you would need two for an album.
  
  You can guess what's coming. They were reviewing tracks on the first tape, and then all went to the pub, leaving the Tape Operator to sort out a new tape so they could start recording more stuff. The tape was so expensive that it was normal to reuse it, wiping it first, of course.
  
  They came back from lunch, worked for many hours (this in the days when an album took months to record because it took that long to get them all there at the same time), and decided they wanted to listen to something on the first tape again.
  
  The tape op went to the cupboard to get the first tape.........
  
  The look on his face was a picture. I didn't know skin could go that colour. They forgave him. There are references to the incident on the credits for the eventual album. More than one, actually.
  
  SO easy to avoid, but it wasn't avoided.
  
  0 0
Wednesday 14th July 2010 12:14 GMT Anonymous Coward

DBS ought to get used to it....

we have.

0 0
Wednesday 14th July 2010 14:05 GMT fch

procedural error ...

... there it is again, the "insight" that all errors are avoidable by following the correct procedures.

I'm sure IBM Global Services updates, extends, enhances, adapts its procedures all the time and, most importantly, creates new ones to cover whichever areas are found in root-cause-analysis (a standardized procedure, of course) to be lacking procedural coverage.

All so to make sure errors don't happen. Which they obviously can't, if the procedures are being followed.

Why is the belief so widespread that the human factor can be eliminated by creating ever more / ever more detailed procedures ?

Seems natural that the more procedures there are, the more likely it is for some poor tired soul working at 3:00am to follow one of them that happens to be inappropriate for the situation ...

My condolences to the poor employee who followed the wrong procedure.

My kudos to the manager who publicly admitted that making errors is human.

0 0
Thursday 15th July 2010 00:08 GMT irrelevant

been there...

Or rather, I've been the poor sod who received the panic telephone call at 11:30pm from the technical director of what is now a major player in the UK mobile phone industry, but was then still only a few dozen employees in size. Seems some lower minion had been preparing to restore some test data from tape, and had done a rm -r * in the wrong folder, and wiped out the entire live accounts and sales systems... Luckily enough important stuff had been backed up earlier that evening, so the damage was recoverable from... The owners are now multi-millionaires. I sometimes wonder just how well they would have done had they actually lost it all...

0 0
Thursday 15th July 2010 19:22 GMT iangotts

Not a process issue, but a process ADOPTION issue

It is the little things that trip us up. So this problem is NOT a process issue. It is a process ADOPTION issue. And why are processes not adopted (ie understood and used) by staff.

Simple. Most process documentation is aimed at IT building better systems, not at engaging end users so they understand what job they have to do, with links to the supporting forms, policies, work instructions, documents and systems. The process application (such as Nimbus Control) which can support this requirement is very different from traditional IT process modeling tools.

Ironically , IBM has bought loads of IT process modelling/management tools; System Architect, Holosofx, Lombardi - all of which are aimed at process automation - which are all now some part of Websphere.

So Nimbus works alongside IBM in clients - but unfortunately Nimbus Control is not deployed internally by IBM. It could have been. Cognos had a very effective deployment of Nimbus Control to document and drive their product implementation processes. But when Cognos was acquired by IBM, Nimbus was a casualty of the M&A "synergistic savings".

Suddenly, it doesn't sound like much of a saving when you look at the likely compensation claim by DBS Bank.

Sour grapes or sound thinking???

For a longer discussion on process adoption (with a less cynical tone) read my blog http://bit.ly/dsQiUI

0 0
Friday 16th July 2010 14:39 GMT Anonymous Coward

The Swiss Cheese Model - the holes lined up and we got a 7 hour outage

I prefer to fly on an aircraft with 2 pilots.

My point:

2 people checking ("propose and confirm") is generally a tad better than one.

But it costs money.

0 0
Sunday 25th July 2010 15:24 GMT Anonymous Coward

Highlights an issue I am often accussed of "winging about",..

in that DR detection and recovery is not done in pleasant conditions or in a fresh state of mind. I use scripts and dashboards to highlight current status of components, and a load of scripted utilities to bring up services and test each response, reporting back when anyting awry is found. Saved my bacon more than once.

0 0

This topic is closed for new posts.

Other stories you might like

Tech titans assemble to decide which jobs AI should cut first

But don't worry, if tech takes your job, we'll retrain you

AI + ML 4 Apr 2024 | 64

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

It's the region where stuff gets stressed at scale first, says Dave Brown, as he plots variants of Amazon's Outposts

PaaS + IaaS 10 Apr 2024 | 4

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Updated As WhatsApp, Facebook Messenger, other Meta bits plus Apple stuff fall offline today

Security 3 Apr 2024 | 18

Singapore expands regulations for digital payment token service providers

More entities will need a license

Public Sector 3 Apr 2024 | 2

Datacenter outages are on the decline, but when they hit, they hit hard

Power snafus take limelight in latest downtime diary from Uptime Institute

On-Prem 2 Apr 2024 | 3

Singapore improves the AI it uses to detect smokers

Past versions struggled to spot a lungbuster – this time authorities think they've reduced false positives

Public Sector 28 Mar 2024 | 21

Tech trade union confirms cyberattack behind IT, email outage

Exclusive Systems have been pulled offline as a precaution

Cyber-crime 25 Mar 2024 | 11

Hong Kong promises its latest national security law is not a ban on social media

Trust us – we're the government

Personal Tech 21 Mar 2024 | 9

IBM CEO pay jumps 23% in 2023, average employee gets 7%

And the party extends to shareholders with an overall $6B payout

Software 15 Mar 2024 | 17

McDonald's ordering system suffers McFlurry of tech troubles

Global meltdown turns fast food slow

Off-Prem 15 Mar 2024 | 108

IBM said to be binning off more staff as 'workforce rebalance' continues

Next logical step after rounds of voluntary layoffs

Systems 12 Mar 2024 | 24

Trying out Microsoft's pre-release OS/2 2.0

It fell through a timewarp from an alternate and very different computing universe

OSes 11 Mar 2024 | 96

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024