The Register® — Biting the hand that feeds IT

Furse should not resign, she should be sacked

Page:

Dodgy Geezer

Um..conflict of interests..? 

Paris Hilton

This comment: "Furse should not resign, she should be sacked.." would come better from someone who is not interested in replacing senior executives at high cost, such as "Dominic Connor,...a headhunter."

Perhaps when he was "Dominic Connor, developing trading systems", he would have recopmmended that Furse stay, but employ someone different to develop the trading system?

Paris Icon for obvious reasons...

Dan Wilkinson

Wrong wrong wrong 

Thumb Down

This article irritated me greatly. Now I can't comment on what happens at the very top level of such a large business, but as someone who has recently jumped ship from being a "geek like you" to a service continuity manager, I have to take issue, and especially with this bit in particular:

"If the DR site was working, why didn’t it take over? Can the LSE put paid to the rumour that they were running exactly the same software for both live and standby? If you are Clara Furse reading this, here’s a hint, two copies of the same software will probably crash at the same time, given the same inputs. That’s why grownups use multiple versions.”

Did Accenture tell you that? Did it sound like a luxury to the media beancounters you appointed? What a load of rubbish. Are you seriously suggesting that the reason you have a DR system, is merely in case the software crashes? And that the way to recover from a software crash is to have some "different" software on your DR system? If so, how different? A newer version? How about an older version? Should my production systems use Oracle on unix, and the DR system DB2 on mainframe?

No, the DR system should be identical to your production system in terms of it's design and functional behaviour wherever possible. OK, depending on the size of your business and the importance of the system in question you may have have lower capacity, or other limitations as part of your design, but these differences should be carefully considered, and may be chosen for reasons of cost, complexity or any number of other reasons, but the nutshell is that what you choose to make available in a DR situation has to mirror the functionality, if not the specification. You can't make arbritary changes to second-guess potential future problems.

To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!

I would like to see what happens when there is a problem in terms of there being a real "disaster" (fire/flood/power cut/sysadmin-gone-mental), whereby the failover to DR systems would have worked, only didn't because the software was different. Your comment "but it’s obvious from this event that if the pathetically vulnerable St. Paul’s site is taken out, we can have no confidence in when the market will be back on line." misses the point that it's entirely possible that the DR site could have worked perfectly, if the cause of the fault had actually been the primary site being "taken out" rather than suffering from potentially poor design, or unexpected "input" (whatever you mean by that).

I note that there hasn't been much press coverage in sufficient details for me to understand what actually went wrong, but I know that if it were my systems, I would not want you to be working to fix it, not only do you misunderstand the concept of what service continuity is and how to effect it, but your "holier-than-though" attitude would surely waste peoples time and misdirect their efforts. You may understand when you become one of us "grownups", stop publishing these childish rants and behave like a (fully) responsible journalist...

Ian Michael Gumby

What a crock! 

Coat

While I do agree that the senior execs of the exchange should be held accountable, there is no reason for this type of failure to happen.

Depending on how much money you want to spend in creating redundant hardware, you can achieve the 6 9's of uptime.

Using IBM's IDS as a database in a configuration of local failover and then remote failover will give you the protection you will need.

Its unfortunate that IBM has the database technology, the hardware technology, yet not the marketing sense to go to the exchanges and pitch the idea.

Of course even if they were, the cost would be high and it would have to be justified against the risk. (Redundancy upon redundancy upon redundancy will give you a fairly high level of uptime, with planned down time allowed.

Peter

Bottom line is that you're mentioning Windows.. 

I thought so. Here's a scary one for you: somewhere in the 90s (around the time of NT 4.0) it became "fashionable" to use Windows for process control systems as well. Fashion equates to "not being decided by logic" and today we are reaping the benefit of that idea: a mad scramble to get those systems at least a BIT secure.

Your observation is thus no surprise: again a case of people who have no clue dictating to those who have, instead of letting those people do what they're paid for. But it's exactly that sort of micromanagement and "politics ueber all" attitude that made me leave that whole scene.

There is a small flaw in your argument, though. The lot at the top isn't entirely useless. Their job is to get clients in, which they do. The mistake they made was to make decisions outside their competence as well - QED..

Nomen Publicus

resign! 

IT Angle

The LSE lives and dies as a network service. If the boss cannot understand that then it is time to go.

Anonymous Coward

Beyond sacked 

I think we should consider bringing back flogging.

Whoever runs the LSE needs to be top of their game across the board. That is no position for someone who is a one trick pony, they have to be in the top 5% of all the elements the LSE touches in operation.

It is a top job, and there are people who are masters of many, choose one of those.

wayne tavitt

A title is required. 

"Britain’s most important industry" - you're 'aving a larf, aincha.

Rodolfo

Good, but why it's on The Register and not the FT? 

Pirate

Spot on. The decision not to invest enough in IT had a business impact. So why not the front page of the FT asking her head but instead El Reg on a weekend?

PS. anyway the board should be removed for the disastrous merger with the most useless stock exchange of the planet, the Milan Stock Exchange, hardly a world-class destination with blue-chip public companies.

Skull because "off with her head"

Anonymous Coward

Britain's most self-important industry???? 

Thumb Down

That brought us endowment mortgage shortfalls, Northern Rock and the credit crunch, foreign-owned essential utilities, and the like... managers get the bonuses, customers carry the risk. Hmmm, trebles all round, unless your savings/pension just went down the tubes.

That aside, wasn't there an OS missing from the author's list of OSes of choice for reliable disaster tolerant systems, potentially with split site capability? Possibly even two OSes? Still, clueless headhunters are nothing new are they. Anyway, more by luck than by design, HP own the two relevant OSes these days, but once upon a time the world knew a little bit more about VMS and whatever the Tandem Nonstop OS is officially called, and indeed until relatively recently those two were the foundation for most of the world's most succesful stock exchanges. Then the LSE decided that a desktop-heritage OS was good enough for them, and obviously purely by coincidence, here we are today.

That being said, HP do have an interesting and relatively recent disaster tolerance video: "We demonstrated that IT services continued to be available for all of our operating system environments— HP-UX, Microsoft® Windows® Server 2003, Red Hat Enterprise Linux, NonStop OS, and OpenVMS." Have a look if you have a few minutes, and if you're associated with Ms Furse, send her this link:

http://www.hp.com/go/DisasterProof

Wrt dissimilar redundancy: don't diss it too much, it's got a few projects done which wouldn't have got approved without it. Stuff that flies, or goes bang, or must never go bang (nuclear power bang), springs to mind. It's not often seen in traditional business IT, but that's largely because dissimilar redundancy done right is *expensive*, and traditional IT beancounters usually prefer "cheap" to "done right".

Anonymous Coward

I remember when 

the LSE ran on resilient systems like VMS; mind you, that was when their IT was done in-house rather than by those nice people from Andersons (as it was then). So 20+ years ago (ye gods , that long...) they had remote site recovery but a seriously non-trendy platform.

Mind you, many of the international stock exchanges that don't fall over are still using stuff like VMS, but that'll just be coincidence.

Chancellor Dorkon

Correct Blame Placement: MICROSOFT 

Really, what else needs to be said?

Sava Zxivanovich

MiFID & Reliability 

MiFID will solve all problems for customers. This event will just force them to consider to use more than one exchange.

Re reliability - you have to use two or three different systems in order to avoid common cause error/crash/fault/disaster!

Ronan Quirke

Missed opportunity 

There was an opportunity here to have an intelligent discussion as to why the stock exchange should have some more clued in tech leadership. Instead what we have is the author blaming accenture because an ex employee is the CIO. Brilliant.

And yeah, a DR running a different version of the software, because that is what a DR system is for....

Oh hang on a second. I also see David Lester worked previously worked for Thomson Financial, they should carry the can somehow too right?

I came across the following article when doing a bit of online research myself:

http://www.watersnews.com/public/showPage.html?page=812732

Gives some more interesting insights than the author's rant about how things were great in his day and everyone should be fired etc.

Anonymous Coward

If it's so important, why sack one person. 

Thumb Up

I have a slightly different view via the safety embedded world, concerning some of the comments about why not having independent hardware and software redundancy/system integrity designed in:

You don't tend to get too many planes falling out of the sky due to software failure, and no redundant/backup capability.

Then of course I guess the LSE (and a few others whose entire business is dependent on the computers/networks working) don't have effectively three levels of hardware( different silicon/routing) running concurrently using three different software implementations developed by seperate teams working in isolation (only the system spec is common).

Is this expensive? Sure. But the news headlines are likely to be far worse if an Airbus decides to fall out of the sky (plus they'll stop selling planes and go bust if it happens often enough) than if some poor inside trader can't ditch his XL Group shares.

Maybe decent system integrity is too expensive for finance types.

Maybe the potential losses to the LSE through a system failure haven't seemed to be large enough for them to actively avoid the lack of system redundancy issue.

And maybe the odd sacking of a CEO of an information driven business, who doesn't have Tech Director/CIO on the board, to encourage the others, isn't a bad idea for the general health of GB plc.

Anonymous Coward

blame the computer -- really? 

You all assume the computer or its software is to blame.

I say the breakdown was most likely 'as designed'. There was huge trading in one particular american company as I understand it, and certain market makers needed the trading to halt. So the trading halted...

Think about that for a moment. The timing was probably everything but a coincidence. Follow the money.

Destroy All Monsters

@Dan Wilkinson 

"No, the DR system should be identical to your production system in terms of it's [sic] design and functional behaviour wherever possible."

Thank you for stating the obviously correct way of doing things; I was getting unsure of myself. Yes, use the same system for production/failover by all means, and if you can afford it do "Independent Verification and Validation" of the design and implementation. Otherwise staffing and configuration management problems will be too much - not to talk about inter-system synchronization issues, coordination, testing etc.

I don't think that "independent implementation" has ever been shown to reduce downtime. It may be useful in very specialized and controlled environments like those three Space Shuttle computers voting on results, yielding continuous uptime with majority rule but even there the benefits of independent implementations was doubtful according to some stats I can't find back.

Anomalous Cowherd

@ Danny 

Thumb Down

Completely agree - the concept of deliberately running a different version of the codebase on the DR system is bizarre. A bug is fixed, a new version released - if I understand this article correctly the author would recommend running a version /known to be buggy/ on the DR system? I'm aware heterogeneity is broadly a good thing, but not in the middle of a disaster thanks very much. I want an identical handover, no surprises or gotchas.

Anyway when was the last time software caused something like this? Hardware, that's where it always goes wrong. OK, except for the metric/imperial mixup on the Mars Orbiter. But generally it's hardware...

n

warning* 

Coat

*systems may go down as well as up.

Anonymous Coward

RE: Wrong wrong wrong 

Boffin

The reason you have different versions is so when the DR comes up it will be fed the same scenario as the live site, and so you take down the recovery site as well.

Cisco systems are a good example of this, sometimes after a failure they get stuck in a loop electing a new leader, then automatically back off to avoid creating the loop and downtime grows each cycle.

I think you have confused service with systems.

"To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!"

For example, Juniper or Cisco can operate with what the LSE are doing and it gives me great comfort to have the two systems interleaving and bypassing each other. On failure of one the other can still take over all the way to Linx for example or can be repaired while keeping the service running. The system is what lets people work, the service is what they provide to others.

blair

Hear hear 

Thumb Up

Having worked on a project that is very close to the LSE's new trading platform I would have to agree with most of Dominic's views.

Accenture have to ability to convince senior executives that they know what they are doing because of their branding this forcing unnecessary hires for testing, support and development staff with little or no experience. Moreover, Accenture insisted on using .Net to develop most of the feed handlers. This made it easier to find developers but also had the effect of reducing performance and also blighting the order matching system with .Net garbage collection issues. Yes I know the system is supposed to be horizontally scalable but what’s the point if you can’t scale up since your non-horizontally scalable parts won’t for huge cost. Fortunately the LSE has been severing ties with Accenture over the last 2 or 3 years although this has just meant buying back staff from Accenture they had previously sold to them.

I must admit I never met Ms Furse during my stint at the LSE. Nevertheless, it seems a touch arrogant for her to appear to ignore competition from the likes of Turqoise and Chi-X. Looking at the share price now, perhaps a takeover back in 2005/6 might not have been a bad idea?

One slight error Dominic is that the LSE’s primary site is not at St Paul’s. The two main data centres are elsewhere with the St Paul’s acting as part of a Quorum.

Anonymous Coward

Seems fair comment 

Happy

The DR should be held back in line with a working version of production. When the production is updated and has settled, then the DR should be patched. In that scenario the DR is useful as a failover in ANY circumstance, not just "act of god" type problems. Having a business critical system down for more than a few hours? No work being done? Seems to quailify as a disaster to me! You pay for DR, use it! Ideally use DR as an offsite storage and offline reporting and processing setup, then you know it works when you need to fire it up in anger.

Sounds like the DR was not up to spec and I can imagine the cry went up much like places I have worked in, "Sorry but DR won't take the load and and you never allocated any money or time to get it up to spec, it was all a token effort to please the audtiors. I would concentrate on getting production back up ASAP!"

I am not entirely sure of the full spread of kit used at LSE but from the discussion I have had with colleagues we are under the impression that it was a showcase setup for a single platform technology. No one in their right mind uses one platform for business critical systems, the trendy term "best of breed" is investigated and you try enusre that you get a useful spread and mix of tech that plays well together. If something goes belly up, you only have to look in one place, not all over the bloody place to try to find the glitch. You never run all the same make of router at your gateways, if one goes down or get compromised, you know the other will keep you going to buy you time to find out what to fix. Run IIS to Oracle on Sun, or Apache on Sun to DB2 on IBM, whatever it takes to buy time if one goes out, especially when O/S patching, on ANY platform, is involved!

Anonymous Coward

At last - someone else pointing out the way things are. 

Pirate

I've seen so many things messed up because the people who make the IT decisions are not qualified to make them. I could rant on about how reliable IBM's VM/MVS/VSE systems are, or how the NS here has tried to move off VMS for many years, spent milllions on outsourcing it (a bankrupt concept) got nothing for it and have decided they need to do it in-house etc. etc. etc. but it'd just be a rant so I'm off to write something trendy... what's next... oh drat - a Delphi 3 program. Sigh.

PS El-Reg web-site designer: I can only see 25 chars in the title box - it scrolls even though there's lots of white-space on the rhs. And insert std. rant about the font-size and fixed-width here - we really should have an icon for it.

Sceptical Bastard

Oh, *that* LSE ! 

And here's me wondering WTF this had to do with the London School of Economics!

(I'd use the Paris icon if we still had the proper one)

Anonymous Coward

Loads a Bull 

Stop

Lods a bull ! It is not networks or Cisco stupid head hunter. You must have been a bad techie so had to move on to be a head hunter.

Anonymous Coward

Dont use IBM DR 

Coat

My DR site had a 12 hour outage a few weeks ago, IBM's Sampson house switched to generators, due to local leccy issues. Then the gennys overheated shut down so no n+1. Then they coudln't handle the load so the load was dropped.

End result whole building down for 12 hours. They should be sacked along with "Furse"

Anonymous Coward

Yes, important. 

The Stock Market's money movements generate more money than any of our manufacturing sectors, and the LSE handles more money than any building conglomerate in the UK. Yes, they're important to the current state of affairs. As you can see by the effect of their actions.

Whether this is a Good Thing (tm) is eminently debateable.

Wayland Sothcott

It was not a screwup 

I think the writer is missing the point. The stockmarket was deliberately switched off because of Fannie May and Freddie Mac.

Nick Hill

Same for a lot of companies 

Coat

Lots of senior execs don't understand that they rely totaly on IT now.

I work for an Insurance company - and we are seriously creaking at the seams. Low morale and lack of balls by the IT Director (and understanding) means the business just doesnt understand what will happen when the systems fail. Who gets blamed, the poor person running that system day in day out that's been screaming for investment!

This Icon looks like a pick-pocket....

Arthur McGiven

Clubable 

The real problem with British management on all levels is that it runs on the basis of knowing the right people. That is how they get the jobs and also how they do the jobs.

The nearer one gets to the ranks of the great and the good, the more this rule applies, and, most disastrously for our future, the more one tends to regard technology/engineering as something that belongs round at the tradesman's entrance.

n

taking the fail out of failover... 

Alert

She should apply for a job with IR.

From IR's wiki:

"Significantly, several IT initiatives are being phased in to better handle ticketing, freight, rolling stock (wagons), terminals, and rail traffic, including the use of Global Positioning System (GPS) and Microsoft (MS) Windows Vista for train tracking in real time."

all aboard the 6.45 "meatgrinder express"!

MarmiteToast

Unimpressed 

Thumb Down

I cannot agree with Dan Wilkinson more, this article is geek sensationalism, wrote by someone with a poor understanding of this issues present.

Alex

Um.. What???! 

Flame

hang on.. You aren't seriously suggesting that you would use a different system at your DR site are you?? You HAVE to use the SAME systems at your DR site to provide continuity!!

Now, if you were saying that testing of new software needs to be done off the live systems, then yes you are right (naturally) but have written it extremely poorly. Testing of new software should be done off network (if you have the means) and at the very least on a test system.

Do not, under ANY circumstances, try to say that you should be using differing systems between your live and DR site...

Am baffled by this, but can only say that you have (hopefully!!!) mis-written what you mean.

Anonymous Coward

Why does everyone think "DR" means "site loss"? 

Unhappy

It means data corruption from stupid programmers, infrastructure lockout from bugs in firmware, negotiation problems or expired licenses. Plus you never have the tools to conclusively prove what the problem is nor the leverage to force the vendors to investigate (They're like plumbers - "sorry mate, not our problem").

And there is rarely a "smoking gun" for the problem - could be OS, could be database, could be application. Had a 3 DAY outage on SAP caused by a bug in the update software, we had to hire our old consultants back as SAP refused to acknowledge it was their problem. (It was)

RotaCyclic

Redundant Systems 

Dan Wilkson wrote:

"To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!"

Rubbish. They do this all the time with military systems, particularly so for aircraft flight critical systems both civilian and military.

Given how much money is involved in the deals on the LSE, it would have been prudent to have designed the system with this capability.

Ferry Boat

Somewhere in the City 

I agree with Dan Wilkinson. You don't have two different versions. How do you know how they should differ? How do you know a previous version would cope any better? How do you test them? They should be in line. Having been through a similar situation on a switch to DR we modified the inputs to the system to prevent the error. In a volume case, reducing the volume and increasing the time.

@Sava Zxivanovich

It would be interesting to know how much trade was switched to other exchanges. I know MiFID theoretically makes this possible but how many firms have everything in place to actually achieve it?

RotaCyclic

IT Managers 

Nick Hill wrote:

"IT Director (and understanding) means the business just doesnt understand what will happen when the systems fail. Who gets blamed, the poor person running that system day in day out that's been screaming for investment!"

Unfortunately, a number of IT managers or IT project managers are IT managers because they don't have the technical expertise to do the techy work, so they become managers, it's easier for them. I've worked for a few like this.

I currently work for a muppet, ( who admits he's not technical) who get's involved in every nook and cranny of the company (we're a small company), even giving advice on how to investigate and debug technical problems on which he knows nothing about, but he's the boss, so everyone does exactly what they're told.

So they're not really up to understanding, however, they should employ decent technical design authorities that can do the system architecture well.

Anonymous Coward

Rant but... 

Paris Hilton

Even though I cannot agree with most of what the author says, (I do agree with Dan Wilkinson), there is an underlying truth about the author's rant.

People making big decisions don't have the foggiest more often than not, and having someone who's not tech savvy making decisions is bound for disaster.

About Accenture I can say that if you deal with them ask for people from their group who belong to Avanade or similar and you'll get what you pay for, (probably even more).

Remember:

No matter what any vendor will tell you, it is you who are responsible, people in charge should always know what they are dealing with.

Now where is Paris Hilton when you need her? She probably could run the place better than Furse. :P

Jason Clery

@AC 

"the LSE ran on resilient systems like VMS"

Pah...

I remember when the LSE was run on paper an abacus.

Ash

@Alex 

I'm not in the sector, but I think I get what's happened.

While you quite rightly say that the same system should be mirrored to the DR site, I believe the point the author was making was that there doesn't need to be the same kit. You can run one on Cisco routers and switches, and one on HP switches. One system can be run on Winblows, the other on Linux.

The input and output would be the same for both systems; Just the underlying infrasctructure would differ, to prevent any issues with the OS / switch kit causing the failure of both sites.

Again, I Am Not A Disaster Recovery Specialist; This is pure speculation.

Anonymous Coward

The Ignorance of City Institutions 

Pirate

I used to work for a City Institution which shall remain nameless. The IT department used to be a relatively small percentage (20%) of the workforce, but grew to be something like 50%. Every single member of the board, and every member of the Executive Committee, was a banker or financial expert. Not one knew the first thing about IT.

The CEO nearly had puppies at a company meeting in which one perspicacious employee described the company as a "software house", despite the fact that they were utterly reliant on IT and spent most of their time writing software, all without any hint of good archtiectural practice. The consequence was that they allegedly wasted a sum not unadjacent to £100M, mostly on consultants, in creating a new set of strategic systems that would never fly (not least because security was apparently viewed as an add-on rather than a deeply ingrained part of the system).

And the arrogance of the City rolls on, undiminished ...

RotaCyclic

Defence vs. Banking 

Anonymous Coward wrote:

"Then of course I guess the LSE (and a few others whose entire business is dependent on the computers/networks working) don't have effectively three levels of hardware( different silicon/routing) running concurrently using three different software implementations developed by seperate teams working in isolation (only the system spec is common)."

I did embedded systems work years ago, and those that can do it, make better programmers, systems people, IT people than those that haven't.

I've chatted to recent computer science graduates and they're not even taught assembly language programming ( I understand why, and undergrads these days can't be taught everything) but those that can program at a low level have a much better understanding of computing all round.

Having worked in the defence sector and in Investment Banking, I can state my experience is that the technical people in defence are better technically than those in the banking sector.

Lee

Disaster Recovery 

Is all about services.

You don't need identical hardware nor identical software to implement DR.

As long as your disaster recovered services provide the desired results everyone's a winner.

Of course if you use different hardware / software you eliminate the chance that a hardware / software bug brings your DR solution down too, but requires more work in being able to ensure that you don't get compatibility type problems.

It seems technically bad that the LSE was out for so long, but without the entire facts I can only summarise the impact as follows:

No-one died, no-one was seriously injured, my mortgage has not gone up (or down), petrol is still bastard expensive but not mofo expensive and I had a good night's sleep last night.

Dan Wilkinson

@ Rotacyclic (& others) 

I think you misundertood my comment, you appear to agree wholeheartedly in your own words.

I know that certain industries and companies have the requirement and indeed the sense to use complex methods to ensure that component failure cannot affect the system as a whole. They may make use of straightforward redundancy, or the disparate redundancy mentioned by a few other posters where (to take a previously mentioned idea for example) you don't only have Cisco routers etc, but a mixture from seperate manufacturers that provde the same functionality. As you say yourself, for some areas, maybe including the LSE, "it would be prudent to have the system designed with this capability". Exactly what I said; that is part of system design, not Disaster Recovery. Maybe I should have used the "DR" wording, rather than Service Continuity.

The point is is that this "requirement" is built into the system as a whole, and not only as part of your DR/SC requirements. It is THE system design. If you need this level of protection, then BOTH your production site, AND your DR site will have a mixture of (again, for example) say Cisco/HP/IBM switches. Your design used a mixed environment, and your DR site should mirror that EXACTLY. It's no use using Cisco at your production site, and HP at your DR site - that is poor design.

Your DR systems are there to replicate your production environment in the event of it's failure, they are not there to provide dedundancy that should be present in production in the first place if it is so important.

Anonymous Coward

I'm amazed that they bought into Windows full stop 

It's fine for smaller systems, but when we want serious performance (and we're talking millions of transaction per hour here) the argument is between Sunfire E25Ks and IBM P595s. Wintel boxes lack the error correction and redundancy of these machines, while simply not scaling to anything like the performance of these boxes.

Lots of small boxes are great in theory, but in practice it's extra complexity to go wrong and more porblems in replicating the environment at your CoB site.

Alex

@Ash 

Ah-so. Indeed, I agree. We use physical hardware with new Cisco kit at the live site, and a virtualised environment with slightly older Cisco kit at the DR site. We use the DR site as a high availability system too, so if server hardware were to drop, the system at the DR site would be called into action.

I'm not 100% sure about wixing OSs and plavours of packages, as there is a big risk.... IMHO

I suppose, the way you've described is very plausible. Just not sure about the author, but I'll pend passing judgment.

Mark

@Lee 

Paris Hilton

If there was such a lack of issue at the outage, why are the executives paid a lot? Only people with important jobs get paid lots and if your system can be missed for a day and nothing untoward happen, why not save a few million each year and pay this bunch of monkeys peanuts?

Anonymous Coward

I have seen it all before 

Boffin

I was once involved in troubleshooting a network outage at the LSE when I worked for a company that supported them. The "consultant" at LSE had called us to say that the network was in meltdown. One of my junior colleagues had taken the call and the LSE chap was screaming so loudly that i could hear him through the phone earpiece even though I was about 5 meters away. I took the phone from my colleague and introduced myself as an senior engineer and asked what was the matter. He said that the whole network was down and kept screaming "when is an engineer going to get here". I told him that an engineer was on his way but the fellow just kept on panicing and asking for the ETA of the engineer.

I then told him "what you need to do is calm down and then take a walk around the building inspecting all of the key components of the network". He agreed to do so so he hung up the call and 10 minutes later my colleague got a call to say we should cancel the engineer visit as he had "fixed the problem". He did not leave any explanation as to what the problem was, he just stated that he had managed to fix the problem. I could not rest until I knew what had been done to fix the issue so I called him to find out what he had done. He said that he had walked into a comms room and found a network device that was continuously rebooting. He switched it off and everything started working. He took the credit for fixing the problem and did not say thank you for our assistance. I am therefore not surprised to hear about outages on their network if they hire people like this.

George Capehart

What GRC? 

Paris Hilton

Just one more example of the total lack of awareness of governance and operational risk management in business. And financial services seems to lead the pack in spite of all of the regulatory activity directed at it. The Peter Principle is alive and well at the C*O and Board levels . . .

Anonymous Coward

IT engineering chasm 

The comments about the impossibility of running differently designed and implemented main and DR systems seems to be between IT people and engineers. This obviously is possible but expensive to develop, test and maintain. Evey project I work on has a risk analysis where we estimate the probability and impact of every failure we can think of. We then design measures to make the probability or impact acceptable.

Design faults are are a likely cause of failure and the DR system will probably also fail given the same inputs. The only way to reduce this risk is to have independantly developed systems. Running different versions of the same system as in teh article is a IS a strange idea, it gives little or no protection as a yet undiscovered bug is probably in both.

It may be that this considered and the risk/cost trade of was considered acceptable. Financial organisations seem to take risks of major catastrophic financial system failures every 50-100 years. It would be strange if they designed computer systems with higher resiliancy.

Adrian Waterworth

Anyone surprised? 

Stop

Anyone who has worked on large-scale IT projects in the public or commercial sectors will have seen this time and time again. Senior management largely drawn from the ranks of marketing, sales and accountancy. If you're lucky (very lucky!) some of them might know enough and be honest enough to realise that they need advice and expertise from the technical staff, but that's not particularly common.

Of course, the IT industry itself is partly to blame. Almost all major projects are ridiculously oversold on a slippery mixture of snake oil and bullshit. That's why massive schedule and cost overruns are the norm rather than the exception. Unfortunately, as long as there's even one major supplier out there who will promise the world at half price by next Wednesday, everyone has to play the same game. So you end up with a bunch of salesmen and accountants having the wool pulled firmly over their eyes by another bunch of salesmen and accountants while the poor buggers who actually have to design and implement their badly-specified and insane pipe-dreams look on in a mixture of despair, resignation and mute fury.

Been there, seen too much of it. That's why I left "big IT" a couple of years ago - it finally reached the stage where I couldn't ignore my moral and ethical misgivings about the whole thing. Now I just wait for the day when something goes sufficiently wrong somewhere that someone big finally says they've had enough and sues one or more of their suppliers into near oblivion.

OK, so that isn't particularly likely, but it's going to need something on that scale to make the IT business grow up and get its collective act together.

(P.S. Nice to see that comments are working now. I originally tried to post this one sometime on Sunday, but in spite of all the above waffle, the comment system still insisted that I had only submitted a title and no actual comment text. Oops! On an article about IT cockups too...)

Page: