back to article Furse should not resign, she should be sacked

The farce of the London Stock Exchange not only crashing but failing to get its systems up and running again should surprise no one. Well, no one except LSE boss Clara Furse, who demonstrates little understanding that technology is crucial to her business. I’ve worked for members of the London Stock Exchange and everyone …

COMMENTS

This topic is closed for new posts.
  1. Dodgy Geezer Silver badge
    Paris Hilton

    Um..conflict of interests..?

    This comment: "Furse should not resign, she should be sacked.." would come better from someone who is not interested in replacing senior executives at high cost, such as "Dominic Connor,...a headhunter."

    Perhaps when he was "Dominic Connor, developing trading systems", he would have recopmmended that Furse stay, but employ someone different to develop the trading system?

    Paris Icon for obvious reasons...

  2. Dan Wilkinson
    Thumb Down

    Wrong wrong wrong

    This article irritated me greatly. Now I can't comment on what happens at the very top level of such a large business, but as someone who has recently jumped ship from being a "geek like you" to a service continuity manager, I have to take issue, and especially with this bit in particular:

    "If the DR site was working, why didn’t it take over? Can the LSE put paid to the rumour that they were running exactly the same software for both live and standby? If you are Clara Furse reading this, here’s a hint, two copies of the same software will probably crash at the same time, given the same inputs. That’s why grownups use multiple versions.”

    Did Accenture tell you that? Did it sound like a luxury to the media beancounters you appointed? What a load of rubbish. Are you seriously suggesting that the reason you have a DR system, is merely in case the software crashes? And that the way to recover from a software crash is to have some "different" software on your DR system? If so, how different? A newer version? How about an older version? Should my production systems use Oracle on unix, and the DR system DB2 on mainframe?

    No, the DR system should be identical to your production system in terms of it's design and functional behaviour wherever possible. OK, depending on the size of your business and the importance of the system in question you may have have lower capacity, or other limitations as part of your design, but these differences should be carefully considered, and may be chosen for reasons of cost, complexity or any number of other reasons, but the nutshell is that what you choose to make available in a DR situation has to mirror the functionality, if not the specification. You can't make arbritary changes to second-guess potential future problems.

    To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!

    I would like to see what happens when there is a problem in terms of there being a real "disaster" (fire/flood/power cut/sysadmin-gone-mental), whereby the failover to DR systems would have worked, only didn't because the software was different. Your comment "but it’s obvious from this event that if the pathetically vulnerable St. Paul’s site is taken out, we can have no confidence in when the market will be back on line." misses the point that it's entirely possible that the DR site could have worked perfectly, if the cause of the fault had actually been the primary site being "taken out" rather than suffering from potentially poor design, or unexpected "input" (whatever you mean by that).

    I note that there hasn't been much press coverage in sufficient details for me to understand what actually went wrong, but I know that if it were my systems, I would not want you to be working to fix it, not only do you misunderstand the concept of what service continuity is and how to effect it, but your "holier-than-though" attitude would surely waste peoples time and misdirect their efforts. You may understand when you become one of us "grownups", stop publishing these childish rants and behave like a (fully) responsible journalist...

  3. Ian Michael Gumby
    Coat

    What a crock!

    While I do agree that the senior execs of the exchange should be held accountable, there is no reason for this type of failure to happen.

    Depending on how much money you want to spend in creating redundant hardware, you can achieve the 6 9's of uptime.

    Using IBM's IDS as a database in a configuration of local failover and then remote failover will give you the protection you will need.

    Its unfortunate that IBM has the database technology, the hardware technology, yet not the marketing sense to go to the exchanges and pitch the idea.

    Of course even if they were, the cost would be high and it would have to be justified against the risk. (Redundancy upon redundancy upon redundancy will give you a fairly high level of uptime, with planned down time allowed.

  4. Peter Gold badge

    Bottom line is that you're mentioning Windows..

    I thought so. Here's a scary one for you: somewhere in the 90s (around the time of NT 4.0) it became "fashionable" to use Windows for process control systems as well. Fashion equates to "not being decided by logic" and today we are reaping the benefit of that idea: a mad scramble to get those systems at least a BIT secure.

    Your observation is thus no surprise: again a case of people who have no clue dictating to those who have, instead of letting those people do what they're paid for. But it's exactly that sort of micromanagement and "politics ueber all" attitude that made me leave that whole scene.

    There is a small flaw in your argument, though. The lot at the top isn't entirely useless. Their job is to get clients in, which they do. The mistake they made was to make decisions outside their competence as well - QED..

  5. Nomen Publicus
    IT Angle

    resign!

    The LSE lives and dies as a network service. If the boss cannot understand that then it is time to go.

  6. Anonymous Coward
    Anonymous Coward

    Beyond sacked

    I think we should consider bringing back flogging.

    Whoever runs the LSE needs to be top of their game across the board. That is no position for someone who is a one trick pony, they have to be in the top 5% of all the elements the LSE touches in operation.

    It is a top job, and there are people who are masters of many, choose one of those.

  7. Anonymous Coward
    Anonymous Coward

    A title is required.

    "Britain’s most important industry" - you're 'aving a larf, aincha.

  8. Rodolfo
    Pirate

    Good, but why it's on The Register and not the FT?

    Spot on. The decision not to invest enough in IT had a business impact. So why not the front page of the FT asking her head but instead El Reg on a weekend?

    PS. anyway the board should be removed for the disastrous merger with the most useless stock exchange of the planet, the Milan Stock Exchange, hardly a world-class destination with blue-chip public companies.

    Skull because "off with her head"

  9. Anonymous Coward
    Thumb Down

    Britain's most self-important industry????

    That brought us endowment mortgage shortfalls, Northern Rock and the credit crunch, foreign-owned essential utilities, and the like... managers get the bonuses, customers carry the risk. Hmmm, trebles all round, unless your savings/pension just went down the tubes.

    That aside, wasn't there an OS missing from the author's list of OSes of choice for reliable disaster tolerant systems, potentially with split site capability? Possibly even two OSes? Still, clueless headhunters are nothing new are they. Anyway, more by luck than by design, HP own the two relevant OSes these days, but once upon a time the world knew a little bit more about VMS and whatever the Tandem Nonstop OS is officially called, and indeed until relatively recently those two were the foundation for most of the world's most succesful stock exchanges. Then the LSE decided that a desktop-heritage OS was good enough for them, and obviously purely by coincidence, here we are today.

    That being said, HP do have an interesting and relatively recent disaster tolerance video: "We demonstrated that IT services continued to be available for all of our operating system environments— HP-UX, Microsoft® Windows® Server 2003, Red Hat Enterprise Linux, NonStop OS, and OpenVMS." Have a look if you have a few minutes, and if you're associated with Ms Furse, send her this link:

    http://www.hp.com/go/DisasterProof

    Wrt dissimilar redundancy: don't diss it too much, it's got a few projects done which wouldn't have got approved without it. Stuff that flies, or goes bang, or must never go bang (nuclear power bang), springs to mind. It's not often seen in traditional business IT, but that's largely because dissimilar redundancy done right is *expensive*, and traditional IT beancounters usually prefer "cheap" to "done right".

  10. Anonymous Coward
    Anonymous Coward

    I remember when

    the LSE ran on resilient systems like VMS; mind you, that was when their IT was done in-house rather than by those nice people from Andersons (as it was then). So 20+ years ago (ye gods , that long...) they had remote site recovery but a seriously non-trendy platform.

    Mind you, many of the international stock exchanges that don't fall over are still using stuff like VMS, but that'll just be coincidence.

  11. Chancellor Dorkon

    Correct Blame Placement: MICROSOFT

    Really, what else needs to be said?

  12. Sava Zxivanovich

    MiFID & Reliability

    MiFID will solve all problems for customers. This event will just force them to consider to use more than one exchange.

    Re reliability - you have to use two or three different systems in order to avoid common cause error/crash/fault/disaster!

  13. Ronan Quirke

    Missed opportunity

    There was an opportunity here to have an intelligent discussion as to why the stock exchange should have some more clued in tech leadership. Instead what we have is the author blaming accenture because an ex employee is the CIO. Brilliant.

    And yeah, a DR running a different version of the software, because that is what a DR system is for....

    Oh hang on a second. I also see David Lester worked previously worked for Thomson Financial, they should carry the can somehow too right?

    I came across the following article when doing a bit of online research myself:

    http://www.watersnews.com/public/showPage.html?page=812732

    Gives some more interesting insights than the author's rant about how things were great in his day and everyone should be fired etc.

  14. Anonymous Coward
    Thumb Up

    If it's so important, why sack one person.

    I have a slightly different view via the safety embedded world, concerning some of the comments about why not having independent hardware and software redundancy/system integrity designed in:

    You don't tend to get too many planes falling out of the sky due to software failure, and no redundant/backup capability.

    Then of course I guess the LSE (and a few others whose entire business is dependent on the computers/networks working) don't have effectively three levels of hardware( different silicon/routing) running concurrently using three different software implementations developed by seperate teams working in isolation (only the system spec is common).

    Is this expensive? Sure. But the news headlines are likely to be far worse if an Airbus decides to fall out of the sky (plus they'll stop selling planes and go bust if it happens often enough) than if some poor inside trader can't ditch his XL Group shares.

    Maybe decent system integrity is too expensive for finance types.

    Maybe the potential losses to the LSE through a system failure haven't seemed to be large enough for them to actively avoid the lack of system redundancy issue.

    And maybe the odd sacking of a CEO of an information driven business, who doesn't have Tech Director/CIO on the board, to encourage the others, isn't a bad idea for the general health of GB plc.

  15. Anonymous Coward
    Anonymous Coward

    blame the computer -- really?

    You all assume the computer or its software is to blame.

    I say the breakdown was most likely 'as designed'. There was huge trading in one particular american company as I understand it, and certain market makers needed the trading to halt. So the trading halted...

    Think about that for a moment. The timing was probably everything but a coincidence. Follow the money.

  16. Destroy All Monsters Silver badge

    @Dan Wilkinson

    "No, the DR system should be identical to your production system in terms of it's [sic] design and functional behaviour wherever possible."

    Thank you for stating the obviously correct way of doing things; I was getting unsure of myself. Yes, use the same system for production/failover by all means, and if you can afford it do "Independent Verification and Validation" of the design and implementation. Otherwise staffing and configuration management problems will be too much - not to talk about inter-system synchronization issues, coordination, testing etc.

    I don't think that "independent implementation" has ever been shown to reduce downtime. It may be useful in very specialized and controlled environments like those three Space Shuttle computers voting on results, yielding continuous uptime with majority rule but even there the benefits of independent implementations was doubtful according to some stats I can't find back.

  17. Anomalous Cowherd Silver badge
    Thumb Down

    @ Danny

    Completely agree - the concept of deliberately running a different version of the codebase on the DR system is bizarre. A bug is fixed, a new version released - if I understand this article correctly the author would recommend running a version /known to be buggy/ on the DR system? I'm aware heterogeneity is broadly a good thing, but not in the middle of a disaster thanks very much. I want an identical handover, no surprises or gotchas.

    Anyway when was the last time software caused something like this? Hardware, that's where it always goes wrong. OK, except for the metric/imperial mixup on the Mars Orbiter. But generally it's hardware...

  18. n
    Coat

    warning*

    *systems may go down as well as up.

  19. Anonymous Coward
    Boffin

    RE: Wrong wrong wrong

    The reason you have different versions is so when the DR comes up it will be fed the same scenario as the live site, and so you take down the recovery site as well.

    Cisco systems are a good example of this, sometimes after a failure they get stuck in a loop electing a new leader, then automatically back off to avoid creating the loop and downtime grows each cycle.

    I think you have confused service with systems.

    "To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!"

    For example, Juniper or Cisco can operate with what the LSE are doing and it gives me great comfort to have the two systems interleaving and bypassing each other. On failure of one the other can still take over all the way to Linx for example or can be repaired while keeping the service running. The system is what lets people work, the service is what they provide to others.

  20. blair
    Thumb Up

    Hear hear

    Having worked on a project that is very close to the LSE's new trading platform I would have to agree with most of Dominic's views.

    Accenture have to ability to convince senior executives that they know what they are doing because of their branding this forcing unnecessary hires for testing, support and development staff with little or no experience. Moreover, Accenture insisted on using .Net to develop most of the feed handlers. This made it easier to find developers but also had the effect of reducing performance and also blighting the order matching system with .Net garbage collection issues. Yes I know the system is supposed to be horizontally scalable but what’s the point if you can’t scale up since your non-horizontally scalable parts won’t for huge cost. Fortunately the LSE has been severing ties with Accenture over the last 2 or 3 years although this has just meant buying back staff from Accenture they had previously sold to them.

    I must admit I never met Ms Furse during my stint at the LSE. Nevertheless, it seems a touch arrogant for her to appear to ignore competition from the likes of Turqoise and Chi-X. Looking at the share price now, perhaps a takeover back in 2005/6 might not have been a bad idea?

    One slight error Dominic is that the LSE’s primary site is not at St Paul’s. The two main data centres are elsewhere with the St Paul’s acting as part of a Quorum.

  21. Anonymous Coward
    Happy

    Seems fair comment

    The DR should be held back in line with a working version of production. When the production is updated and has settled, then the DR should be patched. In that scenario the DR is useful as a failover in ANY circumstance, not just "act of god" type problems. Having a business critical system down for more than a few hours? No work being done? Seems to quailify as a disaster to me! You pay for DR, use it! Ideally use DR as an offsite storage and offline reporting and processing setup, then you know it works when you need to fire it up in anger.

    Sounds like the DR was not up to spec and I can imagine the cry went up much like places I have worked in, "Sorry but DR won't take the load and and you never allocated any money or time to get it up to spec, it was all a token effort to please the audtiors. I would concentrate on getting production back up ASAP!"

    I am not entirely sure of the full spread of kit used at LSE but from the discussion I have had with colleagues we are under the impression that it was a showcase setup for a single platform technology. No one in their right mind uses one platform for business critical systems, the trendy term "best of breed" is investigated and you try enusre that you get a useful spread and mix of tech that plays well together. If something goes belly up, you only have to look in one place, not all over the bloody place to try to find the glitch. You never run all the same make of router at your gateways, if one goes down or get compromised, you know the other will keep you going to buy you time to find out what to fix. Run IIS to Oracle on Sun, or Apache on Sun to DB2 on IBM, whatever it takes to buy time if one goes out, especially when O/S patching, on ANY platform, is involved!

  22. Anonymous Coward
    Pirate

    At last - someone else pointing out the way things are.

    I've seen so many things messed up because the people who make the IT decisions are not qualified to make them. I could rant on about how reliable IBM's VM/MVS/VSE systems are, or how the NS here has tried to move off VMS for many years, spent milllions on outsourcing it (a bankrupt concept) got nothing for it and have decided they need to do it in-house etc. etc. etc. but it'd just be a rant so I'm off to write something trendy... what's next... oh drat - a Delphi 3 program. Sigh.

    PS El-Reg web-site designer: I can only see 25 chars in the title box - it scrolls even though there's lots of white-space on the rhs. And insert std. rant about the font-size and fixed-width here - we really should have an icon for it.

  23. Sceptical Bastard

    Oh, *that* LSE !

    And here's me wondering WTF this had to do with the London School of Economics!

    (I'd use the Paris icon if we still had the proper one)

  24. Anonymous Coward
    Stop

    Loads a Bull

    Lods a bull ! It is not networks or Cisco stupid head hunter. You must have been a bad techie so had to move on to be a head hunter.

  25. Anonymous Coward
    Coat

    Dont use IBM DR

    My DR site had a 12 hour outage a few weeks ago, IBM's Sampson house switched to generators, due to local leccy issues. Then the gennys overheated shut down so no n+1. Then they coudln't handle the load so the load was dropped.

    End result whole building down for 12 hours. They should be sacked along with "Furse"

  26. Anonymous Coward
    Anonymous Coward

    Yes, important.

    The Stock Market's money movements generate more money than any of our manufacturing sectors, and the LSE handles more money than any building conglomerate in the UK. Yes, they're important to the current state of affairs. As you can see by the effect of their actions.

    Whether this is a Good Thing (tm) is eminently debateable.

  27. Wayland Sothcott

    It was not a screwup

    I think the writer is missing the point. The stockmarket was deliberately switched off because of Fannie May and Freddie Mac.

  28. Nick Hill
    Coat

    Same for a lot of companies

    Lots of senior execs don't understand that they rely totaly on IT now.

    I work for an Insurance company - and we are seriously creaking at the seams. Low morale and lack of balls by the IT Director (and understanding) means the business just doesnt understand what will happen when the systems fail. Who gets blamed, the poor person running that system day in day out that's been screaming for investment!

    This Icon looks like a pick-pocket....

  29. Arthur McGiven

    Clubable

    The real problem with British management on all levels is that it runs on the basis of knowing the right people. That is how they get the jobs and also how they do the jobs.

    The nearer one gets to the ranks of the great and the good, the more this rule applies, and, most disastrously for our future, the more one tends to regard technology/engineering as something that belongs round at the tradesman's entrance.

  30. n
    Alert

    taking the fail out of failover...

    She should apply for a job with IR.

    From IR's wiki:

    "Significantly, several IT initiatives are being phased in to better handle ticketing, freight, rolling stock (wagons), terminals, and rail traffic, including the use of Global Positioning System (GPS) and Microsoft (MS) Windows Vista for train tracking in real time."

    all aboard the 6.45 "meatgrinder express"!

  31. MarmiteToast
    Thumb Down

    Unimpressed

    I cannot agree with Dan Wilkinson more, this article is geek sensationalism, wrote by someone with a poor understanding of this issues present.

  32. Alex
    Flame

    Um.. What???!

    hang on.. You aren't seriously suggesting that you would use a different system at your DR site are you?? You HAVE to use the SAME systems at your DR site to provide continuity!!

    Now, if you were saying that testing of new software needs to be done off the live systems, then yes you are right (naturally) but have written it extremely poorly. Testing of new software should be done off network (if you have the means) and at the very least on a test system.

    Do not, under ANY circumstances, try to say that you should be using differing systems between your live and DR site...

    Am baffled by this, but can only say that you have (hopefully!!!) mis-written what you mean.

  33. Anonymous Coward
    Unhappy

    Why does everyone think "DR" means "site loss"?

    It means data corruption from stupid programmers, infrastructure lockout from bugs in firmware, negotiation problems or expired licenses. Plus you never have the tools to conclusively prove what the problem is nor the leverage to force the vendors to investigate (They're like plumbers - "sorry mate, not our problem").

    And there is rarely a "smoking gun" for the problem - could be OS, could be database, could be application. Had a 3 DAY outage on SAP caused by a bug in the update software, we had to hire our old consultants back as SAP refused to acknowledge it was their problem. (It was)

  34. Anonymous Coward
    Anonymous Coward

    Redundant Systems

    Dan Wilkson wrote:

    "To deliberately choose to run a different system in case there is some sort of bug, or error that could cause the design in question to fail is both impossible to do effectively, and not a function of service continuity, but of system design. This is important!"

    Rubbish. They do this all the time with military systems, particularly so for aircraft flight critical systems both civilian and military.

    Given how much money is involved in the deals on the LSE, it would have been prudent to have designed the system with this capability.

  35. Ferry Boat

    Somewhere in the City

    I agree with Dan Wilkinson. You don't have two different versions. How do you know how they should differ? How do you know a previous version would cope any better? How do you test them? They should be in line. Having been through a similar situation on a switch to DR we modified the inputs to the system to prevent the error. In a volume case, reducing the volume and increasing the time.

    @Sava Zxivanovich

    It would be interesting to know how much trade was switched to other exchanges. I know MiFID theoretically makes this possible but how many firms have everything in place to actually achieve it?

  36. Anonymous Coward
    Anonymous Coward

    IT Managers

    Nick Hill wrote:

    "IT Director (and understanding) means the business just doesnt understand what will happen when the systems fail. Who gets blamed, the poor person running that system day in day out that's been screaming for investment!"

    Unfortunately, a number of IT managers or IT project managers are IT managers because they don't have the technical expertise to do the techy work, so they become managers, it's easier for them. I've worked for a few like this.

    I currently work for a muppet, ( who admits he's not technical) who get's involved in every nook and cranny of the company (we're a small company), even giving advice on how to investigate and debug technical problems on which he knows nothing about, but he's the boss, so everyone does exactly what they're told.

    So they're not really up to understanding, however, they should employ decent technical design authorities that can do the system architecture well.

  37. Anonymous Coward
    Paris Hilton

    Rant but...

    Even though I cannot agree with most of what the author says, (I do agree with Dan Wilkinson), there is an underlying truth about the author's rant.

    People making big decisions don't have the foggiest more often than not, and having someone who's not tech savvy making decisions is bound for disaster.

    About Accenture I can say that if you deal with them ask for people from their group who belong to Avanade or similar and you'll get what you pay for, (probably even more).

    Remember:

    No matter what any vendor will tell you, it is you who are responsible, people in charge should always know what they are dealing with.

    Now where is Paris Hilton when you need her? She probably could run the place better than Furse. :P

  38. Jason Clery

    @AC

    "the LSE ran on resilient systems like VMS"

    Pah...

    I remember when the LSE was run on paper an abacus.

  39. Ash

    @Alex

    I'm not in the sector, but I think I get what's happened.

    While you quite rightly say that the same system should be mirrored to the DR site, I believe the point the author was making was that there doesn't need to be the same kit. You can run one on Cisco routers and switches, and one on HP switches. One system can be run on Winblows, the other on Linux.

    The input and output would be the same for both systems; Just the underlying infrasctructure would differ, to prevent any issues with the OS / switch kit causing the failure of both sites.

    Again, I Am Not A Disaster Recovery Specialist; This is pure speculation.

  40. Anonymous Coward
    Pirate

    The Ignorance of City Institutions

    I used to work for a City Institution which shall remain nameless. The IT department used to be a relatively small percentage (20%) of the workforce, but grew to be something like 50%. Every single member of the board, and every member of the Executive Committee, was a banker or financial expert. Not one knew the first thing about IT.

    The CEO nearly had puppies at a company meeting in which one perspicacious employee described the company as a "software house", despite the fact that they were utterly reliant on IT and spent most of their time writing software, all without any hint of good archtiectural practice. The consequence was that they allegedly wasted a sum not unadjacent to £100M, mostly on consultants, in creating a new set of strategic systems that would never fly (not least because security was apparently viewed as an add-on rather than a deeply ingrained part of the system).

    And the arrogance of the City rolls on, undiminished ...

  41. Anonymous Coward
    Anonymous Coward

    Defence vs. Banking

    Anonymous Coward wrote:

    "Then of course I guess the LSE (and a few others whose entire business is dependent on the computers/networks working) don't have effectively three levels of hardware( different silicon/routing) running concurrently using three different software implementations developed by seperate teams working in isolation (only the system spec is common)."

    I did embedded systems work years ago, and those that can do it, make better programmers, systems people, IT people than those that haven't.

    I've chatted to recent computer science graduates and they're not even taught assembly language programming ( I understand why, and undergrads these days can't be taught everything) but those that can program at a low level have a much better understanding of computing all round.

    Having worked in the defence sector and in Investment Banking, I can state my experience is that the technical people in defence are better technically than those in the banking sector.

  42. Anonymous Coward
    Anonymous Coward

    Disaster Recovery

    Is all about services.

    You don't need identical hardware nor identical software to implement DR.

    As long as your disaster recovered services provide the desired results everyone's a winner.

    Of course if you use different hardware / software you eliminate the chance that a hardware / software bug brings your DR solution down too, but requires more work in being able to ensure that you don't get compatibility type problems.

    It seems technically bad that the LSE was out for so long, but without the entire facts I can only summarise the impact as follows:

    No-one died, no-one was seriously injured, my mortgage has not gone up (or down), petrol is still bastard expensive but not mofo expensive and I had a good night's sleep last night.

  43. Dan Wilkinson

    @ Rotacyclic (& others)

    I think you misundertood my comment, you appear to agree wholeheartedly in your own words.

    I know that certain industries and companies have the requirement and indeed the sense to use complex methods to ensure that component failure cannot affect the system as a whole. They may make use of straightforward redundancy, or the disparate redundancy mentioned by a few other posters where (to take a previously mentioned idea for example) you don't only have Cisco routers etc, but a mixture from seperate manufacturers that provde the same functionality. As you say yourself, for some areas, maybe including the LSE, "it would be prudent to have the system designed with this capability". Exactly what I said; that is part of system design, not Disaster Recovery. Maybe I should have used the "DR" wording, rather than Service Continuity.

    The point is is that this "requirement" is built into the system as a whole, and not only as part of your DR/SC requirements. It is THE system design. If you need this level of protection, then BOTH your production site, AND your DR site will have a mixture of (again, for example) say Cisco/HP/IBM switches. Your design used a mixed environment, and your DR site should mirror that EXACTLY. It's no use using Cisco at your production site, and HP at your DR site - that is poor design.

    Your DR systems are there to replicate your production environment in the event of it's failure, they are not there to provide dedundancy that should be present in production in the first place if it is so important.

  44. Anonymous Coward
    Anonymous Coward

    I'm amazed that they bought into Windows full stop

    It's fine for smaller systems, but when we want serious performance (and we're talking millions of transaction per hour here) the argument is between Sunfire E25Ks and IBM P595s. Wintel boxes lack the error correction and redundancy of these machines, while simply not scaling to anything like the performance of these boxes.

    Lots of small boxes are great in theory, but in practice it's extra complexity to go wrong and more porblems in replicating the environment at your CoB site.

  45. Alex

    @Ash

    Ah-so. Indeed, I agree. We use physical hardware with new Cisco kit at the live site, and a virtualised environment with slightly older Cisco kit at the DR site. We use the DR site as a high availability system too, so if server hardware were to drop, the system at the DR site would be called into action.

    I'm not 100% sure about wixing OSs and plavours of packages, as there is a big risk.... IMHO

    I suppose, the way you've described is very plausible. Just not sure about the author, but I'll pend passing judgment.

  46. Mark
    Paris Hilton

    @Lee

    If there was such a lack of issue at the outage, why are the executives paid a lot? Only people with important jobs get paid lots and if your system can be missed for a day and nothing untoward happen, why not save a few million each year and pay this bunch of monkeys peanuts?

  47. Anonymous Coward
    Boffin

    I have seen it all before

    I was once involved in troubleshooting a network outage at the LSE when I worked for a company that supported them. The "consultant" at LSE had called us to say that the network was in meltdown. One of my junior colleagues had taken the call and the LSE chap was screaming so loudly that i could hear him through the phone earpiece even though I was about 5 meters away. I took the phone from my colleague and introduced myself as an senior engineer and asked what was the matter. He said that the whole network was down and kept screaming "when is an engineer going to get here". I told him that an engineer was on his way but the fellow just kept on panicing and asking for the ETA of the engineer.

    I then told him "what you need to do is calm down and then take a walk around the building inspecting all of the key components of the network". He agreed to do so so he hung up the call and 10 minutes later my colleague got a call to say we should cancel the engineer visit as he had "fixed the problem". He did not leave any explanation as to what the problem was, he just stated that he had managed to fix the problem. I could not rest until I knew what had been done to fix the issue so I called him to find out what he had done. He said that he had walked into a comms room and found a network device that was continuously rebooting. He switched it off and everything started working. He took the credit for fixing the problem and did not say thank you for our assistance. I am therefore not surprised to hear about outages on their network if they hire people like this.

  48. George Capehart
    Paris Hilton

    What GRC?

    Just one more example of the total lack of awareness of governance and operational risk management in business. And financial services seems to lead the pack in spite of all of the regulatory activity directed at it. The Peter Principle is alive and well at the C*O and Board levels . . .

  49. Anonymous Coward
    Anonymous Coward

    IT engineering chasm

    The comments about the impossibility of running differently designed and implemented main and DR systems seems to be between IT people and engineers. This obviously is possible but expensive to develop, test and maintain. Evey project I work on has a risk analysis where we estimate the probability and impact of every failure we can think of. We then design measures to make the probability or impact acceptable.

    Design faults are are a likely cause of failure and the DR system will probably also fail given the same inputs. The only way to reduce this risk is to have independantly developed systems. Running different versions of the same system as in teh article is a IS a strange idea, it gives little or no protection as a yet undiscovered bug is probably in both.

    It may be that this considered and the risk/cost trade of was considered acceptable. Financial organisations seem to take risks of major catastrophic financial system failures every 50-100 years. It would be strange if they designed computer systems with higher resiliancy.

  50. Adrian Waterworth
    Stop

    Anyone surprised?

    Anyone who has worked on large-scale IT projects in the public or commercial sectors will have seen this time and time again. Senior management largely drawn from the ranks of marketing, sales and accountancy. If you're lucky (very lucky!) some of them might know enough and be honest enough to realise that they need advice and expertise from the technical staff, but that's not particularly common.

    Of course, the IT industry itself is partly to blame. Almost all major projects are ridiculously oversold on a slippery mixture of snake oil and bullshit. That's why massive schedule and cost overruns are the norm rather than the exception. Unfortunately, as long as there's even one major supplier out there who will promise the world at half price by next Wednesday, everyone has to play the same game. So you end up with a bunch of salesmen and accountants having the wool pulled firmly over their eyes by another bunch of salesmen and accountants while the poor buggers who actually have to design and implement their badly-specified and insane pipe-dreams look on in a mixture of despair, resignation and mute fury.

    Been there, seen too much of it. That's why I left "big IT" a couple of years ago - it finally reached the stage where I couldn't ignore my moral and ethical misgivings about the whole thing. Now I just wait for the day when something goes sufficiently wrong somewhere that someone big finally says they've had enough and sues one or more of their suppliers into near oblivion.

    OK, so that isn't particularly likely, but it's going to need something on that scale to make the IT business grow up and get its collective act together.

    (P.S. Nice to see that comments are working now. I originally tried to post this one sometime on Sunday, but in spite of all the above waffle, the comment system still insisted that I had only submitted a title and no actual comment text. Oops! On an article about IT cockups too...)

  51. Dominic (The Pimp) Connor

    The author repsonds

    Yes, I think a DR site should cope with s/w bugs. The "D" in DR stands for "Disaster", be that s/w, h/w, fire, flood or hiring Accenture.

    I apologize for not including everyone's favourite O/S in my list. I would have included them all, but it would have been boring. As it happens, I myself chose HP/UX. I suppose I'll get slagged for that as well. My *point* was that nothing is 100% trustworthy, it the quality of people and their management that makes it good enough, or not.

    Since there was no IT on the board, we already have a major point of failure. Perhaps the new incumbent should learn to play golf so his better might meet him occasionally ?

    As more than one person has said, systems that are life-critical often use the multiple version approach to allowing them to get past show stopper bugs quickly. I even know of bank systems where the version complied with debug code left in is the "hot" standby.

    There's a tough judgment call on whether you fail over to a different version or not. But it ought to be available, no matter how much an accountant from the media thinks the money could better be spent on more auditors from his previous employer.

    As for conflict of interest...

    I think it is safe to assume that I will not be asked to find the replacement for whoever Ms. Furse offers up as a scapegoat. Yes, I know some people who could do the job, but frankly I can get more money sending them to a firm run by grown ups.

    I like the conspiracy idea, and yes it does fit the facts worryingly well. The period where only certain firms could trade is particularly interesting to me. I'm not saying that anyone behaved improperly, but the traders on the sweet side of that are probably the happiest people in the City at the moment. No, that's not hard.

    The FSE *ought* to investigate the LSE, and quite properly they won't comment. But let us not kid ourselves. The FSA does not have the ability to do this, and any "investigation" will have the objective of "restoring confidence". The report will talk of "lessons learned", and "new opportunities to move forward", since they would never dream of upsetting their golfing chums at the LSE.

    I know what people at the FSA earn, and you can't get good people for that, indeed a former CEO of theirs went on the record to say so. They will get in PWC who will do a Hutton style report for them instead.

  52. Martin Smith
    IT Angle

    IT needs a military style Special Forces

    Generally the larger the client, the larger the service provider/consultancy hired. Large organisations are inherently inefficient. More people, more bureaucracy, more politics, more screw-ups, more waste.

    In military operations, have you ever wondered why the SAS/SBS are so bloody good at what they do? Small teams of highly competent, highly skilled individuals, trained to work efficiently as a unit, while at the same time able to function independently if cut-off from the chain of command to complete the mission if necessary.

    UK IT sorely needs an equivalent of the Special Forces.

    You gather the best, send them on an SAS style IT assessment course, weed out the dross putting them through various scenarios to test their abilities in all relevant areas including leadership and business knowledge, and induct the top 1% into your elite unit.

    When a major outage occurs like the one at the LSE where critical national interests are affected, the IT special forces units are sent in to perform 'surgery'

    They are impartial, effective, apolitical and brutally frank, seeking only to achieve mission goals and get out. If they recommend that a person/team/service provider is fired, they are without questions.

    When not 'deployed' they are constantly training, exhausting all scenarios and learning the latest techniques and technologies.

  53. Anonymous Coward
    Thumb Down

    It's not about "favourite" OSes

    Look sunshine, it's not about any individual's "favourite" OSes. It's about using appropiate tools and materials for any given job. If you were in business building bridges and insisted on cast iron for all of them, and not wood, or concrete, or any of the other possibilities, you'd not be in the business for long. The same should go for systems houses who only have one material in stock and one tool in their toolkit.

    For many years the stock markets of the world have relied on various combinations of DEC VMS and Tandem Nonstop kit, and mostly they've done OK. Your list of (your favourite?) OSes included OS/2 (irrelevant to this market sector as well as all the rest) but you chose not to mention VMS or Nonstop, the two most relevant OSes in this picture. Consequently you FAIL.

  54. amanfromMars Silver badge

    Re IT needs a military style Special Forces

    By Martin Smith Posted Tuesday 16th September 2008 09:04 GMT.

    Do you Suppose there are such ESPecialised Forces, Martin, already dDeeply Embedded and Deployed. Virgin HyperRadioProActive. ...... ManIQ Creative, in an AI Program?

    Which is Matrix Nexus Territory .... in AESThetan Lands with Pirate Satellites in Ninja Robes.

    IT used to Call it the Great Game. ....... but IT aint Simply a Game when IT is Real.

    Play Responsibly [Truthfully at All Times] and Wisdom will Accrue Swift Beta IntelAIgents for All Future Needs.

    And all that Largesse from Something so Simply Started and Maintained. Hugs and KISSes PreDate Power and are Probably Original Source Specification Drivers..... and that Makes for Interesting Array of Live XSSXXXXual Fields for Trialling and Trailling. The Old Double Trap and Shock to the Head ReVamp in Living Tribute to Charles de Menezes. An Innocent Taken and Lost in Paranoia/Civil Madness.

    Anbody Know any Good Shrinks? Or is Big Picture Thinking Out Loud a Simply Better Beta Plan/Platform for AI Creative Therapy.

    Or is that too Crazies into Cuckoo Nests/Clock Network Orange Enabling for Mainstream Monitoring and Mentoring...... Popular Manic OverSight which is a Pivotal Position with Definite Thoroughly Deserved Consequence in Dire Performance?

  55. Dominic (The Pimp) Connor

    The writer respponds (again)

    I fear the anonymous Coward who seems bent on attacking me over OS choice seems to be missing the point big time.

    I've worked on Tandem gear as well as Stratus and yes they did deliver good reliability, but I've had these both die on me. I will admit up front the Stratus issue was my fault, underlining the real issue that it you can't buy reliability shrink wrapped, you have to go deep. Nothing substitutes for the quality of the staff and their management.

    You also failed to read my piece, please do so again. OS/2 was part of the LSE strategy for some years. Gone now of course, but you don't see a lot of VMS around these days either .Things change.

    I will type the next bit slowly because you failed to get it the first two times I wrote it.

    I did not list all OSes because there is a long list. If you want to have a pissing contest on who can name the largest number of OSes you may beat me, or not. I don't care. My point, which again I will tyep slowly is that no OS / HW combination will give you the necessary level of reliability unless you set them up right, program them properly, and test and retest to ensure you identify failure modes. Then test some more.

This topic is closed for new posts.

Other stories you might like