difficult to track down?
Spanning tree strikes again...
University College London hospitals trust (UCLH) has launched an investigation after a network glitch led to the closure of A&E to blue light traffic. The problem also led to cancellations of operations. The trust was last month forced to halt a number of services, including the cancellation of 50 per cent of its operations, …
Perhaps the highly paid external "IT experts" from Logica (oh...quelle surprise!) were too busy examining the insides of their eyelids to check their e-mail.
I mean these "expert" private sector consultants wouldn't have installed such a network without the most basic of SNMP monitoring tools. Shirley!
Oh please... the spanning tree protocol is there to _help_. Many people think it's a bad thing, but it's not. Many people configure their network wrong and blame spanning-tree for the ensuing problems.
You could be right about this being a spanning-tree related problem, i.e. a problem created or exacerbated by wrong STP configuration, but STP is not at fault. The problem lies with the (highly paid?) incompetent "consultants" that design and operate the network.
It sure does sound like they need a redesign: A network with a single point of failure, where said SPoF can disable the whole network and where fault isolation takes several hours. Amateurs! :-)
"A full investigation into the network design and components is being undertaken to verify if there are any design issues to be addressed."
There is clearly a nasty single point of failure here. I am going to stick my neck out and suggest that it isn't the only switch which could have gone pop (as they had to systematically close off the network).
Is this stuff not monitored?
it probably wasn't a consultant who designed the network. In fact, the network was likely never designed at all. A room had computers put in, and so cables were run to it. As a switch fills up, another one is added on to expand capacity. If anyone mentions redundancy, it probably goes something like this:
"We should probably get a second switch for resilience."
"What do you mean?"
"Well if this switch breaks..."
<irate>"Why would it break? Have you recommended the wrong thing? We have paid enough for it, why would it fail?"
<techie mentally weighs the likelihood of this ending well> "...no, it's fine. Forget it"
Management is obsessed with avoiding blame. If they hear that there is any risk *at all* in doing something, they simply won't do it. If it will cost money, they won't do it for fear of blame over the budgets.
Again, IT is seen as a cost-centre, rather than a system that enables people to do their jobs. So everything is done cheap, crap and quiet.
You might very well be right about the current operator (Logica) probably not being directly responsible for the initial sorry state of the network. But _anyone_ assuming responsibility for a network _has_ to observe some kind of "due diligence". If Logica (or whoever) is willing to take the money they implicitely also accept taking the blame.
Call me a fool but I wouldn't classify having to cancel all operations and divert ambulances to other (possibly farther away) hospitals as 'business as usual'. Their 'business continuity plan' patently wasn't anything of the sort. Perish the possibility someone in management will take responsibility though, some lowly network techy will walk the plank but nothing else will change.
As for the NHS and consultancy, just wait till Dishy David and Curious George get through privatising it...
Jon
"Call me a fool but I wouldn't classify having to cancel all operations and divert ambulances to other (possibly farther away) hospitals as 'business as usual'."
[fake innocence of an MBA]
What do you mean? The management were still there carrying on with their usual business. There may have been less patients to bother about, but nothing that stopped any of the normal business of the managers. The same thing can happen with bad weather.
This is obviously why all NHS IT jobs state that you must have previous NHS experience.
Lord forbid that they risk actually getting somebody on staff with a track record of successful delivery and operation of mission critical system in geographically dispersed locations, vendor management, testing, etc.
How many NHS medical facilities do you think we have that have two seperate landlines at oposite ends of the building from dfferent suppliers? (*)
(*) as in something better than having all the staff at a hosiptal locked out because THE (as in single) authentication server was offline, and wouldn't be fixed until monday morning (this hospital has a minor injuries unit, plus a neurological ward)
What would it take to make the repair process *that* slow?
1) No up to date network (logical) map linked to a physical location map. So not sure how the data gets from A to B.
2) No remote monitoring of critical network devices, so someone has to go out there and *look* at a front panel (possibly reporting back what they see to someone in a network admin office). Not a good idea when hospitals tend to be big and hardware tends to be stuffed in locked cupboards blocked by heavy equipment or the odd dying patient on a trolley.
3) No on site spares to do the replacement.
4) A network vulnerable to *multiple* single point failures (which *might* have been picked up if 1 had existed and a competent person had looked at it) so you know the *whole* networks down but you don't know how.
5) Written authority required from some senior management type who *absolutely* must sign off any drastic action (although they won't understand what it does even if *explained* to them) who is naturally out of contact, probably at a conference on improving network reliability.
I'm not a network admin so I'm sure you guys who do this for your day job (one or two of whom I'll bet work in the NHS) can find plenty more ways to turn what I would think should be no more than a 1 hour task (That includes getting the replacement box to where it has to be and swapping the plugs) into a *minimum* of a 10 hour job (when they say normal ambulance service resumed)