University College London hospitals trust (UCLH) has launched an investigation after a network glitch led to the closure of A&E to blue light traffic. The problem also led to cancellations of operations. The trust was last month forced to halt a number of services, including the cancellation of 50 per cent of its operations, due …
difficult to track down?
Spanning tree strikes again...
Indeed: Statseeker or JFFNMS anyone?
Perhaps the highly paid external "IT experts" from Logica (oh...quelle surprise!) were too busy examining the insides of their eyelids to check their e-mail.
I mean these "expert" private sector consultants wouldn't have installed such a network without the most basic of SNMP monitoring tools. Shirley!
Yes. As an Ex-Logica (Logician? Ech...) guy, that sounds about right. And any failures are jotted down in a 'lessons learned' document, which gets lost - er, archived - in the black hole. I mean Logica network.
@spanning tree problems
Oh please... the spanning tree protocol is there to _help_. Many people think it's a bad thing, but it's not. Many people configure their network wrong and blame spanning-tree for the ensuing problems.
You could be right about this being a spanning-tree related problem, i.e. a problem created or exacerbated by wrong STP configuration, but STP is not at fault. The problem lies with the (highly paid?) incompetent "consultants" that design and operate the network.
It sure does sound like they need a redesign: A network with a single point of failure, where said SPoF can disable the whole network and where fault isolation takes several hours. Amateurs! :-)
that was my point...
... configure spanning tree correctly and it's great. Configure it incorrectly (or not at all) and heaven help you when it goes wrong.
Paris, because she has a CCNA.
"A full investigation into the network design and components is being undertaken to verify if there are any design issues to be addressed."
There is clearly a nasty single point of failure here. I am going to stick my neck out and suggest that it isn't the only switch which could have gone pop (as they had to systematically close off the network).
Is this stuff not monitored?
Who the hell designed the network resilience? Christ, the NHS pay through the nose for consultancy fees, yet they still get crap service.
Having worked in the NHS...
it probably wasn't a consultant who designed the network. In fact, the network was likely never designed at all. A room had computers put in, and so cables were run to it. As a switch fills up, another one is added on to expand capacity. If anyone mentions redundancy, it probably goes something like this:
"We should probably get a second switch for resilience."
"What do you mean?"
"Well if this switch breaks..."
<irate>"Why would it break? Have you recommended the wrong thing? We have paid enough for it, why would it fail?"
<techie mentally weighs the likelihood of this ending well> "...no, it's fine. Forget it"
Management is obsessed with avoiding blame. If they hear that there is any risk *at all* in doing something, they simply won't do it. If it will cost money, they won't do it for fear of blame over the budgets.
Again, IT is seen as a cost-centre, rather than a system that enables people to do their jobs. So everything is done cheap, crap and quiet.
@Peter Jones 2
You might very well be right about the current operator (Logica) probably not being directly responsible for the initial sorry state of the network. But _anyone_ assuming responsibility for a network _has_ to observe some kind of "due diligence". If Logica (or whoever) is willing to take the money they implicitely also accept taking the blame.
Strange use of English..
Call me a fool but I wouldn't classify having to cancel all operations and divert ambulances to other (possibly farther away) hospitals as 'business as usual'. Their 'business continuity plan' patently wasn't anything of the sort. Perish the possibility someone in management will take responsibility though, some lowly network techy will walk the plank but nothing else will change.
As for the NHS and consultancy, just wait till Dishy David and Curious George get through privatising it...
business as usual
"Call me a fool but I wouldn't classify having to cancel all operations and divert ambulances to other (possibly farther away) hospitals as 'business as usual'."
[fake innocence of an MBA]
What do you mean? The management were still there carrying on with their usual business. There may have been less patients to bother about, but nothing that stopped any of the normal business of the managers. The same thing can happen with bad weather.
business as usual
Even better, with fewer emergencies disrupting the smooth operation of the Hospital performance targets will be easier to hit and Management Bonuses secured.
Nothing that stopped any of the normal business of the managers
Unless they couldn't summon a porter to carry the golf clubs to the Bentley
Someone made redundant when they outsourced to Logica?
Someone may have left something unpleasant in the stored configuration which only came into effect when the device was rebooted. Spanning tree or VTP settings perhaps.
Well ffs... the people taking over the responsibility for the network (and taking the money) are obligated to look for problems. I only know Cisco and cannot speak wisely about other vendors, but everything is in the configuration. You can't "hide" stuff as such.
24 hours to fix?
Pretty quick for Logica I guess.
Maybe they will think more carefully about which bits are critical systems in future, and not outsource everything to a bunch of monkeys.
Excellence in NHS IT Delivery and Operation
This is obviously why all NHS IT jobs state that you must have previous NHS experience.
Lord forbid that they risk actually getting somebody on staff with a track record of successful delivery and operation of mission critical system in geographically dispersed locations, vendor management, testing, etc.
How many NHS medical facilities do you think we have that have two seperate landlines at oposite ends of the building from dfferent suppliers? (*)
(*) as in something better than having all the staff at a hosiptal locked out because THE (as in single) authentication server was offline, and wouldn't be fixed until monday morning (this hospital has a minor injuries unit, plus a neurological ward)
24 hours to fix a network.
What would it take to make the repair process *that* slow?
1) No up to date network (logical) map linked to a physical location map. So not sure how the data gets from A to B.
2) No remote monitoring of critical network devices, so someone has to go out there and *look* at a front panel (possibly reporting back what they see to someone in a network admin office). Not a good idea when hospitals tend to be big and hardware tends to be stuffed in locked cupboards blocked by heavy equipment or the odd dying patient on a trolley.
3) No on site spares to do the replacement.
4) A network vulnerable to *multiple* single point failures (which *might* have been picked up if 1 had existed and a competent person had looked at it) so you know the *whole* networks down but you don't know how.
5) Written authority required from some senior management type who *absolutely* must sign off any drastic action (although they won't understand what it does even if *explained* to them) who is naturally out of contact, probably at a conference on improving network reliability.
I'm not a network admin so I'm sure you guys who do this for your day job (one or two of whom I'll bet work in the NHS) can find plenty more ways to turn what I would think should be no more than a 1 hour task (That includes getting the replacement box to where it has to be and swapping the plugs) into a *minimum* of a 10 hour job (when they say normal ambulance service resumed)