back to article How UK air traffic control system was caught asleep on the job

A big outage that struck Britain's air traffic control system on Saturday was due to a technical fault with a touch screen interface provided by Frequentis, The Register has learned. On Saturday 7 December, during the run-up to one of the busiest times of the year for the UK's airports, controllers at NATS (National Air …

COMMENTS

This topic is closed for new posts.
  1. Dr Who

    One million lines of code

    That's an interesting defence. "Look mate, this system is huge. It cost loads and loads of money. It's so complicated that my head spins just thinking about it. So, when it fails I want it to fail big! No trivial little glitches that nobody even notices for me - oh no. Ask the banks, they understand. If you've paid for serious software then you want to see serious failures. I want my money's worth."

    1. Peter Simpson 1
      Facepalm

      Re: One million lines of code

      Frequentis: a division of CGI Federal?

      You'd think, when updating software in something as big and important, as the touch screen interface on an air traffic control system, that the software company concerned might do a bit more than the normal "yup, looks OK to me and it's out the door" variety of testing...

      ...but, apparently, you'd be wrong.

      1. Anonymous Coward
        Anonymous Coward

        Re: One million lines of code

        They may well have used the same rigorous testing procedures as RBS.

      2. bigtimehustler

        Re: One million lines of code

        Did anyone say they did actually do an upgrade? They seem to be suggesting its worked fine for years and just stopped randomly, rather than suggesting an upgrade went wrong?

      3. PassingStrange

        Re: One million lines of code

        I was a software tester on one of IBM's flagship mainframe products for 20 years. And I can tell you, anyone who knows a way of catching every glitch in complex software before it goes live (yes, even the ones that bring the customer's business-critical systems crashing to a very visible, embarrassing and expensive halt), that is simple and robust enough to be used in practice in widely diverse environments and by development teams using different approaches and tools - AND simple enough to be understood by non-technical management, so that they won't simply throw out what works two years later in favour of the latest "flavour of the month", in order to be perceived to be "managing" - REALLY needs to get themselves some good marketing and legal support, because they're in line to make a LOT of money.

        1. john 103

          Re: One million lines of code

          RE: PassingStrange

          One tip might be to use more than one sentence when writing the Functional Spec!

    2. Anonymous Coward
      Anonymous Coward

      Re: One million lines of code

      So, it's twice as good as this: http://dilbert.com/fast/2003-08-26/

      1. Yet Another Anonymous coward Silver badge

        Re: One million lines of code

        Or http://dilbert.com/strips/comic/1996-01-31/

        Or more likely http://dilbert.com/strips/comic/1996-02-01/

        1. Nifty Silver badge

          Re: One million lines of code

          My vote goes to

          http://dilbert.com/strips/comic/1996-01-31

    3. Linker3000
      FAIL

      Re: One million lines of code

      Reminds me of my time working for a large UK veterinary group - one of our patient/client management apps couldn't work out birth dates or ages correctly - you'd register a new pet and enter the birth date given by the customer and the software would fill in the approximate age field (or you could fill in the approximate age and it would work out the rough birth date).

      Trouble was, you'd enter something like 'Poochy'..born 9-12-2005 and according to the software, the poor dog would be something like 402 years old!

      After some email correspondence with the support team, their conclusion was 'date handling is complex'.

      It was never fixed. We eventually kicked out the system for a myriad of reasons.

      1. phil dude
        FAIL

        Re: One million lines of code

        unless the code is in one big text file, that is a rubbish excuse.

        Code quality or difficulty is not a simple linear metric....

        P.

      2. dssf

        Re: One million lines of code... Sigh...

        What "comPetent" "veteran" "software designer" doesn't vet date problems? Sounds like that "developer" did not want to force the users to put dates in specific date fields and use those fields in boilerplate reports and letters components.

        I despise developers who allow users to enter any old random shit in any field the user chooses just "because data type entry enforcement slows us down" is what power-wielding users/buyers will sometimes bandy about.

        (In the early 90's, I once temped at a famous "memory leak detector" software developer. I and 2 others had to trudge through MS Access to clean up data so that future sales types (replacing the ones who must've been binned, I sometimes wondered, or for future investors) who entered phone numbers in conversation fields, addresses partially in phone numberr fields, states in the city fied, and so on. It was horrendous, mind-boggling, and blood-pressure boiling, and I'd only been at home playing with Lotus Approach for under two years or so.

        It was supposed to be a 1-week contract, or maybe it was 4 or 5 days, but after two days of that shit, and the ratty interface cobbled by some wannabe or unfortunate Access interace putter-togetherer, it seemed to me it would take 3 weeks or longer to comb through multiple thousands of records and make the data all-right and alright. I recommended Lotus Approach for this task, not to replace Access, but to from a non-programmatic, data-entry-clerk-on-limited-time basis. I convinced the in-charge developer I knew what I was doing and of what my limitations were, and what I'd done at home with Approach. He permitted me to install it and give him a quick run-through of what my plan was, and after a few minutes, green-lit it. We did the task in around 3.5 or 4 days total rather than the 2-weeks or so it was clearly becoming as we earlier kept uncovering more and more SHIT entered by uncaring, clueless, reckless sales/marketing people who obviously did not value a need to revisit and make sense of the data.

        Similarly, I did this at 2 or 3 other South Bay offices but staffed with fewer than 20 people, NONE of whom wanted nor had time to do this "drudgery" type of work. Unfortunately for my ego, Approach was usually never to stay around and take hold. Ditto in the mid 2000s when I again was at (larger) firms with maybe 5,000 employees and 10s of thousands of records, one problem being sifting out related, duplicate, and possibly fraudulent multiple (dozens) entries of employees for pay enhancement purposes. At least, that is sometimes what seemed to keep popping up in my face as I kept relating sites to employees. I was not privvy to SSN information, so, I was not able to fully settle my suspicions. Still, in that industry, it would not be uncommon for employees to be related to each other by 2-4 people. Unfortunately, some relatives had very similar or identical middle or between names.)

        1. Anonymous Coward
          Anonymous Coward

          @ dssf: ...allow users to enter any old random shit in any field...

          The best example (that I can think of) of the effect that a) bad design of input forms and b) indifferent and/or stupid data capturers can have, has to be the South African eNatis/AARTO system.

          The bad design came when the address field (on the form that motorists have to complete) was put below the residential address field. Since many people have their mail delivered to their residential address, the residential address was completed fully, and the postal address filled in as "As above".

          I swear, they must have taken people off the street whose only skill needed to be the ability to match up letters on the form with letters on the keyboard, and simultaneously be able to enter said characters in the corresponding field on the screen.

          The upshot (I guess by now you know where this is heading...) was that more than sixty percent of traffic fines were (probably still are) mailed to "As above".

          The system was supposed to have been implemented country-wide in 2009 or 2010 (can't be bothered to check), but is still seriously hobbled and running as a pilot project in Gauteng only.

          For the record: whilst it was enacted in 1998, with the intention of full deployment in the early 2000's, it is still not operational.

          See here: http://en.wikipedia.org/wiki/Administrative_Adjudication_of_Road_Traffic_Offences_Act,_1998

          And on fines going astray: http://www.arrivealive.co.za/mobile/news.asp?NID=1766

          http://ezinearticles.com/?Why-Most-AARTO-Traffic-Fines-Issued-Go-Astray&id=4996024

        2. Anonymous Coward
          Anonymous Coward

          Re: One million lines of code... Sigh...

          I'm doing a CRM migration at the moment and I feel your pain! In the new system we have built there is a big red button that disables an advisor when they leave the company. That means any historic orders handled by that advisor still maintain the record of being handled by that user, but no new orders can be assigned to them.

          However, that is too easy for the end users. Instead if Steve Smith leaves the company they will just edit his name to Steve DONOTUSELEFTTHECOMPANY. So when a customer logs in to look at their old orders they can see it has been handled by Steve DOTUSELEFTTHECOMPANY. And Steve DONOTUSELEFTTHECOMPANY gets assigned new orders too. And that is just the tip of the iceberg... *sigh*

    4. Number6

      Re: One million lines of code

      Thinking about it, I would much rather it failed big-time and was really obvious about it than having a subtle little bug somewhere that allowed aircraft to collide. I'd also like to add the feature that it only fails on days when I'm not due to fly in the following week.

    5. Anonymous Coward
      Anonymous Coward

      "Ladies and gentlemen, please keep your seat belts fastened in case we have to engage in some violent manoeuvres to avoid on coming aircraft. There has been a little glitch in Air Traffic Control that I am assured will be fixed shortly. If any of you have concer..............."

      "For fcuk sake! turn left! TURN LEFT!"

      1. imanidiot Silver badge

        Wrong way

        AC: "Ladies and gentlemen, please keep your seat belts fastened in case we have to engage in some violent manoeuvres to avoid on coming aircraft. There has been a little glitch in Air Traffic Control that I am assured will be fixed shortly. If any of you have concer..............."

        "For fcuk sake! turn left! TURN LEFT!"

        If you were to ever find yourself in that situation you should turn RIGHT! As the other sod heading for you should also be doing. (Them's the laws of the sky and the agreement between all airmen) Unless ofcourse you find there is no other option, and pray to god the other guy doesn't turn right instead,

        1. I. Aproveofitspendingonspecificprojects

          Re: Wrong way

          If you were to ever find yourself in that situation you should turn RIGHT! As the other sod heading for you should also be doing. (Them's the laws of the sky and the agreement between all airmen)

          Sounds singularly UStanii.

          Don't you / they / whoever, know that people naturally break left?*

          It was always thus until the US forces made us change the direction of engine rotation in WW2.

          Bloody stupid pondjumpers.

          *If you don' believe me just look at the muddy tracks outside university entrances where our young lords and masters are being trained to lead us. Immature (usually males) always take a short cut to the left over once green and pleasant landscapes. (Some extremely stupid bright young things will do it too. (IKYN!))

    6. Anonymous Coward
      Anonymous Coward

      Re: One million lines of code

      What we don't know is how many times the resilience HAS worked i.e. the primary system could have failed hundreds of times over the years ,with the resilience kicking in perfectly every time until now.

      We just don't get to hear about those occasions as there's no impact.

  2. Dodgy Geezer Silver badge

    Surely...

    ...we should be talking to Security Service and GCHQ about this?

    After all, they have justified a sizable chunk of their budget by saying that they will now be the authority responsible for defending the UK's Critical Infrastructure. Which means looking after its Confidentiality, Integrity and AVAILABILITY.

    They took the money - now's the time to ask them what they did with it.

    And no getting off with "I'm afraid that's classified information..."...

    1. theblackhand
      Coat

      Re: Surely...

      "Which means looking after its Confidentiality, Integrity and AVAILABILITY"

      Or CIA for short....

      My coats the one with the roll of tinfoil in the pocket for when I need extra layers on my hat.

    2. Anonymous Coward
      Anonymous Coward

      Re: Surely...

      They took the money - now's the time to ask them what they did with it.

      Ah well you see... underwater fibre channel taps are very expensive, not to mention the storage systems which we had to purchase to store copies of all your packets on, have you seen how much a couple of petabytes of enterprise storage costs these days? And trust me you really, really don't want to know about all the electricity we have to burn processing every word of every email, just in case someone wrote a dodgy word in one... then there's the staff costs... but I won't bore you with them, I'm sure you have no interest in how much clever paranoid fascists cost to run...

  3. Ol'Peculier
    Mushroom

    Odd.

    There are systems in place to handle ATC over to to the military, and this has happened in the past.

    Why not this time?

    1. Sammy Smalls

      Re: Odd.

      At a guess, because the military wouldnt even have 80% capacity of NATs? The assumption being that If the military is in charge, something more fundamental is wrong and flights to Ibiza are less of a priority and the capacity wouldnt be needed.

    2. Chris Evans

      Re: Odd.

      Whilst I'm sure they have the capability to do ATC. I'm sure they don't have the necessary capacity at the drop of a hat to cope with the same volume of traffic. The limitation may be personnel! I doubt Air travellers/taxpayers want to pay for 100+ people to sit around at the correct locations doing nothing 364 days a year. Then there is the number of extra workstations needed, they won't be cheap.

      But they do need a alternate backup system in place for as many parts of the system as possible.

      1. collinsl Bronze badge

        Re: Odd.

        As someone who used to work for NATS I can't say very much, but what I can say is that we provide the infrastructure and radar services for the millitary air controllers these days, so they had the same problems we did.

    3. Sir Sham Cad

      Re: Odd.

      Well, reading a few posts later than yours, The_H appears to have answered your question. it looks as if MIL ATC was also being handled at Swanwick from that day onwards which just utterly removes any fallback to MIL ATC in case of any future issues. Perhaps that decision might be coming up for a rethink!

    4. Anonymous Coward
      Anonymous Coward

      Re: Why Not The Military?

      ATC the military way: a fighter flies alongside with the pilot pointing down.

  4. Anonymous Coward
    Anonymous Coward

    Scary

    This is making me wonder about just how safe air travel is.

    As far as I understood it air traffic control was meant to be able to fall back to operating completely manually, with bits of card for each aircraft and the like... if that's not true WTF are they going to do in the event of a major technology failure... as opposed to just not being able to log some staff members into their system, which is what this amounts to.

    1. DragonLord

      Re: Scary

      I think that this is a different system to the one they can operate by cards. The one they can operate by cards being the tracking, prioritising, and routing of planes on the ground and in the air. The system that crashed being the one that enables them to tell the other traffic control areas that they've got a plane entering their airspace.

      1. Anonymous Coward
        Anonymous Coward

        Re: Scary

        The system that crashed being the one that enables them to tell the other traffic control areas that they've got a plane entering their airspace.

        So when they're operating on cards, they can't handover aircraft to overlapping control areas... jesus fuck... excuse my French, but that's not very reassuring.

        1. frank ly

          Re: Scary

          The thing about the fall-back card stytem and military control is that they are for emergencies. All the systems are fail-safe in that they do control airspace safely. It's just that the military and card scribbling/pushing can't handle anything like the normal volume of traffic required. Also, nowadays especially, large and rich companies lose a lot of money if civilain ATC falls over, hence the political pressure being brought to bear.

        2. An0n C0w4rd

          Re: Scary

          The fall-back system to flight strips (the "cards") will probably also fall back to manually looking up and dialling the controller you have to hand the flight over to when it runs off the end of your RADAR screen. i.e. instead of a 20% reduction in capacity it's probably closer to 50% because of the added workload.

          The manual dial system is what worked for decades before all these fancy computers came in and cocked everything up.

        3. codeusirae
          Facepalm

          The system that crashed ..

          'One of the key changes involves improving the warning messages that flash on the air traffic controllers' screens when an aircraft moves out of their area of control and responsibility. The aim is for a warning to flash on the display to remind the controllers to ensure that they have completed all their co-ordination checks before an aircraft leaves their screen and becomes the responsibility of others.

          "There is a quirk over whether it flashes or not," says Chisholm. "We want it to work in 100% of cases".

          It is important to fix this problem because the Swanwick system, unlike the current manual process, supports the automated transfer of aircraft from one air space sector to another.

          Currently at the London Air Traffic Control Centre, when controllers relinquish responsibility for an aircraft, they confirm this by phoning the appropriate new controller. This will not happen under the new automated procedures at Swanwick.' link

          1. Anonymous Coward
            Anonymous Coward

            Re: The system that crashed ..

            The aim is for a warning to flash on the display to remind the controllers to ensure that they have completed all their co-ordination checks before an aircraft leaves their screen and becomes the responsibility of others.

            "There is a quirk over whether it flashes or not," says Chisholm. "We want it to work in 100% of cases".

            So one minute after flight 666 has declared an in flight emergency, and requested an emergency landing. The air traffic controller is going to lose sight of the aircraft he is marshalling out of 666's way, as his screen is filled with flashing messages...

            Maybe I'll take the train.

          2. collinsl Bronze badge

            Re: The system that crashed ..

            That aricle is from when the centre was being built 10 years ago, and the system envisaged there was implemented successfully some years ago.

    2. Dominion

      Re: Scary

      It's probably no more scary than to sit at home waiting for the rest of the crap IT systems in this country to fall apart in a heap.

    3. MrXavia
      Facepalm

      Re: Scary

      Do you really think they could cope with 80% of capacity with paper cards?

      YES they can fall back, but really, 80% is pretty impressive with a faulty system...

      1. Hoe

        Re: Scary

        But surely we still need an enquiry about it?

        Just think David my mate Lord Billy Milksalot can chair it for just £500,000 of the Tax payers money (plus my 50k referral fee of course).

        Thanks

        I.M.Athief (MP).

    4. This post has been deleted by its author

  5. The_H

    Is there more to the story than this?

    PPRUNE (a pilots' forum) carries this interesting NOTAM (Notice to Airmen)

    AT 0100 ON 07 DEC 2013 SCOTTISH AIR TRAFFIC CONTROL CENTRE MILITARY(SCOTTISH MIL)(SCATCC(MIL)) WILL TRANSITION TO SWANWICK AND ASSUME THE TITLE LONDON AIR TRAFFFIC CONTROL CENTRE MILITARY (LONDON MIL)(LATCC(MIL)) NORTH, THERE WILL BE NO CHANGES TO SERVICE PROVISION ARRANGEMENTS OR INITIAL CONTACT FREQ. ALL LATCC(MIL) SECTORS WILL ASSUME THE VOICE CALL SIGN 'SWANWICK MILITARY'. A SINGLE UK FLIGHT PLAN ADDRESS, EGZYOATT WILL BE USED FOR ALL OPERATIONAL AIR TRAFFIC FLIGHT PLANS.

    In other words - this "phone system failure" just happened to coincide with the day that all military air traffic control transferred to Swanwick. Hmmm.

    1. This post has been deleted by its author

  6. Anonymous Coward
    Anonymous Coward

    99.9998% Availability.....

    One 14 hour failure in 11 years, that's what 99.9998% availability ?

    What's the betting Frequentis won't even have to pay any SLA money... ;-)

    1. Anonymous Coward
      Anonymous Coward

      Re: 99.9998% Availability.....

      If they get picky about it they delivered approximately 90% of the required capability for the 14 hours of faults - which works out at 84 minutes of lost capacity-time in 11 years. As epic failures of critical systems go that's not at all bad.

      1. Will Godfrey Silver badge

        Re: 99.9998% Availability.....

        Length of failure time is less important that how badly everything can screw up during the failure.

        1. Anonymous Coward
          Anonymous Coward

          Re: 99.9998% Availability.....

          With mission critical systems the backpu solution isn't supposed to be a copy of the primary system. It's supposed to be developed separately, updated separately,managed separately. This way a bug in the primary system isn't replicated to the backup.

          There is a difference between the primary system availability and a Disaster Recovery system. DR is used when the primary systemand it's resiliance completely fails. The problem here is the DR system and processes didn't work. That system would not be provided by Frequensys. Probably it's more owned by NATS themselves.(the original card shuffle systems..)

          1. Anonymous Coward
            Anonymous Coward

            Re: 99.9998% Availability.....

            "With mission critical systems the backpu solution isn't supposed to be a copy of the primary system. It's supposed to be developed separately, updated separately,managed separately. This way a bug in the primary system isn't replicated to the backup."

            I've heard the theory.

            Who's seen any recent real examples ?

            Or is "It's too expensive to do dissimilar redundancy, we'll just do one and test it properly" followed by "We can't test it properly, it's too expensive, just ship it" the universal refrain in recent decades?

  7. Anonymous Coward
    Anonymous Coward

    Let (s)he who has failproof software cast the first stone.

    For all the griping and complaining can someone point me to any piece of software that has worked without a major cock-up for 12 years ? (Ok I admit the solitaire game bundled with windows seems rather resilient :-))

    I've seen military C3I systems crash less than gracefully in the middle of live operations, I've seen signalling software that control rail traffic lock up at rush hour and the list goes on with telcos, power grids, chemical factories and nuclear plants.....

    Software will crash - it's a fact of life - and I think the guys at NATS did a rather good mitigation job handling 90% of the workload in a crippled situation (and I'm sure they do have procedures to handle things with pins on a paper chart and analogue telephone lines should the whole IT infrastructure go kaput)

    To answer a previous comment - when a mission critical system crashes you actually prefer it to crash in a big and obvious way and "get your moneys worth"... There would be nothing worse or more dangerous than an "elusive minor bug" ! Imagine a bug that t would randomly omit to show some flights on an ATC controller's screen - now that is a scary scenario...

    1. Anonymous Coward
      Anonymous Coward

      Re: Let (s)he who has failproof software cast the first stone.

      > (Ok I admit the solitaire game bundled with windows seems rather resilient :-))

      If it is so resilient then why is it on version 5.1?

      1. Number6

        Re: Let (s)he who has failproof software cast the first stone.

        If it is so resilient then why is it on version 5.1?

        Several answers to that - one is that it's resilient now because the bugs have been fixed, another is that it's been introducing features (rather than bugs) with each new release.

    2. Tromos

      Re: Let (s)he who has failproof software cast the first stone.

      There have been several complaints that MS Solitaire frequently deals repeated hands, so maybe not quite as resilient as it first appears.

  8. Thomas Whipp

    Redundancy

    "But we can't help but agree with exasperated folk stranded at airports over the weekend who - quite reasonably - asked why such a failure could have happened in the first place with a critical system. Redundancy, much?"

    The fault sounds very much like a configuration problem, if caught at implementation its usually a case of revert to prior state... but once its been in use for a period its the sort of thing that in the middle of a safety critical system can be immensely hard to back out (do you really want them to shut down all phones on the air traffic system???).

    Redundancy means having a duplicate system - hardware wise that's easy, software wise do you really mean they should maintain a completely parallel system with distinct config at all times? I appreciate the trite answer to this could be yes - but in a real situation (which has to interact with external parties) that can quickly become utterly pointless.

    Frankly I'm quite impressed that they managed 80% throughput in the circumstances - I'd guess the contingency plan became a lot of post-it notes very quickly.

    1. JohnG

      Re: Redundancy

      "Redundancy means having a duplicate system"

      ....or perhaps a set of procedures to use regular telephones when the all singing and dancing system is down.

      It seems incredible that they apparently had no alternative procedures to follow and just left everything in night mode.

      1. NomNomNom

        Re: Redundancy

        redundancy is where they fly twice as many planes in the sky than are needed so that when half of them crash it's okay.

      2. Anonymous Coward
        Anonymous Coward

        Re: Redundancy

        Err I think we might be getting a bit confused here.

        Of course NATS has fallback phone exchanges and procedures to follow when the system is in degraded mode.....

        And it is these very procedures that allowed them to handle a large bit of their workload.....

        What apparently failed is a system that provides them on screen direct access to all the "people" they might need to contact - there might be half a dozen of those at any time for any given flight and they change dynamically as the plane progresses : numbers for the ATC of adjacent region where the plane may be directed, numbers of the airports where you may direct the plane to land in case of incident, etc....

        Without this functionality it's back to looking up the relevant numbers and dialing them - This takes quite a bit more time... and the controller can't juggle securely with as many planes as he would usually do....

    2. Number6

      Re: Redundancy

      Redundancy means having a duplicate system - hardware wise that's easy, software wise do you really mean they should maintain a completely parallel system with distinct config at all times? I appreciate the trite answer to this could be yes - but in a real situation (which has to interact with external parties) that can quickly become utterly pointless.

      Don't modern telephone exchanges do this? Run two sets of hardware each tracking the entire exchange context, so that one half can have an upgrade applied without losing calls. Presumably the upgraded half then reloads context from the running half and at some point is given control so that the other half can also be upgraded.

      1. Vladimir Plouzhnikov

        Re: Redundancy

        They had redundancy alright - the backup system is called "phonebook", the only problem is that it's rather slow, comparing to the primary system..

      2. Anonymous Coward
        Anonymous Coward

        Re: Redundancy

        Yes, telephone exchanges have been able to hot-upgrade (and hot-patch) since the early 90's without even dropping calls in progress. Further, they can roll back software releases without dropping calls in progress too.

        It was largely driven by threat of damages (mainly in the US) if a 999/112/911 call was dropped as a result of an intentional maintenance action.

      3. I. Aproveofitspendingonspecificprojects

        Re: Redundancy

        Or is that GCHQ raid-ing?

    3. Anonymous Coward
      Anonymous Coward

      Re: Redundancy

      This.

      I'm fed up hearing about "where are the redundant systems?" when redundancy wouldn't help. In most cases, redundancy just gives you the option to "turn it off and turn it on again" on the contingency system. In a lot of cases, that's vastly complex (e.g. restarting a core set of databases with the knock on effect of required application restarts) and a localised recovery (restart/nudge the failing application component) is quicker. In some cases, you can't simply fail over the faulted system, you have to fail over a lot of related systems which may be working normally and you don't want to fiddle with. I admit there are redundancy/failover designs which can run without any noticeable service impact.

      Further, many systems run contingency/DR as "no data loss is acceptable" which is generally implemented by some kind of synchronous data replication (Veritas VVR, EMC SRDF, Oracle Dataguard etc). If your core problem is data related, I don't care how quickly you can fail over to backup systems, the fault is still there. Recovering that issue can be simple, it can be a complete nightmare.

      So, unless you know that the root cause was a server failure/crash, don't start whinging about lack of recovery/contingency systems.

    4. Anonymous Coward
      Anonymous Coward

      Re: Redundancy

      Yes.

      Networ rail signalling systems are equally important to be reliable. They usually have three parallel systems to control the critical systems, so allowing for double fault in separate system problems.

      To be fair, the NATS SLA's likely say it's fine to run at reduced capacity occasionally rather thanspend an extra 5 billion to make the downtimes even rarer, adding 50 quid to everyone ticket prices.

  9. Anonymous Coward
    Anonymous Coward

    So many questions...

    If anybody from NATS feels like posting anon - surely this system was recognised as critical/had a failover/disaster recovery/whatever it takes up to the hilt etc etc ?

    (I know, I know - I work in the sort of place where we have occasionally war gamed this kind of thing - all I want to make sure of is I'm on holiday in a galaxy far far away if we ever actually have to invoke it for real.)

  10. A Non e-mouse Silver badge
    Pint

    Aren't we being a bit too negative here?

    There's no denying the fact that there was a problem with a computer system at NATS at the weekend, and that this problem caused problems to the travelling public.

    BUT....

    Using backup procedures, they managed to perform 80% of their normal workload. I'd say that's pretty darn good!

    Beers all round for everyone who undoubtedly worked their butts off to keep things running.

  11. Anonymous Coward
    Facepalm

    Was expecting something more technical

    Is it cisco kit ?

    Is it sip ?

    Asterisk ?

    Hipath ?

    Two bean cans and a piece of string ?

    For gods sake not Lync ?

    Come on Reg journos, time to get out of the pub and get the full story.

    1. Anonymous Coward
      Anonymous Coward

      Re: Was expecting something more technical

      "For gods sake not Lync "

      The name of the vendor in question seems to be Frequentis. Some of their stuff runs on Linux, some of their stuff runs on Windows. Not sure what applies in this incident.

      "Come on Reg journos, time to get out of the pub and get the full story"

      Time to get the Swanwick staff *into* the pub to get the full story, shurely?

      I suspect the story on PPRUNE, that the routine changeover became non-routine when the handling of the military traffic arrived, may be close.

      Given that a vendor has been named, I assume it isn't the ATC staff having a problem handling the additional workload from military traffic.

    2. Ralph Online

      Re: Was expecting something more technical

      I doubt that it was anything so, so COTS as those!

      From what I can see it's probably a Frequentis VCS 3020X system that was at fault.

      One of these systems http://tinyurl.com/oplnucd

      And I think relatively widely deployed? Although possibly each system has to be customized?

      1. Anonymous Coward
        Thumb Up

        Re: Was expecting something more technical

        Thanks for the info Ralph. +1

  12. Anonymous Coward
    Anonymous Coward

    There is much redundancy in the system at Swanwick, from the radar feeds and screens, the radio telephony (R/T) system and of course multiple phone lines. At each console there is one radar screen, one touch screen phone panel, one r/t panel and one system information screen. The control of a "sector" (which may be a combination of multiple sub-sectors "bandboxed" together) consists of 3 consoles set up together. If any one of these fails then one of the other consoles, usually used by other staff managing the sector can be used, until a new suite of 3 consoles can be configured.

    To manage the varying demands of traffic, the sectors are combined ("bandboxed") and split as necessary to even out the workload, with many sectors combined overnight when workload is low. In order to split out, the sectors to be split off are "elected" on another set of consoles by another group of controllers. When this happens the phone panels are dynamically loaded as appropriate to the sectors elected on the new consoles. It is this process of loading the appropriate telephone which I think failed (I don't know for sure, I wasn't there).

    Without direct access lines to the surrounding sectors (both within the same operations room and external agencies) certain conditions for the expeditious transfer of traffic cannot be met - the rules (see CAP493, Manual of Air Traffic Services Part 1) say that in order to carry out "silent radar handovers" direct radar-to-radar telephone lines *must* be operational. If they are not, then aircraft need to be positioned further apart to transfer to the next sector, or telephone calls made to hand over each aircraft individually (normally electronic communication advises the next sector or agency of the aircraft details). This is a safety net so that controllers can rapidly talk to each other in case of emergency. Looking up direct dial numbers for multiple agencies (and the multiple sectors at those agencies) would take up a lot of time, and would also be a problem from those agencies dialling in the other way - part of the advantage of the direct dial buttons means that you can quickly see who is calling and be ahead of the game as you answer the telephone.

    When the sectors could not be split properly, I assume they had to remain as larger bandboxes, meaning that controllers had to handle large areas. One of the limitations that a controller has on the number of aircraft they can work is purely R/T loading - if there is so much talking that aircraft can't check in, or the controller can't deal with all of the aircraft on frequency then the system falls apart. This would be one of the main reasons that restrictions are placed on the amount of traffic that can be worked. It's also more difficult to control lots of aircraft when you're viewing the radar picture at a large range - when traffic gets complex, a smaller range really helps to see what is going on in your sector! European traffic always gets hit harder than that from further afield, as much of the traffic from the US say, would already be in the air and therefore is already on it's way. The European traffic has to be held therefore in order that sectors don't get overloaded.

    If this particular fault had happened before Swanwick Area Control went electronic two years ago, I suspect the restrictions would have been far more severe - in an electronic world controllers can handle more traffic due to the help given by the system in assessing and resolving conflictions. The "contingency" was that the system still worked, but with reduced capacity - 80% doesn't seem too bad when there is a major systems failure!

    1. Bronek Kozicki
      Pint

      Awesome explanation, thanks!

      1. Anonymous Coward
        Anonymous Coward

        Glad it was helpful, not sure I can really say any more I'm afraid.

    2. Anonymous Coward
      Anonymous Coward

      @Anonymous 19:28

      *ding* *ding* Winner! Well-explained, no wah-wah drama, just the plain facts.

      Just the way it should be.

    3. Anonymous Coward
      Anonymous Coward

      Much redundancy at Swanwick?

      "There is much redundancy in the system at Swanwick, from the radar feeds and screens, the radio telephony (R/T) system and of course multiple phone lines"

      Obviously not enough redundancy ...

      1. Lottie

        Re:...Obviously not enough redundancy ...

        "Oh, the fools! If only they'd built it with 6000 and one hulls!"

        </fry>

  13. Mr. A

    Pressure

    14 hours to fix the largest air traffic control communication system in Europe sounds pretty good going to me. Those lads must've been crapping themselves!

    Not to mention that air traffic control managed to get 80% of flights in/out the door without it... In this country we're too quick to point the finger rather than simply say well done for sorting it out.

    1. A Non e-mouse Silver badge
      Pint

      Re: Pressure

      14 hours to fix the largest air traffic control communication system in Europe sounds pretty good going to me.

      Without taking it off-line!

  14. zb

    About time El Reg noticed this story.

  15. j1mb0b

    First of all, thank you to AC for posting the most sensible response (to date) on here about what the actual problem was. Those who are interested further would be well advised to check out the ATC forums at www.pprune.org.

    However, given that the fault almost certainly appears to be due to the transfer of Scottish ATC Mil traffic down to Swanwick, there are 2 questions that come to mind:

    1) Why did the testing not pick up this fault;

    2) What was the fix that allowed them to transfer to "day" mode not implemented sooner?

    I appreciate there are perfectly good answers to these two questions. However, I'd still like to know what they are.

    1. RTodd

      Software testing, particularly regression testing of new releases is a particularly dull instrument. Testing generally only finds what it's looking for, except what it stumbles across by accident or not following the script.

      They said it would be fixed by about 18.30 and it was fixed by 19:00. That's either amazing fault management or dumb luck.

    2. Anonymous Coward
      Anonymous Coward

      "given that the fault almost certainly appears to be due to the transfer of Scottish ATC Mil traffic down to Swanwick"

      given that if the nice Mr Salmond gets his way this will be a permanent thing, as no-one with any commmon sense would leave UK any military resources north of the border after a "bye bye" vote (and no UK politician with more than one brain cell would allow any new UK military spending north of the border until that vote is out of the way), what's the contingency plan?:)

  16. Florida1920
    Black Helicopters

    Then fix the problem

    They should have blamed the problem on NSA and GCHQ.

  17. RTodd

    Realtime most of the time or a few panels short of a day shift ?

    Never found a proper system to replace the Ericsson panels at LATCC.

  18. Nifty Silver badge

    backup system

    we need a spare airport on the Thames Estuary to land everything on the next time the control system goes pear shaped

  19. veeguy

    Everyone could use some extra cash this time of year...

    Relax. Computers are good for *so* much more than just watching airplanes. It was simply the ATC BOFH doing some Bitcoin mining for extra Xmas cash.

  20. Skoorb

    I would actually like to know more

    Whilst everyone else is going mental about how useless NATS are, I would honestly like to know about what happened, why, what systems are in place to deal with things like this and what changes will be made. Incident response on large critical systems like this always makes for a very interesting case study to improve your own professional practice.

  21. Anonymous Coward
    Anonymous Coward

    The issue is congestion

    I don't think that what happened here says much about NATS or the systems: if they managed to deliver at 90% and without safety compromise they did their job well.

    The impact in the SE of England shows how overloaded the airports and airspace is there. That overloading means even a minor issue snowballs and knocks on to many flights. Solution: fewer planes or more capacity, or live with increasing disruption.

    1. Anonymous Coward
      Anonymous Coward

      Re: The issue is congestion

      "The impact in the SE of England shows how overloaded the airports and airspace is there"

      And, therefore, it shows how utterly insane it would be to add significant additional *runway* capacity in the SouthEast without adding additional *airspace* capacity in the area too. And afaict from outside there's no way to add additional airspace (not without significant compromises to safety and/or horrid effects when there's a hiccup).

      So the "we need another runway" game is about something other than capacity, and the airlines and airports are fully aware of that, but they hope that Joe Public won't catch on to what their real game is.

      What *is* their real game?

      1. Anonymous Coward
        Anonymous Coward

        Re: The issue is congestion

        Not necessarily true, a lot of the airspace in the South East is taken up by aircraft in holding patterns waiting to make an approach (particularly to Heathrow with its 4 holds). If another runway were available, then perhaps less aircraft would need to hold and therefore free up some space. Plans are already in place to slow aircraft down from further away to soak up some of the delay of holding, while still providing a constant stream of aircraft to make sure the existing runway is utilised to it's full capacity, this could also free up extra space...

        The last couple of days with delays due to the fog show that it is about capacity. The fog means aircraft can't see very well on the ground and have to taxi slower, this means they vacate the runway more slowly and controllers have to space them out more on approach to land. The fog doesn't really affect anything in the air, but this extra spacing and the delays and cancellations entailed does show that the runways at Heathrow are at max capacity for much of the day. An extra runway would help to give some breathing room.

        Not that I'm saying that an extra runway at Heathrow is the right solution to the problem!

  22. cortland

    Proofs? What proofs?

    I've seen software engineers going nuts (nuttier than normal, at least) verifying programming merely as complex as flight control firmware, so I can sympathize with designers and debuggers. Not a software engineer myself, I've nevertheless seen published papers speculating that verification may not even be theoretically possible after some degree of complexity.

    The Wikipedia article may be of some interest: http://en.wikipedia.org/wiki/Formal_verification

  23. Gareth Wright

    System got stuck....

    does it take 14 hours to turn it off and on again?

This topic is closed for new posts.

Other stories you might like