back to article UK air traffic mega cockup: BOTH server channels failed - report

The IT cockup at the National Air Traffic Services (NATS) that grounded hundreds of flights in December occurred because both of its System Flight Server (SFS) channels went down, an independent report has revealed. "The disruption on 12 December 2014 arose because – for the first time in the history of the SFS – both channels …

  1. Anonymous Coward
    Anonymous Coward

    "both channels failed at the same time"

    Not a "failure" in the normal sense of the word; according to the later part of the article, both channels correctly and identically responded to bad data, in the way they were programmed to.

    In other words: it was human error in the way the system was programmed, configured or (insufficiently) tested.

    The article doesn't explain why it suddenly happened at that particular point in time though.

    1. Robert Simmons
      FAIL

      Re: "both channels failed at the same time"

      The article doesn't, but the linked CAA document does. It was triggered by someone issuing a "Sign Off" command from an unused workstation. This command was then replayed on the second SFS, causing a repetition of the same fault which took down the first SFS.

    2. Dave Pickles
      Headmaster

      Re: "both channels failed at the same time"

      Para 4.2 explains that a recent system change to include more military functions was one of the triggers. Another was the design shortcoming which allowed an operator to 'fat-finger' the wrong button when signing-off, thereby consuming an extra session.

    3. Velv

      Re: "both channels failed at the same time"

      Which is one of the arguements for having your Business Continuity operation provided by an alternative supplier who hasn't seen what has been implemented by supplier 1.

      London Internet Exchnage does it with two entirely different sets of infrastructure providing the twin service.

      Not cheap, extremely difficult to integrate, and potentially susceptible to other failures, but a balance to be had where something is mission critical.

      1. Anonymous Coward
        Anonymous Coward

        Re: "both channels failed at the same time" @Velv

        Yes, only an independent (and extremely well tested) disaster recovery system will do. Of course, what we're talking of is expensive, and requires round the clock monitoring/maintenance (especially where database synchronisation is concerned). Getting customers to provide the proper levels of effort to manage a DR setup is difficult (as is getting them to cough up for a proper managed service).

        1. Anonymous Coward
          Anonymous Coward

          Re: "both channels failed at the same time" @Velv

          So you've got two independent systems. One has got a bug triggered by a config change etc. Now one system is telling you one thing and the other is telling you something else. How do you automatically know which one to believe?

          It seems to me the only advantage of having the two "independent" systems is that once a human can correctly decide which one is working correctly then they can just disable the "bad" system so you get, in theory, a quicker fix.

          On the other hand if the error is due to a human inputting the wrong data or a mistake in the requirements you've won nothing at all.

          1. Anonymous Coward
            Anonymous Coward

            Re: "both channels failed at the same time" @Velv

            "So you've got two independent systems. One has got a bug triggered by a config change etc. Now one system is telling you one thing and the other is telling you something else. How do you automatically know which one to believe?"

            You don't. That's the point of the human factor; having people monitoring constantly. One of the earliest things we test, for example, is what happens when communications between the two sites goes down? How does one system know if the other has gone TITSUP when that happens? Having it make that assumption can cause a real disaster, if keeping data in sync is critical*. Only a human can make that determination (and that may involve phone calls or even checking news reports). These things all have to be planned for.

            * Bearing in mind a one way sync, from master to standby, is far easier than both ways.

          2. gnasher729 Silver badge

            Re: "both channels failed at the same time" @Velv

            "How do you automatically know which one to believe?"

            Simples. The first computer system didn't give out any wrong information. It detected that something was wrong, and it was programmed to shut down rather than risk giving out wrong information. Because both systems contained the identical bug, both shut down in exactly the same way. With different systems, the first system would have shut down, but the second would have continued to work. You believe the one that still works.

  2. Annihilator

    aka

    "Sh1t happens"

    A solid reminder that no matter what you do, there will always be a path of execution not covered by test scripts and to have non-IT contingency plans - which in this case, they did.

    1. Anonymous Coward
      Anonymous Coward

      "...there will always be a path of execution not covered by test scripts"

      I'm assuming you mean combinations of paths here? As this is a critical system, it would (should?) have been subject to MC/DC and full code coverage testing.

      Sounds more like a "there will never be a need for more than 640K RAM" type of decision and the capacity limit went too high...

      1. Destroy All Monsters Silver badge
        Holmes

        Re: "...there will always be a path of execution not covered by test scripts"

        have been subject to MC/DC and full code coverage testing

        Show that this kind of dead chicken waving would have detected the error. MC/DC testing is voodoo from a prior age at the best of times. Indeed, this paper says: "We believe that the rigor of the MC/DC metric is highly sensitive to the structure of the implementation and can therefore be misleading as a test adequacy criterion.". In other words: Your MC/DC testing value depends on how you write your program, i.e. on syntax. Ouch.

        Theorem provers are what should be used today.

        1. Anonymous Coward
          Anonymous Coward

          Re: "...there will always be a path of execution not covered by test scripts"

          When lives depended on it, and my case they always did, mere testing methodologies were unacceptable whatever the framework. Every system I designed had software as a component and I always used formal verification. Extensive verification with thoroughly checked results. Life in a federal prison didn't appeal to me. "The prospect of being hanged in a fortnight concentrates the mind wonderfully."

          One nice thing was that I could hand reports to the various providers of compilers and other packages on their defects. Hardware already had formal methods applied, thankfully, although I paid close attention to deviations between the desired state in the engineering manuals as compared to the actual in the field. Those systems are still in use more than thirty years later.

          Edit: Thank you most kindly for the paper.

  3. ravenviz Silver badge
    Happy

    The cockup resulted in 120 flights being cancelled and 500 flights being delayed for 45 minutes, and affected 10,000 passengers in total.

    No-one died.

    1. Gordon 10
      Unhappy

      That won't stop the PHB's doing something dumb for show.

  4. Hans Neeson-Bumpsadese Silver badge
    Boffin

    Passenger count

    "120 flights being cancelled and 500 flights being delayed for 45 minutes, and affected 10,000 passengers in total."

    That works out at an average of 16 pax per flight, so 10,000 total pax affected seems a bit on the low side

    1. Tom Wood

      Re: Passenger count

      Many will be cargo flights, or light aircraft.

    2. Yet Another Anonymous coward Silver badge

      Re: Passenger count

      Only 10,000 passengers were affected enough to get compensation claims - the others don't cout

  5. Tom Wood

    So the cause

    was a hard-coded limit on the number of "things" in the system. But instead of being hard-coded in one place it was hard-coded to different values in two places. Recent changes meant the lower of the two limits was exceeded for the first time ever, and the higher limit wasn't.

    Sounds like a fairly basic software test should have caught this issue. If your requirement is "the system shall support up to X things connected" then a decent test would check what happens if the system is tested with X-1, X and X+1 things to make sure the limit had been programmed correctly (and with the correct use of <, <=, == etc).

    But, you know, it had worked OK since the 90's, so why would anyone need to test it?

  6. Velv

    Reminds me of the Grebulons in Mostly Harmless who lost their stored memories when the backup computer was placed in the same hole left by the original computer :)

    1. TRT Silver badge

      IIRC they became hopelessly obsessed with astrology. Would have been about as much good as UK ATC that day.

      And on an astrology/airport note... NCIS had an episode where the team arrived at a check-in desk and announced themselves as LEOs.

      "I'm Sagittarius", came the reply.

      "Law Enforcement Officers. New on the job are you?"

  7. Gordon 10

    I wonder if

    This was running on TPF?

    The Os and the age is right.

    Would explain why there were different limits in different parts of the system not easily visible to each other.

    Real programmers write OS390 assembler in 4k blocks.

    1. Alistair
      Windows

      Re: I wonder if

      "Real programmers write OS390 assembler in 4k blocks."

      Nope, not any more. There aint no one gonna pay me NEAR enough to go back there.

      1. bazza Silver badge

        Re: I wonder if

        "Nope, not any more. There aint no one gonna pay me NEAR enough to go back there."

        Just out of interest, just how much would it take? $1000 per day/hour/minute? First born child? Fifteen glamorous assistants whose only job is to make your life pleasurable in every possible way to offset the hell that is OS390?

        I ask purely because your answer will help those of us not initiated into the ways of OS390 understand exactly how ghastly it is...

      2. Anonymous Coward
        Anonymous Coward

        Re: I wonder if

        They could certainly pay me enough and I still love assemblers of all the varieties. Only the part in living in Britain, were it a requirement, would give me pause.

        1. bazza Silver badge

          Re: I wonder if

          "Only the part in living in Britain, were it a requirement, would give me pause."

          Booo!

          1. Solmyr ibn Wali Barad

            Re: I wonder if

            Why boo? He may have his life carved up in a way that he doesn't fancy moving anywhere.

            1. yoganmahew

              Re: I wonder if

              "Real programmers write OS390 assembler in 4k blocks."

              Yes I do... you can have 8k now if you go baseless... 64 bits too...

              I doubt it's in TPF, I can't see that they'd have that need for speed?

              As to code test coverage - nice one. Learn that one in a book did you? Try taking a forty year old system, figuring out the different data types that can be fed into it and the number of combinations to ensure 100% coverage. Grains of sand on a beach doesn't even come close.

  8. raving angry loony

    Skimped on IT?

    Far be it for me to disagree with NATS claiming they have skimped on IT, but this kind of failure seems to be exactly the kind of thing you'd expect when proper design and testing isn't scheduled or funded. If the description of the problem is even vaguely accurate (which is a big "if", admittedly) then given the severity of the results I'd expect any attempts to set the wrong values should have been caught by a correctly designed and tested system. That it wasn't makes me wonder what other shortcuts they've taken?

    1. Alan Brown Silver badge

      Re: Skimped on IT?

      "this kind of failure seems to be exactly the kind of thing you'd expect when proper design and testing isn't scheduled or funded."

      They spent shitloads on the system. The problem is their idea of resilient was "two identical systems" - that's redundancy, not resiliancy.

      Testing is always a bit hard for stuff of this complexity. No matter how idiotproof you make it, there's always a smarter idiot to break it - which is a good argument for not having 2 identical systems.

      On the same general theme, this is the same reason you shouldn't make a raid array out of identical drives from the same manufacturer's batch, even though most raid vendors do exactly that.

      1. Yet Another Anonymous coward Silver badge

        Re: Skimped on IT?

        Their idea of resilient was a perfectly safe manual system which they switched to which worked but at a reduced capacity.

        Obviously they could have spent more to reduce the risk of this very rare event - in the same way that we always build 3 identical motorways in parallel so we can switch traffic to a backup in the much more common occurrence of an RTA.

      2. Anonymous Coward
        Anonymous Coward

        Re: Skimped on IT?

        It was a hard design limit and should have resulted in notification that the design limit, wherever found, was about to be exceeded. If you have a stepwise function, that's the absolute least you should do. Hell, Y2K didn't affect any systems I designed, for that very reason. Rollover was designed around as I couldn't ensure the damn machine or OS would handle it in a tolerant fashion. [Never, ever, accept externally applied values without validation even if just a sanity check.)

      3. Tom 13

        Re: always a smarter idiot to break

        I've always preferred the phrase:

        nothing is ever foolproof because fools are so ingenious

        I have no idea from whom I stole it.

  9. Anonymous Coward
    Anonymous Coward

    Just "193 Atomic Functions"?

    Why so small a number...was this app written in COBOL?

    Sounds like the solution is to modernise on to COTS...maybe 1 x64 server could replace that mainframe???

    1. Bill 21

      Re: Just "193 Atomic Functions"?

      I'd assume it failed somehow when they tried it with 194, but just kept running when they used 193. Magic numbers are like that.

    2. vagabondo

      Re: Just "193 Atomic Functions"?

      I do not think that they are talking about database transactions here.

      From the article:

      "All of the operational roles performed within the London Area Control have a unique identifier known as an Atomic Function".

      Together with th mention of "signing off" and unused station, I understtod that "active Atomic Functions" was related to the number of ATCs logged in. Other information about operator mis-keying while logging out might point to a poorly programmed log-out sequence that permits the operator to be apparently logged out without releasinf their "Atomic Function" token.

      Just my guess.

  10. Anonymous Coward
    Anonymous Coward

    There was an online system with three devices to handle traffic that had been working smoothly for some time. New connections were sequentially distributed to be roughly even across all three.

    Then one day all three went down within minutes of each other - an "impossible" scenario. They had all hit the upper limit of connections - which was well beyond the theoretical maximum the overall system could ever generate. Any one device was configured to handle more than that theoretical maximum if necessary.

    What had happened was that someone had decided to set up a test ping script - just to time the making of a connection. Unfortunately their script had a fault and never disconnected. Over several days its number of connections built up evenly over all three devices - before the coup de grace found the upper limit bug on near enough successive pings.

    1. Crazy Operations Guy

      What kind of OS were you using that it didn't immediately close ICMP packets that it had responded to? Any modern OS would have closed those connections pretty quickly as part of basic Denial-of-Service mitigation. At the very least you should have installed a firewall in front of a machine like this to block connections from machines that capitalize on connections like that (Assuming whatever crap OS you were using wasn't capable of doing such itself).

      No wonder you posted anonymous, you were trying to hide your shame after such an embarrassing failure.

      1. Anonymous Coward
        Anonymous Coward

        "What kind of OS were you using that it didn't immediately close ICMP packets that it had responded to? "

        That was before TCP-IP and the internet. It was a private country-wide network with components from various suppliers - large and small. My role was to trouble-shoot system problems that were unexpected. Any embarrassment was with the customer for putting an untested activity on their live system.

        I have a whole storybook of unexpected problems from the pioneering days of on-line systems. That experience is what the current IT industry is built on.

        1. Alan Brown Silver badge

          "What kind of OS were you using that it didn't immediately close ICMP packets that it had responded to? "

          I've seen plenty of "ping" scripts which open tcpip connections in order to test response time, not just round-trip time.

  11. Ken Moorhouse Silver badge

    Failure to report reason for shut-down

    In the scenario that actually occurred, there must be some kind of third system that adjudicates when there is a failure in order to switch across to the other, even if (and it should be) that the adjudicator is at least partially human-driven.

    Did the first system not report to the adjudicator to say "I am shutting down because..."? If so then a human being should have realized that the first system could "poison" the second system (which is what happened in practice).

    Having worked on systems with resilience built into them I appreciate that it is incredibly difficult to design for all eventualities: but what struck me when reading the report was that scant mention was made of the decision-making at the point where the first system decided to take a lie-down.

    1. Pascal Monett Silver badge

      "a human being should have realized"

      That, right there, is why mistakes keep happening.

    2. Solmyr ibn Wali Barad

      Re: Failure to report reason for shut-down

      "there must be some kind of third system that adjudicates when there is a failure"

      Yes, but who'll be watching the watcher? And if you get around it by making three systems equal, then one day you'll be looking for a minority report. With Tom Cruise and ginormous touchscreens involved. Har har.

  12. Crazy Operations Guy

    Limits on Atomic Functions

    So wait, they are running *at* the limit? I figured that something that is meant to guide billions of dollars in aircraft and tens of thousands of lives would have a ridiculous amount of extra resources available. I would think that the system should have a capacity of 512 Atomic Operations across redundant systems so that if one fails you'd still have 63 free operations available to cushion bugs (a 33% buffer).

    My company is spinning down their s/390's, so should I be sending them over to NATS rather than just scrap 'em and sell the parts?

  13. Anonymous Coward
    Anonymous Coward

    So, just to be clear, the Reg's early assertion on the day that this was caused by Windows devices was just utter and complete b*llocks then?

    Well what a surprise.

  14. Roj Blake Silver badge

    Old Kit

    Even when Swanwick went live, a large chunk of its kit was so old that they had to buy in second user machines.

    1. Warm Braw

      Re: Old Kit

      This is the thing I don't really get. Air Traffic Control isn't a uniquely South-of-England problem. There must be many instances of software around the world performing similar functions with very similar sorts of input. How did we end up with something quite so bespoke and apparently temperamental?

      1. Pascal Monett Silver badge

        Re: temperamental?

        One failure in 30 years of operation is hardly temperamental.

        1. Warm Braw

          Re: temperamental?

          It rather depends on what you mean by failure.

      2. Anonymous Coward
        Anonymous Coward

        Re: Old Kit

        These sorts of systems "evolve" way beyond the original design, no two are ever alike. Sure, you could replace the old system with a nice new one that's tried and tested elsewhere, but who's going to shutdown S. English airspace for a month during commissioning? Any new system has to fit in with the oddities of the old, there's rarely a chance to start again from scratch.

      3. Alan Brown Silver badge

        Re: Old Kit

        " How did we end up with something quite so bespoke and apparently temperamental?"

        Just about every installation of this kind is bespoke. Cookie-cutter techniques have never been applied - and a large part of that has to do with "national pride" and "vested interests"

        1. Tom 13

          Re: Old Kit

          Also, a lot of "If it ain't broke don't fix it" coupled with tight budgets. I recall doing some temp work at the time of Y2K. My job was to go around to airports, plug is a floppy disk, run the scan, and put in another floppy disk to which the data was written. The systems we were scanning were hideously old by non-airport business standards, but were state of the art for them. Granted I was on the opposite side of the pond, but that's one commonality I expect the two systems have regardless of national pride or vested interests.

    2. kmac499

      Re: Old Kit

      For the very good reason that old kit has already had years in the wild, it's fimware and reliability was well understood.

      A bit like when BMW made F1 engines out of old road going engine blocks. The previous high mileage had worked wonders on stress relief in the castings.

  15. John Smith 19 Gold badge
    Unhappy

    So magic number on *two* systems didn't match up.

    A couple of points

    Non matching numbers is not IT that's administration or source code control.

    OTOH if that number can be updated by human input that should be most carefully controlled.

    BTW is anyone thinking "Hmm, 193. Not 255 or 128?"

    Just so random.

    1. F0ul

      Re: So magic number on *two* systems didn't match up.

      Its not random at all -

      11000001 (128 + 64 + 1).

      Seems quite obvious to me why 193 is the chosen number!

  16. John Smith 19 Gold badge
    Unhappy

    Now if this is a CMM5 installation

    They'll analyse the development process and find why it wasn't picked up in development.

    Build a tool to find all the other places the code does not match.

    Verify there are reasons why they don't match or if not which (if either ) is the correct value

    Update the system

    Run the changed modules through regression testing (including the revised tests to find the new faults).

    Will it happen.

    Who knows?

  17. Tom Cooke

    So surprised nobody has mentioned the Ramans. Or have we worn that reference out in posts about redundancy by now?

  18. Nash

    I just hope....

    ...that when the sys-admins announced, "both channels have gone down".....someone from the other side of the office replied....."Surely you cant be serious!?"

    to much hilarity of the command center which then proceeded to use further airplane quotes throughout the day, just to lighten the mood.

    i do hope that happened.

    1. Tom 13

      Re: I just hope....

      Don't call me Shirley!

  19. Anonymous Coward
    Anonymous Coward

    A bit harsh

    Not really a cockup and certainly not a "Mega" cockup.

    1. Tom 13

      Re: A bit harsh

      Not at all. It adversely impacted a lot of people, and not just in the UK.

      And yes, you're quite welcome to apply the adjectives to the next US airport that makes a similar mistake. I have no illusions that it won't happen.

  20. Ken Moorhouse Silver badge

    The system that shut itself down poisoned the standby through replayed commands

    "Why the design philosophy was automatically to replay commands on SFS failure; whilst this is appropriate in some circumstances, on this occasion, repeating the command to enter Watching Mode led to the double failure of the SFS."

    As I see it it was these journalled commands that caused the trigger to be repeated on the second system, bringing that down. Presumably if the first system had given notification of the reason why it was shutting down then the list of commands to be journalled through could have been inspected prior to replaying them in order to find out why failover occurred, thus sparing the second system from a similar fate.

    It seems that this opportunity was not taken, presumably as time is of the essence in replaying commands otherwise they may no longer be relevant. A flaw with systems such as this is for a human being to have to make a very quick decision as to whether the shutdown has occurred for hardware reasons, or if it is software, or what. The report that might have been served up by the shut down process may have indicated a fault that sounded hardware related, and therefore a decision was taken to let the journal run through without any breakpoints set.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon