back to article REVEALED: Titsup flight plan mainframe borks UK air traffic control

London's airspace was effectively shut down on Friday afternoon after a flight data server fell over, the National Air Traffic Services (NATS) has confirmed to The Register after multiple sources gave us specific details of the cockup. Hundreds of flights were cancelled or diverted after NATS was forced to restrict the …

  1. John Smith 19 Gold badge
    Unhappy

    There's nothing like state of the art hardware

    And this is nothing like state of the art hardware.

    That said they seem to have handled it well enough in the circumstances.

    I hope they will find a root cause to this and figure out what's up with their networking.

    1. Daggerchild Silver badge

      Re: There's nothing like state of the art hardware

      "And this is nothing like state of the art hardware"

      GOOD! Because I like my bedrock tried and tested, ta.

    2. Gordon 10 Silver badge

      Re: There's nothing like state of the art hardware

      @JS19

      OS/370 and its descendants were running mission critical workdwide apps when you were in nappies boy. Give me known failure modes over unknown any day of the week.

      1. John Smith 19 Gold badge
        Happy

        Re: There's nothing like state of the art hardware

        "OS/370 and its descendants were running mission critical workdwide apps when you were in nappies boy. "

        Could you say with an "Emperor Palpatine" voice?

        For the more humor impaired in the audience I should say I absolutely agree. I wonder if this is the one they got off the FAA in the states, and does it still have valves in it, as their last one is reputedly said to have had.

        Reading the story and the comments 2 things intrigue me.

        1) It looks like it was a "bug" in the data that borked the primary, then it did the secondary, which tried to switch back to the primary. So what kind of data can't be sanity checked before its passed into the system (and of course will checking be added to the code now)?

        2) I did not know a Jovial compiler for IBM mainframes even existed. Historically it's been for deep embedded systems like aircraft flight computers, ECM systems, radars etc.

        1. Anonymous Coward
          Anonymous Coward

          Re: There's nothing like state of the art hardware

          As I'm sure JS19 already appreciates, there's also more than just a recompile/rebuild involved in moving an 'interesting' application from one hardware and network architecture to a different one. Testing the end product once rebuilt can be quite a challenge too.

          Doesn't mean it can't be done, doesn't mean it isn't sometimes a good idea, does mean that 'the usual suspects' may be inappropriate delivery partners, or out of their depth, or both.

        2. John Smith 19 Gold badge
          Coat

          Jovial mainframe compilers.

          Looking a bit further I found IBM did supply a JOVIAL compiler under their "Type III" license.

          IOW It was freeware.

          No guarantee supplied. Use at owners risk.

          Possibly not what you're looking for to crunch the code for you mission (and life) critical ATC app.

          The commercial ones were hosted on it as cross compilers for things like the 1750A and Zilog 8002 (apparently the F16 was a design win for this puppy. Defense con-tractors. Crazy).

          My recollection of JOVIAL was it was common on DEC boxes but as cross compilers to deep embedded kit.

          1. Anonymous Coward
            Anonymous Coward

            Re: Jovial mainframe compilers.

            "My recollection of JOVIAL was it was common on DEC boxes but as cross compilers to deep embedded kit."

            As someone who has been around DEC kit being used for safety related stuff in UK for three decades (either as development host, or target platform (eg European Air Traffic Control), or both) I can safely say that although I encountered several instances of Coral and a handful of RTL/2, I came across writeups of the language but never came across a real Jovial, host or target, and I'm not aware that any of my US colleagues did either. [Iirc, a few of the Coral applications started with 'BEGIN CORAL; BEGIN CODE;' followed by in-line assembler for the remainder of the application]

            Jovial seems to be a bit niche, like its hardware compatriot, MIL1750.

            As for Z8002: my employers chose Z8002 in preference to M68K. I left shortly after that, before the sh*t really started to hit the distribution mechanism. They are still suffering for that mistake, maybe three decades later: every update for the ones still in service seems to risk requiring the code and data to be re-laid-out, as already referred to with PDP11 overlays, except without the support infrastructure provided on a PDP11 OS.

            1. John Smith 19 Gold badge
              Coat

              Re: Jovial mainframe compilers.

              "As someone who has been around DEC kit being used for safety related stuff in UK for three decades (either as development host, or target platform (eg European Air Traffic Control), or both) I can safely say that although I encountered several instances of Coral and a handful of RTL/2, I came across writeups of the language but never came across a real Jovial, host or target, and I'm not aware that any of my US colleagues did either. [Iirc, a few of the Coral applications started with 'BEGIN CORAL; BEGIN CODE;' followed by in-line assembler for the remainder of the application]"

              For a more complete list of JOVIAL apps (and development hosts) can be found here.

              http://progopedia.com/implementation/jovial/

              1. Anonymous Coward
                Anonymous Coward

                Re: Jovial mainframe compilers.

                Thanks for the pointer. Other than the aforementioned NATS software, the list is almost exclusively projects from/for the US military, which would account for why it wasn't real visible over this side.

        3. Byham

          Re: There's nothing like state of the art hardware

          The old IBM 9020D was a 6 machine cluster with 3 compute elements and 3 Input Output Elements. When one of the processors hit an out of range condition or error, it would stop the other processors give them a start point in the program and all the environment variables and all of the processors would run the same program to the same point. If only one of them failed - then the processor took itself off line as a hardware fault. If they all got the error, then it was a software fault. The entire core was dumped (as a box of hex printout!) and the system did a Startover where it dumped all the recent input messages. Controllers receive a message saying STARTOVER - all messages after TIME should be reentered. In a well tested real time system nost errors are caused by a timing fault or a bad input message. By throwing out the messages and restarting from a checkpoint say 7 seconds before the crash both of these problems go away. If Scroggins puts in the bad message again - the same result could occur but this time the Data System Specialist will note that the same input message has preceded the previous startover and would have a one way exchange with Scroggins about his message.

          The idea that you just pass the same broken input to an identical backup machine is bound to fail the systems will cycle in failovers. (been there done that).

      2. Byham

        Re: There's nothing like state of the art hardware

        OS370??

        The software is based on MVT and OS360 - if you look at the software from the current DSS position it is still in 80 column card format.

    3. Fe26Mg12

      Re: There's nothing like state of the art hardware

      The only value to your comment is to highlight how little you know about Mainframe class hardware.

    4. swschrad

      article: two root causes, and one's easily fixed

      the harder one: comms failure.

      the easy one: borked flight plan submissions. answer: spit it back in the pilot's face, like we do in the US. fix your problem, then resubmit. anybody who ever put a card deck in the pigeonhole and got a barfed printout back from JCL with no program on it should understand that. I am advised the US flyboy system says where the problem was when a flight plan is rejected. the stack of NOTAMs tacked on the wall (or that's how it worked back when I found these things out) needs to be read again to avoid hitting the next issue.

  2. fowler

    Is the Register sure this is an IBM S390

    I work with mainframes and run some pretty old applications including one over 30 years old but it all executes on in support Z series boxes running Z/OS. I would be surprised you could actually get parts for an S390 now.

    1. Jack of Shadows Silver badge

      Re: Is the Register sure this is an IBM S390

      One of my "prize possessions" just happens to be an IBM token ring card.

      1. harmjschoonhoven

        Re: Is the Register sure this is an IBM S390

        One of my "prize possessions" just happens to be an (1) IBM valve. Picked it up on a scrapyard as a kid, should have taken the bucket full of them.

    2. Voland's right hand Silver badge

      Re: Is the Register sure this is an IBM S390

      I heard the same from a couple of blokes on the plane yesterday - it is 15year old IBM mainframe rebadge by Lockheed.

      I ended up on a plane brought in "manually" half-way from a divert to Charles De Gaulle. Funnily enough, the people guiding it took a considerably more optimal path - the 320 cut in across Croydon and leveled onto approach somewhere over south London instead of taking the usual lumbering scenic route over all of London.

      The pandemonium at LHR was complete - the few planes coming in to land on manual guidance could not unloaded. The few planes being unloaded could not get their luggage off the plane because the luggage transport was full of bags for the planes scheduled to depart. You name it.

      In any case - as most mainframe based systems it looks like it has an over-reliance on the mainframe never failing and no true primary-to-backup fallback. Mainframes fail very very rarely, however once they fail, you pay for the fact that the system was designed without system level resilience. Just like in this case.

      1. Roland6 Silver badge

        Re: Is the Register sure this is an IBM S390

        as most mainframe based systems it looks like it has an over-reliance on the mainframe never failing and no true primary-to-backup fallback.

        My thoughts from the very little that has been published, is that the problem seems to have been not so much in the primary-to-backup fallback, but in that two things seemed to fail (system and network link) and as I've also come across with many business continuity solutions insufficient attention being paid to the restoration (fallback-to-primary) of normal operations.

  3. Puffin

    Hmmmm

    I wonder why they originally announced the problem as a terrorist threat...

    1. Doctor Syntax Silver badge

      Re: Hmmmm

      Standard operating procedure.

    2. Anonymous Coward
      Anonymous Coward

      Re: Hmmmm

      Let me venture a guess - because that allows using any means necessary to deal with complaining passengers.

  4. ScottME

    Properly engineered systems!

    IBM mainframes are boring and predictable. Exactly what you want for safety-critical infrastructure. Who cares how "old" it is - it gets the job done, with uptimes in years. Rather surprised a bad flight plan can cause problems though.

    1. Anonymous Coward
      Anonymous Coward

      Re: Properly engineered systems!

      "Rather surprised a bad flight plan can cause problems though."

      That does smell of totally inadequate software testing somewhere along the line, doesn't it? Which is sadly not unusual, given the software testing is boring, unglamorous and rarely given adequate resources.

      But this is one of NATS primary systems, and NATS is a billion pound a year business, it seems inexcusable that user input can bork it.

      1. J J Carter Silver badge

        Re: Properly engineered systems!

        I'm thinking the data issue was because some eco-loon decided the carbon-footprint of each flight needs to be shown as it was bodged onto the data feed

      2. RW
        Boffin

        Re: Properly engineered systems!

        Can that system handle Unicode text? It's old enough that Unicode may be implemented either not at all or only in a rudimentary fashion. Have a pilot submit a flight plan with notes in any language written with characters outside the usual 256-character font set up, and kaboom? And there are a lot of such languages, among them Russian, Polish, Greek, Turkish, Georgian, Chinese, Japanese, Hindi, Thai, and a host of others.

        Hmmmm.

        Just to test El Reg's own system: ΣЩՊਊฒႪおナ两

        1. Anonymous Coward
          Anonymous Coward

          Re: Properly engineered systems!

          "Can that system handle Unicode text?"

          IF THE SYSTEM REALLY DATES BACK TO THE 1960S IT IS ENTIRELY POSSIBLE IT CANNOT DO LOWERCASE LETTERS NEVER MIND THE JOYS OF UNICODE.

          1. Black Betty

            Extended Binary Coded Decimal Interchange Code

            All the way to the D before Google auto-complete offered up EBCDIC.

          2. Mike Pellatt

            Re: Properly engineered systems!

            And show me the IATA (??) airport codes containing Unicode....

    2. tirk
      Facepalm

      Re: Properly engineered systems!

      ...can still fail if implemented badly. I remember being at a client's new datacentre many years ago were they had half a dozen or so S370s. There was a power failure affecting the whole site, but the UPS, followed by the on-site generator, kept the mainframes running. Unfortunately, some genius had wired the mainframes consoles into the normal ring main, rather then the protected circuit, so whilst the 370's continued running, not a lot could be done with them! Doh!

    3. werdsmith Silver badge

      Re: Properly engineered systems!

      Considering CAA document CAP 694 for Flight Planning is 120 pages, I'm amazed there's no validation of the input data.

      It borks the mainframe and causes it to failover to another mainframe which is borking on the same data?

      As usual on The Register there will be someone who can explain why that's OK... but FFS!

      Get rid of the S390s and put in a couple of Raspberry Pis.

      1. Byham

        Re: Properly engineered systems!

        Every input is validated by specific programs for each input type, And any fault not only in syntax but also in logic is returned either to the inputting person or to operators as a 'referred reject' to be sorted out. The system is extremely resilient to input errors.

        The original design of the system when it had a startover was that it dumped all the input messages in the message input queue for the previous minute and told controllers to re-enter them. This stopped the cycling of fail overs that are bound to happen by sending the same broken message to an identical machine with identical software. That was part of the move to Swanwick and rehosting the old NAS host software from the IBM360's into a simple mainframe as a virtual machine (we said that wouldn't work at the time).

        However, I challenge anyone in the commercial world to have the same availability as NATS is getting from the Jovial/BAL software which is in the 99.999% or better range.

  5. Borg.King

    User submissions need pre-check

    We take customer data into our systems, and also have suffered from poor data causing issues in the past. These issues are now mitigated by having submission systems to check and reject any data not conforming to the required format or standards.

    I would have thought it a fairly easy procedure to verify flight plans are good at the point of submission, since this should not be a time or mission critical point in the process.

    1. jong

      Re: User submissions need pre-check

      oh it's in the right format alright.

      It's just they're heading to somewhere different to where they said they would.

      How do you fix that in your submission systems ?

      1. Yet Another Anonymous coward Silver badge

        Re: User submissions need pre-check

        >It's just they're heading to somewhere different to where they said they would.

        >How do you fix that in your submission systems ?

        Lasers (flying sharks optional)

        1. Paul Crawford Silver badge

          Re: User submissions need pre-check

          I want a flying shark, even without the laser it would be a cool thing!

          Oh and while I am dreaming, a castle or island lair so I can have a moat for said flying sharks to frolic.

          1. Stoneshop Silver badge
            Black Helicopters

            Re: User submissions need pre-check

            I want a flying shark,

            The guys who built Orville the CatCopter are working on one (sorry, I don't have pics available)

          2. Anonymous Coward
            Anonymous Coward

            Re: User submissions need pre-check

            Oh and while I am dreaming, a castle or island lair so I can have a moat for said flying sharks to frolic.

            You don't need a moat if they're flying..

      2. Anonymous Coward
        Anonymous Coward

        Re: User submissions need pre-check

        How do you fix that in your submission systems ?

        In this day and age the fix is often AAA or fighter jet scramble.

      3. John Smith 19 Gold badge
        Facepalm

        Re: User submissions need pre-check

        "It's just they're heading to somewhere different to where they said they would."

        Isn't that an alarm, not a systems crash type event?

      4. Matt Bryant Silver badge
        Facepalm

        Re: Jong Re: User submissions need pre-check

        ".....It's just they're heading to somewhere different to where they said they would.

        How do you fix that in your submission systems ?" It's called a data quality check - you check the submitted data against a historic record of activity and flag and reject anything out-of-scope. An example might be checking the flight number and the associated historic destination with the destination entered, or simply doing a check that the destination is within the safe flight range of the aircraft's fuel load. Such checks are common in financial systems and are used to detect fraud ("why is Mr X's credit card being used to buy a smartphone in Singapore when his purchase record shows he is in New York?").

        IMHO, someone cut some corners on the code (one bad data entry screwed the whole system?!?), but what's even more unacceptable was the cluster bouncing - no-one at IBM heard of non-automatic failback? TBH, re-write it for a distributed cluster and put the lot on a dozen Linux servers, then spend the savings on some real testing.

    2. billse10

      Re: User submissions need pre-check

      Has that Bobby Tables guy been filing flight plans again?

  6. Gordon 10 Silver badge

    Software release issue.

    The airline industry is riddled with legacy apps running on OS/370 and its descendants. When I were a lad you werent a man until you triggered at least a Ctrl-3 core dump on Prod.

    A friend of mine one took out ticketing for all of Italy for 8hrs with a particularly buggy piece of assembler.

    if it is OS/390 I wonder if its an ALCS/TPF relative?

    1. Anonymous Coward
      Anonymous Coward

      There's legacy, and there's legacy

      That well known journal of record for the IT sector, the Daily Telegraph, reports that the software in question was written in Jovial, which if true might explain why long-standing bugs still haven't been corrected - there probably aren't many offshorers with Jovial experience, and people with Jovial experience are so old that they want to be paid a decent rate, either for fixing the code or for training someone to fix the code.

      1. Anonymous Coward
        Anonymous Coward

        Re: There's legacy, and there's legacy

        Did I get downvoted for not including a link to the Telegraph article, or for some other unstated reason?

        Anyway, here's the link, let's see if the El Reg revamp makes plaintext clickable if it's a URL:

        http://www.telegraph.co.uk/news/aviation/11291495/UK-flights-chaos-Air-traffic-control-computers-using-software-from-the-1960s.html [edit: apparently no autoclickability. Someone else's software is seriously outdated :(]

        Extract:

        "A consultant who has worked for Nats said it knew its software needed to be replaced a decade ago but will be relying on the 1960s programmes for another two years.

        Martyn Thomas, Visiting Professor of Software Engineering at the University of Oxford, said: “The National Airspace System that performs flight data processing was originally written for American airspace in the late 1960s.

        “It wasn’t designed to cope with the volume of air traffic we have today, or to interface with modern computer software.”

        Prof Thomas said the NAS system was written using a now defunct computer language called Jovial, meaning Nats has to train programmers in Jovial just to maintain the antiquated software."

        [continues]

        1. Roland6 Silver badge

          Re: There's legacy, and there's legacy

          Prof Thomas said the NAS system was written using a now defunct computer language called Jovial,

          According to wikipedia ( http://en.wikipedia.org/wiki/JOVIAL ), to say JOVIAL is now defunct is overstating things. But I've not kept abreast of recent developments, so does any one know what is now being used instead of JOVIAL? (I'm a little surprised the wikipedia article doesn't mention this, so presume the obvious candidate - ADA, isn't quite so obvious or universally used).

          1. John Smith 19 Gold badge
            Coat

            Re: There's legacy, and there's legacy

            "But I've not kept abreast of recent developments, so does any one know what is now being used instead of JOVIAL? (I'm a little surprised the wikipedia article doesn't mention this, so presume the obvious candidate - ADA, isn't quite so obvious or universally used)."

            JOVIAL was big for real time control apps. IIRC it did the software for the B52, B1 and F15 at least (off the top of my head). The USN (being the USN) had something else (CSL?, something with a C in it)

            I guess the UK equivalent were things like CORAL66 and RTL2 (ICI's in house computer language. No that's not a typo).

            In theory Ada was meant to be the cure for this babel of DoD languages (including most of the assembler). But writing a full Ada compiler is a not trivial exercise and the DoD has a lot of odd hardware knocking about. and getting conversion tools to convert old-bonkers-software-originally-running-on-valve-processors has turned out to be a tad expensive.

            The big surprise (for me) was having a Jovial compiler for an S/390 (or rather an S/360 as it would have been then). AFAIK when it's IBM mainrframe and it's real time it was assembler (which is how NASA got theirs to deal with the Apollo programme).

            Yes that's an anorak.

            1. Anonymous Coward
              Anonymous Coward

              Re: There's legacy, and there's legacy

              "But writing a full Ada compiler is a not trivial exercise "

              These days gcc and gnat mean that in general 'only' the code generating bits need to be target specific, other kind and clever people have done most of the rest in a target-independent fashion, and you can have it (source included) for free.

              You'd still have to be 'a bit special' to want to do your own compiler, but at least one UK aerospace company apparently have done it:

              http://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=petergarbett1958.pdf

              1. Anonymous IV

                Re: There's legacy, and there's legacy

                I suppose most people know that JOVIAL is an acronym for "Jules' Own Version of the International Algorithmic Language", named in the heady days of the 1960s when such whimsy was quite acceptable.

                In these more enlightened tennies, who would ever dream of giving a version of an operating system a ridiculous name such as Flatulent Ferret or Mangy Mongoose? It just couldn't happen, could it...

                1. RW
                  Headmaster

                  Re: There's legacy, and there's legacy

                  A reminder that the "International Algorithmic Language" referred to is Algol, but whether Algol-60 or Algol-59 I do not know.

                  When I worked for Burroughs back in the day, I was once shown the two file drawers containing the punch card source of a Jovial compiler for Burroughs' "large systems". It was never finished. (Burroughs had significant aeronautic expertise.)

                  1. Roland6 Silver badge

                    Re: There's legacy, and there's legacy

                    A reminder that the "International Algorithmic Language" referred to is Algol

                    Algol-58 was effectively a rename of IAL. There is a good piece on the circumstances prevailing in the late 50's that lead Jules Schartz to define JOVIAL in the 1978 article: http://jovial.com/documents/p203-schwartz-jovial.pdf

                    1. Pigeon

                      Algol-58

                      What is this deflationary usage. I remember Algol 57?...

                      As far as I know, Algol60 was the first definition of the Algol language. I did a lot of work in Algol68, the revised edition. Maybe you meant 68, not 58.

                      What is Algol59? Really, the mists of time are floating round me

              2. John Smith 19 Gold badge
                Unhappy

                Re: There's legacy, and there's legacy

                "These days gcc and gnat mean that in general 'only' the code generating bits need to be target specific, other kind and clever people have done most of the rest in a target-independent fashion, and you can have it (source included) for free."

                Unfortunately the problem is not that you have a good compiler (Which is maybe 1/3 the problem. You need versions of the Ada standard packages and some version of the defined Ada development environment, ideally tools using the DIANA intermediate language.

                But you're still not done.

                You have to prove it. That's where you need a certified Ada validation suite from someone like NIST or BSI to prove what your compiler does (and does not) compile meets the Ada standard exactly

                Do I have to say you won't find one of these on the shelves at PC World?

                It's about giving the customer the certainty that the customers code will do exactly what the standard says it will do (although wheather they realize exactly what that is is another matter).

                I know. It's anal, it's bureaucratic, it's slow but it's how they roll.

                And honestly if you're sitting in one of those metal tubes in the sky would you really have it any other way?

            2. Phil O'Sophical Silver badge

              Re: There's legacy, and there's legacy

              I guess the UK equivalent were things like CORAL66 and RTL2 (ICI's in house computer language. No that's not a typo).

              That takes me back. At least some of the System X telephone kit was programmed in a variant of CORAL66 called PO (Post Office) CORAL, I remember learning it but never used it in anger.

              RTL2 was another matter. It seemed to incorporate the worst features of C and Pascal, with none of their redeeming characteristics, and a buggy compiler to boot. I remember compiling RTL2 to PDP11 assembler, and then having to get the overlays right to fit it all into 48K. Building got easier when the output was M68K, the compiler was no less buggy though.

              Programming today is so boring :)

              1. John Smith 19 Gold badge
                Unhappy

                Re: There's legacy, and there's legacy

                "RTL2 was another matter. It seemed to incorporate the worst features of C and Pascal, with none of their redeeming characteristics, and a buggy compiler to boot. I remember compiling RTL2 to PDP11 assembler, and then having to get the overlays right to fit it all into 48K. Building got easier when the output was M68K, the compiler was no less buggy though."

                AFAIK RTL/2 predates C and is around the same age as Pascal.

                One of Unix's lesser appreciated gifts to the world was putting YACC and lex into the hands of anyone who wanted them. Suddenly if you wanted a compiler (and where prepared to invest a relatively small amount of time) you could have it

                Before that if you wanted a compiler it was fire up the assembler, and prepare for pain. I would suspect that RTL/2 (like early C) didn't really have a formal "standard" and at any given moment they either hacked the compiler to match the (desired) behavior or hacked the standard to formalize what the compiler could do (without massive surgery to its structure).

                With "hilarious" consequences all round.

                Keep in mind that the "classic" PDP 11s did not have memory management hardware (IIRC that came with the 11/780s and the VAX's ).

                BTW the British Teletext systems ran on PDP 11's running RTL/2 code before being retired in a C rewrite.

                1. Anonymous Coward
                  Anonymous Coward

                  Re: There's legacy, and there's legacy

                  "the "classic" PDP 11s did not have memory management hardware (IIRC that came with the 11/780s and the VAX's )."

                  Depends on exactly what is meant by "memory management".

                  If you meant "demand paged virtual memory" (seems unlikely?) you're right.

                  If you meant "a setup where physical memory address and program-visible logical address can be different, and where some regions of memory can be readwrite while others can be readonly, and some memory regions can be executed as instructions whilst others can only be used for data"

                  that would be incorrect for the volume-market PDP11s after the first few families.

                  ""the British Teletext systems ran on PDP 11's running RTL/2 code before being retired in a C rewrite."

                  PDP11s were also used for some of the Swanwick Air Traffic Control systems; not sure what language or what OS (not Windows, not Linux, therefore must be mainframe, right?)

                  http://www.theregister.co.uk/2009/02/12/iris/

                  "Ada Compiler Validation"

                  All you get from passing the ACV tests is the knowledge that some version of the ACV tests were passed using some version of the compiler and some version of the runtime.

                  It is far from being a guarantee that the same compiler and same runtime will always generate correct code from valid input. For example one easy to describe problem I have seen first hand is where a change in the source code results in a branch in the object code being too 'distant'. On a good day with a decent compiler that particular error is detectable at compile time rather than being hidden till runtime conditions provoke it.

                  I have seen other more complex compiler errors, such as where the compiler in some circumstances incorrectly initialises relocatable data (ie incorrectly initialises pointers). These errors were not detected at compile time, they led to unexpected behaviour (crashes) at run time.

                  Neither of the two errors described above are specific to the Ada gcc; other languages could have provoked the same problems.

                  Generic Ada is far too complicated for real critical work. Arguably MISRA C would be preferable in many cases. The currently widely accepted compromise answer to this is using an Ada subset (e.g. Ravenscar is a decade and a half old) but Ravenscar does little or nothing to address code-generation issues with compilers.

                  1. John Smith 19 Gold badge
                    Holmes

                    Re: There's legacy, and there's legacy

                    "All you get from passing the ACV tests is the knowledge that some version of the ACV tests were passed using some version of the compiler and some version of the runtime."

                    Ever wondered why embedded dev teams who do life threatening mission critical code are very reluctant to change their tool chain, including new releases?

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: There's legacy, and there's legacy

                      "Ever wondered why embedded dev teams who do life threatening mission critical code are very reluctant to change their tool chain, including new releases?"

                      Indeed.

                      gcc 3 (.3? .4?) is still the toolchain of choice at the outfit I'm most familiar with. IT want it run on Windows, and the reference environment was Windows XP; not sure what's happened now that IT have got XP-phobia. It has been shown that the generated code can vary between underlying OSes with nominally identical compilers if different versions of the OS are used.

                      However, my main point was that the ACV tests do not demonstrate that the compiler(etc) is correct. They demonstrate that the compiler(etc) passes a particular set of tests at a given point in time. As you rightly point out, everything beyond that is a gamble.

          2. J J Carter Silver badge

            Re: There's legacy, and there's legacy

            We need Object JOVIAL!

          3. P. Lee Silver badge
            Joke

            Re: There's legacy, and there's legacy

            re: Jovial. How can a language be called "defunct" if its still being used?

            But my main point: Why isn't NATS using the Cloud? If nothing else, the latency should be low.

        2. launcap Silver badge
          Happy

          Re: There's legacy, and there's legacy

          > meaning Nats has to train programmers in Jovial just to maintain the antiquated software

          Could be worse, could be using TPF.. (scary thought - for a while, and may well still be, every input into the Galileo reservations system went though some of my code - and it was code explicitly made non-reentrant by use of spin locks as the low-level routines it was using were not reentrant themselves)

          I was quite proud that I didn't manage a ctrl-1 or ctrl-3 catastrophic with that code..

          (And my wife worked in the comms team - the people who took the insanely variable data from airlines and sanitised it so that it could be understood by mere mortals - someone from SwissAir wrote an interpreted database-driven data parser language that took pseudo-BAL instructions and ran them on the fly.)

          Cool stuff - and even in the early 90's we had code that first saw the light of day in the mid to late 60s.

    2. Anonymous Coward
      Anonymous Coward

      Re: Software release issue.

      No idea what systems are like in more mission critical places... but out tills at work crash with a CRTL-V/C/X etc. It's really hard to remember NOT to past with keyboard, though strangely mouse pasting works.

      That's if we don't include the GUI errors putting text all over the place. Systems with a no recovery options should the PC/GUI exit/crash half way/just after completing. Did not have time to press the "ok" button, tough the system now just dumped the entire work job in the bin with no option to recover, and oh it was one sent to you by an automated only system, so no option to rebuild it.

      :(

  7. PNGuinn
    Mushroom

    Bad flight plan?

    "Invariably someone puts a flight plan wrong and it borks the system"

    Seems that this is a fairly regular event then.

    "If the same data goes into the backup server it will sometimes fall over on same processing problems, and start switching back and forth with the main server. When we get a switchover then the first thing that's usually done is to shut down the backup processor."

    So the data doesn't ALWAYS bork the system then.

    And the fix is to disable the backup system. Which is presumably there just for laughs??? WTF???

    If this info El Reg has gotten hold of is correct then there is something more than rotten with both the software and the operating procedures here.

    I wonder what other mischief - accidental or deliberate - could done with a "bad" data entry?

    See Icon.

    1. Paul Hovnanian Silver badge
      Windows

      Re: Bad flight plan?

      "Invariably someone puts a flight plan wrong and it borks the system"

      You are lucky someone didn't try to fly a U2 through your airspace. It appears you may have hired the same people to write your ATC code.

      Welcome to the club.

    2. xyz

      Re: Bad flight plan?

      "Invariably someone puts a flight plan wrong and it borks the system"

      I'm trying to get my head 'round this.... no "Your data are in error, please resubmit?" No, "This field only accepts numbers?" Please tell me someone didn't submit data with an apostrophe in it and that's what caused the connection failure.

      Are they saying this was user error which brought the whole thing to its knees?

      1. Anonymous Coward
        Anonymous Coward

        Re: Bad flight plan?

        "Are they saying this was user error which brought the whole thing to its knees?"

        No. Bad data is not "a user error", it is to be expected. If NATS' IT bods can't do input validation they shouldn't be allowed anywhere near a computer. I was writing code for dual-redundant systems with failover thirty years ago, and it was a given that all user data entered would be validated by the code.

      2. Alan Brown Silver badge

        Re: Bad flight plan?

        If the system is susceptable to bad input, how hard would it be to validate everything before it gets that far?

        If builders built houses the way programmers built programs, the first woodpecker to come along would destroy civilization.

    3. Daggerchild Silver badge

      Re: Bad flight plan?

      Reminds me of a friend describing a city share trader financial system.

      The client submits transactions as a CSV. On receipt the CSV will be *complete bollocks* with values in the wrong columns, newlines in the middle of the data, and any other human mistake you can imagine.

      The aim of the game is to salvage and survive this data to try and keep the client alive and happy, taking their money, while not going insane yourself.

  8. batfastad

    MP

    The MP says "Disruption on this scale is simply unacceptable".

    WTF? The Transport Secretary would prefer everything just carried on as normal during a major systems failure? Yeah just carry on launching passengers into the sky, it'll be fine!

    Honestly, where do they get these people!

    1. Yet Another Anonymous coward Silver badge

      Re: MP

      Well only a small proportion of them will ave voted for the minister in question, and if he has a safe seat even that wouldn't matter

    2. Ossi

      Re: MP

      "WTF? The Transport Secretary would prefer everything just carried on as normal during a major systems failure?"

      I don't think he said that, did he? I think he'd prefer it that the system didn't go down in the first place. Wouldn't you? Sure, it's not the cleverest or calmest of responses, I'll grant you. But it's probably better not to over-react to an over-reaction.

    3. YetAnotherLocksmith

      Re: MP

      That was pretty much my first thought too!

      "Disruption on this scale is simply unacceptable." Bloody idiot. The disruption is completely acceptable. That's like saying "This water damage is totally unacceptable" after a fairly major fire.

      1. P. Lee Silver badge

        Re: MP

        +1

        Plan for failures and make them non-lethal. I'll take Jovial on an s/390 and the occasional bout of organised delays over a new and no doubt expensive system written in .NET thank-you.

    4. Tim Almond

      Re: MP

      Of course, if people had then died, he'd have been complaining that NATS had been irresponsible.

      In the grand scheme of government screwing up, this hardly even registers with me. I've had delays of a couple of hours waiting for a train and no-one at the station seemed particularly bothered.

  9. publius

    Damn, these guys are good

    This (ATC) is an horrendously complex machine, with a whole lot of moving parts. I say "well done" to the controllers and to the grunts who wrote the code. Comes down to the human factor in the end, and the humans are more than up to the challenge - pilots and controllers. I'll get pounded for this here, but I just love it when the machines fail and the humans step in. I know, I'm old and outdated, but I still believe in carbon units!

    1. Paul Crawford Silver badge

      Re: Damn, these guys are good

      You could argue: Never have a system that you can't manually work around for the time when (not "if") it goes tits-up.

      Massive inconvenience, true, but not one died so that is a pretty good outcome.

  10. Grease Monkey

    Back in 1981 I was in the US during the air traffic controllers strike. We flew from Wichita to Dallas Fort Worth then out to the UK. No ATC, no 21st century technology. Also no significant delays. Mostly just pilots using their brains and eyes and talking to each other.

    Apparently this is progress. A computer goes down and flights that don't even cross that computer's controlled airspace are affected.

    Question #1: What's the business continuity plan?

    Apparent answer: What's a business continuity plan?

    1. RegGuy1

      Apparently this is progress

      Yeah, it called having more than one aeroplane in the sky at any one time.

      BTW, did they have aircraft in 1981? I know they had (sky)trains. [For a while -- BA bar stewards, but that's another story.]

    2. Decius

      I'm sure that the people who were controlling traffic at ICT and DFW during the strike will be surprised to hear that your pilot didn't contact them.

  11. Peter Prof Fox

    So the power-outage was bollocks

    Why was the problem described as power-supply failure? Presumably as 'any old rubbish' to fob-off the media (and ultimately me) rather than admit they didn't have a clue what the issue was.

    Just like we have 'Streisand effect' for one sort of foot-in-mouth so we need a tag for 'we know it's a shambles, safety is our first priority, our engineers are working round the clock', and so on. When will these people learn that their reputation depends on us having confidence and is undermined by bullshit.

  12. A Non e-mouse Silver badge

    So the flight data server stores where the aircraft intends to go (the flight plan), whilst the flight server has the radar data (where the aircraft is right now).

    The author says that if the flight data server does offline for eight minutes safety features kick in as aircraft can travel a long way in that time.

    But the flight data server only holds where the plane intends to go. The flight server continues to know, from the radar, where the plane is. So why this shutdown after eight minutes? I think we need a bit more information.

    1. Alex Brett

      The difficulty is you need to know what other aircraft are expected in order to properly plan deconfliction - e.g. the radar for a particular sector might have 3 aircraft all nicely separated vertically / horizontally with no problems, but because you couldn't track what was coming, you suddenly find you have 10 more arrive at once all on course to meet at the same point in the sky - there's a limit to how quickly you can get them all onto different headings / altitudes. If you knew in advance then you can get them sorted in other sectors prior to being handed over.

      There's also the problem that if you have to start asking each aircraft where it's going, that's a lot of time on the already busy radio taken up with the back and forth...

  13. A Non e-mouse Silver badge

    Mirrored systems

    The problem with lock-step mirrored systems is how you avoid the entire system going down when both halves are fed the same bad data.

    1. Paul Crawford Silver badge

      Re: Mirrored systems

      We have had some experience of fail-over systems and it is much harder to make it work properly than you imagine at first. You have a few rather tricky issues to address:

      1) On what conditions do you fail over? Total loss of one system is obvious (power off, kernel panic, etc) but what do you do if some part is down and other look OK? What exactly are the thresholds for action?

      2) If you go for something more useful than total outage, how do you make sure its not triggered by a temporary condition (flood of data requests, etc) that might push system load up higher than normal, but is in fact an acceptable short term condition?

      3) When failing over, how do you ensure data completeness and integrity? If, for example, one hard on a NAS fails you could end up with partly written files and may not be sure of what the clients think was successfully written.

      4) How do you avoid the "split brain" problem when one system takes over from what it thinks is a failed mirror, but that mirror is still doing stuff with shared resources? If you go for powering down the failed system (AKA "shoot it in the head", zombie apocalypse style) to be damned sure its not meddling with shared stuff, how do you then avoid the risk of mutually assured destruction if both lose the heartbeat link and more or less simultaneously kill the other?

      1. Roland6 Silver badge

        Re: Mirrored systems @Paul Crawford

        Suspect these systems with have the added complication of "fall-safe"; but not necessarily the further complication of n-way voting.

        But we are talking about designing software in a way that the typical application programmer has no concept of.

      2. Anonymous Coward
        Anonymous Coward

        Re: Mirrored systems

        These are all good questions.

        They've also been answered since the days of NonStop from Tandem and VMS from DEC, to name but two. 1980s, maybe? Those two are both still around, despite current owner HP's best efforts to ignore them, and both will inevitably be moving off IA64 onto x86-64 before too long.

        Availability Digest has good writeups of practical examples:

        http://www.availabilitydigest.com/

      3. Anonymous Coward
        Anonymous Coward

        Re: Mirrored systems

        We have had some experience of fail-over systems and it is much harder to make it work properly than you imagine at first.

        Ain't that the truth. Too many people design HA solutions on the assumption that the HA part will still work flawlessly even when everything else is going wrong. They usually learn the hard way.

      4. Alan Brown Silver badge

        Re: Mirrored systems

        " If you go for powering down the failed system (AKA "shoot it in the head", zombie apocalypse style) to be damned sure its not meddling with shared stuff, how do you then avoid the risk of mutually assured destruction if both lose the heartbeat link and more or less simultaneously kill the other?"

        That one's simple (I have to deal with STONITH systems on a daily basis) - set differing delay periods on each node, based on prime numbers. That way you can be sure that one will always win in such a situation instead of a mexican standoff result.

    2. Alan Brown Silver badge

      Re: Mirrored systems

      Mirrored systems only protect against hardware failure.

      Proper redundant systems use parallel setups using different architectures and different languages to ensure this kind of thing doesn't happen.

      1. Anonymous Coward
        Anonymous Coward

        Re: Mirrored systems

        "Proper redundant systems use parallel setups using different architectures and different languages to ensure this kind of thing doesn't happen."

        I hear this story from time to time.

        I understand the theory. I also understand that it costs more than most alternatives (it also neglects the opportunity for the spec being wrong, but put that to one side).

        So, are there any real world examples, from recent history, of places where dissimilar redundancy is used in a production system?

        1. Alan Brown Silver badge

          Re: Mirrored systems

          "So, are there any real world examples, from recent history, of places where dissimilar redundancy is used in a production system?"

          Most Boeing and Airbus fly-by-wire airliners have some form of dissimilar redundancy onboard.

          The ISS has dissimilar redundancy throughout.

          A lot of military aircraft use dissimilar redundancy in critical parts (such as automatic carrier landing controllers)

          Check chapter 28 of the Avionics Handbook.

          Yes, it's a bugger to spec out and yes it's expensive, and yes it doesn't cover all bases, but there are cases where the expense is worth it.

          1. Anonymous Coward
            Anonymous Coward

            Re: Mirrored systems

            "Most Boeing and Airbus fly-by-wire airliners have some form of dissimilar redundancy onboard."

            Thanks for that Aviation Handbook reference.

            Chapter 28 is an interesting read, but in the 2001 version I found [1], there was plenty of theoretical stuff I already knew a a bit about, without (afaict) it providing any examples of actual production systems using dissimilar redundancy [0]. So, citation(s) still welcome, thanks.

            There are further chapters of the Aviation Handbook covering the 777, the A340, and other examples, but the copy I found no longer wants to play ball, so I don't know if those chapters describe real examples of dissimilar redundancy. I'd be delighted if they did.

            Aircraft that use UK-designed engine control systems, including engines such as RB211, Trent, etc definitely do NOT use dissimilar redundancy in the engine controls, and never have done so in any production system (to the best of my knowledge).

            Anyway you can have all the dissimilar redundancy you like in the avionics cabinets, but if the dual-redundant (two channels in one box, identical hw and sw) thing controlling the engines decides it wants the day off, and the other engines join it (e.g. due to a data-dependent fault), you're looking at another Gimli Glider (if you're lucky). No mechanical backup for the last decade or more either.

            Don't get me wrong, dissimilar redundancy is a great concept, and as engineer and as occasional passenger I'd be delighted to see more of it. I just don't yet know of a single instance in commercial avionics where it is actually used, and I assume that's down to cost. I would love to know of a real live example or two.

            All input gratefully received.

            [0] Meaning dissimilar redundancy of computer systems. Critical sensors, such as airspeed, are required to be multiply redundant with dissimilar types. AF447 proves even that isn't enough on its own sometimes.

            [1] http://www.davi.ws/avionics/TheAvionicsHandbook_Cap_28.pdf

  14. frank ly Silver badge

    Failure rates

    "... the machine had never had a hardware failure before so software was more likely."

    Is that actually a reasonable assumption? If a machine has never had a hardware failure, then doesn't the probablility of failure increase as time goes by?

    1. 142

      Re: Failure rates

      I guess so, unless it's the same symptoms as a known software issue, which appears to be the case here. According to the article anyway.

    2. martin burns
      Facepalm

      Re: Failure rates

      yes, just like the probability of a dice throwing yet another 6 changes, because Lady Luck is offended at the run and sends her magic elves to come in and add phlogiston weighting....

      No, wait, you're talking bollocks.

      1. frank ly Silver badge

        @martin burns Re: Failure rates

        Have you ever tried useful/helpful explanations? I know they can be difficult and need some thought but ..... oh.

      2. David Black

        Re: Failure rates

        Technically because of the limited lifetime of physical components and their probability to fail characterised by MTBF then the longer you have gone without a failure, the more likely one becomes. As someone who was stuck on my own with a small child to amuse from 3pm to 11pm at the boarding gate (the shitiest wasteland of an airport) I'd love to know who to strangle.

        Also, given the massive increase in electronic devices, why do airports have so few power sockets anywhere? There was only 1 (until the revolting passengers also started unplugging the seemingly pointless flight information screens) outlet for a room designed to hold about 400 people.

      3. Anonymous Coward
        Anonymous Coward

        Re: Failure rates

        The chances of any given set of numbers coming up are independent of *when* the dice are thrown. Time is not relevant.

        The chances of any given [electronic[ hardware failing at any given point are *not* generally independent of the time the hardware has been in use. Time is generally relevant.

        Bathtub curves do not apply to dice. Bathtub curves do apply to electronic equipment. Readers can look it up in the usual way.

        Bollocks indeed.

        1. Bakana

          Re: Failure rates

          Mainframe hardware has, for many years, included Self Diagnosing circuit card which constantly test themselves and Duplicate Fail Over hardware in Each CPU.

          This is WHY Mainframe Failures are so rare.

          When a failing circuit card is detected, the OS passes control to the Fail Over card and sends E-Mail notices both to the System Operator and to IBM's Service Facilities. There have even been instances where the IBM Hardware engineer arrived at a Client site with the repair parts Before the System Ops got around to Reading their E-Mail.

          Systems of this nature perform so well that the Worldwide Average for Unscheduled Mainframe Downtime is less than 5 Minutes per year. This averages real problems where downtime lasting several Hours bangs against systems that have operated for the entire Life of the installation without Ever experiencing downtime.

          1. Alan Brown Silver badge

            Re: Failure rates

            "This is WHY Mainframe Failures are so rare."

            All the hardware protection in the world won't protect you against faulty software.

  15. Ossi

    Can I just say what a nicely written report that was - objective, informative and concise. The best report about this incident that I've read. Nice work.

  16. Christoph Silver badge

    "Invariably someone puts a flight plan wrong and it borks the system"

    "If the flight data processing system is down for more than eight minutes, the flight server alerts controllers that the data it is getting is stale."

    So if one flight is well off its declared route, it can mean that it's eight minutes before the controllers go to checking the radar directly.

    What happens if a second flight goes wild within that eight minute window? Does it get noticed?

    Could terrorists exploit this by hijacking two aircraft? Use the first to crash the system by flying wildly, then the second has eight minutes before the RAF get notified.

    1. David Black

      I suspect it's not a great discussion to participate in on a public forum. Though I too am somewhat curious, it's more likely to have me wearing orange and eating my dinner through my arse than generate intelligent public scrutiny of how a system's vulnerability could be harnessed in a dangerous way and what could be done to mitigate it.

      1. PNGuinn
        Flame

        Which neatly describes why the terrorists have already won.

        Only the inside exspurts who know (or think they know) are allowed to comment. Everyone else is a terrrorist. Trust us WE know best - Our freedom is important to us.

      2. Bloakey1

        <snip>

        " it's more likely to have me wearing orange and eating my dinner through my arse"

        <snip>

        You sir get an up vote and I will hold on the cheese and crackers at the end of the meal.

        A partially deaf old man as asked by the doctor if he could have a stool sample, a urine sample and a sperm sample.

        The old man said to his wife "what did he say?"

        His wife said "they want your underpants".

    2. Fonant

      The problem seems to be not so much knowing where the planes are, or what height they're at, or what their current course is (provided by radar and other systems) but more not knowing what their planned route is, which affects their location ten or more minutes in the future (provided by the flight plan database). Modern ATC is mostly a queuing system, that sends planes to bits of the sky as needed to maximise efficiency.

      The flight plan database allows for planning in advance, as described in the article, which makes it easier to squeeze as many planes as possible into the limited airspace. The problem these days is that the airspace is already 98% used, so without future flight plan knowledge you have to start reducing the number of planes in the air. Otherwise several planes could plan to be in the same place in half an hour's time without ATC knowing.

      Since terrorists won't submit their flight plans honestly if they intend to crash the plane somewhere sensitive, this is all irrelevant to the 9/11 type of terrorist attack.

      1. Alan Brown Silver badge

        "The problem these days is that the airspace is already 98% used"

        Not really. Air corridors are heavily used but they account for less than 10% of the actual airspace available and things are setup that way because antiquated computer systems can't handle the complexity of myriad "best route" paths.

        Sorting this was one of the (failed) objectives of the last great North American ATC rebuild.

        Effectively, current ATC systems _increase_ the danger, because they push most traffic into geographically constrained areas for "convenience" reasons instead of letting it spread out.

        It's another one of "those things" which could be allieviated for substantially less than the cost of bailing out a few banks and which would pay economic dividends in short order. However it's not sexy, which means that getting governmental approval to invest the necessary sums of money simply won't happen until the entire thing comes crashing down around our ears (and a crash investment means poor design choices).

  17. Anonymous Coward
    Anonymous Coward

    Flying cars

    If there are problems now, what about when flying personal transports (flying cars) are introduced - if ever.

  18. Ralph Online
    Black Helicopters

    ATC running on a mainframe? In this day and age that's ridiculous! Surely it's most appropriate to run these applications "in the Cloud" ;-)

  19. MotionCompensation

    Disruptive technology

    This is proof that mainframes are disruptive technology.

  20. J J Carter Silver badge

    This would be a good opportunity to examine whether a central ATC is optimal when transponders + computers in each aircraft could, like a swarm of birds, create a self-organising complex system.

    1. John Smith 19 Gold badge
      Unhappy

      "This would be a good opportunity to examine whether a central ATC is optimal when transponders + computers in each aircraft could, like a swarm of birds, create a self-organising complex system."

      The trouble with all such brilliant ideas such as yours is they fail to account for all the stuff that's in the sky that does not have the room/power/aerial to mount a part of you "swarm."

      That's better tracked from a central site (or rather a series of "central" sites, usually called "airports") and reported as a proverbial "unidentified flying object."

      1. Bloakey1

        <snip>

        "That's better tracked from a central site (or rather a series of "central" sites, usually called "airports") and reported as a proverbial "unidentified flying object.""

        Surely you mean an Extraordinary Rendition Flight.

        1. John Smith 19 Gold badge
          Big Brother

          "Surely you mean an Extraordinary Rendition Flight."

          Surely, citizen you know no such flight has ever entered UK airspace.

    2. Alan Brown Silver badge

      They can swarm in the sky as much as they want - they can only land or take off from any given strip one at a time and commercial ATC is as much about ensuring that aircraft don't leave the ground until the receiving end can be guaranteed to take them as it is about keeping aircraft from bumping into each other whilst on the way.

  21. hairydog

    Perhaps the software that installed the dreadful new El Reg page layout can be persuaded to go tits up.

    OK, if you want to improve the back end code, but why such a terrible user interface? The old one was fine: the new one is terrible.

  22. Anonymous Coward
    Anonymous Coward

    To explain what went on, in a nutshell

    ATC at large airports¹ rely on a number of assistance systems--largely electronic--to organise and monitor traffic, which allows for increased airspace capacity and therefore a more economical/profitable operation for more or less everyone concerned.

    If those systems become unavailable, as appears to have been the case yesterday in London FIR, ATC reverts to a different set of procedures which are not as efficient and result in a reduction of capacity in the affected airspace.

    ¹ For a very loose definition of "airport".

  23. werdsmith Silver badge

    Who Do We Believe?

    Interesting how the television news are giving a completely different explanation of the problem than that given here on The Register.

    1. Bakana

      Re: Who Do We Believe?

      My personal experience with a failure of this sort: Mainframe communicating with a Network when an Interupt occurred and Communication was Lost turned out to be good old fashioned Human ID10T error.

      It took quite a while to track down because Management.

      Within Minutes after the interrupted server was back Up & Running, it was confirmed that both Mainframe AND the entire Network were Up and functioning perfectly, but Communication was Not Happening.

      Much Unproductive and Unhelpful arguing and Finger Pointing later, it was finally determined that the Mainframe URL for the network was Missing from the DNS Server.

      A bit after that, someone revealed that the DNS Server had been Rebooted during the outage. Finally, someone Checked the Backups for the DNS Seerver and discovered that the Mainframe URL had been entered Manually, many months earlier and Never made "Permanent".

      For all those months, the DNS Server had operated as if the Mainframe connection was a Temporary connection that would Go Away as soon as the "Session" was complete.

      6 Months Later, the Same Problem occurred AGAIN.

      The Same ID10T had reset the URL in the Exact Same manner and not managed to rub together the two brain cells that should have Fixed the problem the First Time.

      1. John Smith 19 Gold badge
        Unhappy

        Re: Who Do We Believe?

        "For all those months, the DNS Server had operated as if the Mainframe connection was a Temporary connection that would Go Away as soon as the "Session" was complete.

        6 Months Later, the Same Problem occurred AGAIN.

        The Same ID10T had reset the URL in the Exact Same manner and not managed to rub together the two brain cells that should have Fixed the problem the First Time."

        "Bodge" coders (and their counterparts "bodge" sysadmins) can "fix" anything.

        They're just not very good at the critical thinking needed to ensure it stays fixed, by doing the necessary spade work.

        By their work, so will you know them.

        Unfortunately.

      2. Joseph Eoff

        Re: Who Do We Believe?

        What's with the random capitalization?

        1. Anonymous Coward
          Anonymous Coward

          Re: Who Do We Believe?

          > What's with the random capitalization?

          i Was wonDeriNg toO.

  24. JCcanuk

    in these time

    it is impressive that this incident was only 15 minutes. the sheer volume of an airport this size... wow! As for the complaints and negative comments, those customers and government officials should be grateful the teams tasked with the logistics of getting this confirmed and back up so effectively. Every day we are inundated with the 'newest' terrorist threats on the 'western' world. So if there is an inconvenience of a few hours or even a few days - deal with it! It's better than the alternative - loss of life or worse the loss of your loved ones. Be thankful these securities are in place and hope that they do what they are supposed to when the time comes!

    1. Anonymous Coward
      Anonymous Coward

      Re: in these time

      You forgot to mention "think of the children".

      Apologies if your post was an attempt at irony.

      Hopefully your second post will be better.

      Yes folks, (s)he signed up specially to post this.

      1. Anonymous Coward
        Anonymous Coward

        Re: in these time

        > You forgot to mention "think of the children".

        Jimmy Saville always did.

  25. Anonymous Coward
    Anonymous Coward

    Tech specs

    I was involved in the ATC software for the Swanwick centre at the very early days (prototyping, and implementation, in early 90's), and I recall the following on the radar display side of things:

    - a lot of low level software inherited from IBM which handled failovers and configurations of the sector suites

    - a rather large quantity of IBM RS/6000s

    - token ring LAN married to ethernet on the radar processing side

    - Ada for system software

    - X-Windows for the radar displays (with C or C++ IIRC)

    - vast Sony 2000x2000 pixel Trinitons

    I'd love to know if there is anything of that lot left, maybe just the screens...

    1. Anonymous Coward
      Anonymous Coward

      Re: Tech specs

      Don't know about the hardware of which you speak, but the PDP11s and associated radar displays from Swanwick ended up working again at The National Museum of Computing.

      I don't think they'd have been running Ada or X; there's only so much you can fit in the directly addressable 32kW, or even in the maximum possible 4MB, even with overlays for code, virtual arrays for data, and memory mapping that allows chunks of your addressable 64KB to be moved around the potentially available 4MB.

      http://www.theregister.co.uk/2009/02/12/iris/

      1. Paul 75

        Re: Tech specs

        Ah, that was probably the radar-processing side of things (making sense of the radar inputs, churning out combined data to the LAN). Those would not run the "modern" stuff like Ada or C++, but were based on established proprietary products. The RS-6000s ran AIX, a flavour of Unix.

        Wow, I have to check out the National Museum of Computing, maybe something I worked on actually had historic value (rather than being thrown away without ever being released, like most things...)

  26. Anonymous Coward
    Anonymous Coward

    What really happened.

    The true problem here was that the system was limited to ~150 ATC workstations, due to prior hardware capacity limitations. Due to an influx of additional users, the next workstation started on the system caused an error condition, as it pushed the system over this limit, shutting down the system. This was the 'one line of code in 4 million' that the NATS CEO was referencing. They removed that condition check, as they hardware is substantially more capable now than when the code was originally written, and then restarted the systems.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019