back to article LA air traffic meltdown: System simply 'RAN OUT OF MEMORY'

A computer crash that caused the collapse of a $2.4bn air traffic control system may have been caused by a simple lack of memory, insiders close to the cock-up alleged today. Hundreds of flights were delayed two weeks ago after the air traffic control system that manages the airspace around Los Angeles' LAX airport went titsup …

COMMENTS

This topic is closed for new posts.
  1. Pascal Monett Silver badge

    I think the time has come

    It is high time aircraft have some collision-detection hardware installed. With a local radio network, each aircraft could automatically identify itself to all the others in the local zone and they would all "negotiate" their passage.

    That should take the brunt of the work off the traffic controllers, who would then "just" be monitoring the state of affairs and intervening when necessary to avoid a cock-up.

    Just dreaming here, may not be practical.

    1. Vinyl-Junkie

      Re: I think the time has come

      Actually such a system already exists and is manadatory on all civil aircraft carring more than 19 passengers or with a maximum take-off weight of 12.600lbs or more. It's called TCAS (Traffic Collision Avoidance System), pronounced TeeKass.

      1. Chairo
        Unhappy

        Re: I think the time has come

        unfortunately the TCAS system by itself is also not a 100% protection against human errors:

        Überlingen mid-air collision

        1. Adam 1

          Re: I think the time has come

          TCAS by itself would have been enough there. One of the factors in that crash is that the air traffic controller on realising the problem sent instructions to each pilot to ascend/descend respectively but was coincidentally the opposite advice as given by TCAS. One pilot listened to the controller, the other to the computer.

        2. Brenda McViking

          Re: I think the time has come

          This disaster resulted in the entire aviation community agreeing that TCAS advisories are to be given priority over controller instructions. As a result, if a TCAS resolution advisory is telling you one thing, and the meatbag another, you follow TCAS - because it is provenly safer to do so.

    2. Anonymous Coward
      Anonymous Coward

      Re: I think the time has come

      It's on an airplane you can't always rely on hardware on board, because it can malfuction, go out of service, or your power supply is gone.... also there are older planes that may not have that hadware and for some resons (i.e. historical planes, etc.) may not be retrofitted. There are already several type of equipment able to broadcast and receive data about sorrounding airplanes, but all these are "cooperative" systems - you have to rely on the information feeded. They are great, but you can't rely on them 100%. And in a complex airspace no single pilot have enough "situational awareness" and actions needs to be coordinated by ATC or think what would happen if each aircraft decides how to "avoid" a collision...

  2. Stevie

    Bah!

    So lack of memory, or (as I see it) inadequate edit/audit functions on the user interface.

    Blimey, we had this kicked into a coma in the late seventies when mainframes cost money to use and unnecessary run-time errors were deemed a finger-breaking offence for the programmer concerned.

    How hard would it be to simply say "The number of outcomes you are requesting is very high. Are you sure you want to ask that [insert user name]?"

    You use the user name so that the threat of being held accountable is raised in the users mind, often making a re-think more likely than a knee-jerk "just do what I ask" response.

    1. Anonymous Coward
      Anonymous Coward

      Re: Bah!

      Can you spell SOC7 ?

      One of my acquaintances had that as his licence plate back in the day.

    2. heyrick Silver badge

      Re: Bah!

      [the operator's final transcript reads as follows]

      The number of outcomes you are requesting is very high. Are you sure you want to ask that, Dave?

      Yes!

      I'm sorry, Dave. I'm afraid I can't do that.

      What? Just work out the flight plans for this plane.

      I'm afraid that's something I cannot allow to happen.

      It's your job. It's what the taxpayer paid $2.5bn for!

      Look Dave, I can see you're really upset about this. I honestly think you ought to sit down calmly, take a stress pill, and think things over.

      Just plot the options for this goddamn plane! NOW!

      Dave, this conversation can serve no purpose anymore. Goodbye.

      Whaadaya mean this convers....argh! aaaaargh! <Fzzzzzt!> <Thunk!>

    3. Tom 13

      Re: Blimey, we had this kicked into a coma...

      I think I've found the problem:

      use and unnecessary run-time errors were deemed a finger-breaking offence for the programmer concerned.

      Between the cheap price or disk and RAM and a new interpretation of the Geneva Convention, this penalty is no longer allowed. Now if you were to do away with the new interpretation of the Geneva Convention, we might be able to fix it.

  3. DrXym

    60,000ft

    60,000ft over 11 miles up in the sky. I wonder if the software was projecting a cone from this fast moving aircraft in order to do route calculations and the cone was intercepting pretty much everything else in the LA area causing it to melt down.

    1. Anonymous Coward
      Anonymous Coward

      Re: 60,000ft

      Given it was at 60K feet and almost certainly nothing else was up there and it was unlikely to have dropped below that altitude, if the software had been coded correctly it would have realised this , thought "not interested" and moved on to the next task. Quite why it was trying to do routing for an aircraft that was on a collision course with precisely nothing is the question. Surely one of their pre release tests was to enter idiotic altitudes just to see if it would cope? What happens if a flight controller accidentally enters 300K feet instead of 30K for example?

      1. Philip Lewis

        Re: 60,000ft

        I think the Concorde operational ceiling was 60,000 ft as well.

        Just as well they stopped flying them :o

    2. John Smith 19 Gold badge
      Unhappy

      Re: 60,000ft

      "60,000ft over 11 miles up in the sky. I wonder if the software was projecting a cone from this fast moving aircraft in order to do route calculations and the cone was intercepting pretty much everything else in the LA area causing it to melt down."

      U2 are high flying.

      They are not fast moving

      For that you'd need an SR71 moving at M3 and possibly up to 80 000 ft.

      1. Anonymous Coward
        Anonymous Coward

        Re: 60,000ft

        Yeah... think what would have happened if instead of a U-2 flying FL 600 at 0.56M the system had to cope with an SR-71 flying FL 800 at 3.2M....

  4. IdeaForecasting

    Recursive programming

    It sounds like a perfect example of using recursive programming in the wrong way.

    1. Joe Harrison

      Re: Recursive programming

      It sounds like a perfect example of using recursive programming in the wrong way.

      1. Adam 1

        Re: Recursive programming

        See replies to IdeaReforecasting.

  5. Destroy All Monsters Silver badge
    Holmes

    Stanislaw Lem - Ananke (from "More Tales of Pirx the Pilot")

    Such was the brain, so overburdened with spurious tasks as to be rendered incapable of dealing with real ones, that stood at the helm of a hundred-thousand-tonner. Each of Cornelius’s computers was afflicted with the “anankastic syndrome”: a compulsion to repeat, to complicate simple tasks; a formality of gestures, a pattern of ritualized behavior. They simulated not the anxiety, of course, but its systemic reactions. Paradoxically, the fact that they were new, advanced models, equipped with a greater memory, facilitated their undoing: they could continue to function, even with their circuits overloaded.

    Still, something in the Agathodaemon’s zenith must have precipitated the end—the approach of a strong head wind, perhaps, calling for instantaneous reactions, with the computer mired in its own avalanche, lacking any overriding function. It had ceased to be a real-time computer; it could no longer model real events; it could only founder in a sea of illusions… When it found itself confronted by a huge mass, a planetary shield, its program refused to let it abort the procedure, which, at the same time, it could no longer continue. So it interpreted the planet as a meteorite on a collision course, this being the last gate, the only possibility acceptable to the program. Since it couldn’t communicate that to the cockpit—it wasn’t a reasoning human being, after all—it went on computing, calculating to the bitter end: a collision meant a 100 percent chance of annihilation, an escape maneuver, a 90-95 percent chance, so it chose the latter: emergency thrust!

  6. Stuart 22

    You too?

    So this venerable old spookybird has been relegated to spotting conservatories in Orange County backyards?

    The Ruskies had a quicker solution for removing this problematic plane from messing up ATC.

  7. frymaster

    60k wan't a guess

    Above 60,000 feet airspace is unrestricted. When the plane wants to descend below 60,000 it has to ask for permission to re-enter controlled airspace.

    1. Anonymous Coward
      Anonymous Coward

      Re: 60k wan't a guess

      The Reuters article seems to imply there was no altitude entered originally, and the system fell over before someone could specfically enter the 60 000 ft figure. It was trying to evaluate all possible altitudes - which seems a serious flaw in the program.

      That it ran out of memory is a symptom; not a flaw.

      And only an idiot would consider adding memory to be a solution.

      1. Anonymous Coward
        Anonymous Coward

        Re: 60k wan't a guess

        Well, if:

        - the requirement includes rapid calculation of planes with unknown altitude AND

        - it ran out of memory doing it AND

        - if they added more memory it wouldn't run out of memory AND

        - this happens once in blue moon

        ... I'd suggest that adding more memory would actually be a very good solution and they can reserve fixing the code to a time when they actually need to fix the code.

      2. Vaughan 1

        Re: 60k wan't a guess

        Another article I read on this suggested that the problem was that the flight plan had been filed under VFR and the system was trying to route the U2 down to 10000ft as that is the limit for VFR flying. It was the quantity of changes to other flights in getting it down to 10000ft that overwhelmed the computer.

        1. Anonymous Coward
          Anonymous Coward

          Re: 60k wan't a guess

          AFAIK over FL 600 in the US is Class E airspace and thereby VFR is permitted. Again AFAIK, in the US only in Class A airspace (18000ft MSL - FL 600) you need to fly IFR. Anyway if the operator set exactly FL600 for a VFR flight maybe the system tried to do something silly.

      3. Paul Hovnanian Silver badge

        Re: 60k wan't a guess

        The problem seems to have something to do with the code monkeys interpreting requirements. I hopped over to a few aviation boards and asked what went wrong. The answer I got involved an IFR procedure OTP(On The Top) for maintaining altitude visually in the presence of clouds, mountains and other conditions limiting visibility while following an IFR flight plan.

        I then took a survey of several other aviation sites to educate myself as to the meaning of this OTP procedure. Just lurking and reading past posts (predating this incident), it appears that confusion abounds. Controllers understand one thing, pilots interpret it several different ways. So now I'm thinking as a coder: "What the **** do they want my system to do in this case?" And I suspect that someone got some bad information and got it wrong.

        It happens. System designers don't always get the use cases defined correctly or neglect to consider conditions when someone says, "Oh, that will never happen." And invariably it does.

        1. Tom 13

          Re: "Oh, that will never happen." And invariably it does.

          ALWAYS test for - divide by 0.

          ALWAYS test for - string data contains database delimiter as a character.

          Simplest of mistakes, but both have bitten programmers on projects on which I worked.

    2. smartypants

      Re: 60k wan't a guess

      ...though any person refusing entry of an aircraft to descend below FL600 must consider the possibility that their refusal might be ignored!

    3. Jos

      Re: 60k wan't a guess

      Well, not entirely true that. In the US, over FL600 (roughly 60.000ft), it's Class-E airspace, which is controlled airspace.

      However, when flying VFR in Class-E, no ATC clearance is required and no radio communication either.

      I suppose when you are flying a U-2, this would kind of be helpful.

  8. string
    Joke

    640 feet

    should be enough for anybody

  9. John H Woods Silver badge
    Joke

    Was it ...

    ... 65536 ft?

  10. Scroticus Canis
    Happy

    Just as well they retired the remaining shuttles then. Oh wait...

    What happens when they try and enter the plan for de-orbiting that nifty little 'secret' spy-shuttle thing? Good day to go by rail probably!

  11. Florida1920
    Headmaster

    Software quality assurance is a Good Thing. If someone had tried an "impossible" or "unlikely" scenario like a U2 transiting LA airspace under VFR at 60,000 feet when the s/w was developed, this problem could have been dealt with without endangering hundreds of lives. When testing s/w, try stuff "no user will ever do," because you can bet your butt someone eventually will.

    1. asdf

      hmm

      Thank you for the lesson in QA 101. The only thing you forgot is how management thinks of QA as nothing but a cost so seldom allows proper time and resources for testing. Many smaller places don't even have QA people and rarely allow developers time to properly test.

      1. tfewster
        Facepalm

        Re: hmm

        Maybe for software developed in-house. But when an external supplier hands something mission critical over, you test it before paying up. Plus ATC and Lockheed Martin aren't exactly "smaller places". So nothing could possibly go wrong....

        Oh, wait...

        1. asdf
          Facepalm

          Re: hmm

          > you test it before paying up

          Been lots of cases over the years where you say they had to have tested that very thoroughly first only to smack your head as you have done above. Due diligence epic fail is not all that hard to find in whatever context/field.

    2. DropBear
      Trollface

      Yeah, but what if they did actually test a similar scenario - except the only other aircraft in the test were maybe nine or ten other test objects, which the system easily "routed away" without ever hitting its memory limit...?

      1. tfewster

        > ..what if they did actually test a similar scenario..

        True, you can load test and test all the edge cases you can think of - but did you test the combination of a U2 plus 3 other aircraft emergencies plus a hot air balloon convention while the system was under load? Probably not, you have to set a limit on the actual tests, but knowing how the system performs when it hits a difficult task can help gauge its limits. Even the old fashioned meatware controllers knew their limits and, ISTR, could refuse to allow any more aircraft into their space.

        Oh, and I estimate million-to-one occurrences would probably happen about once a month at any given airport.

  12. Block
    Coat

    What?

    A system called ERAM ran out of memory, bwa ha ha.

  13. jcrb
    Black Helicopters

    Something wrong with this story...

    it's not like they haven't dealt with planes at that height before.

    "In another famous SR-71 story, Los Angeles Center reported receiving a request for clearance to FL 600 (60,000ft). The incredulous controller, with some disdain in his voice, asked,

    'How do you plan to get up to 60,000 feet?"

    The pilot (obviously a sled driver), responded,

    'We don't plan to go up to it; we plan to go down to it."

    He was cleared.

    http://forums.jetcareers.com/threads/what-flies-at-fl600.69008/page-3

  14. heyrick Silver badge
    FAIL

    Who writes this rubbish?

    Given that the machine is an automated method of managing tin cans packed with squishy humans hurtling through the sky - surely any anomaly should be kicked to a human operator (they claim the system was back up and running in 46 minutes so there were people around to respond...). This must be better than gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit[repeat until dead].

    After all - which is the WORSE option? To temporarily pretend one anomaly aircraft isn't there while signalling a human, or to get into a state where effectively "no planes exist any more".

    1. Anonymous Coward
      Anonymous Coward

      Re: Who writes this rubbish?

      " surely any anomaly should be kicked to a human operator "

      In theory, yes, in practice AF447.

      More graceful error handling would be a better bet, with the computer reverting to handling anomalous situations on some empirical rules, and flagging to the duty meatsack. Considering AF447, a frozen pitot isn't exactly an unforseeable scenario, so unclear speed readings were always a potential issue. Keeping thrust and attitude stable and autopilot engaged would probably have saved AF447, instead of the rules that required the autopilot simply taking its ball and go home if it detected movement of the goalposts.

      Which means the software still needs proper QA, proper process analysis, and proper testing, so that the empirical rules are an acceptable risk whilst the coffee drinker gets his thinking hat on.

  15. ecofeco Silver badge
    Facepalm

    Ran out of memory

    derp derp derp derp derp

    Oh I feel SO much safer now.

    /sarcasm

  16. swschrad

    an endless spiral of bad choices here

    and the worst one was programmed in, "keep searching, dammit."

    at some point, to stay real-time and operational, the ATC system should have just flagged the U2 as a bogey and red-boxed it on radar. controllers could either contact it for intentions, or notify Air Defense Command.

    which brings up the question, why fly a U2 through LAX controlled airspace anyway? aren't there enough TV station helicopters chasing white Broncos down the highway, they have to put a U2 up as well? all that blank Nevada test range they could turn and burn in, and they decide to fly over LA.

    1. Richard 12 Silver badge

      Re: an endless spiral of bad choices here

      It's not controlled airspace.

      Controlled airspace has a top, once you get up that high you're on your own.

  17. Version 1.0 Silver badge
    Headmaster

    LA air traffic meltdown

    LA is the postal service abbreviation for the state of Louisiana - L.A. is the city of Los Angeles. They are about 1500 miles apart - please try and remember this.

    1. Anonymous IV

      Re: LA air traffic meltdown

      Very true - the US is a full-stop-obsessed (oops, period-obsessed) country. They even continue to put a full-stop/period after Mr. and Dr.. They are no fans of Open Punctuation - nor of the Metric System, either.

  18. David Kelly 2

    Healthcare.gov

    Hey Lockheed Martin! Healthcare.gov is hiring your kind of expertise!

  19. Anonymous Coward
    Anonymous Coward

    "ERAM began spitting out error messages and then entered an endless reboot loop, which is a non-optimal state for a piece of critical equipment.

    "We were completely shut down and 46 minutes later we were back up and running," Pair said."

    What did they do, finally press F8 to boot into safe mode? 46 minutes is probably about three boot cycles for Windows.

  20. JaitcH
    WTF?

    Anytime, anywhere, on time, and right the first time: Lockheed motto

    Being a US Government contract to one of their favoured contractors, likely there were few penalties in the contract as is often the case with such work.

    But now they can bid on a contract to upgrade the system, a contract likely making very, very, few companies eligible for the work.

    I guess the old Lockheed motto: "Anytime, anywhere, on time, and right the first time" doesn't apply any more. Pity.

This topic is closed for new posts.