"Have you tried turning it off and on again?"
"Are you sure it's plugged in?"
A computer crash that caused the collapse of a $2.4bn air traffic control system may have been caused by a simple lack of memory, insiders close to the cock-up alleged today. Hundreds of flights were delayed two weeks ago after the air traffic control system that manages the airspace around Los Angeles' LAX airport went titsup …
>because it is computing things for numbers between zero and infinity, no amount of memory will be enough.
Should have used functional programming with lazy evaluation.
"I'm sorry we don't keep track of program state, so we don't actually know where the planes are. But we can prove the code is formally sound."
Let me guess (this is a very educated guess by the way - I have seen this idiocy one time too many). Some moron in his infinite wisdom has used a realtime OS for the flight planning as a whole. It did not run out of memory per se, the combined "alloc more memory" + compute exceeded the realtime constraints on the path computation task.
.
If you do that in an RTOS you get a BOOM - a reboot from the global system watchdog at scheduler level.
There is a gazillion ways of triggering it and this is a demonstration why some stuff should not just be done on realtime OS-es and given to vendors that will stick a realtime OS into it out of principle.
The only place that needs RT in the whole system is the realtime collision avoidance which can be standalone, the rest has as no need for RT whatsoever. There may be _HOURS_ before the flight plan is punched in and the actual time it needs to be executed. Doing that realtime on realtime OS under realtime scheduler constraints is beyond idiotic (I can bet 100 green ones on that this is what was shipped here - the name of the vendors speaks for itself).
"There may be _HOURS_ before the flight plan is punched in and the actual time it needs to be executed."
Um, I don't think this is about flight plan stuff _before_ take-off. This is about controlling that the craft is not going to crash into anything else right now.
I agree that pre-flight flight plan control could very well be farmed out to a mainframe that would happily control its validity without resorting to real-time constraints. But when you have a hundred flights over your head at that instant and need to integrate a new object and control its parameters, you need the result straight away, not in ten minutes.
Plus, I believe that flight control has a tendency of reassigning altitudes to ensure that collisions do not occur - that is not something that a pre-flight check can take into account.
"There is a gazillion ways of triggering it and this is a demonstration why some stuff should not just be done on realtime OS-es and given to vendors that will stick a realtime OS into it out of principle."
Sounds to me like it had got stuck in an endless loop anyway and would have eventually crashed regardless of what system it was on. But hey, perhaps it could have been written in Java - I'm sure the garbage collection would have coped, right?
My uninformed guess would be that flight plans are simulated as the points in 4D Space (3 physical + time) that the planes go though to check for collisions, and the root cause a 'minor upgrade' long ago to allow for busier airspace - half the collision avoidance time doubles the number of time points tracked.
My guess would be that the max altitude was never tested when changing time dimension, and not updated with a lower limit. I don't think this would use a real-time OS because you'd want to serialise each plan tracing.
my guess is [1] every busy airspace (like Heathrow) is regression testing altitude [2] game designers are offering smartphones with PhysX GPU as an upgrade to the mainframe.
That's what they want you to think... :-)
Forget the logo, I'd prefer the content of some of those jumpsuits...
Did I read that right - that the system crashed because an operator entered a value that was outside limits that the system could handle and the system didn't flag this up? And, worse still because there was no altitude on the flight plan, the operator just 'guessed' what this value might be?!
> the system crashed because an operator entered a value that was outside limits
You aren't thinking that if you enter a flightplan with an altitude of 2^16 feet, you might just get an integer overflow, are you?
After all, nobody could fly that high, so we'd never need to test it, right?
And for every other air traffic controller in the world they make a habit of checking the secondary radar returns from the aircraft to make sure its altitude is ok. Failing that they radio it and ask. In all circumstances a human brain is keeping an eye on the airspace and making sure everything is safe. Including the air traffic that doesn't really file flight plans or generate large radar returns like gliders.
Meanwhile somewhere in the US some moron decides that computers should do this job which means that finding a parameter that the computer cannot handle in an airspace environment was all but inevitable. No doubt the air traffic controller at LAX knew the altitude of the U2 and knew it was not a problem for them, but entered the altitude into the system because they are meant to otherwise the system does not know where its logged aircraft is at. And so the mayhem began!
I have a mental image of the controllers yelling at the computer, "Noooo, stop doing that. Shut up you f****** thing! Oh f*** it, turn it off."
Never send a computer to do a human's job.
So it's a design flaw. The question: "What is the range of altitudes that planes under our watch could possibly be flying at?" was not evaluated competently.
(Is it possible that U2s and such were left out of the plans for some reason?)
We have surely all seen this happen: coders are given incomplete specs by their client/boss/paymaster, and somewhere down the line all hell breaks loose because The Thing You Failed To Allow For(TM) happens. And it's still somehow the coder that gets it in the neck, often as not.
I think these days a fair percentage of my time spent on architecture/design is on trying to make things extensible to allow for adding routines to deal with TTYFTAF, e.g. taking quoted "tolerance ranges" and doubling them in both directions (well not really, but making sure I know what'll happen if I'm suddenly told I need to double them later).
If I'm lucky enough to have some working knowledge of the industry/sector/thing I'm doing it for, that helps a bit with the intuition to know when something's been left out, but it also invites arrogance/complacency on my part so one still has to be cautious, and always ask the spec provider "are you sure that's all of it?" as often as you possibly can.
A negative altitude is completely possible, it just means that the plane is on the ground at an airport below sea-level (there are several of these around the world and in the US). While it might not be -20,000 ft, -2 is still a negative. ATCs do track aircraft on the ground since the worst aviation accident in history occurred because proper tracking of planes on the ground wasn't done.
I figure that for a project this big, they should have just used signed 64-bit integers for the altitude. Why not have the system be able to track craft approaching Neptune? Given government projects, this abomination will either be replaced tomorrow or still be in place long after the sun collapses and intergalactic travel is common place.
Even if the system was designed with a limited altitude range in mind, it still should be able to cope with input outside that range, e.g. by flagging an error in the input. My very first job as a programmer was to write a (half) decent UI for a DOS image processing package written mostly in Pascal. The previous programmer's effort used READ and READLN to get floating point values from the (mainly Dutch) users, which resulted in frequent crashes when users entered 0,23 instead of 0.23. I wrote a simple parser that only assumed it was getting a string of characters, tried to parse it, and flagged syntax and other errors to the user. Not rocket science, but simply going back to basics: does the string of characters entered as input meet the preconditions of the code that is going to use that data, if so, use it, if not, flag an error. This very basic approach ensured that medics could use the program without swearing at the computer several times each day.
No, no. You have to have the exact right sequences:
1) Order up the blame assessment project. Figure 6 months an 12 heads as the parameters for this one.
2) Order the mitigation for the current system. At 6 weeks and underfunded with still incomplete specs the patch will still fail. Leading to
3) Order the replacement of the current abomination. Kick off the first planning meeting. Meanwhile, kick #2 in the ass because until we get a replacement we need the other one doing the best it can.
4) After two years of planning the replacement, determine the estimated cost is not within budget. Cancel plan, do forensic accounting and find people to blame. Go back to step 3.
Thus arriving at intergalactic travel is common place, the abomination has been ordered replaced but is still operating and not accepting U2 flights, which are now being tracked on glass table with crayons and little model planes.
There was a design flaw but that didn't cause this problem. As described, it was a coding-time bounds-checking failure. The coding should reject parameters it doesn't handle. If you don't have a full spec and decide to use a 16 bit integers, you reject the input of anything outside that during input validation. That would have left the X2 unmanaged, but the rest of the system would have been stable. Hopefully, the feedback would have been sent back to the UI that the value was too high and the operator could have tried lower values until he found one that worked, which is still likely to be way above all the other traffic.
One hopes the radar tracking routine is a little more robust.
It's good practice, I guess, to have to go back to the old standby routine, sans system support. I certainly wouldn't like to have to cope with that myself in such a pressured scenario as this, but, in many walks of life it's not a bad thing to demonstrate, once in a while, that 'all the balls (airplanes), can be kept in the air' without crashing and without the lovely computer machines buzzing in the background.
Hats off to the folk who did that in this instance. Note to self: - don't forget worry-beads when packing for the hols.