back to article Boeing 787 software bug can shut down planes' generators IN FLIGHT

The US Federal Aviation Administration (FAA) has issued a new airworthiness directive (PDF) for Boeing's 787 because a software bug shuts down the plane's electricity generators every 248 days. “We have been advised by Boeing of an issue identified during laboratory testing,” the directive says. That issue sees “The software …

  1. Anonymous Coward
    Anonymous Coward

    Is this the (in)famous

    Screamliner with the "exploding flaming battery issue" that required ripping out the defective unit with crowbars to prevent the whole plane being written off?

    (makes note never to fly on it)

    Seriously, a quadruple failure of avionics would make even "Macgyver" think twice about boarding that flying death trap!

    1. anothercynic Silver badge

      Re: Is this the (in)famous

      Errr no.

      There *never* was a case of, in your words, "ripping out the defective unit with crowbars".

      *eyeroll*

    2. Anonymous Coward
      Anonymous Coward

      Re: Is this the (in)famous

      Maybe related to the infamous Crucial SSD which shutdown after 5,184 hours? (but that's 216 days)

      http://forum.crucial.com/t5/Crucial-SSDs/M4-firmware-0309-is-now-available/td-p/80286

      At least there was a firmware patch for that.

      Presuming they can't patch the bug in the 787 because that would invalidate the certification of the software...?

      1. ST Dog

        Re: Is this the (in)famous

        The article plainly state they are working on a fix:

        Boeing is working on a fix and the FAA says “Once this software is developed, approved, and available, we might consider additional rule making.”

        Yes, it takes a bit longer due to the airworthiness approval, but it won't take long.

        The current "fix" is to cycle them off every 120hrs;.

        1. pompurin

          Re: Is this the (in)famous

          Because I'm such a geek and had to see if this was as easy as I thought:

          Signed 32-bit upper limit: 2^31 = 2,147,483,648

          Seconds in a day * 100 = 8,640,000

          248 days = 2,142,720,000

          So looks to me like their counter is 0.01s :)

          1. Anonymous Coward
            Anonymous Coward

            Re: Is this the (in)famous

            Saw the same kind of bug on <Recent superjumbo plane> glass cockpit : a warning that was staying on the screen instead of disappearing after 5s if it occured roughly 1h11 after the last restart of the display unit.

            I let you do the math ;)

            1. Anonymous Coward
              Anonymous Coward

              Re: Is this the (in)famous

              Downvoted by a lazy commentard I guess...

              Quite easy, it was a 32-bits timer with a microsecond resolution.

              Worse is that the underlying OS actually use 64-bits timers, but the applicative software layer assumed only the least significant word was useful...

            2. anonymous boring coward Silver badge

              Re: Is this the (in)famous

              Have an upvote to compensate from some moron's downvote.

  2. Kanhef

    So...

    How often do planes operate, without shutting down, for more than eight months at a time?

    1. Simon Sharwood, Reg APAC Editor (Written by Reg staff)

      Re: So...

      At a guess, often enough for Boeing to tell the FAA and the FAA to issue a new directive.

      1. Anonymous Coward
        Anonymous Coward

        Re: So...

        Power could be left on 24/7, but even in that unlikely event the aircraft would still be powered down eventually for scheduled maintenance. So to encounter this in the real world you need an operator who basically doesn't operate his aircraft, such that it doesn't need maintenance for 8 months, yet still leaves it powered up continuously. The fact that 787s have been in service for three and a half years now, and more than 250 of them are flying without this fault being encountered, shows how improbable this is. Of course it needs fixing - continually eliminating risk is how aviation has become as safe as it is - but in practice this one is no big deal.

        1. Cliff

          Re: So...

          I agree it's less of an issue than it sounds, but 250 craft for 3.5 years for a bug that shows up every 5/7 the of a year. It's around a potential 1000 cycles in total at most, dramatically less if they've been powered down out of service for, say, battery replacement work.

        2. Graham Triggs

          Re: So...

          It's the generator control unit, not a control unit powered by the generator.

          Which suggests it may be receiving power (from a battery?) without the generators switched on. So the possibility of encountering it may be higher than you are assuming.

          But practically, there will be safety checks and maintenance, which likely have resulted in GCUs being reset inside 248 days, so no real problem. Especially as now people know to consciously do it.

          You never know though, we may have come alarmingly close to a catastrophe.

        3. Anonymous Coward
          Anonymous Coward

          Re: "in practice this one is no big deal."

          "in practice this one is no big deal."

          The big deal is that this is a problem that could and should have been identified and eliminated before entry into service, but it wasn't.

          Time-dependent arithmetic overflows don't come out of nowhere and aren't new technology (even Windows 95 had them). Checking the design and implementation for things like arithmetic overflows isn't exactly rocket science either.

          Not finding such a readily identifiable problem until over three years after the relevant kit has been in service does not say what you think it does - it says that proper risk assessment, design and test practices have not been followed. Can't imagine how that happened...

          1. Destroy All Monsters Silver badge
            Facepalm

            Re: "in practice this one is no big deal."

            The big deal is that this is a problem that could and should have been identified and eliminated before entry into service, but it wasn't.

            This is probably coming from some dude who codes websites in PHP and get his requirements from a 5-second chat in doorframes...

            1. werdsmith Silver badge

              Re: "in practice this one is no big deal."

              This is probably coming from some dude who codes websites in PHP and get his requirements from a 5-second chat in doorframes...

              Hmmm, superciliousness. Not pretty.

              Regardless of who it comes from, it's still a valid point.

              1. Destroy All Monsters Silver badge

                Re: "in practice this one is no big deal."

                it's still a valid point

                It's also a valid point to ask why the real world doesn't work as I want it to. Only idiots and politicians make it a talking point though.

            2. Anonymous Coward
              Anonymous Coward

              Re: "dude who codes websites in PHP"

              "this is a problem that could and should have been identified and eliminated before entry into service, but it wasn't." (me)

              "This is probably coming from some dude who codes websites in PHP and get his requirements from a 5-second chat in doorframes..." (you)

              Guess again.

              Remember DEFSTD 00/55? I used to.

              Know much about 178C (as well as B)? I used to.

              Got any safety critical code flying? I still do; stuff I wrote in the 1980s that is still out there.

              I barely know what PHP is except it seems to be popular as a security vulnerability exploitation vector.

              Even so, if some presentation layer person made the comment I made about it being a defect that could and should have been detected and fixed before entry into service, does the comment's oriigin make the comment wrong? It's not the messenger that matters, it's the message.

              1. Yag

                Re: "dude who codes websites in PHP"

                Know much about 178C (as well as B)? I used to.

                You mean the DO-178C? I understand that you are a bit angry, but "used to" know a 4-years old standard that is not yet fully deployed is a bit exagerated...

            3. anonymous boring coward Silver badge

              Re: "in practice this one is no big deal."

              Have to disagree.

              The important point here is that some software is deemed worthy of massive certification efforts, and some is obviously not considered as important. In reality almost all software may cause a serious accident.

              Compare the military Airbus that crashed recently, that had the wrong engine parameters "flashed" to the engine management software. A case of being lax in one area for no particular reason other than that no-one has bothered to identify it as important.

              This is a recurring theme in air disasters.

          2. Anonymous Coward
            Anonymous Coward

            Re: "in practice this one is no big deal."

            "Time-dependent arithmetic overflows don't come out of nowhere and aren't new technology (even Windows 95 had them). Checking the design and implementation for things like arithmetic overflows isn't exactly rocket science either."

            But it is aerospace system engineering

            1. An ominous cow heard

              Re: not rocket science

              ""Checking the design and implementation for things like arithmetic overflows isn't exactly rocket science either." (me)

              But it is aerospace system engineering" (you)

              It sure is, and one might hope that its importance was widely recognised. Does experience generally match the theory?

              In the "systems engineering" department I'm most familiar with, the importance was recognised in name but not in reality. There was one actual systems engineering graduate and a variety of "systems engineers" from other backgrounds. Good people to have on the team, but not necessarily real "systems engineers".

              If the department was desperate enough for manpower, almost anybody (of whatever background) from almost anywhere in the company could become a "systems engineer" overnight.

              Is that really an approach that the industry should be adopting?

            2. a_yank_lurker

              Re: "in practice this one is no big deal."

              AE = Almost Engineering

          3. Stumpy

            Re: "in practice this one is no big deal."

            "it says that proper risk assessment, design and test practices have not been followed. Can't imagine how that happened..."

            ... sounds like the aircraft was designed and built using Agile...

          4. JLV

            Re: "in practice this one is no big deal."

            Correct me if I am wrong in this but as far as I understand, airplanes are very safe because all the flight control software is written independently from specs in triplicate. Then all 3 control programs run concurrently with a majority poll deciding what happens if one program is in disagreement with the rest. So, in theory and pretty much in practice, you can ride out any one bug.

            But... this was not on the flight control software, so there would have been no triple redundancy around. Granted, the likelihood is pretty low of this flaw coming into play, but you still have a critical component that is not triple redundant so finding such a flaw is scarier than flight control software proper, isn't it?

            The fact that it wasn't likely to be a problem in this particular instance is not a reassurance to the QA and redundancy factors for secondary but critical programs.

            On the other hand, the fact that they did find the bug during lab testing is a good sign.

            1. Anonymous Coward
              Anonymous Coward

              Re: dissimilar redundancy

              "all the flight control software is written independently from specs in triplicate. "

              Relatively frequently mentioned, extrememely rarely provided with definitive references of where it actually happens in production systems.

              (And afaik it doesn't apply to the engine control systems anyway. Still, what could possibly go

            2. Alan Brown Silver badge

              Re: "in practice this one is no big deal."

              "all the flight control software is written independently from specs in triplicate."

              Used to be. Nowadays they use three copies of the same software, rather than multiple implementations.

      2. Trigonoceps occipitalis

        Re: So...

        Won't Boeing be running a test rig?

      3. JustNiz

        Re: So...

        Your paranoia is excessive.

        Boeing and eveyr other aircraft manufacturer automatically notify the FAA of every and any potential fault they detect, no matter how unlikely in real life. Even if all 4 GPUs failed simultaneously, you do realise the plane also has batteries right?

        1. Alan Brown Silver badge

          Re: So...

          "you do realise the plane also has batteries right?"

          You do realise the batteries have enough capacity to last about 10 seconds if all 4 GPUs go down, don't you?

          They're designed to stabilise switching breaks, not carry full load.

      4. This post has been deleted by its author

      5. RobHib

        @Simon Sharwood - Re: So... Some thoughts.

        Reckon that's so but also there are broader issues here; essentially they're issues that emerge from complexity.

        To the point: the complexity of modern airliners like the Boeing 787 and Airbus A380 are such that it's just not possible in any practicable sense to cover every functional mode of operation, design limitation and failure mode let alone properly evaluate all their relevant parameters [design limitations/omissions, failure severity, event probability etc.] through rigorous state analysis and similar techniques.

        Anyone familiar with state analysis will know that it's nigh on impossible to cover every aspect (design limitations, failure modes etc.) of a system as 'simple' as a domestic VCR let alone one as complex as a modern jet airliner–even a VCR's complexity is such that the computational problems are enormous. Just defining the parameters for such tests alone is problematic.

        I'm not saying that these modern airliners aren't reliable, clearly they are but putting an exact measure on 'reliability' just isn't possible with today's state-of–the-art. The fact is we've still to rely on the best expertise that's available and this ultimately boils down to the combined expertise and experience of the engineers, designers and manufacturer's wherewithal etc.–not to mention bean-counters and budgets.

        Let me give you an example: the well-publicized Qantas QF32 A380 [2010-11-04] engine failure. The Rolls-Royce T/900s each generates about 20 terabytes of monitoring data per hour yet this was 'insufficient' to give any forewarning of the failure. Moreover, after the failure–despite the many hundreds of thousands of sensors on the A380–the pilots still had insufficient (or perhaps inappropriate) monitoring for them to determine what failed sufficient to the extent necessary to safely navigate and land the plane.

        Sufficient data was only gathered after a passenger reported damage to the wing and a pilot visually inspected the damage from the passenger's seat. With all of the A380's sophisticated monitoring, human intervention (a human sensor) was still necessary.

        The issues that arise are complex and many but the essential ones are reasonably clear: we now know the exploding engine cut sensor and control lines thus cutting off essential data to the pilots. The question is why this eventuality wasn't allowed for in the original design (given that engines have previously failed/exploded and cut control lines long before this incident). Also, why didn't a state analysis pick up this issue beforehand in the early design phase?

        Moreover, given the long history of control cable/hydraulics failures (by being severed) and leading to crashes [e.g.: UA FLT 232, (1989); AA FLT 96, (1972)], one has to speculate why in such a modern aircraft the few truly critical circuits weren't also backed up by wireless links (powered at sensor source). Same goes for why there were no iPhone-sized camera 'sensors' in critical places–for pilots to view the engines etc. (as we all know from our phones, this is pretty trivial these days).

        Similarly, Airbus designers appear not to have taken into account the overwhelming levels of error messages generated in the cockpit by the computer-based information system. It was essentially useless, as the huge amount of data presented forced the crew to process the data manually and in a time of great stress and with very limited time. The pilots reported that at no time during their training had they ever had to experience this level of data overload–had it not been for the extremely professional crew the craft could have been lost. The problem with the status/fault monitoring is nothing less than a very significant ergonomic design failure. (It's a damning indictment, as there's seemingly no reason why this problem should not have been foreseen.)

        Effectively, the design parameters, in sum, were passed to be an 'acceptable' risk but in practice they were not.

        There's no doubt the QF32 incident raises serious concerns, and both the regulators and Airbus need to be put under the spotlight for a series of problems and events that compounded to considerably more than could ever just be attributed to force majeure alone. That said, what is even more key is that this incident clearly shows that our current understanding of complex systems is very limited. State analysis etc. as applied to large complex designs such as modern aircraft has a long way to go before it can be considered mature engineering. Designers need to heed this fact.

        In my opinion, the chain of events that led to the QF32 incident is a brilliant encapsulated example of the same kinds of problems we all too often see in large computer systems, Windows, the internet etc. Perhaps we computer types should spend more time examining them.

        1. Anonymous Coward
          Anonymous Coward

          Re: RobHib's thoughts.

          Nice writeup. Can I add a few thoughts?

          *) Complexity vs confidence

          Complexity is a challenge, but you can go wrong without that much complexity if you're overconfident. If you're overconfident, all kinds of things can go wrong. (Other factors besides overconfidence are also possible).

          The words of Charles Haddon Cave to the conference on the 25th anniversary of the Piper Alpha disaster are interesting. Haddon Cave led the Nimrod inquiry, and others simialr. He's not an engineer, but he sure has a clue on how things go wrong:

          http://www.oilandgasuk.co.uk/templates/asset-relay.cfm?frmAssetFileID=3251 (20page transcript)

          https://www.youtube.com/watch?v=y99_lhFFCsk (50minute speech, few visuals)

          *) Statistics of small numbers

          Lots of aspects of safety-critical system design involve failure rates along the lines of "1 in 10**6 hours" or similar. You can't demonstrate compliance to this by testing, not in any reasonable timescale or over any reasonable fleet size. What does that leave? Modelling? OK, but...

          You mention QF32 (in which it was a miracle there were no casualties). Around that time, RR actually had three "uncontained engine failures" during a year or so (QF32, one on a test rig in Derby, and one I forget). (NB this is not a dig at RR, GE doubtless have their challenges too).

          A UEF is an incident in which high energy debris escapes from the engine - typically, a piece of something that's been rotating at very high speed. It's pretty unlikely that this will happen - in fact it's NOT supposed to be able to happen. If it does happen, whatever's in the firing line won't survive. But the design is supposed to prevent it happening at all, so is there any reason to cater for it happening?

          Was the three indicative of a trend, or just an unfortunate coincidence? How would anyone know? As far as I'm aware there haven't been any other UEFs on that engine family since then, does that mean everything's now OK? What might happen in the next 12 months?

          *) Drowning in data, searching for information

          The "too many alarm messages" issue was known about elsewhere, literally decades ago. Any half competent SCADA vendor or user would understand the problem, and how it might be mitigated. But it seems the aircraft engine vendors think they're the experts on all they need, and thus they don't always look to see what they can learn from other parts of industry.

          *) Terabytes per hour

          I don't know where this figure comes from but for a flight engine its many orders of magnitude too high, and even for a fully instrumented engine test rig it's way too high. Not to worry.

          1. RobHib

            @A.C. – Re: RobHib's thoughts.

            A.C., thanks. Short of time to elaborate on your comments here except to say I agree with them. Will reply in more detail later. Also, Charles Haddon Cave video is fascinating, recommended viewing, and has echoes of similar things/incidents that I've experienced (but luckily with fewer repercussions).

            In the meantime, with reference to TB/h 'I don't know where this figure comes from...', here's one immediate reference, there's others too: http://www.trilliumsoftware.com/success/_pdf/DataIQ_Fall12-Nigel-Article.pdf.

            1. Anonymous Coward
              Anonymous Coward

              Re: @A.C. – RobHib's thoughts.

              Glad you enjoyed the CHC stuff. I await your further thoughts with interest.

              Re: http://www.trilliumsoftware.com/success/_pdf/DataIQ_Fall12-Nigel-Article.pdf

              "the A380 Airbus is viewed as the world’s smartest plane as it is said to run on over one billion lines of code. It contains a large number of sensors and microprocessors, each monitoring and reporting on the health of its systems. For instance, each of its four engines generates 20 terabytes of data on its performance in flight every hour."

              It's far from a definitive source. Others welcome. I'm sensing a severe case of marketing/Chinese whispers here.

              Based on what I know about engines, sensors, controls, and connectivity, it is almost infinitely improbable that 20TB an hour is recorded in this picture. It's barely plausible that 20TB an hour is *processed* (without being recorded). Even allowing for massive compression (which would be theoretically easy, but practically might be a challenge, due to the adverse environment alongside an aircraft engine, though a DSP might do the job).

              NB what follows is not intended as personal criticism. You have been told something, and taken it at face value (as have others). But it's garbage (at worst) or widely misunderstood (being generous). That's OK, there's a lot of that about. I happen to see the implausibility in this case because it's an area I used to know a bit about.

              We can take a look at this in a couple of ways: e.g. how much data can we get out of the engine control+monitoring system, as limited by its connectivity. And how much data is plausibly going into that system, based on sensor count and sample rate, stuff like that. Back of the envelope, scientific wild assed guess, not intended to be strictly accurate.

              *) Data out of the engine control+monitoring

              Assume there's no recording system which can survive engine mounting so we need to get the data out over a network. It'll be ARINC664 or AFDX or whatever. Think full duplex ethernet, with some design guidelines to make it better suited to time-critical and safety-critical applications.

              Assume 100MBit/s (couldn't find a definitive statement). Make that 10Mbyte/s. Use ALL of it for this recording stuff. So, 10Mbyte/s, in the 3600 seconds in an hour.

              Thus: 3.6 GB an hour that we can get from engine to elsewhere.

              Use a handful of them if you wish, use Gbit if you wish, you can't plausibly make that to Terabytes an hour, whatever the "big data" people say. Apart from anything else, a computer on the engine connected to those sensors has to *send* the data. That capabiliy just isn't there right now. It might be there in three or five years time. Maybe.

              *) Data into the engine

              The main output of the engine control is a fuel flow control value. It is based largely on the pilot's lever angle. A raft of other important factors which figure into the calculations or must be measured for other reasons include various rotational speeds, various pressures and temperatures, and so on. Various sensors are duplicated or triplicated. Various outputs are monitored for safety/integrity reasons. There's lots more stuff too, to do with things like thrust reverser interlocks and other important stuff. This'll do for now.

              Let's for the sake of argument assume 500 sensors on the engine. That may be a bit of an overestimate; the nice people in RR, GE, etc will confirm or otherwise, but it doesn't actually matter much here. Also, it doesn't vary hugely from engine to engine. Let's assume that on average we're sampling every 20ms (some will be more frequent, others less). Let's assume 4 bytes per sample (again, generous). 500 sensors at 4 bytes = 2kbytes/sample. At 50 samples/second that is 100kbytes/second. 3600 seconds per hour:

              Thus: 360 MB per hour of raw unprocessed sensor data.

              Maybe there's advanced engine health monitoring to consider as well. One day. Today, that's for test purposes only (you can do lots of engine health monitoring with just the existing signals in the control unit).

              So two entirely independent approaches both suggest there's **very very roughly** a Gbyte an hour of raw data. Certainly nowhere near terabytes an hour per engine.

              Corrections and clarifications welcome.

              Some wag on slashdot already questioned the Terabytes/hour data rate and suggested that it'd easily get to terabytes an hour if you put it into XML (text dataname, text timestamp, text datavalue, text units, etc).

              I still struggle to see how it gets to terabytes an hour.

              1. Anonymous Coward
                Anonymous Coward

                Re: @A.C. – RobHib's thoughts.

                NB There is at least one arithmetic mistake in my earlier post. The conclusions still stand.

                [Hint: 10MByte/s: what's that in an hour?]

        2. bep

          Re: @Simon Sharwood - So... Some thoughts.

          I wondered for some time why passenger aircraft didn't have cameras monitoring the trailing edge of the wing. Then I thought I read that such cameras had been implemented, but is seems not. This seems like a basic safety feature that should be mandatory. The other question is, is it time for the reintroduction of a flight engineer to monitor all this feedback information that is overwhelming the pilots?

          1. Anonymous Coward
            Anonymous Coward

            Re: flight engineer

            "is it time for the reintroduction of a flight engineer to monitor all this feedback information that is overwhelming the pilots?"

            It's a very fair question,

            What seems to have happened over the years is that the pilot's routine job has been 'de-skilled' (?not quite the right word, maybe 'de-stressed'?) by cockpit automation. But in non-routine circumstances where the automation isn't useful, the automation-dependent crew may struggle, potentially leading to a serious incident. There are plenty of recorded and analysed instances where this has happened. A distressing number of them involve casualties, often fatalities (e.g. failure to reach the runway, in good visual flying conditions, when part of the airport's landing guidance system is undergoing planned and notified maintenance: Asiana Flight 214, 2013, San Francisco).

            [Fwiw there is a somewhat similar syndrome in high speed railways]

            The engine damage and resulting connectivity loss on QF32 didn't even involve casualties afaik. One of a number of reasons for that was that the dice on that day had rolled in such a way that it was a high level training+assessment flight and there were several extra very experienced staff in the cockpit who were able to go through checklists, attempt to read between the lines of status messages, and so on. Without that, the result might have been different.

            Interesting question. I have no answer for it. Maybe the NTSB, FAA, etc need to give it some thought.

        3. Alan Brown Silver badge

          Re: @Simon Sharwood - So... Some thoughts.

          "given that engines have previously failed/exploded and cut control lines long before this incident"

          Hydraulic fuses now standard (after the Kansas City DC10 crash)

          Multiple wing routes for the redundant systems (after various incidents of losing all systems with one cut)

          As for the engine: Uncontained compressor failures are vanishingly rare vs fan disc failures (which are designed and tested for). In any case, I doubt that there's an engine casing material in existence which could be made strong enough to hold it all together without imposing unacceptable weight penalties.

          You can make civil jetliners strong enough to withstand any scenario, but the problem is they'd probably be so heavy they'd never make it off the runway. Every risk and mitigation has to be balanced against probabillities.

          Thankfully fixing software doesn't add weight however the reality is that the level of checking of such things is far less stringent than that applied to the mechanical side of the operation.

    2. Mark 85

      Re: So...

      At most airports, when they land and shut down the engines, ground power is connected. If this is like many aircraft, all electronics including the GCU's would have power. I guessing the solution is to disconnect the batteries and ground power every 247 days. Truly a cold reboot.

      Edit: I'm not sure if the 787 has a breaker for each GCU to completely remove power from the unit. As I recall, some A/C do and some don't.

      1. Roland6 Silver badge

        Re: So...

        I guessing the solution is to disconnect the batteries and ground power every 247 days.

        The directive requires operators to perform and electrical power deactivation at intervals not to exceed 120 days. So it would seem that it is typical that planes are not routinely electrically power deactivated.

        It would be interesting to know how Boeing discovered this fault, I suspect it may have arisen from a review of reliability/fault data and someone spotting a common theme...

        1. ST Dog

          Re: So...

          Said they found it in the lab.

          I'm guessing they the integration lab where they would have a mockup with actual generators and other systems up and running 24-7 for months on end doing long running tests.

          Despite other posts to the contrary, I don't not expect this to be caught earlier.

          They just don't do a lot of test spanning 100s of days.

          The software designers expected the unit to be shut down more often.

          Bad specs/requirements that didn't convey that the unit needed to stay up that long (or even longer)

        2. Alan Brown Silver badge

          Re: So...

          "It would be interesting to know how Boeing discovered this fault"

          Most likely the static test article (which wouldn't normally be powered down) exhibited it.

          I wouldn't be at all surprised to find that Boeing have more of these hiding away - and not just on the screamliner.

      2. Nick L

        GCU...

        GCU puts me in mind of Iain M Banks' incredible craft the size of small cities... Bit of a way to go to get there! (Yes, I realise GCU is generator control unit )

        1. Deryk Barker

          Re: GCU...

          James Blish?

          Banks was just 8 years old when the final volume of Cities in Flight was published in 1962.

      3. Charles Manning

        Have the breakers been disabled?

        "I'm not sure if the 787 has a breaker for each GCU to completely remove power from the unit. As I recall, some A/C do and some don't."

        Since the German pilot crashed the recently, there has been a scramble to disable stuff so that pilots can't turn things off when they want. I wonder if preventing GCUs fits into that?

    3. Test Man

      Re: So...

      "How often do planes operate, without shutting down, for more than eight months at a time?"

      Think about it - when do you ever power down your car? Never. Switching the ignition to "0" isn't turning it off in the same way pressing the power button on your TV isn't actually turning it off either. The same with planes.

      1. Byham

        Re: So...Garbage from Non-Aviators

        Aircraft when they land for a night stop will always shut off the generators and APU and go onto ground power. Aircraft also have a series of standard maintenance checks that are mandatory the A checks after ~ 200-300 flight hours and a B check every 6 months with the aircraft in a hangar for 2 - 3 days. The generators will be off and the maintenance includes disconnecting and checking batteries.

        This is not a flight safety problem

        However, it is a program standards and testing problem that should not be there and obviously a counter overflow of some sort that Boeing felt it should report. It shows up a poor program test if counters can overflow and haven't got a standard handler to reset them. Both the programmer team and the test team should be in front of the leather top desk :-)

        1. Yet Another Anonymous coward Silver badge

          Re: So...Garbage from Non-Aviators

          They shut off the generators but does that necessarily shut off the power to the computer controlling the generator?

          Are they sure that a particular check disconnects all the many redundant power supplies to the GCU? Is this documented in the check?

        2. Fatman
          WTF?

          Re: So...Garbage from Non-Aviators

          Both the programmer team and the test team should be in front of the leather top desk :-) taken out behind the woodshed, and...(you fill in the blank)

          FTFY

        3. Yag

          Re: So...Garbage from Non-Aviators

          It shows up a poor program test if counters can overflow and haven't got a standard handler to reset them. Both the programmer team and the test team should be in front of the leather top desk :-)

          Unfortunately, this is a bit more complicated.

          The standard in such systems is to use Requirement Based Tests. If a tester add such a robustness test with no requirement to justify it, he'll be in front of the leather top desk.

          And, of course, those kind of requirements are often "forgotten" by the SW designers...

        4. Alan Brown Silver badge

          Re: So...Garbage from Non-Aviators

          "Both the programmer team and the test team should be in front of the leather top desk"

          They should be stretched out on it, with the red hot pokers warmed up and ready to go.

      2. Stork Silver badge

        Re: So...

        My current Accord, every 4 years. That is how long the battery seems to last ;-)

      3. Keven E.
        Pint

        Re: So...

        "..when do you ever power down your car? Never. Switching the ignition to "0" isn't turning it off in the same way pressing the power button on your TV isn't actually turning it off either."

        My '66 Dodge has a broken dash clock, and I've checked for leccy "leaks" at the battery terminals... so I'm pretty sure nothing is "on".... but I'm sure the maintenance staff get to watch *inflight movies while scrubbing the barnacles off the hull.

    4. Richard Jones 1
      Go

      Re: So...

      Since it is a fly by wire aircraft I imagine that shutting down the engines and the APU would not de-power the entire craft unless all the breakers were deliberately tripped or some other full power system shut down was ordered. Retaining some systems on standby battery power and thus running could aid subsequent power up operations. Having the GCU in a standby mode might allow the GCU to manage the engine restart more easily.

      Having watched people try to systems check parked aircraft after they have been left idle, even with more traditional types of avionics the state of the batteries appears to be the first issue they address as they start system checks.

      Now the fault is identified and most people would not really want to be flying round in a brick that could drop at any minute fixing or at least managing it does become mandatory.

      Certification routines appear to need to evolve to catch up with the changing nature of aircraft systems and no doubt they will following this snafu. I wonder what other such traps still exist as yet untraced?

      1. Anonymous Coward
        Anonymous Coward

        imagine == assume?

        "I imagine that shutting down the engines and the APU would not de-power the entire craft "

        "Entire craft" is a fairly ill-defined term, but you'd assume wrong anyway, so long as we're ignoring the Ram Air Turbine, which isn't deployed in normal operation, only when engines and APU are unable to provide electricity.

        The 787 is an electric aircraft. No bleed air is used, and even the hydraulics are electrically powered.

        There's around a megawatt of electricity generation capacity (the two main engines each with 2 * 250kW generators, plus rather less than that on the APU).

        The 787 has a lithium battery. Lots of people have read about it (that's not an assumption). It provides a few seconds of power, mainly to cover any dip as a changeover between power sources occurs,

        http://www.boeing.com/commercial/aeromagazine/articles/2012_q3/2/

        http://www.flightglobal.com/features/787dreamliner/systems/

    5. Alan Brown Silver badge

      Re: So...

      Cold starting an airliner is a long-drawn out process.

      In general the control electronics is only turned off during heavy maintenance (D checks) or if the thing's going to be parked away from ground power long enough that leaving it running would flatten the batteries.

      1. Anonymous Coward
        Anonymous Coward

        Re: "Cold starting an airliner is a long-drawn out process."

        "In general the control electronics is only turned off during heavy maintenance (D checks) or if the thing's going to be parked away from ground power long enough that leaving it running would flatten the batteries."

        It's also been reported in the early-service days of the 787 that a wide scale power off/power on cycle generates unusually large quantities of status messages from various bits of "intelligent" kit when things are powered up again. Has that been sorted now?

        All these messages have to be processed and preferably analysed in some fashion in case buried in the noise is something that actually matters.

        So there has been an understandable reluctance to power things down wholesale.

        1. Justthefacts Silver badge

          Re: "Cold starting an airliner is a long-drawn out process."

          That's interesting.....but pretty damning for aerospace software dev doctrine.....

          In every other industry, power cycling software is known to be a good thing (for the stated reason of poor test coverage at long times). Aerospace industry has chosen NOT to, because Standard Operating Process is to write lots of status and checks (good....) And then someone needs to write code to VERiFy huge amounts of Status, and didn't size their kit to do in a sensible time

          So, the safest end to end SOP isn't followed because the time taken is commercially unviable.

          That's a f* up of systems engineering and management on an EPIC scale

          You can laugh at Agile if you like, but Agile would have prevented this by commercial challenge before it got out into the wild.

          To be clear, Agile wouldn't have stopped the bug, it would have ensured that power cycling every flight was commercially viable, and prevented this and potentially loads of others from being an issue

          1. Roland6 Silver badge

            Re: "Cold starting an airliner is a long-drawn out process."

            In every other industry, power cycling software is known to be a good thing

            Well yes and no, having worked on safety critical systems in the past, one of the challenges was the system was expected to be up and running for the full duration of it's operational life: 20 plus years! However, due to power outages, I doubt any of the systems we actually deployed into the field actually lasted more than a few years without getting an unplanned hard reset. Given the absence of press reports over the last 30 years, whatever faults these systems may have had, I can be reasonable sure that no lives were endangered or lost as a result of system failure...

            1. Alan Brown Silver badge

              Re: "Cold starting an airliner is a long-drawn out process."

              "However, due to power outages, I doubt any of the systems we actually deployed into the field actually lasted more than a few years without getting an unplanned hard reset."

              I bet you still didn't skimp and elected to put in a hardware watchdog 'just in case'

  3. RIBrsiq

    This is obviously an exotic usage of the term "fail safe" that I wasn's previously aware of.

    1. Measurer

      Airbus definition of 'fail safe'....

      ...was obviously not pilfered by the NSA, via the German Security Service. The yanks do not know what 'fail safe' means, as I have repeatedly been reminded of every time I speak to a U.S engineer about safety systems on machinery.

      1. Anonymous Coward
        Anonymous Coward

        Re: Airbus definition of 'fail safe'....

        This wouldn't be the same Airbus whose software carried out uncommanded pitch down movements in a Quantus A330 in 2008 resulting in seious injury and an emergency landing would it?

        1. Anonymous Coward
          Anonymous Coward

          Re: Airbus definition of 'fail safe'....

          No one can have absolute confidence that a paticular system is free from defects but the type of problems are different. Pointing fingers between Boeing and Airbus is not productive and both companies products are very safe. Very safe does not mean completely free of any issues or defects. It means the probability of serious consequences from a design flaw is very low, not zero.

          The ability for a variable to overflow and in doing so cause significant issues is a straightforward software implementation defect which is reasonably forseeable and testable. It is worying this sor tof issue gets through to production and there should an di sure willbe an analysis of not justthis fault but the overall software design and development process and if there are any wider shortcomings that should be addressed.

          The uncommanded pitch down event was caused by a faulty ADIRU feeding incorrect information to the software with a quite specific and unanticipated pattern of incorrect samples. The software had data from multiple ADIRUs and was intended to be able to operate without problems even if information from one was incorrect but given the paticular nature of the faulty information ended up averaging good and bad data. Now this is still a design issue but I think the second problem is harder to predict and analyse and test. This would have also triggered an examination of the development process but probably focussing much more on system design than software development.

          1. Anonymous Coward
            Anonymous Coward

            Re: Airbus definition of 'fail safe'....

            "Now this is still a design issue but I think the second problem is harder to predict and analyse and test."

            So in other words if its a problem in a Boeing its a case of "The idiots didn't test it well enough" but if its a problem in an Airbus its "Oh but its very hard to test for wierd edge cases, they do their best".

            Partisan? Much?

            1. Richard 12 Silver badge

              Re: Airbus definition of 'fail safe'....

              No.

              A timer overflow is so obvious and predictable that you can even work out exactly when it will occur to the individual tick.

              A mistake in a flight control algorithm that gives unwanted results when fed by a particular mix of wrong and right values is an incredibly hard thing to predict.

              One is a failure to count.

              The other is an inability to allow for and test all possible circumstances.

    2. Rusty 1

      Similarly "near miss", as in "yes, he nearly missed me."

    3. Anonymous Coward
      Joke

      Must be a typo

      I think they meant 'fall safe'.

    4. Yag
      Mushroom

      For the US definition of "Fail safe"...

      see the movie.

  4. Malcolm Weir Silver badge

    "And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies."

    The fly-by-wire systems will be fine, because the RAT will pop out and produce power. The pilots will then, presumably, try to cycle the GCU's one at a time, and all will become well again.

    It's a problem, but not a massive one.

    1. bri

      This assumes the problem is understood, which would be a tall order before this notification and that there is sufficient height to lose while solving the problem. Would be a bummer when on final approach.

      1. Pookietoo

        re: This assumes the problem is understood

        I don't think it takes much understanding to fix the problem: power goes down, auxiliary kicks in, pilots/engineer restart failed generator controllers, everything's fine again. Assuming it doesn't happen at a critical stage in the flight, of course.

    2. AnonymousCoward

      RAT trap

      Really?

      Even the best trained pilots are going to break into a sweat when much of the power goes off and the last ditch RAT deploys. Assuming they sort it all out is the RAT retractable so they can proceed to land without ripping it off the bottom of the hull?

      1. S4qFBxkFFg

        Re: RAT trap

        I'd assume the designers thought of that - either the clearance is enough that it won't, or it should detach in a non-dangerous (but probably expensive) manner.

        edit: just had a look at some photos - it seems to be located in the middle of the fuselage (on the bottom) and doesn't appear to be longer than the main gear.

      2. John H Woods Silver badge

        Re: RAT trap

        "Even the best trained pilots are going to break into a sweat when much of the power goes off and the last ditch RAT deploys. " -- AC

        True; the AF447 pilots flew their perfectly airworthy plane into the ocean just because their pitot tubes froze, and they lost an instrument (airspeed) that they could easily have managed without, if they hadn't made such a big and fatal deal about it.

    3. rmv
      Stop

      RATs

      "The fly-by-wire systems will be fine, because the RAT will pop out and produce power."

      "Last June, the FAA approved an exemption to allow the 787-9 to enter service on schedule despite a substandard reliability record on the GCU for the RAT. The agency approved the exemption because it was deemed extremely improbable that all six power generators on board could fail at the same time."

      From the flightglobal article (http://www.flightglobal.com/news/articles/faa-orders-new-787-electrical-fix-to-prevent-power-failure-411794/)

    4. Clani

      Will the RAT keep the "in-flight movies" going however ;-)

    5. Anonymous Coward
      Anonymous Coward

      I think regular RAT deployment would be bad PR for Boeing.

      It's a last resort, you don't want to be relying on it.

      Add on the Human Factors of stress on aircrew and you're just writing the script for the next episode of Air Crash Investigation.

      1. Bronek Kozicki
        Unhappy

        I would be more concerned about climate control than anything else. Aircraft spends most of the time on high altitude, where as a glider it would give pilots lots of time to recover, but with air pressure falling fast they might not be able to do much after short time.

    6. hplasm
      Happy

      "The fly-by-wire systems will be fine-

      ... because the RAT will pop out and leave the sinking plane."

      FTFY

    7. Alan Brown Silver badge

      "The fly-by-wire systems will be fine, because the RAT will pop out and produce power. The pilots will then, presumably, try to cycle the GCU's one at a time, and all will become well again."

      Even if the RAT didn't pop out, the aircraft isn't suddenly going to fall out of the sky. There are some funny things sticking out of the fuselage called "wings" which serve to provide something called "lift" and when the engines stop there's this thing called "gliding"

      I know the Reg has a red banner but do we really need to be subjected to the kind of writing which wouldn't be out of place in the Daily Mail or News of the World?

      1. Malcolm Weir Silver badge

        @Alan Brown...

        Well, quite. And the "Gasp, Horror!" notion that if the GCU's happened during final approach DISASTER would happen...

        ... although if the GCU's packed up on final, the avionics would still work (battery and RAT deployment), and the chaps flying the thing would take emergency action which would appear, to the untrained eye, exactly the same as the non-emergency action: they'd fly the thing onto the runway which was neatly lined up in front of them (because they are on final).

        To be honest, the worst phase of flight for the GCU's to fail would likely be cruise, because the air conditioning would pack up (it's electric in the 787), so the SOP would be to lose altitude to 10,000ft or thereabouts so the passengers can breath once the oxygen generators pack up. Since you're obviously operating on 240 minute ETOPS, then by definition you may be up to 4 hours from a landing field. But one would expect a certain amount of dialog between the crew and the maintenance base during those four hours...

        1. chris 17 Silver badge

          Re:

          Etops is when at least 1 engine is operational. The plane will not glide for 4 hours.

      2. anonymous boring coward Silver badge

        I don't know how well the totally fly-by-wire (electrical wire) plane would fly with no electricity?

        My guess is, not too well. At least not after the smoothing battery is exhausted.

  5. iansn

    Sounds like an old Oracle 8 DB bug

    1. Knewbie

      from memory

      A vanilla install of Windows NT (tm) with no patch or network connection ( nor keyboard or mouse, btw) would self crash after 248 days...

      Mr Gates? I have boeing on the line about that extended support contract you spoke about ... ;)

      1. Fatman
        Joke

        Re: from memory

        A vanilla install of Windows NT (tm) with no patch or network connection ( nor keyboard or mouse, btw) would self crash after 248 days...

        So THAT is what those GCU's are running on????

        You would think a company like Boeing could use a newer version of the O/S???

        </sarcasm>

        1. anonymous boring coward Silver badge

          Re: from memory

          On the other hand, any newer MS OS is guaranteed to crash for unknowable reasons long before 248 days. I'd take NT, thank you very much.

      2. Anonymous Coward
        Anonymous Coward

        Re: from memory

        Solaris 2.5 crashed after 248 days as well. Others?

  6. Pavlov's obedient mutt

    refuel...

    They shut down to refuel. They shutdown during cleaning. When cargo doors are open.. during inspections, etc.

    This is a "oh look, in theory, if we keep it running this long..." bug.

    1. John Sager

      Re: refuel...

      Doubt it. Even if there is no ground power, big aircraft like that have an Auxiliary Power Unit - a small jet engine in the tail connected to an alternator. So quite a few bits of aircraft electronics probably stay powered up for months. Having said that it's probably not as long as 248 days though.

    2. Anonymous Coward
      Anonymous Coward

      Re: refuel...

      No they don't.

    3. Bah Humbug

      Re: refuel...

      Have you ever been on a plane? They're often refuelling the plane as you board, but have lights on so you can find your seat - those lights need power. Ditto for the cleaning crew - they need to be able to see those piles of rubbish the passengers on the inbound left, etc. etc.

      1. Pavlov's obedient mutt

        Re: refuel...

        have I? no - not often.. no more than 128 segments in the last 14 months.

        FYI - Emirates has the HIGHEST average aircraft utilisation time at 13.7hrs / day... Gee, wonder what happens in the remaining 10.3hrs?

        So lets ask this question - other than that one thomas cook charter you took with your mates to Benidorm 3 years ago, have you actually had anything to do with the airline industry?

    4. uncle sjohie

      Re: refuel...

      Actually, the 787 we flew from Cuba back to Amsterdam, was being refueled at the stopover in Mexico, with all the passengers on board and the lights on. We were told to unfasten the seatbelts, and stay in our seats, while there was a stewardess positioned at every exit. So it wasn't completely shutdown.

    5. Anonymous Coward
      Anonymous Coward

      All depends what 'shutdown' means?

      The definition of "shut down" is a bit vague. Even Flight magazine don't have one:

      http://www.flightglobal.com/news/articles/faa-orders-new-787-electrical-fix-to-prevent-power-failure-411794/

      Does it mean "main engines shut down'? That seems unlikely, as the odds of main engines NOT being shut down for months at a time seem very small (maybe even smaller than the odds of two speed probes failing identically at the same time, AF447 style).

      Does it mean "main engine generators and APU generators shut down"?

      Does it mean "main engines, APU, and ground power all shut down"?

      Does it mean "main engines, APU, ground power, and lithium battery all shut down or unavailable"?

      Or some other set of circumstances not listed above?

      And perhaps a more fundamental question: what level of certification is needed for a Generator Control Unit, and what went wrong from the design and certification process that allowed an arithmetic overflow to be designed in rather than designed out? It seems a fair bet that overflow is what happens to trigger the entry in to "fail safe mode", although the exact details are not yet clear.

      It also seems a bit strange that "fail safe mode" in this case seems to mean "fail unsafe" (as in, "generate no power").

      Have these guys been taking lessons from the [redacted] School of Safety Critical Software Design?

      http://www.eetimes.com/document.asp?doc_id=1319966

      1. Roland6 Silver badge

        Re: All depends what 'shutdown' means?

        +1 for the EE Times article, a good update on the various "car accelerated without warning" incidents that have been reported over several decades.

  7. Anonymous Coward
    Anonymous Coward

    Good old Scottie

    "The more they overthink the plumbing, the easier it is to stop up the drain."

  8. Anonymous Coward
    Joke

    See...

    ...should of used NT4 to do the software, it would sort itself out about every 30 days.

    1. Anonymous Coward
      Anonymous Coward

      Re: See...

      Windows 95

      49.7 days

      should have

  9. mathew42

    Last year I was sitting on a plane preparing for take-off. We were informed that the co-pilot's microphone wasn't working. After 10 minutes, the solution was to power cycle the aircraft. 15 minutes later we were on our way.

    Not as frustrating as the previous incident where the pilot's window wouldn't open. Apparently it is their emergency exit and we couldn't take off with it stuck. After an hour of banging coming from the front, we transferred to another plane. I didn't feel as bad as the person next to me who watched the flight she was originally booked on depart.

    1. Chris Miller

      Not just planes

      A few years ago I was on a brand new TGV returning from Strasbourg. Suddenly we started to slow down and coasted to a halt. We sat there for about 20 minutes and then all the lights and aircon went out. Five minutes later the power came back on and we resumed our journey without incident.

      I always assumed the SNCF helpdesk asked: "Have you tried turning it off and then back on again?".

      1. Arnold Lieberman

        Re: Not just planes

        Happens on C2C line quite regularly... train is at a station, lights go off then on and the scrolling display shows "Electrostar" for a minute.

        I was told by a neighbour some years ago when the new rolling stock was commissioned that the onboard computers were running on Windows 95. Maybe they've upgraded to ME by now...

        1. paulf
          Gimp

          Re: Not just planes

          Also happens when a dual electric Electrostar set swaps between OHLE and juice rail on the West London line near North Pole Junction (between Shepherd's Bush and Willesden Jcn). The train usually stops for the swap and seems to reboot briefly when the onward power source is engaged. It used to knock out the PIS but that seemed to have been fixed last time I used it.

          Eurostar sets used to support three power sources until the juice rail pick up shoes were removed. I understand those sets just coast between different electrical systems as if it was going through a neutral section.

          1. John Robson Silver badge

            Re: Not just planes

            Presumably with compromised braking?

            1. John Robson Silver badge

              Re: Not just planes

              Not quite sure why the downvote - I have always been told that much of the trains braking is active - dumping power back into the grid. Presumably at least that entire system is compromised...

              Overall braking is probably maintained by conversion of brake pads into dust - but it's like a RAID array with only one drive left in a mirror - working, but compromised.

              1. paulf
                Boffin

                Re: Not just planes

                Eurostar 373/1 trains have three braking systems only one of which is Rheostatic braking. This is different to Regenerative where energy is returned to the OHLE.

                http://en.wikipedia.org/wiki/British_Rail_Class_373#Braking_systems

                Assuming they use air brakes for the other systems, these apply the brakes by reducing the air pressure in the train pipe. So, put simply, an extended lack of power supply to run the loco air compressor to maintain pressure in the train pipe would mean the train stops and can't release it's brakes rather than the other way around.

                Providing the locos have a large enough air supply reservoir for the air braking system there's no reason why braking should be compromised while it coasts through a few hundred metres of neutral section. It has been known in recent times for electric trains to coast for a couple of miles through sections with failed OHLE, which allows train services to be maintained while waiting for repairs to be completed in a possession that night.

        2. Alan Brown Silver badge

          Re: Not just planes

          "I was told by a neighbour some years ago when the new rolling stock was commissioned that the onboard computers were running on Windows 95. "

          FWIW the Docklands light railway is running on PDP11 microvaxes.

          The entertainment system (signboards) might be running some comsumer OS, but the safety critical systems are highly unlikely to be doing so. Too many liabilities.

  10. Clani

    Common Millisecond Counter Issue

    Sounds like a 32bit signed integer counter counting milliseconds rolling over and the system not coping with the counter value being less than I previous reading

    2,147,483,648 ms = 248.5513481481481 days

    A common counter issue and have seen it in lots of devices

    1. Simon Harris

      Re: Common Millisecond Counter Issue

      almost - but I make the counter tick interval 10ms instead of 1ms if it's a signed value, 5ms if it's unsigned.

      1. Alan Brown Silver badge

        Re: Common Millisecond Counter Issue

        "I make the counter tick interval 10ms"

        Which OS runs on 10ms ticks?

        https://en.wikipedia.org/wiki/System_time

        I really hope not.

    2. Anonymous Coward
      Anonymous Coward

      Re: Common Millisecond Counter Issue

      Well spotted, certainly sounds plausible.

      Now readers need to think about why the GCU suppliers didn't spot (and didn't properly handle) a predictable integer overflow condition.

      The 787 is based around more use of electrics than on previous aircraft (and less use of hydraulics and compressed air). Hence the lithium batteries they've been having so much fun with.

      I don't know what level of DO178 criticality/certification was supposed to apply to the Generator Control Unit(s) but it sounds like they missed it anyways.

      1. Simon Harris

        Re: Common Millisecond Counter Issue

        "Now readers need to think about why the GCU suppliers didn't spot (and didn't properly handle) a predictable integer overflow condition."

        Maybe they had planned on the batteries always exploding before 248 days were up.

      2. Roland6 Silver badge

        Re: Common Millisecond Counter Issue

        >Now readers need to think about why the GCU suppliers didn't spot (and didn't properly handle) a predictable integer overflow condition.

        Well, there are several reasons why this may not have been spotted. Firstly, 249 days is a long-time to have something on continuous test. Secondly, given the fun and games you can have with ASM and 'C', there is no reason to suppose that the actual counter was not declared as unsigned and that only a single test and hence a single conditional ASM instruction, treated the counter as signed.

        Also from memory the x86 instruction set for example, doesn't make a big difference between the Jump conditional (unsigned) and Jump conditional (signed) variants and hence unless you are paying real attention to the code you are reading, this is easily missed.

        1. Mike 16

          Re: Common Millisecond Counter Issue

          If my guess is correct, this is a combination of two problems. The first is using code like:

          timeout = current_time + delta;

          while ( current_time < timeout ) yield();

          instead of

          time_snap = current_time;

          while ((current_time - time_snap) < delta) yield();

          and the other is inadvertent use of signed arithmetic in there somewhere, which makes it worse.

          When I started messing with Linux device drivers, I noticed the first problem in the timer code. In a Roadrunner moment, as I was asking my boss if this could still be the case, a Win95 machine next to his desk crashed in a suspiciously similar way. (the 16-bit Windows manifestation had a different deadly duration because 16bit, and roughly 18Hz) When I were a lad, the danger of the first sort of code was drilled into us. Apparently the Windows _and_ Linux folks didn't take that class, or slept through it. Now they work for Boeing.

          BTW: Last time I looked, the Linux timer code had grown a dense mat of "hair" around that code, rather than fixing the underlying cause. I suspect the reason for the first implementation was the desire to save stack space (need to store both time_snap and delta), combined with ignorance of the problem.

          The reason for not actually fixing it seems to be "But we've always done it this way". I haven't checked in the last six years, but would be delighted to hear they finally bit the bullet.

          The usual bandaid for this sort of thing these days is just make all the time variables bigger. I can imagine Lister facing a shutdown of this sort and wondering why the coders of Holly didn't anticipate 64-bit wrap.

          1. Anonymous Coward
            Anonymous Coward

            Re: "Now they work for Boeing."

            Maybe you're not aware of this, but almost every major subcomponent of the 787 is subcontracted out. The engines go to GE or RollsRoyce (who have just appointed the ex CEO of ARM as the new CEO of RR).

            The engine driven generators and the associated control units? Not sure. GE might do their own, RR probably won't.

            The amount of engineering and design (and test?) which Boeing themselves do on the 787 is relatively small.

            There was a lovely whitepaper on the perils of outsourcing for a company like Boeing. Written by a Boeing employee, but not one from HQ. 16 pages, too long to tweet (sorry). But the oversimplified summary isn't:

            you can outsource the design and manufacture but you still have the responsibility, financially and in 'image' terms. You still have to carry all the costs, and the outsourcers have you by the short and curlies. They know you depend on them.

            Find it here:

            "Outsourced profits: the cornerstone of succesful subcontracting. Author: Dr L J Hart-Smith"

            http://seattletimes.nwsource.com/ABPub/2011/02/04/2014130646.pdf

            Summarised in a reasonable way at

            http://www.zdnet.com/article/the-hard-lessons-of-boeings-787-outsourcing/ (from smartplanet)

        2. Anonymous Coward
          Anonymous Coward

          Re: "249 days is a long-time to have something on continuous test. "

          "249 days is a long-time to have something on continuous test. "

          Sure is, Which is why people who know safety critical software, and producing guidelines related thereto, understand that normal operational testing is only one part of a multi-prong strategy for risk and defect reduction.

          One such aspect was code inspection by engineers with a clue (sometimes, expensive people with a clue, frequently engineers who are nowadays unpopular with Management, because they spot things that need fixing, and modern Management frequently like to ship stuff earlier rather than ship less-defective stuff).

          Another aspect was automated in-target testing along "test what you ship, ship what you test" lines, but with mechanisms that allow variables to be set before entry to a piece of code, and results checked on exit. So you could very quickly check what happened when you add 1 to 32767 (to quote a very simple example). No need to wait 249 days to see your overflow.

          We don't yet know *exactly* what happened. I'll be interested to see how it could have been avoided. But I'll be astounded if it genuinely only shows up after 249 days and couldn't have been foreseen and therefore prevented at design/code/test time.

          1. Roland6 Silver badge

            Re: "249 days is a long-time to have something on continuous test. "

            We don't yet know *exactly* what happened. I'll be interested to see how it could have been avoided. But I'll be astounded if it genuinely only shows up after 249 days and couldn't have been foreseen and therefore prevented at design/code/test time.

            Following the discussion about "Fail Safe" and the little information released it does seem that insufficient consideration was given to just what was the right fail safe mode for a GCU to go into.

      3. Destroy All Monsters Silver badge

        Re: Common Millisecond Counter Issue

        I don't know what level of DO178 criticality/certification was supposed to apply to the Generator Control Unit(s) but it sounds like they missed it anyways.

        I you think having a stack of paper on your desk and following the recommendations and ticking all the boxes on the discrepancy item sheet precludes you from missing abvious-in-retrospect problems, you have another thing coming.

        This won't be the last error.

  11. Nick L

    "just hanging in the sky in much the same way that bricks don't"

    Turning an aircraft into a brick? Reminds me of Douglas Adams, whose description of the Vogon Space Fleet was memorable. "The ships hung in the sky in much the same way that bricks don't."

    Still smirk thinking of that.

    1. DocJames

      Re: "just hanging in the sky in much the same way that bricks don't"

      That was the reference intended.

  12. iranu

    Auxiliary Power Unit says

    Ram Air Turbine

  13. kmac499

    A couple of Avro Vulcans were lost due to complete electrical failures. They were fitted with an emergency battery pack but that failed as well. They weren't exactly fly by wire but once the elctrically operated servos were dead there was no flight control.

    1. Chris Miller

      Even worse, the pilots had bang seats, but the rest of the crew did not.

    2. hugo tyson
      Coat

      Avro Vulcan

      You're right, Vulcans weren't fly-by-wire at all; it's pushrods and levers, but the large force to move the control surfaces is electro-hydraulic: each PFC actuator is a self-contained unit taking 200V AC electric power in, to hydraulic pump, to hydraulic piston making a large force on the external hardware - but controlled by a mechanical lever input. The mixing of elevator and aileron signalling to the elevons, and the artificial feel is all (electric-powered) mechanical. Essentially it uses electric instead of hydraulic power to do all the power-assisting, but the control connections are very traditional linkages.

      Not sure, but I think the batteries were only 28V for running other control systems, not enough to power the PFCs, enough to operate the RAT or AAPU to work around generator or bus failures.

      So complete electrical failure left all the controls locked. Not good at all, especially for the 3 guys in the back. :-(

      1. Anonymous Coward
        Anonymous Coward

        Re: Avro Vulcan

        Except for the fact that the Vulcan had a manually deployed RAT in the wing which would have solved the electrical failure issue (my father was involved in the development/testing one of the platforms that the Vulcan carried so for a while he was one of the 3 guys in the back).

  14. Will Godfrey Silver badge
    Unhappy

    Have you tried turning it off and on again?

    Sorry, someone had to say it :)

    It sounds like a pretty crappy bit of programming. I'm no expert by any means but I learned how to deal with integer overflows (assuming that's what it is) on a BBC model B.

  15. Alan J. Wylie

    unsigned 32 bit counter in centi-seconds

    0x7FFFFFFF = 2147483647

    2147483647 / 60 / 60 / 24 / 100 = 248.551

  16. Anonymous Coward
    Anonymous Coward

    Time-Aware computing

    NIST recently put out Technote 1867 regarding this very issue, it's called Time-Aware Applications, Computers, and Communication Systems (TAACCS). This is a very big deal and affects pretty much all computers today. If you recall, during the Gulf War, this is what caused the Patriot missile system to fail, the accuracy was inversely correlated to the uptime of the fire controller.

    To the people that say "Just reset it" you should understand that there is very little machinery made today that is ever truly powered down. Between your mains power you always have at least one level of batteries, but in a complex system like a plane there are many more levels, down to supercaps or 10-year lithiums on the controller board - and if the embedded software is written correctly, it probably saves its state to flash, including this counter - so no, short of an EMP, you can't really be sure the machines are actually dead.

    And even so, power-cycling should never, never be the "solution" to a problem like this. You can't reboot while you're over the Atlantic. And not sure why they use the term FailSafe, this is more correctly referred to as FailFast.

    1. Pookietoo

      Re: You can't reboot while you're over the Atlantic.

      Rather that than on approach.

  17. Anonymous Coward
    Anonymous Coward

    SOS, DD

    Garbage in = Garbage out

    When will they learn?

  18. John Styles

    Remember...

    Many years ago there was a project at RSRE (now presumably part of [spit] QinetiQ if this bit exists in any meaningful way which I very much doubt) that developed an allegedly formally verified mathematically processor the VIPER and a language NEWSPEAK (NB not the language that appears on Wikipedia with that name). One of the ideas was that

    i := i + 1

    or whatever the syntax was was illegal because the input values and output values are by definition different which it doesn't allow i.e. you would need an explicit IF to handle the overflow case.

    Search for VIPER NEWSPEAK on Google Books to find an 1985 New Scientist article.

    My recollection is that they tried to licence it commercially, it turned out to have bugs and one of the licensees tried to sue the MoD but went bust before it went to court.

  19. Canecutter
    WTF?

    Something I don't understand

    What I don't understand is what is the purpose of the GCU.

    From what is described, I see a black box of some sort with a finite counter in it, that is capable of shutting down the generator when that counter overflows.

    Given that a generator is a reasonably straightforward thing, what kind of improvement can one have in mind attaching to it a box that shuts it down for something as predictable and maybe trivial as an overflowing finite counter?

    Help me. I really am trying to understand.

    1. Paul Hovnanian Silver badge

      Re: Something I don't understand

      It contains the voltage regulator, generator field and generator main breaker control plus a lot of protection and monitoring functions.

      As with practically all modern digital control systems, anything requiring a time delay, interval, scheduling future events, etc. uses a system clock to determine when the next task is to be run. At first glance, this would appear to be a simple implementation. Schedule event at Time = Now + Interval. But there's that nasty limitation of all microprocessors in that time is stored in a register or memory location with a finite upper bound. So when the timer reaches that, it rolls over to zero again (much like a mechanical odometer). So all timing functions must be written to handle this discontinutiy in their logic.

      What shocks me about the 787 power system controls (sorry about that), is that the real time controls and event scheduling routines appear not to be based on some stable and tested software libraries. Where such goofs have been caught and fixed early in their development. These are the sorts of goofs that any competent embedded s/w designer should be aware of. But better yet, this level of code is something that an application developer should never have to write from scratch.

      This reminds me of an anecdote from my days at Boeing*. I was reviewing the credientials of several candidates for a job which involved the maintenence of a large package of (mainly) Perl code that moved documents around between various systems. One guy submitted a Perl app he had written in his previous job that implemented an FTP session to do just this sort of thing. It was well written, neatly formatted and showed that he had a good understanding of Perl syntax and programming. But it was dozens of pages of an 'expect' like program that called a Unix command-line ftp client. So, during the interview, I asked him if he had ever heard of CPAN. "No", was his reply. "So, you've never seen the Net::FTP module?" "No" again. Net::FTP could do in a dozen lines what he had done in that many pages of code, leaving me to wonder just how 'good' a developer he was.

      *Boeing most probably didn't write the GCU code. That's a trail that runs back through several layers of h/w and s/w vendors.

      1. Canecutter

        Re: Something I don't understand

        Thanks, Paul Hovnanian.

        Based on your description, it is not clear that functions like output regulation, protection and monitoring need to become disabled if the GCU software should crash owing to the overflow of a finite counter. Is it that the generator is still able to produce output, possibly at some reduced level, if the GCU is disabled, or is it that the generator is completely offline and unable to carry any load?

        According to the FAA article quote in the article, "causing that GCU to go into failsafe mode", does failsafe mode mean no generator output? That would be rather strange for a safety critical system like an aircraft electrical power system, and not something I am accustomed to hearing about aircraft systems.

        1. Paul Hovnanian Silver badge

          Re: Something I don't understand

          "it is not clear that functions like output regulation, protection and monitoring need to become disabled if the GCU software should crash owing to the overflow of a finite counter."

          All of these functions are implemented using digital signal processing techniques. Sampling, filtering and other functions with any kind of time variable will depend on the system clock, timers and event queues. If the clock becomes untrusted, continued operation of the generator can result in a hazardous condition. So a watchdog circuit trips the generator field off, preventing it from producing power and disconnects it from the system. The system design assumes a fault on a single generator channel. So another generator could be switched over to pick up the load. But since this failure mode can affect all channels nearly simultaneously, there is no source left to fall back on.

  20. Anonymous Coward
    Anonymous Coward

    Wonder if this has anything to do with Malaysia plane that disappeared into the Indian ocean?

    1. Vic

      Wonder if this has anything to do with Malaysia plane that disappeared into the Indian ocean?

      No. That was a completely different aircraft type.

      Vic.

  21. Anonymous Coward
    Anonymous Coward

    Just reboot it.....

    I was due to board a 787 a few weeks ago which (according to the announcements) had developed a system fault. The plane was still on the ground so the engineers were called in. In the end (after about 45 minutes) the announced that it had been a power problem and that to fix it they had 'rebooted the plane'.

    Cue the (nervous) comments from people about turning it off and on again.

    Fortunately the flight was uneventful after that.

  22. wolfetone Silver badge

    Two Things

    1) In development, surely there would have been a time when the engineers looked at the lack of generators being on and think "Hmmm, this is very strange. I wonder why this is happening?" - but done nothing OR didn't notice it. Equally as bad.

    2) For the FAA to issue the directive this means that a Dreamliner must have already experienced this issue.

    1. Malcolm Weir Silver badge

      Re: Two Things

      @Wolfetone: no, for the FAA to issue the directive it means that Boeing has identified the issue. The FAA, not being complete idiots, don't wait until something goes wrong before they issue AD's.

      Boeing: Hey, if you X, Y, and Z the wings fall off an the plane plummets to the ground

      FAA: Well, that hasn't happened yet, so we'll just hold this on file until it does. Thanks all the same.

      1. wolfetone Silver badge

        Re: Two Things

        Suppose the FAA have learnt their lesson then from the tombstone regulations then.

  23. JimNim

    Possibly due to NetBSD Variable Change?

    This sounds oddly similar to other software issues I've encountered... NetBSD v5.0 changed the variable "hardclock_ticks" from an unsigned 64-bit integer to a signed 32-bit integer. The counter increments by 1 for every 10ms that pass, and when it reaches a value of 2147483647 (~248.55 days) the next increment sets the "negative" bit in the signed integer. If code doesn't properly handle the possibility of a negative value, then you get a software crash... which I'm sure would be just wonderful for an airliner.

  24. Keven E.

    Perhaps I missed something...

    ... one can't easily tell where within the 248 days one currently *is?

    1. DrM

      Re: Perhaps I missed something...

      Agreed. They should be man-handled like a typical crap wireless router, rebooted every few days *before* it crashes.

  25. Bert.Douglas

    I did better than this in embedded controllers 20 years ago

    It appears that there is a 32-bit signed integer, incrementing at 100 hertz.

    2^31/100/24/60/60 => 248.55134814814815 days

    It is possible to code timer handling that allows wrap-around. Generally, you are checking elapsed time since some previous event. It is kind of tricky to get the details right, but it takes about the same number of instructions. You make convenient helper functions, and do not allow any other code to directly touch the timer value. Then for testing, you reduce the number of effective bits in the timer forcing frequent wrap-arounds.

  26. JLV
    Angel

    Let he who has never sinned cast the first stone

    >Both the programmer team and the test team should be ....

    Yes, I think that is funny and witty as well. But I wouldn't want it actually happening.

    I realize that this a critical system, don't get me wrong. But don't expect things from others that you would not expect from yourself.

    They should review their procedures to avoid it happening again. The aviation industry's IT has a laudable quality record by the standards of our profession at large.

    Excessive penalties in an industry will not attract the kind of people I want to entrust my life to when flying. Fire someone if incompetence and negligence were involved, otherwise just fix it and learn from it, don't play scapegoats.

    A big thank you to the QA team for finding this.

    1. ecofeco Silver badge

      Re: Let he who has never sinned cast the first stone

      Critical system? It's life and death. "Critical system" is like calling a tsunami just another tide.

      1. ecofeco Silver badge

        Re: Let he who has never sinned cast the first stone

        From the FAA themselves:

        "In that case, the FAA said, “all four GCUs will go into fail safe mode at the same time, resulting in a loss of all AC electrical power regardless of flight phase.”

        “Loss of all AC electrical power can result in loss of control of the airplane,” it said."

        And I was given 3 thumbs down? Just... wow. I did not think the world could become even more stupid.

  27. Herby

    Wierd flaws not likely to be encountered

    I discovered one of these on the operating system I worked on in the 70's. Turns out that the calculation of 'leap year' is done when the time was entered on the console (before CMOS clocks). The problem would happen if one entered the time in December of a non leap year (like 1975) and if the machine ran continuously till February 29th. Since the determination of leap year was done in 1975, and 1976 was a leap year, it wouldn't register.

    Of course the likelihood of the machine being up for that long was sooo remote, they didn't consider it a bug. The remedy is to just enter the date some time in January, and go from there. There was really no need to reboot. Sadly the machine was sold for scrap in 1983 after a bankruptcy.

  28. DrM

    Lion

    "And presumably also turning the 787 into a brick with no power for its fly-by-wire systems, lighting, climate control or in-flight movies."

    I may be wrong, but I'm pretty sure they have a big battery pack backup, Lion I think.

  29. Dawud

    Just like Psy on YouTube

    Think of some stupid timer/counter using a 32 bit signed integer and incrementing every .01 seconds...how many days before you get overflow?

  30. Francis Vaughan

    Patriot and requirements

    As noted earlier, the bug has a great deal in common with the Patriot Missile failure. What is important is to note that the Patriot software wasn't in error. (It wasn't a clock counter wrap, but rather accumulating error in the clock.) The mistake was way back in the system requirements, where the specifications called for an agile system that could be rapidly deployed and moved as needed. The requirements called for a system that could remain stable for about four days. Nobody though that there would be semi-permanent emplacements set up to protect military bases. So nobody added a time span to the requirements.

    So, how far back in the system requirements analysis for the GCU was there an explicit expectation for how long the system would stay powered up for? These are the places where issues slip between the cracks, not some poor programmer who was asleep at the wheel. With Boeing outsourcing so much of the systems, it isn't hard to see how hard it is to keep things like this under control. As the 787 is the first airliner to have such a massive reliance on electrical control, it isn't hard to see how traditional expectations of system up-time would influence the analysis done by many engineers.

    I bet an analysis of how this bug came into existence has vastly more to do with the difficulties of requirements across many contractors, and much less to do with "obvious" coding errors.

    1. Roland6 Silver badge

      Re: Patriot and requirements

      Re: Clock accuracy and stability.

      I think part of the problem is that we have got used to the pace of technology evolution where most things are over specified and over engineered for the job - eg. a light bulb with a full computer on-board just so that the light can be controlled over the Internet from a browser app. Hence we simply take things off-the-shelf, thinking that they are largely the same. So it would not of surprised me that the requirements omitted key details because they were regarded as unnecessary pedantic detail.

      So in the case of Patriot (based on the information in your comment), it would not surprise me to find that there were no explicit clock accuracy and stability requirements, only the requirement for rapid deployment. Hence leaving the door open to the use of lower accuracy (but probably more robust) clock components.

      So yes I would agree the failing at Boeing may well be traced back to the requirements and the lack of design authorities obsessing over details. Lets hope findings get published...

    2. Anonymous Coward
      Anonymous Coward

      Re: Patriot and requirements

      "I bet an analysis of how this bug came into existence has vastly more to do with the difficulties of requirements across many contractors, and much less to do with "obvious" coding errors."

      It's trivially easy for a designer/coder/tester to ask "what if the timer/counter overflows".

      And if they did, some paperpusher somewhere will have either deliberately ignored the question or ignored the implications, in the interests of meeting the deadlines/budgets.

      It could and should still have been prevented. It wasn't. Lessons will be learned (etc).

  31. hugo tyson
    Mushroom

    Misplaced caution?

    I agree it looks like a 100Hz ticker reaching 2^31 and going negative. But I think (without going off into comp.risks) there might be something else going on.

    Problematic code can fail at overflow one of two ways: looping forever (very bad) or exiting prematurely (softer). It's not hard to get it right, of course, but if, say, the API changes (!) after you wrote your timer code or something like that it can end up wrong. So, for ultra-cautious safety-critical stuff, how do you make sure? You add asserts. Lots of asserts. Sounds like that's what we have here, as it goes into a fail-"safe" shutdown.

    Problem is, if your code would do the soft failure - premature exit from the loop - then the assert makes it less reliable. Because what would otherwise be a single spurious shorter timeout, perhaps no worse than noise in whatever it is you're measuring/updating periodically, perhaps completely harmless, the assert failure turns into a complete shutdown.

    In other words, assert - and thus shutdown - if you see insanity in your inputs, or detect an actual failure signal, for example. Asserting for anything less might help find bugs if you test it enough, but when deployed, you want something weaker, that logs an unexpected condition, but doesn't panic the system. IMHO.

    1. Mike 16

      Re: Misplaced caution?

      In the Linux case I previously mentioned, the failure mode was that every timer in the system would "fire" immediately, and keep firing, essentially locking up the scheduler. My boss had already experienced it, because we used a 1ms clock tick, rather than the default 10ms.

      As for Asserts, didn't that Ariane rocket blow up for essentially that reason?

  32. Mpeler
    Facepalm

    Patch Tuesday?

    Oh great. Now we have patch Tuesday for planes.....

    Maybe it's time to stop adding so much automation and let us humans do some work...

    (and, AManFromMars - I don't want to hear from you lot.....) :)

  33. ecofeco Silver badge

    Is Boeing a new word for "facepalm"?

    WTF Boeing? Seriously. WTF? (don't think you're off the hook either, Lockheed)

  34. Francis Vaughan

    More lessons - Arianne 5

    I'm still going to bet the problem isn't a simple coding error.

    People are assuming that the GCU was coded from scratch. It probably wasn't. The real time control executive was quite possibly an off the shelf, ready flight qualified and certified system. A great thing to use. But again - who was responsible for the requirements - and especially understanding that the aircraft systems might need to stay powered up for nearly a year?

    In a real time control system you have a constraint of CPU cycles. You don't burn them without reason. It may be perfectly reasonably, and well reasoned that the timer will be coded with no wrap. What do you do if it does wrap? It is difficult, to say the least, to cope with time that goes backwards. So as Hugo Tyson notes above - you have more, not less, problems.

    In a hard real time control system you can't simply throw an exception. Who catches it, and what does it mean? Indeed - everyone is assuming that the clock wrap wasn't caught - it could easily have been caught and it was the catching of the clock wrapping that caused the shutdown.

    This is where it gets messy. And brings us to the first ever flight of Arianne 5. The flight control software was derived from the Arianne 4, and was a known solid bit of code. But it needed modification to cope with the changes in design. A piece of effectively dead (unneeded) code, that was otherwise benign, was driven into an unusual state by higher than expected winds, and threw an exception. Nobody caught it. Exit $400m worth of rocket in a very spectacular failure. The failure was in a perfectly good piece of code that the changed requirements didn't pick up needed addressing and testing - because it was not needed for the new vehicle.

    Writing error free code is easy. It is getting the precise requirements and integration of that code that is really hard. The idea that not picking up the clock could wrap is the error isn't the hard part. It is very unlikely that the clock wrapping wasn't known. It is very likely that a clear understanding of the environment the code would see itself in was not fully addressed along the chain of requirements analysis from the early design briefs of the plane, all the way down to the contractor responsible for coding it. This chain can fail in many many ways, and is a vastly harder thing to manage and get right than simply coding a counter, or indeed even a quite complex bit of software.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like