« profile & posts archive

This author has written 747 posts for Larvatus Prodeo.

Return to: Homepage | Blog Index

33 responses to “Why QF72 developed a mind of its own”

  1. BilB

    Good report, Robert. Thanks.

  2. Desipis

    the real question is why this problem wasn’t picked up before an A330 ever flew.

    a) Cost. The only way to be sure to catch everything is to prove the validity of your software and you can’t do that with testing. It also relies on people not making poor assumptions (see point b).

    b) Some idiot thought he could predict the behavior of failed hardware. Either the output matches or it doesn’t. If the design allows for the redundant components to behave differently, they’re not actually redundant. Sounds more like a case of a software engineer trying to hide a hardware engineer’s screw up.

  3. David Rubie

    More like failure of imagination – not that uncommon in software development. While intermittent failures are well understood in avionics, the complexity of this particular fall-back plan converging with a faulty unit whose failure could be measured in fractions of a second give rise to a situation that was simply untested and (perhaps) untestable.

    I assume that like most avionics systems, the pilot is assumed to be awake and supposed to be involved in doing the override. Which he did. So, in a sense, the system worked (the pilot intervened) even though the passengers ended up injured and frightened. That’s a whole lot better than dead in a smoking hole though.

    About the only conclusion you can draw is that pilots aren’t going anywhere for the foreseeable future on passenger aircraft.

  4. Jenny

    Fascinating discussion, thanks Robert.

    A few years ago I read a book called Airframe by Michael Crichton (who has written some of the most loathsome crap of all time). This book also had loathsome crappiness embedded in it, but the investigation into the reasons for a malfunctioning aeroplane was heady stuff and had some similarities to Robert’s discussion of QF72′s problems. Worth a read.

  5. Chris

    Thanks for the very interesting post Robert. I’d agree with David that its likely to be a failure of imagination on the part of the developer and the tester. Really good coverage for software testing is a lot harder than people imagine. Testing that a system behaves with expected input is the easy part, working out what invalid data to send to a system and under what circumstances it could occur is much much harder.

  6. Desipis

    …working out what invalid data to send to a system and under what circumstances it could occur is much much harder.

    Given the domain of that problem is more or less infinite, I’m not sure imagination is what’s needed.

  7. David Allen

    I write software for gaming machines. The software is complicated stuff, interactions between coin and banknote hardware, buttons, lights and the game itself. Even after extensive testing and many years of use in the field problems can arise. Players are always the best testers. So, no one dies but money is at risk (trust me, that’s critical in gaming) so it’s embarassing but sort of cool when the players find a loophole in code I thought was tight. Usually it’s a bizarre sequence of unlikely events that was never thought of. After all the normal events are well tested but there is an almost infinite number of combinations of events. No one can reasonably find them all.

    The biggest problem for me is familiarity. The more you know the code the more confident you become until it becomes very hard to doubt the code. Finding bugs in new code is easy but finding the last one in old code is hard.

  8. Colonel of Truth

    Thanks – most interesting. Also see Ben Sandilands for additional comment.

  9. MikeM

    This is the second case of a large airliner being destabilised due to an obscure software error in handling ADIRU failures. By an extraordinary coincidence the first one happened on almost the same route but travelling in the opposite direction.

    On 1 August 2005, shortly after departing from Perth, Australia, bound for
    Kuala Lumpur, Malaysia, a Boeing B777-200 passenger aircraft suffered a
    flight upset while climbing through 38,000 feet. It began when the aircraft
    spontaneously pitched sharply upward, reaching 41,000 feet and activating
    stall warnings. After pilots regained control they returned to Perth.

    The incident was triggered by a second accelerometer failure in the
    aircraft’s air data inertial reference unit (ADIRU). This unit is designed
    to be highly redundant and fault-tolerant but the first failed
    accelerometer’s failure mode was not one that had been anticipated during
    unit design and development. (It had been assumed that a failure would
    always result in zero voltage output, but this failed device was producing a
    high output value.) The twin failures exposed a latent software fault, which
    resulted in the unit feeding incorrect aircraft acceleration data to other
    flight control systems.

    Boeing B777-200 aircraft first entered service in 1995 and this is the first
    reported instance of the particular software fault, which was apparently
    present in the unit’s original design, affecting operation of an aircraft.
    The incident highlights the fact that software testing can never eliminate
    all risk.

    The Australian Transport Safety Bureau’s investigation report is here.

  10. Robert Merkel

    Some interesting comments, everyone.

    Testing is hard, but airliner manufacturers can afford the time and money to do a great deal of it.

    However, to me this seems like the kind of thing that should have been picked up in the review process. In a code/design review, a developer stands up in front of their peers and presents his proposed solution to a problem, and the assembled gallery attempts to poke holes in it. As a rule, there’s only one thing software developers like doing better than coming up with good solutions to problems – it’s showing that they’re smarter than their colleagues by spotting flaws in their own solution.

    In this case, the smart-arse in the gallery need only made one request to the designer – “Prove to me that no failure in a single ADIRU can cause an adverse event”. At which point, the meeting would have broken up, the designer would have spent a couple of hours trying to prove it, given up, and concluded an alternative solution was required.

  11. Peter Kemp

    Nothing quite like a Cessna with rudder pedals to the metal and an aileron/elevator yoke connected by cable.

    Technology was once so simple. (And the backup for the elevator control was the elevator trim tab!)

    *Sighs*

    Robert, I guess you’re still working on the software for the “Infinite Improbability Drive”? :-)

    http://en.wikipedia.org/wiki/Starship_Billion_Year_Bunker

  12. David Rubie

    Robert Merkel wrote:

    In a code/design review, a developer stands up in front of their peers and presents his proposed solution to a problem, and the assembled gallery attempts to poke holes in it. As a rule, there’s only one thing software developers like doing better than coming up with good solutions to problems – it’s showing that they’re smarter than their colleagues by spotting flaws in their own solution.

    Sadly, there’s one thing developers hate with a passion and that’s meetings. Especially if they are under a deadline and see the code review process as a blame game by management. Also, if you are in a team with decidedly unclever developers and testers, you are out of luck as they never think up smart enough criticisms.

  13. Peter

    The Turkish airline crash in Amsterdam is reported (by the Dutch authorities) to have been caused by a sudden change in a radio altimeter reading. I was surprised that this was not detected as a fault – particularly at such a critical point in the landing sequence.
    Do you have any information on this?

  14. Peter

    Reference for the report referred to by me @13:
    BBC Report.

  15. Robert Merkel

    Peter: no hard information , but people are wondering about it on the RISKS forum.

    Again, a fault in a single component shouldn’t cause a plane to crash.

  16. Craig Mc

    The biggest problem for me is familiarity. The more you know the code the more confident you become until it becomes very hard to doubt the code. Finding bugs in new code is easy but finding the last one in old code is hard.

    It’s only buggy code that people ever get to work on. There’s the rub. Ultra-reliable code is usually ultra-ignored for years (and then often lost!). Out of sight, out of mind.

    I’ve come to the conclusion that preventative maintenance is almost like a fail-safe timer in software development. It periodically makes sure the software’s eco-system is viable, without schedule interference, and forces people to re-evaluate the code and tests. It’s something of a paradox: ultra-reliable code is the most dangerous code of all, because it’s the least understood. This is the software that ends up being used outside its design constraints because it just seems to work.

    In this case the software engineer probably asked the hardware guy what the possible failure modes were for that module, and just coded for them. In this case he should have coded beyond those cases, but extra code is also extra opportunity for bugs. In another situation, redundant logic may introduce yet another bug, not to mention definite cost.

  17. Craig Mc

    However, to me this seems like the kind of thing that should have been picked up in the review process.

    We slavishly (mindlessly in my opinion) use a system as a guide for determining review hours (as well as expected bugs/phase). Its utility is dubious, but it’s the best we’ve got. The major problem with it is that review hours usually wash downstream (we all deliver on the same integration date) to the point where it’s even difficult to get a meeting room, let alone people’s attention. Reviews are boring – people often bring in their own work (they have their own schedule pressures to deal with), or simply fade to black from review fatigue.

    Mind you, we’re not doing mission-critical, lives-at-stake stuff, but human nature doesn’t change across industries. I know, because I’ve seen fundamental, domain-specific assumptions slip through unquestioned during my time in aerospace/military projects.

    Which is another major problem in software. Software engineers are constantly working outside their knowledge domains. The world’s best engineer doesn’t become a trading system, x-ray, or rocketry expert during the course of a single project, and we rarely do the same project twice.

  18. Peter

    Re @13, 14: I should have Googled first – I found this from Boeing. It seems the throttle control *does* use only the one altimeter, which is scary. Boeing suggests the pilots should have noticed the discrepancies and read the manual. During approach?

  19. Robert Merkel

    In this case the software engineer probably asked the hardware guy what the possible failure modes were for that module, and just coded for them. In this case he should have coded beyond those cases, but extra code is also extra opportunity for bugs. In another situation, redundant logic may introduce yet another bug, not to mention definite cost.

    My take was that the system was rather over-complicated in the first place, which made it difficult to reason about possible failure modes. With the “use the median value” error tolerance method, it’s dead simple to see that one failure can’t break the system. The error-handling method for the AOA code, however, is the kind of thing you’d need to use model checking to prove that the protocol is robust.

  20. HuggyBunny

    Perhaps part of the problem lies in the specialisation that exists in all technical fields.
    Software engineers often have no real knowledge of simple physics. For example, the software written by one very large engineering firm attempted to swing a q 1000 tonne dredge through 180 degrees in 5 seconds. Point is no amount of testing would have detected this as the original algorithm was wrong. Sure this reflects a very poorly structured engineering effort but I suspect it is more common than we want to know.
    Huggy

  21. David Irving (no relation)

    Robert, I’ve been writing software for close to 20 years, and I don’t recall anything having been peer-reviewed. Fortunately, none of it would have got someone killed, but still.

  22. Gregg Sporar

    “The major problem with it is that review hours usually wash downstream (we all deliver on the same integration date) to the point where it’s even difficult to get a meeting room, let alone people’s attention. Reviews are boring – people often bring in their own work (they have their own schedule pressures to deal with), or simply fade to black from review fatigue.”

    Ouch. All the more reason to review early and review often, and to consider doing the review outside of a meeting. Studies have shown that trying to review too much code at once does not work as well as reviewing in smaller chunks. More details here on best practices for peer code review.

  23. Craig Mc

    Gregg: Thanks for the link. Don’t get me started about our ghastly process! At least these days we have code reviews – when I arrived there was nothing. There were a huge number of bugs of course, and a burdensome, missing-the-point process was the response. In this case it’s just a crutch for people who don’t understand software development basics.

    But it’s a job.

  24. Jack Love

    Reading this after the AF 447 loss in the Atlantic, and realising that an ADIRU failure was reported by ACARS before a/c loss makes me wonder if this fate didn’t befall the AF A332, leading to an uncontrollable situation with the a/c at FL350.

  25. Nabakov

    I think you’re getting a bit muddled here Jack. AF 447 was an Air France A332. One’s a flight number and the other is a aircraft model.

    Also a ADIRU failure does not make you suddenly vanish off radar screens. For starters it is actually part of ADIRS and if that failed, the crew would still be aloft and right on the horn to everyone, anyone.

    Next you’ll be telling us Speedbird 781 (Yes, Yoke Peter) broke up in mid air because of heavy handed use of the soda siphon.

  26. Nabakov

    Actually Jack, I do see your original point now and I think we’re all getting a bit muddled with the use of flight and model prefixes. Perhaps more me than you now. My apologies.

    However the vanishing off radar screens without communication is a bit of a puzzler. I suspect either a massive cockup all around or foul play. Doesn’t happen otherwise these days.

    Still willing though to be talked into the overpressurised soda siphon theory for Yoke Peter.

  27. Andrew Reynolds

    Nabs,
    I know it is early but but I think you muddled his wording. It looks good to me – he said “the AF 332″ – not, as you seem to think confusing it with “the AF447″ referred to in the first sentence.
    BTW – the use of “FL350″ to show altitude also indicates that “Jack” is one of the following:
    1. a pretensious tosser who has flown a lot;
    2. someone who has watched “air crash investigations” a bit too much; or
    3. someone who knows what they are talking about and got caught in the jargon.
    .
    For the unitiated – FL 350 means they were flying at 35,000 feet, normally meaning they would be on a track somewhere between due north and nearly due west – which was the case here. The fact I know this shows that I fall into category 2 – or perhaps 1.

  28. Nabakov

    Yes I got it Andy and have already apologised.

    I also have halgf a bottle of 12 year Gelnmorangie Qunita Ruban in me. And an (expired) pilot’s license. Well not in me physically.

    What’s your excuse for being a plonker too at this hour?

  29. Nabakov

    “halgf” is old school Chaucer for “half”. Well it is now.

  30. Andrew Reynolds

    Oops – I knew something was bugging me as I wrote that. Checked Wiki and, if this is right, there was something interesting about the flight level. According to the wiki, if it was at FL 350 it should have been westbound – not eastbound as it was.

  31. Andrew Reynolds

    Marking papers. Why did I agree to do this?

  32. Nabakov

    Um yes, I was about to point out flight levels are not really indicative of direction. And that “due north and nearly due west” is more of a nonsensical guideline than an actual flight plan – especially on that flight path.

    But then given my own fuck up here I thought diversion was the better part of vector.

  33. Nabakov

    “Marking papers. Why did I agree to do this?”

    Just mark ‘em by readability. The time will fly faster and it’s not like anyone cares about spotting talent through half year papers.

    Or as Jonathan Hemlock observed in “The Eiger Sanction” (written by a blase right wing Canadian academic) always make a point of promoting inferior talents. Less competition down the track.

Leave a Reply