On October 7th last year, several passengers and crew were seriously injured when Qantas Flight 72 decided to enter a severe dive – one dramatic enough to throw them around the cabin. The initlal culprit was identified as a faulty ADIRU unit – a gadget that provides information from the plane’s sensors to both the pilots and, crucially, the flight control system. Courtesy of the interim report of the Australian Transport Safety Bureau, it’s now a bit clearer what happened. What follows is a somewhat technical examination of the issue.
Extensive testing has still not explained the ADIRU failure. The theory that it was the low-frequency, high-power radio transmissions from the Learmonth Naval Communication station hasn’t been completely ruled out, but it was always pretty dubious: the units were certified to operate correctly in the presence of much stronger radio signals than result from the Learmonth transmissions, planes have been operating for many years in Learmonth’s vicinity without incidents, and similar emissions from other high-power low frequency transmissions haven’t been reported to cause issues. The faulty ADIRU unit was also tested to see whether radio interference could reproduce the effects seen in-flight, with no success.
But as I noted at the time, the specific reason the ADIRU failed was far less important than why why a single faulty component was able to cause a plane to misbehave to such an extent. The interim report explains the problem in some detail.
The Airbus A330, like all Airbus airliners and more recent Boeing models, is fly-by-wire – the pilot’s stick movements, and the sensor data, are fed into flight control computers that determine what the bits of the aircraft that move – ailerons, flaps, rudder, and tail – actually do. Some of that sensor data comes through the ADIRU units; each ADIRU processes the data from a different set of sensors, and the three ADIRUs feed their data to the flight control computer (which in itself has redundant backups, but they don’t come into play here).
But what if one of the ADIRU units starts malfunctioning, and providing “rubbish” information to the flight computer?
For most pieces of information, if an ADIRU or the attached sensor breaks, the system will notice that the value is radically different to the other redundant sensors and ignores the rubbish value. But for angle of attack, this approach isn’t ideal, because there are situations where you’d expect, in normal operation, for different sensors to report different readings. So a different method was used for the angle of attack data.
In a nutshell, if an ADIRU generates an obviously incorrect angle of attack data for an instant, the flight control computer uses the last known good value it had, over a period of 1.2 seconds. If the ADIRU misbehaves continuously for a second or more, the flight computer concludes “Hang on, you’re faulty”, and will ignore anything it says for the rest of the flight.
But there’s another, rather diabolical possibility. What if the ADIRU (or angle of attack sensor) goes haywire for, say, half a second, starts working again for half a second, then misbehaves for another half-second or so? The ADIRU doesn’t misbehave long enough for the flight computer to disconnect it. But it can’t keep using the old value. So it calculates a new value based on the misbehaving ADIRU.
The end result, unfortunately, was a plane that thought it was pointing its nose somewhere towards the Moon when it was actually flying straight and level. The dive was its attempt to correct itself.
Nasty as it was, A330s are not in danger of crashing due to this design flaw, if I understand the report correctly. The automatic system that pushed the nose down only operates at cruising speeds and at high altitudes. Nevertheless, it is a serious flaw, and the report indicates that Airbus will be modifying the flight control software so as to avoid this situation repeating itself.
From my professional perspective (as an academic who specializes in testing computer software), the real question is why this problem wasn’t picked up before an A330 ever flew. Aircraft manufacturers take design checking and testing more seriously than just about anyone. The report doesn’t go into this question. To be fair, the A330 flight control software was written nearly 20 years ago, so the software quality assurance procedures used then were probably considerably less advanced than those used today. But Airbus (and Boeing) will undoubtedly be thinking hard about their review and testing procedures, to figure out if the same design flaw would be picked up today – before it made it into an aircraft carrying paying passengers.
ELSEWHERE: Ben Sandilands at Crikey aviation blog Plane Talking concentrating on the interference issue. As noted, I think the interference issue is probably a red herring here.




Good report, Robert. Thanks.
a) Cost. The only way to be sure to catch everything is to prove the validity of your software and you can’t do that with testing. It also relies on people not making poor assumptions (see point b).
b) Some idiot thought he could predict the behavior of failed hardware. Either the output matches or it doesn’t. If the design allows for the redundant components to behave differently, they’re not actually redundant. Sounds more like a case of a software engineer trying to hide a hardware engineer’s screw up.
More like failure of imagination – not that uncommon in software development. While intermittent failures are well understood in avionics, the complexity of this particular fall-back plan converging with a faulty unit whose failure could be measured in fractions of a second give rise to a situation that was simply untested and (perhaps) untestable.
I assume that like most avionics systems, the pilot is assumed to be awake and supposed to be involved in doing the override. Which he did. So, in a sense, the system worked (the pilot intervened) even though the passengers ended up injured and frightened. That’s a whole lot better than dead in a smoking hole though.
About the only conclusion you can draw is that pilots aren’t going anywhere for the foreseeable future on passenger aircraft.
Fascinating discussion, thanks Robert.
A few years ago I read a book called Airframe by Michael Crichton (who has written some of the most loathsome crap of all time). This book also had loathsome crappiness embedded in it, but the investigation into the reasons for a malfunctioning aeroplane was heady stuff and had some similarities to Robert’s discussion of QF72′s problems. Worth a read.
Thanks for the very interesting post Robert. I’d agree with David that its likely to be a failure of imagination on the part of the developer and the tester. Really good coverage for software testing is a lot harder than people imagine. Testing that a system behaves with expected input is the easy part, working out what invalid data to send to a system and under what circumstances it could occur is much much harder.
Given the domain of that problem is more or less infinite, I’m not sure imagination is what’s needed.
I write software for gaming machines. The software is complicated stuff, interactions between coin and banknote hardware, buttons, lights and the game itself. Even after extensive testing and many years of use in the field problems can arise. Players are always the best testers. So, no one dies but money is at risk (trust me, that’s critical in gaming) so it’s embarassing but sort of cool when the players find a loophole in code I thought was tight. Usually it’s a bizarre sequence of unlikely events that was never thought of. After all the normal events are well tested but there is an almost infinite number of combinations of events. No one can reasonably find them all.
The biggest problem for me is familiarity. The more you know the code the more confident you become until it becomes very hard to doubt the code. Finding bugs in new code is easy but finding the last one in old code is hard.
Thanks – most interesting. Also see Ben Sandilands for additional comment.
This is the second case of a large airliner being destabilised due to an obscure software error in handling ADIRU failures. By an extraordinary coincidence the first one happened on almost the same route but travelling in the opposite direction.
On 1 August 2005, shortly after departing from Perth, Australia, bound for
Kuala Lumpur, Malaysia, a Boeing B777-200 passenger aircraft suffered a
flight upset while climbing through 38,000 feet. It began when the aircraft
spontaneously pitched sharply upward, reaching 41,000 feet and activating
stall warnings. After pilots regained control they returned to Perth.
The incident was triggered by a second accelerometer failure in the
aircraft’s air data inertial reference unit (ADIRU). This unit is designed
to be highly redundant and fault-tolerant but the first failed
accelerometer’s failure mode was not one that had been anticipated during
unit design and development. (It had been assumed that a failure would
always result in zero voltage output, but this failed device was producing a
high output value.) The twin failures exposed a latent software fault, which
resulted in the unit feeding incorrect aircraft acceleration data to other
flight control systems.
Boeing B777-200 aircraft first entered service in 1995 and this is the first
reported instance of the particular software fault, which was apparently
present in the unit’s original design, affecting operation of an aircraft.
The incident highlights the fact that software testing can never eliminate
all risk.
The Australian Transport Safety Bureau’s investigation report is here.
—
Some interesting comments, everyone.
Testing is hard, but airliner manufacturers can afford the time and money to do a great deal of it.
However, to me this seems like the kind of thing that should have been picked up in the review process. In a code/design review, a developer stands up in front of their peers and presents his proposed solution to a problem, and the assembled gallery attempts to poke holes in it. As a rule, there’s only one thing software developers like doing better than coming up with good solutions to problems – it’s showing that they’re smarter than their colleagues by spotting flaws in their own solution.
In this case, the smart-arse in the gallery need only made one request to the designer – “Prove to me that no failure in a single ADIRU can cause an adverse event”. At which point, the meeting would have broken up, the designer would have spent a couple of hours trying to prove it, given up, and concluded an alternative solution was required.
Nothing quite like a Cessna with rudder pedals to the metal and an aileron/elevator yoke connected by cable.
Technology was once so simple. (And the backup for the elevator control was the elevator trim tab!)
*Sighs*
Robert, I guess you’re still working on the software for the “Infinite Improbability Drive”?
http://en.wikipedia.org/wiki/Starship_Billion_Year_Bunker
Robert Merkel wrote:
Sadly, there’s one thing developers hate with a passion and that’s meetings. Especially if they are under a deadline and see the code review process as a blame game by management. Also, if you are in a team with decidedly unclever developers and testers, you are out of luck as they never think up smart enough criticisms.
The Turkish airline crash in Amsterdam is reported (by the Dutch authorities) to have been caused by a sudden change in a radio altimeter reading. I was surprised that this was not detected as a fault – particularly at such a critical point in the landing sequence.
Do you have any information on this?
Reference for the report referred to by me @13:
BBC Report.
Peter: no hard information , but people are wondering about it on the RISKS forum.
Again, a fault in a single component shouldn’t cause a plane to crash.
It’s only buggy code that people ever get to work on. There’s the rub. Ultra-reliable code is usually ultra-ignored for years (and then often lost!). Out of sight, out of mind.
I’ve come to the conclusion that preventative maintenance is almost like a fail-safe timer in software development. It periodically makes sure the software’s eco-system is viable, without schedule interference, and forces people to re-evaluate the code and tests. It’s something of a paradox: ultra-reliable code is the most dangerous code of all, because it’s the least understood. This is the software that ends up being used outside its design constraints because it just seems to work.
In this case the software engineer probably asked the hardware guy what the possible failure modes were for that module, and just coded for them. In this case he should have coded beyond those cases, but extra code is also extra opportunity for bugs. In another situation, redundant logic may introduce yet another bug, not to mention definite cost.
We slavishly (mindlessly in my opinion) use a system as a guide for determining review hours (as well as expected bugs/phase). Its utility is dubious, but it’s the best we’ve got. The major problem with it is that review hours usually wash downstream (we all deliver on the same integration date) to the point where it’s even difficult to get a meeting room, let alone people’s attention. Reviews are boring – people often bring in their own work (they have their own schedule pressures to deal with), or simply fade to black from review fatigue.
Mind you, we’re not doing mission-critical, lives-at-stake stuff, but human nature doesn’t change across industries. I know, because I’ve seen fundamental, domain-specific assumptions slip through unquestioned during my time in aerospace/military projects.
Which is another major problem in software. Software engineers are constantly working outside their knowledge domains. The world’s best engineer doesn’t become a trading system, x-ray, or rocketry expert during the course of a single project, and we rarely do the same project twice.
Re @13, 14: I should have Googled first – I found this from Boeing. It seems the throttle control *does* use only the one altimeter, which is scary. Boeing suggests the pilots should have noticed the discrepancies and read the manual. During approach?
My take was that the system was rather over-complicated in the first place, which made it difficult to reason about possible failure modes. With the “use the median value” error tolerance method, it’s dead simple to see that one failure can’t break the system. The error-handling method for the AOA code, however, is the kind of thing you’d need to use model checking to prove that the protocol is robust.
Perhaps part of the problem lies in the specialisation that exists in all technical fields.
Software engineers often have no real knowledge of simple physics. For example, the software written by one very large engineering firm attempted to swing a q 1000 tonne dredge through 180 degrees in 5 seconds. Point is no amount of testing would have detected this as the original algorithm was wrong. Sure this reflects a very poorly structured engineering effort but I suspect it is more common than we want to know.
Huggy
Robert, I’ve been writing software for close to 20 years, and I don’t recall anything having been peer-reviewed. Fortunately, none of it would have got someone killed, but still.
Ouch. All the more reason to review early and review often, and to consider doing the review outside of a meeting. Studies have shown that trying to review too much code at once does not work as well as reviewing in smaller chunks. More details here on best practices for peer code review.
Gregg: Thanks for the link. Don’t get me started about our ghastly process! At least these days we have code reviews – when I arrived there was nothing. There were a huge number of bugs of course, and a burdensome, missing-the-point process was the response. In this case it’s just a crutch for people who don’t understand software development basics.
But it’s a job.
Reading this after the AF 447 loss in the Atlantic, and realising that an ADIRU failure was reported by ACARS before a/c loss makes me wonder if this fate didn’t befall the AF A332, leading to an uncontrollable situation with the a/c at FL350.
I think you’re getting a bit muddled here Jack. AF 447 was an Air France A332. One’s a flight number and the other is a aircraft model.
Also a ADIRU failure does not make you suddenly vanish off radar screens. For starters it is actually part of ADIRS and if that failed, the crew would still be aloft and right on the horn to everyone, anyone.
Next you’ll be telling us Speedbird 781 (Yes, Yoke Peter) broke up in mid air because of heavy handed use of the soda siphon.
Actually Jack, I do see your original point now and I think we’re all getting a bit muddled with the use of flight and model prefixes. Perhaps more me than you now. My apologies.
However the vanishing off radar screens without communication is a bit of a puzzler. I suspect either a massive cockup all around or foul play. Doesn’t happen otherwise these days.
Still willing though to be talked into the overpressurised soda siphon theory for Yoke Peter.
Nabs,
I know it is early but but I think you muddled his wording. It looks good to me – he said “the AF 332″ – not, as you seem to think confusing it with “the AF447″ referred to in the first sentence.
BTW – the use of “FL350″ to show altitude also indicates that “Jack” is one of the following:
1. a pretensious tosser who has flown a lot;
2. someone who has watched “air crash investigations” a bit too much; or
3. someone who knows what they are talking about and got caught in the jargon.
.
For the unitiated – FL 350 means they were flying at 35,000 feet, normally meaning they would be on a track somewhere between due north and nearly due west – which was the case here. The fact I know this shows that I fall into category 2 – or perhaps 1.
Yes I got it Andy and have already apologised.
I also have halgf a bottle of 12 year Gelnmorangie Qunita Ruban in me. And an (expired) pilot’s license. Well not in me physically.
What’s your excuse for being a plonker too at this hour?
“halgf” is old school Chaucer for “half”. Well it is now.
Oops – I knew something was bugging me as I wrote that. Checked Wiki and, if this is right, there was something interesting about the flight level. According to the wiki, if it was at FL 350 it should have been westbound – not eastbound as it was.
Marking papers. Why did I agree to do this?
Um yes, I was about to point out flight levels are not really indicative of direction. And that “due north and nearly due west” is more of a nonsensical guideline than an actual flight plan – especially on that flight path.
But then given my own fuck up here I thought diversion was the better part of vector.
“Marking papers. Why did I agree to do this?”
Just mark ‘em by readability. The time will fly faster and it’s not like anyone cares about spotting talent through half year papers.
Or as Jonathan Hemlock observed in “The Eiger Sanction” (written by a blase right wing Canadian academic) always make a point of promoting inferior talents. Less competition down the track.