Whenever a plane crash occurs, the Australian Transport Safety Bureau consults the relevant experts and issues a public report on what went wrong. When an unusual death, in any circumstance, occurs, a coronial inquest is held. And, at the moment, the Bushfire Royal Commission is currently trawling through the events of Black Saturday.
Bertrand Meyer, a rather well-known name in my field, makes a simple observation. Computer systems undergo failures which, while only rarely resulting in loss of life, routinely result in the loss of large amounts of money. But the reasons for the failure almost always remain confidential, making it very difficult to learn from these mistakes in the manner of plane crash investigations. So why not mandate investigations of large software failures, with the results available to everyone so we can learn from the mistakes:
Progress in software engineering will come from many sources. Research is critical, including on topics which today appear exotic. But if anyone is looking for one practical, low-tech idea that has an iron-clad guarantee of improving software engineering, here it is: pass a law that requires extensive professional analysis of any large software failure.
The details are not so hard to refine. The initiative would probably have to start at the national level; any industrialized country could be the pioneer. (Or what about Europe as whole?) The law would have to define what constitutes a “large” failure; for example it could be any failure that may be software-related and has resulted in loss either of human life or of property beyond a certain threshold, say $50 million. In the latter case, to avoid accusations of government meddling in private matters, the law could initially be limited to cases involving public money; when it has shown its value, it could then be extended to private failures as well. Even with some limitations, such a law would have a tremendous effect. Only with a thorough investigation of software projects gone wrong can we help the majority of projects to go right.
This might seem a little bit arcane, but software failures were estimated to cost 0.6% of US GDP, and would be comparable in magnitude here. Reducing that significantly could pay for a hell of a lot of useful things.
In any case, Meyer’s proposal makes sense to me, in principle. Stephen Conroy, can you perhaps take some time out from filtering the internets to look into this, please?




Have you tried turning your computer off and back on again?
That’s a very interesting idea, Robert. I remember we studied the Therac-25 case and the Patriot missile targetting error when I was at uni, and mandating the investigation of failures that cause financial damage as well as those that cost lives would certainly be useful.
Maybe they could as their first case look at the catostrophic backup errors that allegedly led to those ute emails to disappear.
First up the Year 2K bug?
Interesting and informative post to one like me who is totally ignorant of the context.
A couple of questions from general principles.
Who would be the target of such investigations?
The company/dept who use the programme?
The company whoever who develop/retail the programme?
Some other body?
Seems like that would cover some pretty powerful organizations.
There may be some high level resistance.
Who would do the investigating, what resources/powers is it envisaged would be necessary that such a body should have?
What would be the possible likely outcomes of ‘successful’ investigations?
As I said, interesting concept, certainly .6% of GDP is a lot.
FX Holden @4: “First up the Year 2K bug?”
Get over it. Everybody else has.
What would be a more interesting read for me would be independent reports on the many multimillion dollar systems that are built but never get implemented and the real reasons for that outcome.
I dare say this has long happened unofficially between academic institutions and businesses. That’s probably the right level.
It’s not appropriate for every business to behave like it’s CMM level 5, or be treated like it should by a remote government agency. They (and their shareholders) are usually the best judge of risk/damage vs cost/time to market.
I’ve met Meyer myself at an Ada gig in the early 90s. I’ve tried to instil Design By Contract ideas into code wherever I’ve worked, and I guess he’s the inspiration for that.
@8: “I dare say this has long happened unofficially between academic institutions and businesses.”
Case studies in systems failure certainly feature in worthwhile courses in risk management.
I’m keen to see the result of such an investigation, the remedies suggested and the effect of their implementation. Having worked in IT and seen a variety of initiatives like this I’m cautious that this could easily become just another layer of CYA. The problem is the “good for smart people, bad for dumb ones” holes in any system. Any organisation is likely to end up full of people who excel at buck-passing, blamestorming and credit stealing. How does this suggestion make it hard for those people to screw up the process?
I have taken part in project post-mortems that have been incredibly valuable, both to me personally and to the organisation in general. But the other 90% of the time the post-mortem has failed, usually early on. My feeling is that the smart people know they have made mistakes, the rest fear they might have, and everybody fears the consequences of their errors being pointed out. It takes quite skilled management to get the first post mortem through, and a solid history of good results from them before skeptics can be brought on board. And it only takes one determined nay-sayer to screw the whole system up. Blamestorming is particularly toxic in this situation.
Any organization faithfully following the ITIL process will manage system failure of one sort or another in a way that aims to ensure lessons are learned and the required changes made to avoid repetition.
However, we need to be clear about the difference between design failures and operational failures of implemented systems. The principles of software engineering, properly employed throughout the entire development cycle, should mitigate against design failure.
However, at the end of the day, regardless of any failure mitigation strategies in the design, Murphy’s Law tells us that if the running system requires any measure of human intervention, then the probability of failure is greater than zero and contingency plans need to be in place to deal with it when it occurs. That is a function of a business’s ongoing risk management processes.
Failure to implement a system is another risk and it is beholden of the responsible project manager to have a detailed plan to cover that very real risk.
Hannah’s Dad: Even if the 0.6% of GDP claim is accurate (and it’s just one study, albeit probably a reasonably authoritative study), some fraction of that is the unavoidable consequence of using IT systems. And this measure, useful though it is likely to be, is only a small contribution. And some apt observations about the likely high-level resistance.
The outcome is a growing collection of case studies of catastrophic errors, including analysis of the design and construction practices that led to them. We can then start teaching people not to make those mistakes.
Socratease@7: Yes, but such investigations raise additional highly problematic issues, to determine a) whether and b) when such investigations should take place. Do you run postmortems on systems that are still being implemented?
Craig Mc: I daresay it doesn’t happen nearly often enough, and Meyer (who’s been around a lot longer than I have) says it doesn’t happen nearly often enough. Lots of stories float through the ether, but not enough of them are collected systematically to allow things to be learned.
Socratease@12: yes, those lessons are learned internally. But they’re not passed on more broadly.
Great post, Robert. I’m sure that a great deal of corporate reluctance/opposition will be displayed to any such proposal, but I’m fairly sure that the aircraft corporations and airline operators weren’t very keen on the mandated Air Crash Investigation sharing results publicly either, so that’s not a particularly good argument against the idea purely on the grounds that the commercial operators won’t like it.
Is there any operational opposition in the IT industry to more transparency about disastrous outcomes being openly shared?
RM @12: “Yes, but such investigations raise additional highly problematic issues, to determine a) whether and b) when such investigations should take place.”
If the taxpaying public’s money is involved the answer is YES every time, and the relevant Auditor General needs to be involved. As for the non-government sector, then it’s up to the corporate governance of the companies involved and, in the case of public companies, the shareholders ought to be asking difficult questions of their boards.
RM @12: “Do you run postmortems on systems that are still being implemented?”"
No, by definition, a postmortem is dealing with a “death” (read failure — either a failure to implement, or an operational failure of a serious or critical nature), however a software development project that is missing key milestones needs to be dealt with asap and that is the role of the steering group which has oversight of the overall conduct of the project.
All recognized systems development methodologies include a post implementation review (PIR) phase which ought to be chaired by the system sponsor and/or chief client not too long after the system goes live to tick the boxes against the project’s objectives and to assess the performance of the responsible project manager. Again, it is a function of corporate governance to ensure that PIRs are held and written up.
RM @ 12: “yes, those lessons are learned internally. But they’re not passed on more broadly.”
Well, I think they are. The lessons learned are incorporated in internationally recognized IT shop management methodologies such as ITIL, by upskilling IT shops via the CMM model mentioned earlier (which is based on a process of continuous improvement), but again by using adequate risk management techniques at every step of the process.
Risk management recognizes that, despite everybody’s very best intentions, shit happens and we need to cover that eventuality.
Having implemented GST in a national retailer — a project of mission critical importance that spanned more than 12 months and tied up the majority of the IT department — I can tell you that the first item of business on the weekly project management meeting was updating the risk management plan — a document of such size that it required a ring binder to itself.
I remember a study either at NASA or JPL that found 90% of bugs were caused by incorrect or ambiguous requirements documents, so testing the output of systems against flawed requirements definitions won’t pick anything up. It’d probably be worse if those specifying needs weren’t engineers.
Another root cause is that too often it’s great to bring in a project at minimal cost, and the ProjMangler gets brownie points, but then leaves the costs of support/maintenance to a different cost centre post deployment.
Root cause analyses will therefore leave many red faces amongst the PHBs, and therefore don’t occur. Proximal cause analyses typically hit the little guy who has merely tipped over something that was already unstable.
There’s another great rule of thumb on this:
For every problem picked up while still “on paper” that costs $1 to fix, then the cost of fixing the same problem will be
* $10 once someone has started codecutting.
* $100 once it is being used within the organization that developed it.
* $1000 once it is being used OUTSIDE the organization the developed it.
FXH @ 4 – hey the 2037 bug is my retirement plan
tigtog @ 13 – Is one of features of the aircraft investigations that the information gained during the investigation can’t be used in legal action against those that may be at fault? A similar approach may help get more acceptance.
I suspect investigations into IT failures would end up being very very expensive to investigate properly especially if you don’t have the willing enthusiastic cooperation of those involved.
All excellent points at 16, and also some of the key reasons that commercial systems are dead on arrival, with solving the wrong problem being top of the list.
The outside observer usually assumes that a system failure is the fault of some technical boffin on the development side of the bargain, blissfully unaware of the client-side issues that a systems development manager has to deal with on a daily basis, including: inadequate specifications; functionality creep without the chance to re-estimate and reschedule; insufficient commitment to the project by senior management (manifest by non-attendance at project status meetings); project politics, especially inter-departmental; changes of key client personnel; inability or unwillingness to provide knowledgeable end users to assist in the testing (QA) phase … the list goes on.
To Socrates@18:
One name forgotten: Fred Brooks
One book forgotten: The Mythical Man Month
From the guy who led the development of one of the major IBM mainframe operating systems, the book is the only project management text that says something other than “I’m {intensifier} fantastic”. Instead, it admits, “We did X. It cost us $Y blegabucks. Dumb move. Won’t do it again”
To DB @19: I haven’t forgotten it. I used to quote him frequently, especially his famous Law “putting more people on an already late project only makes it later”.
Seeing as you have raised the issue of effort and estimating, as a software development manager, possibly the most difficult part of the job, especially at project initiation (irrational exuberance) time, was managing the client’s delivery expectations while holding firm to the standards associated with delivering a quality product that fully meets the specifications.
Time and again I would have to insist that, as the person ultimately responsible for delivering the outcome, I must be the one who is responsible for the effort estimate. If a client wanted time (and hence effort) cut from the schedule, I would point to their functional specification and ask them which parts of the functionality they wanted to remove. No answer came the stern reply.
Having lost the development effort argument, the client would almost always then point to the testing time and try to reduce that. Again, I would not budge on that as it was the result of my detailed discussion of the specifications with the developers and usually my adding a contingency for their natural tendency to understate the test plan development time. Once agreed with the developer as a delivery contract between him/her and me, that estimate was then my estimate and nobody else’s and not negotiable for the job of work as outlined in the specification.
As stated by an earlier poster, the overwhelming temptation is to give in and be seen to be co-operative, thereby sowing the seeds of project disaster. For those who have not experienced this path, I list the seven stages of mismanaged projects:
1: Uncritical acceptance
2: Wild enthusiasm
3: Dejected disillusionment
4: Total confusion
5: Search for the guilty
6: Punishment of the innocent
7: Promotion of nonparticipants
Making software is design rather than implementation. Even what we call implementation is closer to giving a design to the magic pixies inside the computer than it is to stacking bricks or welding steel.
“we can use the time saved by not doing the design stage for extra testing to make up for that, let’s just start cutting code”.
“there’s never time to do it properly but there’s always time to do it over” (in both senses of the term).
One problem with software is that it’s less engineering than we’d like. IMO it’s closer to architecture than civil engineering. If we screw it up the client often gets software that looks quite like what they asked for, they just can’t use it. Corner cut your architects and you still get a house, you just can’t (or don’t want to) live in it. Which describes as much of our national housing stock as our software, to the disappointment of the people who design both.
I like to reduce discussions to the wee triangular examples most people can grasp. “Cheap, on time, reliable – pick any two” sort of thing. Making those concrete is relatively easy, but making the whole process concrete is much harder. Trying to explain to clients that their behaviour is a major determinant of the size of the uncertainty in the schedule is a waste of time and annoys the pig (to misquote Heinlein). But again, having worked in the construction industry the same thing applies. It’s just that to a much greater extent the parallel to software is the design stage not the building stage. Once the plans are in the hands of the builders even the most loosely coupled client will understand that changes are not going to be easy, just due to the number of steps that they’ve gone through to get to that point and it’s relatively easy to get them to grasp that most of those steps have to be gone through again with the changed plan… quantity surveying and council approval being the ones to start with IME. Neither really exist for software, so we’re left talking about project plans and wiffling the magic pixies.
Erm. Good idea, but won’t it run up against intellectual property laws and such? A lot of businesses rely on proprietary software to maintain an edge, and won’t be okay with the details of such becoming common knowledge. The banks in particular would have a screaming fit about it.
to Socratease @ 20 I believe there is an 8th step…
8. Hide the evidence
Socratease @ 20
It’s actually:
6 Persecution of the innocent
7 Rewarding of the bystanders
An excellent source of information on this topic is Peter G Neumann’s The Risks Digest: On Risks To The Public In Computers And Related Systems.
IT disasters are becoming more like aviation ones, in the sense that as the field has become more professional, simple shortcomings become less frequent, and the residue tend to be a consequence of several failures, no one of which by itself would have resulted in failure.
An interesting example, analysed by the Australian Transport Safety Bureau, was the in-flight upset of a B777 aircraft north-west of Perth on 1 August 2005. The investigation highlighted a software bug in one of the aircraft’s control systems. It had been present in the original program code, but only caused a problem after an accelerometer failed in a way that had not been anticipated, and then a second one failed.
Chris @ 17
One hopes that the 2037 bug will trigger rather less hype and wasted expenditure than the Y2K one did. If so it may not be that great as a superannuation policy.
Socratease – wish I’d had more project managers like you in places I’ve worked in. Related to staffing of projects, bad managers also underestimate the huge differences in productivity in programmers (order of magnitude between average and really good) and how easy it is for team members to contribute negative work. I think they also underestimate just how much more productive a small highly skilled team is compared to even a larger team of average ones.
moz @ 21:
There are also projects that spend so much time on design without sufficient knowledge of the practicalities of implementation that they end up giving nightmare tasks to the implementors. This is more common in really large projects where the architects/designers are different people to the implementers. People should be forced to implement what they design
I strongly support early prototyping, but with the awareness that it will be thrown away and done much better the next time with the knowledge gained from the first pass. You have to build that time into your schedule.
Also I think you need to alter your development methodology based on the programmers that you have. A team of very talented people will thrive in a totally different environment than a team of people you’ve just pulled off the street.
Is this serendipity or what, Robert. BigPond crashed this morning – quote – “we’re having trouble with Bigpond this morning” from extremely rude Telstra operator I rang on some fancy mobile this morning. It’s back up again now. (I’m here, aren’t I?)
But there’s a silver cloud to every lining. Instead of ordering books on line I can’t afford I bought myself a copy of the BBC’s adaptation of War and Peace. So, if youse don’t hear from me for about a week …
Paul @ 26 – what happens at a lot of large companies that have been around for many years is that they accumulate many different IT systems all of different age, none of which were designed initially to talk to each other. So not surprisingly reliability falls as the number and types of systems increase. Occasionally some brave person decides to do a consolidation and its then when you can get your big disaster stories. But the real fault lies many years before when the new systems were acquired. Sometimes its unavoidable when companies merge though.
People really really don’t like downtime on the IT systems they use – they expect them up 24/7. So to avoid customer disappointment it can end up being a bit like a mechanic trying to do diagnose faults and do maintenance on your car while you’re driving it
Chris @ 17, I think the adoption of 64-bit processors (and operating systems) has just blown a large hole in your retirement scheme. (I don’t care anyway – I reckon I’ll be dead, or at least senile, by then. Also I’m sure we won’t still be using UNIX … )
MikeM @ 24, you weren’t paying attention. There was very little hype and even less wasted expenditure. If we hadn’t spent all that time and money on Y2K it would have been a disaster.
Moz @ 21: “Trying to explain to clients that their behaviour is a major determinant of the size of the uncertainty in the schedule is a waste of time and annoys the pig (to misquote Heinlein).”
This all gets down to how the shop operates and the scope of its work. If it rigorously follows a mythology then it has a chance of getting the process right. In my experience, unless there is a project charter signed off at the outset, then it’s going to be a rough ride ahead.
The charter identifies the project’s objectives, KPIs etc; the roles and responsibilities of all stakeholders; the high level delivery schedule (with the important rider that the detailed schedule will be provided as an early deliverable of Phase 1); the methodology that will be employed and the key phases and deliverables; the budget and accounting process.
When a project initiative that has been assessed for feasibility and costed gets a green light from senior business management and is incorporated into the organisation’s operational/capital budget, then the formal process kicks off with production of the charter and the initial meeting for walkthrough and sign off, with the project sponsor being the last to sign under all of the other names. Until that has happened, we do not have a project and when it does happen, all concerned know that’s precisely how the project will be managed.
So, there is no question of “annoying the pig”. If there is a pig on the project, then he/she is identified as a risk and the sponsor needs to replace him/her with a team player.
Chris @25: “Related to staffing of projects, bad managers also underestimate the huge differences in productivity in programmers (order of magnitude between average and really good) and how easy it is for team members to contribute negative work.”
A key role of any manager is the professional development of their staff and a good performance review and bonus process allocates KPI points to the manager’s performance in that area.
It is crucial to understand where each programmer’s strengths and weaknesses lie in order to use their strengths and attend to the weaknesses. It’s important to keep raising the bar in their work challenges/opportunities while ensuring there is a safety net under them and that is in the form of a work buddy acting as a mentor.
Having come through the programming ranks myself, I know how bad it is to have an estimate handed to you without your input into that. My own approach is to create an atmosphere where estimates are discussed calmly and rationally, with the clear understanding that the estimate we publish is one we are committing to and we will probably not be given a chance to vary it. Consequently, sufficient (not excessive) contingency for estimating error needs to be considered seriously up front.
When it comes to project delivery or go live date, I have been keenly aware that the first date mentioned by any member of the IT team will be cast in bronze, screwed to the board room wall and have a spotlight shone on it for the duration of the project, therefore as the manager responsible for delivery I reserve the right to name that date and only after due estimating process has been done.
For security related incidents there are organisations like the various CERTS such as Auscert. They disseminate security and other bugfix alerts to their members.
A lot of open source software projects maintain online bug databases. A popular one is bugzilla (from the Mozilla stable). Its the nature of open source software to be transparent.
There used to be a world of difference between aviation and software development. I think that CASA and air traffic controllers are becoming less open than they were 30 years ago as aviation organisations seek to hide the fact that safety standards have slipped as costs have been reduced by cutting corners.
30 years ago aviators were tightly regulated, the pilots had to be credentialled, aeronautical engineers studied at university or tech,CASA used to publish accident reports for all pilots to read fondly known as the “Crash Comics” one entry read “I cleared the runway, realised I had left the key in the house, went back to the house to collect the key, jumped in the plane without rechecking the runaway and hit a stray ram buckling the undercarriage making takeoff impossible”
Any one could learn programming and call themselves a programmer, systems analyst or project manager and they still can. So you see the bright young things making the same mistakes today as the previous generation made, because its not mandatory to learn about previous stuffups.
And in fact large organisations use tried and true teams to develop new systems which leads one to the uncomfortable feeling that the new system is just a transaction-master tape update circa 1967 executed without the tape drives. I fear the new electronic health system under development will run transaction files against the database once a day because the development team comes from the banks and insurance houses. I don’t mean the ATM programmers, I mean the bank programmers that ran Hogan an American system with 800 programs to control the banking chamber complete with 50 state flags and 10 province flags and mortgage values limited to less than $9,999
Socratease @30 I know one old manager who couldn’t program so when he was given a programming estimate he shave 20% off the finish time. It was a common practice. When you read the old programming documentation you can see that this offensive little crappy design program with meaningless variable names and idiosyncratic coding took 2 days to write and test 40 years ago and a lifetime of pain annually. Why wasn’t it properly written the first time?
Chris @21: “There are also projects that spend so much time on design without sufficient knowledge of the practicalities of implementation that they end up giving nightmare tasks to the implementors. This is more common in really large projects where the architects/designers are different people to the implementers. People should be forced to implement what they design.”
Really large projects should only ever be managed by people with wide and deep experience of all areas of the development cycle, and the project’s charter should ensure that the implementation manager is a stakeholder and must have a sign off on the design to ensure his/her concerns are addressed.
Chris @21: “I strongly support early prototyping, but with the awareness that it will be thrown away and done much better the next time with the knowledge gained from the first pass. You have to build that time into your schedule.”
Prototyping can be very useful but the resulting prototype must never, ever, ever be handed over to the client on any basis. If it’s out of your hands you are in big trouble. When demonstrating a prototype I take pains to describe it as like a façade on a movie set: it’s been knocked up quickly and roughly to look like something real, but there is nothing substantial behind it and it is not a product.
Chris @ 21: “Also I think you need to alter your development methodology based on the programmers that you have. A team of very talented people will thrive in a totally different environment than a team of people you’ve just pulled off the street.”
I don’t agree. The methodology is there for a reason: to deliver a quality product, accordingly you need to take the same steps regardless of who’s doing the work. If the staff involved are new to the business area and/or technology, then stage/phase estimates need to be adjusted in light of that.
On a related issue, I hate the term “fast-tracking” when it is taken to mean skipping key steps. My definition of fast tracking is to put the fastest/most experienced developers on the team as they will spend the least time on each task.
Socratease @ 34 and Chris @21
Designers should be implementers – when the designers have a different background, skill set and knowledge level than the implementers then you have the recipe for disaster. You separate out the functions when you need to demonstrate lack of collusion for banking systems.
Prototypes if the prototype works and can cope with production volumes why rewrite – unless you have to prove that the system is incorruptible
The methodology should be selected/adapted to the skill sets of the developers, development tools in use and system being built but that shouldn’t be an excuse to skip steps or downplay the difficulty of certain steps like recreate the production environment in testing.
Billie @ 33: “I know one old manager who couldn’t program so when he was given a programming estimate he shave 20% off the finish time.”
Yes, that’s an old game of mutual mistrust where the team then routinely adds X% knowing that the manager will cut X%.
Billie @ 35: “You separate out the functions when you need to demonstrate lack of collusion for banking systems.”
Separation is understandable in certain secure situations, but it does not mean that the implementer cannot insist on their implementation standards being incorporated into the design. It is the responsibility of the project manager to ensure that the agreed methodology is followed and appropriate standards are adhered to in the conduct of the project, especially when such Chinese walls are mandated.
Billie @ 35: “Prototypes if the prototype works and can cope with production volumes why rewrite – unless you have to prove that the system is incorruptible.”
By definition, a prototype is NOT a product. It is not supported nor is is supportable.
If you are talking about iterative development where you are constructing an end product that the end user will use “in anger” so to speak, then that is not prototyping in my book, it’s rapid application development and it requires a limited budget of time and resources, very close management to ensure that the terms of agreement are being adhered to, and a clear understanding in writing as to what use the resulting application will be put to and what level of ongoing support will be given.
In my experience many projects fall down because developers gild the lily, trying to deliver 110%, in competition to be the best.
The user often doesn’t know exactly what they want because the system being developed is such a big change the ramifications are hard to imagine so often a system that just meets requirements in phase 1 then is enhanced in following iterations is easier to develop.
I worked on a successful project that came in on time, and on budget that have delivered basic requirements and operates robustly 25 years later. Thus I am not a fan of large teams of developers working with Chinese walls seperating project stages, I think they are expensive, cumbersome and the competing empires fail to cooperate to deliver workable systems.