Awful Software Engineering Failures And How To Prevent Them

Testimonials of some of the most awful Engineering failures and general principles to avoid them.
Last updated on Sep 17, 2024

This post is based on a Hacker News thread where a lot of people provided their own horror stories. I try to give concrete examples of failures and corresponding solutions as much as possible to illustrate the general principles. I'll probably extend an reorganize it in the future. Keep in mind that following is only relevant to important systems, i.e. the one that would put you in a bad situation if you mess them up. It'equally important to know where it's fine to move fast and break things, typically end user impact vs internal impact.

Table of contents

General advices

Minimize the number of manual steps Humans make mistakes, for example in the Gimli Glider story, 2 technicians, the First Officer and the Captain all failed to correctly calculate the amount of fuel required for the flight, causing the plane to take off with an half empty tank. Automation does not prevent you from conducting manual verifications still, especially when you push non-trivial changes or modify said automation. Did you test the script that checks if the CI ran tests? 🤡

Warn the user about potentially dangerous actions, for example when deploying a version that did not pass all tests. Think about situations where you've been asked for confirmation.

Build things that are safe by default.

Deny/ignore by default and use allow lists. This one comes from the security domain but can be applied to many other things. For example, if you want to apply something to a subset of users, it's much easier to know who gets affected by explicitly providing the list of those users than the list of all the other users that must be ignored.

Design systems such that they cannot be plugged incorrectly. For example by the use of named parameters over raw arguments as it can be easy to get the order of arguments wrong. Also be careful with short options, people might expect -r to run the operation recursively, not to be the short version of --remove.

Fail early to avoid inconsistent states. This can happen when you run commands in sequence and some of them succeed while others fail. Figure-out dependencies and make sure that the script returns on the first unexpected error.

Apply Least privilege principle. This one is related to the Deny by default point.

Defense in depth. Don’t assume that your first line of defense is impenetrable. The story of the qantas flight 32 says a lot about how far this principle can go. Fragments of one of the IP turbine disk passed through the wing and belly of the A380 causing the malfunctions of dozens of critical systems. The plane landed successfully nonetheless. Software Engineering examples include:

Don’t leave dangerous operations / code in the wild. If it modifies a critical system it probably deserves proper documentation and official tooling.

Know your Single Points Of Failure. It can be specific hosts or services, internal or external. Once the corresponding stakeholders are aware of those it’s easier to be extra-cautious when interacting on them or to try to remediate them.

Setup quotas, especially on resources that cost extra money. You don’t want to provision GCP instances in an infinite loop until your company goes out of money. If you’re an investment fund, you don’t want to send market orders in an infinite loop until your clients don’t have enough money to pay for the transaction fees.

Setup monitoring and alerting. You don’t want to hear the bad news from your users.

Fight the normalization of deviance. Work quality can degrade quite fast and ad-hoc bumpy procedures have a tendency to stay. Some examples include getting used to: overtime, failing tests in the CI, doing the same process so often that you validate prompts without even reading them, simultaneously working with a dev and prod env.

Avoid interacting with critical systems on Friday evening. Also think about the worst case scenario and how long it would take to recover.

Check your own state. Many errors happen because people are tired, under too much pressure, in a hurry or have whatever other issue that makes them prone to mistakes.

Don’t blame people. Take those failures as opportunities to improve the robustness of your systems. Make sure that the whole team learns from those mistakes.

Have a good understanding of what happened before trying to mitigate an issue.

Make sure your mitigation steps are safe. For example you might want to rollback your application to a previous version, but that version might not be compatible with the current database scheme, because you changed a foreign key, moved a column in another table or whatever.

Once an issue is mitigated, make sure to well identify the root cause of the issue by conducting a proper post-mortem. Make sure to execute the follow-up tasks.

Warn others

Unplug the system yourself. Test your system reliability yourself by voluntarily shutting down some parts. It’s the only way to be truly confident about the system’s robustness.

Run recovery for fake. It’s the only way to be confident during a real disaster recovery. You’ll also have a better sense of the cost in terms of time, resource and money to conduct such recovery.

Wiping prod Database

Prevent

Anticipate

Recover
If possible, run the following often and automatically. Make sure to also manually check that each step runs as expected at regular intervals.

Messing up a critical host

Recover

Other interesting stories you can lear from: