Awful Software Engineering Failures And How To Prevent Them

Testimonials of some of the most awful Engineering failures and general principles to avoid them.
Last updated on Dec 28, 2023

This post is based on a Hacker News thread where a lot of people provided their own horror stories. I try to provide concrete examples of failures and corresponding solutions as much as possible to illustrate the general principles. I'll probably extend an reorganize it in the future.

Table of contents

General advices

Minimize the number of manual steps when interacting with sensitive systems. Humans make mistakes, for example in the Gimli Glider story, among other mistakes, 2 technicians, the First Officer and the Captain all failed to correctly calculate the amount of fuel required for the flight. Automation does not prevent you from conducting manual verifications still, especially when you push non-trivial changes or modify said automation.

Warn the user about potentially dangerous actions, for example when deploying a version that did not pass all tests. By default apt asks for confirmation before installing or removing packages.

Build things that are safe by default.

Developers love to poke around and explore tools by running them without losing too much time reading documentation or code. That’s one reason why you should build tools that are safe by default but also to place safeguards that prevent unintentional bad situations. For example by denying access to tables containing personal user data by default, so that developers don’t create privacy incidents by mistake.

Deny by default and use allow lists. This one comes from the security domain but can be applied to many other things. For example, have groups with associated allowed commands, the finer grain the better, that way an account might only be able to delete some rows but not entire tables.

Design systems such that they cannot be plugged incorrectly. For example by adding a mandatory _prod suffix to all prod databases. Example: Hoover_nozzle_and_Hoover_ring.

Fail early to avoid inconsistent states. This can happen when you run commands in sequence and some of them succeed while others fail. Figure-out dependencies and make sure that the script returns on the first unexpected error.

Apply Least privilege principle. This one is related to the Deny by default point.

Defense in depth. Don’t assume that your first line of defense is impenetrable. The story of the qantas flight 32 says a lot about how far this principle can go. Fragments of one of the IP turbine disk passed through the wing and belly of the A380 causing the malfunctions of dozens of critical systems. The plane landed successfully nonetheless. Software Engineering examples include:

Don’t leave dangerous operations / code in the wild. If it modifies a critical system it probably deserves proper documentation and official tooling.

Know your Single Points Of Failure. It can be specific hosts or services, internal or external. Once the corresponding stakeholders are aware of those it’s easier to be extra-cautious when interacting on them or to try to remediate them.

Setup quotas, especially on resources that cost extra money. You don’t want to provision GCP instances in an infinite loop until your company goes out of money. If you’re an investment fund, you don’t want to send market orders in an infinite loop until your clients don’t have enough money to pay for the transaction fees.

Setup monitoring and alerting. You don’t want to hear the bad news from your users.

Fight the normalization of deviance. Work quality can degrade quite fast and ad-hoc bumpy procedures have a tendency to stay. Some examples include getting used to: overtime, failing tests in the CI, doing the same process so often that you validate prompts without even reading them, simultaneously working with a dev and prod env.

Avoid interacting with critical systems on Friday evening. Also think about the worst case scenario and how long it would take to recover.

Check your own state. Many errors happen because people are tired, under too much pressure, in a hurry or have whatever other issue that makes them prone to mistakes.

Don’t blame people. Take those failures as opportunities to improve the robustness of your systems. Make sure that the whole team learns from those mistakes.

Have a good understanding of what happened before trying to mitigate an issue.

Make sure your mitigation steps are safe. For example you might want to rollback your application to a previous version, but that version might not be compatible with the current database scheme, because you changed a foreign key, moved a column in another table or whatever.

Once an issue is mitigated, make sure to well identify the root cause of the issue by conducting a proper post-mortem. Make sure to execute the follow-up tasks.

Warn others

Unplug the system yourself. Test your system reliability yourself by voluntarily shutting down some parts. It’s the only way to be truly confident about the system’s robustness.

Run recovery for fake. It’s the only way to be confident during a real disaster recovery. You’ll also have a better sense of the cost in terms of time, resource and money to conduct such recovery.

Wiping prod Database

Prevent

Anticipate

Recover If possible, run the following often and automatically. Make sure to also manually check that each step runs as expected at regular intervals.

Messing up a critical host

Recover

Other interesting stories you can lear from: