Library→Third Way: Continual Learning→Chaos Engineering
Chaos Engineering
Deliberately inject failure to discover weaknesses before users do. How Netflix's Chaos Monkey became an engineering discipline — and how to practice it safely.
Video Lesson
A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.
What is chaos engineering?
Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. In plain terms: you deliberately break things to find out what breaks — before users find out for you.
The key word is discipline. Chaos engineering is not randomly breaking things. It is structured experimentation with defined hypotheses, controlled blast radius, and systematic analysis of results.
Every system will fail. The question is whether it fails in a controlled experiment where you are prepared, or in a production incident where users are affected. Chaos engineering shifts discovery from the second scenario to the first.
The origin: Netflix Chaos Monkey
In 2010, Netflix was migrating to AWS and needed confidence that their systems could handle instance failures. They built Chaos Monkey: a tool that randomly terminates virtual machine instances in production during business hours, forcing the engineering teams to build services that could survive the loss of any individual instance.
This was counterintuitive: introduce failures deliberately, in production, during business hours. But the reasoning was sound — if failures will happen anyway, better to introduce them when the team is awake and prepared than to discover them at 3am.
Chaos Monkey
Terminates random VM instances
Instance level
Chaos Gorilla
Simulates failure of an entire AWS availability zone
Zone level
Latency Monkey
Introduces artificial delays in RESTful client-server communication
Network level
Chaos Kong
Simulates failure of an entire AWS region
Region level
The Simian Army expanded to include tools for security, conformance, latency, and janitor cleanup. Netflix open-sourced much of this tooling, and the discipline became known as chaos engineering.
Principles of chaos engineering
The Principles of Chaos Engineering (principlesofchaos.org) defines five principles:
Build a hypothesis around steady state
Define what 'normal' looks like with a measurable metric — requests per second, error rate, SLA compliance. The experiment tests whether steady state survives the perturbation.
Vary real-world events
Inject failures that mirror reality: instance crashes, network partitions, disk full, dependency timeouts. Artificial failures reveal artificial weaknesses.
Run experiments in production
Staging environments do not have the same traffic patterns, scale, or configuration as production. The system you care about is production.
Automate experiments continuously
Manual chaos experiments run once a quarter find a different set of weaknesses than automated experiments running continuously. Automate to find regressions.
Minimize blast radius
Start small. Limit the percentage of users or traffic affected. Have a kill switch ready. Increase scope as confidence grows.
GameDays
A GameDay is a planned, coordinated chaos exercise where a team deliberately injects failure into their systems — or simulates it via tabletop discussion — to practice incident response. The goal is to build muscle memory before the real incident.
01
Define steady state
What does normal look like? Define measurable success metrics.
02
Form hypothesis
"If X fails, the system will still serve N req/s"
03
Minimize blast radius
Start in staging. Small % of traffic. Kill switch ready.
04
Run experiment
Inject the failure. Observe.
05
Analyze
Did steady state hold? If not, what broke?
06
Improve
Fix the weakness. Repeat.
Live GameDay
Actually inject failures into a staging or production environment. The most realistic — and riskiest. Requires mature observability and practiced runbooks first.
Tabletop exercise
Walk through a hypothetical incident scenario. "The database is down. What do you do? Who do you call? What is the runbook?" No actual failure required. Good starting point.
Getting started
Most teams are not ready to run chaos experiments in production on day one. The path to chaos engineering requires prerequisites:
Prerequisites
·Observability — you can see what is happening
·Runbooks — you know what to do when things break
·Fast rollback — you can undo changes in minutes
·Blameless culture — experiments are safe to run
Start here
·Tabletop exercises for common failure scenarios
·Kill switch tests: "does our feature flag work?"
·Dependency injection: "what if the cache is empty?"
·Staging environment chaos only
Mature practice
·Automated chaos in production on small traffic %
·Continuous experiments as part of CI/CD pipeline
·Region-level failure simulations
·Chaos engineering on shared infrastructure
Further reading
Chaos Engineering — Rosenthal, Jones et al.
The book. Principles, case studies, and the tooling landscape. Written by the Netflix engineers who created the discipline.
Principles of Chaos Engineering
principlesofchaos.org. The five principles, with commentary. The authoritative definition of the discipline.
Netflix Tech Blog
netflixtechblog.com. Original posts on Chaos Monkey, the Simian Army, and the FIT (Fault Injection Testing) platform.
Gremlin — Chaos Engineering Guide
gremlin.com/chaos-engineering. Practical guide to getting started. Tool-agnostic, covers blast radius, steady-state definition, and GameDays.