Library→Third Way: Continual Learning→Chaos Engineering

CL-02PRACTICEThird Way: Continual Learning

Chaos Engineering

Deliberately inject failure to discover weaknesses before users do. How Netflix's Chaos Monkey became an engineering discipline — and how to practice it safely.

Sources:Chaos Engineering — Rosenthal et al.Netflix Tech BlogDevOps Handbook

Video Lesson

A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.

What is chaos engineering?

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. In plain terms: you deliberately break things to find out what breaks — before users find out for you.

The key word is discipline. Chaos engineering is not randomly breaking things. It is structured experimentation with defined hypotheses, controlled blast radius, and systematic analysis of results.

Every system will fail. The question is whether it fails in a controlled experiment where you are prepared, or in a production incident where users are affected. Chaos engineering shifts discovery from the second scenario to the first.

The origin: Netflix Chaos Monkey

In 2010, Netflix was migrating to AWS and needed confidence that their systems could handle instance failures. They built Chaos Monkey: a tool that randomly terminates virtual machine instances in production during business hours, forcing the engineering teams to build services that could survive the loss of any individual instance.

This was counterintuitive: introduce failures deliberately, in production, during business hours. But the reasoning was sound — if failures will happen anyway, better to introduce them when the team is awake and prepared than to discover them at 3am.

Chaos Monkey

Terminates random VM instances

Instance level

Chaos Gorilla

Simulates failure of an entire AWS availability zone

Zone level

Latency Monkey

Introduces artificial delays in RESTful client-server communication

Network level

Chaos Kong

Simulates failure of an entire AWS region

Region level

The Simian Army expanded to include tools for security, conformance, latency, and janitor cleanup. Netflix open-sourced much of this tooling, and the discipline became known as chaos engineering.

Principles of chaos engineering

The Principles of Chaos Engineering (principlesofchaos.org) defines five principles:

Build a hypothesis around steady state

Define what 'normal' looks like with a measurable metric — requests per second, error rate, SLA compliance. The experiment tests whether steady state survives the perturbation.

Vary real-world events

Inject failures that mirror reality: instance crashes, network partitions, disk full, dependency timeouts. Artificial failures reveal artificial weaknesses.

Run experiments in production

Staging environments do not have the same traffic patterns, scale, or configuration as production. The system you care about is production.

Automate experiments continuously

Manual chaos experiments run once a quarter find a different set of weaknesses than automated experiments running continuously. Automate to find regressions.

Minimize blast radius

Start small. Limit the percentage of users or traffic affected. Have a kill switch ready. Increase scope as confidence grows.

GameDays

A GameDay is a planned, coordinated chaos exercise where a team deliberately injects failure into their systems — or simulates it via tabletop discussion — to practice incident response. The goal is to build muscle memory before the real incident.

Define steady state

What does normal look like? Define measurable success metrics.

Form hypothesis

"If X fails, the system will still serve N req/s"

Minimize blast radius

Start in staging. Small % of traffic. Kill switch ready.

Run experiment

Inject the failure. Observe.

Analyze

Did steady state hold? If not, what broke?

Improve

Fix the weakness. Repeat.

Live GameDay

Actually inject failures into a staging or production environment. The most realistic — and riskiest. Requires mature observability and practiced runbooks first.

Tabletop exercise

Walk through a hypothetical incident scenario. "The database is down. What do you do? Who do you call? What is the runbook?" No actual failure required. Good starting point.

Getting started

Most teams are not ready to run chaos experiments in production on day one. The path to chaos engineering requires prerequisites:

Prerequisites

·Observability — you can see what is happening

·Runbooks — you know what to do when things break

·Fast rollback — you can undo changes in minutes

·Blameless culture — experiments are safe to run

Start here

·Tabletop exercises for common failure scenarios

·Kill switch tests: "does our feature flag work?"

·Dependency injection: "what if the cache is empty?"

·Staging environment chaos only

Mature practice

·Automated chaos in production on small traffic %

·Continuous experiments as part of CI/CD pipeline

·Region-level failure simulations