Library→Second Way: Feedback→Incident Review

FB-05PRACTICESecond Way: Feedback

Incident Review

The feedback loop after failure. How to turn production incidents into systemic improvements — and why how you review matters as much as what you review.

Sources:DevOps HandbookGoogle SRE BookJohn Allspaw

Video Lesson

A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.

What is an incident review?

An incident review (also called a post-incident review or post-mortem) is a structured process for learning from production failures. After an incident is resolved, the team gathers to understand what happened, why it happened, and how to prevent it — or detect it faster — next time.

The goal is not to assign blame or find the guilty party. The goal is to improve the system. Incidents are free lessons in where your system is fragile. A team that does not review incidents throws those lessons away.

John Allspaw and Paul Hammond's 2009 talk "10+ Deploys Per Day" introduced the concept of blameless post-mortems at Flickr. The insight: engineers make reasonable decisions given the information they had at the time. The system allowed them to cause harm. Fix the system.

Blameless vs blame

A blame culture treats incidents as individual failures — someone made a mistake, and they need to be held accountable. A blameless culture treats incidents as system failures — the system created conditions where a human error could cause an outage.

Blame culture outcome

✗Engineers hide mistakes

✗Problems stay hidden until they cause crises

✗The root cause is 'human error'

✗Nothing systemic changes

✗The same incident recurs

Blameless culture outcome

✓Problems surface early

✓Root causes are investigated fully

✓System design improves

✓Engineers feel safe admitting mistakes

✓Incidents teach the whole team

The five whys

The five whys is a root cause analysis technique from Toyota: ask "why?" repeatedly until you reach a systemic cause rather than a proximate one. The number five is a rule of thumb — keep asking until you reach a cause that can actually be fixed.

Why did the payment service fail?

The database connection pool was exhausted.

Why was the connection pool exhausted?

A slow query held connections open for 40 seconds.

Why did the query run so slowly?

A new index was missing after the migration.

Why was the index missing?

The migration script did not include the CREATE INDEX statement.

Why was the missing index not caught before production?

The staging database is smaller — the query was fast enough not to trigger alerts.

Root cause

The staging environment does not match production data volume. Slow queries are invisible until production. Fix: add staging data volume checks to the deployment pipeline, and add index creation to the migration review checklist.

Notice that the five whys process leads from a symptom (payment failure) to a systemic root cause (environment parity). The fix is not "be more careful" — it is a process change that makes the problem impossible to repeat.

The incident review format

Incident summary

One paragraph: what happened, when, for how long, and what the user impact was. Written for engineers who were not involved.

Timeline

Chronological sequence of events: first signal, detection, escalation, mitigation, resolution. Exact timestamps. Include the detection-to-response gap.

Contributing factors

Not blame-seeking — factor-finding. What conditions made this incident possible? Absent monitoring, missing documentation, high cognitive load, unclear ownership?

What went well

What helped you detect and resolve the incident faster? Good oncall runbooks? Fast rollback mechanism? Monitoring that worked? Reinforce these.

Action items

Concrete tasks with owners and due dates. Each action item removes a contributing factor. Not 'be more careful' — 'add alert for X' or 'add index migration check to pipeline'.

Learning from incidents

A post-mortem document that lives in a shared folder and is never read again is waste. The organizational value of incident review comes from distributing the learning — making it accessible to engineers who were not involved.

Internal publication

Publish every post-mortem to a shared internal wiki. Searchable by affected service, root cause category, or date.

Review in team meetings

Spend 10 minutes reviewing recent incidents in the team meeting. What happened across the org? What should everyone know?

Trend analysis