Library→Second Way: Feedback→Incident Review
Incident Review
The feedback loop after failure. How to turn production incidents into systemic improvements — and why how you review matters as much as what you review.
Video Lesson
A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.
What is an incident review?
An incident review (also called a post-incident review or post-mortem) is a structured process for learning from production failures. After an incident is resolved, the team gathers to understand what happened, why it happened, and how to prevent it — or detect it faster — next time.
The goal is not to assign blame or find the guilty party. The goal is to improve the system. Incidents are free lessons in where your system is fragile. A team that does not review incidents throws those lessons away.
John Allspaw and Paul Hammond's 2009 talk "10+ Deploys Per Day" introduced the concept of blameless post-mortems at Flickr. The insight: engineers make reasonable decisions given the information they had at the time. The system allowed them to cause harm. Fix the system.
Blameless vs blame
A blame culture treats incidents as individual failures — someone made a mistake, and they need to be held accountable. A blameless culture treats incidents as system failures — the system created conditions where a human error could cause an outage.
Blame culture outcome
✗Engineers hide mistakes
✗Problems stay hidden until they cause crises
✗The root cause is 'human error'
✗Nothing systemic changes
✗The same incident recurs
Blameless culture outcome
✓Problems surface early
✓Root causes are investigated fully
✓System design improves
✓Engineers feel safe admitting mistakes
✓Incidents teach the whole team
The five whys
The five whys is a root cause analysis technique from Toyota: ask "why?" repeatedly until you reach a systemic cause rather than a proximate one. The number five is a rule of thumb — keep asking until you reach a cause that can actually be fixed.
Why did the payment service fail?
The database connection pool was exhausted.
Why was the connection pool exhausted?
A slow query held connections open for 40 seconds.
Why did the query run so slowly?
A new index was missing after the migration.
Why was the index missing?
The migration script did not include the CREATE INDEX statement.
Why was the missing index not caught before production?
The staging database is smaller — the query was fast enough not to trigger alerts.
Root cause
The staging environment does not match production data volume. Slow queries are invisible until production. Fix: add staging data volume checks to the deployment pipeline, and add index creation to the migration review checklist.
Notice that the five whys process leads from a symptom (payment failure) to a systemic root cause (environment parity). The fix is not "be more careful" — it is a process change that makes the problem impossible to repeat.
The incident review format
Incident summary
One paragraph: what happened, when, for how long, and what the user impact was. Written for engineers who were not involved.
Timeline
Chronological sequence of events: first signal, detection, escalation, mitigation, resolution. Exact timestamps. Include the detection-to-response gap.
Contributing factors
Not blame-seeking — factor-finding. What conditions made this incident possible? Absent monitoring, missing documentation, high cognitive load, unclear ownership?
What went well
What helped you detect and resolve the incident faster? Good oncall runbooks? Fast rollback mechanism? Monitoring that worked? Reinforce these.
Action items
Concrete tasks with owners and due dates. Each action item removes a contributing factor. Not 'be more careful' — 'add alert for X' or 'add index migration check to pipeline'.
Learning from incidents
A post-mortem document that lives in a shared folder and is never read again is waste. The organizational value of incident review comes from distributing the learning — making it accessible to engineers who were not involved.
Internal publication
Publish every post-mortem to a shared internal wiki. Searchable by affected service, root cause category, or date.
Review in team meetings
Spend 10 minutes reviewing recent incidents in the team meeting. What happened across the org? What should everyone know?
Trend analysis
Quarterly review of incident patterns. What root causes recur? Where is investment in reliability most needed?
Further reading
DevOps Handbook — Part IV
The Second Way: Feedback. Chapter 24–26: creating learning from production telemetry, blameless post-mortems, and review cultures.
Google SRE Book — Chapter 15
Postmortem Culture: Learning from Failure. The Google blameless postmortem format with examples.
John Allspaw — Blameless PostMortems
codeascraft.com. The original Etsy blog post that popularized blameless post-mortems in the DevOps community.
Sidney Dekker — The Field Guide to Human Error
The cognitive science behind why 'human error' is not a root cause — and why system design is the real lever.