Library→Second Way: Feedback→Monitoring and Alerting
Monitoring and Alerting
Know before your users do. What to monitor, how to alert without crying wolf, and how to build an on-call culture that does not burn people out.
Video Lesson
A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.
Monitoring vs observability
Monitoring means checking known failure conditions — watching dashboards and thresholds you defined in advance. Observability means understanding system behavior, including failures you did not anticipate. Monitoring tells you something is wrong. Observability helps you understand why.
Both are necessary. The goal of monitoring is to surface symptoms fast enough that the on-call engineer is notified before users are meaningfully impacted. The goal of observability is to make root cause analysis tractable once you are investigating an incident.
A well-monitored system has no surprises — alerts fire before users notice. An observable system has no mysteries — engineers can always answer "what is happening and why?"
What to monitor: the USE method
Brendan Gregg's USE method provides a systematic approach for identifying performance bottlenecks. For every resource in your system, measure three things: Utilization (how busy is it?), Saturation (is it overloaded?), and Errors (is it failing?).
Resource
Utilization
Saturation
Errors
CPU
% time busy
run queue length
hardware errors
Memory
% in use
swap usage, OOM events
parity / ECC errors
Network
% bandwidth used
packet queue depth
dropped packets
Disk I/O
% time servicing I/O
queue length
read/write errors
DB connections
% pool in use
queue waiting for conn
connection timeouts
The USE method is most useful for infrastructure resources. For user-facing services, pair it with the four golden signals (latency, traffic, errors, saturation) to get a complete picture.
Alerting principles
Alert fatigue is the failure mode of monitoring. When alerts fire too often, on-call engineers begin to ignore them — and when a real incident occurs, it gets missed. Good alerts are rare, actionable, and unambiguous.
Alert on symptoms, not causes
✓ Do this
Error rate > 1% for 5 minutes
✗ Not this
DB query p99 > 500ms
Users experience symptoms. Causes are for investigation, not alerting.
Every alert requires action
✓ Do this
Alert fires → runbook exists → on-call acts
✗ Not this
Alert fires → team ignores it
An alert with no action is noise. Noise trains people to ignore alerts.
Set thresholds on percentiles
✓ Do this
p99 latency > 2s
✗ Not this
Average latency > 500ms
Averages hide tail latency. Your slowest 1% of users matter.
Use multi-window alerts
✓ Do this
Error rate elevated for 5+ of last 10 min
✗ Not this
Any single spike triggers page
Transient spikes cause alert fatigue. Sustained problems require response.
On-call culture
On-call is not a punishment. It is a responsibility distributed across a team. Good on-call culture has three properties: sustainability (engineers are not burned out), effectiveness (incidents are resolved quickly), and learning (every incident improves the system).
Escalation paths
Every alert has a clear owner and an escalation path. If the first responder cannot resolve within 30 minutes, they know exactly who to call.
Runbooks
Documented step-by-step procedures for common incident types. The on-call engineer should not be improvising at 2am. Runbooks encode institutional knowledge.
Handoffs
Rotation schedules, handoff summaries, and explicit transfer of context. On-call is not a marathon — it is a relay.
If on-call engineers are being paged more than 2–3 times per shift, the alert thresholds are wrong or the system has an unresolved reliability problem. Either way, it must be fixed — not tolerated.
Dashboards and visualization
A good dashboard answers specific questions at a glance. The most common mistake is building a dashboard that displays everything — which means it communicates nothing. Good dashboards have a clear audience and a clear purpose.
Service health dashboard
For: On-call engineer
Four golden signals for the service. Red/green. Single page. Actionable at a glance.
Business metrics dashboard
For: Product + leadership
Conversion rate, revenue, user counts. Business impact of technical decisions visible.
Capacity planning dashboard
For: Platform team
Resource trends over weeks/months. Saturation curves. When do we need to scale?
Further reading
Google SRE Book — Chapter 6
Monitoring Distributed Systems. The canonical reference for alert design, dashboard principles, and on-call culture.
DevOps Handbook — Chapter 23
Create Proactive Telemetry to Enable Rapid Detection and Recovery. Linking monitoring to incident response.
The Art of Monitoring — Turnbull
Practical guide to building monitoring infrastructure. Metrics, logging, alerting, and visualization end to end.
Brendan Gregg — USE Method
brendangregg.com/usemethod.html. The full USE method reference including resource-specific checklists.