Library→Second Way: Feedback→Monitoring and Alerting

FB-02TOOLSecond Way: Feedback

Monitoring and Alerting

Know before your users do. What to monitor, how to alert without crying wolf, and how to build an on-call culture that does not burn people out.

Sources:DevOps HandbookGoogle SRE BookSite Reliability Engineering

Video Lesson

A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.

Monitoring vs observability

Monitoring means checking known failure conditions — watching dashboards and thresholds you defined in advance. Observability means understanding system behavior, including failures you did not anticipate. Monitoring tells you something is wrong. Observability helps you understand why.

Both are necessary. The goal of monitoring is to surface symptoms fast enough that the on-call engineer is notified before users are meaningfully impacted. The goal of observability is to make root cause analysis tractable once you are investigating an incident.

A well-monitored system has no surprises — alerts fire before users notice. An observable system has no mysteries — engineers can always answer "what is happening and why?"

What to monitor: the USE method

Brendan Gregg's USE method provides a systematic approach for identifying performance bottlenecks. For every resource in your system, measure three things: Utilization (how busy is it?), Saturation (is it overloaded?), and Errors (is it failing?).

Resource

Utilization

Saturation

Errors

CPU

% time busy

run queue length

hardware errors

Memory

% in use

swap usage, OOM events

parity / ECC errors

Network

% bandwidth used

packet queue depth

dropped packets

Disk I/O

% time servicing I/O

queue length

read/write errors

DB connections

% pool in use

queue waiting for conn

connection timeouts

The USE method is most useful for infrastructure resources. For user-facing services, pair it with the four golden signals (latency, traffic, errors, saturation) to get a complete picture.

Alerting principles

Alert fatigue is the failure mode of monitoring. When alerts fire too often, on-call engineers begin to ignore them — and when a real incident occurs, it gets missed. Good alerts are rare, actionable, and unambiguous.

Alert on symptoms, not causes

✓ Do this

Error rate > 1% for 5 minutes

✗ Not this

DB query p99 > 500ms

Users experience symptoms. Causes are for investigation, not alerting.

Every alert requires action

✓ Do this

Alert fires → runbook exists → on-call acts

✗ Not this

Alert fires → team ignores it

An alert with no action is noise. Noise trains people to ignore alerts.

Set thresholds on percentiles

✓ Do this

p99 latency > 2s

✗ Not this

Average latency > 500ms

Averages hide tail latency. Your slowest 1% of users matter.

Use multi-window alerts

✓ Do this

Error rate elevated for 5+ of last 10 min

✗ Not this

Any single spike triggers page

Transient spikes cause alert fatigue. Sustained problems require response.

On-call culture

On-call is not a punishment. It is a responsibility distributed across a team. Good on-call culture has three properties: sustainability (engineers are not burned out), effectiveness (incidents are resolved quickly), and learning (every incident improves the system).

Escalation paths

Every alert has a clear owner and an escalation path. If the first responder cannot resolve within 30 minutes, they know exactly who to call.

Runbooks

Documented step-by-step procedures for common incident types. The on-call engineer should not be improvising at 2am. Runbooks encode institutional knowledge.

Handoffs

Rotation schedules, handoff summaries, and explicit transfer of context. On-call is not a marathon — it is a relay.

If on-call engineers are being paged more than 2–3 times per shift, the alert thresholds are wrong or the system has an unresolved reliability problem. Either way, it must be fixed — not tolerated.

Dashboards and visualization

A good dashboard answers specific questions at a glance. The most common mistake is building a dashboard that displays everything — which means it communicates nothing. Good dashboards have a clear audience and a clear purpose.

Service health dashboard

For: On-call engineer

Four golden signals for the service. Red/green. Single page. Actionable at a glance.

Business metrics dashboard

For: Product + leadership

Conversion rate, revenue, user counts. Business impact of technical decisions visible.

Capacity planning dashboard

For: Platform team

Resource trends over weeks/months. Saturation curves. When do we need to scale?

Monitoring and Alerting

Monitoring vs observability

What to monitor: the USE method

Alerting principles

On-call culture

Dashboards and visualization

Further reading