Library→Second Way: Feedback→Telemetry and Observability

FB-01CONCEPTSecond Way: Feedback

Telemetry and Observability

You cannot improve what you cannot see. The three pillars of telemetry, the four golden signals, and how to instrument your application to know what is happening in production.

Sources:DevOps HandbookGoogle SRE Book

Video Lesson

A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.

What is telemetry?

Telemetry is the collection of data from a running system — automatically, in real time, at scale. It is how a production system communicates its internal state to the humans and tools responsible for it. Without telemetry, production is a black box.

The three pillars of observability — metrics, logs, and traces — provide complementary views of system behavior:

Metrics

Numeric measurements over time. CPU usage, request rate, error count. Aggregatable and efficient to store.

Request rate: 2,400 req/s

Error rate: 0.3%

p99 latency: 240ms

CPU utilization: 67%

Logs

Timestamped records of discrete events. Verbose and queryable. Best for debugging specific incidents.

2024-01-15 14:23:01 ERROR db connection timeout

Payment failed: card declined

User 4821 logged in

Traces

Records of a single request's journey across services. Reveals latency attribution and dependency failures.

API → Auth (12ms) → DB (180ms) → Cache (2ms)

Total: 194ms — bottleneck: DB query

Observability is not the same as having telemetry. Observability means you can ask arbitrary questions about your system's behavior and get answers — even for failure modes you did not anticipate. Telemetry is the prerequisite.

Observability vs monitoring

Monitoring is reactive: you define known failure modes in advance and set alerts to detect them. Observability is proactive: you instrument your system so thoroughly that you can explore its behavior to understand failures you did not anticipate.

Monitoring (reactive)

· Checks known failure conditions

· Alert fires when threshold crossed

· You must anticipate every failure mode

· "Is the system up?"

Observability (proactive)

· Explores unknown failure patterns

· Ask questions of rich telemetry data

· Handles failures you did not predict

· "Why is the system behaving this way?"

In practice, you need both. Monitoring catches the known problems fast. Observability lets you investigate the unknown ones. The goal is to make production legible — not just alarmed.

The four golden signals

Google's Site Reliability Engineering book identifies four signals that together characterize the health of any service. If you can only instrument four things, these are the four:

Latency

How long does it take to serve a request? Distinguish successful request latency from error latency — slow errors mask performance.

p50: 45ms / p99: 240ms / p999: 1.2s

Traffic

How much demand is the system receiving? The load the system is under. Used to normalize other signals and detect anomalies.

2,400 requests/second; 18GB/hr data ingestion

Errors

What rate of requests fail? Explicit failures (500s) and implicit failures (wrong content, degraded responses). Both matter.

0.3% HTTP 5xx rate; 2.1% checkout timeout rate

Saturation

How full is the service? The most constrained resource — CPU, memory, disk, queue depth. Predict saturation before it causes failure.

DB connection pool: 87% utilized; disk: 72% full

These signals work together. High latency with normal traffic suggests a slow query or upstream dependency. High error rate with normal latency suggests a logic bug. High saturation predicts future problems before they become user-visible failures.

Instrumentation

Instrumentation is the act of adding telemetry to your application code. There are two approaches:

Automatic instrumentation

Libraries and agents that inject telemetry into common frameworks automatically. Zero code changes for standard metrics: HTTP handlers, database calls, external requests. Start here.

Custom instrumentation

Application-specific metrics and traces added by the developer. Essential for business metrics: checkout completion rate, payment success, user-facing error rates. Automatic instrumentation cannot see these.

// Custom instrumentation — Node.js example

const checkoutCounter = meter.createCounter('checkout_attempts');

const paymentDuration = meter.createHistogram('payment_duration_ms');

// In the checkout handler:

checkoutCounter.add(1, { status: 'initiated' });

paymentDuration.record(elapsed, { provider: 'stripe' });

Telemetry at Nexus Corp

After Mission 04, Nexus Corp has automated deployments — but no visibility into what happens after deploy. The next step is instrumentation: making the production system legible. What does Nexus Corp need to measure?

Latency

p50, p95, p99 API response time

Catch slow endpoints before users complain

Traffic

Requests per second, active users

Normalize other signals; detect traffic spikes

Errors

HTTP 5xx rate, payment failure rate

The most direct signal of user-facing problems

Saturation

DB connection pool usage, memory utilization

Predict capacity problems before they cause failures