Library→Second Way: Feedback→Telemetry and Observability
Telemetry and Observability
You cannot improve what you cannot see. The three pillars of telemetry, the four golden signals, and how to instrument your application to know what is happening in production.
Video Lesson
A video lesson for this topic is in development. The library articles and mission exercises cover the same material in the meantime.
What is telemetry?
Telemetry is the collection of data from a running system — automatically, in real time, at scale. It is how a production system communicates its internal state to the humans and tools responsible for it. Without telemetry, production is a black box.
The three pillars of observability — metrics, logs, and traces — provide complementary views of system behavior:
Metrics
Numeric measurements over time. CPU usage, request rate, error count. Aggregatable and efficient to store.
Request rate: 2,400 req/s
Error rate: 0.3%
p99 latency: 240ms
CPU utilization: 67%
Logs
Timestamped records of discrete events. Verbose and queryable. Best for debugging specific incidents.
2024-01-15 14:23:01 ERROR db connection timeout
Payment failed: card declined
User 4821 logged in
Traces
Records of a single request's journey across services. Reveals latency attribution and dependency failures.
API → Auth (12ms) → DB (180ms) → Cache (2ms)
Total: 194ms — bottleneck: DB query
Observability is not the same as having telemetry. Observability means you can ask arbitrary questions about your system's behavior and get answers — even for failure modes you did not anticipate. Telemetry is the prerequisite.
Observability vs monitoring
Monitoring is reactive: you define known failure modes in advance and set alerts to detect them. Observability is proactive: you instrument your system so thoroughly that you can explore its behavior to understand failures you did not anticipate.
Monitoring (reactive)
· Checks known failure conditions
· Alert fires when threshold crossed
· You must anticipate every failure mode
· "Is the system up?"
Observability (proactive)
· Explores unknown failure patterns
· Ask questions of rich telemetry data
· Handles failures you did not predict
· "Why is the system behaving this way?"
In practice, you need both. Monitoring catches the known problems fast. Observability lets you investigate the unknown ones. The goal is to make production legible — not just alarmed.
The four golden signals
Google's Site Reliability Engineering book identifies four signals that together characterize the health of any service. If you can only instrument four things, these are the four:
Latency
How long does it take to serve a request? Distinguish successful request latency from error latency — slow errors mask performance.
p50: 45ms / p99: 240ms / p999: 1.2s
Traffic
How much demand is the system receiving? The load the system is under. Used to normalize other signals and detect anomalies.
2,400 requests/second; 18GB/hr data ingestion
Errors
What rate of requests fail? Explicit failures (500s) and implicit failures (wrong content, degraded responses). Both matter.
0.3% HTTP 5xx rate; 2.1% checkout timeout rate
Saturation
How full is the service? The most constrained resource — CPU, memory, disk, queue depth. Predict saturation before it causes failure.
DB connection pool: 87% utilized; disk: 72% full
These signals work together. High latency with normal traffic suggests a slow query or upstream dependency. High error rate with normal latency suggests a logic bug. High saturation predicts future problems before they become user-visible failures.
Instrumentation
Instrumentation is the act of adding telemetry to your application code. There are two approaches:
Automatic instrumentation
Libraries and agents that inject telemetry into common frameworks automatically. Zero code changes for standard metrics: HTTP handlers, database calls, external requests. Start here.
Custom instrumentation
Application-specific metrics and traces added by the developer. Essential for business metrics: checkout completion rate, payment success, user-facing error rates. Automatic instrumentation cannot see these.
// Custom instrumentation — Node.js example
const checkoutCounter = meter.createCounter('checkout_attempts');
const paymentDuration = meter.createHistogram('payment_duration_ms');
// In the checkout handler:
checkoutCounter.add(1, { status: 'initiated' });
paymentDuration.record(elapsed, { provider: 'stripe' });
Telemetry at Nexus Corp
After Mission 04, Nexus Corp has automated deployments — but no visibility into what happens after deploy. The next step is instrumentation: making the production system legible. What does Nexus Corp need to measure?
Latency
p50, p95, p99 API response time
Catch slow endpoints before users complain
Traffic
Requests per second, active users
Normalize other signals; detect traffic spikes
Errors
HTTP 5xx rate, payment failure rate
The most direct signal of user-facing problems
Saturation
DB connection pool usage, memory utilization
Predict capacity problems before they cause failures
Deploying without observability is driving blind at night. The deployment pipeline gives you confidence that the code is correct at test time. Telemetry gives you confidence that it is correct at runtime — in production, under real load, with real users.
Further reading
Google SRE Book — Chapter 6
Monitoring Distributed Systems. The four golden signals. The definitive treatment from the team that invented them.
DevOps Handbook — Chapter 21
Enable and Practice Telemetry to Create Organizational Learning. Full coverage of instrumentation patterns.
Observability Engineering — Majors, Fong-Jones, Miranda
The comprehensive guide to observability. Structured events, high-cardinality data, and the shift from monitoring to exploration.
OpenTelemetry Documentation
The vendor-neutral standard for telemetry instrumentation. Auto-instrumentation, SDKs, and collector architecture.