Observability

Debugging distributed software using structured JSON logging, Prometheus metric aggregations, and W3C trace context propagation across service boundaries.

IntermediateInfrastructureChapter: Infrastructure15 min read

Monitoring vs Observability

In production systems, engineers make a clear distinction between monitoring a system and making it observable:

  • Monitoring: Focuses on reporting symptoms. It answers the question, "Is the system broken?" It tracks key performance indicators (such as CPU utilization, memory consumption, or HTTP error counts) and triggers alerts when values cross predefined thresholds. Monitoring is ideal for catching known failure modes.
  • Observability: Focuses on understanding system states. It answers the question, "Why is the system broken?" By collecting rich telemetry data, observability allows engineers to debug novel system behaviors, trace rare concurrency bugs, and inspect unknown failure modes without deploying new code.

A system is observable if you can infer its internal state solely by examining its external outputs (logs, metrics, and traces).


Structured Logging

Traditional logging emits lines of unstructured text, making them difficult to parse programmatically. If you have millions of log lines spread across multiple microservices, locating a single failure is nearly impossible.

Structured logging solves this by formatting log entries as structured data objects, typically JSON blobs. Every log line contains key-value pairs representing context:

json
{
  "timestamp": "2026-06-11T18:04:12Z",
  "level": "error",
  "message": "database query timeout",
  "service": "billing-service",
  "trace_id": "4a3fe29c8e1a123fbc09d17d",
  "span_id": "8e3cd92a",
  "query_duration_ms": 5000,
  "user_id": 48291
}

By including attributes like trace_id and user_id, log aggregators (such as Elasticsearch or Grafana Loki) can filter millions of events instantly, correlating billing errors with database performance and user accounts.


Metrics Aggregation

Metrics are numerical values aggregated over time. They are highly structured, efficient to store, and ideal for creating real-time dashboards and alerts. Telemetry frameworks utilize four core metric types:

  • Counters: Monotonically increasing values that only reset to zero on restart. Use counters to track event rates (for instance, the total number of HTTP requests, database transactions, or cache misses).
  • Gauges: Numerical values that can go up and down. Use gauges to represent instantaneous states (such as current memory utilization, active thread pool size, or queue depth).
  • Histograms: Buckets that count the frequency of events falling into specific size ranges. Use histograms to compute percentiles (like p95 or p99 latencies), which show the distribution of query times rather than simple averages.
  • Summaries: Similar to histograms, but they compute configurable quantiles directly on the client side over a sliding time window.

Distributed Tracing and Context Propagation

In a microservices architecture, a single user request might traverse dozens of distinct services, network load balancers, and databases. If a request fails or is slow, logging alone cannot show where the bottleneck lies.

Distributed tracing reconstructs the request path. It utilizes two main abstractions:

  • Span: The basic unit of work. A span represents a single operation with a start time, duration, and status (e.g. an HTTP request, a database query, or an encryption operation).
  • Trace: A collection of spans that form a directed acyclic graph, representing the end-to-end lifecycle of a request.

To link spans across network boundaries, services use context propagation. When service A calls service B, it injects trace metadata (like the Trace ID and Parent Span ID) into the outgoing HTTP headers using the W3C traceparent standard. Service B extracts this header, links its new child span to the parent span, and passes the context down-stream.

Distributed Trace Span Hierarchy Service Call Graph API Gateway Service A Service B Database HTTP HTTP SQL Trace Timeline Visualization Span 1: API Gateway (Parent) Span 2: Service A (Child) Span 3: Service B Span 4: SQL Query

Push vs Pull Collection Models

Telemetry frameworks use two models for collecting data:

  • Pull Model (e.g. Prometheus): The monitoring server periodically scrapes metrics from endpoints exposed by target application instances (typically over a path like /metrics). This model simplifies application logic and protects instances from being overloaded by telemetry traffic, but it requires service discovery to locate target containers in elastic environments.
  • Push Model (e.g. OpenTelemetry): Application instances actively push metrics, logs, and traces to a centralized collector. This model is ideal for short-lived serverless tasks (which exit before a scraper can run) and simplifies routing through firewalls.

Modern standards (like OpenTelemetry) decouple collection from storage, allowing you to ingest data using push protocols and forward it to backend storage engines using whichever mechanism they require.


Further Reading

Code Examples

Core Literature References

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

by Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Mano Plakal, Donald Beaver, Shingo Jasen, and Ashwin Shanbhag — Section 2: Dapper's Distributed Tracing Model, pp. 2-7

View source