Change Data Capture

How systems listen to a database's internal changelog to react to every insert, update, and delete without polling the database directly.

IntermediateDatabasesChapter: Database Systems10 min read

The Polling Problem

Suppose you have a users table in your primary Postgres database, and you need to keep an Elasticsearch search index in sync. The naive approach is to poll: every 10 seconds, query for rows updated in the last 10 seconds and re-index them.

This works until it doesn't. You miss updates that happen between polls. Deletes are invisible. You add load to your database continuously. And at high volume, a 10-second lag is often too much.

Change Data Capture (CDC) solves this by reading the database's own internal record of every write, the same log the database already uses internally for crash recovery.


How Databases Record Writes

Before any write commits, most production databases append it to a Write-Ahead Log (WAL). The WAL is append-only and sequential, which is why it is fast. If the database crashes mid-write, it replays the WAL on restart to recover. Replication between a primary and its replicas also works by streaming this log.

CDC tools tap into this same stream. Instead of querying the table, they read the log and emit an event for every change: an insert with the new row's data, an update with before and after values, or a delete with the deleted row's identifier.

CDC Pipeline Postgres WAL (write-ahead log) Debezium reads WAL, emits events Kafka change event stream Elasticsearch Data warehouse Cache eviction No polling. Every insert, update, and delete flows through as an event. Consumers are decoupled — adding a new one doesn't change the database or Debezium config.

What a CDC Event Looks Like

Each event carries the operation type and the full row data before and after the change. A Debezium event for an updated user might look like:

json
{
  "op": "u",
  "before": { "id": 42, "email": "alice@old.com" },
  "after":  { "id": 42, "email": "alice@new.com" },
  "source": { "table": "users", "lsn": 190234 }
}

Consumers subscribe to the topic for their table and react however they need: reindex the document, invalidate a cache key, update a replica in another region.


Common Use Cases

Search index sync: Keep Elasticsearch or Algolia in sync with your database without dual-writes or polling. Every row change flows into the search index automatically.

Data warehouse replication: Stream production database changes into a data warehouse (Snowflake, BigQuery) for analytics. No nightly batch jobs.

Cache invalidation: When a row updates, immediately evict the corresponding cache key. More reliable than TTL-based expiry.

Audit logs: Record every change to sensitive data (user records, payment data) in a tamper-evident log, without modifying the application code.

Cross-service sync: In a microservices architecture, one service's database changes can trigger updates in another service's own database, without the two services coupling directly to each other.


The Transactional Outbox Pattern

A related pattern worth knowing: sometimes you want to write to your database and publish a message to a queue atomically. If the service crashes between the two, you end up in an inconsistent state.

The Transactional Outbox pattern solves this by writing the message into an outbox table in the same database transaction as your main write. A CDC consumer then reads the outbox table and publishes to the message queue. Since everything lives in the same database transaction, it is either all committed or all rolled back.


Tools

Debezium is the most widely used open-source CDC connector, built on top of Kafka Connect. AWS DMS (Database Migration Service) supports CDC for migrations. PlanetScale's Vitess has built-in CDC. For Postgres specifically, the pg_logical extension and pgoutput plugin power most CDC implementations.


Further Reading