Asynchronous Background Workers

How background worker pools and durable job queues offload long-running tasks from the synchronous request-response path.

IntermediateReliability & ScaleChapter: Reliability & Scalability12 min read

The Problem: Blocking the HTTP Request Lifecycle

When a client initiates an HTTP request, they expect a prompt response. However, many backend systems must perform operations that are slow or resource-heavy. Examples include:

  • Generating complex PDF invoices or reports.
  • Compressing images or transcoding video files.
  • Sending transactional emails or push notifications.
  • Calling slow third-party APIs.

If you execute these processes synchronously inside the client HTTP request lifecycle, the client must wait. This locks up a server request thread, increases HTTP response latency, and leaves the client vulnerable to connection timeouts. If traffic spikes, the application will quickly run out of available threads and crash.


The Solution: The Job Queue Pattern

The job queue pattern decouples the synchronous request-response path from asynchronous task execution.

Instead of processing an expensive task immediately, the web server serializes the task details (such as JSON metadata) and publishes it as a job to a durable message broker (such as Redis, RabbitMQ, or Amazon SQS). The server immediately returns an HTTP status code 202 Accepted to the client, indicating that the task is queued for future processing.

Downstream, separate processes called background workers continuously poll or subscribe to the message broker, fetch tasks, and execute them asynchronously.

Client HTTP POST 202 Accepted Web Server Enqueue Job Queue Worker A Worker B DLQ Failures Retry (Exponential Backoff + Jitter)

Concurrency Safety and Worker Design

To achieve high throughput, background workers generally run inside concurrent threads, goroutines, or separate operating system processes. However, managing concurrency requires careful resource constraints:

  • Worker Prefetch Limits: If a worker node fetches too many jobs from the broker into local memory at once, it can run out of memory or starve other idle workers of tasks. Setting a prefetch count (e.g. via RabbitMQ basic.qos) ensures workers only retrieve jobs they have the active capacity to process.
  • Database Connection Pool Allocation: A common mistake is configuring 50 worker threads but setting the database connection pool limit to 10. When the workers execute in parallel, 40 of them will block waiting for a database connection, degrading throughput.

Delivery Guarantees: At-Least-Once vs At-Most-Once

When a worker pulls a task from the queue, what happens if the worker crashes mid-execution?

  • At-Most-Once: The broker deletes the job immediately upon sending it to the worker. If the worker crashes, the job is lost forever. This is suitable only for non-critical, ephemeral tasks like log forwarding.
  • At-Least-Once: The worker must explicitly send an acknowledgment (ACK) back to the broker after successfully processing the task. If the worker crashes or fails to respond within a visibility timeout, the broker re-enqueues the job to be picked up by another worker.

Because At-Least-Once delivery can result in duplicate executions (for example, if a worker processes a task but crashes right before sending the ACK), tasks must be idempotent. This means running the same task multiple times must result in the same state as running it once.


Fault Tolerance: Backoffs, Jitter, and Dead Letter Queues

If a task fails due to a transient issue, such as a database query timeout or a third-party API outage, it should not be discarded. Instead, it must be retried safely:

  • Exponential Backoff: Successive retry attempts are spaced out by doubling the wait interval (e.g. 1s, 2s, 4s, 8s). This prevents the workers from overwhelming struggling downstream dependencies.
  • Jitter: Adding random noise (jitter) to the backoff delay prevents all failing tasks from retrying at the exact same millisecond. This prevents synchronized retry storms.
  • Dead Letter Queues (DLQ): If a job fails repeatedly and exceeds its maximum retry threshold (e.g. 5 attempts), it represents a permanent failure (like a corrupted payload or a code bug). The system routes these jobs to a specialized queue called a Dead Letter Queue. This isolates bad payloads and lets developers inspect them manually without blocking the main queues.

Distributed Locks and Resource Coordination

If multiple worker processes are pulling tasks from a queue, you may need to ensure that certain tasks do not run concurrently. For example, you should not run two parallel billing jobs for the same user.

To prevent this, workers use distributed locks (such as Redis-based Redlock) to serialize execution:

  1. A worker retrieves a job for User X.
  2. The worker attempts to acquire a lock for the key lock:user_id:X.
  3. If it fails to acquire the lock, the worker releases the job back to the queue to try again later.
  4. If it succeeds, it processes the job, releases the lock, and acknowledges the job.

Monitoring and Observability Metrics

Maintaining a healthy background system requires tracking key performance metrics:

  • Queue Depth: The number of pending tasks in the queue. A steadily growing queue indicates that your workers cannot keep up with the incoming volume, signaling a need to scale out the worker pool.
  • Processing Latency: The duration between when a task is enqueued and when it finishes executing. High latency harms user experiences if they are waiting for a background result (like an email code).
  • Failure Rate: The ratio of failed tasks to total tasks. A sudden spike indicates network connectivity issues, database locks, or bad code deployments.

Further Reading

Code Examples

Core Literature References

Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions

by Gregor Hohpe & Bobby Woolf — Chapter 5: Messaging Systems, Chapter 6: Consumer Patterns, pp. 220-312

View source