Unix I/O Multiplexing

Discover how epoll and kqueue enable high-concurrency servers to handle millions of connections using event-driven notification loops.

AdvancedFoundationsChapter: Foundations15 min read

The Evolution of I/O Handling

To appreciate the necessity of modern input/output multiplexing, consider how operating systems historically managed network connections. When a program reads from a network socket, it makes a request to the kernel. If no data has arrived from the network, the calling thread can either wait or immediately return an error. This split defines the two primary categories of input/output:

  • Blocking I/O: The thread that calls read() or write() is put to sleep by the kernel. The thread yields its CPU execution time and remains suspended until data is copied into the kernel network buffer. This model is simple to program but scales poorly, demanding one dedicated thread per active socket connection.
  • Non-blocking I/O: The calling thread configures the socket file descriptor to return immediately when an I/O operation cannot be completed. The socket returns a specific error code, such as EAGAIN or EWOULDBLOCK, instead of putting the thread to sleep.

While non-blocking sockets prevent threads from freezing, managing hundreds of them introduces a polling problem. If a user application continuously loops over all open sockets in user space, asking each socket if data has arrived, it wastes CPU cycles on redundant system calls.

This challenge led to the creation of I/O multiplexing: a model where the kernel provides a single system call that blocks a single thread, monitoring many file descriptors at once and waking the thread only when at least one descriptor is ready for operations.


The Bottlenecks of select and poll

The earliest Unix attempts to multiplex file descriptors resulted in the select() and poll() system calls. Both mechanisms require the user process to supply a list of file descriptors it wishes to monitor.

  • select(): Uses a fixed-size bit array (called fd_set) representing file descriptor numbers. The maximum size of this array is hardcoded in the kernel, historically limited to FD_SETSIZE (usually 1,024).
  • poll(): Avoids the hardcoded array limit by using a dynamically sized array of structures (struct pollfd).

Despite their historical utility, both system calls suffer from severe architectural limitations that restrict their scalability in modern, high-throughput systems:

  1. O(N) User-Kernel Copying: Every time select() or poll() is invoked, the user program must construct and copy the entire array of monitored descriptors from user space to kernel space. The kernel processes the interest list, updates status fields, and copies the entire array back to user space.
  2. O(N) Kernel Scanning: Inside the kernel, the CPU must iterate through every single registered file descriptor to verify its readiness state.
  3. O(N) User-Space Filtering: Once the system call returns, it does not specify which file descriptors are active. It only returns the count of ready descriptors. The user application must run its own loop over the entire descriptor set to search for active channels.

As the number of concurrent connections increases, the CPU spends the majority of its execution time iterating through inactive connections, degrading performance and increasing latency.


Event-Driven Notifications with epoll and kqueue

To overcome the performance limits of select() and poll(), modern operating systems introduced event-driven subsystems: epoll in Linux and kqueue in BSD/macOS. These systems decouple the registration of interest from the actual waiting phase, achieving O(1) time complexity relative to the size of the monitored set.

Instead of copying the interest list on every invocation, these interfaces establish a persistent state container inside the kernel. The application registers its interest once, and updates the container only when connections open or close.

Unix I/O Multiplexing: The epoll Architecture USER SPACE KERNEL SPACE User Application Event Loop 1. epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) · 2. ready_count = epoll_wait(epfd, events, ...) Registers FDs to monitor, then blocks efficiently until events arrive Interest List (Red-Black Tree) O(log N) Search, Insert, Delete FD 4 FD 3 FD 6 FD 7 Ready List (Doubly Linked List) O(1) Retrievable Events FD 4 FD 6 epoll_ctl ADD/DEL epoll_wait delivers Network Card / Hardware Interrupt Packet arrival triggers IRQ callback Kernel driver pushes associated FD into Ready List

Under the hood, Linux epoll manages the tracked set with two primary data structures:

  • Red-Black Tree (Interest List): When an application adds, modifies, or removes a file descriptor using epoll_ctl(), the kernel updates a red-black tree keyed by the descriptor's numeric ID. This structure enables stable O(log N) search, insertion, and deletion times, helping the kernel scale to millions of active sockets.
  • Doubly Linked List (Ready List): When I/O occurs on a monitored socket, the hardware device driver fires an interrupt. The kernel's network stack processes the incoming bytes and triggers a callback routine. This callback retrieves the corresponding node from the interest tree and appends it to a doubly linked list of "ready" descriptors.

When the application invokes epoll_wait(), the system call does not search the entire tree. Instead, it checks the ready list. If the list is empty, the calling thread is put to sleep. If the list contains elements, the kernel copies only the active descriptors into the application's buffer. The system call returns immediately, completing in O(1) time regardless of the size of the monitored interest list.


Level-Triggered vs Edge-Triggered Modes

The epoll system operates in one of two event-delivery modes, which determine when and how the kernel notifies the application:

Level-Triggered (LT) Mode

This is the default mode, resembling the classic behavior of poll(). In Level-Triggered mode, the kernel treats the ready list as a state representation. If a file descriptor contains unread data, the kernel will continue to report it as "ready" every time epoll_wait() is called.

While LT mode is forgiving of programming omissions, it carries performance overhead. If your application reads only a portion of the incoming network buffer during one event loop iteration, the next call to epoll_wait() will immediately wake the thread again.

Edge-Triggered (ET) Mode

In Edge-Triggered mode, the kernel notifies the application only when a state transition occurs on the descriptor, such as when new data arrives at the network interface. If the application does not consume all the data from the system's buffer, the kernel will not notify the application again for that event, even if data remains in the queue.

To prevent silent stalls, edge-triggered engines must adhere to two implementation rules:

  1. Non-blocking FDs: All monitored file descriptors must be configured as non-blocking.
  2. Exhaustive Read Loops: When notified of an input event, the application must read from the socket in a loop until the system call returns -1 with errno set to EAGAIN or EWOULDBLOCK. This ensures the kernel's internal buffer is empty before the thread goes back to sleep.

Concurrency Patterns: Reactor vs Proactor

Engineers utilize two architectural patterns to orchestrate high-performance network events: the Reactor pattern and the Proactor pattern.

text
                  ┌────────────────────────────────────────┐
                  │                Reactor                 │
                  │  * Synchronous Event Loop              │
                  │  * Notifies when socket is READY       │
                  │  * Handler performs the actual read()  │
                  └────────────────────────────────────────┘

                  ┌────────────────────────────────────────┐
                  │                Proactor                │
                  │  * Asynchronous Completion Loop        │
                  │  * Notifies when read is COMPLETE      │
                  │  * Kernel performs read() into buffer  │
                  └────────────────────────────────────────┘

The Reactor pattern uses synchronous demultiplexing. The event loop blocks on epoll_wait() or kqueue(). When a socket becomes ready, the loop dispatches the active descriptor to an application-level handler. The handler then performs the synchronous read() or write() operation. Systems like Node.js, NGINX, and Netty are built on the Reactor model.

The Proactor pattern uses asynchronous completion. The application initiates an I/O operation and registers a callback, passing a user-space memory buffer directly to the OS. The kernel performs the read or write operation in the background, writing directly to the application's memory buffer. Once the copy completes, the kernel notifies the application's event loop that the task is done. The Proactor pattern is natively supported on Windows via I/O Completion Ports (IOCP) and is increasingly popular on Linux via the modern io_uring interface.


Practical System Limits

As you scale connection densities toward hundreds of thousands of concurrent sockets, you will hit configuration limits.

The first limit is the Process-Level Descriptor Limit. Operating systems enforce a boundary on the number of file descriptors a single process can open. This boundary is controlled by the shell limit configuration ulimit -n. The second is the System-Wide Descriptor Limit, which determines the maximum number of file descriptors allowed across the entire operating system, configurable via sysctl fs.file-max.

If your application exceeds these boundaries, socket creation system calls like accept() or socket() will fail, returning EMFILE or ENFILE errors.

Furthermore, if the rate of incoming events exceeds the application's capacity to process them, the CPU will spend more time switching contexts and running kernel interrupt handlers than executing application code. This state is known as context-switching thrashing, and it can be mitigated by grouping connections into event loops pinned to dedicated CPU cores.


Further Reading

Code Examples

Core Literature References

Unix Network Programming, Volume 1: The Sockets Networking API

by W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff — Chapter 6: I/O Multiplexing: The select and poll Functions, pp. 151-180

View source

The Linux Programming Interface

by Michael Kerrisk — Chapter 63: Alternative I/O Models, pp. 1325-1360

View source