Network Sockets: TCP & UDP

Understand the operating system network stack, socket API system calls, TCP transmission controls, connection queues, UDP semantics, and event-driven I/O multiplexing.

IntermediateFoundationsChapter: Foundations15 min read

User Space vs Kernel Space Network Buffers

A network socket is the fundamental operating system abstraction that exposes network hardware interface queues to user-space software. When your application sends or receives data over the network, it does not interact directly with the physical network interface card (NIC). Instead, it reads from and writes to intermediate buffers allocated in kernel memory.

This split can be compared to a mailbox system in a large office building. Tenants (user-space applications) do not fetch letters from the mail carrier's truck. The building mail clerk (the operating system kernel) retrieves letters, sorts them, and places them into individual pigeonholes (kernel socket buffers). Tenants only access their own designated pigeonhole to grab their mail.

When an application calls write() or send() on a socket:

  • The data is copied from user-space memory blocks into the kernel's transmit buffer (tx_buf).
  • The kernel's TCP/IP stack segments the buffered data, wraps it in protocol headers, and passes it to the device driver for physical transmission.
  • The write() syscall returns as soon as the data is copied to the kernel buffer, not when the data actually crosses the physical wire.

Conversely, when data packets arrive at the physical NIC:

  • The driver moves the data to kernel-space receive buffers (rx_buf) using direct memory access (DMA).
  • The kernel parses headers and identifies the target socket.
  • When the user application calls read() or recv(), the kernel copies the data from the kernel's receive buffer into the application's user-space memory buffer.

If the application is slower at calling read() than the rate of incoming network traffic, the kernel receive buffer fills up. When it becomes completely full, TCP flow control mechanisms trigger, forcing the sender to throttle transmission rates.


Socket API System Calls

Interacting with sockets requires a sequence of POSIX system calls. Each call moves the socket through a specific lifecycle, configuring either a passive listener (server) or an active initiator (client).

The process begins with the socket() syscall, which requests the operating system to allocate file descriptor resources. The kernel returns an integer representing the socket. Next, the server binds this socket to an IP address and port number using bind(), identifying it on the system. It then enters a passive listening state via listen(), enabling the kernel to accept incoming connection requests.

To establish the connection, the client issues connect(), which triggers the TCP handshake. The server accepts this connection via accept(), returning a new, dedicated socket file descriptor for that specific client connection while the initial listening socket remains free to accept additional incoming requests.

TCP Socket System Calls and State Transitions SERVER SOCKET CLIENT SOCKET socket() -> Allocate FD bind() -> Assign Address/Port listen() -> Enter LISTEN State socket() -> Allocate FD connect() -> Initiate Handshake SYN accept() -> SYN Queue (SYN_RCVD) SYN-ACK ESTABLISHED State ACK Accept Queue -> accept() returns read() from socket buffer write() to socket buffer Data Payload read() returns 0 (EOF) close() -> Send FIN FIN close() -> Send ACK, transition ACK TIME_WAIT state (2 * MSL) The OS uses two connection queues in the background: the SYN backlog and the Accept backlog.

TCP Transmission Controls and Connection Queues

To provide reliable delivery over an unreliable physical medium, TCP enforces three core mechanisms: flow control, congestion control, and connection lifecycle queues.

The Handshake and Connection Backlogs

When a server socket transitions to the listening state, the kernel initializes two distinct backlogs:

  • SYN Backlog (Incomplete Connection Queue): Stores connection requests that have sent a SYN packet and received a SYN-ACK, but have not yet replied with the final ACK. Sockets here are in the SYN_RCVD state.
  • Accept Queue (Complete Connection Queue): Stores fully established connections that have completed the three-way handshake and are in the ESTABLISHED state. When the application calls accept(), a connection is popped from this queue.

If a high-traffic server does not call accept() fast enough, the Accept Queue fills up. If the queue overflows, the kernel's default behavior is to ignore subsequent incoming ACK packets, causing the client to think the packet was lost and attempt to retransmit.

Flow Control: The Sliding Window

To prevent a fast sender from overwhelming a slow receiver, TCP uses a sliding window mechanism. During packet exchanges, both sides advertise their remaining receive buffer capacity using the window field in the TCP header.

The sender must never allow its unacknowledged bytes to exceed this advertised window size. If the receiver's window shrinks to zero, the sender stops transmitting data. The sender then periodically transmits single-byte probe packets to check if the window has reopened.

Congestion Control

While flow control manages receiver capacity, congestion control prevents the network infrastructure from dropping packets due to intermediate queue overflows. TCP monitors packet loss to dynamically scale its transmission rate using key stages:

  • Slow Start: When a connection starts, TCP sets its congestion window (cwnd) to a low initial value (typically 10 segments). It doubles the window size with every round-trip time (RTT) where packets are successfully acknowledged, scaling exponentially.
  • Congestion Avoidance: Once cwnd hits a threshold (ssthresh), the window expansion transitions to a linear growth pattern (adding one segment per RTT).
  • Fast Recovery: If a packet drop is detected via duplicate ACKs, TCP decreases ssthresh and drops cwnd to avoid total packet stall, recovering without resorting to a full slow start sequence.

Connection State Transitions

TCP connections undergo a strict sequence of state changes during termination. When a thread calls close() on a socket:

  • It sends a FIN packet and enters the FIN_WAIT_1 state.
  • The remote peer responds with an ACK, transitioning the local socket to FIN_WAIT_2, while the remote peer enters CLOSE_WAIT.
  • The remote peer calls close(), sending its own FIN and transitioning to LAST_ACK.
  • The local socket receives this FIN and responds with an ACK, entering the TIME_WAIT state.

A socket remains in the TIME_WAIT state for double the Maximum Segment Lifetime (typically 2 * MSL, or 1 to 4 minutes). This duration ensures that any delayed packets still traveling through the network are fully discarded rather than corrupting a new socket reassigned to the same address and port combination. It also ensures that the final ACK is successfully delivered, preventing the remote peer from retransmitting its final FIN.


UDP Datagram Semantics

Unlike TCP, the User Datagram Protocol (UDP) is a connectionless, unreliable transport protocol. It eliminates handshakes, retransmissions, sliding windows, and congestion control algorithms.

Unreliable and Connectionless Delivery

A UDP socket does not establish a virtual connection with a remote peer. An application can send datagrams to different destinations using the same socket by specifying the target IP and port on every write request. Packet drops, bit corruptions, and out-of-order arrivals must be handled entirely in the application layer if needed.

Boundary-Preserving Transmission

TCP is a byte-stream protocol, it does not preserve record boundaries. If a client writes two packets of 100 bytes each to a TCP socket, the server might read all 200 bytes in a single read() call, or read 50 bytes in the first call and 150 bytes in the second.

In contrast, UDP is boundary-preserving:

  • If a sender writes a 500-byte datagram to a UDP socket, the receiver will get the exact 500-byte block in a single read operation.
  • If the receiver passes a buffer smaller than the datagram size to recv(), the OS drops the excess bytes and returns a truncation error.

UDP is widely used in real-time media streaming, online gaming, and dns queries where low latency is critical and losing occasional packets is preferable to waiting for retransmissions.


Network I/O Multiplexing and Event Loops

Handling thousands of concurrent network connections efficiently requires moving away from the naive "one thread per connection" design. In a blocking network model, calling read() blocks the executing thread until data arrives on the network interface. Assigning a thread to every active connection wastes system resources because most threads spend their time asleep, waiting for network data.

I/O multiplexing solves this by letting a single thread watch multiple file descriptors at the same time. The kernel provides system calls to notify the application when specific file descriptors are ready for read or write operations:

  • select(): The oldest multiplexing call. It accepts a bitmask of file descriptors. It has a hardcoded limit of 1,024 descriptors and requires an O(N) loop to scan which descriptors are active, making it scale poorly for high numbers of connections.
  • poll(): Replaces the bitmask with an array of structures, removing the 1,024 limit. However, it still requires the kernel and user space to scan the entire array on every check, preserving the O(N) performance bottleneck.
  • epoll() (Linux) and kqueue (macOS/BSD): High-performance, event-driven APIs. Instead of passing an array of file descriptors on every call, the application registers descriptors with the kernel once. The kernel monitors the sockets in the background. When the application calls epoll_wait(), the kernel returns only the file descriptors that are actively ready for I/O in O(1) time.

This event-driven model is the foundation of high-throughput web servers like Nginx, Node.js, and Redis.


TCP Socket Options

Applications can configure socket behavior using setsockopt(). Three options are critical for high-performance network engineering:

  • SO_REUSEADDR: Instructs the kernel to bypass the standard port-release restrictions for sockets in the TIME_WAIT state. Enabling this option allows a restarted server process to immediately bind to its designated port, preventing startup failures during quick restarts.
  • TCP_NODELAY: Disables Nagle's algorithm. Nagle's algorithm groups small outbound packets and delays their transmission to reduce header overhead on slow networks. Disabling this with TCP_NODELAY ensures that packets are dispatched immediately, which is critical for low-latency interactive connections.
  • SO_KEEPALIVE: Configures the kernel to periodically transmit heartbeat probes on established idle connections. If a peer fails to reply within the timeout threshold, the kernel closes the connection and marks the socket as invalid, allowing servers to automatically clean up orphaned connections.

Further Reading

Prerequisites

Code Examples

Core Literature References

Unix Network Programming, Volume 1: The Sockets Networking API

by W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff — Chapter 3: Sockets Introduction & Chapter 4: Elementary TCP Sockets, pp. 79-120

View source