Linux Namespaces & Cgroups

Under the hood of containerization: isolating processes, networks, and mount points with namespaces, and enforcing resources limits with cgroups.

AdvancedInfrastructureChapter: Infrastructure15 min read

Namespaces: Controlling What a Process Can See

Unlike virtual machines, which run on virtualized hardware, containers are standard processes running directly on the host machine's kernel. To keep containers from seeing and interfering with each other, the Linux kernel uses namespaces.

Namespaces wrap global system resources in an abstraction layer, making it appear to processes inside the namespace that they have their own isolated instance of the resource. There are six primary namespaces:

  • PID (Process ID): Isolates the process ID space. A process inside a new PID namespace becomes PID 1 (the init process) and cannot see or signal processes running in the host's default namespace.
  • Net (Network): Isolates network devices, IP routing tables, port bindings, and firewall rules. Each namespace starts with a private loopback interface (lo).
  • Mount: Isolates the system mount table. A mount namespace allows processes to mount and unmount filesystems without affecting the host or other namespaces.
  • IPC (Inter-Process Communication): Isolates message queues, semaphores, and shared memory segments, preventing processes in different containers from communicating via memory.
  • UTS (UNIX Timesharing System): Isolates hostnames and domain names. This allows each container to have its own hostname (e.g. api-server-1).
  • User: Isolates user and group ID mappings. This allows a process to have root privileges (UID 0) inside its container while mapped to a non-privileged user (e.g. UID 10001) on the host.

Control Groups: Controlling What a Process Can Use

While namespaces isolate what a process can see, Control Groups (cgroups) limit how much system resources a process can consume. Without cgroups, a compromised or poorly written container could consume all host memory or CPU time, starving other services (the noisy neighbor problem).

Modern Linux kernels support cgroups v2, which organizes resource allocation hierarchies under /sys/fs/cgroup/. Cgroups enforce limits on:

  • CPU: Limits the maximum CPU share or cores a container can use (e.g. cpu.max setting a container to a maximum of 2 cores).
  • Memory: Enforces a memory usage ceiling (e.g. memory.max). If a container exceeds its memory limit, the kernel's Out-Of-Memory (OOM) killer steps in and terminates the container process.
  • I/O: Enforces read and write bandwidth limits on block storage devices (e.g. limiting disk writes to 50MB/s).
  • Process Counts: Limits the total number of child processes a container can spawn (e.g. pids.max), preventing fork bombs from crashing the host kernel.
Linux Kernel Container Sandbox Boundary Host Operating System Namespace Sandbox (What it Sees) PID Namespace (PID 1 inside) Net Namespace (Isolated IP / ports) Mount Namespace (Isolated root /) Cgroups Constraints (What it Uses) CPU Max Limit (e.g. 2 Cores) Memory Max Limit (e.g. 512MB) Exceeding triggers OOM Killer! Block I/O Bandwidth (e.g. 50MB/s)

Root Filesystem Isolation: chroot and pivot_root

To isolate what files a process can access, a container runtime must change the root directory of the container process. This is done using two key system calls:

  • chroot: Replaces the root directory of the calling process. For example, chroot /var/lib/container/rootfs changes the root directory to that folder, blocking access to host directories above it. However, chroot is insecure and processes can escape it.
  • pivot_root: Replaces the root filesystem of the current mount namespace. It moves the host's old root filesystem to a subdirectory and mounts a new root filesystem as /. Unlike chroot, pivot_root changes the mount tables in the kernel, making it impossible for containerized processes to access host filesystem mounts.

By calling pivot_root, a runtime ensures that the container starts with a completely clean, isolated directory structure based on its container image.


Virtual Ethernet (veth) Pairs and Network Routing

Because each container gets its own isolated network namespace, they cannot communicate with each other or the host out of the box. To solve this, runtimes use virtual ethernet pairs:

A veth pair acts as a virtual network cable. One end of the cable is placed inside the container's network namespace (renamed to eth0), and the other end is bound to a virtual bridge interface (like docker0) on the host.

When the container sends packets to eth0, they travel down the virtual cable and emerge on the host's network bridge. The host OS routing tables and netfilter rules (iptables) then forward these packets to the physical network card or redirect them to another container's veth interface, enabling communication.


Container Runtime Architecture

To coordinate namespaces and cgroups, the container ecosystem uses a standard layered architecture governed by the Open Container Initiative (OCI):

  • Low-Level Runtimes (e.g. runc): The actual program that configures namespaces, cgroups, mounts, and runs the container command. It is short-lived; once the container process starts, runc exits.
  • High-Level Runtimes (e.g. containerd, CRI-O): Long-running daemons that manage container lifecycles. They pull images from registries, manage storage overlays, monitor container statuses, and expose APIs to orchestrators like Kubernetes.

This separation of concerns ensures that the core container isolation logic remains simple and standard across different orchestration frameworks.


Further Reading

Code Examples

Core Literature References

The Linux Programming Interface

by Michael Kerrisk — Chapter 53: POSIX Semaphores (for concurrency baseline) and namespaces/cgroups sections, pp. 1101-1140

View source

Control Groups v2

by Linux Kernel Organization — Documentation/admin-guide/cgroup-v2.rst, pp. Section 1-3

View source