25.05.01 · computer-science / operating-systems

Operating systems: processes and memory

shipped3 tiersLean: none

Anchor (Master): Corbet, Rubini, and Kroah-Hartman, Linux Device Drivers; Bovet and Cesati, Understanding the Linux Kernel; Chen and Helian, concurrent systems research

Intuition Beginner

An operating system is the most important program running on any computer. It is the layer between your applications and the physical hardware, managing all the resources that make a computer useful. Without an operating system, every application would need to contain its own code for reading files from disk, drawing pixels on screen, sending packets over the network, and managing memory. The operating system provides these services once, shared among all applications, so that each application can focus on its own task.

The operating system has two primary jobs. First, it provides an abstraction layer that makes the hardware easier to use. Instead of dealing with disk sectors, tracks, and cylinders, you deal with files and directories. Instead of dealing with screen coordinates and color registers, you deal with windows and buttons. Instead of dealing with network packets and routing tables, you deal with sockets and connections.

Second, the operating system manages resource sharing. A modern computer runs dozens or hundreds of programs simultaneously, all sharing the same CPU, memory, disk, and network. The operating system decides which program gets to use the CPU at any given moment, how memory is divided among programs, and how disk and network access are scheduled. Without this management, programs would conflict with each other, corrupt each other's data, or monopolize resources.

A process is the operating system's fundamental unit of work. When you open a web browser, the operating system creates a process for it. When you open a text editor, another process is created. Each process has its own private memory space, its own set of open files, and its own execution state (the values of CPU registers, the program counter, and the call stack). This isolation is critical: if one process crashes, it does not bring down other processes, because each has its own protected memory.

The operating system switches between processes so rapidly that they all appear to be running simultaneously. This technique is called multitasking or time-sharing. On a computer with a single CPU core, only one process can actually be executing at any instant. But the operating system gives each process a small time slice (called a quantum), typically 10 to 100 milliseconds, before switching to the next process. This switching happens thousands of times per second, fast enough that users perceive all programs as running concurrently. On a multi-core processor, multiple processes can truly run in parallel, one per core.

The mechanism by which the operating system switches between processes is called a context switch. The operating system saves the complete state of the currently running process (all CPU registers, the program counter, the stack pointer, memory mappings) into a data structure called the process control block (PCB). It then loads the saved state of the next process from its PCB and resumes execution. Context switches are expensive: saving and restoring state, flushing CPU caches, and updating memory management data structures can take several microseconds. Operating systems are carefully designed to minimize the frequency and cost of context switches.

Memory management is one of the most complex responsibilities of an operating system. Each process needs memory for its code, data, stack, and heap. But physical memory (RAM) is limited, and the total memory requested by all processes usually exceeds the available RAM. The operating system solves this problem with virtual memory.

Virtual memory gives each process the illusion that it has a large, private address space, independent of the physical memory installed in the machine. When a process reads from memory address 1000, it is not reading from physical address 1000. The operating system translates the virtual address to a physical address using a data structure called a page table. This translation happens transparently, managed by hardware (the Memory Management Unit, or MMU) in concert with the operating system.

Virtual memory is organized in pages, fixed-size blocks typically 4 kilobytes in size. The virtual address space of a process is divided into pages, and each page can be mapped to any page-sized region of physical memory, or it can be stored on disk. When a process accesses a page that is not currently in physical memory, the hardware generates a page fault. The operating system handles this fault by loading the required page from disk into a free frame of physical memory, updating the page table, and resuming the process.

This mechanism allows the total virtual memory used by all processes to exceed the available physical memory. Pages that have not been recently used are swapped out to disk, freeing physical memory for pages that are actively needed. The operating system uses page replacement algorithms to decide which pages to evict. The least recently used (LRU) algorithm evicts the page that has not been accessed for the longest time, on the assumption that pages used recently are more likely to be used again soon.

Scheduling is the operating system's decision about which process to run next. The simplest scheduling algorithm is first-come, first-served (FCFS): processes are executed in the order they arrive. This is fair but can be inefficient. A long process that arrives first blocks all subsequent short processes, a problem known as the convoy effect.

Round-robin scheduling gives each process a fixed time quantum in turn. If a process does not finish within its quantum, it is moved to the back of the queue and the next process gets the CPU. Round-robin provides good response time for interactive processes because no process waits more than $n$ quanta (where $n$ is the number of processes) before getting a turn.

Priority scheduling assigns each process a priority level. Higher-priority processes run before lower-priority ones. This is useful for systems where some tasks are more important than others: a real-time audio processing thread should have higher priority than a background file download. The challenge with priority scheduling is starvation: low-priority processes may never get CPU time if high-priority processes keep arriving. Aging, gradually increasing the priority of waiting processes, prevents starvation.

Visual Beginner

The table below summarizes the key components of an operating system.

Component	Purpose	Key mechanism
Process management	Create, schedule, and terminate processes	PCBs, context switches, scheduling algorithms
Memory management	Allocate and protect memory for each process	Virtual memory, page tables, page replacement
File system	Organize and store data persistently	Files, directories, inodes, journaling
I/O management	Control access to devices	Device drivers, interrupts, DMA
Networking	Enable communication between computers	Sockets, TCP/IP stack, routing
Security	Protect resources from unauthorized access	Users, permissions, access control lists

Worked example Beginner

Consider a system with three processes arriving at different times with different burst times (the CPU time each process needs to complete).

Process P1 arrives at time 0 and needs 24 milliseconds of CPU time. Process P2 arrives at time 1 and needs 3 milliseconds. Process P3 arrives at time 2 and needs 3 milliseconds.

With FCFS scheduling, processes run in arrival order. P1 runs from time 0 to 24, using all 24 milliseconds. P2 runs from 24 to 27. P3 runs from 27 to 30. The waiting times are: P1 waits 0, P2 waits 23 (from arrival at 1 to start at 24), P3 waits 25 (from arrival at 2 to start at 27). Average waiting time: $(0 + 23 + 25) /3 = 16$ milliseconds.

With round-robin scheduling and a quantum of 4 milliseconds, the execution proceeds differently. At time 0, P1 starts and runs for its quantum of 4ms. At time 4, P2 has arrived (at time 1) and P3 has arrived (at time 2). The scheduler switches to P2.

P2 runs from 4 to 7 (it only needs 3ms, so it finishes early). At time 7, the scheduler switches to P3. P3 runs from 7 to 10 (also finishing in 3ms). At time 10, only P1 remains, with 20ms still needed.

P1 runs from 10 to 14 (another quantum). Then from 14 to 18. Then 18 to 22. Then 22 to 26. Then 26 to 30 (only 4ms left). P1 finishes at time 30.

Waiting times: P1 waited from 4 to 10 (6ms), from 14 to 14 (0ms), from 18 to 18 (0ms), and from 22 to 22 (0ms). Total wait for P1: 6ms. P2 waited from 1 to 4 = 3ms. P3 waited from 2 to 7 = 5ms. Average waiting time: $(6 + 3 + 5) /3 = 4.67$ milliseconds.

Round-robin reduced the average waiting time from 16ms to 4.67ms, a significant improvement. The short processes P2 and P3 completed much sooner under round-robin than under FCFS, which demonstrates why round-robin is preferred for interactive systems where response time matters.

Check your understanding Beginner

Formal definition Intermediate+

Process. A process is an instance of a program in execution. Formally, a process is represented by its process control block (PCB), which contains: (1) the process state (new, ready, running, waiting, terminated), (2) the program counter (address of the next instruction), (3) CPU registers, (4) CPU scheduling information (priority, scheduling queue pointers), (5) memory management information (page tables, base and limit registers), (6) accounting information (PID, time used), and (7) I/O status information (open files, pending I/O).

Process state diagram. A process transitions through five states. New: the process is being created. Ready: the process is waiting to be assigned to a processor. Running: instructions are being executed. Waiting: the process is waiting for some event to occur (I/O completion, signal). Terminated: the process has finished execution.

Virtual memory: formal model

Let $V$ denote the virtual address space and $P$ the physical address space. The virtual-to-physical translation is a function $f : V \to P \cup {disk}$ . For a page-based system with page size $p$ , a virtual address $v$ is decomposed as $v = (v_{p}, v_{o})$ where $v_{p} = ⌊ v / p ⌋$ is the virtual page number and $v_{o} = v mod p$ is the offset within the page.

The page table is an array indexed by virtual page number. Each entry contains: (1) a valid/invalid bit indicating whether the page is in physical memory, (2) a frame number specifying which physical frame holds the page, (3) protection bits (read/write/execute), (4) reference and dirty bits for page replacement, and (5) a reference count.

The physical address is computed as $phys = frame \times p + v_{o}$ .

Translation Lookaside Buffer (TLB). The TLB is a hardware cache that stores recent virtual-to-physical translations. A TLB lookup is performed in parallel with the page table walk. A TLB hit avoids the memory access required for page table lookup, reducing address translation from two memory accesses to one (or zero with further caching).

Scheduling: formal framework

A scheduling algorithm maps a set of processes ${P_{1}, P_{2}, \dots, P_{n}}$ with arrival times ${a_{1}, a_{2}, \dots, a_{n}}$ and burst times ${b_{1}, b_{2}, \dots, b_{n}}$ to a schedule $S$ , which is a sequence of $(P_{i}, start_{i}, end_{i})$ triples indicating when each process runs.

Key metrics for evaluating schedules. Turnaround time for process $P_{i}$ : $T_{i} = completion_{i} - a_{i}$ . Waiting time for $P_{i}$ : $W_{i} = T_{i} - b_{i}$ . Response time for $P_{i}$ : $R_{i} = first_run_{i} - a_{i}$ .

Concurrency primitives

Mutex (mutual exclusion lock). A mutex provides exclusive access to a shared resource. Operations: $lock (m)$ acquires the mutex (blocking if already held), $unlock (m)$ releases it. A mutex ensures that only one thread can hold it at a time.

Semaphore. A semaphore $s$ is an integer variable with two atomic operations. $wait (s)$ : decrement $s$ ; if $s < 0$ , block the calling process. $signal (s)$ : increment $s$ ; if $s \leq 0$ , wake one blocked process. A binary semaphore (initialized to 1) functions like a mutex. A counting semaphore (initialized to $n$ ) allows up to $n$ processes to access a resource simultaneously.

Condition variable. A condition variable $c$ allows threads to wait for a condition to become true. Operations: $wait (c, m)$ atomically releases mutex $m$ and blocks on $c$ ; when awakened, reacquires $m$ . $signal (c)$ wakes one waiting thread. $broadcast (c)$ wakes all waiting threads.

Key result: Banker's algorithm for deadlock avoidance Intermediate+

Theorem. The Banker's algorithm guarantees that the system never enters an unsafe state (a state from which deadlock is possible), provided the maximum resource needs of each process are declared in advance.

Proof. The algorithm maintains the following data structures. Available: a vector of available resources of each type. Maximum: an $n \times m$ matrix where Maximum $[i] [j]$ is the maximum units of resource $j$ that process $i$ may request. Allocation: the current allocation matrix. Need = Maximum $-$ Allocation: the remaining needs matrix.

A state is safe if there exists a sequence $⟨ P_{i_{1}}, P_{i_{2}}, \dots, P_{i_{n}} ⟩$ such that for each $k$ , the resources needed by $P_{i_{k}}$ can be satisfied by currently available resources plus the resources held by all $P_{i_{j}}$ with $j < k$ . That is:

$Need [i_{k}] \leq Available + j < k \sum Allocation [i_{j}]$

When a request arrives from process $P_{i}$ for resources $R$ , the algorithm checks:

$R \leq Need [i]$ (request does not exceed declared maximum)
$R \leq Available$ (resources are available)
Pretend to grant the request: compute the resulting Available' $=$ Available $-$ $R$ , Allocation' $[i]$ $=$ Allocation $[i]$ $+$ $R$ , Need' $[i]$ $=$ Need $[i]$ $-$ $R$ .
Run the safety algorithm on the pretend state. If the pretend state is safe, grant the request. Otherwise, deny it and have $P_{i}$ wait.

The safety algorithm tries to find a safe sequence. It repeatedly finds a process $P_{i}$ with Need $[i] \leq$ Work (where Work starts as Available'), adds Allocation $[i]$ to Work, and marks $P_{i}$ as finished. If all processes can be marked finished, the state is safe.

Since the algorithm only grants requests that lead to safe states, and a safe state guarantees that all processes can eventually complete, the system never enters deadlock. $□$

Exercises Intermediate+

Exercise 4 (hard, short answer).

Prove that the Shortest Job First (SJF) scheduling algorithm minimizes average waiting time for a set of processes that all arrive at time 0.

Hint

Use an exchange argument: consider any schedule that is not SJF, and show that swapping two adjacent out-of-order processes reduces the total waiting time.

Answer

Consider any schedule where process $P_{i}$ with burst time $b_{i}$ runs immediately before $P_{j}$ with burst time $b_{j}$ , where $b_{i} > b_{j}$ . The contribution of $P_{i}$ and $P_{j}$ to total waiting time includes $P_{j}$ waiting for $P_{i}$ to finish, contributing $b_{i}$ to $P_{j}$ 's wait. If we swap them, $P_{j}$ runs first and $P_{i}$ waits $b_{j}$ instead. The change in total waiting time is $b_{j} - b_{i} < 0$ . So swapping reduces total waiting time. By repeatedly swapping adjacent inversions, any schedule can be transformed into SJF order, each swap reducing total waiting time. SJF is therefore optimal.

Exercise 3 (medium, short answer).

A system has three processes with the following maximum resource needs and current allocation:

Process	Maximum	Allocation
P0	(7, 5, 3)	(0, 1, 0)
P1	(3, 2, 2)	(2, 0, 0)
P2	(9, 0, 2)	(3, 0, 2)

Available resources: (3, 3, 2). Is this state safe? If so, find a safe sequence.

Hint

Compute Need = Maximum - Allocation for each process, then run the safety algorithm: find a process whose Need is less than or equal to Available, add its Allocation to Available, and repeat.

Answer

Need matrix: P0 = (7, 4, 3), P1 = (1, 2, 2), P2 = (6, 0, 0).

Available = (3, 3, 2). P1's Need (1, 2, 2) <= Available. Allocate to P1. After P1 finishes, Available = (3, 3, 2) + (2, 0, 0) = (5, 3, 2).

Now P2's Need (6, 0, 0) <= Available? No, 6 > 5. Try P0: Need (7, 4, 3) <= (5, 3, 2)? No.

Deadlock: no process can proceed. This state is unsafe.

Domain evidence Master

Operating system design principles appear across many domains beyond desktop and server computing.

Mobile operating systems

Android and iOS extend traditional OS concepts with power management, sandboxed applications, and touch-centric interfaces. Android uses a modified Linux kernel with additions like the Binder IPC mechanism (enabling inter-process communication between app components) and the wakelock mechanism (preventing the CPU from sleeping while background work is in progress). iOS employs a stricter sandboxing model where each app runs in a tightly constrained environment with limited access to system resources.

Mobile operating systems face unique scheduling challenges: background apps must be suspended aggressively to conserve battery, while foreground apps need responsive performance. Both Android and iOS use opportunistic background execution, allowing apps limited time to complete background tasks before suspending them.

Embedded and IoT operating systems

Embedded systems, from automotive controllers to medical devices to smart thermostats, run specialized operating systems optimized for constrained resources. FreeRTOS, Zephyr, and VxWorks provide deterministic scheduling, minimal memory footprints (FreeRTOS can run in as little as 2KB of RAM), and real-time guarantees.

The challenge for IoT operating systems is managing thousands of devices with limited compute resources, intermittent connectivity, and severe power constraints. The Zephyr RTOS, supported by the Linux Foundation, provides a scalable kernel that can run on devices from 8KB RAM single-board computers to more powerful ARM processors, with a unified security model and device management framework.

Mainframe operating systems

IBM z/OS, the descendant of OS/360, demonstrates how operating system concepts scale to handle extreme workloads. A single mainframe can run thousands of concurrent transactions per second with five-nines (99.999%) availability. z/OS uses sophisticated workload management (WLM) that classifies work by service class (response time goals, execution velocity) and dynamically adjusts resource allocation to meet service level agreements.

Mainframe virtualization, pioneered by IBM's VM/370 in 1972, allows multiple operating systems to run on a single mainframe. Each virtual machine believes it has the entire hardware to itself. This concept is the direct ancestor of modern hypervisors and cloud computing.

Operating systems for AI workloads

Modern AI workloads stress operating systems in new ways. GPUs have their own memory hierarchies and scheduling requirements. NVIDIA's CUDA runtime manages GPU memory separately from CPU memory, and the operating system must coordinate between the two. Unified Memory in CUDA provides a virtual address space shared between CPU and GPU, with the driver handling page migration between CPU and GPU memory transparently.

AI inference servers handle thousands of concurrent model execution requests, requiring specialized scheduling that considers GPU utilization, memory bandwidth, and model loading latency. Operating system support for NVMe direct storage access allows AI workloads to load model parameters from SSD to GPU memory with minimal CPU involvement.

Advanced results Master

Lock-free and wait-free data structures

Traditional concurrency uses locks (mutexes, semaphores) to protect shared data, but locks have problems: contention, priority inversion, deadlock, and convoy effects. Lock-free data structures allow multiple threads to access shared data without locks, using atomic hardware instructions like compare-and-swap (CAS).

A data structure is lock-free if it guarantees that at least one thread makes progress in a bounded number of steps, even if other threads are suspended. It is wait-free if every thread completes its operation in a bounded number of steps, regardless of what other threads do. Wait-free algorithms are strictly stronger than lock-free ones: they guarantee individual progress, not just system-wide progress.

Maurice Herlihy's 1991 paper established the theoretical foundations of wait-free synchronization. He proved that read-modify-write instructions like CAS have universal consensus number infinity, meaning they can implement wait-free versions of any data structure. In contrast, simpler primitives like test-and-set have finite consensus number, limiting their expressive power for lock-free constructions.

The ABA problem is a notorious challenge for lock-free algorithms using CAS. Thread 1 reads value A from a shared location. Thread 2 changes the location from A to B, then back to A. Thread 1 executes CAS, sees A, and succeeds, even though the state has changed in between. Solutions include tagged pointers (appending a version counter that increments on every modification) and hazard pointers (tracking which threads are accessing which memory, preventing premature reclamation).

Memory reclamation in lock-free data structures is particularly challenging. A thread might be accessing a node that another thread has removed from the data structure and freed. If the memory is reused for a different object, the accessing thread reads garbage. Techniques like epoch-based reclamation, hazard pointers, and RCU (Read-Copy-Update) address this problem by deferring reclamation until no thread can possibly hold a reference to the freed node. RCU, used extensively in the Linux kernel, allows readers to proceed without any synchronization overhead, while updaters use COW to create modified versions of data structures.

Memory-mapped files and shared memory

Memory-mapped files provide an elegant interface for file I/O and inter-process communication. A process maps a file into its virtual address space using the mmap system call. Subsequent reads and writes to the mapped region are translated by the virtual memory system into file operations, without explicit read() or write() calls. This eliminates the overhead of copying data between kernel buffers and user-space buffers, because the file data is accessed directly through the page cache.

Memory-mapped I/O is particularly efficient for large files and random access patterns. Database systems like SQLite and MongoDB use mmap for their storage engines. However, mmap introduces complexity in error handling: a write to a mapped region that exceeds disk quota or encounters a bad block generates a SIGBUS signal, which the application must handle. For sequential I/O of small files, traditional read/write calls may be simpler and equally fast.

Shared memory is the fastest form of inter-process communication. Two processes map the same physical memory into their respective virtual address spaces. Writes by one process are immediately visible to the other, because they share the same physical pages. No data copying is required, unlike pipes or sockets.

Copy-on-write

Copy-on-write (COW) is an optimization that defers copying data until it is actually modified. When a process calls fork() to create a child process, the operating system does not immediately copy the parent's memory pages. Instead, both parent and child share the same physical pages, marked as read-only. When either process writes to a page, the hardware generates a protection fault, and the operating system creates a private copy of that page for the writing process.

COW makes fork() much faster, because it avoids copying potentially gigabytes of memory that the child may never modify. The fork-exec pattern (fork followed immediately by exec, which replaces the child's memory entirely) benefits enormously from COW: the child never writes to any inherited pages, so no copies are ever made.

Real-time scheduling

Real-time operating systems (RTOS) guarantee that tasks complete within specified deadlines. Hard real-time systems require that all deadlines be met; failure causes catastrophic consequences (avionics, medical devices). Soft real-time systems tolerate occasional deadline misses (multimedia, gaming).

The Rate Monotonic (RM) algorithm assigns higher priority to tasks with shorter periods. Liu and Layland proved in 1973 that RM is optimal among fixed-priority scheduling algorithms for preemptive periodic tasks on a single processor. They also proved the utilization bound: a set of $n$ periodic tasks is schedulable under RM if $\sum_{i = 1}^{n} (C_{i} / T_{i}) \leq n (2^{1/ n} - 1)$ , where $C_{i}$ is the computation time and $T_{i}$ is the period of task $i$ .

The Earliest Deadline First (EDF) algorithm assigns the highest priority to the task with the nearest deadline. EDF is optimal among all scheduling algorithms (fixed or dynamic priority) for preemptive tasks on a single processor: if any algorithm can schedule the task set, EDF can. The utilization bound for EDF is $\sum_{i = 1}^{n} (C_{i} / T_{i}) \leq 1$ , which is strictly better than RM's bound.

Linux provides real-time scheduling classes alongside its default Completely Fair Scheduler. The SCHED_FIFO and SCHED_RR policies implement POSIX real-time scheduling, with real-time tasks always preempting normal tasks. The PREEMPT_RT patch set makes the Linux kernel more preemptible, reducing worst-case latency from milliseconds to tens of microseconds, bringing Linux closer to the determinism required by hard real-time applications.

NUMA and memory architecture

Non-Uniform Memory Access (NUMA) systems have multiple processors, each with its own local memory. A processor can access its local memory faster than remote memory attached to other processors. The operating system must be NUMA-aware, scheduling processes on the processor whose local memory holds their data and allocating memory from the local node when possible.

Linux's NUMA policy allows administrators to specify memory allocation preferences. The numactl command controls NUMA scheduling and memory placement. Kernel memory allocators like SLUB are NUMA-aware, allocating objects from the local node by default. Poor NUMA placement can degrade performance dramatically: a process running on one node accessing memory on another node can experience 2-3x higher latency and reduced bandwidth. Database workloads like PostgreSQL and MySQL benefit significantly from NUMA-aware configuration, binding database processes and their shared buffers to the same NUMA node.

Containers and namespace isolation

Modern operating systems provide lightweight virtualization through containers (Docker, Kubernetes). Containers use OS-level features like namespaces (isolating process IDs, network interfaces, mount points, and user IDs) and cgroups (limiting CPU, memory, and I/O usage) to create isolated environments without the overhead of full virtual machines.

Namespaces provide the illusion that each container is a separate system. PID namespaces give each container its own process numbering. Network namespaces give each container its own network stack. Mount namespaces give each container its own filesystem view. Cgroups enforce resource limits, preventing one container from monopolizing system resources.

Containers differ from virtual machines in a crucial way: containers share the host kernel, while VMs include their own full operating system. This makes containers much lighter: a container image might be megabytes, while a VM image is gigabytes. Containers start in milliseconds, VMs in seconds. The trade-off is isolation: because containers share a kernel, a kernel vulnerability affects all containers, while VMs provide stronger isolation through hardware virtualization.

File systems and storage

File systems organize persistent data on storage devices. A file system provides abstractions like files (named sequences of bytes), directories (named collections of files), and permissions (access control). The inode, the fundamental data structure in Unix file systems, stores metadata about a file (size, permissions, timestamps, disk block locations) separately from the file name.

Journaling file systems (ext4, XFS, NTFS) maintain a log of pending changes before committing them to the main file system. If the system crashes during a write, the journal allows recovery to a consistent state. Without journaling, a crash during a metadata update can leave the file system in an inconsistent state requiring a full filesystem check, which can take hours for large disks.

The ZFS file system, developed by Sun Microsystems, introduced copy-on-write semantics at the file system level, enabling snapshots, deduplication, and self-healing through checksumming. Btrfs provides similar features for Linux. These modern file systems blur the line between file system and volume manager, integrating RAID, compression, and encryption.

I/O systems and device management

Operating systems manage input/output through device drivers, specialized code that translates generic OS requests into device-specific commands. The device driver model allows the operating system to support thousands of hardware devices without knowing the details of each one. Linux alone supports over 10,000 device drivers.

Direct Memory Access (DMA) allows devices to transfer data directly to and from memory without CPU involvement. The CPU initiates the transfer, then continues executing other processes. The DMA controller handles the data transfer and interrupts the CPU when complete. This dramatically reduces CPU overhead for I/O operations.

Interrupt handling is the mechanism by which devices signal the CPU. When a device completes an operation or needs attention, it raises an interrupt. The CPU suspends the current process, saves its state, and executes an interrupt service routine (ISR) in kernel mode. After the ISR completes, the CPU restores the saved state and resumes the interrupted process. Modern systems handle thousands of interrupts per second.

Symmetric multiprocessing (SMP)

Modern systems have multiple CPU cores sharing the same memory. The operating system must schedule processes across cores, manage shared data structures with proper synchronization, and maintain cache coherence. Linux's Completely Fair Scheduler (CFS) uses a red-black tree to efficiently select the next process to run on each core, aiming to give each process a fair share of CPU time.

Synchronization on multiprocessor systems requires hardware support. Atomic instructions (test-and-set, compare-and-swap, fetch-and-add) provide building blocks for locks and other synchronization primitives. Spinlocks busy-wait for the lock, wasting CPU cycles but avoiding the overhead of context switches. Mutexes sleep when the lock is unavailable, freeing the CPU for other work but requiring a context switch to reacquire. The choice between spinlocks and mutexes depends on the expected wait time: spinlocks are preferred for very short critical sections.

Virtualization and hypervisors

Hardware virtualization allows multiple operating systems to run simultaneously on a single physical machine. A hypervisor (virtual machine monitor) manages this sharing. Type 1 hypervisors (VMware ESXi, Xen, Hyper-V) run directly on hardware. Type 2 hypervisors (VirtualBox, VMware Workstation) run as applications on a host operating system.

Hardware-assisted virtualization (Intel VT-x, AMD-V) provides CPU instructions that make virtualization efficient. Without hardware support, the hypervisor must emulate privileged instructions in software, which is slow. Extended Page Tables (EPT) and Nested Page Tables (NPT) hardware-accelerate virtual memory translation for VMs, reducing the overhead of nested page tables (guest virtual to guest physical to host physical).

System calls and the user-kernel boundary

The user-kernel boundary is the most fundamental interface in an operating system. User applications request kernel services through system calls (syscalls). A syscall triggers a software interrupt (or uses the dedicated syscall instruction on x86-64), switching from user mode to kernel mode. The kernel validates the request, performs the operation, and returns the result to user space.

The system call interface is stable and well-documented: Linux's syscall interface has remained backward-compatible for decades. This stability allows applications to run on newer kernel versions without modification. The POSIX standard defines a portable subset of this interface across Unix-like systems. Key syscalls include read, write, open, close, fork, exec, wait, mmap, brk, socket, connect, and ioctl.

Connections Master

Connections to computer architecture

Operating systems and hardware are co-designed. Virtual memory requires hardware support (MMU, TLB, page tables). Context switching requires hardware support (timer interrupts, privilege levels). Device drivers require hardware interfaces (interrupts, DMA, memory-mapped I/O). The x86 architecture provides four privilege rings (ring 0 for kernel, ring 3 for user), hardware page tables, and TLB management instructions specifically designed for operating system use.

The RISC-V architecture, an open-source instruction set, was designed with operating system requirements in mind. Its privilege modes (machine, supervisor, user) and page table format were designed to support efficient virtual memory and process isolation.

The cache hierarchy in modern processors interacts closely with the operating system. Cache coherence protocols (MESI, MOESI) ensure that multiple cores see a consistent view of memory. The operating system must consider cache affinity when scheduling processes: migrating a process from one core to another invalidates its cache state, causing a burst of cache misses. Linux's scheduler attempts to keep processes on the same core (cache affinity) while still balancing load across cores.

Connections to programming languages

Operating systems provide the primitives that programming languages use for concurrency. Threads are OS-level entities managed by the kernel. Goroutines (Go), green threads (Java early versions), and coroutines (Python, JavaScript) are user-level scheduling constructs that map onto OS threads.

Memory management in programming languages interacts with OS memory management. Garbage collectors allocate memory from the OS in large chunks (using mmap or brk) and manage individual objects within those chunks. Language-level memory allocation is layered on top of OS-level virtual memory. A generational garbage collector may use mmap with MAP_ANONYMOUS to allocate large nursery spaces, and madvise with MADV_DONTNEED to release memory back to the operating system when collection frees entire pages.

Rust's ownership model provides compile-time guarantees about memory safety that make some OS-level protections redundant. A Rust program that compiles without unsafe blocks cannot have buffer overflows, use-after-free bugs, or data races. This has motivated projects like Redox OS, a Unix-like operating system written in Rust, which aims to leverage the language's safety guarantees to reduce the large class of kernel bugs caused by memory safety violations.

Connections to security

Operating system security is the foundation of computer security. Process isolation prevents unauthorized access to data. User authentication and access control determine who can access what. The kernel's privilege model (kernel mode vs user mode) ensures that user processes cannot directly access hardware or other processes' memory.

Capabilities, a more flexible alternative to traditional access control lists, associate permissions with unforgeable tokens rather than with user identities. Capability-based security was pioneered in operating systems like Hydra and Capability-Based Computer System, and influences modern designs like seL4, a formally verified microkernel.

Security exploits frequently target the kernel-user boundary. Spectre and Meltdown, disclosed in 2018, exploited speculative execution in modern CPUs to read kernel memory from user space, bypassing the hardware isolation that operating systems rely on. These vulnerabilities required operating system patches (KPTI, Kernel Page Table Isolation) that separate kernel and user page tables, at a performance cost. This illustrates how hardware bugs can undermine OS security assumptions and require complex software mitigations.

Connections to distributed systems

Many operating system concepts have direct analogs in distributed systems. Process scheduling maps to task scheduling in distributed frameworks. Virtual memory maps to distributed caching. File systems map to distributed file systems (GFS, HDFS). Locks and semaphores map to distributed coordination services (ZooKeeper, etcd).

Container orchestration platforms like Kubernetes extend OS scheduling to clusters of machines. The Kubernetes scheduler assigns pods (groups of containers) to nodes, considering resource requirements, affinity rules, and taints/tolerations, analogous to how an OS scheduler assigns processes to CPU cores.

Connections to databases

Database management systems build heavily on operating system primitives. Buffer pools are analogous to virtual memory page caches. Write-ahead logging in databases parallels journaling in file systems. B-tree page management builds on the memory-mapped file abstraction. Many databases bypass the OS page cache using direct I/O (O_DIRECT) to implement their own buffer management with domain-specific eviction policies that outperform the general-purpose OS page replacement algorithm.

Historical and philosophical context Master

From batch processing to time-sharing

The first operating systems (1950s) managed batch processing: jobs were submitted on punch cards, processed sequentially, and output was printed. Users had no interactive access to the machine. Turnaround time was measured in hours or days. The IBM 704 and its operating system, the IBM FORTRAN Monitor System (1956), were among the earliest, allowing sequential execution of programs written in FORTRAN.

Time-sharing, developed at MIT (CTSS, 1961; Multics, 1964) and elsewhere, gave multiple users simultaneous interactive access. This was a revolutionary idea: the computer was a shared resource, not a batch processor. The technical challenges of time-sharing, fast context switching, memory protection, and fair scheduling, drove the development of modern operating system concepts. CTSS demonstrated that a single IBM 7094 could support 30 simultaneous users, each with the illusion of having the machine to themselves.

Unix, created by Ken Thompson and Dennis Ritchie at Bell Labs in 1969, was a reaction to the complexity of Multics. Unix was small, simple, and elegant, written in a high-level language (C) rather than assembly. Its philosophy of small, composable tools connected by pipes influenced every operating system that followed. The POSIX standard, derived from Unix, defines the interface that most operating systems implement today. Brian Kernighan and Rob Pike's Unix Programming Environment (1984) documented this philosophy and inspired generations of programmers.

The microkernel debate

The 1990s saw a famous debate between Andrew Tanenbaum (advocate of microkernels) and Linus Torvalds (advocate of monolithic kernels). In a microkernel, only the most essential services (scheduling, IPC, memory management) run in kernel mode; all other services (file systems, device drivers, network stack) run as user-level processes. In a monolithic kernel, all services run in kernel mode.

Tanenbaum argued that microkernels were more reliable and secure because a bug in a device driver would crash only the driver, not the entire system. Torvalds argued that the performance overhead of microkernels was unacceptable and that the monolithic design was simpler to implement correctly.

The debate continues today. Linux remains monolithic but has adopted microkernel-like features (loadable kernel modules). MINIX 3, seL4, and Fuchsia (Google) are microkernels. The trend toward unikernels (single-purpose VMs) and library operating systems represents another approach to the same problem. Jochen Liedtke's L4 microkernel (1995) demonstrated that microkernel performance could be competitive with monolithic kernels through careful IPC design, addressing one of Torvalds's primary objections.

The philosophical significance of abstraction layers

Operating systems embody one of the most powerful ideas in engineering: abstraction layers. Each layer hides the complexity of the layer below and provides a simpler interface to the layer above. Applications see files, not disk sectors. Processes see virtual memory, not physical RAM. Programs see sockets, not network packets.

This layering principle extends far beyond operating systems. Network protocols are layered (the OSI model). Software architectures are layered (presentation, business logic, data access). Organizations are layered (management hierarchies). The operating system's use of abstraction layers is a specific instance of a general engineering principle: manage complexity by organizing systems into layers with well-defined interfaces.

Butler Lampson's 1983 Turing Award lecture, "Hints for Computer System Design," distilled lessons from building operating systems into broadly applicable design principles: "Do one thing well," "Make it fast, rather than general or correct," and "Keep secrets" (abstraction). These principles remain relevant decades later, influencing everything from API design to microservice architecture.

The open source revolution

The success of Linux, an open-source operating system, has had profound implications for the software industry. Linux runs the vast majority of web servers, smartphones (via Android), supercomputers, and embedded devices. The open-source development model, where thousands of contributors collaborate on a shared codebase, has proven remarkably effective at producing reliable, high-quality software.

The GPL license under which Linux is distributed requires that modifications be shared back with the community. This creates a virtuous cycle: improvements by one contributor benefit all users, who in turn contribute their own improvements. The open-source model challenges traditional assumptions about innovation, ownership, and collaboration in software development. Eric Raymond's "The Cathedral and the Bazaar" (1999) argued that open-source development, with its rapid feedback and parallel exploration of solutions, produces better software than closed, hierarchical development.

The Linux kernel, with over 15,000 contributors and millions of lines of code, demonstrates that large-scale distributed development is feasible. Its maintainer hierarchy (Linus Torvalds at the top, subsystem maintainers below, patch reviewers below them) provides the coordination needed to integrate contributions at a rate of roughly 10 changes per hour.

Bibliography Master

Primary sources

Liu, C.L. and Layland, J.W. (1973). "Scheduling algorithms for multiprogramming in a hard-real-time environment." Journal of the ACM, 20(1), 46-61.
Ritchie, D.M. and Thompson, K. (1974). "The UNIX time-sharing system." Communications of the ACM, 17(7), 365-375.
Lampson, B.W. and Redell, D.D. (1980). "Experience with processes and monitors in Mesa." Communications of the ACM, 23(2), 105-117.
Liedtke, J. (1995). "On micro-kernel construction." Proceedings of the 15th ACM Symposium on Operating Systems Principles, 237-250.
Herlihy, M. (1991). "Wait-free synchronization." ACM Transactions on Programming Languages and Systems, 13(1), 124-149.
Corbató, F.J., Merwin-Daggett, M., and Daley, R.C. (1962). "An experimental time-sharing system." Proceedings of the Spring Joint Computer Conference, 335-344.

Secondary sources

Silberschatz, A., Galvin, P.B., and Gagne, G. (2018). Operating System Concepts (9th ed.). Wiley.
Tanenbaum, A.S. and Bos, H. (2014). Modern Operating Systems (4th ed.). Pearson.
Bryant, R.E. and O'Hallaron, D.R. (2015). Computer Systems: A Programmer's Perspective (3rd ed.). Pearson.
Bovet, D.P. and Cesati, M. (2005). Understanding the Linux Kernel (3rd ed.). O'Reilly.
Corbet, J., Rubini, A., and Kroah-Hartman, G. (2005). Linux Device Drivers (3rd ed.). O'Reilly.
Love, R. (2010). Linux Kernel Development (3rd ed.). Addison-Wesley.
Kleiman, S.R. and Eykholt, J. (1995). "Interrupts as threads." ACM SIGOPS Operating Systems Review, 29(2), 21-26.
Raymond, E.S. (1999). The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O'Reilly.
Lampson, B.W. (1983). "Hints for computer system design." ACM SIGOPS Operating Systems Review, 17(5), 33-48.
Kernighan, B.W. and Pike, R. (1984). The Unix Programming Environment. Prentice-Hall.

Prerequisites

25.01.01
25.04.01

Tier anchors

beginner: Silberschatz, Galvin, and Gagne, Operating System Concepts (9e), Ch. 1-4; Bryant and O'Hallaron, Computer Systems: A Programmer's Perspective (3e), Ch. 8-9
intermediate: Silberschatz, Galvin, and Gagne, Operating System Concepts (9e), Ch. 5-9; Tanenbaum and Bos, Modern Operating Systems (4e), Ch. 2-3
master: Corbet, Rubini, and Kroah-Hartman, Linux Device Drivers; Bovet and Cesati, Understanding the Linux Kernel; Chen and Helian, concurrent systems research

References

computer-science · Ch. 0
Silberschatz, A., Galvin, P.B., and Gagne, G., Operating System Concepts (9e, Wiley, 2018) · Ch. 1-9 · source being verified
Tanenbaum, A.S. and Bos, H., Modern Operating Systems (4e, Pearson, 2014) · Ch. 1-3 · source being verified
Bryant, R.E. and O'Hallaron, D.R., Computer Systems: A Programmer's Perspective (3e, Pearson, 2015) · Ch. 8-9 · source being verified

Estimated time

beginner: 30m
intermediate: 55m
master: 80m