25.02.01 · computer-science / data-structures

Data structures: arrays, trees, and graphs

shipped3 tiersLean: none

Anchor (Master): Knuth, The Art of Computer Programming Vol. 1, Ch. 2; Tarjan, Data Structures and Network Algorithms; Okasaki, Purely Functional Data Structures

Intuition Beginner

Data structures are containers for organizing information. Just as you choose different types of physical storage for different purposes, like a bookshelf for books, a filing cabinet for documents, and a toolbox for tools, you choose different data structures depending on what you need to do with your data. The right data structure can make an operation fast; the wrong one can make it agonizingly slow.

An array is the simplest data structure: a numbered list of items stored in consecutive memory locations. Think of a row of mailboxes, each with a numbered slot. If you know the number, you can go directly to that mailbox. This direct access is what makes arrays powerful. Looking up the element at position 5 takes the same amount of time as looking up the element at position 5,000,000. But arrays have limitations. Inserting a new element in the middle requires shifting every element after it, which can be slow for large arrays. Imagine inserting a new mailbox in the middle of a row: every mailbox after the insertion point must be moved one position to make room.

A linked list solves the insertion problem by giving up direct access. Instead of storing elements in consecutive locations, each element (called a node) contains both a value and a pointer to the next node. Think of a scavenger hunt where each clue tells you where to find the next one. To find the 100th element, you must visit the first 99 elements in sequence. This is slow for random access. But inserting a new element in the middle is fast: just change two pointers. No shifting required.

A stack is a data structure that follows the last-in, first-out principle. Think of a stack of plates in a cafeteria. You add plates to the top and remove them from the top. The last plate placed on the stack is the first one removed. Stacks are used everywhere in computing. The undo feature in a text editor uses a stack: each action is pushed onto the stack, and undo pops the most recent action. Function call management uses a call stack: each function call pushes a frame onto the stack, and returning from a function pops that frame.

A queue follows the first-in, first-out principle. Think of a line at a ticket counter. The first person in line is the first person served. Queues model any situation where fairness demands that items be processed in the order they arrive. Print job scheduling, network packet routing, and task scheduling in operating systems all use queues.

A tree is a data structure that branches hierarchically, like a family tree or an organizational chart. The top of the tree is called the root. Each node can have children, forming a branching structure. The most commonly used tree in computing is the binary search tree, where each node has at most two children, and the tree satisfies an ordering property: for every node, all values in the left subtree are smaller, and all values in the right subtree are larger.

This property makes searching efficient. Start at the root. If the target is smaller, go left. If larger, go right. At each step, you eliminate roughly half the remaining nodes. For a balanced tree with $n$ nodes, search takes about $lo g_{2} n$ steps.

The problem with binary search trees is that they can become unbalanced. If you insert elements in sorted order (1, 2, 3, 4, 5), the tree degenerates into a linked list. Searching for element 5 requires visiting all 5 nodes, not $lo g_{2} 5 \approx 2.3$ nodes. Balanced tree structures like red-black trees and AVL trees solve this problem by automatically reorganizing themselves after each insertion to maintain approximate balance. Red-black trees guarantee that the longest path from root to leaf is at most twice the shortest path, ensuring that search, insertion, and deletion all run in $O (lo g n)$ time.

A hash table provides a different approach to fast lookup. Instead of organizing data by comparisons, it uses a hash function to compute where each element should be stored. A hash function maps each possible key to an integer index in an array. To insert a key, compute its hash, and store it at that index. To look up a key, compute its hash again and go directly to that index.

In the ideal case, this takes constant time, $O (1)$ , regardless of how many elements are stored. The complication is collisions: two different keys may hash to the same index. Collisions are handled by chaining (storing a linked list at each index) or open addressing (probing for the next available slot). With a good hash function and a table that is not too full, collisions are rare, and hash tables provide near-constant-time operations.

A graph generalizes the tree by allowing arbitrary connections. A graph consists of vertices (also called nodes) connected by edges. Social networks are graphs: people are vertices, and friendships are edges. Road maps are graphs: intersections are vertices, and roads are edges. The internet is a graph: web pages are vertices, and hyperlinks are edges. Graphs can be directed (edges have a direction, like one-way streets) or undirected (edges go both ways, like two-way streets). They can be weighted (edges have costs, like toll roads) or unweighted.

Graphs are represented in two primary ways. An adjacency list stores, for each vertex, a list of its neighbors. This is space-efficient for sparse graphs (graphs with relatively few edges) and makes it easy to iterate over a vertex's neighbors. An adjacency matrix is a 2D array where entry $(i, j)$ indicates whether there is an edge from vertex $i$ to vertex $j$ . This is space-efficient for dense graphs and makes it fast to check whether a specific edge exists.

The choice of data structure depends on the operations you need to perform. If you need fast random access by index, use an array. If you need fast insertion and deletion in the middle, use a linked list. If you need fast lookup by key, use a hash table or a balanced binary search tree. If you need ordered traversal, use a balanced tree. If you need to model relationships between entities, use a graph. There is no single best data structure. Each excels at some operations and is poor at others.

A priority queue is a hybrid structure that combines ideas from queues and trees. Elements are dequeued not in the order they arrived but in order of priority. A hospital emergency room is a priority queue: the most critical patients are treated first, regardless of arrival order. Priority queues are typically implemented using a heap, a specialized tree structure where each parent node has higher priority than its children. Heaps support insertion and extraction of the maximum (or minimum) element in $O (lo g n)$ time.

Heaps are stored as arrays, not as explicit tree structures, which makes them space-efficient and cache-friendly. The parent of node $i$ is at position $⌊ i /2 ⌋$ . The children of node $i$ are at positions $2 i$ and $2 i + 1$ . This mapping between tree positions and array indices means the entire heap fits in a single array with no pointers needed.

Visual Beginner

The table below compares common data structures by their time complexity for fundamental operations.

Data structure	Access	Search	Insert	Delete	Space
Array	$O (1)$	$O (n)$	$O (n)$	$O (n)$	$O (n)$
Linked list	$O (n)$	$O (n)$	$O (1)$	$O (1)$	$O (n)$
Hash table	$O (1)$ avg	$O (1)$ avg	$O (1)$ avg	$O (1)$ avg	$O (n)$
Balanced BST	$O (lo g n)$	$O (lo g n)$	$O (lo g n)$	$O (lo g n)$	$O (n)$
Heap (priority queue)	$O (1)$ max	$O (n)$	$O (lo g n)$	$O (lo g n)$	$O (n)$

Worked example Beginner

Consider the problem of maintaining a spell-checker's dictionary. You need to support three operations: add a word, check whether a word exists, and suggest corrections for misspelled words. Each operation should be fast, because a spell-checker must process thousands of words during a single document scan.

If you store the dictionary as an unsorted array, checking whether a word exists requires scanning every entry, taking $O (n)$ time where $n$ is the number of words. With 100,000 words in the dictionary, each lookup could require 100,000 comparisons. This is too slow for interactive use.

A sorted array with binary search reduces lookup to $O (lo g n)$ time, about 17 comparisons for 100,000 words. But inserting a new word requires finding the correct position and shifting all subsequent words, which takes $O (n)$ time. This is fine if the dictionary is static, but if users can add words, each insertion becomes expensive.

A hash table provides $O (1)$ average-case lookup and insertion. To check whether "algorithm" is in the dictionary, compute hash("algorithm") to get an index, and check that bucket. To add "algorithm", compute the same hash and insert at that bucket. Collisions are handled by chaining. With a good hash function and a table sized appropriately, operations are fast in practice.

However, hash tables do not support ordered operations natively. Finding all words between "algorithm" and "binary" requires scanning the entire table. For the suggestion feature, you might need to find all words within one edit distance of the misspelled word. A trie (prefix tree) is better suited for this task. A trie stores words character by character along a tree path. Each node represents a prefix shared by all words that pass through it. Finding all words that start with "algo" requires traversing to the node for that prefix and collecting all descendants, which is efficient.

In practice, a real spell-checker might combine multiple data structures: a hash table for fast membership testing, a trie for prefix-based suggestions, and a bloom filter (a space-efficient probabilistic structure) as a first-pass filter to quickly rule out words that are definitely not in the dictionary.

Now consider a navigation system like Google Maps. The road network is a graph: intersections are vertices, road segments are edges, and distances or travel times are edge weights. Finding the shortest route between two locations requires graph traversal algorithms.

The graph has millions of vertices and edges. The adjacency list representation is appropriate because road networks are sparse: each intersection connects to only a few neighboring intersections, not to all intersections. An adjacency matrix would waste enormous amounts of space storing zeros.

To find the shortest path, Dijkstra's algorithm uses a priority queue. It starts at the source vertex and repeatedly selects the unvisited vertex with the smallest known distance, then updates the distances to its neighbors. The priority queue ensures that vertices are processed in order of increasing distance from the source. For a graph with $V$ vertices and $E$ edges, Dijkstra's algorithm runs in $O ((V + E) lo g V)$ time with a binary heap, or $O (V lo g V + E)$ with a Fibonacci heap.

Check your understanding Beginner

Formal definition Intermediate+

Array. An array $A$ of size $n$ is a mapping from the index set ${0, 1, \dots, n - 1}$ to a set of values, stored in contiguous memory. Access to $A [i]$ is $O (1)$ because the address of element $i$ is computed as $base address + i \times element size$ .

Linked list. A singly linked list is a sequence of nodes, where each node contains a value and a reference (pointer) to the next node. The list is accessed through a head pointer. A doubly linked list adds a pointer to the previous node in each node.

Stack. A stack supports two operations: $push (x)$ , which adds $x$ to the top, and $pop ()$ , which removes and returns the top element. Both are $O (1)$ . Formally, a stack implements the LIFO (last-in, first-out) abstract data type.

Queue. A queue supports $enqueue (x)$ , which adds $x$ to the rear, and $dequeue ()$ , which removes and returns the front element. Both are $O (1)$ . A queue implements the FIFO (first-in, first-out) abstract data type.

Binary search tree. A binary search tree is a binary tree where for every node $v$ : (1) all keys in the left subtree of $v$ are less than the key at $v$ , (2) all keys in the right subtree of $v$ are greater than the key at $v$ , and (3) both subtrees are also binary search trees. On a balanced BST with $n$ nodes, search, insert, and delete all run in $O (lo g n)$ time.

Hash tables

A hash table uses a hash function $h : K \to {0, 1, \dots, m - 1}$ to map keys to array indices. The load factor $α = n / m$ measures how full the table is, where $n$ is the number of stored keys and $m$ is the table size. Under chaining for collision resolution and the assumption of simple uniform hashing (each key is equally likely to hash to any slot, independently of other keys), a successful search takes $O (1 + α)$ time on average.

Collision resolution. Two primary methods exist. In chaining, each slot holds a linked list of all keys that hash to that slot. In open addressing, all keys are stored in the array itself. When a collision occurs, the algorithm probes for the next available slot using a probe sequence. Linear probing uses $h (k, i) = (h^{'} (k) + i) mod m$ . Quadratic probing uses $h (k, i) = (h^{'} (k) + c_{1} i + c_{2} i^{2}) mod m$ . Double hashing uses $h (k, i) = (h_{1} (k) + i \cdot h_{2} (k)) mod m$ .

Heaps and priority queues

A min-heap is a complete binary tree where each node's value is less than or equal to the values of its children. The minimum element is always at the root. A max-heap reverses the inequality. Heaps support $insert$ in $O (lo g n)$ (by adding at the bottom and sifting up) and $extract-min$ in $O (lo g n)$ (by replacing the root with the last element and sifting down).

Graphs

A graph $G = (V, E)$ consists of a set of vertices $V$ and a set of edges $E \subseteq V \times V$ . In a directed graph, edges are ordered pairs $(u, v)$ . In an undirected graph, edges are unordered pairs ${u, v}$ . A weighted graph assigns a weight $w : E \to R$ to each edge.

Adjacency list representation. For each vertex $v \in V$ , store a list of vertices adjacent to $v$ . Space: $O (V + E)$ . Checking whether edge $(u, v)$ exists: $O (de g (u))$ . Iterating over neighbors of $v$ : $O (de g (v))$ .

Adjacency matrix representation. Store an $∣ V ∣ \times ∣ V ∣$ matrix $A$ where $A [u] [v] = 1$ if $(u, v) \in E$ and $0$ otherwise. Space: $O (V^{2})$ . Checking whether edge $(u, v)$ exists: $O (1)$ . Iterating over neighbors of $v$ : $O (V)$ .

Key result: amortized analysis of dynamic arrays Intermediate+

Theorem. A dynamic array that doubles in size when full supports $n$ consecutive $push$ operations in $O (n)$ total time, giving $O (1)$ amortized time per operation.

Proof. Consider a dynamic array starting with capacity 1. The array doubles at sizes 1, 2, 4, 8, $\dots$ , $2^{k}$ where $2^{k} \geq n$ . Each doubling requires copying all existing elements to a new array. The cost of the doubling at size $2^{i}$ is $2^{i}$ (copying $2^{i}$ elements). The total cost of all doublings up to size $2^{k}$ is:

$i = 0 \sum k - 1 2^{i} = 2^{k} - 1 \leq n - 1$

Each push operation costs $O (1)$ for the actual insertion (not counting resizing). So the total cost of $n$ pushes is at most $n + (n - 1) = 2 n - 1 = O (n)$ . The amortized cost per push is $O (n) / n = O (1)$ . $□$

This analysis uses the aggregate method of amortized analysis. Two other methods exist. The accounting method assigns different charges to different operations, overcharging some to prepay for expensive future operations. The potential method defines a potential function $Φ$ that captures the stored energy in the data structure; the amortized cost of an operation is its actual cost plus the change in potential.

For dynamic arrays, the potential method uses $Φ (D_{i}) = 2 \times (number of elements - capacity /2)$ when the array is more than half full, and $Φ (D_{i}) = 0$ otherwise. The expensive resizing operation is paid for by the accumulated potential.

Exercises Intermediate+

Exercise 2 (medium, short answer).

Draw the red-black tree that results from inserting the keys 7, 3, 18, 10, 22, 8, 11 in that order into an initially empty red-black tree.

Hint

Apply the red-black insertion rules: color new nodes red, then fix violations by recoloring and rotating. Remember the red-black properties: the root is black, red nodes have black children, and every path from root to nil has the same number of black nodes.

Answer

After inserting 7, 3, 18, 10, 22, 8, 11: the tree restructures to maintain balance. The root is 10 (black). Left child 7 (black) with left child 3 (red). Right child 18 (black) with left child 11 (red) and right child 22 (red). Node 8 is the right child of 7 (red). The exact coloring depends on the specific fix-up operations, but the key property is that the black-height is uniform and no red-red violations exist.

Exercise 4 (hard, short answer).

Using the potential method, prove that inserting into a hash table with chaining and a load factor at most $α$ takes $O (1 + α)$ amortized time, assuming dynamic resizing that doubles the table size when $α$ exceeds a constant threshold.

Hint

Define the potential as proportional to the excess elements beyond the threshold. Show that each insertion adds at most a constant to the potential, and the potential pays for the eventual resize.

Answer

Let $m$ be the table size, $n$ the number of elements, and $α_{m a x}$ the threshold. Define $Φ = 2 \cdot max (0, n - α_{m a x} \cdot m)$ . When $n \leq α_{m a x} \cdot m$ , $Φ = 0$ and insertion costs $O (1 + α)$ actual time with $O (1 + α)$ amortized. When $n > α_{m a x} \cdot m$ , each insertion increases $Φ$ by 2. When resizing triggers (doubling $m$ to $2 m$ ), the actual cost is $O (n)$ for rehashing, but $Φ$ drops from $2 (n - α_{m a x} m)$ to $0$ (since after doubling, $n \leq α_{m a x} \cdot 2 m$ ). The accumulated potential $O (n)$ pays for the resize. Amortized cost per insertion remains $O (1 + α)$ .

Exercise 7 (hard, short answer).

Explain why a splay tree on $n$ nodes might have $O (n)$ worst-case time for a single access, yet guarantees $O (lo g n)$ amortized time per access. Give a concrete scenario that causes worst-case linear time.

Hint

Consider a splay tree that has degenerated into a long chain. What happens when you access the deepest node?

Answer

If elements were inserted in order 1, 2, 3, ..., n, the tree is a rightward chain. Accessing node $n$ (the deepest leaf) requires traversing all $n$ nodes, costing $O (n)$ worst-case time. However, the splay operation moves $n$ to the root and roughly halves the path length. Subsequent accesses to the same node take $O (1)$ . The access lemma guarantees that over any sequence of $m$ accesses, the total cost is $O (m lo g n + n lo g n)$ , giving $O (lo g n)$ amortized per access. The expensive accesses are "prepaying" for future cheap accesses.

Domain evidence Master

Data structure choices have measurable, sometimes dramatic, effects on real-world system performance.

Data structures in web browsers

A modern web browser relies on dozens of specialized data structures. The DOM (Document Object Model) is a tree structure representing the HTML document. CSS selectors are matched using rule trees. The layout engine uses box trees to compute element positions. JavaScript engines use hidden classes (similar to structs) and hash tables for object property storage. V8 (Chrome's JavaScript engine) uses a hash table for object properties when the number of properties is small, switching to a more efficient array-backed representation when the object shape stabilizes.

Browser caches use LRU-like structures to manage cached resources. The Chrome disk cache uses a variation of a doubly-linked list indexed by a hash table. Session history is managed as a stack. The bookmark system uses a tree structure reflecting the folder hierarchy. Each of these choices reflects a specific trade-off between access patterns, memory usage, and implementation complexity.

Data structures in databases

Database management systems are built almost entirely from data structures. PostgreSQL uses B+ trees for its default index type (the B-tree index), hash indexes for equality comparisons, GIN (Generalized Inverted Index) for full-text search and array containment queries, GiST (Generalized Search Tree) for geometric and range data, and SP-GiST for space-partitioned data like radix trees.

Write-ahead logs use sequential append-only files optimized for throughput. The buffer pool, which caches disk pages in memory, uses a hash table for page lookup and a modified LRU structure (the clock sweep algorithm) for page eviction. PostgreSQL's MVCC (Multi-Version Concurrency Control) system maintains multiple versions of each row, effectively creating a persistent data structure where reads can access older versions without blocking writes.

Data structures in version control

Git's internal data model is a content-addressable graph stored as a forest of trees. Each commit is a node pointing to a tree object (the root directory), which contains pointers to subtrees (subdirectories) and blobs (file contents). The structure is a directed acyclic graph (DAG) where merge commits have multiple parents.

Git uses hash tables (the object store) for content-addressed storage, where the key is the SHA-1 hash of the content. Pack files use a delta compression scheme that stores only differences between similar objects, similar to persistent data structures that share unchanged subtrees. Branching in Git is implemented as a lightweight pointer to a commit node, making branch creation an $O (1)$ operation.

Data structures in gaming and real-time graphics

Game engines use specialized data structures optimized for spatial queries. Octrees and kd-trees partition 3D space for efficient frustum culling (determining which objects are visible), collision detection (finding which objects overlap), and ray casting (determining what a ray of light hits first). Quad-trees serve the same purpose for 2D games.

Entity Component System (ECS) architectures, used by most modern game engines, organize game entities as collections of components stored in dense arrays for cache-efficient iteration. The data-oriented design philosophy prioritizes memory layout (arrays of structs versus structs of arrays) to maximize cache performance, because on modern hardware, cache misses are often more expensive than the actual computations.

Data structures in compilers

Compilers use data structures at every stage of compilation. The lexer uses finite automata (which are directed graphs). The parser builds abstract syntax trees (ASTs). The type checker uses symbol tables (typically hash tables scoped by the AST structure). The optimizer uses control flow graphs, data flow graphs, and interference graphs (for register allocation). The code generator uses DAGs for instruction selection.

The LLVM intermediate representation uses a SSA (Static Single Assignment) form where each variable is assigned exactly once, implemented using a linked list of instructions grouped into basic blocks, which are grouped into functions, forming a hierarchical structure. The dominance tree, a data structure computed from the control flow graph, is used for efficient data flow analysis and register allocation.

Advanced results Master

Splay trees and self-adjusting structures

Splay trees, invented by Sleator and Tarjan in 1985, are self-adjusting binary search trees that do not store any balance information. Instead, every accessed node is moved to the root through a sequence of rotations called splaying. Despite having no explicit balance criterion, splay trees guarantee $O (lo g n)$ amortized time per operation.

The dynamic optimality conjecture states that splay trees are as efficient as any binary search tree algorithm on any sequence of accesses, up to a constant factor. This conjecture remains open after nearly four decades. The closest known result is the Tango tree (Demaine et al., 2004), which achieves $O (lo g lo g n)$ -competitiveness.

Sleator and Tarjan proved the access lemma: the amortized cost of splaying a node $x$ in a tree with $n$ nodes is $O (lo g (n / w (x)))$ , where $w (x)$ is the weight assigned to node $x$ . This implies the balance theorem ( $O (lo g n)$ amortized per operation), the static optimality theorem (as fast as the optimal static tree), and the working set theorem (recently accessed elements are faster to access again).

Fibonacci heaps and decrease-key

Fibonacci heaps, introduced by Fredman and Tarjan in 1984, support insert and decrease-key in $O (1)$ amortized time, and delete-min in $O (lo g n)$ amortized time. These bounds improve Dijkstra's algorithm from $O ((V + E) lo g V)$ to $O (V lo g V + E)$ .

The structure is a collection of heap-ordered trees. Key operations are lazy: inserts simply add a tree, and decrease-key detaches a subtree without restructuring. The expensive consolidation work (combining trees of equal degree) is deferred until delete-min. The potential function counts the number of trees and marks (nodes that have lost a child), enabling the amortized analysis.

The name comes from the fact that the maximum degree $D (n)$ of any node in a Fibonacci heap with $n$ nodes is at most $⌊ lo g_{ϕ} n ⌋$ , where $ϕ$ is the golden ratio. This bound derives from the relationship between node degrees and Fibonacci numbers.

Union-Find and inverse Ackermann

The union-find (disjoint set) data structure maintains a partition of elements into disjoint sets. It supports $find (x)$ , which returns the representative of the set containing $x$ , and $union (x, y)$ , which merges the sets containing $x$ and $y$ .

With two optimizations, union by rank (attach the shorter tree under the root of the taller tree) and path compression (make every node on the find path point directly to the root), the amortized time per operation is $O (α (n))$ , where $α$ is the inverse Ackermann function. This function grows so slowly that $α (n) \leq 4$ for all practical values of $n$ . Tarjan proved in 1975 that this bound is essentially optimal: no data structure can do better in the pointer machine model.

The inverse Ackermann function $α (n)$ is defined as the smallest $k$ such that $A (k, k) \geq n$ , where $A$ is the Ackermann function. For any $n$ that can be stored in the observable universe, $α (n) \leq 4$ . This makes union-find operations practically constant time.

Persistent data structures

Persistent data structures preserve previous versions when modified. A fully persistent structure allows updates and queries to any version. A partially persistent structure allows queries to any version but updates only to the current version.

Okasaki's 1998 book Purely Functional Data Structures showed that functional programming naturally produces persistent structures, because immutable data can be shared between versions. For example, inserting into a functional red-black tree creates a new tree that shares most of its structure with the old one, using $O (lo g n)$ additional space rather than $O (n)$ .

Path copying is the basic technique: copy all nodes along the path from the root to the modified node, reusing unchanged subtrees. For balanced trees, this costs $O (lo g n)$ time and space per update. Fat node simulation achieves persistence with $O (1)$ space overhead per update by storing versioned values at each node.

External memory data structures

When data is too large to fit in main memory, the bottleneck shifts from CPU operations to disk I/O. The I/O model (Aggarwal and Vitter, 1988) measures complexity in terms of block transfers between memory and disk, where each block holds $B$ records and memory holds $M$ records.

B-trees are the standard external memory search structure. A B-tree of order $t$ has internal nodes with between $t - 1$ and $2 t - 1$ keys. The height is $O (lo g_{t} n)$ , so a search requires $O (lo g_{t} (n / B))$ I/O operations. Database systems use B-trees (and their variant B+ trees) almost universally for index structures.

The buffer tree (Arge, 1995) extends this idea to batched operations, achieving $O (1/ B)$ amortized I/O per operation for certain workloads. The LSMT (Log-Structured Merge-Tree) (O'Neil et al., 1996) is another external memory structure used heavily in modern databases and storage systems like LevelDB and RocksDB.

Cache-oblivious structures

Cache-oblivious algorithms achieve good cache performance without knowing the parameters $B$ (block size) and $M$ (memory size) of the memory hierarchy. The van Emde Boas layout for binary trees stores nodes in an order that ensures any root-to-leaf path traverses $O (lo g_{B} n)$ blocks, matching the B-tree bound without knowing $B$ .

Cache-oblivious B-trees (Bender et al., 2000) support search in $O (lo g_{B} n)$ I/Os and update in $O (lo g_{B} n)$ I/Os amortized, without explicitly parameterizing by $B$ . This is remarkable because the algorithm achieves optimal cache behavior at every level of the memory hierarchy simultaneously.

Connections Master

Connections to algorithms

Data structures and algorithms are inseparable. Many algorithms achieve their efficiency only through the use of specific data structures. Dijkstra's shortest-path algorithm requires a priority queue. Kruskal's minimum spanning tree algorithm requires union-find. Huffman coding requires a min-heap. A* search requires a priority queue with decrease-key. The choice of data structure can change the asymptotic complexity of an algorithm.

The relationship runs both ways: new algorithms inspire new data structures. The development of garbage collection algorithms led to efficient memory management structures. Machine learning has inspired approximate data structures like learned index structures (Kraska et al., 2018), which use neural networks to predict the location of data.

Connections to databases

Database systems are built on a foundation of data structures. B-trees and B+ trees provide the index structures that make queries fast. Hash indices provide point queries. Write-ahead logs use sequential append-only structures. Buffer pools are essentially caches built on hash tables and linked lists. LSM-trees, which combine in-memory sorted structures with disk-based sorted runs, are the backbone of modern write-heavy databases like Cassandra and BigTable.

The entire field of database indexing is an applied study of data structure trade-offs: read performance versus write performance, exact matching versus range queries, memory usage versus query speed, and consistency requirements versus availability. MongoDB uses B-trees for its default indexes but also supports hashed indexes for sharding, geospatial indexes (based on R-trees or geohashes), and text indexes (based on inverted indices). Each index type is a specialized data structure optimized for a particular query pattern.

Redis, an in-memory data store, exposes data structures directly to the user: strings, lists, sets, sorted sets (implemented as skip lists), hashes, and streams (implemented as radix trees). The insight behind Redis is that many applications need simple data structure operations with shared state, and that exposing these directly is more useful than a generic query language.

Connections to operating systems

Operating systems rely heavily on data structures. File systems use B-trees (or variants like B+ trees and extents) to map file names to disk locations. Memory management uses red-black trees to track allocated and free memory regions. Process scheduling uses priority queues. The virtual memory system uses hash tables for page table lookups and balanced trees for managing memory-mapped regions.

The Linux kernel, for example, uses red-black trees for the completely fair scheduler (CFS), radix trees for page cache management, and priority heaps for timer management. The choice of data structure in kernel code is driven by the need for predictable worst-case performance, not just good average-case behavior.

The ext4 file system uses H-trees (a specialized form of B-trees) for directory indexing, allowing fast file lookup even in directories containing millions of files. Without this index, listing or accessing a file in a large directory would require a linear scan. The page cache, which caches disk pages in memory, uses a radix tree (a compressed trie) to map file offsets to cached pages, providing $O (1)$ lookup for cached data.

Connections to networking and distributed systems

Network routers use specialized data structures for packet forwarding. The longest prefix match problem in IP routing uses tries (binary tries, multi-bit tries, or compressed tries like Lulea). Bloom filters provide fast membership testing for cache and routing tables with controlled false-positive rates.

Distributed hash tables (DHTs) like Chord, Pastry, and Kademlia use consistent hashing to distribute data across nodes in a peer-to-peer network. Consistent hashing minimizes data movement when nodes join or leave, a property critical for dynamic distributed systems.

Connections to machine learning

Modern machine learning systems depend on efficient data structures for handling large-scale data. Inverted indices power search engines. KD-trees and ball trees accelerate nearest-neighbor search. Approximate nearest neighbor (ANN) structures like locality-sensitive hashing (LSH), HNSW (hierarchical navigable small world), and FAISS indices enable billion-scale vector search.

Feature stores and data pipelines rely on columnar data structures (Parquet, ORC) that optimize for analytical queries. The training loop of deep learning systems uses memory pools, ring buffers for replay buffers, and concurrent queues for data loading pipelines.

Learned index structures (Kraska et al., 2018) replace traditional B-tree indices with neural networks that learn the cumulative distribution function of the data, predicting the location of a key rather than traversing a tree. Early results show that learned indices can outperform B-trees by up to 70% in speed while using less memory, though they trade off the strong worst-case guarantees of B-trees for better average-case performance on specific data distributions.

Connections to bioinformatics

Bioinformatics relies on specialized data structures for processing genomic data. Suffix trees and suffix arrays index DNA sequences for pattern matching, enabling fast search for gene sequences within entire genomes. The FM-index (used by sequence alignment tools like BWA and Bowtie) combines the Burrows-Wheeler transform with suffix array sampling to achieve remarkable compression while supporting fast exact matching.

De Bruijn graphs are used for genome assembly: DNA sequencing produces short fragments (reads), and the de Bruijn graph represents all possible overlaps between fragments, allowing the original genome to be reconstructed by finding paths through the graph. Sequence alignment algorithms use dynamic programming matrices stored as 2D arrays, with banded variants that exploit biological constraints to reduce memory and computation.

Connections to social networks and graph databases

Social network analysis relies on graph data structures and algorithms. Friend-of-friend recommendations use breadth-first search limited to depth 2. Community detection uses graph clustering algorithms. Influence maximization uses priority queues and greedy algorithms on graph structures.

Graph databases like Neo4j store data as property graphs (vertices and edges with key-value attributes) and provide query languages (Cypher) for pattern matching on graph structures. Unlike relational databases that require expensive joins for relationship queries, graph databases traverse edges directly, making queries like "find all friends of friends who live in New York" efficient. The trade-off is that graph databases perform poorly for tabular queries that relational databases handle easily.

Historical and philosophical context Master

The early history of data structures

The concept of organizing data for efficient retrieval predates computers. Library card catalogs were early index structures, organizing books by author, title, and subject using alphabetical ordering in wooden drawers. The Dewey Decimal System, developed by Melvil Dewey in 1876, is a hierarchical classification scheme that organizes knowledge into a tree structure. Filing cabinets with alphabetical tabs are physical hash tables.

The formal study of data structures began with the development of programming languages in the 1950s and 1960s. The array was the first data structure implemented in hardware (via indexed addressing). The linked list was proposed for symbolic computation in IPL (Information Processing Language) and later refined in Lisp. Trees were formalized for representing arithmetic expressions in compilers. The hash table was independently invented by multiple researchers in the 1950s, including Hans Peter Luhn at IBM for automatic indexing.

The term "data structure" was popularized by the 1968 textbook The Art of Computer Programming by Donald Knuth and the 1976 textbook Fundamentals of Data Structures by Ellis Horowitz and Sartaj Sahni. These works established data structures as a distinct area of computer science, separate from algorithms on one hand and programming languages on the other.

The contributions of Donald Knuth

Donald Knuth's The Art of Computer Programming (TAOCP), beginning with Volume 1 in 1968, established data structures as a rigorous discipline. Knuth analyzed the performance of fundamental structures with unprecedented mathematical precision, deriving exact counts of operations rather than asymptotic bounds. His analysis of hashing, trees, and sorting algorithms set the standard for the field.

Knuth introduced or formalized many of the data structures in common use today, including B-trees (with Bayer), hash chaining analysis, and the analysis of random binary search trees. His emphasis on mathematical rigor transformed data structures from ad hoc programming tricks into a systematic science.

Robert Tarjan and the theory of data structures

Robert Tarjan's work in the 1970s and 1980s transformed the theory of data structures. His analysis of union-find (with Hopcroft) established the inverse Ackermann bound. His invention of splay trees (with Sleator) introduced amortized analysis as a fundamental technique. His work on network algorithms showed that graph-theoretic insights could lead to practical data structure improvements.

Tarjan's 1983 book Data Structures and Network Algorithms remains a classic, demonstrating that deep mathematical analysis of data structures leads to algorithms with provably good performance. His work illustrates a principle that has guided the field: understanding the theoretical limits of data structures is essential for building practical systems.

The amortized analysis revolution

The introduction of amortized analysis by Tarjan in 1985 changed how computer scientists think about data structure performance. Before amortized analysis, data structures were evaluated by their worst-case operation costs. After amortized analysis, it became clear that some structures with expensive worst-case operations could guarantee excellent average performance over sequences of operations.

This perspective shift had practical consequences. Splay trees, dynamic arrays, and hash table resizing all have operations that are expensive in the worst case but efficient in the amortized sense. Amortized analysis provided the theoretical justification for using these structures in practice.

The philosophical significance of abstraction

Data structures embody the principle of abstraction in computing. An abstract data type (ADT) specifies what operations are available and what properties they satisfy, without specifying how they are implemented. A stack ADT guarantees LIFO behavior regardless of whether it is implemented with an array, a linked list, or something else entirely.

This separation of interface from implementation is one of the most powerful ideas in computer science. It allows programmers to reason about correctness at the abstract level and optimize performance at the implementation level. The ADT concept parallels ideas in mathematics (defining algebraic structures by their axioms rather than their constructions) and engineering (specifying component behavior without dictating internal design).

Functional data structures and immutability

The rise of functional programming has renewed interest in persistent and immutable data structures. In a purely functional setting, updating a data structure creates a new version without modifying the old one. This eliminates an entire class of bugs related to mutation and aliasing, and makes concurrency fundamentally easier.

Okasaki's 1998 work showed that functional data structures can achieve time bounds competitive with their imperative counterparts, often within a logarithmic factor. This challenged the assumption that functional programming necessarily sacrifices performance. Modern languages like Clojure, Haskell, and Scala use persistent hash array mapped tries (HAMTs) and other functional structures as their default collection types.

Clojure's persistent hash map, based on HAMTs, achieves $O (lo g_{32} n)$ operations, which is effectively $O (1)$ for maps with up to billions of entries (since $lo g_{32} (1 0^{9}) \approx 6$ ). This demonstrates that persistence and performance are not mutually exclusive, and that immutable data structures can be practical even for high-performance applications.

Concurrent data structures

The rise of multi-core processors has driven the development of concurrent data structures that support simultaneous access from multiple threads. Lock-free hash tables (used in Java's ConcurrentHashMap), concurrent skip lists (used in Java's ConcurrentSkipListMap), and lock-free queues (used in Go's channels) provide thread-safe operations without the bottleneck of a single global lock.

The design of concurrent data structures involves trade-offs between correctness, performance, and progress guarantees. Lock-free structures guarantee that at least one thread makes progress, but may be more complex than lock-based alternatives. Optimistic concurrency control, where threads read data, compute results, and then validate that the data has not changed, is a common pattern that reduces contention at the cost of occasional retries.

Bibliography Master

Primary sources

Knuth, D.E. (1997). The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd ed.). Addison-Wesley. The foundational reference for data structure analysis.
Tarjan, R.E. (1983). Data Structures and Network Algorithms. SIAM. Rigorous analysis of fundamental data structures.
Sleator, D.D. and Tarjan, R.E. (1985). "Self-adjusting binary search trees." Journal of the ACM, 32(3), 652-686. The paper introducing splay trees and amortized analysis.
Fredman, M.L. and Tarjan, R.E. (1987). "Fibonacci heaps and their uses in improved network optimization algorithms." Journal of the ACM, 34(3), 596-615.
Bayer, R. and McCreight, E. (1972). "Organization and maintenance of large ordered indices." Acta Informatica, 1(3), 173-189. The original B-tree paper.
Fredman, M.L., Komlos, J., and Szemeredi, E. (1984). "Storing a sparse table with O(1) worst case access time." Journal of the ACM, 31(3), 538-544.
Arge, L. (1995). "The buffer tree: A technique for designing batched external data structures." Algorithmica, 13(1), 1-24.

Secondary sources

Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. Comprehensive treatment of data structures and their analysis.
Okasaki, C. (1998). Purely Functional Data Structures. Cambridge University Press. The definitive work on persistent and functional data structures.
Sedgewick, R. and Wayne, K. (2011). Algorithms (4th ed.). Addison-Wesley. Practical introduction with implementations in Java.
Skiena, S.S. (2008). The Algorithm Design Manual (2nd ed.). Springer. Emphasizes practical data structure selection for real-world problems.
Aggarwal, A. and Vitter, J.S. (1988). "The input/output complexity of sorting and related problems." Communications of the ACM, 31(9), 1116-1127. The foundational paper on the I/O model.
Herlihy, M. and Shavit, N. (2012). The Art of Multiprocessor Programming (revised 1st ed.). Morgan Kaufmann. Comprehensive treatment of concurrent data structures.
Kraska, T., Beutel, A., Chi, E.H., Dean, J., and Polyzotis, N. (2018). "The case for learned index structures." Proceedings of SIGMOD, 489-504.
Bender, M.A., Demaine, E.D., and Farach-Colton, M. (2000). "Cache-oblivious B-trees." Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 399-409.

Prerequisites

25.01.01

Tier anchors

beginner: Cormen et al., Introduction to Algorithms (3e), Ch. 10; Sedgewick and Wayne, Algorithms (4e), Ch. 1-2
intermediate: Cormen et al., Introduction to Algorithms (3e), Ch. 10-12, 22-23; Skiena, The Algorithm Design Manual (2e), Ch. 3-4
master: Knuth, The Art of Computer Programming Vol. 1, Ch. 2; Tarjan, Data Structures and Network Algorithms; Okasaki, Purely Functional Data Structures

References

computer-science · Ch. 1
Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C., Introduction to Algorithms (3e, MIT Press, 2009) · Ch. 10-12, 22-23 · source being verified
Knuth, D.E., The Art of Computer Programming Vol. 1: Fundamental Algorithms (3e, Addison-Wesley, 1997) · Ch. 2 · source being verified
Tarjan, R.E., Data Structures and Network Algorithms (SIAM, 1983) · Ch. 1-4 · source being verified
Okasaki, C., Purely Functional Data Structures (Cambridge University Press, 1998) · Ch. 1-3 · source being verified

Estimated time

beginner: 30m
intermediate: 55m
master: 80m