25.07.01 · computer-science / databases

Databases: relational, NoSQL, and data modeling

shipped3 tiersLean: none

Anchor (Master): Codd 1970; Gray and Reuter, Transaction Processing; Stonebraker and Cetintemel, One Size Fits All; DeWitt and Stonebraker, MapReduce

Intuition Beginner

A database is an organized collection of data stored and accessed electronically. If you have ever used a spreadsheet, you have used a simple database. But databases are far more powerful than spreadsheets. They handle data that is too large to fit in memory, serve thousands of users simultaneously, protect data from corruption and loss, and enforce rules that keep the data consistent.

The relational model, invented by Edgar Codd in 1970, organizes data into tables (called relations). Each table has rows (called tuples) and columns (called attributes). A student table might have columns for student ID, name, major, and GPA. Each row represents one student. The relational model's power comes from its mathematical foundation: operations on tables are based on relational algebra, a formal system for manipulating sets of tuples.

SQL (Structured Query Language) is the standard language for interacting with relational databases. A simple query like "find all students with a GPA above 3.5" is expressed as SELECT * FROM students WHERE gpa > 3.5. More complex queries can join data from multiple tables, aggregate results, and sort output. SQL is declarative: you specify what data you want, not how to retrieve it. The database engine figures out the most efficient way to execute your query.

Consider a university database with three tables. The Students table has columns student_id, name, and major. The Courses table has columns course_id, title, and credits. The Enrollments table has columns student_id, course_id, and grade. This third table connects students to courses, forming a many-to-many relationship: one student can enroll in many courses, and one course can have many students.

To find which courses Maria is taking, you join the Students table with the Enrollments table on student_id, then join with the Courses table on course_id. SQL expresses this as: SELECT courses.title FROM students JOIN enrollments ON students.student_id = enrollments.student_id JOIN courses ON enrollments.course_id = courses.course_id WHERE students.name = 'Maria'. The database engine determines the most efficient way to execute this query, choosing which indexes to use and in what order to process the joins.

Joins are the defining operation of relational databases. An inner join returns only rows that have matching values in both tables. A left outer join returns all rows from the left table and matched rows from the right table, filling in NULLs where there is no match. A right outer join does the reverse. A full outer join returns all rows from both tables, matching where possible and using NULLs elsewhere. A cross join produces the Cartesian product, pairing every row from the first table with every row from the second. A self join joins a table to itself, useful for hierarchical data like employee-manager relationships.

Aggregate functions summarize data across groups. COUNT(*) counts rows, SUM(column) totals values, AVG(column) computes the mean, MIN(column) and MAX(column) find extremes. The GROUP BY clause groups rows that have the same values in specified columns, and the aggregate function is applied to each group. SELECT major, AVG(gpa) FROM students GROUP BY major computes the average GPA for each major. The HAVING clause filters groups after aggregation, unlike WHERE which filters rows before aggregation.

Subqueries nest one query inside another. A correlated subquery references columns from the outer query, re-executing for each row of the outer query. SELECT name FROM students s WHERE gpa > (SELECT AVG(gpa) FROM students WHERE major = s.major) finds students whose GPA exceeds the average for their major. Common table expressions (CTEs), introduced with the WITH keyword, provide named subqueries that improve readability and can be recursive for hierarchical queries.

Window functions perform calculations across rows related to the current row without collapsing groups. RANK() assigns a rank to each row within a partition. ROW_NUMBER() assigns a unique number. LEAD() and LAG() access values from subsequent or preceding rows. SUM() OVER (PARTITION BY major ORDER BY gpa) computes a running total within each major. Window functions are essential for analytical queries and were standardized in SQL:2003.

Indexes make queries fast. Without an index, finding a student by name requires scanning every row in the Students table. With an index on the name column, the database can jump directly to the relevant rows, like looking up a word in a dictionary. Indexes are typically implemented as B-trees, which provide $O (lo g n)$ lookup, insertion, and deletion. The database administrator decides which columns to index based on which queries are most frequent.

ACID properties guarantee that database transactions are processed reliably. Atomicity means a transaction either completes entirely or has no effect. If you transfer money from account A to account B, either both the debit and credit happen, or neither does. Consistency means a transaction takes the database from one valid state to another. Isolation means concurrent transactions do not interfere with each other. Durability means once a transaction is committed, it survives system crashes.

The implementation of ACID properties relies on several mechanisms. Write-ahead logging (WAL) ensures durability by recording all changes to a log file before applying them to the database. If the system crashes, the recovery process reads the WAL and reapplies any committed transactions that were not yet written to the database files. Shadow paging, an alternative approach used by early versions of SQLite, maintains two copies of the database page table and atomically switches from the old to the new copy on commit. WAL is more commonly used because it provides better concurrent performance.

Locking mechanisms implement isolation. Shared locks allow multiple transactions to read the same data simultaneously. Exclusive locks prevent any other transaction from reading or writing the locked data. Two-phase locking, where a transaction acquires all locks before releasing any, guarantees serializability but can cause deadlocks. Most databases use a variant called strong strict two-phase locking (SS2PL), which holds all exclusive locks until the transaction commits or aborts, preventing cascading aborts.

Multi-version concurrency control (MVCC), used by PostgreSQL, MySQL (InnoDB), and Oracle, takes a different approach to isolation. Instead of locking data, MVCC maintains multiple versions of each row. Each transaction sees a snapshot of the database as it existed at the transaction's start time. Writers create new versions without blocking readers, and readers access old versions without blocking writers. This provides excellent read concurrency because readers never wait for writers and vice versa. The trade-off is that the database must periodically clean up old versions that are no longer visible to any active transaction (a process called vacuuming in PostgreSQL).

NoSQL databases emerged to address limitations of relational databases for specific use cases. Document databases like MongoDB store data as flexible JSON-like documents rather than fixed-schema tables. Key-value stores like Redis provide extremely fast lookups by trading query flexibility for speed. Graph databases like Neo4j excel at querying relationships between entities. Column-family stores like Cassandra handle massive write-heavy workloads across distributed clusters.

The NoSQL movement of the late 2000s was driven by the needs of large-scale web applications (Google, Amazon, Facebook) that generated unprecedented volumes of data and required levels of availability and scalability that traditional relational databases struggled to provide. Google's Bigtable paper (2006) and Amazon's Dynamo paper (2007) described systems that traded consistency for availability and scalability, inspiring a generation of open-source databases (HBase from Bigtable, Cassandra and Riak from Dynamo).

The choice between relational and NoSQL depends on the data and the queries. If your data has a clear schema and you need complex queries with joins and aggregations, a relational database is appropriate. If your data is unstructured or semi-structured and you need horizontal scalability, a NoSQL database may be better. Many modern applications use both, choosing the right tool for each type of data.

In practice, the boundary between relational and NoSQL has blurred. PostgreSQL supports JSON columns with indexing and query capabilities, making it a capable document database. MongoDB added support for ACID transactions across multiple documents. Redis added data structures beyond key-value (lists, sets, sorted sets, streams). The convergence suggests that the future is not "relational versus NoSQL" but rather a spectrum of data management tools, each optimized for different access patterns and consistency requirements. The pragmatic approach for most organizations is to default to relational databases for their strong consistency guarantees and mature tooling, and to adopt specialized databases only when the relational model genuinely cannot meet the requirements.

Visual Beginner

Database type	Data model	Strengths	Use case
Relational (PostgreSQL, MySQL)	Tables with rows and columns	ACID transactions, complex queries, joins	Financial systems, ERP, analytics
Document (MongoDB)	JSON-like documents	Flexible schema, nested data	Content management, catalogs
Key-value (Redis)	Key-value pairs	Extremely fast lookups	Caching, session storage
Graph (Neo4j)	Nodes and edges	Relationship queries	Social networks, recommendation
Column-family (Cassandra)	Rows grouped into column families	Massive scale, write throughput	Time series, IoT data

Worked example Beginner

A bank needs to transfer $500 f r o m A l i c e^{'} s a cco u n t (ba l an ce :$ 2000) to Bob's account (balance: $1000). This transfer must be atomic: either both the debit and credit happen, or neither does.

Without transaction management, a crash between the debit and credit operations would leave the database in an inconsistent state: Alice lost $500 b u tB o bn e v er r ece i v e d i t . T h e d a t aba se w o u l d s h o w t ha t$ 500 vanished.

With a transaction, the sequence is wrapped in BEGIN and COMMIT. The database ensures atomicity by writing all changes to a log before applying them. If a crash occurs, the recovery process reads the log and either completes committed transactions or undoes uncommitted ones.

The SQL for the transfer is:

BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 500 WHERE name = 'Alice';
UPDATE accounts SET balance = balance + 500 WHERE name = 'Bob';
COMMIT;

If both updates succeed, COMMIT makes the changes permanent. If either fails, ROLLBACK undoes all changes, restoring both accounts to their original balances. Isolation ensures that no other transaction sees the intermediate state where Alice has been debited but Bob has not yet been credited.

Consider what happens under concurrent access. Suppose a second transaction simultaneously reads Alice's balance to check if she has sufficient funds for a different transfer. Without proper isolation, this second transaction might read Alice's balance after the debit but before the credit to Bob, seeing $1, 500 in s t e a d o f t h ecor r ec t$ 2,000. Under Read Committed isolation, the second transaction would see only committed data, but Alice's balance might change between two reads within the same transaction. Under Repeatable Read, Alice's balance would remain consistent throughout the second transaction. Under Serializable, the effect would be as if the two transactions ran one at a time, eliminating all concurrency anomalies.

Now consider a more complex scenario: a banking system that must enforce the constraint that no account balance goes below zero. With optimistic concurrency control, both transactions proceed without locking. At commit time, the database checks whether any account balance would become negative. If both transactions try to withdraw from the same account, one commits and the other is rolled back and must retry. With pessimistic locking, each transaction locks the account row before modifying it, preventing the conflict entirely. The choice between optimistic and pessimistic approaches depends on the conflict rate: if conflicts are rare, optimistic avoids unnecessary locking overhead; if conflicts are frequent, pessimistic avoids wasted work from retries.

The bank transfer example also illustrates distributed transactions. If Alice's account is in a PostgreSQL database in New York and Bob's is in MySQL in London, the transfer spans two separate database systems managed by different teams. Two-phase commit (2PC) coordinates this: a transaction manager sends "prepare" to both databases. If both respond affirmatively, the manager sends "commit." The weakness of 2PC is that if the manager crashes after prepare but before the final decision, prepared transactions block indefinitely, holding their locks. Modern distributed databases prefer consensus-based approaches over 2PC for this reason.

Check your understanding Beginner

Formal definition Intermediate+

Relational algebra consists of the following operations on relations (tables). Selection $σ_{condition} (R)$ : returns rows satisfying the condition. Projection $π_{attributes} (R)$ : returns specified columns. Cartesian product $R \times S$ : combines every row of $R$ with every row of $S$ . Join $R ⋈_{condition} S$ : Cartesian product filtered by condition. Union $R \cup S$ : combines rows from both relations (schemas must match). Difference $R - S$ : rows in $R$ but not in $S$ . Rename $ρ_{new_name} (R)$ : changes the relation name.

These operations form a closed algebra: every operation on relations produces a relation, enabling composition. A SQL query like SELECT name, gpa FROM students WHERE major = 'CS' ORDER BY gpa DESC is translated into the relational algebra expression $π_{name, gpa} (σ_{major=’CS’} (students))$ , followed by a sort. The query optimizer can transform this expression using algebraic equivalences: selection can be pushed before joins to reduce intermediate result sizes (predicate pushdown), projections can be pushed before joins to eliminate unneeded columns early (projection pushdown), and joins can be reordered to minimize intermediate result sizes.

Relational algebra is equivalent in expressive power to the relational calculus (a declarative query language based on first-order logic), a result known as Codd's completeness theorem. SQL extends relational algebra with practical features like aggregation, sorting, null handling, and recursive queries, but its core is rooted in the algebraic foundation Codd established.

Normalization. A relation is in First Normal Form (1NF) if every column contains atomic values. Second Normal Form (2NF) requires 1NF plus no partial dependencies (non-key attributes depend on the entire primary key). Third Normal Form (3NF) requires 2NF plus no transitive dependencies (non-key attributes depend only on the primary key). Boyce-Codd Normal Form (BCNF) strengthens 3NF: for every functional dependency $X \to Y$ , $X$ must be a superkey.

The normalization process decomposes relations to eliminate redundancy and update anomalies. A relation that stores student_id, student_name, course_id, and instructor_name in a single table suffers from several problems. Inserting a new course with no enrolled students is impossible (insertion anomaly). Deleting the last enrollment for a course loses the course-instructor relationship (deletion anomaly). Changing an instructor's name requires updating all rows for that instructor's courses (modification anomaly). Normalization decomposes this into separate tables for Students, Courses, and Enrollments, eliminating all three anomalies.

However, normalization involves trade-offs. A fully normalized schema minimizes redundancy but may require many joins to reconstruct the original information, increasing query complexity and potentially degrading performance. Denormalization, the deliberate introduction of redundancy, is sometimes used to improve read performance at the cost of more complex write operations. Data warehouses commonly use denormalized star schemas, where a central fact table (e.g., sales) is surrounded by dimension tables (e.g., time, product, store), precomputing joins that would otherwise need to be performed at query time.

Concurrency control in depth

Two-phase locking (2PL) is the most common concurrency control mechanism. In the growing phase, a transaction acquires locks but does not release any. In the shrinking phase, it releases locks but does not acquire any. Strict 2PL holds all exclusive locks until the transaction commits or aborts, preventing cascading rollbacks. The locking protocol guarantees serializability: the order in which transactions commit corresponds to a valid serial execution.

Deadlocks occur when two transactions each hold a lock that the other needs. Transaction T1 locks resource A and requests B. Transaction T2 locks resource B and requests A. Neither can proceed. Databases detect deadlocks using wait-for graphs or timeouts. When a deadlock is detected, one transaction is chosen as the victim and rolled back, releasing its locks so the other can proceed. The victim typically receives an error that the application must handle by retrying the transaction.

Optimistic concurrency control (OCC) assumes conflicts are rare. Transactions execute without locking, maintaining read and write sets. At commit time, the transaction's read set is validated: if any read data has been modified by a concurrently committed transaction, the validating transaction is aborted and must retry. OCC works well when conflicts are infrequent, such as in read-heavy workloads or when transactions access disjoint data. Under high contention, the abort rate can become prohibitive.

Query optimization

A query can be executed in many ways. The query optimizer considers alternative execution plans and selects the one with the lowest estimated cost. Cost is measured in I/O operations (disk reads and writes).

The optimizer uses statistics about the data (table sizes, value distributions, index presence) to estimate the cost of each plan. Key techniques include: choosing join order (joining small tables first reduces intermediate result sizes), choosing join algorithm (nested-loop, sort-merge, or hash join), and deciding when to use an index versus a full table scan.

For a three-table join between Students, Enrollments, and Courses, the optimizer might consider several plans. It could join Students with Enrollments first (if most students are filtered by a WHERE clause), then join the result with Courses. Or it could join Enrollments with Courses first (if most courses are filtered), then join with Students. The optimizer estimates the size of intermediate results for each join order and chooses the one with the lowest total cost. This estimation depends heavily on the accuracy of table statistics: if the optimizer believes the Enrollments table has 1,000 rows when it actually has 10,000,000, it may choose a catastrophically wrong plan.

The Cascades optimizer framework, developed by Goetz Graefe in 1995, provides a modular approach to query optimization. It separates the logical properties of a query (what it computes) from the physical properties (how it is computed). Logical transformations (like rewriting a subquery as a join) generate alternative logical plans. Physical implementations (like choosing a hash join versus a merge join) generate alternative physical plans for each logical plan. The optimizer explores this space using memoization to avoid redundant work, pruning branches that cannot lead to the optimal plan. Modern databases including Microsoft SQL Server, Apache Calcite (used by Hive and Spark), and CockroachDB use Cascades-style optimizers.

Transaction isolation levels

SQL defines four isolation levels, each providing different guarantees against concurrency anomalies. Read Uncommitted: transactions can read uncommitted data from other transactions (dirty reads). Read Committed: transactions only read committed data, but non-repeatable reads are possible (a row read twice may change between reads). Repeatable Read: rows read by a transaction cannot be changed by other transactions, but phantom reads are possible (new rows matching a query may appear). Serializable: complete isolation, as if transactions ran one at a time.

Most production databases default to Read Committed, balancing consistency and performance. PostgreSQL defaults to Read Committed but supports all four levels, implementing Serializable using Serializable Snapshot Isolation (SSI), which detects dangerous patterns of read-write conflicts and aborts transactions that could lead to non-serializable outcomes. Oracle historically branded its Snapshot Isolation as "serializable," which has been a source of confusion since Snapshot Isolation does not prevent all serialization anomalies (specifically, it allows write skew, where two concurrent transactions read overlapping data and write to non-overlapping portions, producing a result that could not occur in any serial execution).

Key result: the CAP theorem Intermediate+

Theorem (Brewer's CAP Theorem). A distributed database system can provide at most two of the following three guarantees simultaneously: Consistency (every read receives the most recent write), Availability (every request receives a response), and Partition tolerance (the system continues to operate despite network partitions).

Proof sketch. Suppose a network partition separates the system into two components $A$ and $B$ . A client writes to $A$ . A client then reads from $B$ . If the system is consistent, $B$ must return the latest value, but it cannot communicate with $A$ due to the partition. If $B$ returns a response, it might return a stale value (violating consistency). If $B$ refuses to respond, the system is unavailable. Therefore, during a partition, the system must choose between consistency and availability. $□$

The CAP theorem implies that distributed databases must make trade-offs. CP systems (like ZooKeeper, HBase) choose consistency over availability. AP systems (like Cassandra, DynamoDB) choose availability over consistency. The BASE (Basically Available, Soft state, Eventually consistent) model is the AP counterpart to ACID.

Exercises Intermediate+

Exercise 3 (hard, short answer).

Explain why serializable isolation is typically implemented using one of several approaches and describe the trade-offs between two-phase locking and optimistic concurrency control.

Hint

Consider how each approach handles conflicts and what performance implications arise.

Answer

Two-phase locking (2PL) acquires all locks before releasing any. It guarantees serializability but can cause deadlocks and reduces concurrency because locks are held for the entire transaction. Optimistic concurrency control (OCC) lets transactions execute without locks and validates at commit time, aborting if conflicts are detected. OCC provides better concurrency under low contention but wastes work under high contention when many transactions abort. 2PL is better for write-heavy workloads with frequent conflicts; OCC is better for read-heavy or low-contention workloads.

Exercise 5 (medium, short answer).

Given the relation R(A, B, C, D) with functional dependencies {AB -> C, C -> D, D -> A}, identify all candidate keys and state whether the relation is in BCNF.

Hint

A candidate key is a minimal set of attributes that functionally determines all attributes. Check each FD to see if its left side is a superkey.

Answer

Starting from AB: AB -> C and C -> D, so AB determines all attributes. Therefore AB is a candidate key. Starting from BC: BC can determine D via C -> D, but we also need A. From D -> A, we get A from D, which comes from C. So BC determines all attributes. Similarly, BD: D -> A, and with B, A we get C via AB -> C. So BD is also a candidate key. The FDs C -> D and D -> A have determinants (C and D) that are not superkeys, so the relation is not in BCNF.

Exercise 6 (medium, short answer).

Explain the difference between a data lake and a data warehouse, and identify a scenario where each would be more appropriate.

Hint

Consider the structure of the data and the types of analysis performed.

Answer

A data warehouse stores structured data that has been cleaned, transformed, and organized according to a predefined schema (schema-on-write). It is optimized for SQL-based analytical queries and is appropriate for business reporting and dashboards where data quality and consistency are paramount. A data lake stores raw data in its native format (structured, semi-structured, or unstructured) without upfront transformation (schema-on-read). It is appropriate for exploratory data analysis, machine learning, and data science where flexibility and the ability to process diverse data formats are more important than predefined structure.

Exercise 7 (hard, short answer).

A distributed database uses Raft for replication with 5 nodes. Under what conditions can the database continue to serve reads and writes? If a network partition separates 3 nodes from 2 nodes, describe what happens to each group.

Hint

Raft requires a majority (quorum) for both leader election and log commitment.

Answer

The database can continue serving reads and writes as long as a majority of nodes (at least 3 of 5) can communicate. In the partition scenario, the group of 3 nodes retains a majority and can elect a leader, commit log entries, and serve both reads and writes. The group of 2 nodes cannot achieve a majority. If the previous leader was in this group, it steps down. The 2-node group cannot elect a new leader or commit any writes, so it becomes unavailable for writes. It may serve stale reads from its local data but cannot guarantee consistency. When the partition heals, the 2-node group receives the missing log entries from the majority group and catches up.

Advanced results Master

LSM-trees and write-optimized storage

Log-Structured Merge Trees (LSM-trees), used by LevelDB, RocksDB, Cassandra, and many modern databases, optimize for write-heavy workloads. Writes are appended to an in-memory table (memtable). When the memtable fills, it is flushed to disk as an immutable sorted file (SSTable). Reads check the memtable, then the most recent SSTable, then older SSTables.

Periodically, a compaction process merges SSTables, removing duplicates and deleted entries. This trade-off, fast writes at the cost of slower reads, is favorable for workloads dominated by inserts and updates. Bloom filters reduce the cost of reads by quickly ruling out SSTables that do not contain the sought key.

The compaction strategy significantly affects performance. Size-tiered compaction (used by Cassandra) groups SSTables into size tiers and compacts all SSTables in a tier at once. This minimizes write amplification (the ratio of bytes written to disk versus bytes written by the application) but can cause read amplification during compaction. Leveled compaction (used by RocksDB's default) organizes SSTables into levels where each level is a sorted run, with each level being approximately ten times larger than the previous. This reduces space amplification and improves read performance at the cost of higher write amplification during compaction. Time-windowed compaction (used for time-series data) groups SSTables by time range, optimizing for the common pattern of querying recent data.

Columnar storage and analytical databases

Traditional row-oriented storage stores all columns of a row together. Columnar storage stores all values of a column together. For analytical queries that aggregate specific columns over millions of rows, columnar storage dramatically reduces I/O because only the needed columns are read.

Systems like Amazon Redshift, Google BigQuery, and Apache Parquet use columnar storage. Compression is more effective in columnar format because similar values are stored together (run-length encoding, dictionary compression, and delta encoding achieve high compression ratios on columnar data).

The performance difference between row and column storage can be dramatic. A query computing the average age of 100 million customers from a 50-column table reads approximately 800 MB in row storage (assuming 8 bytes per column, reading all columns) but only 800 MB / 50 columns = 16 MB in column storage (reading only the age column). This 50x reduction in I/O translates directly to faster query execution. Column stores also enable vectorized query execution, where operations are applied to batches of values simultaneously using SIMD instructions, providing further speedups.

HTAP (Hybrid Transactional/Analytical Processing) systems like TiDB and Google Spanner attempt to serve both OLTP and OLAP workloads from a single database. They use a row-based storage engine for transactions and asynchronously replicate data to a columnar store for analytics. This eliminates the traditional ETL (Extract, Transform, Load) pipeline that moves data from operational databases to data warehouses, reducing latency between data generation and data analysis.

Vector databases and embedding search

The rise of machine learning has created a new class of database optimized for similarity search over high-dimensional vectors. Embedding models convert text, images, and audio into fixed-length vectors (typically 128 to 1536 dimensions). Vector databases like Pinecone, Milvus, and Weaviate index these vectors for fast nearest-neighbor search.

Approximate nearest neighbor (ANN) algorithms trade accuracy for speed. HNSW (hierarchical navigable small world) graphs, IVF (inverted file index), and product quantization enable billion-scale vector search in milliseconds, powering recommendation systems, semantic search, and retrieval-augmented generation (RAG).

The choice of distance metric affects both the quality of results and the index structures that can be used. Euclidean distance ( $L_{2}$ norm) measures the straight-line distance between vectors. Cosine similarity measures the angle between vectors, making it robust to differences in vector magnitude. Inner product similarity is used when vectors are normalized or when magnitude carries meaning. Each metric requires different index optimizations. For example, HNSW works with any metric but is most efficient when the metric satisfies the triangle inequality. Product quantization compresses vectors into compact codes, trading recall for storage efficiency, and is particularly effective for billion-scale datasets where storing full-precision vectors is impractical.

NewSQL and distributed SQL

NewSQL databases (Google Spanner, CockroachDB, TiDB) combine the scalability of NoSQL systems with the ACID guarantees and SQL interface of relational databases. They distribute data across multiple nodes, use distributed consensus (Raft) for replication, and provide globally consistent transactions.

Google Spanner uses TrueTime, a globally synchronized clock service that provides bounded clock uncertainty, to implement externally consistent transactions across data centers. CockroachDB uses hybrid logical clocks to achieve similar guarantees without specialized hardware.

Spanner's TrueTime API returns an interval $[e a r l i es t, l a t es t]$ representing the possible range of the current time. The uncertainty is typically less than 10 milliseconds, ensured by GPS and atomic clocks at each data center. When a transaction commits, the coordinator waits until the current time is definitively past the transaction's commit timestamp, guaranteeing that all subsequent transactions see the committed data. This approach, called "commit wait," adds latency but provides a correctness guarantee that no other distributed database achieves without specialized hardware.

CockroachDB's approach is different. It uses hybrid logical clocks (HLCs) that combine physical timestamps with logical counters to order events across nodes. For transaction isolation, CockroachDB uses a serializable isolation level implemented through a combination of timestamp ordering and optimistic concurrency control. When conflicts are detected, CockroachDB automatically retries transactions, making the retry transparent to the client in many cases.

Stream processing and real-time databases

Stream processing systems (Apache Kafka, Apache Flink, Spark Streaming) process continuous data streams rather than bounded datasets. They support windowed aggregations (compute statistics over the last 5 minutes), joins between streams, and complex event processing (detect patterns across events).

Change Data Capture (CDC) streams database changes in real-time, enabling event-driven architectures where changes in one database trigger actions in other systems. This pattern, called CDC + Kafka + Flink, is the foundation of many modern real-time data pipelines.

Apache Kafka's architecture is worth examining in detail. Kafka organizes data into topics, which are partitioned, ordered, and append-only logs. Each partition is replicated across multiple brokers for fault tolerance. Producers write to the tail of the log; consumers read at their own pace, maintaining their own offset (position in the log). This design decouples producers from consumers: producers do not need to know how many consumers exist, and consumers can join, leave, or replay messages without affecting producers. Kafka retains messages for a configurable period (often days or weeks), enabling time-travel queries and reprocessing of historical data.

Apache Flink provides exactly-once semantics for stream processing through a distributed snapshot mechanism inspired by the Chandy-Lamport algorithm. Periodically, Flink injects checkpoint barriers into the data stream. When an operator receives a barrier, it snapshots its state and acknowledges the checkpoint. When all operators have acknowledged, the checkpoint is complete. If a failure occurs, Flink restores all operators to the last completed checkpoint and replays the data from that point. This mechanism guarantees that every record is processed exactly once, even in the presence of failures.

Graph databases and knowledge graphs

Graph databases store data as nodes and edges, making them natural for modeling and querying relationships. Neo4j, the most popular graph database, uses a property graph model where nodes and edges can have key-value properties, and edges have types that define the relationship.

The Cypher query language, developed by Neo4j, provides a declarative syntax for graph traversal. A query like MATCH (person:Person)-[:KNOWS]->(friend:Person)-[:WORKS_AT]->(company:Company) WHERE person.name = 'Alice' RETURN company.name finds all companies where friends of Alice work. This query is expressed naturally in Cypher but would require multiple joins in SQL.

Knowledge graphs extend the graph database concept with semantic meaning. Google's Knowledge Graph, Wikipedia's Wikidata, and enterprise knowledge graphs represent entities and their relationships in a structured, queryable format. The combination of graph databases with machine learning (graph neural networks) enables powerful reasoning over relational data, including link prediction (suggesting missing relationships), node classification (categorizing entities), and community detection (identifying clusters of related entities).

Connections Master

Connections to data structures

Databases are built on data structures. B-trees and B+ trees provide indexed access. Hash indexes provide point lookups. LSM-trees use skip lists and bloom filters. Buffer pools are hash tables. Write-ahead logs are sequential files. Understanding data structures is essential for understanding database internals and tuning database performance.

The connection goes deeper than implementation. The choice of index data structure determines the types of queries a database can serve efficiently. B+ trees provide efficient range queries (finding all rows where age is between 20 and 30) because leaf nodes are linked in sorted order. Hash indexes provide efficient point queries (finding the row where id = 42) but cannot serve range queries. Bitmap indexes, which store a bitmap for each distinct value of a column, are efficient for low-cardinality columns in analytical workloads (finding all rows where gender = 'F' and region = 'West'). Spatial indexes (R-trees, quadtrees) enable efficient geometric queries (finding all restaurants within 2 km of a location). Full-text indexes (inverted indexes) enable efficient text search. Each index type is a specialized data structure optimized for a specific query pattern.

Connections to distributed systems

Distributed databases must handle network partitions, replication lag, and consistency trade-offs. The CAP theorem, consistent hashing, and distributed consensus algorithms (Paxos, Raft) are fundamental to distributed database design. Modern distributed databases like CockroachDB and YugabyteDB are essentially distributed systems research applied to database architecture.

The raft consensus algorithm, used by CockroachDB, TiDB, and many others for data replication, ensures that all replicas agree on the sequence of operations. Each piece of data (called a range or shard in CockroachDB) is replicated using a Raft group. The leader of the Raft group serves all reads and writes for that range, and followers serve as hot standbys. If the leader fails, the remaining replicas hold an election and choose a new leader. This provides automatic failover with no data loss, as long as a majority of replicas remain available.

Connections to machine learning

Databases and machine learning are increasingly intertwined. Feature stores (Feast, Tecton) serve ML features from databases. Vector databases enable similarity search over ML embeddings. Automated ML platforms query databases for training data. The convergence of OLTP (transaction processing), OLAP (analytics), and ML is driving the development of unified data platforms.

Learned index structures, proposed by Kraska et al. in 2018, replace traditional index structures with neural networks trained to predict the position of keys. A B-tree index learns the cumulative distribution function of the keys; a neural network can approximate this function more compactly, potentially reducing index size by orders of magnitude. While learned indexes have not yet displaced traditional indexes in production systems, the concept illustrates the growing intersection of databases and machine learning.

Database tuning, traditionally a manual task performed by expert database administrators, is increasingly automated using machine learning. Systems like IBM's DB2 Learning Classifier and Microsoft's Database Tuning Advisor use query workload analysis to recommend index creation, table partitioning, and configuration parameter changes. The OtterTune system (Van Aken et al., 2017) uses a combination of supervised learning and collaborative filtering to automatically tune database configuration parameters, achieving performance comparable to human experts.

Connections to programming languages

The interaction between databases and programming languages has been a persistent challenge. SQL is a declarative language optimized for data manipulation, while application languages (Java, Python, JavaScript) are imperative or object-oriented. Object-Relational Mapping (ORM) frameworks like Hibernate (Java), Django ORM (Python), and ActiveRecord (Ruby) bridge this gap by translating between object graphs and relational tables.

ORMs provide convenience but can introduce performance problems. The N+1 query problem occurs when an ORM issues one query to fetch a list of objects and then N additional queries to fetch each object's associated data. For 1,000 orders, fetching each order's items individually generates 1,001 queries instead of 2 (one for orders, one for all items). Solutions include eager loading (fetching associated data in the initial query using JOINs) and batch loading (fetching associated data for multiple parent objects in a single query).

Query builders and type-safe query libraries attempt to provide the safety of a type system while working with SQL. jOOQ (Java), Slick (Scala), and SQLAlchemy Core (Python) allow developers to write type-checked queries that are translated to SQL at runtime. These libraries catch SQL errors at compile time rather than runtime, reducing the risk of SQL injection and syntax errors.

Historical and philosophical context Master

Codd's relational model

Edgar Codd's 1970 paper "A Relational Model of Data for Large Shared Data Banks" revolutionized data management. Before Codd, databases used hierarchical or network models that required navigating physical data structures through pointers. Codd proposed a mathematical model based on relations (tables) and first-order predicate logic, separating the logical view of data from its physical storage.

Codd's insight was that data independence, the ability to change the physical storage without changing the logical model, was essential for building reliable, maintainable systems. The relational model achieved this through declarative querying: users specify what data they want, and the database system determines how to retrieve it.

Codd also proposed 12 rules (numbered 0 through 12) that define a truly relational database system. Rule 0 states that the system must qualify as relational, use its relational facilities exclusively to manage the database. Rule 1 (Information Rule) requires all information to be represented as values in tables. Rule 2 (Guaranteed Access) requires every value to be addressable by table name, primary key value, and column name. Rule 6 (View Updating) requires that views (virtual tables) be updatable. No commercial database fully satisfies all 13 rules, though some come close. PostgreSQL, for example, satisfies most rules but does not fully support updatable views in all cases.

The path from Codd's theoretical model to commercial SQL databases was not straightforward. Codd's relational algebra was a mathematical formalism, not a practical query language. IBM's System R project (1974-1979) developed SQL and demonstrated that relational databases could be practical for production use. Meanwhile, Michael Stonebraker's Ingres project at UC Berkeley developed the QUEL query language and the Ingres database, which later became the commercial product CA-Ingres. Larry Ellison's Oracle Corporation, founded in 1977, was the first company to release a commercial SQL-based relational database, beating IBM's own SQL/DS to market by several years. This competitive dynamic between academic research and commercial products drove rapid innovation in database technology throughout the 1980s.

The object-relational impedance mismatch

The rise of object-oriented programming in the 1990s created tension between the object model (with inheritance, polymorphism, and encapsulation) and the relational model (with tables, foreign keys, and SQL). This "impedance mismatch" led to Object-Relational Mapping (ORM) frameworks like Hibernate, ActiveRecord, and SQLAlchemy, which translate between object graphs and relational tables.

NoSQL databases partly arose from the desire to eliminate this mismatch by storing data in formats closer to how applications use it. Document databases store JSON objects directly, eliminating the translation layer. Graph databases model object relationships natively. The lesson from the impedance mismatch is that data models are not neutral: they shape how applications are designed, what queries are efficient, and what kinds of complexity are easy or hard to manage.

The data democratization movement

Modern data engineering aims to make data accessible to everyone in an organization, not just database administrators. Data lakes, data meshes, and self-service analytics platforms empower non-technical users to query and visualize data. This democratization raises questions about data quality, governance, and privacy that the database community continues to address.

The data lake concept, popularized by James Dixon in 2010, stores raw data in its native format (structured, semi-structured, and unstructured) in a centralized repository, typically built on distributed file systems like HDFS or cloud object stores like Amazon S3. Unlike data warehouses, which require data to be cleaned and structured before loading (schema-on-write), data lakes apply schema when the data is read (schema-on-read). This flexibility enables data scientists and analysts to explore data without the overhead of upfront modeling, but it can lead to "data swamps" where data is stored without adequate metadata, lineage, or quality controls.

The data mesh concept, proposed by Zhamak Dehghani in 2019, takes a fundamentally different approach. Instead of centralizing data in a single platform, data mesh treats data as a product, with domain teams owning and publishing their own data products. Data infrastructure is provided as a self-serve platform, and federated computational governance ensures interoperability between data products. This approach addresses the organizational and scaling challenges that centralized data platforms face, but requires significant cultural and architectural change to implement successfully.

The evolution of query optimization

Query optimization has evolved dramatically since System R's pioneering optimizer in 1979. The System R optimizer used a cost-based approach: it enumerated possible execution plans, estimated the cost of each using statistical models of the data, and chose the cheapest. This approach remains the foundation of modern query optimizers, but with significant extensions.

Volcano (Graefe, 1994) and Cascades (Graefe, 1995) introduced extensible optimizer frameworks that allow new operators and transformations to be added without modifying the optimizer core. These frameworks use memoization to avoid re-evaluating equivalent sub-plans, enabling exhaustive search over large plan spaces. Microsoft SQL Server adopted the Cascades framework, and the approach influenced the design of PostgreSQL's optimizer.

Modern optimizers increasingly use adaptive techniques that adjust execution plans during query execution based on observed performance. PostgreSQL's adaptive hash join, for example, can switch between hash join and nested loop join mid-execution if cardinality estimates prove inaccurate. This adaptivity is crucial because cardinality estimation errors, caused by outdated statistics or complex predicate correlations, are the most common cause of poor query plans. Research systems like Eddies (Avnur and Hellerstein, 2000) took this further, treating query optimization as a continuous process that adjusts operator scheduling throughout execution rather than making a single upfront decision.

The rise of cloud databases

Cloud databases have transformed how organizations provision and manage database infrastructure. Amazon RDS, introduced in 2009, was among the first managed database services, handling backups, patching, replication, and failover automatically. AWS later introduced Aurora, a MySQL and PostgreSQL-compatible database that separates storage from compute, replicating data across three availability zones with a distributed storage layer that provides durability and self-healing. Aurora achieves significantly higher throughput than standard MySQL by reducing write amplification: the log is written to the distributed storage layer, and database pages are reconstructed from the log in the background.

Serverless databases like Aurora Serverless and PlanetScale take this further by automatically scaling compute resources based on demand, scaling to zero when idle. This model is particularly attractive for applications with variable or unpredictable workloads, where provisioning for peak capacity would be wasteful. The database-as-a-service model has shifted the operational burden from database administrators to the cloud provider, allowing development teams to focus on application logic rather than infrastructure management.

Bibliography Master

Primary sources

Codd, E.F. (1970). "A relational model of data for large shared data banks." Communications of the ACM, 13(6), 377-387.
Gray, J. and Reuter, A. (1993). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
Stonebraker, M. and Cetintemel, U. (2005). "One size fits all: An idea whose time has come and gone." Proceedings of the 21st ICDE.
DeWitt, D.J. and Stonebraker, M. (2008). "MapReduce: A major step backwards." Database Column.
Dean, J. and Ghemawat, S. (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM, 51(1), 107-113.

Secondary sources

Ramakrishnan, R. and Gehrke, J. (2002). Database Management Systems (3rd ed.). McGraw-Hill.
Silberschatz, A., Korth, H.F., and Sudarshan, S. (2010). Database System Concepts (6th ed.). McGraw-Hill.
Date, C.J. (2003). An Introduction to Database Systems (8th ed.). Addison-Wesley.
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly.
Graefe, G. (1995). "The Cascades framework for query optimization." Data Engineering Bulletin, 18(3), 19-29.
Dehghani, Z. (2022). Data Mesh: Delivering Data-Driven Value at Scale. O'Reilly.

Prerequisites

25.02.01
25.04.01

Tier anchors

beginner: Ramakrishnan and Gehrke, Database Management Systems (3e), Ch. 1-4; Silberschatz et al., Database System Concepts (6e), Ch. 1-3
intermediate: Ramakrishnan and Gehrke, Database Management Systems (3e), Ch. 5-12; Date, An Introduction to Database Systems (8e), Ch. 4-9
master: Codd 1970; Gray and Reuter, Transaction Processing; Stonebraker and Cetintemel, One Size Fits All; DeWitt and Stonebraker, MapReduce

References

computer-science · Ch. 0
Ramakrishnan, R. and Gehrke, J., Database Management Systems (3e, McGraw-Hill, 2002) · Ch. 1-12 · source being verified
Codd, E.F., A Relational Model of Data for Large Shared Data Banks, Communications of the ACM 13(6), 1970 · pp. 377-387 · source being verified
Silberschatz, A., Korth, H.F., and Sudarshan, S., Database System Concepts (6e, McGraw-Hill, 2010) · Ch. 1-7 · source being verified

Estimated time

beginner: 30m
intermediate: 55m
master: 80m