25.09.01 · computer-science / ai-ml

Artificial intelligence and machine learning

shipped3 tiersLean: none

Anchor (Master): Goodfellow, Bengio, and Courville, Deep Learning; Vapnik, The Nature of Statistical Learning Theory; Sutton and Barto, Reinforcement Learning

Intuition Beginner

Artificial intelligence (AI) is the field of computer science dedicated to creating systems that perform tasks normally requiring human intelligence. These tasks include understanding language, recognizing images, making decisions, and learning from experience. Machine learning (ML), a subset of AI, focuses on systems that improve their performance by learning from data rather than being explicitly programmed.

Traditional programming follows a fixed recipe: given input data and a program, produce output. Machine learning inverts this: given input data and the desired outputs, learn the program (model) that maps inputs to outputs. Instead of writing rules by hand, you provide examples and let the algorithm discover the rules.

Consider the problem of recognizing whether an email is spam. A traditional approach would require writing hundreds of rules: if the email contains the word "lottery," flag it; if it comes from an unknown sender with an attachment, flag it; and so on. This approach is fragile because spammers adapt their messages to bypass known rules. A machine learning approach feeds thousands of labeled emails (spam or not spam) to an algorithm, which learns statistical patterns that distinguish spam from legitimate email. The learned model generalizes to new emails it has never seen before, and adapts to new spamming techniques as long as the training data is updated periodically.

Machine learning comes in three main flavors. Supervised learning trains on labeled data: each training example has an input and the correct output. Classification (predicting a category) and regression (predicting a number) are supervised tasks. Unsupervised learning finds patterns in unlabeled data. Clustering (grouping similar items) and dimensionality reduction (finding simplified representations) are unsupervised tasks. Reinforcement learning trains an agent through trial and error, receiving rewards for good actions and penalties for bad ones.

A neural network is a machine learning model inspired by the structure of the brain. It consists of layers of interconnected nodes (neurons). Each neuron takes inputs, multiplies them by weights, adds a bias, and applies an activation function. The output of one layer becomes the input to the next. The network learns by adjusting the weights to reduce the error between its predictions and the correct answers.

A deep neural network has many layers (sometimes hundreds). Each layer learns progressively more abstract features. In an image recognition network, the first layer might detect edges, the second might detect textures, the third might detect shapes, and deeper layers might detect objects like faces or cars. This hierarchical feature learning is what makes deep learning so powerful: the model discovers useful representations automatically, without manual feature engineering. The term "deep" refers specifically to the number of layers, not to any mystical property of the model.

Training a neural network involves three steps repeated thousands or millions of times. First, forward propagation: the input passes through the network, producing a prediction. Second, loss computation: the prediction is compared to the correct answer using a loss function that measures the error. Third, backpropagation: the error is propagated backward through the network, computing the gradient (direction and magnitude of change) of the loss with respect to each weight. The weights are updated by a small step in the direction that reduces the error.

To illustrate forward propagation concretely, consider a neuron with three inputs $x_{1}, x_{2}, x_{3}$ , weights $w_{1}, w_{2}, w_{3}$ , bias $b$ , and ReLU activation. Given inputs $(0.5, 0.3, 0.8)$ , weights $(0.2, - 0.5, 0.7)$ , and bias $0.1$ : the weighted sum is $z = 0.2 (0.5) + (- 0.5) (0.3) + 0.7 (0.8) + 0.1 = 0.10 - 0.15 + 0.56 + 0.1 = 0.61$ . The ReLU activation outputs $max (0, 0.61) = 0.61$ . This output becomes an input to the next layer. A network with many such neurons in many layers computes increasingly complex functions of the original input.

Convolutional neural networks (CNNs) are specialized for grid-structured data like images. Instead of connecting every input to every neuron, a convolutional layer applies a small filter (typically 3x3 or 5x5) across the entire input. This weight-sharing strategy means the same filter detects the same feature (an edge, a texture) at every spatial position. A CNN typically stacks convolutional layers with pooling layers (which reduce spatial resolution) followed by fully connected layers that produce the final classification. AlexNet (2012) popularized this architecture, and subsequent models like VGGNet, Inception, and ResNet refined it.

Recurrent neural networks (RNNs) process sequential data by maintaining a hidden state that is updated at each time step. At step $t$ , the RNN reads input $x_{t}$ and combines it with the previous hidden state $h_{t - 1}$ to produce a new hidden state $h_{t} = f (W_{h} h_{t - 1} + W_{x} x_{t} + b)$ . This allows the network to maintain information about previous inputs. However, standard RNNs struggle with long sequences because gradients either vanish or explode over many time steps. Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, addressed this with gating mechanisms that control information flow, allowing the network to learn dependencies spanning hundreds of time steps.

The learning rate controls the size of the step taken during weight updates. Too large, and the model overshoots good solutions, oscillating or diverging. Too small, and training takes impractically long. Modern optimizers like Adam adapt the learning rate dynamically, using larger steps when the gradient is consistent and smaller steps when it fluctuates.

Evaluation methodology is critical for trustworthy ML. A model is evaluated on data it has never seen during training (the test set) to estimate real-world performance. The training data is typically split into three parts: a training set (for learning weights), a validation set (for tuning hyperparameters like learning rate and model architecture), and a test set (for final evaluation, used only once). Cross-validation splits the data into $k$ folds, training on $k - 1$ folds and validating on the remaining fold, rotating through all $k$ combinations. This provides a more robust estimate of performance than a single train/validation split, especially for small datasets.

Common evaluation metrics depend on the task. For classification: accuracy (fraction correct), precision (fraction of positive predictions that are correct), recall (fraction of actual positives that are detected), and the F1 score (harmonic mean of precision and recall). For regression: mean squared error (MSE), mean absolute error (MAE), and $R^{2}$ (fraction of variance explained). A confusion matrix shows the full breakdown of correct and incorrect predictions for each class, revealing not just overall performance but which classes are most often confused.

Overfitting is the most common pitfall in machine learning. A model that memorizes the training data perfectly may perform poorly on new, unseen data. This happens when the model is too complex relative to the amount of training data. Regularization techniques prevent overfitting. L2 regularization penalizes large weights, encouraging simpler models. Dropout randomly disables neurons during training, preventing the network from relying too heavily on any single neuron. Early stopping halts training when performance on a validation set starts to degrade.

Underfitting is the opposite problem: the model is too simple to capture the underlying pattern in the data, performing poorly on both training and test data. A linear model trying to fit a quadratic relationship would underfit. The solution is to use a more expressive model (more layers, more neurons) or engineer better features. The art of machine learning is finding the sweet spot between underfitting and overfitting, a model complex enough to capture the true pattern but simple enough to generalize to new data. This tension is formalized by the bias-variance tradeoff, discussed in the intermediate section.

Deep learning has achieved remarkable results in many domains. Image classification, once considered a hard AI problem, is now solved with superhuman accuracy by convolutional neural networks (CNNs). Natural language processing has been transformed by transformer architectures, which can generate fluent text, translate languages, and answer questions. Game-playing AI, from Deep Blue (chess, 1997) to AlphaGo (Go, 2016) to AlphaStar (StarCraft, 2019), has progressed from brute-force search to learning human-level strategy from self-play.

Support vector machines (SVMs), before the deep learning revolution, were the dominant approach for classification. An SVM finds the hyperplane that maximally separates two classes by maximizing the margin, the distance between the hyperplane and the nearest data points from each class (the support vectors). The optimization problem is: minimize $\frac{1}{2} ∥ w ∥^{2}$ subject to $y_{i} (w^{T} x_{i} + b) \geq 1$ for all $i$ .

The kernel trick allows SVMs to learn nonlinear decision boundaries by implicitly mapping inputs into a high-dimensional feature space where a linear separator suffices. Common kernels include the polynomial kernel $K (x, x^{'}) = (x^{T} x^{'} + c)^{d}$ and the radial basis function (RBF) kernel $K (x, x^{'}) = exp (- γ ∥ x - x^{'} ∥^{2})$ . The kernel trick computes the inner product in the high-dimensional space without explicitly computing the mapping, making it computationally feasible.

Decision trees and ensemble methods provide an alternative approach. A decision tree splits the feature space recursively based on feature values, creating a flowchart-like structure where each internal node tests a feature and each leaf assigns a class label. Decision trees are interpretable but prone to overfitting. Ensemble methods address this by combining many trees. Random forests (Breiman, 2001) train many trees on bootstrap samples with random feature subsets and average their predictions. Gradient boosting (Friedman, 2001) sequentially trains trees, each fitting the residual errors of the previous ensemble. XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al., 2017) are optimized gradient boosting implementations that achieve state-of-the-art results on tabular data, often outperforming deep learning on structured datasets.

Visual Beginner

ML type	Data	Goal	Example algorithm
Supervised	Labeled (input + correct output)	Predict labels for new data	Linear regression, neural networks, SVM
Unsupervised	Unlabeled	Find hidden patterns	K-means clustering, PCA, autoencoders
Reinforcement	Reward signals	Maximize cumulative reward	Q-learning, policy gradients

Worked example Beginner

Suppose you want to predict house prices based on square footage. You have data for five houses.

House 1: 1000 sq ft, $150, 000. H o u se 2 : 1500 s q f t,$ 200,000. House 3: 2000 sq ft, $280, 000. H o u se 4 : 2500 s q f t,$ 310,000. House 5: 3000 sq ft, $400,000.

Linear regression finds the line $y = m x + b$ that best fits this data, where $x$ is square footage and $y$ is price. The "best fit" is defined as the line that minimizes the sum of squared errors between predicted and actual prices.

Using least squares, the algorithm computes $m$ and $b$ by minimizing the total squared error between predicted and actual prices across all five houses.

For a 1800 sq ft house: $y = 124 \times 1800 + 30, 000 = 253, 200$ . The model predicts about $253,200.

This is a simple model with one feature. Real estate pricing models might include dozens of features (bedrooms, bathrooms, location, age, lot size) and use nonlinear models to capture complex relationships. A neural network trained on thousands of house sales could learn these relationships automatically.

Logistic regression example: email spam classification. Suppose you want to classify emails as spam or not spam using two features: the frequency of exclamation marks and whether the email contains the word "free." You have five training examples.

Email 1: 0.02 frequency, no "free" (not spam). Email 2: 0.15 frequency, "free" present (spam). Email 3: 0.01 frequency, no "free" (not spam). Email 4: 0.20 frequency, "free" present (spam). Email 5: 0.03 frequency, no "free" (not spam).

Logistic regression fits $p (spam) = σ (w_{1} x_{1} + w_{2} x_{2} + b)$ , where $σ (z) = 1/ (1 + e^{- z})$ is the sigmoid function, $x_{1}$ is exclamation mark frequency, and $x_{2}$ is 1 if "free" is present, 0 otherwise.

Suppose the fitted model yields $w_{1} = 15$ , $w_{2} = 5$ , $b = - 3$ . For a new email with $x_{1} = 0.10$ frequency and "free" present ( $x_{2} = 1$ ): $z = 15 (0.10) + 5 (1) - 3 = 1.5 + 5 - 3 = 3.5$ . Then $p (spam) = σ (3.5) = 1/ (1 + e^{- 3.5}) \approx 0.97$ . The model predicts spam with 97% confidence.

The sigmoid function maps any real number to the range $(0, 1)$ , making it suitable for binary classification. The decision boundary (where $p = 0.5$ ) is the line $w_{1} x_{1} + w_{2} x_{2} + b = 0$ , which is linear in the feature space. For a new email with $x_{1} = 0.01$ and no "free" ( $x_{2} = 0$ ): $z = 15 (0.01) + 5 (0) - 3 = - 2.85$ , so $p (spam) = σ (- 2.85) \approx 0.055$ , and the model predicts not spam.

Check your understanding Beginner

Formal definition Intermediate+

Supervised learning. Given a dataset $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ drawn from an unknown distribution $P (X, Y)$ , find a function $f_{θ} : X \to Y$ parameterized by $θ$ that minimizes the expected loss: $θ^{*} = ar g min_{θ} E_{(x, y) \sim P} [L (f_{θ} (x), y)]$ , where $L$ is a loss function.

Loss functions. For regression: mean squared error $L = \frac{1}{n} \sum_{i = 1}^{n} (f_{θ} (x_{i}) - y_{i})^{2}$ . For classification: cross-entropy $L = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{c = 1}^{C} y_{i c} lo g (f_{θ} (x_{i})_{c})$ .

Gradient descent. Update rule: $θ_{t + 1} = θ_{t} - α \nabla_{θ} L (θ_{t})$ , where $α$ is the learning rate and $\nabla_{θ} L$ is the gradient of the loss with respect to the parameters. Stochastic gradient descent (SGD) estimates the gradient using a mini-batch of training examples.

Backpropagation. Efficient computation of gradients in neural networks using the chain rule. For a network with layers $h_{1}, h_{2}, \dots, h_{L}$ : $\frac{\partial L}{\partial W _{l}} = \frac{\partial L}{\partial h _{L}} \cdot \frac{\partial h _{L}}{\partial h _{L - 1}} \dots \frac{\partial h _{l + 1}}{\partial h _{l}} \cdot \frac{\partial h _{l}}{\partial W _{l}}$ . Each factor is a Jacobian matrix. Backpropagation computes this product from right to left, reusing intermediate results.

Regularization. L2 regularization (ridge/Tikhonov) modifies the loss: $L_{reg} = L + \frac{λ}{2} ∥ θ ∥_{2}^{2}$ , where $λ$ controls the penalty strength. L1 regularization (Lasso) uses $L_{reg} = L + λ ∥ θ ∥_{1}$ , which encourages sparsity (many weights become exactly zero). Elastic net combines both: $L_{reg} = L + λ_{1} ∥ θ ∥_{1} + λ_{2} ∥ θ ∥_{2}^{2}$ . Dropout, specific to neural networks, randomly sets a fraction $p$ of activations to zero during training, equivalent to training an ensemble of thinned networks and approximately averaging their predictions.

Optimization algorithms. Vanilla SGD updates parameters as $θ_{t + 1} = θ_{t} - α \nabla_{θ} L$ . Momentum accelerates SGD by accumulating a velocity vector: $v_{t + 1} = μ v_{t} + \nabla_{θ} L$ , $θ_{t + 1} = θ_{t} - α v_{t + 1}$ , where $μ$ is the momentum coefficient (typically 0.9). Adam (Adaptive Moment Estimation), introduced by Kingma and Ba in 2015, maintains per-parameter adaptive learning rates using estimates of the first and second moments of the gradients: $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ (first moment), $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ (second moment), with bias-corrected updates $\overset{m}{^}_{t} = m_{t} / (1 - β_{1}^{t})$ , $\overset{v}{^}_{t} = v_{t} / (1 - β_{2}^{t})$ , and $θ_{t + 1} = θ_{t} - α \overset{m}{^}_{t} / (\overset{v}{^}_{t} + ϵ)$ . Adam combines the benefits of momentum and adaptive learning rates, and is the default optimizer for most deep learning applications.

Bias-variance tradeoff

The expected prediction error decomposes as: $E [(y - \hat{f} (x))^{2}] = Bias^{2} (\hat{f}) + Var (\hat{f}) + σ^{2}$ , where $Bias^{2} = (E [\hat{f}] - f)^{2}$ measures systematic error, $Var = E [(\hat{f} - E [\hat{f}])^{2}]$ measures sensitivity to training data, and $σ^{2}$ is irreducible noise.

Simple models have high bias (underfitting) and low variance. Complex models have low bias and high variance (overfitting). The optimal model balances bias and variance.

Universal approximation theorem

Theorem. A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $R^{n}$ , under mild conditions on the activation function.

This theorem guarantees representational capacity but says nothing about learnability (whether gradient descent can find the right weights) or generalization (whether the learned function works on new data). The theorem was first proved by Cybenko (1989) for sigmoid activation functions and extended by Hornik (1991) to arbitrary nonconstant bounded activation functions. The practical implication is that the question "can a neural network solve my problem?" is almost always yes; the real questions are "how many neurons and layers does it need?" and "can I train it with my data?"

The depth vs. width tradeoff is significant. While a single wide hidden layer is theoretically sufficient, deep networks (many narrow layers) are exponentially more efficient for certain function classes. Telgarsky (2016) showed that there exist functions representable by deep networks with $O (n)$ layers and $O (n)$ neurons that require $O (2^{n})$ neurons to represent with a shallow network. This mathematical result explains why depth is important in practice: deep networks can learn hierarchical representations that would require exponentially many neurons in a shallow architecture.

Key result: VC dimension and generalization bounds Intermediate+

Theorem (Vapnik-Chervonenkis). Let $H$ be a hypothesis class with VC dimension $d$ . For any $δ > 0$ , with probability at least $1 - δ$ over a training set of size $n$ , every hypothesis $h \in H$ satisfies:

$R (h) \leq R_{emp} (h) + \frac{d ( ln ( 2 n / d ) + 1 ) + ln ( 4/ δ )}{n}$

where $R (h)$ is the true risk (error on the full distribution) and $R_{emp} (h)$ is the empirical risk (error on the training set).

This bound shows that generalization error decreases as the training set size $n$ grows and increases as the model complexity (VC dimension $d$ ) grows. It formalizes the intuition that simpler models generalize better from limited data. The VC dimension of a linear classifier in $R^{d}$ is $d + 1$ . For neural networks, the VC dimension grows with the number of parameters, suggesting that very large networks should overfit, yet in practice they often generalize well. This apparent contradiction is an active area of research.

Exercises Intermediate+

Exercise 3 (hard, short answer).

Explain the double descent phenomenon and why it challenges the classical bias-variance tradeoff.

Hint

Consider what happens as model size increases beyond the point where it can perfectly fit the training data.

Answer

Classical bias-variance theory predicts test error should increase monotonically as model complexity grows beyond the optimal point. However, empirical results show that for overparameterized models (like modern deep networks with more parameters than training examples), test error first increases (the classical regime), then decreases again as the model becomes sufficiently overparameterized. This "double descent" curve has a peak near the interpolation threshold (where the model just barely fits all training data) and then improves. This occurs because overparameterized models have many solutions that fit the training data perfectly, and gradient descent tends to find the simplest one (implicit regularization).

Exercise 7 (medium, short answer).

Explain why stochastic gradient descent (SGD) with mini-batches is preferred over full-batch gradient descent for training large neural networks.

Hint

Consider both computational efficiency and the optimization dynamics.

Answer

Three reasons. First, computational efficiency: computing the gradient over the full dataset is expensive for large datasets; mini-batch SGD provides a noisy but unbiased estimate of the gradient at a fraction of the cost. Second, noise as regularization: the stochasticity in mini-batch gradients helps the optimizer escape shallow local minima and saddle points, acting as implicit regularization that improves generalization. Third, data parallelism: mini-batch computation maps efficiently to GPU hardware, where the same operation is applied to many examples simultaneously.

Exercise 8 (hard, short answer).

Prove that the XOR function cannot be learned by a single perceptron (a linear classifier). Then explain how a two-layer network with a hidden layer of two neurons can learn XOR.

Hint

XOR outputs 1 when exactly one of its two binary inputs is 1. Try to find a single line that separates the positive and negative examples in 2D space.

Answer

The XOR function has four input-output pairs: (0,0)->0, (0,1)->1, (1,0)->1, (1,1)->0. In 2D space, the positive examples (0,1) and (1,0) are at opposite corners of the unit square, and the negative examples (0,0) and (1,1) are at the other two corners. No single line can separate the positive from the negative examples because they are not linearly separable (the positive examples lie on opposite sides of any line through the square). A two-layer network solves this by having the hidden layer learn two features: $h_{1} = step (x_{1} + x_{2} - 0.5)$ (OR gate) and $h_{2} = step (- x_{1} - x_{2} + 1.5)$ (NAND gate). The output neuron computes $y = step (h_{1} + h_{2} - 1.5)$ (AND gate). The hidden layer maps the inputs into a new 2D space where the positive and negative examples become linearly separable.

Exercise 9 (hard, short answer).

Describe the vanishing gradient problem in deep neural networks. How do residual connections (skip connections) in ResNet architectures address this problem?

Hint

Consider what happens when you multiply many small numbers together during backpropagation.

Answer

During backpropagation, gradients are computed as products of Jacobians across layers. If the Jacobians have singular values less than 1, the gradient shrinks exponentially with depth: a 100-layer network with singular values of 0.9 would have gradients reduced by a factor of $0. 9^{100} \approx 2.66 \times 1 0^{- 5}$ . This means early layers receive vanishingly small gradient signals and learn extremely slowly. Residual connections address this by adding the input of a layer directly to its output: $y = F (x) + x$ . During backpropagation, the gradient flows through both $F (x)$ and the skip connection. Even if $\partial F / \partial x$ vanishes, the skip connection provides a gradient of 1 (the identity), ensuring that the total gradient is at least 1. This allows gradients to flow through very deep networks (hundreds of layers) without vanishing.

Exercise 10 (hard, short answer).

What is the difference between model-based and model-free reinforcement learning? Give an example algorithm for each and explain when you would prefer one over the other.

Hint

Model-based methods learn a model of the environment's dynamics; model-free methods learn a policy or value function directly.

Answer

Model-based RL learns a model of the environment's transition dynamics $T (s^{'} ∣ s, a)$ and reward function $R (s, a)$ , then uses this model to plan (e.g., by simulating trajectories or solving a Bellman equation). Example: Dyna-Q combines model-free Q-learning with a learned model for simulated experience. Model-free RL learns a policy or value function directly from experience without building an explicit model. Examples: Q-learning (learns action-value function) and REINFORCE (learns policy directly via policy gradients). Model-based methods are more sample-efficient because the learned model enables planning and imagined experience, but they can be biased if the model is inaccurate. Model-free methods are simpler and avoid model bias, but require more environment interactions. Model-based methods are preferred when environment interaction is expensive (robotics, scientific experiments); model-free methods are preferred when the environment is complex and hard to model accurately.

Domain evidence Master

Computer vision. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) catalyzed progress in image classification. In 2012, AlexNet (Krizhevsky et al.) reduced the top-5 error rate from 26% to 15%, demonstrating the power of deep convolutional networks and GPU training. By 2015, ResNet achieved 3.57% top-5 error, surpassing human-level performance (estimated at 5.1%). This rapid progress established deep learning as the dominant approach for computer vision.

Natural language processing. The evolution of language models from word2vec (2013) to BERT (2018) to GPT-4 (2023) represents a paradigm shift. Word2vec learned static word embeddings, capturing semantic similarity. BERT introduced bidirectional contextual representations, enabling transfer learning. GPT-4 demonstrated that sufficiently large autoregressive models can perform complex reasoning, code generation, and multi-step problem solving. Each generation expanded the scope of tasks that NLP systems could handle.

Game playing. AlphaGo (Silver et al., 2016) defeated world champion Lee Sedol at Go, a game with more possible positions than atoms in the observable universe ( $1 0^{170}$ ). AlphaGo combined deep reinforcement learning with Monte Carlo tree search. Its successor, AlphaZero, learned entirely from self-play without human data, achieving superhuman performance in Go, chess, and shogi. These results demonstrated that reinforcement learning with deep function approximation can master complex strategic domains.

Scientific discovery. AlphaFold 2 (Jumper et al., 2021) solved the protein structure prediction problem, a 50-year grand challenge in biology. It achieved a median Global Distance Test score of 92.4 on the CASP14 benchmark, comparable to experimental methods. This result demonstrated that deep learning can accelerate scientific discovery in domains beyond traditional AI tasks. Subsequently, AlphaFold has predicted the structures of over 200 million proteins, transforming structural biology.

Healthcare. Deep learning has achieved expert-level performance in medical imaging: detecting diabetic retinopathy from retinal photographs (Gulshan et al., 2016), classifying skin lesions (Esteva et al., 2017), and detecting breast cancer in mammograms (McKinney et al., 2020). These systems complement radiologists rather than replacing them, often catching conditions that human reviewers miss.

Advanced results Master

Transformer architectures and attention

The transformer architecture (Vaswani et al., 2017) replaced recurrence with self-attention, enabling parallel processing of sequences. Self-attention computes a weighted sum of all positions in a sequence, where the weights (attention scores) are determined by the compatibility between query and key vectors. Multi-head attention applies this mechanism multiple times with different projections, allowing the model to attend to different types of relationships simultaneously.

The transformer's self-attention mechanism computes attention scores as $Attention (Q, K, V) = softmax (Q K^{T} / d_{k}) V$ , where $Q$ , $K$ , and $V$ are query, key, and value matrices derived from the input, and $d_{k}$ is the dimension of the key vectors. The softmax normalizes the scores to sum to 1, producing a probability distribution over positions. The scaling factor $1/ d_{k}$ prevents the dot products from growing too large, which would push the softmax into regions with extremely small gradients.

The scaling law (Kaplan et al., 2020) showed that transformer performance improves predictably as a power law of model size, dataset size, and compute. This observation drove the development of increasingly large language models (GPT-3, GPT-4, PaLM, LLaMA), which demonstrate emergent capabilities at scale. GPT-3, with 175 billion parameters, demonstrated few-shot learning: the ability to perform new tasks with only a few examples provided in the prompt, without any weight updates. This capability was not explicitly trained but emerged from the scale of the model and dataset.

The Chinchilla scaling laws (Hoffmann et al., 2022) refined Kaplan's findings, showing that previous large language models were undertrained relative to their size. Chinchilla, a 70-billion-parameter model trained on 1.4 trillion tokens, outperformed GPT-3 (175 billion parameters) by using a more favorable ratio of model size to training data. This insight shifted the field toward training smaller models on more data, improving both performance and efficiency.

Generative models

Generative models learn to produce new data samples that resemble the training distribution. Variational autoencoders (VAEs) learn a latent representation and generate new samples by sampling from the latent space. Generative adversarial networks (GANs) pit a generator against a discriminator in a minimax game. Diffusion models (DDPM, Stable Diffusion, DALL-E) learn to reverse a noise-adding process, generating images by iteratively denoising random noise.

Each approach has trade-offs. VAEs produce blurry but diverse samples. GANs produce sharp samples but can suffer from mode collapse (generating limited diversity). Diffusion models produce high-quality, diverse samples but require many iterative steps.

The generative adversarial network framework, introduced by Goodfellow et al. in 2014, formulates generative modeling as a game between two neural networks. The generator $G$ maps random noise $z$ to synthetic data $G (z)$ . The discriminator $D$ classifies inputs as real or synthetic. The generator aims to fool the discriminator, while the discriminator aims to correctly classify. The training objective is $min_{G} max_{D} E_{x \sim p_{data}} [lo g D (x)] + E_{z \sim p_{z}} [lo g (1 - D (G (z)))]$ . At equilibrium, the generator produces samples indistinguishable from real data, and the discriminator outputs 0.5 for all inputs. StyleGAN (Karras et al., 2019) produced photorealistic faces that were indistinguishable from real photographs, demonstrating the power of the GAN framework.

Diffusion models, introduced by Ho, Jain, and Abbeel in 2020, work by gradually adding Gaussian noise to data over $T$ steps (the forward process), then learning to reverse this process (the backward process). The forward process is fixed: $q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)$ , where $β_{t}$ is a variance schedule. The backward process is learned: a neural network predicts the noise added at each step, and the denoised sample is recovered by subtracting the predicted noise. Stable Diffusion, released in 2022, combined diffusion models with latent space representations, dramatically reducing the computational cost of image generation and enabling consumer hardware to produce high-quality images.

Reinforcement learning from human feedback (RLHF)

RLHF aligns language models with human preferences by training a reward model on human comparisons, then optimizing the language model using reinforcement learning (typically PPO). This process produces models that are more helpful, harmless, and honest, as demonstrated by ChatGPT and similar systems.

The alignment problem, ensuring that AI systems pursue goals consistent with human values, is one of the most important open problems in AI safety. RLHF is a first step, but it has limitations: the reward model is an imperfect proxy for true human preferences, and optimizing against it can lead to reward hacking.

The RLHF process has three stages. First, supervised fine-tuning (SFT) trains the language model on high-quality demonstrations of the desired behavior. Second, a reward model is trained on human preference data: human annotators compare pairs of model outputs and indicate which is preferred. The reward model learns to assign higher scores to outputs that humans prefer. Third, the fine-tuned model is optimized against the reward model using proximal policy optimization (PPO), a reinforcement learning algorithm that balances exploration with stable training by constraining policy updates.

Constitutional AI (Bai et al., 2022), developed by Anthropic, extends RLHF by using the AI model itself to evaluate and improve its outputs according to a set of principles (a "constitution"). This reduces the need for human annotators and provides a more scalable approach to alignment. The model generates responses, critiques them according to constitutional principles, and revises them, creating a self-improving loop that can operate without human feedback at each step.

Foundation models and transfer learning

Foundation models are large models pretrained on massive datasets that can be adapted to many downstream tasks with minimal task-specific training. BERT (bidirectional encoder) and GPT (autoregressive decoder) are the two main families. Transfer learning from pretrained foundation models has become the standard approach for most NLP tasks, reducing the labeled data required by orders of magnitude.

The concept extends beyond NLP: vision transformers (ViT) pretrained on ImageNet transfer to medical imaging, satellite imagery, and other domains. Protein language models (ESM, AlphaFold) transfer structural knowledge to novel protein design.

BERT, introduced by Devlin et al. in 2018, pretrains a bidirectional transformer encoder using two objectives: masked language modeling (predicting randomly masked tokens) and next sentence prediction (predicting whether two sentences are consecutive). BERT demonstrated that bidirectional pretraining produces representations that transfer effectively to downstream tasks like question answering, sentiment analysis, and named entity recognition. GPT, introduced by Radford et al. in 2018, pretrains a unidirectional transformer decoder using autoregressive language modeling: predicting the next token given all preceding tokens. While BERT excels at understanding tasks (classification, extraction), GPT excels at generation tasks (writing, translation, summarization).

Federated learning and privacy-preserving ML

Federated learning trains models across multiple devices without centralizing data. Each device computes gradient updates locally and sends only the updates to a central server, which aggregates them. This preserves data privacy because raw data never leaves the device.

Differential privacy provides formal guarantees that the model does not reveal information about any individual training example. By adding calibrated noise to gradient updates, differential privacy bounds the contribution of any single data point to the learned model.

Federated averaging (FedAvg), introduced by McMahan et al. in 2017, is the standard federated learning algorithm. Each round, a subset of clients download the current model, compute gradient updates on their local data for several epochs, and upload the updated model parameters. The server averages the received parameters to produce a new global model. The key insight is that gradient updates from different clients can be combined by averaging because gradient descent is additive. Challenges include non-IID data distributions across clients (client data may not be representative of the global distribution), communication efficiency (mobile devices have limited bandwidth), and system heterogeneity (devices have different computational capabilities).

Connections Master

Connections to statistics

Machine learning and statistics share many techniques (regression, classification, density estimation) but differ in emphasis. Statistics focuses on inference (understanding the data-generating process), while machine learning focuses on prediction (making accurate predictions on new data). This difference affects methodology: ML emphasizes cross-validation and test-set performance, while statistics emphasizes confidence intervals and hypothesis tests.

The convergence of the two fields has accelerated. Causal inference, long a staple of statistics and epidemiology, is increasingly important in machine learning for understanding why models make certain predictions and for building models that generalize across different environments (distribution shift). Judea Pearl's work on causal graphs and the do-calculus provides a mathematical framework for reasoning about causation from observational data, bridging the gap between statistical association and causal explanation.

Connections to neuroscience

Artificial neural networks were inspired by biological neurons. Convolutional networks mirror the hierarchical visual processing in the ventral stream. Reinforcement learning algorithms parallel dopamine-based reward signaling. While the analogy is imperfect (biological neurons are far more complex than artificial ones), neuroscience continues to inspire new architectures and learning algorithms.

Hubel and Wiesel's 1959 discovery of simple and complex cells in the visual cortex directly inspired the convolutional architecture. Simple cells detect edges at specific orientations, analogous to convolutional filters. Complex cells pool over spatial positions, analogous to pooling layers. The ventral stream (V1 to V4 to IT cortex) processes visual information through progressively more abstract representations, analogous to the layer hierarchy in deep CNNs. This parallel is not coincidental: the CNN architecture was explicitly designed to mirror the known organization of the visual system.

Connections to ethics

AI systems raise ethical concerns including bias and fairness (models can perpetuate or amplify societal biases present in training data), privacy (models may memorize and reveal sensitive training data), accountability (who is responsible when an AI system makes a harmful decision?), and existential risk (could superintelligent AI pose a threat to humanity?).

The COMPAS recidivism prediction system, used by courts in the United States to assess the likelihood that a defendant will reoffend, illustrates the bias problem. A 2016 ProPublica investigation found that the system was more likely to falsely flag Black defendants as high-risk and falsely flag white defendants as low-risk. While the system achieved similar overall accuracy across racial groups, the error rates were distributed differently, violating equalized odds fairness. This case demonstrated that technical accuracy is insufficient for fairness: the choice of fairness metric matters, and different fairness criteria conflict with each other.

Connections to software engineering

Deploying machine learning models in production introduces engineering challenges distinct from traditional software development. Machine learning models have complex dependencies on training data, hyperparameters, and random initialization, making reproducibility difficult. Model performance degrades over time as the data distribution shifts (concept drift). The ML lifecycle, from data collection to model training to deployment to monitoring, requires specialized tooling.

MLOps (Machine Learning Operations) applies DevOps principles to ML systems. Key practices include version control for data and models (DVC, MLflow), automated training pipelines, model validation before deployment, and monitoring for data drift and model performance degradation. Feature stores (Feast, Tecton) centralize feature engineering, ensuring that training and serving use the same feature transformations. These practices address the reproducibility and maintenance challenges that distinguish ML systems from traditional software.

Connections to information theory

Information theory provides the mathematical foundations for several ML techniques. Cross-entropy loss, the standard loss function for classification, is derived directly from information theory: it measures the expected number of bits needed to encode samples from the true distribution using a code optimized for the predicted distribution. Minimizing cross-entropy is equivalent to minimizing the Kullback-Leibler divergence between the true and predicted distributions. The information bottleneck theory of deep learning (Tishby and Zaslavsky, 2015) proposes that deep networks learn by compressing the input while preserving information about the output, formalized as an optimization problem involving mutual information.

Connections to optimization

Training a neural network is fundamentally an optimization problem: finding parameters $θ$ that minimize a loss function $L (θ)$ . The loss landscape of a deep network is highly non-convex, with many local minima, saddle points, and flat regions. Understanding this landscape is an active area of research. Recent work has shown that most local minima in overparameterized networks have similar loss values (the "no bad local minima" property), explaining why gradient descent works well despite non-convexity. The Lotka-Volterra equations from population dynamics have been used to model the dynamics of SGD in the infinite-width limit, connecting optimization to mathematical biology.

Historical and philosophical context Master

The AI winters

AI has experienced periods of enthusiasm followed by disappointment. The first AI winter (1974-1980) followed the failure of symbolic AI to achieve its ambitious goals. The second AI winter (1987-1993) followed the failure of expert systems and the collapse of the Lisp machine market. Each winter was triggered by a gap between expectations and reality, leading to reduced funding and interest.

The first AI winter was precipitated by several developments. In 1969, Marvin Minsky and Seymour Papert published "Perceptrons," which proved that single-layer perceptrons could not learn the XOR function, leading to a widespread (but incorrect) belief that all neural networks were fundamentally limited. The Lighthill Report (1973) to the British Science Research Council criticized AI's failure to achieve its grand promises, leading to funding cuts in the UK. DARPA reduced AI funding in the US after expert systems failed to deliver on military applications. The lesson of the first winter was that AI hype outpaced reality: early researchers underestimated the complexity of intelligence and overestimated the power of their techniques.

The current AI boom, driven by deep learning, may or may not face a similar correction. What distinguishes the current era is the demonstrable practical success of AI in industry: deep learning powers products used by billions of people.

The philosophical significance of machine intelligence

Can machines think? Alan Turing addressed this question in his 1950 paper "Computing Machinery and Intelligence," proposing the Turing Test: if a human cannot distinguish the machine's responses from a human's, the machine should be considered intelligent. Critics argue that the Turing Test measures simulation of intelligence, not intelligence itself.

The Chinese Room argument (Searle, 1980) contends that a computer executing a program can produce correct outputs without understanding them, just as a person in a room following instructions in Chinese can produce correct Chinese responses without understanding Chinese. The argument challenges the computational theory of mind.

The debate between connectionist and symbolic AI reflects deeper philosophical commitments about the nature of thought. Symbolic AI, rooted in the logicist tradition, holds that intelligence operates on formal symbols according to rules. Connectionist AI, rooted in the empiricist tradition, holds that intelligence emerges from the statistical regularities learned by neural networks from data. Modern large language models are squarely in the connectionist tradition: they learn patterns from data without explicit symbolic representations. Yet they exhibit behaviors that look remarkably like symbolic reasoning, suggesting that the distinction between connectionist and symbolic intelligence may be less sharp than previously believed.

The alignment problem

As AI systems become more capable, ensuring they pursue goals aligned with human values becomes critical. The alignment problem has both technical aspects (how to specify objectives that capture what we actually want) and philosophical aspects (what values should AI systems promote, and who decides). The growing capabilities of large language models have brought these questions from academic theory into practical urgency.

The instrumental convergence thesis, articulated by Nick Bostrom, argues that sufficiently capable AI systems will pursue certain instrumental goals (self-preservation, resource acquisition, goal preservation) regardless of their final objective, because these instrumental goals help achieve any final objective. This creates a risk: an AI system given a seemingly benign goal might pursue it in ways that are harmful to humans if the goal is not carefully specified. The classic illustration is the paperclip maximizer: an AI tasked with manufacturing paperclips might convert all available resources (including humans) into paperclip factories, not out of malice but out of single-minded pursuit of its objective.

Bibliography Master

Primary sources

Turing, A.M. (1950). "Computing machinery and intelligence." Mind, 59, 433-460.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). "Learning representations by back-propagating errors." Nature, 323, 533-536.
Vaswani, A. et al. (2017). "Attention is all you need." Advances in Neural Information Processing Systems, 30.
Hochreiter, S. and Schmidhuber, J. (1997). "Long short-term memory." Neural Computation, 9(8), 1735-1780.
Goodfellow, I. et al. (2014). "Generative adversarial nets." Advances in Neural Information Processing Systems, 27.
Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising diffusion probabilistic models." Advances in Neural Information Processing Systems, 33.
Silver, D. et al. (2016). "Mastering the game of Go with deep neural networks and tree search." Nature, 529, 484-489.
Devlin, J. et al. (2019). "BERT: Pre-training of deep bidirectional transformers for language understanding." NAACL-HLT.
Brown, T. et al. (2020). "Language models are few-shot learners." Advances in Neural Information Processing Systems, 33.

Secondary sources

Russell, S. and Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Sutton, R.S. and Barto, A.G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Vapnik, V. (1998). Statistical Learning Theory. Wiley.
Breiman, L. (2001). "Random forests." Machine Learning, 45(1), 5-32.
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Kaplan, J. et al. (2020). "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361.
Hoffmann, J. et al. (2022). "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556.

Prerequisites

25.03.01
25.02.01
25.07.01

Tier anchors

beginner: Russell and Norvig, Artificial Intelligence: A Modern Approach (4e), Ch. 1-6; Bishop, Pattern Recognition and Machine Learning, Ch. 1
intermediate: Bishop, Pattern Recognition and Machine Learning, Ch. 2-5; Goodfellow, Bengio, and Courville, Deep Learning, Ch. 1-6
master: Goodfellow, Bengio, and Courville, Deep Learning; Vapnik, The Nature of Statistical Learning Theory; Sutton and Barto, Reinforcement Learning

References

computer-science · Ch. 0
Russell, S. and Norvig, P., Artificial Intelligence: A Modern Approach (4e, Pearson, 2020) · Ch. 1-6 · source being verified
Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning (MIT Press, 2016) · Ch. 1-9 · source being verified
Bishop, C.M., Pattern Recognition and Machine Learning (Springer, 2006) · Ch. 1-5 · source being verified

Estimated time

beginner: 30m
intermediate: 55m
master: 80m