Week Three · Linear & Additive Models

Linear is more
than it looks.

From least squares to logistic regression to ridge, lasso, splines, and GAMs — a tour of the workhorse models that quietly underpin most of modern machine learning.

KCL · Machine Learning · Reading time: ~70 min · Self-test: 18 questions

Part I Is Everything Linear?

Linear models are the unsung heroes of statistics — simple, interpretable, and (it turns out) far more flexible than they first appear.

There's a lot to love about linear models, and the reasons are worth pausing on:

Linear functions are simple to grasp and visualise. A straight line, a flat plane — easy to draw, easy to reason about.
Linear coefficients are interpretable. The number $\beta_j$ has a clean meaning: "if I increase $x_j$ by one unit, the prediction changes by $\beta_j$ units, holding everything else fixed."
Under some assumptions, linear estimators are optimal in a precise sense — we'll meet the Gauss-Markov theorem shortly.

But linearity has limits

Plenty of real-world relationships aren't linear. Linear approaches struggle to describe:

Interaction effects — when two features matter together in a way they don't matter separately. (Caffeine helps you study; sleep deprivation hurts. Their combination doesn't average out — it's worse than either alone.)
Non-monotonic behaviour — relationships that go up then down. Worker productivity vs. hours worked: rises, peaks, falls.
Higher-order polynomials — quadratic, cubic, or wigglier shapes that no straight line can capture.

The big idea: all functions are linear in some representation

This is the slide that re-frames everything. Even when the underlying relationship is wildly nonlinear, we can often recover linearity by transforming our features first. The slides flag three reasons this matters:

Hornik's universal approximation theorem (1993). Any reasonable function can be approximated to arbitrary precision by a neural network. Neural networks are extraordinarily flexible function fitters.
Neural networks instantiate a form of generalised additive models (GAMs). Their predictions are just a (possibly nonlinear function of a) linear combination of basis functions — the activations of hidden units.
So any classification or regression task can be modelled as linear — provided we use a rich enough class of basis and/or link functions.

The reframe

Don't think of linearity as a restriction. Think of it as a lens. The question isn't "is this relationship linear?" — it's "in what representation does this relationship become linear?" Find the right representation (the right basis), and you can deploy the entire toolbox of linear methods on what looks superficially like a nonlinear problem. This insight is the bridge from this week to almost everything that follows.

· · ·

Part II Linear Regression & Ordinary Least Squares

The original. The grandfather of supervised learning. The model every other model is implicitly compared against.

Let $X \in \mathcal{X} \subseteq \mathbb{R}^d$ be a vector of $d$ input features. Let $Y \in \mathcal{Y} \subseteq \mathbb{R}$ be a continuous response variable. If $Y$ is a linear function of $X$, then the conditional expectation $\mathbb{E}[Y \mid x]$ is a map $f: \mathcal{X} \to \mathcal{Y}$ of the form:

$$f(x) = \beta_0 + \sum_{j=1}^{d} \beta_j x_j = x\beta,$$

for some parameter vector $\beta = [\beta_0, \dots, \beta_d]$. The first coefficient $\beta_0$ is the intercept; the rest are slopes. The compact form $x\beta$ assumes we've absorbed the intercept by adding a column of 1s to $x$.

The error term and homoskedasticity

Real data don't sit on a line — they scatter around it. The residuals $\epsilon$ represent the deviation from the conditional expectation:

$$y = f(x) + \epsilon, \quad \mathbb{E}[\epsilon] = 0.$$

If the errors are homoskedastic — meaning they have the same variance regardless of $x$ — we additionally have $\epsilon \perp\!\!\!\perp X$ (the noise is independent of the input). This is a crucial assumption that turns up later when we discuss the Gauss-Markov theorem.

Definition · Homoskedasticity

Errors are homoskedastic if their variance doesn't depend on the input $x$. The opposite is heteroskedastic, where the spread of residuals fans out (or shrinks) as $x$ changes.

The loss: residual sum of squares

Given a dataset $\{x_i, y_i\}_{i=1}^n$, how do we find the best $\beta$? We minimise the residual sum of squares (RSS):

$$\text{RSS}(\beta) = \sum_{i=1}^n (y_i - f(x_i))^2 = \sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^d \beta_j x_{ij}\Big)^2.$$

In compact matrix form, with $X$ being the $n \times d$ feature matrix and $y$ the vector of outcomes:

$$\text{RSS}(\beta) = (y - X\beta)^\top (y - X\beta).$$

Solving for the optimal coefficients

To minimise RSS, we take its derivative with respect to $\beta$:

$$\frac{\partial \text{RSS}}{\partial \beta} = -2X^\top (y - X\beta).$$

Set the gradient to zero:

$$X^\top(y - X\beta) = 0.$$

And, assuming $X$ has full column rank (so $X^\top X$ is invertible), solve:

$$\boxed{\hat{\beta} = (X^\top X)^{-1} X^\top y.}$$

This is the famous ordinary least squares (OLS) closed-form solution. Memorise it — it appears everywhere.

When does this fail?

If $X$ doesn't have full column rank, $X^\top X$ is singular (non-invertible) and the formula blows up. This happens when features are perfectly collinear — e.g. you've included both "height in inches" and "height in cm". One fix is to drop redundant features. Another fix — coming up in the regularization section — is ridge regression, which adds a constant to the diagonal to make the matrix invertible again.

Inference under Gaussian errors

Suppose we add two assumptions on top of OLS:

Samples are i.i.d. — independent and identically distributed.
Residuals are Gaussian with constant variance: $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

Then we can prove three useful results:

Three classical results

1. The estimate $\hat{\beta}$ is itself multivariate normal:

$$\hat{\beta} \sim \mathcal{N}\big(\beta, (X^\top X)^{-1} \sigma^2\big).$$

2. An unbiased estimate of the residual variance:

$$\hat{\sigma}^2 = \frac{1}{n - d - 1} \sum_{i=1}^n (y_i - \hat{y}_i)^2.$$

The denominator $n - d - 1$ is the degrees of freedom — sample size minus the number of parameters we estimated.

3. To test whether a coefficient $\beta_j$ is significantly different from zero, we compute the z-score:

$$z_j = \frac{\hat{\beta}_j}{\hat{\sigma}\sqrt{v_j}},$$

where $\sqrt{v_j}$ is the $j$-th diagonal element of $(X^\top X)^{-1}$. Large $|z_j|$ means strong evidence that $\beta_j \neq 0$.

Worked example: pesticides on plant weight (Dobson, 1990)

The slides walk through a classic dataset: 20 plants, half treated with pesticide, half control, dried weight measured. The model:

$$\text{weight}_i = \beta_0 + \beta_1 \cdot \text{group}_i + \epsilon_i,$$

where $\text{group}_i = 1$ for "treated" and 0 for "control". The fitted model from R output:

Intercept (control mean): $\hat{\beta}_0 = 5.032$
Treatment effect: $\hat{\beta}_1 = -0.371$ (treated plants weigh 0.371 units less on average)
p-value for treatment: 0.249 — not significant at the 5% level
$R^2 = 0.0728$ — the treatment explains only ~7% of the variance

Conclusion: with this small sample, we can't statistically distinguish the pesticide effect from noise. (Echo of last week's biased coin example: small sample, weak conclusion.)

The Gauss-Markov theorem (BLUE)

Why is OLS such a popular default? The Gauss-Markov theorem tells us that, under fairly mild assumptions, the OLS solution is the best linear unbiased estimator — affectionately known as BLUE.

Theorem · Gauss-Markov (BLUE)

Under standard assumptions (linearity, mean-zero errors, homoskedastic uncorrelated errors), the OLS estimator $\hat{\beta} = (X^\top X)^{-1} X^\top y$ is the best linear unbiased estimator. Unpacking the acronym:

Linear: the estimator is a linear function of $y$ — that is, it can be written as $\hat\beta = Cy$ for some matrix $C$.
Unbiased: $\mathbb{E}[\hat{\beta}_j] = \beta_j$ for all $j$ — on average, we get the right answer.
Best: among all linear unbiased estimators, OLS has the lowest variance. Formally, for any alternative weights $\tilde{\beta} = Cy$ that are also unbiased: $$\text{Var}(\hat{\beta}) \preceq \text{Var}(\tilde{\beta}),$$ where $A \preceq B$ means $B - A$ is positive semi-definite (the matrix version of "less than or equal").

Why does "best" matter?

"Lowest variance" means the OLS estimate of $\beta$ jiggles around the true value the least when you re-collect data. So OLS gives the most stable, reliable estimates among the broad class of linear unbiased estimators. There's still room to do better if you're willing to accept some bias — and that's the whole motivation for regularization.

Self-test · Linear Regression

Write the OLS closed-form solution and explain when it fails.

$\hat{\beta} = (X^\top X)^{-1} X^\top y$. It fails when $X^\top X$ is not invertible — i.e. when $X$ doesn't have full column rank. This happens when features are perfectly (or near-perfectly) collinear, or when $d > n$ (more features than data points). Ridge regression is one solution: adding $\lambda I$ to the diagonal makes the matrix invertible.

What does it mean for the OLS estimator to be "BLUE"?

Best Linear Unbiased Estimator. Linear: $\hat{\beta}$ is a linear function of $y$. Unbiased: $\mathbb{E}[\hat{\beta}] = \beta$. Best: among all linear unbiased estimators, OLS has the lowest variance. The Gauss-Markov theorem proves this under mild assumptions (mean-zero homoskedastic uncorrelated errors).

Why do we divide by n − d − 1 in the variance estimate, not just n?

Because we used the data to estimate $d + 1$ parameters (the intercept plus $d$ slopes), so $d + 1$ degrees of freedom are "spent". Dividing by $n - d - 1$ corrects for this and gives an unbiased estimate of $\sigma^2$. Dividing by $n$ would systematically underestimate the variance.

What does homoskedasticity mean and why does it matter?

Homoskedasticity means the error variance is constant across the input space — the residuals don't fan out or shrink as $x$ changes. It's one of the assumptions behind Gauss-Markov; if errors are heteroskedastic, OLS is still unbiased but no longer the minimum-variance unbiased estimator. The standard inference machinery (z-scores, p-values) also breaks down.

· · ·

Part III Logistic Regression & GLMs

When the response is binary, OLS doesn't quite fit. Logistic regression bends the linear model to live on the [0, 1] interval — and turns out to be a special case of a much bigger family.

Linear regression assumes $Y$ is continuous. But what if $Y$ is binary — spam vs. ham, sick vs. healthy, click vs. no-click? Predicting these directly with a linear model is bad: the predictions can drift outside $[0, 1]$, which makes no sense as a probability. Logistic regression solves the problem with a clever transformation.

Step 1 · From probabilities to odds

Let $P(x)$ be the probability of the event $X = x$. The odds are:

$$\text{odds}(x) = \frac{P(x)}{1 - P(x)}.$$

If $P = 0.5$, the odds are 1 (or "1 to 1"). If $P = 0.75$, the odds are 3 ("3 to 1"). If $P$ is tiny, odds are tiny; if $P \to 1$, odds explode.

Step 2 · From odds to log-odds (the logit)

Take the log:

$$\text{logit}(x) = \log\frac{P(x)}{1 - P(x)} = \log P(x) - \log(1 - P(x)).$$

The log-odds (or logit) range over the entire real line, from $-\infty$ (when $P \to 0$) to $+\infty$ (when $P \to 1$). And — crucially — they're an unconstrained quantity, the kind of thing a linear model can predict happily.

Step 3 · The logistic regression model

Let $X \in \mathcal{X} \subseteq \mathbb{R}^d$ be features, $Y \in \{0, 1\}$ a binary response. Logistic regression assumes the log-odds of $Y$ are linear in $X$:

$$\log\Big(\frac{f(x)}{1 - f(x)}\Big) = \beta_0 + \sum_{j=1}^d \beta_j x_j,$$

where $f(x) = P(Y = 1 \mid x)$. Equivalently, solving for $f(x)$:

$$f(x) = \frac{1}{1 + \exp(-x\beta)} = \frac{\exp(x\beta)}{1 + \exp(x\beta)} = \sigma(x\beta),$$

where $\sigma$ is the famous standard logistic function (or "sigmoid"), which squashes any real number into $(0, 1)$ — exactly the range we need for a probability.

Plain English

Take the linear combination $x\beta$ — a number that could be anywhere on the real line. Pass it through the sigmoid $\sigma$. Out comes a number between 0 and 1: your predicted probability that $Y = 1$. The model is "linear in the logits" — once you transform the probability through the logit function, the relationship to the features becomes linear again.

Generalised linear models (GLMs)

Logistic regression is one example of a much larger family. A generalised linear model (GLM) has three components:

Definition · The three components of a GLM

A conditional distribution for the response variable $Y$, typically from the exponential family (Gaussian, Bernoulli, Poisson, Gamma, etc.).
A set of linear weights $\beta$ on input features $X$.
A link function $g$ such that: $$\mathbb{E}[Y \mid x] = g^{-1}(x\beta).$$ The link function $g$ maps the conditional mean of $Y$ to the linear predictor; its inverse $g^{-1}$ maps back.

Linear regression is the special case where the response is Gaussian and the link is the identity ($g(\mu) = \mu$). Logistic regression is the special case where:

The conditional distribution for $Y$ is the Bernoulli.
The linear weights $\beta$ are learned via iteratively reweighted least squares (IRLS) — a numerical algorithm; there's no closed-form.
The link is the logit function, the inverse of the standard logistic function $\sigma$.

The big picture

A GLM stitches together (i) a probability model for the noise, (ii) a linear predictor, and (iii) a link function that bridges the two. Pick different combinations and you get linear regression, logistic regression, Poisson regression for count data, and many more. They all share the same internal architecture; only the parts vary.

Worked example: pesticides revisited

Same Dobson dataset as before, but with the logic flipped: predict the probability that a plant was treated given its observed weight. The slide's confusion matrix:

Of 10 actual controls, the model predicts 6 control / 4 treatment.
Of 10 actual treated, it predicts 3 control / 7 treatment.

So overall accuracy is (6 + 7) / 20 = 65%. Better than 50% chance, but reflects the fact that the underlying signal is weak.

Why don't we just use linear regression for binary outcomes?

Linear regression's predictions can drift outside the [0, 1] range that a probability must live in. It also gives Gaussian-distributed residuals, which makes no sense for binary data (where residuals can only take a few discrete values). Logistic regression respects the [0, 1] constraint via the sigmoid and uses the appropriate Bernoulli noise model.

What are the three components of a GLM?

(1) A conditional distribution for $Y$, typically from the exponential family. (2) A linear predictor $x\beta$. (3) A link function $g$ relating the mean of $Y$ to the linear predictor: $\mathbb{E}[Y \mid x] = g^{-1}(x\beta)$.

If the log-odds are 2, what's the probability?

Apply the inverse: $P = \sigma(2) = \dfrac{1}{1 + e^{-2}} = \dfrac{1}{1 + 0.135} \approx 0.881$. Roughly 88%. (Useful sanity check: log-odds = 0 → P = 0.5; positive log-odds → P > 0.5; negative → P < 0.5.)

· · ·

Part IV Regularization · Bias is Beautiful

When models have too many features or too little data, they overfit. Regularization deliberately adds a small bias to gain a lot of stability — and it's how modern machine learning tames complexity.

Regularization is an umbrella term for techniques that prevent overfitting by constraining the solution. In linear models specifically, regularization adds an extra term to the loss function that limits how big or how many coefficients can be. The general form is:

$$\arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 \quad \text{s.t.} \quad \beta \in \text{(something)}.$$

The "something" might be: the coefficients can't be too large (limit their magnitude), or only a few can be nonzero (force sparsity). Different choices give us different regularizers.

A quick primer on norms

To regularize, we need a way to measure the "size" of the coefficient vector $\beta$. That's what a norm does.

Definition · Norm

A norm is a function $\ell : \mathbb{R}^d \to \mathbb{R}_{\geq 0}$ that satisfies three properties:

Triangle inequality: $\ell(x + y) \leq \ell(x) + \ell(y)$.
Absolute homogeneity: $\ell(s \cdot x) = |s| \cdot \ell(x)$ for any scalar $s$.
Positive definiteness: $\ell(x) = 0$ if and only if $x = 0$.

The two norms you need to know:

Definition · L₂ and L₁ norms

L₂ norm (Euclidean): the root sum of squared entries.

$$\|x\|_2 = \sqrt{\sum_{j=1}^d x_j^2}.$$

L₁ norm (Manhattan): the sum of absolute values.

$$\|x\|_1 = \sum_{j=1}^d |x_j|.$$

General L_p norm:

$$\|x\|_p = \Big(\sum_{j=1}^d |x_j|^p\Big)^{1/p}.$$

The L₂ name "Euclidean" is intuitive — it's the straight-line distance you get from the Pythagorean theorem. The L₁ name "Manhattan" is also intuitive — like walking on a grid of city blocks, you can only go horizontally or vertically, so the distance is just the sum of moves in each dimension.

Ridge regression · the L₂ penalty

Definition · Ridge regression

Ridge regression places a penalty on the squared L₂ norm of the coefficients:

$$\hat{\beta}^{\text{ridge}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 \quad \text{s.t.} \quad \|\beta\|_2^2 \leq \tau,$$

where $\tau \geq 0$ is some threshold. In Lagrangian form, the constraint is folded into the objective:

$$\hat{\beta}^{\text{ridge}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2,$$

where $\lambda \geq 0$ is the tuning parameter. There's a one-to-one (bijective, monotonic) relationship between $\tau$ and $\lambda$ — they're two ways of controlling the same thing.

The closed-form solution

Writing the full RSS:

$$\text{RSS}(\beta, \lambda) = (y - X\beta)^\top(y - X\beta) + \lambda \beta^\top \beta.$$

Differentiate, set to zero, and solve. The solution is:

$$\boxed{\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y.}$$

Compare to OLS: the only difference is that we add $\lambda I$ — a positive constant down the diagonal — to $X^\top X$ before inverting. This has a beautiful side-effect: even when $X^\top X$ is singular (perfectly collinear features, or $d > n$), adding $\lambda I$ makes it invertible. Ridge regression always has a unique solution.

What ridge does

Ridge shrinks the coefficients smoothly toward zero. Bigger $\lambda$ → more shrinkage. As $\lambda \to 0$, you recover OLS. As $\lambda \to \infty$, all coefficients go to zero. The slides show the "ridge path" — a plot of how each coefficient changes as $\lambda$ varies. The key visual is that the paths are smooth curves; coefficients shrink gradually, never quite hitting zero.

The Bayesian interpretation of ridge

One of the most elegant facts in this course: ridge regression has a natural Bayesian interpretation. Suppose:

Prior on $\beta$: $\beta \sim \mathcal{N}(0, \tau^2 I)$ — Gaussian, centred at zero, isotropic.
Likelihood: $Y \mid x \sim \mathcal{N}(x\beta, \sigma^2)$ — standard linear regression.

Then the ridge estimate $\hat{\beta}^{\text{ridge}}$ is exactly the posterior mode (and mean) with $\lambda = \sigma^2 / \tau^2$.

Why this is beautiful

Ridge can be derived two completely different ways: as a frequentist trick to stabilise OLS, or as the natural Bayesian estimate under a Gaussian prior on the coefficients. The two derivations agree exactly. The penalty parameter $\lambda$ has a meaning: it's the ratio of noise variance to prior variance — how much you trust the data versus how confident you are that the true coefficients are small.

Lasso regression · the L₁ penalty

Definition · Lasso regression

Lasso regression places a penalty on the L₁ norm of the coefficients:

$$\hat{\beta}^{\text{lasso}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 \quad \text{s.t.} \quad \|\beta\|_1 \leq \tau.$$

In Lagrangian form:

$$\hat{\beta}^{\text{lasso}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 + \lambda \|\beta\|_1.$$

The math change from ridge (squared sum to absolute sum) looks tiny, but the practical consequence is huge: lasso doesn't just shrink coefficients — it sets some of them exactly to zero. As you crank up $\lambda$, coefficients drop out of the model one by one. Lasso simultaneously fits the model and selects features.

Unlike ridge, lasso has no closed-form solution; you need iterative algorithms. The slides also note a visual property: the solution paths are piecewise linear — straight-line segments that bend at certain $\lambda$ values where coefficients hit zero.

The geometric story · why lasso zeroes out coefficients

This is one of the most-tested ideas in this section: why does L₁ produce sparsity but L₂ doesn't? The answer is geometric.

The orange ellipses are RSS contours — solutions of equal training error around the unconstrained OLS optimum (orange dot). We're forced to find the lowest-RSS point that's also inside the blue feasible region. Ridge's circle has no corners, so the optimum almost always lands somewhere with both coefficients nonzero. Lasso's diamond has sharp corners on the axes — and those corners often "win" geometrically, which means one or more coefficients gets set to exactly zero.

Why corners matter

The RSS contours are smooth ellipses growing out from the OLS estimate. As they expand, they eventually touch the feasible set. With a circle (ridge), the touch point can be anywhere on the boundary — almost certainly off the axes, so all coefficients stay nonzero (though shrunk). With a diamond (lasso), the corners stick out into the contours, so the contours typically meet the feasible region at a corner — and corners lie on the axes, where one or more $\beta_j = 0$. That's why L₁ produces sparsity and L₂ doesn't.

Elastic net · the best of both worlds

Definition · Elastic net regression

Elastic net combines both penalties:

$$\hat{\beta}^{\text{e-net}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 \quad \text{s.t.} \quad \|\beta\|_2^2 \leq \tau_2 \;\text{and}\; \|\beta\|_1 \leq \tau_1.$$

In Lagrangian form:

$$\hat{\beta}^{\text{e-net}} = \arg\min_{\beta \in \mathbb{R}^d} \|y - X\beta\|^2 + \lambda_2 \|\beta\|_2^2 + \lambda_1 \|\beta\|_1.$$

Elastic net inherits both behaviours: lasso's variable selection (some coefficients become exactly zero) and ridge's smooth shrinkage (coefficients that stay nonzero are still gently pulled toward zero). It's especially helpful when features are correlated — pure lasso can be unstable in that setting, picking arbitrarily among correlated features, while elastic net tends to keep them together.

Compare ridge and lasso: what does each penalty do, and when would you prefer one over the other?

Ridge (L₂): shrinks coefficients smoothly toward zero. None hit exactly zero. Has a closed-form solution. Stabilises estimates when features are correlated. Best when you believe many features matter a little.

Lasso (L₁): shrinks coefficients and sets many exactly to zero, performing automatic variable selection. No closed form; needs iterative solvers. Best when you believe only a few features matter and want a sparse interpretable model.

When to use each: Lasso when you want feature selection. Ridge when you want stability with all features retained. Elastic net when you want both, especially with correlated features.

Geometrically, why does lasso produce sparse solutions but ridge doesn't?

The L₁ feasible region is a diamond, which has corners precisely on the coordinate axes. The L₂ feasible region is a circle, which has no corners. The RSS contours typically meet the feasible region first at a corner of the diamond (where some $\beta_j = 0$), but at a generic point of the circle (where all $\beta_j \neq 0$). So L₁ encourages sparsity geometrically; L₂ doesn't.

Write the closed-form solution for ridge regression and explain what changes vs OLS.

$\hat{\beta}^{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$. Compared with OLS, we've added $\lambda I$ — a positive constant down the diagonal — to $X^\top X$ before inverting. Two consequences: (1) coefficients are shrunk toward zero (more so for larger $\lambda$); (2) the matrix is always invertible even when $X^\top X$ is singular, so a unique solution always exists.

Sketch the Bayesian interpretation of ridge regression.

Place a Gaussian prior $\beta \sim \mathcal{N}(0, \tau^2 I)$ on the coefficients (centred at zero, isotropic). Use a Gaussian likelihood $Y \mid x \sim \mathcal{N}(x\beta, \sigma^2)$. The posterior mode (also the mean, since Gaussians) is exactly $\hat{\beta}^{\text{ridge}}$ with $\lambda = \sigma^2 / \tau^2$. The penalty parameter is the ratio of noise variance to prior variance.

· · ·

Part V The Bias–Variance Tradeoff

A single equation that explains why simple models often beat complicated ones — and why regularization works.

Why does ridge regression — which deliberately distorts the OLS estimate — sometimes beat OLS in prediction? The answer is the bias–variance decomposition, one of the most important results in the whole course.

Definition · Bias–variance decomposition

The expected prediction error of a regression model $f$ at point $x$ decomposes into a sum of three terms:

$$\text{Err}(f(x)) = \mathbb{E}\Big[(Y - f(x))^2 \mid X = x\Big] = \sigma^2 + \text{Bias}^2(f(x)) + \text{Var}(f(x)).$$

Let's unpack each term:

$\sigma^2$ — irreducible noise. The variance of $Y$ around its true mean. No model can reduce this; it's the inherent randomness of the world. (This is the Bayes risk from Week 1.)
$\text{Bias}^2(f(x))$ — squared bias. How far off, on average, our estimator is from the truth. A model that's too simple to capture the real structure has high bias.
$\text{Var}(f(x))$ — variance. How much the estimator's predictions jiggle when we re-collect data. A model that's too flexible (overfitting) has high variance — small changes in the training set lead to wildly different predictions.

The picture from slide 39: as model complexity rises, training error always falls, but test error follows a U-shape. Bias² shrinks; variance grows. The sweet spot is the minimum of the test-error curve — the right balance.

Why "bias is beautiful"

OLS is unbiased — but its variance can be huge when features are many or correlated. Ridge regression deliberately introduces a small bias (by shrinking coefficients) to gain a large reduction in variance. The total prediction error often goes down. Same lesson for nearly every regularisation technique you'll encounter — penalising flexibility trades a bit of bias for a lot of stability.

Write the bias-variance decomposition. What does each term represent?

$\text{Err}(f(x)) = \sigma^2 + \text{Bias}^2(f(x)) + \text{Var}(f(x))$.

$\sigma^2$ — irreducible noise (Bayes risk; can't be reduced by any model). $\text{Bias}^2$ — systematic error of the estimator (high when model is too simple). $\text{Var}$ — variability of estimator across datasets (high when model is too flexible / overfitting).

How does ridge regression help with the bias-variance tradeoff?

OLS is unbiased but can have high variance, especially with many or correlated features. Ridge introduces a small amount of bias (by shrinking coefficients toward zero) and in exchange substantially reduces variance. The net effect on total expected error is often a decrease — even though the model is now biased.

· · ·

Part VI Additive Models · Why So Smooth?

Generalised additive models extend GLMs by transforming the inputs through basis functions — letting linear machinery capture beautifully nonlinear shapes.

A generalised additive model (GAM) is a GLM with basis expansions of the input features. That is: instead of feeding raw inputs $x$ into a linear predictor, we first transform them using a set of carefully chosen basis functions $h_k$, then form a linear combination of those.

Definition · GAM

A generalised additive model has four components:

A conditional distribution for $Y$, typically from the exponential family.
A set of basis functions $h_k : \mathcal{X} \to \mathbb{R}$ for $k \in \{1, \dots, m\}$.
A set of linear weights $\beta_k$ on the basis function outputs.
A link function $g$ such that: $$\mathbb{E}[Y \mid x] = g^{-1}\Big(\sum_{k=1}^m \beta_k h_k(x)\Big).$$

Compare with a GLM: the only addition is the basis expansion in component (2). We're still doing linear regression — just on transformed inputs. This is the technical realisation of the "all functions are linear in some representation" idea from the introduction.

Common basis functions

The slide gives five examples of basis functions you might use:

Interactions: $h_k(x) = x_1 x_2$. Lets the model use products of features — captures interaction effects.
Polynomials: $h_k(x) = x_1^2$. Squares, cubes, etc. — captures curvature.
Log transforms: $h_k(x) = \log(x_1)$. Useful when the relationship is multiplicative or the input has heavy tails.
Region indicators: $h_k(x) = \mathbb{I}[0 \leq x_1 < 1]$. Indicator variables that are 1 inside a region and 0 outside — turns continuous features into categorical bins.
Neural embeddings: $h_k(x) = \sigma(w^\top x)$. The activation of a hidden unit. This is the GAM-formulation of a neural network — and the bridge to deep learning.

Local structure: knots, bins and continuity

Often, we want the model to behave differently in different regions of the input space — e.g. a "running average" that smooths the data locally. Three core ideas:

Partition the feature space into bins using cut points or knots.
Within each bin, fit something simple — a constant, a line, or a low-order polynomial.
Optionally, impose continuity constraints at the knots, so the function doesn't jump discontinuously between bins.

The slides show a beautiful four-panel figure (slide 44) with two knots $\xi_1, \xi_2$ dividing the input range into three bins:

Piecewise constant. A flat horizontal line in each bin. Discontinuous at the knots.
Piecewise linear. A separate line segment in each bin. Still discontinuous at knots.
Continuous piecewise linear. Line segments forced to meet at the knots. Continuous, but still has corners (kinks).
Piecewise-linear basis function. A "ramp" basis $(x - \xi_1)_+$ — zero before $\xi_1$, then rising linearly. Combine these and you can build a continuous piecewise-linear function.

Splines · piecewise polynomials done elegantly

Splines are piecewise polynomials — the most popular class of basis functions for GAMs. The slides flag three flavours:

Natural cubic splines — piecewise cubic, with continuity of value, first, and second derivative; forced to be linear past the boundary knots to avoid wild extrapolation.
B-splines — basis splines with local support and good numerical properties.
M-splines — non-negative versions used when modelling positive quantities (e.g. densities).

Slide 46 shows the four levels of cubic-polynomial smoothness across knots: discontinuous, continuous, continuous-first-derivative, continuous-second-derivative. Each step up the ladder gives a smoother fit. Natural cubic splines hit the sweet spot — smooth enough to look graceful, simple enough to compute efficiently.

The penalised least-squares characterisation of splines

Here's a stunning theoretical fact. Among all twice-differentiable functions $f$, find the one that minimises:

$$\text{RSS}(f, \lambda) = \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int (f''(t))^2 \, dt. \quad (1)$$

The first term is RSS. The second term penalises wiggliness — the integral of the squared second derivative measures total curvature. Big $\lambda$ → smoother $f$.

Two extremes

At $\lambda = 0$: any function passing through every data point is a minimiser. We can use any interpolating function — total overfit.
At $\lambda = \infty$: $f''$ must be zero everywhere, so $f$ must be linear. We recover the linear least squares fit.

And the punchline:

Theorem · Smoothing splines

For any $0 < \lambda < \infty$, equation (1) is uniquely minimised by the natural cubic spline with knots at each $x_i$.

This is why natural cubic splines aren't just one option among many — they're the canonical choice. They drop out of an optimisation problem you'd write down anyway if you were thinking from first principles about smoothness.

Nonparametric alternatives · LOESS and kernel smoothers

Splines aren't the only way to achieve local smoothness. Kernel smoothers (and the related LOESS regression) take a different approach: predict the outcome at a new point $x_0$ as a weighted average of nearby training outcomes.

Definition · Kernel smoother

A kernel smoother predicts:

$$f(x_0) = \frac{\sum_{i=1}^n K(x_0, x_i) \, y_i}{\sum_{i=1}^n K(x_0, x_i)},$$

where $K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}_{\geq 0}$ is a kernel function measuring how close $x_0$ is to $x_i$. Common kernels:

Tricube kernel — popular in LOESS.
Epanechnikov kernel — provably optimal in a specific MSE sense.
Radial basis function (RBF) / Gaussian kernel: $$K(x_0, x) = \exp\Big(\frac{\|x_0 - x\|_2^2}{-2\sigma}\Big).$$

Plain English

A kernel smoother says: "to predict at $x_0$, look at the training points nearby; weight them by how close they are; take a weighted average of their $y$ values." Close-by points get heavy weight; far-away points get tiny (or zero) weight. A bigger bandwidth $\sigma$ means a wider neighbourhood — smoother but more biased. A smaller bandwidth means tighter local fits — wigglier but lower bias.

Why GAMs are wonderful

The slides close with four reasons to love GAMs:

GAMs are very expressive. By picking the right basis functions, you can capture nearly any smooth pattern.
Visual interpretability. You can plot each predictor's contribution to the outcome separately — a partial dependence curve. This is much easier than interpreting deep neural networks. The slides show six such curves for a heart-disease GAM (sbp, tobacco, ldl, famhist, obesity, age) on slide 49.
You can incorporate prior knowledge as restrictions on the basis functions. E.g., monotonicity constraints — "I know dose-response should always increase, so force the function to be monotone."
Smooth functions are generally more robust and interpretable than jagged, wiggly functions. Smoothness is itself a useful inductive bias.

Self-test · Additive Models

What are the four components of a GAM?

(1) A conditional distribution for $Y$, typically from the exponential family. (2) A set of basis functions $h_k : \mathcal{X} \to \mathbb{R}$. (3) A set of linear weights $\beta_k$ on the basis function outputs. (4) A link function $g$ such that $\mathbb{E}[Y \mid x] = g^{-1}\big(\sum_k \beta_k h_k(x)\big)$.

Compared with a GLM, the only new ingredient is (2) — the basis expansion of the inputs.

Give five examples of basis functions used in GAMs.

Interactions ($x_1 x_2$), polynomials ($x_1^2$), log transforms ($\log x_1$), region indicators ($\mathbb{I}[0 \leq x_1 < 1]$), and neural embeddings ($\sigma(w^\top x)$).

What is special about natural cubic splines?

Natural cubic splines are piecewise cubic with continuity of value, first, and second derivative — and they're forced to be linear beyond the boundary knots (which improves extrapolation). They are the unique solution to the penalised RSS problem $\sum(y_i - f(x_i))^2 + \lambda \int (f''(t))^2\,dt$ for any finite $\lambda$, with knots placed at each training $x_i$. This makes them the canonical smoother.

How does a kernel smoother predict at a new point x₀?

It computes a weighted average of training outcomes: $f(x_0) = \dfrac{\sum_i K(x_0, x_i)\, y_i}{\sum_i K(x_0, x_i)}$. Kernels that decrease with distance (like the Gaussian/RBF) give more weight to points near $x_0$ and less to those far away. The bandwidth controls the size of the local neighborhood: bigger bandwidth → smoother but more biased; smaller → wigglier but lower bias.

What happens to the smoothing-spline objective as λ → 0 and as λ → ∞?

At $\lambda = 0$, only the RSS matters, and any function that passes through every data point is a minimiser — total overfit, no smoothness. At $\lambda \to \infty$, the penalty dominates, forcing $f''$ to be zero everywhere, so $f$ must be linear — we recover ordinary linear least squares. For any $0 < \lambda < \infty$, the unique minimiser is a natural cubic spline with knots at each $x_i$.

· · ·

Part VII Looking Ahead

Three takeaways the lecture closes on — and a preview of why all this matters for the rest of the course.

The slides end with three high-level claims:

Linear and additive models are a powerful, flexible class of supervised learning methods. Don't dismiss them as "too simple" — with the right basis functions, they cover an enormous range of phenomena.
Most fancy machine learning algorithms can be understood as some variant of GAMs. Neural networks, boosted trees, kernel methods — they're all in some sense linear combinations of cleverly chosen basis functions. The framework you've just learned is the lens through which you'll understand the rest of the course.
The bias–variance tradeoff is inescapable. No matter how clever the algorithm, you're always trading off how well it fits the data against how stable the fit is. Regularization is the principled way to navigate that tradeoff. This idea will reappear every single week.

Putting it all together

Week 3 in one breath: linear models are everywhere because they're simple, interpretable, and — under mild assumptions — provably optimal. OLS gives the BLUE estimator with closed form $\hat\beta = (X^\top X)^{-1} X^\top y$. Logistic regression extends linearity to binary outcomes via the logit link, and is one example of a GLM. When OLS overfits or breaks (singular matrix, too many features), ridge and lasso add penalties on the L₂ and L₁ norms of $\beta$ — ridge shrinks, lasso shrinks-and-selects. The geometric difference (circle vs. diamond) explains the difference in behaviour. Bias–variance tells us why this trade-off is worthwhile. GAMs generalise GLMs by replacing raw features with basis functions — interactions, polynomials, splines, kernel smoothers, neural embeddings — and give us a unified lens for almost every model in the course.

For the exam, be ready to: (i) write the OLS closed-form solution and explain when it fails; (ii) explain the BLUE properties from Gauss-Markov; (iii) describe the three components of a GLM and identify them in linear and logistic regression; (iv) define the L₁ and L₂ norms; (v) write ridge and lasso in both constraint and Lagrangian form, and write the ridge closed-form; (vi) explain the geometric difference between L₁ and L₂ regularization; (vii) state the Bayesian interpretation of ridge; (viii) write the bias-variance decomposition and explain why regularization helps; (ix) list the four components of a GAM, give five basis-function examples, and characterise natural cubic splines as the unique solution to the penalised RSS problem; (x) write the kernel-smoother prediction formula.

Now go nail this worksheet.