From a single decision tree to random forests and gradient boosting — the algorithms that quietly win most tabular data competitions.
A decision tree is just a sequence of yes/no questions. Simple to draw, simple to interpret — and surprisingly powerful when you stack many of them.
Before we can learn from a tree, we need to know what one is. The slides start by laying out the graph-theoretic vocabulary you'll need.
In a binary tree, every node falls into one of three categories:
A binary tree: one root, two internal nodes, four leaves. Data enters at the root and follows edges down to a leaf.
Now we add the splits. Suppose $X \in \mathcal{X} = \{0, 1\}^d$ are binary features and $Y \in \{0, 1\}$ a binary response. Then:
A simple if-then decision tree: if it's not summer, take an umbrella; if it's summer and cloudy, take one; if it's summer and clear, leave it home.
What if features aren't binary? The slides cover three crucial extensions:
Each tree corresponds to a recursive carving of the feature space into nested rectangles. A leaf prediction is constant within each rectangle — for regression, the average $y$ value of training points in that region.
The "high variance" + "NP-hard to optimise" combination is the entire reason for the rest of the lecture. We can't grow the perfect tree, and a single tree wobbles too much anyway. So we'll grow many imperfect trees — and combine them. Bagging, random forests, boosting — they're all elaborate ways to dodge those two limitations.
Root: the only node with no parents — every prediction starts here.
Leaves: nodes with no children — every prediction ends in one.
Internal nodes: nodes with both parents and children — these hold the splits.
Start at the root. At each non-leaf node, evaluate the split (an if-then test on one feature) and follow the matching edge — left if the condition is satisfied, right otherwise. Repeat until you arrive at a leaf. The leaf's value is the prediction for that data point.
If learning the optimal tree is NP-hard, how do we actually train one? With a greedy heuristic that's been the standard for forty years.
CART stands for Classification And Regression Trees. It was introduced by Leo Breiman and colleagues in 1984. Three things to know about CART up-front:
For a continuous outcome, CART picks each split to minimise the residual sum of squares within the resulting two regions.
For some feature $j \in [d]$ and split point $s$, define the pair of half-planes:
$$R_1(j, s) = \{X \mid X_j \leq s\}, \quad R_2(j, s) = \{X \mid X_j > s\}.$$Define the predicted value in each region as the mean response there:
$$\hat{y}_m := \hat{\mathbb{E}}[Y \mid x \in R_m].$$At each split, CART solves:
$$\min_{j, s} \Bigg[\sum_{i: x_i \in R_1(j, s)} (y_i - \hat{y}_1)^2 \;+\; \sum_{i: x_i \in R_2(j, s)} (y_i - \hat{y}_2)^2\Bigg].$$Try every feature $j$ and every meaningful split point $s$; pick the pair $(j, s)$ that minimises the combined squared error in the two child regions. Then recurse on each child.
For a categorical outcome, CART uses Gini impurity instead of squared error.
Define the empirical class probability in region $m$ as:
$$\hat{y}_{mk} := P(Y = y_k \mid x \in R_m).$$Then CART solves at each split:
$$\min_{j, s} \Bigg[\sum_k N_1(j,s)\, \hat{y}_{1k}(1 - \hat{y}_{1k}) \;+\; N_2(j,s)\, \hat{y}_{2k}(1 - \hat{y}_{2k})\Bigg],$$where $N_m(j, s)$ is the number of samples in region $R_m(j, s)$.
The quantity $\hat{y}_{mk}(1 - \hat{y}_{mk})$ is the variance of a Bernoulli with parameter $\hat y_{mk}$ — it's maximised when classes are evenly mixed in a region (50/50 → impurity 0.25 per class) and minimised when one class dominates (100/0 → impurity 0). Multiplying by $N_m$ weights each region by its size. So CART picks splits that produce regions which are as pure as possible — each child region dominated by one class.
The slides walk through CART's runtime carefully. The headline:
$$\text{Time complexity of CART} = O(d \, n \log n).$$Reasoning:
Depth is the primary complexity parameter for CART.
Because it makes the locally optimal choice at each split (minimising the loss for that single split) without considering whether that choice will be optimal for the tree as a whole. Once a split is made, it's never reconsidered — even if a different split higher in the tree would have led to a better overall solution. This is necessary because finding the truly optimal tree is NP-hard.
Gini impurity for a binary class with probability $\hat{y}$ is $\hat{y}(1 - \hat{y})$, the Bernoulli variance. When classes are perfectly mixed (50/50), Gini = 0.25 — the maximum for two classes. When perfectly pure (one class only, $\hat y = 0$ or $\hat y = 1$), Gini = 0 — the minimum. CART picks splits that minimise the sample-weighted sum of Gini values across child regions.
$O(d \, n \log n)$. At each level, we evaluate up to $n \times d$ candidate splits (every feature, every threshold). The tree has roughly $\log n$ levels in a balanced case, hence the $n \log n \cdot d$ scaling. This is a heuristic bound; the actual cost depends on tree depth and stopping criteria.
If one model is high-variance, average many of them. If one is weak, combine them so the errors cancel. The intuition is centuries old.
An ensemble method is any technique that combines multiple supervised learning models into a single predictor. The intuition — the "wisdom of crowds" — goes back at least to Condorcet's jury theorem (1785).
If jurors are (i) independent and (ii) better than random, then the more jurors, the more likely the majority verdict is correct. As $N \to \infty$, the probability of a correct collective verdict approaches 1.
Both conditions are essential. Correlated jurors don't add information. Worse-than-random jurors actually amplify error as you add more.
If we can build many models that are (i) somewhat decorrelated and (ii) at least slightly better than random, then averaging or voting will produce a much better predictor than any single model. This is the entire theoretical basis for random forests and boosting.
The lecture covers three ways to combine models:
Train diverse models in parallel; learn weights to combine them on a held-out validation set.
Train many copies of the same model on different bootstrap samples; average the predictions.
Train models sequentially, each correcting the errors of the previous. Combine with weighted votes.
In each case, we begin with a set of basis functions $h_k : \mathcal{X} \to \mathcal{Y}$ for $k \in \{1, \dots, m\}$. (Same vocabulary as Week 3's GAMs — the basis functions can be anything: linear, polynomial, full trees.) For instance, $h_1$ could be a linear predictor and $h_2$ a quadratic one.
A stacking ensemble is a linear combination of basis functions:
$$f(x) = \sum_{k=1}^m w_k \, h_k(x).$$The basis functions $h_k$ are trained on the training set. The stacking weights $w_k$ are learned on a separate validation set by solving:
$$\hat{w} = \arg\min_{w \in \mathbb{R}^m} \sum_{i=1}^{n_{\text{val}}} \Big[y_i - \sum_{k=1}^m w_k h_k(x_i)\Big]^2,$$where $n_{\text{val}}$ is the validation-set size. Optionally, we can constrain or regularize the stacking weights (e.g. force them to be non-negative or sum to 1).
If we learned the stacking weights on the same data as the basis functions, the weights would be fit to noise in the training data — overfitting. The validation set gives an honest signal about how each basis function performs on unseen data, so the combination is principled.
Bagging = Bootstrap Aggregation. The procedure:
For any single bootstrap sample, the probability that a given training observation is excluded tends to $\exp(-1) \approx 0.368$ as $n \to \infty$. The roughly 37% of samples not included in a given bag are called out-of-bag (OOB) samples.
Because OOB samples weren't used to train that bag's model, we can use them to evaluate it for free — no separate test set required. Average across bags and you get an honest OOB estimate of generalization error.
For a single observation in a sample of size $n$, the probability of being missed in one draw is $1 - 1/n$. The probability of being missed in all $n$ draws (with replacement) is $(1 - 1/n)^n$. As $n \to \infty$, this tends to $e^{-1} \approx 0.368$. Built-in test set, free of charge.
The slides note that the bootstrap was originally invented (by Efron, 1979) for a different purpose: as a nonparametric method for estimating the variance of an estimator. If you compute a statistic on each bag and look at how it varies across bags, you get a sense of the statistic's sampling distribution without needing closed-form formulas.
The technical conditions are subtle, but informally: the bootstrap works for any estimand that's a smooth function of the data.
If jurors are (i) independent and (ii) each better than random, then the probability the majority verdict is correct rises with the number of jurors, approaching 1.
Both conditions matter. Correlated jurors all make the same mistakes — adding more doesn't help. Worse-than-random jurors push the majority systematically wrong — adding more makes things worse. ML translation: ensembles need their members to be both (i) somewhat decorrelated and (ii) at least slightly better than chance.
Stacking uses different kinds of models (e.g. a linear model and a tree and a neural net) and learns weights for combining them on a validation set. Diversity comes from the model types.
Bagging uses many copies of the same model trained on different bootstrap samples of the data. Diversity comes from the data resampling. Predictions are usually averaged or voted with equal weight (no learned weights).
Each bootstrap sample selects $n$ items with replacement from $n$ original items. The probability that any one item is missed in a single draw is $1 - 1/n$, and missed in all $n$ draws is $(1 - 1/n)^n \to e^{-1} \approx 0.368$ as $n$ grows. So about 37% of the data is out-of-bag for each tree, and provides a free generalization estimate.
Bag many trees. Decorrelate them by sampling features at each split. The result is one of the most popular and robust algorithms in machine learning.
The most famous bagging ensemble is the random forest (RF), originally developed by Leo Breiman in 2001 as an improvement on CART. The idea is straightforward:
Bagging alone reduces variance but not by as much as you'd hope, because the trees end up correlated — they all tend to split on the same strong features at the top. Sampling features at each split decorrelates the trees, satisfying the first Condorcet condition. The result:
Random forests have only three real tuning parameters:
Time complexity:
$$O(B \cdot \text{mtry} \cdot n \log n).$$$B$ trees, each looking at mtry features per split, each costing $O(n \log n)$ to grow.
RFs do not overfit as $B$ increases. Adding more trees only reduces noise in the average; it can't introduce bias. (Compare with boosting, which can overfit if you keep going.) That said, individual trees can be deep and overfit on their own bag, but the ensemble averaging dampens this.
One of the most elegant interpretations of RFs in the slides: random forests can be understood as an adaptive nearest-neighbours algorithm.
Standard $k$-NN predicts an outcome at $x$ by averaging the $k$ closest training points in Euclidean space. RFs do something similar, but the "neighbourhood" is defined by the forest itself, not by a fixed metric.
For tree $b$, define the kernel:
$$k^{(b)}(x, x_i) = \begin{cases} 0, & \text{if } R^{(b)}(x) \neq R^{(b)}(x_i) \\ n_i^{(b)}, & \text{otherwise.} \end{cases}$$That is: $x$ and $x_i$ get weight zero unless they fall into the same leaf of tree $b$; if they do, the weight is the number of training points in that leaf.
Normalise per tree:
$$w_i^{(b)}(x) := \frac{k^{(b)}(x, x_i)}{\sum_{j=1}^n k^{(b)}(x, x_j)}.$$Then average over trees to get the forest weighting function:
$$w_i(x) := \frac{1}{B} \sum_{b=1}^B w_i^{(b)}(x).$$And the random forest prediction is:
$$f(x) = \sum_{i=1}^n w_i(x) \, y_i.$$The RF prediction is a weighted average of all training outcomes, where the weights depend on how often $x$ and each training point land in the same leaf. Training points that land near $x$ in the forest's geometry get high weight; those that don't get little or none. This means the "neighbourhood" adapts to the local structure of the data — close in the forest means close in features that actually mattered for the prediction, not close in raw Euclidean distance.
One of the most useful features of RFs is that they let you measure how important each feature is to the model's predictions. The slides describe three popular methods:
The slides show side-by-side bar charts of impurity importance vs. permutation importance for the same dataset. The rankings often broadly agree but sometimes differ — impurity importance tends to favour high-cardinality features (more splits available), while permutation importance is more robust but slower. In practice, permutation is usually preferred for honest interpretation.
(1) Bootstrap sampling — each tree is grown on a different bootstrap sample, decorrelating the trees through data variation. (2) Random feature subsets at each split — only mtry features are considered at each internal node, further decorrelating the trees by preventing them all from splitting on the same dominant feature first. Together, these are what makes the ensemble closer to satisfying Condorcet's independence assumption.
(1) $B$, the number of trees (RFs don't overfit as $B$ grows; ~200 is usually plenty). (2) mtry, features per split — defaults: $\lfloor\sqrt{d}\rfloor$ for classification, $\lfloor d/3 \rfloor$ for regression. (3) The stopping criterion — depth, purity, or min leaf size, as in CART.
Use the out-of-bag (OOB) samples. Each tree was grown without about 37% of the original observations, so for each observation, you can predict using only the trees for which it was OOB. Aggregating across all observations gives an honest test-error estimate without needing a held-out set.
For each test point $x$, every training point $x_i$ gets a weight equal to how often they land in the same leaf across the forest, normalised. The prediction is $f(x) = \sum_i w_i(x) y_i$ — a weighted average of all training outcomes. The weights define a "neighbourhood" that adapts to the data: nearness is determined by the forest's learned structure, not raw Euclidean distance, so only features that actually mattered for the prediction influence which training points count as neighbours.
Stacking and bagging combine independent models in parallel. Boosting builds them sequentially, each one focusing on what the last one got wrong.
Whereas stacking and bagging combine independent basis functions, boosting is a sequential algorithm — each learner improves upon the results of the last. The mechanism is simple in principle:
The first major boosting algorithm was AdaBoost, by Freund and Schapire in 1995 — they later won the 2003 Gödel Prize for it. The goal was to build a strong learner from a combination of weak learners. Originally designed for binary classification, it has since been extended to multi-class and regression.
Here's the algorithm in full. Assume binary labels $y_i \in \{-1, +1\}$.
Input: dataset $\{(x_i, y_i)\}_{i=1}^n$ with $y_i \in \{-1, +1\}$, number of rounds $B$.
Step 1 — Initialize observation weights uniformly:
$$w_i = 1/n, \quad i \in [n].$$
Step 2 — For $b = 1$ to $B$:
(a) Fit weak classifier $h_b(x)$ on the data using weights $w_i$.
(b) Compute the weighted error rate:
$$\epsilon_b = \sum_{i=1}^n w_i \, \mathbb{I}[y_i \neq h_b(x_i)].$$
(c) Compute the classifier weight:
$$\alpha_b = \tfrac{1}{2} \log\Big(\tfrac{1 - \epsilon_b}{\epsilon_b}\Big).$$
(d) Update each observation weight:
$$w_i \leftarrow \frac{w_i \cdot \exp(-\alpha_b \, y_i \, h_b(x_i))}{Z_b},$$
where $Z_b$ is a normalizing constant ensuring the weights sum to 1.
Output: the boosted classifier:
$$f(x) = \text{sgn}\Big(\sum_{b=1}^B \alpha_b \, h_b(x)\Big).$$
Three things deserve attention:
The classifier weight $\alpha_b$: If a weak learner achieves $\epsilon_b = 0.5$ (random guessing), then $\alpha_b = 0$ — that learner is silenced in the final vote. If $\epsilon_b < 0.5$ (better than random), $\alpha_b > 0$ and the learner gets a positive vote — bigger if it's more accurate. If $\epsilon_b > 0.5$, $\alpha_b < 0$ and the learner's vote is flipped (a worse-than-random classifier is still useful — just believe the opposite of what it says).
The weight update: The exponent is $-\alpha_b \, y_i \, h_b(x_i)$. If the weak learner gets sample $i$ right, then $y_i \, h_b(x_i) = +1$ and $\exp(-\alpha_b)$ shrinks $w_i$. If wrong, then $y_i \, h_b(x_i) = -1$ and $\exp(+\alpha_b)$ grows $w_i$. The next round will focus on the hard examples.
The final prediction: $\text{sgn}\big(\sum_b \alpha_b h_b(x)\big)$ is a weighted majority vote across the weak learners. Confident, accurate learners count for more.
The AdaBoost test-error plot (slide 40) is one of the most important pictures in this lecture. It shows the test error of AdaBoost on a benchmark, with two reference horizontal lines:
The theory of boosting has been thoroughly developed since AdaBoost. The slides flag three key results:
GBMs are the workhorse of modern tabular ML. Famously: XGBoost, LightGBM, CatBoost. The slides note:
GBMs routinely attain best-in-class performance on tabular data tasks. They have three key tuning parameters:
Two further important properties:
RFs cannot overfit by adding more trees — averaging only helps. Boosting can overfit by running too many rounds. The reason: each boosting round can fit noise more precisely; the ensemble grows in complexity. RFs grow in averaging, which is monotonically beneficial.
Bagging trains many models in parallel on independent bootstrap samples of the data; combines them by averaging or voting. Reduces variance.
Boosting trains models sequentially, each one focusing on examples the previous models got wrong. Combines them by weighted vote. Reduces bias (and variance).
Bagged trees are independent → Condorcet-style. Boosted learners are explicitly correlated by design — each is an attempted correction of the previous.
$\alpha_b = \tfrac{1}{2}\log\big(\tfrac{1-\epsilon_b}{\epsilon_b}\big)$.
If $\epsilon_b = 0.5$: $\alpha_b = \tfrac{1}{2}\log 1 = 0$ — random-guessing learner gets zero weight; effectively silenced.
If $\epsilon_b = 0$: the log goes to $+\infty$ — a perfect learner gets infinite weight (in practice, you'd stop early or cap this).
If $\epsilon_b > 0.5$: $\alpha_b < 0$ — learner is worse than random; its vote is flipped in the ensemble.
The exponent in the weight update is $-\alpha_b y_i h_b(x_i)$. If the learner classifies $i$ correctly, $y_i h_b(x_i) = +1$, the exponent is $-\alpha_b$, and $w_i$ shrinks (down-weight). If wrong, $y_i h_b(x_i) = -1$, the exponent is $+\alpha_b$, and $w_i$ grows (up-weight). After normalisation by $Z_b$, weights sum to 1 again. The next learner sees the misclassified examples as more important.
(1) Forward stagewise additive modelling — at each round, greedily add a basis function that most improves the loss. (2) A game-theoretic min-max solution to a zero-sum game between a learner and nature. (3) The gradient boosting framework, which generalises boosting to any differentiable loss function.
RF predictions are averages of independent tree predictions; adding more trees just reduces noise in the average — it can't introduce bias or fit the training noise more closely. Boosting, by contrast, is sequential and additive: each round adds capacity that can fit increasingly subtle structure (including noise). Run enough rounds and the ensemble becomes complex enough to memorise. Hence boosting requires careful early stopping or cross-validation.
Three takeaways from the lecture's closing slide.
Week 4 in one breath: a decision tree partitions the feature space with axis-aligned splits, predicts a constant in each leaf, and is interpretable but high-variance and NP-hard to learn optimally. CART is a greedy heuristic that grows trees in $O(d n \log n)$ time, using RSS for regression splits and Gini impurity for classification splits. To tame variance, we ensemble: stacking learns linear weights on diverse models on a validation set; bagging averages copies of one model trained on bootstrap samples; boosting trains models sequentially, each correcting the last. The most famous bagged ensemble is the random forest — bagged trees with extra feature sub-sampling per split, three tuning parameters, free OOB error, and an interpretation as adaptive nearest neighbours. The most famous boosted method is AdaBoost, which up-weights misclassified examples each round; gradient boosting machines generalise this to any differentiable loss and dominate tabular ML competitions today.
For the exam, be ready to: (i) define root, leaf, internal node, and split; (ii) describe how a tree partitions feature space and predicts in leaves; (iii) list trees' four pros and four cons; (iv) describe CART for both regression (RSS) and classification (Gini); (v) state CART's time complexity $O(d n \log n)$; (vi) state Condorcet's jury theorem and its two conditions; (vii) explain stacking, bagging, and boosting and how they differ; (viii) explain bootstrap sampling and the ~37% OOB rate via $(1-1/n)^n \to e^{-1}$; (ix) describe RF construction (bootstrap + feature sub-sampling), the three tuning parameters and mtry defaults ($\sqrt{d}$ classification, $d/3$ regression), and the adaptive-nearest-neighbours interpretation; (x) name three RF variable-importance methods; (xi) write the AdaBoost algorithm step-by-step including the formulas for $\epsilon_b$, $\alpha_b$, the weight update, and the final classifier; (xii) explain why RFs can't overfit in $B$ but boosting can; (xiii) name the three GBM tuning parameters.
Now go nail this worksheet.