Week Four · Trees & Ensembles

If-then rules,
and the wisdom
of crowds.

From a single decision tree to random forests and gradient boosting — the algorithms that quietly win most tabular data competitions.

KCL · Machine Learning · Reading time: ~65 min · Self-test: 18 questions

Part I Trees · What's in a Tree?

A decision tree is just a sequence of yes/no questions. Simple to draw, simple to interpret — and surprisingly powerful when you stack many of them.

Before we can learn from a tree, we need to know what one is. The slides start by laying out the graph-theoretic vocabulary you'll need.

Graphs, directed graphs, and trees

Definition · Graph and tree vocabulary
  • A graph $G = \langle V, E \rangle$ is a set of nodes $V$ and edges $E : V \times V$ between nodes.
  • A graph is directed if each edge has a direction (e.g. $X \to Y$).
  • It is acyclic if no sequence of directed edges loops back on itself (no $X \to Y \to X$).
  • If $G$ contains $X \to Y$, then $X$ is the parent of $Y$, and $Y$ is the child of $X$.
  • A binary tree is a directed acyclic graph in which each node has at most two children.

The three kinds of nodes

In a binary tree, every node falls into one of three categories:

Root Int. 1 Int. 2 Leaf 1 Leaf 2 Leaf 3 Leaf 4

A binary tree: one root, two internal nodes, four leaves. Data enters at the root and follows edges down to a leaf.

What makes a tree a decision tree

Now we add the splits. Suppose $X \in \mathcal{X} = \{0, 1\}^d$ are binary features and $Y \in \{0, 1\}$ a binary response. Then:

Definition · Splits in a decision tree
  • A split $g_k : \mathcal{X} \to \{0, 1\}$ is a literal of the form "$X_j = 1$".
  • At each non-leaf node, we place a unique split. Samples go left or right depending on whether they satisfy the literal.
  • A tree maps samples to leaves by "dropping data down the tree" — checking which literals it satisfies.

Toy example: when to pack an umbrella

Summer? No Umbrella ☂ Yes Cloudy? No No umbrella ☀ Yes Umbrella ☂

A simple if-then decision tree: if it's not summer, take an umbrella; if it's summer and cloudy, take one; if it's summer and clear, leave it home.

Continuous features and continuous outcomes

What if features aren't binary? The slides cover three crucial extensions:

Tree X₁ ≤ t₁ X₂ ≤ t₂ X₁ ≤ t₃ R₁ R₂ R₃ X₂ ≤ t₄ R₄ R₅ Recursive partition t₁ t₂ t₃ t₄ R₁ R₂ R₃ R₄ R₅ X₁ X₂

Each tree corresponds to a recursive carving of the feature space into nested rectangles. A leaf prediction is constant within each rectangle — for regression, the average $y$ value of training points in that region.

What's to love about trees · what isn't

Pros · what's to love about trees
  • Easy to interpret: a (short-ish) path from root to leaf is straightforward to trace.
  • Each split is an "if-then" rule — humans get this immediately.
  • Trees approximate joint distributions over the features and outcome.
  • Trees characterise a set of formulae in disjunctive normal form (DNF) — every leaf corresponds to a conjunction of literals; the whole tree is the OR of those conjunctions.
Cons · what's hard for trees
  • Trees struggle to model smooth functions. A staircase doesn't approximate a curve well.
  • Trees struggle to model additive functions. Each split is "axis-aligned" — capturing additive behaviour like $f(x) = x_1 + x_2$ requires lots of splits.
  • Trees are high-variance predictors. Small changes in the training data can completely reshape the tree.
  • Learning optimal trees is NP-hard. We can't solve this exactly in polynomial time, so we have to use heuristics.
Why this matters for what comes next

The "high variance" + "NP-hard to optimise" combination is the entire reason for the rest of the lecture. We can't grow the perfect tree, and a single tree wobbles too much anyway. So we'll grow many imperfect trees — and combine them. Bagging, random forests, boosting — they're all elaborate ways to dodge those two limitations.

Define "root", "leaf", and "internal node" in a binary decision tree.

Root: the only node with no parents — every prediction starts here.
Leaves: nodes with no children — every prediction ends in one.
Internal nodes: nodes with both parents and children — these hold the splits.

What does it mean to "drop a data point down a tree"?

Start at the root. At each non-leaf node, evaluate the split (an if-then test on one feature) and follow the matching edge — left if the condition is satisfied, right otherwise. Repeat until you arrive at a leaf. The leaf's value is the prediction for that data point.

· · ·

Part II CART · Learning the Splits

If learning the optimal tree is NP-hard, how do we actually train one? With a greedy heuristic that's been the standard for forty years.

CART stands for Classification And Regression Trees. It was introduced by Leo Breiman and colleagues in 1984. Three things to know about CART up-front:

  1. It's a polytime procedure — runs in polynomial time despite the NP-hardness of the global optimisation problem.
  2. It's a greedy algorithm — instead of finding the globally best tree, it minimises a loss function at each split, one at a time, never reconsidering past decisions.
  3. It continues until some stopping criterion is met (e.g. node purity, target depth).

CART for regression

For a continuous outcome, CART picks each split to minimise the residual sum of squares within the resulting two regions.

Definition · CART regression split

For some feature $j \in [d]$ and split point $s$, define the pair of half-planes:

$$R_1(j, s) = \{X \mid X_j \leq s\}, \quad R_2(j, s) = \{X \mid X_j > s\}.$$

Define the predicted value in each region as the mean response there:

$$\hat{y}_m := \hat{\mathbb{E}}[Y \mid x \in R_m].$$

At each split, CART solves:

$$\min_{j, s} \Bigg[\sum_{i: x_i \in R_1(j, s)} (y_i - \hat{y}_1)^2 \;+\; \sum_{i: x_i \in R_2(j, s)} (y_i - \hat{y}_2)^2\Bigg].$$

Try every feature $j$ and every meaningful split point $s$; pick the pair $(j, s)$ that minimises the combined squared error in the two child regions. Then recurse on each child.

CART for classification: the Gini impurity

For a categorical outcome, CART uses Gini impurity instead of squared error.

Definition · Gini impurity and CART classification split

Define the empirical class probability in region $m$ as:

$$\hat{y}_{mk} := P(Y = y_k \mid x \in R_m).$$

Then CART solves at each split:

$$\min_{j, s} \Bigg[\sum_k N_1(j,s)\, \hat{y}_{1k}(1 - \hat{y}_{1k}) \;+\; N_2(j,s)\, \hat{y}_{2k}(1 - \hat{y}_{2k})\Bigg],$$

where $N_m(j, s)$ is the number of samples in region $R_m(j, s)$.

What Gini impurity is doing

The quantity $\hat{y}_{mk}(1 - \hat{y}_{mk})$ is the variance of a Bernoulli with parameter $\hat y_{mk}$ — it's maximised when classes are evenly mixed in a region (50/50 → impurity 0.25 per class) and minimised when one class dominates (100/0 → impurity 0). Multiplying by $N_m$ weights each region by its size. So CART picks splits that produce regions which are as pure as possible — each child region dominated by one class.

Time complexity

The slides walk through CART's runtime carefully. The headline:

$$\text{Time complexity of CART} = O(d \, n \log n).$$

Reasoning:

Depth, overfitting, and pruning

Depth is the primary complexity parameter for CART.

Why is CART called a "greedy" algorithm?

Because it makes the locally optimal choice at each split (minimising the loss for that single split) without considering whether that choice will be optimal for the tree as a whole. Once a split is made, it's never reconsidered — even if a different split higher in the tree would have led to a better overall solution. This is necessary because finding the truly optimal tree is NP-hard.

What is Gini impurity, and what value does it take when classes are perfectly mixed vs. perfectly pure?

Gini impurity for a binary class with probability $\hat{y}$ is $\hat{y}(1 - \hat{y})$, the Bernoulli variance. When classes are perfectly mixed (50/50), Gini = 0.25 — the maximum for two classes. When perfectly pure (one class only, $\hat y = 0$ or $\hat y = 1$), Gini = 0 — the minimum. CART picks splits that minimise the sample-weighted sum of Gini values across child regions.

State the time complexity of CART and explain where it comes from.

$O(d \, n \log n)$. At each level, we evaluate up to $n \times d$ candidate splits (every feature, every threshold). The tree has roughly $\log n$ levels in a balanced case, hence the $n \log n \cdot d$ scaling. This is a heuristic bound; the actual cost depends on tree depth and stopping criteria.

· · ·

Part III Ensembles · Are Crowds That Wise?

If one model is high-variance, average many of them. If one is weak, combine them so the errors cancel. The intuition is centuries old.

An ensemble method is any technique that combines multiple supervised learning models into a single predictor. The intuition — the "wisdom of crowds" — goes back at least to Condorcet's jury theorem (1785).

Theorem · Condorcet's jury theorem (1785)

If jurors are (i) independent and (ii) better than random, then the more jurors, the more likely the majority verdict is correct. As $N \to \infty$, the probability of a correct collective verdict approaches 1.

Both conditions are essential. Correlated jurors don't add information. Worse-than-random jurors actually amplify error as you add more.

Why this matters for ML

If we can build many models that are (i) somewhat decorrelated and (ii) at least slightly better than random, then averaging or voting will produce a much better predictor than any single model. This is the entire theoretical basis for random forests and boosting.

Three aggregation strategies

The lecture covers three ways to combine models:

Stacking

Train diverse models in parallel; learn weights to combine them on a held-out validation set.

Bagging

Train many copies of the same model on different bootstrap samples; average the predictions.

Boosting

Train models sequentially, each correcting the errors of the previous. Combine with weighted votes.

In each case, we begin with a set of basis functions $h_k : \mathcal{X} \to \mathcal{Y}$ for $k \in \{1, \dots, m\}$. (Same vocabulary as Week 3's GAMs — the basis functions can be anything: linear, polynomial, full trees.) For instance, $h_1$ could be a linear predictor and $h_2$ a quadratic one.

Stacking · learn the weights

Definition · Stacking

A stacking ensemble is a linear combination of basis functions:

$$f(x) = \sum_{k=1}^m w_k \, h_k(x).$$

The basis functions $h_k$ are trained on the training set. The stacking weights $w_k$ are learned on a separate validation set by solving:

$$\hat{w} = \arg\min_{w \in \mathbb{R}^m} \sum_{i=1}^{n_{\text{val}}} \Big[y_i - \sum_{k=1}^m w_k h_k(x_i)\Big]^2,$$

where $n_{\text{val}}$ is the validation-set size. Optionally, we can constrain or regularize the stacking weights (e.g. force them to be non-negative or sum to 1).

Why a separate validation set?

If we learned the stacking weights on the same data as the basis functions, the weights would be fit to noise in the training data — overfitting. The validation set gives an honest signal about how each basis function performs on unseen data, so the combination is principled.

Bagging · bootstrap aggregation

Definition · Bagging (bootstrap aggregation)

Bagging = Bootstrap Aggregation. The procedure:

  1. The bootstrap creates $B$ "synthetic" datasets by sampling with replacement from the original data of size $n$, each new dataset also of size $n$.
  2. Some original data points appear multiple times in a single bag; others, not at all.
  3. Train one model per bag.
  4. Aggregate predictions — average for regression, vote for classification.

Out-of-bag (OOB) samples — a freebie

Definition · Out-of-bag samples

For any single bootstrap sample, the probability that a given training observation is excluded tends to $\exp(-1) \approx 0.368$ as $n \to \infty$. The roughly 37% of samples not included in a given bag are called out-of-bag (OOB) samples.

Because OOB samples weren't used to train that bag's model, we can use them to evaluate it for free — no separate test set required. Average across bags and you get an honest OOB estimate of generalization error.

Why ~37%?

For a single observation in a sample of size $n$, the probability of being missed in one draw is $1 - 1/n$. The probability of being missed in all $n$ draws (with replacement) is $(1 - 1/n)^n$. As $n \to \infty$, this tends to $e^{-1} \approx 0.368$. Built-in test set, free of charge.

Original purpose of the bootstrap

The slides note that the bootstrap was originally invented (by Efron, 1979) for a different purpose: as a nonparametric method for estimating the variance of an estimator. If you compute a statistic on each bag and look at how it varies across bags, you get a sense of the statistic's sampling distribution without needing closed-form formulas.

The technical conditions are subtle, but informally: the bootstrap works for any estimand that's a smooth function of the data.

State Condorcet's jury theorem and explain why both conditions are necessary.

If jurors are (i) independent and (ii) each better than random, then the probability the majority verdict is correct rises with the number of jurors, approaching 1.

Both conditions matter. Correlated jurors all make the same mistakes — adding more doesn't help. Worse-than-random jurors push the majority systematically wrong — adding more makes things worse. ML translation: ensembles need their members to be both (i) somewhat decorrelated and (ii) at least slightly better than chance.

What's the difference between stacking and bagging?

Stacking uses different kinds of models (e.g. a linear model and a tree and a neural net) and learns weights for combining them on a validation set. Diversity comes from the model types.

Bagging uses many copies of the same model trained on different bootstrap samples of the data. Diversity comes from the data resampling. Predictions are usually averaged or voted with equal weight (no learned weights).

Why is the OOB sample roughly 37% of the original data?

Each bootstrap sample selects $n$ items with replacement from $n$ original items. The probability that any one item is missed in a single draw is $1 - 1/n$, and missed in all $n$ draws is $(1 - 1/n)^n \to e^{-1} \approx 0.368$ as $n$ grows. So about 37% of the data is out-of-bag for each tree, and provides a free generalization estimate.

· · ·

Part IV Random Forests · A Stochastic Fairytale

Bag many trees. Decorrelate them by sampling features at each split. The result is one of the most popular and robust algorithms in machine learning.

The most famous bagging ensemble is the random forest (RF), originally developed by Leo Breiman in 2001 as an improvement on CART. The idea is straightforward:

Definition · Random Forest
  1. Each basis function is a CART tree grown on an independent bootstrap sample of the data.
  2. At each internal node, only a subset of features is considered for candidate splits — not all of them.
  3. Predictions are made by averaging across trees (regression) or voting (classification).
  4. Generalization error can be cheaply evaluated by averaging across the OOB samples.

Why random feature subsets?

Bagging alone reduces variance but not by as much as you'd hope, because the trees end up correlated — they all tend to split on the same strong features at the top. Sampling features at each split decorrelates the trees, satisfying the first Condorcet condition. The result:

Tuning parameters

Random forests have only three real tuning parameters:

RF tuning parameters
  • $B$ — number of trees to grow. RFs do not overfit as $B$ grows large; models tend to stabilise around $B \approx 200$ for most datasets.
  • mtry — number of features to sample per split. Standard defaults: $\lfloor\sqrt{d}\rfloor$ for classification, $\lfloor d/3 \rfloor$ for regression. This controls model complexity and sparsity.
  • Stopping criterion — same as for CART (depth, purity, min leaf size), with similar pros and cons.

Time complexity:

$$O(B \cdot \text{mtry} \cdot n \log n).$$

$B$ trees, each looking at mtry features per split, each costing $O(n \log n)$ to grow.

A subtle but important fact

RFs do not overfit as $B$ increases. Adding more trees only reduces noise in the average; it can't introduce bias. (Compare with boosting, which can overfit if you keep going.) That said, individual trees can be deep and overfit on their own bag, but the ensemble averaging dampens this.

RFs as adaptive nearest neighbours

One of the most elegant interpretations of RFs in the slides: random forests can be understood as an adaptive nearest-neighbours algorithm.

Standard $k$-NN predicts an outcome at $x$ by averaging the $k$ closest training points in Euclidean space. RFs do something similar, but the "neighbourhood" is defined by the forest itself, not by a fixed metric.

Definition · The forest weighting function

For tree $b$, define the kernel:

$$k^{(b)}(x, x_i) = \begin{cases} 0, & \text{if } R^{(b)}(x) \neq R^{(b)}(x_i) \\ n_i^{(b)}, & \text{otherwise.} \end{cases}$$

That is: $x$ and $x_i$ get weight zero unless they fall into the same leaf of tree $b$; if they do, the weight is the number of training points in that leaf.

Normalise per tree:

$$w_i^{(b)}(x) := \frac{k^{(b)}(x, x_i)}{\sum_{j=1}^n k^{(b)}(x, x_j)}.$$

Then average over trees to get the forest weighting function:

$$w_i(x) := \frac{1}{B} \sum_{b=1}^B w_i^{(b)}(x).$$

And the random forest prediction is:

$$f(x) = \sum_{i=1}^n w_i(x) \, y_i.$$
Why this is beautiful

The RF prediction is a weighted average of all training outcomes, where the weights depend on how often $x$ and each training point land in the same leaf. Training points that land near $x$ in the forest's geometry get high weight; those that don't get little or none. This means the "neighbourhood" adapts to the local structure of the data — close in the forest means close in features that actually mattered for the prediction, not close in raw Euclidean distance.

Variable importance · three flavours

One of the most useful features of RFs is that they let you measure how important each feature is to the model's predictions. The slides describe three popular methods:

Three measures of variable importance
  • Impurity importance. Quantify the total decrease in Gini impurity (or RSS for regression) that results from splitting on each variable $X_j$. Sum across all splits in all trees.
  • Permutation importance. Create a synthetic dataset by randomly permuting variable $X_j$ (breaking its relationship to $Y$) and see how much worse the model performs on average.
  • Noising importance. Replace each split on $X_j$ with a random binomial process and see how much worse the model performs on average.
When the rankings disagree

The slides show side-by-side bar charts of impurity importance vs. permutation importance for the same dataset. The rankings often broadly agree but sometimes differ — impurity importance tends to favour high-cardinality features (more splits available), while permutation importance is more robust but slower. In practice, permutation is usually preferred for honest interpretation.

What two key things make a random forest more than just bagged CART?

(1) Bootstrap sampling — each tree is grown on a different bootstrap sample, decorrelating the trees through data variation. (2) Random feature subsets at each split — only mtry features are considered at each internal node, further decorrelating the trees by preventing them all from splitting on the same dominant feature first. Together, these are what makes the ensemble closer to satisfying Condorcet's independence assumption.

What are the three RF tuning parameters and the standard mtry defaults?

(1) $B$, the number of trees (RFs don't overfit as $B$ grows; ~200 is usually plenty). (2) mtry, features per split — defaults: $\lfloor\sqrt{d}\rfloor$ for classification, $\lfloor d/3 \rfloor$ for regression. (3) The stopping criterion — depth, purity, or min leaf size, as in CART.

How do you get a free generalization error estimate from a random forest?

Use the out-of-bag (OOB) samples. Each tree was grown without about 37% of the original observations, so for each observation, you can predict using only the trees for which it was OOB. Aggregating across all observations gives an honest test-error estimate without needing a held-out set.

Explain how RFs can be viewed as an adaptive nearest-neighbours algorithm.

For each test point $x$, every training point $x_i$ gets a weight equal to how often they land in the same leaf across the forest, normalised. The prediction is $f(x) = \sum_i w_i(x) y_i$ — a weighted average of all training outcomes. The weights define a "neighbourhood" that adapts to the data: nearness is determined by the forest's learned structure, not raw Euclidean distance, so only features that actually mattered for the prediction influence which training points count as neighbours.

· · ·

Part V Boosting · Better Together?

Stacking and bagging combine independent models in parallel. Boosting builds them sequentially, each one focusing on what the last one got wrong.

Whereas stacking and bagging combine independent basis functions, boosting is a sequential algorithm — each learner improves upon the results of the last. The mechanism is simple in principle:

Definition · Boosting
  • Boosting is an iterative procedure for reweighting data.
  • At each step, we down-weight samples that have been correctly classified and up-weight those that have not.
  • The next learner therefore focuses on the previously hard-to-classify examples.
  • Boosting works in principle with any set of weak learners, but in practice we typically use shallow trees (often just decision stumps — trees with a single split).

AdaBoost · the original

The first major boosting algorithm was AdaBoost, by Freund and Schapire in 1995 — they later won the 2003 Gödel Prize for it. The goal was to build a strong learner from a combination of weak learners. Originally designed for binary classification, it has since been extended to multi-class and regression.

Here's the algorithm in full. Assume binary labels $y_i \in \{-1, +1\}$.

AdaBoost

Input: dataset $\{(x_i, y_i)\}_{i=1}^n$ with $y_i \in \{-1, +1\}$, number of rounds $B$.

Step 1 — Initialize observation weights uniformly:

$$w_i = 1/n, \quad i \in [n].$$

Step 2 — For $b = 1$ to $B$:

(a) Fit weak classifier $h_b(x)$ on the data using weights $w_i$.

(b) Compute the weighted error rate:

$$\epsilon_b = \sum_{i=1}^n w_i \, \mathbb{I}[y_i \neq h_b(x_i)].$$

(c) Compute the classifier weight:

$$\alpha_b = \tfrac{1}{2} \log\Big(\tfrac{1 - \epsilon_b}{\epsilon_b}\Big).$$

(d) Update each observation weight:

$$w_i \leftarrow \frac{w_i \cdot \exp(-\alpha_b \, y_i \, h_b(x_i))}{Z_b},$$

where $Z_b$ is a normalizing constant ensuring the weights sum to 1.

Output: the boosted classifier:

$$f(x) = \text{sgn}\Big(\sum_{b=1}^B \alpha_b \, h_b(x)\Big).$$

Reading the algorithm carefully

Three things deserve attention:

What's actually happening

The classifier weight $\alpha_b$: If a weak learner achieves $\epsilon_b = 0.5$ (random guessing), then $\alpha_b = 0$ — that learner is silenced in the final vote. If $\epsilon_b < 0.5$ (better than random), $\alpha_b > 0$ and the learner gets a positive vote — bigger if it's more accurate. If $\epsilon_b > 0.5$, $\alpha_b < 0$ and the learner's vote is flipped (a worse-than-random classifier is still useful — just believe the opposite of what it says).

The weight update: The exponent is $-\alpha_b \, y_i \, h_b(x_i)$. If the weak learner gets sample $i$ right, then $y_i \, h_b(x_i) = +1$ and $\exp(-\alpha_b)$ shrinks $w_i$. If wrong, then $y_i \, h_b(x_i) = -1$ and $\exp(+\alpha_b)$ grows $w_i$. The next round will focus on the hard examples.

The final prediction: $\text{sgn}\big(\sum_b \alpha_b h_b(x)\big)$ is a weighted majority vote across the weak learners. Confident, accurate learners count for more.

Why boosting is remarkable

The AdaBoost test-error plot (slide 40) is one of the most important pictures in this lecture. It shows the test error of AdaBoost on a benchmark, with two reference horizontal lines:

A combination of weak learners can radically outperform a single strong learner.

Subsequent developments in boosting

The theory of boosting has been thoroughly developed since AdaBoost. The slides flag three key results:

Gradient boosting machines (GBMs)

GBMs are the workhorse of modern tabular ML. Famously: XGBoost, LightGBM, CatBoost. The slides note:

Definition · Gradient boosting machines (GBMs)

GBMs routinely attain best-in-class performance on tabular data tasks. They have three key tuning parameters:

  1. Depth of basis functions — how complex each weak learner is allowed to be.
  2. Number of training rounds $B$ — how many weak learners to combine.
  3. Learning rate — a shrinkage factor multiplying each new learner's contribution; smaller learning rate = needs more rounds but tends to generalise better.

Two further important properties:

  • GBMs can continue to improve even after attaining zero training error — counter-intuitive but real. The loss surface keeps getting smoother.
  • Overfitting is a real risk, however — best to tune via cross-validation or similar.
Boosting vs. RF · the key distinction

RFs cannot overfit by adding more trees — averaging only helps. Boosting can overfit by running too many rounds. The reason: each boosting round can fit noise more precisely; the ensemble grows in complexity. RFs grow in averaging, which is monotonically beneficial.

What's the conceptual difference between bagging and boosting?

Bagging trains many models in parallel on independent bootstrap samples of the data; combines them by averaging or voting. Reduces variance.

Boosting trains models sequentially, each one focusing on examples the previous models got wrong. Combines them by weighted vote. Reduces bias (and variance).

Bagged trees are independent → Condorcet-style. Boosted learners are explicitly correlated by design — each is an attempted correction of the previous.

In AdaBoost, what does the classifier weight α_b equal when ε_b = 0.5? When ε_b = 0?

$\alpha_b = \tfrac{1}{2}\log\big(\tfrac{1-\epsilon_b}{\epsilon_b}\big)$.

If $\epsilon_b = 0.5$: $\alpha_b = \tfrac{1}{2}\log 1 = 0$ — random-guessing learner gets zero weight; effectively silenced.

If $\epsilon_b = 0$: the log goes to $+\infty$ — a perfect learner gets infinite weight (in practice, you'd stop early or cap this).

If $\epsilon_b > 0.5$: $\alpha_b < 0$ — learner is worse than random; its vote is flipped in the ensemble.

How does the AdaBoost weight update push attention onto hard examples?

The exponent in the weight update is $-\alpha_b y_i h_b(x_i)$. If the learner classifies $i$ correctly, $y_i h_b(x_i) = +1$, the exponent is $-\alpha_b$, and $w_i$ shrinks (down-weight). If wrong, $y_i h_b(x_i) = -1$, the exponent is $+\alpha_b$, and $w_i$ grows (up-weight). After normalisation by $Z_b$, weights sum to 1 again. The next learner sees the misclassified examples as more important.

State three ways boosting has been theoretically understood.

(1) Forward stagewise additive modelling — at each round, greedily add a basis function that most improves the loss. (2) A game-theoretic min-max solution to a zero-sum game between a learner and nature. (3) The gradient boosting framework, which generalises boosting to any differentiable loss function.

Why don't random forests overfit as B increases, but boosting does?

RF predictions are averages of independent tree predictions; adding more trees just reduces noise in the average — it can't introduce bias or fit the training noise more closely. Boosting, by contrast, is sequential and additive: each round adds capacity that can fit increasingly subtle structure (including noise). Run enough rounds and the ensemble becomes complex enough to memorise. Hence boosting requires careful early stopping or cross-validation.

· · ·

Part VI Conclusion

Three takeaways from the lecture's closing slide.

  1. Trees are an intuitive and effective method for data mining and supervised learning. If-then rules are the most human-friendly model class we have.
  2. Ensembles can turn weak learners into strong learners, under the right conditions (the Condorcet conditions: independence and better-than-random).
  3. Random forests and gradient boosting are among the most popular and powerful algorithms for tabular data tasks. If you have a structured dataset and want a baseline that's hard to beat, reach for one of these.

Putting it all together

Week 4 in one breath: a decision tree partitions the feature space with axis-aligned splits, predicts a constant in each leaf, and is interpretable but high-variance and NP-hard to learn optimally. CART is a greedy heuristic that grows trees in $O(d n \log n)$ time, using RSS for regression splits and Gini impurity for classification splits. To tame variance, we ensemble: stacking learns linear weights on diverse models on a validation set; bagging averages copies of one model trained on bootstrap samples; boosting trains models sequentially, each correcting the last. The most famous bagged ensemble is the random forest — bagged trees with extra feature sub-sampling per split, three tuning parameters, free OOB error, and an interpretation as adaptive nearest neighbours. The most famous boosted method is AdaBoost, which up-weights misclassified examples each round; gradient boosting machines generalise this to any differentiable loss and dominate tabular ML competitions today.

For the exam, be ready to: (i) define root, leaf, internal node, and split; (ii) describe how a tree partitions feature space and predicts in leaves; (iii) list trees' four pros and four cons; (iv) describe CART for both regression (RSS) and classification (Gini); (v) state CART's time complexity $O(d n \log n)$; (vi) state Condorcet's jury theorem and its two conditions; (vii) explain stacking, bagging, and boosting and how they differ; (viii) explain bootstrap sampling and the ~37% OOB rate via $(1-1/n)^n \to e^{-1}$; (ix) describe RF construction (bootstrap + feature sub-sampling), the three tuning parameters and mtry defaults ($\sqrt{d}$ classification, $d/3$ regression), and the adaptive-nearest-neighbours interpretation; (x) name three RF variable-importance methods; (xi) write the AdaBoost algorithm step-by-step including the formulas for $\epsilon_b$, $\alpha_b$, the weight update, and the final classifier; (xii) explain why RFs can't overfit in $B$ but boosting can; (xiii) name the three GBM tuning parameters.

Now go nail this worksheet.