Week One · Learning from Data

Probability,
Information
& Loss

A first principles tour of the foundations every machine learning algorithm rests on — written for someone who's never seen a sigma before.

KCL · Machine Learning · Reading time: ~45 min · Self-test: 14 questions

Part I The Language of Probability

Before we can teach a computer to learn, we need a vocabulary for talking about uncertainty. That vocabulary is probability theory.

Imagine you flip a coin. You don't know what it'll land on, but you do know something: heads is just as likely as tails. Probability is the mathematics of putting numbers on that "something" — turning vague intuitions about chance into precise calculations.

The framework we still use today was nailed down in the 1930s by a Russian mathematician called Andrey Kolmogorov. His insight was that all of probability could be built from just three simple ideas and three simple rules. Let's meet them.

The three ingredients

Whenever we set up a probability problem, we need three things — together they're called the probability space, written as a triple $(\Omega, \mathcal{F}, P)$. Don't be intimidated by the symbols; here's what each one means in plain English.

Definition · Sample space Ω

The sample space Ω (capital omega) is just the list of every possible outcome. Every single thing that could happen, written down.

Example

Flip a coin once → Ω = {Heads, Tails}.

Roll a normal die → Ω = {1, 2, 3, 4, 5, 6}.

Flip two coins → Ω = {HH, HT, TH, TT}.

The slide's example: for two binary variables X and Y (each taking value or its prime), the sample space is $\Omega = \{(x,y), (x,y'), (x',y), (x',y')\}$ — the four ways the pair can come out.

Definition · Event space ℱ

The event space $\mathcal{F}$ (script F) is the list of questions you're allowed to ask. Each question corresponds to a subset of outcomes.

"Did I get a head?" is a question. The set of outcomes that answer "yes" — namely {HH, HT, TH} when we flip two coins — is an event. The event space is the collection of all such events.

The slides mention $\mathcal{F}$ is a "σ-algebra" (sigma-algebra). The technical bit just means: the event space is well-behaved — it contains the empty set, and if you can ask "did A happen?" you can also ask "did A not happen?", and you can combine questions with "or" and "and". For our purposes, just think: collection of questions about outcomes.

Definition · Measure P

The probability measure P is the rule that assigns a number — the probability — to each event.

Plain English

Ω = what could happen. ℱ = the questions you can ask. P = the function that gives each question a numerical answer between 0 and 1. That's it. That's the whole setup.

Kolmogorov's three axioms

An axiom is a rule we just agree on without proof — a starting point. Kolmogorov said: any sensible definition of probability has to obey these three rules.

Probabilities aren't negative. P(E) ≥ 0 for every event. You can't have a -10% chance of rain. This is the nonnegativity axiom.
Something must happen. P(Ω) = 1. The probability that at least one outcome from your sample space occurs is 100%. This is the unit measure axiom.
Disjoint events add up. If two events can't both happen at once, the probability of "either one or the other" equals the sum of their individual probabilities. This is σ-additivity. Formally, for any countable sequence of disjoint events: $$P\Big(\bigcup_{i=1}^{\infty} E_i\Big) = \sum_{i=1}^{\infty} P(E_i).$$

Example · The third axiom in action

For a fair die, the events "roll a 1" and "roll a 2" are disjoint — they can't both be true on the same roll. So:

$$P(\text{roll 1 or 2}) = P(1) + P(2) = \tfrac{1}{6} + \tfrac{1}{6} = \tfrac{1}{3}.$$

Simple, but every probability calculation you'll ever do is built on top of this.

Conditional probability: changing what we know

What if I told you something about the world before asking the question? For instance: "I rolled a die. The result is even. What's the probability it's a 6?"

The new information rules out 1, 3, and 5 — so we should only consider {2, 4, 6}. Among those, only one outcome is "6". So the probability is 1/3, not 1/6. This is what conditional probability captures.

Definition · Conditional probability

The probability of A given that B has happened:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$$

Read it as: "Of all the worlds where B is true, what fraction also have A true?"

Note the symbol $\cap$ (intersection) just means "and". $P(A \cap B)$ is the probability that both A and B happen.

Independence: when knowing doesn't help

Sometimes learning B tells you nothing about A. Knowing whether it rained yesterday in Tokyo doesn't change my belief about whether your coin will land heads. Such events are called independent.

Definition · Statistical independence

A and B are independent if and only if:

$$P(A \cap B) = P(A) \, P(B)$$

Equivalently: $P(A \mid B) = P(A)$. Knowing B leaves your beliefs about A unchanged. We write $A \perp\!\!\!\perp B$.

Conditional independence: a subtler beast

This is one of the most important — and most easily misunderstood — concepts in machine learning. Two events can be dependent overall, but independent once you know a third thing.

Example · Ice cream and drowning

Ice cream sales and drowning incidents are strongly correlated. Cities that sell more ice cream have more drownings. So they're definitely not independent.

But once you know the season, they become independent. In summer, both are high; in winter, both are low — the season explains the link. Within any single season, knowing ice cream sales tells you nothing extra about drownings.

So: ice cream and drowning are conditionally independent given season. Written: Ice ⫫ Drown | Season.

Definition · Conditional independence

A is conditionally independent of B given C if:

$$P(A \cap B \mid C) = P(A \mid C) \, P(B \mid C)$$

Once we condition on C, learning B adds nothing to our knowledge of A. We write $A \perp\!\!\!\perp B \mid C$.

This also entails two equivalent statements (which the slide flags explicitly — likely exam material):

$$P(A \mid B, C) = P(A \mid C) \quad \text{and} \quad P(B \mid A, C) = P(B \mid C).$$

In words: once you know C, additionally knowing B tells you nothing further about A — and vice versa.

Joint, marginal, expected value

Now suppose we have two random quantities — call them A and B. We can ask three different kinds of question.

Joint probability — both at once

The probability that A takes value $a$ and B takes value $b$:

$$P(a, b) = P(a) \, P(b \mid a) = P(b) \, P(a \mid b).$$

The two equalities are the same fact written from different angles — this is sometimes called the chain rule of probability.

Marginal probability — collapsing one variable

If we have the joint distribution but only care about A, we sum across all possible values of B:

$$P(a) = \sum_b P(a, b).$$

This is called marginalising out B. Picture a table of joint probabilities and adding up each row — the row totals are the marginal probabilities of A.

Expected value — the weighted average

If $f$ is some function of the variable A (for example, your winnings on each outcome), the expected value is the average of $f$ weighted by how likely each outcome is:

$$\mathbb{E}[f(a)] = \sum_a P(a) \, f(a).$$

Example · Expected roll of a die

Each face has probability 1/6, and the function is just "the number on the face". So:

$$\mathbb{E}[X] = \tfrac{1}{6}(1+2+3+4+5+6) = 3.5.$$

The expected value of a die roll is 3.5 — even though you can never actually roll 3.5.

Conditional expectation — average given a piece of evidence

If we've observed B = b and want the expected value of $f(A)$ in that conditioned world, we use the conditional probability instead of the marginal:

$$\mathbb{E}\big[f(a) \mid b\big] = \sum_a P(a \mid b) \, f(a).$$

Same recipe, just with each weight updated to reflect the new information. This is one of the most-used quantities in machine learning — the Bayes estimator we'll meet later is essentially a conditional expectation.

Example · Conditional expectation

You roll a fair die. What's the expected value given that the roll is even?

Conditioning on "even" leaves three outcomes {2, 4, 6}, each with probability 1/3:

$$\mathbb{E}[X \mid \text{even}] = \tfrac{1}{3}(2+4+6) = 4.$$

MAP — the most likely answer

The maximum a posteriori (MAP) estimate of A given B = b is just: pick the value of A that's most likely once you've seen B.

$$a^* = \arg\max_{a \in A} P(a \mid b).$$

The notation $\arg\max$ means "the value of $a$ that maximises this quantity". When a doctor sees symptoms B and guesses the most probable disease A, that's a MAP estimate.

Self-test · Probability

You roll two fair dice. What's the probability the sum is 7, given that the first die shows 4?

If the first die is 4, the sum is 7 only when the second die is 3. The second die is independent of the first, so P(second = 3) = 1/6.

Answer: 1/6.

A fair die is rolled. What is E[X²] (the expected value of the square of the outcome)?

Apply the expectation formula with f(x) = x²:

$\mathbb{E}[X^2] = \sum_x P(x) \, x^2 = \tfrac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \tfrac{91}{6} \approx 15.17.$

Note this is not the same as $(\mathbb{E}[X])^2 = 3.5^2 = 12.25$. In general $\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2$ — the difference is the variance.

If A ⫫ B | C, can you simplify P(A | B, C)?

Yes — to P(A | C). This is one of the two consequences of conditional independence flagged on slide 16. Once C is known, B becomes irrelevant for predicting A.

Concretely: in our ice-cream/drowning example, once you know the season, also knowing ice-cream sales adds nothing to your prediction of drownings.

Are "drawing a king" and "drawing a heart" independent in a standard 52-card deck?

P(King) = 4/52 = 1/13. P(Heart) = 13/52 = 1/4. P(King ∩ Heart) = 1/52 (only the King of Hearts).

Check: P(King) × P(Heart) = (1/13)(1/4) = 1/52. ✓ They are independent.

Intuition: knowing a card is a heart doesn't change the chance it's a king, because the suits and ranks are arranged in a perfect 4×13 grid.

A bag has 3 red and 2 blue marbles. What's the expected value of the number of reds drawn in a single draw?

Let X = 1 if red, 0 if blue. P(red) = 3/5, P(blue) = 2/5.

$\mathbb{E}[X] = 1 \cdot \tfrac{3}{5} + 0 \cdot \tfrac{2}{5} = \tfrac{3}{5}.$

Answer: 0.6. (For 0/1 variables, expected value equals the probability of "1".)

· · ·

Part II Information & Surprise

If probability is about uncertainty, information theory asks: how much uncertainty is there, and how much does new evidence reduce it?

In 1948, an engineer at Bell Labs called Claude Shannon published a paper that quietly invented the digital age. He realised that "information" — that fuzzy thing we all talk about — could be measured precisely, in bits. The recipe is surprisingly simple, and it turns out to be the foundation for everything from data compression to neural network training.

Surprise, made measurable

Here's Shannon's intuition. If I tell you "the sun rose this morning", you've learned almost nothing — you already expected it. But if I tell you "the sun didn't rise this morning", you've learned a lot — that was unexpected.

So information is connected to surprise, and surprise is high when probability is low. Shannon's specific choice was to measure the surprise of an outcome $x$ as $-\log P(x)$.

Why log?

The logarithm has two beautiful properties. First, when probability is 1 (certain), $-\log 1 = 0$ — no surprise. When probability is tiny, $-\log$ becomes huge — lots of surprise. Second, surprise from independent events adds: if A and B are independent, then $-\log P(A \cap B) = -\log P(A) - \log P(B)$. Information accumulates.

Entropy: average surprise

If we average the surprise over all possible outcomes, weighted by how likely they are, we get the entropy of the distribution.

Definition · Entropy

For a discrete random variable X with distribution P:

$$H(X) = -\sum_x P(x) \log P(x).$$

This is the average information content of X — or equivalently, the average uncertainty.

Two quick facts:

Entropy is never negative: $H(X) \geq 0$.
Entropy is zero only when one outcome is certain (probability 1) — there's no uncertainty to measure.

Example · Coin entropy

Fair coin: P(H) = P(T) = 1/2.

$$H(X) = -\tfrac{1}{2}\log_2 \tfrac{1}{2} - \tfrac{1}{2}\log_2 \tfrac{1}{2} = 1 \text{ bit}.$$

Maximum uncertainty for two outcomes — exactly one bit, the most a coin can tell you.

Two-headed coin: P(H) = 1.

$$H(X) = -1 \cdot \log 1 = 0 \text{ bits}.$$

You knew it was heads. Zero information. The flip told you nothing.

Example · Fair 8-sided die (the slide example)

P(x) = 1/8 for each of 8 faces. Using log base 2:

$$H(X) = -\sum_{x=1}^{8} \tfrac{1}{8} \log_2 \tfrac{1}{8} = -\log_2 \tfrac{1}{8} = \log_2 8 = 3 \text{ bits}.$$

Three bits — exactly what you'd need to encode 8 outcomes (000 to 111). Shannon's measure tells us this is also the most efficient possible code on average.

Two ways to read entropy

1. As uncertainty. Maximum when the distribution is spread evenly across all outcomes (a fair coin, a fair die). Minimum (zero) when one outcome dominates entirely.

2. As code length. $H(X)$ is the average number of bits per message in the optimal code for X. Frequent outcomes get short codes; rare outcomes get long ones.

Joint and conditional entropy

Just as we extended probability from one variable to two, we can extend entropy.

Joint entropy — combined uncertainty

$$H(X, Y) = -\sum_{x, y} P(x, y) \log P(x, y).$$

The total uncertainty in the pair (X, Y) considered as a single thing.

Useful properties:

$H(X, Y) \leq H(X) + H(Y)$. Combining can never increase total uncertainty beyond just summing the individual uncertainties.
Equality holds if and only if X and Y are independent — in which case the joint really does just decompose into the parts.

Conditional entropy — uncertainty left over

$$H(Y \mid X) = -\sum_{x, y} P(x, y) \log \frac{P(x, y)}{P(x)}.$$

This is the average uncertainty about Y once you've been told X.

The slide highlights three formal properties — worth memorising:

Nonnegative: $H(Y \mid X) \geq 0$.
Zero iff Y is a deterministic function of X. That is, $H(Y \mid X) = 0 \;\Leftrightarrow\; L\big(f^*(X), Y\big) = 0$, where $f^*$ is the Bayes estimator. In words: Y carries no leftover uncertainty given X precisely when the Bayes risk is zero.
Monotonic: $H(Y \mid X) \leq H(Y)$, with equality if and only if $X \perp\!\!\!\perp Y$. Conditioning on more information can never increase your uncertainty — and only leaves it unchanged when X is irrelevant.

Example · Conditional entropy intuition

Y = whether it rains tomorrow. X = the forecast.

Without seeing the forecast, $H(Y)$ might be high — you're very uncertain.

After seeing a forecast that says "98% chance of rain", $H(Y \mid X)$ drops dramatically — most uncertainty is resolved.

The amount the uncertainty drops is what we'll define next.

Mutual information: how much one tells you about the other

Definition · Mutual information $$I(X; Y) = \sum_{x, y} P(x, y) \log \frac{P(x, y)}{P(x) P(y)}.$$

Mutual information measures the dependence between X and Y — how much knowing one reduces uncertainty about the other.

$I(X; Y) \geq 0$ always.
$I(X; Y) = 0$ if and only if X and Y are independent.
It is symmetric: $I(X; Y) = I(Y; X)$. X and Y share information equally.

The web of relationships

All these quantities — joint, conditional, mutual — fit together in beautifully tidy ways. Picture two overlapping circles: one for X's information, one for Y's.

$$ \begin{aligned} H(X, Y) &= H(X) + H(Y) - I(X; Y) \\ H(X \mid Y) &= H(X, Y) - H(Y) \\ H(Y \mid X) &= H(X, Y) - H(X) \end{aligned} $$

Visualise it

The two circles are $H(X)$ and $H(Y)$. Their union is $H(X, Y)$ — the total information in both. Their intersection is $I(X; Y)$ — the information they share. The bits sticking out of each circle (only in X, only in Y) are the conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$ — what's left of each once the other is known.

Cross entropy and KL divergence

So far, we've measured properties of a single distribution P. Now suppose you have a true distribution P, and a guess Q. How wrong is your guess?

Definition · Cross entropy $$H_P(Q) = -\sum_x P(x) \log Q(x).$$

Average surprise when reality is P but you're scoring outcomes using Q's beliefs.

Cross entropy is the most popular loss function for classification in deep learning — when a neural network produces probabilities Q over classes and the true label distribution is P, training minimises exactly this quantity.

Watch out

The slide flags a notation trap: cross entropy is sometimes written $H(P, Q)$ — looking exactly like joint entropy! They are not the same. When you see $H(P, Q)$, check the context: are P and Q two distributions on the same variable (cross entropy), or are they two random variables (joint entropy)?

Cross entropy is always at least as big as the entropy of the truth: $H_P(Q) \geq H(P)$. Equality holds only when Q matches P perfectly. The "extra" cost of using a wrong distribution is what KL divergence measures.

Definition · Kullback–Leibler divergence $$D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}.$$

The extra cost in bits of encoding data drawn from P using a code optimised for Q.

Beautifully, this is just cross entropy minus entropy:

$$D_{\text{KL}}(P \| Q) = H_P(Q) - H(P).$$

And as a bonus, mutual information itself is a KL divergence — between the joint distribution and what the joint would look like if X and Y were independent:

$$I(X; Y) = D_{\text{KL}}\big(P(X, Y) \,\big\|\, P(X) P(Y)\big).$$

That's why $I(X;Y) = 0$ exactly when X and Y are independent: the joint coincides with the product of marginals, so the divergence is zero.

Once you see these connections, information theory feels less like a list of definitions and more like a single elegant idea told from different angles.

Self-test · Information

What's the entropy of a fair four-sided die (using log base 2)?

Each outcome has P = 1/4. So $H(X) = -\sum 4 \cdot \tfrac{1}{4} \log_2 \tfrac{1}{4} = -\log_2 \tfrac{1}{4} = 2$ bits.

Pattern: for a fair n-sided die, entropy is $\log_2 n$ bits.

If a coin is biased so heads has probability 0.99, is its entropy higher or lower than a fair coin?

Lower. Entropy peaks for the most uncertain (uniform) distribution. The biased coin is almost predictable, so its entropy is close to zero. Specifically: $H \approx 0.08$ bits, vs 1 bit for a fair coin.

If X and Y are independent, what is I(X; Y)?

Zero. Independence means knowing X tells you nothing about Y, so they share no information. Formally, P(x, y) = P(x)P(y), so the log term inside the mutual information sum becomes log 1 = 0.

Why does cross entropy work as a loss function for classification?

The neural network outputs a predicted distribution Q over classes; the truth (one-hot) is P. Cross entropy $H_P(Q)$ measures the average surprise of P-data under Q. It is minimised when Q = P. So pushing cross entropy down forces the model's predicted probabilities to align with reality.

Is KL divergence symmetric? That is, does D(P‖Q) = D(Q‖P)?

No! It's a "divergence", not a distance. In general $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$. The formula is asymmetric in the roles of P and Q. This matters in practice: the two directions emphasise different kinds of error.

· · ·

Part III The Three Flavours of Machine Learning

Now that we have the language, what does it mean for a machine to learn? It depends on what kind of feedback the machine gets.

All machine learning boils down to using data to discover patterns or strategies. But there are three fundamentally different setups, depending on what kind of guidance the algorithm receives. They are: supervised learning, reinforcement learning, and unsupervised learning. You will meet each one many times this term.

Supervised learning · learning by example

You're given a stack of inputs paired with their correct outputs — like flashcards. The algorithm's job is to learn the rule connecting input to output, well enough to predict the output for new inputs it's never seen.

Definition · Supervised learning

Learn a function $f: \mathcal{X} \to \mathcal{Y}$ from labelled examples, where $\mathcal{X}$ is the feature space (the inputs) and $\mathcal{Y}$ is the label space (the outputs).

If $\mathcal{Y} = \{0, 1\}$ — binary classification (e.g. spam or not).
If $\mathcal{Y} = \{y_1, \dots, y_k\}$ — multiclass classification (e.g. dog, cat, horse).
If $\mathcal{Y} \subseteq \mathbb{R}$ — regression (e.g. predicting price).

Real-world examples

Photos labelled "cat"/"dog" → classifier learns to recognise pets.
House features (size, location, age) plus sale price → predict prices.
Patient symptoms plus diagnosis → suggest diagnoses.
Loan application data plus default outcomes → credit risk score.

Reinforcement learning · learning by trial and reward

No flashcards. The algorithm is an agent placed inside an environment. It takes actions, receives rewards (or punishments), and learns over time which actions are best.

Definition · Reinforcement learning

The setup involves a set of states $\mathcal{S}$ and possible actions $\mathcal{A}$. A policy $\pi: \mathcal{S} \to \mathcal{A}$ tells the agent what action to take in each state. The environment returns a reward $R$. The goal is to find:

$$\pi^* = \arg\max_\pi \mathbb{E}[R(\pi)].$$

That is — the policy that maximises expected reward.

Real-world examples

Chess and Go agents (DeepMind's AlphaGo).
Robots learning to walk by trying and falling.
Trading bots adjusting strategies for profit.
Video game characters that get better through self-play.

Unsupervised learning · finding structure on your own

No labels, no rewards. Just data. The algorithm has to find patterns purely from the structure of the inputs themselves.

Definition · Unsupervised learning

There's no single formal setup, but a frequent goal is to learn a low-dimensional latent representation:

$$f: \mathcal{X} \to \mathcal{Z}, \quad \mathcal{X} \subseteq \mathbb{R}^{d_X}, \; \mathcal{Z} \subseteq \mathbb{R}^{d_Z}, \; d_X > d_Z.$$

That is, compress complex high-dimensional data into a simpler representation that captures its essence.

Real-world examples

Clustering customers by buying behaviour (no "correct" cluster labels).
Density estimation (modelling how data is distributed).
Autoencoders compressing images.
Generative models like the ones behind ChatGPT or Stable Diffusion.

Quick check: an algorithm that predicts house prices from features is which type of ML?

Supervised learning — specifically, regression (because the target is a continuous number).

You group news articles by topic without being told which topics exist. Which type?

Unsupervised learning (clustering). There are no pre-existing topic labels — the algorithm discovers groupings from text similarity alone.

· · ·

Part IV Loss: Measuring How Wrong You Are

For an algorithm to improve, it needs a target — a number it's trying to push down. That number is called the loss.

If supervised learning is about finding a function $f$ that maps inputs to outputs, we need a way to score how good our $f$ is. We pick a loss function $L$ that measures how far the prediction $f(x)$ is from the truth $y$, then train the algorithm to make $L$ as small as possible. Different problems call for different losses.

Two classic loss functions

Mean squared error · for regression

When predicting numbers, a natural penalty is the squared difference between prediction and reality, averaged over all data:

$$L_{\text{MSE}}(Y, \hat{Y}) = \mathbb{E}_{XY}\big[(f(x) - y)^2\big].$$

Why squared? Two reasons. It's always positive (so over- and under-shoots both count), and it punishes big errors much more than small ones — being off by 10 hurts 100 times as much as being off by 1.

Accuracy · for classification

For predicting a category, the simplest score is just: how often are you right?

$$L_{\text{ACC}}(Y, \hat{Y}) = P_{XY}\big(f(x) = y\big).$$

Strictly speaking this is a "score" — higher is better. The corresponding loss is "1 minus accuracy", or the error rate.

True loss vs. empirical loss

Here's a subtle but crucial point. The formulas above use the true joint distribution $P(x, y)$, expressed via expectations over X and Y. But we don't know that distribution! All we have is a finite dataset of examples.

So in practice, we approximate the true loss by averaging over our sample. This is called the empirical risk.

$$\hat{L}_{\text{MSE}}(Y, \hat{Y}) = \frac{1}{n}\sum_{i=1}^{n}\big(f(x_i) - y_i\big)^2.$$ $$\hat{L}_{\text{ACC}}(Y, \hat{Y}) = \frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\big[f(x_i) = y_i\big].$$

Here $\mathbb{I}[\cdot]$ is just an "indicator function" — 1 when the inside is true, 0 when false. Counting correct predictions, then dividing by total, gives accuracy.

Why this matters

The whole game of machine learning is to minimise true loss using only access to empirical loss. Doing well on training data (empirical) doesn't always mean doing well on new data (true). This gap will haunt every algorithm we study — it's why we need test sets, validation, and regularisation.

The Bayes estimator: the best you could ever do

Suppose you knew the true joint distribution $P(x, y)$ perfectly. What's the best possible predictor? It's called the Bayes estimator, and for any input $x$ it picks the prediction $\hat{y}$ that minimises expected loss:

$$f^*(x) = \arg\min_{\hat{y}} \sum_y P(y \mid x) L(y, \hat{y}).$$

The loss this best-possible predictor achieves is called the Bayes risk — and it's a kind of speed limit. No algorithm, no matter how clever, can do better.

Crucial

The Bayes risk is often greater than zero. There may be irreducible noise in the world: two patients with identical symptoms can have different outcomes. That randomness is built into reality, not into our model. The best we can do is reach the Bayes risk; we cannot go below it.

Example · The slides' picture story (slides 38–41)

The slides walk you through a simple regression example showing this clearly:

A clean sine wave — the true relationship between x and y. Zero noise.
The same wave with a sprinkle of noise — observed data points scatter around the curve. The wave is still the best predictor; the residual scatter is the Bayes risk.
The same wave with much more noise. The data is messy, but the underlying signal is still there. The Bayes risk is now large but not infinite.
So much noise that no signal is recoverable. The Bayes estimator becomes essentially "predict the mean" — and the Bayes risk is huge.

The lesson: how good a model can possibly get is set by the noise in the world, not just by the algorithm.

Galton's heights — the original Bayes-risk example

Slide 42 shows Francis Galton's classic 1885 dataset of children's heights versus their parents'. The cloud is roughly elliptical: tall parents tend to have tall children, but the relationship is far from perfect. Even with the best possible model — a regression line through the centre — there's still vertical scatter that no algorithm can eliminate. That scatter is the Bayes risk.

Other notions of "loss"

For supervised learning, MSE and accuracy are the workhorses. But other paradigms have their own ideas about what to minimise.

Regret · in reinforcement learning

In RL, we measure how much reward we've left on the table compared to playing optimally:

$$\text{Regret}_k(\hat{\pi}) = \sum_{t=0}^{k} R_t(\pi^*) - R_t(\hat{\pi}).$$

The cumulative gap, summed over $k$ rounds, between optimal and learned-policy rewards.

Reconstruction error · in autoencoders

An autoencoder is a network with two halves: an encoder $f$ that compresses an input into a small code, and a decoder $g$ that tries to rebuild the input from that code. We score it by how close the reconstruction is to the original:

$$L(X, \hat{X}) = \mathbb{E}\big[(x - g(f(x)))^2\big].$$

Low reconstruction error means the latent code retained the important information. This is how autoencoders learn meaningful compressed representations.

Self-test · Loss

You're predicting customer ratings (1–5 stars). Which loss is more appropriate: MSE or accuracy?

It depends on whether you treat the rating as a number or a category.

If as a number: MSE — predicting "4" when the truth is "5" is much closer than predicting "1", and MSE captures that gradient.

If as 5 unordered categories: accuracy — but you'd lose the information that "4 vs 5" is a small mistake while "1 vs 5" is a big one. For ratings, MSE is usually better.

An algorithm achieves 100% accuracy on its training set. Has it learned perfectly?

Not necessarily — and probably not. Training accuracy is empirical loss; it tells you how well the model memorised the seen data. The real question is true loss: how it does on new data. A model can hit perfect training accuracy by memorising everything, while doing terribly on anything new. That gap (overfitting) is one of the central themes of the course.

What does it mean if the Bayes risk for a problem is very high?

It means the relationship between inputs and outputs is genuinely noisy or fundamentally limited — the inputs simply don't contain enough information to predict the output well. No model, however sophisticated, can do better. The fix is usually to find better features, not a better algorithm.

Why use squared error rather than just absolute error |f(x) − y|?

Both are valid. Squared error is preferred when:

• You want to punish large errors disproportionately (squaring inflates them).
• You want a smooth, differentiable loss (the absolute value has a sharp corner at zero, which makes optimisation slightly harder).
• You want the optimal predictor to be the conditional mean $\mathbb{E}[Y \mid X]$, which has nice statistical properties.

Absolute error, by contrast, is more robust to outliers and gives the conditional median.

Putting it all together

Here's the through-line of Week 1, in one breath:

Probability gives us a language for uncertainty — sample spaces, events, conditional probabilities, joints and marginals.
Information theory turns probability into quantities of surprise — entropy, mutual information, KL divergence.
Machine learning uses data to fit functions (supervised), policies (RL), or structure (unsupervised).
To make any of this work, we need a loss function — a way to measure being wrong — and we minimise it via empirical risk, hoping to approach the unbeatable Bayes risk.

Every later week — neural networks, decision trees, kernel methods, generative models — will rest on these foundations. If a concept later in the course feels confusing, the chances are very high that re-reading the relevant bit of Week 1 will untangle it. Worth a re-read in a few days.

Good luck with the worksheet — you've got this.