A first principles tour of the foundations every machine learning algorithm rests on — written for someone who's never seen a sigma before.
Before we can teach a computer to learn, we need a vocabulary for talking about uncertainty. That vocabulary is probability theory.
Imagine you flip a coin. You don't know what it'll land on, but you do know something: heads is just as likely as tails. Probability is the mathematics of putting numbers on that "something" — turning vague intuitions about chance into precise calculations.
The framework we still use today was nailed down in the 1930s by a Russian mathematician called Andrey Kolmogorov. His insight was that all of probability could be built from just three simple ideas and three simple rules. Let's meet them.
Whenever we set up a probability problem, we need three things — together they're called the probability space, written as a triple $(\Omega, \mathcal{F}, P)$. Don't be intimidated by the symbols; here's what each one means in plain English.
The sample space Ω (capital omega) is just the list of every possible outcome. Every single thing that could happen, written down.
Flip a coin once → Ω = {Heads, Tails}.
Roll a normal die → Ω = {1, 2, 3, 4, 5, 6}.
Flip two coins → Ω = {HH, HT, TH, TT}.
The slide's example: for two binary variables X and Y (each taking value or its prime), the sample space is $\Omega = \{(x,y), (x,y'), (x',y), (x',y')\}$ — the four ways the pair can come out.
The event space $\mathcal{F}$ (script F) is the list of questions you're allowed to ask. Each question corresponds to a subset of outcomes.
"Did I get a head?" is a question. The set of outcomes that answer "yes" — namely {HH, HT, TH} when we flip two coins — is an event. The event space is the collection of all such events.
The slides mention $\mathcal{F}$ is a "σ-algebra" (sigma-algebra). The technical bit just means: the event space is well-behaved — it contains the empty set, and if you can ask "did A happen?" you can also ask "did A not happen?", and you can combine questions with "or" and "and". For our purposes, just think: collection of questions about outcomes.
The probability measure P is the rule that assigns a number — the probability — to each event.
Ω = what could happen. ℱ = the questions you can ask. P = the function that gives each question a numerical answer between 0 and 1. That's it. That's the whole setup.
An axiom is a rule we just agree on without proof — a starting point. Kolmogorov said: any sensible definition of probability has to obey these three rules.
For a fair die, the events "roll a 1" and "roll a 2" are disjoint — they can't both be true on the same roll. So:
$$P(\text{roll 1 or 2}) = P(1) + P(2) = \tfrac{1}{6} + \tfrac{1}{6} = \tfrac{1}{3}.$$Simple, but every probability calculation you'll ever do is built on top of this.
What if I told you something about the world before asking the question? For instance: "I rolled a die. The result is even. What's the probability it's a 6?"
The new information rules out 1, 3, and 5 — so we should only consider {2, 4, 6}. Among those, only one outcome is "6". So the probability is 1/3, not 1/6. This is what conditional probability captures.
The probability of A given that B has happened:
$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$$Read it as: "Of all the worlds where B is true, what fraction also have A true?"
Note the symbol $\cap$ (intersection) just means "and". $P(A \cap B)$ is the probability that both A and B happen.
Sometimes learning B tells you nothing about A. Knowing whether it rained yesterday in Tokyo doesn't change my belief about whether your coin will land heads. Such events are called independent.
A and B are independent if and only if:
$$P(A \cap B) = P(A) \, P(B)$$Equivalently: $P(A \mid B) = P(A)$. Knowing B leaves your beliefs about A unchanged. We write $A \perp\!\!\!\perp B$.
This is one of the most important — and most easily misunderstood — concepts in machine learning. Two events can be dependent overall, but independent once you know a third thing.
Ice cream sales and drowning incidents are strongly correlated. Cities that sell more ice cream have more drownings. So they're definitely not independent.
But once you know the season, they become independent. In summer, both are high; in winter, both are low — the season explains the link. Within any single season, knowing ice cream sales tells you nothing extra about drownings.
So: ice cream and drowning are conditionally independent given season. Written: Ice ⫫ Drown | Season.
A is conditionally independent of B given C if:
$$P(A \cap B \mid C) = P(A \mid C) \, P(B \mid C)$$Once we condition on C, learning B adds nothing to our knowledge of A. We write $A \perp\!\!\!\perp B \mid C$.
This also entails two equivalent statements (which the slide flags explicitly — likely exam material):
$$P(A \mid B, C) = P(A \mid C) \quad \text{and} \quad P(B \mid A, C) = P(B \mid C).$$In words: once you know C, additionally knowing B tells you nothing further about A — and vice versa.
Now suppose we have two random quantities — call them A and B. We can ask three different kinds of question.
The probability that A takes value $a$ and B takes value $b$:
$$P(a, b) = P(a) \, P(b \mid a) = P(b) \, P(a \mid b).$$The two equalities are the same fact written from different angles — this is sometimes called the chain rule of probability.
If we have the joint distribution but only care about A, we sum across all possible values of B:
$$P(a) = \sum_b P(a, b).$$This is called marginalising out B. Picture a table of joint probabilities and adding up each row — the row totals are the marginal probabilities of A.
If $f$ is some function of the variable A (for example, your winnings on each outcome), the expected value is the average of $f$ weighted by how likely each outcome is:
$$\mathbb{E}[f(a)] = \sum_a P(a) \, f(a).$$Each face has probability 1/6, and the function is just "the number on the face". So:
$$\mathbb{E}[X] = \tfrac{1}{6}(1+2+3+4+5+6) = 3.5.$$The expected value of a die roll is 3.5 — even though you can never actually roll 3.5.
If we've observed B = b and want the expected value of $f(A)$ in that conditioned world, we use the conditional probability instead of the marginal:
$$\mathbb{E}\big[f(a) \mid b\big] = \sum_a P(a \mid b) \, f(a).$$Same recipe, just with each weight updated to reflect the new information. This is one of the most-used quantities in machine learning — the Bayes estimator we'll meet later is essentially a conditional expectation.
You roll a fair die. What's the expected value given that the roll is even?
Conditioning on "even" leaves three outcomes {2, 4, 6}, each with probability 1/3:
$$\mathbb{E}[X \mid \text{even}] = \tfrac{1}{3}(2+4+6) = 4.$$The maximum a posteriori (MAP) estimate of A given B = b is just: pick the value of A that's most likely once you've seen B.
$$a^* = \arg\max_{a \in A} P(a \mid b).$$The notation $\arg\max$ means "the value of $a$ that maximises this quantity". When a doctor sees symptoms B and guesses the most probable disease A, that's a MAP estimate.
If the first die is 4, the sum is 7 only when the second die is 3. The second die is independent of the first, so P(second = 3) = 1/6.
Answer: 1/6.
Apply the expectation formula with f(x) = x²:
$\mathbb{E}[X^2] = \sum_x P(x) \, x^2 = \tfrac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \tfrac{91}{6} \approx 15.17.$
Note this is not the same as $(\mathbb{E}[X])^2 = 3.5^2 = 12.25$. In general $\mathbb{E}[X^2] \neq (\mathbb{E}[X])^2$ — the difference is the variance.
Yes — to P(A | C). This is one of the two consequences of conditional independence flagged on slide 16. Once C is known, B becomes irrelevant for predicting A.
Concretely: in our ice-cream/drowning example, once you know the season, also knowing ice-cream sales adds nothing to your prediction of drownings.
P(King) = 4/52 = 1/13. P(Heart) = 13/52 = 1/4. P(King ∩ Heart) = 1/52 (only the King of Hearts).
Check: P(King) × P(Heart) = (1/13)(1/4) = 1/52. ✓ They are independent.
Intuition: knowing a card is a heart doesn't change the chance it's a king, because the suits and ranks are arranged in a perfect 4×13 grid.
Let X = 1 if red, 0 if blue. P(red) = 3/5, P(blue) = 2/5.
$\mathbb{E}[X] = 1 \cdot \tfrac{3}{5} + 0 \cdot \tfrac{2}{5} = \tfrac{3}{5}.$
Answer: 0.6. (For 0/1 variables, expected value equals the probability of "1".)
If probability is about uncertainty, information theory asks: how much uncertainty is there, and how much does new evidence reduce it?
In 1948, an engineer at Bell Labs called Claude Shannon published a paper that quietly invented the digital age. He realised that "information" — that fuzzy thing we all talk about — could be measured precisely, in bits. The recipe is surprisingly simple, and it turns out to be the foundation for everything from data compression to neural network training.
Here's Shannon's intuition. If I tell you "the sun rose this morning", you've learned almost nothing — you already expected it. But if I tell you "the sun didn't rise this morning", you've learned a lot — that was unexpected.
So information is connected to surprise, and surprise is high when probability is low. Shannon's specific choice was to measure the surprise of an outcome $x$ as $-\log P(x)$.
The logarithm has two beautiful properties. First, when probability is 1 (certain), $-\log 1 = 0$ — no surprise. When probability is tiny, $-\log$ becomes huge — lots of surprise. Second, surprise from independent events adds: if A and B are independent, then $-\log P(A \cap B) = -\log P(A) - \log P(B)$. Information accumulates.
If we average the surprise over all possible outcomes, weighted by how likely they are, we get the entropy of the distribution.
For a discrete random variable X with distribution P:
$$H(X) = -\sum_x P(x) \log P(x).$$This is the average information content of X — or equivalently, the average uncertainty.
Two quick facts:
Fair coin: P(H) = P(T) = 1/2.
$$H(X) = -\tfrac{1}{2}\log_2 \tfrac{1}{2} - \tfrac{1}{2}\log_2 \tfrac{1}{2} = 1 \text{ bit}.$$Maximum uncertainty for two outcomes — exactly one bit, the most a coin can tell you.
Two-headed coin: P(H) = 1.
$$H(X) = -1 \cdot \log 1 = 0 \text{ bits}.$$You knew it was heads. Zero information. The flip told you nothing.
P(x) = 1/8 for each of 8 faces. Using log base 2:
$$H(X) = -\sum_{x=1}^{8} \tfrac{1}{8} \log_2 \tfrac{1}{8} = -\log_2 \tfrac{1}{8} = \log_2 8 = 3 \text{ bits}.$$Three bits — exactly what you'd need to encode 8 outcomes (000 to 111). Shannon's measure tells us this is also the most efficient possible code on average.
1. As uncertainty. Maximum when the distribution is spread evenly across all outcomes (a fair coin, a fair die). Minimum (zero) when one outcome dominates entirely.
2. As code length. $H(X)$ is the average number of bits per message in the optimal code for X. Frequent outcomes get short codes; rare outcomes get long ones.
Just as we extended probability from one variable to two, we can extend entropy.
The total uncertainty in the pair (X, Y) considered as a single thing.
Useful properties:
This is the average uncertainty about Y once you've been told X.
The slide highlights three formal properties — worth memorising:
Y = whether it rains tomorrow. X = the forecast.
Without seeing the forecast, $H(Y)$ might be high — you're very uncertain.
After seeing a forecast that says "98% chance of rain", $H(Y \mid X)$ drops dramatically — most uncertainty is resolved.
The amount the uncertainty drops is what we'll define next.
Mutual information measures the dependence between X and Y — how much knowing one reduces uncertainty about the other.
All these quantities — joint, conditional, mutual — fit together in beautifully tidy ways. Picture two overlapping circles: one for X's information, one for Y's.
$$ \begin{aligned} H(X, Y) &= H(X) + H(Y) - I(X; Y) \\ H(X \mid Y) &= H(X, Y) - H(Y) \\ H(Y \mid X) &= H(X, Y) - H(X) \end{aligned} $$The two circles are $H(X)$ and $H(Y)$. Their union is $H(X, Y)$ — the total information in both. Their intersection is $I(X; Y)$ — the information they share. The bits sticking out of each circle (only in X, only in Y) are the conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$ — what's left of each once the other is known.
So far, we've measured properties of a single distribution P. Now suppose you have a true distribution P, and a guess Q. How wrong is your guess?
Average surprise when reality is P but you're scoring outcomes using Q's beliefs.
Cross entropy is the most popular loss function for classification in deep learning — when a neural network produces probabilities Q over classes and the true label distribution is P, training minimises exactly this quantity.
The slide flags a notation trap: cross entropy is sometimes written $H(P, Q)$ — looking exactly like joint entropy! They are not the same. When you see $H(P, Q)$, check the context: are P and Q two distributions on the same variable (cross entropy), or are they two random variables (joint entropy)?
Cross entropy is always at least as big as the entropy of the truth: $H_P(Q) \geq H(P)$. Equality holds only when Q matches P perfectly. The "extra" cost of using a wrong distribution is what KL divergence measures.
The extra cost in bits of encoding data drawn from P using a code optimised for Q.
Beautifully, this is just cross entropy minus entropy:
$$D_{\text{KL}}(P \| Q) = H_P(Q) - H(P).$$And as a bonus, mutual information itself is a KL divergence — between the joint distribution and what the joint would look like if X and Y were independent:
$$I(X; Y) = D_{\text{KL}}\big(P(X, Y) \,\big\|\, P(X) P(Y)\big).$$That's why $I(X;Y) = 0$ exactly when X and Y are independent: the joint coincides with the product of marginals, so the divergence is zero.
Once you see these connections, information theory feels less like a list of definitions and more like a single elegant idea told from different angles.
Each outcome has P = 1/4. So $H(X) = -\sum 4 \cdot \tfrac{1}{4} \log_2 \tfrac{1}{4} = -\log_2 \tfrac{1}{4} = 2$ bits.
Pattern: for a fair n-sided die, entropy is $\log_2 n$ bits.
Lower. Entropy peaks for the most uncertain (uniform) distribution. The biased coin is almost predictable, so its entropy is close to zero. Specifically: $H \approx 0.08$ bits, vs 1 bit for a fair coin.
Zero. Independence means knowing X tells you nothing about Y, so they share no information. Formally, P(x, y) = P(x)P(y), so the log term inside the mutual information sum becomes log 1 = 0.
The neural network outputs a predicted distribution Q over classes; the truth (one-hot) is P. Cross entropy $H_P(Q)$ measures the average surprise of P-data under Q. It is minimised when Q = P. So pushing cross entropy down forces the model's predicted probabilities to align with reality.
No! It's a "divergence", not a distance. In general $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$. The formula is asymmetric in the roles of P and Q. This matters in practice: the two directions emphasise different kinds of error.
Now that we have the language, what does it mean for a machine to learn? It depends on what kind of feedback the machine gets.
All machine learning boils down to using data to discover patterns or strategies. But there are three fundamentally different setups, depending on what kind of guidance the algorithm receives. They are: supervised learning, reinforcement learning, and unsupervised learning. You will meet each one many times this term.
You're given a stack of inputs paired with their correct outputs — like flashcards. The algorithm's job is to learn the rule connecting input to output, well enough to predict the output for new inputs it's never seen.
Learn a function $f: \mathcal{X} \to \mathcal{Y}$ from labelled examples, where $\mathcal{X}$ is the feature space (the inputs) and $\mathcal{Y}$ is the label space (the outputs).
If $\mathcal{Y} = \{0, 1\}$ — binary classification (e.g. spam or not).
If $\mathcal{Y} = \{y_1, \dots, y_k\}$ — multiclass classification (e.g. dog, cat, horse).
If $\mathcal{Y} \subseteq \mathbb{R}$ — regression (e.g. predicting price).
No flashcards. The algorithm is an agent placed inside an environment. It takes actions, receives rewards (or punishments), and learns over time which actions are best.
The setup involves a set of states $\mathcal{S}$ and possible actions $\mathcal{A}$. A policy $\pi: \mathcal{S} \to \mathcal{A}$ tells the agent what action to take in each state. The environment returns a reward $R$. The goal is to find:
$$\pi^* = \arg\max_\pi \mathbb{E}[R(\pi)].$$That is — the policy that maximises expected reward.
No labels, no rewards. Just data. The algorithm has to find patterns purely from the structure of the inputs themselves.
There's no single formal setup, but a frequent goal is to learn a low-dimensional latent representation:
$$f: \mathcal{X} \to \mathcal{Z}, \quad \mathcal{X} \subseteq \mathbb{R}^{d_X}, \; \mathcal{Z} \subseteq \mathbb{R}^{d_Z}, \; d_X > d_Z.$$That is, compress complex high-dimensional data into a simpler representation that captures its essence.
Supervised learning — specifically, regression (because the target is a continuous number).
Unsupervised learning (clustering). There are no pre-existing topic labels — the algorithm discovers groupings from text similarity alone.
For an algorithm to improve, it needs a target — a number it's trying to push down. That number is called the loss.
If supervised learning is about finding a function $f$ that maps inputs to outputs, we need a way to score how good our $f$ is. We pick a loss function $L$ that measures how far the prediction $f(x)$ is from the truth $y$, then train the algorithm to make $L$ as small as possible. Different problems call for different losses.
When predicting numbers, a natural penalty is the squared difference between prediction and reality, averaged over all data:
$$L_{\text{MSE}}(Y, \hat{Y}) = \mathbb{E}_{XY}\big[(f(x) - y)^2\big].$$Why squared? Two reasons. It's always positive (so over- and under-shoots both count), and it punishes big errors much more than small ones — being off by 10 hurts 100 times as much as being off by 1.
For predicting a category, the simplest score is just: how often are you right?
$$L_{\text{ACC}}(Y, \hat{Y}) = P_{XY}\big(f(x) = y\big).$$Strictly speaking this is a "score" — higher is better. The corresponding loss is "1 minus accuracy", or the error rate.
Here's a subtle but crucial point. The formulas above use the true joint distribution $P(x, y)$, expressed via expectations over X and Y. But we don't know that distribution! All we have is a finite dataset of examples.
So in practice, we approximate the true loss by averaging over our sample. This is called the empirical risk.
$$\hat{L}_{\text{MSE}}(Y, \hat{Y}) = \frac{1}{n}\sum_{i=1}^{n}\big(f(x_i) - y_i\big)^2.$$ $$\hat{L}_{\text{ACC}}(Y, \hat{Y}) = \frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\big[f(x_i) = y_i\big].$$Here $\mathbb{I}[\cdot]$ is just an "indicator function" — 1 when the inside is true, 0 when false. Counting correct predictions, then dividing by total, gives accuracy.
The whole game of machine learning is to minimise true loss using only access to empirical loss. Doing well on training data (empirical) doesn't always mean doing well on new data (true). This gap will haunt every algorithm we study — it's why we need test sets, validation, and regularisation.
Suppose you knew the true joint distribution $P(x, y)$ perfectly. What's the best possible predictor? It's called the Bayes estimator, and for any input $x$ it picks the prediction $\hat{y}$ that minimises expected loss:
$$f^*(x) = \arg\min_{\hat{y}} \sum_y P(y \mid x) L(y, \hat{y}).$$The loss this best-possible predictor achieves is called the Bayes risk — and it's a kind of speed limit. No algorithm, no matter how clever, can do better.
The Bayes risk is often greater than zero. There may be irreducible noise in the world: two patients with identical symptoms can have different outcomes. That randomness is built into reality, not into our model. The best we can do is reach the Bayes risk; we cannot go below it.
The slides walk you through a simple regression example showing this clearly:
The lesson: how good a model can possibly get is set by the noise in the world, not just by the algorithm.
Slide 42 shows Francis Galton's classic 1885 dataset of children's heights versus their parents'. The cloud is roughly elliptical: tall parents tend to have tall children, but the relationship is far from perfect. Even with the best possible model — a regression line through the centre — there's still vertical scatter that no algorithm can eliminate. That scatter is the Bayes risk.
For supervised learning, MSE and accuracy are the workhorses. But other paradigms have their own ideas about what to minimise.
In RL, we measure how much reward we've left on the table compared to playing optimally:
$$\text{Regret}_k(\hat{\pi}) = \sum_{t=0}^{k} R_t(\pi^*) - R_t(\hat{\pi}).$$The cumulative gap, summed over $k$ rounds, between optimal and learned-policy rewards.
An autoencoder is a network with two halves: an encoder $f$ that compresses an input into a small code, and a decoder $g$ that tries to rebuild the input from that code. We score it by how close the reconstruction is to the original:
$$L(X, \hat{X}) = \mathbb{E}\big[(x - g(f(x)))^2\big].$$Low reconstruction error means the latent code retained the important information. This is how autoencoders learn meaningful compressed representations.
It depends on whether you treat the rating as a number or a category.
If as a number: MSE — predicting "4" when the truth is "5" is much closer than predicting "1", and MSE captures that gradient.
If as 5 unordered categories: accuracy — but you'd lose the information that "4 vs 5" is a small mistake while "1 vs 5" is a big one. For ratings, MSE is usually better.
Not necessarily — and probably not. Training accuracy is empirical loss; it tells you how well the model memorised the seen data. The real question is true loss: how it does on new data. A model can hit perfect training accuracy by memorising everything, while doing terribly on anything new. That gap (overfitting) is one of the central themes of the course.
It means the relationship between inputs and outputs is genuinely noisy or fundamentally limited — the inputs simply don't contain enough information to predict the output well. No model, however sophisticated, can do better. The fix is usually to find better features, not a better algorithm.
Both are valid. Squared error is preferred when:
• You want to punish large errors disproportionately (squaring inflates them).
• You want a smooth, differentiable loss (the absolute value has a sharp corner at zero, which makes optimisation slightly harder).
• You want the optimal predictor to be the conditional mean $\mathbb{E}[Y \mid X]$, which has nice statistical properties.
Absolute error, by contrast, is more robust to outliers and gives the conditional median.
Here's the through-line of Week 1, in one breath:
Every later week — neural networks, decision trees, kernel methods, generative models — will rest on these foundations. If a concept later in the course feels confusing, the chances are very high that re-reading the relevant bit of Week 1 will untangle it. Worth a re-read in a few days.
Good luck with the worksheet — you've got this.