A guided tour through Bayesianism and frequentism — the two great traditions for reasoning about uncertainty, the wars they fight, and the truce machine learning helps to broker.
Logic alone can't get us through the day. The world is messy, evidence is partial, and decisions can't always wait for certainty. Statistical inference is the toolkit for this reality.
Classical logic is wonderfully clean. Given true premises, deductive rules deliver true conclusions with no ambiguity. If all men are mortal and Socrates is a man, then Socrates is mortal. Or: if Arsenal or Tottenham won and Arsenal didn't win, then Tottenham won. These are airtight — the conclusion is locked in by the premises.
But most real-world situations don't come pre-packaged with airtight premises. Three quick examples from the slides:
None of these has a deductive answer. They all require reasoning under uncertainty — taking incomplete or imperfect information and turning it into a decision.
Statistical inference is any strategy for making decisions and drawing conclusions from data when our information is incomplete or imperfect.
This raises three deep questions: how do we weigh evidence, how do we evaluate hypotheses, and how do we interpret probabilities? Two big philosophical schools have given different answers — Bayesianism and frequentism — and the differences turn out to matter enormously for how we build machine learning algorithms. This week is your introduction to the schools and the war between them.
If a probability represents how strongly you believe something, what's the rational way to update that belief when new evidence arrives? Reverend Bayes had an answer.
In September 1983, a Soviet officer called Stanislav Petrov was on duty when his early-warning system reported five US nuclear missiles inbound. Protocol said: report it as an attack. Petrov refused. He reasoned that a real first strike would involve hundreds of missiles, not five, and that the system's prior probability of malfunctioning was high. He was right — the alarm was a glitch caused by sunlight reflecting off clouds. Petrov's intuitive Bayesian reasoning may well have saved the world.
Bayesianism is named after the Reverend Thomas Bayes (1701–1761), but most of the heavy lifting was actually done by Pierre-Simon Laplace (1749–1827). Modern Bayesians of note include Richard Jeffreys, Edwin Jaynes, and Jennifer Hill.
Bayesianism is a formal approach to modelling rational belief. It provides a recipe for optimally combining background knowledge with new evidence to update what we believe.
In Bayesian language: Bayes's theorem shows us how to combine a prior with a likelihood to infer a posterior.
Before we hit the formula, let's sit with the puzzle that motivates it. From slide 13:
Disease x has a prevalence of 0.1% — one in a thousand people have it.
Diagnostic test t for disease x is 99% accurate.
You test positive.
What's the probability you actually have the disease?
Most people instinctively answer "99%, or close to it". The right answer is roughly 9%. Yes — really.
The reason this answer feels wrong is that most people ignore the prior. The disease is rare; almost everyone tested doesn't have it; even a 1% false-positive rate generates many more false positives than there are true positives among the tiny diseased minority. We'll see exactly why in a moment. But first — the formula.
Let $h$ stand for some hypothesis (e.g. "I have the disease") and $h'$ for its negation. Let $e$ be some new evidence (e.g. "the test was positive"). Bayes's theorem says:
$$P(h \mid e) = \frac{P(e \mid h) \, P(h)}{P(e)}.$$That denominator $P(e)$ — the probability of seeing the evidence at all — can be expanded using the law of total probability:
$$P(h \mid e) = \frac{P(e \mid h) \, P(h)}{P(h) P(e \mid h) + P(h') P(e \mid h')}.$$This second form is the workhorse — it's what you actually plug numbers into.
A really useful way to remember the theorem is in proportional form:
$$\underbrace{P(h \mid e)}_{\text{posterior}} \;\propto\; \underbrace{P(e \mid h)}_{\text{likelihood}} \times \underbrace{P(h)}_{\text{prior}}.$$The marginal likelihood $P(e)$ in the full denominator is just the normalising constant that makes the posterior probabilities sum to 1 across all hypotheses. It doesn't affect which hypothesis is most probable; it just rescales.
"How much should I believe this hypothesis after seeing the evidence?" depends on two things: how much I believed it before (the prior), and how well it predicts what I observed (the likelihood). Multiply those together, divide by a normalising constant, and you have your updated belief.
Let's nail it. Define:
The numbers from the slide:
Plug into the expanded formula:
$$P(h \mid e) = \frac{0.99 \times 0.001}{(0.001 \times 0.99) + (0.999 \times 0.01)} = \frac{0.00099}{0.01098} \approx 0.0902.$$So the posterior probability you have the disease is about 9%, not 99%.
Imagine 100,000 people. About 100 have the disease (0.1% × 100,000); of these, the test correctly flags 99. Among the 99,900 healthy people, the test wrongly flags 1% — that's 999 false positives. So when anyone tests positive, they're one of 99 + 999 = 1,098 positives, and only 99 are real. Probability you're a real positive: 99 / 1,098 ≈ 9%. The rare base rate dominates.
This is called the base rate fallacy — ignoring the prior leads to wildly wrong intuitions. It is genuinely the most exam-worthy insight in this whole topic.
The slides walk through a second case to give geometric intuition (the dotted-emoji slides 23–28). The setup:
Plug in:
$$P(h \mid e) = \frac{0.8 \times 0.1}{(0.1 \times 0.8) + (0.9 \times 0.1)} = \frac{0.08}{0.17} \approx 0.471.$$So with a 10× higher prevalence, the same kind of test now gives a roughly 47% posterior probability of disease. Same logic, dramatically different answer — because the prior changed.
When you see a posterior probability, always ask: what was the prior? A test result can never tell you the answer on its own — it can only update what you already believed.
$P(h \mid e) = \dfrac{P(e \mid h) P(h)}{P(e)}$, where:
• $P(h)$ = prior — belief in $h$ before seeing evidence.
• $P(e \mid h)$ = likelihood — probability of the evidence given $h$.
• $P(h \mid e)$ = posterior — updated belief after evidence.
• $P(e)$ = marginal likelihood — overall probability of seeing the evidence; the normalising constant.
Let $h$ = uses the drug, $e$ = positive test. Then $P(h) = 0.02$, $P(e\mid h) = 0.95$, $P(e\mid h') = 0.05$.
$P(h\mid e) = \dfrac{0.95 \times 0.02}{(0.02 \times 0.95) + (0.98 \times 0.05)} = \dfrac{0.019}{0.068} \approx 0.279.$
About 28%. Even a "95% accurate" test misclassifies the majority of positives when the base rate is low. Same lesson as the rare disease example.
Because the disease is so rare that, among the population of people who test positive, false positives vastly outnumber true positives. With prevalence 0.1% and a 1% false-positive rate, for every true positive (out of ~100 sick people in 100,000) there are about 10 false positives (out of 99,900 healthy people, 1% of whom test positive). Ignoring the prior — the base rate fallacy — gives the wrong intuition.
Without it, you get a number proportional to the posterior but not equal to it — fine for comparing hypotheses ("which is more likely?"), but not for stating the actual probability ("how likely?"). The marginal likelihood ensures the posterior probabilities sum to 1 across all hypotheses, giving you a calibrated answer rather than a relative score.
Frequentists refuse to put probabilities on hypotheses at all. Instead, they design procedures that, used over and over, control how often we make mistakes.
If you find Bayesian "degrees of belief" suspiciously subjective, you're in good company. Frequentists rejected that whole framing. To them, a probability isn't how strongly you believe something — it's the long-run frequency with which an event occurs in repeated trials. Early frequentist theory came from John Venn (yes, of the diagrams) and Richard von Mises. Hypothesis testing as we now teach it was largely developed by Jerzy Neyman and Egon Pearson. Modern frequentists of note include David Cox, Deborah Mayo, and Nate Silver.
Frequentists interpret probabilities as limiting frequencies in repeated experiments — not as degrees of belief. The goal of hypothesis testing is to design rules that control or minimise the long-run error rate of those rules.
The standard frequentist framework is called null hypothesis significance testing (NHST).
In NHST, we pit two mutually exclusive, jointly exhaustive hypotheses against each other:
For example: testing whether a coin is fair. Let $\theta \in [0, 1]$ be the probability of landing heads. Then:
$$H_0: \theta = 0.5 \quad \text{vs.} \quad H_1: \theta \neq 0.5.$$More formally, we partition the parameter space $\Theta$ into a null region $\Theta_0$ and an alternative region $\Theta_1$, such that:
Why frame everything around a "null"? Because of a philosophical commitment to falsificationism, championed by Karl Popper:
The frequentist version: we never prove $H_1$. We only fail to falsify $H_0$, or we falsify it. There are exactly two possible outcomes of any NHST:
Frequentists never say "we accept $H_0$" — they say "we fail to reject" it. The distinction is real: failing to find evidence against the null isn't the same as evidence for the null. (Absence of evidence ≠ evidence of absence.)
Since there are two outcomes and two true states (the null is actually true, or actually false), there are two kinds of error:
| Truth | Decision | |
|---|---|---|
| Reject $H_0$ | Don't reject $H_0$ | |
| $H_1$ true | TP | FN |
| $H_0$ true | FP | TN |
Type I error (false positive): rejecting $H_0$ when it's actually true. The false positive rate is:
$$\alpha := \frac{\text{FP}}{\text{FP} + \text{TN}}$$Type II error (false negative): failing to reject $H_0$ when it's actually false. The false negative rate is:
$$\beta := \frac{\text{FN}}{\text{TP} + \text{FN}}$$The denominators are about the truth — total actual negatives and total actual positives respectively — not about your decisions.
Type I error sees something that isn't there (false alarm). Type II error misses something that is there. The "power" of a test is $1 - \beta$ — the probability of correctly catching a real effect.
The frequentist machinery for hypothesis testing works like this:
You want to control $\alpha$ and minimise $\beta$ — minimise misses while bounding false alarms. A test that achieves this for any value of $\alpha$ is called uniformly most powerful (UMP).
A statistical test is uniformly most powerful if, for any chosen $\alpha$, it minimises the type II error rate $\beta$ among all tests that control the type I error rate at level $\alpha$.
Neyman and Pearson proved the existence and uniqueness of several UMP tests using likelihood ratios — this result is known as the Neyman–Pearson lemma, and it's why their names are attached to the framework.
You flip a coin 10 times and get 7 heads. Is the coin fair? We're testing $H_0: \theta = 0.5$ versus $H_1: \theta \neq 0.5$.
The probability of seeing exactly $k$ heads in $n$ flips with fixed probability $\theta$ is the binomial PMF:
$$P(X = k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}.$$Plugging in $k = 7$, $n = 10$, $\theta = 0.5$:
$$P(X = 7) = \binom{10}{7}(0.5)^7(0.5)^3 \approx 0.1172.$$But the question is not just "what's the probability of exactly 7 heads?". The frequentist asks: under $H_0$, how often would we see a result this extreme or more extreme? That's the p-value.
The p-value is the probability, assuming $H_0$ is true, of observing a test statistic at least as extreme as the one we actually observed.
Because we're doing a two-sided test (the coin could be biased either way), "as extreme as 7 out of 10" includes both tails:
$$p = \sum_{k=0}^{3} P(X = k) + \sum_{k=7}^{10} P(X = k) = 0.34375.$$By the symmetry of the fair-coin distribution, getting 7 heads out of 10 is just as "extreme" as getting 3 heads. The p-value sums probabilities of all outcomes at least that lopsided.
With $\alpha = 0.05$, since $p = 0.34 \gg 0.05$, we fail to reject $H_0$. The data is consistent with a fair coin.
Now suppose we flip 30 times and get 21 heads — the same proportion (70%). Plug in $k = 21$, $n = 30$, $\theta = 0.5$:
$$P(X = 21) \approx 0.0133.$$The two-sided p-value:
$$p = \sum_{k=0}^{9} P(X=k) + \sum_{k=21}^{30} P(X=k) \approx 0.0428.$$Now $p < 0.05$, so we reject $H_0$. With more data, the same proportion of heads becomes statistically significant.
Sample size matters enormously. A 70/30 split looks pretty unfair to the eye, but with only 10 flips you can't rule out chance. With 30 flips of the same proportion, you can. This is why "n is small" is the most common critique of frequentist results.
The p-value is NOT the probability that $H_0$ is true. It is the probability of the data (or more extreme) given that $H_0$ is true. These are entirely different — a frequentist would say the first quantity is meaningless, since hypotheses don't have probabilities. This confusion is one of the main objections Bayesians level at NHST.
Type I (α): false positive — rejecting $H_0$ when it's actually true. "Crying wolf when there isn't one."
Type II (β): false negative — failing to reject $H_0$ when it's actually false. "Missing a real wolf."
α is set in advance (often 0.05); β is what we try to minimise subject to that constraint.
Since $p = 0.11 > 0.05$, we fail to reject $H_0$. The data is consistent with a fair coin at the 5% significance level. (Note: this does NOT mean the coin is fair — only that we don't have strong enough evidence to say it isn't.)
Failing to reject the null isn't evidence that the null is true — it's just evidence we couldn't disprove it with the data we have. The test was set up to falsify, not to confirm. Saying "accept" overstates what the procedure actually licenses. Always say "fail to reject".
For any chosen α, the test minimises the type II error rate β across all tests that successfully control type I error at level α. In other words: it's the test that gives you the most power (1 − β) for any given tolerance for false positives.
It tells you: if the drug had no effect (the null), the probability of observing data this favourable to the drug (or more so) would be 4%. It does not tell you the probability that the drug works is 96%. To get that, you'd need to bring in a prior — and you'd have left frequentism for Bayesianism.
Two and a half centuries of philosophical disagreement, with real practical consequences for how science gets done — and where machine learning gets to slot in.
By now you've probably noticed: Bayesianism and frequentism aren't just two computational tools. They have genuinely different views about what a probability is. As the statistician Brad Efron put it:
Slide 48 shows Randall Munroe's famous xkcd comic ("Frequentists vs. Bayesians"). The setup: a neutrino detector tells two statisticians whether the sun has gone nova. The detector rolls two dice and lies if both come up six (probability 1/36 ≈ 0.028).
The detector announces: "Yes — the sun has gone nova."
The frequentist calculates: under $H_0$ ("sun is fine"), the probability of getting a "yes" is 1/36 < 0.05, so we reject $H_0$. The sun has exploded!
The Bayesian shrugs: "Bet you $50 it hasn't." Why? Because the prior probability of the sun going nova is astronomically small, and a 1/36 false-positive rate isn't enough to overcome it. The Bayesian wins the bet.
The cartoon is funny because it lampoons exactly the failure mode the rare-disease example highlighted: ignoring priors gives absurd answers when the prior is extreme.
The slide ends on a hopeful note: the wars don't have to be eternal. Modern statistics has produced several genuine hybrids — methods that draw from both traditions. The slide names four:
If your exam asks for a synthesis approach, "Empirical Bayes" or "PAC-Bayes" are the safest two to mention.
Many classical inference problems can be reframed as machine learning problems — sometimes with surprising elegance.
The closing arc of the lecture argues that machine learning isn't just another tool in the statistical toolbox; it changes how we approach inference altogether. Three concrete examples are flagged:
The whole agenda is anchored by a now-famous 2001 essay by Leo Breiman, Two Cultures of Statistical Modeling:
Breiman's "two cultures" — the data-model culture (classical statistics) versus the algorithmic-model culture (machine learning) — sets up the rest of this course. From Week 3 onwards, we mostly live in the second culture. But the first one is what we just toured, and you'll need it on the exam.
Pool the two samples and label each observation by which sample it came from. Train a classifier on this binary task. If the classifier achieves accuracy meaningfully above chance (50%), the two distributions differ — there's signal a model can learn. If it can't beat chance, the distributions are indistinguishable. The classifier's accuracy itself becomes a test statistic.
Empirical Bayes is Bayesian inference where the prior is estimated from the observed data, rather than specified subjectively before seeing any data. This sidesteps the most common frequentist objection to Bayesianism (priors are arbitrary) while still using Bayesian machinery to combine prior with likelihood. It's a synthesis because the data-driven prior is essentially a frequentist move plugged into a Bayesian engine.
Logic gives certainty; the world doesn't oblige. Statistical inference is the discipline of reasoning despite incomplete information. Two great traditions answer the call differently:
For the exam: be ready to (i) state Bayes's theorem and identify its four ingredients in a worked problem, (ii) compute a posterior given a prior and likelihood, (iii) explain why a low base rate makes most positive tests false, (iv) define Type I/II errors and α/β, (v) compute or interpret a p-value, (vi) name at least one objection to each school and at least one synthesis approach.
Now go nail this worksheet.