Week Two · Statistical Inference

Belief, Evidence,
and the Art of Being Wrong

A guided tour through Bayesianism and frequentism — the two great traditions for reasoning about uncertainty, the wars they fight, and the truce machine learning helps to broker.

KCL · Machine Learning · Reading time: ~50 min · Self-test: 16 questions

Part I Why We Need Statistical Inference

Logic alone can't get us through the day. The world is messy, evidence is partial, and decisions can't always wait for certainty. Statistical inference is the toolkit for this reality.

Classical logic is wonderfully clean. Given true premises, deductive rules deliver true conclusions with no ambiguity. If all men are mortal and Socrates is a man, then Socrates is mortal. Or: if Arsenal or Tottenham won and Arsenal didn't win, then Tottenham won. These are airtight — the conclusion is locked in by the premises.

But most real-world situations don't come pre-packaged with airtight premises. Three quick examples from the slides:

You've been diagnosed with a rare disease but show no common symptoms. Should you start treatment?
The forecast says 10% chance of rain. Should you risk wearing your new suede shoes?
Arsenal are 2:1 favourites, but you have a feeling about Tottenham's new striker. Should you place a bet?

None of these has a deductive answer. They all require reasoning under uncertainty — taking incomplete or imperfect information and turning it into a decision.

Definition · Statistical inference

Statistical inference is any strategy for making decisions and drawing conclusions from data when our information is incomplete or imperfect.

This raises three deep questions: how do we weigh evidence, how do we evaluate hypotheses, and how do we interpret probabilities? Two big philosophical schools have given different answers — Bayesianism and frequentism — and the differences turn out to matter enormously for how we build machine learning algorithms. This week is your introduction to the schools and the war between them.

· · ·

Part II Bayesianism · Updating Beliefs

If a probability represents how strongly you believe something, what's the rational way to update that belief when new evidence arrives? Reverend Bayes had an answer.

In September 1983, a Soviet officer called Stanislav Petrov was on duty when his early-warning system reported five US nuclear missiles inbound. Protocol said: report it as an attack. Petrov refused. He reasoned that a real first strike would involve hundreds of missiles, not five, and that the system's prior probability of malfunctioning was high. He was right — the alarm was a glitch caused by sunlight reflecting off clouds. Petrov's intuitive Bayesian reasoning may well have saved the world.

Bayesianism is named after the Reverend Thomas Bayes (1701–1761), but most of the heavy lifting was actually done by Pierre-Simon Laplace (1749–1827). Modern Bayesians of note include Richard Jeffreys, Edwin Jaynes, and Jennifer Hill.

Definition · Bayesianism

Bayesianism is a formal approach to modelling rational belief. It provides a recipe for optimally combining background knowledge with new evidence to update what we believe.

In Bayesian language: Bayes's theorem shows us how to combine a prior with a likelihood to infer a posterior.

The marquee puzzle: rare disease

Before we hit the formula, let's sit with the puzzle that motivates it. From slide 13:

The puzzle

Disease x has a prevalence of 0.1% — one in a thousand people have it.
Diagnostic test t for disease x is 99% accurate.
You test positive.
What's the probability you actually have the disease?

Most people instinctively answer "99%, or close to it". The right answer is roughly 9%. Yes — really.

The reason this answer feels wrong is that most people ignore the prior. The disease is rare; almost everyone tested doesn't have it; even a 1% false-positive rate generates many more false positives than there are true positives among the tiny diseased minority. We'll see exactly why in a moment. But first — the formula.

Bayes's theorem

Let $h$ stand for some hypothesis (e.g. "I have the disease") and $h'$ for its negation. Let $e$ be some new evidence (e.g. "the test was positive"). Bayes's theorem says:

$$P(h \mid e) = \frac{P(e \mid h) \, P(h)}{P(e)}.$$

That denominator $P(e)$ — the probability of seeing the evidence at all — can be expanded using the law of total probability:

$$P(h \mid e) = \frac{P(e \mid h) \, P(h)}{P(h) P(e \mid h) + P(h') P(e \mid h')}.$$

This second form is the workhorse — it's what you actually plug numbers into.

The four ingredients

Definition · Bayes's four ingredients

Prior $P(h)$: the probability assigned to $h$ before seeing the new evidence.
Likelihood $P(e \mid h)$: the probability of the evidence, given the hypothesis is true.
Posterior $P(h \mid e)$: the probability of the hypothesis after observing the evidence.
Marginal likelihood $P(e)$: the total probability of the evidence, marginalising over all hypotheses: $P(e) = \sum_h P(h) P(e \mid h)$.

A really useful way to remember the theorem is in proportional form:

$$\underbrace{P(h \mid e)}_{\text{posterior}} \;\propto\; \underbrace{P(e \mid h)}_{\text{likelihood}} \times \underbrace{P(h)}_{\text{prior}}.$$

The marginal likelihood $P(e)$ in the full denominator is just the normalising constant that makes the posterior probabilities sum to 1 across all hypotheses. It doesn't affect which hypothesis is most probable; it just rescales.

Plain-English version

"How much should I believe this hypothesis after seeing the evidence?" depends on two things: how much I believed it before (the prior), and how well it predicts what I observed (the likelihood). Multiply those together, divide by a normalising constant, and you have your updated belief.

Solving the rare disease puzzle

Let's nail it. Define:

$h$ = "I have disease $x$"
$e$ = "the test came back positive"

The numbers from the slide:

Prior: $P(h) = 0.001$ (prevalence is 0.1%)
Likelihood under $h$: $P(e \mid h) = 0.99$ (test is 99% accurate, so true positive rate is 99%)
Likelihood under $h'$: $P(e \mid h') = 0.01$ (1% error rate gives a 1% false positive rate)

Plug into the expanded formula:

$$P(h \mid e) = \frac{0.99 \times 0.001}{(0.001 \times 0.99) + (0.999 \times 0.01)} = \frac{0.00099}{0.01098} \approx 0.0902.$$

So the posterior probability you have the disease is about 9%, not 99%.

Why so low?

Imagine 100,000 people. About 100 have the disease (0.1% × 100,000); of these, the test correctly flags 99. Among the 99,900 healthy people, the test wrongly flags 1% — that's 999 false positives. So when anyone tests positive, they're one of 99 + 999 = 1,098 positives, and only 99 are real. Probability you're a real positive: 99 / 1,098 ≈ 9%. The rare base rate dominates.

This is called the base rate fallacy — ignoring the prior leads to wildly wrong intuitions. It is genuinely the most exam-worthy insight in this whole topic.

A second example: less rare disease

The slides walk through a second case to give geometric intuition (the dotted-emoji slides 23–28). The setup:

Prevalence: 10%, so $P(h) = 0.1$, $P(h') = 0.9$.
True positive rate: $P(e \mid h) = 0.8$.
False positive rate: $P(e \mid h') = 0.1$.

Plug in:

$$P(h \mid e) = \frac{0.8 \times 0.1}{(0.1 \times 0.8) + (0.9 \times 0.1)} = \frac{0.08}{0.17} \approx 0.471.$$

So with a 10× higher prevalence, the same kind of test now gives a roughly 47% posterior probability of disease. Same logic, dramatically different answer — because the prior changed.

Lesson

When you see a posterior probability, always ask: what was the prior? A test result can never tell you the answer on its own — it can only update what you already believed.

Self-test · Bayesianism

State Bayes's theorem and identify the four ingredients.

$P(h \mid e) = \dfrac{P(e \mid h) P(h)}{P(e)}$, where:

• $P(h)$ = prior — belief in $h$ before seeing evidence.
• $P(e \mid h)$ = likelihood — probability of the evidence given $h$.
• $P(h \mid e)$ = posterior — updated belief after evidence.
• $P(e)$ = marginal likelihood — overall probability of seeing the evidence; the normalising constant.

A drug test is 95% accurate. 2% of athletes use the drug. An athlete tests positive. What's the probability they actually use it?

Let $h$ = uses the drug, $e$ = positive test. Then $P(h) = 0.02$, $P(e\mid h) = 0.95$, $P(e\mid h') = 0.05$.

$P(h\mid e) = \dfrac{0.95 \times 0.02}{(0.02 \times 0.95) + (0.98 \times 0.05)} = \dfrac{0.019}{0.068} \approx 0.279.$

About 28%. Even a "95% accurate" test misclassifies the majority of positives when the base rate is low. Same lesson as the rare disease example.

Why is the answer to the rare disease puzzle ~9% and not ~99%?

Because the disease is so rare that, among the population of people who test positive, false positives vastly outnumber true positives. With prevalence 0.1% and a 1% false-positive rate, for every true positive (out of ~100 sick people in 100,000) there are about 10 false positives (out of 99,900 healthy people, 1% of whom test positive). Ignoring the prior — the base rate fallacy — gives the wrong intuition.

If the marginal likelihood P(e) is just a normalising constant, why include it?

Without it, you get a number proportional to the posterior but not equal to it — fine for comparing hypotheses ("which is more likely?"), but not for stating the actual probability ("how likely?"). The marginal likelihood ensures the posterior probabilities sum to 1 across all hypotheses, giving you a calibrated answer rather than a relative score.

· · ·

Part III Frequentism · Minimising Error

Frequentists refuse to put probabilities on hypotheses at all. Instead, they design procedures that, used over and over, control how often we make mistakes.

If you find Bayesian "degrees of belief" suspiciously subjective, you're in good company. Frequentists rejected that whole framing. To them, a probability isn't how strongly you believe something — it's the long-run frequency with which an event occurs in repeated trials. Early frequentist theory came from John Venn (yes, of the diagrams) and Richard von Mises. Hypothesis testing as we now teach it was largely developed by Jerzy Neyman and Egon Pearson. Modern frequentists of note include David Cox, Deborah Mayo, and Nate Silver.

Definition · Frequentism

Frequentists interpret probabilities as limiting frequencies in repeated experiments — not as degrees of belief. The goal of hypothesis testing is to design rules that control or minimise the long-run error rate of those rules.

The standard frequentist framework is called null hypothesis significance testing (NHST).

The setup: two hypotheses

In NHST, we pit two mutually exclusive, jointly exhaustive hypotheses against each other:

$H_0$ — the null hypothesis. Usually represents "nothing interesting is happening" — the boring default.
$H_1$ — the alternative hypothesis. The exciting claim we want evidence for.

For example: testing whether a coin is fair. Let $\theta \in [0, 1]$ be the probability of landing heads. Then:

$$H_0: \theta = 0.5 \quad \text{vs.} \quad H_1: \theta \neq 0.5.$$

More formally, we partition the parameter space $\Theta$ into a null region $\Theta_0$ and an alternative region $\Theta_1$, such that:

$\Theta_0 \cap \Theta_1 = \emptyset$ (mutually exclusive — no overlap)
$\Theta_0 \cup \Theta_1 = \Theta$ (jointly exhaustive — together they cover everything)

Falsificationism

Why frame everything around a "null"? Because of a philosophical commitment to falsificationism, championed by Karl Popper:

Theories are never empirically verifiable…I shall require that its logical form shall be such that it can be singled out, by means of empirical tests, in a negative sense: it must be possible for an empirical scientific system to be refuted by experience. — Karl Popper, Logic of Scientific Discovery

The frequentist version: we never prove $H_1$. We only fail to falsify $H_0$, or we falsify it. There are exactly two possible outcomes of any NHST:

We reject $H_0$ (the data are inconsistent with the null).
We fail to reject $H_0$ (the data don't give us enough to rule it out).

Notation pedantry that matters

Frequentists never say "we accept $H_0$" — they say "we fail to reject" it. The distinction is real: failing to find evidence against the null isn't the same as evidence for the null. (Absence of evidence ≠ evidence of absence.)

Two ways to be wrong

Since there are two outcomes and two true states (the null is actually true, or actually false), there are two kinds of error:

Truth	Decision
Truth	Reject $H_0$	Don't reject $H_0$
$H_1$ true	TP	FN
$H_0$ true	FP	TN

Definition · Type I & Type II errors

Type I error (false positive): rejecting $H_0$ when it's actually true. The false positive rate is:

$$\alpha := \frac{\text{FP}}{\text{FP} + \text{TN}}$$

Type II error (false negative): failing to reject $H_0$ when it's actually false. The false negative rate is:

$$\beta := \frac{\text{FN}}{\text{TP} + \text{FN}}$$

The denominators are about the truth — total actual negatives and total actual positives respectively — not about your decisions.

A way to remember

Type I error sees something that isn't there (false alarm). Type II error misses something that is there. The "power" of a test is $1 - \beta$ — the probability of correctly catching a real effect.

The full NHST procedure

The frequentist machinery for hypothesis testing works like this:

Fix $\alpha$ upfront — typically at $\alpha = 0.05$. This is your tolerance for false positives.
Compute a test statistic $T(X_n)$ from your data $X_n$ of size $n$. The test statistic is chosen so that large values indicate stronger evidence against $H_0$.
Partition the space of possible test-statistic values into a rejection region $R_1$ and an acceptance region $R_0$. Reject $H_0$ if and only if $T(X_n) \in R_1$.
The test controls the false positive rate at level $\alpha$ if and only if, for every sample size $n$: $$\sup_{\theta \in \Theta_0} P_\theta\big(T(X_n) \in R_1\big) \;\leq\; \alpha.$$ In words: the worst-case probability of falsely rejecting $H_0$ — over all parameter values consistent with $H_0$ — is at most $\alpha$.

Uniformly most powerful (UMP) tests

You want to control $\alpha$ and minimise $\beta$ — minimise misses while bounding false alarms. A test that achieves this for any value of $\alpha$ is called uniformly most powerful (UMP).

Definition · UMP test

A statistical test is uniformly most powerful if, for any chosen $\alpha$, it minimises the type II error rate $\beta$ among all tests that control the type I error rate at level $\alpha$.

Neyman and Pearson proved the existence and uniqueness of several UMP tests using likelihood ratios — this result is known as the Neyman–Pearson lemma, and it's why their names are attached to the framework.

Worked example: the biased coin

You flip a coin 10 times and get 7 heads. Is the coin fair? We're testing $H_0: \theta = 0.5$ versus $H_1: \theta \neq 0.5$.

The probability of seeing exactly $k$ heads in $n$ flips with fixed probability $\theta$ is the binomial PMF:

$$P(X = k) = \binom{n}{k} \theta^k (1-\theta)^{n-k}.$$

Plugging in $k = 7$, $n = 10$, $\theta = 0.5$:

$$P(X = 7) = \binom{10}{7}(0.5)^7(0.5)^3 \approx 0.1172.$$

But the question is not just "what's the probability of exactly 7 heads?". The frequentist asks: under $H_0$, how often would we see a result this extreme or more extreme? That's the p-value.

Definition · p-value

The p-value is the probability, assuming $H_0$ is true, of observing a test statistic at least as extreme as the one we actually observed.

Because we're doing a two-sided test (the coin could be biased either way), "as extreme as 7 out of 10" includes both tails:

$$p = \sum_{k=0}^{3} P(X = k) + \sum_{k=7}^{10} P(X = k) = 0.34375.$$

By the symmetry of the fair-coin distribution, getting 7 heads out of 10 is just as "extreme" as getting 3 heads. The p-value sums probabilities of all outcomes at least that lopsided.

With $\alpha = 0.05$, since $p = 0.34 \gg 0.05$, we fail to reject $H_0$. The data is consistent with a fair coin.

Same coin, more flips

Now suppose we flip 30 times and get 21 heads — the same proportion (70%). Plug in $k = 21$, $n = 30$, $\theta = 0.5$:

$$P(X = 21) \approx 0.0133.$$

The two-sided p-value:

$$p = \sum_{k=0}^{9} P(X=k) + \sum_{k=21}^{30} P(X=k) \approx 0.0428.$$

Now $p < 0.05$, so we reject $H_0$. With more data, the same proportion of heads becomes statistically significant.

The lesson

Sample size matters enormously. A 70/30 split looks pretty unfair to the eye, but with only 10 flips you can't rule out chance. With 30 flips of the same proportion, you can. This is why "n is small" is the most common critique of frequentist results.

A widely tested misinterpretation

The p-value is NOT the probability that $H_0$ is true. It is the probability of the data (or more extreme) given that $H_0$ is true. These are entirely different — a frequentist would say the first quantity is meaningless, since hypotheses don't have probabilities. This confusion is one of the main objections Bayesians level at NHST.

Self-test · Frequentism

State the difference between Type I and Type II errors. Which is α, which is β?

Type I (α): false positive — rejecting $H_0$ when it's actually true. "Crying wolf when there isn't one."

Type II (β): false negative — failing to reject $H_0$ when it's actually false. "Missing a real wolf."

α is set in advance (often 0.05); β is what we try to minimise subject to that constraint.

You flip a coin 10 times and get 8 heads. The two-sided p-value works out to about 0.11. With α = 0.05, what do you conclude?

Since $p = 0.11 > 0.05$, we fail to reject $H_0$. The data is consistent with a fair coin at the 5% significance level. (Note: this does NOT mean the coin is fair — only that we don't have strong enough evidence to say it isn't.)

Why is "we accept H₀" considered bad form?

Failing to reject the null isn't evidence that the null is true — it's just evidence we couldn't disprove it with the data we have. The test was set up to falsify, not to confirm. Saying "accept" overstates what the procedure actually licenses. Always say "fail to reject".

What does it mean for a test to be "uniformly most powerful"?

For any chosen α, the test minimises the type II error rate β across all tests that successfully control type I error at level α. In other words: it's the test that gives you the most power (1 − β) for any given tolerance for false positives.

You see a study reporting p = 0.04, claimed as evidence the drug works. What does this actually tell you?

It tells you: if the drug had no effect (the null), the probability of observing data this favourable to the drug (or more so) would be 4%. It does not tell you the probability that the drug works is 96%. To get that, you'd need to bring in a prior — and you'd have left frequentism for Bayesianism.

· · ·

Part IV The Statistics Wars

Two and a half centuries of philosophical disagreement, with real practical consequences for how science gets done — and where machine learning gets to slot in.

By now you've probably noticed: Bayesianism and frequentism aren't just two computational tools. They have genuinely different views about what a probability is. As the statistician Brad Efron put it:

By and large, statistics is a prosperous and happy country, but it is not a completely peaceful one. Two contending philosophical parties, the Bayesians and the frequentists, have been vying for supremacy over the past two and a half centuries. Unlike most philosophical arguments, this one has important practical consequences. The two philosophies represent competing visions of how science progresses. — Brad Efron, 2013

The xkcd version

Slide 48 shows Randall Munroe's famous xkcd comic ("Frequentists vs. Bayesians"). The setup: a neutrino detector tells two statisticians whether the sun has gone nova. The detector rolls two dice and lies if both come up six (probability 1/36 ≈ 0.028).

The detector announces: "Yes — the sun has gone nova."

The frequentist calculates: under $H_0$ ("sun is fine"), the probability of getting a "yes" is 1/36 < 0.05, so we reject $H_0$. The sun has exploded!

The Bayesian shrugs: "Bet you $50 it hasn't." Why? Because the prior probability of the sun going nova is astronomically small, and a 1/36 false-positive rate isn't enough to overcome it. The Bayesian wins the bet.

The cartoon is funny because it lampoons exactly the failure mode the rare-disease example highlighted: ignoring priors gives absurd answers when the prior is extreme.

Objections, fired in both directions

Objections to Bayesianism

Priors are inherently subjective — different priors give different answers.
Inference is often suspiciously easy (when you use conjugate priors chosen for math convenience) or prohibitively hard (requiring MCMC sampling).
Model misspecification can break likelihoods badly — and there's no guard against this in the framework.
Bayesianism is, somewhat unkindly, called a "cult of rationality" by its detractors.

Objections to frequentism

Who actually cares how improbable the data is under $H_0$? We want to know how probable $H_1$ is given the data — and frequentism flatly refuses to answer.
Ignoring previous results is inefficient and irrational — the next experiment shouldn't start from a blank slate.
Optional stopping (peeking at the data and stopping when significance is reached) leads to invalid frequentist inference but is harmless for Bayesians.
The "cult of p-values" has been blamed for the replication crisis in psychology and biomedicine.

Syntheses are possible

The slide ends on a hopeful note: the wars don't have to be eternal. Modern statistics has produced several genuine hybrids — methods that draw from both traditions. The slide names four:

Definitions worth knowing

Empirical Bayes: Bayesian inference where the prior is estimated from the data itself, rather than chosen subjectively. Bridges the schools by removing the most cited Bayesian sin.
Bootstrap sampling: a resampling technique to estimate the variability of a statistic. Has both frequentist and Bayesian interpretations.
PAC-Bayes learning theory: a framework for proving generalisation bounds in machine learning that borrows heavily from both traditions.
E-values and test martingales: a recent body of work combining frequentist optimality criteria with Bayes-factor-style evidence quantification, designed to handle optional stopping.

If your exam asks for a synthesis approach, "Empirical Bayes" or "PAC-Bayes" are the safest two to mention.

· · ·

Part V Machine Learning to the Rescue

Many classical inference problems can be reframed as machine learning problems — sometimes with surprising elegance.

The closing arc of the lecture argues that machine learning isn't just another tool in the statistical toolbox; it changes how we approach inference altogether. Three concrete examples are flagged:

Two-sample testing as classification. Want to know if two samples come from the same distribution? Train a classifier to tell them apart. If it can't beat random chance, the samples are similar; if it can, they differ. (Reference: Kim et al., 2021.)
Prediction-powered inference. Use a trained ML model to predict labels on a large unlabelled dataset, then use those predictions — carefully calibrated against a small labelled set — to estimate parameters more precisely. (Reference: Angelopoulos et al., 2023.)
Simulation-based inference. When you can simulate from a model but can't write down its likelihood, learn the inference machinery from simulated data instead. (Reference: Dalmasso et al., 2023.)

The whole agenda is anchored by a now-famous 2001 essay by Leo Breiman, Two Cultures of Statistical Modeling:

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models…If our goal as a field is to use data to solve problems, then we need to…adopt a more diverse set of tools. — Leo Breiman, "Two cultures of statistical modeling"

Breiman's "two cultures" — the data-model culture (classical statistics) versus the algorithmic-model culture (machine learning) — sets up the rest of this course. From Week 3 onwards, we mostly live in the second culture. But the first one is what we just toured, and you'll need it on the exam.

Briefly explain how a two-sample test can be reframed as a classification task.

Pool the two samples and label each observation by which sample it came from. Train a classifier on this binary task. If the classifier achieves accuracy meaningfully above chance (50%), the two distributions differ — there's signal a model can learn. If it can't beat chance, the distributions are indistinguishable. The classifier's accuracy itself becomes a test statistic.

What is "Empirical Bayes" and why does it count as a synthesis between schools?

Empirical Bayes is Bayesian inference where the prior is estimated from the observed data, rather than specified subjectively before seeing any data. This sidesteps the most common frequentist objection to Bayesianism (priors are arbitrary) while still using Bayesian machinery to combine prior with likelihood. It's a synthesis because the data-driven prior is essentially a frequentist move plugged into a Bayesian engine.

The whole week, in one breath

Logic gives certainty; the world doesn't oblige. Statistical inference is the discipline of reasoning despite incomplete information. Two great traditions answer the call differently:

Bayesians treat probabilities as degrees of belief and use Bayes's theorem to update them. Strength: optimally uses background knowledge. Weakness: priors are subjective.
Frequentists treat probabilities as long-run frequencies and design procedures (NHST) to control error rates. Strength: no priors required. Weakness: doesn't directly tell you what you usually want to know about hypotheses.
The wars are real but increasingly papered over by syntheses (empirical Bayes, bootstrap, PAC-Bayes, E-values), and machine learning offers fresh approaches that often sidestep the dispute entirely.

For the exam: be ready to (i) state Bayes's theorem and identify its four ingredients in a worked problem, (ii) compute a posterior given a prior and likelihood, (iii) explain why a low base rate makes most positive tests false, (iv) define Type I/II errors and α/β, (v) compute or interpret a p-value, (vi) name at least one objection to each school and at least one synthesis approach.

Now go nail this worksheet.