KCL · Machine Learning · Week 10

Dimensionality Reduction

Why high-dimensional data breaks our tools — and four families of methods (selection, extraction, random projection, factor analysis) to fight back.

Module: KCL · ML Reading time: ~70 min Self-tests: 19 Lecturer: F. Mallmann-Trenn
Part I — the motivation

I. The Curse of Dimensionality & Why We Visualise

Before we reduce dimensions, we need to understand why high dimensions are poison for the algorithms we love — and why a simple picture is often worth a pound of analysis.

Almost every method you have met this term — nearest neighbours, k-means, hierarchical clustering — secretly relies on one thing: that the distance between two points means something. "Close" points are similar, "far" points are different. The whole edifice of distance-based machine learning rests on that assumption. The curse of dimensionality is the discovery that, as you add more and more features, this assumption quietly collapses.

What the curse actually is

Definition — Curse of Dimensionality

A critical issue in data analysis where an increase in the dimensionality of the feature space leads to a deterioration in the performance of distance-based algorithms.

The methods explicitly named on the slide as affected:

  • K-Nearest Neighbour (KNN)
  • K-Means Clustering
  • Hierarchical Clustering

The slide's takeaway: because of this, dimensionality reduction is often adopted as a preliminary processing step — you shrink the feature space first, then run KNN/k-means/clustering on the cleaner, lower-dimensional data so they actually work well.

Geometric evidence #1 — the "kissing number"

How do we see that high-dimensional space is strange? One clean way is the kissing number.

Definition — Kissing Number

The maximum number of non-overlapping unit spheres that can simultaneously touch ("kiss") a single central unit sphere in \(d\)-dimensional space.

Table: Known kissing numbers for specific dimensions
Dimension \(d\)Kissing numberNote (from slide)
12left and right
26forming a hexagon around the central point
312
424
8240
24196560
In plain English

The number of neighbours a sphere can touch explodes with dimension (2 → 6 → 12 → 24 → 240 → 196,560). The slide's punchline is the opposite of what that growth suggests at first: "High-dimensional spaces are empty." Even though a sphere could have hundreds of thousands of touching neighbours, there is so much room that points you actually sample spread out into a vast, sparse void — they almost never sit close together. Volume grows so fast that any finite set of data points becomes a scattering of lonely specks.

Geometric evidence #2 — distances stretch as \(\sqrt n\)

The slide runs a thought experiment: draw 100 points from a \(d\)-dimensional uniform distribution on the cube \([0,1]^d\). What is the largest possible distance between two points (the diagonal of the cube)?

Worked example — where √n comes from

The two most extreme corners of the unit cube are \(\mathbf 0=(0,0,\dots,0)\) and \(\mathbf 1=(1,1,\dots,1)\). Their Euclidean distance is

\[\|\mathbf 1-\mathbf 0\|_2=\sqrt{\sum_{i=1}^{n}(1-0)^2}=\sqrt{\underbrace{1+1+\cdots+1}_{n}}=\sqrt n.\]

So in \(n=100\) dimensions the diagonal is \(\sqrt{100}=10\) — ten times longer than the side of the cube. The space the points must spread across grows without bound while the points stay finite in number.

The fatal consequence — nearest ≈ farthest

The lecture's figure makes the killer point. It plots, on a log–log scale, the distance to the nearest point (red) and to the farthest point (blue) as the dimension grows from 1 to 100.

Dimension (log scale, 1 → 100) Distance (log) gap → 0 Nearest point Farthest point
Slide 5 redrawn — as dimension increases the nearest and farthest distances climb toward each other and the gap between them shrinks. (The slide's real plot shows exactly this convergence.)
Exam trap

The point is not simply "distances get bigger" — that alone wouldn't hurt anything. The damage is that the nearest and the farthest distances become almost equal. If your closest neighbour and your most distant point are nearly the same distance away, then the word "nearest" loses all meaning — and KNN, k-means and clustering, which depend on telling near from far, stop discriminating. That is the curse. Don't write "distances grow" on the exam without explaining the convergence.

QA dataset lives in \([0,1]^{64}\). What is the maximum possible Euclidean distance between two points, and why is this a problem for KNN?

Maximum distance is the cube diagonal \(\sqrt{64}=8\). The problem is not the size itself but that, in such high dimension, nearly all pairwise distances cluster close to this same large value: the nearest neighbour ends up almost as far as the farthest. KNN's notion of "the \(k\) closest points" becomes essentially arbitrary, so its predictions degrade. The fix is to reduce dimensionality first.

QThe kissing number rises from 2 (d=1) to 196,560 (d=24), yet the slide says high-dimensional space is "empty." Is that a contradiction?

No. The kissing number is the maximum possible number of unit spheres that can physically touch a central one — it shows how much "room" a high-dimensional space affords. Because the available volume grows astronomically with \(d\), any finite, realistic sample of data points spreads out into that volume and almost never lands close to any other point. So the space could pack many touching spheres, but real data is sparse within it — hence "empty." High capacity, low occupancy.

·  ·  ·

Why visualise at all?

Slide 6 is a wall of quotes hammering one theme — looking at data beats computing on it blind. Worth knowing the named sources for the exam:

Quotes from the slide (named)
  • John Tukey (American mathematician): "The simple graph has brought more information to the data analyst's mind than any other device."
  • John Tukey: "Data is useless without the ability to visualize and act upon it."
  • John Tukey: "The greatest value of a picture is when it forces us to notice what we never expected to see."
  • Ben Shneiderman (American computer scientist): "A picture is worth a thousand words. An interface is worth a thousand pictures."
  • Ben Shneiderman: "Visualization gives you answers to questions you didn't know you had."
  • Anonymous: "An ounce of visualization is worth a pound of analysis."
  • Frederik: "Visualization/Dimensionality Reduction also really important for the exam ;)"
Take the hint

The lecturer literally put on a slide that this topic is "really important for the exam" — treat the whole of Week 10 as high-yield.

Slide 7 drives it home with a 2-D scatter: six coloured groups plus two lone stray points. Just looking at the data immediately suggests natural clusters and outliers — no algorithm needed. That is the entire promise of dimensionality reduction for visualisation: get the data down to 2-D and your eye does the clustering for free.

X1 X2 outlier outlier
Slide 7 redrawn — six visually obvious clusters and two outliers. The eye does the clustering instantly once data is in 2-D.

The four families (the roadmap for this week)

Slide 8's overview tells you exactly what is coming, and gives one example technique for each:

  1. Feature Selection — keep a subset of the original features (e.g. Pearson correlation).
  2. Feature Extraction — build new features from the old ones (e.g. t-SNE, PCA).
  3. Random Projection — project with a random matrix (e.g. Johnson–Lindenstrauss, "JL").
  4. Factor Analysis — model latent factors (flagged later as non-examinable).
Selection vs. extraction — the one-line distinction

Selection keeps a subset of your original columns (the survivors are still "height," "weight," etc., so results stay interpretable). Extraction invents brand-new columns that are combinations of the originals (e.g. "principal component 1") — more compact, but the new axes no longer map cleanly to a single real-world variable.

QClassify each as feature selection or feature extraction: (a) dropping a column whose correlation with the label is ~0, (b) PCA, (c) lasso (L1) zeroing some coefficients, (d) t-SNE.

(a) Selection — you keep/drop original columns. (b) Extraction — PCA builds new linear-combination axes. (c) Selection (specifically embedded selection — it happens during training and forces some original coefficients to exactly 0). (d) Extraction — t-SNE creates new low-dimensional coordinates.

Part II — keeping the right columns

II. Feature Selection — Pearson, Causation & Simpson's Paradox

Feature selection picks a subset of the original features. The workhorse tool here is Pearson correlation — but the lecturer spends most of the section on the traps that make naïve correlation dangerous: confounding, and Simpson's paradox.

The simplest dimensionality reduction is to throw away the features that don't help. The question is: how do you decide which features don't help? The slides give three broad strategies, then zoom in on correlation as the concrete tool — and then spend a long time on how correlation can lie to you.

Three families of feature selection

Definition — three selection method types
  • Filter Methods: use statistical measures to score each feature's relevance to the target, and filter out the least important ones. Common measures: correlation coefficients and mutual information. (Cheap; model-independent.)
  • Wrapper Methods: use an actual predictive model to evaluate combinations of features and keep the best-performing subset. Techniques: recursive feature elimination, forward selection, backward elimination. (Expensive; searches subsets.)
  • Embedded Methods: selection happens as part of model training. Example: L1 (lasso) regularization, which adds a penalty on feature coefficients to the loss, driving some to zero. (Built into the fit — recall this from Week 3.)
QYour team wants feature selection that is model-agnostic and runs in seconds on millions of features. Filter, wrapper, or embedded — and why?

Filter. It scores each feature with a cheap statistic (e.g. Pearson correlation or mutual information) independent of any model, so it is fast and scales to huge feature counts. Wrapper methods retrain a model for many candidate subsets (far too slow at that scale), and embedded methods tie you to one particular model's training (e.g. lasso). Trade-off: filters ignore feature interactions, so they can keep redundant features a wrapper would catch.

·  ·  ·

Correlation ≠ causation: the ice-cream & sharks story

Before the maths, the lecturer tells the parable. In the seaside town of Sunny Beach, every summer ice-cream sales soar — and so do shark attacks. Year after year, more ice cream sold ⇒ more shark attacks reported.

Observation: plot the data and you see a clear correlation. But does buying ice cream cause shark attacks?

The resolution

Correlation: ice-cream sales and shark attacks both rise in summer. Causation: the hot weather independently causes both — more people buy ice cream and more people swim (so more are exposed to sharks). Conclusion: two things happening together doesn't mean one causes the other; the link can be due to an underlying confounding factor (here, temperature). Always look deeper for a true causal link versus a coincidental correlation.

QName the confounder in the ice-cream/shark example and draw the causal structure in words.

Confounder: hot weather / temperature. Structure: Hot weather → ice-cream sales and Hot weather → swimming → shark attacks. Ice cream and shark attacks have no direct arrow between them; they are correlated only because they share a common cause. This is the textbook confounding pattern.

Pearson correlation coefficient

Definition — Pearson correlation

A measure of the linear correlation between two variables \(X\) and \(Y\). Its value ranges from \(-1\) to \(1\):

  • \(\;1\) — a perfect positive linear relationship,
  • \(\;0\) — no linear correlation,
  • \(-1\) — a perfect negative linear relationship.

The formula

For \(n\) paired scores, the Pearson coefficient is

\[ r_{xy}=\frac{n\left(\sum xy\right)-\left(\sum x\right)\left(\sum y\right)}{\sqrt{\big[\,n\sum x^2-(\sum x)^2\,\big]\big[\,n\sum y^2-(\sum y)^2\,\big]}}\,,\]

where \(n\) is the number of pairs of scores.

In plain English

The numerator asks: "do \(X\) and \(Y\) move together more than you'd expect if they were independent?" The denominator normalises by how spread out each variable is, so the answer always lands between \(-1\) and \(1\) regardless of the units. Same sign-moving ⇒ positive; opposite-moving ⇒ negative; no pattern ⇒ near zero.

Worked example (the slide's, fully plugged in)

Worked example — r for X={1,2,3}, Y={2,4,6}

Here \(n=3\). Compute the five sums:

\[\sum x = 1+2+3 = 6,\qquad \sum y = 2+4+6 = 12,\] \[\sum xy = (1)(2)+(2)(4)+(3)(6)=2+8+18 = 28,\] \[\sum x^2 = 1^2+2^2+3^2 = 14,\qquad \sum y^2 = 2^2+4^2+6^2 = 56.\]

Substitute:

\[ r=\frac{3(28)-(6)(12)}{\sqrt{[\,3(14)-6^2\,][\,3(56)-12^2\,]}}=\frac{84-72}{\sqrt{(42-36)(168-144)}}=\frac{12}{\sqrt{6\cdot 24}}=\frac{12}{\sqrt{144}}=\frac{12}{12}=1.\]

So \(r=1\): a perfect positive linear relationship — which makes sense, since \(Y=2X\) exactly.

QCompute Pearson r for X={1,2,3}, Y={6,4,2} (the same X, but Y reversed). Predict the sign before computing.

Prediction: \(Y\) falls as \(X\) rises, so \(r\) should be \(-1\). Check: \(n=3\), \(\sum x=6\), \(\sum y=12\), \(\sum xy=(1)(6)+(2)(4)+(3)(2)=6+8+6=20\), \(\sum x^2=14\), \(\sum y^2=36+16+4=56\).

\[ r=\frac{3(20)-(6)(12)}{\sqrt{(42-36)(168-144)}}=\frac{60-72}{\sqrt{6\cdot24}}=\frac{-12}{12}=-1.\]

Perfect negative linear relationship — \(Y=8-2X\) exactly. ✓

What different r values look like

The slides show three scatter plots. Redrawn together:

No correlation (r≈0) Ice cream vs. pirate sightings Positive (r→+1) Math knowledge vs. ML skills Negative (r→−1) Coffee vs. unfinished puzzles
Slides 9–11 redrawn — the three regimes Pearson distinguishes (and the lecturer's playful axis labels). Note: the slide captions for the positive/negative figures read "between Size and Weight," but the in-plot titles are the ones shown here.
Exam trap — Pearson only sees lines

Pearson \(r\) measures linear association only. A perfect U-shape (e.g. \(Y=X^2\) symmetric about zero) can have \(r=0\) despite a perfect deterministic relationship. "\(r=0\)" means "no linear correlation," not "no relationship." This is why filter methods sometimes also use mutual information, which catches non-linear dependence.

Practical applications & how to actually use it for selection

Slide 12 lists fields where Pearson is used: Health Sciences (lifestyle factors vs. health outcomes), Economics (relationships between economic factors), Psychology (behavioural patterns vs. outcomes).

How to use Pearson for feature selection (slide 13)
  • Correlation between the label and a feature shows how predictive that feature is.
  • If a feature's correlation with the label is very close to 0, it may be good to drop it.
  • If it's very close to \(+1\) or \(-1\), it's probably very helpful — keep it. (Note: strong negative correlation is just as useful as strong positive!)
  • High correlation between two different features means they're redundant, so one of them can be dropped.
QA feature has correlation −0.92 with the label. A junior colleague suggests dropping it because it's negative. Right or wrong?

Wrong — keep it. Magnitude is what matters, not sign. \(|{-}0.92|=0.92\) is close to 1, so the feature is highly predictive (it just predicts in the opposite direction). The slide explicitly warns: "If the correlation is very close to 1 or −1 it's probably very helpful." You'd only consider dropping features whose correlation is near 0, or one of a pair of mutually highly-correlated features.

·  ·  ·

Simpson's Paradox — when aggregation flips the answer

This is the most exam-worthy idea in Part II. Simpson's paradox is when a trend that appears in every subgroup of the data reverses when the subgroups are combined.

The medical example, built up exactly as the slides do

Step 1 — the aggregate. Compare two treatments by overall success rate:

SurgeryMedication
Total success78% (273/350)83% (289/350)

Medication looks better (83% > 78%). But now split by tumour size:

Step 2 — split by tumour size, then the totals:

Simpson's Paradox — full table (slide 17)
SurgeryMedication
small tumour93% (81/87)87% (234/270)
large tumour73% (192/263)69% (55/80)
total78% (273/350)83% (289/350)
The flip — read this twice

In both subgroups, Surgery wins: 93% > 87% for small tumours, and 73% > 69% for large tumours. Yet in the total, Medication wins (83% > 78%). The within-group winner and the overall winner are opposite. That reversal is Simpson's paradox.

Why it happens — the confounder is group size

The groups are wildly unbalanced. Small tumours (easy cases, high success) are mostly treated with medication (270 vs 87). Large tumours (hard cases, low success) are mostly treated with surgery (263 vs 80). Success depends heavily on the confounding variable tumour size: large tumours have low success regardless of treatment. So surgery's overall rate is dragged down because surgery was handed the hard cases. Medication only appears better because of this treatment-assignment bias — but surgery is actually the better option within each like-for-like comparison.

Slide 18 then asks the design question: "How could one design a study that would not have these problems?"

Answer to the design question

A randomised controlled trial (RCT). By randomly assigning treatment independently of tumour size, you break the link between the confounder (size) and the treatment, so the groups become comparable and the aggregate no longer lies. (This connects directly to Week 2's discussion of experimental design.) Alternatively, stratify/control for tumour size in the analysis.

SMALL tumour 93%Surg 87%Med surgery wins LARGE tumour 73%Surg 69%Med surgery wins TOTAL (flips!) 78%Surg 83%Med medication "wins" Within each group surgery is better — but pooling reverses the verdict.
Simpson's paradox visualised — the within-group winner (surgery, blue-dark) becomes the overall loser once unequal group sizes are pooled.
QIn one sentence, what is Simpson's paradox, and what single design choice prevents it here?

Simpson's paradox is when a relationship holding in every subgroup reverses after the subgroups are aggregated, caused by a confounder correlated with both group membership and outcome. Here the confounder is tumour size (correlated with both treatment choice and success). The design fix is randomised assignment of treatment (an RCT), which decouples the confounder from the treatment.

Case study: UC Berkeley graduate admissions, 1973

The real-world version. In the early 1970s, UC Berkeley was sued for gender discrimination in graduate admissions. The Fall 1973 aggregate figures:

At first glance, assuming similar qualifications, this looks like discrimination against women. But department by department, a different picture emerges:

Table 1 — six largest departments, 1973 Berkeley case (Bickel, Hammel & O'Connell, 1975, via Wikipedia)
DeptMen — applicantsMen — admittedWomen — applicantsWomen — admitted
A82562%10882%
B56063%2568%
C32537%59334%
D41733%37535%
E19128%39324%
F2726%3417%
What the breakdown reveals

In several departments women were admitted at a higher rate (A: 82% vs 62%; B, D, F roughly tied or women higher). The aggregate gap arose because women disproportionately applied to the most competitive departments (e.g. C, E, F with low admit rates for everyone), while men applied more to easy-admit departments (A, B). Department choice is the confounder — same structure as the tumour example. The paradox showed the headline discrimination claim was an artefact of aggregation.

QUsing Berkeley dept A's numbers, what's the confounding mechanism that produces the misleading 44%-vs-35% headline?

Dept A admits at a high rate (62% men, 82% women) and received 825 male but only 108 female applicants — men piled into an easy department. Other departments (C, E, F) are hard for everyone and attracted far more women. Because admit rate depends strongly on which department you applied to (the confounder), and the sexes applied to different department mixes, pooling all departments produces an overall male advantage even though within most departments women did as well or better. Controlling for department dissolves the apparent discrimination.

Part III — inventing better axes

III. Feature Extraction — PCA & t-SNE

Instead of keeping a subset of original features, feature extraction builds brand-new features that are combinations of the originals. PCA finds the linear directions of greatest variance; t-SNE non-linearly preserves who-is-near-whom for visualisation.

Feature extraction trades interpretability for compactness. The new axes — "principal component 1," "t-SNE feature 2" — don't correspond to any single real-world quantity, but they pack the structure of the data into far fewer numbers. The slide opens with a menu of five extraction techniques; the exam focus is the first three.

The five feature-extraction techniques (slide 2)

Definitions — the extraction menu
  • Principal Component Analysis (PCA): a statistical procedure using an orthogonal transformation to convert observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.
  • Singular Value Decomposition (SVD): a matrix factorization that decomposes a matrix as \(A=U\Sigma V^{T}\), where \(A\) is the original matrix, \(U\) is orthogonal (columns = left singular vectors), \(\Sigma\) is diagonal (the singular values), and \(V^{T}\) is the transpose of \(V\).
  • t-SNE (t-distributed Stochastic Neighbor Embedding): a non-linear technique especially suited to embedding high-dimensional data into 2 or 3 dimensions for a scatter plot.
  • Autoencoders: a type of artificial neural network that learns efficient codings of unlabelled data; the encoder reduces dimensionality by learning a compressed representation of the input. (Recall the autoencoder reconstruction idea from Week 1.)
  • Kernel PCA: an extension of PCA using kernel methods to allow non-linear dimensionality reduction.
QWrite out the SVD factorization and say what each of the three factors contains.

\(A = U\Sigma V^{T}\). \(U\) is orthogonal and its columns are the left singular vectors; \(\Sigma\) is diagonal and holds the singular values (non-negative, usually sorted descending); \(V^{T}\) is the transpose of the orthogonal \(V\), whose columns are the right singular vectors. SVD underlies PCA: applying it to a (centred) data matrix gives the principal directions and the amount of variance each captures.

·  ·  ·

Principal Component Analysis (PCA)

What PCA does (slide 4)
  • Transforms a large set of variables into a smaller one.
  • Reduces the number of random variables under consideration.
  • Obtains a set of principal variables.
  • Retains most of the original data's variation.
In plain English

Picture a flattish, tilted cloud of 2-D points. PCA finds the single straight line along which the cloud is most stretched out (most variance) — that's principal component 1. The next component is the most-stretched direction perpendicular to the first, and so on. To reduce dimensions, you keep the first few components and throw the rest away, losing as little spread (information) as possible. The new axes are uncorrelated by construction.

How PCA works — the four steps (slide 5)

  1. Standardization — standardize the range of the initial data points (so features with big units don't dominate).
  2. Covariance Matrix Computation — identifies the correlation between features.
  3. Eigenvalue Decomposition — finds the directions that maximize variance (the eigenvectors of the covariance matrix; eigenvalues = variance captured).
  4. Projection — projects the data onto the chosen lower-dimensional space.
PC1 most variance PC2
PCA on a 2-D cloud — PC1 is the direction of greatest variance; PC2 is orthogonal to it. Keeping only PC1 reduces 2-D → 1-D while preserving most of the spread.

PCA & correlation; benefits (slides 6–7)

PCA and correlation

PCA works off the correlation/covariance matrix (correlation coefficients between pairs of features). It identifies patterns in that matrix to capture significant variance, and reduces redundancy caused by correlated features (two strongly correlated features carry overlapping information; PCA merges them into one component).

Benefits of PCA: Simplification (reduces data complexity), Visualization (facilitates viewing lower-dimensional data), Efficiency (improves computational efficiency in downstream tasks).

QWhy is the standardization step (step 1) important before computing the covariance matrix?

PCA seeks directions of maximum variance, and variance is scale-dependent. If one feature is in metres (range 0–10000) and another in kilograms (range 0–5), the metre feature will dominate the covariance purely because of its units, and PC1 will essentially align with it regardless of true structure. Standardizing (e.g. zero mean, unit variance) puts all features on a comparable footing so the components reflect genuine shared structure, not arbitrary unit choices.

·  ·  ·

t-SNE

What is t-SNE? (slide 12)

t-distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction technique, developed by Laurens van der Maaten and Geoffrey Hinton in 2008, used for visualizing high-dimensional data in 2-D or 3-D.

The big idea (slide 8, in plain English)

t-SNE models the probability that one point is a neighbour of another, in both the high- and low-dimensional spaces. In the original space it measures pairwise similarities with a Gaussian kernel — close points get high probability, far points get low probability. It then places points in 2-D/3-D and shuffles their positions until the low-dimensional neighbour-probabilities match the high-dimensional ones. "Match" is made precise by minimising the divergence between the two probability distributions, using gradient descent, until the layout stabilises.

The t-SNE algorithm — five steps (slide 13)

  1. Compute pairwise similarities in high-dimensional space.
  2. Define a joint probability distribution in high-dimensional space.
  3. Initialize points in low-dimensional space (randomly).
  4. Compute pairwise similarities in low-dimensional space.
  5. Minimize the Kullback–Leibler (KL) divergence between the two distributions using gradient descent.

The four formulas you must know

(1) High-dimensional similarity — Gaussian (slide 14)

Conditional similarity of point \(j\) to point \(i\):

\[ p_{j\mid i}=\frac{\exp\!\big(-\|x_i-x_j\|^2/2\sigma_i^2\big)}{\sum_{k\neq i}\exp\!\big(-\|x_i-x_k\|^2/2\sigma_i^2\big)}.\]

Symmetrise into a joint probability over \(n\) points:

\[ p_{ij}=\frac{p_{j\mid i}+p_{i\mid j}}{2n}.\]

In plain English (the Gaussian formula)

\(p_{j\mid i}\) is "the chance point \(i\) would pick point \(j\) as its neighbour." The \(\exp(-\text{distance}^2)\) makes nearby points hugely more likely than distant ones; the denominator normalises so the probabilities over all neighbours sum to 1. \(\sigma_i\) sets the radius of "near" around point \(i\) — and it is chosen per-point via perplexity (next).

(2) Perplexity & choosing \(\sigma_i\) (slide 15)

The user sets a single perplexity value for the entire dataset. This guides the selection of \(\sigma_i\) for each data point. Adaptive \(\sigma_i\) selection:

  1. Initialize \(\sigma_i\) for each point \(x_i\).
  2. Compute pairwise distances between \(x_i\) and other points.
  3. Adjust \(\sigma_i\) by binary search to match the target perplexity.

Goal: ensure the local distribution around each \(x_i\) has the same effective number of neighbours as defined by the target perplexity.

(3) Low-dimensional similarity — Student's t (slide 16)

In the low-dimensional map, similarities use a Student's t-distribution (one degree of freedom):

\[ q_{ij}=\frac{\big(1+\|y_i-y_j\|^2\big)^{-1}}{\sum_{k\neq l}\big(1+\|y_k-y_l\|^2\big)^{-1}}.\]

Exam trap — why a t-distribution downstairs?

High dimensions use a Gaussian; the low-dimensional map uses a heavy-tailed Student's t. The heavy tail lets moderately-distant points sit comfortably far apart in 2-D without being penalised, which relieves crowding (the "crowding problem") and gives the clean, well-separated clusters t-SNE is famous for. If you forget which space uses which: Gaussian high, t-distribution low.

(4) Minimising KL divergence + the gradient (slide 17)

The cost is the Kullback–Leibler divergence between the high-dim distribution \(P\) and the low-dim distribution \(Q\):

\[ \mathrm{KL}(P\,\|\,Q)=\sum_{i\neq j} p_{ij}\log\frac{p_{ij}}{q_{ij}}.\]

Positions \(y_i\) are updated by gradient descent, with gradient

\[ \frac{\partial C}{\partial y_i}=4\sum_{j}\big(p_{ij}-q_{ij}\big)\big(y_i-y_j\big)\big(1+\|y_i-y_j\|^2\big)^{-1}.\]

In plain English (the KL cost)

KL divergence (from Week 1!) measures how different distribution \(Q\) is from \(P\). Driving \(\mathrm{KL}(P\|Q)\) toward 0 forces the low-dim neighbour-probabilities \(q_{ij}\) to imitate the true high-dim ones \(p_{ij}\). The gradient term \((p_{ij}-q_{ij})\) acts like a spring: if two points are more similar in high-dim than currently shown in low-dim, the spring pulls them together; if less, it pushes them apart.

What is \(y_i\)? (slide 18) & Summary (slide 19)

\(y_i\) is the position of the \(i\)-th data point in the low-dimensional space — initially assigned randomly, then adjusted through gradient descent. It represents the 2-D or 3-D coordinates used for visualization. In summary: t-SNE creates a map that preserves the structure and relationships of high-dimensional data; the \(y_i\) are the coordinates in that map; it helps in visualizing and identifying patterns and clusters.

PCA vs t-SNE — the comparison that gets examined

The lecture shows side-by-side PCA and t-SNE on the same data: PCA leaves the colour groups partly overlapping, while t-SNE separates them into clean islands.

PCA — overlapping teal & yellow groups blur together t-SNE — separated three clean islands
Slide 10 redrawn — on the same data PCA leaves clusters overlapping while t-SNE pulls them into distinct islands, because t-SNE optimises local neighbour structure rather than global variance.
PCA vs t-SNE — the slide's comparison (slide 11)

PCA — a linear technique, optimal for data with linear structure. Projects onto principal components that preserve the variance and large pairwise distances. Good for capturing the most significant features by reducing dimensions.

t-SNE — a non-linear technique that excels at preserving pairwise similarities. Focuses on maintaining small pairwise distances to model local structure.

Summary: PCA preserves variance (global, large distances); t-SNE preserves relationships between points (local, small distances), making it highly effective for visualizing complex high-dimensional data.

Exam trap — the slide's wording on PCA

The slide text literally says PCA projects "through principal components that minimize variance." That phrasing is loose/erroneous — PCA components maximize variance along each successive direction (equivalently, they minimize the reconstruction error). If asked, state clearly: PCA maximizes retained variance / minimizes projection error. Don't reproduce the slide's slip.

QYou have 50-D data and want a 2-D scatter that makes natural clusters visually obvious. PCA or t-SNE? Give the reason in terms of what each preserves.

t-SNE. It preserves local structure — small pairwise distances / who-is-near-whom — which is exactly what makes clusters pop visually. PCA preserves global variance and large distances, so distinct clusters can stay overlapped in the top-2 components. (Caveat for completeness: t-SNE distances/cluster sizes between islands aren't quantitatively meaningful, and it's slower — but for "make clusters visible in 2-D," it's the right tool, as the side-by-side figure shows.)

QIn t-SNE, which space uses the Gaussian and which uses the Student's t-distribution — and what problem does the t-distribution solve?

High-dimensional similarities \(p_{j\mid i}\) use a Gaussian; low-dimensional similarities \(q_{ij}\) use a Student's t. The t-distribution's heavy tails let dissimilar points be placed far apart in the 2-D map without incurring a large cost, relieving the crowding problem and producing well-separated clusters. Mnemonic: Gaussian high, t low.

QWhat does the perplexity parameter control, and how is σᵢ actually set to honour it?

Perplexity is a single user-set value that fixes the effective number of neighbours each point should have. It guides the per-point Gaussian bandwidth \(\sigma_i\): the algorithm initialises \(\sigma_i\), computes pairwise distances from \(x_i\), and uses binary search to tune \(\sigma_i\) until the local neighbourhood's effective neighbour count matches the target perplexity. Dense regions get a smaller \(\sigma_i\), sparse regions a larger one.

QWhat exactly is being minimised in t-SNE, and what role does each yᵢ play?

The KL divergence \(\mathrm{KL}(P\|Q)=\sum_{i\neq j}p_{ij}\log(p_{ij}/q_{ij})\) between the high-dim neighbour distribution \(P\) and the low-dim one \(Q\). Each \(y_i\) is the 2-D/3-D coordinate of point \(i\) in the map; it starts random and is moved by gradient descent (gradient \(\frac{\partial C}{\partial y_i}=4\sum_j (p_{ij}-q_{ij})(y_i-y_j)(1+\|y_i-y_j\|^2)^{-1}\)) until \(Q\) imitates \(P\). The optimised \(\{y_i\}\) is the visualisation.

Part IV — reduce by random luck (with a guarantee)

IV. Random Projection & Johnson–Lindenstrauss

Surprisingly, multiplying your data by a random matrix preserves all pairwise distances — with high probability — as long as you keep enough dimensions. The Johnson–Lindenstrauss lemma says exactly how few you need, and it doesn't depend on the original dimension at all.

Random projection feels like it shouldn't work. You take high-dimensional points, hit them with a random matrix, and somehow the geometry survives. The Johnson–Lindenstrauss (JL) lemma is the theorem that makes this rigorous — and it's a favourite for proof-style exam questions, so we'll walk the whole argument.

The idea

Definition — Random Projection (slide 2)

A method that projects data into a lower-dimensional space using random linear projections that preserve distances well, according to the Johnson–Lindenstrauss lemma. It is often computationally simpler and faster than PCA / t-SNE / the other methods.

JL vs PCA (slides 3–4)

Side-by-side

Johnson–Lindenstrauss (JL)Goal: embed high-dim points into lower dim while approximately preserving pairwise distances. Method: random linear projections. Application: distance-based tasks — nearest-neighbour search, clustering. Basis: probabilistic — guarantees distance preservation with high probability.

PCAGoal: reduce dimension while retaining as much variability (variance) as possible. Method: eigenvectors of the covariance matrix → principal components. Application: visualization, noise reduction, feature extraction. Basis: deterministic linear algebra capturing maximum-variance directions.

Key differences: Focus — JL preserves distances, PCA captures variance. Method — JL uses random projections, PCA uses deterministic eigendecomposition. Output — JL guarantees distance preservation; PCA gives principal components for transformation.

The goal, formally (slide 5)

Given a set of points \(P=\{p_1,p_2,\dots,p_n\}\) with each \(p_i\in\mathbb R^{d}\), we want to embed \(P\) into \(k\ll d\) dimensions while preserving distances (useful for nearest-neighbour, clustering, etc.). Concretely, we seek a function \(f\) such that for all \(p_i,p_j\in P\):

\[ \|f(p_i)-f(p_j)\|_2 \;\approx\; \|p_i-p_j\|_2.\]

The lemma, formally (slide 6)

Johnson–Lindenstrauss Lemma

Let \(P=\{p_1,\dots,p_n\}\) with \(p_i\in\mathbb R^{d}\). For any \(\varepsilon\in(0,1)\) and any \(n\) and \(k\) such that

\[ k \;\ge\; \frac{24}{3\varepsilon^{2}-2\varepsilon^{3}}\,\log n,\]

there exists a function \(f:\mathbb R^{d}\to\mathbb R^{k}\) such that for all \(p_i,p_j\in P\):

\[ (1-\varepsilon)\,\|p_i-p_j\|_2^{2} \;\le\; \|f(p_i)-f(p_j)\|_2^{2} \;\le\; (1+\varepsilon)\,\|p_i-p_j\|_2^{2}.\]

In plain English — the astonishing part

Read the bound on \(k\) carefully: it depends on the number of points \(n\) (logarithmically!) and the tolerance \(\varepsilon\) — but not on the original dimension \(d\) at all. So a million-dimensional dataset of \(n\) points can be squashed to roughly \(O(\log n / \varepsilon^2)\) dimensions while keeping every pairwise (squared) distance within a \((1\pm\varepsilon)\) factor. Want more accuracy (smaller \(\varepsilon\))? You pay with more target dimensions. The squared distances are sandwiched between \((1-\varepsilon)\) and \((1+\varepsilon)\) times the originals.

QYou have n = 1,000,000 points in ℝ^{5000} and want ε = 0.5. Roughly how many target dimensions k does JL require — and how does the answer change if d were 50,000 instead?

Plug into \(k\ge \dfrac{24}{3\varepsilon^2-2\varepsilon^3}\log n\) with \(\varepsilon=0.5\): denominator \(=3(0.25)-2(0.125)=0.75-0.25=0.5\), so the factor is \(24/0.5=48\). With \(\log n=\ln(10^6)\approx 13.8\): \(k\ge 48\times13.8\approx 663\). (If the slide intends \(\log_2\), \(k\approx 48\times 20\approx 957\) — either way a few hundred to ~1000.) Crucially, changing \(d\) from 5000 to 50,000 changes \(k\) by nothing: the bound is independent of the original dimension. That independence is the whole point.

The proof (slides 7, built up)

Proof setup

Let \(R\) be a random \(k\times d\) matrix with each entry \(R_{ij}\sim \mathcal N(0,1)\) (i.i.d. standard normal). Fix any point \(p\in P\) and define the projection

\[ v=\tfrac{1}{\sqrt k}\,R\cdot p \quad\in\mathbb R^{k},\qquad v_i=\tfrac{1}{\sqrt k}\sum_j R_{ij}\,p_j.\]

So \(f(p)=v\) is the random projection (scaled by \(1/\sqrt k\)). The proof rests on three ingredients (lemmas):

The three ingredients

Lemma 1 (correct on average): the projection preserves squared length in expectation,

\[ \mathbb E\!\left[\|v\|_2^{2}\right]=\|p\|_2^{2}.\]

Lemma 2 (rarely too big):

\[ \Pr\!\left[\|v\|_2^{2}\ge(1+\varepsilon)\|p\|_2^{2}\right]\le \frac{1}{n^{2}}.\]

Lemma 3 (rarely too small):

\[ \Pr\!\left[\|v\|_2^{2}\le(1-\varepsilon)\|p\|_2^{2}\right]\le \frac{1}{n^{2}}.\]

In plain English — how the three pieces finish the job

Lemma 1 says random projection gets the length right on average. Lemmas 2 and 3 say it's almost never off by more than a \((1\pm\varepsilon)\) factor — each bad event has probability \(\le 1/n^2\). Apply this to the difference vector \(p=p_i-p_j\) (length \(\|p_i-p_j\|\)) for every one of the \(\binom n2 < n^2/2\) pairs, and take a union bound: the chance that any pair is distorted is at most \(n^2 \times \tfrac{1}{n^2} =\) a small constant \(<1\). Since the probability of failure is below 1, a good projection must exist — which is exactly the "there exists \(f\)" claim. The \(\log n\) in the dimension bound is precisely what makes the \(1/n^2\) tail small enough for the union bound to survive.

ℝ^d (large d) a ×(1/√k)R R: random Gaussian ℝ^k (k≈log n/ε²) ≈a distances preserved to (1±ε)
Random projection — multiply by \(\tfrac{1}{\sqrt k}R\) (random Gaussian) to drop from \(\mathbb R^d\) to \(\mathbb R^k\); JL guarantees every pairwise squared distance survives within a \((1\pm\varepsilon)\) factor, with \(k\) independent of \(d\).
QIn the JL proof, each "bad" event has probability ≤ 1/n². Why is that the right target, and what tool turns per-pair guarantees into an all-pairs guarantee?

There are \(\binom{n}{2}union bound: \(\Pr[\text{any pair fails}] \le \sum_{\text{pairs}}\Pr[\text{that pair fails}] \le \tfrac{n^2}{2}\cdot \tfrac{2}{n^2}=\) a constant \(<1\) (counting both the too-big and too-small directions, \(2/n^2\) per pair). Because the total failure probability is strictly below 1, the probability that all pairs are simultaneously preserved is positive — so such an \(f\) exists (the probabilistic method). The \(1/n^2\) tail is engineered by the \(\log n\) term in the bound on \(k\).

QGive two reasons you might choose random projection (JL) over PCA in practice.

(1) Speed/simplicity: JL just multiplies by a random matrix — no covariance matrix, no eigendecomposition — so it's computationally cheaper and easy to apply to streaming or huge-\(d\) data. (2) Distance guarantee: JL comes with a provable, with-high-probability bound that all pairwise distances are preserved to within \((1\pm\varepsilon)\), which is exactly what distance-based algorithms (nearest-neighbour, clustering) need — whereas PCA optimises variance and gives no such per-pair distance guarantee. Bonus: the target dimension is independent of \(d\).

Part V — for completeness only

V. Factor Analysis (non-examinable)

The lecturer explicitly flags this section as not exam relevant — there isn't even a video. Skim it once so the name doesn't surprise you, then spend your energy on Parts I–IV.

Lecturer's disclaimer (slide 2) — read this first

Verbatim intent from the slide: "This is not exam relevant so feel free to skip it. There won't be a video about it either. I just wanted to mention that in other branches of science people use FA to try to model the influence between variables. Personally, I wouldn't reauss), or model latent factors (factor analysis — skip it).

Exam-readiness checklist — you should be able to:

  1. State the curse of dimensionality, name the three affected algorithms (KNN, k-means, hierarchical clustering), and explain the nearest ≈ farthest convergence (not just "distances grow").
  2. Derive the max distance \(\sqrt n\) in \([0,1]^n\), and interpret the kissing-number table / "high-dim spaces are empty."
  3. Distinguish feature selection vs extraction, and classify filter / wrapper / embedded methods (with lasso as the embedded example).
  4. Compute Pearson \(r\) from raw data using the \(n\sum xy\) formula — and reproduce the worked example giving \(r=1\) for \(Y=2X\).
  5. Explain correlation ≠ causation via the ice-cream/shark confounder, and state when to drop a feature (corr ≈ 0; or one of a highly-correlated pair — magnitude, not sign).
  6. Define Simpson's paradox, work the surgery-vs-medication table (surgery wins both subgroups, loses overall), identify tumour size as the confounder, and prescribe an RCT.
  7. Explain the 1973 Berkeley admissions reversal using department choice as the confounder (cite Bickel, Hammel & O'Connell 1975).
  8. List the five extraction techniques and write the SVD factorization \(A=U\Sigma V^{T}\) with the role of each factor.
  9. State PCA's four steps (standardize → covariance → eigendecomposition → project) and why standardization matters; note PCA maximizes variance (correct the slide's slip).
  10. Reproduce the four t-SNE formulas: Gaussian \(p_{j\mid i}\), joint \(p_{ij}\), Student-t \(q_{ij}\), and KL cost + gradient — and explain perplexity / per-point \(\sigma_i\) via binary search.
  11. Justify Gaussian high, t low (heavy tails fix crowding) and contrast PCA (variance, global, linear) vs t-SNE (similarity, local, non-linear); credit van der Maaten & Hinton 2008.
  12. State the JL lemma including the bound \(k\ge \frac{24}{3\varepsilon^2-2\varepsilon^3}\log n\) and the \((1\pm\varepsilon)\) squared-distance sandwich; stress \(k\) is independent of \(d\).
  13. Reproduce the JL proof skeleton: \(v=\tfrac1{\sqrt k}Rp\), the three lemmas (mean-correct, two tail bounds \(\le 1/n^2\)), and the union-bound / probabilistic-method finish.
  14. Say why random projection beats PCA when you need speed and a per-pair distance guarantee.
  15. Recognise factor analysis as latent-factor extraction — and remember it's non-examinable.