Why high-dimensional data breaks our tools — and four families of methods (selection, extraction, random projection, factor analysis) to fight back.
Before we reduce dimensions, we need to understand why high dimensions are poison for the algorithms we love — and why a simple picture is often worth a pound of analysis.
Almost every method you have met this term — nearest neighbours, k-means, hierarchical clustering — secretly relies on one thing: that the distance between two points means something. "Close" points are similar, "far" points are different. The whole edifice of distance-based machine learning rests on that assumption. The curse of dimensionality is the discovery that, as you add more and more features, this assumption quietly collapses.
A critical issue in data analysis where an increase in the dimensionality of the feature space leads to a deterioration in the performance of distance-based algorithms.
The methods explicitly named on the slide as affected:
The slide's takeaway: because of this, dimensionality reduction is often adopted as a preliminary processing step — you shrink the feature space first, then run KNN/k-means/clustering on the cleaner, lower-dimensional data so they actually work well.
How do we see that high-dimensional space is strange? One clean way is the kissing number.
The maximum number of non-overlapping unit spheres that can simultaneously touch ("kiss") a single central unit sphere in \(d\)-dimensional space.
| Dimension \(d\) | Kissing number | Note (from slide) |
|---|---|---|
| 1 | 2 | left and right |
| 2 | 6 | forming a hexagon around the central point |
| 3 | 12 | |
| 4 | 24 | |
| 8 | 240 | |
| 24 | 196560 |
The number of neighbours a sphere can touch explodes with dimension (2 → 6 → 12 → 24 → 240 → 196,560). The slide's punchline is the opposite of what that growth suggests at first: "High-dimensional spaces are empty." Even though a sphere could have hundreds of thousands of touching neighbours, there is so much room that points you actually sample spread out into a vast, sparse void — they almost never sit close together. Volume grows so fast that any finite set of data points becomes a scattering of lonely specks.
The slide runs a thought experiment: draw 100 points from a \(d\)-dimensional uniform distribution on the cube \([0,1]^d\). What is the largest possible distance between two points (the diagonal of the cube)?
The two most extreme corners of the unit cube are \(\mathbf 0=(0,0,\dots,0)\) and \(\mathbf 1=(1,1,\dots,1)\). Their Euclidean distance is
\[\|\mathbf 1-\mathbf 0\|_2=\sqrt{\sum_{i=1}^{n}(1-0)^2}=\sqrt{\underbrace{1+1+\cdots+1}_{n}}=\sqrt n.\]
So in \(n=100\) dimensions the diagonal is \(\sqrt{100}=10\) — ten times longer than the side of the cube. The space the points must spread across grows without bound while the points stay finite in number.
The lecture's figure makes the killer point. It plots, on a log–log scale, the distance to the nearest point (red) and to the farthest point (blue) as the dimension grows from 1 to 100.
The point is not simply "distances get bigger" — that alone wouldn't hurt anything. The damage is that the nearest and the farthest distances become almost equal. If your closest neighbour and your most distant point are nearly the same distance away, then the word "nearest" loses all meaning — and KNN, k-means and clustering, which depend on telling near from far, stop discriminating. That is the curse. Don't write "distances grow" on the exam without explaining the convergence.
Maximum distance is the cube diagonal \(\sqrt{64}=8\). The problem is not the size itself but that, in such high dimension, nearly all pairwise distances cluster close to this same large value: the nearest neighbour ends up almost as far as the farthest. KNN's notion of "the \(k\) closest points" becomes essentially arbitrary, so its predictions degrade. The fix is to reduce dimensionality first.
No. The kissing number is the maximum possible number of unit spheres that can physically touch a central one — it shows how much "room" a high-dimensional space affords. Because the available volume grows astronomically with \(d\), any finite, realistic sample of data points spreads out into that volume and almost never lands close to any other point. So the space could pack many touching spheres, but real data is sparse within it — hence "empty." High capacity, low occupancy.
Slide 6 is a wall of quotes hammering one theme — looking at data beats computing on it blind. Worth knowing the named sources for the exam:
The lecturer literally put on a slide that this topic is "really important for the exam" — treat the whole of Week 10 as high-yield.
Slide 7 drives it home with a 2-D scatter: six coloured groups plus two lone stray points. Just looking at the data immediately suggests natural clusters and outliers — no algorithm needed. That is the entire promise of dimensionality reduction for visualisation: get the data down to 2-D and your eye does the clustering for free.
Slide 8's overview tells you exactly what is coming, and gives one example technique for each:
Selection keeps a subset of your original columns (the survivors are still "height," "weight," etc., so results stay interpretable). Extraction invents brand-new columns that are combinations of the originals (e.g. "principal component 1") — more compact, but the new axes no longer map cleanly to a single real-world variable.
(a) Selection — you keep/drop original columns. (b) Extraction — PCA builds new linear-combination axes. (c) Selection (specifically embedded selection — it happens during training and forces some original coefficients to exactly 0). (d) Extraction — t-SNE creates new low-dimensional coordinates.
Feature selection picks a subset of the original features. The workhorse tool here is Pearson correlation — but the lecturer spends most of the section on the traps that make naïve correlation dangerous: confounding, and Simpson's paradox.
The simplest dimensionality reduction is to throw away the features that don't help. The question is: how do you decide which features don't help? The slides give three broad strategies, then zoom in on correlation as the concrete tool — and then spend a long time on how correlation can lie to you.
Filter. It scores each feature with a cheap statistic (e.g. Pearson correlation or mutual information) independent of any model, so it is fast and scales to huge feature counts. Wrapper methods retrain a model for many candidate subsets (far too slow at that scale), and embedded methods tie you to one particular model's training (e.g. lasso). Trade-off: filters ignore feature interactions, so they can keep redundant features a wrapper would catch.
Before the maths, the lecturer tells the parable. In the seaside town of Sunny Beach, every summer ice-cream sales soar — and so do shark attacks. Year after year, more ice cream sold ⇒ more shark attacks reported.
Observation: plot the data and you see a clear correlation. But does buying ice cream cause shark attacks?
Correlation: ice-cream sales and shark attacks both rise in summer. Causation: the hot weather independently causes both — more people buy ice cream and more people swim (so more are exposed to sharks). Conclusion: two things happening together doesn't mean one causes the other; the link can be due to an underlying confounding factor (here, temperature). Always look deeper for a true causal link versus a coincidental correlation.
Confounder: hot weather / temperature. Structure: Hot weather → ice-cream sales and Hot weather → swimming → shark attacks. Ice cream and shark attacks have no direct arrow between them; they are correlated only because they share a common cause. This is the textbook confounding pattern.
A measure of the linear correlation between two variables \(X\) and \(Y\). Its value ranges from \(-1\) to \(1\):
For \(n\) paired scores, the Pearson coefficient is
\[ r_{xy}=\frac{n\left(\sum xy\right)-\left(\sum x\right)\left(\sum y\right)}{\sqrt{\big[\,n\sum x^2-(\sum x)^2\,\big]\big[\,n\sum y^2-(\sum y)^2\,\big]}}\,,\]
where \(n\) is the number of pairs of scores.
The numerator asks: "do \(X\) and \(Y\) move together more than you'd expect if they were independent?" The denominator normalises by how spread out each variable is, so the answer always lands between \(-1\) and \(1\) regardless of the units. Same sign-moving ⇒ positive; opposite-moving ⇒ negative; no pattern ⇒ near zero.
Here \(n=3\). Compute the five sums:
\[\sum x = 1+2+3 = 6,\qquad \sum y = 2+4+6 = 12,\] \[\sum xy = (1)(2)+(2)(4)+(3)(6)=2+8+18 = 28,\] \[\sum x^2 = 1^2+2^2+3^2 = 14,\qquad \sum y^2 = 2^2+4^2+6^2 = 56.\]
Substitute:
\[ r=\frac{3(28)-(6)(12)}{\sqrt{[\,3(14)-6^2\,][\,3(56)-12^2\,]}}=\frac{84-72}{\sqrt{(42-36)(168-144)}}=\frac{12}{\sqrt{6\cdot 24}}=\frac{12}{\sqrt{144}}=\frac{12}{12}=1.\]
So \(r=1\): a perfect positive linear relationship — which makes sense, since \(Y=2X\) exactly.
Prediction: \(Y\) falls as \(X\) rises, so \(r\) should be \(-1\). Check: \(n=3\), \(\sum x=6\), \(\sum y=12\), \(\sum xy=(1)(6)+(2)(4)+(3)(2)=6+8+6=20\), \(\sum x^2=14\), \(\sum y^2=36+16+4=56\).
\[ r=\frac{3(20)-(6)(12)}{\sqrt{(42-36)(168-144)}}=\frac{60-72}{\sqrt{6\cdot24}}=\frac{-12}{12}=-1.\]
Perfect negative linear relationship — \(Y=8-2X\) exactly. ✓
The slides show three scatter plots. Redrawn together:
Pearson \(r\) measures linear association only. A perfect U-shape (e.g. \(Y=X^2\) symmetric about zero) can have \(r=0\) despite a perfect deterministic relationship. "\(r=0\)" means "no linear correlation," not "no relationship." This is why filter methods sometimes also use mutual information, which catches non-linear dependence.
Slide 12 lists fields where Pearson is used: Health Sciences (lifestyle factors vs. health outcomes), Economics (relationships between economic factors), Psychology (behavioural patterns vs. outcomes).
Wrong — keep it. Magnitude is what matters, not sign. \(|{-}0.92|=0.92\) is close to 1, so the feature is highly predictive (it just predicts in the opposite direction). The slide explicitly warns: "If the correlation is very close to 1 or −1 it's probably very helpful." You'd only consider dropping features whose correlation is near 0, or one of a pair of mutually highly-correlated features.
This is the most exam-worthy idea in Part II. Simpson's paradox is when a trend that appears in every subgroup of the data reverses when the subgroups are combined.
Step 1 — the aggregate. Compare two treatments by overall success rate:
| Surgery | Medication | |
|---|---|---|
| Total success | 78% (273/350) | 83% (289/350) |
Medication looks better (83% > 78%). But now split by tumour size:
Step 2 — split by tumour size, then the totals:
| Surgery | Medication | |
|---|---|---|
| small tumour | 93% (81/87) | 87% (234/270) |
| large tumour | 73% (192/263) | 69% (55/80) |
| total | 78% (273/350) | 83% (289/350) |
In both subgroups, Surgery wins: 93% > 87% for small tumours, and 73% > 69% for large tumours. Yet in the total, Medication wins (83% > 78%). The within-group winner and the overall winner are opposite. That reversal is Simpson's paradox.
The groups are wildly unbalanced. Small tumours (easy cases, high success) are mostly treated with medication (270 vs 87). Large tumours (hard cases, low success) are mostly treated with surgery (263 vs 80). Success depends heavily on the confounding variable tumour size: large tumours have low success regardless of treatment. So surgery's overall rate is dragged down because surgery was handed the hard cases. Medication only appears better because of this treatment-assignment bias — but surgery is actually the better option within each like-for-like comparison.
Slide 18 then asks the design question: "How could one design a study that would not have these problems?"
A randomised controlled trial (RCT). By randomly assigning treatment independently of tumour size, you break the link between the confounder (size) and the treatment, so the groups become comparable and the aggregate no longer lies. (This connects directly to Week 2's discussion of experimental design.) Alternatively, stratify/control for tumour size in the analysis.
Simpson's paradox is when a relationship holding in every subgroup reverses after the subgroups are aggregated, caused by a confounder correlated with both group membership and outcome. Here the confounder is tumour size (correlated with both treatment choice and success). The design fix is randomised assignment of treatment (an RCT), which decouples the confounder from the treatment.
The real-world version. In the early 1970s, UC Berkeley was sued for gender discrimination in graduate admissions. The Fall 1973 aggregate figures:
At first glance, assuming similar qualifications, this looks like discrimination against women. But department by department, a different picture emerges:
| Dept | Men — applicants | Men — admitted | Women — applicants | Women — admitted |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 272 | 6% | 341 | 7% |
In several departments women were admitted at a higher rate (A: 82% vs 62%; B, D, F roughly tied or women higher). The aggregate gap arose because women disproportionately applied to the most competitive departments (e.g. C, E, F with low admit rates for everyone), while men applied more to easy-admit departments (A, B). Department choice is the confounder — same structure as the tumour example. The paradox showed the headline discrimination claim was an artefact of aggregation.
Dept A admits at a high rate (62% men, 82% women) and received 825 male but only 108 female applicants — men piled into an easy department. Other departments (C, E, F) are hard for everyone and attracted far more women. Because admit rate depends strongly on which department you applied to (the confounder), and the sexes applied to different department mixes, pooling all departments produces an overall male advantage even though within most departments women did as well or better. Controlling for department dissolves the apparent discrimination.
Instead of keeping a subset of original features, feature extraction builds brand-new features that are combinations of the originals. PCA finds the linear directions of greatest variance; t-SNE non-linearly preserves who-is-near-whom for visualisation.
Feature extraction trades interpretability for compactness. The new axes — "principal component 1," "t-SNE feature 2" — don't correspond to any single real-world quantity, but they pack the structure of the data into far fewer numbers. The slide opens with a menu of five extraction techniques; the exam focus is the first three.
\(A = U\Sigma V^{T}\). \(U\) is orthogonal and its columns are the left singular vectors; \(\Sigma\) is diagonal and holds the singular values (non-negative, usually sorted descending); \(V^{T}\) is the transpose of the orthogonal \(V\), whose columns are the right singular vectors. SVD underlies PCA: applying it to a (centred) data matrix gives the principal directions and the amount of variance each captures.
Picture a flattish, tilted cloud of 2-D points. PCA finds the single straight line along which the cloud is most stretched out (most variance) — that's principal component 1. The next component is the most-stretched direction perpendicular to the first, and so on. To reduce dimensions, you keep the first few components and throw the rest away, losing as little spread (information) as possible. The new axes are uncorrelated by construction.
PCA works off the correlation/covariance matrix (correlation coefficients between pairs of features). It identifies patterns in that matrix to capture significant variance, and reduces redundancy caused by correlated features (two strongly correlated features carry overlapping information; PCA merges them into one component).
Benefits of PCA: Simplification (reduces data complexity), Visualization (facilitates viewing lower-dimensional data), Efficiency (improves computational efficiency in downstream tasks).
PCA seeks directions of maximum variance, and variance is scale-dependent. If one feature is in metres (range 0–10000) and another in kilograms (range 0–5), the metre feature will dominate the covariance purely because of its units, and PC1 will essentially align with it regardless of true structure. Standardizing (e.g. zero mean, unit variance) puts all features on a comparable footing so the components reflect genuine shared structure, not arbitrary unit choices.
t-distributed Stochastic Neighbor Embedding is a non-linear dimensionality reduction technique, developed by Laurens van der Maaten and Geoffrey Hinton in 2008, used for visualizing high-dimensional data in 2-D or 3-D.
t-SNE models the probability that one point is a neighbour of another, in both the high- and low-dimensional spaces. In the original space it measures pairwise similarities with a Gaussian kernel — close points get high probability, far points get low probability. It then places points in 2-D/3-D and shuffles their positions until the low-dimensional neighbour-probabilities match the high-dimensional ones. "Match" is made precise by minimising the divergence between the two probability distributions, using gradient descent, until the layout stabilises.
Conditional similarity of point \(j\) to point \(i\):
\[ p_{j\mid i}=\frac{\exp\!\big(-\|x_i-x_j\|^2/2\sigma_i^2\big)}{\sum_{k\neq i}\exp\!\big(-\|x_i-x_k\|^2/2\sigma_i^2\big)}.\]
Symmetrise into a joint probability over \(n\) points:
\[ p_{ij}=\frac{p_{j\mid i}+p_{i\mid j}}{2n}.\]
\(p_{j\mid i}\) is "the chance point \(i\) would pick point \(j\) as its neighbour." The \(\exp(-\text{distance}^2)\) makes nearby points hugely more likely than distant ones; the denominator normalises so the probabilities over all neighbours sum to 1. \(\sigma_i\) sets the radius of "near" around point \(i\) — and it is chosen per-point via perplexity (next).
The user sets a single perplexity value for the entire dataset. This guides the selection of \(\sigma_i\) for each data point. Adaptive \(\sigma_i\) selection:
Goal: ensure the local distribution around each \(x_i\) has the same effective number of neighbours as defined by the target perplexity.
In the low-dimensional map, similarities use a Student's t-distribution (one degree of freedom):
\[ q_{ij}=\frac{\big(1+\|y_i-y_j\|^2\big)^{-1}}{\sum_{k\neq l}\big(1+\|y_k-y_l\|^2\big)^{-1}}.\]
High dimensions use a Gaussian; the low-dimensional map uses a heavy-tailed Student's t. The heavy tail lets moderately-distant points sit comfortably far apart in 2-D without being penalised, which relieves crowding (the "crowding problem") and gives the clean, well-separated clusters t-SNE is famous for. If you forget which space uses which: Gaussian high, t-distribution low.
The cost is the Kullback–Leibler divergence between the high-dim distribution \(P\) and the low-dim distribution \(Q\):
\[ \mathrm{KL}(P\,\|\,Q)=\sum_{i\neq j} p_{ij}\log\frac{p_{ij}}{q_{ij}}.\]
Positions \(y_i\) are updated by gradient descent, with gradient
\[ \frac{\partial C}{\partial y_i}=4\sum_{j}\big(p_{ij}-q_{ij}\big)\big(y_i-y_j\big)\big(1+\|y_i-y_j\|^2\big)^{-1}.\]
KL divergence (from Week 1!) measures how different distribution \(Q\) is from \(P\). Driving \(\mathrm{KL}(P\|Q)\) toward 0 forces the low-dim neighbour-probabilities \(q_{ij}\) to imitate the true high-dim ones \(p_{ij}\). The gradient term \((p_{ij}-q_{ij})\) acts like a spring: if two points are more similar in high-dim than currently shown in low-dim, the spring pulls them together; if less, it pushes them apart.
\(y_i\) is the position of the \(i\)-th data point in the low-dimensional space — initially assigned randomly, then adjusted through gradient descent. It represents the 2-D or 3-D coordinates used for visualization. In summary: t-SNE creates a map that preserves the structure and relationships of high-dimensional data; the \(y_i\) are the coordinates in that map; it helps in visualizing and identifying patterns and clusters.
The lecture shows side-by-side PCA and t-SNE on the same data: PCA leaves the colour groups partly overlapping, while t-SNE separates them into clean islands.
PCA — a linear technique, optimal for data with linear structure. Projects onto principal components that preserve the variance and large pairwise distances. Good for capturing the most significant features by reducing dimensions.
t-SNE — a non-linear technique that excels at preserving pairwise similarities. Focuses on maintaining small pairwise distances to model local structure.
Summary: PCA preserves variance (global, large distances); t-SNE preserves relationships between points (local, small distances), making it highly effective for visualizing complex high-dimensional data.
The slide text literally says PCA projects "through principal components that minimize variance." That phrasing is loose/erroneous — PCA components maximize variance along each successive direction (equivalently, they minimize the reconstruction error). If asked, state clearly: PCA maximizes retained variance / minimizes projection error. Don't reproduce the slide's slip.
t-SNE. It preserves local structure — small pairwise distances / who-is-near-whom — which is exactly what makes clusters pop visually. PCA preserves global variance and large distances, so distinct clusters can stay overlapped in the top-2 components. (Caveat for completeness: t-SNE distances/cluster sizes between islands aren't quantitatively meaningful, and it's slower — but for "make clusters visible in 2-D," it's the right tool, as the side-by-side figure shows.)
High-dimensional similarities \(p_{j\mid i}\) use a Gaussian; low-dimensional similarities \(q_{ij}\) use a Student's t. The t-distribution's heavy tails let dissimilar points be placed far apart in the 2-D map without incurring a large cost, relieving the crowding problem and producing well-separated clusters. Mnemonic: Gaussian high, t low.
Perplexity is a single user-set value that fixes the effective number of neighbours each point should have. It guides the per-point Gaussian bandwidth \(\sigma_i\): the algorithm initialises \(\sigma_i\), computes pairwise distances from \(x_i\), and uses binary search to tune \(\sigma_i\) until the local neighbourhood's effective neighbour count matches the target perplexity. Dense regions get a smaller \(\sigma_i\), sparse regions a larger one.
The KL divergence \(\mathrm{KL}(P\|Q)=\sum_{i\neq j}p_{ij}\log(p_{ij}/q_{ij})\) between the high-dim neighbour distribution \(P\) and the low-dim one \(Q\). Each \(y_i\) is the 2-D/3-D coordinate of point \(i\) in the map; it starts random and is moved by gradient descent (gradient \(\frac{\partial C}{\partial y_i}=4\sum_j (p_{ij}-q_{ij})(y_i-y_j)(1+\|y_i-y_j\|^2)^{-1}\)) until \(Q\) imitates \(P\). The optimised \(\{y_i\}\) is the visualisation.
Surprisingly, multiplying your data by a random matrix preserves all pairwise distances — with high probability — as long as you keep enough dimensions. The Johnson–Lindenstrauss lemma says exactly how few you need, and it doesn't depend on the original dimension at all.
Random projection feels like it shouldn't work. You take high-dimensional points, hit them with a random matrix, and somehow the geometry survives. The Johnson–Lindenstrauss (JL) lemma is the theorem that makes this rigorous — and it's a favourite for proof-style exam questions, so we'll walk the whole argument.
A method that projects data into a lower-dimensional space using random linear projections that preserve distances well, according to the Johnson–Lindenstrauss lemma. It is often computationally simpler and faster than PCA / t-SNE / the other methods.
Johnson–Lindenstrauss (JL) — Goal: embed high-dim points into lower dim while approximately preserving pairwise distances. Method: random linear projections. Application: distance-based tasks — nearest-neighbour search, clustering. Basis: probabilistic — guarantees distance preservation with high probability.
PCA — Goal: reduce dimension while retaining as much variability (variance) as possible. Method: eigenvectors of the covariance matrix → principal components. Application: visualization, noise reduction, feature extraction. Basis: deterministic linear algebra capturing maximum-variance directions.
Key differences: Focus — JL preserves distances, PCA captures variance. Method — JL uses random projections, PCA uses deterministic eigendecomposition. Output — JL guarantees distance preservation; PCA gives principal components for transformation.
Given a set of points \(P=\{p_1,p_2,\dots,p_n\}\) with each \(p_i\in\mathbb R^{d}\), we want to embed \(P\) into \(k\ll d\) dimensions while preserving distances (useful for nearest-neighbour, clustering, etc.). Concretely, we seek a function \(f\) such that for all \(p_i,p_j\in P\):
\[ \|f(p_i)-f(p_j)\|_2 \;\approx\; \|p_i-p_j\|_2.\]
Let \(P=\{p_1,\dots,p_n\}\) with \(p_i\in\mathbb R^{d}\). For any \(\varepsilon\in(0,1)\) and any \(n\) and \(k\) such that
\[ k \;\ge\; \frac{24}{3\varepsilon^{2}-2\varepsilon^{3}}\,\log n,\]
there exists a function \(f:\mathbb R^{d}\to\mathbb R^{k}\) such that for all \(p_i,p_j\in P\):
\[ (1-\varepsilon)\,\|p_i-p_j\|_2^{2} \;\le\; \|f(p_i)-f(p_j)\|_2^{2} \;\le\; (1+\varepsilon)\,\|p_i-p_j\|_2^{2}.\]
Read the bound on \(k\) carefully: it depends on the number of points \(n\) (logarithmically!) and the tolerance \(\varepsilon\) — but not on the original dimension \(d\) at all. So a million-dimensional dataset of \(n\) points can be squashed to roughly \(O(\log n / \varepsilon^2)\) dimensions while keeping every pairwise (squared) distance within a \((1\pm\varepsilon)\) factor. Want more accuracy (smaller \(\varepsilon\))? You pay with more target dimensions. The squared distances are sandwiched between \((1-\varepsilon)\) and \((1+\varepsilon)\) times the originals.
Plug into \(k\ge \dfrac{24}{3\varepsilon^2-2\varepsilon^3}\log n\) with \(\varepsilon=0.5\): denominator \(=3(0.25)-2(0.125)=0.75-0.25=0.5\), so the factor is \(24/0.5=48\). With \(\log n=\ln(10^6)\approx 13.8\): \(k\ge 48\times13.8\approx 663\). (If the slide intends \(\log_2\), \(k\approx 48\times 20\approx 957\) — either way a few hundred to ~1000.) Crucially, changing \(d\) from 5000 to 50,000 changes \(k\) by nothing: the bound is independent of the original dimension. That independence is the whole point.
Let \(R\) be a random \(k\times d\) matrix with each entry \(R_{ij}\sim \mathcal N(0,1)\) (i.i.d. standard normal). Fix any point \(p\in P\) and define the projection
\[ v=\tfrac{1}{\sqrt k}\,R\cdot p \quad\in\mathbb R^{k},\qquad v_i=\tfrac{1}{\sqrt k}\sum_j R_{ij}\,p_j.\]
So \(f(p)=v\) is the random projection (scaled by \(1/\sqrt k\)). The proof rests on three ingredients (lemmas):
Lemma 1 (correct on average): the projection preserves squared length in expectation,
\[ \mathbb E\!\left[\|v\|_2^{2}\right]=\|p\|_2^{2}.\]
Lemma 2 (rarely too big):
\[ \Pr\!\left[\|v\|_2^{2}\ge(1+\varepsilon)\|p\|_2^{2}\right]\le \frac{1}{n^{2}}.\]
Lemma 3 (rarely too small):
\[ \Pr\!\left[\|v\|_2^{2}\le(1-\varepsilon)\|p\|_2^{2}\right]\le \frac{1}{n^{2}}.\]
Lemma 1 says random projection gets the length right on average. Lemmas 2 and 3 say it's almost never off by more than a \((1\pm\varepsilon)\) factor — each bad event has probability \(\le 1/n^2\). Apply this to the difference vector \(p=p_i-p_j\) (length \(\|p_i-p_j\|\)) for every one of the \(\binom n2 < n^2/2\) pairs, and take a union bound: the chance that any pair is distorted is at most \(n^2 \times \tfrac{1}{n^2} =\) a small constant \(<1\). Since the probability of failure is below 1, a good projection must exist — which is exactly the "there exists \(f\)" claim. The \(\log n\) in the dimension bound is precisely what makes the \(1/n^2\) tail small enough for the union bound to survive.
There are \(\binom{n}{2}
(1) Speed/simplicity: JL just multiplies by a random matrix — no covariance matrix, no eigendecomposition — so it's computationally cheaper and easy to apply to streaming or huge-\(d\) data. (2) Distance guarantee: JL comes with a provable, with-high-probability bound that all pairwise distances are preserved to within \((1\pm\varepsilon)\), which is exactly what distance-based algorithms (nearest-neighbour, clustering) need — whereas PCA optimises variance and gives no such per-pair distance guarantee. Bonus: the target dimension is independent of \(d\).
The lecturer explicitly flags this section as not exam relevant — there isn't even a video. Skim it once so the name doesn't surprise you, then spend your energy on Parts I–IV.
Verbatim intent from the slide: "This is not exam relevant so feel free to skip it. There won't be a video about it either. I just wanted to mention that in other branches of science people use FA to try to model the influence between variables. Personally, I wouldn't re auss), or model latent factors (factor analysis — skip it).