From four-point disasters to dendrograms — how to group data when no one tells you the labels.
You have a pile of points and an integer $k$. You want $k$ groups such that points inside a group huddle together and points across groups spread apart. The simplest way to formalise "huddle together" is squared distance to a centroid. That's k-Means.
Clustering is an unsupervised problem. You receive $n$ points $x_1, x_2, \ldots, x_n$ in some $d$-dimensional space, but no labels — nobody tells you what the right groups are. The job is to invent the labels yourself by producing a partition that "makes sense." The hard part is deciding what "makes sense" means mathematically, and different choices give different algorithms. k-Means makes one specific choice: the cost of a clustering is the sum of squared Euclidean distances from each point to its assigned cluster's centroid.
Given $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and an integer $k$, partition them into sets $S_1, S_2, \ldots, S_k$ (every point in exactly one set) so as to minimise $$c(S_1, S_2, \ldots, S_k) \;=\; \sum_{i=1}^{n} \min_{j \in [k]} \bigl\{\, \|x_i - \mu_j\|_2^2 \,\bigr\},$$ where $\mu_j$ is the centroid (mid-point) of cluster $S_j$, $$\mu_j \;=\; \frac{1}{|S_j|}\sum_{\ell \in S_j} x_\ell,$$ and $\|\cdot\|_2^2$ is the squared $L_2$ norm — the squared Euclidean distance.
Note that $x_i$ and $\mu_j$ are vectors, not scalars. The notation $[k]$ is shorthand for $\{1, 2, \ldots, k\}$.
For every point, find the cluster centre it's closest to (in squared Euclidean distance), and add up all those squared distances. That total is the cost. A good clustering is one with a small total. The centre of each cluster is just the average of the points in it.
Suppose $S_1 = \{x_1, x_2\}$ with $$x_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \quad x_2 = \begin{pmatrix} 3 \\ 4 \end{pmatrix}.$$ Then the centroid is just the coordinate-wise average: $$\mu_1 \;=\; \tfrac{1}{2}(x_1 + x_2) \;=\; \begin{pmatrix} 2 \\ 3 \end{pmatrix}.$$ That's it — the centroid lives in the same space as the data and need not be one of the original points.
Pick $k$ pretend centres. Tag every point with its nearest centre. Move each centre to the average of its tagged points. Repeat. When nothing moves, stop. That's the entire algorithm.
The k-Means cost uses $\|x_i - \mu_j\|_2^2$ — the squared Euclidean distance. Don't take a square root. The reason this matters: minimising sum of squared distances has the centroid (mean) as the unique optimal centre. If you used un-squared $L_2$, the optimal centre would be the geometric median (a much harder optimisation). Squaring is what makes the closed-form centroid update $\mu_i = \tfrac{1}{|S_i|}\sum x_j$ correct.
Suppose $x_1 = (0,0)$, $x_2 = (2,0)$, $x_3 = (10,0)$, $x_4 = (12,0)$. We partition $S_1 = \{x_1,x_2\}$ and $S_2 = \{x_3,x_4\}$.
Centroids: $\mu_1 = (1,0)$, $\mu_2 = (11,0)$.
Squared distances: $\|x_1 - \mu_1\|^2 = 1$, $\|x_2 - \mu_1\|^2 = 1$, $\|x_3 - \mu_2\|^2 = 1$, $\|x_4 - \mu_2\|^2 = 1$.
Cost $= 1+1+1+1 = 4$. (Compare with the alternative partition $\{x_1, x_4\}, \{x_2, x_3\}$ which gives a cost of $72 + 32 = 104$ — much worse.)
Each step of Lloyd's algorithm never increases the cost: the assignment step minimises cost over assignments while holding centres fixed, and the update step minimises cost over centre placements while holding assignments fixed (the mean is the unique minimiser of sum-of-squared-distances for a finite set of points). Because there are only finitely many possible assignments (and any repeat of an assignment means we've converged), the algorithm terminates.
However — termination is not the same as finding the global optimum. Lloyd's iteration only guarantees a local minimum. And it can be a very bad local minimum.
Consider four points arranged at the corners of a wide rectangle. The optimal 2-clustering is clearly to split left from right (the rectangle is wider than it is tall). But suppose the random initialization places the two centres at the top-centre and bottom-centre — splitting top from bottom.
"We can see that if we start with sub-optimal clusters, [k-Means] never changes them. This can be made arbitrarily bad by increasing the width of the rectangle." Stretch the rectangle to be 100× wider and the bad solution stays equally trapped — the ratio of bad-to-optimal cost grows without bound. So k-Means has no worst-case approximation guarantee with arbitrary initialization.
Because the configuration is symmetric. With centres on the horizontal midline (one at the top-middle, one at the bottom-middle), every point's closest centre is the one directly above or below it. So the assignment step gives two horizontal clusters: top-pair and bottom-pair. The centroid of the top-pair is at the top-midline, and of the bottom-pair at the bottom-midline. Nothing moves vertically off the midline. The algorithm is at a (local) fixed point.
Mathematically, every Lloyd's iteration is locally optimal in two senses — assignments are best given centres, and centres are best given assignments — but neither of those local optimalities is enough to find the global optimum.
Fix a cluster $S = \{x_1, \ldots, x_m\}$ and consider the cost function $f(\mu) = \sum_{j=1}^{m} \|x_j - \mu\|_2^2$. Expand: $$f(\mu) = \sum_j (x_j - \mu)^\top (x_j - \mu) = \sum_j \|x_j\|^2 - 2\mu^\top \sum_j x_j + m\|\mu\|^2.$$ Take the gradient and set to zero: $$\nabla f(\mu) = -2\sum_j x_j + 2m\mu = 0 \;\;\Rightarrow\;\; \mu = \tfrac{1}{m}\sum_j x_j.$$ The Hessian is $2m\,I$ (positive definite), so this critical point is the unique minimiser. That's why the centroid update is correct — it's the closed-form optimum.
If k-Means' problem is the random initialization, the fix is to be less random. k-Means++ seeds the centres deliberately — each new centre is sampled with probability proportional to its squared distance from the centres already chosen. The payoff: a worst-case approximation factor of $O(\log k)$.
The idea is irresistible: if you're picking $k$ centres and you've already placed one in cluster A, the smart move is to put your next centre far from A — most likely in cluster B. Distance-weighted random sampling does exactly this: the further a point is from the centres you already have, the more likely you are to pick it next. Points that are close to existing centres are unlikely to be chosen as a new centre (and you don't need them anyway, since they're already well-covered).
Pick the first centre randomly. For each subsequent centre, roll a weighted die where the weight of each point is its (squared) distance from the nearest centre already chosen. So points far from existing centres are very likely to be picked, and points already near a centre are very unlikely. Then run normal k-Means.
The slides write $d_j = \min_\ell \|x_j - \mu_\ell\|_2^2$ (so $d_j$ is already the squared distance) and then say "sample proportional to $d_j^2$." Most textbooks use a different convention: they let $D(x_j) = \min_\ell \|x_j - \mu_\ell\|_2$ be the un-squared distance, and sample proportional to $D(x_j)^2$. Either way, the actual sampling weight is the squared Euclidean distance to the nearest existing centre — that's the "$D^2$ sampling" rule. Don't get tripped up by which variable name is squared; the effective weight on the slide is "squared Euclidean distance" (read the slide carefully and use what's literally written for the exam).
It can be shown that the expected cost of the k-Means++ initialization (just the seeds — before running any Lloyd iterations) is within a factor of $O(\log k)$ of the optimal k-Means cost. After Lloyd refines the seeds, the result is usually much better in practice. Contrast this with random initialization, which has no approximation guarantee at all (it can be arbitrarily bad, as the rectangle example showed).
Intuitively, k-Means++ might still occasionally miss a small far-away cluster early in the seeding (because the first centre is uniform random, and small clusters are unlikely to be hit). Once you've missed it, you might or might not pick it later — and the probability you keep missing accumulates. The full proof (Arthur–Vassilvitskii, 2007) is a careful inductive argument showing that with high probability the seeds cover all the "important" clusters but with a multiplicative slack that grows like $\log k$.
For the exam: just remember that k-Means++ gives an expected $O(\log k)$-approximation, whereas plain random initialization gives no guarantee.
Suppose $x_1 = 0$, $x_2 = 1$, $x_3 = 10$ on the real line, and the first centre is $\mu_1 = x_1 = 0$.
Compute distances: $d_1 = 0$, $d_2 = (1-0)^2 = 1$, $d_3 = (10-0)^2 = 100$.
Sampling weights (using slide notation $d_j^2$): $w_1 = 0$, $w_2 = 1$, $w_3 = 10000$.
Probabilities: $\Pr(x_1) = 0$, $\Pr(x_2) = 1/10001 \approx 0.0001$, $\Pr(x_3) = 10000/10001 \approx 0.9999$.
So the second centre is almost certainly $x_3 = 10$, which is precisely the well-separated cluster we want. Compare with uniform random sampling, where each of $x_2$ and $x_3$ would be equally likely.
Same partitioning problem, but swap the squared Euclidean distance for the $L_1$ (Manhattan) distance, and swap the mean for the median. The change is small in symbols and large in behaviour: k-Median is much less sensitive to outliers, because the median doesn't drift toward extreme points the way the mean does.
The mean is fragile. Move one data point a thousand units to the right and the mean of the cluster moves with it. The median, by contrast, is unfazed — it only cares about ordering, not magnitude. So when your data has outliers (sensor errors, mistyped values, occasional extreme readings), k-Median is often more robust than k-Means.
Given $n$ points $x_1, \ldots, x_n$ and an integer $k$, partition them into $S_1, \ldots, S_k$ minimising $$c(S_1, S_2, \ldots, S_k) \;=\; \sum_{i=1}^{n} \min_{j \in [k]} \bigl\{\, \|x_i - \mu_j\|_1 \,\bigr\},$$ where each centre $\mu_j$ is the per-dimension median of the points assigned to it: $$\mu_j \;=\; \mathrm{median}\bigl(x_\ell \;\big|\; \ell \in S_j \bigr),$$ and the median of a set of vectors is defined as the vector whose $i$-th entry is the median of the $i$-th entries of the input vectors.
Instead of summing squared straight-line distances, sum Manhattan distances (walks along the grid). Instead of using the average as a cluster centre, use the median per coordinate. Both changes pull in the same direction: less weight on outliers.
Compute the median of three 2-D vectors: $$\mathrm{median}\left(\begin{pmatrix}1\\1\end{pmatrix}, \begin{pmatrix}2\\3\end{pmatrix}, \begin{pmatrix}2\\0\end{pmatrix}\right) \;=\; \begin{pmatrix} \mathrm{median}(1,2,2) \\ \mathrm{median}(1,3,0)\end{pmatrix} \;=\; \begin{pmatrix} 2 \\ 1 \end{pmatrix}.$$
Notice that the resulting vector $(2,1)$ is not one of the input points. The per-dimension median combines a $1$st-coordinate from one input and a $2$nd-coordinate from another. This distinguishes k-Median from k-medoid, which is a related method that does require centres to be actual data points.
k-Median uses the per-dimension median — the centre can be a "synthetic" point that doesn't appear in the data. k-medoid requires each centre to be an actual data point. Don't confuse them.
| k-Means | k-Median | |
|---|---|---|
| Cost | $\sum_i \min_j \|x_i - \mu_j\|_2^2$ | $\sum_i \min_j \|x_i - \mu_j\|_1$ |
| Distance metric | Squared Euclidean ($L_2^2$) | Manhattan ($L_1$) |
| Centre | Mean (centroid) | Per-dimension median |
| Geometry | Straight-line distance | Grid / taxicab distance |
| Outlier sensitivity | High (mean drifts) | Low (median is robust) |
For a single dimension and a single cluster $\{a_1, \ldots, a_m\}$, the function $f(\mu) = \sum_j |a_j - \mu|$ is piecewise linear with knots at each $a_j$. Its subgradient is "the number of $a_j$ above $\mu$ minus the number below". This is zero exactly when half are above and half are below — i.e., at the median. In higher dimensions, $L_1$ is separable across coordinates, so the per-coordinate median is the closed-form optimum.
Compare with $L_2^2$, where the analogous calculation gives the mean. The cost function determines the centre type — squared distance gives mean, absolute distance gives median.
k-Median. The mean is dragged toward outliers because squared distance heavily penalises extreme values. The median is unaffected by the magnitude of the most extreme points — it only depends on their ordinal position. So a single sensor reading of $10^6$ shifts the mean dramatically but leaves the median essentially unchanged.
k-Means and k-Median both assume that clusters are blob-shaped around a central point. What if your data forms an "S" shape, or a ring, or two interlocking spirals? Centroid methods fail. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) ignores centroids entirely and defines clusters as connected regions of dense points.
The intuition is geographical. Think of points as villages on a map. A cluster is a region where villages are close together. A point with many neighbours within a short distance is part of a dense region — a "core" point. Two core points are in the same cluster if you can walk from one to the other through a chain of dense neighbourhoods. Lonely points in sparse regions don't get a cluster — they're flagged as noise. The number of clusters is not chosen in advance; it emerges from the data and the chosen density threshold.
Pick the $k \in \{2, 3, \ldots, n-1\}$ that maximises the ratio $$\frac{\text{cost}_{k-1}}{\text{cost}_k},$$ where $\text{cost}_j$ is the cost of the best $j$-clustering you can find. The values $k = 1$ and $k = n$ are ruled out because they give trivial clusterings (everything in one cluster; or every point its own cluster).
Find the value of $k$ where adding one more cluster gave the biggest bang for the buck — i.e., where the cost drop $\text{cost}_{k-1} \to \text{cost}_k$ is largest relative to $\text{cost}_k$. That's the bend in the curve.
Suppose the costs are: $\text{cost}_1 = 200$, $\text{cost}_2 = 100$, $\text{cost}_3 = 50$, $\text{cost}_4 = 30$, $\text{cost}_5 = 28$, $\text{cost}_6 = 27$.
Compute ratios for $k \in \{2, 3, 4, 5\}$ (excluding $k=1$ and $k=n$):
The largest ratio occurs at $k=2$ and $k=3$ (tied at 2.0). In practice you'd pick the larger one, $k=3$, since it provides more structure for the same marginal gain. Note the sharp drop in the ratio at $k=5$ — that's the elbow.
$k=1$ puts every point in the same cluster — no structure at all. $k=n$ puts every point in its own cluster — also no structure, and cost is zero, so the ratio $\text{cost}_{n-1}/\text{cost}_n$ would be infinity, dominating any meaningful comparison. Both are degenerate.
"Flat" clustering means the algorithm gives you $k$ groups with no notion of structure among the groups — no parents, no children, just a list. That's fine when the world really is flat. But often it isn't.
Consider four headlines:
Cluster them into three categories. Two reasonable answers:
Option A: CS / Physics / Sports / Sports — gives three categories (CS, Physics, Sports), but lumps cycling and cricket together.
Option B: Science / Science / Cycling / Cricket — also three categories, but now "doping" (cycling) and "cricket rules" are separate, and AlphaBingo + black holes get merged into "Science."
Both are reasonable. Neither captures the natural nested structure: Science (CS, Physics) and Sports (Cycling, Cricket).
The flat $k=3$ constraint forces a single level of granularity. There's no way for a flat algorithm to express "these two clusters belong to a meta-cluster" or "this cluster has sub-clusters." That's the motivation for hierarchical clustering.
A hierarchical clustering doesn't commit to a single partition. It produces a tree: at the root, everything is one giant cluster; at the leaves, every point is alone. Each horizontal slice through the tree gives you a flat clustering at some level of granularity — you pick the slice you want at the end, not the start.
Hierarchical clustering recursively partitions data at increasingly finer granularity, represented as a tree. The leaves are individual data points; internal nodes are merged clusters. The example tree from the slides looks like this:
If you cut between root and the Sci/Sports level you get $k=1$. Cut just below the Sci/Sports level: $k=2$ (Sci, Sports). Cut at the bottom: $k=4$ (CS, Phy, Cric, Cycl). The tree encodes all these clusterings at once.
Agglomerative (bottom-up):
Divisive (top-down):
Picture the data as a graph: each point is a node, edges carry weights that measure either similarity (high = should be in the same cluster) or dissimilarity (high = should be apart). The slides use similarity throughout this lecture, so a higher edge weight means "these should be merged."
Once clusters have more than one point, "similarity between two clusters" needs definition. The slides give three standard answers, illustrated on $C_1 = \{a, b\}$ and $C_2 = \{c, d\}$ with edges $a\!-\!c = 2$, $a\!-\!d = 1$, $b\!-\!c = 5$, $b\!-\!d = 3$:
With similarity, single linkage = max (you merge clusters that share at least one very similar pair). With distance, single linkage = min (you merge clusters that have at least one very close pair). Same concept, opposite arithmetic. Always check whether your edges are similarities or distances before you compute. The slides use similarities.
Suppose $C_1 = \{p, q\}$ and $C_2 = \{r, s\}$ with cross-cluster similarities $p\!-\!r = 3$, $p\!-\!s = 7$, $q\!-\!r = 2$, $q\!-\!s = 6$.
Different linkages give wildly different "similarities" — and therefore wildly different clusterings if you run agglomerative with each rule. Single linkage tends to produce long, chained clusters (good for snake-shaped data, bad when noisy bridges connect distinct clumps). Complete linkage is the opposite: it prefers tight, ball-shaped clusters because one bad outlier kills the score. Average linkage is the middle ground.
The agglomerative algorithm runs in rounds. Round 0: every point is its own cluster. Each subsequent round merges the two most similar clusters (using whichever linkage rule you chose). After $n-1$ merges, everything is one cluster. The slides show 8 rounds of single-linkage on the 9-node graph above. The output is best visualised as a dendrogram — a binary tree where the height of each merge represents the similarity at which that merge happened.
A dendrogram is a record of "when did this thing get absorbed?" The earlier (lower) the merge, the more similar the merged clusters were. Cut the dendrogram with a horizontal line and read off the connected sub-trees — that's your flat clustering at that threshold.
Points $a, b, c, d$ with similarities: $a\!-\!b = 0.9$, $c\!-\!d = 0.8$, $b\!-\!c = 0.3$, $a\!-\!c = 0.2$, $a\!-\!d = 0.1$, $b\!-\!d = 0.4$.
Round 0: $\{a\}, \{b\}, \{c\}, \{d\}$. Highest similarity is $a\!-\!b = 0.9$.
Round 1: Merge $a, b$. Clusters: $\{a,b\}, \{c\}, \{d\}$. Compute single-linkage similarities:
$\{a,b\}\!-\!\{c\}$: $\max(a\!-\!c, b\!-\!c) = \max(0.2, 0.3) = 0.3$
$\{a,b\}\!-\!\{d\}$: $\max(0.1, 0.4) = 0.4$
$\{c\}\!-\!\{d\}$: $0.8$
Highest is $\{c\}\!-\!\{d\} = 0.8$.
Round 2: Merge $c, d$. Clusters: $\{a,b\}, \{c,d\}$. Cross-similarity: $\max(a\!-\!c, a\!-\!d, b\!-\!c, b\!-\!d) = \max(0.2, 0.1, 0.3, 0.4) = 0.4$.
Round 3: Merge $\{a,b\}, \{c,d\}$ at similarity $0.4$. Done.
Dendrogram: leaves $a, b$ merge at height $0.9$; $c, d$ merge at $0.8$; the two pairs merge at $0.4$.
Going top-down is harder than going bottom-up because the very first split — partitioning $n$ points into two well-separated halves — is itself a nontrivial clustering problem. Two common heuristics:
In both cases, you build the cluster-tree top-down by repeatedly splitting until each leaf is a single point (or until you stop at some granularity threshold).
Given a graph with node set $V$ and edge set $E \subset V \times V$, the sparsity of a cut $S \subseteq V$ is $$\phi(S) \;=\; \frac{E(S, \bar S)}{\min(|S|, |\bar S|)},$$ where $E(S, \bar S)$ is the sum of weights of edges crossing the cut, and $\bar S = V \setminus S$.
The sparsest cut is the partition $S^*$ that minimises $\phi$: $$\phi(S^*) \;=\; \min_{S \subset V} \phi(S).$$
Among all possible ways to split the graph into two parts, find the one where the total weight of crossing edges is small relative to the size of the smaller part. Dividing by $\min(|S|, |\bar S|)$ prevents the trivial answer of "put one lonely vertex on one side, everything else on the other" — that has few crossing edges but a tiny denominator, so the ratio is large.
Suppose the graph has 6 nodes $\{1,2,3,4,5,6\}$. Total crossing edge weights for two candidate splits:
Cut B is sparser ($\phi = 1/3 < 3$) — it isolates a balanced half with very few crossing edges. The denominator $\min(|S|, |\bar S|)$ penalises the "lonely vertex" trick that Cut A tried.
Because the cut is symmetric: removing the edges between $S$ and $\bar S$ is the same as removing them between $\bar S$ and $S$. So the criterion should be symmetric in $S$ and $\bar S$. The smaller side determines how "balanced" the cut is.
An alternative formulation uses $|S| \cdot |\bar S|$ (the "edge expansion" or "normalised cut" denominator). Both penalise lopsided splits; the $\min$ form is just one common convention.
Agglomerative is "easy" per step: you just need pairwise similarities and a rule to merge the closest pair. The bookkeeping is on the order of $O(n^2 \log n)$ for naive implementations.
Divisive is "hard" per step: each split is itself an NP-hard problem in general (sparsest cut is NP-hard exactly; bisection k-means is NP-hard exactly but Lloyd-style heuristics work). So divisive is usually slower and depends on the quality of the split heuristic.
Trade-off: agglomerative gets the fine structure right (since the very first merges are between the most similar pairs), while divisive gets the coarse structure right (the first split separates the most distinct halves). Choose based on which level of granularity matters most for your problem.
k-Means minimises sum-of-squared-Euclidean to centroids and converges fast but lands in bad local minima with bad seeds, which k-Means++ patches via $D^2$-sampling and an expected $O(\log k)$ guarantee. k-Median swaps $L_1$ and the per-dimension median for robustness to outliers. DBSCAN ditches centroids for density: a point is core if it has $\ge m$ neighbours within $\varepsilon$, clusters are connected components of cores, and isolated points are noise — handling non-convex shapes that centroid methods can't. To pick $k$, the elbow rule maximises $\text{cost}_{k-1}/\text{cost}_k$ over $k \in \{2,\ldots,n-1\}$. Flat clustering loses nested structure, so hierarchical clustering builds a tree — bottom-up (agglomerative, repeatedly merge the most-similar pair using single, average, or complete linkage) or top-down (divisive, repeatedly split via bisection k-means or sparsest cut $\phi(S) = E(S,\bar S)/\min(|S|, |\bar S|)$) — and you read off any flat clustering by cutting the dendrogram at a chosen height.