Week 09 · Unsupervised Learning · Clustering

k-Means, DBSCAN, and the Shape of Data

From four-point disasters to dendrograms — how to group data when no one tells you the labels.

KCL · MACHINE LEARNING · WEEK 9 · ~75 min read · 17 self-test questions
Part I

k-Means: the workhorse of partitional clustering

You have a pile of points and an integer $k$. You want $k$ groups such that points inside a group huddle together and points across groups spread apart. The simplest way to formalise "huddle together" is squared distance to a centroid. That's k-Means.

Clustering is an unsupervised problem. You receive $n$ points $x_1, x_2, \ldots, x_n$ in some $d$-dimensional space, but no labels — nobody tells you what the right groups are. The job is to invent the labels yourself by producing a partition that "makes sense." The hard part is deciding what "makes sense" means mathematically, and different choices give different algorithms. k-Means makes one specific choice: the cost of a clustering is the sum of squared Euclidean distances from each point to its assigned cluster's centroid.

The cost function

Definition · k-Means cost

Given $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and an integer $k$, partition them into sets $S_1, S_2, \ldots, S_k$ (every point in exactly one set) so as to minimise $$c(S_1, S_2, \ldots, S_k) \;=\; \sum_{i=1}^{n} \min_{j \in [k]} \bigl\{\, \|x_i - \mu_j\|_2^2 \,\bigr\},$$ where $\mu_j$ is the centroid (mid-point) of cluster $S_j$, $$\mu_j \;=\; \frac{1}{|S_j|}\sum_{\ell \in S_j} x_\ell,$$ and $\|\cdot\|_2^2$ is the squared $L_2$ norm — the squared Euclidean distance.

Note that $x_i$ and $\mu_j$ are vectors, not scalars. The notation $[k]$ is shorthand for $\{1, 2, \ldots, k\}$.

In plain English

For every point, find the cluster centre it's closest to (in squared Euclidean distance), and add up all those squared distances. That total is the cost. A good clustering is one with a small total. The centre of each cluster is just the average of the points in it.

Worked example · Centroid of a tiny cluster

Suppose $S_1 = \{x_1, x_2\}$ with $$x_1 = \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \quad x_2 = \begin{pmatrix} 3 \\ 4 \end{pmatrix}.$$ Then the centroid is just the coordinate-wise average: $$\mu_1 \;=\; \tfrac{1}{2}(x_1 + x_2) \;=\; \begin{pmatrix} 2 \\ 3 \end{pmatrix}.$$ That's it — the centroid lives in the same space as the data and need not be one of the original points.

The algorithm (Lloyd's iteration)

Algorithm · k-Means (Lloyd, 1957)
  1. Pick $k$ cluster centres arbitrarily.
  2. Repeat until convergence:
    1. Assignment step. Assign each point to the cluster with the nearest mean: $$S_i \;=\; \bigl\{\, x_j \;\big|\; \|x_j - \mu_i\|_2^2 \le \|x_j - \mu_\ell\|_2^2 \text{ for all } \ell \in [k]\,\bigr\}.$$ Each point goes to exactly one cluster (break ties arbitrarily).
    2. Update step. Recalculate each cluster's centroid: $$\mu_i \;=\; \frac{1}{|S_i|}\sum_{j \in S_i} x_j.$$
In plain English

Pick $k$ pretend centres. Tag every point with its nearest centre. Move each centre to the average of its tagged points. Repeat. When nothing moves, stop. That's the entire algorithm.

1. Random init × × 2. Assignment × × 3. Update × ×
One iteration of Lloyd's algorithm: assign every point to the nearest centre, then move each centre to the mean of its assigned points.
Exam trap · Squared, not square-rooted

The k-Means cost uses $\|x_i - \mu_j\|_2^2$ — the squared Euclidean distance. Don't take a square root. The reason this matters: minimising sum of squared distances has the centroid (mean) as the unique optimal centre. If you used un-squared $L_2$, the optimal centre would be the geometric median (a much harder optimisation). Squaring is what makes the closed-form centroid update $\mu_i = \tfrac{1}{|S_i|}\sum x_j$ correct.

Compute the k-Means cost of a given partition.

Suppose $x_1 = (0,0)$, $x_2 = (2,0)$, $x_3 = (10,0)$, $x_4 = (12,0)$. We partition $S_1 = \{x_1,x_2\}$ and $S_2 = \{x_3,x_4\}$.

Centroids: $\mu_1 = (1,0)$, $\mu_2 = (11,0)$.

Squared distances: $\|x_1 - \mu_1\|^2 = 1$, $\|x_2 - \mu_1\|^2 = 1$, $\|x_3 - \mu_2\|^2 = 1$, $\|x_4 - \mu_2\|^2 = 1$.

Cost $= 1+1+1+1 = 4$. (Compare with the alternative partition $\{x_1, x_4\}, \{x_2, x_3\}$ which gives a cost of $72 + 32 = 104$ — much worse.)

Why does the algorithm converge?

Each step of Lloyd's algorithm never increases the cost: the assignment step minimises cost over assignments while holding centres fixed, and the update step minimises cost over centre placements while holding assignments fixed (the mean is the unique minimiser of sum-of-squared-distances for a finite set of points). Because there are only finitely many possible assignments (and any repeat of an assignment means we've converged), the algorithm terminates.

However — termination is not the same as finding the global optimum. Lloyd's iteration only guarantees a local minimum. And it can be a very bad local minimum.

The bad-initialization disaster

Consider four points arranged at the corners of a wide rectangle. The optimal 2-clustering is clearly to split left from right (the rectangle is wider than it is tall). But suppose the random initialization places the two centres at the top-centre and bottom-centre — splitting top from bottom.

Optimal split (vertical) × × Terrible split (horizontal) — stuck! × ×
The four-corner rectangle. With the bad initialization on the right, every assignment-update cycle keeps the centres on the horizontal midline — the algorithm never finds the vertical split.
Exam-worthy claim from the slides

"We can see that if we start with sub-optimal clusters, [k-Means] never changes them. This can be made arbitrarily bad by increasing the width of the rectangle." Stretch the rectangle to be 100× wider and the bad solution stays equally trapped — the ratio of bad-to-optimal cost grows without bound. So k-Means has no worst-case approximation guarantee with arbitrary initialization.

Why doesn't Lloyd's iteration escape the bad initialization in the rectangle example?

Because the configuration is symmetric. With centres on the horizontal midline (one at the top-middle, one at the bottom-middle), every point's closest centre is the one directly above or below it. So the assignment step gives two horizontal clusters: top-pair and bottom-pair. The centroid of the top-pair is at the top-midline, and of the bottom-pair at the bottom-midline. Nothing moves vertically off the midline. The algorithm is at a (local) fixed point.

Mathematically, every Lloyd's iteration is locally optimal in two senses — assignments are best given centres, and centres are best given assignments — but neither of those local optimalities is enough to find the global optimum.

Show that the mean is the unique minimiser of sum-of-squared-distances for a single cluster.

Fix a cluster $S = \{x_1, \ldots, x_m\}$ and consider the cost function $f(\mu) = \sum_{j=1}^{m} \|x_j - \mu\|_2^2$. Expand: $$f(\mu) = \sum_j (x_j - \mu)^\top (x_j - \mu) = \sum_j \|x_j\|^2 - 2\mu^\top \sum_j x_j + m\|\mu\|^2.$$ Take the gradient and set to zero: $$\nabla f(\mu) = -2\sum_j x_j + 2m\mu = 0 \;\;\Rightarrow\;\; \mu = \tfrac{1}{m}\sum_j x_j.$$ The Hessian is $2m\,I$ (positive definite), so this critical point is the unique minimiser. That's why the centroid update is correct — it's the closed-form optimum.

· · ·
Part II

k-Means++: smarter seeding

If k-Means' problem is the random initialization, the fix is to be less random. k-Means++ seeds the centres deliberately — each new centre is sampled with probability proportional to its squared distance from the centres already chosen. The payoff: a worst-case approximation factor of $O(\log k)$.

The idea is irresistible: if you're picking $k$ centres and you've already placed one in cluster A, the smart move is to put your next centre far from A — most likely in cluster B. Distance-weighted random sampling does exactly this: the further a point is from the centres you already have, the more likely you are to pick it next. Points that are close to existing centres are unlikely to be chosen as a new centre (and you don't need them anyway, since they're already well-covered).

Algorithm · k-Means++ (Arthur & Vassilvitskii, 2007)
  1. Set the first centre to be one of the input points, chosen uniformly at random: $$\mu_1 = \text{uniform}(x_1, x_2, \ldots, x_n).$$
  2. For cluster $i = 2$ to $k$:
    1. For each point $x_j$, compute the distance to the nearest existing centre: $$d_j \;=\; \min_\ell \bigl\{\, \|x_j - \mu_\ell\|_2^2 \,\bigr\}.$$
    2. Open a new centre at a point with probability proportional to $d_j^2$: $$\Pr(\text{new centre at } x_j) \;=\; \frac{d_j^2}{\sum_\ell d_\ell^2}.$$
  3. Continue with the usual k-Means iterations starting from these centres.
In plain English

Pick the first centre randomly. For each subsequent centre, roll a weighted die where the weight of each point is its (squared) distance from the nearest centre already chosen. So points far from existing centres are very likely to be picked, and points already near a centre are very unlikely. Then run normal k-Means.

Notation note · The "$D^2$ sampling" convention

The slides write $d_j = \min_\ell \|x_j - \mu_\ell\|_2^2$ (so $d_j$ is already the squared distance) and then say "sample proportional to $d_j^2$." Most textbooks use a different convention: they let $D(x_j) = \min_\ell \|x_j - \mu_\ell\|_2$ be the un-squared distance, and sample proportional to $D(x_j)^2$. Either way, the actual sampling weight is the squared Euclidean distance to the nearest existing centre — that's the "$D^2$ sampling" rule. Don't get tripped up by which variable name is squared; the effective weight on the slide is "squared Euclidean distance" (read the slide carefully and use what's literally written for the exam).

Approximation guarantee

It can be shown that the expected cost of the k-Means++ initialization (just the seeds — before running any Lloyd iterations) is within a factor of $O(\log k)$ of the optimal k-Means cost. After Lloyd refines the seeds, the result is usually much better in practice. Contrast this with random initialization, which has no approximation guarantee at all (it can be arbitrarily bad, as the rectangle example showed).

Why is the approximation factor $O(\log k)$ and not just $O(1)$?

Intuitively, k-Means++ might still occasionally miss a small far-away cluster early in the seeding (because the first centre is uniform random, and small clusters are unlikely to be hit). Once you've missed it, you might or might not pick it later — and the probability you keep missing accumulates. The full proof (Arthur–Vassilvitskii, 2007) is a careful inductive argument showing that with high probability the seeds cover all the "important" clusters but with a multiplicative slack that grows like $\log k$.

For the exam: just remember that k-Means++ gives an expected $O(\log k)$-approximation, whereas plain random initialization gives no guarantee.

Run one step of k-Means++ on a tiny example.

Suppose $x_1 = 0$, $x_2 = 1$, $x_3 = 10$ on the real line, and the first centre is $\mu_1 = x_1 = 0$.

Compute distances: $d_1 = 0$, $d_2 = (1-0)^2 = 1$, $d_3 = (10-0)^2 = 100$.

Sampling weights (using slide notation $d_j^2$): $w_1 = 0$, $w_2 = 1$, $w_3 = 10000$.

Probabilities: $\Pr(x_1) = 0$, $\Pr(x_2) = 1/10001 \approx 0.0001$, $\Pr(x_3) = 10000/10001 \approx 0.9999$.

So the second centre is almost certainly $x_3 = 10$, which is precisely the well-separated cluster we want. Compare with uniform random sampling, where each of $x_2$ and $x_3$ would be equally likely.

Sampling weight ∝ squared distance from nearest existing centre × μ₁ low high
Each point's probability of being chosen as the next centre grows quadratically with its distance from the centres already placed.
· · ·
Part III

k-Median: when outliers ruin everything

Same partitioning problem, but swap the squared Euclidean distance for the $L_1$ (Manhattan) distance, and swap the mean for the median. The change is small in symbols and large in behaviour: k-Median is much less sensitive to outliers, because the median doesn't drift toward extreme points the way the mean does.

The mean is fragile. Move one data point a thousand units to the right and the mean of the cluster moves with it. The median, by contrast, is unfazed — it only cares about ordering, not magnitude. So when your data has outliers (sensor errors, mistyped values, occasional extreme readings), k-Median is often more robust than k-Means.

The k-Median cost function

Definition · k-Median

Given $n$ points $x_1, \ldots, x_n$ and an integer $k$, partition them into $S_1, \ldots, S_k$ minimising $$c(S_1, S_2, \ldots, S_k) \;=\; \sum_{i=1}^{n} \min_{j \in [k]} \bigl\{\, \|x_i - \mu_j\|_1 \,\bigr\},$$ where each centre $\mu_j$ is the per-dimension median of the points assigned to it: $$\mu_j \;=\; \mathrm{median}\bigl(x_\ell \;\big|\; \ell \in S_j \bigr),$$ and the median of a set of vectors is defined as the vector whose $i$-th entry is the median of the $i$-th entries of the input vectors.

In plain English

Instead of summing squared straight-line distances, sum Manhattan distances (walks along the grid). Instead of using the average as a cluster centre, use the median per coordinate. Both changes pull in the same direction: less weight on outliers.

Worked example · Vector median

Compute the median of three 2-D vectors: $$\mathrm{median}\left(\begin{pmatrix}1\\1\end{pmatrix}, \begin{pmatrix}2\\3\end{pmatrix}, \begin{pmatrix}2\\0\end{pmatrix}\right) \;=\; \begin{pmatrix} \mathrm{median}(1,2,2) \\ \mathrm{median}(1,3,0)\end{pmatrix} \;=\; \begin{pmatrix} 2 \\ 1 \end{pmatrix}.$$

Notice that the resulting vector $(2,1)$ is not one of the input points. The per-dimension median combines a $1$st-coordinate from one input and a $2$nd-coordinate from another. This distinguishes k-Median from k-medoid, which is a related method that does require centres to be actual data points.

Exam trap · k-Median vs k-medoid

k-Median uses the per-dimension median — the centre can be a "synthetic" point that doesn't appear in the data. k-medoid requires each centre to be an actual data point. Don't confuse them.

Side-by-side comparison

k-Meansk-Median
Cost $\sum_i \min_j \|x_i - \mu_j\|_2^2$ $\sum_i \min_j \|x_i - \mu_j\|_1$
Distance metric Squared Euclidean ($L_2^2$) Manhattan ($L_1$)
Centre Mean (centroid) Per-dimension median
Geometry Straight-line distance Grid / taxicab distance
Outlier sensitivity High (mean drifts) Low (median is robust)
Euclidean (≈10) Manhattan (12) Manhattan (12) Manhattan (12) Three different Manhattan paths, all length 12. The Euclidean distance is unique (≈10).
k-Means minimises Euclidean / geometric distance (one straight diagonal line). k-Median minimises Manhattan distance — any grid-walking path of the same length is equivalent. Source: Wikipedia.
Why is the median the right centre for $L_1$ cost?

For a single dimension and a single cluster $\{a_1, \ldots, a_m\}$, the function $f(\mu) = \sum_j |a_j - \mu|$ is piecewise linear with knots at each $a_j$. Its subgradient is "the number of $a_j$ above $\mu$ minus the number below". This is zero exactly when half are above and half are below — i.e., at the median. In higher dimensions, $L_1$ is separable across coordinates, so the per-coordinate median is the closed-form optimum.

Compare with $L_2^2$, where the analogous calculation gives the mean. The cost function determines the centre type — squared distance gives mean, absolute distance gives median.

Which method would you use if your data has occasional sensor errors that produce extreme outliers?

k-Median. The mean is dragged toward outliers because squared distance heavily penalises extreme values. The median is unaffected by the magnitude of the most extreme points — it only depends on their ordinal position. So a single sensor reading of $10^6$ shifts the mean dramatically but leaves the median essentially unchanged.

· · ·
Part IV

DBSCAN: density, not centroids

k-Means and k-Median both assume that clusters are blob-shaped around a central point. What if your data forms an "S" shape, or a ring, or two interlocking spirals? Centroid methods fail. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) ignores centroids entirely and defines clusters as connected regions of dense points.

The intuition is geographical. Think of points as villages on a map. A cluster is a region where villages are close together. A point with many neighbours within a short distance is part of a dense region — a "core" point. Two core points are in the same cluster if you can walk from one to the other through a chain of dense neighbourhoods. Lonely points in sparse regions don't get a cluster — they're flagged as noise. The number of clusters is not chosen in advance; it emerges from the data and the chosen density threshold.

Algorithm · DBSCAN with parameters $m$ and $\varepsilon$
  1. For every point, find the points in its $\varepsilon$-neighborhood. Identify the core points: those with more than $m$ neighbours within distance $\varepsilon$.
  2. Find the connected components of core points on the neighbour graph (ignoring all non-core points). Each connected component is a cluster.
  3. Assign each non-core point to a nearby cluster that's useless. As you increase $k$, the cost monotonically decreases, dramatically at first, then more slowly. The "elbow" is the value of $k$ where the rate of decrease slows down sharply. That's typically where extra clusters stop buying you much.

    Number of clusters $k$ k-Means cost 2 3 4 5 6 7 8 9 elbow at $k = 5$ — extra clusters stop helping
    The elbow method: plot cost as a function of $k$ and pick the value at the bend. Source: towardsdatascience.com.
    Definition · Elbow rule (slide formulation)

    Pick the $k \in \{2, 3, \ldots, n-1\}$ that maximises the ratio $$\frac{\text{cost}_{k-1}}{\text{cost}_k},$$ where $\text{cost}_j$ is the cost of the best $j$-clustering you can find. The values $k = 1$ and $k = n$ are ruled out because they give trivial clusterings (everything in one cluster; or every point its own cluster).

    In plain English

    Find the value of $k$ where adding one more cluster gave the biggest bang for the buck — i.e., where the cost drop $\text{cost}_{k-1} \to \text{cost}_k$ is largest relative to $\text{cost}_k$. That's the bend in the curve.

    Apply the elbow rule to a concrete cost sequence.

    Suppose the costs are: $\text{cost}_1 = 200$, $\text{cost}_2 = 100$, $\text{cost}_3 = 50$, $\text{cost}_4 = 30$, $\text{cost}_5 = 28$, $\text{cost}_6 = 27$.

    Compute ratios for $k \in \{2, 3, 4, 5\}$ (excluding $k=1$ and $k=n$):

    • $k=2$: $200/100 = 2.0$
    • $k=3$: $100/50 = 2.0$
    • $k=4$: $50/30 \approx 1.67$
    • $k=5$: $30/28 \approx 1.07$

    The largest ratio occurs at $k=2$ and $k=3$ (tied at 2.0). In practice you'd pick the larger one, $k=3$, since it provides more structure for the same marginal gain. Note the sharp drop in the ratio at $k=5$ — that's the elbow.

    Why are $k=1$ and $k=n$ excluded?

    $k=1$ puts every point in the same cluster — no structure at all. $k=n$ puts every point in its own cluster — also no structure, and cost is zero, so the ratio $\text{cost}_{n-1}/\text{cost}_n$ would be infinity, dominating any meaningful comparison. Both are degenerate.

    The problem with flat clustering

    "Flat" clustering means the algorithm gives you $k$ groups with no notion of structure among the groups — no parents, no children, just a list. That's fine when the world really is flat. But often it isn't.

    The slide's news-headlines example

    Consider four headlines:

    • "Deepmind's AlphaBingo wins world championship"
    • "Black holes swallow stars whole according to new study"
    • "Someone didn't dope"
    • "Researcher finally figured out the rules of cricket"

    Cluster them into three categories. Two reasonable answers:

    Option A: CS / Physics / Sports / Sports — gives three categories (CS, Physics, Sports), but lumps cycling and cricket together.

    Option B: Science / Science / Cycling / Cricket — also three categories, but now "doping" (cycling) and "cricket rules" are separate, and AlphaBingo + black holes get merged into "Science."

    Both are reasonable. Neither captures the natural nested structure: Science (CS, Physics) and Sports (Cycling, Cricket).

    The flat $k=3$ constraint forces a single level of granularity. There's no way for a flat algorithm to express "these two clusters belong to a meta-cluster" or "this cluster has sub-clusters." That's the motivation for hierarchical clustering.

· · ·
Part VI

Hierarchical clustering: dendrograms, linkages, and cuts

A hierarchical clustering doesn't commit to a single partition. It produces a tree: at the root, everything is one giant cluster; at the leaves, every point is alone. Each horizontal slice through the tree gives you a flat clustering at some level of granularity — you pick the slice you want at the end, not the start.

Hierarchical clustering recursively partitions data at increasingly finer granularity, represented as a tree. The leaves are individual data points; internal nodes are merged clusters. The example tree from the slides looks like this:

News Sci Sports CS Phy Cric Cycl
The news hierarchy: the root is "News", which splits into "Sci" and "Sports", which each split into specific topics. Any horizontal cut through this tree yields a flat clustering at a chosen level.

If you cut between root and the Sci/Sports level you get $k=1$. Cut just below the Sci/Sports level: $k=2$ (Sci, Sports). Cut at the bottom: $k=4$ (CS, Phy, Cric, Cycl). The tree encodes all these clusterings at once.

Two strategies: agglomerative vs divisive

Definition · Two ways to build the tree

Agglomerative (bottom-up):

  • Initially place each data point in its own cluster.
  • Repeatedly merge the two most similar clusters until one cluster remains.

Divisive (top-down):

  • Start with one cluster containing all points.
  • Split using bisection k-means (or sparsest cut).
  • Recurse on each part.

Agglomerative clustering: the similarity graph setup

Picture the data as a graph: each point is a node, edges carry weights that measure either similarity (high = should be in the same cluster) or dissimilarity (high = should be apart). The slides use similarity throughout this lecture, so a higher edge weight means "these should be merged."

2 3 1 2 5 4 4 3 1 5 1 1 1 1 a b c d e f g h i
The similarity graph from the slides. Edge weights = similarity (higher = more similar). High-weight edges (4, 5) are within natural clusters; low-weight edges (1) connect different clusters.

Linkage criteria — how to measure inter-cluster similarity

Once clusters have more than one point, "similarity between two clusters" needs definition. The slides give three standard answers, illustrated on $C_1 = \{a, b\}$ and $C_2 = \{c, d\}$ with edges $a\!-\!c = 2$, $a\!-\!d = 1$, $b\!-\!c = 5$, $b\!-\!d = 3$:

Definition · Three linkage criteria (with similarities)
  • Single linkage = maximum similarity between any pair of cross-cluster members.
    For $C_1 = \{a,b\}, C_2 = \{c,d\}$: $\max(2, 1, 5, 3) = \mathbf{5}$.
  • Average linkage = mean similarity over all cross-cluster pairs.
    $(2 + 1 + 5 + 3)/4 = \mathbf{2.75}$.
  • Complete linkage = minimum similarity between any pair of cross-cluster members.
    $\min(2, 1, 5, 3) = \mathbf{1}$.
Exam trap · "Single" and "complete" flip when you switch from similarity to distance

With similarity, single linkage = max (you merge clusters that share at least one very similar pair). With distance, single linkage = min (you merge clusters that have at least one very close pair). Same concept, opposite arithmetic. Always check whether your edges are similarities or distances before you compute. The slides use similarities.

Single (max = 5) 5 b a c d 2 1 3 1 Average (mean = 2.75) b a c d 5 3 2 1 Complete (min = 1) b a c d 2 1 3 5
The same two clusters $\{a,b\}$, $\{c,d\}$ measured three ways. Single = best pair, Average = mean, Complete = worst pair. Wait — there's a slight notation subtlety: the slide says complete linkage = 1, but the slide-drawn edge weights are $b\!-\!c=2, a\!-\!c=2, a\!-\!d=1, b\!-\!d=3$. The minimum cross-pair similarity is $1$, matching the slide.
Compute single, average, and complete linkage on a worked example.

Suppose $C_1 = \{p, q\}$ and $C_2 = \{r, s\}$ with cross-cluster similarities $p\!-\!r = 3$, $p\!-\!s = 7$, $q\!-\!r = 2$, $q\!-\!s = 6$.

  • Single linkage = $\max(3, 7, 2, 6) = 7$.
  • Average linkage = $(3+7+2+6)/4 = 18/4 = 4.5$.
  • Complete linkage = $\min(3, 7, 2, 6) = 2$.

Different linkages give wildly different "similarities" — and therefore wildly different clusterings if you run agglomerative with each rule. Single linkage tends to produce long, chained clusters (good for snake-shaped data, bad when noisy bridges connect distinct clumps). Complete linkage is the opposite: it prefers tight, ball-shaped clusters because one bad outlier kills the score. Average linkage is the middle ground.

Running agglomerative clustering — the dendrogram

The agglomerative algorithm runs in rounds. Round 0: every point is its own cluster. Each subsequent round merges the two most similar clusters (using whichever linkage rule you chose). After $n-1$ merges, everything is one cluster. The slides show 8 rounds of single-linkage on the 9-node graph above. The output is best visualised as a dendrogram — a binary tree where the height of each merge represents the similarity at which that merge happened.

a c e b d f g h i
The dendrogram from the slides. Reading bottom-up: $c$ and $e$ merge first; $a$ joins to form $\{a,c,e\}$; $b$ joins to form $\{a,b,c,e\}$; meanwhile $f\!-\!g$ and $h\!-\!i$ merge separately; the two halves combine; finally $d$ (the most distant leaf) joins last to form the root.
In plain English

A dendrogram is a record of "when did this thing get absorbed?" The earlier (lower) the merge, the more similar the merged clusters were. Cut the dendrogram with a horizontal line and read off the connected sub-trees — that's your flat clustering at that threshold.

Run single-linkage agglomerative on a tiny similarity matrix.

Points $a, b, c, d$ with similarities: $a\!-\!b = 0.9$, $c\!-\!d = 0.8$, $b\!-\!c = 0.3$, $a\!-\!c = 0.2$, $a\!-\!d = 0.1$, $b\!-\!d = 0.4$.

Round 0: $\{a\}, \{b\}, \{c\}, \{d\}$. Highest similarity is $a\!-\!b = 0.9$.

Round 1: Merge $a, b$. Clusters: $\{a,b\}, \{c\}, \{d\}$. Compute single-linkage similarities:
$\{a,b\}\!-\!\{c\}$: $\max(a\!-\!c, b\!-\!c) = \max(0.2, 0.3) = 0.3$
$\{a,b\}\!-\!\{d\}$: $\max(0.1, 0.4) = 0.4$
$\{c\}\!-\!\{d\}$: $0.8$
Highest is $\{c\}\!-\!\{d\} = 0.8$.

Round 2: Merge $c, d$. Clusters: $\{a,b\}, \{c,d\}$. Cross-similarity: $\max(a\!-\!c, a\!-\!d, b\!-\!c, b\!-\!d) = \max(0.2, 0.1, 0.3, 0.4) = 0.4$.

Round 3: Merge $\{a,b\}, \{c,d\}$ at similarity $0.4$. Done.

Dendrogram: leaves $a, b$ merge at height $0.9$; $c, d$ merge at $0.8$; the two pairs merge at $0.4$.

Divisive heuristics

Going top-down is harder than going bottom-up because the very first split — partitioning $n$ points into two well-separated halves — is itself a nontrivial clustering problem. Two common heuristics:

In both cases, you build the cluster-tree top-down by repeatedly splitting until each leaf is a single point (or until you stop at some granularity threshold).

Sparsest cut

Definition · Sparsest cut

Given a graph with node set $V$ and edge set $E \subset V \times V$, the sparsity of a cut $S \subseteq V$ is $$\phi(S) \;=\; \frac{E(S, \bar S)}{\min(|S|, |\bar S|)},$$ where $E(S, \bar S)$ is the sum of weights of edges crossing the cut, and $\bar S = V \setminus S$.

The sparsest cut is the partition $S^*$ that minimises $\phi$: $$\phi(S^*) \;=\; \min_{S \subset V} \phi(S).$$

In plain English

Among all possible ways to split the graph into two parts, find the one where the total weight of crossing edges is small relative to the size of the smaller part. Dividing by $\min(|S|, |\bar S|)$ prevents the trivial answer of "put one lonely vertex on one side, everything else on the other" — that has few crossing edges but a tiny denominator, so the ratio is large.

cut S 1 2 3 4 5 6 7 8 weight 1 $|S| = 4, \; |\bar S| = 4, \; E(S, \bar S) = 1, \; \phi(S) = 1/4$.
Sparsest cut intuition. Two dense communities connected by a single light edge. The natural split has very few crossing edges relative to the part sizes, giving a small $\phi$.
Compute $\phi(S)$ for two candidate cuts.

Suppose the graph has 6 nodes $\{1,2,3,4,5,6\}$. Total crossing edge weights for two candidate splits:

  • Cut A: $S_A = \{1\}$. Then $|S_A| = 1$, $|\bar S_A| = 5$. Suppose node 1 has 3 edges out, each weight 1. Then $E(S_A, \bar S_A) = 3$.
    $\phi(S_A) = 3 / \min(1, 5) = 3/1 = 3$.
  • Cut B: $S_B = \{1, 2, 3\}$. Then $|S_B| = 3$, $|\bar S_B| = 3$. Suppose only 1 edge of weight 1 crosses. Then $E(S_B, \bar S_B) = 1$.
    $\phi(S_B) = 1 / \min(3, 3) = 1/3 \approx 0.33$.

Cut B is sparser ($\phi = 1/3 < 3$) — it isolates a balanced half with very few crossing edges. The denominator $\min(|S|, |\bar S|)$ penalises the "lonely vertex" trick that Cut A tried.

Why does the sparsest cut definition use $\min(|S|, |\bar S|)$ instead of just $|S|$?

Because the cut is symmetric: removing the edges between $S$ and $\bar S$ is the same as removing them between $\bar S$ and $S$. So the criterion should be symmetric in $S$ and $\bar S$. The smaller side determines how "balanced" the cut is.

An alternative formulation uses $|S| \cdot |\bar S|$ (the "edge expansion" or "normalised cut" denominator). Both penalise lopsided splits; the $\min$ form is just one common convention.

How do agglomerative and divisive clustering differ in expense?

Agglomerative is "easy" per step: you just need pairwise similarities and a rule to merge the closest pair. The bookkeeping is on the order of $O(n^2 \log n)$ for naive implementations.

Divisive is "hard" per step: each split is itself an NP-hard problem in general (sparsest cut is NP-hard exactly; bisection k-means is NP-hard exactly but Lloyd-style heuristics work). So divisive is usually slower and depends on the quality of the split heuristic.

Trade-off: agglomerative gets the fine structure right (since the very first merges are between the most similar pairs), while divisive gets the coarse structure right (the first split separates the most distinct halves). Choose based on which level of granularity matters most for your problem.

Week 09 · One Breath Summary

Clustering, distilled

k-Means minimises sum-of-squared-Euclidean to centroids and converges fast but lands in bad local minima with bad seeds, which k-Means++ patches via $D^2$-sampling and an expected $O(\log k)$ guarantee. k-Median swaps $L_1$ and the per-dimension median for robustness to outliers. DBSCAN ditches centroids for density: a point is core if it has $\ge m$ neighbours within $\varepsilon$, clusters are connected components of cores, and isolated points are noise — handling non-convex shapes that centroid methods can't. To pick $k$, the elbow rule maximises $\text{cost}_{k-1}/\text{cost}_k$ over $k \in \{2,\ldots,n-1\}$. Flat clustering loses nested structure, so hierarchical clustering builds a tree — bottom-up (agglomerative, repeatedly merge the most-similar pair using single, average, or complete linkage) or top-down (divisive, repeatedly split via bisection k-means or sparsest cut $\phi(S) = E(S,\bar S)/\min(|S|, |\bar S|)$) — and you read off any flat clustering by cutting the dendrogram at a chosen height.

Exam-readiness checklist

  1. State the k-Means cost function and prove the centroid update is optimal given fixed assignments.
  2. Explain why Lloyd's iteration converges but can converge to an arbitrarily bad local minimum (rectangle example).
  3. Write down the full k-Means++ algorithm including the $D^2$-weighted sampling rule and quote the $O(\log k)$ expected-approximation guarantee.
  4. State the k-Median cost and compute a vector median per-dimension; distinguish k-Median from k-medoid.
  5. Contrast $L_1$ (Manhattan / taxicab) and $L_2^2$ (squared Euclidean) geometry — many equal-length grid paths vs unique straight line.
  6. Define DBSCAN's three point types (core / border / noise) and run the algorithm on a small example.
  7. Explain why DBSCAN handles non-convex clusters and why k-Means cannot (Voronoi cells are convex).
  8. State and apply the elbow rule: pick $k$ maximising $\text{cost}_{k-1}/\text{cost}_k$ over $\{2, \ldots, n-1\}$.
  9. Articulate the limit of flat clustering using the news-headlines example (CS / Physics / Cycling / Cricket).
  10. Define agglomerative vs divisive hierarchical clustering and the two divisive heuristics (bisection k-means, sparsest cut).
  11. Compute single, average, and complete linkage on a 4-node cross-cluster example with similarities.
  12. Run agglomerative clustering by hand on a similarity matrix and draw the resulting dendrogram with merge heights.
  13. Define $\phi(S) = E(S, \bar S)/\min(|S|, |\bar S|)$ for sparsest cut and explain the role of the denominator in preventing lopsided partitions.