Clustering – Week 7 Quiz

Question 01

What is clustering?

⚑ Select all that apply

AA supervised learning method

BGrouping similar data points together

CDividing data into predefined categories

DIdentifying patterns in unlabelled data

Clustering is an unsupervised learning technique — it works without labels. Its two core properties are: Grouping similar data points (B) — the whole goal is to bring similar items together into clusters; and identifying patterns in unlabelled data (D) — because there are no class labels, the algorithm must discover structure on its own. A is wrong — clustering is unsupervised, not supervised (supervised learning requires labelled training examples). C is wrong — predefined categories are a feature of classification, not clustering; in clustering the groups emerge from the data itself.

✓ Correct answers: B & D

Question 02

Which of the following algorithms is density-based?

AK-Means

BDBSCAN

CHierarchical clustering

DCURE

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the only density-based algorithm in this list. It defines clusters as regions of high point density separated by regions of low density, using two parameters: ε (neighbourhood radius) and MinPts (minimum points to form a core point). K-Means (A) is centroid-based — it partitions data by minimising distance to cluster centres. Hierarchical clustering (C) is connectivity-based — it builds a tree (dendrogram) by merging or splitting clusters. CURE (D) is a representative-point method, a variant of hierarchical clustering suited to large datasets.

✓ Correct answer: B

Question 03

Which measure evaluates clustering quality?

⚑ Select all that apply

ASilhouette score

BEuclidean distance

CMean Absolute Error (MAE)

DF1 score based on reference clusters (ground truth)

Two distinct types of clustering evaluation metrics exist. Silhouette score (A) is an internal metric — it measures how well each point fits its assigned cluster relative to neighbouring clusters without needing ground truth labels; values range from −1 to +1. F1 score based on ground truth (D) is an external metric — when reference labels are available, you can compare predicted clusters to the truth using precision, recall, and F1. Euclidean distance (B) is a similarity/distance measure used within algorithms like K-Means, but it is not itself a quality evaluation metric. MAE (C) is a regression error metric — it has no role in evaluating cluster quality.

✓ Correct answers: A & D

Question 04

Which property is true of DBSCAN?

⚑ Select all that apply

AIt handles arbitrary cluster shapes

BIt requires a fixed number of clusters

CIt uses a density threshold to form clusters

DIt cannot handle noise in data

DBSCAN has two key strengths: Arbitrary shapes (A) — because it grows clusters region-by-region based on density reachability, it can find rings, crescents, and other non-convex shapes that K-Means cannot; Density threshold (C) — a point becomes a core point if at least MinPts neighbours lie within radius ε, which is exactly a density threshold. B is wrong — DBSCAN does not require you to specify the number of clusters in advance; the number of clusters is inferred from the data's density structure. D is wrong — noise handling is actually one of DBSCAN's biggest selling points; points that don't belong to any dense region are classified as noise/outliers automatically.

✓ Correct answers: A & C

Question 05

In K-Means, which of the following determines cluster assignment?

ADistance from centroids

BDensity of nearby points

CGradient descent

DMembership probabilities

K-Means assigns each data point to whichever cluster centroid it is closest to (A) — typically using Euclidean distance. Each iteration: (1) assign every point to its nearest centroid, (2) recompute centroids as the mean of all assigned points, repeat until stable. B is wrong — density of nearby points drives DBSCAN, not K-Means. C is wrong — gradient descent is used in neural network training; K-Means uses a coordinate-ascent style update, not gradient descent. D is wrong — membership probabilities are used in soft/fuzzy clustering (e.g., Gaussian Mixture Models), not in standard hard K-Means.

✓ Correct answer: A

Question 06

The term "silhouette score" measures:

AHow similar a data point is to its assigned cluster compared to others

BThe compactness of clusters

CThe accuracy of predictions in clustering

DThe separation between clusters

The silhouette score for a point is computed as (b − a) / max(a, b), where a is the average distance to points in its own cluster and b is the average distance to the nearest other cluster. This jointly captures how well a point fits its own cluster vs. its nearest alternative (A) — a high score means the point is well-matched to its cluster and poorly matched to neighbours. B (compactness only) and D (separation only) are each half the story — the silhouette combines both into one value. C is wrong — clustering is unsupervised, so there's no "prediction accuracy" in the supervised sense.

✓ Correct answer: A

Question 07

Which clustering technique is most suitable for large datasets?

AK-Means

BDBSCAN

CHierarchical clustering

DCURE

K-Means (A) scales well to large datasets. Its time complexity is approximately O(n · k · i) where n is the number of points, k the number of clusters, and i the number of iterations — all of which are typically small relative to n, making it efficient in practice. Hierarchical clustering (C) has O(n²) or worse complexity and must store the entire distance matrix, making it impractical for large data. DBSCAN (B) can struggle with high-dimensional large datasets and requires careful parameter tuning. CURE (D) was designed to improve scalability over hierarchical methods, but it still generally lags behind K-Means for very large datasets.

✓ Correct answer: A

Question 08

Which of the following are true of hierarchical clustering?

⚑ Select all that apply

AIt can be agglomerative or divisive

BIt is able to generate a dendrogram

CIt requires the number of clusters as input

DIt cannot handle small datasets

Hierarchical clustering has two flavours: agglomerative (bottom-up — start with each point as its own cluster, merge the two closest repeatedly) and divisive (top-down — start with one big cluster, split recursively). So A is true. It also naturally produces a dendrogram (B) — a tree diagram that shows the hierarchy of merges/splits at every distance level, which you can cut at any level to get different numbers of clusters. C is wrong — unlike K-Means, you do not need to specify k upfront; you choose where to cut the dendrogram after the fact. D is wrong — hierarchical clustering is actually very well-suited to small datasets; its high computational cost makes it struggle with large datasets, not small ones.

✓ Correct answers: A & B

Question 09

Suppose we have the following 3 data points in a 2-dimensional space for K-Means clustering — initially all assigned to the same cluster. What will be the new centroid after updating?

A: (2, 3) B: (4, 5) C: (6, 7)

AA: (2, 3)

BB: (4, 5)

CC: (6, 7)

DNot enough information

The K-Means centroid update rule sets the new centroid to the mean of all points in the cluster. With all three points in one cluster: x̄ = (2 + 4 + 6) / 3 = 4 and ȳ = (3 + 5 + 7) / 3 = 5. So the new centroid is (4, 5) = point B. Note that the three points are collinear and evenly spaced — B sits exactly in the middle, making it the mean by symmetry. A and C are wrong because they are the extremes of the cluster. D is wrong because we have exactly the information needed: three points in one cluster is sufficient to compute the centroid.

✓ Correct answer: B — centroid is (4, 5)

Question 10

After the centroid is updated to (4, 5), only one point is reassigned to a different cluster. Which point is most likely to be removed?

A: (2, 3) B: (4, 5) C: (6, 7) New centroid: (4, 5)

A(3, 4)

B(4.67, 5.67)

C(5.12, 6.33)

DNot enough information

The answer is D — not enough information. For a point to be reassigned away from the current cluster, there must be a second cluster with a centroid closer to that point. The question only gives us one cluster of three points — we have no information about any other clusters or their centroids. Without knowing where the other cluster centroid(s) are, it is impossible to determine which point would move. The other options (A, B, C) are plausible new centroid values under different scenarios, but none can be determined as correct without additional data. This tests whether you understand that K-Means assignment requires comparing distances to all centroids.

✓ Correct answer: D — not enough information