Week 15 — Graph Neural Networks & Unsupervised Learning

IWhat is a graph?slides 3–5
IIRepresenting a graph: A, X, Eslides 6–10
IIIGNNs, tasks & loss functionsslides 11–16
IVGraph Convolutional Networksslides 17–24
VGraph classification (molecules)slides 25–27
VIInductive vs transductiveslides 28–29
VIINode classification & batchingslides 30–34
VIIILayers for GCNs (aggregation zoo)slides 35–43
IXEdge graphs & summaryslides 44–46
XUnsupervised & generative modelsslides 47–50
XIEvaluating generative modelsslides 51–57
XIIConclusion & what's nextslides 58–60

Part the first

I. What is a graph?

Most of the data you've met so far lives on a regular grid — pixels in an image, words in a sequence. A graph is what you reach for when the data has structure but no natural grid: things connected to other things.

A graph is, at heart, a wonderfully simple idea: a collection of things and the connections between them. The things are called nodes (or vertices); the connections are called edges. That's it. The power comes from how many real-world situations fit this picture.

Definition · Graph

A graph consists of a set of $N$ nodes connected by a set of $E$ edges. A node usually represents an entity; an edge represents a relationship between two entities.

The lecture opens with three pictures that look completely different but are all graphs (slide 4):

A road / transit map — nodes are junctions or stations, edges are the roads between them.
A molecule — nodes are atoms (the slide shows a chlorine-substituted indole with a piperazine group), edges are chemical bonds.
An electrical circuit — nodes are components (resistors, capacitors, transistors, a diode, a battery), edges are the wires.

The zoo of graph types (slide 5)

Slide 5 (adapted from Fernández-Madrigal & González, 2002) shows that graphs come in many flavours. It's worth knowing the vocabulary because the exam may ask you to classify a scenario:

Undirected social graph — a friendship network (Alice, Bob, Carol, Dan, Erin, Frank). Edges have no direction: if Alice knows Bob, Bob knows Alice.
Undirected influence graph — musicians influencing one another (Mascis 1988, Mould & Hart 1985, Westerberg 1986, Haynes 1985, MacKaye et al. 1981, Watt et al. 1982).
Directed, labelled "knowledge graph" — here the edges carry labels and directions: Alice is brother/is sister of Bob, works for Google, citizen of Canada, Intel is supplier to Google, and so on. The edge type matters.
Geometric / point-cloud graph — a 3-D scan of an aeroplane, where nearby scanned points are joined into a mesh. Nodes carry spatial coordinates.
Hierarchical / scene graph — a room decomposed into parts: Room (ceiling, walls, floor) → Table (tabletop, legs) → Light (cord, lightshade). Nodes group into nested sub-graphs.

Intuition · why this matters

A graph can encode three different kinds of information at once: the structure (who is connected to whom), data living on the nodes, and data living on the edges. A grid (image) only really lets neighbours be fixed left/right/up/down; a graph lets any node connect to any other. That flexibility is exactly why we need new machinery — ordinary CNNs and transformers assume a fixed neighbourhood layout that graphs don't have.

QGive three real-world systems that are naturally graphs, and state what the nodes and edges represent in each.▾ tap to reveal

Any three of the lecture's examples (or your own) are fine, as long as you correctly identify nodes and edges:

Molecule — nodes = atoms, edges = chemical bonds.
Social network — nodes = people, edges = friendships (undirected) or "follows" (directed).
Road map — nodes = junctions/intersections, edges = road segments.
Electrical circuit — nodes = components, edges = wires.
Knowledge graph — nodes = entities (people, companies, countries), edges = typed relations ("works for", "supplier to").

Bonus marks for noting whether the graph is directed (knowledge graph) or undirected (friendships), and whether edges are labelled.

· · ·

Part the second

II. Representing a graph: A, X, E

To feed a graph to a neural network we must turn it into numbers. The lecture's key claim: a graph is fully captured by three matrices — one for the wiring, one for the node data, and one for the edge data.

A neural network eats matrices, not pictures. So before anything else we encode the graph. The lecture states it cleanly (slide 7): "The graph can be encoded by three matrices $A$, $X$, and $E$, representing the graph structure, node embeddings, and edge embeddings." Let us meet each one.

Definition · The three matrices (slide 8)

Adjacency matrix $\mathbf{A}\in\{0,1\}^{N\times N}$ — encodes the structure: $$A_{m,n}=\begin{cases}1 & \text{if there is an edge between nodes } m \text{ and } n\\[2pt] 0 & \text{otherwise}\end{cases}$$

Node data matrix $\mathbf{X}\in\mathbb{R}^{D\times N}$ — each column is the embedding of one node, $\mathbf{x}^{(n)}\in\mathbb{R}^{D}$ for the $n$-th node. So there are $N$ columns and each is $D$-dimensional.

Edge data matrix $\mathbf{E}\in\mathbb{R}^{D_E\times E}$ — each column is an edge embedding $\mathbf{e}^{(e)}\in\mathbb{R}^{D_E}$ for the $e$-th edge.

Exam trap · which dimension is which

Note the orientation: in this course nodes are columns of $\mathbf{X}$, so $\mathbf{X}$ is $D\times N$ (features-by-nodes), not $N\times D$. Many textbooks use the transpose. If you mix them up, every matrix product in the GCN layer (Part IV) will have the wrong shape. Keep $\mathbf{A}$ as $N\times N$, $\mathbf{X}$ as $D\times N$.

What the adjacency matrix can do (slide 9)

The adjacency matrix is not just storage — multiplying by it moves information around the graph. This is the single most important mechanical fact in the whole lecture, so let's build it slowly.

Slide 9 in motion. Start with an indicator vector $\mathbf{x}$ that is 1 only at node 6. Multiplying by $\mathbf A$ "spreads" the signal to node 6's direct neighbours (nodes 5, 7, 8). Multiplying again by $\mathbf A$ (giving $\mathbf A^2\mathbf x$) spreads to neighbours-of-neighbours.

Intuition · A is a "spread" operator

Think of node values as heat. The product $\mathbf{A}\mathbf{x}$ replaces each node's value with the sum of its neighbours' values. So if only node 6 was "hot", after one multiplication the heat appears at everything touching node 6. The slide colours nodes 5, 7, 8 because those are exactly node 6's neighbours.

Push further: the entry $(\mathbf{A}^2)_{m,n}$ counts the number of length-2 walks from node $m$ to node $n$. In general $(\mathbf{A}^k)_{m,n}$ counts walks of length $k$. That's why $\mathbf{A}^2\mathbf{x}$ reaches two hops away. This "information spreads one hop per multiply" is the seed of the entire GCN idea: stack $K$ layers and each node sees $K$ hops out.

QWhat does the $(m,n)$ entry of $\mathbf{A}^2$ represent? And what does the vector $\mathbf{A}\mathbf{x}$ compute if $\mathbf{x}$ is an indicator for a single node?▾ tap to reveal

$(\mathbf{A}^2)_{m,n}$ = the number of distinct walks of length 2 from node $m$ to node $n$ (paths that go $m \to \text{something} \to n$). More generally $(\mathbf A^k)_{m,n}$ counts walks of length $k$. Note the diagonal of $\mathbf A^2$ gives each node's degree (number of ways to walk out and back).

$\mathbf{A}\mathbf{x}$ with $\mathbf{x}$ = indicator of node $j$: produces a vector that is 1 at every direct neighbour of $j$ (and 0 elsewhere). It "selects the $j$-th column of $\mathbf A$," which lists $j$'s neighbours. This is the one-hop message-passing operation at the core of a GCN.

Node indices are arbitrary — permutation matrices (slide 10)

Here's a subtlety that drives the whole design of GNNs. When you label the nodes 1, 2, 3, …, that numbering is a free choice. Relabel them and you have the same graph — the structure hasn't changed one bit. We capture relabelling with a permutation matrix.

Definition · Permutation of node indices

A permutation matrix $\mathbf{P}$ is a square matrix with exactly one 1 in each row and column (a shuffled identity). To relabel a graph from one indexing to another: $$\mathbf{X}' = \mathbf{X}\mathbf{P}, \qquad \mathbf{A}' = \mathbf{P}^{\mathsf T}\mathbf{A}\mathbf{P}$$

The node data $\mathbf X$ gets its columns shuffled (one $\mathbf P$ on the right). The adjacency matrix gets shuffled on both axes — rows and columns — hence $\mathbf{P}^{\mathsf T}\mathbf A\mathbf P$, because both the "from" and "to" indices of every edge must be renamed consistently.

Exam trap · why two P's for A but one for X

$\mathbf X$ has nodes on one axis (columns), so it needs one permutation: $\mathbf X\mathbf P$. $\mathbf A$ has nodes on both axes (rows = from-node, columns = to-node), so it needs the permutation applied twice: $\mathbf P^{\mathsf T}\mathbf A\mathbf P$. This asymmetry is a classic short-answer question. Remember it foreshadows the equivariance requirement in Part IV.

QWhy is node indexing said to be "arbitrary," and how do we formally relabel a graph?▾ tap to reveal

Indexing is arbitrary because the names we assign to nodes carry no real information — a molecule is the same molecule whether we call a particular carbon "atom 1" or "atom 7." Two graphs that differ only by a relabelling are isomorphic (structurally identical).

We relabel with a permutation matrix $\mathbf P$: node data becomes $\mathbf X' = \mathbf X\mathbf P$, adjacency becomes $\mathbf A' = \mathbf P^{\mathsf T}\mathbf A\mathbf P$. Because a model must give the "same" answer regardless of this arbitrary choice, every GNN layer is built to respect permutations (equivariance / invariance — Part IV).

· · ·

Part the third

III. GNNs, tasks & loss functions

A graph neural network takes a graph in and pushes it through several layers, gradually mixing each node's own information with its surroundings — exactly like a transformer turns isolated words into context-aware ones.

Now that a graph is three matrices, what does a GNN actually do with them? Slide 12 gives the bird's-eye view. The inputs are the node embeddings $\mathbf X$ and the adjacency matrix $\mathbf A$. The network passes them through $K$ layers, producing intermediate representations $\mathbf H_k$ at each layer, and a final output $\mathbf H_K$.

Intuition · from "the node itself" to "the node in context"

The lecture gives a lovely analogy. At the start, $\mathbf X$ contains information about each node on its own. By the end, $\mathbf H_K$ contains each node's information plus the context of where it sits in the graph.

This is exactly like word embeddings in a transformer: the initial embedding of "bank" is the same everywhere, but after the transformer's layers, "bank" in "river bank" and "bank" in "money bank" have different representations because each has absorbed its sentence context. A GNN does the same thing, but the "sentence" is the graph and the "context" comes from neighbouring nodes.

Three common tasks (slide 13)

Once you have those context-aware embeddings $\mathbf H_K$, there are three things you typically want to predict. The lecture's figure shows all three flowing out of the same GNN:

Slide 13. The same GNN backbone produces embeddings; what differs is the read-out head. Graph-level: combine all node embeddings → one label. Node-level: classify each node separately. Edge-level: score whether a pair of nodes should be linked.

The three read-out formulas (slides 14–16)

Definition · Graph-level task (slide 14)

The probability the whole graph belongs to class 1: $$P(y=1\mid \mathbf X,\mathbf A)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol{\omega}_K\,\mathbf H_K\,\tfrac{1}{N}\Big)$$

where the scalar $\beta_K$ and the vector $\boldsymbol\omega_K\in\mathbb R^{1\times D}$ are learned parameters, and $\operatorname{sig}(\cdot)\in[0,1]$ is the logistic sigmoid. The factor $\tfrac1N$ and the multiplication of $\mathbf H_K$ (a $D\times N$ matrix) by a vector of ones effectively averages the node embeddings into a single $D$-vector — "mean pooling" — before the sigmoid classifier.

Definition · Node-level task (slide 15)

The loss is defined exactly as for graph-level tasks, except now it is applied independently at each node $n$: $$P\big(y^{(n)}=1\mid \mathbf X,\mathbf A\big)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol\omega_K\,\mathbf h^{(n)}_K\Big)$$

The only change from the graph-level case: we feed the single node's final embedding $\mathbf h^{(n)}_K$ instead of the pooled average. No $\tfrac1N$, no summing across nodes.

Definition · Edge prediction (slide 16)

One possibility: take the dot product of two node embeddings and squash with a sigmoid to get the probability an edge exists between nodes $m$ and $n$: $$P\big(y^{(mn)}=1\mid\mathbf X,\mathbf A\big)=\operatorname{sig}\!\Big(\mathbf h^{(m)\mathsf T}\mathbf h^{(n)}\Big)$$

Intuition · why a dot product for edges

The dot product $\mathbf h^{(m)\mathsf T}\mathbf h^{(n)}$ is large and positive when the two node embeddings point in a similar direction — i.e. when the nodes are "similar" in the learned feature space. The sigmoid turns that similarity into a probability. So the model learns to place nodes that should be linked close together in embedding space. This is the same trick used in recommender systems ("users like this also like…").

QWrite the graph-level classification probability and explain the role of each symbol. How does the node-level formula differ, and why?▾ tap to reveal

Graph-level: $P(y=1\mid\mathbf X,\mathbf A)=\operatorname{sig}(\beta_K+\boldsymbol\omega_K\mathbf H_K\tfrac1N)$.

$\mathbf H_K$ — the $D\times N$ matrix of final node embeddings.
$\tfrac1N$ (with the implicit ones-vector) — mean-pools the $N$ node columns into one $D$-vector, giving a single graph-level representation.
$\boldsymbol\omega_K\in\mathbb R^{1\times D}$ — learned weight vector mapping that representation to a scalar logit.
$\beta_K$ — learned scalar bias.
$\operatorname{sig}$ — logistic sigmoid, maps the logit to a probability in $[0,1]$.

Node-level: $P(y^{(n)}=1)=\operatorname{sig}(\beta_K+\boldsymbol\omega_K\mathbf h^{(n)}_K)$. It uses a single node's embedding $\mathbf h^{(n)}_K$ with no pooling, because we want a separate prediction per node, not one for the whole graph. The classifier weights $\beta_K,\boldsymbol\omega_K$ are shared across all nodes.

· · ·

Part the fourth

IV. Graph Convolutional Networks

The workhorse of the lecture. A GCN updates every node by mixing it with the sum of its neighbours — the same learned weights everywhere — and that single design choice gives it permutation-respecting behaviour and a manageable parameter count.

A Graph Convolutional Network (GCN) is a particular, very natural way to build the layers $F$. Its defining feature (slide 18) is a relational inductive bias: GCNs prioritise information from neighbouring nodes. The slide contrasts this with spectral-based methods, which instead operate in the Fourier domain (using the graph Laplacian's eigenvectors). The GCNs in this course are spatial — they work directly with neighbours, not frequencies.

Definition · Network architecture (slide 19)

Each layer is a function $F$ with parameters $\Phi$ that updates node embeddings using the adjacency matrix. The full network is a stack: $$\begin{aligned}\mathbf H_1 &= F(\mathbf X,\mathbf A,\boldsymbol\phi_0)\\ \mathbf H_2 &= F(\mathbf H_1,\mathbf A,\boldsymbol\phi_1)\\ &\;\;\vdots\\ \mathbf H_K &= F(\mathbf H_{K-1},\mathbf A,\boldsymbol\phi_{K-1})\end{aligned}$$ where $\mathbf X$ = input node embeddings, $\mathbf A$ = adjacency matrix, $\mathbf H_k$ = embeddings at layer $k$, and $\boldsymbol\phi_k$ = parameters mapping layer $k$ to layer $k{+}1$. Crucially, $\mathbf A$ is fed into every layer.

Equivariance and invariance (slides 20–21)

Recall from Part II that node labels are arbitrary. This forces a hard requirement on every layer.

Definition · Equivariance requirement (slide 20)

Each layer must respect permutations of node indices. Formally, permuting the input and then applying the layer must equal applying the layer and then permuting: $$\mathbf H_{k+1}\mathbf P = F\big(\mathbf H_k\mathbf P,\;\mathbf P^{\mathsf T}\mathbf A\mathbf P,\;\boldsymbol\phi_k\big)$$

Intuition · equivariance vs invariance, in one breath

Equivariant = "if I shuffle the inputs, the outputs shuffle the same way." Relabel the nodes and every node's prediction just follows its node — nothing is lost. Invariant = "shuffle the inputs and the output doesn't change at all."

Which one you need depends on the task (slide 21):

Node classification & edge prediction → output must be equivariant: each node/edge keeps its own answer regardless of labelling.
Graph-level tasks → output must be invariant: the final layer aggregates across the whole graph (e.g. mean pooling), so node order cannot affect the single graph label.

The image analogy nails it: image segmentation should be equivariant to geometric transforms (move the object, the mask moves with it), while image classification should be invariant (a cat is a cat wherever it sits). For graphs, networks can be designed to guarantee either property with respect to permutations.

Exam trap · equivariant ≠ invariant

Don't blur the two. A common mistake is to say a node classifier should be "invariant to permutations" — it should be equivariant. Invariance would mean every node gets the same label, which is nonsense for per-node prediction. Reserve invariance for whole-graph outputs.

Parameter sharing (slide 22)

Why use the same weights at every node? The lecture frames it as a direct parallel to CNNs.

	Fully-connected (the problem)	Shared-weight solution
Images	Must learn to recognise an object separately at every pixel position → enormous parameter count.	Convolution: process every position with the same filter → fewer parameters + the bias that "treat every part of the image the same."
Graphs	Would need separate parameters per node and independent learning per graph position → inefficient; needs many graphs with identical topology.	GCN: use the same parameters at every node → fewer parameters + learned information shared across the whole graph.

The GCN layer itself (slides 23–24)

This is the most important formula in Part IV. Build it in two steps.

Definition · Aggregation, then transform (slide 23)

Step 1 — aggregate by summing the embeddings of node $n$'s neighbours: $$\operatorname{agg}(n,k)=\sum_{m\in\text{ne}(n)}\mathbf h^{(m)}_k$$ where $\text{ne}(n)$ returns the set of indices of the neighbours of node $n$.

Step 2 — linearly transform both the node's own embedding and the aggregate, add a bias, apply an activation $a(\cdot)$: $$\mathbf h^{(n)}_{k+1}=a\Big(\boldsymbol\beta_k+\boldsymbol\Omega_k\mathbf h^{(n)}_k+\boldsymbol\Omega_k\operatorname{agg}(n,k)\Big)$$

Matrix form (all nodes at once): $$\mathbf H_{k+1}=a\big(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A\big)=a\big(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k(\mathbf A+\mathbf I)\big)$$

Intuition · decoding that final line

Read $\mathbf H_k(\mathbf A+\mathbf I)$ right-to-left. We saw that $\mathbf H_k\mathbf A$ sums each node's neighbours (the "spread" operator from Part II). Adding the identity, $\mathbf A+\mathbf I$, means "neighbours plus yourself" — the $\mathbf I$ keeps each node's own embedding in the mix (a self-loop). Then:

$\boldsymbol\Omega_k$ — a learned weight matrix applied identically to every node (parameter sharing!).
$\boldsymbol\beta_k\mathbf 1^{\mathsf T}$ — a bias added to every column (the $\mathbf 1^{\mathsf T}$ broadcasts the bias vector across all $N$ nodes).
$a(\cdot)$ — a nonlinearity (e.g. ReLU).

So one GCN layer = "mix yourself with your neighbours, apply a shared linear map, add bias, squash." Stack $K$ of them and node $n$ has absorbed everything within $K$ hops.

Slide 24. Node embeddings evolve layer by layer. At the input each node knows only itself; after layer 1 it knows its 1-hop neighbourhood; after $k$ layers, its $k$-hop neighbourhood. The update rule is the same at every node and every layer (only the parameters $\boldsymbol\Omega_k,\boldsymbol\beta_k$ change with the layer).

QStarting from the per-node GCN update, derive the matrix form and explain why the term $(\mathbf A + \mathbf I)$ appears.▾ tap to reveal

Per-node: $\mathbf h^{(n)}_{k+1}=a(\boldsymbol\beta_k+\boldsymbol\Omega_k\mathbf h^{(n)}_k+\boldsymbol\Omega_k\sum_{m\in\text{ne}(n)}\mathbf h^{(m)}_k)$.

Collect all nodes into the $D\times N$ matrix $\mathbf H_k$. The neighbour-sum for all nodes at once is the matrix product $\mathbf H_k\mathbf A$ (column $n$ of $\mathbf H_k\mathbf A$ is the sum of $\mathbf H_k$'s columns over $n$'s neighbours). The "own embedding" term is just $\mathbf H_k$. So:

$$\mathbf H_{k+1}=a(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A).$$

Factor $\boldsymbol\Omega_k\mathbf H_k$ out of the last two terms: $\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A=\boldsymbol\Omega_k\mathbf H_k(\mathbf I+\mathbf A)$. Hence the $(\mathbf A+\mathbf I)$: the $\mathbf A$ gathers the neighbours, the $\mathbf I$ is a self-loop that keeps the node's own current embedding. Without the $\mathbf I$, a node would forget itself and only see its neighbours. $\boldsymbol\beta_k\mathbf 1^{\mathsf T}$ broadcasts the bias to every node.

QA GCN with $K=3$ layers is applied to a graph. How many "hops" of the graph does each node's final embedding depend on, and why?▾ tap to reveal

Three hops. Each layer mixes a node with its direct (1-hop) neighbours via $(\mathbf A+\mathbf I)$. After layer 1 a node "sees" its 1-hop neighbourhood; those neighbours in turn saw their neighbours, so after layer 2 a node indirectly sees 2 hops; after layer 3, 3 hops. In general $K$ layers ⇒ a receptive field of $K$ hops. (This is exactly why deep GNNs on big graphs hit the "receptive-field explosion" problem in Part VII.)

· · ·

Part the fifth

V. Graph classification: molecules

A worked, end-to-end example: is this molecule toxic or harmless? It pins down exactly what every matrix in the GCN equations holds, and shows how to train on batches of differently-shaped graphs.

The lecture grounds the abstract machinery in a concrete graph-level binary classification task (slide 26): classify molecules as toxic or harmless. A molecule is naturally a graph — atoms are nodes, bonds are edges — so this is a perfect fit.

Example · molecule classification (slide 26)

Inputs:

Adjacency matrix $\mathbf A\in\mathbb R^{N\times N}$ — represents the molecular structure (which atoms are bonded).
Node embedding matrix $\mathbf X\in\mathbb R^{118\times N}$ — encodes the presence of elements from the periodic table using one-hot vectors. (There are 118 known elements, hence 118 rows; each atom's column is a one-hot indicating its element.)

Transformation to arbitrary size $D$: the first weight matrix $\boldsymbol\Omega_0\in\mathbb R^{D\times 118}$ maps the bulky 118-dim one-hot down (or up) to a chosen embedding size $D$.

Network equations: $$\begin{aligned}\mathbf H_1&=a(\boldsymbol\beta_0\mathbf 1^{\mathsf T}+\boldsymbol\Omega_0\mathbf X(\mathbf A+\mathbf I))\\ \mathbf H_2&=a(\boldsymbol\beta_1\mathbf 1^{\mathsf T}+\boldsymbol\Omega_1\mathbf H_1(\mathbf A+\mathbf I))\\ &\;\;\vdots\\ \mathbf H_K&=a(\boldsymbol\beta_{K-1}\mathbf 1^{\mathsf T}+\boldsymbol\Omega_{K-1}\mathbf H_{K-1}(\mathbf A+\mathbf I))\end{aligned}$$ and the read-out: $$f(\mathbf X,\mathbf A,\Phi)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol\omega_K\mathbf H_K\tfrac1N\Big)$$

Intuition · why one-hot, why Ω₀

The raw "what element is this atom" has no natural numeric meaning (carbon isn't "6 units" of anything useful), so we one-hot encode it: 118 slots, a single 1. But 118 dimensions is wasteful and the one-hots are not learnable. So the very first layer's $\boldsymbol\Omega_0$ (shape $D\times118$) acts as a learned embedding lookup — turning each sparse one-hot column into a dense, trainable $D$-dimensional atom embedding. After that, every layer works in the compact $D$-dimensional space.

Training on batches of graphs (slide 27)

Definition · batch training set-up

Training data: graphs $\{\mathbf X_i,\mathbf A_i\}$ with labels $y_i$. Method: stochastic gradient descent (SGD) with binary cross-entropy loss to update model parameters $\Phi$.

Exam trap · the mini-batch problem

Modern networks process whole batches in parallel for speed. But here's the snag: each graph can have a different number of nodes, so the matrices $\mathbf X_i$ and $\mathbf A_i$ have varying sizes. You can't just stack them into one fixed-size tensor the way you stack equally-sized images.

Example · the unified-graph trick (the solution)

The fix is elegant:

Treat each graph in the batch as a component of one larger, unified graph — just place them side by side with no edges between different molecules (a block-diagonal big adjacency matrix).
Run the network as a single instance on this combined graph. Because there are no edges between molecules, information never leaks across them.
Use mean pooling to collapse each molecule's nodes into a single representation (one vector per graph).
Feed these per-graph representations into the loss for training.

QIn the molecule example, why does $\mathbf X$ have exactly 118 rows, and what is the job of $\boldsymbol\Omega_0$?▾ tap to reveal

118 rows = the 118 elements of the periodic table. Each atom (node) is a one-hot column with a single 1 in the row of its element, so $\mathbf X\in\mathbb R^{118\times N}$ just records "which element each atom is."

$\boldsymbol\Omega_0\in\mathbb R^{D\times118}$ is the first layer's weight matrix. It maps each 118-dim one-hot to a dense, learned $D$-dimensional embedding — i.e. it converts the arbitrary one-hot code into a trainable atom representation and sets the working dimension $D$ for all later layers.

QGraphs in a batch have different node counts, so their matrices don't share a shape. How do we still train in mini-batches?▾ tap to reveal

Combine all graphs in the batch into one big disconnected graph (block-diagonal adjacency matrix; node features concatenated). Run the GCN once on this single combined graph — since there are no edges between the original graphs, each graph's nodes only ever aggregate within their own graph. Then mean-pool each original graph's nodes to get one vector per graph, and pass those to the binary cross-entropy loss. This sidesteps the variable-size problem while keeping full parallelism.

· · ·

Part the sixth

VI. Inductive vs transductive

Two fundamentally different problem set-ups. Are you trained on many graphs and tested on a brand-new one (inductive)? Or do you live inside one giant graph, labelling its unknown nodes (transductive)?

This short but exam-favourite distinction (slide 29) tells you what kind of generalisation your model must do.

GANs sample fast and look great but suffer poor coverage (mode collapse) and give no likelihood at all. Diffusion models flip this: gorgeous high-quality samples but slow sampling (many denoising steps) and no easy likelihood. Flows are the only family with a clean ✓ for efficient likelihood — that exact-likelihood property is their signature. VAEs are the all-rounder with a well-behaved latent but mediocre sample quality.

Graph Neural Networks

I. What is a graph?

The zoo of graph types (slide 5)

II. Representing a graph: A, X, E

What the adjacency matrix can do (slide 9)

Node indices are arbitrary — permutation matrices (slide 10)

III. GNNs, tasks & loss functions

Three common tasks (slide 13)

The three read-out formulas (slides 14–16)

IV. Graph Convolutional Networks

Equivariance and invariance (slides 20–21)

Parameter sharing (slide 22)

The GCN layer itself (slides 23–24)

V. Graph classification: molecules

Training on batches of graphs (slide 27)

VI. Inductive vs transductive

Test likelihood (slide 52)

Inception Score, IS (slides 53–54)

Fréchet Inception Distance, FID (slide 55)

Manifold precision / recall (slides 56–57)

XII. Conclusion & what's next

Next lecture: GANs (slide 60)

The whole week in one breath

Exam-readiness checklist — you should be able to: