How to do deep learning on data that isn't a grid — molecules, social networks, road maps — plus a first look at how we judge generative models. Built slide-by-slide, formula-by-formula.
Most of the data you've met so far lives on a regular grid — pixels in an image, words in a sequence. A graph is what you reach for when the data has structure but no natural grid: things connected to other things.
A graph is, at heart, a wonderfully simple idea: a collection of things and the connections between them. The things are called nodes (or vertices); the connections are called edges. That's it. The power comes from how many real-world situations fit this picture.
A graph consists of a set of \(N\) nodes connected by a set of \(E\) edges. A node usually represents an entity; an edge represents a relationship between two entities.
The lecture opens with three pictures that look completely different but are all graphs (slide 4):
Slide 5 (adapted from Fernández-Madrigal & González, 2002) shows that graphs come in many flavours. It's worth knowing the vocabulary because the exam may ask you to classify a scenario:
A graph can encode three different kinds of information at once: the structure (who is connected to whom), data living on the nodes, and data living on the edges. A grid (image) only really lets neighbours be fixed left/right/up/down; a graph lets any node connect to any other. That flexibility is exactly why we need new machinery — ordinary CNNs and transformers assume a fixed neighbourhood layout that graphs don't have.
Any three of the lecture's examples (or your own) are fine, as long as you correctly identify nodes and edges:
Bonus marks for noting whether the graph is directed (knowledge graph) or undirected (friendships), and whether edges are labelled.
To feed a graph to a neural network we must turn it into numbers. The lecture's key claim: a graph is fully captured by three matrices — one for the wiring, one for the node data, and one for the edge data.
A neural network eats matrices, not pictures. So before anything else we encode the graph. The lecture states it cleanly (slide 7): "The graph can be encoded by three matrices \(A\), \(X\), and \(E\), representing the graph structure, node embeddings, and edge embeddings." Let us meet each one.
Adjacency matrix \(\mathbf{A}\in\{0,1\}^{N\times N}\) — encodes the structure: $$A_{m,n}=\begin{cases}1 & \text{if there is an edge between nodes } m \text{ and } n\\[2pt] 0 & \text{otherwise}\end{cases}$$
Node data matrix \(\mathbf{X}\in\mathbb{R}^{D\times N}\) — each column is the embedding of one node, \(\mathbf{x}^{(n)}\in\mathbb{R}^{D}\) for the \(n\)-th node. So there are \(N\) columns and each is \(D\)-dimensional.
Edge data matrix \(\mathbf{E}\in\mathbb{R}^{D_E\times E}\) — each column is an edge embedding \(\mathbf{e}^{(e)}\in\mathbb{R}^{D_E}\) for the \(e\)-th edge.
Note the orientation: in this course nodes are columns of \(\mathbf{X}\), so \(\mathbf{X}\) is \(D\times N\) (features-by-nodes), not \(N\times D\). Many textbooks use the transpose. If you mix them up, every matrix product in the GCN layer (Part IV) will have the wrong shape. Keep \(\mathbf{A}\) as \(N\times N\), \(\mathbf{X}\) as \(D\times N\).
The adjacency matrix is not just storage — multiplying by it moves information around the graph. This is the single most important mechanical fact in the whole lecture, so let's build it slowly.
Think of node values as heat. The product \(\mathbf{A}\mathbf{x}\) replaces each node's value with the sum of its neighbours' values. So if only node 6 was "hot", after one multiplication the heat appears at everything touching node 6. The slide colours nodes 5, 7, 8 because those are exactly node 6's neighbours.
Push further: the entry \((\mathbf{A}^2)_{m,n}\) counts the number of length-2 walks from node \(m\) to node \(n\). In general \((\mathbf{A}^k)_{m,n}\) counts walks of length \(k\). That's why \(\mathbf{A}^2\mathbf{x}\) reaches two hops away. This "information spreads one hop per multiply" is the seed of the entire GCN idea: stack \(K\) layers and each node sees \(K\) hops out.
\((\mathbf{A}^2)_{m,n}\) = the number of distinct walks of length 2 from node \(m\) to node \(n\) (paths that go \(m \to \text{something} \to n\)). More generally \((\mathbf A^k)_{m,n}\) counts walks of length \(k\). Note the diagonal of \(\mathbf A^2\) gives each node's degree (number of ways to walk out and back).
\(\mathbf{A}\mathbf{x}\) with \(\mathbf{x}\) = indicator of node \(j\): produces a vector that is 1 at every direct neighbour of \(j\) (and 0 elsewhere). It "selects the \(j\)-th column of \(\mathbf A\)," which lists \(j\)'s neighbours. This is the one-hop message-passing operation at the core of a GCN.
Here's a subtlety that drives the whole design of GNNs. When you label the nodes 1, 2, 3, …, that numbering is a free choice. Relabel them and you have the same graph — the structure hasn't changed one bit. We capture relabelling with a permutation matrix.
A permutation matrix \(\mathbf{P}\) is a square matrix with exactly one 1 in each row and column (a shuffled identity). To relabel a graph from one indexing to another: $$\mathbf{X}' = \mathbf{X}\mathbf{P}, \qquad \mathbf{A}' = \mathbf{P}^{\mathsf T}\mathbf{A}\mathbf{P}$$
The node data \(\mathbf X\) gets its columns shuffled (one \(\mathbf P\) on the right). The adjacency matrix gets shuffled on both axes — rows and columns — hence \(\mathbf{P}^{\mathsf T}\mathbf A\mathbf P\), because both the "from" and "to" indices of every edge must be renamed consistently.
\(\mathbf X\) has nodes on one axis (columns), so it needs one permutation: \(\mathbf X\mathbf P\). \(\mathbf A\) has nodes on both axes (rows = from-node, columns = to-node), so it needs the permutation applied twice: \(\mathbf P^{\mathsf T}\mathbf A\mathbf P\). This asymmetry is a classic short-answer question. Remember it foreshadows the equivariance requirement in Part IV.
Indexing is arbitrary because the names we assign to nodes carry no real information — a molecule is the same molecule whether we call a particular carbon "atom 1" or "atom 7." Two graphs that differ only by a relabelling are isomorphic (structurally identical).
We relabel with a permutation matrix \(\mathbf P\): node data becomes \(\mathbf X' = \mathbf X\mathbf P\), adjacency becomes \(\mathbf A' = \mathbf P^{\mathsf T}\mathbf A\mathbf P\). Because a model must give the "same" answer regardless of this arbitrary choice, every GNN layer is built to respect permutations (equivariance / invariance — Part IV).
A graph neural network takes a graph in and pushes it through several layers, gradually mixing each node's own information with its surroundings — exactly like a transformer turns isolated words into context-aware ones.
Now that a graph is three matrices, what does a GNN actually do with them? Slide 12 gives the bird's-eye view. The inputs are the node embeddings \(\mathbf X\) and the adjacency matrix \(\mathbf A\). The network passes them through \(K\) layers, producing intermediate representations \(\mathbf H_k\) at each layer, and a final output \(\mathbf H_K\).
The lecture gives a lovely analogy. At the start, \(\mathbf X\) contains information about each node on its own. By the end, \(\mathbf H_K\) contains each node's information plus the context of where it sits in the graph.
This is exactly like word embeddings in a transformer: the initial embedding of "bank" is the same everywhere, but after the transformer's layers, "bank" in "river bank" and "bank" in "money bank" have different representations because each has absorbed its sentence context. A GNN does the same thing, but the "sentence" is the graph and the "context" comes from neighbouring nodes.
Once you have those context-aware embeddings \(\mathbf H_K\), there are three things you typically want to predict. The lecture's figure shows all three flowing out of the same GNN:
The probability the whole graph belongs to class 1: $$P(y=1\mid \mathbf X,\mathbf A)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol{\omega}_K\,\mathbf H_K\,\tfrac{1}{N}\Big)$$
where the scalar \(\beta_K\) and the vector \(\boldsymbol\omega_K\in\mathbb R^{1\times D}\) are learned parameters, and \(\operatorname{sig}(\cdot)\in[0,1]\) is the logistic sigmoid. The factor \(\tfrac1N\) and the multiplication of \(\mathbf H_K\) (a \(D\times N\) matrix) by a vector of ones effectively averages the node embeddings into a single \(D\)-vector — "mean pooling" — before the sigmoid classifier.
The loss is defined exactly as for graph-level tasks, except now it is applied independently at each node \(n\): $$P\big(y^{(n)}=1\mid \mathbf X,\mathbf A\big)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol\omega_K\,\mathbf h^{(n)}_K\Big)$$
The only change from the graph-level case: we feed the single node's final embedding \(\mathbf h^{(n)}_K\) instead of the pooled average. No \(\tfrac1N\), no summing across nodes.
One possibility: take the dot product of two node embeddings and squash with a sigmoid to get the probability an edge exists between nodes \(m\) and \(n\): $$P\big(y^{(mn)}=1\mid\mathbf X,\mathbf A\big)=\operatorname{sig}\!\Big(\mathbf h^{(m)\mathsf T}\mathbf h^{(n)}\Big)$$
The dot product \(\mathbf h^{(m)\mathsf T}\mathbf h^{(n)}\) is large and positive when the two node embeddings point in a similar direction — i.e. when the nodes are "similar" in the learned feature space. The sigmoid turns that similarity into a probability. So the model learns to place nodes that should be linked close together in embedding space. This is the same trick used in recommender systems ("users like this also like…").
Graph-level: \(P(y=1\mid\mathbf X,\mathbf A)=\operatorname{sig}(\beta_K+\boldsymbol\omega_K\mathbf H_K\tfrac1N)\).
Node-level: \(P(y^{(n)}=1)=\operatorname{sig}(\beta_K+\boldsymbol\omega_K\mathbf h^{(n)}_K)\). It uses a single node's embedding \(\mathbf h^{(n)}_K\) with no pooling, because we want a separate prediction per node, not one for the whole graph. The classifier weights \(\beta_K,\boldsymbol\omega_K\) are shared across all nodes.
The workhorse of the lecture. A GCN updates every node by mixing it with the sum of its neighbours — the same learned weights everywhere — and that single design choice gives it permutation-respecting behaviour and a manageable parameter count.
A Graph Convolutional Network (GCN) is a particular, very natural way to build the layers \(F\). Its defining feature (slide 18) is a relational inductive bias: GCNs prioritise information from neighbouring nodes. The slide contrasts this with spectral-based methods, which instead operate in the Fourier domain (using the graph Laplacian's eigenvectors). The GCNs in this course are spatial — they work directly with neighbours, not frequencies.
Each layer is a function \(F\) with parameters \(\Phi\) that updates node embeddings using the adjacency matrix. The full network is a stack: $$\begin{aligned}\mathbf H_1 &= F(\mathbf X,\mathbf A,\boldsymbol\phi_0)\\ \mathbf H_2 &= F(\mathbf H_1,\mathbf A,\boldsymbol\phi_1)\\ &\;\;\vdots\\ \mathbf H_K &= F(\mathbf H_{K-1},\mathbf A,\boldsymbol\phi_{K-1})\end{aligned}$$ where \(\mathbf X\) = input node embeddings, \(\mathbf A\) = adjacency matrix, \(\mathbf H_k\) = embeddings at layer \(k\), and \(\boldsymbol\phi_k\) = parameters mapping layer \(k\) to layer \(k{+}1\). Crucially, \(\mathbf A\) is fed into every layer.
Recall from Part II that node labels are arbitrary. This forces a hard requirement on every layer.
Each layer must respect permutations of node indices. Formally, permuting the input and then applying the layer must equal applying the layer and then permuting: $$\mathbf H_{k+1}\mathbf P = F\big(\mathbf H_k\mathbf P,\;\mathbf P^{\mathsf T}\mathbf A\mathbf P,\;\boldsymbol\phi_k\big)$$
Equivariant = "if I shuffle the inputs, the outputs shuffle the same way." Relabel the nodes and every node's prediction just follows its node — nothing is lost. Invariant = "shuffle the inputs and the output doesn't change at all."
Which one you need depends on the task (slide 21):
The image analogy nails it: image segmentation should be equivariant to geometric transforms (move the object, the mask moves with it), while image classification should be invariant (a cat is a cat wherever it sits). For graphs, networks can be designed to guarantee either property with respect to permutations.
Don't blur the two. A common mistake is to say a node classifier should be "invariant to permutations" — it should be equivariant. Invariance would mean every node gets the same label, which is nonsense for per-node prediction. Reserve invariance for whole-graph outputs.
Why use the same weights at every node? The lecture frames it as a direct parallel to CNNs.
| Fully-connected (the problem) | Shared-weight solution | |
|---|---|---|
| Images | Must learn to recognise an object separately at every pixel position → enormous parameter count. | Convolution: process every position with the same filter → fewer parameters + the bias that "treat every part of the image the same." |
| Graphs | Would need separate parameters per node and independent learning per graph position → inefficient; needs many graphs with identical topology. | GCN: use the same parameters at every node → fewer parameters + learned information shared across the whole graph. |
This is the most important formula in Part IV. Build it in two steps.
Step 1 — aggregate by summing the embeddings of node \(n\)'s neighbours: $$\operatorname{agg}(n,k)=\sum_{m\in\text{ne}(n)}\mathbf h^{(m)}_k$$ where \(\text{ne}(n)\) returns the set of indices of the neighbours of node \(n\).
Step 2 — linearly transform both the node's own embedding and the aggregate, add a bias, apply an activation \(a(\cdot)\): $$\mathbf h^{(n)}_{k+1}=a\Big(\boldsymbol\beta_k+\boldsymbol\Omega_k\mathbf h^{(n)}_k+\boldsymbol\Omega_k\operatorname{agg}(n,k)\Big)$$
Matrix form (all nodes at once): $$\mathbf H_{k+1}=a\big(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A\big)=a\big(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k(\mathbf A+\mathbf I)\big)$$
Read \(\mathbf H_k(\mathbf A+\mathbf I)\) right-to-left. We saw that \(\mathbf H_k\mathbf A\) sums each node's neighbours (the "spread" operator from Part II). Adding the identity, \(\mathbf A+\mathbf I\), means "neighbours plus yourself" — the \(\mathbf I\) keeps each node's own embedding in the mix (a self-loop). Then:
So one GCN layer = "mix yourself with your neighbours, apply a shared linear map, add bias, squash." Stack \(K\) of them and node \(n\) has absorbed everything within \(K\) hops.
Per-node: \(\mathbf h^{(n)}_{k+1}=a(\boldsymbol\beta_k+\boldsymbol\Omega_k\mathbf h^{(n)}_k+\boldsymbol\Omega_k\sum_{m\in\text{ne}(n)}\mathbf h^{(m)}_k)\).
Collect all nodes into the \(D\times N\) matrix \(\mathbf H_k\). The neighbour-sum for all nodes at once is the matrix product \(\mathbf H_k\mathbf A\) (column \(n\) of \(\mathbf H_k\mathbf A\) is the sum of \(\mathbf H_k\)'s columns over \(n\)'s neighbours). The "own embedding" term is just \(\mathbf H_k\). So:
$$\mathbf H_{k+1}=a(\boldsymbol\beta_k\mathbf 1^{\mathsf T}+\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A).$$
Factor \(\boldsymbol\Omega_k\mathbf H_k\) out of the last two terms: \(\boldsymbol\Omega_k\mathbf H_k+\boldsymbol\Omega_k\mathbf H_k\mathbf A=\boldsymbol\Omega_k\mathbf H_k(\mathbf I+\mathbf A)\). Hence the \((\mathbf A+\mathbf I)\): the \(\mathbf A\) gathers the neighbours, the \(\mathbf I\) is a self-loop that keeps the node's own current embedding. Without the \(\mathbf I\), a node would forget itself and only see its neighbours. \(\boldsymbol\beta_k\mathbf 1^{\mathsf T}\) broadcasts the bias to every node.
Three hops. Each layer mixes a node with its direct (1-hop) neighbours via \((\mathbf A+\mathbf I)\). After layer 1 a node "sees" its 1-hop neighbourhood; those neighbours in turn saw their neighbours, so after layer 2 a node indirectly sees 2 hops; after layer 3, 3 hops. In general \(K\) layers ⇒ a receptive field of \(K\) hops. (This is exactly why deep GNNs on big graphs hit the "receptive-field explosion" problem in Part VII.)
A worked, end-to-end example: is this molecule toxic or harmless? It pins down exactly what every matrix in the GCN equations holds, and shows how to train on batches of differently-shaped graphs.
The lecture grounds the abstract machinery in a concrete graph-level binary classification task (slide 26): classify molecules as toxic or harmless. A molecule is naturally a graph — atoms are nodes, bonds are edges — so this is a perfect fit.
Inputs:
Transformation to arbitrary size \(D\): the first weight matrix \(\boldsymbol\Omega_0\in\mathbb R^{D\times 118}\) maps the bulky 118-dim one-hot down (or up) to a chosen embedding size \(D\).
Network equations: $$\begin{aligned}\mathbf H_1&=a(\boldsymbol\beta_0\mathbf 1^{\mathsf T}+\boldsymbol\Omega_0\mathbf X(\mathbf A+\mathbf I))\\ \mathbf H_2&=a(\boldsymbol\beta_1\mathbf 1^{\mathsf T}+\boldsymbol\Omega_1\mathbf H_1(\mathbf A+\mathbf I))\\ &\;\;\vdots\\ \mathbf H_K&=a(\boldsymbol\beta_{K-1}\mathbf 1^{\mathsf T}+\boldsymbol\Omega_{K-1}\mathbf H_{K-1}(\mathbf A+\mathbf I))\end{aligned}$$ and the read-out: $$f(\mathbf X,\mathbf A,\Phi)=\operatorname{sig}\!\Big(\beta_K+\boldsymbol\omega_K\mathbf H_K\tfrac1N\Big)$$
The raw "what element is this atom" has no natural numeric meaning (carbon isn't "6 units" of anything useful), so we one-hot encode it: 118 slots, a single 1. But 118 dimensions is wasteful and the one-hots are not learnable. So the very first layer's \(\boldsymbol\Omega_0\) (shape \(D\times118\)) acts as a learned embedding lookup — turning each sparse one-hot column into a dense, trainable \(D\)-dimensional atom embedding. After that, every layer works in the compact \(D\)-dimensional space.
Training data: graphs \(\{\mathbf X_i,\mathbf A_i\}\) with labels \(y_i\). Method: stochastic gradient descent (SGD) with binary cross-entropy loss to update model parameters \(\Phi\).
Modern networks process whole batches in parallel for speed. But here's the snag: each graph can have a different number of nodes, so the matrices \(\mathbf X_i\) and \(\mathbf A_i\) have varying sizes. You can't just stack them into one fixed-size tensor the way you stack equally-sized images.
The fix is elegant:
118 rows = the 118 elements of the periodic table. Each atom (node) is a one-hot column with a single 1 in the row of its element, so \(\mathbf X\in\mathbb R^{118\times N}\) just records "which element each atom is."
\(\boldsymbol\Omega_0\in\mathbb R^{D\times118}\) is the first layer's weight matrix. It maps each 118-dim one-hot to a dense, learned \(D\)-dimensional embedding — i.e. it converts the arbitrary one-hot code into a trainable atom representation and sets the working dimension \(D\) for all later layers.
Combine all graphs in the batch into one big disconnected graph (block-diagonal adjacency matrix; node features concatenated). Run the GCN once on this single combined graph — since there are no edges between the original graphs, each graph's nodes only ever aggregate within their own graph. Then mean-pool each original graph's nodes to get one vector per graph, and pass those to the binary cross-entropy loss. This sidesteps the variable-size problem while keeping full parallelism.
Two fundamentally different problem set-ups. Are you trained on many graphs and tested on a brand-new one (inductive)? Or do you live inside one giant graph, labelling its unknown nodes (transductive)?
This short but exam-favourite distinction (slide 29) tells you what kind of generalisation your model must do.
Normalizing flows. They are the one family that allows exact and efficient likelihood computation (slide 52) — the only ✓ in the efficient-likelihood column.
GANs are out immediately: they produce no probabilistic output at all (n/a). VAEs can give a likelihood, but only an expensive/approximate one (✗ for efficient likelihood) — workable but not "fast and exact." So flows are the right tool.
Generative-model experiments predominantly use images, because image data is abundant and quality is easy to judge by eye. To compare probabilistic models numerically:
Likelihood comparison only works if the model has a likelihood. Slide 52: GANs do not provide probabilistic outputs; VAEs and diffusion models can, but estimation is expensive. The lone exception is normalizing flows — exact and efficient. Don't write "compare GANs by test likelihood" — GANs have none.
Pass each generated image through a pretrained classifier. Two things signal a good generator: (1) each image is classified confidently as one class (sharp \(P(y\mid x^*)\) — quality), and (2) across all images the classes are varied (flat marginal \(P(y)\) — diversity). The KL divergence between the two is large exactly when both hold.
$$\text{IS}=\exp\!\left(\frac{1}{I}\sum_{i=1}^{I}D_{KL}\big[P(y_i\mid \mathbf x_i^{*})\,\big\|\,P(y)\big]\right)$$ where the marginal class distribution is the average over all \(I\) generated examples: $$P(y)=\frac{1}{I}\sum_{i=1}^{I}P(y_i\mid \mathbf x_i^{*})$$ and \(I\) is the number of generated examples. Higher IS = better.
The IS is low. Each \(P(y_i\mid x_i^*)\) is sharp (good — confident "dog"), but the marginal \(P(y)=\frac1I\sum_i P(y_i\mid x_i^*)\) is also sharply peaked on "dog" because every image is a dog. The KL divergence between each conditional and the marginal is then ~0, so \(\text{IS}=\exp(\approx 0)\approx 1\) — near the floor.
The score rewards diversity: a high IS needs the marginal to be flat (many classes represented) while each individual prediction stays peaked. Mode collapse kills the first condition.
Purpose: measure the distance between the distribution of generated examples and the distribution of real examples; designed specifically for images.
Method: approximate each distribution with a Gaussian, then compute the Fréchet distance between the two Gaussians. The features come from the deepest layer of InceptionV3, whose activations are associated with object classes.
Push real images and fake images through Inception, grab a feature vector for each, and fit a bell-curve cloud to each set. FID is "how far apart are the two clouds." Lower FID = better — opposite direction to Inception Score.
Advantages: considers diversity within classes; reflects both realism and diversity of the generated samples (an improvement over IS).
Limitations: information loss — only accounts for features that InceptionV3 happens to retain; and no distinction — a single number cannot tell you whether a bad score is due to poor realism or poor diversity.
Inception Score: higher is better. FID: lower is better. They also differ in what they use — IS uses only the generated images (and a classifier); FID compares generated vs real feature distributions. Mixing these up is a classic slip.
FID gives one number and can't separate "unrealistic" from "not diverse." Manifold precision/recall splits them. Draw a blob around where real data lives and a blob around where fakes live. Precision = "what fraction of my fakes land inside the real blob?" (realism). Recall = "what fraction of the real data is covered by my fake blob?" (coverage).
Place a hypersphere around each example; its radius is the distance to that example's \(k\)-th nearest neighbour. The union of all these spheres approximates the manifold. The whole computation is usually done in the feature space of a classifier, not raw pixel space.
(a) High precision, low recall. Every fake is realistic (lands inside the real manifold → precision high), but the fakes cluster in one region and miss most of the real data's spread (real points outside the model manifold → recall low). This is mode collapse seen through the precision/recall lens.
(b) Low precision, high recall. The fakes are scattered everywhere, so they cover the real data (recall high), but many land outside the real manifold because they look like garbage (precision low).
This is exactly the advantage over FID: precision/recall separates realism from coverage, which a single FID number cannot.
The lecture closes by zooming back out — placing the day's two big topics, graphs and generative models, in their proper relationship — and teases the next lecture.
Slide 59 ties together the unsupervised half of the lecture as a set of nested ideas, exactly mirroring the taxonomy diagram from Part X.
Probabilistic generative ⊂ generative ⊂ unsupervised. Each inner ring can do more: structure-finding → plus synthesis → plus assigning probabilities. This is the same nesting as the coloured-box diagram in Part X — if you can redraw that diagram, you've got slide 59.
The course continues into Generative Adversarial Networks. The slide-60 figure previews the GAN setup on MNIST handwritten digits: a generator turns a latent sample \(\mathbf z\) into a fake digit, while a discriminator is shown either a real digit or the fake and must output 1 (real) or 0 (fake). The two networks are trained against each other — the generator tries to fool the discriminator, the discriminator tries not to be fooled.
Both are unsupervised and both are generative (they synthesise new data). The difference is the innermost ring: VAEs are probabilistic generative models — they assign \(P(\mathbf x\mid\boldsymbol\phi)\) to data, so they sit in the inner box. GANs are generative but not probabilistic — they produce samples but no likelihood (recall the n/a in the slide-51 scorecard), so they sit in the generative ring but outside the probabilistic inner box.
A graph is just nodes-and-edges encoded as three matrices \((\mathbf A,\mathbf X,\mathbf E)\); a graph neural network repeatedly mixes each node with its neighbours — \(\mathbf H_{k+1}=a(\boldsymbol\beta\mathbf 1^{T}+\boldsymbol\Omega\mathbf H_k(\mathbf A+\mathbf I))\) — in a way that is permutation-equivariant and parameter-shared, so it can classify whole graphs, individual nodes, or predict edges; and when we then turn to learning without labels, generative models let us synthesise new data, judged by metrics like Inception Score, FID, and manifold precision/recall.