KCL · Machine Learning · Week 14

Attention & Transformers

From a confusing restaurant review to BERT, GPT-3, machine translation and Vision Transformers — how self-attention lets a network decide, for every word, which other words matter.

KCL · ML  Garcia Peraza Herrera ~70 min read 18 self-tests covers slides 1–53
Part I

I. Motivating the Transformer

Before any maths: why do we even need a new architecture for text? Three properties of language break the networks we already know.

Imagine a network has to read this restaurant review and turn it into something a downstream task can use:

The running example (slide 4)

“The restaurant refused to serve me a ham sandwich because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambiance was just as good as the food and service.”

The goal is to process this text into a representation suitable for downstream tasks — for instance, classify the review's sentiment, or answer questions like “Does the restaurant serve steak?”

Notice how hard this is for a machine. To answer “does it serve steak?”, the network must connect “refused to serve”, “only cooks vegetarian” and “two slices of bread” — words scattered across the whole passage — and resolve pronouns like “it” and “they”. That difficulty is exactly what motivates the transformer.

The three key observations (slide 5)

1 · Encoded input size is huge

Text turns out to be surprisingly large once encoded. The slide's worked figure: a 37-word passage with an embedding length of 1024 gives

\[ 37 \text{ words} \times 1024 \text{ dims} = 37\text{K numbers.} \]

A fully connected network would need a weight for every input number to every hidden unit, so it is impractical for large bodies of text.

2 · Input length varies

NLP inputs are not a fixed size — sentences and documents differ in length. So we need to share parameters across different input positions, the same trick convolutional networks use for images (one kernel slides over every location).

3 · Language is ambiguous

What the architecture must have

From these observations, a good text model needs exactly two ingredients: (a) parameter sharing across positions, and (b) data-dependent connections between word representations whose strength depends on the words. Dot-product self-attention delivers both.

QWhy is a fully connected network a bad choice for processing a paragraph of text?▾ tap to reveal▴ hide answer

Two reasons from slide 5. (1) Size. Encoded text is large — e.g. 37 words × 1024-dim embeddings = 37K numbers — and a fully connected layer needs a separate weight for every input-to-hidden pair, which explodes. (2) Variable length & no sharing. Inputs vary in length and an FC net has fixed-size, position-specific weights with no parameter sharing across positions, so it cannot generalise a pattern learned at one position to another. Transformers fix both by sharing the value/query/key weights across all positions.

·  ·  ·
Part II

II. Dot-Product Self-Attention

The core mechanism. Each output is a weighted average of transformed inputs — and the weights are computed from the inputs themselves.

A self-attention block is a function that takes \(N\) input vectors \(x_1,\dots,x_N\), each of dimension \(D\times 1\), and returns \(N\) output vectors of the same size. In NLP each input \(x_i\), for \(i\in\{1,\dots,N\}\), represents a word or word fragment. Transformers get both the parameter-sharing and the data-dependent connections of Part I from this one block.

Step 1 — Values (slide 8)

For each input \(x_i\) we compute a value vector using a shared weight matrix and bias:

\[ v_i = \beta_v + \Omega_v\, x_i, \qquad \Omega_v\in\mathbb{R}^{D\times D},\;\; \beta_v\in\mathbb{R}^{D\times 1}. \]

The same \(\Omega_v\) and \(\beta_v\) are applied to every input — that is the parameter sharing.

Step 2 — Weighted sum (slide 8)

The \(i\)-th output of the block is a weighted sum of all the value vectors:

\[ \alpha(x_1,\dots,x_N,i)=\alpha_i(x_1,\dots,x_N)=\sum_{j=1}^{N} a_{j,i}\, v_j, \qquad \sum_{j=1}^{N} a_{j,i}=1. \]
Definition · attention weight

The scalar \(a_{j,i}\) is the attention that the \(i\)-th output pays to input \(x_j\). The weights for a given output are non-negative and sum to one, so each output is a convex combination (a weighted average) of the values.

In plain English

To build the new representation of word \(i\), look at every word \(j\) in the sentence, give each a "relevance score" \(a_{j,i}\), and mix their value vectors in those proportions. A word builds its meaning from the words it pays attention to.

0.1 0.3 0.6 values out 1 0.5 0.2 0.3 values out 2 0.1 0.2 0.7 values out 3
Slide 9 redrawn. The same three value vectors feed every output; only the attention weights differ. Output 1 = 0.1·v₁ + 0.3·v₂ + 0.6·v₃; output 2 uses 0.5/0.2/0.3; output 3 uses 0.1/0.2/0.7. Each set of weights sums to 1.

Where do the weights come from? (slide 10)

Two important facts the slide stresses: the overall self-attention computation is nonlinear, and the attention weights are themselves nonlinear functions of the input. They are produced in two steps.

Queries and keys

We apply two more linear transformations to the inputs:

\[ q_i = \beta_q + \Omega_q\, x_i, \qquad k_i = \beta_k + \Omega_k\, x_i. \]

The \(\{q_i\}\) are called queries and the \(\{k_i\}\) are keys.

Dot products + softmax

Take the dot product of every query with every key and push the results through a softmax:

\[ a_{j,i}=\frac{\exp\!\big(k_j^{\mathsf T} q_i\big)}{\displaystyle\sum_{m=1}^{N}\exp\!\big(k_m^{\mathsf T} q_i\big)}. \]
In plain English · the database analogy

Think of a lookup. Output \(i\) issues a query \(q_i\) ("what am I looking for?"). Every input advertises a key \(k_j\) ("what I am about") and a value \(v_j\) ("what I'll hand over"). The match score is the dot product \(k_j^{\mathsf T} q_i\) — large when query and key point the same way. Softmax turns those scores into a probability distribution over inputs, and we return the value-mixture in those proportions.

inputs xᵢ qᵢ = β+Ωx kⱼ = β+Ωx kⱼᵀqᵢdot prod. softmaxover cols attention weights aⱼ,ᵢ (each column sums to 1) values vⱼ output αᵢ
Slide 11 redrawn as a pipeline. Inputs are split into queries and keys; their dot products form a matrix; a column-wise softmax produces the attention weights; those weights mix the values into each output. Crucially the softmax is taken down each column (over all \(j\) for a fixed output \(i\)).

Self-attention in matrix form (slide 12)

Stacking everything makes the whole block a few matrix products. The full parameter set is

\[ \phi=\{\beta_v,\Omega_v,\;\beta_q,\Omega_q,\;\beta_k,\Omega_k\}. \]

Put the \(N\) inputs as columns of \(X\in\mathbb{R}^{D\times N}\). Then values, queries and keys are

\[ V=\beta_v\mathbf{1}^{\mathsf T}+\Omega_v X,\qquad Q=\beta_q\mathbf{1}^{\mathsf T}+\Omega_q X,\qquad K=\beta_k\mathbf{1}^{\mathsf T}+\Omega_k X, \]

where \(\mathbf{1}\) is an \(N\times 1\) vector of ones (it broadcasts the bias across all \(N\) columns). The block output is

\[ A=V\,\sigma\!\big(K^{\mathsf T}Q\big), \]

where \(\sigma(\cdot)\) takes a matrix and performs the softmax independently on each of its columns.

Exam trap · which index, which softmax direction

Subscript order matters. \(a_{j,i}\) is the weight output \(i\) puts on input \(j\); the softmax normalises over \(j\) (all inputs) for each fixed \(i\) — i.e. over columns of \(K^{\mathsf T}Q\). Get the direction wrong and your weights won't sum to one per output. Also remember: \(\Omega_v,\Omega_q,\Omega_k\) are all \(D\times D\) in the single-head case and the bias is added via \(\beta\mathbf{1}^{\mathsf T}\), not just \(\beta\).

QDistinguish queries, keys and values. Which one is mixed to form the output, and which two produce the weights?▾ tap to reveal▴ hide answer

All three are linear transforms of the same inputs: \(q_i=\beta_q+\Omega_q x_i\), \(k_i=\beta_k+\Omega_k x_i\), \(v_i=\beta_v+\Omega_v x_i\). The values \(v_j\) are what gets mixed into each output: \(\alpha_i=\sum_j a_{j,i}v_j\). The queries and keys produce the weights via dot product + softmax: \(a_{j,i}=\mathrm{softmax}_j(k_j^{\mathsf T}q_i)\). Mnemonic: output \(i\)'s query asks a question; each input's key answers how relevant it is; the value is the content actually passed along.

QIn \(A = V\sigma(K^{\mathsf T}Q)\), what are the shapes of \(K^{\mathsf T}Q\) and \(A\), and what does \(\sigma\) do?▾ tap to reveal▴ hide answer

With \(D\times N\) inputs, \(Q\) and \(K\) are \(D\times N\) (single head), so \(K^{\mathsf T}Q\) is \(N\times N\) — the score for every (key \(j\), query \(i\)) pair. \(\sigma\) applies softmax independently to each column, leaving an \(N\times N\) attention matrix whose columns sum to 1. Then \(V\) (which is \(D\times N\)) times that gives \(A\) of shape \(D\times N\): same shape as the input \(X\), so blocks can be stacked.

QThe slide says the computation is "nonlinear" even though \(v,q,k\) are all linear in \(x\). Where does the nonlinearity enter?▾ tap to reveal▴ hide answer

From the softmax and the product of input-dependent terms. The weights \(a_{j,i}\) are a softmax of dot products of \(x\)-dependent vectors (nonlinear in \(x\)), and the output multiplies these input-dependent weights by input-dependent values — so the output is a nonlinear function of the inputs even though each individual projection is linear. This is why the slide separately states "the attention weights are themselves nonlinear functions of the input."

·  ·  ·
Part III

III. Extensions: Position, Scaling & Heads

Plain self-attention has three weaknesses. It ignores word order, its dot products can blow up, and one head can only model one kind of relationship. Three fixes follow.

Positional encoding (slide 14)

The problem · order is discarded

Self-attention is permutation-equivariant: the computation is the same regardless of the order of the inputs \(x_i\). Shuffle the words and you get the same outputs (just reordered). "Dog bites man" and "man bites dog" would be indistinguishable. So position must be injected explicitly.

There are two main approaches to incorporating position information:

  1. Absolute positional encodings. A matrix \(\Pi\) is added to the input \(X\) that encodes positional information. Each column of \(\Pi\) is unique, so it carries the absolute position in the input sequence.
  2. Relative positional encodings. Each element of the attention matrix corresponds to a particular offset between the key position and the query position — i.e. it encodes "how far apart" two tokens are rather than where each one sits absolutely.
X (input) + Π (unique cols) = X + Π (position-aware) pos 1 2 3 4
Absolute positional encoding: a matrix \(\Pi\) whose every column is distinct is added to \(X\), so the network can tell position 1 from position 4 even though attention itself ignores order.

Scaled dot-product self-attention (slide 15)

The problem · saturated softmax

The dot products \(k_j^{\mathsf T}q_i\) can have large magnitudes (they sum over \(D_q\) terms). When the softmax inputs are huge, the function saturates: small changes to the inputs produce almost no change in the output, and gradients vanish.

The fix is to scale the dot products by the square root of the query/key dimension:

\[ A=V\,\sigma\!\left(\frac{K^{\mathsf T}Q}{\sqrt{D_q}}\right), \]

where \(D_q\) is the dimension of the queries and keys.

In plain English · why \(\sqrt{D_q}\)?

A dot product of two \(D_q\)-dimensional vectors is a sum of \(D_q\) products. If the entries are roughly unit-scale and independent, that sum has variance growing like \(D_q\), i.e. standard deviation \(\sqrt{D_q}\). Dividing by \(\sqrt{D_q}\) rescales the scores back to a sensible range so the softmax stays in its responsive, well-gradiented region.

Multiple heads (slides 16–17)

Multi-head self-attention runs several self-attention mechanisms in parallel. We compute \(H\) different sets of values, keys and queries:

\[ V_h=\beta_{vh}\mathbf{1}^{\mathsf T}+\Omega_{vh}X,\qquad Q_h=\beta_{qh}\mathbf{1}^{\mathsf T}+\Omega_{qh}X,\qquad K_h=\beta_{kh}\mathbf{1}^{\mathsf T}+\Omega_{kh}X. \]

The \(h\)-th self-attention mechanism — its head — is

\[ A_h=V_h\,\sigma\!\left(\frac{K_h^{\mathsf T}Q_h}{\sqrt{D_q}}\right), \]

with different parameters \(\{\beta_{vh},\Omega_{vh}\}\), \(\{\beta_{qh},\Omega_{qh}\}\) and \(\{\beta_{kh},\Omega_{kh}\}\) for each head.

Dimensions & combination (slide 17)

If the input dimension is \(D\) and there are \(H\) heads, the values, queries and keys are all of size \(D/H\) — which keeps the total work the same and allows an efficient implementation. The head outputs are vertically concatenated and a final linear transform \(\Omega_c\) combines them:

\[ A_\eta=\Omega_c\big[A_1^{\mathsf T},A_2^{\mathsf T},\dots,A_H^{\mathsf T}\big]^{\mathsf T}. \]
In plain English

One head can only learn one "way of relating" words (say, subject→verb). Multiple heads let the layer look at the sentence through several relationship lenses at once — one might track syntax, another co-reference, another adjacency — then \(\Omega_c\) blends their findings into a single representation.

Input X Head 1 Qₕ Kₕ Vₕ σ A₁ Head 2 Qₕ Kₕ Vₕ σ A₂ concat + Ωc
Slide 18 redrawn. Two heads each compute their own Q/K/V (here at reduced dimension \(D/H\)) and attention output \(A_h\); the outputs are concatenated and passed through \(\Omega_c\) to produce the combined \(A_\eta\) of the original dimension \(D\).
QWhy divide the score by \(\sqrt{D_q}\) rather than, say, \(D_q\)?▾ tap to reveal▴ hide answer

Because the standard deviation of the dot product grows like \(\sqrt{D_q}\), not \(D_q\). The dot product sums \(D_q\) roughly-independent unit-scale terms; variance adds, so variance \(\propto D_q\) and std \(\propto\sqrt{D_q}\). Dividing by \(\sqrt{D_q}\) restores roughly unit-variance scores, keeping the softmax in its sensitive region so small input changes still move the output (avoiding the saturation problem on slide 15). Dividing by \(D_q\) would over-shrink the scores.

QWith \(D=1024\) and \(H=16\) heads, what is the per-head dimension, and why does this keep multi-head attention efficient?▾ tap to reveal▴ hide answer

\(D/H = 1024/16 = 64\). Each head's queries, keys and values are 64-dimensional (exactly BERT's setup, slide 30). Because each head works in a \(D/H\) subspace, the total compute and parameters across all \(H\) heads is comparable to a single full-dimension attention — you get \(H\) different relationship "lenses" essentially for free, then recombine with \(\Omega_c\).

QSelf-attention is permutation-equivariant. State precisely what that means and how positional encoding breaks it.▾ tap to reveal▴ hide answer

Permutation-equivariant means: if you permute the input columns, the output columns are permuted the same way and otherwise unchanged — the mechanism has no notion of "first" vs "last". Adding an absolute positional matrix \(\Pi\) with a unique column per position makes each position's input distinct, so permuting the words now changes the values being attended over. With relative encodings, the attention scores depend on the offset between query and key positions, again injecting order. Either way, order information re-enters the computation.

·  ·  ·
Part IV

IV. The Transformer Layer

Self-attention is the engine; the transformer layer is the full machine. It wraps attention with residual connections, normalisation and a per-token MLP — then stacks the whole thing many times.

A single transformer layer performs this series of operations (slide 20):

\[ \begin{aligned} X &\leftarrow X + A_\eta \\ X &\leftarrow \mathrm{LayerNorm}(X) \\ x_i &\leftarrow x_i + \mathrm{MLP}(x_i) \qquad \forall\, i\in\{1,\dots,N\}\\ X &\leftarrow \mathrm{LayerNorm}(X) \end{aligned} \]

where the column vectors \(x_i\) are taken separately from the full data matrix \(X\). In a real network, the data passes through a series of these transformer layers.

Reading the four lines

Line 1 — residual around attention: add the multi-head output \(A_\eta\) back onto the input \(X\). Line 2 — LayerNorm. Line 3 — per-token MLP with its own residual: the same MLP is applied independently to each column \(x_i\) (parameter sharing across positions, exactly like a 1×1 operation). Line 4 — LayerNorm again.

In plain English · why each piece is here

Attention lets tokens exchange information (mixing across positions). The MLP then processes each token on its own (computation within a position). Residual connections ("add the input back") give gradients a shortcut so very deep stacks still train. LayerNorm keeps the activations at a stable scale between sub-blocks.

Input multi-headattention + LayerN. per-tokenMLP + LayerN. Output residual residual
Slide 21 redrawn. Input → multi-head self-attention (with a residual) → LayerNorm → parallel per-token MLPs (with a residual) → LayerNorm → output. A real transformer stacks many of these layers; the input/output shape is preserved as \(D\times N\).
Exam trap · attention mixes, MLP doesn't

The MLP step is written per token: \(x_i \leftarrow x_i + \mathrm{MLP}(x_i)\). It does not mix information between tokens — only attention does. A common mistake is to think the MLP combines words. It processes each column independently (with shared weights). All cross-token communication happens in the attention sub-block.

QA transformer layer has two sub-blocks. Which one moves information between positions, and which one transforms each position on its own?▾ tap to reveal▴ hide answer

Multi-head self-attention moves information between positions — each output is a weighted mix of all tokens' values. The per-token MLP (\(x_i\leftarrow x_i+\mathrm{MLP}(x_i)\)) transforms each position independently with shared weights and performs no cross-token mixing. Each sub-block has its own residual connection and is followed by LayerNorm.

·  ·  ·
Part V

V. Transformers for NLP — Tokens, Embeddings & Model Types

Before text can enter a transformer it must become numbers. That means cutting it into tokens, turning tokens into embeddings, and choosing one of three model shapes.

Tokenization (slide 23)

A text-processing pipeline begins with a tokenizer, which splits text into smaller constituent units → tokens. Tokens must come from a pre-defined vocabulary of possible tokens.

Why not just use whole words?

Problems of using words as tokens: (1) some words — e.g. names — won't be in the vocabulary; (2) it's unclear how to handle punctuation; (3) the same word with different suffixes would get entirely different tokens (run, runs, running).

The compromise: sub-word tokenization (slide 24)

A possible solution is to use letters and punctuation marks as tokens. The actual solution is a compromise between letters and full words: the vocabulary is computed using a sub-word tokenizer such as byte pair encoding (BPE).

Byte pair encoding (slide 25)

BPE starts with the tokens being just the individual characters and whitespace, together with their frequencies. At each iteration the tokenizer finds the most commonly occurring adjacent pair of tokens and merges it into a single new token, adding it to the vocabulary. Repeating this grows tokens from letters up towards common sub-words and words.

start: characters s · e · a s · e · a s · e · a "se" is most frequent pair iter 1: merge s+e → "se" se · a se · a se · a "sea" next… iter 2: merge se+a → "sea" (token grows toward a word)
Slide 25 (Fig. 12.8) idea, simplified on the nursery-rhyme corpus "a sailor went to sea sea sea…". Each iteration the most frequent adjacent pair is merged; tokens grow from characters → "se" → "sea". The slide's plot (f) shows the token count rising then falling over ~50 iterations as merges trade many short tokens for fewer long ones.

Embeddings (slide 26)

Each token in the vocabulary \(\mathcal V\) is mapped to a unique word embedding. The embedding matrix is

\[ \Omega_e\in\mathbb{R}^{D\times|\mathcal V|}. \]

To obtain the embeddings: (1) the \(N\) input tokens are first encoded in the matrix \(T\in\mathbb{R}^{|\mathcal V|\times N}\) (one-hot columns); (2) the input embeddings are computed as

\[ X=\Omega_e\, T. \]

A typical embedding size is \(D=1024\); a typical total vocabulary size is \(|\mathcal V|=30\text{K}\).

X (D×N) = Ωₑ (D×|V|) × T (|V|×N), one-hot
Slide 27 idea ("an aardvark ate an ant"). Each token is a one-hot column of \(T\); multiplying by \(\Omega_e\) simply selects that token's embedding column. So \(X=\Omega_e T\) is a lookup that turns token indices into \(D\)-dimensional embedding vectors.

The transformer model & its three types (slide 28)

The transformer model is a series of \(K\) transformer layers through which the embedding matrix \(X\) (representing the text) is passed. There are three types:

TypeIdeaExample (later in deck)
EncoderBuild a representation of the whole input at onceBERT
DecoderGenerate text autoregressively, left to rightGPT-3
Encoder–decoderRead one sequence, produce anotherMachine translation
QGive the three reasons whole words make a poor tokenizer vocabulary, and name the sub-word algorithm used instead.▾ tap to reveal▴ hide answer

(1) Out-of-vocabulary words such as names can't be represented; (2) punctuation handling is unclear; (3) suffix variants of one word (run/runs/running) become unrelated tokens. The standard fix is a sub-word tokenizer — byte pair encoding (BPE) — which is a compromise between characters and full words, iteratively merging the most frequent adjacent token pair.

QIn \(X=\Omega_e T\), what are the shapes of \(\Omega_e\) and \(T\), and why does the multiplication amount to a lookup table?▾ tap to reveal▴ hide answer

\(\Omega_e\in\mathbb{R}^{D\times|\mathcal V|}\) (one column per vocabulary token; typically \(D=1024\), \(|\mathcal V|=30\)K) and \(T\in\mathbb{R}^{|\mathcal V|\times N}\) has one one-hot column per input token. Multiplying a matrix by a one-hot column just picks out the corresponding column of that matrix — so \(\Omega_e T\) selects each token's embedding. Result \(X\) is \(D\times N\), ready for the transformer stack.

QName the three transformer model types and the canonical example of each from the deck.▾ tap to reveal▴ hide answer

Encoder → BERT (representation by predicting missing tokens). Decoder → GPT-3 (autoregressive generation). Encoder–decoder → machine translation (read source, generate target). All are stacks of \(K\) transformer layers over the embedding matrix \(X\).

·  ·  ·
Part VI

VI. Encoder Model — BERT

BERT is an encoder: it sees the whole sentence and learns to fill in deliberately hidden words. The numbers are worth memorising.

Architecture (slide 30)

QuantityValue
Vocabulary30K tokens
Word embedding dimension1024
Transformer layers24
Heads per self-attention16
Query / key / value dimension (per head)64
Shape of \(\Omega_{vh},\Omega_{qh},\Omega_{kh}\)\(1024\times 64\)

Note the consistency check: \(16\text{ heads}\times 64 = 1024 = D\), exactly the \(D/H\) rule from Part III.

Pre-training (slide 31)

Self-supervised learning

BERT uses self-supervised learning. The pretext task is predicting missing words from sentences — words are hidden and the model must reconstruct them. Through this it learns the statistics of the language, with no need for manual annotation, because a large corpus of raw text is freely available.

Training settings: maximum input length 512 tokens, batch size 256, and 50 epochs of a 3.3-billion-word corpus.

Fine-tuning (slide 32)

After pre-training, BERT is fine-tuned on specific tasks:

In plain English

Pre-training teaches BERT general language from cheap, unlabelled text by playing fill-in-the-blank. Fine-tuning then cheaply specialises that general knowledge to a labelled task — classify a whole review, tag each word, or pick the answer span in a passage.

QWhat is BERT's pretext task, why does it count as self-supervised, and what is the practical benefit?▾ tap to reveal▴ hide answer

The pretext task is predicting missing (masked) words in a sentence. It is self-supervised because the "labels" — the hidden words — come for free from the raw text itself, so no manual annotation is needed. The benefit: you can train on enormous freely-available corpora (here, 50 epochs of a 3.3-billion-word corpus), learning rich language statistics that transfer to many downstream tasks via fine-tuning.

QBERT has \(D=1024\) and 16 heads. Verify the per-head dimension and give the shape of each head's projection matrices.▾ tap to reveal▴ hide answer

Per-head dimension \(=D/H = 1024/16 = 64\). So each head's queries, keys and values are 64-dimensional, and each projection \(\Omega_{vh},\Omega_{qh},\Omega_{kh}\) maps the 1024-dim input down to 64 dims — shape \(1024\times 64\). Across 16 heads this restores the full 1024 dimensions before \(\Omega_c\).

·  ·  ·
Part VII

VII. Decoder Model — GPT-3

A decoder generates text one token at a time, each new token conditioned on everything before it. The maths is just the chain rule of probability.

Autoregressive language modelling (slides 34–35)

GPT-3 is an autoregressive language model. Take the sentence “It takes great courage to let yourself appear weak.” Its probability factorises as a product of conditional probabilities (slide 34):

\[ \begin{aligned} P(&\text{It takes great courage to let yourself appear weak}) =\\ &P(\text{It})\times P(\text{takes}\mid\text{It})\times P(\text{great}\mid\text{It takes})\times P(\text{courage}\mid\text{It takes great})\\ &\times P(\text{to}\mid\text{It takes great courage})\times P(\text{let}\mid\text{It takes great courage to})\\ &\times P(\text{yourself}\mid\text{It takes great courage to let})\\ &\times P(\text{appear}\mid\text{It takes great courage to let yourself})\\ &\times P(\text{weak}\mid\text{It takes great courage to let yourself appear}). \end{aligned} \]

In general (slide 35), an autoregressive model factors the joint probability of \(N\) tokens as

\[ P(t_1,t_2,\dots,t_N)=P(t_1)\prod_{n=2}^{N}P(t_n\mid t_1,\dots,t_{n-1}). \]
In plain English

This is just the chain rule of probability: the chance of a sentence equals the chance of its first word, times the chance of the second given the first, and so on. A decoder learns each of these "next-word-given-the-past" distributions, so it can both score text and generate it.

Masked self-attention (slide 36)

To train a decoder we maximise the log probability of the input text. But there's a problem: if we pass in the full sentence, the model can cheat — it could "see" the future word it is meant to predict.

The solution · masking

Set the corresponding dot products in the self-attention computation to negative infinity before they pass through the softmax. Since \(\exp(-\infty)=0\), those positions get exactly zero attention weight, so each token can only attend to itself and earlier tokens — never to future ones.

key (attended-to position) → query → red = −∞ (future, blocked) green = allowed
Masked (causal) self-attention. Scores for "future" keys (above the diagonal) are set to \(-\infty\) before softmax, so they become 0 weight. Each token attends only to itself and the past — preventing the model from cheating during training.

Generating text (slide 37)

An autoregressive language model is a type of generative model: it defines a probability distribution over text sequences and enables sampling of new plausible text examples. The slide flags the generation process and its efficiency, and lists two strategies for coherence:

GPT-3 and few-shot learning (slide 38)

QuantityValue
Sequence length2048 tokens
Batch size3.2 million tokens
Transformer layers96 (includes sparse attention)
Word embedding size12288
Self-attention heads96
Value/query/key dimension128
Training tokens300 billion
Parameters175 billion
Key property · few-shot learning

GPT-3's headline property is that it performs many tasks without fine-tuning — so-called few-shot learning: you describe the task and give a handful of examples in the prompt, and the model adapts on the fly, no weight updates required.

(Consistency check: \(96\text{ heads}\times 128 = 12288\), again the \(D/H\) rule.)

QWrite the autoregressive factorisation of \(P(t_1,\dots,t_N)\) and explain in one line why a decoder can both score and generate text.▾ tap to reveal▴ hide answer

\(P(t_1,\dots,t_N)=P(t_1)\prod_{n=2}^{N}P(t_n\mid t_1,\dots,t_{n-1})\) — the chain rule. The model learns every "next token given the past" conditional, so it can multiply them to score a given sequence, or sample them one at a time to generate a new one.

QExactly how does masked self-attention prevent "cheating", mechanically?▾ tap to reveal▴ hide answer

Before the softmax, the dot-product scores for any future key positions (relative to the current query) are set to \(-\infty\). Softmax exponentiates, and \(\exp(-\infty)=0\), so those future tokens receive exactly zero attention weight. Each position can therefore only attend to itself and earlier positions, so when predicting token \(n\) the model cannot peek at tokens \(n,n+1,\dots\) that it's supposed to predict.

QWhat does "few-shot learning" mean for GPT-3, and how does it differ from BERT-style fine-tuning?▾ tap to reveal▴ hide answer

Few-shot learning means GPT-3 performs a new task by being shown a description and a few examples in the prompt, with no weight updates at all. BERT-style fine-tuning instead retrains (updates parameters of) the model on a labelled dataset for each task. Few-shot trades a task-specific training run for in-context demonstrations.

QCompare BERT and GPT-3 across model role, what they attend to, and rough scale.▾ tap to reveal▴ hide answer

BERT = encoder; bidirectional (each token attends to all tokens); trained by predicting masked words; 24 layers, \(D=1024\), 16 heads, 512-token inputs. GPT-3 = decoder; masked/causal attention (each token attends only to the past); autoregressive next-token prediction; 96 layers, \(D=12288\), 96 heads, 2048-token sequences, 175 billion parameters, 300 billion training tokens. BERT is for representation/understanding; GPT-3 is generative and does tasks few-shot.

·  ·  ·
Part VIII

VIII. Encoder–Decoder & Cross-Attention

Machine translation needs to read one language and write another. The bridge between the two stacks is cross-attention — self-attention's queries and keys come from different sequences.

In an encoder–decoder model (used for machine translation), the encoder processes the source sentence and the decoder generates the target. The decoder reaches into the encoder's output through cross-attention (slide 40).

Cross-attention — where Q, K, V come from

The crucial difference from self-attention: the queries come from the decoder, while the keys and values come from the encoder.

\[ \begin{aligned} Q &= \beta_q\mathbf{1}^{\mathsf T}+\Omega_q X_{dec}\\ K &= \beta_k\mathbf{1}^{\mathsf T}+\Omega_k X_{enc}\\ V &= \beta_v\mathbf{1}^{\mathsf T}+\Omega_v X_{enc} \end{aligned} \]

and the output is \( V\cdot\mathrm{Softmax}\!\big[K^{\mathsf T}Q\big]\).

Cross-attention Xdec Xenc Q (Nd) K (Ne) V (Ne) softmax[KᵀQ] output V·σ[KᵀQ]
Slide 40 redrawn. Decoder input \(X_{dec}\) (length \(N_d\)) supplies the queries; encoder input \(X_{enc}\) (length \(N_e\)) supplies the keys and values. The attention matrix is \(N_e\times N_d\) and the output is \(D\times N_d\) — one enriched vector per decoder position, informed by the whole source.
In plain English

Self-attention is a sentence talking to itself. Cross-attention is the output sentence asking questions of the input sentence: "as I write this target word (query), which source words (keys) should I look at, and what do they say (values)?" That is how a translator keeps the output faithful to the source.

QIn cross-attention, which sequence provides the queries and which provides the keys and values? What shape is the attention matrix?▾ tap to reveal▴ hide answer

Queries come from the decoder (\(Q=\beta_q\mathbf 1^{\mathsf T}+\Omega_q X_{dec}\)); keys and values come from the encoder (\(K=\beta_k\mathbf 1^{\mathsf T}+\Omega_k X_{enc}\), \(V=\beta_v\mathbf 1^{\mathsf T}+\Omega_v X_{enc}\)). With \(N_d\) decoder positions and \(N_e\) encoder positions, \(K^{\mathsf T}Q\) is \(N_e\times N_d\), and the output \(V\cdot\sigma[K^{\mathsf T}Q]\) is \(D\times N_d\) — one vector per decoder token.

·  ·  ·
Part IX

IX. Transformers for Long Sequences

Self-attention compares every token with every other token — cost grows with the square of the length. For long inputs we prune or sparsify those comparisons.

Pruning and sparsification (slide 42)

Objective

Reduce self-attention complexity by pruning interactions or sparsifying the interaction matrix. Example: restrict interactions to a convolutional structure, where each token only interacts with a few neighbouring tokens, instead of all \(N\).

The convolutional structure brings three features:

Two further strategies to speed up attention:

The benefit: reduced computational complexity, making long sequences tractable.

Exam trap · the quadratic cost

Standard self-attention forms the full \(N\times N\) score matrix \(K^{\mathsf T}Q\), so cost and memory scale quadratically with sequence length \(N\) (re-stated in the conclusion, slide 52). Every "long sequence" method here is a way to avoid materialising or computing all \(N^2\) interactions.

QWhy does vanilla self-attention scale quadratically with sequence length, and name two ways to reduce that cost.▾ tap to reveal▴ hide answer

Because it computes a dot product between every query and every key — an \(N\times N\) interaction matrix \(K^{\mathsf T}Q\) — so work and memory grow like \(N^2\). To reduce it: (1) sparsify/prune the interaction matrix, e.g. a convolutional/local structure where each token attends only to nearby tokens (with receptive-field expansion across layers); (2) selective attention and/or global tokens that act as shared hubs while most tokens stay local.

·  ·  ·
Part X

X. Transformers for Images

Images are just another sequence — of pixels, or of patches. Four models trace the arc from treating pixels as tokens (ImageGPT) to patch-based encoders (ViT) and multiscale designs (SWin, DaViT).

ImageGPT (slides 44–45)

ImageGPT — pixels as tokens

ImageGPT is an autoregressive transformer model for image pixels. Key facts:

  • Restricted to 64×64-pixel images because of the model's quadratic complexity.
  • Uses a quantized 9-bit colour space512 tokens per pixel.
  • Learns the positional encoding to handle 2D relationships.
  • Classification: average the pixel embeddings, then a linear layer with softmax.
  • Pre-trained on web images; fine-tuned on ImageNet (48×48 pixels) with combined loss functions.
  • Accuracy: 27.4% top-1 error on ImageNet.
  • Limitations: small or thin objects.

The slide-45 figure shows ImageGPT completing partially-masked images (e.g. a water droplet, a room, birds on a branch) — multiple plausible completions per input, demonstrating its generative nature.

Vision Transformer — ViT (slides 46–47)

ViT — patches as tokens

ViT is an encoder model with a special <cls> token. It was pre-trained on 303 million labelled images from 18K classes.

  • Patch division: images are split into 16×16 patches; each patch is mapped to a lower dimension by a learned linear transformation.
  • Positional encoding: 1D positional encodings are used.
  • Classification: the <cls> token is processed through a final layer and softmax for class probabilities.
  • Fine-tuning: adapt to a task by replacing the final layer.
  • Performance: 11.45% top-1 error on ImageNet.

Comparison to state-of-the-art: ViT outperformed convolutional networks with pre-training, but was less effective than SotA CNNs without pre-training.

image → 16×16 patches linear <cls> patch embeds Transformer×K MLP +softmax bird ball zebra class prob.
Slide 47 redrawn. The image is cut into 16×16 patches, each linearly embedded; a learnable <cls> token is prepended; the sequence passes through \(K\) transformer layers; the <cls> output goes through an MLP + softmax to give class probabilities.

Multiscale vision transformers (slides 48–50)

ViT operates on a single scale with fixed-size patches. Multi-scale transformers instead behave like CNNs: start with high-resolution patches and few channels, then gradually decrease resolution and increase channels. Two examples:

SWin Transformer (Shifted-Window)
  • Divides the image into patches and applies self-attention within windows (local, not global) — cutting cost.
  • Windows are shifted between layers to expand the receptive field (so information crosses window boundaries).
  • Scale is reduced by concatenating 2×2 patches and mapping features to increased channels.
  • No <cls> token: output features are averaged and mapped to class probabilities via softmax.
  • 9.89% top-1 error on ImageNet.
4×4 windows 2×2 (concat 2×2) 1×1 window resolution ↓ channels ↑ windows shift to grow receptive field
Slide 49 idea. Like a CNN, SWin progressively merges 2×2 patches to reduce spatial resolution while increasing channels; self-attention runs within windows, and shifting the windows lets information flow across their boundaries.
Dual Attention Vision Transformers (DaViT)

DaViT alternates between two types of transformers: one where image patches attend to one another using all channels, and one where channels attend to one another using all image patches. It achieves 9.60% top-1 error on ImageNet.

ModelTokensImageNet top-1 error
ImageGPTpixels (9-bit, 512/pixel)27.4%
ViT16×16 patches, <cls>11.45%
SWinshifted windows, no <cls>9.89%
DaViTpatch- & channel-attention9.60%
QWhat is ImageGPT's token, why is it limited to 64×64, and how does it classify?▾ tap to reveal▴ hide answer

Its tokens are pixels in a quantized 9-bit colour space (512 possible tokens per pixel). It is limited to 64×64 images because attention's quadratic complexity in the number of tokens makes more pixels infeasible. It classifies by averaging the pixel embeddings and passing the result through a linear layer + softmax. Top-1 error 27.4%; struggles on small/thin objects.

QHow does ViT turn an image into a sequence, and what is the role of the <cls> token?▾ tap to reveal▴ hide answer

ViT splits the image into 16×16 patches and maps each patch to a lower-dimensional embedding via a learned linear transformation, with 1D positional encodings added. A special <cls> token is included; after the transformer stack, the <cls> token's output is sent through a final layer + softmax to produce class probabilities — it acts as the aggregate "summary" representation used for classification. Fine-tuning replaces that final layer. ViT reached 11.45% top-1 error and beat CNNs with pre-training (but not without).

QGive two ways SWin differs from ViT.▾ tap to reveal▴ hide answer

Any two of: (1) SWin is multiscale — it reduces resolution by concatenating 2×2 patches and increasing channels (CNN-like), whereas ViT is single-scale with fixed patches. (2) SWin applies self-attention within windows (local), shifting them between layers to grow the receptive field, while ViT attends globally. (3) SWin has no <cls> token — it averages output features for classification — whereas ViT classifies from the <cls> token. (SWin: 9.89% vs ViT 11.45% top-1 error.)

·  ·  ·
Part XI

XI. Conclusion & Exam Readiness

The whole week, distilled into the points the lecturer chose to end on (slide 52).

On the architecture

On training and learning

Sign-off (slide 53)

The deck closes: “Thanks! See you in the next lecture on Graph Neural Networks.” — so the natural follow-on topic is GNNs.

QThe conclusion claims transformers have "low computational complexity per layer" yet "scale quadratically with sequence length." Reconcile these.▾ tap to reveal▴ hide answer

They describe different axes. "Low complexity per layer" / "parallelizable" means each layer is a handful of large matrix multiplications that run efficiently on hardware in parallel (no sequential recurrence like an RNN). "Quadratic in sequence length" refers to the \(N\times N\) attention matrix: the per-layer cost still grows like \(N^2\) in the number of tokens. So a layer is cheap and parallel for moderate \(N\), but becomes expensive as \(N\) grows — which is why Part IX's sparsification methods exist.

QMap each of "unsupervised", "predicting missing tokens", and "generative model" to the right transformer type.▾ tap to reveal▴ hide answer

Training on large unlabelled corpora (no manual annotation) is unsupervised learning and applies to transformers in general. Predicting missing tokens is the encoder (BERT) representation-learning objective. Building an autoregressive / generative model that can create new data examples is the decoder (GPT-3).

The week in one breath

Text is too big and too ambiguous for fully connected nets, so self-attention lets every token build itself from a data-dependent weighted mix of the others' values — weights set by query·key softmax — and wrapping that (scaled, multi-headed, position-encoded) in residual + LayerNorm + per-token-MLP layers gives the transformer, which becomes an encoder (BERT, fill-in-the-blank), a decoder (GPT-3, masked autoregression), or an encoder–decoder (translation via cross-attention), and extends to long sequences by sparsifying and to images via pixels (ImageGPT) or patches (ViT, SWin, DaViT).

Exam-readiness checklist — be able to…

  1. State the three motivations for transformers (input size 37×1024=37K, variable length / parameter sharing, language ambiguity / long-range data-dependent connections).
  2. Write the value, query, key equations and the output \(\alpha_i=\sum_j a_{j,i}v_j\) with \(\sum_j a_{j,i}=1\).
  3. Give the softmax attention weight \(a_{j,i}=\exp(k_j^{\mathsf T}q_i)/\sum_m\exp(k_m^{\mathsf T}q_i)\) and the matrix form \(A=V\sigma(K^{\mathsf T}Q)\) (softmax per column).
  4. Explain positional encoding (absolute \(\Pi\) vs relative) and why self-attention needs it (permutation-equivariance).
  5. Justify the \(\sqrt{D_q}\) scaling via the variance/saturation argument.
  6. Describe multi-head attention: \(D/H\) per head, parallel heads, concat + \(\Omega_c\); compute \(1024/16=64\).
  7. Recite the four transformer-layer operations and say which sub-block mixes tokens (attention) vs not (per-token MLP), plus the role of residuals + LayerNorm.
  8. Explain tokenization, the three problems with word tokens, and byte pair encoding (merge most frequent adjacent pair).
  9. Give the embedding lookup \(X=\Omega_e T\) with shapes (\(D{=}1024\), \(|\mathcal V|{=}30\)K).
  10. Name the three model types and their examples (encoder/BERT, decoder/GPT-3, enc-dec/translation).
  11. Recall BERT specs (24 layers, 16 heads, dim 64, 512 tokens, batch 256, 3.3B-word corpus) and its masked-word self-supervised pretext task + fine-tuning uses.
  12. Write the autoregressive factorisation \(P(t_1{\dots}t_N)=P(t_1)\prod_{n\ge2}P(t_n|t_{masked self-attention (set future scores to \(-\infty\)).
  13. Recall GPT-3 scale (96 layers, \(D{=}12288\), 96 heads, 2048 tokens, 175B params, 300B training tokens) and define few-shot learning; name beam search & top-\(k\).
  14. State cross-attention: queries from decoder, keys/values from encoder; output \(V\sigma[K^{\mathsf T}Q]\).
  15. Explain why attention is quadratic in \(N\) and list reductions (convolutional/local structure, selective attention, global tokens).
  16. Summarise the four image models and their ImageNet top-1 errors (ImageGPT 27.4%, ViT 11.45%, SWin 9.89%, DaViT 9.60%).
  17. Contrast ViT vs SWin (single- vs multi-scale, global vs windowed attention, <cls> vs averaged features).
  18. Reproduce the conclusion points: parallelizable, long-range, quadratic, sparsifiable; encoders predict missing tokens, decoders are generative.