From a confusing restaurant review to BERT, GPT-3, machine translation and Vision Transformers — how self-attention lets a network decide, for every word, which other words matter.
Before any maths: why do we even need a new architecture for text? Three properties of language break the networks we already know.
Imagine a network has to read this restaurant review and turn it into something a downstream task can use:
“The restaurant refused to serve me a ham sandwich because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambiance was just as good as the food and service.”
The goal is to process this text into a representation suitable for downstream tasks — for instance, classify the review's sentiment, or answer questions like “Does the restaurant serve steak?”
Notice how hard this is for a machine. To answer “does it serve steak?”, the network must connect “refused to serve”, “only cooks vegetarian” and “two slices of bread” — words scattered across the whole passage — and resolve pronouns like “it” and “they”. That difficulty is exactly what motivates the transformer.
Text turns out to be surprisingly large once encoded. The slide's worked figure: a 37-word passage with an embedding length of 1024 gives
A fully connected network would need a weight for every input number to every hidden unit, so it is impractical for large bodies of text.
NLP inputs are not a fixed size — sentences and documents differ in length. So we need to share parameters across different input positions, the same trick convolutional networks use for images (one kernel slides over every location).
From these observations, a good text model needs exactly two ingredients: (a) parameter sharing across positions, and (b) data-dependent connections between word representations whose strength depends on the words. Dot-product self-attention delivers both.
Two reasons from slide 5. (1) Size. Encoded text is large — e.g. 37 words × 1024-dim embeddings = 37K numbers — and a fully connected layer needs a separate weight for every input-to-hidden pair, which explodes. (2) Variable length & no sharing. Inputs vary in length and an FC net has fixed-size, position-specific weights with no parameter sharing across positions, so it cannot generalise a pattern learned at one position to another. Transformers fix both by sharing the value/query/key weights across all positions.
The core mechanism. Each output is a weighted average of transformed inputs — and the weights are computed from the inputs themselves.
A self-attention block is a function that takes \(N\) input vectors \(x_1,\dots,x_N\), each of dimension \(D\times 1\), and returns \(N\) output vectors of the same size. In NLP each input \(x_i\), for \(i\in\{1,\dots,N\}\), represents a word or word fragment. Transformers get both the parameter-sharing and the data-dependent connections of Part I from this one block.
For each input \(x_i\) we compute a value vector using a shared weight matrix and bias:
The same \(\Omega_v\) and \(\beta_v\) are applied to every input — that is the parameter sharing.
The \(i\)-th output of the block is a weighted sum of all the value vectors:
The scalar \(a_{j,i}\) is the attention that the \(i\)-th output pays to input \(x_j\). The weights for a given output are non-negative and sum to one, so each output is a convex combination (a weighted average) of the values.
To build the new representation of word \(i\), look at every word \(j\) in the sentence, give each a "relevance score" \(a_{j,i}\), and mix their value vectors in those proportions. A word builds its meaning from the words it pays attention to.
Two important facts the slide stresses: the overall self-attention computation is nonlinear, and the attention weights are themselves nonlinear functions of the input. They are produced in two steps.
We apply two more linear transformations to the inputs:
The \(\{q_i\}\) are called queries and the \(\{k_i\}\) are keys.
Take the dot product of every query with every key and push the results through a softmax:
Think of a lookup. Output \(i\) issues a query \(q_i\) ("what am I looking for?"). Every input advertises a key \(k_j\) ("what I am about") and a value \(v_j\) ("what I'll hand over"). The match score is the dot product \(k_j^{\mathsf T} q_i\) — large when query and key point the same way. Softmax turns those scores into a probability distribution over inputs, and we return the value-mixture in those proportions.
Stacking everything makes the whole block a few matrix products. The full parameter set is
Put the \(N\) inputs as columns of \(X\in\mathbb{R}^{D\times N}\). Then values, queries and keys are
where \(\mathbf{1}\) is an \(N\times 1\) vector of ones (it broadcasts the bias across all \(N\) columns). The block output is
where \(\sigma(\cdot)\) takes a matrix and performs the softmax independently on each of its columns.
Subscript order matters. \(a_{j,i}\) is the weight output \(i\) puts on input \(j\); the softmax normalises over \(j\) (all inputs) for each fixed \(i\) — i.e. over columns of \(K^{\mathsf T}Q\). Get the direction wrong and your weights won't sum to one per output. Also remember: \(\Omega_v,\Omega_q,\Omega_k\) are all \(D\times D\) in the single-head case and the bias is added via \(\beta\mathbf{1}^{\mathsf T}\), not just \(\beta\).
All three are linear transforms of the same inputs: \(q_i=\beta_q+\Omega_q x_i\), \(k_i=\beta_k+\Omega_k x_i\), \(v_i=\beta_v+\Omega_v x_i\). The values \(v_j\) are what gets mixed into each output: \(\alpha_i=\sum_j a_{j,i}v_j\). The queries and keys produce the weights via dot product + softmax: \(a_{j,i}=\mathrm{softmax}_j(k_j^{\mathsf T}q_i)\). Mnemonic: output \(i\)'s query asks a question; each input's key answers how relevant it is; the value is the content actually passed along.
With \(D\times N\) inputs, \(Q\) and \(K\) are \(D\times N\) (single head), so \(K^{\mathsf T}Q\) is \(N\times N\) — the score for every (key \(j\), query \(i\)) pair. \(\sigma\) applies softmax independently to each column, leaving an \(N\times N\) attention matrix whose columns sum to 1. Then \(V\) (which is \(D\times N\)) times that gives \(A\) of shape \(D\times N\): same shape as the input \(X\), so blocks can be stacked.
From the softmax and the product of input-dependent terms. The weights \(a_{j,i}\) are a softmax of dot products of \(x\)-dependent vectors (nonlinear in \(x\)), and the output multiplies these input-dependent weights by input-dependent values — so the output is a nonlinear function of the inputs even though each individual projection is linear. This is why the slide separately states "the attention weights are themselves nonlinear functions of the input."
Plain self-attention has three weaknesses. It ignores word order, its dot products can blow up, and one head can only model one kind of relationship. Three fixes follow.
Self-attention is permutation-equivariant: the computation is the same regardless of the order of the inputs \(x_i\). Shuffle the words and you get the same outputs (just reordered). "Dog bites man" and "man bites dog" would be indistinguishable. So position must be injected explicitly.
There are two main approaches to incorporating position information:
The dot products \(k_j^{\mathsf T}q_i\) can have large magnitudes (they sum over \(D_q\) terms). When the softmax inputs are huge, the function saturates: small changes to the inputs produce almost no change in the output, and gradients vanish.
The fix is to scale the dot products by the square root of the query/key dimension:
where \(D_q\) is the dimension of the queries and keys.
A dot product of two \(D_q\)-dimensional vectors is a sum of \(D_q\) products. If the entries are roughly unit-scale and independent, that sum has variance growing like \(D_q\), i.e. standard deviation \(\sqrt{D_q}\). Dividing by \(\sqrt{D_q}\) rescales the scores back to a sensible range so the softmax stays in its responsive, well-gradiented region.
Multi-head self-attention runs several self-attention mechanisms in parallel. We compute \(H\) different sets of values, keys and queries:
The \(h\)-th self-attention mechanism — its head — is
with different parameters \(\{\beta_{vh},\Omega_{vh}\}\), \(\{\beta_{qh},\Omega_{qh}\}\) and \(\{\beta_{kh},\Omega_{kh}\}\) for each head.
If the input dimension is \(D\) and there are \(H\) heads, the values, queries and keys are all of size \(D/H\) — which keeps the total work the same and allows an efficient implementation. The head outputs are vertically concatenated and a final linear transform \(\Omega_c\) combines them:
One head can only learn one "way of relating" words (say, subject→verb). Multiple heads let the layer look at the sentence through several relationship lenses at once — one might track syntax, another co-reference, another adjacency — then \(\Omega_c\) blends their findings into a single representation.
Because the standard deviation of the dot product grows like \(\sqrt{D_q}\), not \(D_q\). The dot product sums \(D_q\) roughly-independent unit-scale terms; variance adds, so variance \(\propto D_q\) and std \(\propto\sqrt{D_q}\). Dividing by \(\sqrt{D_q}\) restores roughly unit-variance scores, keeping the softmax in its sensitive region so small input changes still move the output (avoiding the saturation problem on slide 15). Dividing by \(D_q\) would over-shrink the scores.
\(D/H = 1024/16 = 64\). Each head's queries, keys and values are 64-dimensional (exactly BERT's setup, slide 30). Because each head works in a \(D/H\) subspace, the total compute and parameters across all \(H\) heads is comparable to a single full-dimension attention — you get \(H\) different relationship "lenses" essentially for free, then recombine with \(\Omega_c\).
Permutation-equivariant means: if you permute the input columns, the output columns are permuted the same way and otherwise unchanged — the mechanism has no notion of "first" vs "last". Adding an absolute positional matrix \(\Pi\) with a unique column per position makes each position's input distinct, so permuting the words now changes the values being attended over. With relative encodings, the attention scores depend on the offset between query and key positions, again injecting order. Either way, order information re-enters the computation.
Self-attention is the engine; the transformer layer is the full machine. It wraps attention with residual connections, normalisation and a per-token MLP — then stacks the whole thing many times.
A single transformer layer performs this series of operations (slide 20):
where the column vectors \(x_i\) are taken separately from the full data matrix \(X\). In a real network, the data passes through a series of these transformer layers.
Line 1 — residual around attention: add the multi-head output \(A_\eta\) back onto the input \(X\). Line 2 — LayerNorm. Line 3 — per-token MLP with its own residual: the same MLP is applied independently to each column \(x_i\) (parameter sharing across positions, exactly like a 1×1 operation). Line 4 — LayerNorm again.
Attention lets tokens exchange information (mixing across positions). The MLP then processes each token on its own (computation within a position). Residual connections ("add the input back") give gradients a shortcut so very deep stacks still train. LayerNorm keeps the activations at a stable scale between sub-blocks.
The MLP step is written per token: \(x_i \leftarrow x_i + \mathrm{MLP}(x_i)\). It does not mix information between tokens — only attention does. A common mistake is to think the MLP combines words. It processes each column independently (with shared weights). All cross-token communication happens in the attention sub-block.
Multi-head self-attention moves information between positions — each output is a weighted mix of all tokens' values. The per-token MLP (\(x_i\leftarrow x_i+\mathrm{MLP}(x_i)\)) transforms each position independently with shared weights and performs no cross-token mixing. Each sub-block has its own residual connection and is followed by LayerNorm.
Before text can enter a transformer it must become numbers. That means cutting it into tokens, turning tokens into embeddings, and choosing one of three model shapes.
A text-processing pipeline begins with a tokenizer, which splits text into smaller constituent units → tokens. Tokens must come from a pre-defined vocabulary of possible tokens.
Problems of using words as tokens: (1) some words — e.g. names — won't be in the vocabulary; (2) it's unclear how to handle punctuation; (3) the same word with different suffixes would get entirely different tokens (run, runs, running).
A possible solution is to use letters and punctuation marks as tokens. The actual solution is a compromise between letters and full words: the vocabulary is computed using a sub-word tokenizer such as byte pair encoding (BPE).
BPE starts with the tokens being just the individual characters and whitespace, together with their frequencies. At each iteration the tokenizer finds the most commonly occurring adjacent pair of tokens and merges it into a single new token, adding it to the vocabulary. Repeating this grows tokens from letters up towards common sub-words and words.
Each token in the vocabulary \(\mathcal V\) is mapped to a unique word embedding. The embedding matrix is
To obtain the embeddings: (1) the \(N\) input tokens are first encoded in the matrix \(T\in\mathbb{R}^{|\mathcal V|\times N}\) (one-hot columns); (2) the input embeddings are computed as
A typical embedding size is \(D=1024\); a typical total vocabulary size is \(|\mathcal V|=30\text{K}\).
The transformer model is a series of \(K\) transformer layers through which the embedding matrix \(X\) (representing the text) is passed. There are three types:
| Type | Idea | Example (later in deck) |
|---|---|---|
| Encoder | Build a representation of the whole input at once | BERT |
| Decoder | Generate text autoregressively, left to right | GPT-3 |
| Encoder–decoder | Read one sequence, produce another | Machine translation |
(1) Out-of-vocabulary words such as names can't be represented; (2) punctuation handling is unclear; (3) suffix variants of one word (run/runs/running) become unrelated tokens. The standard fix is a sub-word tokenizer — byte pair encoding (BPE) — which is a compromise between characters and full words, iteratively merging the most frequent adjacent token pair.
\(\Omega_e\in\mathbb{R}^{D\times|\mathcal V|}\) (one column per vocabulary token; typically \(D=1024\), \(|\mathcal V|=30\)K) and \(T\in\mathbb{R}^{|\mathcal V|\times N}\) has one one-hot column per input token. Multiplying a matrix by a one-hot column just picks out the corresponding column of that matrix — so \(\Omega_e T\) selects each token's embedding. Result \(X\) is \(D\times N\), ready for the transformer stack.
Encoder → BERT (representation by predicting missing tokens). Decoder → GPT-3 (autoregressive generation). Encoder–decoder → machine translation (read source, generate target). All are stacks of \(K\) transformer layers over the embedding matrix \(X\).
BERT is an encoder: it sees the whole sentence and learns to fill in deliberately hidden words. The numbers are worth memorising.
| Quantity | Value |
|---|---|
| Vocabulary | 30K tokens |
| Word embedding dimension | 1024 |
| Transformer layers | 24 |
| Heads per self-attention | 16 |
| Query / key / value dimension (per head) | 64 |
| Shape of \(\Omega_{vh},\Omega_{qh},\Omega_{kh}\) | \(1024\times 64\) |
Note the consistency check: \(16\text{ heads}\times 64 = 1024 = D\), exactly the \(D/H\) rule from Part III.
BERT uses self-supervised learning. The pretext task is predicting missing words from sentences — words are hidden and the model must reconstruct them. Through this it learns the statistics of the language, with no need for manual annotation, because a large corpus of raw text is freely available.
Training settings: maximum input length 512 tokens, batch size 256, and 50 epochs of a 3.3-billion-word corpus.
After pre-training, BERT is fine-tuned on specific tasks:
Pre-training teaches BERT general language from cheap, unlabelled text by playing fill-in-the-blank. Fine-tuning then cheaply specialises that general knowledge to a labelled task — classify a whole review, tag each word, or pick the answer span in a passage.
The pretext task is predicting missing (masked) words in a sentence. It is self-supervised because the "labels" — the hidden words — come for free from the raw text itself, so no manual annotation is needed. The benefit: you can train on enormous freely-available corpora (here, 50 epochs of a 3.3-billion-word corpus), learning rich language statistics that transfer to many downstream tasks via fine-tuning.
Per-head dimension \(=D/H = 1024/16 = 64\). So each head's queries, keys and values are 64-dimensional, and each projection \(\Omega_{vh},\Omega_{qh},\Omega_{kh}\) maps the 1024-dim input down to 64 dims — shape \(1024\times 64\). Across 16 heads this restores the full 1024 dimensions before \(\Omega_c\).
A decoder generates text one token at a time, each new token conditioned on everything before it. The maths is just the chain rule of probability.
GPT-3 is an autoregressive language model. Take the sentence “It takes great courage to let yourself appear weak.” Its probability factorises as a product of conditional probabilities (slide 34):
In general (slide 35), an autoregressive model factors the joint probability of \(N\) tokens as
This is just the chain rule of probability: the chance of a sentence equals the chance of its first word, times the chance of the second given the first, and so on. A decoder learns each of these "next-word-given-the-past" distributions, so it can both score text and generate it.
To train a decoder we maximise the log probability of the input text. But there's a problem: if we pass in the full sentence, the model can cheat — it could "see" the future word it is meant to predict.
Set the corresponding dot products in the self-attention computation to negative infinity before they pass through the softmax. Since \(\exp(-\infty)=0\), those positions get exactly zero attention weight, so each token can only attend to itself and earlier tokens — never to future ones.
An autoregressive language model is a type of generative model: it defines a probability distribution over text sequences and enables sampling of new plausible text examples. The slide flags the generation process and its efficiency, and lists two strategies for coherence:
| Quantity | Value |
|---|---|
| Sequence length | 2048 tokens |
| Batch size | 3.2 million tokens |
| Transformer layers | 96 (includes sparse attention) |
| Word embedding size | 12288 |
| Self-attention heads | 96 |
| Value/query/key dimension | 128 |
| Training tokens | 300 billion |
| Parameters | 175 billion |
GPT-3's headline property is that it performs many tasks without fine-tuning — so-called few-shot learning: you describe the task and give a handful of examples in the prompt, and the model adapts on the fly, no weight updates required.
(Consistency check: \(96\text{ heads}\times 128 = 12288\), again the \(D/H\) rule.)
\(P(t_1,\dots,t_N)=P(t_1)\prod_{n=2}^{N}P(t_n\mid t_1,\dots,t_{n-1})\) — the chain rule. The model learns every "next token given the past" conditional, so it can multiply them to score a given sequence, or sample them one at a time to generate a new one.
Before the softmax, the dot-product scores for any future key positions (relative to the current query) are set to \(-\infty\). Softmax exponentiates, and \(\exp(-\infty)=0\), so those future tokens receive exactly zero attention weight. Each position can therefore only attend to itself and earlier positions, so when predicting token \(n\) the model cannot peek at tokens \(n,n+1,\dots\) that it's supposed to predict.
Few-shot learning means GPT-3 performs a new task by being shown a description and a few examples in the prompt, with no weight updates at all. BERT-style fine-tuning instead retrains (updates parameters of) the model on a labelled dataset for each task. Few-shot trades a task-specific training run for in-context demonstrations.
BERT = encoder; bidirectional (each token attends to all tokens); trained by predicting masked words; 24 layers, \(D=1024\), 16 heads, 512-token inputs. GPT-3 = decoder; masked/causal attention (each token attends only to the past); autoregressive next-token prediction; 96 layers, \(D=12288\), 96 heads, 2048-token sequences, 175 billion parameters, 300 billion training tokens. BERT is for representation/understanding; GPT-3 is generative and does tasks few-shot.
Machine translation needs to read one language and write another. The bridge between the two stacks is cross-attention — self-attention's queries and keys come from different sequences.
In an encoder–decoder model (used for machine translation), the encoder processes the source sentence and the decoder generates the target. The decoder reaches into the encoder's output through cross-attention (slide 40).
The crucial difference from self-attention: the queries come from the decoder, while the keys and values come from the encoder.
and the output is \( V\cdot\mathrm{Softmax}\!\big[K^{\mathsf T}Q\big]\).
Self-attention is a sentence talking to itself. Cross-attention is the output sentence asking questions of the input sentence: "as I write this target word (query), which source words (keys) should I look at, and what do they say (values)?" That is how a translator keeps the output faithful to the source.
Queries come from the decoder (\(Q=\beta_q\mathbf 1^{\mathsf T}+\Omega_q X_{dec}\)); keys and values come from the encoder (\(K=\beta_k\mathbf 1^{\mathsf T}+\Omega_k X_{enc}\), \(V=\beta_v\mathbf 1^{\mathsf T}+\Omega_v X_{enc}\)). With \(N_d\) decoder positions and \(N_e\) encoder positions, \(K^{\mathsf T}Q\) is \(N_e\times N_d\), and the output \(V\cdot\sigma[K^{\mathsf T}Q]\) is \(D\times N_d\) — one vector per decoder token.
Self-attention compares every token with every other token — cost grows with the square of the length. For long inputs we prune or sparsify those comparisons.
Reduce self-attention complexity by pruning interactions or sparsifying the interaction matrix. Example: restrict interactions to a convolutional structure, where each token only interacts with a few neighbouring tokens, instead of all \(N\).
The convolutional structure brings three features:
Two further strategies to speed up attention:
The benefit: reduced computational complexity, making long sequences tractable.
Standard self-attention forms the full \(N\times N\) score matrix \(K^{\mathsf T}Q\), so cost and memory scale quadratically with sequence length \(N\) (re-stated in the conclusion, slide 52). Every "long sequence" method here is a way to avoid materialising or computing all \(N^2\) interactions.
Because it computes a dot product between every query and every key — an \(N\times N\) interaction matrix \(K^{\mathsf T}Q\) — so work and memory grow like \(N^2\). To reduce it: (1) sparsify/prune the interaction matrix, e.g. a convolutional/local structure where each token attends only to nearby tokens (with receptive-field expansion across layers); (2) selective attention and/or global tokens that act as shared hubs while most tokens stay local.
Images are just another sequence — of pixels, or of patches. Four models trace the arc from treating pixels as tokens (ImageGPT) to patch-based encoders (ViT) and multiscale designs (SWin, DaViT).
ImageGPT is an autoregressive transformer model for image pixels. Key facts:
The slide-45 figure shows ImageGPT completing partially-masked images (e.g. a water droplet, a room, birds on a branch) — multiple plausible completions per input, demonstrating its generative nature.
ViT is an encoder model with a special <cls> token. It was pre-trained on 303 million labelled images from 18K classes.
Comparison to state-of-the-art: ViT outperformed convolutional networks with pre-training, but was less effective than SotA CNNs without pre-training.
ViT operates on a single scale with fixed-size patches. Multi-scale transformers instead behave like CNNs: start with high-resolution patches and few channels, then gradually decrease resolution and increase channels. Two examples:
DaViT alternates between two types of transformers: one where image patches attend to one another using all channels, and one where channels attend to one another using all image patches. It achieves 9.60% top-1 error on ImageNet.
| Model | Tokens | ImageNet top-1 error |
|---|---|---|
| ImageGPT | pixels (9-bit, 512/pixel) | 27.4% |
| ViT | 16×16 patches, <cls> | 11.45% |
| SWin | shifted windows, no <cls> | 9.89% |
| DaViT | patch- & channel-attention | 9.60% |
Its tokens are pixels in a quantized 9-bit colour space (512 possible tokens per pixel). It is limited to 64×64 images because attention's quadratic complexity in the number of tokens makes more pixels infeasible. It classifies by averaging the pixel embeddings and passing the result through a linear layer + softmax. Top-1 error 27.4%; struggles on small/thin objects.
ViT splits the image into 16×16 patches and maps each patch to a lower-dimensional embedding via a learned linear transformation, with 1D positional encodings added. A special <cls> token is included; after the transformer stack, the <cls> token's output is sent through a final layer + softmax to produce class probabilities — it acts as the aggregate "summary" representation used for classification. Fine-tuning replaces that final layer. ViT reached 11.45% top-1 error and beat CNNs with pre-training (but not without).
Any two of: (1) SWin is multiscale — it reduces resolution by concatenating 2×2 patches and increasing channels (CNN-like), whereas ViT is single-scale with fixed patches. (2) SWin applies self-attention within windows (local), shifting them between layers to grow the receptive field, while ViT attends globally. (3) SWin has no <cls> token — it averages output features for classification — whereas ViT classifies from the <cls> token. (SWin: 9.89% vs ViT 11.45% top-1 error.)
The whole week, distilled into the points the lecturer chose to end on (slide 52).
The deck closes: “Thanks! See you in the next lecture on Graph Neural Networks.” — so the natural follow-on topic is GNNs.
They describe different axes. "Low complexity per layer" / "parallelizable" means each layer is a handful of large matrix multiplications that run efficiently on hardware in parallel (no sequential recurrence like an RNN). "Quadratic in sequence length" refers to the \(N\times N\) attention matrix: the per-layer cost still grows like \(N^2\) in the number of tokens. So a layer is cheap and parallel for moderate \(N\), but becomes expensive as \(N\) grows — which is why Part IX's sparsification methods exist.
Training on large unlabelled corpora (no manual annotation) is unsupervised learning and applies to transformers in general. Predicting missing tokens is the encoder (BERT) representation-learning objective. Building an autoregressive / generative model that can create new data examples is the decoder (GPT-3).
Text is too big and too ambiguous for fully connected nets, so self-attention lets every token build itself from a data-dependent weighted mix of the others' values — weights set by query·key softmax — and wrapping that (scaled, multi-headed, position-encoded) in residual + LayerNorm + per-token-MLP layers gives the transformer, which becomes an encoder (BERT, fill-in-the-blank), a decoder (GPT-3, masked autoregression), or an encoder–decoder (translation via cross-attention), and extends to long sequences by sparsifying and to images via pixels (ImageGPT) or patches (ViT, SWin, DaViT).