KCL · Machine Learning · Week 16

Generative Adversarial Networks

Two networks locked in a duel — a forger and a detective — and what happens when the detective gets too good. From the minimax loss, through Jensen–Shannon and Wasserstein, to StyleGAN.

KCL · ML ~70 min read 16 self-tests Lecturer: L. C. Garcia Peraza Herrera
Part I

Discrimination as a Signal

The core idea of a GAN is almost mischievous: instead of telling the generator what good output looks like, you train a second network to catch fakes, and use its complaints as the only teaching signal.

Suppose you have a pile of real training examples — photographs of faces, say — written as . You would like a machine that can invent new faces that look like they came from the same source, even though they are not copies of any real photo. We call these invented examples , and we want them drawn from the same distribution as the real data.

The trick is to never look at a real face and a fake face side by side and measure their difference directly. (How would you even define "difference" between two faces?) Instead, a GAN sets up a game between two networks.

Definition · The two players

Generator : takes a random latent variable (drawn from a simple base distribution, e.g. a normal) and turns it into a sample . It is the forger.

Discriminator : takes an input and returns a scalar that is higher when it believes the input is real. It is the detective.

Generation works in two strokes: (1) choose a latent from the base distribution; (2) push it through the generator, . The training goal is to adjust the generator parameters until the generated samples are statistically indistinguishable from the real data — at which point even an excellent detective can do no better than a coin flip.

Intuition · Why two networks?

You cannot write down "looks like a real face" as a loss function. But you can train a classifier to separate real from fake — that is an ordinary supervised problem. The GAN insight is to use the classifier's success as the generator's loss. If the detective can tell them apart, the forger gets a gradient telling it how to improve. The forger only stops improving when the detective is fooled.

A toy generator you can picture

To build intuition, the slides use the simplest generator imaginable — it just adds a single parameter to the latent:

So the generated samples are just the latent samples shifted along the line by . Training amounts to sliding the orange (synthesized) cloud until it sits on top of the teal (real) cloud.

a) θ = 3.0 Data x Pr(real) b) θ = 4.9 Data x c) θ = 6.7 Data x ■ Synthesized ■ Real
Slide 5. As grows, the synthesized cloud (orange) slides toward the real cloud (teal). In (c) they overlap and the discriminator's sigmoid flattens — it can no longer separate them, which is exactly the goal.

The discriminator's loss: ordinary binary cross-entropy

The discriminator faces a plain classification task. Label real examples and generated ones . With the logistic sigmoid, the standard binary cross-entropy objective is:

Now plug in the labels. Real examples carry (only the second term survives); generated samples carry (only the first term survives):

Intuition · Reading the loss

is the discriminator's estimated probability that is real. The first sum punishes the detective when it assigns high realness to fakes; the second punishes it when it assigns low realness to real data. Minimising it makes the detective a sharp judge.

Folding in the generator: a minimax game

Substitute the generator definition . Here is the pivotal observation: we want the generator to fool the detective, so we must maximise the same loss with respect to (push the generated samples toward being misclassified):

Definition · Minimax

One objective, two opposed optimisers. The discriminator parameters minimise the loss (try to classify correctly); the generator parameters maximise it (try to make the discriminator wrong). This is more complex than any single-objective loss seen before.

QWhy does the generator maximise the loss while the discriminator minimises it? Couldn't we just give the generator its own separate loss to minimise?

The loss measures how well the discriminator separates real from fake. The discriminator wants this small error (good separation), so it minimises. The generator wants the discriminator to fail, i.e. it wants the separation error large, so it maximises the very same quantity. They are pulling on opposite ends of one rope — hence "minimax."

And yes, in practice we do split it into two losses (next slide). The point of writing the single expression is to make the adversarial structure explicit: it is one game, not two unrelated training problems.

In practice: two loss functions

GAN training is a minimax game, and to actually run gradient steps we split the joint objective into the part each network controls:

We alternate: take a step minimising (improve the detective), then a step on (improve the forger), and repeat. Note uses only the generated-sample term, because the real examples don't depend on .

g[z,θ] Generator z Latent {x*} Generated {x} Real examples Data sig[f[•,φ]] Discriminator Probability is real L[φ] = −Σ log[1−sig f(x*)] − Σ log[sig f(x)] x* → low prob, x → high prob L[θ] = Σ log[1−sig f(g[z,θ],φ)] generated samples should get HIGH prob from discriminator
Slide 9. The training loop. The discriminator loss (blue, top) flows back to the discriminator; the generator loss (orange, bottom) flows back to the generator. Real and generated samples are mixed before the discriminator scores them.

Making it deep: DCGAN

The toy "add " generator becomes a deep convolutional network in practice. The Deep Convolutional GAN (DCGAN) turns a 100-dimensional latent vector into a image.

Generator Discriminator z 100×1 4×4×1024 8×8×512 16×16×256 32×32×128 64×64×3 project & reshape · fractional conv · tanh x or x* 32×32×128 16×16×256 8×8×512 4×4×1024 Pr(real)1×1 strided conv · 4×4 conv · sigmoid
Slide 10. DCGAN. The generator up-samples (project & reshape → fractionally-strided convolutions → tanh output). The discriminator down-samples with strided convolutions to a single sigmoid probability. Architecture from Radford et al. (2015); samples on slide 11 show faces, scenes and bedrooms.
Exam trap · "Fractional" vs "strided"

Don't mix them up. The generator uses fractionally-strided (a.k.a. transposed) convolutions to grow spatial size; the discriminator uses ordinary strided convolutions to shrink it. The generator ends in tanh; the discriminator ends in sigmoid.

QA friend says: "GANs are just autoencoders that reconstruct images." Correct them in one or two sentences.

No — a GAN never reconstructs a specific input and has no per-sample reconstruction loss. The generator maps random latents to images, and is trained only by an adversarial signal (a separate discriminator's success/failure), not by comparing its output to a target image. (Autoencoders, by contrast, have an encoder and a reconstruction objective — that distinction matters for the Week 1 autoencoder material and for VAEs next week.)

Example · One training iteration, narrated

(1) Draw a mini-batch of latents , generate fakes . (2) Grab a mini-batch of real . (3) Update the discriminator by descending : it learns to score high and low. (4) Freeze ; update the generator by ascending the adversarial term (descending as written): the fakes shift toward regions the detective currently calls "real." Repeat. Early on the detective wins easily; ideally they converge to a stalemate where everywhere.

But there's a problem lurking in that last sentence — "early on the detective wins easily." When it wins too easily, the generator gets almost no usable gradient. The slides show Arjovsky et al. (2017) examples of this failure (muddy, mode-collapsed images). Understanding why is the whole of Part II.

· · ·
Part II

Improving Stability

Why does the original GAN train so badly? The answer is that its loss secretly measures a specific distance between distributions — the Jensen–Shannon divergence — and that distance gives no gradient when the two distributions don't overlap. The fix, the Wasserstein distance, needs a detour through linear programming and duality.

We start by rewriting the discriminator loss in a more revealing form. Divide the two sums by the counts (real) and (generated). Each averaged sum becomes an expectation, and in the limit an integral over the data space:

Here is the distribution over generated samples and the true distribution over real examples. We have turned a finite-sample loss into a statement about two probability distributions.

What discriminator is optimal?

Consider a sample of unknown origin. Define the class-conditional densities and . By Bayes' theorem:

where and are the prior probabilities of a sample being real or generated. If we assume equal counts , the priors are equal:

By the law of total probability, . Substituting the priors and plugging into Bayes, the priors cancel:

That last constraint is a discrete Lipschitz condition: may not change too fast between neighbouring bins.

Intuition · Why the dual is the gift

The primal optimises over the whole matrix (one variable per pair of bins — huge, and impossible in continuous space). The dual optimises over a single function (one value per bin), with a slope cap. In the continuous limit that "single function with a slope cap" is exactly a neural network constrained to be Lipschitz — which is buildable.

Continuous case

The primal generalises to a transport plan (mass moved from position to ):

and the dual generalises to an optimisation over a function :

subject to the constraint that the Lipschitz constant of is less than one.

The Wasserstein GAN loss

Now make it trainable. Maximise over the function space by optimising the parameters of a network (now called a critic, not a classifier — there is no sigmoid). Approximate the integrals with samples — generated and real :

subject to the constraint that the critic has an absolute gradient norm of less than one at every position :

Example · What changed vs the vanilla GAN

Three concrete differences. (1) No sigmoid/log — the critic outputs a raw score, the loss is a plain difference of means. (2) The critic is constrained to be 1-Lipschitz (gradient norm ). (3) Because the underlying distance is Wasserstein, the loss keeps decreasing smoothly as the clouds approach — even when they don't yet overlap — so the generator always has a gradient. This directly cures the vanishing-gradient disease from slide 19.

Exam trap · The Lipschitz constraint is mandatory

The dual identity is only valid when is 1-Lipschitz. Drop the constraint and the "max" is unbounded (you can scale up forever), so the loss is meaningless. Enforcing it — weight clipping in the original WGAN, or a gradient penalty in WGAN-GP — is not an optional regulariser; it is what makes the quantity equal the Wasserstein distance.

QBoth the optimal vanilla discriminator and the WGAN critic are "networks scoring inputs." State the single most important difference and its consequence.

The vanilla discriminator outputs a probability via a sigmoid and minimises JS divergence; when the clouds are separated its output saturates and the generator gradient vanishes. The WGAN critic outputs an unbounded 1-Lipschitz score and estimates the Wasserstein distance; its gradient stays informative even for disjoint distributions. Consequence: the WGAN critic can be trained to (near-)optimality without starving the generator, whereas an over-trained vanilla discriminator kills generator learning.

QTrace the full logical chain from "BCE classification loss" to "Wasserstein distance." (This is the spine of Part II — a favourite long-answer question.)

(1) The discriminator's BCE loss, averaged, becomes expectations/integrals over and . (2) Solving for the optimal discriminator via Bayes (equal priors when ) gives . (3) Substituting it back shows the loss equals the Jensen–Shannon divergence between and . (4) JS saturates at for disjoint supports → vanishing gradients. (5) Replace JS with the Wasserstein distance (earth-mover's work), which is well-defined for disjoint supports and smooth. (6) Its transport definition is a hard primal LP, so we take its dual (justified by strong duality), which optimises a single 1-Lipschitz function. (7) Parameterise that function as a network → the WGAN critic loss with gradient-norm .

· · ·
Part III

Progressive Growing, Mini-batch Discrimination & Truncation

A better loss (Part II) stabilises training; these three architectural and sampling tricks push image quality and diversity. They are the engineering that made high-resolution GANs (e.g. faces at ) possible.

Training a GAN to emit a megapixel image from scratch is hopeless — the discriminator instantly spots the chaos and the generator never finds its footing. Progressive growing (Karras et al., 2018) sidesteps this by starting tiny and adding resolution in stages.

Train low-res first, then add layers to both networks 4×4 stage 1 8×8 stage 2 16×16 32×32 … 1024×1024 final
Slides 47–48. Both generator and discriminator begin at and synchronously grow: . Each new resolution is faded in smoothly. Coarse structure is learned before fine detail is ever attempted.
Intuition · Curriculum for pixels

It is a curriculum: master the easy, low-frequency layout (where the head/sky/horizon go) before tackling hard, high-frequency detail (hair strands, skin pores). Each stage is a small, stable problem, and it starts from a good initialisation handed down by the previous stage. This both speeds training and dramatically improves final quality.

Mini-batch discrimination — fighting mode collapse

Definition · Mini-batch discrimination

A mechanism designed to prevent mode collapse and ensure diversity in generated samples. The discriminator is allowed to look not just at one sample in isolation but at feature statistics across the whole mini-batch, comparing the spread of generated samples against the spread of real ones. These batch statistics are integrated into the feature maps and fed back as discriminator feedback.

Intuition · Catching a one-trick forger

Mode collapse is when the generator finds one image that fools the detective and produces near-copies of it forever. If the detective only ever sees one sample at a time, it cannot notice the lack of variety. Let it see a whole batch and compute "how diverse is this batch compared to a real batch?" — now a collapsed generator is instantly caught, because its batch is suspiciously uniform. The pressure to match real-batch diversity forces the generator to cover more modes.

Truncation — trading diversity for quality at sampling time

This trick happens after training, when you sample. The latent is drawn from a base distribution; truncation shrinks it toward the mean by a factor before generating (Brock et al., 2019).

τ = 2.0diverse τ = 0.04canonical more variety, more failures fewer failures, near-identical
Slide 50. Truncating the latent toward its mean (lower ) yields cleaner, more "canonical" outputs but kills diversity (Brock et al. show spaniels becoming near-identical at ); larger gives variety but more artefacts.
QMini-batch discrimination and truncation both touch "diversity," yet pull in opposite directions. Distinguish them.

Mini-batch discrimination acts during training and increases diversity — it prevents mode collapse so the trained model can cover many modes. Truncation acts during sampling and decreases diversity on purpose — it sacrifices variety to raise per-image quality (samples near the latent mean are the ones the generator renders most reliably). One is a learning fix; the other is a knob you turn at inference to pick your point on the quality–diversity trade-off.

QWhy can't you just train a GAN directly at with a deep enough network — why is progressive growing needed?

At full resolution from the start, the discriminator can trivially distinguish real from the generator's initial noise (high-res statistics are very informative), so the JS/adversarial signal saturates and the generator gets little usable gradient — the Part II problem, amplified. There is also an enormous, unstable optimisation landscape to navigate at once. Progressive growing turns one impossible problem into a sequence of easy, stable ones, each warm-started from the last, so coarse structure is locked in before fine detail is attempted.

· · ·
Part IV

Conditional Generation

A vanilla GAN generates something from the dataset, but you can't ask for "a 7" or "a bird." Conditional generation gives you a steering wheel — and the slides show three different ways to attach it.

All three variants feed extra information (a class label, an attribute vector) alongside the latent . They differ in where the conditioning is enforced and what the discriminator is asked to do.

a) Conditional GAN z, c g[z,c,θ] {[x*; c]} & {[x; c]} sig[f[•,•,φ]] Pr( pair is real ) discriminator sees the (image, attribute) PAIR b) Auxiliary classifier GAN (ACGAN) z, c g[z,c,θ] x* (and real x) sig[f₁[•,φ]]softmax[f₂[•,φ]] Pr( is real ) Pr( class ) discriminator ALSO classifies the image's class c) InfoGAN [z; c] g[[z;c],θ] x* (and real x) sig[f₁[•,φ]]f₂[•,φ] Pr( is real ) estimate of c c is UNLABELLED; network recovers it to maximise info
Slide 52. Three conditioning strategies, increasing in subtlety from top to bottom.
Definition · The three variants

(a) Conditional GAN (cGAN): the attribute is fed to both generator and discriminator. The discriminator scores the pair : is this image-plus-attribute combination real? This forces the image to match the requested attribute.

(b) Auxiliary classifier GAN (ACGAN): the discriminator has two heads for "is it real?" and for "which class?". The class label is supervised, so the discriminator both judges realism and classifies. (Odena et al., 2017.)

(c) InfoGAN: the code is part of the latent input and is not labelled. A second head tries to recover from the generated image, maximising the mutual information between and the output. The result: discovers interpretable factors with no supervision. (Chen et al., 2016.)

Example · InfoGAN on MNIST (slide 54)

Trained on unlabelled MNIST, InfoGAN's codes spontaneously align with meaningful factors: a discrete code controls digit identity (sweeping it walks 0→9); a continuous code controls rotation/slant; another continuous controls stroke width. Nobody told it what "rotation" or "width" was — it found them because they are the factors that carry the most information about the image.

QcGAN and InfoGAN both involve a code , yet are nearly opposites. What is the key conceptual difference?

In cGAN the code is a known label you supply — you tell the model "make a 7," and conditioning is enforced by having the discriminator judge the (image, label) pair. In InfoGAN the code is unsupervised — you don't know what it means in advance; the model is rewarded for making the code recoverable from the output, so it discovers interpretable factors on its own. Short version: cGAN imposes meaning; InfoGAN discovers it.

QIn ACGAN, the discriminator has two output heads. Name them, say what each predicts, and explain how this differs structurally from how cGAN handles the class label.

ACGAN's discriminator outputs two things from a single image input: (1) a sigmoid head giving the probability the image is real, and (2) a softmax head giving a probability distribution over the class the image belongs to. The generator still receives the class as input, but the discriminator is not shown the label — it must reconstruct it.

This is the structural contrast with cGAN: in cGAN the label is fed to both networks and the single discriminator head scores the pair ("does this image-and-label combination look real?"). In ACGAN the label is fed only to the generator; the discriminator's extra classification head supplies the conditioning pressure by being trained (on both real and generated images) to name the class. So cGAN verifies the label, ACGAN predicts it.

· · ·
Part V

Image Translation

Conditioning on a label is one thing; conditioning on a whole image is image translation — turn a sketch into a photo, a low-res into high-res, a horse into a zebra. The recurring recipe: a content loss to stay faithful to the input, plus an adversarial loss to look real.

Pix2Pix — paired translation

The generator is a U-Net that maps an input image to a prediction. Two losses train it:

input c g[c,θ] U-Net prediction content loss: prediction ≈ ground truth pairs sig[f[•,•,φ]] Pr(pair real) adversarial loss: input/prediction pair looks real
Slide 56. Pix2Pix. Content loss: the prediction should resemble the paired ground-truth target. Adversarial loss: the (input, prediction) pair should look like a real pair to the discriminator. Demos: maps↔aerial, edges↔handbag, B&W↔colour, labels↔building facade.
Intuition · Why two losses, not one

Content loss alone (e.g. pixel ) gives blurry, "average" predictions — it can't decide between equally valid sharp details, so it hedges. Adversarial loss alone would let the generator produce a perfectly realistic image that ignores the input. Together: content loss keeps it faithful, adversarial loss keeps it sharp and realistic.

SRGAN — super-resolution

Same template, specialised. The generator is a convolutional network that takes a low-resolution input and predicts a high-resolution image. The content loss is a content/VGG loss (the prediction should agree with the real high-resolution image in the feature space of a pretrained VGG network, not just pixel-wise), plus the adversarial "looks real" loss. (Ledig et al., 2017; upscaling beats bicubic interpolation visibly.)

Exam trap · VGG/perceptual loss ≠ pixel loss

SRGAN's content loss is computed on VGG feature activations, not raw pixels. This is deliberate: pixel-wise loss for super-resolution produces over-smoothed results (the same blurring problem as Pix2Pix), whereas matching deep features rewards perceptually plausible high-frequency detail. If asked why SRGAN looks sharper than bicubic, "perceptual/VGG content loss + adversarial loss" is the answer.

CycleGAN — unpaired translation

Pix2Pix needs paired data (the same scene as sketch and photo). Often you only have two unpaired piles — horses here, zebras there. CycleGAN translates anyway, using a cycle-consistency loss.

horse c g[c,θ] zebra c' sig[f[•,φ]] Pr(real zebra) adversarial: looks like a real zebra g'[c',θ] horse again cycle-consistency: maps back to original image
Slide 58. CycleGAN. Translate horse→zebra with , then zebra→horse with a second generator ; the round trip must return the original image (cycle-consistency loss), while a discriminator enforces that the intermediate zebra looks real (adversarial loss). Demos: horse↔zebra, photo↔Monet. (Zhu et al., 2017.)
Intuition · What cycle-consistency buys you

Without paired data there is no ground-truth target, so no ordinary content loss. Cycle-consistency replaces it: "if you turn a horse into a zebra and back, you must recover that same horse." This stops the generator from ignoring the input or scrambling content — it must preserve enough structure to be reversible — while the adversarial loss handles the "look like a zebra" part.

QPix2Pix uses a content loss against a ground-truth target; CycleGAN can't. Why, and what replaces it?

Pix2Pix is trained on paired data, so for each input there is a known correct output to compute a content loss against. CycleGAN is trained on unpaired collections (e.g. a set of horse photos and an unrelated set of zebra photos) — there is no matching target image, so a direct content loss is impossible. It is replaced by the cycle-consistency loss: translating to the other domain and back must reproduce the original input, which preserves content without ever needing a paired target.

QAll three image-translation models (Pix2Pix, SRGAN, CycleGAN) combine an adversarial loss with a second "content"-type term. Match each model to how it computes that second term.

The adversarial loss is shared — a discriminator pushing outputs to look real. The second term differs because the available supervision differs:

Pix2Pix — paired data, so a direct content loss in pixel space: the prediction is compared to the known ground-truth target image.

SRGAN — paired (low-res in, high-res target), but pixel-wise comparison gives blurry results, so it uses a content/VGG loss: prediction and target are pushed to agree in the feature space of a pre-trained VGG network rather than raw pixels.

CycleGANunpaired, so no target exists; the second term is the cycle-consistency loss (translate across and back, recover the original). Exam trap: don't say all three use "content loss" — only Pix2Pix uses a plain pixel content loss; SRGAN uses a feature-space variant; CycleGAN uses cycle-consistency instead.

· · ·
Part VI

StyleGAN & Inverting GANs

StyleGAN's signature move is to separate the things you'd intuitively want to control independently — coarse structure vs. fine texture vs. random detail — and inject them at different scales of the generator.

In a plain GAN the latent is shoved in at the bottom and everything is entangled — nudge and pose, identity, hair and lighting all move together. StyleGAN reorganises the generator into three subsystems: a style path, a noise path, and the main generative pipeline that consumes both.

Style z (1×1×512) w (1×1×512) fully-connected net style y₁ style y₂ style y₃ linear transform → per-channel scale & offset (AdaIN) Main pipeline 4×4×512 learned const 4×4×512 8×8×512 image Noise z₁⊗ψ₁ z₂⊗ψ₂ z₃⊗ψ₃ noise added to every channel
Slide 60. StyleGAN. Latent is first mapped by a fully-connected network to an intermediate ; linear transforms turn into per-layer styles that scale-and-offset each channel (AdaIN). Separately, noise maps are added to every channel. The main pipeline starts from a learned constant and grows image.
Definition · The three subsystems

Mapping network: latent () → fully-connected network → intermediate latent (). Disentangling happens here.

Style: linear transforms of produce styles (each : a scale and an offset). Each is applied as a per-channel scale and offset at a different layer — coarse layers control coarse attributes (pose, face shape), fine layers control fine attributes (hair, freckles, colour).

Noise: independent noise maps ( at , etc.) are added to every channel, supplying stochastic, non-semantic detail (exact hair placement, skin pores).

Intuition · Style vs noise, coarse vs fine

Two independent axes of control. Style = systematic, semantic content injected via (what the face is); noise = random texture (irrelevant micro-detail). And scale: inject at early/low-resolution layers to change coarse things, at late/high-resolution layers to change fine things. Slide 61 demonstrates exactly this — changing coarse/medium/fine styles independently, and increasing coarse vs fine noise independently, with the rest of the face held fixed.

QWhy introduce the intermediate latent at all? Why not feed straight into the styles?

comes from a fixed simple prior (e.g. a Gaussian), whose shape need not match the real distribution of face factors — forcing styles directly from entangles attributes (the manifold gets "warped"). The learned mapping network produces in a space free to be disentangled, where individual directions correspond to individual attributes. Controlling styles from is therefore far cleaner — this is why you can change "coarse style" without disturbing "fine style."

Inverting GANs — editing real photos

Definition · GAN inversion

Objective: edit a real image by projecting it into the latent space, manipulating the latent variables, and re-projecting back to image space. Challenge: a GAN maps latent → image, not image → latent, so there is no built-in inverse. Methods: (i) encoder-based (train a network to predict the latent), (ii) optimisation-based (search for the latent whose output best matches the image), and for StyleGAN specifically (iii) pivotal tuning.

Intuition · Why inversion is the gateway to editing

StyleGAN gives gorgeous, controllable generated faces — but to edit your photo you first need its latent code. Inversion finds the / that reproduces your photo; then you nudge that code along a meaningful direction (older, smiling, different hair) and regenerate. The hard part is that the inverse isn't unique or exact, which is why there are competing methods and why StyleGAN's pivotal tuning fine-tunes the generator slightly around the found code.

· · ·
Part VII

Conclusion

Everything in this week answers one question: how do you train a network to produce realistic data when you can't write down a loss for "realistic"? The answer was an adversary — and most of the week was about fixing the adversary's weaknesses.

The objective is to learn a generator network that transforms random noise into data indistinguishable from a training set. The mechanism has three moving parts: a generator that creates samples from random noise; a discriminator that tries to distinguish real examples from generated samples; and a training loop in which the generator is repeatedly updated to produce data that is increasingly "real" to the discriminator.

The core challenge (the thread running through everything)

Weak training signal: the original GAN formulation suffers when it's easy to tell whether samples are real or generated. That's precisely early training — and precisely when JS divergence saturates and the generator's gradient vanishes.

Each later section is a solution to a different facet of the problem:

ImprovementWhat it fixes / adds
WassersteinA more consistent training signal by measuring the Wasserstein (earth-mover's) distance — non-zero, smooth gradient even for disjoint distributions.
Convolutional tricksProgressive growing, mini-batch discrimination and truncation enhance image quality and diversity (and enable high resolution).
ConditionalControl over output attributes (e.g. object class) and improved output relevance — cGAN, ACGAN, InfoGAN.
Image translationRetains conditional information as images and uses the discriminator loss to favour realistic outputs — Pix2Pix, SRGAN, CycleGAN.
StyleGANInjects noise strategically to control style and noise at different scales, enhancing realism and controllability.
QIn one breath: name the single problem that motivates Parts II–VI, and the one fix each part contributes.

Problem: the adversarial signal is weak/unstable. Fixes: II swaps JS for Wasserstein (smooth gradient); III stabilises and sharpens via progressive growing + mini-batch discrimination + truncation; IV adds controllability via conditioning; V extends conditioning to whole images (translation) with content + adversarial losses; VI disentangles control by scale (StyleGAN) and enables editing real images (inversion).

The week in one breath

A GAN trains a forger against a detective; the detective's binary-cross-entropy loss is secretly the Jensen–Shannon divergence, which vanishes when fakes are obvious — so we replace it with the Wasserstein distance (computed via the LP dual as a 1-Lipschitz critic), then stack on progressive growing, mini-batch discrimination, truncation, conditioning, image-to-image translation, and StyleGAN's scale-separated style/noise to win quality, diversity and control.

Exam-readiness checklist — be able to…

  1. Write the BCE discriminator loss, assign labels (real , generated ), and explain why the generator maximises while the discriminator minimises (the minimax ).
  2. State the two practical losses and , and reproduce the DCGAN structure (generator: fractional conv + tanh; discriminator: strided conv + sigmoid).
  3. Derive the optimal discriminator via Bayes with equal priors ().
  4. Show the loss equals the Jensen–Shannon divergence, identify the quality and coverage terms, and explain vanishing gradients for disjoint supports.
  5. Define the linear programming problem; convert any LP to all- / matrix / standard form; recite the LP vocabulary (feasible set, free variable, optimal cost, unbounded below).
  6. Reproduce the Lagrangian motivation ( s.t. ) and the relaxed-problem lower-bound derivation .
  7. State the primal–dual correspondence table and both duality theorems (weak: ; strong: equal optimal costs).
  8. Define the Wasserstein distance (discrete primal with marginal constraints; dual with ; continuous primal/dual with Lipschitz ).
  9. Write the WGAN critic loss with gradient-norm , and say why the Lipschitz constraint is essential.
  10. Explain progressive growing, mini-batch discrimination (anti-mode-collapse), and truncation (quality↔diversity at sampling).
  11. Distinguish cGAN vs ACGAN vs InfoGAN, including InfoGAN's unsupervised, interpretable codes on MNIST.
  12. Contrast Pix2Pix (paired, content+adversarial), SRGAN (VGG content loss), CycleGAN (unpaired, cycle-consistency).
  13. Describe StyleGAN's mapping network , per-scale styles vs added noise, and the purpose of ; define GAN inversion and its three methods.
  14. Attach the right citations: Radford 2015 (DCGAN), Arjovsky 2017 / Arjovsky & Bottou 2017 (instability, vanishing gradients), Karras 2018 (progressive growing), Brock 2019 (truncation), Odena 2017 (ACGAN), Chen 2016 (InfoGAN), Ledig 2017 (SRGAN), Zhu 2017 (CycleGAN).

Next lecture: Variational Autoencoders — a second route to generative modelling, this time with an explicit, optimisable likelihood instead of an adversary.