Two networks locked in a duel — a forger and a detective — and what happens when the detective gets too good. From the minimax loss, through Jensen–Shannon and Wasserstein, to StyleGAN.
The core idea of a GAN is almost mischievous: instead of telling the generator what good output looks like, you train a second network to catch fakes, and use its complaints as the only teaching signal.
Suppose you have a pile of real training examples — photographs of faces, say — written as . You would like a machine that can invent new faces that look like they came from the same source, even though they are not copies of any real photo. We call these invented examples , and we want them drawn from the same distribution as the real data.
The trick is to never look at a real face and a fake face side by side and measure their difference directly. (How would you even define "difference" between two faces?) Instead, a GAN sets up a game between two networks.
Generator : takes a random latent variable (drawn from a simple base distribution, e.g. a normal) and turns it into a sample . It is the forger.
Discriminator : takes an input and returns a scalar that is higher when it believes the input is real. It is the detective.
Generation works in two strokes: (1) choose a latent from the base distribution; (2) push it through the generator, . The training goal is to adjust the generator parameters until the generated samples are statistically indistinguishable from the real data — at which point even an excellent detective can do no better than a coin flip.
You cannot write down "looks like a real face" as a loss function. But you can train a classifier to separate real from fake — that is an ordinary supervised problem. The GAN insight is to use the classifier's success as the generator's loss. If the detective can tell them apart, the forger gets a gradient telling it how to improve. The forger only stops improving when the detective is fooled.
To build intuition, the slides use the simplest generator imaginable — it just adds a single parameter to the latent:
So the generated samples are just the latent samples shifted along the line by . Training amounts to sliding the orange (synthesized) cloud until it sits on top of the teal (real) cloud.
The discriminator faces a plain classification task. Label real examples and generated ones . With the logistic sigmoid, the standard binary cross-entropy objective is:
Now plug in the labels. Real examples carry (only the second term survives); generated samples carry (only the first term survives):
is the discriminator's estimated probability that is real. The first sum punishes the detective when it assigns high realness to fakes; the second punishes it when it assigns low realness to real data. Minimising it makes the detective a sharp judge.
Substitute the generator definition . Here is the pivotal observation: we want the generator to fool the detective, so we must maximise the same loss with respect to (push the generated samples toward being misclassified):
One objective, two opposed optimisers. The discriminator parameters minimise the loss (try to classify correctly); the generator parameters maximise it (try to make the discriminator wrong). This is more complex than any single-objective loss seen before.
The loss measures how well the discriminator separates real from fake. The discriminator wants this small error (good separation), so it minimises. The generator wants the discriminator to fail, i.e. it wants the separation error large, so it maximises the very same quantity. They are pulling on opposite ends of one rope — hence "minimax."
And yes, in practice we do split it into two losses (next slide). The point of writing the single expression is to make the adversarial structure explicit: it is one game, not two unrelated training problems.
GAN training is a minimax game, and to actually run gradient steps we split the joint objective into the part each network controls:
We alternate: take a step minimising (improve the detective), then a step on (improve the forger), and repeat. Note uses only the generated-sample term, because the real examples don't depend on .
The toy "add " generator becomes a deep convolutional network in practice. The Deep Convolutional GAN (DCGAN) turns a 100-dimensional latent vector into a image.
Don't mix them up. The generator uses fractionally-strided (a.k.a. transposed) convolutions to grow spatial size; the discriminator uses ordinary strided convolutions to shrink it. The generator ends in tanh; the discriminator ends in sigmoid.
No — a GAN never reconstructs a specific input and has no per-sample reconstruction loss. The generator maps random latents to images, and is trained only by an adversarial signal (a separate discriminator's success/failure), not by comparing its output to a target image. (Autoencoders, by contrast, have an encoder and a reconstruction objective — that distinction matters for the Week 1 autoencoder material and for VAEs next week.)
(1) Draw a mini-batch of latents , generate fakes . (2) Grab a mini-batch of real . (3) Update the discriminator by descending : it learns to score high and low. (4) Freeze ; update the generator by ascending the adversarial term (descending as written): the fakes shift toward regions the detective currently calls "real." Repeat. Early on the detective wins easily; ideally they converge to a stalemate where everywhere.
But there's a problem lurking in that last sentence — "early on the detective wins easily." When it wins too easily, the generator gets almost no usable gradient. The slides show Arjovsky et al. (2017) examples of this failure (muddy, mode-collapsed images). Understanding why is the whole of Part II.
Why does the original GAN train so badly? The answer is that its loss secretly measures a specific distance between distributions — the Jensen–Shannon divergence — and that distance gives no gradient when the two distributions don't overlap. The fix, the Wasserstein distance, needs a detour through linear programming and duality.
We start by rewriting the discriminator loss in a more revealing form. Divide the two sums by the counts (real) and (generated). Each averaged sum becomes an expectation, and in the limit an integral over the data space:
Here is the distribution over generated samples and the true distribution over real examples. We have turned a finite-sample loss into a statement about two probability distributions.
Consider a sample of unknown origin. Define the class-conditional densities and . By Bayes' theorem:
where and are the prior probabilities of a sample being real or generated. If we assume equal counts , the priors are equal:
By the law of total probability, . Substituting the priors and plugging into Bayes, the priors cancel: