Week 17 · Generative Models

Variational Autoencoders

Learning a probability distribution you can sample from — by training a network to be its own approximate inverse.

KCL · Machine Learning ~55 min read 17 self-tests Lecturer: L. C. Garcia Peraza Herrera
where it all begins

PART ILatent variable models

Some things we want to model are too lumpy and complicated to describe directly. The trick is to invent a hidden "knob" that, once set, makes the rest simple — then average over all settings of the knob.

Suppose you want to write down the probability of some data \(x\) — say, the probability of a particular face image, or a particular handwritten digit. Real data has a messy, multi-bumped distribution that no single tidy formula (like one Gaussian) can capture. A latent variable model sidesteps this by introducing an extra variable \(z\) that we never actually observe.

Definition · Latent variable model

We model a joint distribution \(Pr(x,z)\), where \(z\) is an unobserved (latent) variable. We then describe the probability of the data \(Pr(x)\) as a marginalization of this joint over \(z\):

\[ Pr(x) = \int Pr(x,z)\,dz = \int Pr(x\mid z)\,Pr(z)\,dz \]
In plain English

Think of \(z\) as a hidden setting. For each setting of \(z\), the conditional \(Pr(x\mid z)\) is a simple distribution. The observed data distribution \(Pr(x)\) is what you get when you blend all those simple pieces together, weighting each by how likely that setting \(Pr(z)\) is. "Marginalizing" just means "summing/integrating away the variable you don't care about, leaving only \(x\)."

The second equality uses the product rule \(Pr(x,z)=Pr(x\mid z)\,Pr(z)\). The integral becomes a sum when \(z\) is discrete.

The canonical example: a mixture of Gaussians

The friendliest latent variable model is the one-dimensional mixture of Gaussians. Here the latent variable \(z\) is discrete — it picks which Gaussian component a data point came from.

Definition · 1D mixture of Gaussians

The latent \(z\) is a category label \(n\in\{1,\dots,N\}\). The two ingredients are:

\[ Pr(z=n) = \lambda_n \qquad Pr(x\mid z=n) = \text{Norm}_x\!\left[\mu_n,\sigma_n^2\right] \]

where \(\lambda_n\) is the mixing weight (prior probability) of component \(n\), and each component is a normal with its own mean \(\mu_n\) and variance \(\sigma_n^2\).

The likelihood \(Pr(x)\) is found by marginalizing over the latent \(z\) — exactly the recipe above, but now the integral is a sum over the \(N\) components:

\[ \begin{aligned} Pr(x) &= \sum_{n=1}^{N} Pr(x,z=n)\\ &= \sum_{n=1}^{N} Pr(x\mid z=n)\cdot Pr(z=n)\\ &= \sum_{n=1}^{N} \lambda_n\cdot \text{Norm}_x\!\left[\mu_n,\sigma_n^2\right] \end{aligned} \]
In plain English

To draw a point: first roll a weighted die to choose a component \(n\) (weights \(\lambda_n\)), then draw \(x\) from that component's bell curve. The overall density is a weighted sum of bell curves — so it can have several bumps even though each ingredient has only one. That is the whole point: simple parts, complex whole.

a) marginal density Pr(x) Pr(x) x b) joint Pr(x,z) over 3 slices Pr(x,z) z=1 z=2 z=3 marginalize over z ↓ collapses to Pr(x)
Slides 5–6. Each dashed curve is one weighted component \(\lambda_n\,\text{Norm}_x[\mu_n,\sigma_n^2]\); the solid aubergine curve is their sum, the marginal \(Pr(x)\). Panel (b): the joint lives over discrete slices \(z=1,2,3\); summing the slices ("marginalizing over the latent variable \(z\)") flattens them into the multimodal \(Pr(x)\).
QWhy can a mixture of three Gaussians have three bumps when each Gaussian has only one?▾ tap to reveal▴ hide answer

Because \(Pr(x)=\sum_n \lambda_n\,\text{Norm}_x[\mu_n,\sigma_n^2]\) is a weighted sum of the components. If the means \(\mu_n\) are far apart relative to the variances, each component dominates in its own region of \(x\), producing a separate local peak. Marginalization adds the densities pointwise; it never forces unimodality. (If the means are close, the bumps can merge into one.)

QIn the marginalization \(Pr(x)=\int Pr(x\mid z)Pr(z)\,dz\), which factor is "simple" and which makes the result complex?▾ tap to reveal▴ hide answer

The conditional \(Pr(x\mid z)\) is the simple factor — for any fixed \(z\) it is an easy distribution (a single Gaussian, say). The complexity of \(Pr(x)\) comes from blending many of these simple conditionals together, weighted by the prior \(Pr(z)\). The model is "rich on the outside, simple on the inside."

·  ·  ·
from a die-roll to a deep network

PART IIThe nonlinear latent variable model

Replace the discrete die-roll with a continuous Gaussian knob, and replace the lookup table of component means with a neural network. Now the latent variable can paint an infinitely flexible distribution.

The mixture of Gaussians had a discrete latent variable. The VAE's generative model keeps the same marginalization idea but makes \(z\) a continuous vector and lets a neural network decide how \(z\) maps to data. Three pieces define it.

Definition · Nonlinear latent variable model

Prior — the latent has a standard multivariate normal prior (a unit Gaussian blob centred at the origin):

\[ Pr(z) = \text{Norm}_z[\mathbf{0},\mathbf{I}] \]

Likelihood — given \(z\), the data is normal with a mean computed by a network \(\mathbf{f}[z,\phi]\) and a fixed spherical covariance:

\[ Pr(x\mid z,\phi) = \text{Norm}_x\!\Big[\mathbf{f}[z,\phi],\,\sigma^2\mathbf{I}\Big] \]

Data probability — marginalize over \(z\):

\[ \begin{aligned} Pr(x\mid\phi) &= \int Pr(x,z\mid\phi)\,dz\\ &= \int Pr(x\mid z,\phi)\cdot Pr(z)\,dz\\ &= \int \text{Norm}_x\!\big[\mathbf{f}[z,\phi],\sigma^2\mathbf{I}\big]\cdot\text{Norm}_z[\mathbf{0},\mathbf{I}]\,dz \end{aligned} \]
In plain English

It is a mixture of Gaussians with infinitely many components — one for every point \(z\) in the continuous latent space. The network \(\mathbf{f}[z,\phi]\) is a lookup function that tells you where to place the little Gaussian blob for each \(z\). Because \(\mathbf{f}\) is nonlinear, sweeping \(z\) drags the blob along a curved path, and the blended result \(Pr(x\mid\phi)\) can be a curved, complicated density. The parameters \(\phi\) are the weights of that decoder network.

Exam trap · two networks, two parameter sets

Keep \(\phi\) and \(\theta\) straight from the start. \(\phi\) = parameters of the decoder / likelihood network \(\mathbf{f}[z,\phi]\) (the generative model). Later, \(\theta\) = parameters of the encoder network \(\mathbf{g}[x,\theta]\) (the variational approximation). Mixing them up is the single most common mistake on this material.

joint Pr(x,z|φ): blobs slide along a curve x₂ z (latent axis) marginalize over z Pr(x|φ) x₂
Slides 9 & 11. Left: the joint \(Pr(x,z\mid\phi)\). For each latent value the likelihood is a small spherical Gaussian (mean \(\mathbf{f}[z,\phi]\), variance \(\sigma^2\mathbf{I}\)); as \(z\) varies, the blob's centre traces a curve. Right: marginalizing over \(z\) smears all the blobs together into the curved manifold density \(Pr(x\mid\phi)\) — a horseshoe that no single Gaussian could ever produce.

Generation: ancestral sampling

Once the model is trained, generating a fresh sample is a clean three-step pipeline. Because the joint factors as \(Pr(z)\,Pr(x\mid z)\), you sample the parent \(z\) first, then the child \(x\) — hence ancestral sampling.

Worked recipe · ancestral sampling (slide 10)

1 — Draw latent. Sample a latent value from the prior: \[ z^{*} \sim Pr(z) \]

2 — Compute likelihood mean. Push \(z^{*}\) through the decoder \(\mathbf{f}[z^{*},\phi]\) to get the mean of the likelihood: \[ \text{Mean}\big(Pr(x\mid z^{*},\phi)\big)=\mathbf{f}[z^{*},\phi] \]

3 — Draw data sample. Sample a new example from that likelihood: \[ x^{*} \sim Pr(x\mid z^{*},\phi) \]

In plain English (slide 11)

Panel (a): pick a point \(z^{*}\) under the unit-Gaussian prior \(\text{Norm}_z[0,1]\). Panel (b): the decoder turns that into a sharp Gaussian blob sitting somewhere on the data manifold — that's \(Pr(x\mid z^{*},\phi)\), and your sample \(x^{*}\) lands inside it. Panel (c): if you repeated this for every \(z\) and averaged, you'd recover the full curved marginal \(Pr(x\mid\phi)=\int Pr(x\mid z)Pr(z)\,dz\).

QWhy is the prior \(Pr(z)=\text{Norm}_z[\mathbf{0},\mathbf{I}]\) chosen to be so boring — a plain unit Gaussian — when the data is complicated?▾ tap to reveal▴ hide answer

All the complexity is offloaded to the nonlinear decoder \(\mathbf{f}[z,\phi]\). A trivial prior is (i) trivially easy to sample from at generation time, and (ii) gives a fixed, known target the encoder can be pulled toward. The network learns to warp the simple Gaussian blob into whatever shape the data needs — so the prior need not be expressive itself. Simplicity of the prior is a feature, not a bug.

QContrast this nonlinear latent model with the mixture of Gaussians from Part I in one sentence.▾ tap to reveal▴ hide answer

The mixture has a discrete latent (finitely many components, each with a hand-stored mean), whereas the nonlinear model has a continuous latent and a network \(\mathbf{f}[z,\phi]\) that produces a different Gaussian mean for every point in latent space — effectively a mixture with uncountably many components, all tied together by shared weights \(\phi\).

·  ·  ·
the integral we cannot do

PART IIITraining & the evidence lower bound

Maximum likelihood demands an integral over all latent values — and that integral is intractable. The escape route is to maximize a clever lower bound on the likelihood instead, built from Jensen's inequality.

Training a probabilistic model means maximum likelihood: choose the parameters that make the observed data as probable as possible. For our dataset \(\{x_i\}_{i=1}^{I}\) we want

Definition · Maximum likelihood objective (slide 13) \[ \hat{\phi} = \operatorname*{argmax}_{\phi}\left[\sum_{i=1}^{I}\log\big[Pr(x_i\mid\phi)\big]\right] \]

where each term contains the intractable marginal

\[ Pr(x_i\mid\phi) = \int \text{Norm}_{x_i}\!\big[\mathbf{f}[z,\phi],\sigma^2\mathbf{I}\big]\cdot\text{Norm}_z[\mathbf{0},\mathbf{I}]\,dz. \]
Exam trap · why we cannot just do this

That integral has no closed form: the network \(\mathbf{f}[z,\phi]\) sits inside the Gaussian, so you cannot integrate \(z\) out analytically, and numerically integrating over a high-dimensional latent space is hopeless. This intractability is the entire reason the VAE exists. Every subsequent idea — the ELBO, the encoder, sampling, reparametrisation — is machinery to dodge this one integral.

The evidence lower bound (ELBO)

Since we cannot maximize \(\log[Pr(x\mid\phi)]\) directly, we build a function that is always \(\le\) the log-likelihood and push that up instead. If the lower bound rises, the true log-likelihood is dragged up with it.

Definition · ELBO (slide 14)

The evidence lower bound is a function that is always less than or equal to the log-likelihood for a given \(\phi\), and which also depends on a new set of parameters \(\theta\). "Evidence" is another name for the data likelihood \(Pr(x\mid\phi)\). Building it requires Jensen's inequality.

Jensen's inequality

Definition · Jensen's inequality (slide 15)

For a concave function \(g\), the function of the expectation is \(\ge\) the expectation of the function:

\[ g\big[\mathbb{E}[y]\big] \;\ge\; \mathbb{E}\big[g[y]\big] \]

The logarithm is concave, so in particular:

\[ \log\big[\mathbb{E}[y]\big] \;\ge\; \mathbb{E}\big[\log[y]\big] \]

Written as an integral over a density \(Pr(y)\):

\[ \log\!\left[\int Pr(y)\,y\,dy\right] \;\ge\; \int Pr(y)\,\log[y]\,dy \]
In plain English

A concave function bends downward like a frown / a dome. If you draw a chord between two points on a frown-shaped curve, the chord lies below the curve. "Average first, then apply log" lands you on the curve (high); "apply log first, then average" lands you on the chord (low). So \(\log\) of an average is at least the average of the logs. The gap between them is exactly what the bound will trade on.

y log[y] log[𝔼[y]] 𝔼[log[y]] 𝔼[y]
Slides 16–17. Brown dots are sampled values sitting on the concave \(\log\) curve. Reading up at \(\mathbb{E}[y]\): the curve gives \(\log[\mathbb{E}[y]]\) (green, higher); the grey chord through the samples gives \(\mathbb{E}[\log[y]]\) (aubergine, lower). The shaded sliver between curve and chord is the Jensen gap — and it is exactly the gap the ELBO will later identify as a KL divergence.

Deriving the bound

Now apply Jensen to the log-likelihood. The trick is to multiply and divide by an arbitrary distribution \(q(z)\) inside the integral — a move that does nothing mathematically but creates an expectation we can bound.

\[ \begin{aligned} \log[Pr(x\mid\phi)] &= \log\!\left[\int Pr(x,z\mid\phi)\,dz\right]\\ &= \log\!\left[\int q(z)\,\frac{Pr(x,z\mid\phi)}{q(z)}\,dz\right] \end{aligned} \]

The right-hand side is now \(\log\) of an expectation under \(q(z)\). Apply Jensen (\(\log[\mathbb{E}[\cdot]]\ge\mathbb{E}[\log[\cdot]]\)):

\[ \log\!\left[\int q(z)\frac{Pr(x,z\mid\phi)}{q(z)}dz\right] \;\ge\; \int q(z)\,\log\!\left[\frac{Pr(x,z\mid\phi)}{q(z)}\right]dz \]

In practice \(q(z)\) carries its own parameters \(\theta\), so we write it \(q(z\mid\theta)\). The right-hand side is the ELBO:

Definition · The ELBO (slide 19) \[ \text{ELBO}[\theta,\phi] = \int q(z\mid\theta)\,\log\!\left[\frac{Pr(x,z\mid\phi)}{q(z\mid\theta)}\right]dz \]

To learn the nonlinear latent variable model we maximize this quantity as a function of both \(\phi\) and \(\theta\). The neural architecture that computes this quantity is the VAE.

In plain English

We replaced an impossible problem ("maximize an integral we can't compute") with a possible one ("maximize a bound that touches that integral from below"). The free distribution \(q(z\mid\theta)\) is a dial: choose it well and the bound hugs the true log-likelihood tightly; the next part shows exactly how tight.

QThe step "multiply and divide by \(q(z)\)" looks like a magic trick. Why is it legal, and why do it?▾ tap to reveal▴ hide answer

It is legal because \(q(z)/q(z)=1\): the integrand is unchanged, so the integral's value is identical. We do it to turn the integral into an expectation under \(q\), namely \(\int q(z)\,h(z)\,dz=\mathbb{E}_{q}[h(z)]\) with \(h(z)=Pr(x,z\mid\phi)/q(z)\). Only once it is an expectation can Jensen's inequality be applied to pull the \(\log\) inside.

QThe slide says the ELBO "will also depend on some other parameters \(\theta\)." Where do those \(\theta\) come from?▾ tap to reveal▴ hide answer

From the auxiliary distribution \(q(z\mid\theta)\) we introduced. The true likelihood \(\log[Pr(x\mid\phi)]\) depends only on \(\phi\). But the bound we maximize depends on the shape of \(q\), which is parameterised by \(\theta\). So the ELBO is a function of both: \(\phi\) controls the generative model (and the true objective), \(\theta\) controls how tightly the bound approximates it.

QJensen's inequality requires a concave function. Is \(\log\) concave, convex, or neither — and what would change if we had a convex function instead?▾ tap to reveal▴ hide answer

\(\log\) is concave (its second derivative \(-1/y^2<0\) everywhere on \(y>0\)). For a concave function Jensen gives \(g[\mathbb{E}[y]]\ge\mathbb{E}[g[y]]\), which yields a lower bound — exactly what we want for a quantity we are maximizing. For a convex function the inequality flips (\(g[\mathbb{E}[y]]\le\mathbb{E}[g[y]]\)), giving an upper bound instead, which would be useless for maximization.

·  ·  ·
how good is the bound?

PART IVELBO properties & tightness

The bound has two knobs. Moving \(\theta\) lifts the bound toward the true curve; moving \(\phi\) slides along it toward higher likelihood. And the gap between bound and truth turns out to be a familiar quantity: a KL divergence.

Picture the true log-likelihood \(\log[Pr(x\mid\phi)]\) as a fixed curve over the parameter \(\phi\). The ELBO is a second curve living underneath it. Two facts govern training.

Definition · ELBO properties (slide 21)

① The original log-likelihood is a function of \(\phi\) alone, and we want its maximum. ② Depending on our choice of \(\theta\), the lower bound may sit closer to or further from the log-likelihood. ③ When we change \(\phi\), we move along the lower-bound function.

a) fix φ, raise θ → bound lifts log[Pr(x|φ)] ELBO[θ⁰,φ⁰] ELBO[θ¹,φ⁰] φ⁰ b) fix θ, change φ → climb along bound ELBO[θ¹,φ⁰] ELBO[θ¹,φ¹] φ⁰ φ¹
Slide 22. (a) Holding \(\phi^{0}\) fixed and improving \(\theta\) lifts the ELBO curve until it touches the true log-likelihood at that point — the bound becomes tight. (b) Holding \(\theta^{1}\) fixed and improving \(\phi\) walks the contact point up the curve toward higher likelihood. Real VAE training alternates / interleaves both moves via joint gradient ascent.

Tightness of the bound

How tight is the bound, exactly? Factor the numerator inside the ELBO using the conditional-probability identity \(Pr(x,z\mid\phi)=Pr(z\mid x,\phi)\,Pr(x\mid\phi)\). Watch the ELBO split cleanly into the log-likelihood minus a KL term.

\[ \begin{aligned} \text{ELBO}[\theta,\phi] &= \int q(z\mid\theta)\log\!\left[\frac{Pr(x,z\mid\phi)}{q(z\mid\theta)}\right]dz\\[2pt] &= \int q(z\mid\theta)\log\!\left[\frac{Pr(z\mid x,\phi)\,Pr(x\mid\phi)}{q(z\mid\theta)}\right]dz\\[2pt] &= \int q(z\mid\theta)\log\big[Pr(x\mid\phi)\big]dz + \int q(z\mid\theta)\log\!\left[\frac{Pr(z\mid x,\phi)}{q(z\mid\theta)}\right]dz\\[2pt] &= \log\big[Pr(x\mid\phi)\big] + \int q(z\mid\theta)\log\!\left[\frac{Pr(z\mid x,\phi)}{q(z\mid\theta)}\right]dz\\[2pt] &= \log\big[Pr(x\mid\phi)\big] - D_{KL}\!\Big[q(z\mid\theta)\,\big\|\,Pr(z\mid x,\phi)\Big] \end{aligned} \]
In plain English · reading the final line (slide 23)

The ELBO = (the thing we actually want, \(\log[Pr(x\mid\phi)]\)) minus a KL divergence measuring how far our chosen \(q(z\mid\theta)\) is from the true posterior \(Pr(z\mid x,\phi)\). KL divergence is \(\ge 0\) always, which re-proves that ELBO \(\le\) log-likelihood. The third line works because \(\log[Pr(x\mid\phi)]\) does not depend on \(z\), so it pulls out of the integral and \(\int q(z\mid\theta)\,dz=1\).

Definition · When the bound is tight (slide 23)

The KL distance is zero — and the bound is tight (touches the log-likelihood) — exactly when our approximating distribution equals the true posterior:

\[ q(z\mid\theta) = Pr(z\mid x,\phi) \]
prior Pr(z) vs posterior Pr(z|x*,φ) z Pr(z) prior Pr(z|x*,φ) posterior
Slide 24. The broad blue curve is the prior \(Pr(z)\); the sharp, possibly multi-peaked terracotta curve is the true posterior \(Pr(z\mid x^{*},\phi)\) for a specific observation \(x^{*}\). The posterior concentrates on the latent values that could have produced \(x^{*}\). Making \(q\) match this curve is what makes the bound tight — and as we'll see, the true posterior is generally too awkward to match exactly.

Three faces of the ELBO

The slides emphasise that the same ELBO can be written three equivalent ways. The first two we already have (slide 25):

Forms 1 & 2 (slide 25)

1 · Original integral form:

\[ \int q(z\mid\theta)\log\!\left[\frac{Pr(x,z\mid\phi)}{q(z\mid\theta)}\right]dz \]

2 · Likelihood-minus-KL-to-posterior:

\[ \log\big[Pr(x\mid\phi)\big] - D_{KL}\!\Big[q(z\mid\theta)\,\big\|\,Pr(z\mid x,\phi)\Big] \]

The third form is the one used in practice. Factor the numerator the other way, as \(Pr(x,z\mid\phi)=Pr(x\mid z,\phi)\,Pr(z)\):

\[ \begin{aligned} \text{ELBO}[\theta,\phi] &= \int q(z\mid\theta)\log\!\left[\frac{Pr(x,z\mid\phi)}{q(z\mid\theta)}\right]dz\\[2pt] &= \int q(z\mid\theta)\log\!\left[\frac{Pr(x\mid z,\phi)\,Pr(z)}{q(z\mid\theta)}\right]dz\\[2pt] &= \int q(z\mid\theta)\log\big[Pr(x\mid z,\phi)\big]dz + \int q(z\mid\theta)\log\!\left[\frac{Pr(z)}{q(z\mid\theta)}\right]dz\\[2pt] &= \int q(z\mid\theta)\log\big[Pr(x\mid z,\phi)\big]dz - D_{KL}\!\Big[q(z\mid\theta)\,\big\|\,Pr(z)\Big] \end{aligned} \]
Form 3 · reconstruction − KL-to-prior (slide 26) \[ \text{ELBO}[\theta,\phi] = \underbrace{\int q(z\mid\theta)\log\big[Pr(x\mid z,\phi)\big]dz}_{\text{reconstruction term}} - \underbrace{D_{KL}\!\Big[q(z\mid\theta)\,\big\|\,Pr(z)\Big]}_{\text{KL to prior}} \]
In plain English — this is the one to memorise

Reconstruction term: "If I encode \(x\) into a latent code \(z\) and decode it back, how probable is the original \(x\)?" High = good reconstruction. KL-to-prior term: "How far has my encoder's distribution \(q(z\mid\theta)\) drifted from the tidy prior \(Pr(z)\)?" We subtract it, so it acts as a regulariser pulling codes toward the unit Gaussian. Maximizing the ELBO = reconstruct well while keeping the latent code well-behaved.

Exam trap · two different KL divergences

Form 2 has \(D_{KL}[\,q\,\|\,Pr(z\mid x,\phi)\,]\) — KL to the posterior (intractable; explains the gap). Form 3 has \(D_{KL}[\,q\,\|\,Pr(z)\,]\) — KL to the prior (tractable; appears in the loss we actually optimize). Same ELBO, different KL term. Don't conflate them.

QIn the tightness derivation, why does \(\int q(z\mid\theta)\log[Pr(x\mid\phi)]\,dz\) collapse to just \(\log[Pr(x\mid\phi)]\)?▾ tap to reveal▴ hide answer

Because \(\log[Pr(x\mid\phi)]\) contains no \(z\) — it is a constant with respect to the integration variable. So it factors out: \(\log[Pr(x\mid\phi)]\int q(z\mid\theta)\,dz\). And \(q(z\mid\theta)\) is a probability distribution, so \(\int q(z\mid\theta)\,dz=1\). The constant survives, the integral evaporates.

QUsing Form 2, prove in one line that the ELBO can never exceed the log-likelihood.▾ tap to reveal▴ hide answer

Form 2 is \(\text{ELBO}=\log[Pr(x\mid\phi)] - D_{KL}[q\,\|\,Pr(z\mid x,\phi)]\). A KL divergence is non-negative (\(D_{KL}\ge 0\), with equality iff the two distributions are identical). Subtracting a non-negative number can only decrease or preserve, so \(\text{ELBO}\le\log[Pr(x\mid\phi)]\). \(\blacksquare\)

QMaximizing the ELBO over \(\theta\) does what to the true log-likelihood — and why is that initially surprising?▾ tap to reveal▴ hide answer

It does nothing to the true log-likelihood \(\log[Pr(x\mid\phi)]\), which depends only on \(\phi\). Surprising at first — but from Form 2, varying \(\theta\) only changes the KL term. Maximizing the ELBO over \(\theta\) minimizes that KL, lifting the bound up to the (fixed) log-likelihood curve. So \(\theta\) tightens the bound; \(\phi\) actually raises the likelihood. That's precisely the two-panel story of slide 22.

·  ·  ·
an honest cheat

PART VThe variational approximation

The bound is tight when \(q\) equals the true posterior — but the true posterior is intractable. So we cheat honestly: pick a simple family for \(q\) and let a neural network choose its parameters per data point.

We established that the ELBO is tight when \(q(z\mid\theta)=Pr(z\mid x,\phi)\). Why not just set \(q\) to the posterior and be done? Because Bayes' rule \(Pr(z\mid x,\phi)=Pr(x\mid z,\phi)Pr(z)/Pr(x\mid\phi)\) needs the denominator \(Pr(x\mid\phi)\) — the very intractable integral we started with.

Definition · Variational approximation (slide 28)

The ELBO is tight when \(q(z\mid\theta)=Pr(z\mid x,\phi)\), but we cannot use Bayes' rule because \(Pr(x\mid\phi)\) is intractable. The fix: choose a simple parametric form for \(q(z\mid\theta)\) and use it to approximate the true posterior. Since the optimal \(q\) was the posterior \(Pr(z\mid x)\), which depends on the data \(x\), the approximation should depend on \(x\) too:

\[ q(z\mid x,\theta) = \text{Norm}_z\!\Big[\mathbf{g}_{\mu}[x,\theta],\,\mathbf{g}_{\Sigma}[x,\theta]\Big] \]

Here \(\mathbf{g}[x,\theta]\) is a second neural network with parameters \(\theta\) that predicts the mean \(\mu\) and variance \(\Sigma\) of the normal variational approximation.

In plain English

We give up on matching the true (gnarly) posterior exactly and instead say: "for each input \(x\), I will guess a single Gaussian that approximates its posterior." A network \(\mathbf{g}\) reads \(x\) and outputs the mean and spread of that Gaussian. This network is the encoder. The word variational just means we are optimizing over a family of functions/distributions to find the best approximation — classic calculus-of-variations flavour.

a) unimodal posterior → good fit q(z|x*,θ) Pr(z|x*,φ) latent z b) bimodal posterior → poor fit q(z|x*,θ) Pr(z|x*,φ) latent z
Slide 29. A single Gaussian \(q\) (teal) can hug a unimodal posterior (a) but cannot capture a multi-peaked posterior (b) — it smears across both modes and fits neither well. This mismatch is the price of the approximation: the bound stays slack by the amount \(D_{KL}[q\,\|\,Pr(z\mid x,\phi)]\) whenever the true posterior isn't Gaussian.
QWhy must \(q\) depend on \(x\) (written \(q(z\mid x,\theta)\)) rather than being one fixed distribution \(q(z\mid\theta)\) shared by all data points?▾ tap to reveal▴ hide answer

Because the quantity \(q\) is approximating — the true posterior \(Pr(z\mid x,\phi)\) — is itself different for every \(x\). Different observations imply different beliefs about which latents produced them. A single shared Gaussian could not track all those posteriors. So we let the encoder \(\mathbf{g}[x,\theta]\) amortise the inference: one network reads any \(x\) and instantly emits the right \((\mu,\Sigma)\). (This is called amortised variational inference.)

QLooking at panel (b), what does the slack bound cost us, and could we fix it?▾ tap to reveal▴ hide answer

When the true posterior is multimodal but \(q\) is a single Gaussian, \(D_{KL}[q\,\|\,Pr(z\mid x,\phi)]>0\), so the ELBO underestimates the true log-likelihood — we optimize a looser proxy and may learn a worse model. We could fix it by giving \(q\) a richer form (mixtures, normalizing flows, hierarchical posteriors), at the cost of complexity. The vanilla VAE accepts the Gaussian limitation for tractability and speed.

·  ·  ·
putting it together

PART VIThe variational autoencoder

Encoder, sampler, decoder, loss. The VAE is a network that computes the (reconstruction − KL) form of the ELBO, makes the intractable integral tractable with a single sample, and uses a closed-form KL for Gaussians.

We now assemble the full architecture. Start from Form 3 of the ELBO and plug in our Gaussian variational approximation \(q(z\mid x,\theta)\):

Definition · The VAE objective (slide 31) \[ \text{ELBO}[\theta,\phi] = \int q(z\mid x,\theta)\log\big[Pr(x\mid z,\phi)\big]dz - D_{KL}\!\Big[q(z\mid x,\theta)\,\big\|\,Pr(z)\Big] \]

with the approximation \(q(z\mid x,\theta)=\text{Norm}_z[\mathbf{g}_{\mu}[x,\theta],\mathbf{g}_{\Sigma}[x,\theta]]\).

Making the first term tractable: sampling

That first integral (the reconstruction term) is itself an intractable expectation. We approximate any expectation under \(q\) by a Monte-Carlo average:

Definition · Monte-Carlo estimate of the expectation (slide 31)

For any function \(\mathrm{a}\):

\[ \mathbb{E}_{z}\big[\mathrm{a}[z]\big] = \int \mathrm{a}[z]\,q(z\mid x,\theta)\,dz \;\approx\; \frac{1}{N}\sum_{n=1}^{N}\mathrm{a}[z_n^{*}] \]

where \(z_n^{*}\) is the \(n\)-th sample drawn from \(q(z\mid x,\theta)\).

In practice we are stingy: a single sample \(z^{*}\) gives a very rough but unbiased estimate, which is enough for stochastic gradient training:

Definition · Single-sample ELBO (slide 32) \[ \text{ELBO}[\theta,\phi] \;\approx\; \log\big[Pr(x\mid z^{*},\phi)\big] - D_{KL}\!\Big[q(z\mid x,\theta)\,\big\|\,Pr(z)\Big] \]

Making the second term exact: the Gaussian KL

The KL term is between two Gaussians — the variational \(q=\text{Norm}_z[\mu,\Sigma]\) and the prior \(Pr(z)=\text{Norm}_z[\mathbf{0},\mathbf{I}]\) — so it has a closed form with no sampling needed:

Definition · KL between Gaussian and unit prior (slide 32) \[ D_{KL}\!\Big[q(z\mid x,\theta)\,\big\|\,Pr(z)\Big] = \frac{1}{2}\left(\operatorname{Tr}[\Sigma] + \mu^{T}\mu - D_z - \log\Big[\det[\Sigma]\Big]\right) \]

where \(D_z\) is the dimensionality of the latent space.

In plain English · what each piece of the KL penalises

\(\mu^{T}\mu\) penalises the latent mean drifting away from the origin. \(\operatorname{Tr}[\Sigma]\) penalises the variance being too large; \(-\log\det[\Sigma]\) penalises it being too small (collapsing). \(-D_z\) is just the constant that makes the whole thing zero when \(\Sigma=\mathbf{I}\) and \(\mu=\mathbf{0}\) — i.e. when \(q\) exactly equals the prior. The term gently herds every encoded code toward the standard Gaussian blob.

The VAE algorithm

Worked recipe · computing the ELBO for one point (slide 33)

We build a model that computes the ELBO for a point \(x\), then use an optimization algorithm to maximize it over the dataset, thereby improving the log-likelihood. For each \(x\):

1. Compute the mean \(\mu\) and variance \(\Sigma\) of the variational posterior \(q(z\mid\theta,x)\) using the encoder network \(\mathbf{g}(x,\theta)\).

2. Draw a sample \(z^{*}\) from this distribution.

3. Compute the ELBO: \[ \text{ELBO}[\theta,\phi] \approx \log\big[Pr(x\mid z^{*},\phi)\big] - D_{KL}\!\Big[q(z\mid x,\theta)\,\big\|\,Pr(z)\Big] \]

Loss function ELBO[θ,φ] var. dist. similar to prior D_KL[q(z|x,θ)‖Pr(z)] data should be probable log[Pr(x|z*,φ)] x g[x,θ] Encoder μ Σ q(z|x,θ) var. dist. Sample z* f[z*,φ] Decoder Pr(x|z*,φ) Prediction data example params of var. dist. sample from var. dist.
Slide 34. The VAE forward pass. The encoder \(\mathbf{g}[x,\theta]\) maps \(x\) to the parameters \(\mu,\Sigma\) of the variational distribution \(q(z\mid x,\theta)\). We sample \(z^{*}\) from it, then the decoder \(\mathbf{f}[z^{*},\phi]\) maps \(z^{*}\) to the likelihood \(Pr(x\mid z^{*},\phi)\). The loss = ELBO = (reconstruction: data should be probable) − (KL: variational distribution should resemble the prior).
log[Pr(x|φ)] φ log[Pr(x|φ)] φ⁰ φ³
Slide 35. Joint gradient ascent in action. Each step both raises the local ELBO bump toward the true curve (improving \(\theta\)) and slides the contact point rightward and upward along the log-likelihood (improving \(\phi\)), climbing from \(\phi^{0}\) to \(\phi^{3}\). The dashed arrows trace the optimization trajectory.
QWhy can the KL term be computed exactly while the reconstruction term needs sampling?▾ tap to reveal▴ hide answer

The KL term is between two Gaussians of known parameters — \(q=\text{Norm}[\mu,\Sigma]\) and the prior \(\text{Norm}[\mathbf{0},\mathbf{I}]\) — and the KL between Gaussians has a tidy analytic formula \(\tfrac12(\operatorname{Tr}[\Sigma]+\mu^T\mu-D_z-\log\det[\Sigma])\). The reconstruction term \(\int q(z\mid x,\theta)\log[Pr(x\mid z,\phi)]\,dz\) has the nonlinear decoder \(\mathbf{f}[z,\phi]\) buried inside the log, so the integral has no closed form — hence the Monte-Carlo (sampling) estimate.

QA single sample is a terrible estimate of an expectation. Why is using just one sample acceptable here?▾ tap to reveal▴ hide answer

Because we are training with stochastic gradient methods over many minibatches and many epochs. A one-sample estimate is high-variance but unbiased — on average it points in the right direction. Across thousands of noisy updates the variance averages out, much like SGD tolerates noisy minibatch gradients. The slide explicitly calls it "a very approximate estimate," and it works in practice.

QIn the Gaussian KL formula, what value of \((\mu,\Sigma)\) makes the KL exactly zero, and why does that matter?▾ tap to reveal▴ hide answer

\(\mu=\mathbf{0}\) and \(\Sigma=\mathbf{I}\): then \(\operatorname{Tr}[\mathbf{I}]=D_z\), \(\mu^T\mu=0\), \(\log\det[\mathbf{I}]=0\), so \(D_{KL}=\tfrac12(D_z+0-D_z-0)=0\). It matters because that is exactly when \(q\) equals the prior — the KL term has its minimum (zero) there and grows as the encoder's output drifts away from the standard Gaussian, which is precisely the regularising pressure we want.

·  ·  ·
letting gradients flow through randomness

PART VIIThe reparametrisation trick

There's a sampling step right in the middle of the network — and you cannot backpropagate through a random draw. The fix is to push the randomness off to the side, so the path the gradient travels is fully deterministic.

Look back at the architecture: between encoder and decoder we sample \(z^{*}\sim q(z\mid x,\theta)\). That sampling operation is stochastic and not a differentiable function of \(\mu\) and \(\Sigma\) — so gradient descent cannot send error signals back through it to the encoder. The whole thing would be untrainable end-to-end.

Definition · The reparametrisation trick (slide 37)

The network involves a sampling step that is difficult to differentiate. We move the stochastic part into a side branch that draws a sample \(\epsilon^{*}\) from \(\text{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]\), and then use the following relation to draw from the Gaussian:

\[ z^{*} = \mu + \Sigma^{1/2}\,\epsilon^{*} \qquad \epsilon^{*}\sim\text{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}] \]
In plain English

Instead of asking the network to "draw from a Gaussian whose mean and spread you computed" (a random, non-differentiable act), we draw a standard random number \(\epsilon^{*}\) from a fixed unit Gaussian on the side, then deterministically reshape it: scale by \(\Sigma^{1/2}\), shift by \(\mu\). The randomness now enters as an external input \(\epsilon^{*}\), and \(z^{*}=\mu+\Sigma^{1/2}\epsilon^{*}\) is a smooth, differentiable function of \(\mu\) and \(\Sigma\). Gradients flow straight through to the encoder.

ELBO[θ,φ] x g[x,θ] μ Σ z* = μ + Σ^½ ε* Norm_ε[0,I] Sample ε* f[z*,φ] Pr(x|z*,φ) D_KL[q‖Pr(z)] log[Pr(x|z*,φ)]
Slide 38. The sampling node has moved off the main path. The random draw \(\epsilon^{*}\sim\text{Norm}_{\epsilon}[\mathbf{0},\mathbf{I}]\) enters from a side branch; the combination \(z^{*}=\mu+\Sigma^{1/2}\epsilon^{*}\) is deterministic in \(\mu,\Sigma\). Backpropagation now has an unbroken differentiable route from the loss all the way back to the encoder weights \(\theta\).
QVerify that \(z^{*}=\mu+\Sigma^{1/2}\epsilon^{*}\) with \(\epsilon^{*}\sim\text{Norm}[\mathbf{0},\mathbf{I}]\) genuinely has distribution \(\text{Norm}[\mu,\Sigma]\).▾ tap to reveal▴ hide answer

A linear transform of a Gaussian is Gaussian. Mean: \(\mathbb{E}[z^{*}]=\mu+\Sigma^{1/2}\mathbb{E}[\epsilon^{*}]=\mu+\Sigma^{1/2}\mathbf{0}=\mu\). Covariance: \(\operatorname{Cov}[z^{*}]=\Sigma^{1/2}\operatorname{Cov}[\epsilon^{*}](\Sigma^{1/2})^{T}=\Sigma^{1/2}\,\mathbf{I}\,\Sigma^{1/2}=\Sigma\). So \(z^{*}\sim\text{Norm}[\mu,\Sigma]\) exactly — the trick changes how we sample, not what we sample.

QPrecisely why does backprop fail through a raw \(z^{*}\sim\text{Norm}[\mu,\Sigma]\) sampling node, and how does reparametrisation cure it?▾ tap to reveal▴ hide answer

A sampling operation is not a deterministic function of its inputs, so \(\partial z^{*}/\partial\mu\) and \(\partial z^{*}/\partial\Sigma\) are undefined — there is no smooth map to differentiate, and the gradient chain to the encoder is severed. After reparametrisation the only random node is \(\epsilon^{*}\), which has no learnable parameters; the dependence of \(z^{*}\) on \(\mu,\Sigma\) is the deterministic affine map \(\mu+\Sigma^{1/2}\epsilon^{*}\), which is fully differentiable. Gradients route through that map and ignore the parameter-free \(\epsilon^{*}\) branch.

·  ·  ·
what VAEs are good for

PART VIIIApplications

Estimating how probable a sample is, generating new data, editing real data, and discovering interpretable latent factors — four things a trained VAE buys you, each with its own catch.

Approximating the probability of a sample

A trained VAE is a probability model, so we may want to ask: how probable is a given \(x\)? The model says

\[ \begin{aligned} Pr(x) &= \int Pr(x\mid z)Pr(z)\,dz = \mathbb{E}_{z}\big[Pr(x\mid z)\big] = \mathbb{E}_{z}\Big[\text{Norm}_x[\mathbf{f}[z,\phi],\sigma^2\mathbf{I}]\Big] \end{aligned} \]

In principle we could estimate this by drawing samples \(z_n\) straight from the prior \(Pr(z)=\text{Norm}_z[\mathbf{0},\mathbf{I}]\) and averaging (slide 40):

\[ Pr(x) \approx \frac{1}{N}\sum_{n=1}^{N} Pr(x\mid z_n) \]
Bad news (slide 40)

This naive estimate needs a huge number of samples to be reliable. Most prior samples \(z_n\) land in regions where \(Pr(x\mid z_n)\) is essentially zero for the particular \(x\) you care about, so almost every sample is wasted and the average is dominated by rare lucky draws.

Definition · Importance sampling (slide 41)

A better approach draws from \(q(z)\) instead of the prior, reweighting to stay unbiased:

\[ \begin{aligned} Pr(x) &= \int Pr(x\mid z)Pr(z)\,dz = \int \frac{Pr(x\mid z)Pr(z)}{q(z)}\,q(z)\,dz\\ &= \mathbb{E}_{q(z)}\!\left[\frac{Pr(x\mid z)Pr(z)}{q(z)}\right] \approx \frac{1}{N}\sum_{n=1}^{N}\frac{Pr(x\mid z_n)Pr(z_n)}{q(z_n)} \end{aligned} \]

where now we draw the samples \(z_n\) from \(q(z)\).

In plain English

Don't sample blindly from the prior — sample from \(q(z)\), which (via the encoder) already knows where the good latents for this \(x\) live. Then correct for having sampled from the "wrong" distribution by multiplying each term by the ratio \(Pr(z_n)/q(z_n)\). You spend your samples where they matter, so far fewer are needed.

Generation & the aggregated posterior

Why vanilla VAE samples look blurry (slide 42)

Samples from vanilla VAEs are generally low quality, for two reasons: the naive spherical Gaussian noise model \(\sigma^2\mathbf{I}\) on the likelihood, and the use of Gaussian models for both the prior and the variational posterior. These oversmooth the output.

Definition · Aggregated posterior (slide 42)

To improve generation quality, sample from the aggregated posterior rather than the prior:

\[ q(z\mid\theta) = \frac{1}{I}\sum_{i} q(z\mid x_i,\theta) \]

i.e. the average of the encoder's distributions across the whole training set.

In plain English

The prior \(\text{Norm}_z[\mathbf{0},\mathbf{I}]\) is what we told the latents to look like, but the encoder's actual outputs over the dataset may occupy a different region. Sampling from the prior can therefore hit "dead zones" the decoder never learned to handle, giving junk. The aggregated posterior is where the encoded data actually lives, so sampling from it lands you in well-trained territory and produces better images.

a) prior samples — noisy b) smoother c) noise d) high quality Better noise/prior/posterior modelling → sharper samples (rows a→d)
Slide 43. Face samples improve dramatically as the noise model, prior, and posterior become more sophisticated — from grainy prior draws (a) through smoothed versions (b) to crisp, photorealistic faces (d). The vanilla spherical-Gaussian VAE sits at the blurry end.

Resynthesis

Definition · Resynthesis (slide 44)

VAEs can also modify real data. To find the latent code for an existing example you can either:

1. Take the mean of the distribution predicted by the encoder, or

2. Solve an optimisation problem to find the latent variable \(z\) that maximises the posterior probability.

In plain English

Encode a real photo into its latent code, nudge that code along meaningful directions, then decode. Move along one direction and the face starts smiling; move along another and the mouth opens. You are editing high-level attributes by doing simple arithmetic in latent space.

Smiling → Mouth open → original
Slide 45. Resynthesis grid: a single encoded face decoded while sweeping two latent directions — "smiling" left-to-right and "mouth open" top-to-bottom. Smooth, independent control of attributes is exactly what a good latent space gives you.

Disentanglement

Definition · Disentanglement & the β-VAE (slide 46)

When each dimension of \(z\) represents an independent real-world factor, the latent space is disentangled. To encourage it, a general regularised loss is

\[ L_{\text{new}} = -\text{ELBO}[\theta,\phi] + \lambda_1\,\mathbb{E}_{Pr(x)}\!\Big[\mathrm{r}_1\big[q(z\mid x,\theta)\big]\Big] + \lambda_2\,\mathrm{r}_2\big[q(z\mid\theta)\big] \]

The β-VAE simply upweights the second (KL) term in the ELBO by a factor \(\beta\):

\[ \text{ELBO}[\theta,\phi] \approx \log\big[Pr(x\mid z^{*},\phi)\big] - \beta\cdot D_{KL}\!\Big[q(z\mid x,\theta)\,\big\|\,Pr(z)\Big] \]
In plain English

Setting \(\beta>1\) leans harder on the "stay close to the unit-Gaussian prior" pressure. Because the prior has independent (axis-aligned) dimensions, this push encourages each latent dimension to act independently — so one axis ends up controlling rotation, another size, another the number of legs, and so on. The trade-off: too large a \(\beta\) sacrifices reconstruction sharpness for cleaner factor separation.

a) Rotation b) Size c) Legs Each latent axis varies one factor — rotation, size, legs — independently.
Slide 47. A disentangled latent space on chairs: sweeping one latent dimension changes only the chair's rotation (a), another only its size (b), another only the legs/base (c). Clean one-axis-one-factor behaviour is the hallmark of disentanglement.
QWhy is importance sampling so much more efficient than sampling from the prior for estimating \(Pr(x)\)?▾ tap to reveal▴ hide answer

Prior samples are spread over the whole latent space, but for a specific \(x\) only a tiny region of \(z\) gives non-negligible \(Pr(x\mid z)\). Almost all prior draws contribute \(\approx 0\), so you need enormous \(N\). The proposal \(q(z)\) (from the encoder) concentrates samples exactly in that high-contribution region; the weights \(Pr(z_n)/q(z_n)\) keep the estimate unbiased. Same accuracy, far fewer samples.

QWhy does sampling from the aggregated posterior beat sampling from the prior for generation, given we trained the prior to be \(\text{Norm}[\mathbf{0},\mathbf{I}]\)?▾ tap to reveal▴ hide answer

The KL term only approximately matches the encoder's outputs to the prior; in practice the aggregated posterior \(\tfrac1I\sum_i q(z\mid x_i,\theta)\) — the region the encoded data actually occupies — differs from the prior, leaving "holes" in the prior that the decoder never trained on. Sampling those holes yields garbage. Sampling the aggregated posterior keeps you on the manifold the decoder understands, so outputs are sharper. (This mismatch is the well-known prior hole problem.)

QIn the β-VAE, what happens to reconstructions as \(\beta\to\infty\), and as \(\beta\to 0\)?▾ tap to reveal▴ hide answer

As \(\beta\to\infty\) the loss is dominated by the KL term, so \(q(z\mid x,\theta)\) is forced to equal the prior for every \(x\) — the latent carries no information about \(x\) (posterior collapse), reconstructions become generic/blurry but the latent is maximally disentangled. As \(\beta\to 0\) the KL is ignored, the model becomes a plain autoencoder optimising reconstruction only — sharp reconstructions but an unstructured, entangled, non-generative latent space. Good disentanglement lives at an intermediate \(\beta>1\).

·  ·  ·
tying the bow

PART IXConclusion

One model, one bound, one trick — and a clear-eyed list of what the VAE can and cannot do.

Summary · Conclusion (slide 49)

VAEs learn a nonlinear latent variable model over \(x\). The generation process is: ① sample from the latent variable; ② pass the result through a deep network; ③ add independent Gaussian noise.

Challenges: it is not possible to compute the likelihood of a data point in closed form, and computing the posterior probability of the latent variable given observed data is intractable.

Solution: the variational approximation. Enhancement: sophisticated latent-space modelling — e.g. hierarchical priors.

The one-paragraph story

We wanted to model complicated data with a simple-inside / rich-outside latent model, but the defining integral was intractable. Jensen's inequality handed us a lower bound (the ELBO); rewriting it three ways revealed it as reconstruction minus a KL-to-prior. We approximated the intractable posterior with an encoder network (the variational approximation), estimated the reconstruction term by single-sample Monte Carlo, computed the KL in closed form, and used the reparametrisation trick to keep everything differentiable. The result is the VAE: encoder → sample → decoder, trained by maximising the ELBO.

QThe conclusion lists two challenges — likelihood not computable in closed form, and intractable posterior. Which VAE component addresses each?▾ tap to reveal▴ hide answer

Intractable likelihood \(Pr(x\mid\phi)\) → addressed by maximizing the ELBO instead of the likelihood directly (Jensen's bound). Intractable posterior \(Pr(z\mid x,\phi)\) → addressed by the variational apprples are blurry and how the aggregated posterior \(\tfrac1I\sum_i q(z\mid x_i,\theta)\) helps.

  • Describe the two resynthesis methods and define disentanglement.
  • Write the β-VAE objective and explain the reconstruction-vs-disentanglement trade-off as \(\beta\) varies.
  • Recite the conclusion: what VAEs learn, the 3-step generation, the two intractability challenges, and the variational solution.
  • END OF WEEK 17 · VARIATIONAL AUTOENCODERS · KCL MACHINE LEARNING