Week 7 — Backward Propagation & the Universal Approximation Theorem

Part I

The learning problem — loss, gradient, update

Training a neural network means finding numerical settings of its weights so that, when an input is pushed through, the output matches the truth as closely as possible. We need three things: a measure of "how wrong are we" (the loss), a way to know which way to nudge each weight (the gradient), and a rule for actually nudging (the update).

Think of the apple-vs-orange picture from the slides. We have a network with some random weights, and we want it to fire on apples and stay quiet on oranges. We don't change the architecture — the boxes and arrows stay the same. We change the numbers on the arrows. That's what learning is.

The procedure on slide 3 is short enough to memorise: (i) pick a training example $(x, y)$ at random, (ii) push it forward through the network and look at the output $\hat y$, (iii) if $\hat y$ disagrees with $y$, change the weights and biases. Most of this week's mathematics is just the third step in detail.

Loss: a scalar that says "how wrong"

Definition · MSE loss for one example

Given a label $y$ and a network output $\hat y$, the mean squared error loss on this single example is

$$\text{loss}(w) \;=\; \text{MSE} \;=\; \tfrac{1}{2}(y - \hat y)^2.$$

The factor $\tfrac{1}{2}$ has no statistical meaning — it just makes the derivative come out cleaner (the $2$ cancels with the $\tfrac{1}{2}$). The loss is a function of the weights $w$ because $\hat y$ depends on $w$.

In plain English

"How far off are we?" Square the gap so positive and negative misses both count, and so big misses are punished more than small ones. Halve it for clean derivatives.

The update rule

Slide 4 gives the central object of the entire week. Once we have a loss, we update each weight by moving it opposite the slope of the loss with respect to that weight:

$$w \;:=\; w \;-\; \alpha \, \nabla \text{loss}(w).$$

Two new symbols. The learning rate $\alpha$ is a small positive number that controls step size. The gradient $\nabla \text{loss}(w)$ is the vector of partial derivatives of the loss with respect to every weight: it points in the direction of steepest increase of the loss, so we negate it to descend.

Pitfall — sign of the update

The minus sign matters. Adding the gradient would climb the loss surface, not descend it. If you've ever written w = w + alpha * grad and watched the loss explode, this is why.

Pitfall — the learning rate is a knife edge

Slide 4 spells it out: "Big values might overshoot, small values will take a long time to converge."

This is one of the few exam-friendly verbal answers — but make sure to give both sides. Too big and you bounce out of the basin; too small and you barely move per step.

Why is MSE written with a $\tfrac{1}{2}$ instead of just $(y-\hat y)^2$? Does it change anything?

It changes the value of the loss by a constant factor of two but does not change where the loss is minimised — the optimum is the same. The point of the $\tfrac{1}{2}$ is purely cosmetic for derivatives: $\partial(\tfrac{1}{2}(y-\hat y)^2)/\partial \hat y = -(y-\hat y)$, with no extra 2 floating around. Without the $\tfrac{1}{2}$ you'd write $-2(y-\hat y)$ everywhere; the algorithm absorbs the extra factor of 2 into the learning rate $\alpha$ anyway.

Sketch what happens to the loss-vs-iteration curve when (a) $\alpha$ is too large, (b) $\alpha$ is too small, (c) $\alpha$ is just right.

(a) Too large: the loss may oscillate or even grow — you overshoot the minimum and bounce back and forth across it (or out of the basin entirely).

(b) Too small: the loss decreases monotonically but very slowly — you crawl downhill and may not reach the minimum within your training budget.

(c) Just right: the loss decreases quickly at first, then levels off near the minimum. (Slide 20 shows exactly the (a) vs (b) picture: zigzag jumps for big $\alpha$, tiny steps for small $\alpha$.)

· · ·

Part II

Stochastic gradient descent

"Stochastic" just means random. Instead of computing the gradient over the entire training set on every step (which is expensive), we use one example at a time, picked at random. This is faster, noisier, and — surprisingly — often better.

Slide 9 names the only thing that distinguishes SGD from plain (batch) gradient descent: "only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step."

Definition · Stochastic gradient descent

Given a dataset $\{(x_i, y_i)\}_{i=1}^N$, repeat: pick an index $i$ uniformly at random, compute the gradient of the loss on that single example, and update

$$w \;:=\; w \;-\; \alpha \, \nabla \text{loss}_i(w).$$

Compare to batch gradient descent, where the gradient is the average over all $N$ examples before any update.

In plain English

Batch GD takes one careful step per pass through the entire dataset. SGD takes $N$ noisy steps per pass. The noise is usually a feature, not a bug — it can knock you out of bad local minima.

The descent, picture by picture

Slides 11–17 animate descent on a 1-D loss curve. The story is: pick a starting weight, compute the slope at that point (the dotted tangent), step downhill (the arrow), repeat. Each step roughly halves the distance to the minimum until you're in the bottom of the basin and progress slows.

Slides 11–18: dots are successive weights; arrows are gradient steps. Big slope at the start → big steps. Near the bottom, the slope flattens and the steps shrink.

Local optima and the bumpy landscape

Slide 18 introduces a crucial vocabulary distinction. A local optimum is a point that is lower than everything in its immediate neighbourhood. The global optimum is the lowest point overall. Plain gradient descent has no way to tell them apart: it just rolls downhill from wherever you started, and where it stops depends on where you began.

Slide 19's three-dimensional surface drives the point home. In a real network with millions of weights the loss surface is hyper-dimensional, but the cartoon is the same: lots of valleys of varying depth. Where you land depends on where you started and which gradient you happened to step against — which is why the noise of SGD can be useful.

Two unavoidable approximations (slide 20)

The slides explicitly call these out as the two failure modes of SGD:

We only find local optima (not necessarily the global one).
We don't reach them exactly — we only approach (this depends on the learning rate; with a non-zero $\alpha$, we hop around near the bottom rather than settling there).

Both are testable verbatim.

What is the key difference between SGD and batch gradient descent?

Batch GD computes the gradient over the entire training set before each weight update; SGD uses just one randomly-picked example per update. Per slide 9, this is the only difference at the algorithmic level. Practical consequences: SGD is faster per step, noisier, and can escape shallow local minima.

Slide 20 says we have "two errors/approximations." Name them and say which one is fixable in principle and which one isn't.

(1) Only local optima: we may end up in a basin that isn't the deepest one. Not generally fixable for non-convex losses (which any deep network has) — though restarts, momentum, and noise help in practice.

(2) We only approach, never reach: the algorithm hops around the minimum because $\alpha > 0$. This is fixable in principle by shrinking $\alpha$ over time (a "learning-rate schedule"). With a constant $\alpha$ you are stuck oscillating in a small region around the optimum.

· · ·

Part III

Feed-forward, with numbers

Before we backpropagate anything, we have to push something forward. Slides 21–24 set up the toy network we'll use for the entire rest of the week. Memorise these numbers — every backprop computation that follows reuses them.

The toy network has two inputs $x_1, x_2$, two hidden units $h_1, h_2$ (with ReLU activations), and one output $\hat y$ (also ReLU). All six weights are labelled $w_1$ through $w_6$. There are no biases in this example — the picture is already busy enough.

The toy network from slide 22. Inputs $x_1=2$, $x_2=3$. Six weights, all positive and small.

The forward pass, term by term

The activation function throughout is $\sigma(z) = \text{ReLU}(z) = \max(0, z)$. Slides 23–24 walk the calculation:

$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2) = \text{ReLU}(2 \cdot 0.11 + 3 \cdot 0.12) = \text{ReLU}(0.58) = 0.58$

$\sigma(h_2) = \text{ReLU}(x_1 w_3 + x_2 w_4) = \text{ReLU}(2 \cdot 0.21 + 3 \cdot 0.08) = \text{ReLU}(0.66) = 0.66$

$\hat y = \sigma(z) = \text{ReLU}(\sigma(h_1) \cdot w_5 + \sigma(h_2) \cdot w_6) = \text{ReLU}(0.58 \cdot 0.14 + 0.66 \cdot 0.15) = \text{ReLU}(0.1802) = 0.1802$

Notational trap on the slides

The slides write things like "$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2)$" — i.e. they're using $h_1$ to mean the pre-activation $x_1 w_1 + x_2 w_2$, and $\sigma(h_1)$ to mean the post-activation output of the unit. In the chain rule pages later, $h_1$ is used to mean the activation (the value 0.58). It's the same convention some books use; the meaning is always clear from context, but watch for it on the exam.

Now suppose the true label is $y = 1$. The error is

$$\text{Error} \;=\; \tfrac{1}{2}(y - \hat y)^2 \;=\; \tfrac{1}{2}(1 - 0.1802)^2 \;\approx\; 0.336.$$

(Slide 26.) The network is wildly wrong — the output 0.1802 is far from the desired 1. Backprop's job is to adjust each of the six weights to reduce this error.

Recompute the forward pass with the same inputs but $w_5 = -0.14$ (not $+0.14$). What is $\hat y$?

The hidden activations don't depend on $w_5$, so they're still $0.58$ and $0.66$. The pre-activation at the output is now

$0.58 \cdot (-0.14) + 0.66 \cdot 0.15 = -0.0812 + 0.099 = 0.0178.$

This is positive, so $\text{ReLU}$ leaves it alone: $\hat y = 0.0178$.

If the pre-activation had been negative, $\hat y$ would be exactly $0$ — a useful sanity check that ReLU can produce zeros.

Compute the error if the same network were given $y = 0$ instead of $y = 1$.

$\text{Error} = \tfrac{1}{2}(0 - 0.1802)^2 = \tfrac{1}{2}(0.0325) \approx 0.0162$. Much smaller than $0.336$ — the network's output of $0.1802$ is much closer to $0$ than to $1$, so it's "almost right" if the answer were 0.

· · ·

Part IV

Backprop into the output layer

Backpropagation is just the chain rule applied carefully. The key insight: to update a weight, you don't need to recompute the network from scratch — you can reuse pieces of the gradient that you already calculated for weights closer to the output. Information literally flows backwards.

Slides 5–8 illustrate this with the deeper four-layer network: starting from the output (slide 6), gradient information propagates layer by layer back into the network (slides 7–8), each layer reusing what came after it. We'll do the math on the simpler 2-2-1 toy network from Part III.

The plan and the ReLU derivative

Slide 27 states the update rule for any weight $w_i$:

$$w_i' \;=\; w_i \;-\; \alpha \left( \frac{\partial \text{Error}}{\partial w_i} \right).$$

So the entire problem reduces to computing $\partial \text{Error} / \partial w_i$ for each of the six weights. We'll need the derivative of the activation. Slide 28 reminds us that for ReLU,

Definition · Derivative of ReLU

For $\sigma(x) = \text{ReLU}(x) = \max(0,x)$,

$$\frac{\partial \sigma(x)}{\partial x} \;=\; \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise.} \end{cases}$$

(At $x = 0$ the derivative is technically undefined, but we conventionally set it to $0$.)

Deriving $\partial \text{Error} / \partial w_6$

Let's start with $w_6$, the weight from $h_2$ into the output. The chain rule on slide 30 says

$$\frac{\partial \text{Error}}{\partial w_6} \;=\; \frac{\partial \text{Error}}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_6} \;=\; \frac{\partial \tfrac{1}{2}(y - \hat y)^2}{\partial \hat y} \cdot \frac{\partial \sigma(h_1 w_5 + h_2 w_6)}{\partial w_6}.$$

Two factors. The first is just calculus on the loss; the second is calculus on the activation.

Factor 1: Loss derivative

Slide 31 computes

$$\frac{\partial \tfrac{1}{2}(y-\hat y)^2}{\partial \hat y} \;=\; -2 \cdot \tfrac{1}{2}(y - \hat y) \;=\; -(y - \hat y) \;=:\; \Delta.$$

The minus sign comes from differentiating $(y - \hat y)$ with respect to $\hat y$. The slides give this quantity a name — $\Delta$ — and it shows up in every backprop derivation that follows.

Definition · The output error term Δ

For MSE loss with one output, $\Delta := -(y - \hat y) = \hat y - y$.

It is the "signed error" — positive if the network overshot ($\hat y > y$), negative if it undershot. For our example, $\Delta = -(1 - 0.1802) = -0.8198$.

Factor 2: Activation derivative — the case split

The second factor is $\partial \sigma(h_1 w_5 + h_2 w_6) / \partial w_6$. We need the ReLU derivative, which depends on the sign of the argument $h_1 w_5 + h_2 w_6$.

Case A (slide 33): if $h_1 w_5 + h_2 w_6 \le 0$, then $\partial \sigma / \partial w_6 = 0$, so

$$\frac{\partial \text{Error}}{\partial w_6} \;=\; \Delta \cdot 0 \;=\; 0.$$

The weight gets no update at all.

Case B (slide 34): if $h_1 w_5 + h_2 w_6 > 0$, then $\partial \sigma / \partial w_6 = 1$, and inside the ReLU we have a linear function whose derivative w.r.t. $w_6$ is just $h_2$:

$$\frac{\partial \text{Error}}{\partial w_6} \;=\; \Delta \cdot \frac{\partial [h_1 w_5 + h_2 w_6]}{\partial w_6} \;=\; \Delta \cdot \frac{\partial [h_2 w_6]}{\partial w_6} \;=\; \Delta \cdot h_2.$$

The first term $h_1 w_5$ doesn't depend on $w_6$, so its derivative is zero. Only $h_2 w_6$ contributes, and its derivative is $h_2$.

The update for $w_6$ and (by symmetry) $w_5$

Putting it all together (slide 36):

$$w_6' \;=\; \begin{cases} w_6 - \alpha (\Delta \cdot h_2) & \text{if } h_1 w_5 + h_2 w_6 > 0 \\ w_6 & \text{otherwise.} \end{cases}$$

For $w_5$, the derivation is identical with $h_1$ in place of $h_2$ (slide 37):

$$w_5' \;=\; \begin{cases} w_5 - \alpha (\Delta \cdot h_1) & \text{if } h_1 w_5 + h_2 w_6 > 0 \\ w_5 & \text{otherwise.} \end{cases}$$

In plain English

"How much should I change the weight from hidden unit $h_2$ to the output?" The answer is: (error at the output) times (how active $h_2$ was). If $h_2$ was silent, this weight had no influence on the prediction this time, so don't change it. If the output was nearly correct ($\Delta$ small), don't change it much either. The bigger the error and the louder the upstream neuron, the bigger the weight update.

Pitfall — the gating condition is on the output's pre-activation

The condition for both $w_5$ and $w_6$ is the same: $h_1 w_5 + h_2 w_6 > 0$. This is the pre-activation of the output unit. If the output ReLU was off (input ≤ 0), the gradient is zero for every weight feeding the output — and (as we'll see) for every weight feeding any earlier layer too, because the error can't propagate past a dead ReLU.

For the toy network from Part III with $y=1$, $\hat y = 0.1802$, $h_2 = 0.66$, and $\alpha = 0.1$, compute the new $w_6$.

First, $\Delta = -(1 - 0.1802) = -0.8198$.

Output pre-activation: $h_1 w_5 + h_2 w_6 = 0.58 \cdot 0.14 + 0.66 \cdot 0.15 = 0.1802 > 0$, so we're in Case B.

$\partial \text{Error}/\partial w_6 = \Delta \cdot h_2 = -0.8198 \cdot 0.66 \approx -0.5411$.

$w_6' = 0.15 - 0.1 \cdot (-0.5411) = 0.15 + 0.05411 \approx 0.2041$.

The weight increases — which makes sense: the network undershot ($\hat y < y$), so we want the output to be larger, and increasing $w_6$ (with $h_2 > 0$) increases the output.

Suppose someone in the same toy setup chose $w_5 = -1$ and $w_6 = -1$. What is $\partial \text{Error}/\partial w_6$? Why?

The output pre-activation is $0.58 \cdot (-1) + 0.66 \cdot (-1) = -1.24 < 0$. We are in Case A. The gradient is exactly zero, regardless of the size of the error. The output ReLU is "dead" for this input, and no signal can pass through it. This is the so-called dying ReLU problem in miniature.

What is $\Delta$ in terms of $y$ and $\hat y$, and what is its sign when the network undershoots vs overshoots?

$\Delta = -(y - \hat y) = \hat y - y$.

If the network overshoots ($\hat y > y$), $\Delta > 0$.
If the network undershoots ($\hat y < y$), $\Delta < 0$.

The sign of $\Delta$ flips the sign of the gradient, which (combined with the $-\alpha$ in the update) ensures we move in the direction that reduces the error.

· · ·

Part V

Backprop into the hidden layer

For weights closer to the input, the chain rule has more links — the error signal has to be propagated back through more layers. The derivation is the same idea, just with one extra step. The reward is a tidy summary of all six weight updates.

We tackle $w_1$, the weight from input $x_1$ into hidden unit $h_1$. Slide 38 says: assume both ReLUs are active — that is, $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$. (If either is non-positive, the gradient is zero by the same argument as before.)

The chain rule, three links deep

Slide 39 writes the chain

$$\frac{\partial \text{Error}}{\partial w_1} \;=\; \frac{\partial \text{Error}}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial h_1} \cdot \frac{\partial h_1}{\partial w_1}.$$

The first factor is $\Delta$, as before. The middle and right factors slide 40 unpacks:

$$\frac{\partial \hat y}{\partial h_1} \;=\; \frac{\partial \sigma(h_1 w_5 + h_2 w_6)}{\partial h_1} \;=\; w_5 \quad \text{(under Case B for the output ReLU)}.$$

$$\frac{\partial h_1}{\partial w_1} \;=\; \frac{\partial \sigma(x_1 w_1 + x_2 w_2)}{\partial w_1} \;=\; x_1 \quad \text{(under Case B for the hidden ReLU)}.$$

Multiplying:

$$\frac{\partial \text{Error}}{\partial w_1} \;=\; \Delta \cdot w_5 \cdot x_1.$$

In plain English

To update an early-layer weight: take the output error, multiply by every weight on the path from this unit to the output (here just $w_5$), and multiply by the input that this weight saw (here $x_1$). Each weight along the path "scales" how much the error gets blamed on the upstream weight.

The full summary (slide 41)

Slide 41 collects all six update rules. It looks intimidating but it's just six instances of the same idea. Let me lay them out side by side. Recall $\Delta = -(y - \hat y)$.

Weight	Update (when both relevant ReLUs are active)	Active condition
$w_1$	$w_1' = w_1 - \alpha(\Delta \cdot w_5 \cdot x_1)$	$h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$
$w_2$	$w_2' = w_2 - \alpha(\Delta \cdot w_5 \cdot x_2)$	$h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$
$w_3$	$w_3' = w_3 - \alpha(\Delta \cdot w_6 \cdot x_1)$	$h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_3 + x_2 w_4 > 0$
$w_4$	$w_4' = w_4 - \alpha(\Delta \cdot w_6 \cdot x_2)$	$h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_3 + x_2 w_4 > 0$
$w_5$	$w_5' = w_5 - \alpha(\Delta \cdot h_1)$	$h_1 w_5 + h_2 w_6 > 0$
$w_6$	$w_6' = w_6 - \alpha(\Delta \cdot h_2)$	$h_1 w_5 + h_2 w_6 > 0$

If any active condition fails, the weight does not change on that step (the bracketed expression in slide 41 is zero).

Three patterns to spot

Output-layer weights ($w_5$, $w_6$): update factor is $\Delta \cdot h_j$, where $h_j$ is the upstream neuron's activation.
Input-layer weights into $h_1$ ($w_1, w_2$): update factor is $\Delta \cdot w_5 \cdot x_j$ — error times the path weight $w_5$ times the input.
Input-layer weights into $h_2$ ($w_3, w_4$): update factor is $\Delta \cdot w_6 \cdot x_j$ — error times the path weight $w_6$ times the input.

This pattern generalises to any architecture: walk the path from the weight to the output, multiply the activation derivatives and weights along the way, and multiply by the upstream activation.

Using the worked example ($\Delta \approx -0.8198$, $w_5 = 0.14$, $x_1 = 2$, $\alpha = 0.1$, all ReLUs active), compute the new $w_1$.

$\partial \text{Error}/\partial w_1 = \Delta \cdot w_5 \cdot x_1 = -0.8198 \cdot 0.14 \cdot 2 \approx -0.2295$.

$w_1' = w_1 - \alpha \cdot (-0.2295) = 0.11 + 0.02295 \approx 0.1330$.

Notice the gradient is much smaller than the gradient at $w_6$ (≈ −0.5411), even though everything else is comparable. That's because we multiplied by $w_5 = 0.14$ along the way — the weight on the path from $h_1$ to the output. Small weights along the path damp the error signal as it travels back. Hold that thought; it's the entire idea behind vanishing gradients in Part VI.

Why does the update for $w_3$ involve $w_6$ but not $w_5$?

$w_3$ feeds into $h_2$ (not $h_1$). The only path from $h_2$ to the output goes through $w_6$. There is no path from $h_2$ that uses $w_5$, so $w_5$ doesn't appear in the chain. In a fully-connected network with multiple output paths, you'd sum over all paths — here there's only one each.

True or false: if the hidden pre-activation $x_1 w_3 + x_2 w_4$ is exactly $0$, the slides' gradient for $w_3$ is non-zero. Explain.

False. Slide 41 uses the strict inequality $> 0$, and per slide 28 the convention at $x = 0$ for the ReLU derivative is $0$. So the gradient is $0$ exactly at the kink. (Real implementations sometimes use a sub-gradient; the slide convention is fine for an exam.)

· · ·

Part VI

Vanishing gradients

If the gradient is multiplied by a number smaller than $1$ at every layer it crosses, then in a deep network the gradient at the earliest layer is multiplied by something tiny — sometimes so tiny it's effectively zero. The early layers stop learning. This is the vanishing-gradients problem.

Slide 42 sets up a deeper version of the toy net: input $x$, then three hidden layers $h_{1,1}, h_{1,2}, h_{1,3}$ (top stream) and $h_{2,1}, h_{2,2}, h_{2,3}$ (bottom stream), then output $\hat y$. The chain rule for the very first weight $w_1$ now has five factors:

$$\frac{\partial \text{Error}}{\partial w_1} \;=\; \frac{\partial \text{Error}}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial h_{1,3}} \cdot \frac{\partial h_{1,3}}{\partial h_{1,2}} \cdot \frac{\partial h_{1,2}}{\partial h_{1,1}} \cdot \frac{\partial h_{1,1}}{\partial w_1}.$$

Each $\partial h_{1,x+1}/\partial h_{1,x}$ in the middle is, by the chain rule, the activation derivative times a weight on the path. Three of these chained.

The sigmoid activation, and the bound 0.25

Slide 43 brings in the sigmoid:

$$\sigma(x) \;=\; \frac{1}{1 + e^{-x}}, \qquad \sigma'(x) \;=\; \sigma(x)\big(1 - \sigma(x)\big).$$

The derivative is largest when $\sigma(x) = 0.5$, where $\sigma'(x) = 0.5 \cdot 0.5 = 0.25$. Everywhere else it is smaller. So $\sigma'(x) \le 0.25$ for all $x$.

Slide 43. The sigmoid (blue) saturates at 0 and 1 with a smooth $S$ shape; its derivative (orange dashed) peaks at exactly $0.25$ at $x=0$ and decays exponentially in both directions.

Why this is fatal for deep networks

Each factor $\partial h_{1,x+1}/\partial h_{1,x}$ in the chain involves $\sigma'$ at some pre-activation, multiplied by a weight. If $\sigma'$ is bounded by $0.25$, and weights are typically initialised around $1$ in magnitude or smaller, then each factor is bounded by something like $0.25$. Chaining $L$ of them gives

$$\Big| \frac{\partial \text{Error}}{\partial w_1} \Big| \;\lesssim\; (0.25)^L \cdot |\Delta|.$$

For $L = 10$, that's $(0.25)^{10} \approx 10^{-6}$. The gradient at the first layer is a million times smaller than the gradient at the output. Early-layer weights barely move; effectively, only the last few layers are training. This is the vanishing-gradients problem (slide 44).

The exam-friendly bullet

"Each factor $\partial h_{1,x+1}/\partial h_{1,x}$ is bounded by 0.25 (because the maximum of $\sigma'$ is 0.25), so for a deep network the gradient at the early layers is exponentially small in the depth." — that's the entire argument in one sentence.

Solutions (slide 45)

The slides list three:

Shallower networks. Fewer layers means fewer factors in the chain, less attenuation. (Crude but effective.)
The ReLU function. Its derivative is $1$ wherever the unit is active, not $\le 0.25$. So the chain doesn't shrink — though it can still die entirely if a unit's pre-activation is non-positive (the "dead ReLU" problem).
Batch Normalization. Slide 45 verbatim: "normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. Batch normalization can mitigate the problem by reducing internal covariate shift, which in turn helps in maintaining stronger gradients."

Definition · Batch Normalization (slide 45)

For a mini-batch of activations from a layer, compute the batch mean $\mu_B$ and batch standard deviation $\sigma_B$, then replace each activation $a$ by $(a - \mu_B)/\sigma_B$ before passing it on. This keeps activations in a sensible range across training, which keeps activation derivatives away from saturated regions and thus keeps gradients larger.

The effect this targets is called internal covariate shift — the changing distribution of layer inputs as upstream weights update. Batch norm reduces this shift.

If we use the sigmoid in a 5-layer network and weights are around 1, roughly how much smaller is the gradient at the first layer than at the output?

Each layer contributes a factor bounded by $\sigma'(\cdot) \le 0.25$ (assuming weights are about 1). With four "between-layer" factors in a 5-layer net, the bound is $(0.25)^4 \approx 0.0039$. The gradient at the first layer is at most about $0.4\%$ of the gradient at the output. (In practice it can be far smaller, since most pre-activations sit in the saturated tails where $\sigma'$ is much less than 0.25.)

Why does ReLU help with vanishing gradients but not entirely solve it?

ReLU's derivative is $1$ wherever the unit is active, so the activation factor stops shrinking the gradient — that's the help. But:

Where ReLU is inactive (pre-activation ≤ 0), the derivative is $0$, killing the gradient entirely along that path.
The weights themselves can still be small or large, so the chain can vanish (or explode) through the weight-multiplication factors even when activations are healthy.

So ReLU helps but isn't a magic bullet — hence techniques like careful initialisation, batch norm, and skip connections.

What is "internal covariate shift" and what does batch norm do about it?

Internal covariate shift: as upstream weights change during training, the distribution of inputs to a downstream layer changes too — that layer is "chasing a moving target." This shift can push activations into saturated regions of nonlinearities (where derivatives are tiny) and slow learning.

Batch norm's response: normalise each layer's activations (subtract batch mean, divide by batch std) so that, at every layer, inputs to the next layer have a stable mean and variance regardless of what upstream weights are doing. Per slide 45, this "helps in maintaining stronger gradients."

· · ·

Part VII

The Universal Approximation Theorem

Once we know how to train weights, the next question is what those weights can ever express. The remarkable answer: a neural network with just one hidden layer can approximate any continuous function arbitrarily well — provided you allow the hidden layer to be wide enough.

The setting: what does "approximating a function" mean?

Slide 47 frames the problem in terms of image classification. Each input image is a point in some very-high-dimensional feature space; the goal of training is to partition that space so each region maps to the right label. Said geometrically, learning is just carving up $\mathbb{R}^d$.

Slide 48 simplifies to a one-dimensional cartoon: given $x$, predict $f(x)$ for some target function — say $f(x) = \sin(x) \cdot x^2 - x^3/30 + x^4/500$, the wiggly curve on the slide. Two natural questions arise (slide 49):

How many samples do we need to learn $f$ well? (A statistical question.)
Can we approximate any function $f$? (An expressivity question.)

The week tackles the second.

Approximation, not exact reproduction

Slide 50 shows a continuous curve (blue) overlaid with a piecewise-constant staircase (yellow). The staircase isn't perfect — there are gaps wherever the blue curve isn't flat — but if we make the staircase steps fine enough, the gap can be made as small as we want. This is the core idea: we don't need to be the function, we only need to approximate it to any desired accuracy.

The same picture works in higher dimensions (slide 51): a smooth surface gets approximated by a piecewise-constant tessellation. As long as the tiles are small enough, the approximation gets as good as we like.

The theorem itself

Theorem · Universal Approximation (slide 52)

"We can approximate any d-dimensional continuous function arbitrarily well using a NN with only one hidden layer."

More carefully: let $f: K \subset \mathbb{R}^d \to \mathbb{R}$ be a continuous function on a compact set, and let $\varepsilon > 0$. Then there exists a neural network with a single hidden layer (using sigmoid activation) that approximates $f$ to within $\varepsilon$ everywhere on $K$.

The catch — never written on the slide but worth knowing for the exam — is that "wide enough" can mean astronomically wide. The theorem is about existence, not efficiency.

Common misinterpretations

UAT does not say neural networks are good at finding these weights — only that the weights exist. Training is a separate question.
UAT does not say one hidden layer is enough in practice. Deep networks express the same function with exponentially fewer units in many cases. UAT is about possibility, not parsimony.
The function must be continuous on a compact (bounded, closed) set. Functions with discontinuities, or functions on all of $\mathbb{R}$, need slightly stronger versions of the theorem.

Why is "approximation" enough? Why don't we need exact reproduction?

Slide 50 makes the point: a continuous function has uncountably many values, and any finite network has finitely many parameters, so exact reproduction is mathematically impossible in general. But for any practical purpose (classification, regression, function evaluation), being arbitrarily close is just as good. If you can be within $\varepsilon$ everywhere for any $\varepsilon > 0$, that is the function for engineering purposes.

What does UAT promise, and what does it not promise?

Promises: for any continuous $f$ on a compact set and any $\varepsilon > 0$, there exists a single-hidden-layer NN that approximates $f$ to within $\varepsilon$. The hidden layer can use a sigmoid (or any "squashing" activation).

Does not promise: (a) that the hidden layer is small — it might need to be enormous; (b) that gradient descent will find these weights; (c) that one layer is the most efficient — deeper networks usually need far fewer parameters; (d) anything about generalising from finite training data — that's a separate, statistical concern.

· · ·

Part VIII

Building any function from sigmoid bumps

Slides 53–63 walk through a concrete construction proving the 1-D version of the theorem. Three steps: (i) build a single steep step from a sigmoid, (ii) combine two steps into a localised "bump," (iii) add up many bumps at chosen heights to staircase any target function.

Step 1 — one sigmoid → one step

The hidden unit computes $h_1 = \sigma(x_1 w_1 + b_1)$ where $\sigma$ is the sigmoid (slide 53). The threshold (the place where the sigmoid crosses $0.5$) sits at

$$s_1 \;=\; -b_1 / w_1.$$

That's because $\sigma$ is centred at $0$, so it crosses $0.5$ when its argument is zero, which means $x_1 w_1 + b_1 = 0$, i.e. $x_1 = -b_1/w_1$.

The steepness of the transition is controlled by $w_1$. Larger $w_1$ gives a sharper transition. Slides 54–55 show this in pictures:

$w_1$	$b_1$	$s_1 = -b_1/w_1$	Shape
20	−20	1.0	smooth gentle ramp
4	−2	0.5	very gentle ramp
40	−20	0.5	steeper ramp at $x = 0.5$
400	−200	0.5	essentially a step at $x = 0.5$
600	−200	$1/3$	step at $x = 1/3$
800	−200	$1/4$	step at $x = 1/4$

The lesson (slide 56): pick $w_1$ huge to get a near-perfect step, then choose $b_1$ to place that step at any $s_1$ you want, since $s_1 = -b_1/w_1$.

Slide 56. With $w_1 = 600$ and $b_1 = -200$, the sigmoid's threshold $s_1 = -b_1/w_1 = 1/3$ and the transition is so sharp it functions as a step.

From slide 57 onward, the slides abstract this away: instead of writing $w_1, b_1$, they just write a hidden node labelled $s_1$, with the understanding that some pair $(w_1, b_1)$ has been chosen to put the step at $s_1$.

Step 2 — two steps → one bump

Slide 59 puts two such hidden nodes in parallel: one with threshold $s_1 = 1/3$, one with threshold $s_2 = 2/3$. The output combines them with weights $w_3 = 1$ and $w_4 = -1$:

$$\hat y \;=\; 1 \cdot s_1(x) \;+\; (-1) \cdot s_2(x).$$

Where $x < 1/3$: both step nodes are off (output 0); $\hat y = 0$.
Where $1/3 < x < 2/3$: $s_1$ is on (output 1) but $s_2$ is still off; $\hat y = 1$.
Where $x > 2/3$: both are on; $\hat y = 1 - 1 = 0$.

That's a rectangular bump of height $1$ between $x = 1/3$ and $x = 2/3$.

Slide 59. Step at $1/3$ minus step at $2/3$ = rectangular bump of height $1$ between $1/3$ and $2/3$.

To get a shorter bump of height $\ell$, just scale the output weights (slide 61):

$$w_3 = \ell, \quad w_4 = -\ell \;\;\Longrightarrow\;\; \text{bump of height } \ell.$$

Step 3 — many bumps → any staircase

The final move (slide 63): a target function approximated by a staircase consists of many flat pieces, each at its own height $\ell_i$ over its own interval $[s_{2i}, s_{2i+1}]$. To build the staircase, add a bump of height $\ell_i$ over $[s_{2i}, s_{2i+1}]$ for every flat piece. In a one-hidden-layer network, that means: two hidden nodes per flat piece.

Slide 63. Each flat segment of the target staircase needs two hidden units: a "step on" at the segment's left edge and a "step off" at its right edge. The output weights $\pm w_i$ set the segment's height.

What the proof actually shows

Once you can build any 1-D staircase, you can approximate any 1-D continuous function (slide 50). The same construction generalises to higher dimensions (slide 51): build "tile" bumps over hyper-rectangles in $\mathbb{R}^d$ instead of intervals. Add up tiles, one per region, at the right heights. That's the entire proof idea — every step uses at most one hidden layer.

Why this is so striking

One hidden layer of sigmoid (or step-like) units is enough to express any continuous function. The expressivity of neural networks doesn't come from depth — depth is about efficiency, not capability. Even the shallowest possible architecture, with the right (possibly very large) number of hidden units, can do the job. This was a major theoretical result of the late 1980s (Hornik, Cybenko, Funahashi, all 1989).

You want a step function that crosses 0.5 at $x = 1/4$ using a single sigmoid hidden unit. If you fix $b_1 = -200$, what should $w_1$ be?

The threshold formula is $s_1 = -b_1/w_1$. Setting $s_1 = 1/4$ and $b_1 = -200$:

$1/4 = -(-200)/w_1 = 200/w_1 \;\;\Rightarrow\;\; w_1 = 800.$

This matches slide 56 exactly. Note that $w_1 = 800$ is large enough that the transition is essentially a perfect step.

How would you build a bump of height 0.3 between $x = 0.4$ and $x = 0.7$? Give the full hidden-layer specification.

Two hidden units, both with steep weights so they act as steps:

Unit 1: threshold at $s_1 = 0.4$. Pick $w_1 = 1000$ (big), $b_1 = -1000 \cdot 0.4 = -400$.
Unit 2: threshold at $s_2 = 0.7$. Pick $w_2 = 1000$, $b_2 = -1000 \cdot 0.7 = -700$.

Output weights: $w_3 = 0.3$ from unit 1, $w_4 = -0.3$ from unit 2. Output is $\hat y = 0.3 \cdot s_1(x) + (-0.3) \cdot s_2(x)$, which is $0.3$ between $x = 0.4$ and $x = 0.7$ and $0$ outside.

If a target staircase has $k$ flat segments, how many hidden units do you need to reproduce it with this construction?

$2k$ hidden units — two per segment, one for the rising step at the segment's left edge and one for the falling step at the right edge. Per slide 63: "For the ith flat line, we add the nodes $s_{2i}$ and $s_{2i+1}$ to the middle layer, and the outgoing weights are set to height of the ith flat line (and its negative)."

The slides use sigmoid in the proof. Could the same construction work with ReLU?

Sketch: yes, with a small modification. A ReLU centred far to the left, scaled and shifted, also has a step-like transition (in fact ReLU itself looks like a half-plane "switch"). One can construct bumps from differences of two ReLUs. The classical UAT proofs in fact apply to a broad class of "squashing" or non-polynomial activations, including ReLU; the version on the slides is just the cleanest one to draw. The takeaway: the theorem isn't really about sigmoid specifically — it's about how flexible any nonlinearity makes the network.

Week 7 in one breath

A neural network learns by gradient descent on its loss; backpropagation is just the chain rule applied layer-by-layer to compute that gradient efficiently; deep networks suffer because gradients shrink exponentially through stacked sigmoids; and yet — astonishingly — even one hidden layer of such units can approximate any continuous function arbitrarily well, by adding up step-shaped sigmoids into bumps and bumps into staircases.

Exam-readiness checklist

State the SGD update rule $w := w - \alpha \nabla \text{loss}(w)$ and explain the role of every symbol, including the consequences of $\alpha$ being too large or too small.
Distinguish stochastic from batch gradient descent in one sentence, and name the two unavoidable approximations of SGD (only finds local optima; never reaches them exactly).
Run a forward pass through the slide-21 toy network with given inputs and weights — using ReLU activations — and produce $\hat y$.
Compute the MSE loss $\frac{1}{2}(y - \hat y)^2$ for given $y$ and $\hat y$.
Define $\Delta := -(y - \hat y)$ and identify it inside any backprop derivation.
State the ReLU derivative as a piecewise function and apply it to gating gradient flow.
Derive $\partial \text{Error}/\partial w_6 = \Delta \cdot h_2$ from the chain rule, including the case split on the output ReLU.
Derive $\partial \text{Error}/\partial w_1 = \Delta \cdot w_5 \cdot x_1$, including both ReLU case conditions.
Reproduce the full slide-41 summary of all six weight updates from memory, with their gating conditions.
Plug numbers into the update rule $w_i' = w_i - \alpha (\partial \text{Error}/\partial w_i)$ to compute one step of training.
Explain why $\sigma'(x) \le 0.25$ for the sigmoid (peak at $x = 0$), and use this to argue gradients vanish exponentially in deep networks.
List the three vanishing-gradients remedies from slide 45: shallower nets, ReLU, batch normalization (and explain how each helps).
Define internal covariate shift and explain how batch norm reduces it.
State the Universal Approximation Theorem precisely (continuous function, compact set, one hidden layer, arbitrary $\varepsilon$).
Distinguish what UAT does promise (existence of weights) from what it doesn't (that they can be efficiently found, that one layer is parsimonious, that training will discover them).
Given a desired threshold $s_1$, choose $w_1$ and $b_1$ for a sigmoid hidden unit so that the transition is sharp and centred at $s_1$ — using $s_1 = -b_1/w_1$.
Combine two threshold units into a rectangular bump of any chosen height $\ell$ by setting output weights $\pm \ell$.
Given a target staircase with $k$ flat segments, design the hidden layer ($2k$ units) and output weights to reproduce it.
Explain how the staircase construction proves UAT in 1-D, and why the same idea extends to higher dimensions.
Connect everything: backprop is how a network learns, UAT is what a network can in principle learn, and vanishing gradients are why early networks struggled in practice — motivating ReLU, batch norm, and modern deep learning.