How does a network actually learn its weights? And once it can learn anything — what can it learn? Two ideas, one chain rule, sixty-three slides.
Training a neural network means finding numerical settings of its weights so that, when an input is pushed through, the output matches the truth as closely as possible. We need three things: a measure of "how wrong are we" (the loss), a way to know which way to nudge each weight (the gradient), and a rule for actually nudging (the update).
Think of the apple-vs-orange picture from the slides. We have a network with some random weights, and we want it to fire on apples and stay quiet on oranges. We don't change the architecture — the boxes and arrows stay the same. We change the numbers on the arrows. That's what learning is.
The procedure on slide 3 is short enough to memorise: (i) pick a training example $(x, y)$ at random, (ii) push it forward through the network and look at the output $\hat y$, (iii) if $\hat y$ disagrees with $y$, change the weights and biases. Most of this week's mathematics is just the third step in detail.
Given a label $y$ and a network output $\hat y$, the mean squared error loss on this single example is
$$\text{loss}(w) \;=\; \text{MSE} \;=\; \tfrac{1}{2}(y - \hat y)^2.$$
The factor $\tfrac{1}{2}$ has no statistical meaning — it just makes the derivative come out cleaner (the $2$ cancels with the $\tfrac{1}{2}$). The loss is a function of the weights $w$ because $\hat y$ depends on $w$.
"How far off are we?" Square the gap so positive and negative misses both count, and so big misses are punished more than small ones. Halve it for clean derivatives.
Slide 4 gives the central object of the entire week. Once we have a loss, we update each weight by moving it opposite the slope of the loss with respect to that weight:
Two new symbols. The learning rate $\alpha$ is a small positive number that controls step size. The gradient $\nabla \text{loss}(w)$ is the vector of partial derivatives of the loss with respect to every weight: it points in the direction of steepest increase of the loss, so we negate it to descend.
The minus sign matters. Adding the gradient would climb the loss surface, not descend it. If you've ever written w = w + alpha * grad and watched the loss explode, this is why.
Slide 4 spells it out: "Big values might overshoot, small values will take a long time to converge."
This is one of the few exam-friendly verbal answers — but make sure to give both sides. Too big and you bounce out of the basin; too small and you barely move per step.
It changes the value of the loss by a constant factor of two but does not change where the loss is minimised — the optimum is the same. The point of the $\tfrac{1}{2}$ is purely cosmetic for derivatives: $\partial(\tfrac{1}{2}(y-\hat y)^2)/\partial \hat y = -(y-\hat y)$, with no extra 2 floating around. Without the $\tfrac{1}{2}$ you'd write $-2(y-\hat y)$ everywhere; the algorithm absorbs the extra factor of 2 into the learning rate $\alpha$ anyway.
(a) Too large: the loss may oscillate or even grow — you overshoot the minimum and bounce back and forth across it (or out of the basin entirely).
(b) Too small: the loss decreases monotonically but very slowly — you crawl downhill and may not reach the minimum within your training budget.
(c) Just right: the loss decreases quickly at first, then levels off near the minimum. (Slide 20 shows exactly the (a) vs (b) picture: zigzag jumps for big $\alpha$, tiny steps for small $\alpha$.)
"Stochastic" just means random. Instead of computing the gradient over the entire training set on every step (which is expensive), we use one example at a time, picked at random. This is faster, noisier, and — surprisingly — often better.
Slide 9 names the only thing that distinguishes SGD from plain (batch) gradient descent: "only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step."
Given a dataset $\{(x_i, y_i)\}_{i=1}^N$, repeat: pick an index $i$ uniformly at random, compute the gradient of the loss on that single example, and update
$$w \;:=\; w \;-\; \alpha \, \nabla \text{loss}_i(w).$$
Compare to batch gradient descent, where the gradient is the average over all $N$ examples before any update.
Batch GD takes one careful step per pass through the entire dataset. SGD takes $N$ noisy steps per pass. The noise is usually a feature, not a bug — it can knock you out of bad local minima.
Slides 11–17 animate descent on a 1-D loss curve. The story is: pick a starting weight, compute the slope at that point (the dotted tangent), step downhill (the arrow), repeat. Each step roughly halves the distance to the minimum until you're in the bottom of the basin and progress slows.
Slide 18 introduces a crucial vocabulary distinction. A local optimum is a point that is lower than everything in its immediate neighbourhood. The global optimum is the lowest point overall. Plain gradient descent has no way to tell them apart: it just rolls downhill from wherever you started, and where it stops depends on where you began.
Slide 19's three-dimensional surface drives the point home. In a real network with millions of weights the loss surface is hyper-dimensional, but the cartoon is the same: lots of valleys of varying depth. Where you land depends on where you started and which gradient you happened to step against — which is why the noise of SGD can be useful.
The slides explicitly call these out as the two failure modes of SGD:
Both are testable verbatim.
Batch GD computes the gradient over the entire training set before each weight update; SGD uses just one randomly-picked example per update. Per slide 9, this is the only difference at the algorithmic level. Practical consequences: SGD is faster per step, noisier, and can escape shallow local minima.
(1) Only local optima: we may end up in a basin that isn't the deepest one. Not generally fixable for non-convex losses (which any deep network has) — though restarts, momentum, and noise help in practice.
(2) We only approach, never reach: the algorithm hops around the minimum because $\alpha > 0$. This is fixable in principle by shrinking $\alpha$ over time (a "learning-rate schedule"). With a constant $\alpha$ you are stuck oscillating in a small region around the optimum.
Before we backpropagate anything, we have to push something forward. Slides 21–24 set up the toy network we'll use for the entire rest of the week. Memorise these numbers — every backprop computation that follows reuses them.
The toy network has two inputs $x_1, x_2$, two hidden units $h_1, h_2$ (with ReLU activations), and one output $\hat y$ (also ReLU). All six weights are labelled $w_1$ through $w_6$. There are no biases in this example — the picture is already busy enough.
The activation function throughout is $\sigma(z) = \text{ReLU}(z) = \max(0, z)$. Slides 23–24 walk the calculation:
$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2) = \text{ReLU}(2 \cdot 0.11 + 3 \cdot 0.12) = \text{ReLU}(0.58) = 0.58$
$\sigma(h_2) = \text{ReLU}(x_1 w_3 + x_2 w_4) = \text{ReLU}(2 \cdot 0.21 + 3 \cdot 0.08) = \text{ReLU}(0.66) = 0.66$
$\hat y = \sigma(z) = \text{ReLU}(\sigma(h_1) \cdot w_5 + \sigma(h_2) \cdot w_6) = \text{ReLU}(0.58 \cdot 0.14 + 0.66 \cdot 0.15) = \text{ReLU}(0.1802) = 0.1802$
The slides write things like "$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2)$" — i.e. they're using $h_1$ to mean the pre-activation $x_1 w_1 + x_2 w_2$, and $\sigma(h_1)$ to mean the post-activation output of the unit. In the chain rule pages later, $h_1$ is used to mean the activation (the value 0.58). It's the same convention some books use; the meaning is always clear from context, but watch for it on the exam.
Now suppose the true label is $y = 1$. The error is
$$\text{Error} \;=\; \tfrac{1}{2}(y - \hat y)^2 \;=\; \tfrac{1}{2}(1 - 0.1802)^2 \;\approx\; 0.336.$$
(Slide 26.) The network is wildly wrong — the output 0.1802 is far from the desired 1. Backprop's job is to adjust each of the six weights to reduce this error.
The hidden activations don't depend on $w_5$, so they're still $0.58$ and $0.66$. The pre-activation at the output is now
$0.58 \cdot (-0.14) + 0.66 \cdot 0.15 = -0.0812 + 0.099 = 0.0178.$
This is positive, so $\text{ReLU}$ leaves it alone: $\hat y = 0.0178$.
If the pre-activation had been negative, $\hat y$ would be exactly $0$ — a useful sanity check that ReLU can produce zeros.
$\text{Error} = \tfrac{1}{2}(0 - 0.1802)^2 = \tfrac{1}{2}(0.0325) \approx 0.0162$. Much smaller than $0.336$ — the network's output of $0.1802$ is much closer to $0$ than to $1$, so it's "almost right" if the answer were 0.
Backpropagation is just the chain rule applied carefully. The key insight: to update a weight, you don't need to recompute the network from scratch — you can reuse pieces of the gradient that you already calculated for weights closer to the output. Information literally flows backwards.
Slides 5–8 illustrate this with the deeper four-layer network: starting from the output (slide 6), gradient information propagates layer by layer back into the network (slides 7–8), each layer reusing what came after it. We'll do the math on the simpler 2-2-1 toy network from Part III.
Slide 27 states the update rule for any weight $w_i$:
So the entire problem reduces to computing $\partial \text{Error} / \partial w_i$ for each of the six weights. We'll need the derivative of the activation. Slide 28 reminds us that for ReLU,
For $\sigma(x) = \text{ReLU}(x) = \max(0,x)$,
$$\frac{\partial \sigma(x)}{\partial x} \;=\; \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise.} \end{cases}$$
(At $x = 0$ the derivative is technically undefined, but we conventionally set it to $0$.)
Let's start with $w_6$, the weight from $h_2$ into the output. The chain rule on slide 30 says
Two factors. The first is just calculus on the loss; the second is calculus on the activation.
Slide 31 computes
$$\frac{\partial \tfrac{1}{2}(y-\hat y)^2}{\partial \hat y} \;=\; -2 \cdot \tfrac{1}{2}(y - \hat y) \;=\; -(y - \hat y) \;=:\; \Delta.$$
The minus sign comes from differentiating $(y - \hat y)$ with respect to $\hat y$. The slides give this quantity a name — $\Delta$ — and it shows up in every backprop derivation that follows.
For MSE loss with one output, $\Delta := -(y - \hat y) = \hat y - y$.
It is the "signed error" — positive if the network overshot ($\hat y > y$), negative if it undershot. For our example, $\Delta = -(1 - 0.1802) = -0.8198$.
The second factor is $\partial \sigma(h_1 w_5 + h_2 w_6) / \partial w_6$. We need the ReLU derivative, which depends on the sign of the argument $h_1 w_5 + h_2 w_6$.
Case A (slide 33): if $h_1 w_5 + h_2 w_6 \le 0$, then $\partial \sigma / \partial w_6 = 0$, so
$$\frac{\partial \text{Error}}{\partial w_6} \;=\; \Delta \cdot 0 \;=\; 0.$$
The weight gets no update at all.
Case B (slide 34): if $h_1 w_5 + h_2 w_6 > 0$, then $\partial \sigma / \partial w_6 = 1$, and inside the ReLU we have a linear function whose derivative w.r.t. $w_6$ is just $h_2$:
$$\frac{\partial \text{Error}}{\partial w_6} \;=\; \Delta \cdot \frac{\partial [h_1 w_5 + h_2 w_6]}{\partial w_6} \;=\; \Delta \cdot \frac{\partial [h_2 w_6]}{\partial w_6} \;=\; \Delta \cdot h_2.$$
The first term $h_1 w_5$ doesn't depend on $w_6$, so its derivative is zero. Only $h_2 w_6$ contributes, and its derivative is $h_2$.
Putting it all together (slide 36):
For $w_5$, the derivation is identical with $h_1$ in place of $h_2$ (slide 37):
"How much should I change the weight from hidden unit $h_2$ to the output?" The answer is: (error at the output) times (how active $h_2$ was). If $h_2$ was silent, this weight had no influence on the prediction this time, so don't change it. If the output was nearly correct ($\Delta$ small), don't change it much either. The bigger the error and the louder the upstream neuron, the bigger the weight update.
The condition for both $w_5$ and $w_6$ is the same: $h_1 w_5 + h_2 w_6 > 0$. This is the pre-activation of the output unit. If the output ReLU was off (input ≤ 0), the gradient is zero for every weight feeding the output — and (as we'll see) for every weight feeding any earlier layer too, because the error can't propagate past a dead ReLU.
First, $\Delta = -(1 - 0.1802) = -0.8198$.
Output pre-activation: $h_1 w_5 + h_2 w_6 = 0.58 \cdot 0.14 + 0.66 \cdot 0.15 = 0.1802 > 0$, so we're in Case B.
$\partial \text{Error}/\partial w_6 = \Delta \cdot h_2 = -0.8198 \cdot 0.66 \approx -0.5411$.
$w_6' = 0.15 - 0.1 \cdot (-0.5411) = 0.15 + 0.05411 \approx 0.2041$.
The weight increases — which makes sense: the network undershot ($\hat y < y$), so we want the output to be larger, and increasing $w_6$ (with $h_2 > 0$) increases the output.
The output pre-activation is $0.58 \cdot (-1) + 0.66 \cdot (-1) = -1.24 < 0$. We are in Case A. The gradient is exactly zero, regardless of the size of the error. The output ReLU is "dead" for this input, and no signal can pass through it. This is the so-called dying ReLU problem in miniature.
$\Delta = -(y - \hat y) = \hat y - y$.
The sign of $\Delta$ flips the sign of the gradient, which (combined with the $-\alpha$ in the update) ensures we move in the direction that reduces the error.
For weights closer to the input, the chain rule has more links — the error signal has to be propagated back through more layers. The derivation is the same idea, just with one extra step. The reward is a tidy summary of all six weight updates.
We tackle $w_1$, the weight from input $x_1$ into hidden unit $h_1$. Slide 38 says: assume both ReLUs are active — that is, $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$. (If either is non-positive, the gradient is zero by the same argument as before.)
Slide 39 writes the chain
The first factor is $\Delta$, as before. The middle and right factors slide 40 unpacks:
$$\frac{\partial \hat y}{\partial h_1} \;=\; \frac{\partial \sigma(h_1 w_5 + h_2 w_6)}{\partial h_1} \;=\; w_5 \quad \text{(under Case B for the output ReLU)}.$$
$$\frac{\partial h_1}{\partial w_1} \;=\; \frac{\partial \sigma(x_1 w_1 + x_2 w_2)}{\partial w_1} \;=\; x_1 \quad \text{(under Case B for the hidden ReLU)}.$$
Multiplying:
To update an early-layer weight: take the output error, multiply by every weight on the path from this unit to the output (here just $w_5$), and multiply by the input that this weight saw (here $x_1$). Each weight along the path "scales" how much the error gets blamed on the upstream weight.
Slide 41 collects all six update rules. It looks intimidating but it's just six instances of the same idea. Let me lay them out side by side. Recall $\Delta = -(y - \hat y)$.
| Weight | Update (when both relevant ReLUs are active) | Active condition |
|---|---|---|
| $w_1$ | $w_1' = w_1 - \alpha(\Delta \cdot w_5 \cdot x_1)$ | $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$ |
| $w_2$ | $w_2' = w_2 - \alpha(\Delta \cdot w_5 \cdot x_2)$ | $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_1 + x_2 w_2 > 0$ |
| $w_3$ | $w_3' = w_3 - \alpha(\Delta \cdot w_6 \cdot x_1)$ | $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_3 + x_2 w_4 > 0$ |
| $w_4$ | $w_4' = w_4 - \alpha(\Delta \cdot w_6 \cdot x_2)$ | $h_1 w_5 + h_2 w_6 > 0$ and $x_1 w_3 + x_2 w_4 > 0$ |
| $w_5$ | $w_5' = w_5 - \alpha(\Delta \cdot h_1)$ | $h_1 w_5 + h_2 w_6 > 0$ |
| $w_6$ | $w_6' = w_6 - \alpha(\Delta \cdot h_2)$ | $h_1 w_5 + h_2 w_6 > 0$ |
If any active condition fails, the weight does not change on that step (the bracketed expression in slide 41 is zero).
This pattern generalises to any architecture: walk the path from the weight to the output, multiply the activation derivatives and weights along the way, and multiply by the upstream activation.
$\partial \text{Error}/\partial w_1 = \Delta \cdot w_5 \cdot x_1 = -0.8198 \cdot 0.14 \cdot 2 \approx -0.2295$.
$w_1' = w_1 - \alpha \cdot (-0.2295) = 0.11 + 0.02295 \approx 0.1330$.
Notice the gradient is much smaller than the gradient at $w_6$ (≈ −0.5411), even though everything else is comparable. That's because we multiplied by $w_5 = 0.14$ along the way — the weight on the path from $h_1$ to the output. Small weights along the path damp the error signal as it travels back. Hold that thought; it's the entire idea behind vanishing gradients in Part VI.
$w_3$ feeds into $h_2$ (not $h_1$). The only path from $h_2$ to the output goes through $w_6$. There is no path from $h_2$ that uses $w_5$, so $w_5$ doesn't appear in the chain. In a fully-connected network with multiple output paths, you'd sum over all paths — here there's only one each.
False. Slide 41 uses the strict inequality $> 0$, and per slide 28 the convention at $x = 0$ for the ReLU derivative is $0$. So the gradient is $0$ exactly at the kink. (Real implementations sometimes use a sub-gradient; the slide convention is fine for an exam.)
If the gradient is multiplied by a number smaller than $1$ at every layer it crosses, then in a deep network the gradient at the earliest layer is multiplied by something tiny — sometimes so tiny it's effectively zero. The early layers stop learning. This is the vanishing-gradients problem.
Slide 42 sets up a deeper version of the toy net: input $x$, then three hidden layers $h_{1,1}, h_{1,2}, h_{1,3}$ (top stream) and $h_{2,1}, h_{2,2}, h_{2,3}$ (bottom stream), then output $\hat y$. The chain rule for the very first weight $w_1$ now has five factors:
Each $\partial h_{1,x+1}/\partial h_{1,x}$ in the middle is, by the chain rule, the activation derivative times a weight on the path. Three of these chained.
Slide 43 brings in the sigmoid:
$$\sigma(x) \;=\; \frac{1}{1 + e^{-x}}, \qquad \sigma'(x) \;=\; \sigma(x)\big(1 - \sigma(x)\big).$$
The derivative is largest when $\sigma(x) = 0.5$, where $\sigma'(x) = 0.5 \cdot 0.5 = 0.25$. Everywhere else it is smaller. So $\sigma'(x) \le 0.25$ for all $x$.
Each factor $\partial h_{1,x+1}/\partial h_{1,x}$ in the chain involves $\sigma'$ at some pre-activation, multiplied by a weight. If $\sigma'$ is bounded by $0.25$, and weights are typically initialised around $1$ in magnitude or smaller, then each factor is bounded by something like $0.25$. Chaining $L$ of them gives
$$\Big| \frac{\partial \text{Error}}{\partial w_1} \Big| \;\lesssim\; (0.25)^L \cdot |\Delta|.$$
For $L = 10$, that's $(0.25)^{10} \approx 10^{-6}$. The gradient at the first layer is a million times smaller than the gradient at the output. Early-layer weights barely move; effectively, only the last few layers are training. This is the vanishing-gradients problem (slide 44).
"Each factor $\partial h_{1,x+1}/\partial h_{1,x}$ is bounded by 0.25 (because the maximum of $\sigma'$ is 0.25), so for a deep network the gradient at the early layers is exponentially small in the depth." — that's the entire argument in one sentence.
The slides list three:
For a mini-batch of activations from a layer, compute the batch mean $\mu_B$ and batch standard deviation $\sigma_B$, then replace each activation $a$ by $(a - \mu_B)/\sigma_B$ before passing it on. This keeps activations in a sensible range across training, which keeps activation derivatives away from saturated regions and thus keeps gradients larger.
The effect this targets is called internal covariate shift — the changing distribution of layer inputs as upstream weights update. Batch norm reduces this shift.
Each layer contributes a factor bounded by $\sigma'(\cdot) \le 0.25$ (assuming weights are about 1). With four "between-layer" factors in a 5-layer net, the bound is $(0.25)^4 \approx 0.0039$. The gradient at the first layer is at most about $0.4\%$ of the gradient at the output. (In practice it can be far smaller, since most pre-activations sit in the saturated tails where $\sigma'$ is much less than 0.25.)
ReLU's derivative is $1$ wherever the unit is active, so the activation factor stops shrinking the gradient — that's the help. But:
So ReLU helps but isn't a magic bullet — hence techniques like careful initialisation, batch norm, and skip connections.
Internal covariate shift: as upstream weights change during training, the distribution of inputs to a downstream layer changes too — that layer is "chasing a moving target." This shift can push activations into saturated regions of nonlinearities (where derivatives are tiny) and slow learning.
Batch norm's response: normalise each layer's activations (subtract batch mean, divide by batch std) so that, at every layer, inputs to the next layer have a stable mean and variance regardless of what upstream weights are doing. Per slide 45, this "helps in maintaining stronger gradients."
Once we know how to train weights, the next question is what those weights can ever express. The remarkable answer: a neural network with just one hidden layer can approximate any continuous function arbitrarily well — provided you allow the hidden layer to be wide enough.
Slide 47 frames the problem in terms of image classification. Each input image is a point in some very-high-dimensional feature space; the goal of training is to partition that space so each region maps to the right label. Said geometrically, learning is just carving up $\mathbb{R}^d$.
Slide 48 simplifies to a one-dimensional cartoon: given $x$, predict $f(x)$ for some target function — say $f(x) = \sin(x) \cdot x^2 - x^3/30 + x^4/500$, the wiggly curve on the slide. Two natural questions arise (slide 49):
The week tackles the second.
Slide 50 shows a continuous curve (blue) overlaid with a piecewise-constant staircase (yellow). The staircase isn't perfect — there are gaps wherever the blue curve isn't flat — but if we make the staircase steps fine enough, the gap can be made as small as we want. This is the core idea: we don't need to be the function, we only need to approximate it to any desired accuracy.
The same picture works in higher dimensions (slide 51): a smooth surface gets approximated by a piecewise-constant tessellation. As long as the tiles are small enough, the approximation gets as good as we like.
"We can approximate any d-dimensional continuous function arbitrarily well using a NN with only one hidden layer."
More carefully: let $f: K \subset \mathbb{R}^d \to \mathbb{R}$ be a continuous function on a compact set, and let $\varepsilon > 0$. Then there exists a neural network with a single hidden layer (using sigmoid activation) that approximates $f$ to within $\varepsilon$ everywhere on $K$.
The catch — never written on the slide but worth knowing for the exam — is that "wide enough" can mean astronomically wide. The theorem is about existence, not efficiency.
Slide 50 makes the point: a continuous function has uncountably many values, and any finite network has finitely many parameters, so exact reproduction is mathematically impossible in general. But for any practical purpose (classification, regression, function evaluation), being arbitrarily close is just as good. If you can be within $\varepsilon$ everywhere for any $\varepsilon > 0$, that is the function for engineering purposes.
Promises: for any continuous $f$ on a compact set and any $\varepsilon > 0$, there exists a single-hidden-layer NN that approximates $f$ to within $\varepsilon$. The hidden layer can use a sigmoid (or any "squashing" activation).
Does not promise: (a) that the hidden layer is small — it might need to be enormous; (b) that gradient descent will find these weights; (c) that one layer is the most efficient — deeper networks usually need far fewer parameters; (d) anything about generalising from finite training data — that's a separate, statistical concern.
Slides 53–63 walk through a concrete construction proving the 1-D version of the theorem. Three steps: (i) build a single steep step from a sigmoid, (ii) combine two steps into a localised "bump," (iii) add up many bumps at chosen heights to staircase any target function.
The hidden unit computes $h_1 = \sigma(x_1 w_1 + b_1)$ where $\sigma$ is the sigmoid (slide 53). The threshold (the place where the sigmoid crosses $0.5$) sits at
$$s_1 \;=\; -b_1 / w_1.$$
That's because $\sigma$ is centred at $0$, so it crosses $0.5$ when its argument is zero, which means $x_1 w_1 + b_1 = 0$, i.e. $x_1 = -b_1/w_1$.
The steepness of the transition is controlled by $w_1$. Larger $w_1$ gives a sharper transition. Slides 54–55 show this in pictures:
| $w_1$ | $b_1$ | $s_1 = -b_1/w_1$ | Shape |
|---|---|---|---|
| 20 | −20 | 1.0 | smooth gentle ramp |
| 4 | −2 | 0.5 | very gentle ramp |
| 40 | −20 | 0.5 | steeper ramp at $x = 0.5$ |
| 400 | −200 | 0.5 | essentially a step at $x = 0.5$ |
| 600 | −200 | $1/3$ | step at $x = 1/3$ |
| 800 | −200 | $1/4$ | step at $x = 1/4$ |
The lesson (slide 56): pick $w_1$ huge to get a near-perfect step, then choose $b_1$ to place that step at any $s_1$ you want, since $s_1 = -b_1/w_1$.
From slide 57 onward, the slides abstract this away: instead of writing $w_1, b_1$, they just write a hidden node labelled $s_1$, with the understanding that some pair $(w_1, b_1)$ has been chosen to put the step at $s_1$.
Slide 59 puts two such hidden nodes in parallel: one with threshold $s_1 = 1/3$, one with threshold $s_2 = 2/3$. The output combines them with weights $w_3 = 1$ and $w_4 = -1$:
$$\hat y \;=\; 1 \cdot s_1(x) \;+\; (-1) \cdot s_2(x).$$
Where $x < 1/3$: both step nodes are off (output 0); $\hat y = 0$.
Where $1/3 < x < 2/3$: $s_1$ is on (output 1) but $s_2$ is still off; $\hat y = 1$.
Where $x > 2/3$: both are on; $\hat y = 1 - 1 = 0$.
That's a rectangular bump of height $1$ between $x = 1/3$ and $x = 2/3$.
To get a shorter bump of height $\ell$, just scale the output weights (slide 61):
$$w_3 = \ell, \quad w_4 = -\ell \;\;\Longrightarrow\;\; \text{bump of height } \ell.$$
The final move (slide 63): a target function approximated by a staircase consists of many flat pieces, each at its own height $\ell_i$ over its own interval $[s_{2i}, s_{2i+1}]$. To build the staircase, add a bump of height $\ell_i$ over $[s_{2i}, s_{2i+1}]$ for every flat piece. In a one-hidden-layer network, that means: two hidden nodes per flat piece.
Once you can build any 1-D staircase, you can approximate any 1-D continuous function (slide 50). The same construction generalises to higher dimensions (slide 51): build "tile" bumps over hyper-rectangles in $\mathbb{R}^d$ instead of intervals. Add up tiles, one per region, at the right heights. That's the entire proof idea — every step uses at most one hidden layer.
One hidden layer of sigmoid (or step-like) units is enough to express any continuous function. The expressivity of neural networks doesn't come from depth — depth is about efficiency, not capability. Even the shallowest possible architecture, with the right (possibly very large) number of hidden units, can do the job. This was a major theoretical result of the late 1980s (Hornik, Cybenko, Funahashi, all 1989).
The threshold formula is $s_1 = -b_1/w_1$. Setting $s_1 = 1/4$ and $b_1 = -200$:
$1/4 = -(-200)/w_1 = 200/w_1 \;\;\Rightarrow\;\; w_1 = 800.$
This matches slide 56 exactly. Note that $w_1 = 800$ is large enough that the transition is essentially a perfect step.
Two hidden units, both with steep weights so they act as steps:
Output weights: $w_3 = 0.3$ from unit 1, $w_4 = -0.3$ from unit 2. Output is $\hat y = 0.3 \cdot s_1(x) + (-0.3) \cdot s_2(x)$, which is $0.3$ between $x = 0.4$ and $x = 0.7$ and $0$ outside.
$2k$ hidden units — two per segment, one for the rising step at the segment's left edge and one for the falling step at the right edge. Per slide 63: "For the ith flat line, we add the nodes $s_{2i}$ and $s_{2i+1}$ to the middle layer, and the outgoing weights are set to height of the ith flat line (and its negative)."
Sketch: yes, with a small modification. A ReLU centred far to the left, scaled and shifted, also has a step-like transition (in fact ReLU itself looks like a half-plane "switch"). One can construct bumps from differences of two ReLUs. The classical UAT proofs in fact apply to a broad class of "squashing" or non-polynomial activations, including ReLU; the version on the slides is just the cleanest one to draw. The takeaway: the theorem isn't really about sigmoid specifically — it's about how flexible any nonlinearity makes the network.