Week 05 · Study Guide

The Brain, the Perceptron, and Why We Need Curves

From a single firing neuron to multilayer feed-forward networks — and the convergence theorem that started it all.

KCL · Machine Learning  ·  ~50 min reading  ·  18 self-tests  ·  covers all 54 slides
Part I

Biological inspiration & the artificial neuron

A perceptron is a comically simplified neuron. Real neurons fire when chemistry tips them past a threshold; perceptrons fire when a weighted sum tips past a number. Same shape, different ingredients.

Your brain has roughly 86 billion neurons. Each one is a tiny decision-maker. Dendrites are the input branches — they're covered in receptors that catch chemical messengers (neurotransmitters) released by other neurons, and they translate those chemical hits into electrical signals. Those signals all flow into the soma (the cell body), where they're summed up. The nucleus, sitting inside the soma, holds the cell's genetic material. If the total signal crosses some threshold, the neuron fires — sending an electrical pulse down the axon, which terminates in branches called axon terminals. Those terminals release their own neurotransmitters, which crash into the next neuron's dendrites, and the chain continues.

The artificial neuron — the perceptron — is a cartoon of this. We replace the dendrites with input numbers, the synapse strengths with weights, the threshold with a bias, and the firing decision with an activation function.

x₁ x₂ x₃ h = f(Σ xᵢwᵢ + b) (activation) b w₁ w₂ w₃ ŷ
The artificial neuron: inputs are weighted, summed with a bias, and passed through an activation function.
Definition · Perceptron

A perceptron has four parts: an input vector $\mathbf{x} \in \mathbb{R}^d$, a weight vector $\mathbf{w} \in \mathbb{R}^d$, a scalar bias $b \in \mathbb{R}$, and an activation function $f : \mathbb{R} \to \mathbb{R}$. Its output is

$$h \;=\; f\!\left(\sum_i x_i w_i + b\right) \;=\; f(\mathbf{x}^\top \mathbf{w} + b).$$

The quantity $\mathbf{x}^\top\mathbf{w} - b$ (or sometimes $+b$, depending on convention) is called the activation potential — the number that the activation function gets to decide on.

Intuition

Each weight $w_i$ tells the neuron how much it cares about input $x_i$. A large positive $w_i$ means "this input matters and pulls me toward firing"; a large negative $w_i$ means "this input matters and pulls me away from firing"; near-zero means "I don't care about this one." The bias $b$ is the neuron's mood — its baseline reluctance or eagerness to fire, independent of inputs.

For a single perceptron, let's pick the simplest possible activation function: the binary step.

$$\hat{y} \;=\; f_{\text{binary}}(\mathbf{x},\mathbf{w}) \;=\; \begin{cases} 1 & \text{if } \mathbf{x}^\top\mathbf{w} - b \;\geq\; 0\\ 0 & \text{otherwise.}\end{cases}$$

Think of it as: if the weighted vote crosses the threshold $b$, output 1; otherwise output 0.

Self-test 1. A perceptron has weights $\mathbf{w}=(0.5, -0.3)$, bias $b=0.2$, and binary activation. What does it output on $\mathbf{x}=(2, 1)$?

Compute the activation potential first: $\mathbf{x}^\top\mathbf{w} - b = (2)(0.5) + (1)(-0.3) - 0.2 = 1.0 - 0.3 - 0.2 = 0.5$. Since $0.5 \geq 0$, the binary step outputs $\hat{y}=1$.

· · ·
Part II

The perceptron as a classifier

A perceptron with binary activation is a line-drawer. Give it 2D data and it will draw a straight line, calling everything on one side "1" and everything on the other "0".

Suppose you have data on apples and oranges. (Don't try this with pears — you can't compare them with apples.) For each fruit you measure two things: sweetness and size. Apples have sweetness 4–7 and size 6–10cm; oranges have sweetness 6–9 and size 8–12cm. The two clouds overlap in the middle, but mostly they sit in different parts of the 2D plane. We'd like a perceptron that, given a new fruit's $(x_1, x_2)$, tells us "apple" or "orange".

The decision boundary

For binary activation, the perceptron outputs $1$ exactly when $x_1 w_1 + x_2 w_2 - b \geq 0$. The set of points where $x_1 w_1 + x_2 w_2 - b = 0$ is a straight line in 2D (a hyperplane in higher dimensions). On one side of this line, the output is 1; on the other, 0.

x₁ x₂ 1 1 f(x,b) = 1 f(x,b) = 0 + + + + + + + + + +
A linear decision boundary in 2D. Above the dashed line: positive class. Below: negative class.

The "game": pick the right weights

You don't get to control $x_1$ and $x_2$ — they're given by nature. Your job is to pick $w_1$, $w_2$, and $b$ so that the line they define correctly separates the two classes. Think of it as a game: nature throws you a point, you've already committed to your weights, and you want to be right as often as possible.

For the toy example where the ideal boundary is $x_1 + x_2 = 1$:

Self-test 2. The ideal boundary is $2x_1 + x_2 = 4$, with the positive class on the upper-right side. Give one valid choice of $(w_1, w_2, b)$.

We need $x_1 w_1 + x_2 w_2 - b \geq 0 \iff 2x_1 + x_2 \geq 4$. So pick $w_1 = 2$, $w_2 = 1$, $b = 4$. (Any positive scalar multiple, e.g. $w_1=4, w_2=2, b=8$, also works — perceptrons are scale-invariant.)

Exam trap · Sign of the bias

The slides write the activation potential as $\mathbf{x}^\top \mathbf{w} - b$ (subtracting $b$). Many textbooks write $\mathbf{x}^\top\mathbf{w} + b$ (adding $b$). They're the same model — just absorb the sign by replacing $b$ with $-b$. In an exam, always check which convention the question uses before computing.

Self-test 3. Why is "the decision boundary of a perceptron is a hyperplane" — i.e., what makes $\{\mathbf{x} : \mathbf{w}^\top\mathbf{x} - b = 0\}$ flat?

The equation $\mathbf{w}^\top\mathbf{x} - b = 0$ is linear in $\mathbf{x}$ — every term is a constant times one coordinate of $\mathbf{x}$, plus a constant. The set of solutions to one linear equation in $d$ unknowns is a $(d-1)$-dimensional flat — a line in 2D, a plane in 3D, a hyperplane in general. This is what restricts a single perceptron: it can only carve $\mathbb{R}^d$ with one straight cut.

· · ·
Part III

When binary breaks: sigmoid & more neurons

A binary perceptron is brutally confident. It says "1" or "0", with no shades of grey. Two upgrades fix this: smooth the activation (sigmoid), or stack many perceptrons together (multi-layer).

The binary step is computationally cheap and biologically pleasing — but it has a fatal flaw for learning: it has no useful derivative. The derivative is zero almost everywhere, with an infinite spike at the threshold. This means we can't use gradient-based methods (backprop) to update the weights; the gradient signal vanishes.

Fix #1: replace binary with the sigmoid. It looks like a smoothed step. Now near the boundary the perceptron output is $0.5$ish — a soft "I'm not sure" — and only out at the extremes does it commit to $\sim 0$ or $\sim 1$. The boundary becomes a fuzzy stripe instead of a knife-edge.

Fix #2: when even a smooth boundary can't separate the classes (e.g., a red blob surrounded by blue), one neuron isn't enough. Stack them: feed the outputs of several neurons into another neuron. The output region becomes the intersection or union of half-planes — and you can carve out arbitrary shapes (triangles, polygons, smooth blobs) by combining straight cuts.

binary cut hard boundary sigmoid soft boundary multi-neuron arbitrary shapes
Three regimes: hard binary (left), smooth sigmoid (middle), and multi-neuron carving (right).
Self-test 4. Why can a single perceptron with sigmoid still not separate the red triangle from the blue surround?

Because sigmoid only smooths the boundary — the boundary itself is still a single hyperplane. Smoothing doesn't change where the boundary lies, only how sharply we commit on either side. To enclose a triangular region you need to combine three half-planes (one per edge), which requires three hidden neurons feeding into an output neuron. The non-linearity of sigmoid is in the output, not in the geometry of the boundary.

· · ·
Part IV

The Perceptron Learning Algorithm

We've talked about which weights would classify correctly. Now: how does the perceptron find them on its own from data? Rosenblatt's 1958 answer is shockingly simple — and it's guaranteed to work, given certain conditions.

First, a small change of convention. From here on, the labels are not $\{0, 1\}$ but $\{-1, +1\}$. This makes the algebra cleaner — we'll multiply predictions by labels, and a wrong answer turns out to flip the sign. This change is part of the lecturer's handwritten annotation on slide 20: the activation is now

$$\hat{y} = \text{sign}(\mathbf{x}^\top\mathbf{w} - b) \in \{-1, +1\}.$$

Setting up: augmenting the bias into the weights

It's cleaner to fold the bias into the weight vector. Define the augmented input and weights:

$$\mathbf{x}'_i = \begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix}, \qquad \mathbf{w}' = \begin{bmatrix} \mathbf{w} \\ -b \end{bmatrix}.$$

Then the activation potential becomes:

$${\mathbf{x}'_i}^\top \mathbf{w}' = \mathbf{x}_i^\top \mathbf{w} - b.$$

From now on we'll just write $\mathbf{x}^\top\mathbf{w}$, with the understanding that the bias is hidden inside.

Definition · Hyperplane

The decision hyperplane of a perceptron with weights $\mathbf{w}$ and bias $b$ is

$$H_{\mathbf{w}, b} \;=\; \{\mathbf{x} \in \mathbb{R}^d : \mathbf{w}^\top\mathbf{x} - b = 0\}.$$

Points with $\mathbf{w}^\top\mathbf{x} - b > 0$ are classified $+1$; those with $\mathbf{w}^\top\mathbf{x} - b < 0$ are classified $-1$.

The "classified correctly" trick

With labels in $\{-1, +1\}$ there's a beautiful single-line check for whether a point is classified correctly.

Key observation (slide 22 handwriting)

$$y_i \cdot (\mathbf{x}_i^\top \mathbf{w}) \;\geq\; 0 \;\;\iff\;\; \mathbf{x}_i \text{ is classified correctly.}$$

Why? If the true label is $y_i = +1$, we want the prediction $\mathbf{x}_i^\top\mathbf{w}$ to be $\geq 0$ — so $y_i \cdot (\mathbf{x}_i^\top\mathbf{w}) \geq 0$. If the true label is $y_i = -1$, we want $\mathbf{x}_i^\top\mathbf{w} \leq 0$ — and multiplying both sides by $y_i = -1$ flips the inequality, giving again $y_i \cdot (\mathbf{x}_i^\top\mathbf{w}) \geq 0$. The product $y_i \cdot (\mathbf{x}_i^\top\mathbf{w})$ is positive exactly when label and prediction agree.

Assumptions for learning

Two assumptions (slide 20)
  1. Labels are in $\{-1, +1\}$.
  2. The data is linearly separable — there exists some hyperplane that perfectly splits the positive from the negative class.

The algorithm itself

Algorithm 1: Perceptron Algorithm
─────────────────────────────────
Initialize weights w ← 0
repeat
    Done ← True
    for each training sample (xⱼ, yⱼ):
        compute prediction ŷⱼ = sign(w · xⱼ)
        if ŷⱼ ≠ yⱼ:
            update weights w ← w + yⱼ · xⱼ
            Done ← False
        end if
    end for
until Done
Why this update makes sense

The update rule $\mathbf{w} \leftarrow \mathbf{w} + y_j \mathbf{x}_j$ pushes the weight vector in the direction of $\mathbf{x}_j$ if $y_j = +1$ (so that $\mathbf{w}^\top\mathbf{x}_j$ grows), and in the opposite direction if $y_j = -1$ (so that $\mathbf{w}^\top\mathbf{x}_j$ shrinks). In both cases the update is exactly tuned to increase $y_j \cdot (\mathbf{w}^\top\mathbf{x}_j)$. We never update on a correctly-classified point — only when we make a mistake.

Self-test 5. Suppose $\mathbf{w}=(0,0)$, and we see $\mathbf{x}_1 = (2, 1)$ with $y_1=+1$. What is the new $\mathbf{w}$ after one update?

The current prediction is $\text{sign}(\mathbf{0}^\top\mathbf{x}_1) = \text{sign}(0)$. By the algorithm convention, $\text{sign}(0) = +1$ or $-1$ depending on tie-breaking; if we treat $0$ as misclassified for $y_1 = +1$ (since the algorithm says ŷⱼ ≠ yⱼ triggers an update — and many implementations treat sign(0) as 0 or as -1), we apply the rule. Update: $\mathbf{w}' = \mathbf{w} + y_1 \mathbf{x}_1 = (0,0) + 1 \cdot (2,1) = \mathbf{(2,1)}$.

Check: now $\mathbf{w}'^\top\mathbf{x}_1 = 2\cdot 2 + 1 \cdot 1 = 5 > 0$, so $\hat{y}_1 = +1$ — correct!

Self-test 6. Why does the perceptron algorithm only update on mistakes, never on correctly classified points?

Two reasons. First, philosophically: if the current $\mathbf{w}$ already gets a point right, perturbing it might make things worse. Second, algorithmically: the convergence proof (next section) crucially relies on the fact that the algorithm only takes a step on a misclassified point, because that's what makes the inequality $-2y(\mathbf{w}^\top\mathbf{x}) \geq 0$ available — which is needed to bound how fast $\|\mathbf{w}\|^2$ can grow.

· · ·
Part V

Geometry — distance to a hyperplane

Before we can prove the algorithm converges, we need one geometric fact: how far is a given point from a given hyperplane? The answer is a single tidy formula, and the proof is two lines of algebra.

Claim · Distance formula (slide 26)

Let $b \in \mathbb{R}$, $\mathbf{w} \in \mathbb{R}^d$, and define the hyperplane

$$H_{\mathbf{w}, b} \;=\; \{\mathbf{x} : \mathbf{w}^\top\mathbf{x} + b = 0\}.$$

The (perpendicular) distance from a point $\mathbf{x}_0 \in \mathbb{R}^d$ to $H$ is

$$\boxed{\;\text{dist}(H, \mathbf{x}_0) \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|}\;}$$

Proof (slide 27)

Consider the line through $\mathbf{x}_0$ in the direction of $\mathbf{w}$ (which is the normal to $H$):

$$L \;=\; \{\mathbf{x}_0 + t\mathbf{w} : t \in \mathbb{R}\} \;\subseteq\; \mathbb{R}^d.$$

This line $L$ meets the hyperplane $H$ at a single point — call it $\mathbf{x}'$ — at parameter $t = t'$. To find $t'$, we plug $\mathbf{x}_0 + t\mathbf{w}$ into the hyperplane equation:

$$\mathbf{w}^\top(\mathbf{x}_0 + t\mathbf{w}) + b = 0 \;\;\Longleftrightarrow\;\; t\|\mathbf{w}\|^2 = -(\mathbf{w}^\top\mathbf{x}_0 + b).$$

Solving:

$$t' \;=\; -\frac{\mathbf{w}^\top\mathbf{x}_0 + b}{\|\mathbf{w}\|^2}.$$

Now the distance is just $\|\mathbf{x}' - \mathbf{x}_0\|$:

$$\text{dist}(H, \mathbf{x}_0) \;=\; \|\mathbf{x}' - \mathbf{x}_0\| \;=\; \|\mathbf{x}_0 + t'\mathbf{w} - \mathbf{x}_0\| \;=\; |t'|\cdot\|\mathbf{w}\| \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|^2} \cdot \|\mathbf{w}\| \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|}. \quad\blacksquare$$
Intuition

The numerator $|\mathbf{w}^\top\mathbf{x}_0 + b|$ measures how strongly the point activates the hyperplane (the absolute activation potential). Dividing by $\|\mathbf{w}\|$ normalises the measurement — without it, just doubling all the weights would double the "score" without the point actually moving. So the distance formula is "raw activation, scaled to physical units."

Self-test 7. The hyperplane is $3x_1 + 4x_2 - 5 = 0$ (so $\mathbf{w} = (3,4)$, $b = -5$). What is the distance from the origin to this hyperplane?

Plug in $\mathbf{x}_0 = (0,0)$:

$$\text{dist} = \frac{|3(0) + 4(0) - 5|}{\sqrt{3^2 + 4^2}} = \frac{|-5|}{\sqrt{25}} = \frac{5}{5} = 1.$$

So the origin sits exactly 1 unit away from the line $3x_1 + 4x_2 = 5$.

Self-test 8. What happens to the distance formula if we scale $\mathbf{w}$ and $b$ by a factor of 10 (i.e., new weights $10\mathbf{w}$, new bias $10b$)?

Numerator scales by 10, denominator scales by 10 (since $\|10\mathbf{w}\| = 10\|\mathbf{w}\|$), so the distance is unchanged. This is good — the hyperplane $\{\mathbf{w}^\top\mathbf{x} + b = 0\}$ and $\{10\mathbf{w}^\top\mathbf{x} + 10b = 0\}$ are the same set of points, so they should give the same distance. The formula respects this geometric invariance.

· · ·
Part VI

Novikoff's Convergence Theorem

This is the punchline of the perceptron story. If the data is linearly separable with margin γ, the perceptron makes at most 1/γ² mistakes. The algorithm cannot get stuck: it always halts, and the bound on mistakes depends only on the geometry of the data — not on its size, dimension, or order.

This theorem is due to Albert Novikoff (1962) and is stated and proved entirely in the lecturer's handwritten annotations on slides 23–24. It's a foundational result in learning theory, and it's the kind of thing that absolutely could appear on the exam — the statement, the proof technique, and the role of every assumption.

Setup: the margin

Definition · Margin γ (slide 23)

Suppose the data is linearly separable. Among all separating hyperplanes, pick one with weights $\mathbf{w}^*$ that achieves $\|\mathbf{w}^*\| = 1$ (we can always rescale). The margin of the data with respect to $\mathbf{w}^*$ is

$$\gamma \;=\; \min_{(\mathbf{x}_i, y_i)} |\mathbf{x}_i^\top \mathbf{w}^*|.$$

i.e., the distance from the closest training point to the optimal separating hyperplane (since $\|\mathbf{w}^*\|=1$, the absolute activation equals the perpendicular distance — see Part V).

Standing assumptions for Novikoff
  1. $\|\mathbf{x}_i\| \leq 1$ for all training points (data is bounded — we can rescale to make this true).
  2. $\|\mathbf{w}^*\| = 1$ (we normalise the optimal separator).
  3. There exists $\gamma > 0$ such that $y_i \cdot (\mathbf{x}_i^\top \mathbf{w}^*) \geq \gamma$ for all $i$ (linear separability with margin).
‖xᵢ‖ ≤ 1 w* + γ
The margin γ is the distance from the closest training point to the optimal separating hyperplane.

The theorem

Theorem · Novikoff (1962)

Under the assumptions above, the Perceptron Algorithm makes at most

$$m \;\leq\; \frac{1}{\gamma^2}$$

misclassifications (i.e., at most $1/\gamma^2$ updates) before halting with a separator that classifies all training data correctly.

Proof (slide 24, full)

The proof is a beautiful sandwich argument. We'll bound $\mathbf{w}^{(m)\top}\mathbf{w}^*$ from below (it grows) and $\|\mathbf{w}^{(m)}\|$ from above (it doesn't grow too fast); together, these force $m$ to be small.

Notation: at step $k$, we make a mistake on point $(\mathbf{x}, y)$ (dropping the index for clarity) and update $\mathbf{w}' \leftarrow \mathbf{w} + y\mathbf{x}$. Two facts about the misclassified point:

Step 1: $\mathbf{w}^\top\mathbf{w}^*$ grows by at least γ each update

$$\mathbf{w}'^\top \mathbf{w}^* \;=\; (\mathbf{w} + y\mathbf{x})^\top \mathbf{w}^* \;=\; \mathbf{w}^\top \mathbf{w}^* + y(\mathbf{x}^\top \mathbf{w}^*) \;\geq\; \mathbf{w}^\top \mathbf{w}^* + \gamma.$$

So after $m$ updates, starting from $\mathbf{w}^{(0)} = \mathbf{0}$:

$$\mathbf{w}^{(m)\top}\mathbf{w}^* \;\geq\; m\gamma. \qquad (\star)$$

Step 2: $\|\mathbf{w}\|^2$ grows by at most 1 each update

$$\|\mathbf{w}'\|^2 \;=\; (\mathbf{w} + y\mathbf{x})^\top(\mathbf{w} + y\mathbf{x}) \;=\; \mathbf{w}^\top\mathbf{w} + 2y(\mathbf{w}^\top\mathbf{x}) + y^2(\mathbf{x}^\top\mathbf{x}).$$

The middle term is $\leq 0$ (by fact (i): $y(\mathbf{w}^\top\mathbf{x}) \leq 0$, so $2y(\mathbf{w}^\top\mathbf{x}) \leq 0$). The last term is $y^2 \|\mathbf{x}\|^2 = \|\mathbf{x}\|^2 \leq 1$ (since $y \in \{-1,+1\}$ and $\|\mathbf{x}\| \leq 1$). So:

$$\|\mathbf{w}'\|^2 \;\leq\; \|\mathbf{w}\|^2 + 1.$$

After $m$ updates, starting from $\mathbf{w}^{(0)} = \mathbf{0}$:

$$\|\mathbf{w}^{(m)}\|^2 \;\leq\; m. \qquad (\star\star)$$

Step 3: combine via Cauchy–Schwarz

By Cauchy–Schwarz (recall: $\mathbf{a}^\top\mathbf{b} \leq \|\mathbf{a}\|\|\mathbf{b}\|$):

$$\mathbf{w}^{(m)\top}\mathbf{w}^* \;\leq\; \|\mathbf{w}^{(m)}\| \cdot \|\mathbf{w}^*\| \;=\; \|\mathbf{w}^{(m)}\| \cdot 1 \;\leq\; \sqrt{m}.$$

Combining with $(\star)$:

$$m\gamma \;\leq\; \mathbf{w}^{(m)\top}\mathbf{w}^* \;\leq\; \sqrt{m}.$$

So $m\gamma \leq \sqrt{m}$, i.e., $m^2\gamma^2 \leq m$, i.e.,

$$\boxed{\;m \;\leq\; \frac{1}{\gamma^2}.\;}\qquad \blacksquare$$
Why this is gorgeous

$\mathbf{w}^{(m)\top}\mathbf{w}^*$ is the alignment between the current weights and the optimal weights. Step 1 says: every mistake brings us at least γ closer in alignment. Step 2 says: but the weight vector can't grow too fast in length. Cauchy–Schwarz forces alignment to be bounded by length, so we run out of room. The bound depends only on γ — not on the number of training points, not on the dimension, not on the order of presentation. It's a "free lunch" guarantee.

Exam trap · What if γ = 0?

If the data is not linearly separable, then there is no $\mathbf{w}^*$ with margin $\gamma > 0$. The bound $1/\gamma^2$ blows up, and indeed: the perceptron algorithm will not halt; it cycles forever, updating on points that lie on the wrong side. The convergence guarantee is conditional on linear separability — it gives no guarantees on noisy or overlapping data.

Self-test 9. The bound $m \leq 1/\gamma^2$ does not depend on the dimension $d$ or the number of training points $n$. Why is that surprising?

It would seem reasonable that learning in higher dimensions, or with more data, takes more mistakes — that's how it works for many ML algorithms. But Novikoff's bound says: the geometry of the worst-case point (its distance from the optimal separator) is all that matters. Adding more "easy" points doesn't slow convergence; what hurts is having a point sit very close to the boundary (small γ). This is the same intuition that motivates Support Vector Machines: maximise the margin to make learning robust.

Self-test 10. Where in the proof do we use assumption (i) that $y(\mathbf{x}^\top\mathbf{w}) \leq 0$ (i.e., the point is misclassified)?

In Step 2, when we said the middle term $2y(\mathbf{w}^\top\mathbf{x}) \leq 0$. That's exactly the misclassification condition: the prediction has the wrong sign relative to the label. Without this, we couldn't drop the middle term, and $\|\mathbf{w}'\|^2$ might grow much faster than $\|\mathbf{w}\|^2 + 1$ — destroying the bound. This is why the algorithm only updates on mistakes: the proof relies on it.

Self-test 11. If the data has margin γ = 0.1 (tiny), what's the maximum number of mistakes the perceptron will make?

$m \leq 1/\gamma^2 = 1/(0.1)^2 = 1/0.01 = 100$ mistakes. With margin γ = 0.01, it would be $1/0.0001 = 10{,}000$ mistakes — so the algorithm gets very slow as the margin shrinks, even though it eventually halts.

Self-test 12. We assumed $\|\mathbf{x}_i\| \leq 1$ in the proof. If instead the points have $\|\mathbf{x}_i\| \leq R$ for some $R > 0$, what does the bound become?

Step 2 changes: $\|\mathbf{x}\|^2 \leq R^2$, so $\|\mathbf{w}^{(m)}\|^2 \leq mR^2$, hence $\|\mathbf{w}^{(m)}\| \leq R\sqrt{m}$. Then $m\gamma \leq R\sqrt{m}$, giving $m \leq R^2 / \gamma^2$. The general form of Novikoff's theorem is $\boxed{m \leq R^2/\gamma^2}$, and the "$\|\mathbf{x}\| \leq 1$" version is just the case $R = 1$.

· · ·
Part VII

Feed-forward in a multi-layer network

Now we move from one neuron to many, organised in layers. Feed-forward is the recipe for taking an input and producing an output, layer by layer. It's just plugging numbers into formulas — but with bookkeeping.

A multi-layer perceptron has an input layer (the raw features), one or more hidden layers (each layer's outputs become the next layer's inputs), and an output layer (whose outputs are the predictions $\hat{y}$). Each connection has a weight, each neuron has a bias and an activation function. Feed-forward is the act of pushing inputs through the layers in order to compute the output.

For this section, we'll use the ReLU activation: $f(x) = \max(0, x)$. We'll see why ReLU (or some non-linear function) matters in Part VIII.

Input layer x₁ x₂ Hidden layer h₁ h₂ Output layer ŷ w₁=.11 w₃=.21 w₂=.12 w₄=.08 w₅=.14 w₆=.15
A 2–2–1 feed-forward network. Each edge is a weight; we'll plug in numbers.

Worked example (slides 34–36)

Let $x_1 = 2$, $x_2 = 3$, with weights:

(For simplicity all biases are zero in this example.)

Step 1. Compute the hidden activations.

$$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2) = \text{ReLU}(2 \cdot 0.11 + 3 \cdot 0.12) = \text{ReLU}(0.22 + 0.36) = \text{ReLU}(0.58) = 0.58.$$ $$\sigma(h_2) = \text{ReLU}(x_1 w_3 + x_2 w_4) = \text{ReLU}(2 \cdot 0.21 + 3 \cdot 0.08) = \text{ReLU}(0.42 + 0.24) = \text{ReLU}(0.66) = 0.66.$$

Step 2. Compute the output.

$$\hat{y} = \sigma(z) = \text{ReLU}(\sigma(h_1)\cdot w_5 + \sigma(h_2)\cdot w_6) = \text{ReLU}(0.58 \cdot 0.14 + 0.66 \cdot 0.15)$$ $$= \text{ReLU}(0.0812 + 0.099) = \text{ReLU}(0.1802) = 0.1802.$$
Worked answer

For input $(2, 3)$ this network outputs $\hat{y} = 0.1802$. Notice that ReLU did nothing here — every pre-activation was already positive. ReLU only "kicks in" when an input is negative, sending it to zero.

Self-test 13. Repeat the calculation with the same network and weights, but with input $(-2, 3)$. (Hint: this time ReLU will matter.)

Hidden layer:

$$\sigma(h_1) = \text{ReLU}((-2)(0.11) + 3(0.12)) = \text{ReLU}(-0.22 + 0.36) = \text{ReLU}(0.14) = 0.14.$$ $$\sigma(h_2) = \text{ReLU}((-2)(0.21) + 3(0.08)) = \text{ReLU}(-0.42 + 0.24) = \text{ReLU}(-0.18) = \mathbf{0}.$$

$h_2$ has been zeroed by ReLU — the neuron didn't fire. Output layer:

$$\hat{y} = \text{ReLU}(0.14 \cdot 0.14 + 0 \cdot 0.15) = \text{ReLU}(0.0196) = 0.0196.$$

So the prediction drops sharply when one of the hidden neurons goes silent.

Self-test 14. In a fully-connected feed-forward network with input dimension $d_0$, hidden layers of sizes $d_1, d_2, \ldots, d_L$, and output dimension $d_{L+1}$, how many weights does the network have? (Ignore biases.)

Each layer $\ell$ has $d_{\ell-1} \cdot d_\ell$ weights connecting from the previous layer's $d_{\ell-1}$ neurons to its own $d_\ell$ neurons. Total:

$$\sum_{\ell=1}^{L+1} d_{\ell-1} \cdot d_\ell.$$

For a 2–2–1 network: $2 \cdot 2 + 2 \cdot 1 = 4 + 2 = 6$ weights — exactly the six in our example.

· · ·
Part VIII

Why we need non-linearity

Here's a striking fact: if every neuron uses a linear activation function, the entire network — no matter how many layers — collapses to a single linear neuron. Any expressive power has to come from non-linearity. Here's the algebra that proves it.

Take the same 2–2–1 network as before, with biases this time: $b_1$ and $b_2$ on the hidden neurons, $b_3$ on the output neuron, and a linear activation (i.e., $f(x) = x$). The output is

$$\hat{y} = h_1 w_5 + h_2 w_6 - b_3.$$

Now substitute $h_1 = x_1 w_1 + x_2 w_2 - b_1$ and $h_2 = x_1 w_3 + x_2 w_4 - b_2$:

$$\hat{y} = (x_1 w_1 + x_2 w_2 - b_1) w_5 + (x_1 w_3 + x_2 w_4 - b_2) w_6 - b_3.$$

Distribute everything:

$$\hat{y} = x_1 w_1 w_5 + x_2 w_2 w_5 - b_1 w_5 + x_1 w_3 w_6 + x_2 w_4 w_6 - b_2 w_6 - b_3.$$

Group by $x_1$, $x_2$, and constants:

$$\hat{y} = x_1(w_1 w_5 + w_3 w_6) + x_2(w_2 w_5 + w_4 w_6) - (b_1 w_5 + b_2 w_6 + b_3).$$

Define new weights and a new bias:

$$w' = w_1 w_5 + w_3 w_6, \quad w'' = w_2 w_5 + w_4 w_6, \quad b' = b_1 w_5 + b_2 w_6 + b_3.$$

Then:

$$\boxed{\;\hat{y} = x_1 w' + x_2 w'' - b'.\;}$$

This is exactly the formula for a single perceptron with weights $(w', w'')$ and bias $b'$. The hidden layer has vanished into the equivalent single neuron.

Claim · Linear activations collapse

Every neural network with linear activation functions can be replaced by a single neuron with the same input/output behaviour, no matter how many hidden layers it has.

Why this is true in general

Each layer of a linear network is a matrix multiplication: $\mathbf{h}^{(\ell)} = W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}$. Composing $L$ such linear maps gives $\mathbf{h}^{(L)} = W^{(L)} W^{(L-1)} \cdots W^{(1)} \mathbf{x} + (\text{some bias})$, and the product of matrices is just another matrix. So the whole stack has the form $W_{\text{eff}}\mathbf{x} + \mathbf{b}_{\text{eff}}$ — one linear map. Depth gains you nothing without non-linearity.

Exam trap · "Identity is linear"

The identity function $f(x) = x$ is linear. So is $f(x) = 2x$, $f(x) = -x$, and $f(x) = ax + c$ for any constants $a, c$. The linear-collapse argument applies to any of these. Only genuinely non-linear functions — sigmoid, tanh, ReLU, leaky ReLU, swish — give multi-layer networks expressive power beyond a single neuron.

Self-test 15. We saw that a 2–2–1 linear network collapses to a single neuron. Does the same happen for a 2–10–10–10–1 network, all linear? Why?

Yes — exactly the same argument applies. Each layer is a linear map $\mathbf{h}^{(\ell)} = W^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}$, and the composition of any number of linear maps is still a linear map. The 2–10–10–10–1 network with linear activations is mathematically identical to a 2–1 network (one neuron) with appropriate $w'$, $w''$, $b'$. The 30 hidden neurons buy you nothing.

Self-test 16. Why does ReLU break the collapse? It's piecewise linear — it's literally the identity on positive inputs.

Because ReLU is not a single linear function: it's $0$ on negatives and identity on positives. The "kink" at zero means that for different inputs, different sets of neurons fire vs. don't fire — and the effective linear map of the whole network changes depending on the input. Composing two ReLU layers gives a piecewise-linear function with many "pieces" (each corresponding to a different combination of which neurons are active), and the number of pieces can be exponentially large in the depth. That non-linearity at the kink is enough to give the network all its expressive power.

· · ·
Part IX

The activation-function zoo

There are six activation functions on the slides, each with a personality. Here they are, ranked by their pros, cons, and a one-line summary of when you'd use them.

1. Binary step

$$f(x) = \begin{cases} 1 & x \geq 0 \\ 0 & x < 0 \end{cases}$$

Advantages:

Disadvantages:

2. Sigmoid (logistic)

$$f(x) = \frac{1}{1 + e^{-x}}.$$

Advantages:

Disadvantages:

3. Hyperbolic tangent (tanh)

$$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}.$$

Like sigmoid, but its outputs sit in $[-1, +1]$ instead of $[0, 1]$. So tanh is zero-centred on the y-axis.

Why this helps: if all hidden activations are positive (as with sigmoid), the gradient updates for the next layer's weights all share a sign — they all push up, or all push down — which slows learning. Centring around zero means activations can be positive or negative, and gradient updates can move in either direction independently.

4. ReLU (Rectified Linear Unit)

$$f(x) = \max(0, x).$$

Advantages:

Disadvantages:

5. Leaky ReLU

$$f(x) = \max(0.1\,x,\, x).$$

Same as ReLU but with a small slope on the negative side (here, 0.1). This means even on negatives, the gradient is non-zero (it's 0.1), so a "dying" neuron can still receive update signals and recover.

Advantages:

Disadvantages:

6. Swish

$$f(x) = \frac{x}{1 + e^{-x}} \;=\; x \cdot \text{sigmoid}(x).$$

A smooth, non-monotonic function that looks similar to ReLU but with a small dip below zero before flattening out. Empirically, Swish performs a tiny bit better than ReLU on many tasks (introduced by Google researchers via neural architecture search).

Summary table

FunctionFormulaRangeSmooth?Risks
Binary1 if x≥0, else 0{0, 1}NoNo gradient
Sigmoid1/(1+e⁻ˣ)[0, 1]YesVanishing gradient
tanh(eˣ−e⁻ˣ)/(eˣ+e⁻ˣ)[−1, 1]YesVanishing gradient
ReLUmax(0, x)[0, ∞)No (kink at 0)Dying ReLU
Leaky ReLUmax(0.1x, x)(−∞, ∞)No (kink at 0)Fixed leak coef.
Swishx·sigmoid(x)(≈−0.28, ∞)YesComputationally heavier
Self-test 17. Which activation functions cannot be used with backpropagation, and why?

The binary step cannot, because its derivative is zero almost everywhere (and undefined at the threshold). Backpropagation works by chaining derivatives, so a zero derivative kills the gradient signal — no learning happens. All the others (sigmoid, tanh, ReLU, leaky ReLU, swish) have well-defined or sub-differentiable gradients that backprop can use.

Self-test 18. A neuron uses sigmoid activation. Its pre-activation values during training are typically around $+8$ or $-8$. Why is this a problem?

At $x = \pm 8$, sigmoid is essentially saturated — $\sigma(8) \approx 0.9997$, $\sigma(-8) \approx 0.00033$. The derivative $\sigma(x)(1 - \sigma(x))$ is therefore extremely tiny ($\approx 0.0003$ at $x=8$). During backpropagation, this tiny derivative gets multiplied through the gradient chain, so the weights connected to this neuron receive almost no update. This is the vanishing-gradient problem, and it's why deep networks trained with sigmoid throughout were notoriously hard to train before ReLU became popular.