From a single firing neuron to multilayer feed-forward networks — and the convergence theorem that started it all.
A perceptron is a comically simplified neuron. Real neurons fire when chemistry tips them past a threshold; perceptrons fire when a weighted sum tips past a number. Same shape, different ingredients.
Your brain has roughly 86 billion neurons. Each one is a tiny decision-maker. Dendrites are the input branches — they're covered in receptors that catch chemical messengers (neurotransmitters) released by other neurons, and they translate those chemical hits into electrical signals. Those signals all flow into the soma (the cell body), where they're summed up. The nucleus, sitting inside the soma, holds the cell's genetic material. If the total signal crosses some threshold, the neuron fires — sending an electrical pulse down the axon, which terminates in branches called axon terminals. Those terminals release their own neurotransmitters, which crash into the next neuron's dendrites, and the chain continues.
The artificial neuron — the perceptron — is a cartoon of this. We replace the dendrites with input numbers, the synapse strengths with weights, the threshold with a bias, and the firing decision with an activation function.
A perceptron has four parts: an input vector $\mathbf{x} \in \mathbb{R}^d$, a weight vector $\mathbf{w} \in \mathbb{R}^d$, a scalar bias $b \in \mathbb{R}$, and an activation function $f : \mathbb{R} \to \mathbb{R}$. Its output is
$$h \;=\; f\!\left(\sum_i x_i w_i + b\right) \;=\; f(\mathbf{x}^\top \mathbf{w} + b).$$The quantity $\mathbf{x}^\top\mathbf{w} - b$ (or sometimes $+b$, depending on convention) is called the activation potential — the number that the activation function gets to decide on.
Each weight $w_i$ tells the neuron how much it cares about input $x_i$. A large positive $w_i$ means "this input matters and pulls me toward firing"; a large negative $w_i$ means "this input matters and pulls me away from firing"; near-zero means "I don't care about this one." The bias $b$ is the neuron's mood — its baseline reluctance or eagerness to fire, independent of inputs.
For a single perceptron, let's pick the simplest possible activation function: the binary step.
$$\hat{y} \;=\; f_{\text{binary}}(\mathbf{x},\mathbf{w}) \;=\; \begin{cases} 1 & \text{if } \mathbf{x}^\top\mathbf{w} - b \;\geq\; 0\\ 0 & \text{otherwise.}\end{cases}$$Think of it as: if the weighted vote crosses the threshold $b$, output 1; otherwise output 0.
Compute the activation potential first: $\mathbf{x}^\top\mathbf{w} - b = (2)(0.5) + (1)(-0.3) - 0.2 = 1.0 - 0.3 - 0.2 = 0.5$. Since $0.5 \geq 0$, the binary step outputs $\hat{y}=1$.
A perceptron with binary activation is a line-drawer. Give it 2D data and it will draw a straight line, calling everything on one side "1" and everything on the other "0".
Suppose you have data on apples and oranges. (Don't try this with pears — you can't compare them with apples.) For each fruit you measure two things: sweetness and size. Apples have sweetness 4–7 and size 6–10cm; oranges have sweetness 6–9 and size 8–12cm. The two clouds overlap in the middle, but mostly they sit in different parts of the 2D plane. We'd like a perceptron that, given a new fruit's $(x_1, x_2)$, tells us "apple" or "orange".
For binary activation, the perceptron outputs $1$ exactly when $x_1 w_1 + x_2 w_2 - b \geq 0$. The set of points where $x_1 w_1 + x_2 w_2 - b = 0$ is a straight line in 2D (a hyperplane in higher dimensions). On one side of this line, the output is 1; on the other, 0.
You don't get to control $x_1$ and $x_2$ — they're given by nature. Your job is to pick $w_1$, $w_2$, and $b$ so that the line they define correctly separates the two classes. Think of it as a game: nature throws you a point, you've already committed to your weights, and you want to be right as often as possible.
For the toy example where the ideal boundary is $x_1 + x_2 = 1$:
We need $x_1 w_1 + x_2 w_2 - b \geq 0 \iff 2x_1 + x_2 \geq 4$. So pick $w_1 = 2$, $w_2 = 1$, $b = 4$. (Any positive scalar multiple, e.g. $w_1=4, w_2=2, b=8$, also works — perceptrons are scale-invariant.)
The slides write the activation potential as $\mathbf{x}^\top \mathbf{w} - b$ (subtracting $b$). Many textbooks write $\mathbf{x}^\top\mathbf{w} + b$ (adding $b$). They're the same model — just absorb the sign by replacing $b$ with $-b$. In an exam, always check which convention the question uses before computing.
The equation $\mathbf{w}^\top\mathbf{x} - b = 0$ is linear in $\mathbf{x}$ — every term is a constant times one coordinate of $\mathbf{x}$, plus a constant. The set of solutions to one linear equation in $d$ unknowns is a $(d-1)$-dimensional flat — a line in 2D, a plane in 3D, a hyperplane in general. This is what restricts a single perceptron: it can only carve $\mathbb{R}^d$ with one straight cut.
A binary perceptron is brutally confident. It says "1" or "0", with no shades of grey. Two upgrades fix this: smooth the activation (sigmoid), or stack many perceptrons together (multi-layer).
The binary step is computationally cheap and biologically pleasing — but it has a fatal flaw for learning: it has no useful derivative. The derivative is zero almost everywhere, with an infinite spike at the threshold. This means we can't use gradient-based methods (backprop) to update the weights; the gradient signal vanishes.
Fix #1: replace binary with the sigmoid. It looks like a smoothed step. Now near the boundary the perceptron output is $0.5$ish — a soft "I'm not sure" — and only out at the extremes does it commit to $\sim 0$ or $\sim 1$. The boundary becomes a fuzzy stripe instead of a knife-edge.
Fix #2: when even a smooth boundary can't separate the classes (e.g., a red blob surrounded by blue), one neuron isn't enough. Stack them: feed the outputs of several neurons into another neuron. The output region becomes the intersection or union of half-planes — and you can carve out arbitrary shapes (triangles, polygons, smooth blobs) by combining straight cuts.
Because sigmoid only smooths the boundary — the boundary itself is still a single hyperplane. Smoothing doesn't change where the boundary lies, only how sharply we commit on either side. To enclose a triangular region you need to combine three half-planes (one per edge), which requires three hidden neurons feeding into an output neuron. The non-linearity of sigmoid is in the output, not in the geometry of the boundary.
We've talked about which weights would classify correctly. Now: how does the perceptron find them on its own from data? Rosenblatt's 1958 answer is shockingly simple — and it's guaranteed to work, given certain conditions.
First, a small change of convention. From here on, the labels are not $\{0, 1\}$ but $\{-1, +1\}$. This makes the algebra cleaner — we'll multiply predictions by labels, and a wrong answer turns out to flip the sign. This change is part of the lecturer's handwritten annotation on slide 20: the activation is now
$$\hat{y} = \text{sign}(\mathbf{x}^\top\mathbf{w} - b) \in \{-1, +1\}.$$It's cleaner to fold the bias into the weight vector. Define the augmented input and weights:
$$\mathbf{x}'_i = \begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix}, \qquad \mathbf{w}' = \begin{bmatrix} \mathbf{w} \\ -b \end{bmatrix}.$$Then the activation potential becomes:
$${\mathbf{x}'_i}^\top \mathbf{w}' = \mathbf{x}_i^\top \mathbf{w} - b.$$From now on we'll just write $\mathbf{x}^\top\mathbf{w}$, with the understanding that the bias is hidden inside.
The decision hyperplane of a perceptron with weights $\mathbf{w}$ and bias $b$ is
$$H_{\mathbf{w}, b} \;=\; \{\mathbf{x} \in \mathbb{R}^d : \mathbf{w}^\top\mathbf{x} - b = 0\}.$$Points with $\mathbf{w}^\top\mathbf{x} - b > 0$ are classified $+1$; those with $\mathbf{w}^\top\mathbf{x} - b < 0$ are classified $-1$.
With labels in $\{-1, +1\}$ there's a beautiful single-line check for whether a point is classified correctly.
$$y_i \cdot (\mathbf{x}_i^\top \mathbf{w}) \;\geq\; 0 \;\;\iff\;\; \mathbf{x}_i \text{ is classified correctly.}$$
Why? If the true label is $y_i = +1$, we want the prediction $\mathbf{x}_i^\top\mathbf{w}$ to be $\geq 0$ — so $y_i \cdot (\mathbf{x}_i^\top\mathbf{w}) \geq 0$. If the true label is $y_i = -1$, we want $\mathbf{x}_i^\top\mathbf{w} \leq 0$ — and multiplying both sides by $y_i = -1$ flips the inequality, giving again $y_i \cdot (\mathbf{x}_i^\top\mathbf{w}) \geq 0$. The product $y_i \cdot (\mathbf{x}_i^\top\mathbf{w})$ is positive exactly when label and prediction agree.
Algorithm 1: Perceptron Algorithm
─────────────────────────────────
Initialize weights w ← 0
repeat
Done ← True
for each training sample (xⱼ, yⱼ):
compute prediction ŷⱼ = sign(w · xⱼ)
if ŷⱼ ≠ yⱼ:
update weights w ← w + yⱼ · xⱼ
Done ← False
end if
end for
until Done
The update rule $\mathbf{w} \leftarrow \mathbf{w} + y_j \mathbf{x}_j$ pushes the weight vector in the direction of $\mathbf{x}_j$ if $y_j = +1$ (so that $\mathbf{w}^\top\mathbf{x}_j$ grows), and in the opposite direction if $y_j = -1$ (so that $\mathbf{w}^\top\mathbf{x}_j$ shrinks). In both cases the update is exactly tuned to increase $y_j \cdot (\mathbf{w}^\top\mathbf{x}_j)$. We never update on a correctly-classified point — only when we make a mistake.
The current prediction is $\text{sign}(\mathbf{0}^\top\mathbf{x}_1) = \text{sign}(0)$. By the algorithm convention, $\text{sign}(0) = +1$ or $-1$ depending on tie-breaking; if we treat $0$ as misclassified for $y_1 = +1$ (since the algorithm says ŷⱼ ≠ yⱼ triggers an update — and many implementations treat sign(0) as 0 or as -1), we apply the rule. Update: $\mathbf{w}' = \mathbf{w} + y_1 \mathbf{x}_1 = (0,0) + 1 \cdot (2,1) = \mathbf{(2,1)}$.
Check: now $\mathbf{w}'^\top\mathbf{x}_1 = 2\cdot 2 + 1 \cdot 1 = 5 > 0$, so $\hat{y}_1 = +1$ — correct!
Two reasons. First, philosophically: if the current $\mathbf{w}$ already gets a point right, perturbing it might make things worse. Second, algorithmically: the convergence proof (next section) crucially relies on the fact that the algorithm only takes a step on a misclassified point, because that's what makes the inequality $-2y(\mathbf{w}^\top\mathbf{x}) \geq 0$ available — which is needed to bound how fast $\|\mathbf{w}\|^2$ can grow.
Before we can prove the algorithm converges, we need one geometric fact: how far is a given point from a given hyperplane? The answer is a single tidy formula, and the proof is two lines of algebra.
Let $b \in \mathbb{R}$, $\mathbf{w} \in \mathbb{R}^d$, and define the hyperplane
$$H_{\mathbf{w}, b} \;=\; \{\mathbf{x} : \mathbf{w}^\top\mathbf{x} + b = 0\}.$$The (perpendicular) distance from a point $\mathbf{x}_0 \in \mathbb{R}^d$ to $H$ is
$$\boxed{\;\text{dist}(H, \mathbf{x}_0) \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|}\;}$$Consider the line through $\mathbf{x}_0$ in the direction of $\mathbf{w}$ (which is the normal to $H$):
$$L \;=\; \{\mathbf{x}_0 + t\mathbf{w} : t \in \mathbb{R}\} \;\subseteq\; \mathbb{R}^d.$$This line $L$ meets the hyperplane $H$ at a single point — call it $\mathbf{x}'$ — at parameter $t = t'$. To find $t'$, we plug $\mathbf{x}_0 + t\mathbf{w}$ into the hyperplane equation:
$$\mathbf{w}^\top(\mathbf{x}_0 + t\mathbf{w}) + b = 0 \;\;\Longleftrightarrow\;\; t\|\mathbf{w}\|^2 = -(\mathbf{w}^\top\mathbf{x}_0 + b).$$Solving:
$$t' \;=\; -\frac{\mathbf{w}^\top\mathbf{x}_0 + b}{\|\mathbf{w}\|^2}.$$Now the distance is just $\|\mathbf{x}' - \mathbf{x}_0\|$:
$$\text{dist}(H, \mathbf{x}_0) \;=\; \|\mathbf{x}' - \mathbf{x}_0\| \;=\; \|\mathbf{x}_0 + t'\mathbf{w} - \mathbf{x}_0\| \;=\; |t'|\cdot\|\mathbf{w}\| \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|^2} \cdot \|\mathbf{w}\| \;=\; \frac{|\mathbf{w}^\top\mathbf{x}_0 + b|}{\|\mathbf{w}\|}. \quad\blacksquare$$The numerator $|\mathbf{w}^\top\mathbf{x}_0 + b|$ measures how strongly the point activates the hyperplane (the absolute activation potential). Dividing by $\|\mathbf{w}\|$ normalises the measurement — without it, just doubling all the weights would double the "score" without the point actually moving. So the distance formula is "raw activation, scaled to physical units."
Plug in $\mathbf{x}_0 = (0,0)$:
$$\text{dist} = \frac{|3(0) + 4(0) - 5|}{\sqrt{3^2 + 4^2}} = \frac{|-5|}{\sqrt{25}} = \frac{5}{5} = 1.$$So the origin sits exactly 1 unit away from the line $3x_1 + 4x_2 = 5$.
Numerator scales by 10, denominator scales by 10 (since $\|10\mathbf{w}\| = 10\|\mathbf{w}\|$), so the distance is unchanged. This is good — the hyperplane $\{\mathbf{w}^\top\mathbf{x} + b = 0\}$ and $\{10\mathbf{w}^\top\mathbf{x} + 10b = 0\}$ are the same set of points, so they should give the same distance. The formula respects this geometric invariance.
This is the punchline of the perceptron story. If the data is linearly separable with margin γ, the perceptron makes at most 1/γ² mistakes. The algorithm cannot get stuck: it always halts, and the bound on mistakes depends only on the geometry of the data — not on its size, dimension, or order.
This theorem is due to Albert Novikoff (1962) and is stated and proved entirely in the lecturer's handwritten annotations on slides 23–24. It's a foundational result in learning theory, and it's the kind of thing that absolutely could appear on the exam — the statement, the proof technique, and the role of every assumption.
Suppose the data is linearly separable. Among all separating hyperplanes, pick one with weights $\mathbf{w}^*$ that achieves $\|\mathbf{w}^*\| = 1$ (we can always rescale). The margin of the data with respect to $\mathbf{w}^*$ is
$$\gamma \;=\; \min_{(\mathbf{x}_i, y_i)} |\mathbf{x}_i^\top \mathbf{w}^*|.$$i.e., the distance from the closest training point to the optimal separating hyperplane (since $\|\mathbf{w}^*\|=1$, the absolute activation equals the perpendicular distance — see Part V).
Under the assumptions above, the Perceptron Algorithm makes at most
$$m \;\leq\; \frac{1}{\gamma^2}$$misclassifications (i.e., at most $1/\gamma^2$ updates) before halting with a separator that classifies all training data correctly.
The proof is a beautiful sandwich argument. We'll bound $\mathbf{w}^{(m)\top}\mathbf{w}^*$ from below (it grows) and $\|\mathbf{w}^{(m)}\|$ from above (it doesn't grow too fast); together, these force $m$ to be small.
Notation: at step $k$, we make a mistake on point $(\mathbf{x}, y)$ (dropping the index for clarity) and update $\mathbf{w}' \leftarrow \mathbf{w} + y\mathbf{x}$. Two facts about the misclassified point:
So after $m$ updates, starting from $\mathbf{w}^{(0)} = \mathbf{0}$:
$$\mathbf{w}^{(m)\top}\mathbf{w}^* \;\geq\; m\gamma. \qquad (\star)$$The middle term is $\leq 0$ (by fact (i): $y(\mathbf{w}^\top\mathbf{x}) \leq 0$, so $2y(\mathbf{w}^\top\mathbf{x}) \leq 0$). The last term is $y^2 \|\mathbf{x}\|^2 = \|\mathbf{x}\|^2 \leq 1$ (since $y \in \{-1,+1\}$ and $\|\mathbf{x}\| \leq 1$). So:
$$\|\mathbf{w}'\|^2 \;\leq\; \|\mathbf{w}\|^2 + 1.$$After $m$ updates, starting from $\mathbf{w}^{(0)} = \mathbf{0}$:
$$\|\mathbf{w}^{(m)}\|^2 \;\leq\; m. \qquad (\star\star)$$By Cauchy–Schwarz (recall: $\mathbf{a}^\top\mathbf{b} \leq \|\mathbf{a}\|\|\mathbf{b}\|$):
$$\mathbf{w}^{(m)\top}\mathbf{w}^* \;\leq\; \|\mathbf{w}^{(m)}\| \cdot \|\mathbf{w}^*\| \;=\; \|\mathbf{w}^{(m)}\| \cdot 1 \;\leq\; \sqrt{m}.$$Combining with $(\star)$:
$$m\gamma \;\leq\; \mathbf{w}^{(m)\top}\mathbf{w}^* \;\leq\; \sqrt{m}.$$So $m\gamma \leq \sqrt{m}$, i.e., $m^2\gamma^2 \leq m$, i.e.,
$$\boxed{\;m \;\leq\; \frac{1}{\gamma^2}.\;}\qquad \blacksquare$$$\mathbf{w}^{(m)\top}\mathbf{w}^*$ is the alignment between the current weights and the optimal weights. Step 1 says: every mistake brings us at least γ closer in alignment. Step 2 says: but the weight vector can't grow too fast in length. Cauchy–Schwarz forces alignment to be bounded by length, so we run out of room. The bound depends only on γ — not on the number of training points, not on the dimension, not on the order of presentation. It's a "free lunch" guarantee.
If the data is not linearly separable, then there is no $\mathbf{w}^*$ with margin $\gamma > 0$. The bound $1/\gamma^2$ blows up, and indeed: the perceptron algorithm will not halt; it cycles forever, updating on points that lie on the wrong side. The convergence guarantee is conditional on linear separability — it gives no guarantees on noisy or overlapping data.
It would seem reasonable that learning in higher dimensions, or with more data, takes more mistakes — that's how it works for many ML algorithms. But Novikoff's bound says: the geometry of the worst-case point (its distance from the optimal separator) is all that matters. Adding more "easy" points doesn't slow convergence; what hurts is having a point sit very close to the boundary (small γ). This is the same intuition that motivates Support Vector Machines: maximise the margin to make learning robust.
In Step 2, when we said the middle term $2y(\mathbf{w}^\top\mathbf{x}) \leq 0$. That's exactly the misclassification condition: the prediction has the wrong sign relative to the label. Without this, we couldn't drop the middle term, and $\|\mathbf{w}'\|^2$ might grow much faster than $\|\mathbf{w}\|^2 + 1$ — destroying the bound. This is why the algorithm only updates on mistakes: the proof relies on it.
$m \leq 1/\gamma^2 = 1/(0.1)^2 = 1/0.01 = 100$ mistakes. With margin γ = 0.01, it would be $1/0.0001 = 10{,}000$ mistakes — so the algorithm gets very slow as the margin shrinks, even though it eventually halts.
Step 2 changes: $\|\mathbf{x}\|^2 \leq R^2$, so $\|\mathbf{w}^{(m)}\|^2 \leq mR^2$, hence $\|\mathbf{w}^{(m)}\| \leq R\sqrt{m}$. Then $m\gamma \leq R\sqrt{m}$, giving $m \leq R^2 / \gamma^2$. The general form of Novikoff's theorem is $\boxed{m \leq R^2/\gamma^2}$, and the "$\|\mathbf{x}\| \leq 1$" version is just the case $R = 1$.
Now we move from one neuron to many, organised in layers. Feed-forward is the recipe for taking an input and producing an output, layer by layer. It's just plugging numbers into formulas — but with bookkeeping.
A multi-layer perceptron has an input layer (the raw features), one or more hidden layers (each layer's outputs become the next layer's inputs), and an output layer (whose outputs are the predictions $\hat{y}$). Each connection has a weight, each neuron has a bias and an activation function. Feed-forward is the act of pushing inputs through the layers in order to compute the output.
For this section, we'll use the ReLU activation: $f(x) = \max(0, x)$. We'll see why ReLU (or some non-linear function) matters in Part VIII.
Let $x_1 = 2$, $x_2 = 3$, with weights:
(For simplicity all biases are zero in this example.)
Step 1. Compute the hidden activations.
$$\sigma(h_1) = \text{ReLU}(x_1 w_1 + x_2 w_2) = \text{ReLU}(2 \cdot 0.11 + 3 \cdot 0.12) = \text{ReLU}(0.22 + 0.36) = \text{ReLU}(0.58) = 0.58.$$ $$\sigma(h_2) = \text{ReLU}(x_1 w_3 + x_2 w_4) = \text{ReLU}(2 \cdot 0.21 + 3 \cdot 0.08) = \text{ReLU}(0.42 + 0.24) = \text{ReLU}(0.66) = 0.66.$$Step 2. Compute the output.
$$\hat{y} = \sigma(z) = \text{ReLU}(\sigma(h_1)\cdot w_5 + \sigma(h_2)\cdot w_6) = \text{ReLU}(0.58 \cdot 0.14 + 0.66 \cdot 0.15)$$ $$= \text{ReLU}(0.0812 + 0.099) = \text{ReLU}(0.1802) = 0.1802.$$For input $(2, 3)$ this network outputs $\hat{y} = 0.1802$. Notice that ReLU did nothing here — every pre-activation was already positive. ReLU only "kicks in" when an input is negative, sending it to zero.
Hidden layer:
$$\sigma(h_1) = \text{ReLU}((-2)(0.11) + 3(0.12)) = \text{ReLU}(-0.22 + 0.36) = \text{ReLU}(0.14) = 0.14.$$ $$\sigma(h_2) = \text{ReLU}((-2)(0.21) + 3(0.08)) = \text{ReLU}(-0.42 + 0.24) = \text{ReLU}(-0.18) = \mathbf{0}.$$$h_2$ has been zeroed by ReLU — the neuron didn't fire. Output layer:
$$\hat{y} = \text{ReLU}(0.14 \cdot 0.14 + 0 \cdot 0.15) = \text{ReLU}(0.0196) = 0.0196.$$So the prediction drops sharply when one of the hidden neurons goes silent.
Each layer $\ell$ has $d_{\ell-1} \cdot d_\ell$ weights connecting from the previous layer's $d_{\ell-1}$ neurons to its own $d_\ell$ neurons. Total:
$$\sum_{\ell=1}^{L+1} d_{\ell-1} \cdot d_\ell.$$For a 2–2–1 network: $2 \cdot 2 + 2 \cdot 1 = 4 + 2 = 6$ weights — exactly the six in our example.
Here's a striking fact: if every neuron uses a linear activation function, the entire network — no matter how many layers — collapses to a single linear neuron. Any expressive power has to come from non-linearity. Here's the algebra that proves it.
Take the same 2–2–1 network as before, with biases this time: $b_1$ and $b_2$ on the hidden neurons, $b_3$ on the output neuron, and a linear activation (i.e., $f(x) = x$). The output is
$$\hat{y} = h_1 w_5 + h_2 w_6 - b_3.$$Now substitute $h_1 = x_1 w_1 + x_2 w_2 - b_1$ and $h_2 = x_1 w_3 + x_2 w_4 - b_2$:
$$\hat{y} = (x_1 w_1 + x_2 w_2 - b_1) w_5 + (x_1 w_3 + x_2 w_4 - b_2) w_6 - b_3.$$Distribute everything:
$$\hat{y} = x_1 w_1 w_5 + x_2 w_2 w_5 - b_1 w_5 + x_1 w_3 w_6 + x_2 w_4 w_6 - b_2 w_6 - b_3.$$Group by $x_1$, $x_2$, and constants:
$$\hat{y} = x_1(w_1 w_5 + w_3 w_6) + x_2(w_2 w_5 + w_4 w_6) - (b_1 w_5 + b_2 w_6 + b_3).$$Define new weights and a new bias:
$$w' = w_1 w_5 + w_3 w_6, \quad w'' = w_2 w_5 + w_4 w_6, \quad b' = b_1 w_5 + b_2 w_6 + b_3.$$Then:
$$\boxed{\;\hat{y} = x_1 w' + x_2 w'' - b'.\;}$$This is exactly the formula for a single perceptron with weights $(w', w'')$ and bias $b'$. The hidden layer has vanished into the equivalent single neuron.
Every neural network with linear activation functions can be replaced by a single neuron with the same input/output behaviour, no matter how many hidden layers it has.
Each layer of a linear network is a matrix multiplication: $\mathbf{h}^{(\ell)} = W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}$. Composing $L$ such linear maps gives $\mathbf{h}^{(L)} = W^{(L)} W^{(L-1)} \cdots W^{(1)} \mathbf{x} + (\text{some bias})$, and the product of matrices is just another matrix. So the whole stack has the form $W_{\text{eff}}\mathbf{x} + \mathbf{b}_{\text{eff}}$ — one linear map. Depth gains you nothing without non-linearity.
The identity function $f(x) = x$ is linear. So is $f(x) = 2x$, $f(x) = -x$, and $f(x) = ax + c$ for any constants $a, c$. The linear-collapse argument applies to any of these. Only genuinely non-linear functions — sigmoid, tanh, ReLU, leaky ReLU, swish — give multi-layer networks expressive power beyond a single neuron.
Yes — exactly the same argument applies. Each layer is a linear map $\mathbf{h}^{(\ell)} = W^{(\ell)}\mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}$, and the composition of any number of linear maps is still a linear map. The 2–10–10–10–1 network with linear activations is mathematically identical to a 2–1 network (one neuron) with appropriate $w'$, $w''$, $b'$. The 30 hidden neurons buy you nothing.
Because ReLU is not a single linear function: it's $0$ on negatives and identity on positives. The "kink" at zero means that for different inputs, different sets of neurons fire vs. don't fire — and the effective linear map of the whole network changes depending on the input. Composing two ReLU layers gives a piecewise-linear function with many "pieces" (each corresponding to a different combination of which neurons are active), and the number of pieces can be exponentially large in the depth. That non-linearity at the kink is enough to give the network all its expressive power.
There are six activation functions on the slides, each with a personality. Here they are, ranked by their pros, cons, and a one-line summary of when you'd use them.
$$f(x) = \begin{cases} 1 & x \geq 0 \\ 0 & x < 0 \end{cases}$$
Advantages:
Disadvantages:
$$f(x) = \frac{1}{1 + e^{-x}}.$$
Advantages:
Disadvantages:
$$f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}.$$
Like sigmoid, but its outputs sit in $[-1, +1]$ instead of $[0, 1]$. So tanh is zero-centred on the y-axis.
Why this helps: if all hidden activations are positive (as with sigmoid), the gradient updates for the next layer's weights all share a sign — they all push up, or all push down — which slows learning. Centring around zero means activations can be positive or negative, and gradient updates can move in either direction independently.
$$f(x) = \max(0, x).$$
Advantages:
Disadvantages:
$$f(x) = \max(0.1\,x,\, x).$$
Same as ReLU but with a small slope on the negative side (here, 0.1). This means even on negatives, the gradient is non-zero (it's 0.1), so a "dying" neuron can still receive update signals and recover.
Advantages:
Disadvantages:
$$f(x) = \frac{x}{1 + e^{-x}} \;=\; x \cdot \text{sigmoid}(x).$$
A smooth, non-monotonic function that looks similar to ReLU but with a small dip below zero before flattening out. Empirically, Swish performs a tiny bit better than ReLU on many tasks (introduced by Google researchers via neural architecture search).
| Function | Formula | Range | Smooth? | Risks |
|---|---|---|---|---|
| Binary | 1 if x≥0, else 0 | {0, 1} | No | No gradient |
| Sigmoid | 1/(1+e⁻ˣ) | [0, 1] | Yes | Vanishing gradient |
| tanh | (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) | [−1, 1] | Yes | Vanishing gradient |
| ReLU | max(0, x) | [0, ∞) | No (kink at 0) | Dying ReLU |
| Leaky ReLU | max(0.1x, x) | (−∞, ∞) | No (kink at 0) | Fixed leak coef. |
| Swish | x·sigmoid(x) | (≈−0.28, ∞) | Yes | Computationally heavier |
The binary step cannot, because its derivative is zero almost everywhere (and undefined at the threshold). Backpropagation works by chaining derivatives, so a zero derivative kills the gradient signal — no learning happens. All the others (sigmoid, tanh, ReLU, leaky ReLU, swish) have well-defined or sub-differentiable gradients that backprop can use.
At $x = \pm 8$, sigmoid is essentially saturated — $\sigma(8) \approx 0.9997$, $\sigma(-8) \approx 0.00033$. The derivative $\sigma(x)(1 - \sigma(x))$ is therefore extremely tiny ($\approx 0.0003$ at $x=8$). During backpropagation, this tiny derivative gets multiplied through the gradient chain, so the weights connected to this neuron receive almost no update. This is the vanishing-gradient problem, and it's why deep networks trained with sigmoid throughout were notoriously hard to train before ReLU became popular.