KCL · Machine Learning · Week 13
From a three-neuron toy network to deep convolutional and residual architectures — built up one piecewise-linear joint at a time, then trained by maximum likelihood.
This week is about discriminative models — networks that learn , mapping an input directly to an output or a distribution over outputs. The two reference texts are Understanding Deep Learning (Simon J. D. Prince, 2023) and Deep Learning: Foundations and Concepts (Christopher M. Bishop, 2023).
Part I
A neural network is just a function with knobs. Turn the knobs (the parameters) and you change which function it computes. The genius of the design is that, with the right activation, even a one-layer network can bend a straight line into almost any shape you like.
Strip away the mystique and a neural network is a function that maps inputs to outputs — nothing more exotic than that. What makes it useful is that the function has free parameters we can tune, and a clever internal structure that lets a handful of simple parts combine into very flexible shapes. We start with the simplest possible case and build up.
Consider a network with one layer and three neurons (also called hidden units) that maps a single number to a single number . It has 10 parameters, collected into the vector , and is defined by Equation 1:
The computation breaks into three parts, exactly as the slide lists them:
1. Compute three linear functions of the input: , , .
2. Pass each through an activation function .
3. Weight the three resulting activations by , sum them, and add an offset .
Each hidden unit first draws a straight line through the input (slope , intercept ), then bends it with the activation. The output is a weighted blend of these bent lines plus a constant. "Parameters" are the numbers we learn; everything else is fixed structure.
To finish the description we must pick the activation . There are many choices, but the most common is the rectified linear unit (ReLU), Equation 2:
Because each hidden unit contributes one kink, the function of Equation 1 is a continuous piecewise-linear function with up to four linear regions (three units → four pieces).
Introduce the hidden units as intermediate quantities (Equation 3):
Then becomes a tidy linear function of them (Equation 4):
Each hidden unit is a line clipped below zero by ReLU. The point where each unit crosses zero becomes a "joint" in the final output — a place where the slope is allowed to change. The three clipped lines are scaled by (which can flip or stretch them), summed, and lifted by , which sets the overall height. Stack the pieces and you get the kinked curve on slide 8.
10 parameters: four output parameters (one offset + one weight per unit) and six hidden parameters (an intercept and a slope for each of the three units).
Each ReLU contributes exactly one "joint" (where its argument crosses zero), and between joints the function is a sum of linear pieces, hence linear. Three joints partition the input line into four intervals, so the curve has at most four straight segments.
for and for . Without the nonlinearity, a sum of linear functions is just another linear function (a single straight line); the ReLU kink is what lets the pieces have different slopes in different regions, so the composite can approximate curves.
For the example network (1 input, 1 output, ReLU, hidden units): the hidden units determine the network's capacity. The key fact (slide 9):
With hidden units and ReLU, is piecewise-linear with at most linear regions. More hidden units enable approximation of more complex functions. With adequate capacity, the network can describe any continuous 1D function on a compact subset of the real line to arbitrary precision.
At most 51 linear regions. The rule for a 1D input is regions for hidden units — each unit adds at most one joint, and joints cut the input line into pieces. (This bound is specific to one input dimension; for higher-dimensional inputs the region count grows differently.)
To produce several outputs, simply use a different linear combination of the same hidden units for each output. A network with scalar input , four hidden units, and 2D output (Equation 5):