Week 13 — Discriminative Neural Networks

This week is about discriminative models — networks that learn , mapping an input directly to an output or a distribution over outputs. The two reference texts are Understanding Deep Learning (Simon J. D. Prince, 2023) and Deep Learning: Foundations and Concepts (Christopher M. Bishop, 2023).

Part I

Neural Networks — A Refresher

A neural network is just a function with knobs. Turn the knobs (the parameters) and you change which function it computes. The genius of the design is that, with the right activation, even a one-layer network can bend a straight line into almost any shape you like.

Strip away the mystique and a neural network is a function that maps inputs to outputs — nothing more exotic than that. What makes it useful is that the function has free parameters we can tune, and a clever internal structure that lets a handful of simple parts combine into very flexible shapes. We start with the simplest possible case and build up.

The single-layer network

Consider a network with one layer and three neurons (also called hidden units) that maps a single number to a single number . It has 10 parameters, collected into the vector , and is defined by Equation 1:

Slide 5 — one scalar input, three hidden units, one scalar output. The θ weights feed the hidden units; the φ weights combine them.

The computation breaks into three parts, exactly as the slide lists them:

1. Compute three linear functions of the input: , , .
2. Pass each through an activation function .
3. Weight the three resulting activations by , sum them, and add an offset .

Definition · in plain English

Each hidden unit first draws a straight line through the input (slope , intercept ), then bends it with the activation. The output is a weighted blend of these bent lines plus a constant. "Parameters" are the numbers we learn; everything else is fixed structure.

The activation: ReLU

To finish the description we must pick the activation . There are many choices, but the most common is the rectified linear unit (ReLU), Equation 2:

Slide 6 — ReLU clips everything below zero and passes everything above unchanged. This single "kink" is what gives the network its bending power.

Because each hidden unit contributes one kink, the function of Equation 1 is a continuous piecewise-linear function with up to four linear regions (three units → four pieces).

Why it produces that family — the joints

Introduce the hidden units as intermediate quantities (Equation 3):

Then becomes a tidy linear function of them (Equation 4):

Intuition

Each hidden unit is a line clipped below zero by ReLU. The point where each unit crosses zero becomes a "joint" in the final output — a place where the slope is allowed to change. The three clipped lines are scaled by (which can flip or stretch them), summed, and lifted by , which sets the overall height. Stack the pieces and you get the kinked curve on slide 8.

Slides 7–8 — the joints (red dots) sit exactly where each hidden unit's pre-activation crosses zero. More units ⇒ more joints ⇒ more flexible curve.

QThe three-unit network of Equation 1 has how many parameters, and what are they? Why is the output piecewise-linear with at most four regions?

10 parameters: four output parameters (one offset + one weight per unit) and six hidden parameters (an intercept and a slope for each of the three units).

Each ReLU contributes exactly one "joint" (where its argument crosses zero), and between joints the function is a sum of linear pieces, hence linear. Three joints partition the input line into four intervals, so the curve has at most four straight segments.

QWrite the ReLU definition from memory and explain in one sentence why it, rather than the linear functions alone, gives the network its expressive power.

for and for . Without the nonlinearity, a sum of linear functions is just another linear function (a single straight line); the ReLU kink is what lets the pieces have different slopes in different regions, so the composite can approximate curves.

Capacity — more units, more regions

For the example network (1 input, 1 output, ReLU, hidden units): the hidden units determine the network's capacity. The key fact (slide 9):

Definition · capacity rule (1D input)

With hidden units and ReLU, is piecewise-linear with at most linear regions. More hidden units enable approximation of more complex functions. With adequate capacity, the network can describe any continuous 1D function on a compact subset of the real line to arbitrary precision.

Slide 9 — the grey curve is the target; the coloured polyline is the network. As the number of linear regions grows from 5 → 10 → 20, the fit tightens.

QA single-layer ReLU network with one scalar input has 50 hidden units. What is the maximum number of linear regions its output can have? What is the general rule?

At most 51 linear regions. The rule for a 1D input is regions for hidden units — each unit adds at most one joint, and joints cut the input line into pieces. (This bound is specific to one input dimension; for higher-dimensional inputs the region count grows differently.)

Multivariate outputs

To produce several outputs, simply use a different linear combination of the same hidden units for each output. A network with scalar input , four hidden units, and 2D output (Equation 5):

. The bottleneck of (a) throws information away; the full matrix of (b) does not.

Naming conventions

We denote layer counts and per-layer widths . These are hyperparameters. For fixed hyperparameters (e.g. with each ) the model describes a family of functions; the weights pick out one particular function. Considering the hyperparameters too, a deep network is a "family of families" of functions.

Matrix notation

where is applied element-wise to its vector input, and are bias vectors, and is the full inter-layer weight matrix (slides 23–24).

The general formulation

with (input, ), (output), (the parameters), layers, intermediate layers , and output function .

Universal approximation — do we even need depth?

So why deep, and why now?

Part III

Loss Functions via Maximum Likelihood

Every loss in this course comes from one recipe: decide what kind of thing you are predicting, pick a probability distribution for it, let the network output the distribution's parameters, and then minimise the negative log-likelihood of the training data. Regression, binary, and multiclass classification all fall out of this single idea.

Training means searching for the parameters that produce the best mapping from input to output for the task at hand. "Best" needs a definition, and that definition is the loss function (or cost function) : a single number measuring the mismatch between the model's predictions and the ground-truth outputs . To train, we need a training dataset of input/output pairs and such a loss.

Computing a distribution over outputs

How can a deterministic model produce a probability distribution? Two steps:

1. Choose a parametric distribution defined on the output domain.
2. Use the network to compute one or more of that distribution's parameters .

Example: if , use the univariate normal with . The network might predict the mean , treating the variance as an unknown constant.

Slide 30 — the choice of distribution matches the output type: continuous values, discrete classes, non-negative counts, or circular directions. The network predicts each distribution's parameters from the input .

The recipe (four steps)

Definition · maximum-likelihood loss recipe

1. Choose a distribution over the prediction domain, with parameters .
2. Set the model to predict those parameters: , so .
3. Find the parameters that minimise the negative log-likelihood over the training pairs: 4. At inference, return the full distribution or its maximum (a point estimate).

Intuition · why negative log-likelihood

"Most probable data" = maximise . Products of small probabilities underflow and are awkward to differentiate, so we take the log (turning the product into a sum) and negate it (turning maximisation into minimisation). Minimising the NLL is identical to maximising the likelihood — it is the same objective wearing optimisation-friendly clothes.

QState the four-step recipe for building a loss function by maximum likelihood. Why do we minimise the negative log-likelihood rather than maximise the likelihood directly?

(1) Choose a distribution over the output domain. (2) Let the network predict its parameters, . (3) Minimise . (4) At test time return the distribution or its mode.

The likelihood is a product over data points, which underflows numerically and is hard to differentiate. Taking the log converts it to a sum; negating converts maximisation to minimisation. The optimum is unchanged because is monotonic.

Case 1 — Univariate regression

Goal: predict a single scalar . Following the recipe, choose the univariate normal (defined on ) and let the network compute the mean, :

The negative log-likelihood loss is:

Worked example · NLL becomes least squares

Expanding the log: . The first term is a constant in ; the second is . So minimising the Gaussian NLL is exactly minimising the sum of squared errors — least-squares regression is maximum likelihood under a Gaussian with constant variance.

QShow that the negative log-likelihood for univariate regression with a constant-variance Gaussian reduces to the squared-error loss.

Take of the Gaussian PDF for each point: . Summing over , the terms and the factor are constants w.r.t. . Hence — the least-squares objective.

Case 2 — Binary classification

Goal: assign to one of two classes (here is called a label). Examples: review positive () vs. negative (); tumour present () vs. absent () from an MRI scan.

Choose the Bernoulli distribution on , with a single parameter = probability that :

The network should predict , but its raw output can be any real number, whereas . Solution: squash the output through the logistic sigmoid:

Slide 36 — the sigmoid maps , guaranteeing is a valid probability.

So , giving . The negative log-likelihood is the binary cross-entropy loss:

For a point estimate at inference: predict if , else .

QWalk through the recipe for binary classification: which distribution, why the sigmoid, and what is the resulting loss called?

Distribution: Bernoulli on , parameter , written . Sigmoid: the network output is unbounded, but must lie in ; maps , so is always valid. Loss: the negative log-likelihood is the binary cross-entropy, . Decision rule: iff .

Case 3 — Multiclass classification

Goal: assign to one of classes, . Examples: which of digits is in an image; which of words follows an incomplete sentence.

Choose the categorical distribution, with parameters where . The parameters must (i) lie in and (ii) sum to one. The network outputs numbers, but they will not satisfy these constraints, so pass them through the softmax:

The exponentials guarantee positivity; the denominator forces the outputs to sum to one.

Slides 38–39 — softmax exponentiates and normalises, turning arbitrary network scores into a valid categorical distribution over the classes.

The likelihood that has label is . The negative log-likelihood is the multiclass cross-entropy loss:

The transformed output is a categorical distribution over . For a point estimate, take the most probable class:

Exam trap · sigmoid vs. softmax

Binary classification uses one output + sigmoid → Bernoulli → binary cross-entropy. Multiclass uses outputs + softmax → categorical → multiclass cross-entropy. Softmax with reduces to the sigmoid case, but on an exam name the distribution (Bernoulli vs. categorical) and the squashing function (sigmoid vs. softmax) correctly for each.

QWrite the -th softmax output, explain the two constraints it enforces, and give the multiclass cross-entropy loss plus the inference rule.

. The exponential makes every entry positive (constraint i: values in ); dividing by the sum makes them sum to one (constraint ii) — i.e. a valid categorical distribution.

Loss: . Inference: — the most probable class.

Part IV

Convolutional Neural Networks

Images are huge, locally correlated, and meaningful regardless of where the object sits. Fully connected networks ignore all three facts. Convolutions exploit them by sliding one small set of shared weights across the whole image — fewer parameters, built-in translation handling, and a receptive field that grows with depth.

Until now every network was fully connected — a single path from each input to each output, with a private weight for every connection. That is wasteful for images, which have three properties demanding specialised architecture: they are high-dimensional; nearby pixels are statistically related; and the interpretation of an image is stable under geometric transformations. Convolutional layers process each local region independently using parameters shared across the whole image. A network built mainly from such layers is a CNN.

Invariance and equivariance

Definition · invariance

A function is invariant to a transformation if . The output does not change when the input is transformed. Image classification should be invariant: a translated, rotated, flipped, or warped photo of a cat is still classified "cat."

Definition · equivariance (covariance)

A function is equivariant (or covariant) to if . Transforming the input transforms the output the same way. Per-pixel segmentation should be equivariant: shift the image and the segmentation mask should shift identically. CNNs are equivariant to translation — each convolutional layer commutes with shifts.

Slides 43–44 — invariance: the label is unchanged by the shift. Equivariance: the output (mask) undergoes the same shift as the input.

QDefine invariance and equivariance with their equations, and state which property image classification needs and which segmentation needs.

Invariance: — output unchanged by the transform. Needed for classification (a shifted cat is still "cat").

Equivariance: — output transforms the same way. Needed for per-pixel segmentation (shift the image → shift the mask). Convolutional layers are equivariant to translation.

The 1D convolution

A 1D convolution turns an input vector into an output where each is a weighted sum of nearby inputs, using the same weights at every position — the convolution kernel (or filter). The number of inputs combined is the kernel size. For kernel size three:

This operation is equivariant with respect to translation: translate and translates the same way.

Stride and dilation

Kernel size — how many inputs each output combines (e.g. 3 or 5).
Stride — how far the kernel hops between outputs (stride 2 keeps every other position, halving the output length).
Dilation — gaps inserted between kernel taps (dilation 2 with size 3 reaches inputs ), enlarging the receptive field without adding weights.

Slide 45–46 — the same three weights slide across all positions. Stride sets the hop size; dilation spaces the taps apart.

Padding

At the boundaries there is no "previous" input for the first output and no "subsequent" input for the last. Two common approaches:

1. Pad the edges with invented values. Zero padding assumes the input is zero outside its valid range; alternatively treat the input as circular or reflecting at the boundaries.
2. Discard output positions where the kernel runs off the edge. Advantage: no invented information at the edges. Disadvantage: the representation shrinks in size.

The convolutional layer

A convolutional layer convolves the input, adds a bias , and passes each result through an activation . With kernel size 3, stride 1, dilation 1, the -th hidden unit is:

The bias and kernel weights are the trainable parameters; with zero padding, is treated as zero out of range.

QGiven input and kernel with stride 1 and zero padding, compute the convolution output where . What does this kernel detect?

With , (out-of-range entries are 0):

; ; ; ; .

So . The kernel computes a difference between neighbours — it is an edge / gradient detector, large where the signal changes sharply and ≈0 where it is flat.

Channels

A single convolution loses information — it averages neighbours and ReLU clips negatives. So we compute several convolutions in parallel; each produces a new set of hidden variables called a feature map or channel.

In general, inputs and hidden layers all have multiple channels. If the incoming layer has channels and kernel size , each output-channel unit is a weighted sum over all channels and all kernel positions, using a weight matrix and one bias. For output channels:

QA convolutional layer maps input channels to output channels with a kernel of size (1D). How many weights and biases does it have? Why use multiple channels at all?

Weights: . Biases: . Total trainable parameters.

A single convolution would lose information (it averages neighbours, and ReLU clips negatives). Computing many convolutions in parallel produces many feature maps, each detecting a different pattern, preserving far more of the input's structure.

Receptive field

Definition · receptive field

The receptive field of a hidden unit is the region of the original input that feeds into it. With kernel size 3 at every layer: layer-1 units see the 3 closest inputs (field 3); layer-2 units see field 5; the field grows with depth, gradually integrating information from across the whole input.

Slides 51–52 — stacking kernel-3 layers widens the receptive field 3 → 5 → 7 → … ( after layers). Stride further accelerates this growth.

QIn a stack of convolutional layers each with kernel size 3 and stride 1, what is the receptive field of a unit in layer ? Derive the first few values.

Each layer extends the field by on each pass. Layer 1: field 3; layer 2: ; layer 3: . General formula: field for layers of kernel size 3, stride 1. (More generally the field grows by per layer; stride and dilation increase the growth rate.)

2D convolution

Everything generalises to 2D: the kernel is a small grid (e.g. ) slid over the image, computing a weighted sum at each position, adding a bias, and applying the activation (slide 53). For an RGB image the input has 3 channels, so the kernel is — it sums over the spatial window and the colour channels to produce each hidden value (slide 54).

Downsampling — shrinking a 2D map

Three main approaches to scale a 2D representation down:

(a) Stride — skip positions when convolving. (b) Max-pooling — take the maximum in each window. (c) Average pooling — take the mean in each window.

Downsampling is applied separately to each channel, so the output has half the width and height but the same number of channels.

Slide 55 — pooling collapses each window to one value. Max keeps the strongest activation; average smooths. Number of channels is unchanged.

Upsampling — enlarging a 2D map

Four main approaches to scale a 2D representation up:

(a) Duplication — copy each value into a larger block. (b) Max-unpooling — place each value back at the location it came from during max-pooling, zeros elsewhere. (c) Interpolation (e.g. bilinear) — smoothly blend between known values. (d) Transposed convolution — a learnable upsampling that spreads each input across a kernel-shaped patch.

Changing the number of channels — the 1×1 convolution

Sometimes we want to change the channel count between layers without further spatial pooling. The trick: apply a convolution with kernel size one. A convolution mixes channels at each spatial location independently — it is a per-pixel linear map from channels to channels, leaving width and height untouched.

Slide 57 — the convolution recombines channels at each pixel, the standard way to grow or shrink channel count without spatial change.

QWhat does a convolution do, and when would you use it? Contrast it with max-pooling.

A convolution applies a learned linear combination across channels at each individual pixel, changing the channel count while leaving the spatial dimensions unchanged. Use it to increase or decrease the number of feature maps between layers without pooling.

Max-pooling, by contrast, reduces the spatial size (e.g. halves width and height) and keeps the channel count fixed — the opposite axis of change.

QList the three downsampling methods and the four upsampling methods, and state what downsampling does to the channel count.

Downsampling (3): stride, max-pooling, average pooling. Upsampling (4): duplication, max-unpooling, interpolation (e.g. bilinear), transposed convolution.

Downsampling is applied independently per channel, so it halves width and height but leaves the number of channels unchanged.

Part V

Residual Networks & Normalisation

Naively stacking ever more layers makes networks worse, not better. The fix is almost embarrassingly simple — add the input of each block back to its output. That one change turns a deep tower into an ensemble of short paths, tames the gradients, and lets us train networks an order of magnitude deeper.

Every architecture so far processes data sequentially: . In principle we can add as many layers as we want. In practice, image-classification performance decreases again as further layers are added — a phenomenon that "is not completely understood." Residual networks address it directly.

Residual connections and residual blocks

Definition · residual / skip connection

A residual (or skip) connection is a branch in the computational path whereby the input to each layer is added back to its output. The residual network is:

Each function learns an additive change to the current representation. Their outputs must be the same size as their inputs (so the addition is defined). Each input-plus-processed-output combination is a residual block (or residual layer).

Slide 60 — the defining picture: the input takes a shortcut around the processing block and is summed back in. only has to learn the difference from the identity.

Residual networks as ensembles

Intuition · sixteen paths

Residual connections turn the original network into an ensemble of smaller networks whose outputs are summed. A four-block residual network creates sixteen paths of different lengths from input to output (each of the 4 blocks can be taken or skipped: ). Gradients through shorter paths are better behaved. Because both the identity term and many short chains of derivatives contribute to each layer's gradient, residual networks suffer less from shattered gradients.

QWrite the residual-network equations for four blocks. Why does adding skip connections help training, and how many input-to-output paths does a four-block network create?

, , , . Each learns an additive correction, and its output must match its input in size.

Skip connections create an ensemble of paths of varying length. Short paths have well-behaved gradients, and the ever-present identity term keeps a clean derivative flowing to every layer — so the network suffers far less from shattered (and vanishing) gradients, making very deep training feasible.

Exploding gradients in residual networks

Adding residual connections roughly doubles the depth that can be practically trained before performance degrades. But to go deeper still we must consider (1) how the variance of activations changes in the forward pass and (2) how gradient magnitudes change in the backward pass. Initialisation is critical: without care, forward-pass values and backward-pass gradients can grow or shrink exponentially (explode or vanish) as we move through the network.

Definition · He initialisation

We initialise so the expected variance of activations and gradients stays the same between layers. He initialisation achieves this for ReLU activations by: (i) initialising biases to zero; and (ii) choosing weights from a normal distribution with mean zero and variance , where is the number of hidden units in the previous layer.

Exam trap · why residual nets explode forward

In a residual network, He initialisation means we do not have to worry about vanishing gradients — but the forward-pass values still increase exponentially. Reason: each branch has some uncorrelated variability; the expected variance is unchanged by the processing inside a block, but when we add the block's output back to the input, the variances add — the variance roughly doubles per block, growing exponentially with the number of blocks. This limits depth before floating-point precision is exceeded in the forward pass.

QState He initialisation precisely. In a residual network with good initialisation, which problem is solved and which remains — and why?

He init: biases = 0; weights , where is the number of hidden units in the previous layer. Designed for ReLU so activation/gradient variance is preserved layer-to-layer.

Solved: vanishing gradients (the identity path keeps gradients flowing). Remains: the forward-pass values explode. Each residual block adds its (uncorrelated) output back to the input, so the variances sum — variance roughly doubles per block, growing exponentially and eventually exceeding floating-point precision.

Batch normalisation (BN)

Batch normalisation stabilises the forward and backward passes of residual networks. It shifts and rescales each activation so its mean and variance across the batch become values learned during training.

First compute the empirical mean and standard deviation over the batch:

Then standardise to zero mean and unit variance:

Finally, scale by and shift by (Equation 12):

After this, the activations have mean and standard deviation across the batch — and both and are learned during training.

Definition · BN parameter counts & test-time behaviour

BN is applied independently to each hidden unit. In a standard network with layers of hidden units each, there are learned offsets and learned scales .

In a convolutional network the statistics are computed over both the batch and the spatial position: with layers of channels each, there are offsets and scales.

At test time there is no batch to gather statistics from, so and are computed across the whole training dataset and frozen in the final network.

QDescribe the three stages of batch normalisation with their formulas. After the final stage, what are the activations' mean and standard deviation, and what happens at test time?

(1) Compute statistics: and .

(2) Standardise: → zero mean, unit variance ( avoids division by zero).

(3) Scale & shift: . Afterwards the activations have mean , standard deviation , both learned.

Test time: no batch is available, so are computed over the entire training set and frozen.

QHow many learned BN parameters does a fully-connected network with layers of units have? How does the count change for a convolutional network with channels?

Fully connected: offsets + scales parameters (BN applied per hidden unit).

Convolutional: statistics are pooled over batch and spatial position, so BN is applied per channel, not per spatial unit: offsets + scales here as well — but the count now scales with the number of channels , independent of the (much larger) spatial resolution, which is the key saving.

Coming next

The lecture closes by pointing forward (slide 68): the next topic is attention and transformers — where, instead of fixed local convolutions, each output is a learned weighted sum over all input positions (the small numbers 0.1, 0.3, 0.6 … on the final slide are attention weights routing values to outputs).

The week in one breath

A neural network is a tunable function built from clipped lines; ReLU gives it kinks, depth folds the input to multiply those kinks cheaply, maximum likelihood turns "predict well" into a concrete negative-log-likelihood loss for regression, binary, and multiclass tasks, convolutions share weights across an image to gain translation equivariance and growing receptive fields, and residual connections plus batch normalisation keep the gradients sane so the whole tower can actually be trained.

Exam-readiness checklist

Write Equation 1 for the 3-unit single-layer network, count its 10 parameters, and explain why ReLU yields up to linear regions for a 1D input.
Extend the network to multivariate inputs and outputs (Equations 5–8) and count parameters for a given shape.
Explain why composing two networks gives regions, and why a genuine two-layer network (arbitrary ) is strictly more expressive than the composition (constrained outer-product ).
Reproduce the -substitution derivation (Equation 10) showing composition is a constrained two-layer net.
Define width, depth, capacity; distinguish shallow () from deep (); recall the general formulation .
State the universal approximation theorem (Cybenko 1989, Hornik 1991, Lu 2017) and explain why depth is still preferred (fewer params, easier training, better generalisation) and why "now" (data, GPUs).
Recite the four-step maximum-likelihood recipe and the NLL objective .
Derive the Gaussian-NLL → squared-error equivalence for univariate regression.
Build binary classification end-to-end: Bernoulli → sigmoid → binary cross-entropy → 0.5 threshold.
Build multiclass classification: categorical → softmax () → multiclass cross-entropy → argmax.
Define invariance vs. equivariance ; match to classification vs. segmentation.
Compute a 1D convolution by hand (kernel size 3, stride, dilation, zero padding) and write the layer formula .
Count convolution parameters: , ; explain feature maps/channels.
Compute receptive field growth ( for kernel-3, stride-1 stacks); list the 3 downsampling and 4 upsampling methods; explain the convolution.
Write the four residual-block equations; explain the 16-path ensemble view and reduced shattered gradients.
Explain why residual nets explode in the forward pass (variance doubles per block) yet don't vanish in the backward pass; state He initialisation (biases 0, weights ).
Derive batch normalisation's three stages, give the post-BN mean and std , count its (or ) parameters, and state the frozen-statistics test-time rule.