Exploration, generalisation through function approximation, and learning policies directly — how RL escapes the look-up table and scales to real problems.
If an agent only ever does what currently looks best, it may never discover that something better exists. Learning to act well requires deliberately doing things that look sub-optimal — at least for a while.
In the previous lectures we built two ways for an agent to learn while acting: active ADP (adaptive dynamic programming — learn a model of the MDP, then solve it) and Q-learning (learn action-values directly without a model). Both, as we ran them, were greedy: at every step the agent took the action that looked best according to whatever estimate of the world it had so far.
Concretely, an ADP agent picks its action by looking one step ahead through its learned transition model and choosing whatever maximises expected utility:
and a Q-learning agent simply reads off the best action from its current Q-table:
"Greedy" means always cash in on what you already believe. The trouble: your beliefs were formed from a handful of trials, so they're probably wrong in places you've barely visited. By only ever exploiting them, you keep re-visiting the same familiar corner of the world and never gather the evidence that would correct your mistaken beliefs about everywhere else.
The danger is concrete: a purely greedy agent could miss large parts of the state space entirely, never build a good enough estimate of the MDP, and so end up committed to a policy that is worse than optimal — permanently.
In our running 4×3 grid (the same one from the MDP lectures), greedy active ADP converges to the policy on the left below. Compare it to the true optimal policy on the right.
Because greediness creates a self-reinforcing blind spot. The agent only takes actions that look best under its current estimate of the MDP. If that estimate undervalues some region (perhaps just from bad luck in early trials), the agent stops going there — and because it stops going there, it never collects the data that would correct the under-estimate. So the error never gets fixed: the agent converges, but to the wrong policy. There is no contradiction with "unlimited time" because the agent's own behaviour starves it of the very experience it would need.
One clean way to force exploration is an exploration function that makes the agent try every action a minimum number of times, $N_{\min}$, regardless of how good that action currently looks. Only after an action has been tried enough times does the agent go back to choosing by expected utility (i.e. greedily).
$$f(u,n) \equiv \begin{cases} R^{+}, & \text{if } n < N_{\min}\\[4pt] u, & \text{otherwise}\end{cases}$$
Here $u$ is the action's current estimated utility (or Q-value) and $n$ is the number of times that action has been tried so far in this state. $R^{+}$ is an optimistic value.
Slide 4 describes $R^{+}$ with a garbled, duplicated phrase ("an optimistic reward that is more than the reward that is more than the value of any state"). The intended meaning is simply: $R^{+}$ is an optimistic value chosen to be larger than the utility of any reachable state. Because $R^{+}$ beats every real estimate, the agent prefers an under-tried action ($n
Pretend, until proven otherwise, that every untried action leads to paradise. That optimism guarantees the agent will go and check each action at least $N_{\min}$ times. Once it has the evidence, the optimism switches off (the function returns the real value $u$) and the agent behaves greedily on solid ground.
We drop the exploration function into any active RL algorithm. For Q-learning it changes only the action-selection line:
1: for s ∈ States do 2: for a ∈ Acts do 3: Q(s, a) := 0 4: N(s, a) := 0 5: end for 6: end for 7: n := 0 8: si := s0 9: while True do 10: amax := arg maxa∈Acts f( Q(si, a), N(si, a) ) 11: N(si, amax) ++ 12: Execute amax in environment 13: si+1 := state-perceived 14: if si+1 is not terminal then 15: Q(si, amax) := Q(si, amax) + α(n)·( R(si) + γ·maxa′ Q(si+1, a′) − Q(si, amax) ) 16: si := si+1 17: else 18: break 19: end if 20: end while
On the original slide, line 10 reads f(Q(si,a), N(s,a)) and line 11 reads N(s,amax)++ — they index the visit-count table $N$ with a bare $s$ instead of the loop's current state $s_i$. There is no plain variable $s$ in scope inside the loop (it was the for-loop dummy used only to initialise the tables). The correct indexing is $N(s_i, a)$ and $N(s_i, a_{\max})$, as written above. The slide also closes the while loop with od; for consistency with the end for style it should read end while.
The lecturer stresses: only line 10 (action selection) was changed, not line 15 (the value update). Line 15 still bootstraps off $\max_{a'} Q(s_{i+1}, a')$ — the value of the greedy next action, not the action exploration will actually take.
Off-policy = "do one thing, learn about another." You wander around exploring (behaviour policy), but every update still asks "what's the best I could have done next?" (target policy = greedy). On-policy = "learn about exactly what you do," so the update must reflect the exploratory action you genuinely took.
It is off-policy because only line 10 was modified to use the exploration function $f$, while line 15 (the Q-update) was left unchanged. Line 15 still uses $\max_{a'} Q(s_{i+1}, a')$ — the value of the greedy successor action — even though the agent's behaviour (line 10) may pick an exploratory, non-greedy action. To make it on-policy you would also change line 15 so the bootstrap target uses the value of the action exploration will actually execute next (rather than the greedy $\max$).
So that the optimism always wins the arg max while an action is still under-explored. If $R^{+}$ were merely large but smaller than some genuinely high-value state's estimate, the agent could prefer that known good state over trying an untested action — leaving the untested action with $n < N_{\min}$ forever and defeating the purpose. Making $R^{+}$ exceed every possible state value guarantees that any action with $n < N_{\min}$ is ranked above every "known" option, so it will be tried at least $N_{\min}$ times before the agent is allowed to ignore it.
Everything we did before quietly assumed one cell of memory per state. That works for a 12-cell grid. It collapses the moment the world gets interesting.
So far we discussed RL assuming we have an MDP, a value/utility function, Q-values, and so on — but we never asked how these objects are actually stored in a program. The implicit assumption throughout was that everything could be represented in a tabular manner.
For example, in direct utility estimation the approximate utility function is just an array in memory, with one cell per MDP state. For our grid that's trivial — a 2D array of 12 cells, each holding the learned utility of one square:
For richer problems, the table simply cannot scale:
Slide 7 literally says of these counts: "This would fit the memory of most practical computers." Read in context that is the opposite of what is meant — the entire slide is arguing tabular methods break at this scale. The intended statement is: $10^{20}$ or $10^{40}$ states would not fit in the memory of any practical computer. (For scale: $10^{20}$ cells at one byte each is ~$10^{8}$ terabytes; $10^{40}$ is astronomically beyond any storage that could ever exist.)
This raises the central question of the lecture:
"How could reinforcement learning be used to solve complex problems?" The answer is function approximation: instead of representing the utility function explicitly with one entry per state, we represent it approximately — as a compact function with far fewer parameters than there are states.
Chess has on the order of $10^{40}$ states. A look-up table needs one memory cell per state, so it would require ~$10^{40}$ cells — far more storage than any physical computer has or could ever have. The agent also could never visit more than a vanishing fraction of those states, so most cells would never even be filled. The proposed alternative is function approximation: represent $U$ (or $Q$) as a parameterised function with a small number of parameters that computes a state's value from its features, rather than looking it up. This both fits in memory and lets value learned for visited states transfer to unvisited ones.
Replace the giant table with a short recipe: describe each state by a few features, then compute its value as a weighted sum of those features. Learning becomes tuning the weights — and, as a bonus, knowledge spreads between similar states.
The first and simplest form of function approximation is linear function approximation. The idea: describe every state not by its identity but by a handful of state features.
A set of functions $f_i : \text{States} \to \mathbb{R}$, for $1 \le i \le n$, each mapping a state to a real number. Each $f_i$ extracts one numerical property of the state.
In the grid world a (toy) feature could be the sum of the agent's coordinates, $f(x,y) = x + y$. Writing each cell's value of $x+y$ (with $x$ the column index $0$–$3$ and $y$ the row index $0$–$2$):
1+2, 2+1 and 3+0 all equal 3) — this will matter shortly.Given the features, the approximate utility of a state $s$ is a weighted sum:
$$U_\theta(s) \equiv \sum_{i=1}^{n} \theta_i\, f_i(s)$$
The $\theta_i$ are parameters (weights) determining how much each feature influences the final computed utility.
With the single feature $f(x,y)=x+y$ and weight $\theta = 2$: $$U_\theta\big((x,y)\big) \equiv 2(x+y).$$ So instead of storing 12 numbers, we store one — the weight $\theta=2$ — and compute any cell's utility on demand.
A linear approximator is just "value = weight × feature, added up." Learning no longer means writing a number into a cell; it means nudging the weights $\theta_i$ so the formula's outputs match what experience tells us. A handful of weights now stands in for a potentially enormous table.
Recall how direct utility estimation set a value in the tabular case: it averaged the returns observed for that state across trials,
With linear approximation we can't write into a cell — instead we tune the parameters $\theta_i$ to adjust the estimate. The agent still interacts with the environment in the familiar loop, receiving a new state and reward after each action:
Naïvely, we could just set the parameters so the approximation matches an observed value. Suppose the trials gave an estimate of 0.75 for state $(1,2)$. Then with the single-feature model we could change $\theta$ from $2$ to $0.25$, since
We cannot immediately slam $\theta$ to match one state's observed value, because the features may not fully capture the state — several distinct states get compressed into one entry of the approximator. With $f(x,y)=x+y$, the model cannot distinguish $(1,2)$ from $(2,1)$ (both have $x+y=3$). Yet the trials gave $0.75$ for $(1,2)$ and $-1.12$ for $(2,1)$. Any single $\theta$ that nails one will be wrong for the other. Forcing a perfect fit to $(1,2)$ wrecks $(2,1)$, and vice versa.
So if trials imply $(1,2)$ is worth $1$ and $(2,1)$ is worth $0$, what should the parameters be? We do what the rest of the module has trained us to do: pick parameters that minimise the error, using a diminishing learning rate to guarantee convergence.
For an observed value $u_j$, the error for a state $s$ is $$E_j(s) = \frac{\big(U_\theta(s) - u_j(s)\big)^2}{2}.$$
(Slide 13 writes this as $E(s)_j$ and slide 14 as $E_j(s)$ — the same quantity; the half-squared-error. The $\tfrac12$ is there purely so it cancels the $2$ from differentiating the square.)
To minimise the error we follow its gradient, updating each parameter $\theta_i$ separately. Differentiating $E_j(s)$ and applying the chain rule gives the practically usable form on the right:
$$\theta_i := \theta_i - \alpha\,\frac{\partial E_j(s)}{\partial \theta_i} \;=\; \theta_i + \alpha\,\big(u_j(s) - U_\theta(s)\big)\,\frac{\partial U_\theta(s)}{\partial \theta_i}$$
The bracket $\big(u_j(s) - U_\theta(s)\big)$ is the prediction error — "how far off was I?" The rule says: move each weight a little ($\alpha$) in the direction that would have shrunk that error, scaled by how strongly that weight affected the output ($\partial U_\theta/\partial\theta_i$, which for the linear model is just the feature $f_i(s)$). The minus-then-plus rewrite is the chain rule: the two minus signs (one from gradient descent, one from differentiating the squared term) cancel into a plus.
For the linear model $U_\theta(s)=\sum_i \theta_i f_i(s)$, we have $\dfrac{\partial U_\theta(s)}{\partial\theta_i}=f_i(s)$, so the rule becomes the very intuitive $$\theta_i := \theta_i + \alpha\big(u_j(s)-U_\theta(s)\big)f_i(s).$$ Big error and a strongly-active feature ⟹ big weight change.
Earlier we flagged it as a danger that changing $\theta$ for one state $s$ also changes the value of another state $s'$ that the features can't distinguish from $s$. That is in fact the main advantage of the approach.
Tying states together through shared features is a liability when the tied states genuinely differ in value (as $(1,2)$ vs $(2,1)$), but an asset when they truly are alike — because one experience then teaches the agent about a whole family of states at once. The art of linear FA is designing features so the tying happens between the right states.
Because the feature collapses them onto the same number: $f(1,2)=1+2=3$ and $f(2,1)=2+1=3$. The approximator only ever sees the feature value, so $U_\theta(1,2)=U_\theta(2,1)=3\theta$ — it is structurally incapable of assigning them different values. To match $(1,2)=1$ we'd need $\theta=\tfrac13$; to match $(2,1)=0$ we'd need $\theta=0$. No single $\theta$ satisfies both. The best a learner can do is pick $\theta$ that minimises the total squared error across both observations (which is exactly what Widrow–Hoff with a diminishing learning rate converges to), rather than fitting either one exactly.
$$\theta_i := \theta_i + \alpha\,\big(u_j(s)-U_\theta(s)\big)\,\frac{\partial U_\theta(s)}{\partial\theta_i}.$$
Net effect: take a small step down the squared-error gradient. The derivation starts from $\theta_i := \theta_i - \alpha\,\partial E_j/\partial\theta_i$; differentiating $E_j=\tfrac12(U_\theta-u_j)^2$ produces a factor that flips the sign, giving the $+\alpha(\dots)$ form.
The same parameterised representation works inside TD learning and Q-learning. We just swap the "observed value" for the relevant bootstrapped target:
$$\theta_i := \theta_i + \alpha\,\big(R(s) + \gamma\,U_\theta(s_{k+1}) - U_\theta(s_k)\big)\,\frac{\partial U_\theta(s)}{\partial \theta_i}$$
$$\theta_i := \theta_i + \alpha\,\big(r + \gamma\,\max_{a'} Q_\theta(s_{k+1}, a') - Q_\theta(s_k, a_{\max})\big)\,\frac{\partial Q_\theta(s_k, a)}{\partial \theta_i}$$
In both cases the modification is identical: the error term that was "observed minus predicted" is now the TD error / Q-error — the bootstrapped target minus the current estimate — and we move $\theta$ in the direction that reduces that error signal.
Every version is the same skeleton: $\theta_i \mathrel{+}= \alpha \cdot (\text{error}) \cdot (\text{gradient of the prediction})$. Only the definition of "error" changes — a directly observed return for plain estimation, $R+\gamma U(s') - U(s)$ for TD, $r+\gamma\max_{a'}Q(s',a') - Q(s,a)$ for Q-learning.
They all instantiate $\;\theta_i := \theta_i + \alpha\,(\text{error})\,\dfrac{\partial(\text{prediction})}{\partial\theta_i}$. The only thing that changes is the error term:
In all three we step $\theta$ in the direction that shrinks that error, scaled by the gradient of the prediction w.r.t. $\theta_i$ (which is the feature value for a linear model).
Linear approximation only works if you hand-design features powerful enough to capture the world linearly. When you can't, let a neural network learn the features for you.
One stubborn problem with linear approximators is that you must design the features yourself, and they must distinguish classes of states that have different utilities. Worse, they must be "powerful enough" that a mere linear combination of them accurately describes the whole state space.
In new or complex domains that is hard to achieve by hand. So a great deal of work has gone into using Deep Neural Networks (DNNs) as approximators for value functions or Q-values.
DNN approximators were responsible for most of the breakthroughs achieved with RL — the headline example being DeepMind's Go-playing system.
Linear FA says "you tell me the features; I'll learn the weights." Deep RL says "I'll learn the features too." The network's hidden layers discover their own non-linear features of the state, which is why deep approximators cope with raw, high-dimensional inputs (board positions, pixels) where hand-crafted linear features would fail.
The linear approximator $U_\theta(s)=\sum_i\theta_i f_i(s)$ is linear in $\theta$. A DNN with non-linear activation functions is not linear in $\theta$ — composing non-linearities makes the output a complicated non-linear function of the weights. Practically: for the linear model the gradient $\partial U_\theta/\partial\theta_i$ is simply the feature $f_i(s)$, whereas for a DNN the gradient must be computed through the layers via back-propagation. The pay-off is that the DNN learns its own features rather than relying on hand-designed ones that must already make the value linear-combinable.
The features must be powerful enough that a purely linear combination of them accurately describes the state space — i.e. states with different utilities must be separable by a weighted sum of the chosen features. In novel or complex domains, finding such features by hand is extremely difficult (you'd essentially need to already understand the value structure). Deep RL sidesteps this by letting a DNN learn the features automatically through its non-linear hidden layers, removing the dependence on human feature engineering.
Instead of learning values and reading a policy off them, search the space of policies directly. You don't need to know how good each action is — only which one to pick.
The final RL technique of the lecture is policy search. The shift in mindset: rather than estimating a value or Q-function and then acting greedily, we search directly in the space of possible policies for one that performs well.
$$\pi_\theta(s) \equiv \operatorname*{arg\,max}_{a} Q_\theta(s, a)$$
The policy is represented by parameters $\theta$. The parameterisation could be a linear function approximation or a neural network.
We then search the parameter space by following the gradient with respect to $\theta$ — the same gradient-descent idea used earlier for estimating the Q-function or value function, but now aimed at the policy itself.
This is the conceptual crux of the whole section. When we optimise a parameterised Q-function, we minimise the error
i.e. we want $Q_\theta(s,a)$ to numerically match the true $Q(s,a)$ for all actions — the exact expected reward of taking each action. Policy search asks for far less:
For policy search we only care that the policy chooses the correct action — regardless of how accurately $Q_\theta(s,a)$ approximates the true Q-function. Formally, we only need $Q_\theta$ to preserve the ordering of action values at each state: $$Q_\theta(s,a_1) < Q_\theta(s,a_2) \;\Longleftrightarrow\; Q(s,a_1) < Q(s,a_2).$$ Get the ranking right and the $\arg\max$ picks the right action — the exact magnitudes are irrelevant.
The bar chart makes this vivid. The true Q-values are $Q(s,a_1)=2$ and $Q(s,a_2)=4$. Four different estimates are shown:
Approximating $Q$ accurately is harder than acting well. A policy only needs to know which action wins, not by how much. Three of the four estimates above are numerically wrong but rank the actions correctly — so they all act optimally. This is why searching for a good policy can be easier than nailing down the value function.
Not at all — it is perfectly good for control. Although the magnitudes are wrong (6 and 10 vs. true 2 and 4), it preserves the ordering: it ranks Action 2 above Action 1, exactly as the true Q-values do ($4>2$). Since the policy is $\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$, only the ranking matters, so this estimate selects the same (optimal) action. It would be a poor estimate if we needed accurate values (e.g. for further bootstrapping), but for choosing actions it is as good as the exact one. This is precisely the point of the Est1 bar in the chart.
How do we actually perform the search? First we define the quantity we want to maximise.
$\rho(\theta)$ is a function returning the expected reward obtained if the policy $\pi_\theta$ is executed starting from a given state $s_0$.
There's a snag. The parameterised policy is defined with an $\arg\max$:
which makes $\pi_\theta$ non-differentiable w.r.t. $\theta$ (a tiny change in $\theta$ leaves the chosen action unchanged — flat — until it abruptly jumps to a different action). Consequently $\rho$, being a function of $\pi_\theta(s_0)$, is also non-differentiable — and we can't follow a gradient that doesn't exist.
$\arg\max$ produces a step function in $\theta$: piecewise-constant with sudden jumps. Its derivative is zero almost everywhere and undefined at the jumps. A policy gradient of "zero almost everywhere" gives the search no direction to move — hence the need to smooth it out.
To make the policy differentiable w.r.t. $\theta$, replace the hard $\arg\max$ with a softmax:
$$\pi_\theta(s, a) \equiv \frac{e^{\,Q_\theta(s,a)}}{\displaystyle\sum_{a' \in \mathrm{Acts}} e^{\,Q_\theta(s,a')}}$$
The slide writes this definition with two errors:
Memorise the correct form: numerator uses the action $a$ you're scoring; denominator sums $e^{Q_\theta(s,a')}$ over all actions $a'$ (this normalises the probabilities to sum to 1).
Two things to note about the softmax policy:
Softmax is "soft arg-max." Instead of flatly handing all the probability to the single best action (a step function), it spreads probability across actions in proportion to $e^{Q}$ — a smooth ramp. Smooth means differentiable, and differentiable means we can finally compute a gradient and climb it.
$\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$ is piecewise constant in $\theta$: small changes to $\theta$ don't change which action is the max, so the output is flat, then jumps discontinuously when a different action overtakes. Its derivative w.r.t. $\theta$ is therefore $0$ almost everywhere and undefined at the jumps — so $\rho(\theta)$ inherits this non-differentiability and there is no usable gradient $\nabla_\theta\rho(\theta)$ to follow.
The softmax $\pi_\theta(s,a)=e^{Q_\theta(s,a)}/\sum_{a'}e^{Q_\theta(s,a')}$ replaces the hard max with a smooth, everywhere-differentiable function of $\theta$ that outputs a probability distribution over actions. Because it is differentiable, $\rho(\theta)$ becomes differentiable too, and gradient ascent on $\rho$ can proceed.
Correct form: $$\pi_\theta(s,a)=\frac{e^{\,Q_\theta(s,a)}}{\sum_{a'\in\mathrm{Acts}} e^{\,Q_\theta(s,a')}}.$$
Slide mistakes: (1) the left side has a spurious trailing subscript, $\pi_\theta(s,a)_\theta$, which should just be $\pi_\theta(s,a)$; (2) the denominator's exponent uses $a$ instead of the summation variable $a'$ — written as $\sum_{a'} e^{Q_\theta(s,a)}$, the term being summed ignores $a'$ entirely. It must be $\sum_{a'} e^{Q_\theta(s,a')}$ so that the normaliser genuinely sums the exponentiated Q-values of all actions and the probabilities add to 1.
Last piece: how do we actually compute $\nabla_\theta \rho(\theta)$?
If the policy is probabilistic, you need many more samples to estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ in an unbiased way — the randomness of the policy adds variance to every run, so naive finite differencing becomes sample-hungry.
A more sample-efficient route to unbiased estimates: first estimate $\nabla_\theta \pi_\theta(s,a)$ for every state visited across the trials. Then an unbiased estimator of the policy-value gradient is:
$$\nabla_\theta \rho(\theta) := \frac{1}{N} \sum_{i=1}^{N} \frac{\nabla_\theta \pi_\theta(s, a_i)\, u_i}{\pi_\theta(s, a_i)}$$
where:
This is the likelihood-ratio (REINFORCE-style) estimator. For each time the agent visited $s$ and took $a_i$, look at how that visit actually turned out (the reward-to-go $u_i$). Then nudge $\theta$ to make the actions that led to high $u_i$ more probable. The division by $\pi_\theta(s,a_i)$ corrects for the fact that we sampled $a_i$ according to the policy's own probabilities — it's what keeps the estimate unbiased. Averaging over the $N$ visits reduces variance.
Deterministic: use finite differences — estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ (each via direct utility estimation, i.e. run the policy and average the returns) and approximate the gradient from how $\rho$ changed. Because the policy is deterministic, each run of a fixed $\theta$ gives a clean, repeatable return, so few samples are needed.
Probabilistic: now every execution of the same $\theta$ can take different actions and yield different returns, injecting variance. To estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ unbiasedly you therefore need many more samples to average out the policy's own randomness. The likelihood-ratio estimator $\nabla_\theta\rho(\theta)=\tfrac1N\sum_i \nabla_\theta\pi_\theta(s,a_i)u_i/\pi_\theta(s,a_i)$ is the more sample-efficient way to get an unbiased gradient in that case.
Averaging the product over the $N$ visits to $s$ then gives the unbiased gradient estimate (and reduces its variance).
Greedy RL gets stuck, so we force exploration (try every action $N_{\min}$ times via an optimistic $R^{+}$); look-up tables can't hold $10^{20}$–$10^{40}$ states, so we switch to function approximation — write $U_\theta(s)=\sum_i\theta_i f_i(s)$ and learn the weights with the Widrow–Hoff gradient rule (which generalises across feature-similar states, for better or worse); when good linear features can't be hand-built we let deep nets learn them; and finally, instead of values, we search policies directly — smoothing $\arg\max$ into a differentiable softmax and climbing the policy gradient $\nabla_\theta\rho(\theta)$.