Week 21 — Reinforcement Learning III

Part I

Exploration vs. Exploitation

If an agent only ever does what currently looks best, it may never discover that something better exists. Learning to act well requires deliberately doing things that look sub-optimal — at least for a while.

In the previous lectures we built two ways for an agent to learn while acting: active ADP (adaptive dynamic programming — learn a model of the MDP, then solve it) and Q-learning (learn action-values directly without a model). Both, as we ran them, were greedy: at every step the agent took the action that looked best according to whatever estimate of the world it had so far.

Concretely, an ADP agent picks its action by looking one step ahead through its learned transition model and choosing whatever maximises expected utility:

$$\pi(s) := \operatorname*{arg\,max}_{a\in \mathrm{Acts}} \sum_{s'\in \mathrm{States}} T(s' \mid s,a)\, U^{\pi}(s')$$

and a Q-learning agent simply reads off the best action from its current Q-table:

$$a_{\max} := \operatorname*{arg\,max}_{a\in \mathrm{Acts}} Q(s_i, a)$$

In plain English

"Greedy" means always cash in on what you already believe. The trouble: your beliefs were formed from a handful of trials, so they're probably wrong in places you've barely visited. By only ever exploiting them, you keep re-visiting the same familiar corner of the world and never gather the evidence that would correct your mistaken beliefs about everywhere else.

The danger is concrete: a purely greedy agent could miss large parts of the state space entirely, never build a good enough estimate of the MDP, and so end up committed to a policy that is worse than optimal — permanently.

What it looks like in the grid world

In our running 4×3 grid (the same one from the MDP lectures), greedy active ADP converges to the policy on the left below. Compare it to the true optimal policy on the right.

Greedy active ADP (left) vs. the optimal policy (right). The optimal policy routes the bottom row left and the lower-middle column up, deliberately steering clear of the −1 (☢) state. The greedy policy doesn't bother avoiding −1, and its overall utility is noticeably worse.

Compared to the optimal policy, greedy active ADP "seems to not want to avoid the −1 state" — its arrows don't route around the danger the way the optimal ones do.
The accompanying convergence plot (RMS error and policy loss vs. number of trials) shows the difference in utility between optimal and greedy ADP is relatively high — i.e. the greedy agent's policy loss does not vanish even after hundreds of trials.

QWhy can a purely greedy RL agent end up with a permanently sub-optimal policy, even with unlimited time?

Because greediness creates a self-reinforcing blind spot. The agent only takes actions that look best under its current estimate of the MDP. If that estimate undervalues some region (perhaps just from bad luck in early trials), the agent stops going there — and because it stops going there, it never collects the data that would correct the under-estimate. So the error never gets fixed: the agent converges, but to the wrong policy. There is no contradiction with "unlimited time" because the agent's own behaviour starves it of the very experience it would need.

Fixing it with an exploration function

One clean way to force exploration is an exploration function that makes the agent try every action a minimum number of times, $N_{\min}$, regardless of how good that action currently looks. Only after an action has been tried enough times does the agent go back to choosing by expected utility (i.e. greedily).

Definition · Exploration function

$$f(u,n) \equiv \begin{cases} R^{+}, & \text{if } n < N_{\min}\\[4pt] u, & \text{otherwise}\end{cases}$$

Here $u$ is the action's current estimated utility (or Q-value) and $n$ is the number of times that action has been tried so far in this state. $R^{+}$ is an optimistic value.

Slide notation — corrected

Slide 4 describes $R^{+}$ with a garbled, duplicated phrase ("an optimistic reward that is more than the reward that is more than the value of any state"). The intended meaning is simply: $R^{+}$ is an optimistic value chosen to be larger than the utility of any reachable state. Because $R^{+}$ beats every real estimate, the agent prefers an under-tried action ($n

Intuition

Pretend, until proven otherwise, that every untried action leads to paradise. That optimism guarantees the agent will go and check each action at least $N_{\min}$ times. Once it has the evidence, the optimism switches off (the function returns the real value $u$) and the agent behaves greedily on solid ground.

Plugging it into Q-learning

We drop the exploration function into any active RL algorithm. For Q-learning it changes only the action-selection line:

Algorithm 1 · Q-learning with exploration ($s_0$)

1: for s ∈ States do 2: for a ∈ Acts do 3: Q(s, a) := 0 4: N(s, a) := 0 5: end for 6: end for 7: n := 0 8: s_i := s₀ 9: while True do 10: a_max := arg max_a∈Acts f( Q(s_i, a), N(s_i, a) ) 11: N(s_i, a_max) ++ 12: Execute a_max in environment 13: s_i+1 := state-perceived 14: if s_i+1 is not terminal then 15: Q(s_i, a_max) := Q(s_i, a_max) + α(n)·( R(s_i) + γ·max_a′ Q(s_i+1, a′) − Q(s_i, a_max) ) 16: s_i := s_i+1 17: else 18: break 19: end if 20: end while

Slide notation — corrected (the "table indexing" bug)

On the original slide, line 10 reads f(Q(s_i,a), N(s,a)) and line 11 reads N(s,a_max)++ — they index the visit-count table $N$ with a bare $s$ instead of the loop's current state $s_i$. There is no plain variable $s$ in scope inside the loop (it was the for-loop dummy used only to initialise the tables). The correct indexing is $N(s_i, a)$ and $N(s_i, a_{\max})$, as written above. The slide also closes the while loop with od; for consistency with the end for style it should read end while.

Off-policy vs. on-policy

The lecturer stresses: only line 10 (action selection) was changed, not line 15 (the value update). Line 15 still bootstraps off $\max_{a'} Q(s_{i+1}, a')$ — the value of the greedy next action, not the action exploration will actually take.

Because the behaviour (what we do, via $f$) differs from the target we learn about (the greedy $\max$), the modified algorithm is off-policy.
To make it on-policy, we would also have to change line 15 so the update uses the value of the action that exploration actually selects (this is the SARSA-style fix).

In plain English

Off-policy = "do one thing, learn about another." You wander around exploring (behaviour policy), but every update still asks "what's the best I could have done next?" (target policy = greedy). On-policy = "learn about exactly what you do," so the update must reflect the exploratory action you genuinely took.

QIn Algorithm 1, exactly which line makes it off-policy, and what single change would make it on-policy?

It is off-policy because only line 10 was modified to use the exploration function $f$, while line 15 (the Q-update) was left unchanged. Line 15 still uses $\max_{a'} Q(s_{i+1}, a')$ — the value of the greedy successor action — even though the agent's behaviour (line 10) may pick an exploratory, non-greedy action. To make it on-policy you would also change line 15 so the bootstrap target uses the value of the action exploration will actually execute next (rather than the greedy $\max$).

QWhy must $R^{+}$ be larger than the value of any state, rather than just "a big positive number"?

So that the optimism always wins the arg max while an action is still under-explored. If $R^{+}$ were merely large but smaller than some genuinely high-value state's estimate, the agent could prefer that known good state over trying an untested action — leaving the untested action with $n < N_{\min}$ forever and defeating the purpose. Making $R^{+}$ exceed every possible state value guarantees that any action with $n < N_{\min}$ is ranked above every "known" option, so it will be tried at least $N_{\min}$ times before the agent is allowed to ignore it.

· · ·

Part II

Why Look-up Tables Don't Scale

Everything we did before quietly assumed one cell of memory per state. That works for a 12-cell grid. It collapses the moment the world gets interesting.

So far we discussed RL assuming we have an MDP, a value/utility function, Q-values, and so on — but we never asked how these objects are actually stored in a program. The implicit assumption throughout was that everything could be represented in a tabular manner.

For example, in direct utility estimation the approximate utility function is just an array in memory, with one cell per MDP state. For our grid that's trivial — a 2D array of 12 cells, each holding the learned utility of one square:

The learned utilities $U^\pi(s)$ for the 4×3 grid, stored as one array cell per state — 12 numbers in total. Tiny problems pose no storage difficulty.

The wall: state-space explosion

For richer problems, the table simply cannot scale:

Backgammon has on the order of $10^{20}$ states.
Chess has on the order of $10^{40}$ states.
Go is worse still — the problem is "further exacerbated."

Slide wording — corrected

Slide 7 literally says of these counts: "This would fit the memory of most practical computers." Read in context that is the opposite of what is meant — the entire slide is arguing tabular methods break at this scale. The intended statement is: $10^{20}$ or $10^{40}$ states would not fit in the memory of any practical computer. (For scale: $10^{20}$ cells at one byte each is ~$10^{8}$ terabytes; $10^{40}$ is astronomically beyond any storage that could ever exist.)

This raises the central question of the lecture:

The key question

"How could reinforcement learning be used to solve complex problems?" The answer is function approximation: instead of representing the utility function explicitly with one entry per state, we represent it approximately — as a compact function with far fewer parameters than there are states.

QA look-up table stores one cell per state. State precisely why this fails for chess, and name the alternative the lecture proposes.

Chess has on the order of $10^{40}$ states. A look-up table needs one memory cell per state, so it would require ~$10^{40}$ cells — far more storage than any physical computer has or could ever have. The agent also could never visit more than a vanishing fraction of those states, so most cells would never even be filled. The proposed alternative is function approximation: represent $U$ (or $Q$) as a parameterised function with a small number of parameters that computes a state's value from its features, rather than looking it up. This both fits in memory and lets value learned for visited states transfer to unvisited ones.

· · ·

Part III

Linear Function Approximation

Replace the giant table with a short recipe: describe each state by a few features, then compute its value as a weighted sum of those features. Learning becomes tuning the weights — and, as a bonus, knowledge spreads between similar states.

The first and simplest form of function approximation is linear function approximation. The idea: describe every state not by its identity but by a handful of state features.

Definition · State features

A set of functions $f_i : \text{States} \to \mathbb{R}$, for $1 \le i \le n$, each mapping a state to a real number. Each $f_i$ extracts one numerical property of the state.

Worked example · the grid

In the grid world a (toy) feature could be the sum of the agent's coordinates, $f(x,y) = x + y$. Writing each cell's value of $x+y$ (with $x$ the column index $0$–$3$ and $y$ the row index $0$–$2$):

The single feature $f(x,y)=x+y$ evaluated at every cell. Notice already that several distinct cells share the same value (e.g. 1+2, 2+1 and 3+0 all equal 3) — this will matter shortly.

The linear approximator

Given the features, the approximate utility of a state $s$ is a weighted sum:

Definition · Linear function approximator

$$U_\theta(s) \equiv \sum_{i=1}^{n} \theta_i\, f_i(s)$$

The $\theta_i$ are parameters (weights) determining how much each feature influences the final computed utility.

Worked example

With the single feature $f(x,y)=x+y$ and weight $\theta = 2$: $$U_\theta\big((x,y)\big) \equiv 2(x+y).$$ So instead of storing 12 numbers, we store one — the weight $\theta=2$ — and compute any cell's utility on demand.

In plain English

A linear approximator is just "value = weight × feature, added up." Learning no longer means writing a number into a cell; it means nudging the weights $\theta_i$ so the formula's outputs match what experience tells us. A handful of weights now stands in for a potentially enormous table.

How do we update the weights?

Recall how direct utility estimation set a value in the tabular case: it averaged the returns observed for that state across trials,

$$U^\pi(s) := \frac{\sum_{i=1}^{N_{\text{trials}}} u_i(s)}{N_{\text{trials}}}.$$

With linear approximation we can't write into a cell — instead we tune the parameters $\theta_i$ to adjust the estimate. The agent still interacts with the environment in the familiar loop, receiving a new state and reward after each action:

The standard agent–environment loop: the agent emits action $a_t$; the environment returns the next state $s_{t+1}$ and reward $r_{t+1}$. Each such transition is a learning signal for updating $\theta$.

The naïve idea (and why it almost works)

Naïvely, we could just set the parameters so the approximation matches an observed value. Suppose the trials gave an estimate of 0.75 for state $(1,2)$. Then with the single-feature model we could change $\theta$ from $2$ to $0.25$, since

$$U_\theta\big((x,y)\big) = \theta\, f(x,y) = 0.25\,(1+2) = 0.75. \checkmark$$

Exam trap · features collapse states together

We cannot immediately slam $\theta$ to match one state's observed value, because the features may not fully capture the state — several distinct states get compressed into one entry of the approximator. With $f(x,y)=x+y$, the model cannot distinguish $(1,2)$ from $(2,1)$ (both have $x+y=3$). Yet the trials gave $0.75$ for $(1,2)$ and $-1.12$ for $(2,1)$. Any single $\theta$ that nails one will be wrong for the other. Forcing a perfect fit to $(1,2)$ wrecks $(2,1)$, and vice versa.

So if trials imply $(1,2)$ is worth $1$ and $(2,1)$ is worth $0$, what should the parameters be? We do what the rest of the module has trained us to do: pick parameters that minimise the error, using a diminishing learning rate to guarantee convergence.

Definition · Error

For an observed value $u_j$, the error for a state $s$ is $$E_j(s) = \frac{\big(U_\theta(s) - u_j(s)\big)^2}{2}.$$

(Slide 13 writes this as $E(s)_j$ and slide 14 as $E_j(s)$ — the same quantity; the half-squared-error. The $\tfrac12$ is there purely so it cancels the $2$ from differentiating the square.)

The Widrow–Hoff rule

To minimise the error we follow its gradient, updating each parameter $\theta_i$ separately. Differentiating $E_j(s)$ and applying the chain rule gives the practically usable form on the right:

Definition · The Widrow–Hoff rule

$$\theta_i := \theta_i - \alpha\,\frac{\partial E_j(s)}{\partial \theta_i} \;=\; \theta_i + \alpha\,\big(u_j(s) - U_\theta(s)\big)\,\frac{\partial U_\theta(s)}{\partial \theta_i}$$

In plain English

The bracket $\big(u_j(s) - U_\theta(s)\big)$ is the prediction error — "how far off was I?" The rule says: move each weight a little ($\alpha$) in the direction that would have shrunk that error, scaled by how strongly that weight affected the output ($\partial U_\theta/\partial\theta_i$, which for the linear model is just the feature $f_i(s)$). The minus-then-plus rewrite is the chain rule: the two minus signs (one from gradient descent, one from differentiating the squared term) cancel into a plus.

Why the gradient form?

For the linear model $U_\theta(s)=\sum_i \theta_i f_i(s)$, we have $\dfrac{\partial U_\theta(s)}{\partial\theta_i}=f_i(s)$, so the rule becomes the very intuitive $$\theta_i := \theta_i + \alpha\big(u_j(s)-U_\theta(s)\big)f_i(s).$$ Big error and a strongly-active feature ⟹ big weight change.

The "warning" is actually the whole point

Earlier we flagged it as a danger that changing $\theta$ for one state $s$ also changes the value of another state $s'$ that the features can't distinguish from $s$. That is in fact the main advantage of the approach.

It lets the agent generalise what it learns from trials on one state to other similar states — including states it has never visited.
Whether this helps or hurts depends entirely on the features: good features must categorise states with similar values as similar (i.e. give them similar feature vectors).

Two sides of one coin

Tying states together through shared features is a liability when the tied states genuinely differ in value (as $(1,2)$ vs $(2,1)$), but an asset when they truly are alike — because one experience then teaches the agent about a whole family of states at once. The art of linear FA is designing features so the tying happens between the right states.

QWith $f(x,y)=x+y$ and weight $\theta$, why can't a single $\theta$ correctly value both $(1,2)$ and $(2,1)$ when trials say they're worth $1$ and $0$?

Because the feature collapses them onto the same number: $f(1,2)=1+2=3$ and $f(2,1)=2+1=3$. The approximator only ever sees the feature value, so $U_\theta(1,2)=U_\theta(2,1)=3\theta$ — it is structurally incapable of assigning them different values. To match $(1,2)=1$ we'd need $\theta=\tfrac13$; to match $(2,1)=0$ we'd need $\theta=0$. No single $\theta$ satisfies both. The best a learner can do is pick $\theta$ that minimises the total squared error across both observations (which is exactly what Widrow–Hoff with a diminishing learning rate converges to), rather than fitting either one exactly.

QState the Widrow–Hoff update and explain, term by term, why a weight moves the way it does.

$$\theta_i := \theta_i + \alpha\,\big(u_j(s)-U_\theta(s)\big)\,\frac{\partial U_\theta(s)}{\partial\theta_i}.$$

$\big(u_j(s)-U_\theta(s)\big)$ — the error: observed value minus current prediction. Positive ⟹ we under-predicted ⟹ push the value up; negative ⟹ push it down.
$\partial U_\theta(s)/\partial\theta_i = f_i(s)$ — how much this particular weight influences the output for this state. Weights attached to large/active features get adjusted more (and a feature that's zero here doesn't move at all).
$\alpha$ — the learning rate, taken diminishing over time to guarantee convergence.

Net effect: take a small step down the squared-error gradient. The derivation starts from $\theta_i := \theta_i - \alpha\,\partial E_j/\partial\theta_i$; differentiating $E_j=\tfrac12(U_\theta-u_j)^2$ produces a factor that flips the sign, giving the $+\alpha(\dots)$ form.

Widrow–Hoff for TD and Q-learning

The same parameterised representation works inside TD learning and Q-learning. We just swap the "observed value" for the relevant bootstrapped target:

Definition · TD Widrow–Hoff rule

$$\theta_i := \theta_i + \alpha\,\big(R(s) + \gamma\,U_\theta(s_{k+1}) - U_\theta(s_k)\big)\,\frac{\partial U_\theta(s)}{\partial \theta_i}$$

Definition · Q-learning Widrow–Hoff rule

$$\theta_i := \theta_i + \alpha\,\big(r + \gamma\,\max_{a'} Q_\theta(s_{k+1}, a') - Q_\theta(s_k, a_{\max})\big)\,\frac{\partial Q_\theta(s_k, a)}{\partial \theta_i}$$

In both cases the modification is identical: the error term that was "observed minus predicted" is now the TD error / Q-error — the bootstrapped target minus the current estimate — and we move $\theta$ in the direction that reduces that error signal.

Spot the pattern

Every version is the same skeleton: $\theta_i \mathrel{+}= \alpha \cdot (\text{error}) \cdot (\text{gradient of the prediction})$. Only the definition of "error" changes — a directly observed return for plain estimation, $R+\gamma U(s') - U(s)$ for TD, $r+\gamma\max_{a'}Q(s',a') - Q(s,a)$ for Q-learning.

QThe plain, TD, and Q-learning Widrow–Hoff rules look different. What single template do they all instantiate, and what is the only thing that changes between them?

They all instantiate $\;\theta_i := \theta_i + \alpha\,(\text{error})\,\dfrac{\partial(\text{prediction})}{\partial\theta_i}$. The only thing that changes is the error term:

Direct estimation: error $= u_j(s) - U_\theta(s)$ (observed return minus prediction).
TD: error $= R(s) + \gamma U_\theta(s_{k+1}) - U_\theta(s_k)$ (bootstrapped one-step target minus prediction).
Q-learning: error $= r + \gamma \max_{a'} Q_\theta(s_{k+1},a') - Q_\theta(s_k,a_{\max})$ (greedy bootstrap target minus prediction).

In all three we step $\theta$ in the direction that shrinks that error, scaled by the gradient of the prediction w.r.t. $\theta_i$ (which is the feature value for a linear model).

· · ·

Part IV

Beyond Linear: Deep RL

Linear approximation only works if you hand-design features powerful enough to capture the world linearly. When you can't, let a neural network learn the features for you.

One stubborn problem with linear approximators is that you must design the features yourself, and they must distinguish classes of states that have different utilities. Worse, they must be "powerful enough" that a mere linear combination of them accurately describes the whole state space.

In new or complex domains that is hard to achieve by hand. So a great deal of work has gone into using Deep Neural Networks (DNNs) as approximators for value functions or Q-values.

Why this matters

DNN approximators were responsible for most of the breakthroughs achieved with RL — the headline example being DeepMind's Go-playing system.

A DNN approximator can still be viewed as a function of parameters $\theta$ — but if non-linear activation functions are used, it is not linear in those parameters (unlike $U_\theta = \sum_i \theta_i f_i$).
Its parameters are optimised with back-propagation, exactly as in deep learning.
Deep RL remains an active research area — applied to problems once thought out of reach for AI, including automated mathematical theorem proving.

The conceptual jump

Linear FA says "you tell me the features; I'll learn the weights." Deep RL says "I'll learn the features too." The network's hidden layers discover their own non-linear features of the state, which is why deep approximators cope with raw, high-dimensional inputs (board positions, pixels) where hand-crafted linear features would fail.

QA DNN value approximator and a linear approximator are both "functions of parameters $\theta$." What is the crucial mathematical difference, and what does it change about training?

The linear approximator $U_\theta(s)=\sum_i\theta_i f_i(s)$ is linear in $\theta$. A DNN with non-linear activation functions is not linear in $\theta$ — composing non-linearities makes the output a complicated non-linear function of the weights. Practically: for the linear model the gradient $\partial U_\theta/\partial\theta_i$ is simply the feature $f_i(s)$, whereas for a DNN the gradient must be computed through the layers via back-propagation. The pay-off is that the DNN learns its own features rather than relying on hand-designed ones that must already make the value linear-combinable.

QWhat is the fundamental requirement on hand-designed features for linear approximation to succeed, and why does this motivate deep RL?

The features must be powerful enough that a purely linear combination of them accurately describes the state space — i.e. states with different utilities must be separable by a weighted sum of the chosen features. In novel or complex domains, finding such features by hand is extremely difficult (you'd essentially need to already understand the value structure). Deep RL sidesteps this by letting a DNN learn the features automatically through its non-linear hidden layers, removing the dependence on human feature engineering.

· · ·

Part V

Policy Search & Policy Gradients

Instead of learning values and reading a policy off them, search the space of policies directly. You don't need to know how good each action is — only which one to pick.

The final RL technique of the lecture is policy search. The shift in mindset: rather than estimating a value or Q-function and then acting greedily, we search directly in the space of possible policies for one that performs well.

Definition · Parameterised policy

$$\pi_\theta(s) \equiv \operatorname*{arg\,max}_{a} Q_\theta(s, a)$$

The policy is represented by parameters $\theta$. The parameterisation could be a linear function approximation or a neural network.

We then search the parameter space by following the gradient with respect to $\theta$ — the same gradient-descent idea used earlier for estimating the Q-function or value function, but now aimed at the policy itself.

Policy search is not Q-function optimisation

This is the conceptual crux of the whole section. When we optimise a parameterised Q-function, we minimise the error

$$\big(r + \gamma \max_{a'} Q_\theta(s_{k+1}, a') - Q_\theta(s_k, a_{\max})\big),$$

i.e. we want $Q_\theta(s,a)$ to numerically match the true $Q(s,a)$ for all actions — the exact expected reward of taking each action. Policy search asks for far less:

The key distinction

For policy search we only care that the policy chooses the correct action — regardless of how accurately $Q_\theta(s,a)$ approximates the true Q-function. Formally, we only need $Q_\theta$ to preserve the ordering of action values at each state: $$Q_\theta(s,a_1) < Q_\theta(s,a_2) \;\Longleftrightarrow\; Q(s,a_1) < Q(s,a_2).$$ Get the ranking right and the $\arg\max$ picks the right action — the exact magnitudes are irrelevant.

The bar chart makes this vivid. The true Q-values are $Q(s,a_1)=2$ and $Q(s,a_2)=4$. Four different estimates are shown:

True Q-values vs. four estimates. In every estimate, Action 2 > Action 1 — matching the true ordering — so all four induce the optimal policy. Only Est3 reproduces the actual numbers $(2,4)$. Est1 and Est2 are numerically way off yet still "correct" for policy purposes.

In plain English

Approximating $Q$ accurately is harder than acting well. A policy only needs to know which action wins, not by how much. Three of the four estimates above are numerically wrong but rank the actions correctly — so they all act optimally. This is why searching for a good policy can be easier than nailing down the value function.

QAn estimate $Q_\theta$ assigns Action 1 → 6 and Action 2 → 10, while the true values are 2 and 4. Is this estimate useless for control? Explain.

Not at all — it is perfectly good for control. Although the magnitudes are wrong (6 and 10 vs. true 2 and 4), it preserves the ordering: it ranks Action 2 above Action 1, exactly as the true Q-values do ($4>2$). Since the policy is $\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$, only the ranking matters, so this estimate selects the same (optimal) action. It would be a poor estimate if we needed accurate values (e.g. for further bootstrapping), but for choosing actions it is as good as the exact one. This is precisely the point of the Est1 bar in the chart.

The policy value $\rho(\theta)$

How do we actually perform the search? First we define the quantity we want to maximise.

Definition · Policy value $\rho(\theta)$

$\rho(\theta)$ is a function returning the expected reward obtained if the policy $\pi_\theta$ is executed starting from a given state $s_0$.

Naturally, $\rho(\theta)$ is a function of $\pi_\theta(s_0)$.
Policy search then becomes the optimisation problem: maximise $\rho(\theta)$, solved by ascending the gradient $\nabla_\theta \rho(\theta)$ — known as the policy gradient.

Challenge 1 — $\arg\max$ is not differentiable

There's a snag. The parameterised policy is defined with an $\arg\max$:

$$\pi_\theta(s) \equiv \operatorname*{arg\,max}_{a} Q_\theta(s,a)$$

which makes $\pi_\theta$ non-differentiable w.r.t. $\theta$ (a tiny change in $\theta$ leaves the chosen action unchanged — flat — until it abruptly jumps to a different action). Consequently $\rho$, being a function of $\pi_\theta(s_0)$, is also non-differentiable — and we can't follow a gradient that doesn't exist.

Why this blocks gradient methods

$\arg\max$ produces a step function in $\theta$: piecewise-constant with sudden jumps. Its derivative is zero almost everywhere and undefined at the jumps. A policy gradient of "zero almost everywhere" gives the search no direction to move — hence the need to smooth it out.

Challenge 1, solved — the softmax policy

To make the policy differentiable w.r.t. $\theta$, replace the hard $\arg\max$ with a softmax:

Definition · Softmax policy

$$\pi_\theta(s, a) \equiv \frac{e^{\,Q_\theta(s,a)}}{\displaystyle\sum_{a' \in \mathrm{Acts}} e^{\,Q_\theta(s,a')}}$$

Slide notation — corrected (important for the exam)

The slide writes this definition with two errors:

It writes the left-hand side as $\pi_\theta(s,a)_\theta$ with a stray extra subscript $\theta$. It should simply be $\pi_\theta(s,a)$.
In the denominator it writes $\sum_{a'\in\mathrm{Acts}} e^{Q_\theta(s,a)}$ — using $a$ instead of $a'$ inside the exponent. As written, the summand doesn't depend on the summation index $a'$, so the sum would just be (number of actions)$\times e^{Q_\theta(s,a)}$ — wrong. The correct denominator sums over $a'$: $\;\sum_{a'\in\mathrm{Acts}} e^{Q_\theta(s,a')}$, as shown in the corrected definition above.

Memorise the correct form: numerator uses the action $a$ you're scoring; denominator sums $e^{Q_\theta(s,a')}$ over all actions $a'$ (this normalises the probabilities to sum to 1).

Two things to note about the softmax policy:

It is a stochastic policy. $\pi_\theta(s,a)$ can be read as the probability of selecting action $a$ in state $s$. (Higher $Q_\theta(s,a)$ ⟹ exponentially higher probability, but every action keeps non-zero probability — which also gives built-in exploration.)
It is differentiable w.r.t. $\theta$, so it is now suitable for any gradient-descent technique: we move in the direction of the policy gradient $\nabla_\theta \rho(\theta)$.

In plain English

Softmax is "soft arg-max." Instead of flatly handing all the probability to the single best action (a step function), it spreads probability across actions in proportion to $e^{Q}$ — a smooth ramp. Smooth means differentiable, and differentiable means we can finally compute a gradient and climb it.

QWhy does the hard $\arg\max$ policy break gradient-based policy search, and exactly how does softmax fix it?

$\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$ is piecewise constant in $\theta$: small changes to $\theta$ don't change which action is the max, so the output is flat, then jumps discontinuously when a different action overtakes. Its derivative w.r.t. $\theta$ is therefore $0$ almost everywhere and undefined at the jumps — so $\rho(\theta)$ inherits this non-differentiability and there is no usable gradient $\nabla_\theta\rho(\theta)$ to follow.

The softmax $\pi_\theta(s,a)=e^{Q_\theta(s,a)}/\sum_{a'}e^{Q_\theta(s,a')}$ replaces the hard max with a smooth, everywhere-differentiable function of $\theta$ that outputs a probability distribution over actions. Because it is differentiable, $\rho(\theta)$ becomes differentiable too, and gradient ascent on $\rho$ can proceed.

QWrite the correct softmax policy and point out the two notational mistakes on the slide.

Correct form: $$\pi_\theta(s,a)=\frac{e^{\,Q_\theta(s,a)}}{\sum_{a'\in\mathrm{Acts}} e^{\,Q_\theta(s,a')}}.$$

Slide mistakes: (1) the left side has a spurious trailing subscript, $\pi_\theta(s,a)_\theta$, which should just be $\pi_\theta(s,a)$; (2) the denominator's exponent uses $a$ instead of the summation variable $a'$ — written as $\sum_{a'} e^{Q_\theta(s,a)}$, the term being summed ignores $a'$ entirely. It must be $\sum_{a'} e^{Q_\theta(s,a')}$ so that the normaliser genuinely sums the exponentiated Q-values of all actions and the probabilities add to 1.

Challenge 2 — computing the policy gradient

Last piece: how do we actually compute $\nabla_\theta \rho(\theta)$?

If the policy is deterministic

Just estimate $\rho(\theta)$ and $\rho(\theta + \Delta)$ and take an empirical finite-difference estimate of the gradient.
Both $\rho(\theta)$ and $\rho(\theta+\Delta)$ can be estimated using direct utility estimation (run the policy, average the returns).

If the policy is probabilistic

The challenge

If the policy is probabilistic, you need many more samples to estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ in an unbiased way — the randomness of the policy adds variance to every run, so naive finite differencing becomes sample-hungry.

A more sample-efficient route to unbiased estimates: first estimate $\nabla_\theta \pi_\theta(s,a)$ for every state visited across the trials. Then an unbiased estimator of the policy-value gradient is:

Definition · Policy-gradient estimator

$$\nabla_\theta \rho(\theta) := \frac{1}{N} \sum_{i=1}^{N} \frac{\nabla_\theta \pi_\theta(s, a_i)\, u_i}{\pi_\theta(s, a_i)}$$

where:

$N$ — the number of times the agent was at state $s$ in a given trial;
$a_i$ — the action executed at the $i$-th appearance of $s$;
$u_i$ — the reward-to-go from step $i$ to the end of the trial.

In plain English

This is the likelihood-ratio (REINFORCE-style) estimator. For each time the agent visited $s$ and took $a_i$, look at how that visit actually turned out (the reward-to-go $u_i$). Then nudge $\theta$ to make the actions that led to high $u_i$ more probable. The division by $\pi_\theta(s,a_i)$ corrects for the fact that we sampled $a_i$ according to the policy's own probabilities — it's what keeps the estimate unbiased. Averaging over the $N$ visits reduces variance.

QFor a deterministic policy, how is $\nabla_\theta\rho(\theta)$ estimated, and why does a probabilistic policy make this harder?

Deterministic: use finite differences — estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ (each via direct utility estimation, i.e. run the policy and average the returns) and approximate the gradient from how $\rho$ changed. Because the policy is deterministic, each run of a fixed $\theta$ gives a clean, repeatable return, so few samples are needed.

Probabilistic: now every execution of the same $\theta$ can take different actions and yield different returns, injecting variance. To estimate $\rho(\theta)$ and $\rho(\theta+\Delta)$ unbiasedly you therefore need many more samples to average out the policy's own randomness. The likelihood-ratio estimator $\nabla_\theta\rho(\theta)=\tfrac1N\sum_i \nabla_\theta\pi_\theta(s,a_i)u_i/\pi_\theta(s,a_i)$ is the more sample-efficient way to get an unbiased gradient in that case.

QIn the policy-gradient estimator, what role does each of $u_i$, $\nabla_\theta\pi_\theta(s,a_i)$, and the division by $\pi_\theta(s,a_i)$ play?

$u_i$ (reward-to-go from step $i$): the "credit" — how good the outcome was after taking $a_i$ at $s$. It weights the update so high-return actions are reinforced and low-return ones suppressed.
$\nabla_\theta\pi_\theta(s,a_i)$: the direction in parameter space that increases the probability of the action that was taken. Multiplied by $u_i$, it pushes $\theta$ to make good actions more likely.
Dividing by $\pi_\theta(s,a_i)$: an importance-weighting / likelihood-ratio correction. Because $a_i$ was sampled with probability $\pi_\theta(s,a_i)$, frequently-chosen actions would otherwise be over-counted; dividing by that probability removes the sampling bias, making the estimator unbiased.

Averaging the product over the $N$ visits to $s$ then gives the unbiased gradient estimate (and reduces its variance).

QSummarise the whole policy-search pipeline in order: what we optimise, the obstacle, the fix, and how we get the gradient.

Objective: define the policy value $\rho(\theta)$ = expected reward of running $\pi_\theta$ from $s_0$; we want to maximise it by ascending $\nabla_\theta\rho(\theta)$.
Obstacle: $\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$ is non-differentiable in $\theta$, so $\rho(\theta)$ is too — no gradient.
Fix: replace $\arg\max$ with the softmax $\pi_\theta(s,a)=e^{Q_\theta(s,a)}/\sum_{a'}e^{Q_\theta(s,a')}$ — a smooth, differentiable, stochastic policy.
Gradient: if deterministic, finite-difference $\rho(\theta)$ vs $\rho(\theta+\Delta)$ via direct utility estimation; if probabilistic, use the unbiased estimator $\tfrac1N\sum_i \nabla_\theta\pi_\theta(s,a_i)u_i/\pi_\theta(s,a_i)$ with reward-to-go $u_i$.
Throughout: remember we only need correct action ordering, not accurate Q magnitudes.

Week 21 in one breath

Greedy RL gets stuck, so we force exploration (try every action $N_{\min}$ times via an optimistic $R^{+}$); look-up tables can't hold $10^{20}$–$10^{40}$ states, so we switch to function approximation — write $U_\theta(s)=\sum_i\theta_i f_i(s)$ and learn the weights with the Widrow–Hoff gradient rule (which generalises across feature-similar states, for better or worse); when good linear features can't be hand-built we let deep nets learn them; and finally, instead of values, we search policies directly — smoothing $\arg\max$ into a differentiable softmax and climbing the policy gradient $\nabla_\theta\rho(\theta)$.

Exam-readiness checklist — be able to…

Explain why a purely greedy agent can converge to a sub-optimal policy, and describe the greedy-vs-optimal grid example (it fails to avoid the −1 state).
Write the exploration function $f(u,n)=R^{+}$ if $n
Reproduce Algorithm 1 (Q-learning with exploration), including the corrected table indexing $N(s_i,a)$, and identify which line ($a_{\max}$ selection, line 10) is changed.
Define off-policy vs on-policy and state that leaving line 15 unchanged makes the algorithm off-policy; describe the single change for on-policy.
Justify why tabular representations don't scale ($10^{20}$/$10^{40}$ states will not fit in memory) and name function approximation as the remedy.
Define state features $f_i:\text{States}\to\mathbb{R}$ and write the linear approximator $U_\theta(s)=\sum_i\theta_i f_i(s)$; evaluate the $\theta=2$, $x+y$ grid example.
Explain why features that collapse states (e.g. $(1,2)$ and $(2,1)$ both $\mapsto 3$) prevent exact fitting, forcing error minimisation.
State the error $E_j(s)=\tfrac12(U_\theta(s)-u_j(s))^2$ and the Widrow–Hoff rule, and explain each term (error × prediction-gradient).
Explain how feature-sharing turns the "warning" into the generalisation advantage, and why this depends on feature quality.
Write the TD and Q-learning Widrow–Hoff rules and identify the common template (error term is the only thing that changes).
Explain why DNN approximators are non-linear in $\theta$, are trained by back-propagation, and remove the need for hand-designed features (DeepMind Go; theorem proving).
Define a parameterised policy $\pi_\theta(s)=\arg\max_a Q_\theta(s,a)$ and contrast policy search with Q-function optimisation.
State the ordering-preservation condition and use the Est1–Est4 bar chart to argue that accurate Q magnitudes are unnecessary for optimal control.
Define the policy value $\rho(\theta)$ and the optimisation goal (maximise $\rho$ via the policy gradient $\nabla_\theta\rho(\theta)$).
Explain why $\arg\max$ makes $\pi_\theta$ (and hence $\rho$) non-differentiable, and how the softmax fixes it — writing the correct softmax with $a'$ in the denominator.
Interpret softmax as a stochastic policy (probability of choosing $a$ in $s$) and state that it is differentiable.
Describe gradient estimation: finite differences for deterministic policies vs. the unbiased likelihood-ratio estimator $\tfrac1N\sum_i\nabla_\theta\pi_\theta(s,a_i)u_i/\pi_\theta(s,a_i)$ for probabilistic ones, defining $N$, $a_i$, $u_i$ (reward-to-go).
Recite the corrected notation flagged in this guide: $N(s_i,\cdot)$ table indexing, the softmax denominator $\sum_{a'}e^{Q_\theta(s,a')}$, and that $10^{20}/10^{40}$ states do not fit in memory.