KCL · Machine Learning · Week 20

Reinforcement
Learning II

Learning to act when you don't have a map of the world — passive vs. active learning, model-based vs. model-free, and the road to Q-learning.

KCL · ML ◆ ~50 min read ◆ 22 self-tests ◆ Lecturer: M. Abdulaziz

Part I · the problem restated

The Setup — Why "Learning" at All?

Last week we knew the rules of the world and just had to solve it. This week we don't know the rules — and acting in the world is the only way to find them out.

Last week (RL I) you built the machinery for planning inside a Markov Decision Process (MDP): a world described by states, actions, a reward function \(R\), and a transition function \(T(s' \mid s,a)\) — the probability that doing action \(a\) in state \(s\) lands you in \(s'\). Crucially, the lecturer stressed the word modelling: value iteration and policy iteration both assume you already have that full model in hand.

The whole point of this week is that, in real life, you usually don't. The slide's example: in a real robotic environment, no one knows the exact probability that an action will fail. You cannot write down \(T\) for a robot arm slipping on a wet floor. Reinforcement Learning is the framework for acting under exactly that ambiguity — and the way it copes is by learning the model (or learning the values directly) from experience. Hence the word "learning."

A one-slide recap of value iteration

To see precisely what breaks, recall the algorithm from RL I. Its goal: compute the utility \(U(s)\) of every state with respect to the optimal policy.

Algorithm 1 — ValIteration(States, Acts, T) 1for s ∈ States do 2 U(s) := 0 3end for 4 U(s) := U(s)new 5for s ∈ States do 6 U(s)new := R(s) + max_a∈Acts Σ_s'∈States T(s'|s,a)·U(s') 7end for 8for s ∈ States do 9 if ‖U(s)new − U(s)‖ > ε then 10 go to 4 11 end if 12end for

The engine underneath it is the Bellman update (with discount \(\gamma\) made explicit):

\[ U(s)^{\text{new}} \;=\; R(s) \;+\; \gamma \,\max_{a\in Acts}\, \sum_{s'\in States} T(s' \mid s,a)\,U(s') \]

Exam trap

Line 6 of the printed algorithm omits the \(\gamma\) — but the Bellman update written underneath it includes \(\gamma\). On an exam, always write the discount factor. The "no-\(\gamma\)" version is just the special case \(\gamma=1\) used later in the worked grid examples.

Where it breaks — and the four-way split

The Bellman update depends on knowing \(T\). If you don't have a model of the world, you literally cannot evaluate \(\sum_{s'} T(s'\mid s,a)U(s')\), so the update is unusable. The lecturer makes one simplifying assumption for this course:

\(T\) is not known — this is the hard part RL must solve.
\(R\) is usually also unknown, but here we will assume \(R\) is known to keep things clean.

Definition — RL's fundamental problem

To construct or learn a model of the world (in our case, an MDP) by acting in the world — and to use that experience to act well.

RL methods are organised along two independent axes. Every method this week is a corner of this 2×2 grid:

The map of Week 20. Read across: do you build a model? Read down: can you change the policy?

"Passive vs. active" asks whether the policy is fixed while we learn, or whether the agent is allowed to change it. "Model-based vs. model-free" asks whether we build an explicit approximation of the MDP (the \(T\)'s) or skip that and learn values directly.

QWhy can we no longer use the Bellman update once we drop the MDP assumption?

Because the Bellman update contains the term \(\sum_{s'} T(s'\mid s,a)\,U(s')\), which requires the transition function \(T\). If \(T\) is unknown, that sum cannot be computed, so the update has no defined value. Learning must first recover \(T\) (model-based) or sidestep it entirely (model-free).

QIn this course's RL setting, what is assumed known and what must be learned?

Known: the reward function \(R\) (an explicit simplification — "usually \(R\) is not known, but here I will assume it is known"). Unknown / to be learned: the transition function \(T(s'\mid s,a)\). Recovering or bypassing \(T\) through experience is RL's fundamental problem.

QName the two axes that classify the RL methods in this lecture, with one example of each extreme.

Axis 1 — passive vs. active: passive keeps a fixed policy \(\pi\) (e.g. Direct Utility Estimation, plain ADP, plain TD); active changes the policy while learning (e.g. Active ADP, Q-learning).
Axis 2 — model-based vs. model-free: model-based builds an estimate of \(T\) (ADP); model-free updates values without ever building \(T\) (TD, Q-learning).

· · ·

Part II · the simplest thing that works

Passive RL — Direct Utility Estimation

Fix a policy, run it many times, and just average the rewards you actually got. No model, no cleverness — but it leaves value on the table.

In passive RL the agent is handed a fixed policy \(\pi\) — it always acts the same way — and the only job is to estimate the utility of every state with respect to that policy. (This is exactly the first step of policy iteration from RL I, except now we have no \(T\) to compute it with.) The recipe: execute \(\pi\) for a number of trials, each starting from a state, and observe the rewards and successor states that come back. We call it "passive" precisely because the policy never moves.

The grid world we'll keep returning to

Every example this week lives in the same 4×3 grid. The agent starts bottom-left at \((0,0)\); the top-right cell \((3,2)\) is the goal (€, reward \(+1\)); the cell below it \((3,1)\) is the trap (☢, reward \(-1\)); \((1,1)\) is a wall. Every non-terminal step costs \(-0.04\). The arrows are the fixed policy \(\pi\) being evaluated.

The grid world. Coordinates are (column, row), row 0 at the bottom. Arrows = the fixed policy π being evaluated.

Three example trials (traces) are observed. Each subscript on a state is the reward collected there (\(-0.04\) for ordinary cells, \(+1\)/\(-1\) at the terminals):

Trial 1: (0,0)→(0,1)→(0,2)→(0,1)→(0,2)→(1,2)→(2,2)→(3,2)+1 Trial 2: (0,0)→(0,1)→(0,2)→(1,2)→(2,2)→(2,1)→(2,2)→(3,2)+1 Trial 3: (0,0)→(1,0)→(2,0)→(2,1)→(3,1)−1

What "utility of a state" means here

Recall from RL I that the value of a single trace \(s_0, s_1, \dots\) is its discounted reward sum:

\[ \sum_{i=0}^{\infty} \gamma^{i} R(s_i) \]

The true utility of a state under policy \(\pi\) is the expectation of that sum over all traces \(\pi\) could produce starting there:

\[ U^{\pi}(s) \;\equiv\; \mathbb{E}_{\text{traces } s_0 s_1 s_2 \dots}\!\left[\sum_{i=0}^{\infty} \gamma^{i} R(s_i)\right] \]

We can't compute that expectation (no \(T\)), so we estimate it the most direct way possible: run trials and average.

Definition — Direct Utility Estimation (DUE)

For a trial \(i\) starting at \(s_0\) and spanning states \(s_1, s_2, \dots, s_n\), let \(u_i(s_0)\) denote the observed reward-to-go \(\displaystyle\sum_{i=1}^{n}\gamma^{i}R(s_i)\). In direct utility estimation, the utility of a state is approximated by averaging that observed value over all trials that pass through it:

\[ \overline{U}^{\pi}(s) \;:=\; \frac{\sum_{i=1}^{N_{\text{trials}}} u_i(s)}{N_{\text{trials}}} \]

In plain English

"Every time a trial walked through state \(s\), write down how much total reward it ended up collecting from there to the end. The value of \(s\) is just the average of those numbers." It's the same logic as estimating a die's average roll by rolling it a lot and averaging — pure Monte-Carlo sampling, no model needed.

Worked example — state (0,0), with γ = 1

With \(\gamma = 1\) the discounted sum is simply the total reward along the trace. Reading off the three trials' totals from state \((0,0)\):

Trial 1: \(-0.04 \times 8 + 1 = 0.68\)
Trial 2: \(-0.04 \times 8 + 1 = 0.68\)
Trial 3: \(-0.04 \times 4 - 1 = -1.16\)

Averaging the three:

\[ \overline{U}^{\pi}\big((0,0)\big) \;:=\; \frac{0.68 + 0.68 - 1.16}{3} \;=\; 0.06 \]

Worked example — state (0,1)

Trial 1: \(-0.04 \times 7 + 1 = 0.72\)
Trial 2: \(-0.04 + 1 = 0.84\)
Trial 3: \(-0.04 \times 7 + 1 = 0.72\)

\[ \overline{U}^{\pi}\big((0,1)\big) \;:=\; \frac{0.72 + 0.84 + 0.72}{3} \;=\; 0.76 \]

Read the slide carefully

These are the exact figures on the lecturer's slides — reproduce them in the exam, not your own re-derivation. (The arithmetic is illustrative: each trial value is the observed reward-to-go from that state to the terminal, then averaged across trials.) Also note the slide labels states "(new0,0)" / "(new0,1)" — this is just a rendering quirk for states \((0,0)\) and \((0,1)\).

QA trial visits state \(s\) and then collects rewards \(-0.04,\,-0.04,\,-0.04,\,+1\) before terminating. With \(\gamma=1\), what is this trial's \(u_i(s)\)? If a second trial gives \(u_i(s)=-1.12\), what is the DUE estimate of \(U^\pi(s)\)?

First trial: \(3\times(-0.04) + 1 = -0.12 + 1 = 0.88\). DUE estimate over the two trials: \(\frac{0.88 + (-1.12)}{2} = \frac{-0.24}{2} = -0.12\).

The flaw: DUE ignores the structure of the problem

DUE treats every state as an island — it learns each value purely from the raw traces that happened to pass through it. But the states are connected: the value of a state is tied to the values of its neighbours through the Bellman equation

\[ U(s) = R(s) + \gamma\,\max_{a\in Acts}\sum_{s'\in States} T(s'\mid s,a)\,U(s') \]

Because the policy is fixed, there is no maximisation over actions — \(\pi\) already tells us the action — so it simplifies to:

\[ U(s) = R(s) + \gamma \sum_{s'\in States} T(s'\mid s,\pi(s))\,U(s') \]

This simplified Bellman equation says a state's value should be consistent with its successors' values. DUE never enforces that consistency, so it converges slowly and wastes information. The methods in Parts III and IV fix exactly this.

QWhat important opportunity does Direct Utility Estimation miss, and why does the fixed policy simplify the Bellman equation?

DUE misses the connection between the utilities of different states expressed by the Bellman equation — a state's value constrains and is constrained by its neighbours'. DUE estimates each state independently from raw traces, so it converges slowly. With a fixed policy \(\pi\), the action at \(s\) is already determined, so the \(\max_a\) drops out, leaving \(U(s)=R(s)+\gamma\sum_{s'}T(s'\mid s,\pi(s))U(s')\).

QWhy is this whole family of methods called "passive," and which step of policy iteration does it correspond to?

"Passive" because the policy \(\pi\) is fixed — the agent does not change how it acts while learning. It corresponds to the policy-evaluation step (the first step of policy iteration): given a fixed \(\pi\), estimate \(U^\pi\) for every state.

· · ·

Part III · build the model, then plan

Passive RL — Adaptive Dynamic Programming

If the only thing DUE was missing is the Bellman connection, then let's put it back: estimate the transitions from experience, then solve the MDP you've reconstructed.

Adaptive Dynamic Programming (ADP) is the obvious repair. It is a model-based method: it actually builds an approximate MDP and then evaluates the policy on that approximate model, so the Bellman connection between states is fully exploited. The main idea, in three moves:

Take every \(\langle\)state, reward\(\rangle\) pair from the trials.
Use them to compute an approximation \(T\) of the transition probabilities — i.e. an approximation of the MDP.
Then compute the value of all states w.r.t. \(\pi\), exactly as in policy iteration.

It is model-based because it constructs an approximation of the MDP. The value computation is driven by the simplified Bellman update:

\[ U(s)^{\text{new}} = R(s) + \gamma \sum_{s'\in States} T(s'\mid s,\pi(s))\,U(s') \]

The only thing standing in the way is that we don't have \(T(s'\mid s,\pi(s))\) — so ADP estimates it.

Estimating T by counting

Definition — transition estimate from counts

Let \(N(s,a,s')\) count how many times taking action \(a\) in state \(s\) was observed to lead to \(s'\). Then estimate

\[ T(s'\mid s,a) := \frac{N(s,a,s')}{\sum_{s''\in States} N(s,a,s'')} \]

i.e. the empirical fraction of times \(s'\) followed \((s,a)\). The more trials, the better this approximation.

In plain English

To learn "how likely is action right from this cell to actually move me right?", just tally it: out of all the times you tried right here, what fraction actually ended up in the cell on the right? That fraction is your estimate of \(T\). It's maximum-likelihood estimation by frequency-counting.

The ADP algorithm

Algorithm 2 — ADP(s₀) 1for s ∈ States do 2 U(s) := R(s) 3end for 4 sᵢ := s₀ 5while True do 6 Execute π(sᵢ) in environment 7 sᵢ₊₁ := state-perceived 8 N(sᵢ, π(sᵢ), sᵢ₊₁) ++ # tally the transition 9 for s ∈ States do 10 T(s | sᵢ,π(sᵢ)) := N(sᵢ,π(sᵢ),s) / Σ_s' N(sᵢ,π(sᵢ),s') 11 end for 12 Compute value of all states under π, given MDP (States,Acts,R,T) 13 if sᵢ is not terminal then 14 sᵢ := sᵢ₊₁ 15 else 16 break 17 end if 18od

Line 12 is the expensive heart: a full policy-evaluation solve on the current estimated MDP, repeated after every single step.

How it converges in the grid

Running ADP on the grid, the per-state value estimates settle toward their true values, and the RMS error in utility drops toward zero. The lecturer's note: "a lot of changes happen later on, around trial 80, which is the first trial to enter the state whose value is \(-1\)." Until the agent actually falls into the trap, it has no evidence the trap exists, so its values are biased; the first encounter corrects them sharply.

Converged utility estimates from ADP (γ<1 here, so values differ from the γ=1 worked examples). Note the dip toward the −1 trap.

The catch: ADP is expensive

ADP reaches correct values in a reasonable number of trials (≈100 in the grid). But line 12 — recomputing all state values w.r.t. \(\pi\) — is, as established for policy iteration, \(O(n^3)\) in the worst case (\(n\) = number of states). For an MDP with a few million states, solving that linear system after every step is impractical.

Exam trap — where the cost lives

The expensive part is not estimating \(T\) (that's a cheap count update). It is the full policy-evaluation solve done every step — \(O(n^3)\). This is the precise pain point TD learning is designed to remove.

QHow does ADP estimate the transition probability \(T(s'\mid s,a)\), and what makes ADP "model-based"?

By frequency counting: \(T(s'\mid s,a) = \dfrac{N(s,a,s')}{\sum_{s''}N(s,a,s'')}\), the empirical fraction of times \(s'\) followed \((s,a)\). ADP is model-based because it explicitly builds an approximation of the MDP (the transition function \(T\)) and then plans on it, rather than learning values directly.

QWhat is the worst-case cost of one ADP iteration, and why is that a problem?

\(O(n^3)\) where \(n\) is the number of states — the same cost as the policy-evaluation step in policy iteration (solving a linear system). Because this is done after every perceived state, it becomes impractical for MDPs with millions of states.

QIn the grid experiment, why do the value estimates change sharply around trial 80?

Trial 80 is the first trial to enter the \(-1\) (trap) state. Before that, the agent has no experience of the penalty, so its model and values are biased; the first encounter injects new evidence and the estimates correct abruptly.

QADP initialises \(U(s):=R(s)\) rather than \(U(s):=0\). Why is that a sensible starting point?

\(R(s)\) is the one piece of value information we are given (R is assumed known). Seeding utilities with the immediate reward gives the Bellman iteration a more informed start than zero, so policy evaluation has a better initial guess to refine.

· · ·

Part IV · skip the model entirely

Passive RL — Temporal Difference Learning

What if we never build the MDP at all, and instead nudge each value a little every time experience disagrees with it? That's TD — cheaper per step, slower overall.

Temporal Difference (TD) learning avoids ADP's cost by avoiding the construction of an MDP altogether — it is a model-free approach. Instead of estimating \(T\) and solving for values, it directly updates an approximation \(U^{\pi}(\cdot)\) of the state values, one observed transition at a time. The single update rule:

\[ U^{\pi}(s_i) \;:=\; U^{\pi}(s_i) \;+\; \alpha(n)\Big(R(s_i) + \gamma\,U^{\pi}(s_{i+1}) - U^{\pi}(s_i)\Big) \]

Definition — the TD update, term by term

\(R(s_i) + \gamma\,U^{\pi}(s_{i+1})\) is the target: a fresh, one-step-better estimate of \(s_i\)'s value, built from the reward you just got plus the discounted value of where you landed. \(U^{\pi}(s_i)\) is the old estimate. Their difference is the TD error (error signal). \(\alpha(n)\) is the learning rate, controlling how far you move toward the target.

What the update is really doing

The slide spells out the reasoning:

TD updates a state's value using the discounted value of the state it encounters next, plus the reward of the current state.
It does not completely discard the old approximation — the new value is taken to be in between the old value and the target \(r + \gamma U^{\pi}(s_{i+1})\), with the blend controlled by \(\alpha\).
The learning rate decreases as the algorithm runs (that's why \(\alpha\) is written as \(\alpha(n)\), a function of the step count) — this is what guarantees convergence.
The difference between the old estimate and \(r+\gamma U^{\pi}(s_{i+1})\) is treated as an error signal that guides learning.

In plain English

Every time you take a step, you get a slightly more informed opinion about the state you just left ("given what just happened, it looks worth this much"). Rather than overwrite your old opinion, you split the difference — move a fraction \(\alpha\) of the way from your old guess toward the new one. Big surprises (large TD error) move you more; as \(\alpha\) shrinks over time, your estimates stop wobbling and settle.

The new estimate lands between the old estimate and the target — a fraction α of the way along the TD error.

The TD algorithm

Algorithm 4 — TD(s₀) 1for s ∈ States do 2 U(s) := 0 3 n := 0 4end for 5 sᵢ := s₀ 6while True do 7 Execute π(sᵢ) in environment 8 sᵢ₊₁ := state-perceived 9 Uᵖⁱ(sᵢ) := Uᵖⁱ(sᵢ) + α(n)·( R(sᵢ) + γ·Uᵖⁱ(sᵢ₊₁) − Uᵖⁱ(sᵢ) ) 10 if sᵢ is not terminal then 11 sᵢ := sᵢ₊₁ 12 n := n + 1 13 else 14 break 15 end if 16od

Notice there is no \(T\), no \(N\) counts, and no \(O(n^3)\) solve — just one cheap arithmetic update per step. That's the model-free payoff.

The trade-off: TD vs. ADP

Aspect	ADP (model-based)	TD (model-free)
Cost per (state,reward) pair	High — full \(O(n^3)\) re-solve every step	Low — one arithmetic update
Trials needed to converge	Few — ≈ 100 in the grid	Many — ≈ 500 in the grid
Builds a model of \(T\)?	Yes	No
Exploits Bellman connection	Fully (global solve)	Locally, one transition at a time

In short: ADP does more work per step but needs fewer steps; TD does almost no work per step but needs many more steps. The grid figures show TD's value curves are visibly noisier and take ~500 trials to approach the truth, versus ADP's smoother ~100.

QWrite the TD(0) update rule and label the target, old estimate, learning rate and error.

\(U^{\pi}(s_i) := U^{\pi}(s_i) + \alpha(n)\big(\underbrace{R(s_i)+\gamma U^{\pi}(s_{i+1})}_{\text{target}} - \underbrace{U^{\pi}(s_i)}_{\text{old}}\big)\).
The bracketed difference is the TD error; \(\alpha(n)\) is the learning rate. The new value is the old value moved a fraction \(\alpha\) toward the target.

QIs TD model-based or model-free? Justify in one sentence.

Model-free: it never constructs or estimates the transition function \(T\) (no MDP is built); it updates the value estimates \(U^{\pi}\) directly from observed transitions.

QWhy must the learning rate \(\alpha\) decrease over time?

To ensure convergence. Early on, large \(\alpha\) lets estimates move quickly toward better targets; as learning proceeds, shrinking \(\alpha\) damps the random fluctuations from sampling so the estimates settle on stable values instead of perpetually overshooting. (Hence \(\alpha\) is written \(\alpha(n)\), a function of the step counter.)

QState the cost/convergence trade-off between ADP and TD, with the grid numbers.

ADP costs more per step (an \(O(n^3)\) re-solve each step) but needs fewer trials (≈100). TD costs much less per step (one update) but needs many more trials (≈500) for the same convergence in the grid example.

QA state currently has \(U^\pi(s_i)=0.5\). The agent takes one step, receives \(R(s_i)=-0.04\), and lands in \(s_{i+1}\) with \(U^\pi(s_{i+1})=0.7\). With \(\gamma=1\) and \(\alpha=0.5\), what is the updated \(U^\pi(s_i)\)?

Target \(= R + \gamma U^\pi(s_{i+1}) = -0.04 + 0.7 = 0.66\). TD error \(= 0.66 - 0.5 = 0.16\). Update \(= 0.5 + 0.5\times0.16 = 0.5 + 0.08 = \mathbf{0.58}\). The estimate moves halfway toward the target.

· · ·

Part V · let the agent steer

Active RL — Letting the Policy Change

A fixed policy can only learn the parts of the world it happens to walk through. Active RL frees the agent to revise its plan as it learns — and exposes a subtle problem for the model-free case.

All three passive methods share a weakness: they learn the environment only by executing one specific policy. That makes the agent too rigid — it can miss large parts of the state space entirely, because the fixed \(\pi\) never visits them. The remedy is active learning: allow the agent to change its policy during learning. In every active method we'll see, the agent adapts its policy as it senses the states that result from acting in the environment.

The active loop: the agent sends an action aₜ; the environment returns the new state sₜ₊₁ and reward rₜ₊₁; the agent revises its policy and acts again.

Active ADP — re-plan after every step

Making ADP active is conceptually easy: change the policy after every state sensed. The new policy is the one that's optimal with respect to the current estimate of the transition probabilities. Since we have an estimated MDP in hand, we can run value iteration on it to compute that optimal policy.

Algorithm 5 — ADP-active(s₀) 1for s ∈ States do 2 U(s) := R(s) 3 π(s) := random(Acts) # start from a random policy 4end for 5 sᵢ := s₀ 6while True do 7 Execute π(sᵢ) in environment 8 sᵢ₊₁ := state-perceived 9 N(sᵢ, π(sᵢ), sᵢ₊₁) ++ 10 for s ∈ States do 11 T(s | sᵢ,π(sᵢ)) := N(sᵢ,π(sᵢ),s) / Σ_s' N(sᵢ,π(sᵢ),s') 12 end for 13 Compute value of all states under π, given MDP (States,Acts,R,T) 14 if sᵢ is not terminal then 15 for s ∈ States do 16 π(s) := argmax_a∈Acts Σ_s' T(s'|s,a)·Uᵖⁱ(s') # the new bit 17 end for 18 sᵢ := sᵢ₊₁ 19 else 20 break 21 end if 22od

Compared with plain ADP (Algorithm 2), the only additions are: a random initial policy (line 3), and the policy-improvement loop (lines 15–17) that re-derives \(\pi\) greedily from the current values and transition estimates after each step. Note the \(\max_a\) is back — because the policy is no longer fixed.

Active TD — and the problem with U^π

We can make TD active in exactly the same way: after every sensed state, set

\[ \pi(s) := \arg\max_{a\in Acts}\sum_{s'\in States} T(s'\mid s,a)\,U^{\pi}(s') \]

and leave the rest of TD unchanged. But notice this still uses \(T\) — so it is no longer model-free! We'd like an active TD method that stays model-free (keeps no estimate of the transition probabilities). To do that, we need to change utilities directly from sensed states.

The core challenge

We cannot use \(U^{\pi}(s)\) directly to choose actions in a model-free way. \(U^{\pi}\) tells you how good a state is, but to pick the best action via \(\arg\max_a \sum_{s'} T(s'\mid s,a)U^\pi(s')\) you still need \(T\). With only state-values and no model, there is no way to distinguish which action is best. This dead-end is exactly what Q-values resolve in Part VI.

QWhat concrete weakness of passive RL motivates active RL?

Passive RL learns the environment only by running one fixed policy, which makes the agent too rigid and can cause it to miss large parts of the state space (states the fixed policy never visits). Active RL lets the agent change its policy during learning so it can explore and discover those regions.

QIn active ADP, how is the policy updated after each step, and which RL-I algorithm performs the computation?

The policy is made greedy w.r.t. the current value/transition estimates: \(\pi(s):=\arg\max_a \sum_{s'} T(s'\mid s,a)U^\pi(s')\), i.e. optimal w.r.t. the current estimate of the transition probabilities. Value iteration is used to compute this new optimal policy on the estimated MDP.

QWhy does the obvious "active TD via \(\arg\max_a \sum_{s'} T U^\pi\)" fail to give us what we want, and what is the deeper obstacle?

That update still references \(T\), so it is not model-free — defeating the purpose of TD. The deeper obstacle: state-values \(U^\pi\) alone cannot tell you the best action without a model, because choosing an action requires knowing where each action leads (\(T\)). With only \(U^\pi\) and no \(T\), the agent cannot distinguish actions. Q-values fix this.

· · ·

Part VI · the destination

Active RL — Q-Learning

Store the value of each action, not just each state — and suddenly you can choose the best action with no model at all.

The fix is to stop tracking the value of a state and instead track the expected utility of taking each action in that state. That object is the Q-value.

Definition — Q-value

\(Q(s,a)\) takes a state \(s\) and an action \(a\) and returns the expected utility of executing \(a\) in \(s\) (and behaving well thereafter). Where \(U^\pi(s)\) scores a state, \(Q(s,a)\) scores a state–action pair.

Why this solves the problem

With state-values you had to ask "where would each action take me, and how good is it there?" — which needs \(T\). With Q-values, the goodness of each action is already stored. To act, you just read off \(\arg\max_a Q(s,a)\): no model, no transition function, no lookahead. The model has been "baked into" the action-values.

The Q-learning update

We modify the TD update to operate on an approximation of the Q-value instead of the state utility. Again there is an error signal, but now on the Q-value:

\[ Q(s_i,a_{\max}) := Q(s_i,a_{\max}) + \alpha(n)\Big(R(s_i) + \gamma\,\max_{a'} Q(s_{i+1},a') - Q(s_i,a_{\max})\Big) \]

Compare with TD: the old \(U^\pi(s_{i+1})\) is replaced by \(\max_{a'} Q(s_{i+1},a')\) — the value of acting optimally from the next state. That single \(\max\) is what makes Q-learning learn the optimal action-values directly.

Algorithm 6 — Q-learning(s₀) 1for s ∈ States do 2 for a ∈ Acts do 3 Q(s,a) := 0 4 end for 5end for 6 n := 0 7 sᵢ := s₀ 8while True do 9 a_max := argmax_a∈Acts Q(sᵢ, a) 10 Execute a_max in environment 11 sᵢ₊₁ := state-perceived 12 if sᵢ₊₁ is not terminal then 13 Q(sᵢ,a_max) := Q(sᵢ,a_max) + α(n)·( R(sᵢ) + γ·max_a' Q(sᵢ₊₁,a') − Q(sᵢ,a_max) ) 14 sᵢ := sᵢ₊₁ 15 else 16 break 17 end if 18od

This is a fully active, model-free method — the bottom-right corner of the Part I taxonomy. It keeps no \(T\), no \(N\) counts, never solves a linear system, and chooses actions purely from its own action-values.

Exam trap — the exploration gap

As printed (line 9), the agent always executes \(a_{\max}=\arg\max_a Q(s_i,a)\) — purely greedy. With Q's initialised to 0 and no exploration mechanism, a greedy agent can get stuck repeatedly choosing whatever first looked good and never discover better actions. In practice Q-learning needs an exploration strategy (e.g. \(\varepsilon\)-greedy). Reproduce the slide's greedy version, but be ready to name the missing ingredient: exploration vs. exploitation. Also note line 13 only updates on a non-terminal successor.

QDefine \(Q(s,a)\) and explain why action-values let an agent act without a model.

\(Q(s,a)\) is the expected utility of executing action \(a\) in state \(s\). Because the value of each action is stored explicitly, the best action is simply \(\arg\max_a Q(s,a)\) — no transition function \(T\) is needed to evaluate or compare actions. The model dependence that plagued state-values is removed.

QWrite the Q-learning update and identify the one change relative to the TD update that makes it learn optimal action-values.

\(Q(s_i,a_{\max}) := Q(s_i,a_{\max}) + \alpha(n)\big(R(s_i) + \gamma\,\max_{a'}Q(s_{i+1},a') - Q(s_i,a_{\max})\big)\).
The change: TD's \(\gamma U^\pi(s_{i+1})\) becomes \(\gamma\,\max_{a'}Q(s_{i+1},a')\) — the value of acting optimally from the next state. The \(\max\) over next-state actions is what targets the optimal policy.

QPlace Q-learning, plain TD, plain ADP and DUE in the passive/active × model-based/model-free grid.

DUE: passive, neither (no \(T\), no Bellman use — just averaging). ADP: passive, model-based. TD: passive, model-free. Q-learning: active, model-free. (Active ADP would be active, model-based.)

QSuppose \(Q(s_i,a_{\max})=0.2\), reward \(R(s_i)=-0.04\), and the next state \(s_{i+1}\) has action-values \(\{0.5, 0.9, 0.3\}\). With \(\gamma=1\), \(\alpha=0.1\), compute the updated \(Q(s_i,a_{\max})\).

\(\max_{a'}Q(s_{i+1},a') = 0.9\). Target \(= -0.04 + 1\cdot 0.9 = 0.86\). Error \(= 0.86 - 0.2 = 0.66\). Update \(= 0.2 + 0.1\times 0.66 = 0.2 + 0.066 = \mathbf{0.266}\).

Week 20 in one breath

When you no longer know the world's transition function \(T\), you can either rebuild it from counts and plan (ADP, model-based) or skip it and nudge values straight from experience (TD, model-free); freeing the policy to change turns these into active methods, and storing action-values \(Q(s,a)\) instead of state-values gives you Q-learning — the active, model-free method that picks the best action with no model at all.

Exam-readiness checklist

State why RL is needed (\(T\) unknown in the real world; \(R\) assumed known here) and reproduce the value-iteration recap + Bellman update with \(\gamma\).
Place any method in the passive/active × model-based/model-free 2×2 grid and justify the placement.
Write the Direct Utility Estimation definition and compute \(\overline U^\pi(s)\) by averaging observed reward-to-go over trials (do the \((0,0)\)→0.06 and \((0,1)\)→0.76 grid examples with \(\gamma=1\)).
Explain the flaw of DUE (ignores the Bellman connection between states) and write the simplified Bellman equation for a fixed \(\pi\) (no \(\max_a\)).
Reproduce the ADP algorithm, the count-based estimate \(T(s'\mid s,a)=N(s,a,s')/\sum N(s,a,s'')\), and explain why ADP is model-based.
Identify ADP's cost as \(O(n^3)\) per iteration (the policy-evaluation solve, done every step) and why that's impractical at scale.
Explain the grid observation that values shift around trial 80 — the first entry into the \(-1\) state.
Write the TD update, label target / old / error / learning rate, and explain why \(\alpha(n)\) decreases (convergence).
State the ADP-vs-TD trade-off with numbers: ADP ≈100 trials but \(O(n^3)\)/step; TD ≈500 trials but \(O(1)\)/step.
Reproduce active ADP (random initial \(\pi\); greedy policy-improvement \(\pi(s)=\arg\max_a\sum_{s'}T U^\pi\) via value iteration) and say why \(\max_a\) returns.
Explain why state-values \(U^\pi\) can't drive a model-free active method, and how Q-values \(Q(s,a)\) resolve it.
Write the Q-learning update and full algorithm; identify the \(\max_{a'}Q(s_{i+1},a')\) target as the change from TD, and flag the exploration/exploitation gap in the greedy version.

The Setup — Why "Learning" at All?

A one-slide recap of value iteration

Where it breaks — and the four-way split

Passive RL — Direct Utility Estimation

The grid world we'll keep returning to

What "utility of a state" means here

Worked example — state (0,0), with γ = 1

Worked example — state (0,1)

The flaw: DUE ignores the structure of the problem

Passive RL — Adaptive Dynamic Programming

Estimating T by counting

The ADP algorithm

How it converges in the grid

The catch: ADP is expensive

Passive RL — Temporal Difference Learning

What the update is really doing

The TD algorithm

The trade-off: TD vs. ADP

Active RL — Letting the Policy Change

Active ADP — re-plan after every step

Active TD — and the problem with Uπ

Active RL — Q-Learning

The Q-learning update

Week 20 in one breath

Exam-readiness checklist

Active TD — and the problem with U^π