Learning to act when you don't have a map of the world — passive vs. active learning, model-based vs. model-free, and the road to Q-learning.
Last week we knew the rules of the world and just had to solve it. This week we don't know the rules — and acting in the world is the only way to find them out.
Last week (RL I) you built the machinery for planning inside a Markov Decision Process (MDP): a world described by states, actions, a reward function \(R\), and a transition function \(T(s' \mid s,a)\) — the probability that doing action \(a\) in state \(s\) lands you in \(s'\). Crucially, the lecturer stressed the word modelling: value iteration and policy iteration both assume you already have that full model in hand.
The whole point of this week is that, in real life, you usually don't. The slide's example: in a real robotic environment, no one knows the exact probability that an action will fail. You cannot write down \(T\) for a robot arm slipping on a wet floor. Reinforcement Learning is the framework for acting under exactly that ambiguity — and the way it copes is by learning the model (or learning the values directly) from experience. Hence the word "learning."
To see precisely what breaks, recall the algorithm from RL I. Its goal: compute the utility \(U(s)\) of every state with respect to the optimal policy.
The engine underneath it is the Bellman update (with discount \(\gamma\) made explicit):
\[ U(s)^{\text{new}} \;=\; R(s) \;+\; \gamma \,\max_{a\in Acts}\, \sum_{s'\in States} T(s' \mid s,a)\,U(s') \]Line 6 of the printed algorithm omits the \(\gamma\) — but the Bellman update written underneath it includes \(\gamma\). On an exam, always write the discount factor. The "no-\(\gamma\)" version is just the special case \(\gamma=1\) used later in the worked grid examples.
The Bellman update depends on knowing \(T\). If you don't have a model of the world, you literally cannot evaluate \(\sum_{s'} T(s'\mid s,a)U(s')\), so the update is unusable. The lecturer makes one simplifying assumption for this course:
To construct or learn a model of the world (in our case, an MDP) by acting in the world — and to use that experience to act well.
RL methods are organised along two independent axes. Every method this week is a corner of this 2×2 grid:
"Passive vs. active" asks whether the policy is fixed while we learn, or whether the agent is allowed to change it. "Model-based vs. model-free" asks whether we build an explicit approximation of the MDP (the \(T\)'s) or skip that and learn values directly.
Because the Bellman update contains the term \(\sum_{s'} T(s'\mid s,a)\,U(s')\), which requires the transition function \(T\). If \(T\) is unknown, that sum cannot be computed, so the update has no defined value. Learning must first recover \(T\) (model-based) or sidestep it entirely (model-free).
Known: the reward function \(R\) (an explicit simplification — "usually \(R\) is not known, but here I will assume it is known"). Unknown / to be learned: the transition function \(T(s'\mid s,a)\). Recovering or bypassing \(T\) through experience is RL's fundamental problem.
Axis 1 — passive vs. active: passive keeps a fixed policy \(\pi\) (e.g. Direct Utility Estimation, plain ADP, plain TD); active changes the policy while learning (e.g. Active ADP, Q-learning).
Axis 2 — model-based vs. model-free: model-based builds an estimate of \(T\) (ADP); model-free updates values without ever building \(T\) (TD, Q-learning).
Fix a policy, run it many times, and just average the rewards you actually got. No model, no cleverness — but it leaves value on the table.
In passive RL the agent is handed a fixed policy \(\pi\) — it always acts the same way — and the only job is to estimate the utility of every state with respect to that policy. (This is exactly the first step of policy iteration from RL I, except now we have no \(T\) to compute it with.) The recipe: execute \(\pi\) for a number of trials, each starting from a state, and observe the rewards and successor states that come back. We call it "passive" precisely because the policy never moves.
Every example this week lives in the same 4×3 grid. The agent starts bottom-left at \((0,0)\); the top-right cell \((3,2)\) is the goal (€, reward \(+1\)); the cell below it \((3,1)\) is the trap (☢, reward \(-1\)); \((1,1)\) is a wall. Every non-terminal step costs \(-0.04\). The arrows are the fixed policy \(\pi\) being evaluated.
Three example trials (traces) are observed. Each subscript on a state is the reward collected there (\(-0.04\) for ordinary cells, \(+1\)/\(-1\) at the terminals):
Recall from RL I that the value of a single trace \(s_0, s_1, \dots\) is its discounted reward sum:
\[ \sum_{i=0}^{\infty} \gamma^{i} R(s_i) \]The true utility of a state under policy \(\pi\) is the expectation of that sum over all traces \(\pi\) could produce starting there:
\[ U^{\pi}(s) \;\equiv\; \mathbb{E}_{\text{traces } s_0 s_1 s_2 \dots}\!\left[\sum_{i=0}^{\infty} \gamma^{i} R(s_i)\right] \]We can't compute that expectation (no \(T\)), so we estimate it the most direct way possible: run trials and average.
For a trial \(i\) starting at \(s_0\) and spanning states \(s_1, s_2, \dots, s_n\), let \(u_i(s_0)\) denote the observed reward-to-go \(\displaystyle\sum_{i=1}^{n}\gamma^{i}R(s_i)\). In direct utility estimation, the utility of a state is approximated by averaging that observed value over all trials that pass through it:
\[ \overline{U}^{\pi}(s) \;:=\; \frac{\sum_{i=1}^{N_{\text{trials}}} u_i(s)}{N_{\text{trials}}} \]"Every time a trial walked through state \(s\), write down how much total reward it ended up collecting from there to the end. The value of \(s\) is just the average of those numbers." It's the same logic as estimating a die's average roll by rolling it a lot and averaging — pure Monte-Carlo sampling, no model needed.
With \(\gamma = 1\) the discounted sum is simply the total reward along the trace. Reading off the three trials' totals from state \((0,0)\):
Averaging the three:
\[ \overline{U}^{\pi}\big((0,0)\big) \;:=\; \frac{0.68 + 0.68 - 1.16}{3} \;=\; 0.06 \]These are the exact figures on the lecturer's slides — reproduce them in the exam, not your own re-derivation. (The arithmetic is illustrative: each trial value is the observed reward-to-go from that state to the terminal, then averaged across trials.) Also note the slide labels states "(new0,0)" / "(new0,1)" — this is just a rendering quirk for states \((0,0)\) and \((0,1)\).
First trial: \(3\times(-0.04) + 1 = -0.12 + 1 = 0.88\). DUE estimate over the two trials: \(\frac{0.88 + (-1.12)}{2} = \frac{-0.24}{2} = -0.12\).
DUE treats every state as an island — it learns each value purely from the raw traces that happened to pass through it. But the states are connected: the value of a state is tied to the values of its neighbours through the Bellman equation
\[ U(s) = R(s) + \gamma\,\max_{a\in Acts}\sum_{s'\in States} T(s'\mid s,a)\,U(s') \]Because the policy is fixed, there is no maximisation over actions — \(\pi\) already tells us the action — so it simplifies to:
\[ U(s) = R(s) + \gamma \sum_{s'\in States} T(s'\mid s,\pi(s))\,U(s') \]This simplified Bellman equation says a state's value should be consistent with its successors' values. DUE never enforces that consistency, so it converges slowly and wastes information. The methods in Parts III and IV fix exactly this.
DUE misses the connection between the utilities of different states expressed by the Bellman equation — a state's value constrains and is constrained by its neighbours'. DUE estimates each state independently from raw traces, so it converges slowly. With a fixed policy \(\pi\), the action at \(s\) is already determined, so the \(\max_a\) drops out, leaving \(U(s)=R(s)+\gamma\sum_{s'}T(s'\mid s,\pi(s))U(s')\).
"Passive" because the policy \(\pi\) is fixed — the agent does not change how it acts while learning. It corresponds to the policy-evaluation step (the first step of policy iteration): given a fixed \(\pi\), estimate \(U^\pi\) for every state.
If the only thing DUE was missing is the Bellman connection, then let's put it back: estimate the transitions from experience, then solve the MDP you've reconstructed.
Adaptive Dynamic Programming (ADP) is the obvious repair. It is a model-based method: it actually builds an approximate MDP and then evaluates the policy on that approximate model, so the Bellman connection between states is fully exploited. The main idea, in three moves:
It is model-based because it constructs an approximation of the MDP. The value computation is driven by the simplified Bellman update:
\[ U(s)^{\text{new}} = R(s) + \gamma \sum_{s'\in States} T(s'\mid s,\pi(s))\,U(s') \]The only thing standing in the way is that we don't have \(T(s'\mid s,\pi(s))\) — so ADP estimates it.
Let \(N(s,a,s')\) count how many times taking action \(a\) in state \(s\) was observed to lead to \(s'\). Then estimate
\[ T(s'\mid s,a) := \frac{N(s,a,s')}{\sum_{s''\in States} N(s,a,s'')} \]i.e. the empirical fraction of times \(s'\) followed \((s,a)\). The more trials, the better this approximation.
To learn "how likely is action right from this cell to actually move me right?", just tally it: out of all the times you tried right here, what fraction actually ended up in the cell on the right? That fraction is your estimate of \(T\). It's maximum-likelihood estimation by frequency-counting.
Line 12 is the expensive heart: a full policy-evaluation solve on the current estimated MDP, repeated after every single step.
Running ADP on the grid, the per-state value estimates settle toward their true values, and the RMS error in utility drops toward zero. The lecturer's note: "a lot of changes happen later on, around trial 80, which is the first trial to enter the state whose value is \(-1\)." Until the agent actually falls into the trap, it has no evidence the trap exists, so its values are biased; the first encounter corrects them sharply.
ADP reaches correct values in a reasonable number of trials (≈100 in the grid). But line 12 — recomputing all state values w.r.t. \(\pi\) — is, as established for policy iteration, \(O(n^3)\) in the worst case (\(n\) = number of states). For an MDP with a few million states, solving that linear system after every step is impractical.
The expensive part is not estimating \(T\) (that's a cheap count update). It is the full policy-evaluation solve done every step — \(O(n^3)\). This is the precise pain point TD learning is designed to remove.
By frequency counting: \(T(s'\mid s,a) = \dfrac{N(s,a,s')}{\sum_{s''}N(s,a,s'')}\), the empirical fraction of times \(s'\) followed \((s,a)\). ADP is model-based because it explicitly builds an approximation of the MDP (the transition function \(T\)) and then plans on it, rather than learning values directly.
\(O(n^3)\) where \(n\) is the number of states — the same cost as the policy-evaluation step in policy iteration (solving a linear system). Because this is done after every perceived state, it becomes impractical for MDPs with millions of states.
Trial 80 is the first trial to enter the \(-1\) (trap) state. Before that, the agent has no experience of the penalty, so its model and values are biased; the first encounter injects new evidence and the estimates correct abruptly.
\(R(s)\) is the one piece of value information we are given (R is assumed known). Seeding utilities with the immediate reward gives the Bellman iteration a more informed start than zero, so policy evaluation has a better initial guess to refine.
What if we never build the MDP at all, and instead nudge each value a little every time experience disagrees with it? That's TD — cheaper per step, slower overall.
Temporal Difference (TD) learning avoids ADP's cost by avoiding the construction of an MDP altogether — it is a model-free approach. Instead of estimating \(T\) and solving for values, it directly updates an approximation \(U^{\pi}(\cdot)\) of the state values, one observed transition at a time. The single update rule:
\[ U^{\pi}(s_i) \;:=\; U^{\pi}(s_i) \;+\; \alpha(n)\Big(R(s_i) + \gamma\,U^{\pi}(s_{i+1}) - U^{\pi}(s_i)\Big) \]\(R(s_i) + \gamma\,U^{\pi}(s_{i+1})\) is the target: a fresh, one-step-better estimate of \(s_i\)'s value, built from the reward you just got plus the discounted value of where you landed. \(U^{\pi}(s_i)\) is the old estimate. Their difference is the TD error (error signal). \(\alpha(n)\) is the learning rate, controlling how far you move toward the target.
The slide spells out the reasoning:
Every time you take a step, you get a slightly more informed opinion about the state you just left ("given what just happened, it looks worth this much"). Rather than overwrite your old opinion, you split the difference — move a fraction \(\alpha\) of the way from your old guess toward the new one. Big surprises (large TD error) move you more; as \(\alpha\) shrinks over time, your estimates stop wobbling and settle.
Notice there is no \(T\), no \(N\) counts, and no \(O(n^3)\) solve — just one cheap arithmetic update per step. That's the model-free payoff.
| Aspect | ADP (model-based) | TD (model-free) |
|---|---|---|
| Cost per (state,reward) pair | High — full \(O(n^3)\) re-solve every step | Low — one arithmetic update |
| Trials needed to converge | Few — ≈ 100 in the grid | Many — ≈ 500 in the grid |
| Builds a model of \(T\)? | Yes | No |
| Exploits Bellman connection | Fully (global solve) | Locally, one transition at a time |
In short: ADP does more work per step but needs fewer steps; TD does almost no work per step but needs many more steps. The grid figures show TD's value curves are visibly noisier and take ~500 trials to approach the truth, versus ADP's smoother ~100.
\(U^{\pi}(s_i) := U^{\pi}(s_i) + \alpha(n)\big(\underbrace{R(s_i)+\gamma U^{\pi}(s_{i+1})}_{\text{target}} - \underbrace{U^{\pi}(s_i)}_{\text{old}}\big)\).
The bracketed difference is the TD error; \(\alpha(n)\) is the learning rate. The new value is the old value moved a fraction \(\alpha\) toward the target.
Model-free: it never constructs or estimates the transition function \(T\) (no MDP is built); it updates the value estimates \(U^{\pi}\) directly from observed transitions.
To ensure convergence. Early on, large \(\alpha\) lets estimates move quickly toward better targets; as learning proceeds, shrinking \(\alpha\) damps the random fluctuations from sampling so the estimates settle on stable values instead of perpetually overshooting. (Hence \(\alpha\) is written \(\alpha(n)\), a function of the step counter.)
ADP costs more per step (an \(O(n^3)\) re-solve each step) but needs fewer trials (≈100). TD costs much less per step (one update) but needs many more trials (≈500) for the same convergence in the grid example.
Target \(= R + \gamma U^\pi(s_{i+1}) = -0.04 + 0.7 = 0.66\). TD error \(= 0.66 - 0.5 = 0.16\). Update \(= 0.5 + 0.5\times0.16 = 0.5 + 0.08 = \mathbf{0.58}\). The estimate moves halfway toward the target.
A fixed policy can only learn the parts of the world it happens to walk through. Active RL frees the agent to revise its plan as it learns — and exposes a subtle problem for the model-free case.
All three passive methods share a weakness: they learn the environment only by executing one specific policy. That makes the agent too rigid — it can miss large parts of the state space entirely, because the fixed \(\pi\) never visits them. The remedy is active learning: allow the agent to change its policy during learning. In every active method we'll see, the agent adapts its policy as it senses the states that result from acting in the environment.
Making ADP active is conceptually easy: change the policy after every state sensed. The new policy is the one that's optimal with respect to the current estimate of the transition probabilities. Since we have an estimated MDP in hand, we can run value iteration on it to compute that optimal policy.
Compared with plain ADP (Algorithm 2), the only additions are: a random initial policy (line 3), and the policy-improvement loop (lines 15–17) that re-derives \(\pi\) greedily from the current values and transition estimates after each step. Note the \(\max_a\) is back — because the policy is no longer fixed.
We can make TD active in exactly the same way: after every sensed state, set
\[ \pi(s) := \arg\max_{a\in Acts}\sum_{s'\in States} T(s'\mid s,a)\,U^{\pi}(s') \]and leave the rest of TD unchanged. But notice this still uses \(T\) — so it is no longer model-free! We'd like an active TD method that stays model-free (keeps no estimate of the transition probabilities). To do that, we need to change utilities directly from sensed states.
We cannot use \(U^{\pi}(s)\) directly to choose actions in a model-free way. \(U^{\pi}\) tells you how good a state is, but to pick the best action via \(\arg\max_a \sum_{s'} T(s'\mid s,a)U^\pi(s')\) you still need \(T\). With only state-values and no model, there is no way to distinguish which action is best. This dead-end is exactly what Q-values resolve in Part VI.
Passive RL learns the environment only by running one fixed policy, which makes the agent too rigid and can cause it to miss large parts of the state space (states the fixed policy never visits). Active RL lets the agent change its policy during learning so it can explore and discover those regions.
The policy is made greedy w.r.t. the current value/transition estimates: \(\pi(s):=\arg\max_a \sum_{s'} T(s'\mid s,a)U^\pi(s')\), i.e. optimal w.r.t. the current estimate of the transition probabilities. Value iteration is used to compute this new optimal policy on the estimated MDP.
That update still references \(T\), so it is not model-free — defeating the purpose of TD. The deeper obstacle: state-values \(U^\pi\) alone cannot tell you the best action without a model, because choosing an action requires knowing where each action leads (\(T\)). With only \(U^\pi\) and no \(T\), the agent cannot distinguish actions. Q-values fix this.
Store the value of each action, not just each state — and suddenly you can choose the best action with no model at all.
The fix is to stop tracking the value of a state and instead track the expected utility of taking each action in that state. That object is the Q-value.
\(Q(s,a)\) takes a state \(s\) and an action \(a\) and returns the expected utility of executing \(a\) in \(s\) (and behaving well thereafter). Where \(U^\pi(s)\) scores a state, \(Q(s,a)\) scores a state–action pair.
With state-values you had to ask "where would each action take me, and how good is it there?" — which needs \(T\). With Q-values, the goodness of each action is already stored. To act, you just read off \(\arg\max_a Q(s,a)\): no model, no transition function, no lookahead. The model has been "baked into" the action-values.
We modify the TD update to operate on an approximation of the Q-value instead of the state utility. Again there is an error signal, but now on the Q-value:
\[ Q(s_i,a_{\max}) := Q(s_i,a_{\max}) + \alpha(n)\Big(R(s_i) + \gamma\,\max_{a'} Q(s_{i+1},a') - Q(s_i,a_{\max})\Big) \]Compare with TD: the old \(U^\pi(s_{i+1})\) is replaced by \(\max_{a'} Q(s_{i+1},a')\) — the value of acting optimally from the next state. That single \(\max\) is what makes Q-learning learn the optimal action-values directly.
This is a fully active, model-free method — the bottom-right corner of the Part I taxonomy. It keeps no \(T\), no \(N\) counts, never solves a linear system, and chooses actions purely from its own action-values.
As printed (line 9), the agent always executes \(a_{\max}=\arg\max_a Q(s_i,a)\) — purely greedy. With Q's initialised to 0 and no exploration mechanism, a greedy agent can get stuck repeatedly choosing whatever first looked good and never discover better actions. In practice Q-learning needs an exploration strategy (e.g. \(\varepsilon\)-greedy). Reproduce the slide's greedy version, but be ready to name the missing ingredient: exploration vs. exploitation. Also note line 13 only updates on a non-terminal successor.
\(Q(s,a)\) is the expected utility of executing action \(a\) in state \(s\). Because the value of each action is stored explicitly, the best action is simply \(\arg\max_a Q(s,a)\) — no transition function \(T\) is needed to evaluate or compare actions. The model dependence that plagued state-values is removed.
\(Q(s_i,a_{\max}) := Q(s_i,a_{\max}) + \alpha(n)\big(R(s_i) + \gamma\,\max_{a'}Q(s_{i+1},a') - Q(s_i,a_{\max})\big)\).
The change: TD's \(\gamma U^\pi(s_{i+1})\) becomes \(\gamma\,\max_{a'}Q(s_{i+1},a')\) — the value of acting optimally from the next state. The \(\max\) over next-state actions is what targets the optimal policy.
DUE: passive, neither (no \(T\), no Bellman use — just averaging). ADP: passive, model-based. TD: passive, model-free. Q-learning: active, model-free. (Active ADP would be active, model-based.)
\(\max_{a'}Q(s_{i+1},a') = 0.9\). Target \(= -0.04 + 1\cdot 0.9 = 0.86\). Error \(= 0.86 - 0.2 = 0.66\). Update \(= 0.2 + 0.1\times 0.66 = 0.2 + 0.066 = \mathbf{0.266}\).
When you no longer know the world's transition function \(T\), you can either rebuild it from counts and plan (ADP, model-based) or skip it and nudge values straight from experience (TD, model-free); freeing the policy to change turns these into active methods, and storing action-values \(Q(s,a)\) instead of state-values gives you Q-learning — the active, model-free method that picks the best action with no model at all.