Week 19 — Reinforcement Learning I

Part I

What Is Reinforcement Learning?

You have met two ways for a machine to learn. This is the third — and it is the one that looks most like how a creature actually learns: by doing things and seeing what happens.

So far in this module you have studied two paradigms of machine learning. In supervised learning you are handed labelled examples (here is an input, here is the correct answer) and you learn to copy the mapping. In unsupervised learning you are handed unlabelled data and you learn its hidden structure (clusters, components, codes). Reinforcement learning is a genuinely different third paradigm.

Intuition

Imagine training a dog. You don't show it a labelled dataset of "correct moves." You let it act, and you hand out treats (reward) or scoldings (punishment). Over many episodes the dog adjusts its behaviour to get more treats. That loop — act, observe what happened, get a reward signal, adjust — is reinforcement learning.

Definition — Reinforcement Learning

In RL we have an intelligent agent whose goal is to learn a policy to act in the world such that it maximises its rewards. The policy is learnt by acting in the world, observing the results of acting — in terms of its new state and a reward signal — and updating the policy next time from those observations.

Notice the three moving parts the agent exchanges with its environment at each time step $t$: it chooses an action $a_t$, and in return the environment hands back a new state $s_{t+1}$ and a reward $r_{t+1}$. That cycle repeats forever (or until the task ends).

The RL loop (slide 3). The agent emits actions; the environment returns the next state and a reward.

Why this paradigm matters

The lecturer's framing is worth quoting directly: "RL combines learning and acting: the essence of intelligence." A supervised learner is a passive predictor; an RL agent must decide and live with the consequences, which is much closer to what we mean by intelligent behaviour.

Examples from the slides

Problems that can be modelled as RL: Pacman, Chess, the stock market, and even Mathematics. The approach is used in robotics and game playing. Many recent AI breakthroughs were solved as RL problems — most famously AlphaGo and other DeepMind achievements.

QIn one sentence, what distinguishes RL from supervised and unsupervised learning?

In supervised learning the agent is given the correct answers (labels); in unsupervised learning it is given unlabelled data to find structure in. In RL the agent is given neither — it must act in an environment and learn purely from the state transitions and reward signals its actions produce, with the goal of maximising cumulative reward.

QName the two things the environment returns to the agent after each action, and the one thing the agent sends to the environment.

The agent sends an action $a_t$. The environment returns a new state $s_{t+1}$ and a reward $r_{t+1}$. (Memorise the loop diagram — examiners love asking you to label it.)

QIs chess "supervised, unsupervised, or reinforcement learning," and why?

It fits reinforcement learning: there is an agent taking actions (moves), a changing state (the board), and a reward (win/lose), with no labelled "correct move" dataset to imitate. The agent must learn a policy by playing and observing outcomes. The slides explicitly list Chess (alongside Pacman, the stock market, and Mathematics) as an RL-modellable problem, and note AlphaGo-style breakthroughs solved board games this way.

· · ·

Part II

Markov Decision Processes: The Model

Before we can learn how to act, we need a precise mathematical description of "the world" and "acting in it." That description is the Markov Decision Process — five ingredients that capture states, actions, randomness, and reward.

The first thing we need is the mathematical / formal model of environments and agents that underlies the RL methods in this course. A very general such model is the Markov Decision Process (MDP). The whole of Part II is just unpacking the five things an MDP gives us, using one running example: a robot in a grid.

Definition — MDP (assembled from the slides)

An MDP provides a mathematical model for: a set of states the world can be in; a set of actions the agent may take in each state; a transition relation $T$ giving, for each state–action pair, a probability distribution over next states; and a reward $R$ accrued on entering a state. (We add the discount factor $\gamma$ in Part III.) The agent acts according to a policy $\pi$.

Ingredient 1 — States

The world can be in different states. In our running example we have a grid of cells (4 columns wide, 3 rows tall) and a robot that occupies one cell. The states are simply the possible positions of the robot, written as coordinates such as $s\equiv(0,3)$, $s\equiv(1,3)$, $s\equiv(2,3)$ as it moves across the bottom row.

The running grid world (slides 6–15). Robot starts bottom-left; the gold bar (+1) is the goal terminal, the bomb (−1) a disaster terminal, and the shaded cell is an impassable wall.

Exam trap — the coordinate notation

The slides are loose about coordinates. The robot demo labels the bottom row $(0,3),(1,3),(2,3)$, while the policy and reward slides use pairs like $(2,0),(3,1),(3,2)$ under a $(\text{col},\text{row})$ convention. Do not get hung up on reconciling the exact indices — they are just labels for cells. What you must remember is the layout: $+1$ goal top-right, $-1$ bomb directly below it, a wall in the centre. If asked, state your coordinate convention explicitly.

Ingredient 2 — Actions & the policy

At every state the agent executes one action from a set of available actions, based on its policy. Here there are four actions: $\uparrow,\ \rightarrow,\ \leftarrow,\ \downarrow$. A policy $\pi$ is a rule that says which action to take in each state. The slide gives a concrete deterministic example:

$$\pi(s)=\begin{cases}\uparrow, & \text{if } s=(2,0)\ \text{or}\ s=(2,1)\\[2pt]\leftarrow, & \text{if } s=(3,0)\\[2pt]\text{None}, & \text{if } s=(3,1)\ \text{or}\ s=(3,2)\\[2pt]\rightarrow, & \text{otherwise}\end{cases}$$

The two terminal cells (gold and bomb) map to None — once you reach a terminal you stop acting. Pictured as arrows, this policy steers the robot rightward and up toward the gold:

The example policy as arrows (slide 9). Terminal cells (gold, bomb) take action "None."

Ingredient 3 — The transition relation (the crucial twist)

Choosing an action results in a new state — but crucially, in MDPs taking an action does not always lead to the same outcome. The resulting states form a probability distribution depending on the state and the action. This is the transition relation $T(s'\mid s,a)$.

Worked example — a slippery "up"

If the robot at $(2,2)$ chooses $\uparrow$, the action can fail: it has a 10% chance of veering right and 10% left. So:

$T\big((2,3)\mid(2,2),\uparrow\big)=0.8$ (intended move succeeds)
$T\big((3,2)\mid(2,2),\uparrow\big)=0.1$ (slips right)
$T\big((1,2)\mid(2,2),\uparrow\big)=0.1$ (slips left)

These three probabilities sum to $1$ — they form a proper distribution over outcomes.

In plain English

The robot is clumsy. When it tries to go up, it usually does (80%), but sometimes it skids sideways (10% each way). This models real-world messiness: wheels slip, sensors lie, motors stutter. The lecturer lists the sources explicitly — errors / unreliability of actions in the physical world, inaccuracies in our model of the actions, and other sources of randomness. The "Markov" in MDP means the next state depends only on the current state and action, not the full history.

The stochastic outcome of a single "up" action (slide 10).

QThe robot at $(2,2)$ takes $\uparrow$. What is the probability it ends up not directly above where it intended?

The intended outcome $(2,3)$ has probability $0.8$, so the probability of not landing there is $1-0.8 = \mathbf{0.2}$ (split as $0.1$ left $+\,0.1$ right). Always check transition probabilities sum to 1.

Ingredient 4 — The state space (graph view)

The different choices of actions and their possible consequences induce a graph — the state space. From a state, each action is a branch; each action then fans out into its possible next states with the transition probabilities as edge weights.

The state-space graph rooted at $(2,2)$ (slide 11).

Worked example — reaching the prize

Because actions are probabilistic, executing a policy does not always lead to known results. From the start, the action sequence $\uparrow,\uparrow,\rightarrow,\rightarrow,\rightarrow$ reaches the prize only with probability $0.8^5$ — every one of the five steps must succeed (each succeeds with $0.8$). Numerically $0.8^5 = 0.32768$, so even a "perfect plan" works barely a third of the time.

QWhy is $0.8^5$ and not, say, $5\times 0.8$ the probability of the plan succeeding?

The five steps are independent stochastic events, and the plan succeeds only if all five succeed. Independent "AND" means multiply the probabilities: $0.8\times0.8\times0.8\times0.8\times0.8 = 0.8^5 \approx 0.328$. ($5\times0.8$ would be a sum, which isn't even a valid probability since it exceeds 1.)

Ingredient 5 — The reward

Executing an action results in a reward depending on the resulting state, written $R(s)$. The slides show two reward designs. Version A (no living cost):

$$R(s)=\begin{cases}1, & \text{if } s=(3,2)\quad(\text{gold})\\ -1, & \text{if } s=(3,1)\quad(\text{bomb})\\ 0, & \text{otherwise}\end{cases}$$

This gives the agent a positive reward only at the gold and a strong negative reward at the bomb, encouraging it to move toward the positive terminal. Version B (small living penalty): replace the $0$ with $-0.04$ on every non-terminal cell.

$$R(s)=\begin{cases}1, & \text{if } s=(3,2)\\ -1, & \text{if } s=(3,1)\\ -0.04, & \text{otherwise}\end{cases}$$

Reward Version B (slide 15): a $-0.04$ "living penalty" on every non-terminal state.

In plain English — why the −0.04?

The tiny negative reward on every ordinary cell is a cost of dawdling. It makes the agent "keen" to leave non-terminal states quickly rather than wandering forever. Without it, an agent with $\gamma<1$ has no positive reason to hurry; with it, every wasted step bleeds a little reward, so the optimal behaviour becomes "get to the gold efficiently — but not so recklessly that you risk the bomb." This single number tunes the whole personality of the agent (we'll see in Part III how different ranges of $R(s)$ produce wildly different optimal policies).

QReward design: what behaviour would you expect if the living penalty were a large negative number, e.g. $R(s)=-2$ on non-terminal cells?

If existing in any non-terminal state costs $-2$ per step — worse than the $-1$ bomb! — the agent becomes desperate to terminate as fast as possible, even by diving into the bomb, because ending the episode quickly (even badly) beats accumulating huge living costs. This is exactly the kind of insight slide 21's "policies for four ranges of $R(s)$" figure illustrates: the reward magnitude, not just its sign, reshapes the optimal policy.

QWhat does the "Markov" property mean for the transition relation $T(s'\mid s,a)$, and why does it make the model tractable?

The Markov property means the distribution over the next state depends only on the current state $s$ and action $a$ — written $T(s'\mid s,a)$ — and not on the full history of how the agent got to $s$. This is why we can write value/Bellman equations in terms of a single state $s$ rather than entire trajectories: each state summarises everything relevant about the past, so the optimal action depends only on where you are now, not on your story so far.

· · ·

Part III

Optimal Policies & the Value of a State

We can describe the world and a policy — but which policy is best? To answer that we need a number that scores how good it is to be somewhere. That number is the value of a state, and it rests on one clever trick: discounting.

The goal of the agent is to achieve the best outcome by executing a policy, and each state carries a reward accrued on entering it. So we need a way to compare policies. We can do this by evaluating a policy based on the states it makes the agent enter. The question is: how do we turn an endless stream of rewards into one comparable number?

The problem with adding rewards

We consider an infinite horizon — the sequence of states could be infinitely long: $R(s_0), R(s_1), R(s_2), R(s_3), \dots$ The naive idea is to just add them all:

$$R(s_0)+R(s_1)+R(s_2)+R(s_3)+\dots$$

But this fails for an infinite sequence — the sum can diverge to $\pm\infty$, and infinitely many policies would all score "$+\infty$," giving us no way to rank them.

The fix — discounted reward

Definition — Discounted Reward

$$R(s_0)+\gamma R(s_1)+\gamma^2 R(s_2)+\gamma^3 R(s_3)+\dots=\sum_{i=0}^{\infty}\gamma^i R(s_i)$$ where $\gamma$ is the discount factor, chosen so that $0\le\gamma<1$.

With $0\le\gamma<1$ the sum converges even for an infinite sequence, and it is bounded:

$$\sum_{i=0}^{\infty}\gamma^i R(s_i)\ \le\ \frac{R_{\max}}{1-\gamma}$$

In plain English — what discounting means

$\gamma$ is "how much you care about the future." Each step into the future, reward is multiplied by another factor of $\gamma<1$, so it shrinks. The implication the lecturer stresses: when we evaluate a sequence of states, we prefer sequences that earn more reward earlier rather than later — the value of later rewards diminishes. It's the maths of "a treat now beats the same treat next year," and it also conveniently makes infinite sums finite.

Worked example — the bound

The bound $\frac{R_{\max}}{1-\gamma}$ comes from a geometric series. If the best possible per-step reward is $R_{\max}=1$ and $\gamma=0.9$, then no infinite trace can be worth more than $\frac{1}{1-0.9}=\frac{1}{0.1}=10$. With $\gamma=0.99$ the ceiling rises to $100$ — a patient agent values the long run far more.

QWhy must $\gamma<1$ strictly, rather than $\gamma\le 1$?

At $\gamma=1$ the factors $\gamma^i$ never shrink (they all equal 1), so $\sum_{i=0}^\infty R(s_i)$ is just the un-discounted sum again — which can diverge for an infinite horizon. The geometric bound $\frac{R_{\max}}{1-\gamma}$ also blows up ($\tfrac{1}{0}$) at $\gamma=1$. We need $\gamma<1$ for guaranteed convergence. ($\gamma\ge0$ is required so future rewards aren't perversely negated.)

QWith $\gamma=0.5$ and $R(s_i)=2$ for every state forever, what is the discounted return?

$\sum_{i=0}^\infty 0.5^i\cdot 2 = 2\cdot\frac{1}{1-0.5}=2\cdot 2 = \mathbf{4}$. (Geometric series $\sum\gamma^i=\frac{1}{1-\gamma}$.) Notice it equals the bound $\frac{R_{\max}}{1-\gamma}=\frac{2}{0.5}=4$ here, because every reward is already at the maximum.

Utility of a state under a policy

To assess a policy we ask: what discounted reward does it return when executed from a given starting state? That quantity is the utility of a state under the policy, written $U^\pi(s)$.

Definition — Utility (deterministic case)

For deterministic actions, $U^\pi(s)$ is simply the discounted reward accrued by executing $\pi$ starting from $s$: $\;U^\pi(s)=\sum_{i=0}^\infty \gamma^i R(s_i)$, where $s_0=s$ and each $s_{i+1}$ is the (single, certain) state that $\pi$ leads to.

Worked example from the slide

Set $\gamma=0.9$ and $R(s)=-0.04$ for all non-terminal $s$. The slide writes the utility of the start state $(0,0)$ along a path that pays five $-0.04$ steps then collects the $+1$ gold:

$U^\pi\big((0,0)\big)=-0.04-0.9\times0.04-0.9^2\times0.04-0.9^3\times0.04-0.9^4\times0.04+0.9^5\times1$

The structure to learn: each successive step's reward is multiplied by one more factor of $\gamma$, and the final $+1$ is discounted by $\gamma^5$ because it arrives at step 5.

Exam trap — the slide's printed number

The slide states this evaluates to $-0.084$. If you actually compute the expression as written, you get $-0.04(1+0.9+0.81+0.729+0.6561)+0.9^5 = -0.1638+0.5905 \approx +0.43$, not $-0.084$. The printed $-0.084$ does not match the printed formula (a slide slip, or a different/longer path was intended). For the exam: reproduce the method exactly — geometric discounting of the living penalties plus the $\gamma^k$-discounted terminal reward — and show your arithmetic. Don't memorise $-0.084$ as a result of this formula; if the lecturer's figure reappears, note the inconsistency rather than being thrown by it.

Utility when actions are stochastic

As we saw, actions are generally probabilistic, so executing the same policy can produce many different traces, each with its own probability. We therefore define the utility as the expected (average) discounted return over all possible traces:

Definition — Utility (stochastic case)

$$U^\pi(s)\ \equiv\ \mathbb{E}_{\text{possible traces } s_0 s_1 s_2 \dots}\!\left(\sum_{i=0}^{\infty}\gamma^i R(s_i)\right)$$

In plain English

Because the robot is clumsy, "run policy $\pi$ from $s$" isn't one story — it's a whole bushy tree of possible stories, each weighted by how likely it is. The utility is the probability-weighted average of the discounted returns of all those stories. The expectation $\mathbb{E}$ is doing exactly the averaging-over-randomness job.

The optimal policy and the value of a state

Definition — Optimal Policy & Value

The optimal policy from a state $s$ is the policy with the maximum expected return if executed from that state: $$\pi^*(s)=\operatorname*{argmax}_{\pi} U^\pi(s).$$ We call $U^{\pi^*}(s)$ the value of a state, and denote it $U(s)$ from now on.

So $U(s)$ is "the best expected discounted reward achievable from $s$, playing optimally." Russell & Norvig's Figure 17.2 (on the slide) illustrates two things: (a) the optimal policy for the stochastic grid with $R(s)=-0.04$, and (b) how the optimal policy changes across four different ranges of $R(s)$ — from "rush to any exit" when living is very costly to "avoid the $-1$ at all costs, even loop forever" when living is pleasant.

Solving the grid (with $\gamma$ and the $-0.04$ penalty) gives these state values:

State values $U(s)$ for the grid (slides 22, 25, 30). Values rise as you near the $+1$ gold.

Read the pattern: the cell next to the gold is worth $0.918$; the bottom-right corner, near the bomb and far from the gold, is worth only $0.388$. Value flows outward from the goal, decaying with distance and risk.

The recursive characterisation (the key theorem)

Theorem — optimal action from values

$$\pi^*(s)=\operatorname*{argmax}_{a\in \text{Acts}}\sum_{s'\in \text{States}} T(s'\mid s,a)\,U(s')$$

In plain English — why this is the engine of everything

A crucial observation: to know the best action in a state, you only need the values of its successor states. For each candidate action you compute its expected next-state value — sum over possible next states $s'$ of (probability of landing in $s'$) $\times$ (value of $s'$) — and pick the action with the biggest expected value. This is a recursive characterisation of the optimal policy, and it is the basis of all exact methods to solve MDPs (Part IV).

Exam trap — expected value vs. best-case value

The optimal action is not "the action whose most likely outcome has the highest value," nor "the action that could reach the highest-value cell." It is the action with the highest expected value, $\sum_{s'} T(s'\mid s,a)U(s')$ — averaging over all outcomes weighted by probability. A slippery action that might reach a great cell but probably skids into the bomb can have low expected value. (This is the very puzzle slide 31 asks you to ponder — see Part IV.)

QYou're in a state with two actions. Action A: 0.9 chance of a cell worth 0.5, 0.1 chance of a cell worth 1.0. Action B: 1.0 chance of a cell worth 0.6. Which does $\pi^*$ pick?

Compute expected successor value for each:

A: $0.9\times0.5 + 0.1\times1.0 = 0.45+0.10 = 0.55$.
B: $1.0\times0.6 = 0.60$.

$\pi^*$ picks Action B ($0.60 > 0.55$) — even though only A can reach the highest-value cell ($1.0$). This is the expected-value-not-best-case lesson in numbers.

QIn the value grid, the corner $(3,0)$-ish cell is $0.388$ while the cell next to the gold is $0.918$. Explain the gap intuitively.

Value is the best achievable expected discounted reward from a cell. The cell beside the gold is one reliable step from $+1$, so almost the full reward survives discounting → $0.918$. The far corner is many steps away (so the $+1$ is heavily discounted by $\gamma^{\text{many}}$), it pays several $-0.04$ living penalties en route, and it sits near the $-1$ bomb (risk of skidding in). Distance + living cost + danger all push its value down to $0.388$.

· · ·

Part IV

Solving MDPs Exactly

We have defined the optimal policy and the value of a state — but defining is not computing. This part is the payoff: two algorithms that actually crunch an MDP into an optimal (or near-optimal) policy, both built on a single equation due to Bellman.

Until now we discussed definitions (optimal policy, value of a state) and a theorem (the optimal policy is recursively specified), but never how to compute them. The focus now is exactly that — computing these concepts, most importantly the optimal policy, which lets us exactly solve a given MDP. The first method is value iteration: approximate the state values under an optimal policy, then read off a policy from those values.

Bellman's Equation

Theorem — Bellman's Equation

$$U(s)=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)\,U(s')$$

In plain English

The value of being in state $s$ equals the reward you get for being here ($R(s)$), plus the discounted value of where you'll be next if you act optimally. That second term is exactly the recursive theorem from Part III: take the best action $a$ (the $\max$), and for that action average the successor values weighted by transition probabilities. Bellman's equation is a recursive characterisation of the value of a state, and it is just a restatement of the optimal-policy characterisation $\pi^*(s)=\operatorname*{argmax}_a\sum_{s'}T(s'\mid s,a)U(s')$ — the equation finds the value, the argmax finds the action.

Exam trap — argmax vs max

Bellman's equation uses $\max$ (it returns the best value, a number). The policy uses $\operatorname{argmax}$ (it returns the best action). Same inner expression $\sum_{s'}T(s'\mid s,a)U(s')$, different output type. Also note $R(s)$ sits outside the discounted max term — the immediate reward is not discounted; only the future is.

Bellman's Update & Value Iteration

Turn the equation into an assignment and you get Bellman's update — the heart of value iteration:

Definition — Bellman's Update

$$U(s)_{\text{new}}=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)\,U(s')$$

The algorithm initialises all values to $0$, then repeatedly applies the update to every state until the values stop changing by more than a tolerance $\epsilon$:

Algorithm 1 — Value Iteration for s ∈ States do U(s)_new := 0 end for U(s) := U(s)_new for s ∈ States do # Bellman update, all states U(s)_new := R(s) + γ max_a Σ_s' T(s'|s,a) U(s') end for for s ∈ States do # convergence check if ||U(s)_new − U(s)|| > ε then go to line 4 end for

In plain English — how it converges

Start with everyone guessing "my value is 0." Each sweep, every state looks one step ahead, sees the best it can expect from its neighbours, and corrects its guess. Information about the $+1$ gold ripples outward one cell per sweep. After enough sweeps the guesses stop moving — they have converged to the true values. The lecturer notes the values get arbitrarily close to exact; on termination the error is bounded: $\;\lVert U(s)-\hat U(s)\rVert<\epsilon$.

Value-iteration convergence (slide 30). Each curve is one state's value estimate over sweeps; they settle to the true values. States far from the goal (e.g. (4,1)) even dip negative before climbing.

Reading a policy off the values

Once values are computed, the policy follows by picking, in each state, the maximum-expected-utility action:

$$\pi_{\text{valiter}}(s)=\operatorname*{argmax}_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)\,U(s')$$

The lecturer's LGT question (think about it!)

"Why don't we take the action leading to the state with the maximum computed value?" — i.e. why use the expected value over outcomes instead of just walking toward the single highest-valued neighbour? Answer (for your own benefit): because actions are stochastic. The action that points at the best cell might, with high probability, slip somewhere terrible (e.g. the bomb). Only the expectation $\sum_{s'}T(s'\mid s,a)U(s')$ accounts for all the places an action might actually send you, weighted by how likely each is. Greedily chasing the best neighbour ignores the risk of the action failing.

Value iteration gives only an ε-optimal policy

Because the computed values carry error $\lVert U(s)-\hat U(s)\rVert<\epsilon$, the resulting policy is not guaranteed optimal — it is ε-optimal. Its shortfall in achieved reward is bounded:

$$\big\lVert U^{\pi_{\text{valiter}}}(s)-U(s)\big\rVert<\frac{2\epsilon\gamma}{1-\gamma}$$

In plain English

"Close-enough values" produce a "close-enough policy." The reward you lose by following value iteration's policy instead of the truly optimal one is at most $\frac{2\epsilon\gamma}{1-\gamma}$. Smaller tolerance $\epsilon$ → tighter bound; larger $\gamma$ (more future-focused) → looser bound, because small value errors compound over a long horizon.

QIf you halve the convergence tolerance $\epsilon$, what happens to the worst-case policy sub-optimality bound?

The bound $\frac{2\epsilon\gamma}{1-\gamma}$ is linear in $\epsilon$, so halving $\epsilon$ halves the bound. The trade-off: a smaller $\epsilon$ means value iteration needs more sweeps to terminate. (Note $\gamma$ and the bound move together — pushing $\gamma\to 1$ makes the bound explode.)

Policy iteration — a provably optimal alternative

Is there a way to compute a policy we know is optimal (not just ε-optimal)? Yes: policy iteration. It produces an optimal policy — one for which no better policy exists in terms of reward when executed.

Algorithm 2 — Policy Iteration while π_new ≠ π do π := π_new Compute Uπ(s), for all s ∈ States # step 1: policy evaluation for s ∈ States do # step 2: policy improvement if max_a Σ_s' γ T(s'|s,a) Uπ(s') > Uπ(s) then π_new(s) := argmax_a Σ_s' γ T(s'|s,a) Uπ(s') end if end for od

In plain English — the two-step dance

Policy iteration alternates two steps until the policy stops changing:

Step 1 — Policy evaluation: compute the value $U^\pi(s)$ of the current policy in every state. With $\pi$ fixed there's no $\max$, so Bellman's equation becomes a system of linear equations — solvable by linear programming (the lecturer's hint: "guess it from Bellman's equation!").

Step 2 — Policy improvement: for each state, check whether some action gives a better expected utility than the current policy's action; if so, switch to it. Repeat. When no state can be improved, the policy is optimal.

Exam trap — the cost of policy evaluation

Step 1 is the bottleneck: solving the linear system has worst-case running time $O(n^3)$ in the number of states $n$ — prohibitive for realistic MDPs. This is precisely why we don't always use policy iteration even though it gives exact optimal policies.

Value iteration vs. policy iteration

Aspect	Value Iteration	Policy Iteration
Policy quality	$\epsilon$-optimal (approximate)	Exactly optimal
Number of iterations	More iterations	Fewer iterations
Cost per iteration	Cheap Bellman sweeps	Step 1 (policy evaluation) is $O(n^3)$ — the bottleneck
Practical verdict	Often better overall for realistic systems	The $O(n^3)$ step makes it worse overall for any realistic system

The punchline

Policy iteration needs fewer iterations and gives a better (truly optimal) answer — yet value iteration usually wins in practice, because each policy-iteration step pays a steep $O(n^3)$ price to exactly evaluate the current policy. Quality per iteration vs. cost per iteration: the classic trade-off.

QYour MDP has a few million states. A colleague insists on policy iteration "because it's exact." What's your objection?

Policy iteration's policy-evaluation step is $O(n^3)$ in the number of states. With $n$ in the millions, $n^3$ is astronomically large — a single iteration is intractable. Value iteration, with its cheap per-state Bellman sweeps, will reach an $\epsilon$-optimal policy far faster, and you can drive $\epsilon$ small enough that the sub-optimality $\frac{2\epsilon\gamma}{1-\gamma}$ is negligible. "Exact" is worthless if you can't finish even one step.

QIn policy iteration's improvement step, what condition triggers a policy change at state $s$?

The policy at $s$ changes when some action's expected discounted successor value strictly exceeds the current policy's value there: $\;\max_{a}\sum_{s'}\gamma\,T(s'\mid s,a)U^\pi(s') > U^\pi(s)$. Then $\pi_{\text{new}}(s)$ is set to the $\operatorname{argmax}$ action. When no state satisfies this strict inequality, the loop ($\pi_{\text{new}}\neq\pi$) ends and the policy is optimal.

QWrite Bellman's equation from memory and identify each of its four pieces.

$U(s)=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)U(s')$.

$R(s)$ — immediate (undiscounted) reward for being in $s$. $\gamma$ — discount factor weighting the future. $\max_a$ — choose the best action (this makes it the optimal-value equation). $\sum_{s'}T(s'\mid s,a)U(s')$ — expected value of the next state under action $a$. Drop the $\max$ (fix $a=\pi(s)$) and you get the equation policy iteration solves in step 1.

QTrue or false: value iteration computes the optimal policy directly, then derives values from it.

False — it's the reverse. Value iteration first approximates the values $U(s)$ via repeated Bellman updates, and then reads off a policy with $\pi_{\text{valiter}}(s)=\operatorname{argmax}_a\sum_{s'}T(s'\mid s,a)U(s')$. (Policy iteration is the one that works directly with policies, alternating evaluation and improvement.)

· · ·

The week in one breath

Reinforcement learning is an agent learning to act by trial and reward; we model its world as a Markov Decision Process (states, stochastic transitions $T$, rewards $R$, discount $\gamma$); we score behaviour by the discounted expected return, defining the value $U(s)$ of acting optimally; and Bellman's equation lets us compute it exactly — approximately and cheaply via value iteration ($\epsilon$-optimal), or exactly but expensively via policy iteration ($O(n^3)$ per step).

Exam-readiness checklist

State the three components exchanged in the RL loop ($a_t$ out; $s_{t+1}$, $r_{t+1}$ back) and place RL beside supervised/unsupervised learning.
List the ingredients of an MDP and write a transition relation $T(s'\mid s,a)$ for a stochastic action, checking the outcome probabilities sum to 1.
Explain why a naive infinite-horizon reward sum fails, and write the discounted return $\sum_{i=0}^\infty\gamma^i R(s_i)$ with the bound $\frac{R_{\max}}{1-\gamma}$; justify $0\le\gamma<1$.
Define $U^\pi(s)$ in both the deterministic and stochastic (expectation-over-traces) cases.
Define the optimal policy $\pi^*(s)=\operatorname{argmax}_\pi U^\pi(s)$ and the value $U(s)=U^{\pi^*}(s)$.
State and explain the recursive theorem $\pi^*(s)=\operatorname{argmax}_a\sum_{s'}T(s'\mid s,a)U(s')$ — best action needs only successor values.
Reproduce a discounted-return calculation (living penalties $+$ $\gamma^k$-discounted terminal), showing every term — and know the slide's printed $-0.084$ doesn't match its own formula.
Write Bellman's Equation and Bellman's Update; distinguish $\max$ (value) from $\operatorname{argmax}$ (action), and note $R(s)$ is undiscounted.
Write the Value Iteration algorithm, its initialisation, its convergence test $\lVert U_{\text{new}}-U\rVert>\epsilon$, and its value error bound $<\epsilon$.
Derive a policy from values and answer the LGT question (expected value over outcomes, not the single best neighbour, because actions are stochastic).
State the $\epsilon$-optimality bound $\frac{2\epsilon\gamma}{1-\gamma}$ and explain how $\epsilon$ and $\gamma$ affect it.
Write the Policy Iteration algorithm; explain policy evaluation (linear system / LP, $O(n^3)$) vs. policy improvement (switch on strict improvement).
Compare VI and PI on quality, iteration count, and per-iteration cost — and explain why VI usually wins in practice.
Recall the named results and sources: Bellman's equation; "RL combines learning and acting: the essence of intelligence"; AlphaGo/DeepMind; review Russell & Norvig 17.1–17.3.