The third paradigm of machine learning — an agent that learns to act by trial, reward, and consequence — and the exact mathematics for solving it.
You have met two ways for a machine to learn. This is the third — and it is the one that looks most like how a creature actually learns: by doing things and seeing what happens.
So far in this module you have studied two paradigms of machine learning. In supervised learning you are handed labelled examples (here is an input, here is the correct answer) and you learn to copy the mapping. In unsupervised learning you are handed unlabelled data and you learn its hidden structure (clusters, components, codes). Reinforcement learning is a genuinely different third paradigm.
Imagine training a dog. You don't show it a labelled dataset of "correct moves." You let it act, and you hand out treats (reward) or scoldings (punishment). Over many episodes the dog adjusts its behaviour to get more treats. That loop — act, observe what happened, get a reward signal, adjust — is reinforcement learning.
In RL we have an intelligent agent whose goal is to learn a policy to act in the world such that it maximises its rewards. The policy is learnt by acting in the world, observing the results of acting — in terms of its new state and a reward signal — and updating the policy next time from those observations.
Notice the three moving parts the agent exchanges with its environment at each time step $t$: it chooses an action $a_t$, and in return the environment hands back a new state $s_{t+1}$ and a reward $r_{t+1}$. That cycle repeats forever (or until the task ends).
The lecturer's framing is worth quoting directly: "RL combines learning and acting: the essence of intelligence." A supervised learner is a passive predictor; an RL agent must decide and live with the consequences, which is much closer to what we mean by intelligent behaviour.
Problems that can be modelled as RL: Pacman, Chess, the stock market, and even Mathematics. The approach is used in robotics and game playing. Many recent AI breakthroughs were solved as RL problems — most famously AlphaGo and other DeepMind achievements.
In supervised learning the agent is given the correct answers (labels); in unsupervised learning it is given unlabelled data to find structure in. In RL the agent is given neither — it must act in an environment and learn purely from the state transitions and reward signals its actions produce, with the goal of maximising cumulative reward.
The agent sends an action $a_t$. The environment returns a new state $s_{t+1}$ and a reward $r_{t+1}$. (Memorise the loop diagram — examiners love asking you to label it.)
It fits reinforcement learning: there is an agent taking actions (moves), a changing state (the board), and a reward (win/lose), with no labelled "correct move" dataset to imitate. The agent must learn a policy by playing and observing outcomes. The slides explicitly list Chess (alongside Pacman, the stock market, and Mathematics) as an RL-modellable problem, and note AlphaGo-style breakthroughs solved board games this way.
Before we can learn how to act, we need a precise mathematical description of "the world" and "acting in it." That description is the Markov Decision Process — five ingredients that capture states, actions, randomness, and reward.
The first thing we need is the mathematical / formal model of environments and agents that underlies the RL methods in this course. A very general such model is the Markov Decision Process (MDP). The whole of Part II is just unpacking the five things an MDP gives us, using one running example: a robot in a grid.
An MDP provides a mathematical model for: a set of states the world can be in; a set of actions the agent may take in each state; a transition relation $T$ giving, for each state–action pair, a probability distribution over next states; and a reward $R$ accrued on entering a state. (We add the discount factor $\gamma$ in Part III.) The agent acts according to a policy $\pi$.
The world can be in different states. In our running example we have a grid of cells (4 columns wide, 3 rows tall) and a robot that occupies one cell. The states are simply the possible positions of the robot, written as coordinates such as $s\equiv(0,3)$, $s\equiv(1,3)$, $s\equiv(2,3)$ as it moves across the bottom row.
The slides are loose about coordinates. The robot demo labels the bottom row $(0,3),(1,3),(2,3)$, while the policy and reward slides use pairs like $(2,0),(3,1),(3,2)$ under a $(\text{col},\text{row})$ convention. Do not get hung up on reconciling the exact indices — they are just labels for cells. What you must remember is the layout: $+1$ goal top-right, $-1$ bomb directly below it, a wall in the centre. If asked, state your coordinate convention explicitly.
At every state the agent executes one action from a set of available actions, based on its policy. Here there are four actions: $\uparrow,\ \rightarrow,\ \leftarrow,\ \downarrow$. A policy $\pi$ is a rule that says which action to take in each state. The slide gives a concrete deterministic example:
The two terminal cells (gold and bomb) map to None — once you reach a terminal you stop acting. Pictured as arrows, this policy steers the robot rightward and up toward the gold:
Choosing an action results in a new state — but crucially, in MDPs taking an action does not always lead to the same outcome. The resulting states form a probability distribution depending on the state and the action. This is the transition relation $T(s'\mid s,a)$.
If the robot at $(2,2)$ chooses $\uparrow$, the action can fail: it has a 10% chance of veering right and 10% left. So:
$T\big((2,3)\mid(2,2),\uparrow\big)=0.8$ (intended move succeeds)
$T\big((3,2)\mid(2,2),\uparrow\big)=0.1$ (slips right)
$T\big((1,2)\mid(2,2),\uparrow\big)=0.1$ (slips left)
These three probabilities sum to $1$ — they form a proper distribution over outcomes.
The robot is clumsy. When it tries to go up, it usually does (80%), but sometimes it skids sideways (10% each way). This models real-world messiness: wheels slip, sensors lie, motors stutter. The lecturer lists the sources explicitly — errors / unreliability of actions in the physical world, inaccuracies in our model of the actions, and other sources of randomness. The "Markov" in MDP means the next state depends only on the current state and action, not the full history.
The intended outcome $(2,3)$ has probability $0.8$, so the probability of not landing there is $1-0.8 = \mathbf{0.2}$ (split as $0.1$ left $+\,0.1$ right). Always check transition probabilities sum to 1.
The different choices of actions and their possible consequences induce a graph — the state space. From a state, each action is a branch; each action then fans out into its possible next states with the transition probabilities as edge weights.
Because actions are probabilistic, executing a policy does not always lead to known results. From the start, the action sequence $\uparrow,\uparrow,\rightarrow,\rightarrow,\rightarrow$ reaches the prize only with probability $0.8^5$ — every one of the five steps must succeed (each succeeds with $0.8$). Numerically $0.8^5 = 0.32768$, so even a "perfect plan" works barely a third of the time.
The five steps are independent stochastic events, and the plan succeeds only if all five succeed. Independent "AND" means multiply the probabilities: $0.8\times0.8\times0.8\times0.8\times0.8 = 0.8^5 \approx 0.328$. ($5\times0.8$ would be a sum, which isn't even a valid probability since it exceeds 1.)
Executing an action results in a reward depending on the resulting state, written $R(s)$. The slides show two reward designs. Version A (no living cost):
This gives the agent a positive reward only at the gold and a strong negative reward at the bomb, encouraging it to move toward the positive terminal. Version B (small living penalty): replace the $0$ with $-0.04$ on every non-terminal cell.
The tiny negative reward on every ordinary cell is a cost of dawdling. It makes the agent "keen" to leave non-terminal states quickly rather than wandering forever. Without it, an agent with $\gamma<1$ has no positive reason to hurry; with it, every wasted step bleeds a little reward, so the optimal behaviour becomes "get to the gold efficiently — but not so recklessly that you risk the bomb." This single number tunes the whole personality of the agent (we'll see in Part III how different ranges of $R(s)$ produce wildly different optimal policies).
If existing in any non-terminal state costs $-2$ per step — worse than the $-1$ bomb! — the agent becomes desperate to terminate as fast as possible, even by diving into the bomb, because ending the episode quickly (even badly) beats accumulating huge living costs. This is exactly the kind of insight slide 21's "policies for four ranges of $R(s)$" figure illustrates: the reward magnitude, not just its sign, reshapes the optimal policy.
The Markov property means the distribution over the next state depends only on the current state $s$ and action $a$ — written $T(s'\mid s,a)$ — and not on the full history of how the agent got to $s$. This is why we can write value/Bellman equations in terms of a single state $s$ rather than entire trajectories: each state summarises everything relevant about the past, so the optimal action depends only on where you are now, not on your story so far.
We can describe the world and a policy — but which policy is best? To answer that we need a number that scores how good it is to be somewhere. That number is the value of a state, and it rests on one clever trick: discounting.
The goal of the agent is to achieve the best outcome by executing a policy, and each state carries a reward accrued on entering it. So we need a way to compare policies. We can do this by evaluating a policy based on the states it makes the agent enter. The question is: how do we turn an endless stream of rewards into one comparable number?
We consider an infinite horizon — the sequence of states could be infinitely long: $R(s_0), R(s_1), R(s_2), R(s_3), \dots$ The naive idea is to just add them all:
But this fails for an infinite sequence — the sum can diverge to $\pm\infty$, and infinitely many policies would all score "$+\infty$," giving us no way to rank them.
$$R(s_0)+\gamma R(s_1)+\gamma^2 R(s_2)+\gamma^3 R(s_3)+\dots=\sum_{i=0}^{\infty}\gamma^i R(s_i)$$ where $\gamma$ is the discount factor, chosen so that $0\le\gamma<1$.
With $0\le\gamma<1$ the sum converges even for an infinite sequence, and it is bounded:
$\gamma$ is "how much you care about the future." Each step into the future, reward is multiplied by another factor of $\gamma<1$, so it shrinks. The implication the lecturer stresses: when we evaluate a sequence of states, we prefer sequences that earn more reward earlier rather than later — the value of later rewards diminishes. It's the maths of "a treat now beats the same treat next year," and it also conveniently makes infinite sums finite.
The bound $\frac{R_{\max}}{1-\gamma}$ comes from a geometric series. If the best possible per-step reward is $R_{\max}=1$ and $\gamma=0.9$, then no infinite trace can be worth more than $\frac{1}{1-0.9}=\frac{1}{0.1}=10$. With $\gamma=0.99$ the ceiling rises to $100$ — a patient agent values the long run far more.
At $\gamma=1$ the factors $\gamma^i$ never shrink (they all equal 1), so $\sum_{i=0}^\infty R(s_i)$ is just the un-discounted sum again — which can diverge for an infinite horizon. The geometric bound $\frac{R_{\max}}{1-\gamma}$ also blows up ($\tfrac{1}{0}$) at $\gamma=1$. We need $\gamma<1$ for guaranteed convergence. ($\gamma\ge0$ is required so future rewards aren't perversely negated.)
$\sum_{i=0}^\infty 0.5^i\cdot 2 = 2\cdot\frac{1}{1-0.5}=2\cdot 2 = \mathbf{4}$. (Geometric series $\sum\gamma^i=\frac{1}{1-\gamma}$.) Notice it equals the bound $\frac{R_{\max}}{1-\gamma}=\frac{2}{0.5}=4$ here, because every reward is already at the maximum.
To assess a policy we ask: what discounted reward does it return when executed from a given starting state? That quantity is the utility of a state under the policy, written $U^\pi(s)$.
For deterministic actions, $U^\pi(s)$ is simply the discounted reward accrued by executing $\pi$ starting from $s$: $\;U^\pi(s)=\sum_{i=0}^\infty \gamma^i R(s_i)$, where $s_0=s$ and each $s_{i+1}$ is the (single, certain) state that $\pi$ leads to.
Set $\gamma=0.9$ and $R(s)=-0.04$ for all non-terminal $s$. The slide writes the utility of the start state $(0,0)$ along a path that pays five $-0.04$ steps then collects the $+1$ gold:
$U^\pi\big((0,0)\big)=-0.04-0.9\times0.04-0.9^2\times0.04-0.9^3\times0.04-0.9^4\times0.04+0.9^5\times1$
The structure to learn: each successive step's reward is multiplied by one more factor of $\gamma$, and the final $+1$ is discounted by $\gamma^5$ because it arrives at step 5.
The slide states this evaluates to $-0.084$. If you actually compute the expression as written, you get $-0.04(1+0.9+0.81+0.729+0.6561)+0.9^5 = -0.1638+0.5905 \approx +0.43$, not $-0.084$. The printed $-0.084$ does not match the printed formula (a slide slip, or a different/longer path was intended). For the exam: reproduce the method exactly — geometric discounting of the living penalties plus the $\gamma^k$-discounted terminal reward — and show your arithmetic. Don't memorise $-0.084$ as a result of this formula; if the lecturer's figure reappears, note the inconsistency rather than being thrown by it.
As we saw, actions are generally probabilistic, so executing the same policy can produce many different traces, each with its own probability. We therefore define the utility as the expected (average) discounted return over all possible traces:
$$U^\pi(s)\ \equiv\ \mathbb{E}_{\text{possible traces } s_0 s_1 s_2 \dots}\!\left(\sum_{i=0}^{\infty}\gamma^i R(s_i)\right)$$
Because the robot is clumsy, "run policy $\pi$ from $s$" isn't one story — it's a whole bushy tree of possible stories, each weighted by how likely it is. The utility is the probability-weighted average of the discounted returns of all those stories. The expectation $\mathbb{E}$ is doing exactly the averaging-over-randomness job.
The optimal policy from a state $s$ is the policy with the maximum expected return if executed from that state: $$\pi^*(s)=\operatorname*{argmax}_{\pi} U^\pi(s).$$ We call $U^{\pi^*}(s)$ the value of a state, and denote it $U(s)$ from now on.
So $U(s)$ is "the best expected discounted reward achievable from $s$, playing optimally." Russell & Norvig's Figure 17.2 (on the slide) illustrates two things: (a) the optimal policy for the stochastic grid with $R(s)=-0.04$, and (b) how the optimal policy changes across four different ranges of $R(s)$ — from "rush to any exit" when living is very costly to "avoid the $-1$ at all costs, even loop forever" when living is pleasant.
Solving the grid (with $\gamma$ and the $-0.04$ penalty) gives these state values:
Read the pattern: the cell next to the gold is worth $0.918$; the bottom-right corner, near the bomb and far from the gold, is worth only $0.388$. Value flows outward from the goal, decaying with distance and risk.
$$\pi^*(s)=\operatorname*{argmax}_{a\in \text{Acts}}\sum_{s'\in \text{States}} T(s'\mid s,a)\,U(s')$$
A crucial observation: to know the best action in a state, you only need the values of its successor states. For each candidate action you compute its expected next-state value — sum over possible next states $s'$ of (probability of landing in $s'$) $\times$ (value of $s'$) — and pick the action with the biggest expected value. This is a recursive characterisation of the optimal policy, and it is the basis of all exact methods to solve MDPs (Part IV).
The optimal action is not "the action whose most likely outcome has the highest value," nor "the action that could reach the highest-value cell." It is the action with the highest expected value, $\sum_{s'} T(s'\mid s,a)U(s')$ — averaging over all outcomes weighted by probability. A slippery action that might reach a great cell but probably skids into the bomb can have low expected value. (This is the very puzzle slide 31 asks you to ponder — see Part IV.)
Compute expected successor value for each:
A: $0.9\times0.5 + 0.1\times1.0 = 0.45+0.10 = 0.55$.
B: $1.0\times0.6 = 0.60$.
$\pi^*$ picks Action B ($0.60 > 0.55$) — even though only A can reach the highest-value cell ($1.0$). This is the expected-value-not-best-case lesson in numbers.
Value is the best achievable expected discounted reward from a cell. The cell beside the gold is one reliable step from $+1$, so almost the full reward survives discounting → $0.918$. The far corner is many steps away (so the $+1$ is heavily discounted by $\gamma^{\text{many}}$), it pays several $-0.04$ living penalties en route, and it sits near the $-1$ bomb (risk of skidding in). Distance + living cost + danger all push its value down to $0.388$.
We have defined the optimal policy and the value of a state — but defining is not computing. This part is the payoff: two algorithms that actually crunch an MDP into an optimal (or near-optimal) policy, both built on a single equation due to Bellman.
Until now we discussed definitions (optimal policy, value of a state) and a theorem (the optimal policy is recursively specified), but never how to compute them. The focus now is exactly that — computing these concepts, most importantly the optimal policy, which lets us exactly solve a given MDP. The first method is value iteration: approximate the state values under an optimal policy, then read off a policy from those values.
$$U(s)=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)\,U(s')$$
The value of being in state $s$ equals the reward you get for being here ($R(s)$), plus the discounted value of where you'll be next if you act optimally. That second term is exactly the recursive theorem from Part III: take the best action $a$ (the $\max$), and for that action average the successor values weighted by transition probabilities. Bellman's equation is a recursive characterisation of the value of a state, and it is just a restatement of the optimal-policy characterisation $\pi^*(s)=\operatorname*{argmax}_a\sum_{s'}T(s'\mid s,a)U(s')$ — the equation finds the value, the argmax finds the action.
Bellman's equation uses $\max$ (it returns the best value, a number). The policy uses $\operatorname{argmax}$ (it returns the best action). Same inner expression $\sum_{s'}T(s'\mid s,a)U(s')$, different output type. Also note $R(s)$ sits outside the discounted max term — the immediate reward is not discounted; only the future is.
Turn the equation into an assignment and you get Bellman's update — the heart of value iteration:
$$U(s)_{\text{new}}=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)\,U(s')$$
The algorithm initialises all values to $0$, then repeatedly applies the update to every state until the values stop changing by more than a tolerance $\epsilon$:
Start with everyone guessing "my value is 0." Each sweep, every state looks one step ahead, sees the best it can expect from its neighbours, and corrects its guess. Information about the $+1$ gold ripples outward one cell per sweep. After enough sweeps the guesses stop moving — they have converged to the true values. The lecturer notes the values get arbitrarily close to exact; on termination the error is bounded: $\;\lVert U(s)-\hat U(s)\rVert<\epsilon$.
Once values are computed, the policy follows by picking, in each state, the maximum-expected-utility action:
"Why don't we take the action leading to the state with the maximum computed value?" — i.e. why use the expected value over outcomes instead of just walking toward the single highest-valued neighbour? Answer (for your own benefit): because actions are stochastic. The action that points at the best cell might, with high probability, slip somewhere terrible (e.g. the bomb). Only the expectation $\sum_{s'}T(s'\mid s,a)U(s')$ accounts for all the places an action might actually send you, weighted by how likely each is. Greedily chasing the best neighbour ignores the risk of the action failing.
Because the computed values carry error $\lVert U(s)-\hat U(s)\rVert<\epsilon$, the resulting policy is not guaranteed optimal — it is ε-optimal. Its shortfall in achieved reward is bounded:
"Close-enough values" produce a "close-enough policy." The reward you lose by following value iteration's policy instead of the truly optimal one is at most $\frac{2\epsilon\gamma}{1-\gamma}$. Smaller tolerance $\epsilon$ → tighter bound; larger $\gamma$ (more future-focused) → looser bound, because small value errors compound over a long horizon.
The bound $\frac{2\epsilon\gamma}{1-\gamma}$ is linear in $\epsilon$, so halving $\epsilon$ halves the bound. The trade-off: a smaller $\epsilon$ means value iteration needs more sweeps to terminate. (Note $\gamma$ and the bound move together — pushing $\gamma\to 1$ makes the bound explode.)
Is there a way to compute a policy we know is optimal (not just ε-optimal)? Yes: policy iteration. It produces an optimal policy — one for which no better policy exists in terms of reward when executed.
Policy iteration alternates two steps until the policy stops changing:
Step 1 — Policy evaluation: compute the value $U^\pi(s)$ of the current policy in every state. With $\pi$ fixed there's no $\max$, so Bellman's equation becomes a system of linear equations — solvable by linear programming (the lecturer's hint: "guess it from Bellman's equation!").
Step 2 — Policy improvement: for each state, check whether some action gives a better expected utility than the current policy's action; if so, switch to it. Repeat. When no state can be improved, the policy is optimal.
Step 1 is the bottleneck: solving the linear system has worst-case running time $O(n^3)$ in the number of states $n$ — prohibitive for realistic MDPs. This is precisely why we don't always use policy iteration even though it gives exact optimal policies.
| Aspect | Value Iteration | Policy Iteration |
|---|---|---|
| Policy quality | $\epsilon$-optimal (approximate) | Exactly optimal |
| Number of iterations | More iterations | Fewer iterations |
| Cost per iteration | Cheap Bellman sweeps | Step 1 (policy evaluation) is $O(n^3)$ — the bottleneck |
| Practical verdict | Often better overall for realistic systems | The $O(n^3)$ step makes it worse overall for any realistic system |
Policy iteration needs fewer iterations and gives a better (truly optimal) answer — yet value iteration usually wins in practice, because each policy-iteration step pays a steep $O(n^3)$ price to exactly evaluate the current policy. Quality per iteration vs. cost per iteration: the classic trade-off.
Policy iteration's policy-evaluation step is $O(n^3)$ in the number of states. With $n$ in the millions, $n^3$ is astronomically large — a single iteration is intractable. Value iteration, with its cheap per-state Bellman sweeps, will reach an $\epsilon$-optimal policy far faster, and you can drive $\epsilon$ small enough that the sub-optimality $\frac{2\epsilon\gamma}{1-\gamma}$ is negligible. "Exact" is worthless if you can't finish even one step.
The policy at $s$ changes when some action's expected discounted successor value strictly exceeds the current policy's value there: $\;\max_{a}\sum_{s'}\gamma\,T(s'\mid s,a)U^\pi(s') > U^\pi(s)$. Then $\pi_{\text{new}}(s)$ is set to the $\operatorname{argmax}$ action. When no state satisfies this strict inequality, the loop ($\pi_{\text{new}}\neq\pi$) ends and the policy is optimal.
$U(s)=R(s)+\gamma\max_{a\in\text{Acts}}\sum_{s'\in\text{States}}T(s'\mid s,a)U(s')$.
$R(s)$ — immediate (undiscounted) reward for being in $s$. $\gamma$ — discount factor weighting the future. $\max_a$ — choose the best action (this makes it the optimal-value equation). $\sum_{s'}T(s'\mid s,a)U(s')$ — expected value of the next state under action $a$. Drop the $\max$ (fix $a=\pi(s)$) and you get the equation policy iteration solves in step 1.
False — it's the reverse. Value iteration first approximates the values $U(s)$ via repeated Bellman updates, and then reads off a policy with $\pi_{\text{valiter}}(s)=\operatorname{argmax}_a\sum_{s'}T(s'\mid s,a)U(s')$. (Policy iteration is the one that works directly with policies, alternating evaluation and improvement.)
Reinforcement learning is an agent learning to act by trial and reward; we model its world as a Markov Decision Process (states, stochastic transitions $T$, rewards $R$, discount $\gamma$); we score behaviour by the discounted expected return, defining the value $U(s)$ of acting optimally; and Bellman's equation lets us compute it exactly — approximately and cheaply via value iteration ($\epsilon$-optimal), or exactly but expensively via policy iteration ($O(n^3)$ per step).