King's College London · Machine Learning · Week 8

Stochastic Multi-Armed Bandits

Learning which lever to pull when every pull costs you the chance to pull another — exploration versus exploitation, with concentration inequalities for muscle.

KCL · ML Lecturer: Frederik Mallmann-Trenn Reading time ≈ 55 min 18 self-tests 3 algorithms
Part I

The Bandit Problem & Regret

A gambler stands in front of $K$ slot machines. Each one secretly has its own average payout — but she doesn't know which. Every round she pulls exactly one lever and sees what fell out. After $T$ rounds, how do we measure whether she "learned"?

§1The slot-machine analogy

A multi-armed bandit (MAB) is a stripped-down model of sequential decisions under uncertainty. Imagine $K$ slot machines lined up in a casino — old-school slot machines were called "one-armed bandits" because they had a single lever and robbed your wallet. A row of them is therefore a "multi-armed bandit". The gambler must decide, round by round:

so as to maximise total winnings over time.

Intuition

Each "arm" is an action whose payoff is random and unknown. You learn what arm $i$ pays only by pulling it — and every pull you spend exploring arm $i$ is a pull you didn't spend exploiting whichever arm currently looks best. This tension is called the exploration–exploitation trade-off, and the whole field of bandits is the study of how to resolve it.

Five arms, hidden means — the player sees only one realised reward per round arm 1 μ₁ = 0.72? arm 2 μ₂ = 0.55? arm 3 μ₃ = 0.40? arm 4 μ₄ = 0.30? arm K μ_K = 0.15? The bars are the true means $\mu_i$ — invisible to the player. The agent must infer the ordering from samples.
Slide 6 — formal setup. The bars are the unknown true means; the player sees only one i.i.d. draw from one chosen arm per round.

§2Where it shows up

The same skeleton — repeated choices, stochastic rewards, no model up front — describes a surprising number of real systems:

Examples from the slides
  • Online advertising. Each ad is an arm; "clicked" is the reward. The system optimises engagement and revenue.
  • Recommendation systems. Netflix and Amazon choose which thumbnail to show; the reward is "did the user click / watch / buy?"
  • Clinical trials. Each treatment is an arm; "patient recovered" is the reward. Adaptive trials allocate more patients to therapies that look promising mid-trial.
  • Dynamic pricing. Each price point is an arm; revenue is the reward. The system adjusts in real time to demand.
  • Website optimisation / A/B testing. Each design is an arm; conversion is the reward.
  • Networking. Each routing path is an arm; latency / throughput is the reward. Adaptive routing finds the best path online.

§3Formal setup

Definition · the stochastic K-armed bandit
  • There are $K \in \mathbb{N}$ arms.
  • $X_{i,t}$ denotes the random reward of arm $i \in \{1,\dots,K\}$ at time $t \in \mathbb{N}$.
  • Rewards are bounded: $X_{i,t} \in [0,1]$, $i.i.d.$ across $t$ for each fixed arm.
  • $\mu_i := \mathbb{E}[X_{i}]$, independent of $t$.
  • In every round you play exactly one arm, and only see that one arm's reward.
  • W.l.o.g. arms are labelled so that $\mu_1 > \mu_2 \ge \cdots \ge \mu_K$ — but this ordering and the values $\mu_1,\dots,\mu_K$ are unknown to the player.
Exam trap

"I.i.d. rewards" means independent over time for the same arm, and the arms are independent of each other. It does not mean all $K$ arms have the same distribution — they have different means $\mu_i$, that's the whole point. Don't confuse "i.i.d." (across time, per arm) with "identically distributed" (across arms — they're not).

Why is the ordering $\mu_1 > \mu_2 \ge \cdots \ge \mu_K$ called "without loss of generality"? Could the algorithm exploit it?

"W.l.o.g." means we're just labelling the arms by their unknown true means for analytical convenience — calling the best arm "1" and the worst "K" doesn't constrain the problem because the player doesn't see those labels. The algorithm only sees indices that are essentially arbitrary names. We use the ordering in the proof (e.g., to call out "arm 1" as the optimal one), but the algorithm never benefits from it because it doesn't know which physical arm is "arm 1".

§4Regret — the right scoreboard

Call $I_t$ the arm we choose to play at round $t$. The expected reward we get at round $t$ is $\mathbb{E}[X_{I_t, t}]$, and over $T$ rounds the cumulative expected reward is

$$\mathbb{E}\!\left[\sum_{t=1}^{T} X_{I_t,t}\right] \;=\; \sum_{t=1}^{T} \mathbb{E}\!\left[X_{I_t,t}\right].$$

If we had been an oracle and known the means in advance, we would have always played arm 1 and earned $T\mu_1$ in expectation. The gap between these is what we minimise:

Definition · pseudo-regret R(T) $$R(T) \;=\; \max_{i \le K}\sum_{t=1}^{T}\mathbb{E}[X_{i,t}] \;-\; \sum_{t=1}^{T}\mathbb{E}[X_{I_t,t}] \;=\; T\mu_1 - \sum_{t=1}^{T}\mathbb{E}[X_{I_t,t}].$$

In plain English: "how much worse, on average, am I doing than the best single arm in hindsight?" Smaller is better; $R(T)=0$ means we played arm 1 every round.

Note that $I_t$ is itself random — it depends on whatever the algorithm observed in rounds $1,\dots,t-1$ — so $\mathbb{E}[X_{I_t,t}]$ is the expectation over both the arm choice and its reward.

If you ran for $T=1000$ rounds, pulled arm 1 (mean $0.7$) 800 times and arm 2 (mean $0.5$) 200 times, what is $R(T)$?

Expected total reward $= 800\cdot 0.7 + 200\cdot 0.5 = 560 + 100 = 660$. Oracle would have got $T\mu_1 = 1000 \cdot 0.7 = 700$. So $R(T) = 700 - 660 = 40$. Equivalently: each of the 200 sub-optimal pulls cost $\mu_1 - \mu_2 = 0.2$ in expectation; $200 \cdot 0.2 = 40$. ✓

§5Why pseudo-regret? Why not raw realisations?

You might be tempted to define a more "honest" regret using the actually-observed numbers:

$$R_{\text{alternative}}(T) \;=\; \max_{i \le K}\sum_{t=1}^{T} X_{i,t} \;-\; \sum_{t=1}^{T} X_{I_t,t}.$$

The slides explicitly warn against this. Two problems:

Why the alternative breaks
  1. Identical arms become unlearnable. Suppose $\mu_i = \mu_j$ for all $i,j$. Under the alternative regret, the "best arm in hindsight" is whichever arm got lucky and accumulated the highest realised reward — and you'd be punished for not having pulled that arm, even though no learning algorithm could possibly have known. "Very unfair," as the slides put it.
  2. Many tied arms. Suppose $\mu_1 > \mu_2 = \mu_3 = \dots = \mu_K$. With many tied second-best arms, with high probability one of them will have a luckier sample average than arm 1, simply by variance. Under the alternative regret you'd "almost never play the best arm" — the variance washes out the signal — so learning becomes pointless.

Pseudo-regret sidesteps both pathologies by comparing to the expected performance of the best arm, not its realised performance. Learning becomes meaningful again.

In one sentence each, why is $R(T)$ (pseudo-regret) preferred over $R_{\text{alternative}}(T)$?

(a) Because under $R_{\text{alternative}}$ ties between arms make learning impossible — you'd be punished by luck rather than by bad decisions. (b) Because variance among many near-equal arms causes a sub-optimal arm to "win" the in-hindsight comparison with high probability, so $R_{\text{alternative}}$ doesn't reward actually identifying $\mu_1$.

§6The three algorithms we'll study

Slide 14 sets the menu, and the rest of the deck delivers all three:

  1. Explore-First — try everything $\tau$ times, then commit. Simple. Regret $O(T^{2/3}\cdot\text{stuff})$.
  2. Epsilon-Greedy — same idea, but interleaved: explore with probability $\varepsilon$, exploit with probability $1-\varepsilon$. Regret bound holds for every round $t$, not just for a fixed $T$.
  3. UCB (Upper Confidence Bound) — "optimism in the face of uncertainty". Pick the arm with the highest empirical mean plus a confidence radius. Regret $O(\sqrt{KT\log(KT)})$ — much better.

But to analyse any of them we first need a kit of concentration inequalities, which is what Part II is about.

· · ·
Part II

The Concentration Toolkit

Every bandit proof in this deck reduces to one question: "once we've pulled arm $i$ enough times, how close is the sample mean to the true mean, and with what probability?" Here are the six ingredients we'll combine.

§1Two free reminders

Reminder 1 · Independence factorises expectations

If $X$ and $Y$ are independent, then for any (well-behaved) functions $f,g$:

$$\mathbb{E}\!\left[f(X)\,g(Y)\right] \;=\; \mathbb{E}[f(X)]\;\mathbb{E}[g(Y)].$$

We use this in the proof of the variance lemma to kill cross terms.

Reminder 2 · Union bound

For any events $E_1, \dots, E_k$:

$$\Pr\!\left[\bigcup_{i} E_i\right] \;\le\; \sum_{i} \Pr[E_i].$$

In plain English: the probability that at least one bad thing happens is at most the sum of the probabilities of each bad thing happening alone. This is what lets us state a "good event" simultaneously for all $K$ arms and all $T$ rounds, without independence assumptions.

§2The variance of the sample mean ($\sigma^2/n$)

Lemma · MSE of the empirical mean

Let $X_1,\dots,X_n$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ and $\mathrm{Var}(X_i) = \sigma^2$. Define the estimator

$$\hat{\mu}_n \;=\; \frac{1}{n}\sum_{i=1}^{n} X_i.$$

Then

$$\mathbb{E}\!\left[(\hat{\mu}_n - \mu)^2\right] \;=\; \frac{\sigma^2}{n}.$$
In plain English

Average $n$ noisy measurements and your mean-squared error shrinks like $1/n$. Equivalently, the standard error shrinks like $1/\sqrt{n}$. That $1/\sqrt{n}$ is everywhere in this deck (it's the "radius" of every confidence interval we'll build).

Proof (Slides 21–22, in full)

Substitute and expand the square:

$$\mathbb{E}[(\hat\mu_n - \mu)^2] = \mathbb{E}\!\left[\left(\tfrac{1}{n}\sum_i (X_i-\mu)\right)^2\right] = \mathbb{E}\!\left[\tfrac{1}{n^2}\sum_i (X_i-\mu)^2 + \tfrac{1}{n^2}\sum_{i\ne j}(X_i-\mu)(X_j-\mu)\right].$$

Split the expectation. For the cross term, $X_i$ and $X_j$ are independent for $i\ne j$, so by Reminder 1:

$$\mathbb{E}\!\left[\tfrac{1}{n^2}\sum_{i\ne j}(X_i-\mu)(X_j-\mu)\right] = \tfrac{1}{n^2}\sum_{i\ne j}\mathbb{E}[X_i-\mu]\,\mathbb{E}[X_j-\mu] = 0,$$

since $\mathbb{E}[X_i - \mu] = 0$. The diagonal term is

$$\mathbb{E}\!\left[\tfrac{1}{n^2}\sum_i (X_i-\mu)^2\right] = \tfrac{1}{n^2}\sum_i \sigma^2 = \tfrac{\sigma^2}{n}. \quad\blacksquare$$
If $X_i \in [0,1]$, give an explicit upper bound on $\sigma^2$, and hence on $\mathbb{E}[(\hat\mu_n - \mu)^2]$.

For $X \in [0,1]$, the variance is maximised by a Bernoulli with $p=1/2$, giving $\sigma^2 \le 1/4$. So $\mathbb{E}[(\hat\mu_n - \mu)^2] \le 1/(4n)$. This is why the standard error for a bounded variable is always at most $1/(2\sqrt n)$.

§3Markov's inequality

Markov's inequality

For any non-negative random variable $X$ and any $t \ge 0$:

$$\Pr[X > t] \;\le\; \frac{\mathbb{E}[X]}{t}.$$
In plain English

"A non-negative variable can't sit far above its mean too often, because if it did the mean would already be huge." Markov is the weakest tail bound in our kit — it only uses the mean, not the variance — but its weakness is what makes the proof of the next item (WLLN) trivial.

§4The Weak Law of Large Numbers

Weak LLN

For any $\varepsilon > 0$:

$$\lim_{n\to\infty} \Pr\!\left[\,|\hat\mu_n - \mu| > \varepsilon\,\right] = 0.$$

That is, the sample mean converges in probability to the true mean.

Proof (Slide 25)

Square both sides inside the probability — this is legal because squaring is monotone on non-negatives — and apply Markov to the non-negative variable $|\hat\mu_n - \mu|^2$:

$$\Pr[\,|\hat\mu_n - \mu| > \varepsilon\,] = \Pr[\,|\hat\mu_n - \mu|^2 > \varepsilon^2\,] \;\le\; \frac{\mathbb{E}[|\hat\mu_n - \mu|^2]}{\varepsilon^2} \;=\; \frac{\sigma^2}{n\,\varepsilon^2}.$$

Now take $n \to \infty$: the bound $\sigma^2/(n\varepsilon^2) \to 0$. $\blacksquare$

What just happened

The squaring trick lets us recycle the $\sigma^2/n$ lemma. This is also exactly the trick (Markov on the squared deviation) that gives Chebyshev's inequality. WLLN tells us learning is possible; the next tool (Hoeffding) tells us how fast.

Why was it necessary to square first before applying Markov?

Markov's inequality requires a non-negative variable. The quantity $\hat\mu_n - \mu$ can be negative — it's a centred deviation, signed. So we square it to make it non-negative, and the event $\{|\hat\mu_n - \mu| > \varepsilon\}$ is identical to the event $\{|\hat\mu_n - \mu|^2 > \varepsilon^2\}$, so no information is lost. Then Markov applies cleanly.

§5Hoeffding's inequality — the workhorse

Hoeffding's inequality

Let $Y_1, \dots, Y_m$ be independent random variables with $a_i \le Y_i \le b_i$ almost surely. Then for any $d > 0$:

$$\Pr\!\left[\,\left|\sum_{i=1}^{m} Y_i - \sum_{i=1}^{m}\mathbb{E}[Y_i]\right| \ge d\,\right] \;\le\; 2\exp\!\left(-\frac{2d^2}{\sum_{i=1}^{m}(b_i - a_i)^2}\right).$$
In plain English

The probability that a sum of bounded independent variables deviates from its mean by more than $d$ decays exponentially in $d^2$. Compare with Markov's $1/d$ decay: exponential is much tighter for moderate deviations. This is why concentration inequalities are sometimes called "tail bounds on steroids".

The specialisation we'll use over and over

If $Y_1,\dots,Y_m \in [0,1]$ then $(b_i - a_i)^2 \le 1$ for each $i$, so $\sum (b_i - a_i)^2 \le m$, and Hoeffding simplifies to

$$\Pr\!\left[\,\Big|\sum_{i=1}^{m} Y_i - m\mu\Big| \ge d\,\right] \;\le\; 2\exp\!\left(-\frac{2d^2}{m}\right).$$

Dividing by $m$: the empirical mean deviates from $\mu$ by more than $d/m$ with probability at most $2\exp(-2d^2/m)$. Set $d = r\sqrt{m}$ and you get the form we'll need: the sample mean deviates by $r/\sqrt m$ with probability at most $2\exp(-2r^2)$.

§6Hoeffding vs Markov — the picture

Slide 27 shows a binomial histogram and three vertical lines: $\mu - 3\sigma$, $\mu + 3\sigma$ (Hoeffding's $\pm 3\sigma$ window), and $3\mu$ (Markov's threshold at three times the mean).

μ (mean) μ − 3σ μ + 3σ 3μ (Markov) Hoeffding's window ≈ 99% mass inside Markov's window (much looser) guarantees only ≥ 2/3 mass inside Sum of Bernoulli RVs — concentration thresholds compared
Slide 27 — Hoeffding's $\pm 3\sigma$ window captures ~99% of the mass; Markov's bound at $3\mu$ guarantees only ≥ 2/3 — much looser even though it sits further from the mean.
Takeaway

Markov is the "any non-negative variable, no further structure" tool — it's universal but loose. Hoeffding adds the assumption "bounded and independent" — and pays you back with exponential tails. For bandit problems both assumptions hold, so Hoeffding is the right hammer.

Exam trap

Hoeffding requires independence and boundedness. Drop independence and the inequality fails — that's why the $\mu_i = \mathbb{E}[X_i]$ being independent of $t$ in the bandit setup matters. Also: Markov requires the variable to be non-negative; it cannot be applied directly to a centred deviation like $\hat\mu_n - \mu$ — you must square (or otherwise non-negativise) first. Mixing these conditions up is a classic exam slip.

Concrete numbers: let $m = 100$ Bernoulli($1/2$) trials. Use Hoeffding to bound $\Pr[|\sum Y_i - 50| \ge 15]$. Then use Markov on $\sum Y_i$ to bound $\Pr[\sum Y_i \ge 65]$. Which bound is tighter and by how much?

Hoeffding. Each $Y_i \in [0,1]$ so $\sum (b_i-a_i)^2 = 100$. With $d = 15$:

$$\Pr\!\left[\,\left|\sum Y_i - 50\right| \ge 15\,\right] \le 2\exp\!\left(-\frac{2 \cdot 225}{100}\right) = 2 e^{-4.5} \approx 0.0222.$$

So roughly a 2.2% upper bound on the two-sided deviation.

Markov. $\sum Y_i \ge 0$ with $\mathbb{E}[\sum Y_i] = 50$. With $t = 65$:

$$\Pr\!\left[\sum Y_i \ge 65\right] \le \frac{50}{65} \approx 0.769.$$

So Markov only promises ≤ 76.9% — almost useless. Hoeffding is roughly 35× tighter here, and the gap widens exponentially as $d$ grows. This is exactly the lesson of slide 27's histogram: with $m=100$ Bernoullis you'd never see the sum near 65 in practice, and Hoeffding correctly says so; Markov can't.

Apply Hoeffding to $\tau$ i.i.d. samples in $[0,1]$ from arm $i$. Show that the empirical sum $\sum_{t=1}^{\tau} X_{i,t}$ is within $r$ of $\tau\mu_i$ with probability at least $1 - 2/(KT^3)$, when $r = \sqrt{\tau \log(KT^3)}$.

Each $X_{i,t} \in [0,1]$ so $\sum (b_t - a_t)^2 = \tau$. By Hoeffding with $d = r$:

$$\Pr\!\left[\left|\sum_{t=1}^{\tau} X_{i,t} - \tau\mu_i\right| > r\right] \le 2\exp\!\left(-\frac{2r^2}{\tau}\right).$$

Plug in $r^2 = \tau \log(KT^3)$:

$$2\exp(-2\log(KT^3)) = 2 \cdot (KT^3)^{-2} \le \frac{2}{KT^3}.$$

So the complement (the deviation $\le r$) has probability at least $1 - 2/(KT^3)$. This is exactly claim (*) on slide 33. ✓

· · ·
Part III

Explore-First — commit and pray

The crudest possible strategy: try each arm $\tau$ times, then play the empirical winner forever. The whole point is that even this stupid algorithm gives a non-trivial regret bound — provided $\tau$ is chosen properly.

§1The algorithm

Algorithm · Explore-First
  1. Exploration phase. Play each of the $K$ arms exactly $\tau$ times.
  2. Compute the empirical mean of each arm and let $\hat a$ be the arm with the highest empirical mean.
  3. Exploitation phase. Play $\hat a$ for all of the remaining $T - K\tau$ rounds.

The parameter $\tau$ is the only knob, and it depends on $T$ and $K$.

Exploration round-robin · Kτ rounds play each arm τ times, compute means Exploitation commit · (T − Kτ) rounds play the empirical winner â every round t = 1 t = T
Slide 16 — Explore-First's two phases. Once we commit, we never re-explore.

Should you pull each arm 10, 100, or 1000 times? You can't pick by gut feeling — you need a concentration bound to know how much exploration is "enough" to identify the best arm. That was the whole point of Part II.

§2The regret theorem

Theorem · Explore-First regret

For an appropriate choice of $\tau$, the Explore-First algorithm satisfies

$$\mathbb{E}[R(T)] \;=\; O\!\left(T^{2/3}\,\bigl(K\log(KT^3)\bigr)^{1/3}\right).$$
What the rate means

The regret grows like $T^{2/3}$ — sub-linear, so per-round regret $R(T)/T \to 0$, meaning you do eventually "learn". But $T^{2/3}$ is much worse than the $\sqrt{T}$ rate that UCB will achieve, because Explore-First commits irreversibly: if you got unlucky in the $K\tau$ exploration rounds, you spend the next $T - K\tau$ rounds suffering for it.

§3The full proof (Slides 32–35)

The proof has the structure: (a) define a "good event" E that the empirical means are all close to the truth; (b) show $\Pr[\neg E]$ is " stroke-width="1"/> explore (prob ε_t) exploit (prob 1−ε_t) t = 1 t = T

Slide 39 — Epsilon-Greedy interleaves: a few exploration rounds sprinkled into a sea of greedy exploitation. The exploration rate $\varepsilon_t$ shrinks with $t$.

§2The right schedule for $\varepsilon_t$

Theorem · Epsilon-Greedy regret

With the schedule

$$\varepsilon_t \;=\; \frac{(K\log(Kt))^{1/3}}{t^{1/3}},$$

the algorithm achieves, for every round $t$,

$$\mathbb{E}[R(t)] \;=\; O\!\left(t^{2/3}\,(K\log(Kt^3))^{1/3}\right).$$
In plain English

The rate is the same as Explore-First's: $t^{2/3}$ up to log factors. So Epsilon-Greedy isn't faster — but it has the huge practical advantage that you don't need to know $T$ in advance. The bound holds anytime: at $t = 100$, at $t = 10^6$, at any later round. Explore-First only achieves a fixed-$T$ bound.

Why $\varepsilon_t$ shrinks like $t^{-1/3}$

Initially you know nothing — you want lots of exploration. Later your estimates are good, and each new exploration is mostly wasted (you'd rather exploit). The schedule $\varepsilon_t \propto t^{-1/3}$ shrinks exploration smoothly and at exactly the rate that balances exploration cost vs commitment risk, producing the same $t^{2/3}$ regret as Explore-First but in an anytime fashion.

What is the main practical advantage of Epsilon-Greedy over Explore-First, given they have the same regret rate?

The Epsilon-Greedy bound holds for every $t$, not just for one prescribed horizon. You don't need to know in advance how long you'll run. Explore-First requires picking $\tau$ as a function of $T$, so if you committed to $T = 10^6$ but only got to play for $T = 10^4$, your $\tau$ would be wrong and you'd be stuck in the exploration phase the whole time. Epsilon-Greedy is "anytime": you can stop whenever and the bound applies to whatever $t$ you stopped at.

Compute $\varepsilon_t$ for $K = 5$ and $t = 1000$. What fraction of rounds are spent exploring at that point?

$\varepsilon_{1000} = (5 \log 5000)^{1/3} / 1000^{1/3} = (5 \cdot 8.52)^{1/3} / 10 = 42.6^{1/3}/10 \approx 3.49/10 \approx 0.35$. So about 35% of round 1000 is still exploration. At $t = 10^6$: $\varepsilon = (5 \log(5\cdot 10^6))^{1/3} / 100 = (5 \cdot 15.4)^{1/3}/100 \approx 4.3/100 \approx 0.043$. About 4.3%. The exploration fraction drops gracefully.

Exam trap — why not a constant $\varepsilon$?

A tempting shortcut is to fix $\varepsilon = 0.1$ for all $t$ ("explore 10% of the time forever"). That gives linear regret: even if your estimates are perfect, you keep pulling random arms forever, losing roughly $\varepsilon \cdot (\mu_1 - \bar\mu)$ per round on average $\Rightarrow$ $R(T) = \Theta(T)$. The whole point of the $\varepsilon_t \propto t^{-1/3}$ schedule is that exploration decays so total exploration cost grows only as $T^{2/3}$. Constant $\varepsilon$ doesn't pass the regret test, even though it is what most "epsilon-greedy" papers in reinforcement learning use heuristically.

Show that with a constant $\varepsilon \in (0,1)$, $\varepsilon$-Greedy has expected regret $R(T) = \Omega(\varepsilon \,T \,(\mu_1 - \mu_2))$ — i.e. linear in $T$.

At each round, with probability $\varepsilon$ we pick a uniformly random arm. Conditional on exploring, the probability of picking the best arm is $1/K$, so the probability of pulling a sub-optimal arm is $(K-1)/K$. The expected per-round regret from exploration alone is then at least

$$\varepsilon \cdot \frac{K-1}{K} \cdot (\mu_1 - \mu_2),$$

since the best of the sub-optimal arms has gap $\mu_1 - \mu_2$. Summing over $T$ rounds:

$$R(T) \;\ge\; T \cdot \varepsilon \cdot \frac{K-1}{K} \cdot (\mu_1 - \mu_2) \;=\; \Omega(\varepsilon\, T).$$

This is linear in $T$, so the average regret per round does not vanish. Hence no constant $\varepsilon$ gives sublinear regret — you must let $\varepsilon_t \to 0$.

§3Summary

From slide 41: Epsilon-Greedy is simple yet effective; it offers a straightforward exploration–exploitation balance; the theorem tells us its theoretical limits. But the rate is still $t^{2/3}$ — and we can do better.

· · ·
Part V

UCB — optimism in the face of uncertainty

UCB doesn't separate exploration and exploitation at all — it does both at once, by always playing the arm whose plausible upper limit is largest. The algorithm is one line; the analysis is the payoff.

§1The idea

Optimism in the face of uncertainty

For each arm, we know its empirical mean $\hat\mu_{i,t}$ and a confidence radius $r_{i,t}$ that shrinks as we pull it more. The upper confidence bound $\hat\mu_{i,t} + r_{i,t}$ is the largest value the true mean could plausibly take, with high probability. UCB plays the arm whose upper-bound is highest — i.e. the arm that could plausibly be the best.

Why is this clever? Two arms can lead in UCB for two reasons: (a) good empirical mean (exploitation) or (b) large radius from few pulls (exploration). The algorithm naturally explores arms it doesn't know well — but stops as soon as their radius shrinks below the leading arm's mean.

§2The algorithm

Algorithm · UCB
  1. Play each arm once (to give every arm a finite radius).
  2. For each subsequent round $t$:
    1. Compute, for every arm $i$, $$\mathrm{UCB}(i,t) = \hat\mu_{i,t} + r_{i,t},$$ where the empirical mean is $$\hat\mu_{i,t} = \frac{1}{n_{i,t}}\sum_{s\le t,\,I_s = i} X_{i,s}$$ (average over observations of arm $i$ so far), and the radius is $$r_{i,t} = \sqrt{\frac{2\log(KT)}{n_{i,t}}}.$$ Here $n_{i,t}$ counts how many times arm $i$ has been played up to round $t$.
    2. Play the arm with the highest UCB value.
    3. Update $\hat\mu$ and $n$ for that arm.
0 1 UCB μ̂ + r arm 1 n=120, μ̂=0.58 small radius UCB ★ arm 2 n=45, μ̂=0.50 medium radius UCB arm 3 n=20, μ̂=0.30 larger radius UCB arm 4 n=8, μ̂=0.10 huge radius For each arm: dot = empirical mean, bar = mean ± radius. UCB plays the arm whose top is highest. ★ this round, arm 2 leads in UCB even though its empirical mean is below arm 1's — its radius hasn't shrunk enough yet.
Slide 44 — UCB plays the arm whose top-of-confidence-bar is highest. Less-pulled arms have wider bars, giving them a chance to be "optimistic".

§3The regret theorem

Theorem · UCB regret $$\mathbb{E}[R(T)] \;=\; O\!\left(\sqrt{KT\log(TK)}\right).$$
Why this matters

This is $O(\sqrt{T})$ — much better than the $T^{2/3}$ of the previous two algorithms. As $T$ grows large, UCB pulls away dramatically. This $\sqrt{KT\log(KT)}$ rate is actually worst-case optimal up to the log factor — no algorithm can do asymptotically better in the minimax sense.

AlgorithmRegret rateAnytime?Comment
Explore-FirstO(T2/3(K log KT³)1/3)Nocommit-and-pray; need to know T
Epsilon-GreedyO(t2/3(K log Kt³)1/3)Yessame rate, anytime version
UCBO(√(KT log KT))Yesworst-case optimal up to logs

§4The full proof (Slides 47–51)

The good event

Define $E$ to hold if, for all arms $i \in [K]$ and all rounds $t \in [T]$,

$$|\hat\mu_{i,t} - \mu_i| \;\le\; r_{i,t}.$$

That is, every empirical mean is within its own confidence radius of the truth. We want to bound $\Pr[\neg E]$.

Pointwise concentration via Hoeffding

For a fixed $i$ and $t$, conditional on $n_{i,t}$ pulls of arm $i$, multiply both sides by $n_{i,t}$ and apply Hoeffding to the sum of $n_{i,t}$ i.i.d. bounded variables:

$$\Pr\!\left[|\hat\mu_{i,t} - \mu_i| > r_{i,t}\right] = \Pr\!\left[|n_{i,t}\hat\mu_{i,t} - n_{i,t}\mu_i| > n_{i,t} r_{i,t}\right] \le 2\exp\!\left(-\frac{2(n_{i,t}r_{i,t})^2}{n_{i,t}}\right).$$

Substitute $r_{i,t}^2 = 2\log(KT)/n_{i,t}$:

$$2\exp(-2n_{i,t} \cdot r_{i,t}^2) = 2\exp(-4\log(KT)) = 2(KT)^{-4}.$$

Hmm — the slide writes $2\exp(-2\log(KT)) = 2/(KT)^2 \le 1/(KT^2)$. Let me reproduce the slide's algebra exactly: the slide effectively gives

$$\Pr[|\hat\mu_{i,t} - \mu_i| > r_{i,t}] \;\le\; 2\exp(-2\log(KT)) \;=\; \frac{2}{(KT)^2} \;\le\; \frac{1}{KT^2}.$$

Union bound over all arms and rounds

There are at most $KT$ pairs $(i,t)$, so by union bound:

$$\Pr[\neg E] \;\le\; KT \cdot \frac{1}{KT^2} \;=\; \frac{1}{T}.$$

The bad event contributes $O(1)$

On $\neg E$ the regret is at most $T$ (trivial), so

$$\mathbb{E}[R(T) \mid \neg E]\cdot \Pr[\neg E] \;\le\; T \cdot \frac{1}{T} \;=\; O(1).$$

This is asymptotically negligible compared to the $\sqrt{KT\log KT}$ term we'll get on the good event.

The key inequality on the good event

Suppose at round $t$, UCB plays a sub-optimal arm $i > 1$. Then by the algorithm's rule, $\mathrm{UCB}(i,t) \ge \mathrm{UCB}(1,t)$, i.e.

$$\hat\mu_{i,t} + r_{i,t} \;\ge\; \hat\mu_{1,t} + r_{1,t}.$$

On the good event:

  • $\hat\mu_{i,t} \le \mu_i + r_{i,t}$,
  • $\hat\mu_{1,t} \ge \mu_1 - r_{1,t}$.

Substituting both bounds and rearranging:

$$\mu_i + 2r_{i,t} \;\ge\; \hat\mu_{i,t} + r_{i,t} \;\ge\; \hat\mu_{1,t} + r_{1,t} \;\ge\; \mu_1.$$

So $\mu_1 - \mu_i \le 2 r_{i,t}$ — the gap between arm 1 and the played arm is at most twice the radius of the played arm. (Note: the slide also mentions "due to monotonicity" — meaning $r_{i,t}$ is monotone decreasing in $n_{i,t}$, so the largest radius for arm $i$ is at the moment it's played, and we can bound $r_{i,t}$ at the time of play by $r_{i,T}$ at the end of all its pulls.)

Translation

This is the UCB version of the Explore-First key inequality ("if we picked the wrong arm, we didn't pick a terribly wrong one"). Here it says: whenever UCB pulls a sub-optimal arm $i$, the suboptimality gap $\mu_1 - \mu_i$ can be at most $2 r_{i,t}$. So if we want to pull arm $i$ a lot ($n_{i,T}$ large), the radius is small, so the suboptimality is small, so the regret per pull is small. Self-limiting.

Bounding total regret on the good event

Arm $i$ contributes $(\mu_1 - \mu_i) n_{i,T}$ to total regret. Using the inequality above:

$$(\mu_1 - \mu_i)\, n_{i,T} \;\le\; 2 r_{i,T} n_{i,T} \;=\; 2\sqrt{\frac{2\log(KT)}{n_{i,T}}} \cdot n_{i,T} \;=\; 2\sqrt{2\log(KT)} \cdot \sqrt{n_{i,T}}.$$

Summing over sub-optimal arms:

$$\mathbb{E}[R(T) \mid E] \;=\; \sum_{i > 1}(\mu_1 - \mu_i) n_{i,T} \;\le\; 2\sqrt{2\log(KT)} \sum_{i} \sqrt{n_{i,T}}.$$

Jensen's inequality finishes it

The function $\sqrt{\cdot}$ is concave, so by Jensen:

$$\frac{1}{K}\sum_{i} \sqrt{n_{i,T}} \;\le\; \sqrt{\frac{1}{K}\sum_{i} n_{i,T}} \;=\; \sqrt{\frac{T}{K}},$$

since $\sum_i n_{i,T} = T$ (each round pulls exactly one arm). Multiplying both sides by $K$:

$$\sum_{i} \sqrt{n_{i,T}} \;\le\; K \sqrt{T/K} \;=\; \sqrt{KT}.$$

Putting it together:

$$\mathbb{E}[R(T) \mid E] \;\le\; 2\sqrt{2\log(KT)} \cdot \sqrt{KT} \;=\; O\!\left(\sqrt{KT\log(KT)}\right).$$

Combining good and bad events

$$\mathbb{E}[R(T)] = \underbrace{\mathbb{E}[R(T)\mid E]\Pr[E]}_{O(\sqrt{KT\log KT})\,\cdot\,1} + \underbrace{\mathbb{E}[R(T)\mid \neg E]\Pr[\neg E]}_{O(1)} = O\!\left(\sqrt{KT\log(KT)}\right). \quad\blacksquare$$
In the UCB formula, why is the radius $r_{i,t} = \sqrt{2\log(KT)/n_{i,t}}$ — specifically, where does the $\log(KT)$ come from?

From the union bound. We need a high-probability bound that holds for all $K$ arms and all $T$ rounds simultaneously — that's $KT$ events. For each individual event, Hoeffding gives a failure probability of roughly $2\exp(-2r^2 n_{i,t})$. To make the total failure probability after union bound be at most $1/T$, we need each individual event's failure probability to be $\le 1/(KT^2)$, which requires $r^2 \approx \log(KT)/n_{i,t}$. Hence the formula.

Why is the regret bound only $\sqrt{KT}$ and not $T$, given there are $K$ arms?

Because of Jensen's inequality. The worst case for $\sum \sqrt{n_i}$ subject to $\sum n_i = T$ is when the $n_i$ are equal (concavity), giving $\sum \sqrt{n_i} = K\sqrt{T/K} = \sqrt{KT}$. If one arm dominated (say $n_1 = T$, others zero), we'd get $\sum\sqrt{n_i} = \sqrt{T}$ — even smaller. So UCB's regret is largest when the algorithm spreads its pulls evenly, but that case is bounded by $\sqrt{KT}$, not $T$.

A common student misreading: "UCB plays the arm with the highest empirical mean." True or false, and why?

False. UCB plays the arm with the highest $\hat\mu_{i,t} + r_{i,t}$ — empirical mean plus radius. The arm with the highest empirical mean would be pure exploitation (i.e. greedy), and that strategy can fail catastrophically: if arm 1's first few pulls are unlucky, you'd never play it again. UCB's radius term ensures that arms with few samples (and hence large $r_{i,t}$) are revisited, giving them a chance to redeem themselves.

Suppose at round 50 you have observed: arm A pulled 30 times with $\hat\mu_A = 0.6$, arm B pulled 5 times with $\hat\mu_B = 0.5$. With $K=2$, $T=1000$, compute the UCB of each and say which arm UCB plays next.

$\log(KT) = \log(2000) \approx 7.6$. Then $r_A = \sqrt{2 \cdot 7.6 / 30} = \sqrt{0.51} \approx 0.71$, so $\mathrm{UCB}(A) \approx 0.6 + 0.71 = 1.31$. And $r_B = \sqrt{2 \cdot 7.6 / 5} = \sqrt{3.04} \approx 1.74$, so $\mathrm{UCB}(B) \approx 0.5 + 1.74 = 2.24$. Arm B wins — its small visit count gives it a huge radius, so UCB pulls it next to gather more information. (Note: since rewards are in $[0,1]$, UCB values bigger than 1 are fine; they just mean the algorithm has very low confidence.)

Where does Jensen's inequality enter the UCB proof, and why is $\sqrt{\cdot}$ being concave the key?

After bounding regret per arm by $2r_{i,T} n_{i,T} = 2\sqrt{2\log(KT)} \sqrt{n_{i,T}}$, we need to sum $\sum_{i} \sqrt{n_{i,T}}$. The constraint is $\sum_i n_{i,T} = T$. Concavity of $\sqrt{\cdot}$ gives, via Jensen, that the average $\frac{1}{K}\sum_i \sqrt{n_{i,T}} \le \sqrt{\frac{1}{K}\sum n_{i,T}} = \sqrt{T/K}$. Multiplying by $K$: $\sum \sqrt{n_{i,T}} \le \sqrt{KT}$. Without concavity (e.g. if we had $\sum n_{i,T}^2$ instead), the bound would go the other way and we'd be stuck.

§5Key takeaways

From Slide 52
  • UCB beats Explore-First by a lot. $\sqrt{T}$ vs $T^{2/3}$ is asymptotically a huge gap.
  • UCB effectively balances exploration and exploitation by encoding both into one quantity.
  • It adapts over time: arms with many pulls have small radii, so they're played only when their empirical mean is genuinely highest; arms with few pulls have large radii and get re-tried.
  • Suitable for environments where the reward distributions are unknown and must be learned online.
· · ·
Outro · Week 8

The week in one breath

A stochastic bandit poses the dilemma: $K$ arms with unknown means $\mu_1 > \mu_2 \ge \cdots \ge \mu_K$, one pull per round, and you want to minimise the pseudo-regret $R(T) = T\mu_1 - \sum_t \mathbb{E}[X_{I_t,t}]$. To analyse any algorithm we use a concentration toolkit (Markov, the $\sigma^2/n$ lemma, the Weak Law of Large Numbers, and especially Hoeffding's exponential tail bound), combined with the union bound to handle multiple arms and rounds simultaneously. Three algorithms tackle the problem: Explore-First tries each arm $\tau$ times and commits, achieving $O(T^{2/3}\,\text{polylog})$ regret if $\tau$ is chosen optimally; Epsilon-Greedy interleaves exploration and exploitation with a shrinking $\varepsilon_t = (K\log Kt)^{1/3}/t^{1/3}$, matching the rate but giving an anytime bound; UCB plays the arm with the highest $\hat\mu_{i,t} + \sqrt{2\log(KT)/n_{i,t}}$, embodying "optimism in the face of uncertainty", and achieves $O(\sqrt{KT\log KT})$ — worst-case optimal up to logs. The proofs all share the same backbone: define a good event using Hoeffding + union bound, show the bad event contributes $O(1)$, and on the good event bound regret using a self-limiting argument (a sub-optimal arm gets pulled only if its radius is large, which limits how badly we can lose per pull).

  1. State the stochastic $K$-armed bandit formally: arms, $X_{i,t} \in [0,1]$ i.i.d., $\mu_i = \mathbb{E}[X_i]$, ordering $\mu_1 > \mu_2 \ge \cdots \ge \mu_K$ unknown.
  2. Write the pseudo-regret $R(T) = T\mu_1 - \sum_t \mathbb{E}[X_{I_t,t}]$ and explain in one sentence why it's preferred over $R_{\text{alternative}}(T)$ defined on realisations.
  3. Give Markov's inequality $\Pr[X > t] \le \mathbb{E}[X]/t$, the variance-of-mean lemma $\mathbb{E}[(\hat\mu_n - \mu)^2] = \sigma^2/n$, and reproduce the slide proof.
  4. Prove the Weak Law of Large Numbers from Markov + the variance lemma (squaring trick).
  5. State Hoeffding's inequality with the exponent $-2d^2/\sum(b_i - a_i)^2$ and specialise it to $[0,1]$ variables.
  6. Explain the Hoeffding-vs-Markov picture: Hoeffding gives ~99% within $\mu \pm 3\sigma$; Markov only guarantees 2/3 within $3\mu$.
  7. State the union bound $\Pr[\cup E_i] \le \sum \Pr[E_i]$ and use it to combine per-arm concentration into a single "good event $E$".
  8. Describe Explore-First's two phases and state its theorem $\mathbb{E}[R(T)] = O(T^{2/3}(K\log KT^3)^{1/3})$.
  9. Reproduce the Explore-First proof end-to-end: radius $r = \sqrt{\tau\log(KT^3)}$, Hoeffding gives $\Pr[\neg E] \le 2/T^3$ (via $K$ × $2/(KT^3)$), key inequality $\mu_1 - \mu_i \le 2r/\tau$ on $E$, regret $K\tau + (T/\tau)\cdot 2r$, balanced $\tau$.
  10. State Epsilon-Greedy with schedule $\varepsilon_t = (K\log Kt)^{1/3}/t^{1/3}$ and the anytime regret $O(t^{2/3}(K\log Kt^3)^{1/3})$, and explain why "anytime" is the practical win over Explore-First.
  11. Write the UCB formula $\mathrm{UCB}(i,t) = \hat\mu_{i,t} + \sqrt{2\log(KT)/n_{i,t}}$ and explain "optimism in the face of uncertainty".
  12. State and motivate the UCB theorem $\mathbb{E}[R(T)] = O(\sqrt{KT\log KT})$.
  13. Reproduce the UCB proof: $\Pr[\neg E] \le 1/T$ by Hoeffding + union bound over $KT$ pairs; bad event contributes $O(1)$; on $E$, $\mu_1 - \mu_i \le 2 r_{i,t}$; arm $i$ contributes $\le 2r_{i,T} n_{i,T}$; sum with Jensen to get $\sqrt{KT}$.
  14. Show $\sum_i \sqrt{n_{i,T}} \le \sqrt{KT}$ given $\sum n_{i,T} = T$, using Jensen on the concave $\sqrt{\cdot}$.
  15. Compare the three algorithms in a table: rate, anytime?, when to use which.
  16. Compute UCB values for a worked example with specific $\hat\mu$ and $n$ counts, and identify which arm UCB plays next.
  17. Explain why the radius shrinks (more pulls → smaller $r_{i,t}$), and how this self-limiting feature gives UCB its $\sqrt{T}$ rate.
  18. Be ready to discuss the worst-case optimality of UCB and why $T^{2/3}$ is not achievable by any algorithm that ignores past observations.