Learning which lever to pull when every pull costs you the chance to pull another — exploration versus exploitation, with concentration inequalities for muscle.
A gambler stands in front of $K$ slot machines. Each one secretly has its own average payout — but she doesn't know which. Every round she pulls exactly one lever and sees what fell out. After $T$ rounds, how do we measure whether she "learned"?
A multi-armed bandit (MAB) is a stripped-down model of sequential decisions under uncertainty. Imagine $K$ slot machines lined up in a casino — old-school slot machines were called "one-armed bandits" because they had a single lever and robbed your wallet. A row of them is therefore a "multi-armed bandit". The gambler must decide, round by round:
so as to maximise total winnings over time.
Each "arm" is an action whose payoff is random and unknown. You learn what arm $i$ pays only by pulling it — and every pull you spend exploring arm $i$ is a pull you didn't spend exploiting whichever arm currently looks best. This tension is called the exploration–exploitation trade-off, and the whole field of bandits is the study of how to resolve it.
The same skeleton — repeated choices, stochastic rewards, no model up front — describes a surprising number of real systems:
"I.i.d. rewards" means independent over time for the same arm, and the arms are independent of each other. It does not mean all $K$ arms have the same distribution — they have different means $\mu_i$, that's the whole point. Don't confuse "i.i.d." (across time, per arm) with "identically distributed" (across arms — they're not).
"W.l.o.g." means we're just labelling the arms by their unknown true means for analytical convenience — calling the best arm "1" and the worst "K" doesn't constrain the problem because the player doesn't see those labels. The algorithm only sees indices that are essentially arbitrary names. We use the ordering in the proof (e.g., to call out "arm 1" as the optimal one), but the algorithm never benefits from it because it doesn't know which physical arm is "arm 1".
Call $I_t$ the arm we choose to play at round $t$. The expected reward we get at round $t$ is $\mathbb{E}[X_{I_t, t}]$, and over $T$ rounds the cumulative expected reward is
$$\mathbb{E}\!\left[\sum_{t=1}^{T} X_{I_t,t}\right] \;=\; \sum_{t=1}^{T} \mathbb{E}\!\left[X_{I_t,t}\right].$$If we had been an oracle and known the means in advance, we would have always played arm 1 and earned $T\mu_1$ in expectation. The gap between these is what we minimise:
In plain English: "how much worse, on average, am I doing than the best single arm in hindsight?" Smaller is better; $R(T)=0$ means we played arm 1 every round.
Note that $I_t$ is itself random — it depends on whatever the algorithm observed in rounds $1,\dots,t-1$ — so $\mathbb{E}[X_{I_t,t}]$ is the expectation over both the arm choice and its reward.
Expected total reward $= 800\cdot 0.7 + 200\cdot 0.5 = 560 + 100 = 660$. Oracle would have got $T\mu_1 = 1000 \cdot 0.7 = 700$. So $R(T) = 700 - 660 = 40$. Equivalently: each of the 200 sub-optimal pulls cost $\mu_1 - \mu_2 = 0.2$ in expectation; $200 \cdot 0.2 = 40$. ✓
You might be tempted to define a more "honest" regret using the actually-observed numbers:
$$R_{\text{alternative}}(T) \;=\; \max_{i \le K}\sum_{t=1}^{T} X_{i,t} \;-\; \sum_{t=1}^{T} X_{I_t,t}.$$The slides explicitly warn against this. Two problems:
Pseudo-regret sidesteps both pathologies by comparing to the expected performance of the best arm, not its realised performance. Learning becomes meaningful again.
(a) Because under $R_{\text{alternative}}$ ties between arms make learning impossible — you'd be punished by luck rather than by bad decisions. (b) Because variance among many near-equal arms causes a sub-optimal arm to "win" the in-hindsight comparison with high probability, so $R_{\text{alternative}}$ doesn't reward actually identifying $\mu_1$.
Slide 14 sets the menu, and the rest of the deck delivers all three:
But to analyse any of them we first need a kit of concentration inequalities, which is what Part II is about.
Every bandit proof in this deck reduces to one question: "once we've pulled arm $i$ enough times, how close is the sample mean to the true mean, and with what probability?" Here are the six ingredients we'll combine.
If $X$ and $Y$ are independent, then for any (well-behaved) functions $f,g$:
$$\mathbb{E}\!\left[f(X)\,g(Y)\right] \;=\; \mathbb{E}[f(X)]\;\mathbb{E}[g(Y)].$$We use this in the proof of the variance lemma to kill cross terms.
For any events $E_1, \dots, E_k$:
$$\Pr\!\left[\bigcup_{i} E_i\right] \;\le\; \sum_{i} \Pr[E_i].$$In plain English: the probability that at least one bad thing happens is at most the sum of the probabilities of each bad thing happening alone. This is what lets us state a "good event" simultaneously for all $K$ arms and all $T$ rounds, without independence assumptions.
Let $X_1,\dots,X_n$ be i.i.d. with $\mathbb{E}[X_i] = \mu$ and $\mathrm{Var}(X_i) = \sigma^2$. Define the estimator
$$\hat{\mu}_n \;=\; \frac{1}{n}\sum_{i=1}^{n} X_i.$$Then
$$\mathbb{E}\!\left[(\hat{\mu}_n - \mu)^2\right] \;=\; \frac{\sigma^2}{n}.$$Average $n$ noisy measurements and your mean-squared error shrinks like $1/n$. Equivalently, the standard error shrinks like $1/\sqrt{n}$. That $1/\sqrt{n}$ is everywhere in this deck (it's the "radius" of every confidence interval we'll build).
Substitute and expand the square:
$$\mathbb{E}[(\hat\mu_n - \mu)^2] = \mathbb{E}\!\left[\left(\tfrac{1}{n}\sum_i (X_i-\mu)\right)^2\right] = \mathbb{E}\!\left[\tfrac{1}{n^2}\sum_i (X_i-\mu)^2 + \tfrac{1}{n^2}\sum_{i\ne j}(X_i-\mu)(X_j-\mu)\right].$$Split the expectation. For the cross term, $X_i$ and $X_j$ are independent for $i\ne j$, so by Reminder 1:
$$\mathbb{E}\!\left[\tfrac{1}{n^2}\sum_{i\ne j}(X_i-\mu)(X_j-\mu)\right] = \tfrac{1}{n^2}\sum_{i\ne j}\mathbb{E}[X_i-\mu]\,\mathbb{E}[X_j-\mu] = 0,$$since $\mathbb{E}[X_i - \mu] = 0$. The diagonal term is
$$\mathbb{E}\!\left[\tfrac{1}{n^2}\sum_i (X_i-\mu)^2\right] = \tfrac{1}{n^2}\sum_i \sigma^2 = \tfrac{\sigma^2}{n}. \quad\blacksquare$$For $X \in [0,1]$, the variance is maximised by a Bernoulli with $p=1/2$, giving $\sigma^2 \le 1/4$. So $\mathbb{E}[(\hat\mu_n - \mu)^2] \le 1/(4n)$. This is why the standard error for a bounded variable is always at most $1/(2\sqrt n)$.
For any non-negative random variable $X$ and any $t \ge 0$:
$$\Pr[X > t] \;\le\; \frac{\mathbb{E}[X]}{t}.$$"A non-negative variable can't sit far above its mean too often, because if it did the mean would already be huge." Markov is the weakest tail bound in our kit — it only uses the mean, not the variance — but its weakness is what makes the proof of the next item (WLLN) trivial.
For any $\varepsilon > 0$:
$$\lim_{n\to\infty} \Pr\!\left[\,|\hat\mu_n - \mu| > \varepsilon\,\right] = 0.$$That is, the sample mean converges in probability to the true mean.
Square both sides inside the probability — this is legal because squaring is monotone on non-negatives — and apply Markov to the non-negative variable $|\hat\mu_n - \mu|^2$:
$$\Pr[\,|\hat\mu_n - \mu| > \varepsilon\,] = \Pr[\,|\hat\mu_n - \mu|^2 > \varepsilon^2\,] \;\le\; \frac{\mathbb{E}[|\hat\mu_n - \mu|^2]}{\varepsilon^2} \;=\; \frac{\sigma^2}{n\,\varepsilon^2}.$$Now take $n \to \infty$: the bound $\sigma^2/(n\varepsilon^2) \to 0$. $\blacksquare$
The squaring trick lets us recycle the $\sigma^2/n$ lemma. This is also exactly the trick (Markov on the squared deviation) that gives Chebyshev's inequality. WLLN tells us learning is possible; the next tool (Hoeffding) tells us how fast.
Markov's inequality requires a non-negative variable. The quantity $\hat\mu_n - \mu$ can be negative — it's a centred deviation, signed. So we square it to make it non-negative, and the event $\{|\hat\mu_n - \mu| > \varepsilon\}$ is identical to the event $\{|\hat\mu_n - \mu|^2 > \varepsilon^2\}$, so no information is lost. Then Markov applies cleanly.
Let $Y_1, \dots, Y_m$ be independent random variables with $a_i \le Y_i \le b_i$ almost surely. Then for any $d > 0$:
$$\Pr\!\left[\,\left|\sum_{i=1}^{m} Y_i - \sum_{i=1}^{m}\mathbb{E}[Y_i]\right| \ge d\,\right] \;\le\; 2\exp\!\left(-\frac{2d^2}{\sum_{i=1}^{m}(b_i - a_i)^2}\right).$$The probability that a sum of bounded independent variables deviates from its mean by more than $d$ decays exponentially in $d^2$. Compare with Markov's $1/d$ decay: exponential is much tighter for moderate deviations. This is why concentration inequalities are sometimes called "tail bounds on steroids".
If $Y_1,\dots,Y_m \in [0,1]$ then $(b_i - a_i)^2 \le 1$ for each $i$, so $\sum (b_i - a_i)^2 \le m$, and Hoeffding simplifies to
$$\Pr\!\left[\,\Big|\sum_{i=1}^{m} Y_i - m\mu\Big| \ge d\,\right] \;\le\; 2\exp\!\left(-\frac{2d^2}{m}\right).$$Dividing by $m$: the empirical mean deviates from $\mu$ by more than $d/m$ with probability at most $2\exp(-2d^2/m)$. Set $d = r\sqrt{m}$ and you get the form we'll need: the sample mean deviates by $r/\sqrt m$ with probability at most $2\exp(-2r^2)$.
Slide 27 shows a binomial histogram and three vertical lines: $\mu - 3\sigma$, $\mu + 3\sigma$ (Hoeffding's $\pm 3\sigma$ window), and $3\mu$ (Markov's threshold at three times the mean).
Markov is the "any non-negative variable, no further structure" tool — it's universal but loose. Hoeffding adds the assumption "bounded and independent" — and pays you back with exponential tails. For bandit problems both assumptions hold, so Hoeffding is the right hammer.
Hoeffding requires independence and boundedness. Drop independence and the inequality fails — that's why the $\mu_i = \mathbb{E}[X_i]$ being independent of $t$ in the bandit setup matters. Also: Markov requires the variable to be non-negative; it cannot be applied directly to a centred deviation like $\hat\mu_n - \mu$ — you must square (or otherwise non-negativise) first. Mixing these conditions up is a classic exam slip.
Hoeffding. Each $Y_i \in [0,1]$ so $\sum (b_i-a_i)^2 = 100$. With $d = 15$:
$$\Pr\!\left[\,\left|\sum Y_i - 50\right| \ge 15\,\right] \le 2\exp\!\left(-\frac{2 \cdot 225}{100}\right) = 2 e^{-4.5} \approx 0.0222.$$So roughly a 2.2% upper bound on the two-sided deviation.
Markov. $\sum Y_i \ge 0$ with $\mathbb{E}[\sum Y_i] = 50$. With $t = 65$:
$$\Pr\!\left[\sum Y_i \ge 65\right] \le \frac{50}{65} \approx 0.769.$$So Markov only promises ≤ 76.9% — almost useless. Hoeffding is roughly 35× tighter here, and the gap widens exponentially as $d$ grows. This is exactly the lesson of slide 27's histogram: with $m=100$ Bernoullis you'd never see the sum near 65 in practice, and Hoeffding correctly says so; Markov can't.
Each $X_{i,t} \in [0,1]$ so $\sum (b_t - a_t)^2 = \tau$. By Hoeffding with $d = r$:
$$\Pr\!\left[\left|\sum_{t=1}^{\tau} X_{i,t} - \tau\mu_i\right| > r\right] \le 2\exp\!\left(-\frac{2r^2}{\tau}\right).$$Plug in $r^2 = \tau \log(KT^3)$:
$$2\exp(-2\log(KT^3)) = 2 \cdot (KT^3)^{-2} \le \frac{2}{KT^3}.$$So the complement (the deviation $\le r$) has probability at least $1 - 2/(KT^3)$. This is exactly claim (*) on slide 33. ✓
The crudest possible strategy: try each arm $\tau$ times, then play the empirical winner forever. The whole point is that even this stupid algorithm gives a non-trivial regret bound — provided $\tau$ is chosen properly.
The parameter $\tau$ is the only knob, and it depends on $T$ and $K$.
Should you pull each arm 10, 100, or 1000 times? You can't pick by gut feeling — you need a concentration bound to know how much exploration is "enough" to identify the best arm. That was the whole point of Part II.
For an appropriate choice of $\tau$, the Explore-First algorithm satisfies
$$\mathbb{E}[R(T)] \;=\; O\!\left(T^{2/3}\,\bigl(K\log(KT^3)\bigr)^{1/3}\right).$$The regret grows like $T^{2/3}$ — sub-linear, so per-round regret $R(T)/T \to 0$, meaning you do eventually "learn". But $T^{2/3}$ is much worse than the $\sqrt{T}$ rate that UCB will achieve, because Explore-First commits irreversibly: if you got unlucky in the $K\tau$ exploration rounds, you spend the next $T - K\tau$ rounds suffering for it.
The proof has the structure: (a) define a "good event" E that the empirical means are all close to the truth; (b) show $\Pr[\neg E]$ is " stroke-width="1"/>
With the schedule the algorithm achieves, for every round $t$, The rate is the same as Explore-First's: $t^{2/3}$ up to log factors. So Epsilon-Greedy isn't faster — but it has the huge practical advantage that you don't need to know $T$ in advance. The bound holds anytime: at $t = 100$, at $t = 10^6$, at any later round. Explore-First only achieves a fixed-$T$ bound. Initially you know nothing — you want lots of exploration. Later your estimates are good, and each new exploration is mostly wasted (you'd rather exploit). The schedule $\varepsilon_t \propto t^{-1/3}$ shrinks exploration smoothly and at exactly the rate that balances exploration cost vs commitment risk, producing the same $t^{2/3}$ regret as Explore-First but in an anytime fashion. The Epsilon-Greedy bound holds for every $t$, not just for one prescribed horizon. You don't need to know in advance how long you'll run. Explore-First requires picking $\tau$ as a function of $T$, so if you committed to $T = 10^6$ but only got to play for $T = 10^4$, your $\tau$ would be wrong and you'd be stuck in the exploration phase the whole time. Epsilon-Greedy is "anytime": you can stop whenever and the bound applies to whatever $t$ you stopped at. $\varepsilon_{1000} = (5 \log 5000)^{1/3} / 1000^{1/3} = (5 \cdot 8.52)^{1/3} / 10 = 42.6^{1/3}/10 \approx 3.49/10 \approx 0.35$. So about 35% of round 1000 is still exploration. At $t = 10^6$: $\varepsilon = (5 \log(5\cdot 10^6))^{1/3} / 100 = (5 \cdot 15.4)^{1/3}/100 \approx 4.3/100 \approx 0.043$. About 4.3%. The exploration fraction drops gracefully. A tempting shortcut is to fix $\varepsilon = 0.1$ for all $t$ ("explore 10% of the time forever"). That gives linear regret: even if your estimates are perfect, you keep pulling random arms forever, losing roughly $\varepsilon \cdot (\mu_1 - \bar\mu)$ per round on average $\Rightarrow$ $R(T) = \Theta(T)$. The whole point of the $\varepsilon_t \propto t^{-1/3}$ schedule is that exploration decays so total exploration cost grows only as $T^{2/3}$. Constant $\varepsilon$ doesn't pass the regret test, even though it is what most "epsilon-greedy" papers in reinforcement learning use heuristically. At each round, with probability $\varepsilon$ we pick a uniformly random arm. Conditional on exploring, the probability of picking the best arm is $1/K$, so the probability of pulling a sub-optimal arm is $(K-1)/K$. The expected per-round regret from exploration alone is then at least since the best of the sub-optimal arms has gap $\mu_1 - \mu_2$. Summing over $T$ rounds: This is linear in $T$, so the average regret per round does not vanish. Hence no constant $\varepsilon$ gives sublinear regret — you must let $\varepsilon_t \to 0$. From slide 41: Epsilon-Greedy is simple yet effective; it offers a straightforward exploration–exploitation balance; the theorem tells us its theoretical limits. But the rate is still $t^{2/3}$ — and we can do better.§2The right schedule for $\varepsilon_t$
What is the main practical advantage of Epsilon-Greedy over Explore-First, given they have the same regret rate?
Compute $\varepsilon_t$ for $K = 5$ and $t = 1000$. What fraction of rounds are spent exploring at that point?
Show that with a constant $\varepsilon \in (0,1)$, $\varepsilon$-Greedy has expected regret $R(T) = \Omega(\varepsilon \,T \,(\mu_1 - \mu_2))$ — i.e. linear in $T$.
§3Summary
UCB doesn't separate exploration and exploitation at all — it does both at once, by always playing the arm whose plausible upper limit is largest. The algorithm is one line; the analysis is the payoff.
For each arm, we know its empirical mean $\hat\mu_{i,t}$ and a confidence radius $r_{i,t}$ that shrinks as we pull it more. The upper confidence bound $\hat\mu_{i,t} + r_{i,t}$ is the largest value the true mean could plausibly take, with high probability. UCB plays the arm whose upper-bound is highest — i.e. the arm that could plausibly be the best.
Why is this clever? Two arms can lead in UCB for two reasons: (a) good empirical mean (exploitation) or (b) large radius from few pulls (exploration). The algorithm naturally explores arms it doesn't know well — but stops as soon as their radius shrinks below the leading arm's mean.
This is $O(\sqrt{T})$ — much better than the $T^{2/3}$ of the previous two algorithms. As $T$ grows large, UCB pulls away dramatically. This $\sqrt{KT\log(KT)}$ rate is actually worst-case optimal up to the log factor — no algorithm can do asymptotically better in the minimax sense.
| Algorithm | Regret rate | Anytime? | Comment |
|---|---|---|---|
| Explore-First | O(T2/3(K log KT³)1/3) | No | commit-and-pray; need to know T |
| Epsilon-Greedy | O(t2/3(K log Kt³)1/3) | Yes | same rate, anytime version |
| UCB | O(√(KT log KT)) | Yes | worst-case optimal up to logs |
Define $E$ to hold if, for all arms $i \in [K]$ and all rounds $t \in [T]$,
$$|\hat\mu_{i,t} - \mu_i| \;\le\; r_{i,t}.$$That is, every empirical mean is within its own confidence radius of the truth. We want to bound $\Pr[\neg E]$.
For a fixed $i$ and $t$, conditional on $n_{i,t}$ pulls of arm $i$, multiply both sides by $n_{i,t}$ and apply Hoeffding to the sum of $n_{i,t}$ i.i.d. bounded variables:
$$\Pr\!\left[|\hat\mu_{i,t} - \mu_i| > r_{i,t}\right] = \Pr\!\left[|n_{i,t}\hat\mu_{i,t} - n_{i,t}\mu_i| > n_{i,t} r_{i,t}\right] \le 2\exp\!\left(-\frac{2(n_{i,t}r_{i,t})^2}{n_{i,t}}\right).$$Substitute $r_{i,t}^2 = 2\log(KT)/n_{i,t}$:
$$2\exp(-2n_{i,t} \cdot r_{i,t}^2) = 2\exp(-4\log(KT)) = 2(KT)^{-4}.$$Hmm — the slide writes $2\exp(-2\log(KT)) = 2/(KT)^2 \le 1/(KT^2)$. Let me reproduce the slide's algebra exactly: the slide effectively gives
$$\Pr[|\hat\mu_{i,t} - \mu_i| > r_{i,t}] \;\le\; 2\exp(-2\log(KT)) \;=\; \frac{2}{(KT)^2} \;\le\; \frac{1}{KT^2}.$$There are at most $KT$ pairs $(i,t)$, so by union bound:
$$\Pr[\neg E] \;\le\; KT \cdot \frac{1}{KT^2} \;=\; \frac{1}{T}.$$On $\neg E$ the regret is at most $T$ (trivial), so
$$\mathbb{E}[R(T) \mid \neg E]\cdot \Pr[\neg E] \;\le\; T \cdot \frac{1}{T} \;=\; O(1).$$This is asymptotically negligible compared to the $\sqrt{KT\log KT}$ term we'll get on the good event.
Suppose at round $t$, UCB plays a sub-optimal arm $i > 1$. Then by the algorithm's rule, $\mathrm{UCB}(i,t) \ge \mathrm{UCB}(1,t)$, i.e.
$$\hat\mu_{i,t} + r_{i,t} \;\ge\; \hat\mu_{1,t} + r_{1,t}.$$On the good event:
Substituting both bounds and rearranging:
$$\mu_i + 2r_{i,t} \;\ge\; \hat\mu_{i,t} + r_{i,t} \;\ge\; \hat\mu_{1,t} + r_{1,t} \;\ge\; \mu_1.$$So $\mu_1 - \mu_i \le 2 r_{i,t}$ — the gap between arm 1 and the played arm is at most twice the radius of the played arm. (Note: the slide also mentions "due to monotonicity" — meaning $r_{i,t}$ is monotone decreasing in $n_{i,t}$, so the largest radius for arm $i$ is at the moment it's played, and we can bound $r_{i,t}$ at the time of play by $r_{i,T}$ at the end of all its pulls.)
This is the UCB version of the Explore-First key inequality ("if we picked the wrong arm, we didn't pick a terribly wrong one"). Here it says: whenever UCB pulls a sub-optimal arm $i$, the suboptimality gap $\mu_1 - \mu_i$ can be at most $2 r_{i,t}$. So if we want to pull arm $i$ a lot ($n_{i,T}$ large), the radius is small, so the suboptimality is small, so the regret per pull is small. Self-limiting.
Arm $i$ contributes $(\mu_1 - \mu_i) n_{i,T}$ to total regret. Using the inequality above:
$$(\mu_1 - \mu_i)\, n_{i,T} \;\le\; 2 r_{i,T} n_{i,T} \;=\; 2\sqrt{\frac{2\log(KT)}{n_{i,T}}} \cdot n_{i,T} \;=\; 2\sqrt{2\log(KT)} \cdot \sqrt{n_{i,T}}.$$Summing over sub-optimal arms:
$$\mathbb{E}[R(T) \mid E] \;=\; \sum_{i > 1}(\mu_1 - \mu_i) n_{i,T} \;\le\; 2\sqrt{2\log(KT)} \sum_{i} \sqrt{n_{i,T}}.$$The function $\sqrt{\cdot}$ is concave, so by Jensen:
$$\frac{1}{K}\sum_{i} \sqrt{n_{i,T}} \;\le\; \sqrt{\frac{1}{K}\sum_{i} n_{i,T}} \;=\; \sqrt{\frac{T}{K}},$$since $\sum_i n_{i,T} = T$ (each round pulls exactly one arm). Multiplying both sides by $K$:
$$\sum_{i} \sqrt{n_{i,T}} \;\le\; K \sqrt{T/K} \;=\; \sqrt{KT}.$$Putting it together:
$$\mathbb{E}[R(T) \mid E] \;\le\; 2\sqrt{2\log(KT)} \cdot \sqrt{KT} \;=\; O\!\left(\sqrt{KT\log(KT)}\right).$$From the union bound. We need a high-probability bound that holds for all $K$ arms and all $T$ rounds simultaneously — that's $KT$ events. For each individual event, Hoeffding gives a failure probability of roughly $2\exp(-2r^2 n_{i,t})$. To make the total failure probability after union bound be at most $1/T$, we need each individual event's failure probability to be $\le 1/(KT^2)$, which requires $r^2 \approx \log(KT)/n_{i,t}$. Hence the formula.
Because of Jensen's inequality. The worst case for $\sum \sqrt{n_i}$ subject to $\sum n_i = T$ is when the $n_i$ are equal (concavity), giving $\sum \sqrt{n_i} = K\sqrt{T/K} = \sqrt{KT}$. If one arm dominated (say $n_1 = T$, others zero), we'd get $\sum\sqrt{n_i} = \sqrt{T}$ — even smaller. So UCB's regret is largest when the algorithm spreads its pulls evenly, but that case is bounded by $\sqrt{KT}$, not $T$.
False. UCB plays the arm with the highest $\hat\mu_{i,t} + r_{i,t}$ — empirical mean plus radius. The arm with the highest empirical mean would be pure exploitation (i.e. greedy), and that strategy can fail catastrophically: if arm 1's first few pulls are unlucky, you'd never play it again. UCB's radius term ensures that arms with few samples (and hence large $r_{i,t}$) are revisited, giving them a chance to redeem themselves.
$\log(KT) = \log(2000) \approx 7.6$. Then $r_A = \sqrt{2 \cdot 7.6 / 30} = \sqrt{0.51} \approx 0.71$, so $\mathrm{UCB}(A) \approx 0.6 + 0.71 = 1.31$. And $r_B = \sqrt{2 \cdot 7.6 / 5} = \sqrt{3.04} \approx 1.74$, so $\mathrm{UCB}(B) \approx 0.5 + 1.74 = 2.24$. Arm B wins — its small visit count gives it a huge radius, so UCB pulls it next to gather more information. (Note: since rewards are in $[0,1]$, UCB values bigger than 1 are fine; they just mean the algorithm has very low confidence.)
After bounding regret per arm by $2r_{i,T} n_{i,T} = 2\sqrt{2\log(KT)} \sqrt{n_{i,T}}$, we need to sum $\sum_{i} \sqrt{n_{i,T}}$. The constraint is $\sum_i n_{i,T} = T$. Concavity of $\sqrt{\cdot}$ gives, via Jensen, that the average $\frac{1}{K}\sum_i \sqrt{n_{i,T}} \le \sqrt{\frac{1}{K}\sum n_{i,T}} = \sqrt{T/K}$. Multiplying by $K$: $\sum \sqrt{n_{i,T}} \le \sqrt{KT}$. Without concavity (e.g. if we had $\sum n_{i,T}^2$ instead), the bound would go the other way and we'd be stuck.
A stochastic bandit poses the dilemma: $K$ arms with unknown means $\mu_1 > \mu_2 \ge \cdots \ge \mu_K$, one pull per round, and you want to minimise the pseudo-regret $R(T) = T\mu_1 - \sum_t \mathbb{E}[X_{I_t,t}]$. To analyse any algorithm we use a concentration toolkit (Markov, the $\sigma^2/n$ lemma, the Weak Law of Large Numbers, and especially Hoeffding's exponential tail bound), combined with the union bound to handle multiple arms and rounds simultaneously. Three algorithms tackle the problem: Explore-First tries each arm $\tau$ times and commits, achieving $O(T^{2/3}\,\text{polylog})$ regret if $\tau$ is chosen optimally; Epsilon-Greedy interleaves exploration and exploitation with a shrinking $\varepsilon_t = (K\log Kt)^{1/3}/t^{1/3}$, matching the rate but giving an anytime bound; UCB plays the arm with the highest $\hat\mu_{i,t} + \sqrt{2\log(KT)/n_{i,t}}$, embodying "optimism in the face of uncertainty", and achieves $O(\sqrt{KT\log KT})$ — worst-case optimal up to logs. The proofs all share the same backbone: define a good event using Hoeffding + union bound, show the bad event contributes $O(1)$, and on the good event bound regret using a self-limiting argument (a sub-optimal arm gets pulled only if its radius is large, which limits how badly we can lose per pull).