Activity 2 — Interactive Beta–Bernoulli Bandit (Game)

We model a website that, for each arriving user, shows one of \(K\) banner ads.
If ad \(k\in{1,\dots,K}\) is displayed, the outcome is a click (\(r=1\)) with probability \(\theta_k\in[0,1]\) and no click (\(r=0\)) with probability \(1-\theta_k\).
The vector \(\theta=(\theta_1,\dots,\theta_K)\) is unknown and fixed. The quantity \(\theta_k\) is the click-through rate (CTR) of ad \(k\).

Interaction protocol

Over rounds \(t=1,2,\ldots,T\):

choose an ad \(x_t\in\{1,\dots,K\}\) to show;
observe a binary reward \(r_t\in\{0,1\}\) indicating whether the user clicked.

Formally, \[ \mathbb{P}(r_t=1\mid x_t=k,\theta)=\theta_k, \qquad \mathbb{P}(r_t=0\mid x_t=k,\theta)=1-\theta_k. \]

Goal: maximize clicks

We want a policy (sequence \(x_1,\dots,x_T\)) that maximizes the expected number of clicks. It is standard to evaluate policies via regret. Let \[ \theta^\star \triangleq \max_{k} \theta_k, \qquad \Delta_k \triangleq \theta^\star-\theta_k \ge 0. \] If \(N_k(T)\) is the number of times ad \(k\) is shown by time \(T\), the cumulative (realized) regret is \begin{equation}\label{eq:regret-def} R_T \;\triangleq\; T\,\theta^\star - \sum_{t=1}^T r_t, \end{equation} and its expectation satisfies \[ \mathbb{E}[R_T] \;=\; \sum_{k\neq k^\star} \Delta_k \,\mathbb{E}[N_k(T)], \qquad k^\star\in\arg\max_k \theta_k. \] Thus, small regret means suboptimal ads are selected only sublinearly often.

Bayesian modeling for the Beta–Bernoulli bandit

Goal. For each ad \(k\), we want to learn its (unknown) click-through rate (CTR) \(\theta_k\in(0,1)\) from binary outcomes \(r_t\in\{0,1\}\) observed when that ad is shown.

Step 0 — Data model (what we observe)

If ad \(k\) is shown at round \(t\), we assume \[ r_t \mid (x_t=k,\theta_k)\sim \mathrm{Bernoulli}(\theta_k). \] Why: A click/no-click is a single success/failure trial; Bernoulli is the standard model for that.

Step 1 — Prior (what we believe before data)

For each ad \(k\), place an independent Beta prior \[ \theta_k \sim \mathrm{Beta}(\alpha_k,\beta_k)\quad\text{with density} \] \[ \label{eq:beta-pdf} p(\theta_k)=\frac{\Gamma(\alpha_k+\beta_k)}{\Gamma(\alpha_k)\Gamma(\beta_k)}\, \theta_k^{\alpha_k-1}(1-\theta_k)^{\beta_k-1}, \quad k=1,\dots,K. \] If you have no prior information, set \((\alpha_k,\beta_k)=(1,1)\) (uniform on \([0,1]\)). Why: The Beta family is supported on \((0,1)\) and is conjugate to the Bernoulli likelihood, making updates simple and fast.

Interpretation: \(\alpha_k-1\) and \(\beta_k-1\) act as pseudocounts of prior “clicks” and “no-clicks,” respectively.

Step 2 — Likelihood (how data inform \(\theta_k\))

Suppose ad \(k\) has been shown \(n_k\) times so far, producing \(s_k\) clicks and \(f_k=n_k-s_k\) no-clicks. The likelihood is \[ p(\text{data}_k\mid \theta_k)\propto \theta_k^{\,s_k}(1-\theta_k)^{\,f_k}. \] Why: Bernoulli trials are independent given \(\theta_k\); multiplying their probabilities collects exponents \(s_k\) and \(f_k\).

Step 3 — Posterior (Bayes’ rule = prior × likelihood)

Multiplying prior and likelihood yields another Beta: \[ \theta_k \mid \text{data}_k \;\sim\; \mathrm{Beta}\big(\alpha_k+s_k,\;\beta_k+f_k\big). \] Equivalently, you can update online after each new outcome \(r_t\) when ad \(k\) is shown: \[ \label{eq:beta-update} (\alpha_k,\beta_k)\leftarrow (\alpha_k+r_t,\;\beta_k+(1-r_t)), \] leaving \((\alpha_j,\beta_j)\) unchanged for \(j\neq k\). Why (conjugacy): The Beta prior and Bernoulli likelihood have the same functional form in \(\theta_k\), so the posterior stays in the Beta family with parameters that just add counts.

Step 4 — Point estimates (what number to report)

A convenient summary is the posterior mean \[ \label{eq:posterior-mean} \hat{\theta}_k \;\triangleq\; \mathbb{E}[\theta_k\mid \text{data}_k] \;=\; \frac{\alpha_k}{\alpha_k+\beta_k}. \] Why: For \(\mathrm{Beta}(\alpha,\beta)\), the mean is \(\alpha/(\alpha+\beta)\); with pseudocounts it becomes a smoothed empirical CTR.

Confidence grows with data: The posterior variance for \(\mathrm{Beta}(\alpha,\beta)\) is \[ \mathrm{Var}[\theta_k\mid \text{data}_k]= \frac{\alpha_k\,\beta_k}{(\alpha_k+\beta_k)^2\,(\alpha_k+\beta_k+1)}, \] which shrinks as \(\alpha_k+\beta_k\) increases. This justifies saying the posterior “concentrates” as we collect more outcomes.

Step 5 — Repeat (sequential learning)

Each time you show ad \(k\), observe \(r_t\), apply the update in \eqref{eq:beta-update}, and recompute summaries like \(\hat{\theta}_k\) in \eqref{eq:posterior-mean}. Policies (e.g., Greedy or Thompson Sampling) then use these posteriors to choose which ad to show next.

Baseline: Greedy (posterior-mean) policy

At each round \(t\):

For each \(k\), compute \(\hat{\theta}_k\) via \eqref{eq:posterior-mean}.
Play \(x_t\in\arg\max_k \hat{\theta}_k\).
Observe \(r_t\) and update \eqref{eq:beta-update}.

Greedy always exploits the current best estimate and can under-explore if early noise misleads it.

Thompson Sampling (TS) for Beta–Bernoulli

At each round \(t\):

For each \(k\), draw a plausible CTR from its posterior: \[ \tilde{\theta}_k \sim \mathrm{Beta}(\alpha_k,\beta_k). \]
Play \(x_t\in\arg\max_k \tilde{\theta}_k\).
Observe \(r_t\) and update \eqref{eq:beta-update}.

This randomized choice naturally balances exploration (trying uncertain ads) and exploitation (favoring ads likely to have high CTR).

LET’S SIMULATE AND PLAY THIS GAME!

Game description & rules

You control a 3-arm bandit with hidden click-through rates \(\theta_1,\theta_2,\theta_3\).
Each round, up to \(T=30\), choose one ad. You’ll see the instantaneous reward (😀 for 1, 😞 for 0).
Each arm’s posterior is \(\mathrm{Beta}(\alpha_k,\beta_k)\) and updates as \[ (\alpha_k,\beta_k)\leftarrow(\alpha_k+r_t,\;\beta_k+(1-r_t)). \]

Controls buttoms

Play Ad 1/2/3: pulls that arm and updates its posterior.
Sample (TS): you have the possibility to play a Thompson Sampling strategy. If you want to play this strategy, push this buttom to draws \(\tilde\theta_k\sim\mathrm{Beta}(\alpha_k,\beta_k)\) for each arm and make your decision.
New Game: resets with fresh random CTRs \(\theta_k\sim\mathrm{Beta}(2,2)\) and uniform priors \((1,1)\).
After round 30, the true CTRs \(\theta_k\) are revealed.

🎮 Launch the interactive game