2302/2302.01543.md · huggingchat/papers-content

huggingchat/papers-content / 2302 /2302.01543.md

|

118 kB

Runzhe Wan ^*1 Haoyu Wei ^*2 Branislav Kveton ¹ Rui Song ¹

Abstract

Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.

1. Introduction

The bandit problem has found wide applications in various areas such as clinical trials (Durand et al., 2018), finance (Shen et al., 2015), recommendation systems (Zhou et al., 2017), among others. Accurate uncertainty quantification is the key to address the exploration-exploitation trade-off. Most existing bandit algorithms critically rely on certain analytical property of the imposed model (e.g., linear bandits) to quantify the uncertainty and derive the exploration strategy. Thompson Sampling (TS, Thompson, 1933) and Upper Confidence Bound (UCB, Auer et al., 2002) are two prominent examples, which are typically based on explicit-form posterior distributions or confidence sets, respectively.

However, in many real problems, the reward model is fairly complex: e.g., a general graphical model (Chapelle & Zhang, 2009) or a pipeline with multiple prediction modules and manual rules. In these cases, it is typically impossible to quantify the uncertainty in an analytical way, and frameworks such as TS or UCB are either methodologically not applicable or computationally infeasible. Motivated by the real needs, we are concerned with the following question:

Can we design a practical bandit algorithm framework that

^*Equal contribution ¹Amazon ²Department of Statistics, North Carolina State University. Correspondence to: Runzhe Wan runzhe.wan@gmail.com.

is general, adaptive, and computationally tractable, with certain theoretical guarantee?

A straightforward idea is to apply the bootstrap method (Efron, 1992), a widely applicable data-driven approach for measuring uncertainty. However, as discussed in Section 2, most existing bootstrap-based bandit algorithms are either heuristic without a theoretical guarantee, computationally intensive, or only applicable in limited scenarios. To address these limitations, we propose a new exploration strategy based on multiplier bootstrap (Van Der Vaart & Wellner, 1996), an easy-to-adapt bootstrap framework that only requires randomly weighted data points. We further show that a naive application of multiplier bootstrap may result in linear regret, and we introduce a suitable way to add additional perturbations for sufficient exploration.

Contribution. Our contributions are three-fold. First, we propose a general-purpose bandit algorithm framework, Multiplier Bootstrap-based Exploration (MBE). The main advantage of MBE is that it is general: it is applicable to any reward model amenable to weighted loss minimization, without need of analytical-form uncertainty quantification or case-by-case algorithm design. As a data-driven exploration strategy, MBE is also adaptive to different environments.

Second, theoretically, we prove near-optimal regret bounds for MBE under sub-Gaussian multi-armed bandits (MAB), in both the instance-dependent and the instance-independent sense. Compared with all existing results for bootstrap-based bandit algorithms, our result is strictly more general (see Table 1), since existing results only apply to some special cases of sub-Gaussian distributions. To overcome the technical challenges, we proved a novel concentration inequality for some function of sub-exponential variables, and also developed the first finite-sample concentration and anti-concentration analysis for multiplier bootstrap, to the best of our knowledge. Given the broad applications of multiplier bootstrap in statistics and machine learning, we believe our theoretical analysis has separate interest.

Third, with extensive simulation and real data experiments, we demonstrate that MBE yields comparable performance with existing algorithms in different MAB settings and three real-world problems (online learning to rank, online combinatorial optimization, and dynamic slate optimization). This supports that MBE is easily generalizable, as it requires mini-Table 1. Comparisons between several bootstrap-based bandit algorithms. We divide the sources of exploration into leveraging the intrinsic randomness in the observed data and manually adding extrinsic perturbations that are independent of the observed data. All papers derive near-optimal regret bounds in MAB, with different reward distribution requirements. To compare the computational cost, we focus on MAB to illustrate, and consider Algorithm 2 for MBE. See Section 2 for more details of discussions in this table.

	Exploration Source	Methodology Generality	Theory Requirement	Computation Cost
MBE (this paper)	intrinsic & extrinsic	general	sub-Gaussian	$\mathcal{O}(KT)$
GIRO (Kveton et al., 2019c)	intrinsic & extrinsic	general	Bernoulli	$\mathcal{O}(T^2)$
ReBoot (Wang et al., 2020; Wu et al., 2022)	intrinsic & extrinsic	fixed & finite set of arms	Gaussian	$\mathcal{O}(KT)$
PHE (Kveton et al., 2019a;b; 2020a)	only extrinsic	general	bounded	$\mathcal{O}(KT)$

mal modifications and derivations to match the performance of those near-optimal algorithms specifically designed for each problem. Moreover, we also show that MBE adapts to different environments and is relatively robust, due to its data-driven nature.

2. Related Work

The most popular bandit algorithms, arguably, include $\epsilon$ -greedy (Watkins, 1989), TS, and UCB. $\epsilon$ -greedy is simple and thus widely used. However, its exploration strategy is not aware of the uncertainty in data and thus is known to be statistically sub-optimal. TS and UCB rely on posteriors and confidence sets, respectively. Yet, their closed forms only exist in limited cases, such as MAB or linear bandits. For a few other models (such as generalized linear model or neural nets), we know how to construct the approximate posteriors or confidence sets (Filippi et al., 2010; Li et al., 2017; Phan et al., 2019; Kveton et al., 2020b; Zhou et al., 2020), though the corresponding algorithms are usually costly or conservative. In more general cases, it is often not clear how to adapt UCB and TS in a valid and efficient way. Moreover, the dependency on the probabilistic model assumptions also pose challenges to being robust.

To enable wider applications of bandit algorithms, several bootstrap-based (and related perturbation-based) methods have been proposed in the literature. Most algorithms are TS-type, by replacing the posterior with a bootstrap distribution. We next review the related papers, and summarize those with near-optimal regret bounds in Table 1.

Arguably, the non-parametric bootstrap is the most well-known bootstrap method, which works by re-sampling data with replacement. Vaswani et al. (2018) proposes a version of non-parametric bootstrap with forced exploration to achieve a $\mathcal{O}(T^{2/3})$ regret bound in Bernoulli MAB. GIRD proposed in Kveton et al. (2019c) successfully achieves a rate-optimal regret bound in Bernoulli MAB, by adding

Bernoulli perturbations to non-parametric bootstrap. However, due to the re-sampling nature of non-parametric bootstrap, it is hard to be updated efficiently other than in Bernoulli MAB (see Section 4.3). Specifically, the computational cost of re-sampling scales quadratically in $T$ .

Another line of research is the residual bootstrap-based approach (ReBoot) (Hao et al., 2019; Wang et al., 2020; Tang et al., 2021; Wu et al., 2022). For each arm, ReBoot randomly perturbs the residuals of the corresponding observed rewards with respect to the estimated model to quantify the uncertainty for its mean reward. We note that, although these methods also use random weights, they are applied to residuals and hence are in a way fundamentally different from ours. The limitation is that, by design, this approach is only applicable to problems with a fixed and finite set of arms, since the residuals are attached closely to each arm (see Appendix A.4 for more details).

The perturbed history exploration (PHE) algorithm (Kveton et al., 2019a;b; 2020a) is also related. PHE works by adding additive noise to the observed rewards. Osband et al. (2019) applies similar ideas to reinforcement learning. However, PHE has two main limitations. First, for models where adding additive noise is not feasible (e.g., decision trees), PHE is not applicable. Second, as demonstrated in both (Wang et al., 2020) and our experiments, the fact that PHE relies on only the extrinsically injected noise for exploration makes it less robust. For a complex structured problem, it may not be clear how to add the noise in a sound way (Wang et al., 2020). In contrast, it is typically more natural (and hence easier to be accepted) to leverage the intrinsic randomness in the observed data.

Finally, we note that multiplier bootstrap has been considered in the bandit literature, mostly as a computationally efficient approximation to non-parametric bootstrap studied in those papers. Eckles & Kaptein (2014) studies the direct adaption of multiplier bootstrap (see Section 4.1) in simulation, and its empirical performance in contextual banditsis studied later (Tang et al., 2015; Elmachtoub et al., 2017; Riquelme et al., 2018; Bietti et al., 2021). However, no theoretical guarantee is provided in these works. In fact, as demonstrated in Section 4.1, such a naive adaptation may have a linear regret. Osband & Van Roy (2015) shows that, in Bernoulli MAB, a variant of multiplier bootstrap is mathematically equivalent to TS. No further theoretical or numerical results are established except for this special case. Our work is the first systematic study of multiplier bootstrap in bandits. Our unique contributions include: we identify the potential failure of naively applying multiplier bootstrap, highlight the importance of additional perturbations, design a general algorithm framework to make this heuristic idea concrete, provide the first theoretical guarantee in general MAB settings, and conduct extensive numerical experiments to study its generality and adaptivity.

3. Preliminary

Setup. We consider a general stochastic bandit problem. For any positive integer $M$ , let $[M] = {1, \dots, M}$ . At each round $t \in [T]$ , the agent observes a context vector $\mathbf{x}_t$ (it is empty in non-contextual problems) and an action set $\mathcal{A}_t$ , then chooses an action $A_t \in \mathcal{A}_t$ , and finally receives the corresponding reward $R_t = f(\mathbf{x}_t, A_t) + \epsilon_t$ . Here, $f$ is an unknown function and $\epsilon_t$ is the noise term. Without loss of generality, we assume $f(\mathbf{x}_t, A_t) \in [0, 1]$ . The goal is to minimize the cumulative regret

$\text{Reg}_T = \sum_{t=1}^T \mathbb{E} \left[ \max_{a \in \mathcal{A}_t} f(\mathbf{x}_t, a) - f(\mathbf{x}_t, A_t) \right].$

At time $t$ , with an existing dataset $\mathcal{D}t = {(\mathbf{x}l, A_l, R_l)}{l \in [t]}$ , to decide the action $A{t+1}$ , most algorithms typically first estimate $f$ in some function class $\mathcal{F}$ by solving a weighted loss minimization problem (also called weighted empirical risk minimization or cost-sensitive training)

$\hat{f} = \arg \min_{f \in \mathcal{F}} \frac{1}{t} \sum_{l=1}^t \omega_l \mathcal{L}(f(\mathbf{x}_l, A_l), R_l) + J(f). \quad (1)$

Here, $\mathcal{L}$ is a loss function (e.g., $\ell_2$ loss or negative log-likelihood), $\omega_l$ is the weight of the $l$ -th data point, and $J$ is an optional penalty function. We consider the weighted problem as it is general and related to our proposal below. One can just set $\omega_l \equiv 1$ to get the unweighted problem. As the simplest example, consider the $K$ -armed bandit problem where $\mathbf{x}l$ is empty and $\mathcal{A}l = [K]$ . Let $\mathcal{L}$ be the $\ell_2$ loss, $J \equiv 0$ , and $f(\mathbf{x}l, A_l) \equiv r{A_l}$ where $r_k$ is the mean reward of the $k$ -th arm. Then, (1) reduces to $\arg \min{{r_1, \dots, r_K}} \sum{l=1}^t \omega_l (R_l - r_{A_l})^2$ , which gives the estimator $\hat{r}k = (\sum{l: A_l=k} \omega_l)^{-1} \sum_{l: A_l=k} \omega_l R_l$ , i.e., the arm-wise weighted average. Similarly, in linear bandits, (1) reduces to the weighted least-square problem (see Appendix A.2 for details).

Challenges. The estimation of $f$ , together with the related uncertainty quantification, forms the foundation of most bandit algorithms. In the literature, $\mathcal{F}$ is typically a class of models that permit closed-form uncertainty quantification (e.g., linear models, Gaussian processes, etc.). However, in many real applications, the reward model can yield a fairly complicated structure, e.g., a hierarchical pipeline with both classification and regression modules. Manually specified rules are also commonly part of the model. It is challenging to quantify the uncertainty of these complicated models in analytical forms. Even when feasible, the dependency on the probabilistic model assumptions also pose challenges to being robust.

Therefore, in this paper, we focus on the bootstrap-based approach due to its generality and data-driven nature. Bootstrapping, as a general approach to quantify the model uncertainty, has many variants. The most popular one, arguably, is non-parametric bootstrap (used in GIRD), which constructs bootstrap samples by re-sampling the dataset with replacement. However, due to the re-sampling nature, it is computationally intense (see Section 4.3 for more discussions). In contrast, multiplier bootstrap (Van Der Vaart & Wellner, 1996), as an efficient and easy-to-implement alternative, is popular in statistics and machine learning.

Multiplier bootstrap. The main idea of multiplier bootstrap is to learn the model using randomly weighted data points. Specifically, given a multiplier weight distribution $\rho(\omega)$ , for every bootstrap sample, we first randomly sample ${\omega_t^{MB}}_{t=1}^{t'} \sim \rho(\omega)$ , and then solve (1) with $\omega_t = \omega_t^{MB}$ to obtain $\hat{f}^{MB}$ . Repeat the procedure and the distribution of $\hat{f}^{MB}$ forms the bootstrap distribution that quantifies our uncertainty over $f$ . The popular choices of $\rho(\omega)$ include $\mathcal{N}(1, \sigma_\omega^2)$ , $\text{Exp}(1)$ , $\text{Poisson}(1)$ , and the double-or-nothing distribution $2 \times \text{Bernoulli}(0.5)$ .

4. Multiplier Bootstrap-based Exploration

4.1. Failure of the naive adaptation of multiplier bootstrap

To design an exploration strategy based on multiplier bootstrap, a natural idea is to replace the posterior distribution in TS with the bootstrap distribution. Specifically, at every time point, we sample a function $\hat{f}$ following the multiplier bootstrap procedure as described in Section 3, and then take the greedy action $\arg \max_{a \in \mathcal{A}_t} \hat{f}(\mathbf{x}_t, a)$ . However, perhaps surprisingly, such an adaptation may not be valid. The main reason is that the intrinsic randomness in a finite dataset is, in some cases, not enough to guarantee sufficient exploration. We illustrate with the following toy example.

Example 1. Consider a two-armed Bernoulli bandit. Let the mean rewards of the two arms be $p_1$ and $p_2$ , respectively. Without loss of generality, assume $1 > p_1 > p_2 > 0$ .Let $\mathbb{P}(\omega = 0) = 0$ . Then, with non-zero probability, an agent following the naive adaption of multiplier bootstrap (breaking ties randomly; see Algorithm 5 in Appendix A.3 for details) takes arm 1 only once. Therefore, The agent suffers a linear regret.

Proof. We first define two events $\mathcal{E}_1 = {A_t = 1, R_1 = 0}$ and $\mathcal{E}_2 = {A_2 = 2, R_2 = 1}$ . By design, at time $t = 1$ , the agent randomly choose an arm and hence will pull arm 1 with probability 0.5. Then the observed reward $R_1$ is 0 with probability $1 - p_1$ . Therefore, $\mathbb{P}(\mathcal{E}_1) = 0.5(1 - p_1)$ . Conditioned on $\mathcal{E}_1$ , at $t = 2$ , the agent will pull arm 2 (since multiplying $R_1 = 0$ with any weight always gives 0), then it will observe reward $R_2 = 1$ with probability $p_2$ . Conditioned on $\mathcal{E}_1 \cap \mathcal{E}_2$ , by induction, the agent will pull arm 2 for any $t > 2$ . This is because the only reward record for arm 1 is $R_1 = 0$ and hence its weighted average is always 0, which is smaller than the weighted average for arm 2, which is at least positive. In conclusion, with probability at least $0.5 \times (1 - p_1) \times p_2 > 0$ , the algorithm takes the optimal arm 1 only once.

4.2. Main algorithm

The failure of the naive application of multiplier bootstrap implies that some additional randomness is needed to ensure sufficient exploration. In this paper, we consider achieving that by adding pseudo-rewards, an approach that proves its effectiveness in a few other setups (Kveton et al., 2019c; Wang et al., 2020). The intuition is as follows. The under-exploration issue happens when, by randomness, the observed rewards are in the low-value region (compared with the expected reward). Therefore, if we can blend in some data points with rewards that have a relatively wide coverage, then the agent would have a higher chance to explore.

These discussions motivate the design of our main algorithm, Multiplier Bootstrap-based Exploration (MBE), as in Algorithm 1. Specifically, at every round, in addition to the observed reward, we additionally add two pseudo-rewards with value 0 and 1. The pseudo-rewards are associated with the pulled arm and the context (if exists). Then, we solve a weighted loss minimization problem to update the model estimation (line 8). The weights are first sampled from a multiplier distribution (line 7), and then those of pseudo-rewards are additionally multiplied by a tuning parameter $\lambda$ . In MAB, the estimates are arm-wise weighted average of all (observed or pseudo-) rewards $\bar{Y}k = \sum{\ell: A_\ell=k} (\omega_\ell R_{k,\ell} + \lambda \omega'\ell) / \sum{\ell: A_\ell=k} (\omega_\ell + \lambda \omega'_\ell + \lambda \omega''_\ell)$ . See Appendix A.1 for details.

We make three remarks on the algorithm design. First, we choose to add pseudo-rewards at the boundaries of the mean reward range (i.e., $[0, 1]$ ), since such a design naturally induces a high variance (and hence more exploration). Adding pseudo-rewards in other manners is also possible. Second,

the tuning parameter $\lambda$ controls the amount of extrinsic perturbation and determines the degree of exploration (together with the dispersion of $\rho(\omega)$ ). In Section 5, we give a theoretically valid range for $\lambda$ . Finally and critically, besides guaranteeing sufficient exploration, we need to make sure the optimal arm can still be identified (asymptotically) after adding the pseudo-rewards. Intuitively, this is guaranteed, since we shift and scale the (asymptotic) mean reward from $f(\mathbf{x}, a)$ to $(f(\mathbf{x}, a) + \lambda) / (1 + 2\lambda) = f(\mathbf{x}, a) / (1 + 2\lambda) + \lambda / (1 + 2\lambda)$ , which preserves the order between arms. A detailed analysis for MAB can be found in Appendix A.1.

We conclude this section by re-visiting Example 1 to provide some insights into how do the pseudo-rewards help.

Example 1 (Continued). Even under the event $\mathcal{E}_1 \cap \mathcal{E}_2$ , Algorithm 1 keeps the chance to explore. To see this, consider the example where the multiplier distribution is $2 \times \text{Bernoulli}(0.5)$ . Then, we have $\mathbb{P}(A_3 = 1) \geq \mathbb{P}(\bar{Y}_1 > \bar{Y}_0) = \mathbb{P}\left(\frac{\lambda \omega'_1}{\omega_1 + \lambda \omega'_1 + \lambda \omega''_1} > \frac{\omega_2 + \lambda \omega'_2}{\omega_2 + \lambda \omega'_2 + \lambda \omega''_2}\right) \geq \mathbb{P}(\omega'_1 = 2, \omega_1 = \omega''_1 = \omega_2 = \omega'_2 = \omega''_2 = 0) = (1/2)^6$ . Therefore, the agent still has chance to choose the optimal arm.

Algorithm 1 General Template for MBE

Data: Function class $\mathcal{F}$ , loss function $\mathcal{L}$ , (optional) penalty function $J$ , multiplier weight distribution $\rho(\omega)$ , tuning parameter $\lambda$


2 Initialize  $\hat{f}$ 
3 for  $t = 1, \dots, T$  do
4     Observe context  $\mathbf{x}_t$  and action set  $\mathcal{A}_t$ 
5     Offer  $A_t = \arg \max_{a \in \mathcal{A}_t} \hat{f}(\mathbf{x}_t, A)$  (break ties randomly)
6     Observe reward  $R_t$ 
7     Sample the multiplier weights  $\{\omega_l, \omega'_l, \omega''_l\}_{l=1}^t \sim \rho(\omega)$ 
8     Solve the weighted loss minimization problem
        
 $\hat{f} = \arg \min_{f \in \mathcal{F}} \sum_{l=1}^t \left[ \omega_l \mathcal{L}(f(\mathbf{x}_l, A_l), R_l) \right. \\ \left. + \lambda \omega'_l \mathcal{L}(f(\mathbf{x}_l, A_l), 0) \right. \\ \left. + \lambda \omega''_l \mathcal{L}(f(\mathbf{x}_l, A_l), 1) \right] + J(f).$ 

9 end

4.3. Computationally efficient implementation

Efficient computation is critical for real applications of bandit algorithms. One potential limitation of Algorithm 1 is the computational burden: at every decision point, we need to re-sample the weights for all historical observations (line 8). This leads to a total computational cost of order $\mathcal{O}(T^2)$ , similar to GIR0.

Fortunately, one prominent advantage of multiplier bootstrap over other bootstrap methods (such as non-parametric bootstrap or residual bootstrap) is that the (approximate)bootstrap distribution can be efficiently updated in an online manner, such that the per-round computation cost does not grow over time. Suppose we have a dataset $\mathcal{D}_t$ at time $t$ , and denote $\mathcal{B}(\mathcal{D}t)$ as the corresponding bootstrap distribution for $f$ . With multiplier bootstrap, it is feasible to update $\mathcal{B}(\mathcal{D}{t+1})$ approximately based on $\mathcal{B}(\mathcal{D}_t)$ . We detail the procedure below and elaborate more in Algorithm 2.

Specifically, we maintain $B$ different models ${\hat{f}{b,t}}{b=1}^B$ and the corresponding history (with random weights) as ${\mathcal{H}b, \mathcal{H}'b}{b=1}^B$ . ${\hat{f}{b,t}}{b=1}^B$ can be regarded as sampled from $\mathcal{B}(\mathcal{D}t)$ and hence the empirical distribution over them is an approximation to the bootstrap distribution. At every time point $t$ , for each replicate $b$ , we only need to sample one weight for the new data point and then update $\hat{f}{b,t}$ as $\hat{f}{b,t+1}$ . Then, ${\hat{f}{b,t+1}}{b=1}^B$ are still $B$ valid samples from $\mathcal{B}(\mathcal{D}_{t+1})$ and hence still a valid approximation. We note that, since we only have one new data point, the updating of $f$ can typically be done efficiently (e.g., with closed-form updating or via online gradient descent). The per-round computational cost is hence independent of $t$ .

Such an approximation is a common practice in the online bootstrap literature and can be regarded as an ensemble sampling-type algorithm (Lu & Van Roy, 2017; Qin et al., 2022). The hyper-parameter $B$ is typically not treated as a tuning parameter but depends on the available computational resource (Hao et al., 2019). In our numerical experiments, this practical variant shows desired performance with $B = 50$ . Moreover, the algorithm is embarrassingly parallel and also easy to implement: given an existing implementation for estimating $f$ (i.e., solving (1)), the major requirement is to replicate it for $B$ times and use random weights for each. This feature is attractive in real applications.

5. Regret Analysis

In this section, we provide the regret bound for Algorithm 1 under MAB with sub-Gaussian rewards. We regard this as the first step towards the theoretical understanding of MBE, and leave the analysis of more general settings to future research. We call a random variable $X$ as $\sigma$ -sub-Gaussian if $\mathbb{E} \exp{t(X - \mathbb{E}X)} \leq \exp{t^2\sigma^2/2}$ for any $t \in \mathbb{R}$ .

Theorem 5.1. Consider a $K$ -armed bandit, where the reward distribution of arm $k$ is 1-sub-Gaussian with mean $\mu_k$ . Suppose arm 1 is the unique best arm that has the highest mean reward and $\Delta_k = \mu_1 - \mu_k$ . Take the multiplier weight distribution as $\mathcal{N}(1, \sigma_\omega^2)$ in Algorithm 3. Let the tuning parameters satisfy $\lambda \geq (1 + 4/\sigma_\omega) + \sqrt{4(1 + 4/\sigma_\omega)/\sigma_\omega}$ . Then, the problem-dependent regret is upper bounded by

$\text{Reg}_T \leq \sum_{k=2}^K \left\{ 7\Delta_k + \frac{10[C_1^*(\lambda, \sigma_\omega) + C_2^*(\lambda, \sigma_\omega)]}{\Delta_k} \log T \right\},$

Algorithm 2 Practical Implementation of MBE

Data: Number of bootstrap replicates $B$ , function class $\mathcal{F}$ , Loss function $\mathcal{L}$ , (optional) penalty function $J$ , weight distribution $\rho(\omega)$ , tuning parameter $\lambda$


2 Set  $\mathcal{H}_b = \{\}$  be the history and  $\mathcal{H}'_b = \{\}$  be the pseudo-history, for any  $b \in [B]$ 
3 Initialize  $\hat{f}_{b,0}$  for any  $b \in [B]$ 
4 for  $t = 1, \dots, T$  do
5   Observe context  $\mathbf{x}_t$  and action set  $\mathcal{A}_t$ 
6   Sample an index  $b_t$  uniformly from  $\{1, \dots, B\}$ 
7   Offer  $A_t = \arg \max_{A \in \mathcal{A}_t} \hat{f}_{b_t, t-1}(\mathbf{x}_t, A)$  (break ties randomly)
8   Observe reward  $R_t$ 
9   for  $b = 1, \dots, B$  do
10    Sample the weights  $\omega_{l,b}, \omega'_{l,b}, \omega''_{l,b} \sim \rho(\omega)$ 
11    Update  $\mathcal{H}_b = \mathcal{H}_b \cup \{(\mathbf{x}_t, A_t, R_t, \omega_{l,b})\}$  and  $\mathcal{H}'_b = \mathcal{H}'_b \cup \{(\mathbf{x}_t, A_t, 0, \omega'_{l,b}), (\mathbf{x}_t, A_t, 1, \omega''_{l,b})\}$ 
12    Solve the weighted loss minimization problem
    
 $\hat{f}_{b,t} = \arg \min_{f \in \mathcal{F}} \sum_{l=1}^t \left[ \omega_{l,b} \mathcal{L}(f(\mathbf{x}_l, A_l), R_l) + \lambda \omega'_{l,b} \mathcal{L}(f(\mathbf{x}_l, A_l), 0) + \lambda \omega''_{l,b} \mathcal{L}(f(\mathbf{x}_l, A_l), 1) \right] + J(f).$ 

13  end
14 end

and the problem-independent regret is bounded by

$\text{Reg}_T \leq 7K\mu_1 + C_1^*(\lambda, \sigma_\omega)K \log T + 2\sqrt{C_2^*(\lambda, \sigma_\omega)KT \log T}.$

Here, $C_1^*(\lambda, \sigma_\omega) = 8\sqrt{2}C_3^*(\lambda, \sigma_\omega) + 38\sigma_\omega^2$ and $C_2^*(\lambda, \sigma_\omega) = 5\lambda^2 + [45(3 + \sigma_\omega^2)\lambda^4 C_3^*(\lambda, \sigma_\omega) + 38\sigma_\omega^2]$ are tuning parameter-related components. $C_3^*(\lambda, \sigma_\omega)$ is a logarithmic term as $\log[(1 + 15\sigma_\omega^{-2} + 3\sigma_\omega + 10\sigma_\omega^2)\lambda^2]/(3 \log 2) + 1$ .

The two regret bounds are known as near-optimal (up to a logarithm term) in both the problem-dependent and problem-independent sense (Lattimore & Szepesvári, 2020). Notably, recall that the Gaussian distribution and all bounded distributions belong to the sub-Gaussian class. Therefore, as reviewed in Table 1, our theory is strictly more general than all existing results for bootstrap-based MAB algorithms.

Technical challenges. It is particularly challenging to analyze MBE due to two reasons. First, the probabilistic analysis of multiplier bootstrap itself is technically challenging, since the same random weights appear in both the denominator and the numerator (recall MBE uses the weighted averages ${\bar{Y}k = \sum{\ell: A_\ell=k} (\omega_\ell R_{k,\ell} + \lambda \omega'\ell) / \sum{\ell: A_\ell=k} (\omega_\ell + \lambda \omega'_\ell + \lambda \omega''\ell)}{k=1}^K$ to select actions in MAB). It is notoriously complicated to analyze the ratio of random variables, especiallywhen they are correlated. Besides, existing bootstrap-based papers rely on the properties of specific parametric reward classes (e.g., Bernoulli in Kveton et al. (2019c) and Gaussian in Wang et al. (2020)), while we lose these nice structures when considering sub-Gaussian rewards.

To overcome these challenges, we start with carefully defining two good events on which the weighted average $\bar{Y}k$ , the non-weighted average (with pseudo-rewards) $\bar{R}k^* = (\sum{\ell: A_\ell=k} (R{k,\ell} + 1 \times \lambda + 0 \times \lambda)) / (\sum_{\ell: A_\ell=k} (1 + \lambda + \lambda))$ , and the shifted asymptotic mean $(\mu_k + \lambda) / (1 + 2\lambda)$ are close to each other (see Appendix C). To bound the probability of the bad event and to control the regret on the bad event, we face two major technical challenges. First, when transforming the ratio into an analyzable form, a summation of correlated sub-Gaussian and sub-exponential variables appears and is hard to analyze. We carefully design and analyze a novel event to remove the correlation and the sub-Gaussian terms (see proof of Lemma D.3). Second, the proof needs a new concentration inequality for functions of sub-exponential variables that do not exist in the literature. We obtain such a new concentration inequality (Lemma E.4) via careful analysis of sub-exponential distributions.

We believe our new concentration inequality is of separate interest for the analysis of sub-exponential distributions. Moreover, to the best of our knowledge, our proof provides the first finite-sample concentration and anti-concentration analysis for multiplier bootstrap, which has broad applications in statistics and machine learning.

Tuning parameters. In Theorem 5.1, MBE has two tuning parameters $\lambda$ and $\sigma_\omega$ . Intuitively, $\lambda$ controls the amount of external perturbation and $\sigma_\omega$ controls the magnitude of exploration from bootstrap. In general, higher values of these two parameters facilitate exploration but also lead to a slower convergence. The condition $\lambda \geq (1 + 4/\sigma_\omega) + \sqrt{4(1 + 4/\sigma_\omega)/\sigma_\omega}$ requires that (i) $\lambda$ is not too small and (ii) the joint effect of $\lambda$ and $\sigma$ is not too small. Both are intuitive and reasonable. In practice, the theoretical condition could be loose: e.g., it requires $\lambda \geq 5 + 2\sqrt{5}$ when $\sigma_\omega = 1$ . As we observe in Section 6, MBE with a smaller $\lambda$ (e.g., 0.5) still empirically performs well.

6. Experiments

In this section, we study the empirical performance of MBE with both simulation (Section 6.1) and real datasets (Section 6.2).

6.1. MAB Simulation

We first experiment with simulated MAB instances. The goal is to (i) further validate our theoretical findings, (ii) check whether MBE can yield comparable performance with standard methods, and (iii) study the robustness and adap-

tivity of MBE. We also experimented with linear bandits and the main findings are similar. To save space, we defer its results to Appendix B.1.

We compare MBE with TS (Thompson, 1933), PHE (Kveton et al., 2019a), ReBoot (Wang et al., 2020), and GIRD (Kveton et al., 2019c). The last three algorithms are the existing bootstrap- or perturbation-type algorithms reviewed in Section 2. Specifically, PHE explores by perturbing observed rewards with additive noise, without leveraging the intrinsic uncertainty in the data, ReBoot explores by perturbing the residuals of the rewards observed for each arm, and GIRD re-samples observed data points. In all experiments below, the weights of MBE are sampled from $\mathcal{N}(1, \sigma_\omega^2)$ ¹. We fix $\lambda = 0.5$ and run MBE with three different values of $\sigma_\omega^2$ as 0.5, 1 and 1.5. We also compare with the naive adaption of multiplier bootstrap (i.e., no pseudo-rewards; denoted as Naive MB). We run Algorithm 2 with $B = 50$ replicates.

We first study 10-armed bandits, where the mean reward of each arm is independently sampled from $\text{Beta}(1, 8)$ . We consider three reward distributions, including Bernoulli, Gaussian, and exponential. For Gaussian MAB, the reward noise is sampled from $\mathcal{N}(0, 1)$ . The other two distributions are determined by their means. For TS, we always use the correct reward distribution class and its conjugate prior. The prior mean and variance are calibrated using the true model. Therefore, we use TS as a strong and standard baseline. For GIRD and ReBoot, we use the default implementations as they work well. For PHE, the original paper adds Bernoulli perturbation since it only studies bounded reward distributions. We extend PHE by sampling additive noise from the same distribution family as the true rewards, as did in Wu et al. (2022). GIRD, ReBoot and PHE all have one tuning parameter that control the degree of exploration. We tune these hyper-parameters over ${2^{k-4}}_{k=0}^6$ , and report the best performance of each method. Without tuning, these algorithms generally do not perform well using the hyper-parameters suggested in the original papers, due to the differences in settings. We tuned Naive MB as well.

Results. Results over 100 runs are displayed in Figure 1. Our findings can be summarized as follows. First, without knowledge of the problem settings (e.g., the reward distribution family and its parameters, and the prior distribution) and without heavy tuning, MBE performs favorably and close to TS. Second, pseudo-rewards are indeed important in exploration, otherwise the algorithm suffers a linear regret. Third, MBE has a stable performance with different $\sigma_\omega$ (note that other methods are tuned for their best performance). This is thanks to the data-driven nature of MBE. Finally, the other three general-purpose exploration strategies perform

¹We also experimented with other weight distributions with similar main conclusions. Using Gaussian weights allows us to study impact of different multiplier magnitudes more clearly.Figure 1. Simulation results under MAB. The error bars indicate the standard errors, which may not be visible when the width is small.

reasonably after tuning, as expected. However, GIRO is computationally intense. For example, in Gaussian bandits, the time cost for GIRO is 2 minutes while all the other algorithms can complete within 10 seconds. The computational burden is due to the limitation of non-parametric bootstrap (see Section 4.3). ReBoot also performs reasonably, yet by design it is not easy to extend it to many more complex problems (e.g., problems in Section 6.2).

Adaptivity. PHE relies on sampling additive noise from an appropriate distribution, and TS has similar dependency. In the results above, we provide auxiliary information about the environment to them and need to modify their implementation in different setups. In contrast, MBE automatically adapts to these problems. As argued in Section 2, one main advantage of MBE over them is its adaptiveness. To see this, we consider the following procedure: we run the Gaussian versions of TS and PHE in Bernoulli MAB, and run their Bernoulli versions in Gaussian MAB. We also run MBE with $\sigma_\omega^2 = 0.5$ . MBE does not require any modifications across the two problems. The results presented in Figure 2 clearly demonstrate that MBE adapts to reward distributions.

Similarly, we also studied the adaptivity of these methods against the reward distribution scale (the standard deviation of the Gaussian noise, $\sigma$ ) and the task distribution (we sample the mean rewards from $\text{Beta}(\alpha, 8)$ and vary the parameter $\alpha$ ). For all settings, we use the algorithms tuned for Figure 1. We observe that, MBE shows impressive adaptiveness, while PHE and TS may not perform well when the environment is not close to the one they are tuned for. Recall that, in real applications, heavy tuning is not possible without ground truths. This demonstrates the adaptivity of MBE, as a data-driven exploration strategy.

Additional results. In Appendix B.2, we also try different values of $\lambda$ and $B$ for MBE. We also repeat the main experiment with $K = 25$ . Our main observations above still hold and MBE is relatively robust to these tuning parameters.

Figure 2. Robust results against the reward distribution class.

Figure 3. Results with different reward variances and task distributions. For the x-axis in both figures and the y-axis in the second one, we plot at the logarithmic scale for better visualization.

6.2. Real data applications

The main advantage of MBE is that it easily generalizes to complex models. In this section, we use real datasets to study this property. Our goal is to investigate that, without problem-specific algorithm design and without heavy tuning, whether MBE can achieve comparable performance with strong problem-specific baselines proposed in the literature.

We study the three problems considered in Wan et al. (2022), including cascading bandits for online learning to rank (Kveton et al., 2015), combinatorial semi-bandits for online combinatorial optimization (Chen et al., 2013), and multinomial logit (MNL) bandits for dynamic slate optimization (Agrawal et al., 2017; 2019). All these are practical and important problems in real life. Yet, these domain models all have unique structures and require a case-by-case algorithm design. For example, the rewards in MNL bandits follow multinomial distributions that have complex dependency with the pulled arms. To derive the posterior or confidence bound, one has to use a delicately designed epoch-typeFigure 4. Real data results for three structured bandit problems that need domain-specific models.

procedure (Agrawal et al., 2019).

We compared MBE with state-of-the-art baselines in the literature, including TS-Cascade (Zhong et al., 2021) and CascadeKL-UCB (Kveton et al., 2015) for cascading bandits, CUCB (Chen et al., 2016) and CTS (Wang & Chen, 2018) for semi-bandits, and MNL-TS (Agrawal et al., 2017) and MNL-UCB (Agrawal et al., 2019) for MNL bandits. To save space, we denote the TS-type algorithms by TS and UCB-type ones by UCB. We also study PHE and $\epsilon$ -greedy (EG) as two other general-purpose exploration strategies.

We use the three datasets studied in Wan et al. (2022). Specifically, we use the Yelp rating dataset (Zong et al., 2016) to recommend and rank $K$ restaurants, use the Adult dataset (Dua & Graff, 2017) to send advertisements to $K/2$ men and $K/2$ women (a combinatorial semi-bandit problem with continuous rewards), and use the MovieLens dataset (Harper & Konstan, 2015) to display $K$ movies. In our experiments, we fix $K = 4$ and randomly sample 30 items from the dataset to choose from. We provide a summary of these datasets and problems in Appendix B.3, and refer interested readers to Wan et al. (2022) and references therein for more details.

For the baseline methods, as in Section 6.1, we either use the default hyperparameters in Wan et al. (2022) or tune them extensively and present their best performance. For EG, we set the exploration rate $\epsilon_t = \min(1, a/2\sqrt{t})$ with tuning parameter $a$ . For MBE, with every bootstrap sample, we estimate the reward model via maximum weighted likelihood estimation, which yields nice closed-form solution that allows online updating in all three problems. The other implementation details are similar to Section 6.1.

We present the results in Figure 4. The overall findings are consistent with simulation. First, without any additional derivations or algorithm design, MBE matches the performance of problem-specific algorithms. Second, pseudo-rewards are important to guarantee sufficient exploration, and naively applying multiplier bootstrap may fail. Third, MBE has relatively stable performance with $\sigma_\omega$ , as its exploration is mostly data-driven. In contrast, we found that

the hyper-parameters of PHE and EG have to be carefully tuned, due to that they rely on externally added perturbation or forced exploration. For example, the best parameters for EG are $a = 5, 0.1$ and $0.5$ in three problems. Finally, we observe that PHE does not perform well in MNL and cascading bandits, where the outcomes are binary. From a closer look, we find one possible reason: the response rates (i.e., the probabilities for the binary outcome to be 1) in the two datasets are low, so PHE introduces too much (additive) noise for exploration purpose, which slows down the estimation convergence.

7. Conclusion

In this paper, we propose a new bandit exploration strategy, Multiplier Bootstrap-based Exploration (MBE). The main advantage of MBE is its generality: for any reward model that can be estimated via weighted loss minimization, the idea of MBE is applicable and requires minimal efforts on derivation or implementation of the exploration mechanism. As a data-driven method, MBE also shows nice adaptivity. We prove near-optimal regret bounds for MBE in the sub-Gaussian MAB setup, which is more general compared with other bootstrap-based bandit papers. Numerical experiments demonstrate that MBE is general, efficient, and adaptive.

There are a few meaningful future extensions. First, the regret analysis for MBE (and more generally, other bootstrap-based bandit methods) in more complicated setups would be valuable. Second, adding pseudo-rewards at every round is needed for the analysis. We hypothesize that there exists a more adaptive way of adding them. Last, the practical implementation of MBE relies on an ensemble of models to approximate the bootstrap distribution and the online regression oracle to update the model estimation. Our numerical experiments show that such an approach works well empirically, but it would be still meaningful to have more theoretical understanding.---

References

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the mnl-bandit. arXiv preprint arXiv:1706.00977, 2017.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.

Bietti, A., Agarwal, A., and Langford, J. A contextual bandit bake-off. J. Mach. Learn. Res., 22:133–1, 2021.

Chapelle, O. and Zhang, Y. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pp. 1–10, 2009.

Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pp. 151–159. PMLR, 2013.

Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67–82. PMLR, 2018.

Eckles, D. and Kaptein, M. Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009, 2014.

Efron, B. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569–593. Springer, 1992.

Elmachtoub, A. N., McNellis, R., Oh, S., and Petrik, M. A practical method for solving contextual bandit problems using decision trees. arXiv preprint arXiv:1706.04687, 2017.

Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010.

Hao, B., Abbasi Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems, 32, 2019.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acem transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767–776. PMLR, 2015.

Kveton, B., Szepesvári, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic multi-armed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 2786–2793, 2019a.

Kveton, B., Szepesvari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic linear bandits. arXiv preprint arXiv:1903.09132, 2019b.

Kveton, B., Szepesvari, C., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601–3610. PMLR, 2019c.

Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 2066–2076. PMLR, 2020a.

Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 2066–2076. PMLR, 2020b.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071–2080. PMLR, 2017.

Lu, X. and Van Roy, B. Ensemble sampling. Advances in neural information processing systems, 30, 2017.

Osband, I. and Van Roy, B. Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300, 2015.

Osband, I., Van Roy, B., Russo, D. J., Wen, Z., et al. Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1–62, 2019.

Phan, M., Abbasi Yadkori, Y., and Domke, J. Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32, 2019.Qin, C., Wen, Z., Lu, X., and Roy, B. V. An analysis of ensemble sampling. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=c6ibx0yl-aG.

Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127, 2018.

Shen, W., Wang, J., Jiang, Y.-G., and Zha, H. Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artificial intelligence, 2015.

Tang, L., Jiang, Y., Li, L., Zeng, C., and Li, T. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 323–332, 2015.

Tang, Q., Xie, H., Xia, Y., Lee, J., and Zhu, Q. Robust contextual bandits via bootstrapping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12182–12189, 2021.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285–294, 1933.

Van Der Vaart, A. W. and Wellner, J. A. Weak convergence. In Weak convergence and empirical processes, pp. 16–28. Springer, 1996.

Vaswani, S., Kveton, B., Wen, Z., Rao, A., Schmidt, M., and Abbasi-Yadkori, Y. New insights into bootstrapping for bandits. arXiv preprint arXiv:1805.09793, 2018.

Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. arXiv preprint arXiv:2202.13227, 2022.

Wang, C.-H., Yu, Y., Hao, B., and Cheng, G. Residual bootstrap exploration for bandit algorithms. arXiv preprint arXiv:2002.08436, 2020.

Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114–5122. PMLR, 2018.

Watkins, C. J. C. H. Learning from delayed rewards. 1989.

Wen, Z., Kveton, B., and Ashkan, A. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pp. 1113–1122. PMLR, 2015.

Wu, S., Wang, C.-H., Li, Y., and Cheng, G. Residual bootstrap exploration for stochastic linear bandit. arXiv preprint arXiv:2202.11474, 2022.

Zhang, H. and Chen, S. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1–85, 2021.

Zhang, H. and Wei, H. Sharper sub-weibull concentrations. Mathematics, 10(13):2252, 2022.

Zhong, Z., Chueng, W. C., and Tan, V. Y. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22(218):1–66, 2021.

Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492–11502. PMLR, 2020.

Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In International Conference on Neural Information Processing, pp. 811–821. Springer, 2017.

Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359, 2016.## A. Additional Method Details

A.1. MBE for MAB

In this section, we present the concrete form of MBE when being applied to MAB. Recall that $\mathbf{x}_t$ is null, $A_t \in [K]$ , and $r_k$ is the mean reward of the $k$ -th arm. We define $f(\mathbf{x}t, A_t; \mathbf{r}) = r{A_t}$ , where the parameter vector $\mathbf{r} = (r_1, \dots, r_K)^\top$ . We define the loss function as

$\frac{1}{t'} \sum_{t=1}^{t'} \omega_t (r_{A_t} - R_t)^2.$

The solution is then $(\hat{r}_1, \dots, \hat{r}K)^\top$ with $\hat{r}k = (\sum{t:A_t=k} \omega_t)^{-1} \sum{t:A_t=k} \omega_t R_t$ , i.e., the arm-wise weighted average. After adding the pseudo rewards, we can give algorithm for MAB in Algorithm 3.

Next, we provide intuitive explanation on why Algorithm 3 works. Indeed, denote $s := |\mathcal{H}{k,T}|$ , where $\mathcal{H}{k,T}$ is the set of observed rewards for the $k$ -th arm up to round $T$ . Let $R_{k,l}$ be the $l$ -th element in $\mathcal{H}_{k,T}$ . Then

$\begin{aligned} \bar{Y}_{k,s} &= \frac{\sum_{i=1}^s \omega_i R_{k,i} + \lambda \sum_{i=1}^s \omega'_i}{\sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i} \\ &= \frac{s^{-1} \sum_{i=1}^s \omega_i (R_{k,i} - \mu_k) + s^{-1} \sum_{i=1}^s (\omega_i - 1) + \lambda s^{-1} \sum_{i=1}^s (\omega'_i - 1) + \mu_k + \lambda}{s^{-1} \sum_{i=1}^s (\omega_i - 1) + \lambda s^{-1} \sum_{i=1}^s (\omega'_i - 1) + \lambda s^{-1} \sum_{i=1}^s (\omega''_i - 1) + 1 + 2\lambda} \xrightarrow{\mathbb{P}} \frac{\mu_k + \lambda}{1 + 2\lambda} \end{aligned}$

by using the law of large numbers. Then, by Slutsky's theorem,

$\sqrt{s} \left[ \bar{Y}_{k,s} - \frac{\mu_k + \lambda}{1 + 2\lambda} \right] = \frac{1}{1 + 2\lambda} \left[ \frac{1}{\sqrt{s}} \sum_{i=1}^s \omega_i (R_{k,i} - \mu_k) + \frac{1}{\sqrt{s}} \sum_{i=1}^s (\omega_i - 1) + \frac{\lambda}{\sqrt{s}} \sum_{i=1}^s (\omega'_i - 1) \right] + o_p(1)$

will weakly converge to a mean-zero Gaussian distribution $\mathcal{N}\left(0, \frac{\sigma_k^2 + 2}{(1+2\lambda)^2} \sigma_\omega^2\right)$ . Therefore, our algorithm preserves the order of the arms for any $\lambda > 0$ .

Algorithm 3 MBE for MAB with sub-Gaussian rewards with mean bounded in $[0, 1]$

Data: number of arms $K$ , multiplier weight distribution $\rho(\omega)$ , tuning parameter $\lambda$


2 Set  $\mathcal{H}_k = \{\}$  be the history of the arm  $k$  and  $\bar{Y}_k = +\infty, \forall k \in [K]$ 
3 for  $t = 1, \dots, T$  do
4   Pull  $A_t = \arg \max_{k \in [K]} \bar{Y}_k$  (break tie randomly),
5   Observe reward  $R_t$ 
6   Set  $\mathcal{H}_k = \mathcal{H}_k \cup \{R_t\}$ 
7   for  $k = 1, \dots, K$  do
8     if  $|\mathcal{H}_k| > 0$  then
9       Sample the multiplier weights  $\{\omega_l, \omega'_l, \omega''_l\}_{l=1}^{|\mathcal{H}_k|} \sim \rho(\omega)$ .
10      Update the mean reward

 $\bar{Y}_k = \left( \sum_{\ell=1}^{|\mathcal{H}_k|} (\omega_\ell \cdot R_{k,\ell} + \omega'_\ell \cdot 1 \times \lambda + \omega''_\ell \cdot 0 \times \lambda) \right) / \left( \sum_{\ell=1}^{|\mathcal{H}_k|} (\omega_\ell + \lambda \omega'_\ell + \lambda \omega''_\ell) \right),$ 

where  $R_{k,l}$  is the  $l$ -th element in  $\mathcal{H}_k$ .
11    end
12  end
13 end

A.2. MBE for stochastic linear bandits

In this section, we derive the form of MBE when applied to stochastic linear bandits. We focus on the setup where $\mathbf{x}_t$ is empty and $A_t \in \mathbb{R}^p$ is a linear feature vector, and other setups of linear bandits can be formulated similarly. In this case,---

Algorithm 4 MBE for linear bandits.

Data: number of arms $K$ , multiplier weight distribution $\rho(\omega)$ , tuning parameter $\lambda$
2 Set $\mathcal{H}_k = {}$ be the history of the arm $k$ , set $A_0 = \mathbf{0}$ , $\hat{\boldsymbol{\theta}}0 = \mathbf{0}$ with $b_0 = \mathbf{0}$ , and $V_0 = (1 + \xi)I_p$ .
3 if $t = 1, \dots, p$ then
4     Offer $A_t = t$ .
5 end
6 for $t = p + 1, \dots, T$ do
7     Offer $A_t = \arg \max{\mathbf{a} \in \mathcal{A}_t} \mathbf{a}^\top \boldsymbol{\theta}_t$ (break tie randomly)
8     Observe reward $R_t$
9     Set $\mathcal{H}_k = \mathcal{H}_k \cup {R_t}$
10     for $k = 1, \dots, K$ do
11         if $|\mathcal{H}_k| > 0$ then
12             Sample the multiplier weights ${\omega_l, \omega'_l, \omega''l}{l=1}^{|\mathcal{H}k|} \sim \rho(\omega)$ .
13             Update the following quantities:
14                 • $V{t+1} = V_t + \omega_t A_t A_t^\top + \lambda \omega't 0 I_d + \lambda \omega''t I_d$ ;
15                 • $b{t+1} = b_t + A_t(\omega_t R_t + \lambda \omega't 0 + \lambda \omega''t 1)$ ;
16                 • Refresh the parameter as $\hat{\boldsymbol{\theta}}{t+1} = V{t+1}^{-1} b{t+1}$ .
14             end
15         end
16     end

$f(\mathbf{x}_t, A_t; \boldsymbol{\theta}) = A_t^\top \boldsymbol{\theta}$ where the parameter vector is $\boldsymbol{\theta} \in \mathbb{R}^p$ . Then, the weighted loss function is

$\sum_{t=1}^T \omega_t (A_t^\top \boldsymbol{\theta} - R_t)^2 + \frac{\xi}{2} \|\boldsymbol{\theta}\|_2^2,$

where $\xi \geq 0$ is a penalty tuning parameter. The solution is the standard weighted ridge regression estimator and can be updated in the following way:

1. Initialization: $A_0 = \mathbf{0}$ , $\hat{\boldsymbol{\theta}}0 = \mathbf{0}$ with $b_0 = \mathbf{0}$ , and $V_0 = (\xi + 1)I{\dim(A_t)}$ .
1. $\hat{\boldsymbol{\theta}}_t = V_t^{-1} A_t$ ;
1. $V_{t+1} = V_t + \omega_t A_t A_t^\top$ , $b_{t+1} = b_t + \omega_t R_t A_t$ , and hence update

$\begin{aligned} \hat{\boldsymbol{\theta}}_{t+1} &= V_{t+1}^{-1} b_{t+1} = (V_t + \omega_t A_t A_t^\top)^{-1} (b_t + \omega_t R_t A_t) \\ &= V_t^{-1} - V_t^{-1} A_t (\omega_t^{-1} + A_t^\top V_t^{-1} A_t)^{-1} A_t^\top V_t^{-1} \end{aligned}$

1. Take the action $A_{t+1} = \arg \max_{\mathbf{a} \in \mathcal{A}t} \mathbf{a}^\top \boldsymbol{\theta}{t+1}$ .

The MBE algorithm for linear bandits is presented in Algorithm 4.

A.3. Naive Adaptation of the Multiplier Bootstrap

We present the naive multiplier bootstrap-based exploration algorithm in Algorithm 5. Specifically, there is no pseudo-rewards added.

A.4. ReBoot

For completeness, we introduce the details of ReBoot in this section and discuss its generalizability. More details can be found in the original papers (Wang et al., 2020; Wu et al., 2022).Algorithm 5 A Naive Design of MBE

Data: Function class $\mathcal{F}$ , loss function $\mathcal{L}$ , (optional) penalty function $J$ , multiplier weight distribution $\rho(\omega)$ , tuning parameter $\lambda$


1  Set  $\mathcal{H} = \{\}$  be the history be the pseudo-history
2  Initialize  $\hat{f}$  in an optimistic way
3  for  $t = 1, \dots, T$  do
4      Observe context  $\mathbf{x}_t$  and action set  $\mathcal{A}_t$ 
5      Offer  $A_t = \arg \max_{a \in \mathcal{A}_t} \hat{f}(\mathbf{x}_t, a)$  (break tie randomly)
6      Observe reward  $R_t$ 
7      Update  $\mathcal{H} = \mathcal{H} \cup \{(\mathbf{x}_t, A_t, R_t)\}$ 
8      Sample the multiplier weights  $\{\omega_l\}_{l=1}^t \sim \rho(\omega)$ 
9      Solve the weighted loss minimization problem to update  $\hat{f}$  as
10

 $\hat{f} = \arg \min_{f \in \mathcal{F}} \frac{1}{t} \sum_{l=1}^t \omega_l \mathcal{L}(f(\mathbf{x}_l, A_l), R_l) + J(f).$ 

11 end

Figure 5. Performance of MBE in three linear bandit problems.

Consider a stochastic bandit problem with a fixed and finite set of arm $\mathcal{A}$ . Every arm $a \in \mathcal{A}$ may have a fixed feature vector (which with slight overload of notation, we also denote as $a$ ). The mean reward of arm $a$ is $f(a)$ . At each round $t'$ , ReBoot first fit the model $f$ as $\hat{f}$ using all data. Then for each arm $a$ , ReBoot first computes the corresponding residuals using rewards related to that arm as ${\epsilon_t = R_t - \hat{f}(a)}{t:A_t=a, t \leq t'}$ , then perturbs these residuals with random weights as ${\omega_t \epsilon_t = R_t - \hat{f}(a)}{t:A_t=a, t \leq t'}$ (ReBoot also adds pseudo-residuals, which we omit for ease of notations), and finally use $\hat{f}(a) + |{t : A_t = a, t \leq t'}|^{-1} \sum_{t:A_t=a, t \leq t'} \omega_t \epsilon_t$ as the perturbed estimation of the mean reward of arm $a$ . By design, it can be seen that ReBoot critically relies on the reward history of each fixed arm. Therefore, to the best of our understanding, it is not easy to extend ReBoot to problems with either changing (e.g., contextual problems) or infinite arms.Figure 6. Performance of MBE with different values of $\lambda$ in MAB.

B. More Experiment Results and Details

B.1. Results for linear bandits

We also consider the linear bandit problem. The linear bandit version of MBE is presented in Appendix A.2. We experiment with several dimensions $p = 10, 15, 20$ . The number of arms is $K = 100$ . The feature vector $x_k \in \mathbb{R}^p$ of arm $k$ is generated as follows. For the last 10 arms, the features are drawn uniformly at random from $(0, 1)$ . For the first 90 arms, we consider a practical setup where they are low-rank: we first generate a loading metric $A = (a_{ij}) \in \mathbb{R}^{p \times 5}$ from $\text{Uniform}(0, 1)$ , then sample $b \in \mathbb{R}^5$ from $\text{Uniform}(0, 1)$ , and finally constructs $x_k = Ab$ . The parameter vector $\theta \in \mathbb{R}^p$ is uniformly sampled from $[0, 1]^p$ . We normalize the feature vectors such that the mean reward $\mu_k = x_k^\top \theta$ falls within the interval $[0, 1]$ . The rewards of arm $k$ are drawn i.i.d. from $\text{Bernoulli}(x_k^\top \theta)$ .

We still compare MBE with the method for linear bandit version of GIRO, PHE, and ReBoot with tuning to their best performance over the hyper-parameter set ${2^{k-4}}_{k=0}^6$ and report the best performance of each method. For TS, we use Gaussian for both its reward and prior distribution, and calibrate their parameters using the true model. The total rounds are $T = 20000$ and our results are averaged over 50 randomly chosen problems. Most other details are similar to our MAB experiments.

We present the results in Figure 5, where we vary either $\sigma_\omega$ or $\lambda$ in the two subplots. We can see that naive MB leads to a linear regret. Hence, the pseudo-reward also matters in this problem. MBE achieves comparable performance with strong baselines such as TS. Another finding is that MBE is robust to its tuning parameters. Finally, Reboot needs to pull $K$ times to initialize (the linear regret part in the first $K$ rounds) due to the nature of its design. In contrast, most other linear bandit algorithms typically only need $p$ rounds of forced exploration. This shows the limitation of the ReBoot framework (See Appendix A.4).

B.2. Additional results

In this section, we study the performance of MBE with respect to a few other hyper-parameters.

We first study the robustness to another tuning parameter $\lambda$ . Recall that $\lambda$ controls the amount of external perturbation. Specifically, we repeat the experiment in 6.1 with $\sigma_\omega^2$ fixed as 0.5 and with different values of $\lambda$ (0.25, 0.5, 0.75). From Figure 6, it can be seen that a small amount of pseudo-rewards ( $\lambda = 0.25$ ) seems sufficient in these settings, and the results are fairly stable. We believe this is because the exploration of MBE is main driven by the internal randomness in the data.

In Figure 7, we repeat our main experiments with $K$ changed to 25. We can see that our main conclusions still hold.

Finally, in Figure 8, we implement MBE with different number of replicates $B$ . As expected, more replicates does help exploration due to a better approximation to the whole bootstrapping distribution. Yet, we find that $B = 50$ suffices to generate comparable performance with TS and the performance of MBE becomes relatively stable for larger values of $B$ .

B.3. Details of the real data experiments

Our real data experiments closely follow Wan et al. (2022). For completeness, we provide information of the three problems here, and refer interested readers to Wan et al. (2022) and references therein.

In an online learning to rank problem, we aim to select and rank $K$ items from a pool of $L$ ones. We iteratively interacts with users to learn about their preferences. The cascading model is popular in learning to rank (Kveton et al., 2015), whichFigure 7. Performance of MBE with $K = 25$ .

Figure 8. Performance of MBE with different number of replicates $B$ .

models the user behaviour as glancing from top to bottom (like a cascade) and choose to click an item following a Bernoulli distribution when she looks at that item. Therefore, we will binary outcomes for all items that the user has examined, and there are complex dependency between them.

In a slate optimization (or called assortment optimization) problem, we aim to offer $K$ items from a pool of $L$ ones, especially when there exist substitution effects. The Multinomial Logit (MNL) model characterizes the choice behaviour as a multinomial distribution based on the attractiveness of each item. Since the offer subset changes over rounds, the joint likelihood is actually complex. To get the posterior or confidence bounds, one has to resort to an epoch-type offering schedule (Agrawal et al., 2017).

Online combinatorial optimization also has numerous applications (Wen et al., 2015), including maximum weighted matching, ads allocation, webpage optimization, etc. It is common that every chosen item will generate a separate observation, known as the semi-bandit problem. We consider a special problem in our experiments, where we need to choose $K$ persons from a pool under constraints.

The three datasets we used (and related problem setups) are studied in corresponding TS papers in the literature. To general random rewards, we need to either generate from a real data-calibrated model or by directly sampling from the dataset. We follow Wan et al. (2022) and references therein. For cascading or MNL bandits, we split the dataset into a training and a testing set, use the training to estimate the reward model, and compare on the testing set. For semi-bandits, we sample rewards from the dataset.

C. Main Proof

This section gives the proof of the main regret bound (Theorem 5.1). Section D gives the major lemmas required to bound the regret components used in this section. Section E lists all supporting technical lemmas, including the lower bound of the Gaussian tail and some novel results on the concentration property of sub-Gaussian and sub-exponential distributions.

We will present a complete version of the theory of MBE under MAB, Theorem C.1, as well as its proof, in which we allow arbitrary variance proxy $\sigma_k^2$ rather than requiring them to be one.

Theorem C.1. Consider a $K$ -armed bandit, where the reward distribution of arm $k$ is $\text{subG}(\sigma_k^2)$ with mean $\mu_k$ . Suppose$\mu_1 = \max_{k \in [K]} \mu_k$ and $\Delta_k = \mu_1 - \mu_k$ . Take the multiplier weight distribution as $\mathcal{N}(1, \sigma_\omega^2)$ in Algorithm 3. Let the tuning parameters satisfy $\lambda \geq \left(\frac{4\sigma_1}{\sigma_\omega} + 1\right) + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left(\frac{4\sigma_1}{\sigma_\omega} + 1\right)}$ . Then the problem-dependent regret is upper bounded by

$\text{Reg}_T \leq \sum_{k=2}^K \Delta_k \left[ 7 + \left\{ C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) + \frac{C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{\Delta_k^2} \right\} \log T \right],$

where

$C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 10 \left[ 8\sqrt{2} \max_{k \in [K]} \sigma_k^2 \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right],$

and

$C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 50\lambda^2 \sigma_k^2 + 10 \left[ D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right],$

with

$D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = \left[ (1 + 8\sqrt{2} \max_{k \in [K]} \sigma_k^2) \left( 16 + \frac{\sigma_\omega^2}{\sigma_1^2} \right) + 16\sigma_1^4 + 3\frac{\sigma_\omega^2}{\sigma_1^2} + 3\sigma_\omega^2 + 1 \right] \sigma_\omega^2 \lambda^4,$

and

$D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 3 \left[ 1 + \frac{3\sqrt{\pi}\sigma_\omega^2}{2\sigma_1} \left( \frac{\sigma_1}{\sigma_\omega} + 3\lambda \right) + 16\sqrt{\pi} \max_{k \in [K]} \sigma_k^2 \left( \frac{\sigma_\omega^2}{16\sigma_1^4} + \frac{1}{\sigma_\omega^2} \right) + \frac{\lambda^2 \sigma_\omega^2}{4\sigma_1^4} \right].$

Furthermore, the problem-independent regret is upper bounded by

$\text{Reg}_T \leq 7K\mu_1 + \max_{k \in [K] \setminus \{1\}} C_1(\sigma_1, \sigma_k, \lambda) K \log T + 2 \sqrt{\max_{k \in [K] \setminus \{1\}} C_2(\sigma_1, \sigma_k, \lambda) K T \log T}.$

Before starting our proof, we first give the definition of sub-exponential variables: we call a mean-zero variable $X$ sub-exponential with parameters $\lambda$ and $\alpha$ if

$\mathbb{E} \exp\{tX\} \leq \exp\left\{ \frac{t^2 \lambda^2}{2} \right\}, \quad |t| \leq \frac{1}{\alpha}.$

And we denote as $X \sim \text{subE}(\lambda, \alpha)$ if sub-exponential $X$ has parameters $\lambda$ and $\alpha$ . We have the following concentration property

$\mathbb{P}(X \geq t) \leq \exp\left\{ -\frac{1}{2} \left( \frac{t^2}{\lambda^2} \wedge \frac{t}{\alpha} \right) \right\}, \quad \forall t \geq 0,$

where $a \wedge b$ is defined as $\min{a, b}$ . See Zhang & Chen (2021) and Zhang & Wei (2022) for more details. For simplicity, we denote $\text{subE}(\lambda) := \text{subE}(\lambda, \lambda)$ . We call a random variable $X$ sub-Gaussian with variance proxy $\sigma^2$ if $\mathbb{E} \exp{t(X - \mathbb{E}X)} \leq \exp{t^2 \sigma^2 / 2}$ for any $t \in \mathbb{R}$ , and denote it as $X \sim \text{subG}(\sigma^2)$ .

Notations: Denote $\mathbb{P}_\xi(A) = \int_A dF_\xi(x)$ as the probability of event $A$ , where $F_\xi(x)$ is the distribution function of the random variable $\xi$ , and similarly, we denote $\mathbb{E}_\xi f(\xi) = \int f(x) dF_\xi(x)$ as the expectation. Recall that we write two functions $a(s, T) \lesssim b(s, T)$ if $a(s, T) \leq c \times b(s, T)$ for some constant $c$ free of $s$ and $T$ . And we write $a(s, T) \asymp b(s, T)$ if both $a(s, T) \lesssim b(s, T)$ and $a(s, T) \gtrsim b(s, T)$ . Furthermore, we define $a \vee b = \max{a, b}$ and $a \wedge b = \min{a, b}$ for any real numbers $a$ and $b$ . Similarly, we can define $a \vee b \vee c = \max{a, b, c}$ and $a \wedge b \wedge c = \min{a, b, c}$ for any $a, b, c \in \mathbb{R}$ .

Proof. Step 0: Decomposition of the regret bound. Our proof relies on the following decomposition of the cumulative regret (Kveton et al., 2019c). We denote the first $s$ rewards from pulling arm $k$ as $\mathcal{H}{k,s}$ , with the $i$ -th observations denoted as $R{k,i}$ . Let $Q_{k,s}(\tau) = \mathbb{P}(\bar{Y}{k,s} > \tau \mid \mathcal{H}{k,s})$ be the tail probability that $\bar{Y}{k,s}$ conditioned on history $\mathcal{H}{k,s}$ is a least $\tau$ , and $N_{k,s}(\tau) = 1/Q_{k,s}(\tau) - 1$ is the expected number of rounds that the arm $k$ being underestimated given $s$ sample rewards. Here

$\bar{Y}_{k,s} = \frac{\sum_{i=1}^s [\omega_i R_{k,i} + \omega'_i \cdot (1 \times \lambda) + \omega''_i \cdot (0 \times \lambda)]}{\sum_{i=1}^s (\omega_i + \lambda \omega'_i + \lambda \omega''_i)}$

is the objective function defined in Algorithm 3.Lemma C.2. Suppose in MAB we select arms according to the rule $A_t = \arg \max_{k \in [K]} \bar{Y}{k,t}$ with $\bar{Y}{k,t}$ defined in Algorithm 3. Then for any ${\tau_k}_{k=2}^K \subseteq \mathbb{R}$ , the expected $T$ -round regret can be bounded above as

$\text{Reg}_T \leq \sum_{k=2}^K \Delta_k (a_k + b_k),$

where $a_k = \sum_{s=0}^{T-1} a_{k,s}$ and $b_k = \sum_{s=0}^{T-1} b_{k,s} + 1$ , and $a_{k,s} = \mathbb{E}[N_{1,s}(\tau_k) \wedge T]$ and $b_k = \mathbb{P}(Q_{k,s}(\tau_k) > T^{-1})$ .

Recall the summation index $s$ is the number of times we pull the $k$ -th arm. The definitions of $a_k$ and $b_k$ have important meanings: $a_k$ represents the expected number of rounds that optimal arm 1 has been being underestimated, whereas $b_k$ is the probability that the suboptimal arm $k$ is being overestimated. Here we only need to consider the lower bound of the tail of the distribution of the rewards from the optimal arm. The intuition is: (i) we only need the rewards from the optimal arm taking a relatively large value with a probability that is not too small; (ii) we do not care about the tiny probability of a large reward from suboptimal arms.

Therefore, our target is then to bound $a_k$ and $b_k$ for any $k \geq 2$ . These are completed in Step 1 and Step 2 below, respectively.

Step 1: Bounding $a_k$ .

We first provide a roadmap for proving $a_k$ is bounded by a term of $O(\log T)$ order: For a given constant level $\tau_k$ , the probability of the optimal arm 1 being underestimated given $s$ rewards is $1 - Q_{1,s}(\tau_k)$ . If we pick the level to satisfy $\tau_k < \frac{\mu_1 + \lambda}{1 + 2\lambda}$ , the theory of large deviation gives

$\lim_{s \rightarrow \infty} Q_{1,s}(\tau_k) = 1.$

Hence the expected number of rounds to observe a not-underestimated case $N_{1,s}(\tau_k) = \frac{1}{Q_{1,s}(\tau_k)} - 1$ has the property $\lim_{s \rightarrow \infty} N_s(\tau_k) = 1$ as the number of pulls $s$ grows to infinity. Thus, given the round $T$ , there exists a constant $s_0(T)$ such that $N_s(\tau_k) \leq T$ for all $s$ over $s_0(T)$ . Consequently, the quantity $a_k$ in regret bound will be bounded by

$a_k \leq \sum_{s=0}^{s_0(T)} \mathbb{E}[N_{1,s}(\tau_k) \wedge T].$

The fact that the constant $s_0(T)$ is at the order of $\log T$ will be shown in Lemma D.2, Lemma D.3, and Lemma D.4. For small number of pulls $s \leq s_0(T)$ , we show in Lemma D.1 that $\mathbb{E}[N_{1,s}(\tau_k) \wedge T] \leq 2 + 4e^{9/8}$ for any $s \in \mathbb{N}$ . Thus, it is enough to conclude that $a_k$ can be bounded by a term of $O(\log T)$ order.

To formally bound $a_k$ in the non-asymptotic sense following the intuition above, we need to decompose $a_k$ . For the decomposition, a common approach is to use indicators on good events. Denote the shifted (sample-) mean reward as

$\bar{R}_{k,s}^* = \frac{\sum_{i=1}^s R_{k,i} + \lambda s}{s(1 + 2\lambda)} = \frac{\lambda}{1 + 2\lambda} + \frac{1}{1 + 2\lambda} \bar{R}_{k,s}.$

where $\bar{R}{k,s} = \frac{1}{s} \sum{i=1}^s R_{k,i}$ is the mean reward of the $k$ -th arm. Then we can define the following good events for $l$ -th arm as

$A_{l,s} = \left\{ -C_1 \Delta_k < \bar{R}_{l,s}^* - \frac{\mu_l + \lambda}{1 + 2\lambda} < C_1 \Delta_k \right\},$

and

$G_{l,s} = \left\{ -C_2 \Delta_k \leq \bar{Y}_{l,s} - \bar{R}_{l,s}^* \leq C_2 \Delta_k \right\}.$

The definitions of $A_{l,s}$ and $G_{l,s}$ are intuitive: $A_{l,s}$ represents the sample mean does not deviate from the true mean too far, and events on $G_{l,s}$ means $\bar{Y}{l,s}$ is not too far away from the scaled-shifted sample mean $\bar{R}{l,s}^*$ . Here $C_1, C_2 \in (0, 1)$ are some constants.

Therefore, by using these good events, we decompose $a_{k,s}$ as the following three parts:

$a_{k,s,1} = \mathbb{E}\left[\left(N_{1,s}(\tau_k) \wedge T\right) \mathbb{I}(A_{1,s}^c)\right], \quad (2)$ $$a_{k,s,2} = \mathbb{E} \left[ (N_{1,s}(\tau_k) \wedge T) \mathbb{I}(A_{1,s}) \mathbb{I}(G_{1,s}^c) \right], \quad (3)$$

and

$a_{k,s,3} = \mathbb{E} \left[ (N_{1,s}(\tau_k) \wedge T) \mathbb{I}(A_{1,s}) \mathbb{I}(G_{1,s}) \right]. \quad (4)$

Let $C_1 = \frac{1}{6\lambda}$ and $C_2 = \frac{1}{12\lambda}$ with fixed $\lambda > 1$ , then $C_1, C_2 \in (0, 1)$ . Denote

$\begin{aligned} a_1 &= \frac{192\sigma_\omega^2\lambda^4}{3(1+2\lambda)^2\Delta_k^2}, & b_1 &= \frac{2\sigma_\omega^2 [96\lambda^4\sigma_1^2 + (1+2\lambda)^2C_2^2\Delta_k^2]}{3(1+2\lambda)^2\Delta_k^2}, \\ a_2 &= \frac{36\sigma_\omega^2\lambda^4}{3(\lambda-1)^2\Delta_k^2}, & b_2 &= \frac{\sigma_\omega^2 [72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2\Delta_k^2]}{6(\lambda-1)^2\Delta_k^2}, & \Lambda_k &= 8\sqrt{2}\sigma_k^2. \end{aligned}$

Consider the case $T \geq 2$ . Define

$s_{a,j}(T) = \max\{s : a_{k,s,j} \geq T^{-1}\}, \quad j = 1, 2, 3.$

Lemma D.2 in Appendix D proves that by taking

$s \geq s_{a,1}(T) = \frac{144\lambda^2\sigma_1^2}{(1+2\lambda)^2\Delta_k^2} \log T,$

we will have $a_{k,s,1} \leq T^{-1}$ . Similarly, Lemma D.3 and Lemma D.4 in Appendix D say that: if we take

$\begin{aligned} s \geq s_{a,2}(T) &= \left[ \Lambda_1 + a_1 + b_1 + \Lambda_1 a_1 \right] \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi b_1}}{2} + \frac{\sqrt{2\pi}\Lambda_1 a_1}{2b_1^2} + \frac{2a_1}{b_1} \right] \right\} + 1 \right) \\ &\quad \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \times 3 \log T \end{aligned}$

and

$\begin{aligned} s \geq s_{a,3}(T) &= \left[ \Lambda_1 + a_2 + b_2 + \Lambda_1 a_2 \right] \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi b_2}}{2} + \frac{\sqrt{2\pi}\Lambda_1 a_2}{2b_2^2} + \frac{2a_2}{b_2} \right] \right\} + 1 \right) \\ &\quad \vee \frac{18\sigma_\omega^2(2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \times 3 \log T, \end{aligned}$

then $a_{k,s,2} \leq T^{-1}$ and $a_{k,s,3} \leq T^{-1}$ , respectively. Denote $\Lambda_{\max} = \max_{k \in [K]} \Lambda_k$ and $\Lambda_{\min} = \min_{k \in [K]} \Lambda_k$ , then for any

$\begin{aligned} s \geq s_a(T) &= \frac{144\lambda^2\sigma_1^2}{(1+2\lambda)^2\Delta_k^2} \log T + \left[ \Lambda_{\max} + (a_1 + a_2) + (b_1 + b_2) + \Lambda_{\max}(a_1 + a_2) \right] \\ &\quad \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi}}{2} (\sqrt{b_1} + \sqrt{b_2}) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{a_1}{b_1^2} + \frac{a_2}{b_2^2} \right) + 2 \left( \frac{a_1}{b_1} + \frac{a_2}{b_2} \right) \right] \right\} + 1 \right) \\ &\quad \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \times 3 \log T, \end{aligned}$

we have $s \geq \max_{j=1,2,3} s_{a,j}(T)$ due to that

$s_a(T) = s_{a,1}(T) + \max_{j=2,3} s_{a,j}(T) \geq \max_{j=1,2,3} s_{a,j}(T).$

Hence, for any $s \geq s_a(T)$ , we have

$a_{k,s} = a_{k,s,1} + a_{k,s,2} + a_{k,s,3} \leq 3T^{-1}.$ Finally, Lemma D.1 in Appendix D ensures if we take $\lambda \geq \frac{4\sigma_1}{\sigma_\omega} + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right)}$ , the component $a_{k,s} \leq 2 + 4e^{9/8}$ for any $s \geq 0$ . Therefore, by setting $\lambda \geq \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right) + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right)}$ , we have

$\begin{aligned} a_k &= \sum_{s=0}^{T-1} a_{k,s} \\ &\leq \sum_{s < s_a(T)} \max_{s \in \{0,1,\dots,T-1\}} a_{k,s} + \sum_{s_a(T) \leq s < T} 3T^{-1} \\ &= \max_{s \in \{0,1,\dots,T-1\}} a_{k,s} \times s_a(T) + 3T^{-1} \times (T - s_a(T)) \\ &\leq 2(1 + 2e^{9/8}) \left\{ \frac{144\lambda^2\sigma_1^2}{(1+2\lambda)^2\Delta_k^2} \log T + \left[ \Lambda_{\max} + (\mathbf{a}_1 + \mathbf{a}_2) + (\mathbf{b}_1 + \mathbf{b}_2) + \Lambda_{\max}(\mathbf{a}_1 + \mathbf{a}_2) \right] \right. \\ &\quad \left. \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi}}{2} \left( \sqrt{\mathbf{b}_1} + \sqrt{\mathbf{b}_2} \right) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{\mathbf{a}_1}{\mathbf{b}_1^2} + \frac{\mathbf{a}_2}{\mathbf{b}_2^2} \right) + 2 \left( \frac{\mathbf{a}_1}{\mathbf{b}_1} + \frac{\mathbf{a}_2}{\mathbf{b}_2} \right) \right] \right\} + 1 \right) \right. \right. \\ &\quad \left. \left. \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right] \right\} \log T + 3 \end{aligned}$

for any $T \geq 2$ .

Step 2: Bounding $b_k$ .

Again, we will provide a roadmap for proving $a_k$ is bounded by a term of $O(\log T)$ order. Similar to Step 1, we set a fixed level $\tau_k$ such that $\tau_k > \frac{\mu_k + \lambda}{1 + 2\lambda}$ , then the theory of large deviation gives

$\lim_{s \rightarrow \infty} Q_{k,s}(\tau_k) = 0.$

Thus, given the time horizon $T$ , there exists a constant $s_0(T)$ such that $Q_{k,s}(\tau_k) \leq T^{-1}$ for all $s$ over $s_0(T)$ . As a result, the event ${Q_{k,s}(\tau_k) > T^{-1}}$ is empty if the number of pulls $s$ is beyond $s_0(T)$ . Consequently,

$b_k \leq \sum_{s=0}^{s_0(T)} \mathbb{P}(Q_{k,s}(\tau_k) > T^{-1}).$

The fact that the constant $s_0(T)$ is of $O(\log T)$ order will be shown in Lemma D.5, Lemma D.6, and Lemma D.7. For small number of pull $s \leq s_0(T)$ , we apply a trivial bound $\mathbb{P}(Q_{k,s}(\tau_k) > T^{-1}) \leq 1$ that holds for any $s$ . Therefore, it is sufficient to conclude that $b_k$ can be bounded by a term of $O(\log T)$ order.

Note that $b_{k,s}$ is naturally bounded by the constant 1. Similar to Step 2, decompose $b_{k,s} = \mathbb{P}(Q_{k,s}(\tau_k) > T^{-1})$ as $b_k = b_{k,s,1} + b_{k,s,2} + b_{k,s,3}$ with

$b_{k,s,1} = \mathbb{E}[\mathbb{I}(Q_{k,s}(\tau_k) > T^{-1})\mathbb{I}(A_{k,s}^c)], \quad (5)$

$b_{k,s,2} = \mathbb{E}[\mathbb{I}(Q_{k,s}(\tau_k) > T^{-1})\mathbb{I}(A_{k,s})\mathbb{I}(G_{k,s}^c)], \quad (6)$

and

$b_{k,s,3} = \mathbb{E}[\mathbb{I}(Q_{k,s}(\tau_k) > T^{-1})\mathbb{I}(A_{k,s})\mathbb{I}(G_{k,s})], \quad (7)$

Again, define $s_{b,j} := \max{s : b_{k,s,j} \geq T^{-1}}$ for $j = 1, 2, 3$ , then Lemma D.5 in Appendix D guarantees that take $s \geq s_{b,1}(T)$ with

$s_{b,1}(T) = \frac{72\lambda^2\sigma_k^2}{(1+2\lambda)^2\Delta_k^2} \log T,$ then $b_{k,s,1} \leq T^{-1}$ . Consider $T \geq 2$ and let $C_1 = \frac{1}{6\lambda}$ and $C_2 = \frac{1}{12\lambda}$ with fixed $\lambda > 1$ as in Step 1, Lemma D.6 and Lemma D.7 in Appendix D proves that: if we take

$s \geq s_{b,2}(T) = \left[ \Lambda_k + \mathbf{a}_1 + \mathbf{b}_1 + \Lambda_k \mathbf{a}_1 \right] \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\mathbf{b}_1}{4\Lambda_k \mathbf{a}_1} + \frac{\sqrt{2\pi}\Lambda_k \mathbf{a}_1}{2\mathbf{b}_1^2} + \frac{2\mathbf{a}_1}{\mathbf{b}_1} \right] \right\} + 1 \right) \\ \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \Big] \times 3 \log T$

and

$s \geq s_{b,3}(T) = \left[ \Lambda_k + \mathbf{a}_2 + \mathbf{b}_2 + \Lambda_k \mathbf{a}_2 \right] \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\mathbf{b}_2}{4\Lambda_k \mathbf{a}_2} + \frac{\sqrt{2\pi}\Lambda_k \mathbf{a}_2}{2\mathbf{b}_2^2} + \frac{2\mathbf{a}_2}{\mathbf{b}_2} \right] \right\} + 1 \right) \\ \vee \frac{18\sigma_\omega^2(2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \Big] \times 3 \log T,$

then we have $b_{k,s,2} \leq T^{-1}$ and $b_{k,s,3} \leq T^{-1}$ , respectively. Let

$s_b(T) := \frac{72\lambda^2\sigma_k^2}{(1 + 2\lambda)^2\Delta_k^2} \log T + \left[ \Lambda_{\max} + (\mathbf{a}_1 + \mathbf{a}_2) + (\mathbf{b}_1 + \mathbf{b}_2) + \Lambda_{\max}(\mathbf{a}_1 + \mathbf{a}_2) \right] \\ \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi}}{2} (\sqrt{\mathbf{b}_1} + \sqrt{\mathbf{b}_2}) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{\mathbf{a}_1}{\mathbf{b}_1^2} + \frac{\mathbf{a}_2}{\mathbf{b}_2^2} \right) + 2 \left( \frac{\mathbf{a}_1}{\mathbf{b}_1} + \frac{\mathbf{a}_2}{\mathbf{b}_2} \right) \right] \right\} + 1 \right) \\ \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \Big] \times 3 \log T,$

then for any $s \geq s_b(T)$ , we have $s \geq \max_{j=1,2,3} s_{b,j}(T)$ due to $s_b(T) = s_{b,1}(T) + \max_{j=2,3} s_{b,j}(T)$ . Thus, $b_{k,s,1} + b_{k,s,2} + b_{k,s,3} \leq 3T^{-1}$ for any $s \geq s_b(T)$ . Note that $b_{k,s} = \mathbb{P}(Q_{k,s}(\tau_k) > T^{-1}) \leq 1$ for any $s \geq 0$ , then

$b_k = 1 + \sum_{s=0}^{T-1} b_{k,s,1} \\ \leq 1 + 1 \times s_b(T) + 3T^{-1} \times (T - s_b(T)) \\ \leq \left\{ \frac{72\lambda^2\sigma_k^2}{(1 + 2\lambda)^2\Delta_k^2} + \left[ \Lambda_{\max} + (\mathbf{a}_1 + \mathbf{a}_2) + (\mathbf{b}_1 + \mathbf{b}_2) + \Lambda_{\max}(\mathbf{a}_1 + \mathbf{a}_2) \right] \right. \\ \left. \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi}}{2} (\sqrt{\mathbf{b}_1} + \sqrt{\mathbf{b}_2}) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{\mathbf{a}_1}{\mathbf{b}_1^2} + \frac{\mathbf{a}_2}{\mathbf{b}_2^2} \right) + 2 \left( \frac{\mathbf{a}_1}{\mathbf{b}_1} + \frac{\mathbf{a}_2}{\mathbf{b}_2} \right) \right] \right\} + 1 \right) \right. \right. \\ \left. \left. \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right] \right\} \log T + 4$

for any $T \geq 2$ .

Step 3: Aggregating results.

Denote

$d_1 = \Delta_k^2 \{ (\mathbf{a}_1 + \mathbf{a}_2) + (\mathbf{b}_1 + \mathbf{b}_2) + \Lambda_{\max}(\mathbf{a}_1 + \mathbf{a}_2) \}$

and

$d_2 = 3 \left[ 1 + \frac{\sqrt{\pi}}{2} (\sqrt{\mathbf{b}_1} + \sqrt{\mathbf{b}_2}) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{\mathbf{a}_1}{\mathbf{b}_1^2} + \frac{\mathbf{a}_2}{\mathbf{b}_2^2} \right) + 2 \left( \frac{\mathbf{a}_1}{\mathbf{b}_1} + \frac{\mathbf{a}_2}{\mathbf{b}_2} \right) \right],$ then the bounds in Step 1 and Step 2 imply

$a_k \leq 2(1 + 2e^{9/8}) \left\{ \frac{144\lambda\sigma_1^2}{(1+2\lambda)^2\Delta_k^2} + \left[ (\Lambda_{\max} + d_1) \left( \frac{1}{3} \frac{\log d_2}{\log 2} + 1 \right) \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right] \right\} \log T + 3.$

and

$b_k \leq \left\{ \frac{72\lambda^2\sigma_k^2}{(1+2\lambda)^2\Delta_k^2} + \left[ (\Lambda_{\max} + d_1) \left( \frac{1}{3} \frac{\log d_2}{\log 2} + 1 \right) \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right] \right\} \log T + 4$

for any $T \geq 2$ . Therefore, by applying Lemma C.2, we have

$\begin{aligned} \text{Reg}_T &\leq \sum_{k=2}^K \Delta_k (a_k + b_k) \\ &\leq \sum_{k=2}^K \Delta_k \left[ 7 + \{c_1(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega) + c_2(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega)\Delta_k^{-2}\} \log T \right] \end{aligned} \quad (8)$

for any $T \geq 2$ , where

$c_1(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda) = (3 + 2e^{9/8}) \left[ \left( \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{18\sigma_\omega^2\{(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2\}}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right],$

and

$\begin{aligned} c_2(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda) &= \frac{72\lambda^2\sigma_k^2}{(1+2\lambda)^2} (5 + 4e^{9/8}) \\ &\quad + (3 + 2e^{9/8}) \left[ \left( d_1 \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{18\sigma_\omega^2\{(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2\}}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1 + 2\lambda^2)}{(1 + 2\lambda)^2} \right], \end{aligned}$

for $k = 2, \dots, K$ . When the total round $T = 1$ , the bound (8) still holds because $\text{Reg}T \leq \max{k=2, \dots, K} \Delta_k \leq 7 \sum_{k=2}^K \Delta_k$ .

Step 4: Simplifying bounds.

Note that

$\begin{aligned} d_1 &= \Delta_k^2 \{ (a_1 + a_2) + (b_1 + b_2) + \Lambda_{\max}(a_1 + a_2) \} \\ &\stackrel{\Delta_k \leq 1}{\leq} \frac{1 + \Lambda_{\max}}{3} \sigma_\omega^2 \left[ \frac{192\lambda^4}{(1+2\lambda)^2} + \frac{36\lambda^4}{(\lambda-1)^2} \right] \\ &\quad + \sigma_\omega^2 \left[ \frac{2[96\lambda^4\sigma_1^2 + (1+2\lambda)^2/(36\lambda^2)]}{3(1+2\lambda)^2} + \frac{72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2}{6(\lambda-1)^2} \right]. \end{aligned}$

By $\lambda \geq \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right) + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right)}$ , we have $(\lambda - 1)^2 \geq 16\sigma_1^2/\sigma_\omega^2$ , and hence

$\frac{192\lambda^4}{(1+2\lambda)^2} + \frac{36\lambda^4}{(\lambda-1)^2} \leq \frac{192\lambda^4}{4\lambda^2} + \frac{36\lambda^4}{16\sigma_1^2/\sigma_\omega^2} = 48\lambda^2 + 3\frac{\sigma_\omega^2}{\sigma_1^2}\lambda^4.$ Similarly,

$\begin{aligned} & \frac{2 \left[ 96\lambda^4\sigma_1^2 + (1+2\lambda)^2/(36\lambda^2) \right]}{3(1+2\lambda)^2} + \frac{72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2}{6(\lambda-1)^2} \\ & \leq \frac{2 \times 96\lambda^4\sigma_1^2}{3 \times 4\lambda^2} + \frac{2}{3 \times 36\lambda^2} + \frac{72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2}{6 \times 16\sigma_1^2/\sigma_\omega^2} \\ & \leq 16\lambda^2\sigma_1^2 + \frac{1}{48} + \lambda^4\sigma_\omega^2 + \frac{25 \times 9\lambda^2}{6 \times 16\sigma_1^2/\sigma_\omega^2} \leq 16\lambda^2\sigma_1^2 + \frac{1}{48} + 3\lambda^4\sigma_\omega^2 + 3\lambda^2\sigma_\omega^2/\sigma_1^2. \end{aligned}$

Thus,

$\begin{aligned} d_1 &= \Delta_k^2 \{ (\mathbf{a}_1 + \mathbf{a}_2) + (\mathbf{b}_1 + \mathbf{b}_2) + \Lambda_{\max}(\mathbf{a}_1 + \mathbf{a}_2) \} \\ & \leq \frac{(1 + \Lambda_{\max})\sigma_\omega^2}{3} \left[ 48\lambda^2 + 3\frac{\sigma_\omega^2}{\sigma_1^2}\lambda^4 \right] + \sigma_\omega^2 \left[ 16\lambda^2\sigma_1^2 + \frac{1}{48} + 3\lambda^4\sigma_\omega^2 + 3\frac{\sigma_\omega^2}{\sigma_1^2}\lambda^2 \right] \\ &= \sigma_\omega^2 \left\{ \left[ 16(1 + \Lambda_{\max}) + 16\sigma_1^4 + 3\frac{\sigma_\omega^2}{\sigma_1^2} \right] \lambda^2 + \left[ (1 + \Lambda_{\max})\frac{\sigma_\omega^2}{\sigma_1^2} + 3\sigma_\omega^2 \right] \lambda^4 + \frac{1}{48} \right\} \\ & \leq \sigma_\omega^2 \left[ 16(1 + \Lambda_{\max}) + 16\sigma_1^4 + 3\frac{\sigma_\omega^2}{\sigma_1^2} + (1 + \Lambda_{\max})\frac{\sigma_\omega^2}{\sigma_1^2} + 3\sigma_\omega^2 + 1 \right] \lambda^4 \\ & \leq \left[ (1 + \Lambda_{\max}) \left( 16 + \frac{\sigma_\omega^2}{\sigma_1^2} \right) + 16\sigma_1^4 + 3\frac{\sigma_\omega^2}{\sigma_1^2} + 3\sigma_\omega^2 + 1 \right] \sigma_\omega^2 \lambda^4. \end{aligned} \tag{9}$

Define

$D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) := \left[ (1 + \Lambda_{\max}) \left( 16 + \frac{\sigma_\omega^2}{\sigma_1^2} \right) + 16\sigma_1^4 + 3\frac{\sigma_\omega^2}{\sigma_1^2} + 3\sigma_\omega^2 + 1 \right] \sigma_\omega^2 \lambda^4,$

then $d_1 \leq D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega)$ . On the other hand, for $d_2$ , we have

$\begin{aligned} d_2 &= 3 \left[ 1 + \frac{\sqrt{\pi}}{2} \left( \sqrt{\mathbf{b}_1} + \sqrt{\mathbf{b}_2} \right) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{\mathbf{a}_1}{\mathbf{b}_1^2} + \frac{\mathbf{a}_2}{\mathbf{b}_2^2} \right) + 2 \left( \frac{\mathbf{a}_1}{\mathbf{b}_1} + \frac{\mathbf{a}_2}{\mathbf{b}_2} \right) \right] \\ & \leq 3 \left[ 1 + \frac{\sqrt{\pi}}{2} \left( \sqrt{\frac{2\sigma_\omega^2(1+2\lambda)^2\Delta_k^2/(144\lambda^2)}{3(1+2\lambda)^2\Delta_k^2}} + \sqrt{\frac{25\sigma_\omega^2(1+2\lambda)^2\Delta_k^2}{6(\lambda-1)^2\Delta_k^2}} \right) \right. \\ & \quad \left. + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{(1+2\lambda)^2}{64\lambda^4\sigma_1^4} + \frac{(\lambda-1)^2}{12\lambda^4\sigma_1^2\sigma_\omega^2} \right) \right. \\ & \quad \left. + 2 \left( \frac{96\lambda^4}{96\lambda^4\sigma_1^2 + (1+2\lambda)^2/(36\lambda^2)} + \frac{72\lambda^4}{72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2} \right) \right]. \end{aligned}$

Similarly, by the fact $\lambda > 1$ ,

$\begin{aligned} \sqrt{\frac{2\sigma_\omega^2(1+2\lambda)^2\Delta_k^2/(144\lambda^2)}{3(1+2\lambda)^2\Delta_k^2}} + \sqrt{\frac{25\sigma_\omega^2(1+2\lambda)^2\Delta_k^2}{6(\lambda-1)^2\Delta_k^2}} & \leq \frac{\sigma_\omega}{12\lambda} + \frac{5\sigma_\omega(1+2\lambda)}{2(\lambda-1)} \\ & \leq \sigma_\omega + 3\frac{\sigma_\omega^2}{\sigma_1}(1+2\lambda) \leq \frac{3\sigma_\omega^2}{\sigma_1} \left[ \frac{\sigma_1}{\sigma_\omega} + 3\lambda \right] \end{aligned}$

$\begin{aligned} \frac{(1+2\lambda)^2}{64\lambda^4\sigma_1^4} + \frac{(\lambda-1)^2}{12\lambda^4\sigma_1^2\sigma_\omega^2} & \leq \frac{9\lambda^2}{64\lambda^4\sigma_1^4} + \frac{16\sigma_1^2/\sigma_\omega^2}{12\lambda^4\sigma_1^2\sigma_\omega^2} \\ & \leq \frac{9\lambda^2}{64\lambda^4\sigma_1^4} + \frac{16\sigma_1^2/\sigma_\omega^2}{12\lambda^2\sigma_1^2\sigma_\omega^2} \\ & \leq 2 \left[ \frac{1}{\sigma_1^2\lambda^2} + \frac{1}{\sigma_\omega^2\lambda^2} \right] \leq 2 \left[ \frac{\sigma_\omega^2}{16\sigma_1^4} + \frac{1}{\sigma_\omega^2} \right], \end{aligned}$ and

$\begin{aligned} & \frac{96\lambda^4}{96\lambda^4\sigma_1^2 + (1+2\lambda)^2/(36\lambda^2)} + \frac{72\lambda^4}{72\lambda^4\sigma_1^2 + 25(1+2\lambda)^2} \\ & \leq \frac{96\lambda^4}{96\lambda^4\sigma_1^2} + \frac{72\lambda^4}{72\lambda^4\sigma_1^2} = \frac{2\lambda^2}{\lambda^2\sigma_1^2} \leq \frac{2\lambda^2}{16\sigma_1^4/\sigma_\omega^2} = \frac{\lambda^2\sigma_\omega^2}{8\sigma_1^4}. \end{aligned}$

Thus,

$\begin{aligned} d_2 &= 3 \left[ 1 + \frac{\sqrt{\pi}}{2} \left( \sqrt{b_1} + \sqrt{b_2} \right) + \frac{\sqrt{2\pi}\Lambda_{\max}}{2} \left( \frac{a_1}{b_1^2} + \frac{a_2}{b_2^2} \right) + 2 \left( \frac{a_1}{b_1} + \frac{a_2}{b_2} \right) \right] \\ & \leq 3 \left[ 1 + \frac{3\sqrt{\pi}\sigma_\omega^2}{2\sigma_1} \left( \frac{\sigma_1}{\sigma_\omega} + 3\lambda \right) + \sqrt{2\pi}\Lambda_{\max} \left( \frac{\sigma_\omega^2}{16\sigma_1^4} + \frac{1}{\sigma_\omega^2} \right) + \frac{\lambda^2\sigma_\omega^2}{4\sigma_1^4} \right]. \end{aligned} \quad (10)$

Again, we define

$D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 3 \left[ 1 + \frac{3\sqrt{\pi}\sigma_\omega^2}{2\sigma_1} \left( \frac{\sigma_1}{\sigma_\omega} + 3\lambda \right) + \sqrt{2\pi}\Lambda_{\max} \left( \frac{\sigma_\omega^2}{16\sigma_1^4} + \frac{1}{\sigma_\omega^2} \right) + \frac{\lambda^2\sigma_\omega^2}{4\sigma_1^4} \right],$

then $d_2 \leq D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)$ .

The next step is giving a simply bounds for $c_1(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega)$ and $c_2(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega)$ . We use the fact $a \vee b \leq a + b$ and $a \vee b \vee c = [(a \vee b) \vee c]$ , then obtain that

$\begin{aligned} & \left( \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{18\sigma_\omega^2 \{(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2\}}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \\ & \leq \left( \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{72\sigma_\omega^2 \{(2\mu_1 - 1)^2 + (2\mu_k - 1)^2\}}{(1+2\lambda)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \\ & \leq \left( \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{72\sigma_\omega^2 \times 2 + 2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \\ & \leq \left( \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{144\sigma_\omega^2 + 2\sigma_\omega^2 + 4\lambda^2\sigma_\omega^2}{4\lambda^2} \\ & \leq \left[ \Lambda_{\max} \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right] + 38\sigma_\omega^2 \leq \Lambda_{\max} \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2. \end{aligned}$

and

$\begin{aligned} & \left( d_1 \left( \frac{\log d_2}{3 \log 2} + 1 \right) \right) \vee \frac{18\sigma_\omega^2 \{(2\mu_1 - 1)^2 \vee (2\mu_k - 1)^2\}}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \\ & \leq D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2, \end{aligned}$

where the last steps in two inequalities above are by (9) and (10). Define

$C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 10 \left[ \Lambda_{\max} \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right],$

and

$C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) = 50\lambda^2\sigma_k^2 + 10 \left[ D_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \left( \frac{\log D_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right].$

Note that

$3 + 2e^{9/8} \leq 10, \quad \frac{72\lambda^2\sigma_k^2}{(1+2\lambda)^2}(5 + 4e^{9/8}) \leq \frac{72\lambda^2\sigma_k^2}{25}(5 + 4e^{9/8}) \leq 50\lambda^2\sigma_k^2,$

we will have $c_1(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega) \leq C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega)$ and $c_2(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega) \leq C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)$ . Finally, by using the bound in (8), one has

$\begin{aligned} \text{Reg}_T & \leq \sum_{k=2}^K \Delta_k \left[ 7 + \{c_1(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega) + c_2(\mu_1, \sigma_1, \mu_k, \sigma_k, \lambda, \sigma_\omega)\Delta_k^{-2}\} \log T \right] \\ & \leq \sum_{k=2}^K \Delta_k \left[ 7 + \{C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) + C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)\Delta_k^{-2}\} \log T \right]. \end{aligned}$ For the problem-dependent case. For the problem-independent case, denote

$J_{k,T} := \sum_{t=1}^T \mathbb{I}(I_t = k) = \sum_{k=2}^K (a_k + b_k)$

then

$\begin{aligned} \text{Reg}_T &= \sum_{\Delta_k: \Delta_k < \Delta} \Delta_k \mathbb{E} J_{k,T} + \sum_{\Delta_k: \Delta_k \geq \Delta} \Delta_k \mathbb{E} J_{k,T} \\ &\stackrel{\text{by the bounds of } a_k, b_k}{\leq} T\Delta + \sum_{\Delta_k: \Delta_k \geq \Delta} \Delta_k \left[ 7 + \{C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) + C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \Delta_k^{-2}\} \log T \right] \\ &= T\Delta + 7 \sum_{\Delta_k: \Delta_k \geq \Delta} \Delta_k + \sum_{\Delta_k: \Delta_k \geq \Delta} C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \Delta_k \log T + \sum_{\Delta_k: \Delta_k \geq \Delta} \frac{C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{\Delta_k} \log T \\ &\stackrel{\text{by } \Delta_k \leq \mu_1 \leq 1}{\leq} T\Delta + 7K\mu_1 + \max_{k \in [K] \setminus \{1\}} C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) K \log T + \max_{k \in [K] \setminus \{1\}} C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \frac{K \log T}{\Delta}, \end{aligned} \tag{11}$

for any previously specified $\Delta \in (0, 1)$ . Take $\Delta = \sqrt{\max_{k \in [K] \setminus {1}} C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) K \log T / T}$ , we have

$\text{Reg}_T \leq 7K\mu_1 + \max_{k \in [K] \setminus \{1\}} C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) K \log T + 2 \sqrt{\max_{k \in [K] \setminus \{1\}} C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) K T \log T}.$

Therefore, we finish the proof of Theorem C.1.

For proving Theorem 5.1. Set $\sigma_k \equiv 1$ for $k \in [K]$ , we have

$\begin{aligned} D_1(1, 1, \lambda, \sigma_\omega) &= \left[ (1 + 8\sqrt{2})(16 + \sigma_\omega^2) + 16 + 3\sigma_\omega^2 + 3\sigma_\omega^2 + 1 \right] \sigma_\omega^2 \lambda^4 \\ &\leq \left[ 13(16 + 3\sigma_\omega^2) + 6\sigma_\omega^2 + 17 \right] \sigma_\omega^2 \lambda^4 = 45(3 + \sigma_\omega^2) \lambda^4 \end{aligned}$

and

$\begin{aligned} D_2(1, 1, \lambda, \sigma_\omega) &= 3 \left[ 1 + \frac{3\sqrt{\pi}\sigma_\omega^2}{2} \left( \frac{1}{\sigma_\omega} + 3\lambda \right) + 8\sqrt{\pi} \left( \frac{\sigma_\omega^2}{16} + \frac{1}{\sigma_\omega^2} \right) + \frac{\lambda^2 \sigma_\omega^2}{4} \right] \\ &\leq 3 \left[ 1 + \frac{3\sqrt{\pi}\sigma_\omega^2}{2} \left( \frac{1}{\sigma_\omega} + 3 \right) + 8\sqrt{\pi} \left( \frac{\sigma_\omega^2}{16} + \frac{1}{\sigma_\omega^2} \right) + \frac{\sigma_\omega^2}{4} \right] \lambda^2 \leq (1 + 15\sigma_\omega^{-2} + 3\sigma_\omega + 10\sigma_\omega^2) \lambda^2. \end{aligned}$

Therefore, we have

$\begin{aligned} C_1(1, 1, \lambda, \sigma_\omega) &= 10 \left[ 8\sqrt{2} \left( \frac{\log D_2(1, 1, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right] \\ &\leq 10 \left[ 8\sqrt{2} \left( \frac{\log [(1 + 15\sigma_\omega^{-2} + 3\sigma_\omega + 10\sigma_\omega^2) \lambda^2]}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right], \end{aligned}$

and

$\begin{aligned} C_2(1, 1, \lambda, \sigma_\omega) &= 50\lambda^2 + 10 \left[ D_1(1, 1, \lambda, \sigma_\omega) \left( \frac{\log D_2(1, 1, \lambda, \sigma_\omega)}{3 \log 2} + 1 \right) + 38\sigma_\omega^2 \right] \\ &\leq 50\lambda^2 + 10 \left[ 45(3 + \sigma_\omega^2) \left( \frac{\log [(1 + 15\sigma_\omega^{-2} + 3\sigma_\omega + 10\sigma_\omega^2) \lambda^2]}{3 \log 2} + 1 \right) \lambda^4 + 38\sigma_\omega^2 \right]. \end{aligned}$

The above bounds together with the fact that

$C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) + C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega) \Delta_k^{-2} \leq \frac{C_1(\sigma_1, \sigma_k, \lambda, \sigma_\omega) + C_2(\sigma_1, \sigma_k, \lambda, \sigma_\omega)}{\Delta_k^2}$

gives the first result in Theorem 5.1. Finally, by using the same technique as in (11), we obtain the problem-independent regret in we require. $\square$## D. Lemmas on Bounding Regret Components

D.1. Lemmas on bounding $a_k$ .

Lemma D.1 (Bounding $a_{k,s}$ for any $s > 0$ ). Set

$\lambda \geq \frac{4\sigma_1}{\sigma_\omega} + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right)},$

then

$a_{k,s} \leq 2 + 4e^{9/8}$

for any $s \geq 0$ .

Proof. Note that if we take

$\tau_k \leq \frac{\mu_1 + \lambda}{1 + 2\lambda}, \quad (12)$

then we will have

$\begin{aligned} a_{k,s} &= \mathbb{E} \left[ \left( \frac{1}{Q_{1,s}(\tau_k)} - 1 \right) \wedge T \right] \\ &\stackrel{\text{by } Q_{1,s}(\cdot) \text{ is decreasing}}{\leq} \mathbb{E} \frac{1}{Q_{1,s}(\tau_k)} \leq \mathbb{E} Q_{1,s}^{-1} \left( \frac{\mu_1 + \lambda}{1 + 2\lambda} \right), \end{aligned} \quad (13)$

Hence, we need to find a lower bound of the tail probability $Q_{1,s} \left( \frac{\mu_1 + \lambda}{1 + 2\lambda} \right) = \mathbb{P} \left( \bar{Y}{1,s} > \frac{\mu_1 + \lambda}{1 + 2\lambda} \mid \mathcal{H}{1,s} \right)$ . And throughout the proof, we will use the choice of (12) for $\tau_k$ , and we can take advantage of the bound (13).

For further analysis (13), we write it as the probability with respect to the weighted random summation of the Gaussian random variables ${\omega_i, \omega'_i, \omega''_i}$ .

Let $x_i := (2\lambda + 1)R_{1,i} - (\mu_1 + \lambda)$ , $y_i := \lambda(\mu_1 - 1 - \lambda)$ , $z_i := \lambda(\lambda + \mu_1)$ for $i \in [s]$ , then

$S_x = \sum_{i=1}^s x_i = s((2\lambda + 1)\bar{R}_{1,s} - (\mu_1 + \lambda)), \quad S_y = \sum_{i=1}^s y_i = s\lambda(\mu_1 - 1 - \lambda), \quad S_z = \sum_{i=1}^s z_i = s\lambda(\lambda + \mu_1),$

with $S_x - S_y - S_z = s(2\lambda + 1)(\bar{R}_{1,s} - \mu_1)$ and

$T_x = \sum_{i=1}^s x_i^2 = \sum_{i=1}^s [(2\lambda + 1)\bar{R}_{1,s} - (\mu_1 + \lambda)]^2, \quad T_y = \sum_{i=1}^s y_i^2 = s\lambda^2(\mu_1 - 1 - \lambda)^2, \quad T_z = \sum_{i=1}^s z_i^2 = s\lambda^2(\lambda + \mu_1)^2,$

with $T_x + T_y + T_z = \sum_{i=1}^s [(2\lambda + 1)\bar{R}_{1,s} - (\mu_1 + \lambda)]^2 + s[\lambda^2(\mu_1 - 1 - \lambda)^2 + \lambda^2(\lambda + \mu_1)^2]$ . For a standard Gaussian variable $Z$ , we have Lemma E.1, and then we write the probability as

$\begin{aligned} &Q_{1,s} \left( \frac{\mu_1 + \lambda}{1 + 2\lambda} \right) \\ &= \mathbb{P} \left( \bar{Y}_{1,s} > \frac{\mu_1 + \lambda}{1 + 2\lambda} \mid \mathcal{H}_{1,s} \right) \\ &= \mathbb{P}_{\omega, \omega', \omega''} \left( \sum_{i=1}^s ((2\lambda + 1)R_{1,i} - (\mu_1 + \lambda))\omega_i > \lambda(\mu_1 - 1 - \lambda) \sum_{i=1}^s \omega'_i + \lambda(\lambda + \mu_1) \sum_{i=1}^s \omega''_i \right). \end{aligned}$ Hence, by applying the first inequality in Lemma E.1 and using the fact that $T_x + T_y + T_z \geq T_z \geq s\lambda^4$ , we can get that

$\begin{aligned} & Q_{1,s} \left( \frac{\mu_1 + \lambda}{1 + 2\lambda} \right) \\ &= \mathbb{P}_{\omega, \omega', \omega''} \left( \sum_{i=1}^s x_i \omega_i - \sum_{i=1}^s y_i \omega'_i - \sum_{i=1}^s z_i \omega''_i > 0 \right) \\ &= \mathbb{P}_Z \left( Z > \frac{-(S_x - S_y - S_z)}{\sigma_\omega \sqrt{T_x + T_y + T_z}} \right) \\ &\geq \frac{1}{2} \times \mathbb{I} \left( \frac{-(S_x - S_y - S_z)}{\sigma_\omega \sqrt{T_x + T_y + T_z}} < 0 \right) + \frac{1}{4} \exp \left\{ -\frac{(S_x - S_y - S_z)^2}{\sigma_\omega^2 (T_x + T_y + T_z)} \right\} \times \mathbb{I} \left( \frac{-(S_x - S_y - S_z)}{\sigma_\omega \sqrt{T_x + T_y + T_z}} > 0 \right) \\ &\geq \frac{1}{2} \times \mathbb{I} (S_x - S_y - S_z > 0) + \frac{1}{4} \exp \left\{ -\frac{(S_x - S_y - S_z)^2}{\sigma_\omega^2 \cdot s\lambda^4} \right\} \times \mathbb{I} (S_x - S_y - S_z \leq 0) \\ &= \frac{1}{2} \mathbb{I} (\bar{R}_{1,s} - \mu_1 > 0) + \frac{1}{4} \exp \left\{ -\frac{s(2\lambda + 1)^2 (\bar{R}_{1,s} - \mu_1)^2}{\sigma_\omega^2 \lambda^4} \right\} \mathbb{I} (\bar{R}_{1,s} - \mu_1 \leq 0). \end{aligned}$

where $Z$ again is the standard Gaussian variable and we will use this notation throughout this proof. Then this result combined with the fact (13) gives the upper bound for $a_{k,s}$ as follows:

$\begin{aligned} a_{k,s} &\leq 2\mathbb{P} (\bar{R}_{1,s} - \mu_1 > 0) + 4\mathbb{E} \exp \left\{ \frac{s(2\lambda + 1)^2 (\bar{R}_{1,s} - \mu_1)^2}{\sigma_\omega^2 \lambda^4} \right\} \mathbb{I} (\bar{R}_{1,s} - \mu_1 \leq 0) \\ &\leq 2 + 4\mathbb{E} \exp \left\{ \frac{s(2\lambda + 1)^2 (\bar{R}_{1,s} - \mu_1)^2}{\sigma_\omega^2 \lambda^4} \right\}. \end{aligned}$

From the above expression, for proving $a_{k,s}$ is bounded by a constant free of $s$ , it remains to show the expectation

$\mathbb{E} \exp \left\{ \frac{s(2\lambda + 1)^2 (\bar{R}_{1,s} - \mu_1)^2}{\sigma_\omega^2 \lambda^4} \right\} \quad (14)$

can be bounded below some constants free of $s$ .

Applying Lemma E.2, we know that if we take

$\frac{s(2\lambda + 1)^2}{\sigma_\omega^2 \lambda^4} \leq \frac{s}{8\sigma_1^2},$

then (14) can be upper bounded by $e^{9/8}$ . A sufficient condition for the above inequality is

$\frac{s(2\lambda + 1)^2}{\sigma_\omega^2 \lambda^4} \leq \frac{s}{16\sigma_1^2}, \quad \lambda^2 - \frac{8\sigma_1}{\sigma_\omega} \lambda - \frac{4\sigma_1}{\sigma_\omega} \geq 0,$

or say,

$\lambda \geq \frac{4\sigma_1}{\sigma_\omega} + \sqrt{\frac{4\sigma_1}{\sigma_\omega} \left( \frac{4\sigma_1}{\sigma_\omega} + 1 \right)}. \quad (15)$

Therefore, if we take the tuning parameters $\lambda$ and $\sigma_\omega$ as in (15), then $a_{k,s}$ will be bounded by a constant such that

$a_{k,s} \leq 2 + 4e^{9/8}. \quad (16)$

Note that this step derives the tuning conditions for the tuning parameters in Theorem C.1. $\square$

Lemma D.2 (Bounding $a_{k,s,1}$ at (2)). Take

$s_{a,1}(T) := \frac{4\sigma_1^2}{C_1^2 (1 + 2\lambda)^2 \Delta_k^2} \times \log T.$ Then for any $s \geq s_{a,1}(T)$ ,

$a_{k,s,1} \leq T^{-1},$

when $T \geq 2$ .

Proof. We will bound $a_{k,s,1}$ by bounding the probability of the event $A_{k,s}^c$ . Write

$a_{k,s,1} = \mathbb{E} \left[ (N_{1,s}(\tau_k) \wedge T) \mathbb{I}(A_{1,s}^c) \right] \leq \mathbb{E} \left[ T \mathbb{I}(A_{1,s}^c) \right] = T \mathbb{P}(A_{1,s}^c) \quad (17)$

Since the summation of independent sub-Gaussian variables is still sub-Gaussian, (30) in Lemma E.2 gives $\bar{R}_{1,s} - \mu_1 \sim \text{subG}(\sigma_1^2/s)$ . By the concentration inequality of the sub-Gaussian variable,

$\begin{aligned} \mathbb{P}(A_{1,s}^c) &= \mathbb{P}(|\bar{R}_{1,s} - \mu_1| \geq (1+2\lambda)C_1\Delta_k) \\ &\leq 2 \exp \left\{ -\frac{(1+2\lambda)^2 C_1^2 \Delta_k^2 s}{2\sigma_1^2} \right\} \\ &\stackrel{\text{take } s \geq s_{a,1}(T) \geq \frac{2\sigma_1^2}{C_1^2(1+2\lambda)^2 \Delta_k^2}}{\leq} \times \log(\sqrt{2}T) \frac{1}{T^2}. \end{aligned}$

Hence, we obtain that

$a_{k,s,1} \leq T \mathbb{P}(A_{1,s}^c) \leq T^{-1}, \quad \text{for any } s \geq s_{a,1}(T) \asymp \log T.$

□

Lemma D.3 (Bounding $a_{k,s,2}$ at (3)). Take

$\begin{aligned} s_{a,2}(T) := & \left[ \Lambda_1 + \mathbf{a}_1 + \mathbf{b}_1 + \Lambda_1 \mathbf{a}_1 \right] \left( \frac{1}{3} \log^{-1} 2 \times \log \left\{ 3 \left[ 1 + \frac{\sqrt{\pi \mathbf{b}_1}}{2} + \frac{\sqrt{2\pi} \Lambda_1 \mathbf{a}_1}{2\mathbf{b}_1^2} + \frac{2\mathbf{a}_1}{\mathbf{b}_1} \right] \right\} + 1 \right) \\ & \vee \frac{18\sigma_\omega^2(2\mu_1 - 1)^2}{(\lambda^2 + \lambda + 1/4)^2} \vee \frac{2\sigma_\omega^2(1+2\lambda^2)}{(1+2\lambda)^2} \times 3 \log T, \end{aligned}$

where

$\mathbf{a}_1 = \frac{4\sigma_\omega^2 \lambda^2}{3C_2^2(1+2\lambda)^2 \Delta_k^2}, \quad \mathbf{b}_1 = \frac{2\sigma_\omega^2 [2\lambda^2 \sigma_1^2 + (1+2\lambda)^2 C_2^2 \Delta_k^2]}{3C_2^2(1+2\lambda)^2 \Delta_k^2}, \quad \Lambda_1 = 8\sqrt{2}\sigma_1^2.$

Then for any $s \geq s_{a,2}(T)$ ,

$a_{k,s,2} \leq T^{-1},$

for any $T \geq 2$ .

Proof. We will bound the probability of $G_{1,s}^c$ conditioning on $\mathcal{H}_{1,s}$ to prove this lemma. Note that

$\begin{aligned} & \mathbb{E}_{\omega, \omega', \omega''} \left[ \sum_{i=1}^s [(1+2\lambda)R_{1,i} - (\bar{R}_{1,s} + \lambda)] \omega_i + \sum_{i=1}^s \lambda(1+\lambda - \bar{R}_{1,s}) \omega'_i - \sum_{i=1}^s \lambda(\bar{R}_{1,s} + \lambda) \omega''_i \right] \\ &= s((1+2\lambda)\bar{R}_{1,s} - (\bar{R}_{1,s} + \lambda) + \lambda(1+\lambda - \bar{R}_{1,s}) - \lambda(\bar{R}_{1,s} + \lambda)) = 0, \end{aligned}$ we can bound the tail probability $\mathbb{P}(G_{1,s}^c \mid \mathcal{H}_{1,s})$ by

$\begin{aligned} & \mathbb{P}(G_{1,s}^c \mid \mathcal{H}_{1,s}) \\ &= \mathbb{P}(|\bar{Y}_{k',s} - \bar{R}_{k',s}^*| > C_3 \Delta_k \mid \mathcal{H}_{1,s}) \\ &= 2\mathbb{P}_{\omega, \omega', \omega''} \left\{ \left| \frac{\sum_{i=1}^s [(1+2\lambda)R_{1,i} - (\bar{R}_{1,s} + \lambda)]\omega_i + \sum_{i=1}^s \lambda(1+\lambda - \bar{R}_{1,s})\omega'_i - \sum_{i=1}^s \lambda(\bar{R}_{1,s} + \lambda)\omega''_i}{(1+2\lambda)(\sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i)} \right| > C_2 \Delta_k \right\} \\ &\stackrel{\text{by Lemma E.3}}{\leq} 2\mathbb{P}_{\omega, \omega', \omega''} \left\{ \sum_{i=1}^s [(1+2\lambda)R_{1,i} - (\bar{R}_{1,s} + \lambda)]\omega_i + \sum_{i=1}^s \lambda(1+\lambda - \bar{R}_{1,s})\omega'_i \right. \\ &\quad \left. - \sum_{i=1}^s \lambda(\bar{R}_{1,s} + \lambda)\omega''_i - C_2(1+2\lambda)\Delta_k \left[ \sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i \right] > 0 \right\} \\ &\quad + \mathbb{P}_{\omega, \omega', \omega''} \left\{ \sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i < 0 \right\} = 2I + II, \end{aligned} \tag{18}$

where

$\begin{aligned} I = \mathbb{P}_{\omega, \omega', \omega''} \left\{ \sum_{i=1}^s [(1+2\lambda)R_{1,i} - (\bar{R}_{1,s} + \lambda)]\omega_i + \sum_{i=1}^s \lambda(1+\lambda - \bar{R}_{1,s})\omega'_i \right. \\ \left. - \sum_{i=1}^s \lambda(\bar{R}_{1,s} + \lambda)\omega''_i - C_2(1+2\lambda)\Delta_k \left[ \sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i \right] > 0 \right\}. \end{aligned}$

and

$II = \mathbb{P}_{\omega, \omega', \omega''} \left\{ \sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i < 0 \right\}.$

For bounding $\mathbb{P}(G_{1,s}^c \mid \mathcal{H}_{1,s})$ , it suffices to bound $I^{(1)}$ and $II^{(2)}$ separately. we first study $II^{(1)}$ . Write

$\begin{aligned} II &= \mathbb{P}_{\omega, \omega', \omega''} \left( \sum_{i=1}^s \omega_i + \lambda \sum_{i=1}^s \omega'_i + \lambda \sum_{i=1}^s \omega''_i < 0 \right) \\ &= \mathbb{P}_{\omega, \omega', \omega''} \left( \frac{\sum_{i=1}^s (\omega_i - 1) + \lambda \sum_{i=1}^s (\omega'_i - 1) + \lambda \sum_{i=1}^s (\omega''_i - 1)}{\sqrt{s\sigma_\omega^2(1+2\lambda^2)}} < -\frac{(1+2\lambda)s}{\sqrt{s\sigma_\omega^2(1+2\lambda^2)}} \right) \\ &= \mathbb{P}_{\omega, \omega', \omega''} \left( Z < -\frac{(1+2\lambda)}{\sigma_\omega \sqrt{1+2\lambda^2}} \times \sqrt{s} \right) \\ &\stackrel{\text{by Lemma E.1}}{\leq} \frac{1}{2} \exp \left\{ -\frac{s(1+2\lambda)^2}{2\sigma_\omega^2(1+2\lambda^2)} \right\}. \end{aligned}$

For studying $I^{(1)}$ , we define the following functions of ${R_{1,i}}_{i=1}^s$ :

$f(\{R_{1,i}\}_{i=1}^s) = f_1(\{R_{1,i}\}_{i=1}^s) + f_2(\{R_{1,i}\}_{i=1}^s) - f_3(\{R_{1,i}\}_{i=1}^s)$

with

$\begin{aligned} f_1(\{R_{1,i}\}_{i=1}^s) &= \sum_{i=1}^s [(R_{1,i} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1) - (1+2\lambda)C_2\Delta_k]\omega_i, \\ f_2(\{R_{1,i}\}_{i=1}^s) &= \sum_{i=1}^s [\lambda(\lambda+1 - \bar{R}_{1,s}) - (1+2\lambda)C_2\Delta_k]\omega'_i, \end{aligned}$

and

$f_3(\{R_{1,i}\}_{i=1}^s) = \sum_{i=1}^s [\lambda(\bar{R}_{1,s} + \lambda) + (1+2\lambda)C_2\Delta_k]\omega''_i.$ Then we can write $I^{(1)} = 2\mathbb{P}{\omega, \omega', \omega'}(f_1 + f_2 - f_3 > 0)$ . Note that $f_1, f_2$ , and $f_3$ are mutually independent conditioning on $\mathcal{H}{1,s}$ . Then the expectation is

$\mathbb{E}[f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s}] = -3C_2(1 + 2\lambda)\Delta_k s,$

and the variance is

$\begin{aligned} & \text{var}(f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s}) \\ &= \sigma_\omega^2 \left[ \sum_{i=1}^s [(R_{1,i} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1) - (1 + 2\lambda)C_2\Delta_k]^2 \right. \\ & \quad \left. + \sum_{i=1}^s [\lambda(\lambda + 1 - \bar{R}_{1,s}) - (1 + 2\lambda)C_2\Delta_k]^2 + \sum_{i=1}^s [\lambda(\bar{R}_{1,s} + \lambda) + (1 + 2\lambda)C_2\Delta_k]^2 \right] \\ &= \sigma_\omega^2 [V_1 + V_2 + V_3], \end{aligned} \quad (19)$

where

$V_1 = \sum_{i=1}^s [(R_{1,i} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1) - (1 + 2\lambda)C_2\Delta_k]^2,$

$V_2 = \sum_{i=1}^s [\lambda(\lambda + 1 - \bar{R}_{1,s}) - (1 + 2\lambda)C_2\Delta_k]^2,$

and

$V_3 = \sum_{i=1}^s [\lambda(\bar{R}_{1,s} + \lambda) + (1 + 2\lambda)C_2\Delta_k]^2.$

For bounding the conditional variance above, we calculate its components as follows.

$\begin{aligned} V_1 &= \sum_{i=1}^s [(R_{1,i} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1)]^2 + s\lambda^2(\lambda + 1 - \bar{R}_{1,s})^2 + s\lambda^2(\bar{R}_{1,s} + \lambda)^2 \\ &= \sum_{i=1}^s (R_{1,i} - \bar{R}_{1,s})^2 + 2s\lambda^2(3\bar{R}_{1,s}^2 - 3\bar{R}_{1,s} + \lambda^2 + \lambda + 1) \\ &= \sum_{i=1}^s (R_{1,i} - \mu_1)^2 + (6\lambda^2 - 1)s(\bar{R}_{1,s} - \mu_1)^2 + 6s\lambda^2(2\mu_1 - 3)(\bar{R}_{1,s} - \mu_1) + 2s\lambda^2[\lambda^2 + \lambda + 1 + 3\mu_1(\mu_1 - 1)] \\ &\stackrel{\text{Cauchy inequality}}{\leq} 6\lambda^2 \sum_{i=1}^s (R_{1,i} - \mu_1)^2 + 6s\lambda^2(2\mu_1 - 1)(\bar{R}_{1,s} - \mu_1) + 2s\lambda^2(\lambda^2 + \lambda + 1 - \mu_1(1 - \mu_1)) \\ &= 6\lambda^2 \sum_{i=1}^s (R_{1,i} - \mu_1)^2 + 6s\lambda^2(2\mu_1 - 1)(\bar{R}_{1,s} - \mu_1) + 2s\lambda^2(\lambda^2 + \lambda + 1 - \mu_1(1 - \mu_1)), \end{aligned} \quad (20)$

$\begin{aligned} V_2 &= -2(1 + 2\lambda)C_3\Delta_k \left[ \sum_{i=1}^s ((R_{1,i} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1)) + s\lambda(\lambda + 1 - \bar{R}_{1,s}) - s\lambda(\bar{R}_{1,s} + \lambda) \right] \\ &= -2s(1 + 2\lambda)C_3\Delta_k \left[ (\bar{R}_{1,s} - \bar{R}_{1,s}) + \lambda(2\bar{R}_{1,i} - 1) + \lambda(\lambda + 1 - \bar{R}_{1,s}) \right] - \lambda(\bar{R}_{1,s} + \lambda) = 0, \end{aligned}$

and

$V_3 = 3s(1 + 2\lambda)^2 C_3^2 \Delta_k^2.$ Therefore, the conditional variance in (19) is bounded by

$\begin{aligned} \sigma_\omega^{-2} \text{var} (f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s}) &\leq I^{(2)} + II^{(2)} + III^{(2)} \\ &\leq \underbrace{6\lambda^2 \sum_{i=1}^s (R_{1,i} - \mu_1)^2 + 6s\lambda^2(2\mu_1 - 1) (\bar{R}_{1,s} - \mu_1)}_{\text{the random part}} \\ &\quad + \underbrace{2s\lambda^2 (\lambda^2 + \lambda + 1 - 3\mu_1(1 - \mu_1)) + 3s(1 + 2\lambda)^2 C_2^2 \Delta_k^2}_{\text{the determined part}} \\ &= \mathfrak{R}_1 + \mathfrak{R}_2 + \mathfrak{Q}, \end{aligned} \tag{21}$

where

$\mathfrak{R}_1 = 6\lambda^2 \sum_{i=1}^s (R_{1,i} - \mu_1)^2, \quad \mathfrak{R}_2 = 6s\lambda^2(2\mu_1 - 1) (\bar{R}_{1,s} - \mu_1) + 2s\lambda^2 (\lambda^2 + \lambda + 1 - 3\mu_1(1 - \mu_1)),$

and

$\mathfrak{Q} = 3s(1 + 2\lambda)^2 C_2^2 \Delta_k^2.$

Obviously, $\mathfrak{R}_1$ and $\mathfrak{R}_2$ are strictly positive. Furthermore, for $\mathfrak{R}_2$ , we can also prove it is non-positive with a high probability. Indeed, note that $0 \leq 3\mu_1(1 - \mu_1) \leq \frac{3}{4}$ , we have

$\begin{aligned} \mathbb{P}(\mathfrak{R}_2 > 0) &= \mathbb{P} (3(2\mu_1 - 1) (\bar{R}_{1,s} - \mu_1) > \lambda^2 + \lambda + 1 - 3\mu_1(1 - \mu_1)) \\ &\leq \mathbb{P} \left( \left| 3(2\mu_1 - 1) (\bar{R}_{1,s} - \mu_1) \right| > \left| \lambda^2 + \lambda + 1 - 3\mu_1(1 - \mu_1) \right| \right) \mathbb{I} \left( \mu_1 \neq \frac{1}{2} \right) + \mathbb{P}(\emptyset) \times \mathbb{I} \left( \mu_1 = \frac{1}{2} \right) \\ &\stackrel{0 \leq 3\mu_1(1-\mu_1) \leq \frac{3}{4}}{\leq} \mathbb{P} \left( \left| \bar{R}_{1,s} - \mu_1 \right| > \frac{\lambda^2 + \lambda + 1/4}{|3(2\mu_1 - 1)|} \right) \mathbb{I} \left( \mu_1 \neq \frac{1}{2} \right) + 0 \times \mathbb{I} \left( \mu_1 = \frac{1}{2} \right) \\ &\stackrel{\text{by sub-Gaussian inequality}}{\leq} \exp \left\{ -\frac{(\lambda^2 + \lambda + 1/4)^2 s}{18\sigma_\omega^2(2\mu_1 - 1)^2} \right\} \times \mathbb{I} \left( \mu_1 \neq \frac{1}{2} \right) + 0 \times \mathbb{I} \left( \mu_1 = \frac{1}{2} \right). \end{aligned} \tag{22}$

Then by applying Lemma E.1 again,

$\begin{aligned} \mathbb{P}(G_{1,s}^c \mid \mathcal{H}_{1,s}) &= 2I + II \\ &= 2\mathbb{P} \left( \frac{f_1 + f_2 - f_3 - \mathbb{E}(f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})}{\sqrt{\text{var} (f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})}} > -\frac{\mathbb{E}(f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})}{\sqrt{\text{var} (f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})}} \right) \\ &\quad + \frac{1}{2} \exp \left\{ -\frac{s(1 + 2\lambda)^2}{2\sigma_\omega^2(1 + 2\lambda^2)} \right\} \\ &\leq 2 \exp \left\{ -\frac{\mathbb{E}^2(f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})}{2 \text{var} (f_1 + f_2 - f_3 \mid \mathcal{H}_{1,s})} \right\} + \frac{1}{2} \exp \left\{ -\frac{s(1 + 2\lambda)^2}{2\sigma_\omega^2(1 + 2\lambda^2)} \right\}. \end{aligned}$

Buckets:

huggingchat
/

papers-content

Abstract

1. Introduction

2. Related Work

3. Preliminary

4. Multiplier Bootstrap-based Exploration

4.1. Failure of the naive adaptation of multiplier bootstrap

4.2. Main algorithm

Algorithm 1 General Template for MBE

4.3. Computationally efficient implementation

5. Regret Analysis

Algorithm 2 Practical Implementation of MBE

6. Experiments

6.1. MAB Simulation

6.2. Real data applications

7. Conclusion

References

A.1. MBE for MAB

A.2. MBE for stochastic linear bandits

A.3. Naive Adaptation of the Multiplier Bootstrap

A.4. ReBoot

B. More Experiment Results and Details

B.1. Results for linear bandits

B.2. Additional results

B.3. Details of the real data experiments

C. Main Proof

Step 1: Bounding $a_k$ .

Step 3: Aggregating results.

Step 4: Simplifying bounds.

D.1. Lemmas on bounding $a_k$ .

Xet Storage Details