huggingchat/papers-content / 2302 /2302.01544.md

|

96.5 kB

Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits

Jongyeong Lee^1,2 Junya Honda^3,2 Chao-Kai Chiang¹ Masashi Sugiyama^2,1

¹ The University of Tokyo ² RIKEN AIP ³ Kyoto University

Abstract

In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.

1 Introduction

In the multi-armed bandit (MAB) problem, an agent plays an arm and observes a reward only from the played arm, which is partial feedback [Robbins, 1952, Thompson, 1933]. The rewards are further assumed to be generated from the distribution of the corresponding arm in the stochastic MAB problem [Bubeck et al., 2012]. Since only partial observations are available, the agent has to estimate unknown distributions to guess which arm is optimal while avoiding playing suboptimal arms that induce loss of resources. Thus, the agent has to cope with the dilemma of exploration and exploitation.

In this problem, Thompson sampling (TS), a randomized Bayesian policy that plays an arm according to the posterior probability of being optimal, has been widely adopted because of its outstanding empirical performance [Chapelle and Li, 2011, Russo et al., 2018]. Following its empirical success, theoretical analysis of TS has been conducted for several reward models such as Bernoulli models [Agrawal and Goyal, 2012, Kaufmann et al., 2012], one-dimensional exponential families [Korda et al., 2013], Gaussian models [Honda and Takemura, 2014], and bounded support models [Baudry et al., 2021, Riou and Honda, 2020] where asymptotic optimality of TS was established. Here, an algorithm is said to be asymptotically optimal if it can achieve the theoretical problem-dependent lower bound derived by Lai et al. [1985] for one-parameter models and Burnetas and Katehakis [1996] for multiparameter or nonparametric models. Note that the performance of any reasonable algorithms cannot be better than these lower bounds.

Apart from the problem-dependent regret analysis, several works studied the problem-independent or prior-independent bounds of TS [Agrawal and Goyal, 2017, Bubeck and Liu, 2013, Russo and Van Roy,2016]. In this paper, we study how the choice of noninformative priors affects the performance of TS for any given problem instance. In other words, we focus on the asymptotic optimality of TS depending on the choice of noninformative priors.

The asymptotic optimality of TS has been mainly considered in the one-parameter model, while its optimality under the multiparameter model has not been well studied. To the best of our knowledge, the asymptotic optimality of TS in the noncompact multiparameter model is only known for the Gaussian bandits [Honda and Takemura, 2014] where both the mean and variance are unknown. They showed that TS with the uniform prior is optimal while TS with the Jeffreys prior and reference prior cannot achieve the lower bound. The success of the uniform prior is due to its frequent playing of seemingly suboptimal arms. Its conservativeness comes from a moderate overestimation of the posterior probability that current suboptimal arms might be optimal.

In this paper, we consider the two-parameter Pareto models where the tail function is heavy-tailed. We first derive the closed form of the problem-dependent constant that appears in the theoretical lower bound in Pareto models, which is not trivial, unlike those for exponential families. Based on this result, we show that TS with some probability-matching priors achieves the optimal bound, which is the first result for two-parameter Pareto bandit models, to our knowledge.

We further show that TS with different choices of probability matching priors, called optimistic priors, suffers a polynomial regret in expectation. Therefore, being conservative would be better when one chooses noninformative priors to avoid suboptimality in view of expectation. Nevertheless, we show that TS with the Jeffreys prior or the reference prior can achieve the optimal regret bound if we add a truncation procedure on the shape parameter. Our contributions are summarized as follows:

• We prove the asymptotic optimality/suboptimality of TS under different choices of priors, which shows the importance of the choice of noninformative priors in cases of two-parameter Pareto models.
• We provide another option to achieve optimality: adding a truncation procedure to the parameter space of the posterior distribution instead of finding an optimal prior.

This paper is organized as follows. In Section 2, we formulate the stochastic MAB problems under the Pareto distribution and derive its regret lower bound. Based on the choice of noninformative priors and their corresponding posteriors, we formulate TS for the Pareto models and propose another TS-based algorithm to solve the suboptimality problem of the Jeffreys prior and the reference prior in Section 3. In Section 4, we provide the main results on the optimality of TS and TS with a truncation procedure, whose proof outline is given in Section 6. Numerical results that support our theoretical analysis are provided in Section 5.

2 Preliminaries

In this section, we formulate the stochastic MAB problem. We derive the exact form of the problem-dependent constant that appears in the lower bound of the expected regret in Pareto bandits.

2.1 Notations

We consider the stochastic $K$ -armed bandit problem where the rewards are generated from Pareto distributions with fixed parameters. An agent chooses an arm $a$ in $[K] := {1, \dots, K}$ at each round $t \in \mathbb{N}$ and observes an independent and identically distributed reward from $\text{Pa}(\kappa_a, \alpha_a)$ , where $\text{Pa}(\kappa, \alpha)$ denotes the Paretodistribution parameterized by scale $\kappa > 0$ and shape $\alpha > 0$ . This has the density function of form

$f_{\kappa, \alpha}^{\text{Pa}}(x) = \frac{\alpha \kappa^\alpha}{x^{\alpha+1}} \mathbb{1}[x \geq \kappa], \quad (1)$

where $\mathbb{1}[\cdot]$ denotes the indicator function. We consider a bandit model where parameters $\theta_a = (\kappa_a, \alpha_a) \in \mathbb{R}+ \times (1, \infty)$ are unknown to the agent. We denote the mean of a random variable following $\text{Pa}(\theta_a)$ by $\mu_a = \mu(\theta_a) := \frac{\kappa_a \alpha_a}{\alpha_a - 1}$ . Note that $\alpha > 1$ is a necessary condition of an arm to have a finite mean, which is required to define the sub-optimality gap $\Delta_a := \max{i \in [K]} \mu_i - \mu_a$ . We assume without loss of generality that arm 1 has the maximum mean for simplicity, i.e., $\mu_1 = \max_{i \in [K]} \mu_i$ . Let $j(t)$ be the arm played at round $t \in \mathbb{N}$ and $N_a(t) = \sum_{s=1}^{t-1} \mathbb{1}[j(s) = a]$ denote the number of rounds the arm $a$ is played until round $t$ . Then, the regret at round $T$ is given as

$\text{Reg}(T) = \sum_{t=1}^T \Delta_{j(t)} = \sum_{a=2}^K \Delta_a N_a(T+1).$

Let $r_{a,n}$ be the $n$ -th reward generated from the arm $a$ . In the Pareto distribution, the maximum likelihood estimators (MLEs) of $\kappa, \alpha$ for arm $a$ given $n$ rewards and their distributions are given as follows [Malik, 1970]:

$\hat{\kappa}_a(n) = \min_{s \in [n]} r_{a,s} \sim \text{Pa}(\kappa_a, n\alpha_a), \quad \hat{\alpha}_a(n) = \frac{n}{\sum_{s=1}^n \log(r_{a,s}) - n \log \hat{\kappa}_a(n)} \sim \text{IG}(n-1, n\alpha_a), \quad (2)$

where $\text{IG}(n, \alpha)$ denotes the inverse-gamma distribution with shape $n > 0$ and scale $\alpha > 0$ . Note that Malik [1970] further showed the stochastic independence of $\hat{\alpha}(n)$ and $\hat{\kappa}(n)$ .

2.2 Asymptotic lower bound

Burnetas and Katehakis [1996] provided a problem-dependent lower bound of the expected regret such that any uniformly fast convergent policy, which is a policy satisfying $\text{Reg}(T) = o(t^\alpha)$ for all $\alpha \in (0, 1)$ , must satisfy

$\liminf_{T \rightarrow \infty} \frac{\mathbb{E}[\text{Reg}(T)]}{\log T} \geq \sum_{a=2}^K \frac{\Delta_a}{\inf_{\theta: \mu(\theta) > \mu_1} \text{KL}(\text{Pa}(\kappa_a, \alpha_a), \text{Pa}(\theta))}, \quad (3)$

where $\text{KL}(\cdot, \cdot)$ denotes the Kullback-Leibler (KL) divergence. Notice that the bandit model $(\theta_a)_{a \in [K]}$ is considered as a fixed constant in the problem-dependent analysis.

The KL divergence between Pareto distributions is given as

$\text{KL}(\text{Pa}(\kappa_1, \alpha_1), \text{Pa}(\kappa_2, \alpha_2)) = \begin{cases} \log\left(\frac{\alpha_1}{\alpha_2}\right) + \alpha_2 \log\left(\frac{\kappa_1}{\kappa_2}\right) + \frac{\alpha_2}{\alpha_1} - 1 & \text{if } \kappa_2 \leq \kappa_1, \\ \infty & \text{otherwise.} \end{cases}$

Here the divergence sometimes becomes infinity since the scale parameter $\kappa$ determines the support of the Pareto distribution. We denote the numerator in (3) for $a \neq 1$ by

$\begin{aligned} \text{KL}_{\inf}(a) &:= \inf_{\theta: \mu(\theta) > \mu_1} \text{KL}(\text{Pa}(\kappa_a, \alpha_a), \text{Pa}(\theta)) \\ &= \inf_{\theta \in \Theta_a} \log \frac{\alpha_a}{\alpha} + \alpha \log \frac{\kappa_a}{\kappa} + \frac{\alpha}{\alpha_a} - 1, \end{aligned}$ where

$\Theta_a = \{(\kappa, \alpha) \in (0, \kappa_a] \times (0, \infty) : \mu(\kappa, \alpha) > \mu_1\}. \quad (4)$

Notice that $\Theta_a$ allows parameters whose expected rewards are infinite ( $\alpha \in (0, 1]$ ), although we consider a bandit model with $\alpha_a > 1$ for all $a \in [K]$ so that the sub-optimality gap $\Delta_a$ becomes finite. This implies that $\text{KL}_{\text{inf}}(a)$ does not depend on whether the agent considers the possibility that an arm has the infinite expected reward or not. Then, we can simply rewrite the lower bound in (3) as

$\liminf_{T \rightarrow \infty} \frac{\mathbb{E}[\text{Reg}(T)]}{\log T} \geq \sum_{a=2}^K \frac{\Delta_a}{\text{KL}_{\text{inf}}(a)}.$

The following lemma shows the closed form of this infimum, whose proof is given in Appendix B.

Lemma 1. For any arm $a \neq 1$ , it holds that

$\text{KL}_{\text{inf}}(a) = \log \left( \alpha_a \frac{\mu_1 - \kappa_a}{\mu_1} \right) + \frac{1}{\alpha_a} \frac{\mu_1}{\mu_1 - \kappa_a} - 1.$

2.3 Relation with bounded moment models

In MAB literature, several algorithms based on the upper confidence bound (UCB) were proposed to tackle heavy-tailed models with infinite variance under additional assumptions on moments [Bubeck et al., 2013]. One major assumption is that the moment of any arm $a$ satisfies $\mathbb{E}[|r_{a,n}|^\gamma] \leq v$ for some fixed $\gamma \in [1, 2)$ and known $v < \infty$ [Bubeck et al., 2013]. Note that the $\gamma$ -th raw moment of the density function of $X$ following $\text{Pa}(\kappa, \alpha)$ is given as

$\mathbb{E}[X^\gamma] = \begin{cases} \infty & \alpha \leq \gamma, \\ \frac{\alpha \kappa^\gamma}{\alpha - \gamma} & \alpha > \gamma, \end{cases}$

which implies that the Pareto models and the bounded moment models are not a subset of each other.

Recently, Agrawal et al. [2021] proposed an asymptotically optimal KL-UCB based algorithm that requires solving the optimization problem at every round. Since the bounded moment model only covers certain Pareto distributions in general, the known optimality result of KL-UCB does not necessarily imply the optimality in the sense of (3).

3 Thompson sampling and probability matching priors

TS is a policy from the Bayesian viewpoint, where the choices of priors are important. Although one can utilize prior knowledge on parameters when choosing the prior, such information would not always be available in practice. To deal with such scenarios, we consider noninformative priors based on the Fisher information (FI) matrix, which does not assume any information on unknown parameters.

For a random variable $X$ with density $f(\cdot | \boldsymbol{\theta})$ , FI is defined as the variance of the score, a partial derivative of $\log f$ with respect to $\boldsymbol{\theta}$ , which is given as follows [Cover and Thomas, 2006]:

$I_{ij} = [I(\boldsymbol{\theta})]_{ij} := \mathbb{E}_X \left[ \left( \frac{\partial}{\partial \theta_i} \log f(X | \boldsymbol{\theta}) \right) \left( \frac{\partial}{\partial \theta_j} \log f(X | \boldsymbol{\theta}) \right) \middle| \boldsymbol{\theta} \right]. \quad (5)$

It is known that the FI matrix in (5) coincides with the negative expected value of the Hessian matrix of $\log f(X | \boldsymbol{\theta})$ if the model satisfies the FI regular condition [Schervish, 2012]. However, $\text{Pa}(\kappa, \alpha)$ does notsatisfy this condition since it is a parametric-support family. Therefore, for $X$ with density function in (1), one can obtain the FI matrix of $\text{Pa}(\kappa, \alpha)$ based on (5) as follows [Li et al., 2022]:

$I(\kappa, \alpha) = \begin{bmatrix} \frac{\alpha^2}{\kappa^2} & 0 \\ 0 & \frac{1}{\alpha^2} \end{bmatrix} = \begin{bmatrix} I_{11}(\kappa)I_{11}(\alpha) & 0 \\ 0 & I_{22}(\alpha) \end{bmatrix}, \quad (6)$

where $I_{11}(\kappa) = \frac{1}{\kappa^2}$ , $I_{11}(\alpha) = \alpha^2$ , and $I_{22}(\alpha) = \frac{1}{\alpha^2}$ . Note that $I_{11}$ differs from $-\mathbb{E} \left[ \frac{\partial^2}{\partial \kappa^2} \log f_{\kappa, \alpha}^{\text{Pa}}(X; \theta) \mid \theta \right] = \frac{\alpha}{\kappa^2}$ .

Based on (6), the Jeffreys prior and the reference prior are given as $\pi_J(\kappa, \alpha) \propto \sqrt{\det(I)} = \frac{1}{\kappa}$ and $\pi_R(\kappa, \alpha) \propto \sqrt{I_{11}(\kappa)I_{22}(\alpha)} = \frac{1}{\kappa\alpha}$ , respectively. Here, the reverse reference prior is the same as the reference prior from the orthogonality of parameters [Datta, 1996, Datta and Ghosh, 1995].

From the orthogonality of parameters, the probability matching prior when $\kappa$ is of interest and $\alpha$ is the nuisance parameter is given as

$\pi_P(\kappa, \alpha) \propto \sqrt{I_{11}}g_1(\alpha) = \frac{\alpha}{\kappa}g_1(\alpha)$

for arbitrary $g_1(\alpha) > 0$ [Tibshirani, 1989]. In this paper, we consider the prior $\pi(\kappa, \alpha) \propto \frac{\alpha^{-k}}{\kappa}$ for $k \in \mathbb{Z}$ since the cases $k = 0, 1$ correspond to the Jeffreys prior and the (reverse) reference prior, respectively.

Remark 1. The Pareto distribution discussed in this paper is sometimes called the Pareto type 1 distribution [Arnold, 2008]. On the other hand, Kim et al. [2009] derived several noninformative priors for a special case of the Pareto type 2 distribution called the Lomax distribution [Lomax, 1954], where the FI matrix can be written using the negative Hessian.

In the multiparameter cases, the Jeffreys prior is known to suffer from many problems [Datta and Ghosh, 1996, Ghosh, 2011]. For example, it is known that the Jeffreys prior leads to inconsistent estimators for the error variance in the Neyman-Scott problem [see Berger and Bernardo, 1992, Example 3]. This might be a possible reason why TS with Jeffreys prior suffers a polynomial expected regret in a multiparameter distribution setting. More details on the probability matching prior and the Jeffreys prior can be found in Appendix E.

3.1 Sampling procedure

Let $\mathcal{F}t := (j(s), r{j(s), N_{j(s)}(s)})_{s=1}^{t-1}$ be the history until round $t$ . Under the prior $\frac{\alpha^{-k}}{\kappa}$ with $k \in \mathbb{Z}$ , the marginalized posterior distribution of the shape parameter of arm $a$ is given as

$\alpha_a \mid \mathcal{F}_t \sim \text{Erlang} \left( N_a(t) - k, \frac{N_a(t)}{\hat{\kappa}_a(N_a(t))} \right), \quad (7)$

where $\text{Erlang}(s, \beta)$ denotes the Erlang distribution with shape $s$ and rate $\beta$ . Note that we require $\bar{n} \geq \max{2, k + 1}$ initial plays to avoid improper posteriors and MLE of $\alpha$ . When the shape parameter $\alpha_a$ is given as $\beta$ , the cumulative distribution function (CDF) of the conditional posterior of $\kappa_a$ is given as

$\mathbb{P}[\kappa_a \leq x \mid \mathcal{F}_t, \alpha_a = \beta] = \left( \frac{x}{\hat{\kappa}_a(N_a(t))} \right)^{\beta N_a(t)}, \quad (8)$

if $0 < x \leq \hat{\kappa}_a(N_a(t))$ . Since one can derive the posteriors following the same steps as Sun et al. [2020], the detailed derivation is postponed to Appendix E.3. At round $t$ , we denote the sampled scale and shape---

Algorithm 1 STS / STS-T


1: Parameter:  $k \in \mathbb{Z}$ ,  $\bar{n} = \max\{2, k + 1\}$ .
2: Initialization: Select each arm  $\bar{n}$  times.
3: Loop:
4:   Sample  $\tilde{\alpha}_a(t) \sim \text{Erlang}\left(N_a(t) - k, \frac{N_a(t)}{\tilde{\alpha}_a(N_a(t))}\right)$ .
5:    $\bar{\alpha}_a(N_a(t)) \leftarrow \min(N_a(t), \hat{\alpha}_a(N_a(t)))$ .
6:   Sample  $\tilde{\alpha}_a(t) \sim \text{Erlang}\left(N_a(t) - k, \frac{N_a(t)}{\bar{\alpha}_a(N_a(t))}\right)$ .
7:   if  $\{a \in [K] : \tilde{\alpha}_a(t) \leq 1\} \neq \emptyset$  then
8:     Select  $j(t) = \arg \min_{a \in [K]} \tilde{\alpha}_a(t)$ .
9:   else
10:    Sample  $u_a \sim U(0, 1)$  for every  $a \in [K]$ .
11:     $\tilde{\kappa}_a(t) = \hat{\kappa}_a(N_a(t)) u_a^{1/(N_a(t)\tilde{\alpha}_a(t))}$ .
12:    Play  $j(t) = \arg \max_{a \in [K]} \frac{\tilde{\kappa}_a(t)\tilde{\alpha}_a(t)}{\tilde{\alpha}_a(t)-1}$ 
13:            $= \arg \max_{a \in [K]} \tilde{\mu}_a(t)$ .
14:   end if

parameters of arm $a$ by $\tilde{\kappa}_a(t)$ and $\tilde{\alpha}_a(t)$ , respectively, and the corresponding mean reward by $\tilde{\mu}_a(t) := \mu(\tilde{\kappa}_a(t), \tilde{\alpha}_a(t))$ . We first sample the shape parameter from the marginalized posterior in (7). Then, we sample the scale parameter given the sampled shape parameter from the CDF of the conditional posterior in (8) by using inverse transform sampling. TS based on this sequential procedure, which we call Sequential Thompson Sampling (STS), can be formulated as Algorithm 1.

In Theorem 3 given in the next section, STS with the Jeffreys prior and the reference prior turns out to be suboptimal in view of the lower bound in (3). Their suboptimality is mainly due to the behavior of the posterior in (7) when $\hat{\alpha}_1(n)$ is overestimated for small $N_1(t) = n$ . To overcome such issues, we propose the STS-T policy, a variant of STS with truncation, where we replace $\hat{\alpha}(n)$ with $\bar{\alpha}(n) := \min(n, \hat{\alpha}(n))$ in (7). Note that such a truncation procedure is especially considered in the posterior sampling by (7) and (8). We show that STS-T with the Jeffreys prior and the reference prior can achieve the optimal regret bound in Theorem 4.

3.2 Interpretation of the prior parameter $k$

The Erlang distribution is a special case of the Gamma distribution, where the shape parameter is a positive integer. If a random variable $X$ follows $\text{Erlang}(s, \beta)$ , then it has the density of form

$f_{s,\beta}^{\text{Er}}(x) = \frac{\beta^s}{\Gamma(s)} x^{s-1} e^{-\beta x} \mathbb{1}[x \in \mathbb{R}_+], \quad (9)$

where $s \in \mathbb{N}$ and $\beta > 0$ denote the shape and rate parameter, respectively. Then, the CDF evaluated at $x > 0$ is given as

$F_{s,\beta}^{\text{Er}}(x) = \frac{\int_0^{\beta x} t^{s-1} e^{-t} dt}{\Gamma(s)} = \frac{\gamma(s, \beta x)}{\Gamma(s)}, \quad (10)$

where $\gamma(\cdot, \cdot)$ denotes the lower incomplete gamma function. Since $\gamma(s+1, x) = s\gamma(s, x) - x^s e^{-x}$ holds, one can observe that for any $x > 0$

$F_{s,\beta}^{\text{Er}}(x) \geq F_{s+1,\beta}^{\text{Er}}(x). \quad (11)$ From the sampling procedure of STS and STS-T, $\tilde{\mu}$ depends on $\tilde{\kappa}$ only when $\tilde{\alpha} > 1$ holds since $\tilde{\alpha} \leq 1$ results in $\tilde{\mu}(\cdot, \tilde{\alpha}) = \infty$ . Therefore, for any $\beta > 1$ in (8), $\tilde{\kappa}$ will concentrate on $\hat{\kappa}$ for sufficiently large $N_a(t) = n$ . Thus, $\tilde{\mu}$ will be mainly determined by $\tilde{\alpha}$ and $\hat{\kappa}$ , where the choice of $k$ affects the sampling of $\tilde{\alpha}$ by (7). From (11), one could see that the probability of sampling small $\tilde{\alpha}$ increases as shape $n - k$ decreases. Therefore, $\tilde{\mu}$ of suboptimal arms would increase as $k$ increases for the same $n$ . In other words, the probability of sampling large $\tilde{\mu}$ becomes large as $k$ increases. Therefore, TS with large $k$ becomes a conservative policy that could frequently play currently suboptimal arms. In contrast, priors with small $k$ yield an optimistic policy that focuses on playing the current best arm.

4 Main results

In this section, we provide regret bounds of STS and STS-T with different choices of $k \in \mathbb{Z}$ . At first, we show the asymptotic optimality of STS for priors $\pi(\kappa, \alpha) \propto \frac{\alpha^{-k}}{\kappa}$ with $k \in \mathbb{Z}_{\geq 2}$ .

Theorem 2. Assume that arm 1 is the unique optimal arm with a finite mean. For every $a \in [K]$ , let $\varepsilon_a = \min \left{ \frac{\kappa_a}{\alpha_a(\kappa_a+1)} \frac{\kappa_a \delta_a}{\mu_a(\mu_a + \delta_a - \kappa_a) + \kappa_a \delta_a}, \frac{\kappa_a \delta_a}{\mu_a(1 + \mu_a + \delta_a)} \right}$ where $\delta_a = \frac{\Delta_a}{2}$ for $a \neq 1$ and $\delta_1 = \min_{a \neq 1} \delta_a$ . Given arbitrary $\epsilon \in (0, \min_{a \in [K]} \varepsilon_a)$ , the expected regret of STS with $k \in \mathbb{Z}_{\geq 2}$ is bounded as

$\mathbb{E}[\text{Reg}(T)] \leq \sum_{a=2}^K \frac{\Delta_a \log T}{D_{a,k}(\epsilon)} + \mathcal{O}(\epsilon^{-2}),$

where $D_{a,k}(\epsilon)$ is a function such that $\lim_{\epsilon \rightarrow 0} D_{a,k}(\epsilon) = \text{KL}_{\text{inf}}(a)$ for any fixed $k \in \mathbb{Z}$ .

By letting $\epsilon = o(1)$ in Theorem 2, we see that STS with $k \in \mathbb{Z}_{\geq 2}$ satisfies

$\liminf_{T \rightarrow \infty} \frac{\mathbb{E}[\text{Reg}(T)]}{\log T} \leq \sum_{a=2}^K \frac{\Delta_a}{\text{KL}_{\text{inf}}(a)},$

which shows the asymptotic optimality of STS in terms of the lower bound in (3).

Next, we show that STS with $k \in \mathbb{Z}_{\leq 1}$ cannot achieve the asymptotic bound in the theorem below. Following the proofs for Gaussian bandits [Honda and Takemura, 2014], we consider two-armed bandit problems where the full information on the suboptimal arm is given to simplify the analysis. We further assume that two arms have the same scale parameter $\kappa_1 = \kappa_2$ .

Theorem 3. Consider a two-armed bandit problem where $\kappa_1 = \kappa_2$ and $1 < \alpha_1 < \alpha_2$ . When $\tilde{\alpha}_1(t)$ and $\tilde{\kappa}_1(t)$ are sampled from the posteriors in (7) and (8), respectively and $\tilde{\mu}_2(t) = \mu_2$ holds, there exists a constant $C(\alpha_1, \alpha_2) > 0$ satisfying

$\liminf_{T \rightarrow \infty} \frac{\mathbb{E}[\text{Reg}(T)]}{\log T} \geq C(\alpha_1, \alpha_2),$

where $C(\alpha_1, \alpha_2) > \text{KL}{\text{inf}}(2)$ holds for some instances. In particular, for $k \in \mathbb{Z}{\leq 0}$ , there exists $C'(\alpha_1, \alpha_2)$ satisfying

$\liminf_{T \rightarrow \infty} \frac{\mathbb{E}[\text{Reg}(T)]}{\sqrt{T}} \geq C'(\alpha_1, \alpha_2).$

From Theorems 2 and 3, we find that the prior should be conservative to some extent when one considers maximizing rewards in expectation.

Although STS with the Jeffreys prior ( $k = 0$ ) and reference prior ( $k = 1$ ) were shown to be suboptimal, we show that a modified algorithm, STS-T, can achieve the optimal regret bound with $k \in \mathbb{Z}{\geq 0}$ .Theorem 4. *With the same notation as Theorem 2, the expected regret of STS-T with $k \in \mathbb{Z}{\geq 0}$ is bounded as*

$\mathbb{E}[\text{Reg}(T)] \leq \sum_{a=2}^K \frac{\Delta_a \log T}{D_{a,k}(\epsilon)} + \mathcal{O}(\epsilon^{-m}),$

where $m = \max(2, 3 - k)$ .

From Theorems 2 and 4, we have two choices to achieve the lower bound in (3): use either the conservative priors with MLEs or moderately optimistic priors with truncated samples. Since initialization steps require playing every arm $\max(2, k + 1)$ times, if the number of arms $K$ is large, the Jeffreys priors or the reference prior with the truncated estimator would be a better choice. On the other hand, if the model can contain arms with large $\alpha$ , where the truncation might be problematic for small $n$ , it would be better to use STS with conservative priors.

5 Experiments

In this section, we present numerical results to demonstrate the performance of STS and STS-T, which supports our theoretical analysis. We consider the 4-armed bandit model $\theta_4$ with parameters given in Table 1 as an example where suboptimal arms have smaller, equal, and larger $\kappa$ compared with the optimal arm. $\theta_4$ has $\mu = (4.55, 3.2, 2.74, 3)$ and infinite variance. Further experimental results can be found in Appendix H.

Table 1: 4-armed bandit model $\theta_4$ .

	ARM 1	ARM 2	ARM 3	ARM 4
$\kappa$	1.3	1.2	1.3	1.5
$\alpha$	1.4	1.6	1.9	2.0

Figure 1 shows the cumulative regret for the proposed policies with various choices of parameters $k$ on the prior. The solid lines denote the averaged cumulative regret over 100,000 independent runs of priors that can achieve the optimal lower bound in (3), whereas the dashed lines denote that of priors that cannot. The green dotted line denotes the problem-dependent lower bound and shaded regions denote a quarter standard deviation.

In Figures 2 and 3, we investigate the difference between STS and STS-T with the same $k$ . The solid lines denote the averaged cumulative regret over 100,000 independent runs. The shaded regions and dashed lines show the central 99% interval and the upper 0.05% of regret.

The Jeffreys prior ( $k = 0$ ) In Figure 1(a), the Jeffreys prior seems to have a larger order of the regret compared with priors $k = 1, 3$ , which performed the best in this setting. As Theorem 4 states, its performance improves under STS-T, which shows a similar performance to that of $k = 1, 3$ . Figure 2(a) illustrates the possible reason for the improvements, where the central 99% interval of the regret noticeably shrank under STS-T. Since the suboptimality of STS with the Jeffreys prior ( $k = 0$ ) is due to an extreme case that induces a polynomial regret with small probability, this kind of shrink contributes to decreasing the expected regret of STS-T with the Jeffreys prior.(a) Cumulative regret of STS with various $k$

(b) Cumulative regret of STS-T with various $k$

Figure 1: The solid lines denote the averaged cumulative regret over 100,000 independent runs of priors that can achieve the optimal lower bound in (3). The dashed lines denote that of priors that cannot achieve the optimal lower bound in (3). The shaded regions show a quarter standard deviation. The green dotted line denotes the problem-dependent lower bound based on Lemma 1.

(a) The Jeffreys prior $k = 0$

(b) The reference prior $k = 1$

(c) Prior with $k = 3$

Figure 2: The solid lines denote an averaged regret over independent 100,000 runs. The shaded regions and dashed lines show the central 99% interval and the upper 0.05% of the regret, respectively.

The reference prior ( $k = 1$ ) The reference prior showed a similar performance to the asymptotically optimal prior $k = 3$ , although it was shown to be suboptimal for some instances under STS in Theorem 3. Similarly to the Jeffreys prior ( $k = 0$ ), the reference prior ( $k = 1$ ) under STS-T has a smaller central 99% interval of the regret than that under STS as shown in Figure 2(b), although its decrement is comparably smaller than that of the Jeffreys prior. This would imply that the reference prior is more conservative than the Jeffreys prior.Figure 3: The solid lines denote an averaged regret over independent 100,000 runs. The shaded regions and dashed lines show the central 99% interval and the upper 0.05% of the regret, respectively.

The conservative prior ( $k = 3$ ) Interestingly, Figure 2(c) showed that a truncated procedure does not affect the central 99% interval of the regret and even degrade the performance in upper 0.05%. Notice that the upper 0.05% of the regret of $k = 3$ is much lower than that of $k = 0, 1$ , which shows the stability of the conservative prior in Figure 2.

Since a truncation procedure was adopted to prevent an extreme case that was a problem for $k \in \mathbb{Z}_{\leq 1}$ , it is natural to see that there is no difference between STS and STS-T with $k = 3$ . This would imply that $k = 3$ is sufficiently conservative, and so the truncated procedure does not affect the overall performance.

Optimistic priors ( $k < 0$ ) In Figure 1(a), one can see that the averaged regret of $k = -1$ and $k = -3$ increases much faster than that of $k = 0, 1, 3$ under the STS policy, which illustrates the suboptimality of STS with priors $k \in \mathbb{Z}_{< 0}$ .

As the optimistic priors ( $k < 0$ ) showed better performance under STS-T in Figure 1, we can check the effectiveness of a truncation procedure in the posterior sampling with optimistic priors. However, as Figures 3(a) and 3(b) illustrate, there is no big difference in the central 99% interval of the regret between STS and STS-T with $k = -1, -3$ , which might imply that a prior with $k \in \mathbb{Z}_{< 0}$ is too optimistic. Therefore, we might need to use a more conservative truncation procedure such as the one using $\bar{\alpha}_a(n) = \max(\sqrt{n}, \hat{\alpha}_a(n))$ or $\max(\log n, \hat{\alpha}_a(n))$ , which would induce a larger regret in the finite time horizon.

6 Proof outline of optimal results

In this section, we provide the proof outline of Theorem 2 and Theorem 4, whose detailed proof is given in Appendix C. Note that the proof of Theorem 3 is postponed to Appendix D.Let us first consider good events on MLEs defined by

$\begin{aligned}\mathcal{K}_{a,n}(\epsilon) &:= \{\hat{\kappa}_a(n) \in [\kappa_a, \kappa_a + \epsilon]\} \\ \mathcal{A}_{a,n}(\epsilon) &:= \{\hat{\alpha}_a(n) \in [\alpha_a - \epsilon_{a,l}(\epsilon), \alpha_a + \epsilon_{a,u}(\epsilon)]\} \\ \mathcal{E}_{a,n}(\epsilon) &:= \mathcal{K}_{a,n}(\epsilon) \cap \mathcal{A}_{a,n}(\epsilon),\end{aligned}$

where $n \in \mathbb{N}$ and

$\epsilon_{a,l}(\epsilon) = \frac{\epsilon \alpha_a^2}{1 + \epsilon \alpha_a}, \quad \epsilon_{a,u}(\epsilon) = \frac{\epsilon \alpha_a^2 (\kappa_a + 1)}{\kappa_a - \epsilon \alpha_a (\kappa_a + 1)}. \quad (12)$

Note that $\bar{\alpha}_a(n) = \hat{\alpha}a(n)$ holds on $\mathcal{A}{a,n}(\epsilon)$ for any $n \geq \alpha_a + 1$ . Here, we set $\varepsilon_a$ to satisfy $\hat{\mu}_a \in [\mu_a - \delta_a, \mu_a + \delta_a]$ on $\mathcal{E}_a(\epsilon)$ for any $\epsilon \leq \varepsilon_a$ . Define an event on the sample of the optimal arm

$\mathcal{M}_\epsilon(t) := \{\tilde{\mu}_1(t) \geq \mu_1 - \epsilon\}.$

Then, the expected regret at round $T$ can be decomposed as follows:

$\begin{aligned}\mathbb{E}[\text{Reg}(T)] &= \mathbb{E} \left[ \sum_{t=1}^T \Delta_{j(t)} \right] \\ &= \sum_{a=2}^K \Delta_a \left( \bar{n} + \sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a]] \right) \\ &\leq \Delta_2 \sum_{t=\bar{n}K+1}^T \left( \mathbb{E}[\mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}(\epsilon), \mathcal{M}_\epsilon^c(t)]] + \mathbb{E}[\mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)]] \right) \\ &\quad + \sum_{a=2}^K \Delta_a \left\{ \bar{n} + \sum_{t=\bar{n}K+1}^T \left( \mathbb{E}[\mathbb{1}[j(t) = a, \mathcal{M}_\epsilon(t), \mathcal{E}_{a,N_a(t)}(\epsilon)]] \right. \right. \\ &\quad \left. \left. + \mathbb{E}[\mathbb{1}[j(t) = a, \mathcal{M}_\epsilon(t), \mathcal{E}_{a,N_a(t)}^c(\epsilon)]] \right) \right\},\end{aligned}$

where $\mathcal{E}^c$ denotes the complementary set of $\mathcal{E}$ . Lemmas 5–8 complete the proof of Theorems 2 and 4, whose proofs are given in Appendix C.

Lemma 5. Under STS with $k \in \mathbb{Z}_{\geq 2}$ ,

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)]] \leq \mathcal{O}(\epsilon^{-2}).$

and under STS-T with $k \in \mathbb{Z}_{\geq 0}$ ,

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)]] \leq \mathcal{O}(\epsilon^{-m}),$

where $m = \max(2, 3 - k)$ .

Although Lemma 6 contributes to the main term of the regret, the proof of Lemma 5 is the main difficulty in the regret analysis. We found that our analysis does not result in a finite upper bound for STS with $k \in \mathbb{Z}_{<2}$ and designed STS-T to solve such problems.Lemma 6. Under STS and STS-T with $k \in \mathbb{Z}$ , it holds that for any $a \in [K]$

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a, \mathcal{M}_\epsilon(t), \mathcal{E}_{a, N_a(t)}(\epsilon)]] \leq \max(0, k) + 1 + \frac{1}{\alpha_a \epsilon} \mathbb{1}[k > 0] + \frac{\log T}{D_{a,k}(\epsilon)}.$

where $D_{a,k}(\epsilon) > 0$ is a finite problem-deterministic constant satisfying $\lim_{\epsilon \rightarrow 0} D_{a,k}(\epsilon) = \text{KL}_{\text{inf}}(a)$ .

Since large $k$ yields a more conservative policy and it requires additional initial plays of every arm, large $k$ might induce larger regret for a finite time horizon $T$ , which corresponds to the component of the regret discussed in Lemma 6. Thus, this lemma would imply that the policy has to be conservative to some extent and being overly conservative would induce larger regrets in a finite time.

Lemma 7. Under STS and STS-T with $k \in \mathbb{Z}_{\geq 0}$ ,

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) \neq 1, \mathcal{K}_{1, N_1(t)}(\epsilon), \mathcal{M}_\epsilon^c(t)]] \leq \mathcal{O}(\epsilon^{-1}).$

The key to Lemma 7 is to convert the term on $\tilde{\mu}_1(t)$ , $\mathcal{M}_\epsilon(t)$ , to a term on $\tilde{\alpha}_1(t)$ . Since $\mu(\kappa, \alpha) = \infty$ holds for $\alpha \leq 1$ , $\tilde{\mu}_1 = \mu(\tilde{\kappa}_1, \tilde{\alpha}_1)$ becomes infinity regardless of the value of $\tilde{\kappa}_1$ if $\tilde{\alpha}_1 \leq 1$ holds, which implies $\mathbb{P}[\mathcal{M}_\epsilon^c(t), \tilde{\alpha}_1(t) \leq 1] = 0$ . Therefore, it is enough to consider the case where $\tilde{\alpha}_1(t) > 1$ holds to prove Lemma 7. Although density functions of $\tilde{\alpha}_1$ under STS and STS-T are different, conditional CDFs of $\tilde{\kappa}_1$ given $\alpha_1 = \tilde{\alpha}_1$ are the same, which is given in (8) as

$\mathbb{P}[\tilde{\kappa}_1 \leq x | \mathcal{F}_t, \tilde{\alpha}_1 = \alpha_1] = \left( \frac{x}{\hat{\kappa}_1(N_1(t))} \right)^{\tilde{\alpha}_1 N_1(t)}.$

Therefore, for sufficiently large $N_1(t)$ and $\tilde{\alpha}_1(t) > 1$ , $\tilde{\kappa}_1(t)$ will concentrate on $\hat{\kappa}1(N_1(t))$ with high probability, which is close to its true value $\kappa_1$ under the event ${\mathcal{K}{1, N_1(t)}(\epsilon)}$ . Thus, $\tilde{\mu}_1 = \frac{\tilde{\kappa}_1 \tilde{\alpha}_1}{\tilde{\alpha}_1 - 1} \geq \frac{\kappa_1 \tilde{\alpha}_1}{\tilde{\alpha}_1 - 1} = \mu(\kappa_1, \tilde{\alpha}1)$ holds with high probability, which implies that $\mathbb{P}[\mathcal{K}{1, N_1(t)}(\epsilon), \mathcal{M}_\epsilon^c(t) | \mathcal{F}t]$ can be roughly bounded by $\mathbb{P}[\mathcal{K}{1, N_1(t)}(\epsilon), \tilde{\alpha}_1(t) \geq c | \mathcal{F}_t]$ for some problem-dependent constants $c > 1$ . Since $\mathcal{K}_1$ is deterministic given $\mathcal{F}_t$ , we have

$\mathbb{P}[\mathcal{K}_{1, N_1(t)}(\epsilon), \tilde{\alpha}_1(t) \geq c | \mathcal{F}_t] = \mathbb{1}[\mathcal{K}_{1, N_1(t)}(\epsilon)] \mathbb{P}[\tilde{\alpha}_1(t) \geq c | \mathcal{F}_t],$

which implies $\tilde{\mu}_1(t)$ is mainly determined by the value of $\tilde{\alpha}1(t)$ under the event ${\mathcal{K}{1, N_1(t)}(\epsilon)}$ for both policies. In such cases, STS and STS-T behave like TS in the Pareto distribution with a known scale parameter, where $\tilde{\mu}_1(t) := \mu(\kappa_1, \tilde{\alpha}_1(t))$ for $t \in \mathbb{N}$ . Here, the Pareto distribution with the known scale parameter belongs to the one-dimensional exponential family, where Korda et al. [2013] showed the optimality of TS with the Jeffreys prior. Since the posterior of $\alpha$ under the Jeffreys prior is given as the Erlang distribution with shape $N_1(t) + 1$ in the one-parameter Pareto model, we can apply the results by Korda et al. [2013] to prove Lemma 7 by using some properties of the Erlang distribution such as (11).

Lemma 8. Under STS and STS-T with $k \in \mathbb{Z}$ , it holds that for any $a \neq 1$

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a, \mathcal{E}_{a, N_a(t)}^c(\epsilon)]] \leq \mathcal{O}(\epsilon^{-2}).$

Lemma 8 controls the regret induced when estimators of the played arm are not close to their true parameters, which is not difficult to analyze as in the usual analysis of TS. In fact, the proof of this lemma is straightforward since the upper bounds of $\mathbb{P}[\mathcal{K}_a^c]$ and $\mathbb{P}[\mathcal{K}_a, \mathcal{A}_a^c]$ can be easily derived based on the distributions of $\hat{\kappa}_a$ and $\hat{\alpha}_a$ in (2).## 7 Conclusion

We considered the MAB problems under the Pareto distribution that has a heavy tail and follows the power-law. While most previous research on TS has focused on one-dimensional or light-tailed distributions, we focused on the Pareto distribution characterized by unknown scale and shape parameters. By sequentially sampling parameters via their marginalized and conditional posterior distributions, we can realize an efficient sampling procedure. We showed that TS with the appropriate choice of priors achieves a problem-dependent optimal regret bound in such a setting for the first time. Although the Jeffreys prior and the reference prior are shown to be suboptimal under the direct implementation of TS, we showed that they can achieve the optimal regret bound if we add a truncation procedure. Experimental results support our theoretical results, which show the optimality of conservative priors and the effectiveness of the truncation procedure for the Jeffreys prior and the reference prior.

Acknowledgement

JL was supported by JST SPRING, Grant Number JPMJSP2108. CC and MS were supported by the Institute for AI and Beyond, UTokyo.

Bibliography

Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1. PMLR, 2012.

Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for Thompson sampling. Journal of the Association for Computing Machinery, 64(5):1–24, 2017.

Shubhada Agrawal, Sandeep K Juneja, and Wouter M Koolen. Regret minimization in heavy-tailed bandits. In Conference on Learning Theory, pages 26–62. PMLR, 2021.

Barry C Arnold. Pareto and generalized Pareto distributions. In Modeling income distributions and Lorenz curves, pages 119–145. Springer, 2008.

Dorian Baudry, Romain Gautron, Emilie Kaufmann, and Odalric Maillard. Optimal Thompson sampling strategies for support-aware CVaR bandits. In International Conference on Machine Learning, pages 716–726. PMLR, 2021.

James O Berger and José M Bernardo. On the development of the reference prior method. Bayesian Statistics, 4(4):35–60, 1992.

Jose M Bernardo. Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):113–128, 1979.

Sébastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for Thompson sampling. In Advances in Neural Information Processing Systems, volume 26, pages 638–646, 2013.

Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.

Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.

Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience, 2nd edition, 2006.

Gauri Sankar Datta. On priors providing frequentist validity of Bayesian inference for multiple parametric functions. Biometrika, 83(2):287–298, 1996.

Gauri Sankar Datta and Malay Ghosh. Some remarks on noninformative priors. Journal of the American Statistical Association, 90(432):1357–1363, 1995.

Gauri Sankar Datta and Malay Ghosh. On the invariance of noninformative priors. The Annals of Statistics, 24(1):141–159, 1996.

Gauri Sankar Datta and Trevor J Sweeting. Probability matching priors. Handbook of Statistics, 25:91–114, 2005.

Thomas J DiCiccio, Todd A Kuffner, and G Alastair Young. A simple analysis of the exact probability matching prior in the location-scale model. The American Statistician, 71(4):302–304, 2017.

Malay Ghosh. Objective priors: An introduction for frequentists. Statistical Science, 26(2):187–202, 2011.

Junya Honda and Akimichi Takemura. Optimality of Thompson sampling for Gaussian bandits depends on priors. In International Conference on Artificial Intelligence and Statistics, volume 33, pages 375–383. PMLR, 2014.

Harold Jeffreys. The theory of probability. OUP Oxford, 1998.

Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite time analysis. In International Conference on Algorithmic Learning Theory, volume 7568, pages 199–213. Springer, 2012.

Dal-Ho Kim, Sang-Gil Kang, and Woo-Dong Lee. Noninformative priors for Pareto distribution. Journal of the Korean Data and Information Science Society, 20(6):1213–1223, 2009.

Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1-dimensional exponential family bandits. In International Conference on Neural Information Processing Systems, volume 26, pages 1448–1456. Curran Associates, Inc., 2013.

Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.

Mingming Li, Huafei Sun, and Linyu Peng. Fisher–Rao geometry and Jeffreys prior for Pareto distribution. Communications in Statistics-Theory and Methods, 51(6):1895–1910, 2022.Kenneth S Lomax. Business failures: Another example of the analysis of failure data. Journal of the American Statistical Association, 49(268):847–852, 1954.

Henrick John Malik. Estimation of the parameters of the Pareto distribution. Metrika, 15(1):126–132, 1970.

Rahul Mukerjee and Malay Ghosh. Second-order probability matching priors. Biometrika, 84(4):970–975, 1997.

Charles Riou and Junya Honda. Bandit algorithms based on Thompson sampling for bounded reward distributions. In International Conference on Algorithmic Learning Theory, pages 777–826. PMLR, 2020.

Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer, 2nd edition, 2007.

Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.

Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.

Mark J Schervish. Theory of statistics. Springer Science & Business Media, 2012.

Marvin K Simon and Dariush Divsalar. Some new twists to problems involving the Gaussian probability integral. IEEE Transactions on Communications, 46(2):200–210, 1998.

Charles Stein. An example of wide discrepancy between fiducial and confidence intervals. The Annals of Mathematical Statistics, 30(4):877–880, 1959.

Fupeng Sun, Yueqi Cao, Shiqiang Zhang, and Huafei Sun. The Bayesian inference of Pareto models based on information geometry. Entropy, 23(1):45, 2020.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

Robert Tibshirani. Noninformative priors for one parameter of many. Biometrika, 76(3):604–608, 1989.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

David L. Wallace. Bounds on normal approximations to student’s and the chi-square distributions. The Annals of Mathematical Statistics, 30(4):1121–1130, 1959.

BL Welch and HW Peers. On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society: Series B (Methodological), 25(2):318–329, 1963.## A Notations

Tables 2–5 summarize the symbols used in this paper.

Table 2: Notations for the bandit problem.

Symbol	Meaning
$K$	the number of arms.
$T$	time horizon.
$j(t)$	the index of the played arm at round $t$ .
$k \in \mathbb{Z}$	prior parameter, see Section 3 for details.
$\bar{n} = \max(2, k + 1)$	initial plays to avoid improper posteriors.
$N_a(t)$	the number of playing arm $a$ until round $t$ .
$r_{a,n}$	$n$ -th reward generated from the arm $a$ .
$\mu(\theta)$	the expected value of the random variable following $\text{Pa}(\theta)$ .
$\mu_a = \mu(\theta_a)$	the expected rewards of arm $a$ .
$\Delta_a$	sub-optimality gap of arm $a$ .
$\delta_a = \frac{\Delta_a}{2}$ for $a \neq 1$	a half of sub-optimality gap of arm $a$ .
$\delta_1 := \min_{a \neq 1} \delta_a$	defined as the minimum of sub-optimality gap.
$\mathcal{F}_t = (j(s), r_{j(s), N_{j(s)}})_{s=1}^{t-1}$	the history until round $t$ .
$\mathbb{P}_t[\cdot] := \mathbb{P}[\cdot \| \mathcal{F}_t]$	conditional probability given $\mathcal{F}_t$ .
$g_a(c, \alpha)$	KL-divergence from $\text{Pa}(\frac{\kappa_a}{c}, \alpha)$ to $\text{Pa}(\kappa_a, \alpha_a)$ for $c \geq 1$ .
$h_a(c, \boldsymbol{\mu}) = h_a(c)$	the upper bound of $\alpha$ satisfying $\mu(\frac{\kappa_a}{c}, \alpha) \geq \mu_1$ for $c \geq 1$ .

Table 3: Notations for probability distributions and estimators

Symbol	Meaning
$\text{Pa}(\kappa, \alpha)$	Pareto distribution with the scale $\kappa > 0$ and shape $\alpha > 0$ .
$f_{\kappa, \alpha}^{\text{Pa}}(x)$	density function of $\text{Pa}(\kappa, \alpha)$ .
$\text{Erlang}(s, \beta)$	Erlang distribution with the shape $s > 0$ and rate $\beta > 0$ .
$f_{s, \beta}^{\text{Er}}(x)$	density function of $\text{Erlang}(s, \beta)$ .
$F_{s, \beta}^{\text{Er}}(x)$	CDF of $\text{Erlang}(s, \beta)$ evaluated at $x > 0$ .
$\text{IG}(s, \beta)$	Inverse Gamma distribution with shape $s > 0$ and scale $\beta > 0$ .
$F_n(x)$	CDF of the chi-squared distribution with $n$ degree of freedom.
$\Gamma(s)$	Gamma function.
$\gamma(s, x)$	the lower incomplete gamma function.
$\Gamma(s, x)$	the upper incomplete gamma function.
$\hat{\kappa}_a(n), \hat{\alpha}_a(n)$	MLEs of the scale and shape parameter of arm $a$ after $n$ observations, defined in (2).
$\tilde{\kappa}_a(t), \tilde{\alpha}_a(t)$	sampled parameters at round $t$ from posterior distribution in (8) and (7).
$\bar{\alpha}_a(n) = \max(\hat{\alpha}_a(n), n)$	truncated estimator of the shape parameter.
$\alpha_{a,n}$	a temporary notation that can be replaced by both $\hat{\alpha}_a(n)$ (STS) and $\bar{\alpha}_a(n)$ (STS-T).
$\hat{\mu}_a(n) = \mu(\hat{\kappa}_a(n), \hat{\alpha}_a(n))$	computed mean rewards by the MLEs after $n$ observation.
$\tilde{\mu}_a(t) = \mu(\tilde{\kappa}_a(t), \tilde{\alpha}_a(t))$	computed mean rewards by sampled parameters $\tilde{\kappa}_a(t)$ and $\tilde{\alpha}_a(t)$ at round $t$ .
$\theta_a = (\kappa_a, \alpha_a)$	tuple of true parameters of arm $a \in [K]$ .
$\hat{\theta}_{a,n} = (\hat{\kappa}_a(n), \hat{\alpha}_a(n))$	tuple of MLEs of arm $a$ after $n$ observations.
$\bar{\theta}_{a,n} = (\hat{\kappa}_a(n), \bar{\alpha}_a(n))$	tuple of estimators with a truncation procedure of arm $a$ after $n$ observations.
$\theta_{a,n} = (\hat{\kappa}_a(n), \alpha_{a,n})$	a temporary notation that can be replaced by both $\hat{\theta}_{a,n}$ (STS) and $\bar{\theta}_{a,n}$ (STS-T).

Table 4: Notations for the regret analysis

Symbol	Meaning
$D_{a,k}(\epsilon)$	a function contributes to the main term of regret analysis defined in (42).
$\mathcal{K}_{a,n}(\epsilon)$	an event where MLE of $\kappa$ is close to its true value at round $t$ after $n$ observations.
$\mathcal{A}_{a,n}(\epsilon)$	an event where MLE of $\alpha$ is close to its true value at round $t$ after $n$ observations.
$\mathcal{E}_{a,n}(\epsilon)$	intersection of $\mathcal{K}_{a,n}(\epsilon)$ and $\mathcal{A}_{a,n}(\epsilon)$ .
$\mathcal{M}_\epsilon(t)$	an event where sampled mean of the optimal arm is close to its true mean reward at round $t$ .
$p_n(x\|\theta_{1,n})$	probability of $\{\tilde{\mu}_1(t) \leq \mu_1 - x\}$ after $n$ observation of arm 1 given $\theta_{1,n}$ .
$G_k(x; n)$	another expression of the CDF of the Erlang distribution in (20).

Table 5: Notations for (deterministic) constants

Symbol	Meaning
$\varepsilon_a$	problem-dependent constants to satisfy $\hat{\mu}_a(n) \in [\mu_a - \delta_a, \mu_a + \delta_a]$ on $\mathcal{E}_{a,n}(\epsilon)$ for any $\epsilon \leq \varepsilon_a$ .
$\epsilon_{a,l}(\epsilon), \epsilon_{a,l}(\epsilon)$	constants to control a deviation of $\hat{\alpha}_a(\epsilon)$ under the event $\mathcal{A}_{a,n}(\epsilon)$ .
$\rho_{\theta_1}(\epsilon), \bar{\rho} = \rho_{\theta_1}(\epsilon/2)$	a difference from its true shape parameter $\alpha_1$ to satisfy $\mu(\kappa_a, \alpha + \rho_{\mu_1}(\epsilon)) \geq \mu_1 - \frac{\epsilon}{2}$ .
$C_1(\mu_1, \epsilon, n) = C_{1,n}$	a constant smaller than 1 in (20).
$C_2(\mu_1, \epsilon, k)$	an uniform bound of $p_n(\epsilon\|\cdot)$ under $\mathcal{K}_{1,n}^c(\epsilon) \cap \mathcal{A}_{1,n}(\epsilon/2)$ .
$c_{\mu_1}(\epsilon)$	a small constant with $\mathcal{O}(\epsilon^{-2})$ .

## B Proof of Lemma 1

Lemma 1. For any arm $a \neq 1$ , it holds that

$\text{KL}_{\text{inf}}(a) = \log \left( \alpha_a \frac{\mu_1 - \kappa_a}{\mu_1} \right) + \frac{1}{\alpha_a} \frac{\mu_1}{\mu_1 - \kappa_a} - 1.$

Proof. Recall the definition

$\text{KL}_{\text{inf}}(a) = \text{KL}_{\text{inf}}(\text{Pa}(\theta_a), \text{Pa}(\theta)) := \inf_{\theta \in \Theta_a} \log \frac{\alpha_a}{\alpha} + \alpha \log \frac{\kappa_a}{\kappa} + \frac{\alpha}{\alpha_a} - 1,$

where $\theta = (\kappa, \alpha)$ and $\Theta_a$ defined in (4).

Here, we consider the partition of $\Theta_a$ ,

$\begin{aligned} \Theta_a^{(1)} &= \{(\kappa, \alpha) \in (0, \kappa_a] \times (0, 1] : \mu(\kappa, \alpha) > \mu_1\} = (0, \kappa_a] \times (0, 1] \\ \Theta_a^{(2)} &= \left\{ (\kappa, \alpha) \in (0, \kappa_a] \times (1, \infty) : \mu(\kappa, \alpha) = \frac{\kappa \alpha}{\alpha - 1} > \mu_1 \right\}, \end{aligned} \quad (13)$

where $\Theta_a^{(1)} \cup \Theta_a^{(2)} = \Theta_a$ . Therefore, it holds that

$\text{KL}_{\text{inf}}(a) = \min \left( \inf_{\theta \in \Theta_a^{(1)}} \log \frac{\alpha_a}{\alpha} + \alpha \log \frac{\kappa_a}{\kappa} + \frac{\alpha}{\alpha_a} - 1, \inf_{\theta \in \Theta_a^{(2)}} \log \frac{\alpha_a}{\alpha} + \alpha \log \frac{\kappa_a}{\kappa} + \frac{\alpha}{\alpha_a} - 1 \right).$

For $(\kappa, \alpha) \in \Theta_a^{(1)}$ , $\mu(\kappa, \alpha) = \infty$ holds regardless of $\kappa$ . Therefore, we obtain

$\begin{aligned} \inf_{\theta \in \Theta_a^{(1)}} \log \frac{\alpha_a}{\alpha} + \alpha \log \frac{\kappa_a}{\kappa} + \frac{\alpha}{\alpha_a} - 1 &= \inf_{\alpha \in (0, 1]} \log \frac{\alpha_a}{\alpha} + \frac{\alpha}{\alpha_a} - 1 \\ &= \log \alpha_a + \frac{1}{\alpha_a} - 1, \end{aligned}$

where the last equality holds since $\log x + \frac{1}{x} - 1$ is an increasing function for $x \geq 1$ .

Let $\frac{\kappa_a}{\kappa} = c \geq 1$ to make KL divergence from $\text{Pa}(\theta_a)$ to $\text{Pa}(\kappa, \alpha)$ be well-defined. From its definition of $\Theta_a^{(2)}$ in (13), any $\theta = (\kappa, \alpha) \in \Theta_a^{(2)}$ satisfies $\frac{\kappa \alpha}{\alpha - 1} \geq \mu_1$ , i.e.,

$\frac{\kappa_a \alpha}{c(\alpha - 1)} \geq \mu_1 \Leftrightarrow \alpha \leq \frac{c\mu_1}{c\mu_1 - \kappa_a} =: h_a(c, \mu) = h_a(c).$

Note that it holds that

$h_a(1) = \frac{\mu_1}{\mu_1 - \kappa_a} \leq \frac{\mu_a}{\mu_a - \kappa_a} = \alpha_a$

since $\frac{x}{x-y}$ is decreasing with respect to $x \geq y$ . Then, we can rewrite the infimum of KL divergence as

$\text{KL}_{\text{inf}}(a) = \min \left( \log \alpha_a + \frac{1}{\alpha_a} - 1, \inf_{c \geq 1} \inf_{\alpha \leq h_a(c)} g_a(\alpha, c) \right),$

where $g_a(\alpha, c) := \log \frac{\alpha_a}{\alpha} + \alpha \log c + \frac{\alpha}{\alpha_a} - 1$ satisfying

$\frac{\partial g_a(\alpha, c)}{\partial \alpha} = \frac{1}{\alpha_a} + \log c - \frac{1}{\alpha}. \quad (14)$ Then, the inner infimum can be obtained when $\alpha = \frac{\alpha_a}{1+\alpha_a \log c}$ holds if $\frac{\alpha_a}{1+\alpha_a \log c} < h_a(c)$ , where $g_a\left(\frac{\alpha_a}{1+\alpha_a \log c}, c\right) = \log(1 + \alpha_a \log c)$ .

Let $c_a^* \geq 1$ be a deterministic constant such that

$h_a(c_a^*) = \frac{c_a^* \mu_1}{c_a^* \mu_1 - \kappa_a} = \frac{\alpha_a}{1 + \alpha_a \log c_a^*} \Leftrightarrow (\mu_1 \alpha_a) c_a^* \log c_a^* + (\mu_1 - \alpha_a \mu_1) c_a^* = -\alpha_a \kappa_a \quad (15)$

so that $h_a(c) \geq \frac{\alpha_a}{1+\alpha_a \log c}$ holds for any $c \geq c_a^*$ . Since the solution of $ax \log(x) + bx = -c$ is given as $\exp\left(W\left(-\frac{ce^{b/a}}{a}\right) - \frac{b}{a}\right)$ for principal branch of Lambert W function $W(\cdot)$ , one can obtain $c_a^*$ by solving the equality in (15), which is

$c_a^* = \exp\left(W\left(-\frac{\kappa_a}{\mu_1} e^{\frac{1}{\alpha_a} - 1}\right) + 1 - \frac{1}{\alpha_a}\right). \quad (16)$

Notice that $\frac{\kappa_a}{\mu_1} e^{\frac{1}{\alpha_a} - 1} \leq \frac{\kappa_a}{\mu_a} e^{\frac{1}{\alpha_a} - 1} \leq \left(1 - \frac{1}{\alpha_a}\right) e^{-\left(1 - \frac{1}{\alpha_a}\right)} \leq e^{-1}$ holds so that $c_a^*$ is a real value. Here, we consider the principal branch to ensure $c_a^* \geq 1$ since the solution on other branches, $W_{-1}(\cdot)$ , is less than 1, which is out of our interest.

Let $A_a = 1 - \frac{1}{\alpha_a}$ , which is positive as $\alpha_a > 1$ and $B_a = \frac{\kappa_a}{\mu_1} \leq \frac{\kappa_a}{\mu_a} = \frac{\alpha_a - 1}{\alpha_a} = A_a$ . Then, we can rewrite $c_a^*$ as

$c_a^* = e^{A_a} e^{W(-B_a e^{-A_a})} = e^{A_a} e^{-A_a} \frac{-B_a}{W(-B_a e^{-A_a})}. \quad \therefore e^{W(x)} = \frac{x}{W(x)}$

Since the principal branch of Lambert W function, $W(x)$ , is increasing for $x \geq -\frac{1}{e}$ , we have

$0 > W(-B_a e^{-A_a}) \geq W(-B_a e^{-B_a}) = -B_a,$

which implies that $c_a^* \geq 1$ . Therefore, the infimum of $g_a$ can be written as

$\begin{aligned} \inf_{c \geq 1} \inf_{\alpha \leq h_a(c)} g_a(\alpha, c) &= \min \left( \inf_{c \in [1, c_a^*]} g_a(h_a(c), c), \inf_{c \geq c_a^*} \log(1 + \alpha_a \log c) \right) \\ &= \min \left( \inf_{c \in [1, c_a^*]} g_a(h_a(c), c), \log(1 + \alpha_a \log c_a^*) \right), \end{aligned}$

where we follow the convention that the infimum over the empty set is defined as infinity.

By substituting $c_a^*$ in (16), we obtain

$\log(1 + \alpha_a \log c_a^*) = \log \left( \alpha_a + W \left( -\frac{\kappa_a}{\mu_1} e^{\frac{1}{\alpha_a} - 1} \right) \right).$

Let us consider the following inequalities:

$\begin{aligned} \log \left( \alpha_a + W \left( -\frac{\kappa_a}{\mu_1} e^{\frac{1}{\alpha_a} - 1} \right) \right) &\geq \log \left( \alpha_a + W \left( -\frac{\kappa_a}{\mu_a} e^{\frac{1}{\alpha_a} - 1} \right) \right) \\ &= \log \left( \alpha_a + W \left( \frac{1 - \alpha_a}{\alpha_a} e^{\frac{1}{\alpha_a} - 1} \right) \right) \\ &= \log \left( \alpha_a + \frac{1}{\alpha_a} - 1 \right), \end{aligned} \quad (17)$ where the first inequality holds since the principal branch of Lambert W function $W(x)$ is increasing and negative with respect to $x \in [-1/e, 0)$ .

It remains to find the closed form of $\inf_{c \in [1, c_a^*]} g_a(h_a(c), c)$ . From the definition of $h_a(c) = \frac{c\mu_1}{c\mu_1 - \kappa_a}$ , we have $h'_a(c) = -\frac{\mu_1 \kappa_a}{(c\mu_1 - \kappa_a)^2}$ and

$\begin{aligned} \frac{\partial g_a(h_a(c), c)}{\partial c} &= \frac{\partial}{\partial c} \left( \log \frac{\alpha_a}{h_a(c)} + h_a(c) \log c + \frac{h_a(c)}{\alpha_a} - 1 \right) \quad \because g_a(x, c) = \log \frac{\alpha_a}{x} + x \log c + \frac{x}{\alpha_a} - 1 \\ &= -\frac{h'_a(c)}{h_a(c)} + h'_a(c) \log c + \frac{h_a(c)}{c} + \frac{1}{\alpha_a} h'_a(c) \\ &= \frac{\kappa_a}{c(c\mu_1 - \kappa_a)} - \frac{\mu_1 \kappa_a}{(c\mu_1 - \kappa_a)^2} \log c + \frac{\mu_1}{c\mu_1 - \kappa_a} - \frac{\kappa_a \mu_1}{\alpha_a (c\mu_1 - \kappa_a)^2} \\ &= \frac{\kappa_a}{c(c\mu_1 - \kappa_a)} - \frac{\mu_1 \kappa_a}{(c\mu_1 - \kappa_a)^2} \log c + \mu_1 \frac{c\mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a}}{(c\mu_1 - \kappa_a)^2}. \end{aligned} \quad (18)$

Since the first term in (18) is positive for $c \geq 1$ and $\mu_1 \geq \mu_a > \kappa_a$ , let us consider the last two terms for $c \in [1, c_a^*]$ ,

$\begin{aligned} -\frac{\mu_1 \kappa_a}{(c\mu_1 - \kappa_a)^2} \log c + \mu_1 \frac{c\mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a}}{(c\mu_1 - \kappa_a)^2} &= \frac{\mu_1}{(c\mu_1 - \kappa_a)^2} \left( c\mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a} - \kappa_a \log c \right) \\ &= \frac{\mu_1}{(c\mu_1 - \kappa_a)^2} \left( \mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a} + (c-1)\mu_1 - \kappa_a \log c \right) \\ &= \frac{\mu_1}{(c\mu_1 - \kappa_a)^2} \left( \mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a} + \mu_1 \left( c - \frac{\kappa_a}{\mu_1} \log c - 1 \right) \right). \end{aligned}$

Here,

$\mu_1 - \kappa_a - \frac{\kappa_a}{\alpha_a} \geq \mu_a - \kappa_a - \frac{\kappa_a}{\alpha_a} = \frac{\kappa_a \alpha_a}{\alpha_a - 1} - \kappa_a - \frac{\kappa_a}{\alpha_a} = \frac{\kappa_a}{\alpha_a(\alpha_a - 1)} > 0,$

and $c - \frac{\kappa_a}{\mu_1} \log c - 1$ is increasing with respect to $c$ so that $c - \frac{\kappa_a}{\mu_1} \log c - 1 \geq 0$ for $c \geq 1$ . Therefore, $\frac{\partial}{\partial c} g_a(h_a(c), c)$ is positive for $c \geq 1$ , i.e., $g_a(h_a(c), c)$ is an increasing function with respect to $c \geq 1$ .

Thus, we have for $c \in [1, c_a^*]$ ,

$\inf_{c \in [1, c_a^*]} g_a(h_a(c), c) = g_a(h_a(1), 1) = g_a\left(\frac{\mu_1}{\mu_1 - \kappa_a}, 1\right) = \log\left(\alpha_a \frac{\mu_1 - \kappa_a}{\mu_1}\right) + \frac{1}{\alpha_a} \frac{\mu_1}{\mu_1 - \kappa_a} - 1.$

Denote $X_a = \alpha_a \frac{\mu_1 - \kappa_a}{\mu_1} \in [1, \alpha_a)$ where $X_a = 1$ happens only when $\mu_a = \mu_1$ . Then, we have for $\alpha_a > 1$

$\log(X_a) + \frac{1}{X_a} - 1 \leq \log \alpha_a + \frac{1}{\alpha_a} - 1 \leq \log\left(\alpha_a + \frac{1}{\alpha_a} - 1\right) \leq \log(1 + \alpha_a \log c_a^*),$

where the last inequality comes from the result in (17). Therefore, we have

$\begin{aligned} \text{KL}_{\inf}(a) &= \min \left( \log \alpha_a + \frac{1}{\alpha_a} - 1, \inf_{c \in [1, c_a^*]} g_a(h_a(c), c), \log(1 + \alpha_a \log c_a^*) \right) \\ &= \log\left(\alpha_a \frac{\mu_1 - \kappa_a}{\mu_1}\right) + \frac{1}{\alpha_a} \frac{\mu_1}{\mu_1 - \kappa_a} - 1, \end{aligned}$

which concludes the proof. $\square$## C Proofs of lemmas for Theorems 2 and 4

In this section, we provide the proof of lemmas for the main results.

To avoid redundancy, we use a temporary notation $\alpha_{a,n}$ when the same result holds for both $\hat{\alpha}_a(n)$ and $\bar{\alpha}a(n)$ . When $\alpha{a,n}$ notation is used, one can replace it with either $\hat{\alpha}_a(n)$ or $\bar{\alpha}_a(n)$ depending on which policy we are considering. For example, it holds that

$\alpha_{a,n} \leq 1 \Leftrightarrow \begin{cases} \hat{\alpha}_a(n) \leq 1 & \text{under STS policy,} \\ \bar{\alpha}_a(n) \leq 1 & \text{under STS-T policy.} \end{cases}$

Similarly, we use the notation $\theta_{a,n} := (\hat{\kappa}a(n), \alpha{a,n})$ when it can be replaced by both $\hat{\theta}_{a,n} = (\hat{\kappa}_a(n), \hat{\alpha}a(n))$ and $\bar{\theta}{a,n} = (\hat{\kappa}_a(n), \bar{\alpha}a(n))$ for any arm $a \in [K]$ and $n \in \mathbb{N}$ . Based on $\theta{a,n}$ notation, we provide an inequality on the posterior probability that the sampled mean is smaller than a given value with proofs in Appendix C.5.

Lemma 9. For any arm $a \in [K]$ and fixed $t \in \mathbb{N}$ , let $N_a(t) = n$ . For any positive $\xi \leq \frac{y}{y-\kappa_a}$ and $k \in \mathbb{Z}$ , it holds that

$\mathbb{1}[\hat{\kappa}_a(n) \leq y] \mathbb{P}[\tilde{\mu}_a(t) \leq y | \theta_{a,n}] \leq \int_{\xi}^{\infty} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) dx + \left( \frac{y}{\mu((\kappa_a, \xi))} \right)^n \int_1^{\xi} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) dx,$

where $f_{s,\beta}^{\text{Er}}(\cdot)$ denotes the probability density function of the Erlang distribution with shape $s \in \mathbb{N}$ and rate $\beta > 0$ .

Based on $\theta_{1,n}$ notation, we denote the probability of sample from the posterior distribution after $n$ times playing is smaller than $\mu_1 - x$ by

$p_n(x | \theta_{1,n}) := \mathbb{P}[\tilde{\mu}_1 \leq \mu_1 - x | \hat{\kappa}_1(n), \alpha_{1,n}]. \quad (19)$

Let $K(\epsilon) = (\kappa_1 + \epsilon, \mu_1 - \epsilon)$ be an open interval on $\mathbb{R}$ . The Lemma 10 below shows the upper bound of $p_n$ conditioned on $\theta_{1,n}$ .

Lemma 10. Given $\epsilon > 0$ , define a positive problem-dependent constant $\rho = \rho_{\theta_1}(\epsilon) := \frac{\kappa_1 \epsilon}{2(\mu_1 - \epsilon/2 - \kappa_1)(\mu_1 - \kappa_1)}$ . If $n \geq \bar{n} = \max(2, k + 1)$ , then for $k \in \mathbb{Z}_{\geq 0}$

$p_n(\epsilon | \theta_{1,n}) \leq \begin{cases} e^{-n}, & \text{if } \hat{\kappa}_1(n) \geq \mu_1 - \epsilon, \\ h(\mu_1, \epsilon, n), & \text{if } \hat{\kappa}_1(n) \in K(\epsilon), \alpha_{1,n} \leq \alpha_1 + \rho, \\ C_1(\mu_1, \epsilon, n) G_k(1/\alpha_{1,n}; n) + 1 - G_k(1/\alpha_{1,n}; n) & \text{if } \hat{\kappa}_1(n) \in K(\epsilon), \alpha_{1,n} \geq \alpha_1 + \rho, \end{cases}$

where

$\begin{aligned} h(\mu_1, \epsilon, n) &:= e^{-n \frac{3\epsilon}{4\mu_1}} \left( 1 - \frac{1}{2} e^{-n c_{\mu_1}(\epsilon)} \right) + \frac{1}{2} e^{-n c_{\mu_1}(\epsilon)} \\ C_1(\mu_1, \epsilon, n) &:= \left( \frac{\mu_1 - \epsilon}{\mu_1 - \epsilon/2} \right)^n \leq e^{-n \frac{\epsilon}{2\mu_1 - \epsilon}} < 1. \\ G_k(x; n) &:= F_{n-k, nx}^{\text{Er}}(\alpha_1 + \rho) \end{aligned} \quad (20)$

for $F^{\text{Er}}$ defined in (10). Here, $c_{\mu_1}(\epsilon) = \zeta - \log(1 + \zeta) = \mathcal{O}(\epsilon^{-2})$ and $\zeta = \frac{\epsilon}{4\mu_1 - 2\epsilon} \in (0, 1)$ are deterministic constants of $\mu_1$ and $\epsilon$ .

Notice that $\mu((\kappa_1, \alpha_1 + \rho)) = \mu_1 - \epsilon/2$ holds and there exists a problem-dependent constant $C_2(\mu_1, \epsilon, k) < 1$ such that for any $n \geq \bar{n} = \max(2, k + 1)$ and $\epsilon > 0$

$h(\mu_1, \epsilon, n) \leq 1 - C_2(\mu_1, \epsilon, k). \quad (21)$ ## C.1 Proof of Lemma 5

Let us start by stating a well-known fact that is utilized in the proof.

Fact 11. When $X \sim \text{Erlang}(n, \beta)$ with rate parameter $\beta$ , then $\frac{1}{X}$ follows the inverse gamma distribution with shape $n \in \mathbb{N}$ and scale $\beta \in \mathbb{R}_+$ , i.e., $\frac{1}{X} \sim \text{IG}(n, \beta)$ .

Lemma 5. Under STS with $k \in \mathbb{Z}_{\geq 2}$ ,

$\sum_{t=\bar{n}K+1}^T \mathbb{E} \left[ \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \right] \leq \mathcal{O}(\epsilon^{-2}).$

and under STS-T with $k \in \mathbb{Z}_{\geq 0}$ ,

$\sum_{t=\bar{n}K+1}^T \mathbb{E} \left[ \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \right] \leq \mathcal{O}(\epsilon^{-m}),$

where $m = \max(2, 3 - k)$ .

Proof. Let us consider the following decomposition that holds under both STS and STS-T:

$\begin{aligned} & \sum_{t=K\bar{n}+1}^T \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \\ &= \sum_{n=\bar{n}}^T \sum_{t=K\bar{n}+1}^T \mathbb{1} \left[ j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t), N_1(t) = n \right] \\ &= \sum_{n=\bar{n}}^T \sum_{m=1}^T \mathbb{1} \left[ m \leq \sum_{t=K\bar{n}+1}^T \mathbb{1} \left[ j(t) \neq 1, \mathcal{M}_\epsilon^c(t), \mathcal{K}_{1,N_1(t)}^c(\epsilon), N_1(t) = n \right] \right]. \end{aligned}$

Notice that

$m \leq \sum_{t=K\bar{n}+1}^T \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t), N_1(t) = n]$

implies that $\tilde{\mu}1(t) \leq \mu_1 - \epsilon$ occurred $m$ times in a row on ${t : \mathcal{K}{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t), N_1(t) = n}$ . Thus, we have

$\begin{aligned} \mathbb{E} \left[ \sum_{t=K\bar{n}+1}^T \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \right] &\leq \mathbb{E} \left[ \sum_{n=\bar{n}}^T \sum_{m=1}^T \mathbb{1}[\mathcal{K}_{1,n}^c(\epsilon)] p_n(\epsilon|\theta_{1,n})^m \right] \\ &\leq \sum_{n=\bar{n}}^T \mathbb{E} \left[ \mathbb{1}[\mathcal{K}_{1,n}^c(\epsilon)] \frac{p_n(\epsilon|\theta_{1,n})}{1 - p_n(\epsilon|\theta_{1,n})} \right] \end{aligned} \quad (22)$

for $p_n(\cdot|\cdot)$ defined in (19). From now on, we fix $n \geq \bar{n}$ and drop the argument $n$ of $\hat{\kappa}_1(n)$ , $\hat{\alpha}1(n)$ and $\bar{\alpha}_1(n)$ for simplicity.Under STS with $k \in \mathbb{Z}_{\geq 2}$ : By applying Lemma 10 and (21) under STS with $k \in \mathbb{Z}{\geq 0}$ , we can decompose the expectation in (22) as

$\begin{aligned} \mathbb{E} \left[ \mathbb{1} [\mathcal{K}_{1,n}^c(\epsilon)] \frac{p_n(\epsilon|\hat{\theta}_{1,n})}{1 - p_n(\epsilon|\hat{\theta}_{1,n})} \right] &\leq \mathbb{P}[\hat{\kappa}_1 \geq \mu_1 - \epsilon] \frac{e^{-n}}{1 - e^{-n}} + \mathbb{P}[\hat{\kappa}_1 \in K(\epsilon), \hat{\alpha}_1 < \alpha_1 + \rho] \frac{h(\mu_1, \epsilon, n)}{C_2(\mu_1, \epsilon, k)} \\ &\quad + \underbrace{\mathbb{E}_{\hat{\theta}_{1,n}} \left[ \frac{\mathbb{1}[\hat{\kappa}_1 \in K(\epsilon), \hat{\alpha}_1 > \alpha_1 + \rho]}{G_k(1/\hat{\alpha}_1; n)(1 - C_{1,n})} \right]}_{(*)}, \end{aligned} \quad (23)$

where we denoted $C_{1,n} = C_1(\mu_1, \epsilon, n)$ . For simplicity, let us define $z := \frac{1}{\hat{\alpha}_1}$ where $z \sim \text{Erlang}(n-1, n\alpha_1)$ holds from Fact (11) since $\hat{\alpha}_1 \sim \text{IG}(n-1, n\alpha_1)$ in (2). From the independence of $\hat{\kappa}$ and $\hat{\alpha}$ and distributions of $z$ and $\hat{\alpha}$ in (9) and (2), respectively, we have

$\begin{aligned} (*) &= \int_0^{\frac{1}{\alpha_1 + \rho}} z^{n-2} e^{-n\alpha_1 z} \frac{(n\alpha_1)^{n-1}}{\Gamma(n-1)} \int_{\hat{\kappa}_1 \in K(\epsilon)} \frac{f_{\kappa_1, n\alpha_1}^{\text{Pa}}(\hat{\kappa}_1)}{G_k(z; n)(1 - C_{1,n})} d\hat{\kappa}_1 dz \\ &= \mathbb{P}[\hat{\kappa}_1 \in K(\epsilon)] \int_0^{\frac{1}{\alpha_1 + \rho}} \frac{z^{n-2} e^{-n\alpha_1 z}}{G_k(z; n)(1 - C_{1,n})} \frac{(n\alpha_1)^{n-1}}{\Gamma(n-1)} dz. \end{aligned}$

By substituting the CDF in (10), we obtain

$\begin{aligned} G_k(z; n) &= F_{n-k, nz}^{\text{Er}}(\alpha_1 + \rho) \\ &= \frac{1}{\Gamma(n-k)} \int_0^{n(\alpha_1 + \rho)z} e^{-t} t^{n-k-1} dt \\ &\geq \frac{e^{-n(\alpha_1 + \rho)z}}{\Gamma(n-k)} \int_0^{n(\alpha_1 + \rho)z} t^{n-k-1} dt \\ &= \frac{e^{-n(\alpha_1 + \rho)z}}{\Gamma(n-k+1)} (n(\alpha_1 + \rho)z)^{n-k}. \end{aligned} \quad (24)$

Therefore,

$\begin{aligned} \frac{(*)}{\mathbb{P}[\hat{\kappa} \in K(\epsilon)]} &\leq \int_0^{\frac{1}{\alpha_1 + \rho}} \frac{\Gamma(n-k+1)}{(n(\alpha_1 + \rho)z)^{n-k}(1 - C_{1,n})} e^{n(\alpha_1 + \rho)z} \frac{(n\alpha_1)^{n-1}}{\Gamma(n-1)} z^{n-2} e^{-n\alpha_1 z} dz \\ &= \frac{\Gamma(n-k+1)}{\Gamma(n-1)(1 - C_{1,n})} (\alpha_1 + \rho)^{k-1} \left( \frac{\alpha_1}{\alpha_1 + \rho} \right)^{n-1} n^{k-1} \int_0^{\frac{1}{\alpha_1 + \rho}} z^{k-2} e^{n\rho z} dz \\ &\leq \frac{\Gamma(n-k+1)}{\Gamma(n-1)(1 - C_{1,n})} (\alpha_1 + \rho)^{k-1} e^{-\frac{\rho}{\alpha_1 + \rho}(n-1)} \frac{n^{k-1}}{(n\rho)^{k-2}} \int_0^{\frac{1}{\alpha_1 + \rho}} (n\rho z)^{k-2} e^{n\rho z} dz. \end{aligned} \quad (25)$

By letting $n\rho z = y$ , we can bound the integral in (25) as

$\begin{aligned} \frac{n^{k-1}}{(n\rho)^{k-2}} \int_0^{\frac{1}{\alpha_1 + \rho}} (n\rho z)^{k-2} e^{n\rho z} dz &= \rho^{-(k-1)} \int_0^{\frac{n\rho}{\alpha_1 + \rho}} y^{k-2} e^y dy \\ &\leq \rho^{-(k-1)} e^{\frac{n\rho}{\alpha_1 + \rho}} \int_0^{\frac{n\rho}{\alpha_1 + \rho}} y^{k-2} dy \end{aligned} \quad (26)$

$= \frac{e^{\frac{n\rho}{\alpha_1 + \rho}}}{k-1} \left( \frac{n}{\alpha_1 + \rho} \right)^{k-1}, \quad \text{if } k \in \mathbb{Z}_{\geq 2} \quad (27)$ where (27) holds only for $k \in \mathbb{Z}{\geq 2}$ since the integral in (26) diverges for $k \in \mathbb{Z}{\leq 1}$ .

By applying (27) to (25), we obtain for $k \in \mathbb{Z}_{\geq 2}$

$\begin{aligned} (*) &\leq \mathbb{P}[\hat{\kappa} \in K(\epsilon)] \frac{e^{\frac{\rho}{\alpha_1+\rho}}}{1-C_{1,n}} \frac{n^{k-1}}{k-1} \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \\ &\leq \frac{e^{1-\frac{\epsilon\alpha_1}{\kappa+\epsilon}n}}{1-C_{1,n}} \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \frac{n^{k-1}}{k-1} = \mathcal{O}(ne^{-n\epsilon}), \end{aligned} \quad (28)$

where (28) can be obtained by Lemma 15 and $\frac{\rho}{\alpha_1+\rho} < 1$ . By combining (28) with (23) and (22), we obtain for $k \in \mathbb{Z}_{\geq 2}$

$\begin{aligned} \mathbb{E} \left[ \sum_{t=K\bar{n}+1}^T \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1,N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \right] &\leq \sum_{n=\bar{n}}^T \left( \frac{e^{-n}}{1-e^{-n}} + \frac{e^{-n\frac{3\epsilon}{4\mu_1}} + \frac{1}{2}e^{-n\epsilon\mu_1(\epsilon)}}{C_2(\mu_1, \epsilon, k)} + (*) \right) \\ &\leq \sum_{n=\bar{n}}^T \mathcal{O}(e^{-n}) + \mathcal{O}(e^{-n\epsilon}) + \mathcal{O}(e^{-n\epsilon^2}) + \mathcal{O}(ne^{-n\epsilon}) \\ &= \mathcal{O}(1) + \mathcal{O}(\epsilon^{-1}) + \mathcal{O}(\epsilon^{-2}) + \mathcal{O}(\epsilon^{-2}), \end{aligned}$

which concludes the proof under STS with $k \in \mathbb{Z}_{\geq 2}$ .

Under STS-T with $k \in \mathbb{Z}_{\geq 0}$ : Under STS-T, we have the following inequality instead of (23):

$\begin{aligned} \mathbb{E} \left[ \mathbb{1}[\mathcal{K}_n^c(\epsilon)] \frac{p_n(\epsilon|\bar{\theta}_{1,n})}{1-p_n(\epsilon|\bar{\theta}_{1,n})} \right] &\leq \mathbb{P}[\hat{\kappa}_1 \geq \mu_1 - \epsilon] \frac{e^{-n}}{1-e^{-n}} + \mathbb{P}[\hat{\kappa}_1 \in K(\epsilon), \bar{\alpha}_1 < \alpha_1 + \rho] \frac{h(\mu_1, \epsilon, n)}{C_2(\mu_1, \epsilon, k)} \\ &\quad + \underbrace{\mathbb{E}_{\bar{\theta}_{1,n}} \left[ \frac{\mathbb{1}[\hat{\kappa}_1 \in K(\epsilon), \bar{\alpha}_1 \in (\alpha_1 + \rho, n]]}{G_k(1/\bar{\alpha}_1; n)(1-C_{1,n})} \right]}_{(*)}. \end{aligned} \quad (29)$

From $\mathbb{1}[\bar{\alpha}_1(n) < n] = \mathbb{1}[\bar{\alpha}_1(n) = \hat{\alpha}_1(n)]$ , it holds that

$(*) = \mathbb{E}_{\hat{\theta}_{1,n}} \left[ \frac{\mathbb{1}[\hat{\kappa}_1 \in K(\epsilon), \hat{\alpha}_1 \in (\alpha_1 + \rho, n]]}{G_k(1/\hat{\alpha}_1; n)(1-C_{1,n})} \right] + \mathbb{E}_{\bar{\theta}_{1,n}} \left[ \frac{\mathbb{1}[\hat{\kappa}_1 \in K(\epsilon), \bar{\alpha}_1 = n]}{G_k(1/\bar{\alpha}_1; n)(1-C_{1,n})} \right].$

Since $\mathbb{1}[\bar{\alpha}_1(n) = n] = \mathbb{1}[\hat{\alpha}_1(n) \geq n]$ holds and $\hat{\kappa}$ and $\hat{\alpha}$ are independent, we have for $z = \frac{1}{\hat{\alpha}_1} \sim \text{Erlang}(n-1, n\alpha_1)$

$\frac{(*)}{\mathbb{P}[\hat{\kappa}_1 \in K(\epsilon)]} \leq \underbrace{\int_{\frac{1}{n}}^{\frac{1}{\alpha_1+\rho}} \frac{z^{n-2} e^{-n\alpha_1 z}}{G_k(z; n)(1-C_{1,n})} \frac{(n\alpha_1)^{n-1}}{\Gamma(n-1)} dz}_{(\dagger)} + \underbrace{\frac{1}{G_k(1/n; n)(1-C_{1,n})} \mathbb{P} \left[ \frac{1}{\hat{\alpha}_1} \leq \frac{1}{n} \right]}_{(\diamond)}, \quad (30)$

where the same techniques on $(*)$ can be applied to derive an upper bound of $(\dagger)$ . By following the same steps as (25) and (26), we obtain

$\int_{\rho}^{\frac{n\rho}{\alpha+\rho}} y^{k-2} dy \leq \begin{cases} \left( \frac{n\rho}{\alpha+\rho} \right)^{k-1}, & \text{if } k \in \mathbb{Z}_{\geq 2}, \\ \log \left( \frac{n}{\alpha+\rho} \right), & \text{if } k = 1, \\ \rho^{k-1}/(1-k), & \text{if } k \in \mathbb{Z}_{k \leq 0}, \end{cases}$ as a counterpart of the integral in (26). By following the same steps as (27) and (28), we have

$(\dagger) \leq \begin{cases} \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \frac{en^{k-1}}{k-1}, & \text{if } k \in \mathbb{Z}_{\geq 2}, \\ (n-1) \log\left(\frac{n}{\alpha+\rho}\right), & \text{if } k = 1, \\ \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \frac{e}{(1-k)(\alpha+\rho)^{1-k}}, & \text{if } k \in \mathbb{Z}_{k \leq 0}. \end{cases} \quad (31)$

Note that the probability term in $(\diamond)$ is the same as the CDF of the Erlang( $n-1, n\alpha_1$ ) with rate $n\alpha_1$ evaluated at $\frac{1}{n}$ since $\hat{\alpha}_1 \sim \text{IG}(n-1, n\alpha_1)$ from (2). Thus, we have

$\begin{aligned} (\diamond) &= \frac{1}{1 - C_{1,n}} \frac{1}{G_k(1/n; n)} \frac{\gamma(n-1, \alpha_1)}{\Gamma(n-1)} \\ &\leq \frac{e^{\alpha_1+\rho}}{1 - C_{1,n}} \frac{\Gamma(n-k+1)}{(\alpha_1+\rho)^{n-k}} \frac{\gamma(n-1, \alpha_1)}{\Gamma(n-1)} \quad \text{by (24)} \\ &\leq \frac{e^{\alpha_1+\rho}}{1 - C_{1,n}} \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \frac{\alpha_1^{n-1}}{(\alpha_1+\rho)^{n-k}} \end{aligned} \quad (32)$

$\begin{aligned} &\leq \frac{e^{\alpha_1+\rho}}{1 - C_{1,n}} \frac{\Gamma(n-k+1)}{\Gamma(n-1)} \frac{1}{(\alpha_1+\rho)^{1-k}} \\ &= \mathcal{O}(n^{2-k}), \end{aligned} \quad (33)$

where (32) holds from $\gamma(s, x) \leq x^s$ for any $s \geq 1$ and $x > 0$ . Therefore, by combining (31) and (33) with (30) and $\mathbb{P}[\hat{k} \in K(\epsilon)] = \mathcal{O}(e^{-n\epsilon})$ , we have

$(\star) \leq \begin{cases} \mathcal{O}(ne^{-n\epsilon}), & \text{if } k \in \mathbb{Z}_{\geq 2} \\ \mathcal{O}(n \log(n)e^{-n\epsilon}), & \text{if } k = 1, \\ \mathcal{O}(n^{2-k}e^{-n\epsilon}), & \text{if } k \in \mathbb{Z}_{\leq 0}. \end{cases} \quad (34)$

By combining (34) with (29) and (22), we obtain for $k \in \mathbb{Z}_{\geq 0}$

$\begin{aligned} \mathbb{E} \left[ \sum_{t=K\bar{n}+1}^T \mathbb{1}[j(t) \neq 1, \mathcal{K}_{1, N_1(t)}^c(\epsilon), \mathcal{M}_\epsilon^c(t)] \right] &\leq \sum_{n=\bar{n}}^T \left( \frac{e^{-n}}{1 - e^{-n}} + \frac{e^{-n\frac{3\epsilon}{4\mu_1}} + \frac{1}{2}e^{-nc_{\mu_1}(\epsilon)}}{C_2(\mu, \epsilon, k)} + (\star) \right) \\ &\leq \sum_{n=\bar{n}}^T \left( \mathcal{O}(e^{-n}) + \mathcal{O}(e^{-n\epsilon}) + \mathcal{O}(e^{-n\epsilon^2}) \right. \\ &\quad \left. + \mathcal{O}(\psi(n, k)e^{-n\epsilon}) \right) \\ &= \mathcal{O}(1) + \mathcal{O}(\epsilon^{-1}) + \mathcal{O}(\epsilon^{-2}) + \mathcal{O}(\epsilon^{-\max(2, 3-k)}), \end{aligned}$

where

$\psi(n, k) = n\mathbb{1}[k \geq 2] + n \log(n)\mathbb{1}[k = 1] + n^{2-k}\mathbb{1}[k \leq 0].$

Note that the analysis on term $(\star)$ also holds for STS-T with $k \in \mathbb{Z}{< 0}$ . However, differently from the case of $k \in {0, 1}$ , priors with $k \in \mathbb{Z}{< 0}$ have additional problems in Lemma 10 under the event ${\hat{k}_1 \in K(\epsilon), \bar{\alpha}_1(n) \leq \alpha_1 + \rho}$ , where the upper bound becomes a constant $\frac{1}{2}$ . $\square$## C.2 Proof of Lemma 6

Firstly, we state a well-known fact that is utilized in the proof.

Fact 12. When $X \sim \text{Erlang}(n, \beta)$ with rate parameter $\beta$ , then $2\beta X$ follows the chi-squared distribution with $2n$ degree of freedom, i.e., $2\beta X \sim \chi_{2n}^2$ .

Lemma 6. Under STS and STS-T with $k \in \mathbb{Z}$ , it holds that for any $a \in [K]$

$\sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a, \mathcal{M}_\epsilon(t), \mathcal{E}_{a, N_a(t)}(\epsilon)]] \leq \max(0, k) + 1 + \frac{1}{\alpha_a \epsilon} \mathbb{1}[k > 0] + \frac{\log T}{D_{a,k}(\epsilon)}.$

where $D_{a,k}(\epsilon) > 0$ is a finite problem-deterministic constant satisfying $\lim_{\epsilon \rightarrow 0} D_{a,k}(\epsilon) = \text{KL}_{\text{inf}}(a)$ .

Proof. From the sampling rule, it holds that

$\sum_{t=\bar{n}K+1}^T \mathbb{P}[j(t) = a, \tilde{\mu}_1(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a, N_a(t)}(\epsilon)] \leq \sum_{t=\bar{n}K+1}^T \mathbb{P}[j(t) = a, \tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a, N_a(t)}(\epsilon)].$

Fix a time index $t$ and denote $\mathbb{P}_t[\cdot] = \mathbb{P}[\cdot | \mathcal{F}_t]$ and $N_a(t) = n$ . To simplify notations, we drop the argument $t$ of $\tilde{\kappa}_a(t)$ , $\tilde{\alpha}_a(t)$ and $\tilde{\mu}_a(t)$ and the argument $n$ of $\hat{\kappa}_a(n)$ , $\hat{\alpha}_a(n)$ , $\bar{\alpha}_a(n)$ .

Since $\tilde{\kappa}_a \in (0, \hat{\kappa}_a]$ holds from its posterior distribution, if $\tilde{\alpha}_a \geq \frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}$ holds, then $\tilde{\mu}_a = \frac{\tilde{\kappa}_a \tilde{\alpha}_a}{\tilde{\alpha}_a - 1} \leq \mu_1 - \epsilon$ holds for any $\tilde{\kappa}a$ . Recall that $f{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(\cdot)$ denotes a density function of $\text{Erlang}\left(n - k, \frac{n}{\alpha_a, n}\right)$ with rate parameter $\frac{n}{\alpha_a, n}$ , which is the marginalized posterior distribution of $\tilde{\alpha}$ under STS and STS-T. From the CDF of $\tilde{\kappa}$ in (8), if $\hat{\kappa}_a < \mu_1 - \epsilon$ , then

$\begin{aligned} \mathbb{P}_t[\tilde{\mu}_a \geq \mu_1 - \epsilon] &= \mathbb{P}_t[\tilde{\alpha}_a \leq 1] + \mathbb{P}_t\left[\tilde{\kappa}_a \geq \frac{\tilde{\alpha}_a - 1}{\tilde{\alpha}_a}(\mu_1 - \epsilon) \cap \tilde{\alpha}_a \in \left(1, \frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}\right)\right] \\ &= \int_0^1 f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) dx + \int_1^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}} f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) \mathbb{P}_t\left[\tilde{\kappa}_a \geq \frac{x-1}{x}(\mu_1 - \epsilon)\right] dx \\ &= \int_0^1 f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) dx + \int_1^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}} f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) \left(1 - \left(\frac{x-1}{\hat{\kappa}_a}(\mu_1 - \epsilon)\right)^{nx}\right) dx \\ &= \int_0^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}} f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) dx - \int_1^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}_a}} f_{n-k, \frac{n}{\alpha_a, n}}^{\text{Er}}(x) \left(\frac{x-1}{\hat{\kappa}_a}(\mu_1 - \epsilon)\right)^{nx} dx. \end{aligned}$

Since $\frac{x}{x-y}$ is increasing with respect to $y < x$ and $\hat{\kappa} \leq \kappa + \epsilon$ holds on $\mathcal{E}$ , we have for $\mathcal{E}$

$\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}} \leq \frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - (\kappa + \epsilon)}.$

Let

$\eta_a(\epsilon) = \frac{\kappa_a(\Delta_a - \epsilon) - \epsilon \mu_a}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a - 2\epsilon)} > 0 \quad (35)$ be a deterministic constant that depends only on the model and $\epsilon$ and satisfies

$\begin{aligned} \alpha_a - \eta_a(\epsilon) &= \frac{\mu_a}{\mu_a - \kappa_a} - \frac{\kappa_a(\Delta_a - \epsilon) - \epsilon\mu_a}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a - 2\epsilon)} \\ &= \frac{\mu_a\mu_1 - \kappa_a\mu_a - 2\epsilon\mu_a - \kappa_a(\mu_1 - \mu_a - \epsilon) + \epsilon\mu_a}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a - 2\epsilon)} \\ &= \frac{\mu_1(\mu_a - \kappa_a) - \epsilon(\mu_a - \kappa_a)}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a - 2\epsilon)} \\ &= \frac{\mu_1 - \epsilon}{\mu_1 - \kappa_a - 2\epsilon}. \end{aligned}$

Since $\eta_a(\epsilon) > 0$ , it holds that for any $\epsilon \in (0, \varepsilon_a)$

$\alpha_a - \eta_a(\epsilon) = \frac{\mu_1 - \epsilon}{\mu_1 - \kappa_a - 2\epsilon} \leq \frac{\mu_a}{\mu_a - \kappa_a} = \alpha_a.$

Note that $\frac{\mu}{\mu - \kappa} = \alpha$ holds and the change of $\mu$ to $\mu'$ with fixed $\kappa$ that is $\frac{\mu'}{\mu' - \kappa}$ , implies how the value of the shape parameter $\alpha'$ should be to satisfy $\mu((\kappa, \alpha')) = \mu'$ . For example, $\theta = (\kappa_a + \varepsilon_a, \alpha_a)$ satisfies $\mu(\theta) \leq \mu_a + \frac{\delta_a}{2}$ . Thus, if $\mu((\kappa_a + \varepsilon_a, \alpha)) = \mu_1 - \epsilon > \mu_a + \frac{\delta_a}{2}$ , then $\alpha$ should be smaller than $\alpha_a$ . Hence, we have

$\begin{aligned} &\mathbb{1}[\mathcal{E}_{a,n}(\epsilon)]\mathbb{P}_t[\tilde{\mu}_a \geq \mu_1 - \epsilon] \\ &\leq \mathbb{1}[\mathcal{E}_{a,n}(\epsilon)] \left( \int_0^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}}} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) dx \right. \\ &\quad \left. - \int_1^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}}} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) \left( \frac{x-1}{\hat{\kappa}x} (\mu_1 - \epsilon) \right)^{nx} dx \right) \\ &\leq \mathbb{1}[\mathcal{E}_{a,n}(\epsilon)] \int_0^{\frac{\mu_1 - \epsilon}{\mu_1 - \epsilon - \hat{\kappa}}} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) dx \tag{36} \end{aligned}$

$\leq \mathbb{1}[\mathcal{E}_{a,n}(\epsilon)] \int_0^{\alpha_a - \eta_a(\epsilon)} f_{n-k, \frac{n}{\alpha_{a,n}}}^{\text{Er}}(x) dx = \mathbb{1}[\mathcal{E}_{a,n}(\epsilon)]\mathbb{P}_t[\tilde{\alpha}_a(t) \leq \alpha_a - \eta_a(\epsilon)]. \tag{37}$

Therefore, by taking expectation and using Fact 12, we have

$\begin{aligned} \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] &\leq \mathbb{P}[\tilde{\alpha}_a \leq \alpha_a - \eta_a(\epsilon), \mathcal{E}_{a,n}(\epsilon)], \\ &= \mathbb{P}\left[Z \leq \frac{2n}{\alpha_{a,n}} (\alpha - \eta_a(\epsilon)), \mathcal{E}_{a,n}(\epsilon)\right] \tag{38} \end{aligned}$

where $Z$ is a random variable following the chi-squared distribution with $2(n - k)$ degrees of freedom, i.e., $Z \sim \chi_{2n-2k}^2$ .

C.2.1 Under STS

Here, we first consider the case of STS where we replace $\alpha_{a,n}$ with $\hat{\alpha}_a(n)$ .

Since $\hat{\alpha}a \in [\alpha_a - \epsilon{a,l}, \alpha_a + \epsilon_{a,u}]$ holds on $\mathcal{E}_{a,n}(\epsilon)$ , we have

$\frac{1}{\alpha_a} - \epsilon \left( 1 + \frac{1}{\kappa_a} \right) = \frac{1}{\alpha_a + \epsilon_{a,u}} \leq \frac{1}{\hat{\alpha}_a} \leq \frac{1}{\alpha_a - \epsilon_{a,l}} = \frac{1}{\alpha_a} + \epsilon \tag{39}$ by the definition of $\epsilon_{a,l}(\epsilon)$ and $\epsilon_{a,u}(\epsilon)$ in (12).

By replacing $\alpha_{a,n}$ with $\hat{\alpha}_a(n)$ in (38) and applying (39), we have

$\begin{aligned} \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] &\leq \mathbb{P}\left[Z \leq \frac{2n}{\hat{\alpha}_a}(\alpha_a - \eta_a(\epsilon)), \mathcal{E}_{a,n}(\epsilon)\right] \\ &\leq \mathbb{P}\left[Z \leq 2n\left(\frac{1}{\alpha_a} + \epsilon\right)(\alpha_a - \eta_a(\epsilon))\right] \\ &= \mathbb{P}\left[Z \leq 2(n-k)\frac{n}{n-k}\left(\frac{1}{\alpha_a} + \epsilon\right)(\alpha_a - \eta_a(\epsilon))\right]. \end{aligned} \quad (40)$

Priors with $k \in \mathbb{Z}_{\leq 0}$ . Let us first consider the case $k \in \mathbb{Z}_{\leq 0}$ , where we have $\frac{n}{n-k} \leq 1$ . It holds that

$\begin{aligned} \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] &\leq \mathbb{P}\left[Z \leq 2(n-k)\frac{n}{n-k}\left(\frac{1}{\alpha_a} + \epsilon\right)(\alpha_a - \eta_a(\epsilon))\right] \\ &\leq \mathbb{P}\left[Z \leq 2(n-k)\left(\frac{1}{\alpha_a} + \epsilon\right)(\alpha_a - \eta_a(\epsilon))\right]. \end{aligned}$

Note that the definition of $\epsilon_a$ in Theorem 2 is set to satisfy $(\frac{1}{\alpha} + \epsilon)(\alpha - \eta_a(\epsilon)) < 1$ for any $\epsilon \leq \epsilon_a$ . Thus, we can apply Lemma 17, which shows

$\mathbb{P}\left[Z \leq 2(n-k)\left(1 - \frac{\eta_a(\epsilon)}{\alpha_a} + \epsilon(\alpha_a - \eta_a(\epsilon))\right)\right] \leq e^{-(n-k)D_{a,k}(\epsilon)}, \quad (41)$

where

$\begin{aligned} D_{a,k}(\epsilon) := -\log\left(1 - \frac{\eta_a(\epsilon)}{\alpha_a} + (\max(0, k) + 1)\epsilon(\alpha_a - \eta_a(\epsilon))\right) \\ - \frac{\eta_a(\epsilon)}{\alpha_a} + (\max(0, k) + 1)\epsilon(\alpha_a - \eta_a(\epsilon)) \end{aligned} \quad (42)$

is a finite constant that only depends on the model parameters, $\epsilon$ , and prior parameter $k$ .

For arbitrary $n_a > 0$ , applying (41) to (38) gives

$\begin{aligned} \sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a, \tilde{\mu}_1(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,N_a(t)}(\epsilon)]] \\ \leq \sum_{t=\bar{n}K+1}^T \mathbb{P}[j(t) = a, \tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] \\ \leq n_a + \sum_{t=\bar{n}K+1}^T \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,N_a(t)}(\epsilon), N_a(t) \geq n_a] \\ \leq n_a + \sum_{t=\bar{n}K+1}^T e^{-(n_a-k)D_{a,k}(\epsilon)} \\ \leq n_a + \sum_{t=\bar{n}K+1}^T e^{-n_a D_{a,k}(\epsilon)} = n_a + T e^{-n_a D_{a,k}(\epsilon)}. \end{aligned}$

Letting $n_a = \frac{\log T}{D_{a,k}(\epsilon)}$ concludes the cases of priors with $k \in \mathbb{Z}{\leq 0}$ .Priors with $k \in \mathbb{Z}_{>0}$ Next, consider the case $k \in \mathbb{Z}{>0}$ . Recall that we first play every arm $k + 1$ times if $k > 0$ , which implies that $n - k > 0$ . For $n \geq \frac{1}{\alpha\epsilon} + k + 1$ , it holds that

$\frac{n}{n-k} \left( \frac{1}{\alpha} + \epsilon \right) \leq \frac{1}{\alpha} + (k+1)\epsilon. \quad (43)$

By applying (43) to (38), we have for $n \geq \frac{1}{\alpha\epsilon} + k + 1$ ,

$\mathbb{P}[\tilde{\alpha}_a \leq \alpha_a - \eta_a(\epsilon), \mathcal{E}_{a,N_a(t)}(\epsilon)] \leq \mathbb{P} \left[ Z \leq 2(n-k) \left( 1 - \frac{\eta_a(\epsilon)}{\alpha_a} + (k+1)\epsilon(\alpha_a - \eta_a(\epsilon)) \right) \right].$

Similarly, by applying Lemma 17, one can see that for $n \geq \frac{1}{\alpha_a\epsilon} + k + 1$

$\mathbb{P}[\tilde{\alpha}_a \leq \alpha_a - \eta_a(\epsilon), \mathcal{E}_{a,N_a(t)}(\epsilon)] \leq e^{-(n-k)D_{a,k}(\epsilon)}, \quad (44)$

where $D_{a,k}(\epsilon)$ is defined in (42).

When $k \in \mathbb{Z}_{>0}$ , let $n_a \geq \frac{1}{\alpha_a\epsilon} + k + 1$ be arbitrary. By applying (44) to (38) again, we have

$\begin{aligned} & \sum_{t=\bar{n}K+1}^T \mathbb{E}[\mathbb{1}[j(t) = a, \tilde{\mu}_1(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,N_a(t)}(\epsilon)]] \\ & \leq \sum_{t=\bar{n}K+1}^T \mathbb{P}[j(t) = a, \tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] \\ & \leq n_a + \sum_{t=\bar{n}K+1}^T \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,N_a(t)}(\epsilon), N_a(t) \geq n_a] \\ & \leq n_a + \sum_{t=\bar{n}K+1}^T e^{-(n_a-k)D_{a,k}(\epsilon)} = n_a + Te^{-(n_a-k)D_{a,k}(\epsilon)}. \end{aligned}$

Letting $n_a = k + 1 + \frac{1}{\alpha_a\epsilon} + \frac{\log T}{D_{a,k}(\epsilon)}$ concludes the cases of priors with $k > 0$ .

C.2.2 Under STS-T

Here, we consider the case of STS-T where we replace $\alpha_{a,n}$ with $\bar{\alpha}_a(n) = \min(\hat{\alpha}_a(n), n)$ . From the definition of $\bar{\alpha}_a(n)$ , it holds that for $\epsilon \leq \varepsilon_a$

$\forall n \geq \alpha_a + 1 : \mathbb{1}[\bar{\alpha}_a(n) = \hat{\alpha}_a(n), \mathcal{A}_{a,n}(\epsilon)] = 1.$

Therefore, for $n \geq \alpha_a + 1$ , the analysis on STS can be applied to STS-T directly.

Let us consider the case where $\bar{\alpha}a(n) = n < \alpha_a + 1$ holds under the condition $\mathcal{A}{a,n}(\epsilon)$ . By replacing $\alpha_{a,n}$ with $n$ in (38) and following the same steps as in (38) and (41), we have for any $k \in \mathbb{Z}$ ,

$\begin{aligned} \mathbb{P}[\tilde{\mu}_a(t) \geq \mu_1 - \epsilon, \mathcal{E}_{a,n}(\epsilon)] & \leq \mathbb{P} \left[ Z \leq \frac{2n}{n} (\alpha_a - \eta_a(\epsilon)), \mathcal{E}_{a,n}(\epsilon) \right] \\ & \leq \mathbb{P} \left[ Z \leq 2(n-k) \frac{1}{n-k} \left( \frac{1}{\alpha_a} + \epsilon \right) (\alpha_a - \eta_a(\epsilon)), \mathcal{E}_{a,n}(\epsilon) \right] \\ & \leq \mathbb{P} \left[ Z \leq 2(n-k) \left( \frac{1}{\alpha_a} + \epsilon \right) (\alpha_a - \eta_a(\epsilon)), \mathcal{E}_{a,n}(\epsilon) \right] \\ & \leq e^{-(n-k)D_{a,k}(\epsilon)}, \end{aligned}$ where $D_{a,k}(\epsilon)$ defined in (42). Therefore, the same result follows by the analysis in Section C.2.1.

C.2.3 Asymptotic behavior of $D_{a,k}(\epsilon)$

Finally, we show that $\lim_{\epsilon \rightarrow 0} D_{a,k}(\epsilon) = \text{KL}_{\text{inf}}(a)$ . From their definitions of $\eta_a(\epsilon)$ in (35) and $\Delta_a = \mu_1 - \mu_a$ , we have

$\begin{aligned} \lim_{\epsilon \rightarrow 0} \eta_a(\epsilon) &= \lim_{\epsilon \rightarrow 0} \frac{\kappa_a(\Delta_a - \epsilon) - \epsilon\mu_a}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a - 2\epsilon)} \\ &= \frac{\kappa_a\Delta_a}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a)} = \frac{\kappa_a(\mu_1 - \mu_a)}{(\mu_a - \kappa_a)(\mu_1 - \kappa_a)} = (\alpha_a - 1) \frac{\mu_1 - \mu_a}{\mu_1 - \kappa_a}. \end{aligned}$

Then, it holds that

$\begin{aligned} \lim_{\epsilon \rightarrow 0} 1 - \frac{\eta_a(\epsilon)}{\alpha_a} &= 1 - \left( \frac{\alpha_a - 1}{\alpha_a} \frac{\mu_1 - \mu_a}{\mu_1 - \kappa_a} \right) \\ &= \frac{\alpha_a(\mu_1 - \kappa_a) - (\alpha_a - 1)(\mu_1 - \mu_a)}{\alpha_a(\mu_1 - \kappa_a)} \\ &= \frac{\alpha_a(\mu_a - \kappa_a) + \mu_1 - \mu_a}{\alpha_a(\mu_1 - \kappa_a)} \\ &= \frac{\mu_a}{\mu_a - \kappa_a} \frac{(\mu_a - \kappa_a)}{\alpha_a(\mu_1 - \kappa_a)} + \frac{\mu_1 - \mu_a}{\alpha_a(\mu_1 - \kappa_a)} \quad \because \alpha_a = \frac{\mu_a}{\mu_a - \kappa_a} \\ &= \frac{\mu_1}{\alpha_a(\mu_1 - \kappa_a)} =: \frac{1}{X_a}, \end{aligned}$

Therefore, from the definition of $D_{a,k}(\epsilon)$ in (42)

$\begin{aligned} \lim_{\epsilon \rightarrow 0} D_{a,k}(\epsilon) &= \lim_{\epsilon \rightarrow 0} \left[ -\log \left( 1 - \frac{\eta_a(\epsilon)}{\alpha_a} + (\max(0, k) + 1)(\alpha_a - \eta_a(\epsilon))\epsilon \right) \right. \\ &\quad \left. - \frac{\eta_a(\epsilon)}{\alpha_a} + (\max(0, k) + 1)(\alpha_a - \eta_a(\epsilon))\epsilon \right] \\ &= -\log \left( 1 - \lim_{\epsilon \rightarrow 0} \frac{\eta_a(\epsilon)}{\alpha_a} \right) - \lim_{\epsilon \rightarrow 0} \frac{\eta_a(\epsilon)}{\alpha_a} \\ &= -\log \frac{1}{X_a} + \frac{1}{X_a} - 1 = \log X_a + \frac{1}{X_a} - 1 \\ &= \log \left( \alpha_a \frac{\mu_1 - \kappa_a}{\mu_1} \right) + \frac{\mu_1}{\alpha_a(\mu_1 - \kappa_a)} - 1 = \text{KL}_{\text{inf}}(a), \end{aligned}$

where the last equality comes from Lemma 1. $\square$

C.3 Proof of Lemma 7

Firstly, we state two well-known facts that are utilized in the proof.

Fact 13. When $X \sim \text{Pa}(\kappa, \alpha)$ with the scale parameter $\kappa \in \mathbb{R}$ and rate parameter $\alpha \in \mathbb{R}_+$ , then $\log \left( \frac{X}{\kappa} \right)$ follows the exponential distribution with rate $\alpha$ , i.e., $\log \left( \frac{X}{\kappa} \right) \sim \text{Exp}(\alpha)$ .

Buckets:

huggingchat
/

papers-content

Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits

Abstract

1 Introduction

2 Preliminaries

2.1 Notations

2.2 Asymptotic lower bound

2.3 Relation with bounded moment models

3 Thompson sampling and probability matching priors

3.1 Sampling procedure

3.2 Interpretation of the prior parameter $k$

4 Main results

5 Experiments

6 Proof outline of optimal results

Acknowledgement

Bibliography

C.2.1 Under STS

C.2.2 Under STS-T

C.2.3 Asymptotic behavior of $D_{a,k}(\epsilon)$

C.3 Proof of Lemma 7

Xet Storage Details