Finite Budget Analysis of Multi-armed Bandit...

Finite Budget Analysis of Multi-armed Bandit Problems

Yingce Xia1, Tao Qin2, Wenkui Ding3, Haifang Li4, Xu-Dong Zhang3, Nenghai Yu1, Tie-Yan Liu2

1University of Science and Technology of China 2Microsoft Research 3Tsinghua University 4University of Chinese Academic of [email protected],[email protected];2taoqin,[email protected];[email protected],[email protected];[email protected]

Abstract

In the budgeted multi-armed bandit (MAB) problem, a player receives a random reward and needs to pay a cost after pulling an arm,and he cannot pull any more arm after running out of budget. In this paper, we give an extensive study of the upper confidence boundbased algorithms and a greedy algorithm for budgeted MABs. We perform theoretical analysis on the proposed algorithms, andshow that they all enjoy sublinear regret bounds with respect to the budget B. Furthermore, by carefully choosing the parameters,they can even achieve log linear regret bounds. We also prove that the asymptotic lower bound for budgeted Bernoulli bandits isΩ(ln B). Our proof technique can be used to improve the theoretical results for fractional KUBE [27] and Budgeted ThompsonSampling [31].

Keywords: Budgeted Multi-Armed Bandits, UCB Algorithms, Regret Analysis

1. Introduction

Multi-armed bandits (MAB) correspond to a typical sequen-tial decision problem, in which a player receives a random re-ward by playing one of K arms from a slot machine at eachround and wants to maximize his cumulated reward. Many realworld applications can be modeled as MAB problems, such asauction mechanism design [25], search advertising [28], UGCmechanism design [18] and personalized recommendation [23].Many algorithms have been designed for MAB problems andstudied from both theoretical and empirical perspectives, likeUCB1, εn-GREEDY [6], UCB-V [4], LinRel [5], DMED [19],and KL-UCB [17]. A survey on MAB can be found in [12].

Most of the aforementioned works assume that playing anarm is costless, however, in many real applications includingthe real-time bidding problem in ad exchange [13], the bid op-timization problem in sponsored search [10], the on-spot in-stance bidding problem in Amazon EC2 [9], and the cloud ser-vice provider selection problem in IaaS [2], one needs to paysome cost to play an arm and the number of plays is constrainedby a budget. To model these applications, a new kind of MABproblems, called budgeted MAB, have been proposed and stud-ied in recent years, in which the play of an arm is associatedwith both a random reward and a cost. According to differentsettings of budgeted MAB, the cost could be either determinis-tic or random, either discrete or continuous.

In the literature, a few algorithms have been developed forsome particular settings of the budgeted MAB problem. Thesetting of deterministic cost was studied in [27], and two al-gorithms named KUBE and fractional KUBE were proposed,which learn the probability of pulling an arm by solving aninteger programming problem. It has been proven that thesetwo algorithms can lead to a regret bound of O(ln B). The set-ting of random discrete cost was studied in [14], and two upper

confidence bound (UCB) based algorithms with specifically de-signed (complex) indexes were proposed and the log linear re-gret bounds were derived for the two algorithms.

The above algorithms for the budgeted MAB problem couldonly address some (but not all) settings of budgeted MAB. Giv-en the tight connection between budgeted MAB and standardMAB problems, an interesting question to ask is whether someextensions of the algorithms originally designed for standardMAB (without budget constraint) could be good enough to ful-fill the budgeted MAB tasks, and perhaps in a more generalway. [31] shows that a simple extension of the Thompson sam-pling algorithm, which is designed for standard MAB, worksquite well for budgeted MAB with very general settings. In-spired by that work, we are interested in whether we can handlebudgeted MAB by extending other algorithms designed for s-tandard MAB.

In order to answer the question, we study the following nat-ural extensions of the UCB and εn-GREEDY algorithms [6] inthis paper (these extensions do not need to know the budget B inadvance). We first propose four basic algorithms for budgetedMABs: (1) i-UCB, which replaces the average reward of an armin the exploitation term of UCB1 by the average reward-to-costratio; (2) c-UCB, which further incorporates the average costof an arm into the exploration term of UCB1 (c-UCB can beregarded as an adaptive version of i-UCB); (3) m-UCB, whichmixes the upper confidence bound of reward and the lower con-fidence bound of cost; (4) b-GREEDY, which replaces the av-erage reward in εn-GREEDY with the average reward-to-costratio.

We conduct theoretical analysis on these algorithms, andshow that they all enjoy sublinear regret bounds with respectto B. By carefully setting the hyper parameter in each algorith-m, we show that the regret bounds can be further improved tobe log linear. Although the basic idea of regret analysis for the

Preprint submitted to Neurocomputing May 25, 2016

algorithms for budgeted MAB is similar to standard MAB, i.e.,we only need to bound the number of plays of all the suboptimalarms (the arms whose expected-reward-to-expected-cost ratiosare not the maximum), there are two challenges to address com-paring with the standard MAB (without budgets). First, thereare two random factors, rewards and costs, that can influencethe choice of arm simultaneously, which will bring difficultieswhen decomposing the probabilities that suboptimal arms arepulled. Second, the stopping time (i.e., the time/round that apulling algorithm runs out of budget and stops) is a randomvariable related to the costs of each arm, due to which the in-dependence of the costs at different rounds is destroyed andit brings difficulties when applying concentration inequalities.To address the first challenge, we introduce the δ-gap (5), withwhich we can separate the terms w.r.t the ratio of rewards andcosts into terms related to rewards only and costs only. To ad-dress the second one, we make a decomposition like (8): beforeround b2B/µc

minc (µcmin is the minimum expected cost among all

the arms), we only consider the event about pulling a subop-timal arm; after round b2B/µc

minc, we only consider the eventthat there is remaining budget at each round. Combining thetwo techniques with those for standard MAB, we can derive theregret bounds for the aforementioned four algorithms.

Furthermore, we give a lower bound to the budgetedBernoulli MAB (the reward and cost of a budgeted Bernoullibandit is either 0 or 1), and show that our proposed algorithmscan match the lower bound (by carefully setting the hyper pa-rameters).

In addition to the theoretical analysis, we also conduct em-pirical study on these algorithms. We simulate two bandits, onewith 10 arms and the other one with 50 arms. For each bandit,we consider two sub cases with different distributions for re-wards and costs: the Bernoulli distribution in the first sub caseand the beta distribution in the second one. The simulation re-sults show that our extensions work surprisingly well for boththe bandits and the sub cases, and can achieve comparable orbetter performances than existing algorithms.

The Budget-UCB algorithm proposed in [30] can be seen asa combined version of c-UCB and m-UCB. With our proposedδ-gap, the regret of Budget-UCB can be improved.

Besides the literature mentioned above, there also exist someworks studying the MAB problems with multiple budget con-straints. For example, in [7], the bandits with knapsacks setting(shortly, BwK) is studied: the total number of plays is con-strained by a predefined number T and the total cost of the playsis constrained by a monetary budget B (the problem). [7] stud-ies the distribution-free regret bound for BwK while [16] pro-vides distribution-dependent bound. In [1], the above settingis extended to arms with concave rewards. Furthermore, in [8]contextual bandits are considered. At the first glance, these set-tings seem to be more general than the budgeted MAB problemdefined in the previous subsection. We would like to point outthat their algorithms and theoretical analysis cannot be directlyapplied here, because in their settings the total number of plays(T ) is critical to their algorithms and analysis, however, such anumber T does not exist in the setting under our investigation.

2. Problem Setup

A budgeted MAB problem can be described as follows. Aplayer is facing a slot machine with K arms (K ≥ 2). Attime/round t, he pulls an arm i ∈ [K], (for ease of reference,let [K] denote the set 1, 2, · · · ,K) receives a random rewardri(t), and pays a cost ci(t) until his budget, B, runs out. Both therewards and costs are supported in [0, 1]. There can be differentsettings depending on the values of the costs, which could beeither deterministic or random, either discrete or continuous. Inthis work, we mainly focus on the setting with random costs,since deterministic costs can be regarded as a special case ofrandom ones. When analyzing the regret bounds of our pro-posed algorithms, the costs could be either discrete or contin-uous. Following the common practice in standard MAB, weassume the independence between arms and rounds: the re-wards and costs of an arm are independent of any other arm,and the rewards (and costs) of arm i at different round t are in-dependently drawn from the same distribution with expectationµr

i (and µci ). We do not need to assume that the rewards of an

arm are independent of its costs. Without loss of generality, weassume 0 < µr

i , µci < 1 for any i ∈ [K] and µr

1µc

1>

µriµc

ifor any i , 1.

We name arm 1 as the optimal arm and the others as suboptimalarms. The goal of the player is to maximize his total expectedreward before the budget runs out.

We introduce some notations that will be frequently usedthroughout the text. Let It denote the arm chosen at round t.1· is the indicator function. Let ri(t) and ci(t) denote the re-ward and cost of arm i ∈ [K] at round t. Note ri(t) and ci(t)are visible to the player only if arm i is pulled at round t. Letni,t, ri,t and ci,t denote the number of pulling time, the averagereward, and the average cost of arm i before (excluding) roundt respectively. Mathematically, they are ni,t =

∑t−1s=1 1Is = i,

ri,t =1

ni,t

t−1∑s=1

ri(s)1Is = i, ci,t =1

ni,t

t−1∑s=1

ci(s)1Is = i. (1)

We often use pseudo regret to evaluate an algorithm (denotedas Ra for algorithm “a”), which is defined as follows:

Ra = R∗ − E∑∞

t=1rIt (t)1Bt ≥ 0, (2)

where R∗ is the expected reward of the optimal policy (thepolicy that can obtain the maximum expected reward whenthe reward distribution and the cost distribution of each arm isknown in advance), Bt is the remaining budget at round t, (i.e.,Bt = B −

∑ts=1 cIs (s)), and the expectation is taken with respect

to the randomness of the algorithm, the received rewards/costs,and the stopping time. Maximizing the expected reward is e-quivalent to minimizing the pseudo regret.

Please note that it could be very complex to obtain the opti-mal policy for the budgeted MAB problem (under the conditionthat the reward and cost distributions of each arm are known).Even for its degenerated case, where the reward and cost ofeach arm are deterministic, the problem is known to be NP-hard (actually in this case the problem becomes an unboundedknapsack problem [24]). Therefore, generally speaking, it ishard to calculate R∗ in an exact manner.

2

However, we find that it is much easier to approximate theoptimal policy and to upper bound R∗. Specifically, when thereward and cost per pulling are supported in [0, 1] and B is large,always pulling the optimal arm could be very close to the opti-mal policy1. The results are summarized in Lemma 1, togetherwith upper bounds on R∗. The proof of Lemma 1 can be foundin the full version of [31].

Lemma 1. When the reward and cost per pulling are supportedin [0, 1], we have R∗ ≤ (µr

1/µc1)(B + 1), and the suboptimality of

always pulling arm 1 (as compared to the optimal policy) is atmost 2µr

1/µc1.

3. Algorithms

We present two kinds of algorithms for budgeted multi-armed bandit problems. As pointed in Section 2, although theoptimal policy is quite complex for budgeted MAB, alwayspulling the optimal arm can bring almost the same expected re-ward as the optimal policy. Therefore, our proposed algorithmstarget at pulling the optimal arm as frequently as possible, withsome tradeoff between exploration (on the less pulled arms) andexploitation (on the empirical best arms).

3.1. Extensions of UCB Based AlgorithmUCB algorithms [6, 4, 33] are widely used and well studied

for standard MAB. The main idea of UCB is to play the armwith the largest index, where the index of an arm is the sum oftwo terms: an exploitation term (the average reward ri,t of thearm), and an exploration term (the uncertainty of the averagereward as an estimator of the true expected reward).

When we have a cost for each arm and a budget constraint, itis straightforward to replace the average reward in the exploita-tion term of UCB1 by the average reward-to-cost ratio, (i.e.,ri,t/ci,t). By doing so, we can get the so-called i-UCB algorith-m, in which the exploration term is defined as

Eαi,t = α√

ln(t − 1)/ni,t, (3)

where α is a positive hyper parameter. Please note that sucha hyper parameter also exists in [23], which can make the al-gorithm flexible. Here, “i” stands for independence, since theexploration term is independent to the average reward and cost.The index2 for i-UCB is theDi

i,t in (4).Note that the exploration term of i-UCB only considers the

pulling time of an arm (i.e., a less frequently pulled arm is morelikely to be explored), and does not take costs into considera-tion. By considering the average cost of an arm in exploration,we get another algorithm called c-UCB. Here, “c” stands for“cost”. In c-UCB, an arm that was less frequently pulled and/or

1This is inspired by the greedy heuristic for the knapsack problem [15], i.e.,at each round, one selects the item with the maximum value-to-weight ratio.Although there are many approximation algorithms for the knapsack problemlike the total-value greedy heuristic [20] and the FPTAS [29], under our bud-geted MAB setting, we find that they will not bring much benefit on tighteningthe bound of R∗.

2When the denominator is 0, the index is defined as (i) 0 when the numeratoris 0; (ii) infinity when the numerator is larger than zero.

has smaller costs is more likely to be explored. The index forc-UCB is theDc

i,t in (4).Furthermore, by mixing the exploitation and exploration

terms together, we can get the m-UCB algorithm, where “m” s-tands for mixed. As can be seen, m-UCB plays arms accordingto the ratio of the upper confidence bound of reward to the low-er confidence bound of cost. Here the upper confidence boundof reward is capped by 1 and the lower confidence bound ofcost is floored by zero, so as to ensure that they are supportedin [0, 1]. The index for m-UCB is theDm

i,t in (4).

Dii,t =

ri,t

ci,t+ Eαi,t, D

ci,t =

ri,t

ci,t+Eαi,t

ci,t, Dm

i,t =minri,t + Eαi,t, 1

maxci,t − Eαi,t, 0

. (4)

The details of UCB based algorithms are summarized in Algo-rithm 1.

Algorithm 1: UCB Based Algorithms

1 Input: hyper parameter α(> 0); one index defined in (4);2 Pull each arm once at the first K rounds and set t ← K + 1;3 while the budget has not been run out do4 Compute the index for each arm;5 Pull arm It with the largest index;6 Update nIt ,t, rIt ,t, cIt ,t and the left budget; set t ← t + 1.

3.2. Extension of a Greedy AlgorithmThe εt-GREEDY algorithm [6] is a simple and well known al-

gorithm for standard MAB, which plays with probability 1 − εt

the arm with the largest average reward, and with probability εt

a randomly chosen arm. It can be easily extended to budgetedMAB by replacing the average reward of each arm by its av-erage reward-to-cost ratio. For ease of reference, we call theextended algorithm b-GREEDY. Algorithm 2 shows the detailsof the b-GREEDY algorithm.

Algorithm 2: b-GREEDY

1 Input: hyper parameter α(> 0); t ← 1;2 while the budget has not been run out do3 Set εt = min1, αK/t; With probability 1 − εt, play the

arm with the highest average reward-to-cost ratio; Withprobability εt, play a randomly chosen arm;

4 Update nIt ,t, rIt ,t, cIt ,t and the left budget; set t ← t + 1.

Please note that all the algorithms have a hyper-parameterα(> 0), the empirical impact of which is studied in experiments(Section 5). Furthermore, for all the proposed algorithms, whenthere are more than one arm with the same and largest index,just randomly pull one of them.

4. Regret Analysis

4.1. Upper Bounds of the RegretsFor any suboptimal arm i ≥ 2, the weighted ratio gap ∆i,

asymmetric δ-gap δi(γ) with γ ≥ 0 and the symmetric %-gap

3

%i play important roles through this work, which are defined asfollows:

∆i = µci

µr1

µc1− µr

i ; δi(γ) =∆i

γµr1/µ

c1 + 1

; %i =µc

1∆i

µr1 + µc

1 + µri + µc

i. (5)

One can verify that

µr1

µc1

=µr

i + δi(γ)µc

i − γδi(γ);

µr1 − %i

µc1 + %i

=µr

i + %i

µci − %i

. (6)

(6) shows two kinds of gaps between arm 1 and arm i: (i) tomake a suboptimal arm i an optimal arm, we can increase theµr

i by δi(γ) while decreasing µci by γδ(γ). When γ = 1, the

asymmetric gap becomes symmetric. (ii) To make arm 1 anda suboptimal arm i the same expected reward to expected costratio, we can increase the µr

i and µc1 by %i while decreasing the

µr1 and µc

i by %i. We will see that these gaps are highly correlatedto the regret bounds of the algorithms later.

To present the regret bounds of the proposed algorithms, weneed to introduce some other notations: (1) τB = b2B/µc

minc,where µc

min = mini∈[K] µci ; (2) define four indices as follows:

oi = 12 (αµc

min)2/( µr1µc

1+ 1)2; oc = (αµc

1)2/(µr1 + µc

1)2;

om = α2; ob = mini≥2α10 , α%

2i .

The subscripts “i,c,m,b” stand for “i-UCB, c-UCB, m-UCBand b-GREEDY” algorithms respectively. We will use themthroughout this work when the context is clear. The regretbounds of the proposed algorithms are summarized as follows:

Theorem 2. For any algorithm a ∈ i, c,m, b:(I) If oa > 1, the regret Ra can be upper bounded as follows:

Ra ≤∑K

i=2 à,i ln τB + O(1),where ì,i = (4α2(µc

i )2)/∆i, `c,i = ( µr1µc

1+ α + 1)2/∆i,

`m,i = (α + 1)2( µr1µc

1+ 1)2/∆i, `b,i = α∆i,

and O(1) is only related to µri , µ

ci for any i ∈ [K].

(II) If oa < 1, the regret Ra can be upper bounded as follows:

Ri ≤∑K

i=2[7( µr1µc

1+ 1)2/∆i]τ

1−oiB + o(τ1−oi

B ln τB);

Ra ≤∑K

i=2 2∆iτ1−oaB log2 τB/(1 − oa) + O(τ1−oa

B ) ∀a ∈ c,m;

Rb ≤∑K

i=2 Cb,i( e1−1/(αK)

αK+1 )−ob∆iτ1−obB /(1 − ob) + o(τ1−ob

B ),where Cb,i is a constant related to %i only.

(III) If oa = 1 for any a ∈ i, c,m, the regret Ra is O(ln2 B); ifob = 1, the regret Rb of b-GREEDY is O(ln B).

More detailed theoretical regrets are summarized in AppendixA. In general, the regret bounds of the proposed algorithms aresublinear in terms of τB (i.e., O(τ1−oa

B )) and thus B. By carefullychoosing the hyper parameters (e.g., making oa > 1 by settingα large enough), all the four algorithms can achieve log linearregret bounds. Corollary 3 provides a group of α’s that canreach O(ln B) regrets:

Corollary 3. If the parameters λ and D, which satisfy λ ≤ µcmin

and D ≤ mini≥2 ∆i, are known in advance, by setting α as 2(1 +1λ)/λ, 2(1 + 1

λ), 2 and max10, 16

λ2D2 + 1 for i-UCB, c-UCB, m-UCB and b-GREEDY respectively, their regrets are

Ra ≤

K∑i=2

ã,i ln(τB) + O(1), a ∈ i, c,m, b, (7)

where ˜i,i = 16(µci )2(1 + 1

λ)2/(∆iλ

2), ˜c,i = ( µr1µc

1+ 2

λ+ 3)2/∆i,

˜m,i = 9( µr1µc

1+ 1)2/∆i, ˜b,i = (max

10, 16

λ2D2

+ 1)∆i.

Remark: Corollary 3 only gives a sufficient condition for thealgorithms to reach O(ln B) regrets. It does not mean that theexecution of the algorithms needs λ or D to determine α.

4.2. Proof for the c-UCB AlgorithmDue to the space limitation, we leave the detailed proofs of

Theorem 2 to Appendix C and only give a proof of c-UCB here.We choose c-UCB here because the proofs for i-UCB and b-GREEEDY are more notation heavy while the proof for m-UCBis a little simple. We choose one with medium difficulty.

The analysis for the general budgeted MAB problems can bedecomposed into 4 steps:(S1): Bridge the regret to the expected pulling number of sub-optimal arms: Let Ti denote the pulling number of arm i(≥ 2)from round 1 to round τB. We have the following lemma:

Lemma 4. The regret of any algorithm can be bounded by

Regret ≤∑K

i=2∆iE[Ti] + X(B)

∑K

i=2∆i + 2µr

1/µc1, (8)

where X(B) denotes the order O((B/µcmin) exp−0.5Bµc

min).

The insight behind Lemma 4 is intuitive: (a) Before round τB,if the player spent the E[Ti] rounds for arm i on pulling arm1, she can gain ∆iE[Ti] more reward. (b) The probability thatthe pulling procedure can exceed round τB is bounded by X(B),which constitutes the second term in (8). (c) The randomnessof the stopping time introduce another additional term 2µr

1/µc1

to the regret. Note that X(B) tends to zero as B → ∞. Lemma4 can be obtained by setting L = 1 of Eqn.(10) in [32].(S2) Decompose Ti: Define γ′ = 1/(1 + α). Define δ′i as δi(γ′)for the δ-gap in Eqn.(5). Let Uc

i denote (1 + α)2 ln τB/δ′2i . For a

suboptimal arm i, we have that

Ti = 1 +∑τB

t=K+11It = i ≤ dUc

i e +∑τB

t=K+11It = i, ni,t ≥ dUc

i e

≤ dUci e +

∑τB

t=K+11 ri,t + Eαi,t

ci,t≥

r1,t + Eα1,t

c1,t, ni,t ≥ dUc

i e

. (9)

Define two mutually exclusive eventsAit and ¬Ai

t as follows:

Ait :

r1,t + Eα1,t

c1,t≥

µri + δ′i

µci − γ

′iδ′i; ¬Ai

t :r1,t + Eα1,t

c1,t<µr

1

µc1. (10)

By intersecting the union ofAit and ¬Ai

t with the second termof (9), we can obtain that (9) is bounded by

dUci e +

τB∑t=K+1

1¬Ait + 1

ri,t + Eαi,t

ci,t≥

µri + δ′

µci − γ

′δ′i, ni,t ≥ dUc

i e

. (11)

For ease of reference, denote the second 1· of (11) as Bit.

(S3) Bound∑τB

t=K+1 P¬Ait: Note that given t ≥ K +1, we have

ni,t ≥ 1 and further, Eαi,t > 0. If c1,t = 0, the ratio (r1,t + Eα1,t)/ci,t

is infinity. Then P¬Ait, c1,t = 0 = 0. Therefore, we only need

to consider the case of c1,t > 0. Since

r1,t

c1,t−µr

1

µc1

=(r1,t − µ

r1)µc

1

µc1c1,t

+(µc

1 − c1,t)µr1

µc1c1,t

, (12)

4

if ¬Ait holds, at least one of the following equations is true:

(a)(r1,t − µ

r1)µc

1

µc1c1,t

≤ −µc

1

µr1 + µc

1

Eα1,t

c1,t; (b)

(µc1 − c1,t)µr

1

µc1c1,t

≤ −µr

1

µr1 + µc

1

Eα1,t

c1,t.

As a result, we have P¬Ait = P¬Ai

t, c1,t > 0, and

P¬Ait, c1,t > 0 ≤ P

(r1,t − µr1)µc

1

µc1c1,t

≤ −µc

1

µr1 + µc

1

Eα1,t

c1,t, c1,t > 0

+P

(µc1 − c1,t)µr

1

µc1c1,t

≤ −µr

1

µr1 + µc

1

Eα1,t

c1,t, c1,t > 0

≤P

r1,t − µ

r1 ≤ −

µc1E

α1,t

µr1 + µc

1

+ P

µc

1 − c1,t ≤ −µc

1Eα1,t

µr1 + µc

1

. (13)

To make a clearer expression, temporarily denote Xl as rewardfrom the l-th pulling of arm 1. Note that Xl’s are independentfor different l’s. By the peeling technique, we have

Pr1,t ≤ µ

r1 −

αµc1

µr1 + µc

1

√ln(t − 1)

n1,t

≤P

∃s ∈ [t − 1] s.t.

∑s

l=1(Xl − µ

r1) ≤ −

αµc1

µr1 + µc

1

√s ln(t − 1)

≤

∑bln(t−1)

ln 2 c

j=0P∃s ∈ (

12

) j+1(t − 1) < s ≤ (12

) j(t − 1) s.t.∑s

l=1(Xl − µ

r1) ≤ −

αµc1

µr1 + µc

1

√(12

) j+1(t − 1) ln(t − 1)

≤4∑b

ln(t−1)ln(2) c

j=0(t − 1)−oc ≤

(log2(t − 1) + 1

)(t − 1)−oc ,

(14)

in which the “≤” marked with 4 is obtained by Hoeffing’s max-imal inequality (See Fact 2 in Appendix B). The second termof (13) can be similarly obtained as above. Thus, (13) is upperbounded by 2(log2(t − 1) + 1)(t − 1)−oc .

Define ϕB(x) =∑τB

t=K+1 2(

log2(t − 1) + 1)(t − 1)−x for any

x > 0. Please note that when x ∈ (0, 1), ϕB(x) is of order PB(x);when x = 1, ϕB(x) is of order O(ln2 τB); when x > 1, ϕB(x) canbe bounded by a term irrelevant to B. We can obtain that∑τB

t=K+1 P¬Ait ≤ ϕB(oc).

(S4) Bound∑τB

t=K+1 PBit in (11): first we can verify that if

ni,t ≥ Uci , δ′i − E

αi,t ≥ δ

′i/(1 + α). Accordingly, PBi

t is boundedby

PBit ≤ P

ri,t + Eαi,t ≥ µ

ri + δ′i ∪ ci,t ≤ µ

ci − γ

′δ′i , ni,t ≥ dUci e

≤P

ri,t − µ

ri ≥ γ


+ P

ci,t − µ

ci ≤ −γ


.

By Hoeffding’s inequality and union bound, we have

Pri,t ≥ µri + γ′δ′i , ni,t ≥ dUc

i e =∑t−1

l=dUci ePri,t ≥ µ

ri + γ′δ′i , ni,t = l

≤∑τB

l=dUci e

exp−2lγ′2δ′2i ≤ τB exp−2dUci eγ

′2δ′2i ≤ τ−1B .

Similarly, we have Pci,t − µ

ci ≤ −γ


≤ τ−1

B .Accordingly, we have that

∑τBt=K+1 PB

it ≤ 2.

In summary, the regret bound of c-UCB can be written asK∑

i=2

(1 + α + µr1/µ

c1)2

∆iln τB + [ϕB(oc) + X(B) + 3]

K∑i=2

∆i + 2µr

1

µc1, (15)

from which we can get the corresponding results in Theorem 2and Corollary 3.

We would like to point out that the proofs of i-UCB and m-UCB depends on δi(1) defined in Eqn.(5), while the proof ofb-GREEDY depends on the %i.

4.3. Lower Bound for Budgeted Bernoulli MABsBudgeted Bernoulli MAB is a kind of MAB that the reward

and cost is either 0 or 1, i.e., ri(t), ci(t) ∈ 0, 1 ∀i ∈ [K], ∀t ≥ 1.The lower bounds related to Bernoulli MAB have been widelystudied for standard MAB [12, 3] literature. We give a lowerbound for budgeted Bernoulli MAB. Like the standard bandits,the lower bound for the budgeted MAB also depends on theKL-divergence of the distributions between the optimal arm andsuboptimal arms. Mathematically, define the KL-divergence oftwo Bernoulli distributions with success probabilities x and y asfollows:

kl(x, y) = x lnxy

+ (1 − x) ln1 − x1 − y

∀x, y ∈ (0, 1). (16)

To make a suboptimal arm j(≥ 2) an optimal arm, according tothe definition of the δ-gap (see (5)), we can increase µr

j by δ j(γ)while decreasing µc

j by γδ(γ). Let γ∗ and L∗j denote the optimalsolution and the corresponding value of the problem defined in(17).

infγ

kl(µrj, µ

rj + δ j(γ)) + kl(µc

j, µcj − γδ j(γ))

s.t. γ ≥ 0, µrj + δ j(γ) < 1.

(17)

We can regard δ j(γ∗) as an “optimal” δ-gap for arm j, sincethe KL-divergence defined in (17) is minimized. We can getthat for any j ≥ 2, γ∗ and L∗j always exists and L∗j is strictlypositive.

Denote the number of pulling rounds of arm i(∈ [K]) in thefirst B rounds as T B

i . The lower bound of T Bi for Bernoulli bud-

geted bandits is shown in the following theorem:

Theorem 5. For Bernoulli budgeted MABs, if the rewards areindependent to the costs of each arm, for any pulling policysuch that as B→ ∞, for any a ∈ (0, 1),

∑Ki=2 ET B

i = o(Ba), wehave that for any ε > 0, i ≥ 2, limB→∞ PT B

i ≥(1−ε) ln BL∗i = 1,

and accordingly, lim infB→∞E[T Bi ]/ ln(B) ≥ L∗i .

We can easily bridge the T Bi to the regret. That is:

Corollary 6. For Bernoulli budgeted MABs, if the rewards areindependent to the costs of each arm, the regret of any pullingpolicy, which satisfies that as B→ ∞, ∀a ∈ (0, 1),

∑Ki=2 ET B

i =

o(Ba), is at least Ω(∑K

i=2(∆i/L∗i ) ln B

).

The basic idea of the proof of the theorem is as follows. First,we note that in budgeted MAB, the pulling time is at least B,since the cost of each arm per pulling is at most 1. Second, toremove the randomness of the stopping time, we consider thepulling time of a suboptimal arm j during the first B rounds,which is a lower bound of T B

i . Third, we extend the change-of-measure technique in [12, 22] into budgeted MAB. Finally,with some calculations, we obtain the lower bound in Theorem5. Corollary 6 can be obtained by some simple derivations. Weleave the proof of Theorem 5 and Corollary 6 at Appendix D.

We can see that m-UCB can achieve an asymptotic O(ln B)regret bound without any additional information of the bandit.Recall that Theorem 5 claims that the lower bound for budgetedBernoulli MAB in terms of B is Ω(ln B). Therefore, by choos-ing the hyper parameters carefully, the regret bounds of all thefour proposed algorithms can match the lower bound of budget-ed Bernoulli MAB in terms of B.

5

4.4. Discussions4.4.1. Special Cases Discussion

By choosing the parameters carefully, our proposed algo-rithms can achieve regret O(Cu ln B). Meanwhile, denote thelower bound as Ω(Cl ln B). It is difficult for our proposed UCBalgorithms and the greedy algorithm to match the lower boundperfectly (i.e., Cu > Cl), which is a common drawback for con-ventional MABs (without budget constraints). However, we canstill see that Cl and Cu share many similar trends. We discussthe m-UCB with α = 2 in the following examples and the otheralgorithms can be similarly discussed.(Example 1) The regrets w.r.t ∆i for any i ≥ 2: We introducea similar setting in [21] to discuss the relations between theupper bounds of our algorithms and the lower bound. Assumethat there exists an l0 ∈ (0, 0.5) s.t. µr

i , µci ∈ [l0, 1 − l0] and

∆i < l0/2 for any i ∈ [K]. One can verify that γ = 1 satisfies theconstraint in (17) and δi(1) < l0/2. According to the fact thatfor any p, q ∈ (0, 1),

2(p − q)2 ≤ kl(p, q) ≤(p − q)2

q(1 − q), (18)

we can get that L∗i is upper bounded by

δ2i (1)

(µri + δi(1))(1 − µr

i − δi(1))+

δ2i (1)

(µci − δi(1))(1 − µc

i + δi(1))≤

4∆2i

l20

(19)

Thus, Cl =∑K

i=2(l20/∆i). We also know that Cu =∑K

i=2 1/(l20∆i).Therefore, both the upper bound and the lower bound of theregret are linear to

∑Ki=2(1/∆i) ln B.

(Example 2) The regrets w.r.t to the magnitude of the expectedcosts. Consider the K-armed Bernoulli bandits with expectedrewards µr

i Ki=1 and expected costs θµc

i Ki=1, where µr

i , µci for any

i is fixed and θ ∈ (1/√

B, 1). Since γ = ∞ satisfies the constraintin (17), we have that L∗i is upper bounded by

δ2i (∞)

(θµci − δi(∞))(1 − θµc

i + δi(∞))≤

θµc1∆

2i

µr1µ

ri (1 − µ

ci ). (20)

That is, Cl = K/θ. On the other hand, we can verify that theupper bound is of order O((K/θ2) ln(B/θ)). The the upper andlower bound are not perfectly matched in terms of θ, we can stillget that we can see that when the magnitudes of the expectedcosts of all the arms become smaller, the regret becomes larger.

4.4.2. Proof Technique for Existing AlgorithmsOur proof technique (and the proposed δ-gap (5)) can also be

used to analyze multiple existing algorithms:(A) For the standard MAB [6] in which there is no cost of

each arm and the pulling time is bounded by a given T , the ∆i

in [6] is exactly the δi(0) (i ≥ 2) of our analysis. Thus, our prooftechnique can cover that of standard multi-armed bandit.

(B) For the budgeted MAB with fixed cost of each arm (stud-ied in [27]), using our proposed method, we can obtain a betterresult than Theorem 2 in [27] (and Theorem 7 in its arXiv ver-sion). See Appendix E for details3.

3Since [27] uses different notations from our work, we leave the detaileddiscussions in the appendix.

(C) For the Budgeted Thompson sampling algorithm pro-posed in [31], with our asymmetric δ-gap, we can get that whenthe reward and cost are both Bernoulli for each arm, BTS canachieve an improved regret:

Theorem 7. For Bernoulli budgeted bandits, there exists anε0 ∈ (0, 1) such that for any ε ∈ (0, ε0), the regret of BTS areupper bounded by∑K

i=2

(1 + ε)∆i

Liln τB + X(B)

∑K

i=2∆i + O(Kε−6), (21)

where Li = supγ minkl(µri , µ

ri + δi(γ)), kl(µc

i , µci − γδi(γ)).

Please note that the regret in Theorem 7 outperforms the resultin [31] and is closer to the lower bound. The proof of Theorem7 is left in Appendix F.

5. EMPIRICAL EVALUATIONS

The theoretical results in the previous section can guaranteethe asymptotic performances of the algorithms when the budgetis very large. In this section, we empirically evaluate the practi-cal performances of the algorithms when the budget is limited.Temporally, we do not verify the performances of the variantsof the UCB algorithms.

Specifically, we first simulate a 10-armed Bernoulli bandit,in which the rewards and costs of each arm follow Bernoullidistributions, and a 10-armed Beta bandit, in which the rewardsand costs of each arm follow Beta distributions. The parametersof the Bernoulli distribution and beta distribution are randomlychosen from (0, 1). Then we simulate a 50-armed Bernoullibandit and a 50-armed Beta bandit, to see whether the numberof arms will lead to different results.

When implementing the algorithms introduced in the previ-ous sections, the hyper parameters of the i-UCB, c-UCB andm-UCB are all chosen as 0.25. The hyper parameter in b-GREEDY is chosen as 4. For the comparison purpose, we im-plement several baseline algorithms. The first is the budgetedThompson sampling algorithm (BTS) proposed in [31]. Thesecond baseline is the UCB-BV1 algorithm [14] designed forrandom discrete costs. The third baseline is the ε-first algo-rithm [26], designed for deterministic costs. The fourth base-line is a variant of the PD-BwK algorithm [7], which we callvPD-BwK. At each time, this baseline algorithm plays the armwith the maximum ratio minri,t+rad(ri,t ,ni,t),1

maxci,t−rad(ci,t ,ni,t),0, where rad(ν,N) =√

Crad νN + Crad

N and Crad = x log(BK). Note that we do notuse the original PD-BwK algorithm because it requires a hardtime constraint T that indicates how many times one can playarms. However, in our setting, there is no such time constraintand therefore the original PD-BwK algorithm cannot be directlyapplied. For the hyper parameters of the baselines, we chooseε = 0.1 for ε-first and x = 0.25 for vPD-BwK to keep consis-tency with [31].

To run vPD-BwK and the ε-first algorithm, the bud-get B needs to be known in advance. In our ex-periments, we try a set of budgets: 100, 200, 500, 1K,2K, 5K, 10K, 20K, 30K, 40K, 50K. For each budget, we repeat

6

the experiments for 500 runs and report the average perfor-mance. All the other algorithms do not need to know B inadvance. Therefore, by setting B = 50K, we can get the perfor-mances of these algorithms for all budgets below 50K. Similar-ly, we repeat the experiments for 500 runs for these algorithmsand report the average performance.

Figure 1 shows the performance of each algorithm for the10-armed bandits. We do not include the regret curve of UCB-BV1 in Figure 1(a) and 1(b), because its empirical regrets aremuch worse than all the other algorithms and will affect thevisualization of the results (its regret at B = 50K are 1687.7for the Bernoulli bandit and 2151.8 for the Beta bandit). Ascan be seen from the figures, no matter for the Bernoulli ban-dit or the Beta bandit, the four extended algorithms introducedin this paper outperform the baselines (BTS, UCB-BV1, ε-first,and vPD-BwK). Among them, i-UCB performs the best for theBernoulli bandits while c-UCB performs best for the beta ban-dits.

0 1 2 3 4 5

x 104

0

200

400

600

800

1000

Budget

Regret

i-UCBc-UCBm-UCBb-GREEDYBTSε-firstvPD-BwK

(a) Bernoulli bandit, regret

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

1400

Budget

Regret


(b) Beta bandit, regret

0

50

100

150

200

250

300

350

400

450

500

#Round

i-UCBc-UCB

m-UCBb-GREEDY

BTSε-first

vPD-BwKUCB-BV1

Rank 1Rank 2Rank 3Rank 4Rank 5Rank 6Rank 7Rank 8

(c) Bernoulli Bandit, rank

0

50

100

150

200

250

300

350

400

450

500

#Round

i-UCBc-UCB

m-UCBb-GREEDY

BTSε-first

vPD-BwKUCB-BV1


(d) Beta bandit, rank

Figure 1: Results of the 10-armed bandits

In addition to the average regrets shown in Figure 1(a) and1(b), we also visualize the per-run performance of each algo-rithm in Figure 1(c) and 1(d), in which rank 1 means an algo-rithm performs the best and rank 2 means it performs the secondbest. From Figure 1(c), we see that i-UCB performs the best in252 runs out of the total 500 runs and b-GREEDY perform-s the best in 140 runs. Overall, the four extended algorithmsintroduced in this paper take the top 4 positions in most runs,e.g., i-UCB is ranked in top 4 positions for more than 400 run-s. Thus, we can come to the conclusion that the four simpleextended algorithms outperform the baseline algorithms with ahigh probability.

Recall that Corollary 3 shows that the four algorithms canachieve log linear regret when B→ ∞ if carefully choosing thehyper parameters. One might also be interested in the empiricalregrets for limited budget B when the α is chosen in the way

provided in Corollary 3. Our experiments show that such anapproach of setting the hyper parameters is not good and leadsto larger regret when B is limited. Taking B = 50000 and theBernoulli bandit as an example, the regret of the four algorithmsis 1806.3, 1784.8, 1693.9, 2003.3 respectively, which are muchworse the results shown in Figure 1(a).

0 1 2 3 4 5

x 104

0

200

400

600

800

1000

1200

1400

1600

1800

Budget

Regret


(a) Bernoulli bandit, regret

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

500

1000

1500

2000

2500

Budget

Regret


(b) Beta Bandit, regret

0

50

100

150

200

250

300

350

400

450

500

#Round

i-UCBc-UCB

m-UCBb-GREEDY

BTSε-first

vPD-BwKUCB-BV1


(c) Bernoulli bandit, rank

0

50

100

150

200

250

300

350

400

450

500

#Round

i-UCBc-UCB

m-UCBb-GREEDY

BTSε-first

vPD-BwKUCB-BV1


(d) Beta Bandit, rank

Figure 2: Results of the 50-armed bandits

The results when the 50-armed bandits are shown in Figure2. Again, we observe that the simple extended algorithms intro-duced in this paper perform better than the existing algorithm-s. In particular, comparing with the 10-armed bandits, the 50-armed bandits have more arms and therefore need more explo-rations. As a result, the regrets for the 50-armed bandits arelarger than those for the 10-armed bandits.

−4 −3 −2 −1 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

log2(α)

Regret

i−UCB

c−UCB

m−UCB

b−GREEDY

Figure 3: Sensitivity Analysis

Finally, we investigate the sensitivity of the proposed algo-rithms w.r.t the hyper parameter α. Figure 3 shows the resultsof a 50-armed beta bandit. We can see that b-GREEDY and i-UCB are quite stable to α. We need to set a relatively small αfor m-UCB and c-UCB; otherwise their performance will dropa lot.

7

6. Better Regret Bound for Budget-UCB

The Budget-UCB algorithm proposed in [30] can be seen asa combined version of our c-UCB and m-UCB and specializesthe α as

√2. Mathematically, the index for Budget-UCB is4:

for any i ∈ [K], t > K,

DBudget−UCBi,t =

ri,t

ci,t+Eαi,t

ci,t+Eαi,t

ci,t

minri,t + Eαi,t, 1

maxci,t − Eαi,t, 0

. (22)

With the δi(1) gap, we can get an improved regret bound com-pared to that in [30]: (The proof is in Appendix G.)

Corollary 8. By setting α =√

2 in Eqn.(22), the regret ofBudget-UCB is upper bounded by

K∑i=2

3 + 2√

2∆i

(µr

1

µc1

+ 1)2 ln τB + (X(B) + 3)K∑

i=2

∆i +2µr

1

µc1. (23)

7. CONCLUSIONS

In this work we have studied the budgeted MAB problems.We show that simple extensions of the algorithms designed forstandard MAB work surprisingly well for budgeted MAB: theyenjoy sublinear regret bounds with respect to the budget andperform comparably or even better than the baseline methods inour simulations. We also give a lower bound for the budgetedBernoulli MAB.

There are many directions to explore in the future. First, inaddition to the simulations, it would make more sense to test theperformances of the extended algorithms in real-world applica-tions. Second, in addition to the distribution-dependent regretbounds, it would be interesting to study distribution-free boundsas well. Third, in addition to extending UCB1 and εt-GREEDY,we will consider extending other algorithms designed for stan-dard MAB to the budgeted settings.

References

[1] Agrawal, S., Devanur, N. R., 2014. Bandits with concave rewards andconvex knapsacks. In: Proceedings of the fifteenth ACM conference onEconomics and computation. ACM, pp. 989–1006.

[2] Ardagna, D., Panicucci, B., Passacantando, M., 2011. A game theoret-ic formulation of the service provisioning problem in cloud systems. In:Proceedings of the 20th international conference on World wide web.ACM, pp. 177–186.

[3] Audibert, J.-Y., Bubeck, S., 2010. Best arm identification in multi-armedbandits. In: COLT-23th Conference on Learning Theory-2010. pp. 13–p.

[4] Audibert, J.-Y., Munos, R., Szepesvari, C., 2009. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits.Theor. Comput. Sci. 410 (19), 1876–1902.

[5] Auer, P., 2003. Using confidence bounds for exploitation-explorationtrade-offs. The Journal of Machine Learning Research 3, 397–422.

[6] Auer, P., Cesa-Bianchi, N., Fischer, P., 2002. Finite-time analysis of themultiarmed bandit problem. Machine learning 47 (2-3), 235–256.

[7] Badanidiyuru, A., Kleinberg, R., Slivkins, A., 2013. Bandits with knap-sacks. In: Foundations of Computer Science (FOCS), 2013 IEEE 54thAnnual Symposium on. IEEE, pp. 207–216.

4A minor revision is: the execution of Budget-UCB in [30] needs to knowthe λ ≤ µc

min in advance. Here we do not need to know such the λ.

[8] Badanidiyuru, A., Langford, J., Slivkins, A., 2014. Resourceful contextu-al bandits. In: 27th Conf. on Learning Theory (COLT).

[9] Ben-Yehuda, O., Ben-Yehuda, M., Schuster, A., Tsafrir, D., 2011. Decon-structing amazon ec2 spot instance pricing. In: Cloud Computing Tech-nology and Science (CloudCom), 2011 IEEE Third International Confer-ence on. IEEE, pp. 304–311.

[10] Borgs, C., Chayes, J., Immorlica, N., Jain, K., Etesami, O., Mahdian, M.,2007. Dynamics of bid optimization in online advertisement auctions. In:WWW. ACM, pp. 531–540.

[11] Bubeck, S., 2010. Bandits games and clustering foundations. Ph.D. thesis,Universite des Sciences et Technologie de Lille-Lille I.

[12] Bubeck, S., Cesa-Bianchi, N., 2012. Regret analysis of stochasticand nonstochastic multi-armed bandit problems. arXiv preprint arX-iv:1204.5721.

[13] Chakraborty, T., Even-Dar, E., Guha, S., Mansour, Y., Muthukrishnan, S.,2010. Selective call out and real time bidding. In: Internet and NetworkEconomics. Springer, pp. 145–157.

[14] Ding, W., Qin, T., Zhang, X.-D., Liu, T.-Y., 2013. Multi-armed ban-dit with budget constraint and variable costs. In: Twenty-Seventh AAAIConference on Artificial Intelligence.

[15] Fisher, M. L., 1980. Worst-case analysis of heuristic algorithms. Manage-ment Science 26 (1), 1–17.

[16] Flajolet, A., Jaillet, P., 2015. Low regret bounds for bandits with knap-sacks. arXiv preprint arXiv1510.01800.

[17] Garivier, A., Cappe, O., 2011. The kl-ucb algorithm for bounded stochas-tic bandits and beyond. arXiv preprint arXiv:1102.2490.

[18] Ghosh, A., Hummel, P., 2013. Learning and incentives in user-generatedcontent: Multi-armed bandits with endogenous arms. In: Proceedingsof the 4th conference on Innovations in Theoretical Computer Science.ACM, pp. 233–246.

[19] Honda, J., Takemura, A., 2010. An asymptotically optimal bandit algo-rithm for bounded support models. In: COLT. Citeseer, pp. 67–79.

[20] Kohli, R., Krishnamurti, R., 1992. A total-value greedy heuristic for theinteger knapsack problem. Operations research letters 12 (2), 65–71.

[21] Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., Eriksson, B., 2014. Ma-troid bandits: Fast combinatorial optimization with learning. In: The 30thConference on Uncertainty in Artificial Intelligence.

[22] Lai, T. L., Robbins, H., 1985. Asymptotically efficient adaptive allocationrules. Advances in applied mathematics 6 (1), 4–22.

[23] Li, L., Chu, W., Langford, J., Schapire, R. E., 2010. A contextual-banditapproach to personalized news article recommendation. In: WWW. ACM,pp. 661–670.

[24] Martello, S., Toth, P., 1990. Knapsack problems: algorithms and comput-er implementations. John Wiley & Sons, Inc.

[25] Mohri, M., Munoz, A., 2014. Optimal regret minimization in posted-priceauctions with strategic buyers. In: Advances in Neural Information Pro-cessing Systems. pp. 1871–1879.

[26] Tran-Thanh, L., Chapman, A., Munoz de Cote, E., Rogers, A., Jennings,N. R., 2010. Epsilon-first policies for budget limited multi-armed bandits.In: AAAI.

[27] Tran-Thanh, L., Chapman, A. C., Rogers, A., Jennings, N. R., 2012. K-napsack based optimal policies for budget-limited multi-armed bandits.In: AAAI.

[28] Tran-Thanh, L., Stavrogiannis, L. C., Naroditskiy, V., Robu, V., Jennings,N. R., Key, P., 2014. Efficient regret bounds for online bid optimisation inbudget-limited sponsored search auctions, 809–818.

[29] Vazirani, V. V., 2001. Approximation algorithms. Springer Science &Business Media.

[30] Xia, Y., Ding, W., Zhang, X.-D., Yu, N., Qin, T., 2015. Budgeted banditproblems with continuous random costs. In: The 7th Asian Conferenceon Machine Learninng (ACML-15).

[31] Xia, Y., Li, H., Qin, T., Yu, N., Liu, T.-Y., 2015. Thompson sampling forbudgeted multi-armed bandits. In: 24th International Joint Conference onArtificial Intelligence. pp. 3960–3966.

[32] Xia, Y., Qin, T., Ma, W., Yu, N., Liu, T.-Y., 2016. Budgeted multi-armedbandits with multiple plays. In: Proceedings of 25th International JointConference on Artificial Intelligence.

[33] yves Audibert, J., Bubeck, S., 2009. Minimax policies for adversarial andstochastic bandits. In: Proceedings of the 22nd Annual Conference onLearning Theory, Omnipress. pp. 773–818.

8

Appendix

Appendix A. Detailed Theoretical Results

To present the detailed regret bounds of the proposed algorithms, we need to first introduce some other notations:

(1) Ii-UCBa (α) = (αµc

1)2/(µr

1

µc1

+ 1)2; (2) Ii-UCBb (α) = ((αµc

min)2)/(2(µr

1

µc1

+ 1)2);

(3) ϕB(x) =

τB∑t=K+1

2(

log2(t − 1) + 1)(

1t − 1

)x ∀x > 0; (4) ϕB(x) = (e1− 1

αK

αK + 1)−x

τB∑t=2

(1t

)x ∀x > 0;

(5) ψB(x) = (τB)1−x1x ≤ 1 + 1x > 1; (6) gi(α) = α + 4(1κ(α)

)α10 +

16(κ(α))−%2i α

%2i

where κ(α) =e1− 1

αK

αK + 1;

(7) X(B) = (τB + 1) exp−

Bµcmin

2

+

1(µc

min)2 exp(µcmin)2 − 2Bµc

min.

(A.1)

Please note that the functions ϕB(·), ϕB(·) and ψB(·) are all sublinear in terms of B. The X(B) is a refined version of the one definedin Lemma 4. When B→ ∞, X(B)→ 0. X(B) can also be bounded by a term related to µc

min only.

Theorem 9. All the four algorithms enjoy sublinear regret bounds. Specifically,

Ri ≤

K∑i=2

4α2(µci )

2

∆iln τB +

K∑i=2

∆i

(ϕB(Ii−UCB

a (α)) +7(

µr1µc

1+ 1)2

∆2i

ψB(Ii-UCBb (α)) + 1 + X(B)

)+

2µr1

µc1

;

Rc ≤

K∑i=2

(µr

1µc

1+ α + 1)2

∆iln τB +

(ϕB

(oc

)+ 3 + X(B)

) K∑i=2

∆i + 2µr

1

µc1

;

Rm ≤

K∑i=2

[(α + 1)(µr

1µc

1+ 1)]2

∆iln τB +

(ϕB(α2) + 3 + X(B)

) K∑i=2

∆i +2µr

1

µc1

;

Rb ≤

K∑i=2

∆i

(α ln τB + 4ϕB(

α

10) +

16%2

i

ϕB(%2i α) + X(B) + gi(α)

)+

2µr1

µc1.

In general, the regret bounds of the four proposed algorithms are sublinear in terms of B. By choosing the hyper parameterscarefully, all the four algorithms can achieve O(ln B) regret bounds, which are shown in Corollary 10.

Corollary 10. If the parameters λ and D, which satisfy λ ≤ µcmin and D ≤ mini≥2 ∆i, are known in advance, by setting α as

2(1 + 1λ)/λ, 2(1 + 1

λ), 2 and max10, 16

λ2D2 + 1 for i-UCB, c-UCB, m-UCB and b-GREEDY respectively, their regrets are

Ri ≤

K∑i=2

16(µci )

2

∆i

(1 + 1λ

λ

)2ln τB +

K∑i=2

∆i

(4 + X(B) +

7(µr

1µc

1+ 1)2

∆2i

)+

2µr1

µc1

;

Rc ≤

K∑i=2

(µr

1µc

1+ 2

λ+ 3)2

∆iln τB +

(X(B) + 6

) K∑i=2

∆i + 2µr

1

µc1

;

Rm ≤

K∑i=2

9(µr

1µc

1+ 1)2

∆iln τB +

(X(B) + 6

) K∑i=2

∆i + 2µr

1

µc1

;

Rb ≤(

max10,

16λ2D2

+ 1

)( K∑i=2

∆i

)ln τB + C,

where C is related to X(B) and the expected reward/cost of each arm.

Appendix B. Important Facts

Fact 1 (Chernoff-Hoeffding bound, Fact 1 of [6]). Let X1, · · · , Xn be random variables with common range [0, 1] and such thatEXt |Xt−1, · · · , X1 = µ. Let S n = X1 + X2 + · · · + Xn. Then for all a ≥ 0,

PS n ≥ nµ + a ≤ e−2a2

n and PS n ≤ nµ − a ≤ e−2a2

n (B.1)

9

Fact 2 (Hoeffing’s maximal inequality, Page 30 of [11]). For centerer i.i.d random variables X1, X2, · · · and for any x > 0, t ≥ 1,

P∃s ∈ 1, 2, · · · , t,

s∑l=1

Xl ≥ x≤ exp

−

2x2

t

. (B.2)

Please note that in [11], the ≥ in Fact 2 is >. However, [11] actually uses the form like Fact 2.

Fact 3 (Bernstein inequality, Fact 2 of [6]). Let X1,X2,· · · ,Xn be random variables with range [0, 1] and∑nt=1 Var[Xt |Xt−1, · · · , X1] = σ2. Let S n =

∑nt=1 Xt. Then for all a ≥ 0,

PS n ≥ E[S n] + a ≤ exp−

a2/2σ2 + a/2

. (B.3)

Appendix C. Detailed analysis of the regret upper bounds

Appendix C.1. Detailed analysis of i-UCBThroughout this subsection, (A) let Ti denote the number of pulling rounds of arm i from round 1 to round τB; (B) denote δi(1)

as δi ∀i ≥ 2. By Eqn. (5), we know µr1µc

1=

µri +δi

µci−δi∀i ≥ 2; (C) temporally omit the “α” from Eαi,t, i.e., denote Eαi,t as Ei,t for any i ∈ [K].

(S1) Decompose ETi ∀i ≥ 2: For any suboptimal arm i, given η ∈ (0, 1), we have

Ti ≤ 1 +

τB∑t=K+1

1It = i ≤ dU ii(η)e +

τB∑t=K+1

1It = i, ni,t ≥ dU ii(η)e

≤1 + U ii(η) +

τB∑t=K+1

1 ri,t

ci,t+ Ei,t ≥

r j,t

c j,t+ E j,t ∀ j , i, ni,t ≥ U i

i(η)≤ 1 + U i

i(η) +

τB∑t=K+1

1 ri,t

ci,t+ Ei,t ≥

r1,t

c1,t+ E1,t, ni,t ≥ U i

i(η)

≤1 + U ii(η) +

τB∑t=K+1

(1 ri,t

ci,t+ Ei,t ≥

r1,t

c1,t+ E1,t,

r1,t

c1,t+ E1,t ≤

µr1

µc1, ni,t ≥ U i

i(η)

+ 1 ri,t

ci,t+ Ei,t ≥

r1,t

c1,t+ E1,t,

r1,t

c1,t+ E1,t ≥

µr1

µc1, ni,t ≥ U i

i(η))

≤1 + U ii(η) +

τB∑t=K+1

(1 r1,t

c1,t+ E1,t ≤

µr1

µc1

+ 1

ri,t

ci,t+ Ei,t ≥

µri + δi

µci − δi

, ni,t ≥ U ii(η)

),

in which U ii(η) = α2 ln τB

(µr

1µc

1+1)2

(µc

i−ηδi

δi−ηδi

)2, and η ∈ (0, 1) is a parameter to be determined later. Accordingly,

ETi ≤ 1 + U ii(η) +

τB∑t=K+1

P r1,t

c1,t+ E1,t ≤

µr1

µc1

︸︷︷︸

a

+

τB∑t=K+1

P ri,t

ci,t+ Ei,t ≥

µri + δi

µci − δi

, ni,t ≥ U ii(η)

︸︷︷︸

b

.

We will bound the sums of term a and b in the following two steps.(S2-1) Bound the sum of term a: First we have P µ

r1µc

1− E1,t ≥

r1,t

c1,t,µr

1µc

1− E1,t < 0 = 0. Thus we only need to focus on the case of

µr1µc

1−E1,t ≥ 0. Given an E1,t, one can find a proper da(t) ∈ (0, µr

1] s.t. µr1−da(t)µc

1+da(t) =µr

1µc

1−E1,t, whose solution is da(t) =

(µc

1E1,t)/( µr

1µc

1+1−E1,t

).

µr1µc

1− E1,t ≥ 0 implies that (i) da(t) > 0; (ii) µr

1 − da(t) ≥ 0; (iii) µc1 + da(t) > 0.

Defined Ii-UCBa (α) =

(αµc

1µr

1µc

1+1

)2. Defined that Ii-UCB

b (α) = 12

(αµcmin

µr1µc

1+1

)2. Accordingly, we have

P r1,t

c1,t+ E1,t ≤

µr1

µc1

= P

r1,t

c1,t+ E1,t ≤

µr1

µc1,µr

1

µc1− E1,t ≥ 0

=P

r1,t

c1,t≤µr

1 − da(t)µc

1 + da(t),µr

1

µc1− E1,t ≥ 0

≤ P

r1,t

c1,t≤µr

1 − da(t)µc

1 + da(t), µc

1 + da(t) > 0, µr1 − da(t) ≥ 0, da(t) > 0

≤Pr1,t ≤ µ

r1 − da(t), µc

1 + da(t) > 0, µr1 − da(t) ≥ 0, da(t) > 0 + Pc1,t ≥ µ

c1 + da(t), µc

1 + da(t) > 0, µr1 − da(t) ≥ 0, da(t) > 0

≤Pr1,t ≤ µ

r1 −

αµc1

µr1µc

1+ 1

√ln(t − 1)

n1,t

+ P

c1,t ≥ µ

c1 +

αµc1

µr1µc

1+ 1

√ln(t − 1)

n1,t

.

Recall we have defined that ∀i ∈ [K],

ni,t =

t−1∑s=1

1Is = i, ri,t =1

ni,t

t−1∑s=1

ri(s)1Is = i.

10

To make a clearer expression, temporarily denote ri,t as Xi,ni,t ∀i ∈ [K]. And Xi,l denotes the l-th observations of the reward of thei-th arm. Xi,s = 1

s∑s

l=1 Xi,l We can get that

Pr1,t ≤ µ

r1 −

αµc1

µr1µc

1+ 1

√ln(t − 1)

n1,t

≤P∃s ∈ 1, 2, · · · , t − 1 s.t. X1,s ≤ µ

r1 −

αµc1

µr1µc

1+ 1

√ln(t − 1)

s

= P

∃s ∈ 1, 2, · · · , t − 1 s.t.

s∑l=1

(X1,l − µr1) ≤ −

αµc1

µr1µc

1+ 1

√s ln(t − 1)

≤

bln(t−1)

ln 2 c∑j=0

P∃s ∈ (

12

) j+1(t − 1) < s ≤ (12

) j(t − 1) s.t.s∑

l=1

(X1,l − µr1) ≤ −

αµc1

µr1µc

1+ 1

√(12

) j+1(t − 1) ln(t − 1)

≤4

bln(t−1)ln(2) c∑j=0

(1

t − 1)I

i-UCBa (α) ≤

(log2(t − 1) + 1

)(

1t − 1

)Ii-UCBa (α), (C.1)

in which the “≤” marked with 4 is obtained by Hoeffing’s maximal inequality [11] (in Page 30). Similarly, we also have

Pc1,t ≥ µ

c1 +

αµc1

µr1µc

1+ 1

√ln(t − 1)

n1,t

≤

(log2(t − 1) + 1

)(

1t − 1

)Ii-UCBa (α).

Consequently, we have obtained that term a is bounded by 2(

log2(t − 1) + 1)( 1

t−1 )Ii-UCBa (α). Therefore, the sum term a is upper

bounded by ϕB(Ii-UCBa (α)), which is sublinear in B.

(S2-2) Bound the sum of term b: Similar to the analysis on term a, we can find a proper db(t) s.t.

µri + δi − db(t)µc

i − δi + db(t)=µr

i + δi

µci − δi

− α

√ln(t − 1)

ni,t, (C.2)

the solution of which is

db(t) =

(α(µc

i − δi)

√ln(t − 1)

ni,t

)/(µr1

µc1

+ 1 − α

√ln(t − 1)

ni,t

).

For any η ∈ (0, 1), if ni,t ≥ U ii(η) and t ∈ K + 1, · · · , τB, we can verify that 0 < db(t) ≤ (1 − η)δi. We have that

P ri,t

ci,t+ Ei,t ≥

µri + δi

µci − δi

, ni,t ≥ U ii(η)

≤ P

ri,t

ci,t≥µr

i + δi − db(t)µc

i − δi + db(t), ni,t ≥ U i

i(η)

≤Pri,t ≥ µri + δi − db(t), ni,t ≥ U i

i(η) + Pci,t ≤ µci − δi + db(t), ni,t ≥ U i

i(η)

≤Pri,t ≥ µri + ηδi, ni,t ≥ U i

i(η) + Pci,t ≤ µci − ηδi, ni,t ≥ U i

i(η).

For the simplicity of use, define

fb(i, α, η) = 2α2η2δ2

i

(µr

1µc

1+ 1)2

(µc

i − ηδi

δi − ηδi

)2

.

Again, temporarily denote ri,t as Xi,ni,t ∀i ∈ [K]. We can obtain that

Pri,t ≥ µri + ηδi, ni,t ≥ U i

i(η) =

τB−1∑l=dUi

i (η)e

PXi,ni,t ≥ µri + ηδi, ni,t = l ≤

τB−1∑l=dUi

i (η)e

PXi,l ≥ µri + ηδi

≤

∞∑l=dUi

i (η)e

exp−2lη2δ2i (obtained by Hoeffding’s inequality)

≤exp2η2

2η2δ2i

τ− fb(i,α,η)B

(C.3)

Similarly, one can obtain that

Pci,t ≤ µci − δi + db(t), ni,t ≥ U i

i(η) ≤exp2η2

2η2δ2i

τ− fb(i,α,η)B .

As a result, term b is upper bounded by exp2η2

η2δ2iτ− fb(i,α,η)B . Accordingly, For term b,

τB∑t=K+1

P ri,t

ci,t+ Ei,t ≥

µri + δi

µci − δi

, ni,t ≥ U ii(η)

≤

exp2η2

η2δ2i

τ1− fb(i,α,η)B . (C.4)

11

Therefore, according to (S1), (S2-1)∼(S2-2), we have that

ETi ≤ 1 + U ii(η) + ϕB(Ii-UCB

a (α)) +exp2η2

η2δ2i

τ1− fb(i,α,η)B . (C.5)

Due to the complexity of optimization over η, we simply choose η = 12 and get that

fb(i, α, η) =2α2η2δ2

i

(µr

1µc

1+ 1)2

(µc

i − ηδi

δi − ηδi

)2

=2α2

(µr

1µc

1+ 1)2

(µc

i −12δi

)2>

α2(µci )

2

2(µr

1µc

1+ 1)2

;

U ii(

12

) =4α2 ln τB

∆2i

(µc

i −12δi

)2≤

4(µci )

2α2 ln τB

∆2i

.

Therefore, according to Lemma 4, we can obtain that the regret of i-UCB is upper bounded by

K∑i=2

4(µci )

2α2 ln τB

∆iln τB +

K∑i=2

∆i

(1 + ϕB(Ii-UCB

a (α)) +7δ2

i

τ1−Ii-UCB

b (α)B + X(B)

)+

2µr1

µc1. (C.6)

Appendix C.2. The regret bound of i-UCB by setting α =2(1+ 1

λ )λ

For term a, since Ii-UCBa (α) =

(αµc

1µr

1µc

1+1

)2> 4, we know that

ϕB(Ii-UCBa (α)) =

τB∑t=K+1

2(

log2(t − 1) + 1)(

1t − 1

)Ii-UCBa (α) ≤

∞∑t=K+1

2(

log2(t − 1) + 1)(

1t − 1

)4

≤

∞∑t=K+1

2ln 2

(1

t − 1)3 + 2(

1t − 1

)4 ≤

∫ ∞

1

1ln 2

2t3 dt +

∫ ∞

1

2t4 dt ≤ 3.

For term b, we have τ1−Ii-UCB

b (α)B ≤ 1. As a result,

Ri-UCB ≤

K∑i=2

16(µci )

2

∆i

(1 + 1λ

λ

)2ln τB +

K∑i=2

∆i

(4 + X(B) +

7δ2

i

)+

2µr1

µc1, (C.7)

which is a O(ln B) regret bound.

Appendix C.3. Detailed analysis of m-UCB algorithm

Throughout this subsection, (A) let Ti denote the number of pulling rounds of arm i from round 1 to round τB with m-UCBalgorithm; (B) denote δi(1) as δi ∀i ≥ 2; (C) temporally omit the “α” from Eαi,t, i.e., denote Eαi,t as Ei,t for any i ∈ [K]; (D) Define

Umi (η) = α2 ln τB

δ2i (1−η)2 .

(S1) Decompose Ti for m-UCB:

Ti ≤ 1 +

τB∑t=K+1

1It = i ≤ dUmi (η)e +

τB∑t=K+1

1 minri,t + Ei,t, 1

maxci,t − Ei,t, 0≥

minr j,t + E j,t, 1maxc j,t − E j,t, 0

∀ j , i, ni,t ≥ dUmi (η)e

≤1 + Um

i (η) +

τB∑t=K+1

1 minri,t + Ei,t, 1


minr1,t + E1,t, 1maxc1,t − E1,t, 0

, ni,t ≥ Umi (η)

≤1 + Um

i (η) +

τB∑t=K+1

1 minri,t + Ei,t, 1



,minr1,t + E1,t, 1maxc1,t − E1,t, 0

≤µr

1

µc1, ni,t ≥ Um

i (η)

+ 1 minri,t + Ei,t, 1



,minr1,t + E1,t, 1maxc1,t − E1,t, 0

≥µr

i + δi

µci − δi

, ni,t ≥ Umi (η)

≤1 + Um

i (η) +

τB∑t=K+1

(1 minr1,t + E1,t, 1

maxc1,t − E1,t, 0≤µr

1

µc1

+ 1

minri,t + Ei,t, 1maxci,t − Ei,t, 0

≥µr

i + δi

µci − δi

, ni,t ≥ Umi (η)

).

12

Since µri < 1 and µc

i > 0 ∀i ∈ [K], we have E[Ti] is upper bounded by

1 + Umi (η) +

τB∑t=K+1

P minr1,t + E1,t, 1

maxc1,t − E1,t, 0≤µr

1

µc1

+ P

minri,t + Ei,t, 1maxci,t − Ei,t, 0

≥µr

i + δi

µci − δi

, ni,t ≥ Umi (η)

≤1 + Um

i (η) +

τB∑t=K+1

Pminr1,t + E1,t, 1 ≤ µr

1+ P

maxc1,t − E1,t, 0 ≥ µc

1

+ Pminri,t + Ei,t, 1 ≥ µr

i + δi, ni,t ≥ Umi (η)

+ P

maxci,t − Ei,t, 0 ≤ µc

i − δi, ni,t ≥ Umi (η)

≤1 + Um

i (η) +

τB∑t=K+1

Pr1,t + E1,t ≤ µ

r1+ P

c1,t − E1,t ≥ µ

c1︸︷︷︸

a

+Pri,t + Ei,t ≥ µ

ri + δi, ni,t ≥ Um

i (η)+ P

ci,t − Ei,t ≤ µ

ci − δi, ni,t ≥ Um

i (η)︸︷︷︸

b

.

(C.8)

(S2-1) Bound the sum of term a: Similar to the derivation of C.1, we have

Pr1,t + E1,t ≤ µr1 + Pc1,t − E1,t ≥ µ

c1 ≤ 2

(log2(t − 1) + 1

)(

1t − 1

)α2

(C.9)

Therefore, the sum of term a is upper bounded by ϕB(α2).(S2-2) Bound the sum of term b: ∀η ∈ (0, 1), given τB, if ni,t ≥ Um

i (η), we can obtain that δi − Ei,t ≥ ηδi. Thus, term b could bebounded as

Pri,t + Ei,t ≥ µri + δi, ni,t ≥ Um

i (η) + Pci,t − Ei,t ≤ µci − δi, ni,t ≥ Um

i (η)

≤Pri,t − µri ≥ ηδi, ni,t ≥ Um

i (η) + Pci,t − µci ≤ −ηδi, ni,t ≥ Um

i (η)

≤2τB exp−2Umi (η)η2δ2

i ≤ 2τB exp−

2α2η2 ln τB

(1 − η)2

,

(C.10)

where the last step is obtained by Hoeffding inequality and union bound (similar to the derivation of (C.3)). Thus,

EτB∑

t=K+1

1minri,t + Ei,t, 1maxci,t − Ei,t, 0

≥µr

i + δi

µci − δi

, ni,t ≥ Umi (η) ≤ 2τ

1− 2α2η2

(1−η)2

B . (C.11)

According to (S1) ∼ (S2-2), we can obtain E[T m-UCBi ] is upper bounded by

1 +α2(

µr1µc

1+ 1)2

(1 − η)2∆2i

ln τB + ϕB(α2) + 2τ1− 2α2η2

(1−η)2

B . (C.12)

According to Lemma 4, by setting 2α2η2

(1−η)2 − 1 = 1, we have the regret bound of m-UCB is

K∑i=2

[(α + 1)(µr

1µc

1+ 1)]2

∆iln τB +

(3 + ϕB(α2) + X(B)

) K∑i=2

∆i + 2µr

1

µc1.

By choosing α = 2, we can obtain thatK∑

i=2

9(µr

1µc

1+ 1)2

∆iln τB +

(6 + X(B)

) K∑i=2

∆i + 2µr

1

µc1.

Appendix C.4. Detailed Analysis on b-GREEDY AlgorithmThroughout this subsection, (A) let Ti denote the number of pulling rounds of arm i from round 1 to round τB with b-GREEDY

algorithm.(S1) Decompose ETi: Obviously, ETi =

∑τBt=1 PIt = i. We will bound PIt = i in the next four steps:

(S2-1) Decompose PIt = i: It is obvious that

PIt = i ≤εt

K+ (1 − εt)P

ri,t

ci,t≥

r1,t

c1,t

≤εt

K+ P

ri,t

ci,t≥

r1,t

c1,t

, (C.13)

and the term P ri,t

ci,t≥

r1,t

c1,t

can be further upper bounded by

P ri,t

ci,t≥

r1,t

c1,t,

r1,t

c1,t≥µr

i + %i

µci − %i

+ P

ri,t

ci,t≥

r1,t

c1,t,

r1,t

c1,t≤µr

1 − %i

µc1 + %i

≤P

ri,t

ci,t≥µr

i + %i

µci − %i

+ P

r1,t

c1,t≤µr

1 − %i

µc1 + %i

≤Pri,t ≥ µ

ri + %i + Pci,t ≤ µ

ci − %i + Pr1,t ≤ µ

r1 − %i + Pc1,t ≥ µ

c1 + %i.

(C.14)

13

(S2-2) Bound the four terms in (C.14): We can bound the last four terms in (C.14) in similar ways. Take the first term as anexample. Define zt = 1

2K∑t−1

s=1 εs and denote nRi,t as the number of random plays of arm i before (excluding) round t. According to

Hoeffding’s inequality,

Pri,t ≥ µri + %i =

t−1∑l=0

Pni,t = l; ri,t ≥ µri + %i ≤

t−1∑l=dzte

Pri,t ≥ µri + %i, ni,t = l +

bztc∑l=0

Pni,t = l

≤

∞∑l=dzte

exp−2l%2i + Pni,t ≤ bztc ≤

4%2

i

e−2%2i zt + PnR

i,t ≤ bztc.

(C.15)

Considering E[nRi,t] = 2zt and

Var[nRi,t] =

t−1∑s=1

εs

K(1 −

εs

K) ≤

t−1∑s=1

εs

K= 2zt,

according to the Bernstein inequality (Fact 3), we have PnRi,t ≤ zt ≤ e−

zt5 , and thus

Pri,t ≥ µri + %i ≤

4%2

i

e−2%2i zt + e−

zt5 . (C.16)

Similarly, we can obtain the upper bounds for the other three terms as follows.

(A) Pci,t ≤ µci − %i ≤

4%2

i

e−2%2i zt + e−

zt5 ; (B) Pr1,t ≤ µ

r1 − %i ≤

4%2

i

e−2%2i zt + e−

zt5 ; (C) Pc1,t ≥ µ

c1 + %i ≤

4%2

i

e−2%2i zt + e−

zt5 . (C.17)

(S2-3) Lower bound zt: Assume t′ = αK, then we can easily verify that

zt ≥bt′c2K

+1

2K

t−1∑s=bt′c+1

αKs≥bαKc2K

+α

2ln

tαK + 1

≥α

2ln

e1− 1αK t

αK + 1=α

2ln(κ(α)t), (C.18)

where κ(α) = e1− 1αK

αK+1 .(S2-4) Bound ETi: Combining inequalities (C.13) to (C.18),

PIt = i ≤εt

K+ 4e−

zt5 +

16%2

i

e−2%2i zt ≤

α

t+ 4(

1κ(α)t

)α10 +

16(κ(α)t)−%2i α

%2i

.

As a result,ETi ≤ α ln τB + 4ϕB(

α

10) +

16%2

i

ϕB(%2i α) + gi(α), (C.19)

where

gi(α) = α + 4(1κ(α)

)α10 +

16(κ(α))−%2i α

%2i

, function ϕB(x) =( e1− 1

αK

αK + 1

)−xτB∑t=2

(1t

)x ∀x > 0 (C.20)

(S3) We can get the regret bound of b-GREEDY by Lemma 4.

Appendix D. Proof of Theorem 5 and Corollary 6

Denote the pulling number of arm i at the first B rounds as Ti. (Note that in budgeted MAB, the pulling number is at least B.) Theregret of the first B rounds is certainly a lower bound of the budgeted Bernoulli MAB. To remove the randomness of the stoppingtime, we will only consider the regret at the first B rounds.

To prove Theorem 5, we first need to prove Lemma 11.

Lemma 11. Given any rule for budgeted Bernoulli MAB such that as B→ ∞, for any a > 0,∑K

i=2 ETi = o(Ba), we have that forany ε > 0, γ ≥ 0, µr

j + δ j(γ) < 1,

limB→∞

PT j ≥

(1 − ε) ln Bkl(µr

j, µrj + δ j(γ)) + kl(µc


= 1 (D.1)

Proof of Lemma 11: Consider a K-armed budgeted Bernoulli bandit: ∀i ∈ [K], the expected reward and cost of arm i are µri and µc

irespectively. (We call this bandit “original bandit”.) Fix j ≥ 2, given the proper γ s.t. µr

j + δ j(γ) < 1, for any 0 < % < 1, we canalways find an xr and an xc such that

kl(µrj, µ

rj + δ j(γ) + xr) < (1 + %)kl(µr

j, µrj + δ j(γ)) and xr > 0, µr

j + δ j(γ) + xr < 1;

kl(µcj, µ

cj − γδ j(γ) − xc) < (1 + %)kl(µc

j, µcj − γδ j(γ)) and xc > 0, µc

j − γδ j(γ) − xc > 0.

14

Define χr = µrj + δ j(γ) + xr and χc = µc

j − γδ j(γ) − xc. Consider a modified K-armed budgeted Bernoulli bandit: the expectedreward and cost of arm i(, j) are µr

i and µci respectively; the expected reward and cost of arm j are χr and χc respectively.

We use the notation PM and EM when we integrate with respect to the original bandit. And we use the notation PM′ and EM′

when we integrate with respect to the modified bandit.According to the assumption, we have

EM′ B − T j =∑i, j

EM′ Ti = o(Ba). (D.2)

On the other hand, we can obtain that with 0 < a < %,

EM′ B − T j = EM′(B − T j)|T j <

(1 − %) ln Bkl(µr

j, χr) + kl(µc

j, χc)

PM′

T j <


j, χr) + kl(µc

j, χc)

+ EM′

(B − T j)|T j ≥


j, χr) + kl(µc

j, χc)

PM′

T j ≥


j, χr) + kl(µc

j, χc)

≥ EM′

(B − T j)|T j <


j, χr) + kl(µc

j, χc)

PM′

T j <


j, χr) + kl(µc

j, χc)

≥

(B −


j, χr) + kl(µc

j, χc)

)PM′

T j <


j, χr) + kl(µc

j, χc)

. (D.3)

Accordingly, we can obtain that

PM′T j <


j, χr) + kl(µc

j, χc)

= o(Ba−1). (D.4)

Let Xr1, X

c1, X

r2, X

c2, · · · , X

rB, X

cB denote successive reward and cost observations from the j-th arm. Note that Xr

i , Xci ∈ 0, 1 ∀i ∈ [B].

Define

Lm =

m∑i=1

lnµr

jXri + (1 − µr

j)(1 − Xri )

χrXri + (1 − χr)(1 − Xr

i )+

m∑i=1

lnµc

jXci + (1 − µc

j)(1 − Xci )

χcXci + (1 − χc)(1 − Xc

i ).

An important property is the following: for any event A in the σ-algebra generated by Xr1, X

c1, X

r2, X

c2, · · · , X

rB, X

cB, the following

change-of-measure identity holds:PM′ A = EM1A exp−LT j . (D.5)

According to (D.4), we know that PM′ (ξ) = o(Ba−1) where

ξ =T j <


j, χr) + kl(µc

j, χc)

and LT j ≤ (1 − a) ln B. (D.6)

Given n1, n2, · · · , nK s.t.∑K

i=1 ni = B, ni ≥ 0 ∀i ∈ [K], we have

PM′ T1 = n1, T2 = n2, · · · , TK = nK , LT j ≤ (1 − a) ln B

=EM1T1 = n1, T2 = n2, · · · , TK = nK , LT j ≤ (1 − a) ln B exp−LT j

≥ exp(−(1 − a) ln B)PMT1 = n1, T2 = n2, · · · , TK = nK , LT j ≤ (1 − a) ln B.

(D.7)

Sinceξ =

⋃∑K

i=1 ni=B;T j<(1−ρ) ln B

kl(µrj ,χ

r )+kl(µcj ,χ

c )

T1 = n1, T2 = n2, · · · , TK = nK , LT j ≤ (1 − a) ln B

, (D.8)

we know that ξ can be decomposed into a group of the disjoint events. Therefore,

PM(ξ) ≤ B1−aPM′ (ξ)→0 as B→ ∞. (D.9)

By the strong law of large numbers, Lmm → kl(µr

j, χr) + kl(µc

j, χc) > 0, therefore, according to Lemma 10.5 in [11], as m→ ∞

maxi≤m

Li

m→ kl(µr

j, χr) + kl(µc

j, χc) a.s. [PM]. (D.10)

Since 1 − a > 1 − %, it then follows that

PM

Li > (1 − a) ln B for some i <


j, χr) + kl(µc

j, χc)

→ 0 (D.11)

as B→ ∞. Therefore,limB→∞

PM

T j <


j, χr) + kl(µc

j, χc)

= lim

B→∞PM

T j <


j, χr) + kl(µc

j, χc), LT j > (1 − a) ln B

+ lim

B→∞PMξ

= limB→∞

PMξ = 0.

(D.12)

15

As a result, we have

limB→∞

PM

T j <

1 − %1 + %

ln Bkl(µr

j, µrj + δ j(γ)) + kl(µc


≤ lim

B→∞PM

T j <


j, χr) + kl(µc

j, χc)

= 0. (D.13)

Therefore, by replacing 1−ρ1+ρ

with 1 − ε, we can obtain Lemma 11.

Given any algorithm “alg” for budgeted Bernoulli MAB such that as B → ∞, ∀a > 0,∑K

i=2 ETalgi = o(Ba) means that∑K

i=2 ETi = o(Ba). By Lemma 11, we can obtain that for any ε > 0, γ ≥ 0, µrj + δ j(γ) < 1,

limB→∞

PT alg

j ≥(1 − ε) ln B

kl(µrj, µ



= 1. (D.14)

Next we prove that L∗j is positive and it always exists. Let L j(γ) denote kl(µrj, µ


j, µcj − γδ j(γ)). Denote γ∗ as the

corresponding γ of L∗j . Define Γ j = γ ≥ 0|µrj + δ j(γ) < 1. Minimizing L j(γ) w.r.t γ ∈ Γ j is equivalent to maximizing

L j = µrj ln

(µr

j +∆ j

γµr

1µc

1+ 1

)+ (1 − µr

j) ln(1 − µr

j −∆ j

γµr

1µc

1+ 1

)+ µc

j ln(µc

j −γ∆ j

γµr

1µc

1+ 1

)+ (1 − µc

j) ln(1 − µc

j +γ∆ j

γµr

1µc

1+ 1

). (D.15)

The first order derivative of L j w.r.t γ is:

∂L j

∂γ=

∆ j

(γµr

1µc

1+ 1)2

−

µrjµr

1µc

1

µrj +

∆ j

γµr

1µc

1+1

+(1 − µr

j)µr

1µc

1

1 − µrj −

∆ j

γµr

1µc

1+1

−µc

j

µcj −

γ∆ j

γµr

1µc

1+1

+1 − µc

j

1 − µcj +

γ∆ j

γµr

1µc

1+1

. (D.16)

It is easy to see that (1) limγ→∞∂L j

∂γ< 0; (2) the terms within · decreases w.r.t γ ∈ Γ j. Thus,

1. If µrj + δ j(0) ≥ 1: we can always find a unique non-negative solution γ0 s.t. µr

j + δ j(γ0) = 1. It is obvious that limγ→γ+0

= +∞.

According to intermediate value theorem5, there exists a γ′ s.t. ∂L j

∂γ

∣∣∣γ=γ′

= 0. And γ′ is just the γ∗.

2. If µrj + δ j(0) < 1: when ∂L j

∂γ

∣∣∣∣γ=0≤ 0, γ∗ = 0; otherwise, γ∗ is the one satisfies ∂L j

∂γ= 0.

Therefore, we have proved that L∗j always exists. Since at least one of δ j(γ∗) and γ∗δ j(γ∗) is positive, we know L∗j is positive.Finally, according to Lemma 2 of [31], we know for Bernoulli bandits, the pseudo regret can be written as:

Regret =

K∑i=2

∆iET algi . (D.17)

During the first B rounds, if the budget spend on suboptimal arm j is spent on the optimal arm, the player can get more expectedreward, which is Ω( ∆ j

L∗Jln B) as B→ ∞. Summing this for all the arms, the regret is Ω(

∑Kj=2

∆ j

L∗Jln B) as B→ ∞.

Appendix E. Proof of the fractional KUBE in [27]

In the budgeted MAB with deterministic cost [27], the reward distribution of each arm has support in [0, 1]. The expected rewardof arm i is denoted as µr

i ∀i ∈ [K]. The pulling cost of arm i ∈ [K] is denoted as ci, which is deterministic. In [27], ci ≥ 1 ∀i ∈ [K].Denote cmin = mini∈[K] ci. W.l.o.g, set arg maxi∈[K]

µri

ci= 1. Define ∆i =

µr1

c1−

µri

ci∀i ≥ 2. Define ∆c

i = ci − c1 ∀i ≥ 2 (i.e., the δi in[27]). Define ∆r

i = µr1 − µ

ri ∀i ≥ 2 (i.e., the ∆i in [27]). Denote the stopping time as TB.

The algorithm “fractional KUBE” proposed in [27] is: at each round, pull the arm with the maximum index ri,t+Ei,t

ci, in which

Ei,t =√

2 ln(t−1)ni,t

. Denote the pulling time of arm i as T fKUBEi when the algorithm fKUBE stops.

5See http://en.wikipedia.org/wiki/Intermediate_value_theorem for a quick introduction.

16

Under the setting that the cost of each arm is deterministic, the pulling time is at most Bcmin

. Given η ∈ (0, 1), U fi (η) denotes

2 ln Bcmin

(1−η)2δ2i (0) . Therefore, we have

T fKUBEi ≤ 1 +

B/cmin∑t=K+1

1It = i ≤ dU fi (η)e +

B/cmin∑t=K+1

1It = i; ni,t ≥ dUfi (η)e

≤ 1 + U fi (η) +

B/cmin∑t=K+1

1 ri,t + Ei,t

ci≥

r j,t + E j,t

c j∀ j , i, ni,t ≥ dU

fi (η)e

≤ 1 + U f

i (η) +

B/cmin∑t=K+1

1 ri,t + Ei,t

ci≥

r1,t + E1,t

c1, ni,t ≥ dU

fi (η)e

≤ 1 + U f

i (η) +

B/cmin∑t=K+1

1 ri,t + Ei,t

ci≥

r1,t + E1,t

c1, ni,t ≥ dU

fi (η)e,

r1,t + E1,t

c1≥µr

i + δi(0)ci

+

B/cmin∑t=K+1

1 ri,t + Ei,t

ci≥

r1,t + E1,t

c1, ni,t ≥ dU

fi (η)e,

r1,t + E1,t

c1≤µr

1

c1

≤ 1 + U f

i (η) +

B/cmin∑t=K+1

1ri,t + Ei,t ≥ µ

ri + δi(0), ni,t ≥ U f

i (η)

+

B/cmin∑t=K+1

1r1,t + E1,t ≤ µ

r1

(E.1)

One can verify that Pr1,t + E1,t ≤ µr1 ≤

1(t−1)3 by Hoeffding inequality and union bound. Therefore,

EB/cmin∑t=K+1

1r1,t + E1,t ≤ µ

r1

≤

∞∑t=K

1t3 ≤

12. (E.2)

On the other hand, if ni,t ≥ U fi (η), δi(0) − Ei,t ≥ ηδi(0). By Hoeffding inequality and union bound, we can obtain that

Pri,t + Ei,t ≥ µ


i (η)≤

( Bcmin

)1− 4η2

(1−η)2 (E.3)

Therefore,

EB/cmin∑

t=1

1ri,t + Ei,t ≥ µ


i (η)≤

( Bcmin

)2− 4η2

(1−η)2 . (E.4)

By choosing η = 1√

2+1, we have

EB/cmin∑

t=1

1ri,t + Ei,t ≥ µ


i (η)≤ 1. (E.5)

Therefore, we can conclude that

ET fKUBEi ≤

(3 + 2√

2) ln Bcmin

(ci∆i)2 +52. (E.6)

In [27], the player can keep pulling until none of the arms are feasible. Thus,

B − cmin ≤ E TB∑

t=1

K∑i=1

ci1It = i

= E TB∑

t=1

K∑i=1

(ci − c1 + c1)1It = i

= ETBc1 +

K∑i=2

ETB∑t=1

ci − c11It = i = ETBc1 +

K∑i=2

∆ciET fKUBE

i

≤ ETBc1 +∑∆c

i >0

∆ci

( (3 + 2√

2) ln Bcmin

(ci∆i)2 +52

).

Therefore, we have

Bc1− ETB ≤

cmin

c1+

∑∆c

i >0

∆ci

c1

( (3 + 2√

2) ln Bcmin

(ci∆i)2 +52

). (E.7)

17

According to inequality (31) in [27],

RfKUBE ≤ (Bc1− ETB)µr

1 +∑∆r

i >0

∆riETi

≤µr

1cmin

c1+

∑∆c

i >0

µr1∆

ci

c1

( (3 + 2√

2) ln Bcmin

(ci∆i)2 +52

)+

∑∆r

i >0

∆ri

( (3 + 2√

2) ln Bcmin

(ci∆i)2 +52

)≤ (3 + 2

√2) ln

Bcmin

( ∑∆c

i >0

µr1∆

ci

c1(ci∆i)2 +∑∆r

i >0

∆ri

(ci∆i)2

)+

52

( ∑∆c

i >0

µr1∆

ci

c1+

∑∆r

i >0

∆ri

)+µr

1cmin

c1,

which is certainly smaller than Theorem 7 (in Page 16) of [27].

Appendix F. Improved Results for Budgeted Thompson Sampling [31]

Notations (1) For any suboptimal arm i(≥ 2) and for any γ ≥ 0, xri , x

ci , y

ri , y

ci are the constant to be determined satisfying the

following constraints:

µri

µci≤

xri

xci≤

yri

yci≤

µri + δi(γ)

µci − γδi(γ)

; xri , x

ci , y

ri , y

ci ∈ (0, 1); µr

i < xri < yr

i < µri + δ(γ); µc

i > xri > yc

i > µci − γδ(γ). (F.1)

(2) Define Li(τB) as follows:Li(τB) =

ln(τB)supγ minkl(xr

i , yri ), kl(xc

i , yci ). (F.2)

(3) We introduce some other notations:

ki(t) = S ri (t) + Fr

i (t); µri (t) =

S ri (t)

S ri (t) + Fr

i (t) + 1; µc

i (t) =S c

i (t)S c

i (t) + Fci (t) + 1

;

EA,ri (t) : µr

i (t) ≤ xri ; EA,c

i (t) : µci (t) ≥ xc

i ; EAi (t) = EA,r

i (t) ∩ EA,ci (t); EB

i (t) :θr

i (t)θc

i (t)≤

yri

yci.

(F.3)

Analysis of BTS for Bernoulli Budgeted MABAccording to Lemma 4, we only need to bound ETi for any i ≥ 2, where Ti is still defined as the number of pulling time of arm

i from round 1 to round τB.Similar to the analysis in [31], we can get that

ETi = EτB∑t=1

1It = i =

τB∑t=1

PIt = i, EAi (t) +

τB∑t=1

PIt = i, EAi (t), EB

i (t) +τB∑t=1

PIt = i, EBi (t). (F.4)

The first two terms of (F.4) can be bounded by the following lemma:

Lemma 12. For any i ≥ 2, we have the following two results:

(I)τB∑t=1

PIt = i; EAi (t) ≤ 1 +

1kl(xr

i , µri )

+1

kl(xci , µ

ci )

; (II)τB∑t=1

PIt = i, EAi (t), EB

i (t) ≤ Li(τB) + 2. (F.5)

The proof of Lemma 12 will be provided later.Define wi is as follows:

wi =µr

1yci − µ

c1yr

i

yri + yc

i, (F.6)

from which we can verify thatyr

i

yci

=µr

1 − wi

µc1 + wi

. (F.7)

Given ε ∈ (0, 1), we can always get a group of proper xri , x

ci , y

ri , y

ci s.t.

kl(xri , y

ri ) =

kl(µri , y

ri )

1 + ε=

kl(µri , µ

ri + δi(γ))

(1 + ε)2 ; kl(xci , y

ci ) =

kl(µci , y

ci )

1 + ε=

kl(µci , µ

ci − γδi(γ))

(1 + ε)2 . (F.8)

Temporally denote the yri and yc

i in (F.8) as yri (ε) and yc

i (ε). Denote the wi in (F.6) as wi(ε). We can get that

dyri (ε)dε

= −1

(1 + ε)2

yri (ε)(1 − yr

i (ε))yr

i (ε) − µri

kl(µri , µ

ri + δi(γ));

dyci (ε)dε

= −1

(1 + ε)2

yci (ε)(1 − yc

i (ε))yc

i (ε) − µci

kl(µci , µ

ci − γδi(γ)). (F.9)

18

Applying total difference to both sides of (F.6), we can obtain that

dwi(ε)dε

=µr

1 − wi(ε)yr

i (ε) + yci (ε)

dyci (ε)dε

−µc

1 + wi(ε)yr

i (ε) + yci (ε)

dyri (ε)dε

. (F.10)

Since limε→0 yri (ε) = µr

i + δi(γ) and limε→0 yci (ε) = µc

i − γδi(γ), by (F.9) and (F.10), we can verify that

dwi(ε)dε

∣∣∣∣ε=0

=µr

1(µci − γδi(γ))(1 − µc

i + γδi(γ))µr

i + µci + (1 − γ)δi(γ)

kl(µci , µ

ci − γδi(γ))γδi(γ)

+µc

1(µri + δi(γ))(1 − µr

i − δi(γ))µr

i + µci + (1 − γ)δi(γ)

kl(µri , µ

ri + δi(γ))δi(γ)

, (F.11)

which is obviously a positive finite constant. As a result, when ε ∈ (0, 1), we can say that O(w−1i (ε)) ≤ O(ε−1).

Therefore, by substituting the εi(γ) in the Step 3 of [31] to the wi(ε), we can obtain that when the ε is chosen small enough suchthat wi(ε) < 0.5(1 − µc

1), we have that∑τB

t=1 PIt = i, EBi (t) ≤ O(ε−6

i ).According to the above derivations, we can get the improved regret in Theorem 7.Finally we will prove Lemma 12:

(Proof of (I) in Lemma 12): Let sk be the k-th time that arm i is pulled. Define s0 = 0.

τB∑t=1

PIt = i; EAi (t) ≤ E

[ τB−1∑k=0

sk+1∑t=sk+1

1It = i1EAi (t)

]=E

[ τB−1∑k=0

1EAi (sk + 1)

sk+1∑t=sk+1

1It = i]

= E[ τB−1∑

k=0

1EAi (sk + 1)

]≤1 + E

[ τB−1∑k=1

1EAi (sk + 1)

]≤ 1 +

τB−1∑k=1

Pµri (t) ≥ xr

i +

τB−1∑k=1

Pµci (t) ≤ xr

i

≤1 +

τB−1∑k=1

exp−kkl(xri , µ

ri ) +

τB−1∑k=1

exp−kkl(xci , µ

ci ) ≤ 1 +

1kl(xr

i , µri )

+1

kl(xci , µ

ci ).

(F.12)

(Proof of (II) in Lemma 12) First, given the history until round t − 1 (denoted as Ft−1), ki(t) is a deterministic number rather thanvariable. Then we have

PIt = i, EBi (t), EA,r

i (t), EA,ci (t)|Ft−1

≤PIt = i, θri (t) > yr

i , EA,ri (t)|Ft−1 + PIt = i, θc

i (t) < yci , E

A,ci (t)|Ft−1

≤Pθri (t) > yr

i |µri (t) ≤ xr

i ,Ft−1 + Pθci (t) < yc

i |µci (t) ≥ xc

i ,Ft−1

≤PBeta(µri (t)(ki(t) + 1) + 1, (1 − µr

i (t))(ki(t) + 1)) > yri

+ PBeta(µci (t)(ki(t) + 1) + 1, (1 − µc

i (t))(ki(t) + 1)) < yci

=FBki(t)+1,yr

i(xr

i (ki(t) + 1)) + (1 − FBki(t)+1,yc

i(xc

i (ki(t) + 1)))

≤ exp−(ki(t) + 1)kl(xri , y

ri ) + exp−(ki(t) + 1)kl(xc

i , yci ).

(F.13)

As a result, when ki(t) > Li(τB), we have


i (t), EA,ci (t) ≤

2τB. (F.14)

Temporally let s denote the maximum round that ki(t) ≤ Li(τB). As a result, we have

τB∑t=1


i (t), EA,ci (t)

≤E[ s∑

t=1


i (t), EA,ci (t) +

τB∑t=s+1


i (t), EA,ci (t)

]≤E

[ s∑t=1

PIt = i +τB∑

t=s+1


i (t), EA,ci (t)

]≤Li(τB) + 2.

(F.15)

Appendix G. Detailed Analysis on Budget-UCB Algorithm

Throughout this subsection, (A) let Ti denote the number of pulling rounds of arm i from round 1 to round τB with Budget-UCBalgorithm; (B) denote δi(1) as δi ∀i ≥ 2; (C) temporally omit the “α” from Eαi,t, i.e., denote Eαi,t as Ei,t for any i ∈ [K].

19

(S1) Similar to the derivations of (S1) in Subsection Appendix C.1, we have

E[Ti] ≤ dUji e +

τB∑t=K+1

P r1,t

c1,t+ H1,t ≤

µr1

µc1

︸︷︷︸

a

+

τB∑t=K+1

P ri,t

ci,t+ Hi,t ≥

µri + δi

µc1 − δi

, ni,t ≥ dUji e

︸︷︷︸b

, (G.1)

where

U ji =

(3 + 2√

2) ln τB

δ2i

; Hi,t =εi,t

ci,t+εi,t

ci,t

minri,t + εi,t, 1maxci,t − εi,t, 0

. (G.2)

Next, we will bound sum of term a and term b respectively.(S2) Bound the sum of term a in (G.1): It is obvious that

P r1,t

c1,t+ H1,t ≤

µr1

µc1

≤ P

c1,t − µ

c1 ≥ ε1,t

+ P

r1,t

c1,t+ H1,t ≤

µr1

µc1, c1,t − µ

c1 < ε1,t

. (G.3)

If the event expressed by the last term of (G.3) holds, we have6

r1,t

c1,t−µr

1

µc1

=(r1,t − µ

r1)µc

1 + (µc1 − c1,t)µr

1

c1,tµc1

≤ −ε1,t

c1,t−ε1,t

c1,t

minr1,t + ε1,t, 1maxc1,t − ε1,t, 0

. (G.4)

Therefore, at least one of the following events is true:

(i) r1,t − µr1 ≤ −ε1,t; (ii)

(c1,t − µc1)µr

1

µc1

≥ ε1,tminr1,t + ε1,t, 1maxc1,t − ε1,t, 0

. (G.5)

Given c1,t−µc1 < ε1,t, (ii) in (G.5) implies that r1,t+ε1,t ≤ µ

r1. Therefore, (G.3) is upper bounded by Pr1,t+ε1,t ≤ µ

r1+Pc1,t−µ

c1 ≥ ε1,t.

By Hoeffding’s inequality and union bound, we have

Pr1,t + ε1,t ≤ µr1 + Pc1,t − µ

c1 ≥ ε1,t ≤ 2(t − 1)−3 (G.6)

As a result, the sum of term a in (G.1) can be bounded byτB∑

t=K+1

P r1,t

c1,t+ H1,t ≤

µr1

µc1

≤ 2

∞∑t=2

1t3 ≤ 1. (G.7)

(S3) Bound the sum of term b in (G.1): For any suboptimal arm i, we can find a shadow expected reward µri and a shadow expected

cost µci of the optimal arm related to both µr

i and µci . ∀i , 1: µr

i = µri + δi; µc

i = µci − δi. We can verify

µr1

µc1−εi,t

ci,t=µr

i

µci−εi,t

ci,t=µr

i (ci,t − µci )

µci ci,t

+µr

i − εi,t

ci,t. (G.8)

Thus, ri,t

ci,t+ Hi,t ≥

µr1µc

1holds implies that

ri,t

ci,t+εi,t

ci,t

minri,t + εi,t, 1maxci,t − εi,t, 0

≥µr

i (ci,t − µci )

µci ci,t

+µr

i − εi,t

ci,t. (G.9)

If (G.9) is true, conditioned on ri,t + εi,t < µri , it is certain that εi,t ≥ ci,t − µ

ci . Thus, if the event expressed by the term b of (G.1) is

true, at least one of the following events is true:7

(i) ci,t − µci ≤ εi,t; (ii) ri,t + εi,t ≥ µ

ri . (G.10)

Given Ti ≥ U ji =

(3+2√

2) ln τB

δ2i

, one can verify that δi − εi,t ≥δi√2+1

. Therefore,

τB∑t=K+1

Pci,t − µci ≤ εi,t, ni,t ≥ dU

ji e ≤

τB∑t=K+1

t−1∑l=dU j

i e

Pci,t − µ

ci ≤ −

δi√

2 + 1, ni,t = l

≤(τB)2 exp−2U j

i (δi√

2 + 1)2 = 1. (Obtained by Chernoff-Hoeffding bound.)

(G.11)

Similarly, we can also obtain thatτB∑

t=K+1

Pri,t + εi,t ≥ µri , ni,t ≥ U j

i ≤ 1. (G.12)

Therefore,

E[Ti] ≤(3 + 2

√2) ln τB

δ2i

+ 3. (G.13)

The regret can be obtained by applying Lemma 4.

6For the boundary case that ci,t = 0, since ε1,t > 0 when t ≥ K + 1, we know that (ri,t/ci,t) + H1,t → ∞, which makes the last P· · · of G.3 zero. Therefore, wecan safely consider the case ci,t > 0 only.

7Note that the boundary case ci,t = 0 belongs to the events expressed by (G.10).

20

Finite Budget Analysis of Multi-armed Bandit...

Documents

Transcript of Finite Budget Analysis of Multi-armed Bandit...