Battle of Bandits · 2019-09-10 · Observe (noisy) reward r t ~ Dist(𝜇(𝒂𝒕)) repeat...

Battle of Bandits - Online learning from relative preferences

Aadirupa Saha, Computer Science and Automation (CSA) Prof. Aditya Gopalan, Electrical Communication Engineering (ECE) Indian Institute of Science (IISc), Bangalore.

Aadirupa Saha - Ph.D. Student, Dept of Computer Science and Automation (CSA, IISc) Advisors: Prof. Aditya Gopalan (ECE, IISc), Prof. Siddharth Barman (CSA, IISc) - M.E. (CSA, IISc), Advisor: Prof. Shivani Agarwal. Research Interests: Machine Learning, Learning Theory, Optimization

Brief Intro.

Research intern, Google Mountain View Collaborators: Ofer Meshi, Branislav Kveton, Craig Boutilier

Currently ---

09-Sep-19 2

Problem Overview

Learning from Preferences

Introduction – Multi-Armed Bandits (MAB)

Qualcomm Innovation Fellowship India 2018

Introduction – Multi-Armed Bandits (MAB)


-$10 $1 $50 -$5

How fast can we

find the arm with

*highest reward*?

Play sequentially (one by one)


μ1 μ2 μ3 μ4 … μn > > > > >

…

*Best* arm

More formally: MAB (Learning from single choices)

More formally: MAB (Learning from single choices)

Observe (noisy) reward rt ~ Dist(𝜇(𝒂𝒕))

repeat

Expected Regret in T rounds:

…

Select an arm at from {1,2,…n}

At round t,

Best possible: 𝑂(𝑛 𝑙𝑜𝑔 𝑇)

State of the art

Auer et. al. Finite-time analysis of the multiarmed bandit problem. Machine Learning 2002.

Restaurant recommendation

09-Sep-19 8

Search engine optimization:

Information aggregation from preference data

Learning from relative preferences

Absolute vs. Relative preferences

Rankings (Relative)

Ratings (Absolute)

Often easier (& more accurate) to elicit relative preferences than absolute scores

--- How much you score it out of 5?

--- Do you like movie A over B?

09-Sep-19

09-Sep-19 12

Guess the most liked flavour?

Information aggregation from pairwise preference data

Dueling Bandits

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. ICML 2009.

09-Sep-19 13

More Formally: Dueling Bandits (Learning from pairwise preferences)

Observe (noisy) comparison xt є {0,1} ~ P:=

repeat

…

Select two arms (at,bt)

At round t,

1 2 3 4 5

1 0.5 0.53 0.54 0.56 0.6

2 0.47 0.5 0.53 0.58 0.61

3 0.46 0.47 0.5 0.54 0.57

4 0.44 0.42 0.46 0.5 0.51

5 0.4 0.39 0.43 0.49 0.5

Preference Matrix

Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NuerIPS 2015.

Yue and Joachims. Beat the mean bandit. ICML 2011.

Objective: Find the best arm with

with minimum possible #samples (rounds)

Wouldn’t a subset-wise preference make more sense?

09-Sep-19 15

vs.

Why Subsets??

09-Sep-19 16

Realistic & budget friendly

More feedback flexibility

Easy data collection

Main question: Faster information aggregation with subsets?

09-Sep-19

Prior Art: Not much!

Almost None! Chen, Xi, Yuanzhi Li, and Jieming Mao. "A Nearly Instance Optimal Algorithm for Top-k Ranking under the

Multinomial Logit Model." Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms.

Society for Industrial and Applied Mathematics, (2018).

Khetan, Ashish, and Sewoong Oh. "Data-driven rank breaking for efficient rank aggregation." Journal of Machine

Learning Research 17.193 (2016).

Batch (Offline / Non-active) setting

Active setting

18

Wenbo Ren, Jia Liu, Ness B. Shroff. "PAC Ranking from Pairwise and Listwise Queries: Lower Bounds and Upper

Bounds." arXiv (2018).

Assortment Optimization Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016, July). A near-optimal exploration-exploitation

approach for assortment selection. In Proceedings of the 2016 ACM Conference on Economics and Computation.

09-Sep-19

Battling Bandits

Observe (noisy) “subset-wise feedback” f(St) ~ P (“stochastic model”)

repeat

…

Select set of k-arms St

At round t,

09-Sep-19 19

?

(Learning relatively from subsets)

Objective: Find the best arm with


Choice Modeling (Challenges):

2. Combinatorial structure:

3. How to express relative utilities of arms within subsets?

Probabilistic modeling of feedback 𝒂 in set 𝑺 := 𝑷(𝒂|𝑺)

Number of parameters: 𝑛𝑘

or 𝑛𝑘 --- Combinatorially large!!

Subset-wise preference matrix

1. Choice modeling

1 2 3 ... #out-

comes

𝑺𝟏 0.13 0.01 0.05 … 0.22

𝑺𝟐 0.27 0.12 0.03 … 0.19

𝑺𝟑 0.04 0.11 0.05 … 0.23

… … … … … …

𝑺 𝑛𝑘

0.23 0.19 0.03 0.19 0.24

09-Sep-19 20

Discrete Choice Models:

Other Choice models: Multinomial Probit, Mallows, Nested GEV etc.

Plackett-Luce choice model:

Modelling stochastic preferences of an individual or group of items

in a given context (subset)

Just n parameters!

Parameters: 𝑛𝑘 n

Parameter Reduction!!

Azari et al., Random utility theory for social choice. NuerIPS 2012.

for any subset

Discrete Choice Models:

Plackett-Luce choice model:

Modelling stochastic preferences of an individual or group of items

in a given context (subset)

Just n parameters!

Parameters:

Parameter Reduction!!

Azari et al., Random utility theory for social choice. NuerIPS 2012.

for any subset

Type of PL feedback: General Top-m Ranking:

Example: For subset (k=4) -- Top-m ranking feedback (m=2):

-- Full ranking feedback (m=4):

𝑛𝑘 n

Work done

Part I: Learning the (𝜖, 𝛿)-PAC-Best Item

Part III: Learning the (𝜖, 𝛿)-PAC-Optimal Ranking

Part IV: Cost optimization or Regret minimization

(a). PL-model (b). PSC-model

09-Sep-19

Part II: Instant optimal-PAC-Best Item

Part-I: Learning (𝜖, 𝛿)-PAC Best Item

09-Sep-19

In 30th International Conference on Algorithmic Learning Theory (ALT), 2019.

Objective:

Problem: (𝜖, 𝛿)-PAC Best Item:

Output item i such that:

09-Sep-19


For any and and and any A, there exist

an instance of the PL model where A requires a sample complexity of at least

rounds

Result Overview: (𝜖, 𝛿)-Sample Complexity

1. Sample Complexity Lower Bound:

2. Proposed algorithm takes: rounds.

𝛺𝑛

𝑚𝜖2 𝑙𝑜𝑔

1

𝛿

O𝑛

𝑚𝜖2 𝑙𝑜𝑔

𝑘

𝛿

-- Algorithm-1: Divide and Battle

-- Algorithm-2: Halving Battle Essentially ‘independent‟ of k !

Reduces with m ! 09-Sep-19

A. Lower Bound Analysis

09-Sep-19 27

PL instances:

Arm-1 Arm-2 Arm-n Arm-(n-1) Arm-3 Arm-a

- optimal arm

True instance -

09-Sep-19

PL instances:

Arm-1

True instance -

Alternative instance -

Arm-2 Arm-3

Arm-1 Arm-2 Arm-3

Arm-a

Arm-a

- optimal arm

- optimal arm

Arm-n Arm-(n-1)

Arm-n Arm-(n-1) 09-Sep-19

Fundamental Inequality (Kaufmann et al. 2016): Consider two MAB instances on n arms: and . Arm set:

reward distribution of arm i for (similarly for )

number of plays of arm i during any finite stopping time

where

: Any event under sigma-algebra of the algorithm’s trajectory

Kaufmann et al., On the complexity of best-arm identification in multi-armed bandit models. JMLR 2016.

09-Sep-19

Lower Bound Analysis:

Arm set :

: Event that Algorithm (A) returns item-1

, and

LHS: ,

RHS:

Result follows further using:

(Kaufmann et al. 2016)

09-Sep-19

B. Proposed Algorithms + Guarantees

09-Sep-19 32

Rank Breaking:

Example: Consider a subset of size (k = 4)

Rank-Breaking

Idea of extracting pairwise preferences from subset-wise feedback

Upon top-m ranking feedback (m=2):

Rank-Breaking

Upon full ranking feedback (m=4):

„Strongest‟ Winner of maximum no. of Pairwise Duels 09-Sep-19

Key Lemma (Deviations of pairwise win-probability estimates for PL model):

.

.

.

Then:

Assume,

where and 09-Sep-19

Retain the 'strongest‟ Divide into groups

Play each for

times

+ Rank-Breaking

.

.

.

.

.

.

Proposed Algorithm-1:

Divide and Battle (DaB)

09-Sep-19

(output)

phases

Retain the 'strongest‟ Divide into groups


Play each for

times

+ Rank-Breaking

.

.

.

.

.

.

.

.

.

Repeat

PAC Item

Divide and Battle (DaB)

Comparisons: Existing Results

Algorithm-1 (DaB):

Existing Dueling Bandit (k=2) results (m=1)

Algorithm-2 (HB):

PLPAC (Szorenyi et al., 2015):

BTM (Yue and Joachims, 2011):

sub-optimality

Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NuerIPS 2015.

Yue and Joachims. Beat the mean bandit. ICML 2011.

But no existing work

for top-m feedback!!

Part-II: Instant-Optimal-PAC Best Item

In submission

09-Sep-19

What if 𝜖 = 0 ?

Shouldn‟t sample complexity depend on the

„hardness‟ of the problem instance?

09-Sep-19

Motivation:

Arm-1

Instance - 1

Arm-2 Arm-3

Arm-1 Arm-2 Arm-3

Arm-a

Arm-a

Arm-n Arm-(n-1)


Instance - 2

“Hard” instance

“Easy” instance

can‟t be same !!

Finding - optimal arm

Motivation:

Arm-1

Instance - 1

Arm-2 Arm-3

Arm-1 Arm-2 Arm-3

Arm-a

Arm-a

can‟t be same !!

Arm-n Arm-(n-1)


Instance - 2 Finding - optimal arm

“Hard” instance

“Easy” instance

(Gaps)

Pure-Exploration:

Lower Bound:

We achieved:

Instant Dependent-Sample Complexity

Instant optimal Best-Item

09-Sep-19

“Hard” instance “Easy” instance

---- for “Hard” instances

---- for “Easy” instances

Partition into batches, and play each for times

-PAC Best-Item subroutine

Find -optimal Best-Item

Prune the items with and merge the rest

Resulting

At any sub-phase [ assume ]

PAC-Wrapper!

Re

pe

at

Instant optimal Best-Item: Proposed algorithm

Item-wise survival time:

09-Sep-19 44

Sample Complexity vs (𝜖, 𝛿):

09-Sep-19 45

varying 𝜖 varying 𝛿

Sample Complexity vs. rank-ordered feedback (m):

09-Sep-19 46

Learning (𝜖, 𝛿)-PAC Best Ranking

Part - III

09-Sep-19

In 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

True-ranking:

Objective: Predict a full Ranking (𝝈):

Problem Setting: (𝜖, 𝛿)-PAC-Ranking


09-Sep-19

For any and and and any A satisfying

label invariance, there exist an instance of the PL model where A requires a

sample complexity of at least rounds.

Result Overview: (𝜖, 𝛿)-Sample Complexity

1. Sample Complexity Lower Bound:

2. Proposed algorithm takes: rounds.

-- Algorithm-1: Beat-the-Pivot -- Algorithm-2: Score-and-Rank

Again ‘independent‟ of k !

Inverse linear dependency on m ! 09-Sep-19

A. Lower Bound

09-Sep-19 50

Arm-0 Arm-1 Arm-(n-1) Arm-(n-2) Arm-2 Arm- 𝑛

2

True instance - PL instances:

such that

….

Arm-( 𝑛

2+1)

….

- best optimal arms

09-Sep-19


2

Alternative instance - PL instances:

such that

….

Arm-( 𝑛

2+1)

….

- best optimal arms

09-Sep-19


2

Alternative instance - PL instances:

such that

….

Arm-( 𝑛

2+1)

….

- best optimal arms

‘Label Invariance’!

09-Sep-19

Arm set for current setting:

We have , and we set

And we show .

Result follows using:

Lower Bound Analysis:

(Kaufmann et al. 2016)

Kaufmann et. al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models. JMLR 2016. 09-Sep-19

B. Proposed Algorithms + Guarantees

09-Sep-19 55

-- Algorithm-1: Beat-the-Pivot -- Algorithm-2: Score-and-Rank

(output)

PAC Item: ‘b’

Divide into

.

.

.


groups


Beat-the-Pivot (BP)

(Saha & Gopalan, ALT19)

(output)

PAC Item: ‘b’

Divide into

Play each for

times

.

.

.


groups

.

.

.

Compute


Beat-the-Pivot (BP)

+ Rank-Breaking

?

?

(Saha & Gopalan, ALT19)


Beat-the-Pivot (BP)

Correctness and Sample Complexity guarantee

Theorem: Beat the pivot finds an (𝜖, 𝛿)-PAC Optimal Ranking with

sample complexity :

Proof?

Main Idea: If , , then for any 𝑏 ∈ [𝑛]

Can we estimate with high confidence? 09-Sep-19

Guarantees on (𝜖, 𝛿)-Sample Complexity:

Algorithm-1 (BP):

Comparison: Existing Dueling Bandit (k=2) results (m=1)

Algorithm-2 (SaR):

PLPAC-AMPR (Szorenyi et al., 2015):

sub-optimality

Szorenyi et. al. Online rank elicitation for Plackett-Luce: A dueling bandits approach. NeurIPS 2015.

Again no existing work

for top-m feedback!!

09-Sep-19

Kendall-Tau ranking loss: ,

Experiments: True

Sample-size

Ke

ndall-

Tau loss

09-Sep-19

Predicted

Cost (Regret Minimization)

Part-IV:

09-Sep-19

Accepted to Neural Information Processing Systems (NeurIPS), 2019.

Regret Minimization for Plackett-Luce Model

Part-IV(a):

09-Sep-19

Accepted to Neural Information Processing Systems (NeurIPS), 2019.

Problem Setting

Objectives:

1. Regret w.r.t. to the Best-Item: ,

2. Regret w.r.t. to the Top-k Items: ,

09-Sep-19

Result Overview:

A. Lower Bound:

Results: Cumulative Regret in T rounds

B. We achieved:

09-Sep-19

B. Proposed Algorithm (Regret w.r.t. to the Best-Item)

09-Sep-19

Possible set of Good items

U: UCB of 𝑃 Pairwise-Preference Matrix (𝑃 )

Algorithm: MaxMin-UCB

1 … n

1 𝑝 11 𝑝 1𝑛

… 𝑝 𝑖𝑗

n 𝑝 𝑛1 𝑝 𝑛𝑛

1 … n

1 𝑢11 𝑢1𝑛

… 𝑢𝑖𝑗

n 𝑢𝑛1 𝑢𝑛𝑛

Max-Min set

building rule

St

Play St

?

09-Sep-19

Com

pute

Com

pute

Algorithm: MaxMin-UCB Set Building Rule

09-Sep-19

Max-Min set

building rule T

Algorithm: MaxMin-UCB Set Building Rule

Max-Min set

building rule

That’s it!

Otherwise, recurse for m times:

09-Sep-19

T

Effect of varying subset-size(k):

Full ranking feedback Winner feedback

09-Sep-19 75

Winner-Regret Performance:

09-Sep-19 76

Top-k-Regret Performance:

09-Sep-19 77

Regret Minimization for Pairwise-Subset Choice Model

Part-IV(b):

In 9th International Conference on Uncertainty in Artificial Intelligence (UAI), 2018.

09-Sep-19

Result Overview:

A. Lower Bound:

Results: Cumulative Regret in T rounds

B. We achieved:

09-Sep-19

Parameters: Preference

matrix P = [P(a,b)]nxn

Pairwise-subset choice model

Matching! Thus optimal.

BB

DB

St

a1

a2

ak

.

.

.

Feedback winner

of the Duel to BB Play two random

arms from St

Lower Bound: Reducing Dueling Bandits to Battling Bandits

09-Sep-19

2. Regret Setting: (a) PSC model Proposed Algorithm: (Using Dueling Bandit Blackbox)

BB

DB St

𝒙𝒕

.

.

Feedback winner

of the Battle to DB Play St

Replicate

k/2 times

𝒙𝒕

𝒚𝒕

.

.

𝒚𝒕

𝒙𝒕

𝒙𝒕

𝒚𝒕

09-Sep-19

Comparative Regret Performances (on synthetic datasets):

09-Sep-19 82

09-Sep-19 83

Comparative Regret Performances (on real datasets):

Future directions…

09-Sep-19

Best-

Item

Top-

K

Full-

Ranking … Cascading

Condorcet-

winner All 4 All 4 All 4

Borda-winner ?? ?? ??

Copeland-

winner

…

Top-cycle

Bank-set

Best-

Item

Top-

K Full-Ranking … Cascading

Condorcet-



Copeland-

winner

…

Top-cycle

Bank-set

Best-

Item

Top-

K

Full-


Condorcet-



Copeland-

winner

…

Top-cycle

Bank-set

Best-

Item Top-K

Full-


Condorcet-

winner

Borda-winner

Copeland-

winner

…

Top-cycle

Ranking

Ob

jective

Feedback mechanism

Pla

ck

ett

-Lu

ce

Mu

ltin

om

ial P

rob

it

Ma

llo

ws

mo

de

l

Ne

ste

d G

EV

09-Sep-19 85

Future Work: (2). with other Choice models + Objectives

Contextual

…

09-Sep-19 86

More future works: Revenue Maximization (Item prices/budgets)

Problem Setup:

Revenue maximization:

Every item is priced as: 𝑟𝑖.

Objective:

Modeling:

Every round choose set of items St ⊆ 𝑛 such that,

Plackett-Luce model:

Parameters:

for any subset

where

,

, and

Thank You!

Questions?

09-Sep-19

Battle of Bandits · 2019-09-10 · Observe (noisy) reward r t ~ Dist(𝜇(𝒂𝒕)) repeat...

Documents

Transcript of Battle of Bandits · 2019-09-10 · Observe (noisy) reward r t ~ Dist(𝜇(𝒂𝒕)) repeat...