Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis...

Post on 23-Jan-2021

3 views 0 download

Transcript of Applied bandits: Supporting health-related researchQuality score Parameter selection Online analysis...

Applied bandits:Supporting health-related research

Audrey Durand

July 2, 2019

We are going to talk about

Bandits with structure → Neuroscience research

Application to microscopy imaging parameters

Bandits with contexts → Cancer research

Application to adaptive treatment allocation

Stochastic bandits

For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾

• Observe outcome 𝑟#~𝐷(𝜇/0)

1 2 3

𝐾

𝜇2 𝜇3 𝜇4 𝜇5

Stochastic bandits

For each episode 𝑡:• Select a action 𝑘# ∈ 1, 2, … , 𝐾

• Observe outcome 𝑟#~𝐷(𝜇/0)

1 2 3

𝐾

𝜇2 𝜇3 𝜇4 𝜇5

Goal: Maximize expected rewards

𝑘⋆=argmax/∈{2,3,…,5}

𝜇/

Minimize 𝑅 𝑇 = ∑#B2C 𝜇/⋆ − 𝜇/0

Exploration/Exploitation trade-off

Exploit: Potentially minimize regret• 𝑘# = argmax

/∈{2,3,…,5}E𝜇/

Explore: Gain information

Many strategies:• 𝜖-Greedy• Optimism in front of uncertainty (UCB)• Thompson Sampling• Best Empirical Sampled Average (BESA)

Theory showingsublinear regret underproper assumptions

In practice

We cannot compute regret: 𝔼∑#B2C 𝜇/⋆ − 𝜇/0

• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout

• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved

In practice

We cannot compute regret: ∑#B2C 𝜇/⋆ − 𝜇/0

• Instead we minimize cumulative bad events, e.g. system failures, fractures, patient dropout

• Or we maximize cumulative good events, e.g. clicks, minutes spent on website, lives saved

We need to face constraints and challenges specific to applications

0.0 0.2 0.4 0.6 0.8 1.0

Actions

0.0

0.5

1.0

Reward

Many actions → Structured bandits

Expected reward is a functionof the action features

𝑓:𝒳 ↦ ℝ

For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))

𝔼[

]

0.0 0.2 0.4 0.6 0.8 1.0

Actions

0.0

0.5

1.0

Reward

Many actions → Structured bandits

Expected reward is a functionof the action features

𝑓:𝒳 ↦ ℝ

For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Obtain a reward 𝑟#~ 𝒟(𝑓(𝑥#))

𝔼[

]

Goal: Maximize rewards Find 𝑥⋆= argmaxQ∈𝒳

𝑓(𝑥)

0.0 0.2 0.4 0.6 0.8 1.0

Actions

0.0

0.5

1.0

Reward

Capture structure: Linear model

• Unknown 𝜃 ∈ ℝS

• Mapping 𝜙:𝒳 ↦ ℝS

• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃

𝒳 𝒦𝜙(V)

𝔼[

]

0.0 0.2 0.4 0.6 0.8 1.0

Actions

0.0

0.5

1.0

Reward

Online function approximation

• Unknown 𝜃 ∈ ℝS

• 𝑓 𝑥 = 𝜙 𝑥 , 𝜃

• 𝑥⋆ = argmaxQ∈𝒳

𝜙 𝑥 , 𝜃

For each episode 𝑡:• Select an action 𝑥# ∈ 𝒳• Observe outcome 𝑦# = 𝑓(𝑥#) + 𝜉#

with noise ξ[ ∼ 𝒩(0, σ3)

𝔼[

]

Minimize 𝔼∑#B2C 𝑓 𝑥⋆ − 𝑓(𝑥#)

𝑓

𝑥⋆

Kernel regression

• Kernel 𝑘 𝑥, 𝑥′ = 𝜙 𝑥 , 𝜙(𝑥′)

• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc

d𝐼 for 𝜆 > 0

𝐊i = 𝑘 𝑥j, 𝑥k 2lj,kliand 𝐤i 𝑥 = 𝑘 𝑥, 𝑥j 2ljli

ℙ 𝑓|𝑥2, … , 𝑥i, 𝑦2, … , 𝑦i ∼ 𝒩 𝑓i 𝑥Q∈𝒳

, 𝑘i 𝑥, 𝑥p Q,Qq∈𝒳

𝑓i 𝑥 = 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐲i

𝑘i 𝑥, 𝑥p = 𝑘 𝑥, 𝑥′ − 𝐤i 𝑥 r 𝐊i + 𝜆𝐼s2𝐤i(𝑥

p)

𝜙:𝒳 ↦ ℝS

𝑑 can be very large!

Kernel regression

• Kernel 𝑘 𝑥j, 𝑥k = 𝜙 𝑥j , 𝜙(𝑥k)

• Gaussian prior 𝜃 ∼ 𝒩S 0, Σ with Σ =bc

d𝐼 for 𝜆 > 0

For 𝜆 = 𝜎3 → Gaussian Process (Rasmussen and Williams, 2006)

• Example: Pointwise posterior mean and standard deviation

𝜙:𝒳 ↦ ℝS

𝑑 can be very large!

Streaming kernel regression

• Next input location 𝑥# is selected based on the 𝑡 − 1 past observations

• Many algorithm variants bandits, e.g.Kernel UCB, Kernel TS, GP-UCB, GP-TS

Let’s apply those bandits!

Optimizing super-resolution imaging parameters

• Flavie Lavoie-Cardinal• Theresa Wiesner• Anthony Bilodeau• Paul De Koninck

• Louis-Émile Robitaille• Marc-André Gardner• Christian Gagné

Joint work with

D., Wiesner, Gardner, Robitaille, Bilodeau, Gagné, De Koninck, and Lavoie-Cardinal(Nature Comm 2018)

Observing structures at the nanoscale (Hell and Wichmann, 1994)

Problem

Biology: The optimal parameters are not always the same

Typical strategy:• Split samples in two groups A and B• Find good parameters on group A• Perform imaging task on group B

From abberior-instruments.com

Structured bandit problem

Find good parameters during the imaging task• Maximize the acquisition of useful images • Minimize trials of poor parameters

Imaging parameters

Feedback

→ Identify best parameters→ Explore wisely

Optimizing image quality

Recall goal: Maximize the acquisition of images useful to researchers

Imaging parameters

ImageQuality score

Imaging parameters

ImageQuality score

ThompsonSampling

−1 0 1

Parameter

−5.0

−2.5

0.0

2.5

5.0

Objective

N = 3

Qua

lity

Thompson Sampling for selecting imaging parameters

−1 0 1

Parameter

−5.0

−2.5

0.0

2.5

5.0

Objective

N = 3

f fN f̃ Previous points Next location

−1 0 1

Parameter

−1.0

−0.5

0.0

0.5

Objective

N = 10

−1 0 1

Parameter

−1.0

−0.5

0.0

0.5

Objective

N = 29

Qua

lity

What is good image quality?

Avoiding images like these:

Getting more like these:

But imaging is a destructive process…

Trade-off image quality and photobleaching

Outcomesoptions

Quality score

Parameterselection

Online analysis Images% bleach

Thompson Sampling for generating outcome options

• One kernel regression model w𝑓j per objective 𝑖

• Sample one function y𝑓j per objective 𝑖

• Option y𝑓(𝑥) at parameter 𝑥: Concatenate y𝑓j(𝑥) for all 𝑖

−1 0 1

Parameter

−4

−2

0

2

4

Objective

N = 3

Quality

−1 0 1

Parameter

−4

−2

0

2

4

Objective

N = 3

Photobleaching

10

10

y𝑓 0 = (0.75, 0.65)

Presenting estimates to the expert• Exploration/Exploitation

in the cloud!

• Expert acts as an argmaxon the preference function

Image quality

Pho

tobl

each

ing

Experiments on neuronal imaging

Three parameters (1000 configurations): • Excitation laser power • Depletion laser power• Duration of imaging per pixel

Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human embryonic kidney cells

Acquire two STED images with ↑ 1st STED quality and ↓ photobleaching

Acquire good images and control photobleaching

Not super-resolution →

Quality

Photobleaching

Sublinear regret, as suggested by theory

Different imaging targets:• Neuron: Rat neuron• PC12: Rat tumor cell line• HEK293: Human

embryonic kidney cells

Fully automated process

Outcomesoptions

Quality score

Parameterselection

Online analysis Images% bleach

Fully automated imaging

Bottom left corner:Not super-resolution

Beginning of optim →

End of optim →

Towards the next application:Getting closer to the patient

Randomized trials

Ran

dom

izat

ion

Option A

Option B

Time

Comparison&

Decision

Adaptive trials

Dynamically adapt the study designbased on previous observations

Effective

Favorable

Promising

Unfavorable

Futile

Writing down the setting…

Treatments:

Probabilityof effectiveness:

For each patient 𝑡:• Select a treatment 𝑘# ∈ 1, 2, … , 𝐾

• Observe outcome 𝑟#~𝐷(𝜇/0)

1 2 3

𝐾

𝜇2 𝜇3 𝜇4 𝜇5

This is stochastic bandits!(Thompson, 1933)

In the absence of one size fits all strategy

Treatments:

Probabilityof effectiveness:

1 2 3

𝐾

𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~

𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�

Context

In the absence of one size fits all strategy

Treatments:

Probabilityof effectiveness:

1 2 3

𝐾

𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~

𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�

You can treat them as independent bandit problems!

In the absence of one size fits all strategy

Treatments:

Probabilityof effectiveness:

1 2 3

𝐾

𝜇2,~ 𝜇3,~ 𝜇4,~ 𝜇5,~

𝜇2,� 𝜇3,� 𝜇4,� 𝜇5,�

What happens if the number of contexts grows large?

Contextual bandits

Exploit the underlying structure on the context space

Reward = treatment effect

0.0 0.2 0.4 0.6 0.8 1.0

Context

-1

0

1

Reward

Action 1

Action 2

Action 3

Action 4

𝑓2(0.6)

𝑓3(0.6)

𝔼[

] 𝑓4(0.6)

𝑓�(0.6)

Contextual bandits

Expected reward of action 𝑘 is afunction 𝑓/ of the context features

𝑓/: 𝒮 ↦ ℝ

For each episode 𝑡:• Observe a context 𝑠# ∼ Π

• Select an action 𝑘# ∈ 1, 2, … , 𝐾

• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))

0.0 0.2 0.4 0.6 0.8 1.0

Context

-1

0

1

Reward

Action 1

Action 2

Action 3

Action 4

𝔼[

]

Contextual bandits

Expected reward of action 𝑘 is afunction 𝑓/ of the context

𝑓/: 𝒮 ↦ ℝ

For each episode 𝑡:• Observe a context 𝑠# ∼ Π

• Select an action 𝑘# ∈ 1, 2, … , 𝐾

• Observe a reward 𝑟#~𝐷(𝑓/0(𝑠#))

0.0 0.2 0.4 0.6 0.8 1.0

Context

-1

0

1

Reward

Action 1

Action 2

Action 3

Action 4

𝔼[

]

Goal: Maximize rewards Find 𝑘#⋆= argmax/∈{2,3,…,5}

𝑓/(𝑠#)

0.0 0.2 0.4 0.6 0.8 1.0

Context

-1

0

1

Reward

Action 1

Action 2

Action 3

Action 4

Online function approximation

• Unknown 𝜃/ ∈ ℝS for each action 𝑘

• 𝑓/ 𝑠 = 𝜙 𝑠 , 𝜃/• 𝑘#

⋆ = argmax/∈{2,3,…,5}

𝜙 𝑠# , 𝜃/

For each episode 𝑡:• Observe a context 𝑠# ∼ Π

• Select an action 𝑘# ∈ 1, 2, … , 𝐾

• Observe reward 𝑟# = 𝑓/0(𝑠#) + 𝜉#

with noise ξ[ ∼ 𝒩(0, σ3)𝔼[

]

Minimize 𝔼∑#B2C 𝑓/0⋆ 𝑠# − 𝑓/0(𝑠#)

Adaptive treatment allocation for mice trials

• Georgios D. Mitsis• Joelle Pineau

• Charis Achilleos• Demetris Iacovides• Katerina Strati

Joint work with

D., Achilleos, Iacovides, Strati, Mitsis, and Pineau (MLHC 2018)

Data acquisition problem

• Mice with induced cancer tumours

• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None

• Treatment allocation twice a week

Which treatment should be allocated to patients with cancer given the stage of their disease?

Squamous Cell

Carcinoma

Data acquisition problem

• Mice with induced cancer tumours

• Treatment options: 5FU, Imiquimod, 5FU+Imiquimod, None

• Treatment allocation twice a week

Which treatment should be allocated to patients with cancer given the stage of their disease?

Squamous Cell

Carcinoma

Tumour volume

Phase 1: Randomized allocation (only exploration)

• 6 mice

Processing a mouse:• 2x/week:

– Measure volume of tumours– If all tumours are below a critical level

» Randomly assign one of the four treatment options– Otherwise terminate this animal

Phase 1: Randomized allocation (only exploration)

Result: 12 usable tumours• 163 triplets (tumour volume, treatment, next tumour volume)

0 100 200 300

Volume (mm3)

0

5

10

15

20

25

Number

ofmeasurements

0 25 50 75 100

Number of days

0

200

400

Volume(m

m3)

Phase 1: Randomized allocation (only exploration)

Exponential tumour growth → Few data collected for larger tumours

Phase 2: Adaptive trial (exploration/exploitation)

• 10 mice• Select treatment 2x/week

Adaptive Clinical Trial• Do not fix the experiment design a priori• Adapt treatment allocation based on previous observations• Favor selection of better treatments

• Reduce exposition to less effective treatments

Contextual bandit problem

Improve treatment allocation online

• Maximize amount of acquired data• Minimize trials of poor treatments

Treatment

Tumour volume after treatment

Tumour volume before treatment

→ Identify best action given context→ Explore wisely

Alert: Contexts are not independent of actions!

Recall contextual bandits:

For each episode 𝑡:• Observe a context 𝑠# ∼ Π

• Select an action 𝑘# ∈ 1, 2, … , 𝐾

• Obtain a reward 𝑟#~𝒟(𝑓/0(𝑠#))

Beware of traps!

Reward shaping

Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2

Tumour volume reduction

Reward shaping

Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2

Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume

Tumour volume reduction

Reward shaping

Natural reward definition could be 𝑟# = 𝑠# − 𝑠#�2

Controlling the disease, i.e. maintain tumour constant, has the same value independently of the tumour volume

What we used instead: 𝑟# = −𝑠#�2

Tumour volume reduction

Exploration/Exploitation strategy

Best Empirical Sampled Average: BESA (Baransi et al., 2014)

• Fair comparison of empirical estimators• Opportunities for actions to show how good they are

Samples # Action 1 Action 2

1 0 1

2 1 0

3 1 1

4 0

… …

100 1

Expected reward

E𝜇2

E𝜇3

Exploration/Exploitation strategy

Best Empirical Sampled Average: BESA (Baransi et al., 2014)

• Fair comparison of empirical estimators• Opportunities for actions to show how good they are

Samples # Action 1 Action 2

1 0 1

2 1 0

3 1 1

4 0

… …

100 1

Expected reward

E𝜇2

E𝜇3

Exploration/Exploitation strategy

Best Empirical Sampled Average: BESA (Baransi et al., 2014)

• Fair comparison of empirical estimators• Opportunities for actions to show how good they are

Samples # Action 1 Action 2

1 0 1

2 1 0

3 1 1

4 0

… …

100 1

Expected reward

E𝜇2

E𝜇3

Exploration/Exploitation strategy

Best Empirical Sampled Average: BESA (Baransi et al., 2014)

• Fair comparison of empirical estimators• Opportunities for actions to show how good they are

Samples # Action 1 Action 2

1 0 1

2 1 0

3 1 1

4 0

… …

100 1

Expected reward

E𝜇2

E𝜇3

0 75 150 225

Tumour volume

0

150

300

Treatmenteffect

f

f̃N

0 75 150 225

Tumour volume

0

150

300

Treatmenteffect

f

f̃N

GP BESA: Extension to contextual bandits

0 75 150 225

Tumour volume

0

150

300

Treatmenteffect

f

f̃N

0 75 150 225

Tumour volume

0

150

300

Treatmenteffect

f

f̃N

𝑁 = 50 𝑁 = 20

𝑁 = 10 𝑁 = 5

0 75 150 225 300

Tumour volume

0

150

300

Treatmenteff

ect

st

A1 (N = 5)

A2 (N = 5)

Example of exploration

Action 1: 100 observationsAction 2: 5 observations

0 75 150 225 300

Tumour volume

0

150

300

Treatmenteff

ect

st

A1 (N = 100)

A2 (N = 5)

0 75 150 225 300

Tumour volume

0

150

300

Treatmenteffect

st

A1

A2

True functions: Predictions (all data): Predictions (subsample):

Experimental setting

• 10 mice total

• Processing a group of mice (2-3 subjects)– Twice a week:

» For each mouse in the group:• Measure tumour → reward for last treatment• Select treatment to assign now

– Until death/sacrifice of all mice in group

• Update algorithm with tuples of (volume, treatment, next volume)

• Start next group

Animals live longer

Algorithm updated after each group

None Random 5-FU GP BESA

50

100

150

Lifeduration(days)

None Random 5-FU GP

50

100

150

Lifeduration(days)

A B C D

50

100

150Lifeduration(days)

Evolution of tumour volumes

Slowing the exponential growth

Recall phase 1:

0

200

400

Volume(m

m3)

Group A

0

200

400

Volume(m

m3)

Group B

0

200

400

Volume(m

m3)

Group C

0 50 100

Days

0

200

400

Volume(m

m3)

Group D

0 25 50 75 100

Number of days

0

200

400

Volume(m

m3)

A better state space covering

Using data in a next phase• More information on the tumor growth process• 40% more data points of volume > 70mm3

0 100 200 300

Volume (mm3)

0

5

10

15

20

25

Number

demeasurements

0.0

0.5

1.0

Probability

Group A

None 5-FU Imiquimod 5-FU + Imiquimod

Group B

0 50 100 150 200 250 300

Tumour volume (mm3)

0.0

0.5

1.0

Probability

Group C

0 50 100 150 200 250 300

Tumour volume (mm3)

Group D

Evolution of the policy

Conclusion + Take homes

• Bandits is a nice framework for theory, but also has applications! J

• We often break theoretical guarantees in practice L

• How to design algorithms that don’t make unrealistic assumptions?

• Other aspects important in practice were not considered here, e.g.– Fairness in exploration– Safe exploration

Huge thanks again to my collaborators!

Questions?…and more