7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a...

7-11-2010 INFORMS Annual Meeting Austin

1

Learning in Approximate Dynamic Programming for Managing a Multi-Attribute DriverMartijn MesDepartment of Operational Methods for Production and LogisticsUniversity of TwenteThe Netherlands

Sunday, November 7, 2010INFORMS Annual Meeting Austin

7-11-2010INFORMS Annual Meeting Austin 2/40

OUTLINE

1. Illustration: a transportation application2. Stylized illustration: the Nomadic Trucker Problem3. Approximate Dynamic Programming (ADP)4. Challenges with ADP5. Optimal Learning6. Optimal Learning in ADP7. Challenges with Optimal Learning in ADP8. Sketch of our solution concept


TRANSPORTATION APPLICATION

Heisterkamp Trailer trucking: Providing trucks and drivers

Planning department: Accept orders Assign orders to trucks Assign drivers to trucks

Type of orders: Direct order: move trailer from A to B; client pays depending

on distance between A to B, but trailer might go through hubs to change the truck and/or driver

Customer guidance order: rent a truck and driver to a client for some time period


REAL APPLICATION

Heisterkamp


CHARACTERISTICS

The drivers are bounded by EU drivers’ hours regulations However, given sufficient supply of orders and drivers,

trucks can in principle be utilized 24/7 by switching drivers Even though we can replace a driver (to increase utilization

of trucks), we still might face costs for the old driver Objective: increase profits by ‘clever’ order acceptance and

minimization of costs for drivers, trucks and moving empty (i.e., without trailer)

We solve a dynamic assignment problem, given the state of all trucks and (probabilistic) known orders, at specific time instances for a fixed horizon

This problem is known as a Dynamic Fleet Management Problem (DFMP). For illustrative purposes we now focus on the single vehicle version of the DFMP.


THE NOMADIC TRUCKER PROBLEM

Single trucker moving from city to city either with a load or empty Rewards when moving loads otherwise there are costs involved Vector of attributes describing a single resource

with the set of possible attribute vectors

1

2

3

4

5

6

7

8

9

10

location of the truckarrival time at next locationmaintenance statusdriver typehours driving in current triphours driving todayhours driving thi

aaaaa

aaaaaa

s weekhours away from homehome domicileday of the week

The truck

The driver

aA

Dynamic attributes


MODELING THE DYNAMICS

State where with Rta=1 when the truck has attribute a (in the

DFMP, Rta gives the number of resources at time t with attribute a)

with Dtl the number of loads of type l Decision xt: make a loaded move, wait at current location, or move

empty to another location; xt follows from a decision function where πΠ is a family of policies

Exogenous information Wt+1: information arriving between t and t+1 such as new loads, wear of truck, occurrence of breakdowns etc.

Choosing decision xt with current state St and exogenous information Wt+1, results in a transition

with contribution (payment or costs)

1 1, ,Mt t t tS S S x W

,t tC S x

t ta aR R

A

tX

,t t tS R D

t tl lD D

L


OBJECTIVE

Objective is to find the policy π that maximizes the expected sum of discounted contributions over all time periods

00

sup ,T

tt t

tC S X S

E


SOLVING THE PROBLEM

Optimality equation (expectation form of Bellman’s equation):

Enumerating by backward induction? Suppose a=(location, arrival time, domicile) and we discretize to 500

locations and 50 possible arrival times → ||=12,500,000 In the backward loop we not only have to visit all states, but also we

have to evaluate all actions, and, to compute the expectation, we probably also have to evaluate all possible outcomes

Backwards dynamic programming might become intractable

1 1max ,t t

t t t t t t txV S C S x V S S

EX

Approximate Dynamic Programming


APPROXIMATE DYNAMIC PROGRAMMING

We replace the original optimality equation

With the following

1 1

1 1

max ,

,

where

,

t tt t t t t t t tx

Mt t t t

V S C S x V S S

S S S x W

EX

1 ,

nt

1 1

le

ˆ max , ,

x

,

t be our decisio

,

nt t

n n n n M x nt t t t t t t tx

n M n n nt t t t

v S C S x V S S x

S S S x W

tX




With the following

1 1

1 1

max ,

,

where

,

t tt t t t t t t tx

Mt t t t

V S C S x V S S

S S S x W

EX

1 ,

nt

1 1

le

ˆ max , ,

x

,

t be our decisio

,

nt t


n M n n nt t t t

v S C S x V S S x

S S S x W

tX

Using a value function approximation This allows us to step forward in time

1




With the following

1 1

1 1

max ,

,

where

,

t tt t t t t t t tx

Mt t t t

V S C S x V S S

S S S x W

EX

1 ,

nt

1 1

le

ˆ max , ,

x

,

t be our decisio

,

nt t


n M n n nt t t t

v S C S x V S S x

S S S x W

tX

1 1 1 1, , , , , , , ,...x xt t t t t t t tS x S W S x S W

Using the post-decision state variable

Deterministic function

2




With the following

1 1

1 1

max ,

,

where

,

t tt t t t t t t tx

Mt t t t

V S C S x V S S

S S S x W

EX

1 ,

nt

1 1

le

ˆ max , ,

x

,

t be our decisio

,

nt t


n M n n nt t t t

v S C S x V S S x

S S S x W

tX

1 2, ,..., , n n nT t tW W W W W

Generating sample paths3




With the following

1 1

1 1

max ,

,

where

,

t tt t t t t t t tx

Mt t t t

V S C S x V S S

S S S x W

EX

1 ,

nt

1 1

le

ˆ max , ,

x

,

t be our decisio

,

nt t


n M n n nt t t t

v S C S x V S S x

S S S x W

tX

Learning through iterations4


OUTLINE OF THE ADP ALGORITHM

0 10

1 ,

, ,

Step 0. Initialise and , set

Step 1. Choose a sample pathStep 2. For do:

Step

1

0,1,...,

ˆ max ,

2a. Solve:

let be the best decision

,

and let ,t t

t

n

n n n M x nt t t t t tx

n x n M x nt t t

V S n

t T

v C S x V S S x

x S S S

X

, 1 ,1 1 1 1 1 1

1 1

Step 2b. Update the value function:

Step 2c. Compute new

ˆ1

, ,

predicision state

Step 3. Increment . If go to Step 1.

Step 4. Return th

e

nt

n x n n x n nt t n t t n t

n M n n nt t t t

x

V S V S v

S S S x W

n n N

1

value functi ns oTn

t tV

Deterministic optimization

Simulation

Statistics


0 10

1 ,

, ,



Step

1

0,1,...,

ˆ max ,

2a. Solve:


,

and let ,t t

t

n


n x n M x nt t t

V S n

t T

v C S x V S S x

x S S S

X

, 1 ,1 1 1 1 1 1

1 1



ˆ1

, ,

predicision state


Step 4. Return th

e

nt


n M n n nt t t t

x

V S V S v

S S S x W

n n N

1

value functi ns oTn

t tV

CHALLENGES WITH ADP

Exploration vs. exploitation: Exploitation: we do we

currently think is best Exploration: we choose to

try something and learnmore (information collection)

To avoid getting stuck in a local optimum, we have to explore. But what do we want to explore and for how long? Do we need to explore the whole state space?

Do we update the value functions using the results of the exploration steps or do we want to perform off-policy control?

Techniques from Optimal Learning might help here


OPTIMAL LEARNING

To cope with the exploration vs. exploitation dilemma Undirected exploration: Try to randomly explore the whole state space Examples: pure exploration and epsilon greedy (explore with

probability εn and exploit with probability 1- εn) Directed exploration: Utilize past experience to execute efficient exploration (costs are

gradually avoided by making more expensive actions less likely) Examples of directed exploration

Boltzmann exploration; choose x that maximizes

Interval estimation; choose x that maximizes

The knowledge gradient policy (see next sheets)

'

'

exp

exp

nxn

x nx

x

p

n nx xz


THE KNOWLEDGE GRADIENT POLICY [1/2]

Basic principle: Assume you can make only one measurement, after which you have

to make a final choice (the implementation decision) What choice would you make now to maximize the expected value of

the implementation decision?

1 2 3 4 5

Change which produces a change in the decision.

Change in estimated value of option 5 due to measurement of 5

Updated estimate of the value of option 5

Observation


THE KNOWLEDGE GRADIENT POLICY [2/2]

The knowledge gradient is the expected marginal value of a single measurement x

The knowledge gradient policy is given by There are many problems where making one measurement tells us

something about what we might observe from other measurements (e.g., in our transportation application nearby locations have similar properties)

Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions)

There are various extensions of the Knowledge Gradient policy that take into account similarities between alternatives

X

argmax KGxx

X

KGx

Hierarchical Knowledge Gradient policy


HIERARCHIAL KNOWLEDGE GRADIENT (HKG)

Idea: instead of having a belief on the true value θx of each alternative x (Bayesian prior with mean and precision ), we have a belief on the value of each alternative at various levels of aggregation (with and )

Using aggregation, we express (our estimate of θx) as a weighted combination

Intuition: highest weight to levels with lowest sum of variance and bias; see [1] and [2] for details.

[1] M.R.K. Mes, W.B. Powell, and P.I. Frazier (2010). Hierarchical Knowledge Gradient for Sequential Sampling.

[2] A. George, W.B. Powell, and S.R. Kulkarni (2008). Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management.

,g nx

, ,n g n g nx x x

g G

w

nx

nx

nx

, g nx


STATISTICAL AGGREGATION

Example of an aggregation structure for the Nomadic Trucker Problem

With HKG we would have 38,911 beliefs and our belief about a single alternative can be expressed as a function of 6 beliefs (1 for each aggregation level).

Level Location Driver type Day of week Size of state space

0 City * * 500x10x7=35,000

1 Region * * 50x10x7=3,500

2 Region - * 50x1x7=350

3 Region - - 50x1x1=50

4 Province - - 10x1x1=10

5 Country - - 1x1x1=1

We need this for each time unit

* include in this level - exclude in this level


ILLUSTRATION OF HKG

The knowledge gradient policy prefers to measure alternatives with high mean and/or low precision: Equal means measure lowest precision Equal precisions measure highest mean

Demo HKG…


B

CD

A

t+2t+1

tt-1

COMBINING OPTIMAL LEARNING AND ADP DECISIONS

Illustration learning in ADP State St=(Rt,Dt) where Rt resembles a location Rt{A,B,C,D}

and Dt available loads going out from Rt

Decision xt is a location to move to xt{A,B,C,D} Exogenous information Wt are the new loads Dt

1ntS

,1

x ntS 1

ntx

time → locati

on →



B

CD

A

t+2t+1

tt-1

ntW n

tS1

ntS

,1

x ntS 1

ntx

time → locati

on →

We were in the post decision state where we decided to move to location C. After observing the new loads , we are in the pre decision state

,1

x ntS

ntW n

tS



B

CD

A

t+2t+1

tt-1

ntW n

tS1

ntS

,1

x ntS 1

ntx

1 ,ˆ max , ,t t


v C S x V S S x

X

, 1 ,1 1 1 1 1 1 ˆ1n x n n x n n

t t n t t n tV S V S v

where

nn+

1ite

ratio

n →

time → locati

on →



B

CD

A

t+2t+1

tt-1

ntW n

tS1

ntS

,1

x ntS 1

ntx

So not necessarily influences the value

However, it determines the state we update next

,1 1

n x nt tV S

ntx

,n x nt tV S

nn+

1ite

ratio

n →

time → locati

on →



B

CD

A

t+2t+1

tt-1

ntW n

tS1

ntS

,1

x ntS 1

ntx

Using Optimal Learning, we estimate the knowledge gain

1 11 wh, r, e en M n n n n n

t t t t t tK K K x W K V

nn+

1ite

ratio

n →

time → locati

on →



B

CD

A

t+2t+1

tt-1

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

nn+

1ite

ratio

n →

time → locati

on →

We decide to move to location B resulting in a post decision state ,x n

tS



B

CD

A

t+2t+1

tt-1

1n

tW 1ntS

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

nn+

1ite

ratio

n →

time → locati

on →

After observing the new loads , we are in the pre decision state

1n

tW

1ntS



B

CD

A

t+2t+1

tt-1

1n

tW 1ntS

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

1 1

1 ,1 1 1 1 1 1ˆ max , ,

t t


v C S x V S S x

X

, 1 ,1 1 1ˆ1n x n n x n n

t t n t t n tV S V S v

where

nn+

1ite

ratio

n →

time → locati

on →



B

CD

A

t+2t+1

tt-1

1n

tW 1ntS

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

nn+

1ite

ratio

n →

time → locati

on →

Again we have to make a sampling decision



B

CD

A

t+2t+1

tt-1

1n

tW 1ntS

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

nn+

1ite

ratio

n →

time → locati

on →

Again, we estimate the knowledge gain

1 11 1 1 2 1 1, where,n M n n n n n

t t t t t tK K K x W K V



B

CD

A

t+2t+1

tt-1

1n

tW 1ntS

,1

x ntS 1

ntx

ntW n

tS ,x ntSn

tx1

ntS

,1

x ntS 1

ntx

nn+

1ite

ratio

n →

time → locati

on →

We decide to move to location B resulting in a post decision state ,

1x ntS


CHALLENGES WITH OPTIMAL LEARNING IN ADP

Impact on next iteration hard to compute → so we assume a similar resource and demand state in the next iteration and evaluate the impact of an updated knowledge state

Bias: Decisions have impact on the value of states in the

downstream path (we learn what we measure) Decisions have impact on the value of states in the

upstream path (with on-policy control) To decision to measure a state will change its value which

in turn might influence our decisions in the next iteration:Simply measuring states more often might increase their estimated values which in turn make them more attractive next time


SKETCH OF OUR SOLUTION APPROACH

To cope with the bias, we propose using so-called projected value functions

Assumption: exponential increase (decrease if we started with optimistic estimates) in estimated values as a function of the number of iterations

Value iteration is known to converge geometrically, see [1][1] M.L. Puterman (1994). Markov decision processes. New York: John Wiley & Sons.

limx n xt t t tn

G S V S

…and hopefully x xt t t tG S V S

1 expn x x x x xt t t t t t t t t tV S B S G S B S nZ S

n>n0 weighted estimates

output after n0 limitingvalue

rate


SKETCH OF OUR SOLUTION APPROACH

Illustration projected value functions:

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 20 40 60 80 100 120 140 160 180 200Iteration

Val

ue

Observed valuesProjected valuesFinal fitted function


0 10

1 ,

, ,



Step

1

0,1,...,

ˆ max ,

2a. Solve:


,

and let ,t t

t

n


n x n M x nt t t

V S n

t T

v C S x V S S x

x S S S

X

, 1 ,1 1 1 1 1 1

1 1



ˆ1

, ,

predicision state


Step 4. Return th

e

nt


n M n n nt t t t

x

V S V S v

S S S x W

n n N

1

value functi ns oTn

t tV

NEW ADP ALGORITHM

Step 2b: Update the value function

estimates at all levels ofaggregation

Update the weights andcompute the weightedvalue function estimates, possibly for many states at once

Step 2c: combine the updated value function estimates with the prior

distributions on the projected value functions to obtain posterior distributions, see [1] for details

The new state follows from running HKG using our beliefs on the projected value functions as input

So we completely separated the updating step (step 2a/b) and the exploration step (step 2c)

[1] P.I. Frazier, W.B. Powell, and H.P. Simão (2009). Simulation model calibration with correlated knowledge-

gradients.


PERFORMANCE IMPRESSION

Experiment on an instance of the Nomadic Trucker Problem

0

5000

10000

15000

20000

25000

30000

35000

40000

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Iteration

Val

ue

Pure exploitationPure explorationExploitation with projected valuesHKGOptimal policyTrue value


SHORTCOMINGS

Fitting It is not always possible to find a nice fit For example, if observed values increase slightly faster in

the beginning and slower after that (compared to the fitted exponential), we still have this bias where sampled states look more attractive than others; after a sufficient number of measurements this will be corrected

Computation time We have to spent quite some computation time to make the

sampling decision; we could have used this time just to sample the states instead of thinking about it

Application area: large state space (pure exploration doesn’t make sense) but small action space


CONCLUSIONS

We illustrated the challenges of ADP using the Nomadic Trucker example

We illustrated how optimal learning can be helpful here We illustrated the difficulty of learning in ADP due to the bias: our estimated values are influenced by the measurement

policy which in turn is influenced by our estimated values To cope with this bias we introduced the notion of projected

value functions This enables use to use the HKG policy to cope with the exploration vs. exploitation dilemma allow generalization across states

We shortly illustrated the potential of using this approach but also mentioned several shortcomings

7-11-2010 INFORMS Annual Meeting Austin

41

QUESTIONS?

Martijn MesAssistant professorUniversity of TwenteSchool of Management and GovernanceOperational Methods for Production and LogisticsThe Netherlands

ContactPhone: +31-534894062Email: [email protected]: http://mb.utwente.nl/ompl/staff/Mes/

7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a...

Documents

Transcript of 7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a...