7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a...
-
Upload
blake-lindsey -
Category
Documents
-
view
224 -
download
0
description
Transcript of 7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a...
7-11-2010 INFORMS Annual Meeting Austin
1
Learning in Approximate Dynamic Programming for Managing a Multi-Attribute DriverMartijn MesDepartment of Operational Methods for Production and LogisticsUniversity of TwenteThe Netherlands
Sunday, November 7, 2010INFORMS Annual Meeting Austin
7-11-2010INFORMS Annual Meeting Austin 2/40
OUTLINE
1. Illustration: a transportation application2. Stylized illustration: the Nomadic Trucker Problem3. Approximate Dynamic Programming (ADP)4. Challenges with ADP5. Optimal Learning6. Optimal Learning in ADP7. Challenges with Optimal Learning in ADP8. Sketch of our solution concept
7-11-2010INFORMS Annual Meeting Austin 3/40
TRANSPORTATION APPLICATION
Heisterkamp Trailer trucking: Providing trucks and drivers
Planning department: Accept orders Assign orders to trucks Assign drivers to trucks
Type of orders: Direct order: move trailer from A to B; client pays depending
on distance between A to B, but trailer might go through hubs to change the truck and/or driver
Customer guidance order: rent a truck and driver to a client for some time period
7-11-2010INFORMS Annual Meeting Austin 4/40
REAL APPLICATION
Heisterkamp
7-11-2010INFORMS Annual Meeting Austin 5/40
CHARACTERISTICS
The drivers are bounded by EU drivers’ hours regulations However, given sufficient supply of orders and drivers,
trucks can in principle be utilized 24/7 by switching drivers Even though we can replace a driver (to increase utilization
of trucks), we still might face costs for the old driver Objective: increase profits by ‘clever’ order acceptance and
minimization of costs for drivers, trucks and moving empty (i.e., without trailer)
We solve a dynamic assignment problem, given the state of all trucks and (probabilistic) known orders, at specific time instances for a fixed horizon
This problem is known as a Dynamic Fleet Management Problem (DFMP). For illustrative purposes we now focus on the single vehicle version of the DFMP.
7-11-2010INFORMS Annual Meeting Austin 6/40
THE NOMADIC TRUCKER PROBLEM
Single trucker moving from city to city either with a load or empty Rewards when moving loads otherwise there are costs involved Vector of attributes describing a single resource
with the set of possible attribute vectors
1
2
3
4
5
6
7
8
9
10
location of the truckarrival time at next locationmaintenance statusdriver typehours driving in current triphours driving todayhours driving thi
aaaaa
aaaaaa
s weekhours away from homehome domicileday of the week
The truck
The driver
aA
Dynamic attributes
7-11-2010INFORMS Annual Meeting Austin 7/40
MODELING THE DYNAMICS
State where with Rta=1 when the truck has attribute a (in the
DFMP, Rta gives the number of resources at time t with attribute a)
with Dtl the number of loads of type l Decision xt: make a loaded move, wait at current location, or move
empty to another location; xt follows from a decision function where πΠ is a family of policies
Exogenous information Wt+1: information arriving between t and t+1 such as new loads, wear of truck, occurrence of breakdowns etc.
Choosing decision xt with current state St and exogenous information Wt+1, results in a transition
with contribution (payment or costs)
1 1, ,Mt t t tS S S x W
,t tC S x
t ta aR R
A
tX
,t t tS R D
t tl lD D
L
7-11-2010INFORMS Annual Meeting Austin 8/40
OBJECTIVE
Objective is to find the policy π that maximizes the expected sum of discounted contributions over all time periods
00
sup ,T
tt t
tC S X S
E
7-11-2010INFORMS Annual Meeting Austin 9/40
SOLVING THE PROBLEM
Optimality equation (expectation form of Bellman’s equation):
Enumerating by backward induction? Suppose a=(location, arrival time, domicile) and we discretize to 500
locations and 50 possible arrival times → ||=12,500,000 In the backward loop we not only have to visit all states, but also we
have to evaluate all actions, and, to compute the expectation, we probably also have to evaluate all possible outcomes
Backwards dynamic programming might become intractable
1 1max ,t t
t t t t t t txV S C S x V S S
EX
Approximate Dynamic Programming
7-11-2010INFORMS Annual Meeting Austin 10/40
APPROXIMATE DYNAMIC PROGRAMMING
We replace the original optimality equation
With the following
1 1
1 1
max ,
,
where
,
t tt t t t t t t tx
Mt t t t
V S C S x V S S
S S S x W
EX
1 ,
nt
1 1
le
ˆ max , ,
x
,
t be our decisio
,
nt t
n n n n M x nt t t t t t t tx
n M n n nt t t t
v S C S x V S S x
S S S x W
tX
7-11-2010INFORMS Annual Meeting Austin 11/40
APPROXIMATE DYNAMIC PROGRAMMING
We replace the original optimality equation
With the following
1 1
1 1
max ,
,
where
,
t tt t t t t t t tx
Mt t t t
V S C S x V S S
S S S x W
EX
1 ,
nt
1 1
le
ˆ max , ,
x
,
t be our decisio
,
nt t
n n n n M x nt t t t t t t tx
n M n n nt t t t
v S C S x V S S x
S S S x W
tX
Using a value function approximation This allows us to step forward in time
1
7-11-2010INFORMS Annual Meeting Austin 12/40
APPROXIMATE DYNAMIC PROGRAMMING
We replace the original optimality equation
With the following
1 1
1 1
max ,
,
where
,
t tt t t t t t t tx
Mt t t t
V S C S x V S S
S S S x W
EX
1 ,
nt
1 1
le
ˆ max , ,
x
,
t be our decisio
,
nt t
n n n n M x nt t t t t t t tx
n M n n nt t t t
v S C S x V S S x
S S S x W
tX
1 1 1 1, , , , , , , ,...x xt t t t t t t tS x S W S x S W
Using the post-decision state variable
Deterministic function
2
7-11-2010INFORMS Annual Meeting Austin 13/40
APPROXIMATE DYNAMIC PROGRAMMING
We replace the original optimality equation
With the following
1 1
1 1
max ,
,
where
,
t tt t t t t t t tx
Mt t t t
V S C S x V S S
S S S x W
EX
1 ,
nt
1 1
le
ˆ max , ,
x
,
t be our decisio
,
nt t
n n n n M x nt t t t t t t tx
n M n n nt t t t
v S C S x V S S x
S S S x W
tX
1 2, ,..., , n n nT t tW W W W W
Generating sample paths3
7-11-2010INFORMS Annual Meeting Austin 14/40
APPROXIMATE DYNAMIC PROGRAMMING
We replace the original optimality equation
With the following
1 1
1 1
max ,
,
where
,
t tt t t t t t t tx
Mt t t t
V S C S x V S S
S S S x W
EX
1 ,
nt
1 1
le
ˆ max , ,
x
,
t be our decisio
,
nt t
n n n n M x nt t t t t t t tx
n M n n nt t t t
v S C S x V S S x
S S S x W
tX
Learning through iterations4
7-11-2010INFORMS Annual Meeting Austin 15/40
OUTLINE OF THE ADP ALGORITHM
0 10
1 ,
, ,
Step 0. Initialise and , set
Step 1. Choose a sample pathStep 2. For do:
Step
1
0,1,...,
ˆ max ,
2a. Solve:
let be the best decision
,
and let ,t t
t
n
n n n M x nt t t t t tx
n x n M x nt t t
V S n
t T
v C S x V S S x
x S S S
X
, 1 ,1 1 1 1 1 1
1 1
Step 2b. Update the value function:
Step 2c. Compute new
ˆ1
, ,
predicision state
Step 3. Increment . If go to Step 1.
Step 4. Return th
e
nt
n x n n x n nt t n t t n t
n M n n nt t t t
x
V S V S v
S S S x W
n n N
1
value functi ns oTn
t tV
Deterministic optimization
Simulation
Statistics
7-11-2010INFORMS Annual Meeting Austin 16/40
0 10
1 ,
, ,
Step 0. Initialise and , set
Step 1. Choose a sample pathStep 2. For do:
Step
1
0,1,...,
ˆ max ,
2a. Solve:
let be the best decision
,
and let ,t t
t
n
n n n M x nt t t t t tx
n x n M x nt t t
V S n
t T
v C S x V S S x
x S S S
X
, 1 ,1 1 1 1 1 1
1 1
Step 2b. Update the value function:
Step 2c. Compute new
ˆ1
, ,
predicision state
Step 3. Increment . If go to Step 1.
Step 4. Return th
e
nt
n x n n x n nt t n t t n t
n M n n nt t t t
x
V S V S v
S S S x W
n n N
1
value functi ns oTn
t tV
CHALLENGES WITH ADP
Exploration vs. exploitation: Exploitation: we do we
currently think is best Exploration: we choose to
try something and learnmore (information collection)
To avoid getting stuck in a local optimum, we have to explore. But what do we want to explore and for how long? Do we need to explore the whole state space?
Do we update the value functions using the results of the exploration steps or do we want to perform off-policy control?
Techniques from Optimal Learning might help here
7-11-2010INFORMS Annual Meeting Austin 17/40
OPTIMAL LEARNING
To cope with the exploration vs. exploitation dilemma Undirected exploration: Try to randomly explore the whole state space Examples: pure exploration and epsilon greedy (explore with
probability εn and exploit with probability 1- εn) Directed exploration: Utilize past experience to execute efficient exploration (costs are
gradually avoided by making more expensive actions less likely) Examples of directed exploration
Boltzmann exploration; choose x that maximizes
Interval estimation; choose x that maximizes
The knowledge gradient policy (see next sheets)
'
'
exp
exp
nxn
x nx
x
p
n nx xz
7-11-2010INFORMS Annual Meeting Austin 18/40
THE KNOWLEDGE GRADIENT POLICY [1/2]
Basic principle: Assume you can make only one measurement, after which you have
to make a final choice (the implementation decision) What choice would you make now to maximize the expected value of
the implementation decision?
1 2 3 4 5
Change which produces a change in the decision.
Change in estimated value of option 5 due to measurement of 5
Updated estimate of the value of option 5
Observation
7-11-2010INFORMS Annual Meeting Austin 19/40
THE KNOWLEDGE GRADIENT POLICY [2/2]
The knowledge gradient is the expected marginal value of a single measurement x
The knowledge gradient policy is given by There are many problems where making one measurement tells us
something about what we might observe from other measurements (e.g., in our transportation application nearby locations have similar properties)
Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions)
There are various extensions of the Knowledge Gradient policy that take into account similarities between alternatives
X
argmax KGxx
X
KGx
Hierarchical Knowledge Gradient policy
7-11-2010INFORMS Annual Meeting Austin 20/40
HIERARCHIAL KNOWLEDGE GRADIENT (HKG)
Idea: instead of having a belief on the true value θx of each alternative x (Bayesian prior with mean and precision ), we have a belief on the value of each alternative at various levels of aggregation (with and )
Using aggregation, we express (our estimate of θx) as a weighted combination
Intuition: highest weight to levels with lowest sum of variance and bias; see [1] and [2] for details.
[1] M.R.K. Mes, W.B. Powell, and P.I. Frazier (2010). Hierarchical Knowledge Gradient for Sequential Sampling.
[2] A. George, W.B. Powell, and S.R. Kulkarni (2008). Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management.
,g nx
, ,n g n g nx x x
g G
w
nx
nx
nx
, g nx
7-11-2010INFORMS Annual Meeting Austin 21/40
STATISTICAL AGGREGATION
Example of an aggregation structure for the Nomadic Trucker Problem
With HKG we would have 38,911 beliefs and our belief about a single alternative can be expressed as a function of 6 beliefs (1 for each aggregation level).
Level Location Driver type Day of week Size of state space
0 City * * 500x10x7=35,000
1 Region * * 50x10x7=3,500
2 Region - * 50x1x7=350
3 Region - - 50x1x1=50
4 Province - - 10x1x1=10
5 Country - - 1x1x1=1
We need this for each time unit
* include in this level - exclude in this level
7-11-2010INFORMS Annual Meeting Austin 22/40
ILLUSTRATION OF HKG
The knowledge gradient policy prefers to measure alternatives with high mean and/or low precision: Equal means measure lowest precision Equal precisions measure highest mean
Demo HKG…
7-11-2010INFORMS Annual Meeting Austin 23/40
B
CD
A
t+2t+1
tt-1
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
Illustration learning in ADP State St=(Rt,Dt) where Rt resembles a location Rt{A,B,C,D}
and Dt available loads going out from Rt
Decision xt is a location to move to xt{A,B,C,D} Exogenous information Wt are the new loads Dt
1ntS
,1
x ntS 1
ntx
time → locati
on →
7-11-2010INFORMS Annual Meeting Austin 24/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
ntW n
tS1
ntS
,1
x ntS 1
ntx
time → locati
on →
We were in the post decision state where we decided to move to location C. After observing the new loads , we are in the pre decision state
,1
x ntS
ntW n
tS
7-11-2010INFORMS Annual Meeting Austin 25/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
ntW n
tS1
ntS
,1
x ntS 1
ntx
1 ,ˆ max , ,t t
n n n M x nt t t t t tx
v C S x V S S x
X
, 1 ,1 1 1 1 1 1 ˆ1n x n n x n n
t t n t t n tV S V S v
where
nn+
1ite
ratio
n →
time → locati
on →
7-11-2010INFORMS Annual Meeting Austin 26/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
ntW n
tS1
ntS
,1
x ntS 1
ntx
So not necessarily influences the value
However, it determines the state we update next
,1 1
n x nt tV S
ntx
,n x nt tV S
nn+
1ite
ratio
n →
time → locati
on →
7-11-2010INFORMS Annual Meeting Austin 27/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
ntW n
tS1
ntS
,1
x ntS 1
ntx
Using Optimal Learning, we estimate the knowledge gain
1 11 wh, r, e en M n n n n n
t t t t t tK K K x W K V
nn+
1ite
ratio
n →
time → locati
on →
7-11-2010INFORMS Annual Meeting Austin 28/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
nn+
1ite
ratio
n →
time → locati
on →
We decide to move to location B resulting in a post decision state ,x n
tS
7-11-2010INFORMS Annual Meeting Austin 29/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
1n
tW 1ntS
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
nn+
1ite
ratio
n →
time → locati
on →
After observing the new loads , we are in the pre decision state
1n
tW
1ntS
7-11-2010INFORMS Annual Meeting Austin 30/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
1n
tW 1ntS
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
1 1
1 ,1 1 1 1 1 1ˆ max , ,
t t
n n n M x nt t t t t tx
v C S x V S S x
X
, 1 ,1 1 1ˆ1n x n n x n n
t t n t t n tV S V S v
where
nn+
1ite
ratio
n →
time → locati
on →
7-11-2010INFORMS Annual Meeting Austin 31/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
1n
tW 1ntS
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
nn+
1ite
ratio
n →
time → locati
on →
Again we have to make a sampling decision
7-11-2010INFORMS Annual Meeting Austin 32/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
1n
tW 1ntS
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
nn+
1ite
ratio
n →
time → locati
on →
Again, we estimate the knowledge gain
1 11 1 1 2 1 1, where,n M n n n n n
t t t t t tK K K x W K V
7-11-2010INFORMS Annual Meeting Austin 33/40
COMBINING OPTIMAL LEARNING AND ADP DECISIONS
B
CD
A
t+2t+1
tt-1
1n
tW 1ntS
,1
x ntS 1
ntx
ntW n
tS ,x ntSn
tx1
ntS
,1
x ntS 1
ntx
nn+
1ite
ratio
n →
time → locati
on →
We decide to move to location B resulting in a post decision state ,
1x ntS
7-11-2010INFORMS Annual Meeting Austin 34/40
CHALLENGES WITH OPTIMAL LEARNING IN ADP
Impact on next iteration hard to compute → so we assume a similar resource and demand state in the next iteration and evaluate the impact of an updated knowledge state
Bias: Decisions have impact on the value of states in the
downstream path (we learn what we measure) Decisions have impact on the value of states in the
upstream path (with on-policy control) To decision to measure a state will change its value which
in turn might influence our decisions in the next iteration:Simply measuring states more often might increase their estimated values which in turn make them more attractive next time
7-11-2010INFORMS Annual Meeting Austin 35/40
SKETCH OF OUR SOLUTION APPROACH
To cope with the bias, we propose using so-called projected value functions
Assumption: exponential increase (decrease if we started with optimistic estimates) in estimated values as a function of the number of iterations
Value iteration is known to converge geometrically, see [1][1] M.L. Puterman (1994). Markov decision processes. New York: John Wiley & Sons.
limx n xt t t tn
G S V S
…and hopefully x xt t t tG S V S
1 expn x x x x xt t t t t t t t t tV S B S G S B S nZ S
n>n0 weighted estimates
output after n0 limitingvalue
rate
7-11-2010INFORMS Annual Meeting Austin 36/40
SKETCH OF OUR SOLUTION APPROACH
Illustration projected value functions:
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 20 40 60 80 100 120 140 160 180 200Iteration
Val
ue
Observed valuesProjected valuesFinal fitted function
7-11-2010INFORMS Annual Meeting Austin 37/40
0 10
1 ,
, ,
Step 0. Initialise and , set
Step 1. Choose a sample pathStep 2. For do:
Step
1
0,1,...,
ˆ max ,
2a. Solve:
let be the best decision
,
and let ,t t
t
n
n n n M x nt t t t t tx
n x n M x nt t t
V S n
t T
v C S x V S S x
x S S S
X
, 1 ,1 1 1 1 1 1
1 1
Step 2b. Update the value function:
Step 2c. Compute new
ˆ1
, ,
predicision state
Step 3. Increment . If go to Step 1.
Step 4. Return th
e
nt
n x n n x n nt t n t t n t
n M n n nt t t t
x
V S V S v
S S S x W
n n N
1
value functi ns oTn
t tV
NEW ADP ALGORITHM
Step 2b: Update the value function
estimates at all levels ofaggregation
Update the weights andcompute the weightedvalue function estimates, possibly for many states at once
Step 2c: combine the updated value function estimates with the prior
distributions on the projected value functions to obtain posterior distributions, see [1] for details
The new state follows from running HKG using our beliefs on the projected value functions as input
So we completely separated the updating step (step 2a/b) and the exploration step (step 2c)
[1] P.I. Frazier, W.B. Powell, and H.P. Simão (2009). Simulation model calibration with correlated knowledge-
gradients.
7-11-2010INFORMS Annual Meeting Austin 38/40
PERFORMANCE IMPRESSION
Experiment on an instance of the Nomadic Trucker Problem
0
5000
10000
15000
20000
25000
30000
35000
40000
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
Iteration
Val
ue
Pure exploitationPure explorationExploitation with projected valuesHKGOptimal policyTrue value
7-11-2010INFORMS Annual Meeting Austin 39/40
SHORTCOMINGS
Fitting It is not always possible to find a nice fit For example, if observed values increase slightly faster in
the beginning and slower after that (compared to the fitted exponential), we still have this bias where sampled states look more attractive than others; after a sufficient number of measurements this will be corrected
Computation time We have to spent quite some computation time to make the
sampling decision; we could have used this time just to sample the states instead of thinking about it
Application area: large state space (pure exploration doesn’t make sense) but small action space
7-11-2010INFORMS Annual Meeting Austin 40/40
CONCLUSIONS
We illustrated the challenges of ADP using the Nomadic Trucker example
We illustrated how optimal learning can be helpful here We illustrated the difficulty of learning in ADP due to the bias: our estimated values are influenced by the measurement
policy which in turn is influenced by our estimated values To cope with this bias we introduced the notion of projected
value functions This enables use to use the HKG policy to cope with the exploration vs. exploitation dilemma allow generalization across states
We shortly illustrated the potential of using this approach but also mentioned several shortcomings
7-11-2010 INFORMS Annual Meeting Austin
41
QUESTIONS?
Martijn MesAssistant professorUniversity of TwenteSchool of Management and GovernanceOperational Methods for Production and LogisticsThe Netherlands
ContactPhone: +31-534894062Email: [email protected]: http://mb.utwente.nl/ompl/staff/Mes/