A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

Post on 22-Dec-2015

219 views 1 download

Tags:

Transcript of A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

A Reinforcement Learning A Reinforcement Learning Approach for Product Delivery Approach for Product Delivery by Multiple Vehiclesby Multiple Vehicles

Scott Proper

Oregon State University

Prasad TadepalliHong Tang Rasaratnam Logendran

Vehicle Routing & Product DeliveryVehicle Routing & Product Delivery

Contributions of our ResearchContributions of our Research

Multiple vehicle product delivery is a well-studied problem in operations research

We have formulated this problem as an average reward reinforcement learning (RL) problem

We have combined inventory control with vehicle routing

We have scaled RL methods to work with large state spaces

Markov Decision ProcessesMarkov Decision Processes

Action a

Actions are stochastic: Pi,j(a)

Actions have costs or rewards: ri(a)

Move

Unload

Unload

Average Reward Reinforcement Average Reward Reinforcement LearningLearning

Goal: Maximize average reward/time step– Minimize stockout penalty + movement

penalty Policy: states → actions Value function: states → real values

– expected long-term reward from a state, relative to other states, when following the optimal policy

H-LearningH-Learning

The value function satisfies the Bellman equation:

The optimal action a* maximizes the immediate reward + expected value of the next state

H-Learning is a real-time algorithm for solving the value function

H-Learning: an example 1H-Learning: an example 1

-.1, 1/1

0, 9/9 0, 0/9

A

ED

CB

Value Table

A 0

B 0

C 0

D 0

E 0

H-Learning: an example 2H-Learning: an example 2

Stockout penalty: -20

A

ED

CB-.1, 1/1

0, 9/10 -20, 1/10Value Table

A -.1

B 0

C 0

D 0

E 0

H-Learning: an example 3H-Learning: an example 3

A

ED

CB-.1, 1/1

0, 9/10Value Table

A -.1

B 0

C 0

D 0

E 0

-20, 1/10

H-Learning: an example 4H-Learning: an example 4

Move penalty: -.1

A

ED

CB-.1, 2/2

0, 9/10

Value Table

A -.1

B 0

C 0

D 0

E 0

-20, 1/10

On-line Product DeliveryOn-line Product Delivery

Deliver 1 product 9 truck actions:

– 4 levels of unload – 4 move directions– wait

P(Inventory decrease | shop)

Stockout penalty: -20 Movement penalty: -.1

5 Shops

Depot

The problem of state-space The problem of state-space explosionexplosion The loads of trucks and shop inventories

are discretized into 5 levels States grow exponentially in shops and

trucks– 10 locations, 5 shops, 2 trucks = (102)

(55)(52) = 7,812,500 states– 5 trucks = 976,592,500,000 states

Table-based methods take too much time and space

Piecewise Linear Function Piecewise Linear Function ApproximationApproximation

We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks

Each function is linear in truck loads and shop inventories

Every function represents 10 million states

million-fold reduction of learnable parameters

Piecewise linear function Piecewise linear function approximation vs. table-basedapproximation vs. table-based

10 locations, 5 shops, 2 trucks, 106 iterations

-8

-7

-6

-5

-4

-3

-2

-1

0

10 110 210 310 410 510 610 710 810 910

1000's of Iterations

Ave

rag

e R

ewar

d

Piecewise Linear Function Approximation

Table-based

Storing and using the action modelsStoring and using the action models

Problem: exponential time to determine the expected value of the next state:

- Each shop’s consumption is independent

- Value function is piecewise linear

?

?

?

?

Ignoring Truck IdentityIgnoring Truck Identity

m = number of locations (10)k = number of trucks (2-5)

5 trucks: 105 functions Learnable parameters:

1.1 million

2002 functions Learnable parameters:

22,022

mk

The problem of action-space The problem of action-space explosionexplosion

Every action a is a vector of individual “truck actions” a = (a1, a2,…,an)

Actions grow exponentially in the number of trucks– 9 “truck actions”– For 2 trucks: 92 = 81 total actions– For 5 trucks: 95 = 59,049 total actions

Hill Climbing SearchHill Climbing Search

We initialize the vector of truck actions a to all “wait” actions

We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat

This results in an order-of-magnitude improvement in search time

Hill climbing vs. exhaustive search Hill climbing vs. exhaustive search for 4 and 5 trucksfor 4 and 5 trucks

10 locations, 5 shops, 5 trucks, 106 iterations

-3

-2.5

-2

-1.5

-1

-0.5

0

10 110 210 310 410 510 610 710 810 9101000's of Iterations

Av

era

ge

re

wa

rd

Hill Climbing with 5 trucks

Exhaustive search, 5 trucks

ConclusionConclusion

Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery

Hill climbing shows great potential for speeding up search in domains with a large action space

Problems of scaling are surmountable

Future WorkFuture Work

Scaling! More trucks, more locations, more shops, more depots, and more items

Allowing trucks to move with non-uniform speeds (event-based model needed)

Real-valued shop inventory and truck load levels