A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

A Reinforcement Learning A Reinforcement Learning Approach for Product Delivery Approach for Product Delivery by Multiple Vehiclesby Multiple Vehicles

Scott Proper

Oregon State University

Prasad TadepalliHong Tang Rasaratnam Logendran

Vehicle Routing & Product DeliveryVehicle Routing & Product Delivery

Contributions of our ResearchContributions of our Research

Multiple vehicle product delivery is a well-studied problem in operations research

We have formulated this problem as an average reward reinforcement learning (RL) problem

We have combined inventory control with vehicle routing

We have scaled RL methods to work with large state spaces

Markov Decision ProcessesMarkov Decision Processes

Action a

Actions are stochastic: Pi,j(a)

Actions have costs or rewards: ri(a)

Unload

Average Reward Reinforcement Average Reward Reinforcement LearningLearning

Goal: Maximize average reward/time step– Minimize stockout penalty + movement

penalty Policy: states → actions Value function: states → real values

– expected long-term reward from a state, relative to other states, when following the optimal policy

H-LearningH-Learning

The value function satisfies the Bellman equation:

The optimal action a* maximizes the immediate reward + expected value of the next state

H-Learning is a real-time algorithm for solving the value function

H-Learning: an example 1H-Learning: an example 1

-.1, 1/1

0, 9/9 0, 0/9

Value Table

Stockout penalty: -20

CB-.1, 1/1

0, 9/10 -20, 1/10Value Table

CB-.1, 1/1

0, 9/10Value Table

-20, 1/10

Move penalty: -.1

CB-.1, 2/2

0, 9/10

Value Table

-20, 1/10

On-line Product DeliveryOn-line Product Delivery

Deliver 1 product 9 truck actions:

– 4 levels of unload – 4 move directions– wait

P(Inventory decrease | shop)

Stockout penalty: -20 Movement penalty: -.1

5 Shops

The problem of state-space The problem of state-space explosionexplosion The loads of trucks and shop inventories

are discretized into 5 levels States grow exponentially in shops and

trucks– 10 locations, 5 shops, 2 trucks = (102)

(55)(52) = 7,812,500 states– 5 trucks = 976,592,500,000 states

Table-based methods take too much time and space

Piecewise Linear Function Piecewise Linear Function ApproximationApproximation

We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks

Each function is linear in truck loads and shop inventories

Every function represents 10 million states

million-fold reduction of learnable parameters

Piecewise linear function Piecewise linear function approximation vs. table-basedapproximation vs. table-based

10 locations, 5 shops, 2 trucks, 106 iterations

10 110 210 310 410 510 610 710 810 910

1000's of Iterations

Piecewise Linear Function Approximation

Table-based

Storing and using the action modelsStoring and using the action models

Problem: exponential time to determine the expected value of the next state:

- Each shop’s consumption is independent

- Value function is piecewise linear

Ignoring Truck IdentityIgnoring Truck Identity

m = number of locations (10)k = number of trucks (2-5)

5 trucks: 105 functions Learnable parameters:

1.1 million

2002 functions Learnable parameters:

22,022

The problem of action-space The problem of action-space explosionexplosion

Every action a is a vector of individual “truck actions” a = (a1, a2,…,an)

Actions grow exponentially in the number of trucks– 9 “truck actions”– For 2 trucks: 92 = 81 total actions– For 5 trucks: 95 = 59,049 total actions

Hill Climbing SearchHill Climbing Search

We initialize the vector of truck actions a to all “wait” actions

We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat

This results in an order-of-magnitude improvement in search time

Hill climbing vs. exhaustive search Hill climbing vs. exhaustive search for 4 and 5 trucksfor 4 and 5 trucks

10 locations, 5 shops, 5 trucks, 106 iterations

10 110 210 310 410 510 610 710 810 9101000's of Iterations

Hill Climbing with 5 trucks

Exhaustive search, 5 trucks

ConclusionConclusion

Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery

Hill climbing shows great potential for speeding up search in domains with a large action space

Problems of scaling are surmountable

Future WorkFuture Work

Scaling! More trucks, more locations, more shops, more depots, and more items

Allowing trucks to move with non-uniform speeds (event-based model needed)

Real-valued shop inventory and truck load levels

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

Documents

Transcript of A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon...

Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.

(PRASAD (Drury)

1 Prasad Tadepalli Intelligent assistive systems Infer the goals of the human users and offer timely help; applications to assistance, tutoring; Learning.

Axon Targeting and Cell Fate in the Drosophila Eye Humera Ahmad Verni Logendran Herman Lab.

Seema Prasad

Chap01 Prasad

Udbhav SchoolWelcome, Mrs Vasanthi Tadepalli ! Udbhav School warmly welcomes Mrs Vasanthi Tadepalli as the new Head Mistress. Mrs.Vasanthi joins us with 25 years of experience in reputed

Prasad Main

Guru prasad

Using Trajectory Data to Improve Bayesian …...Palo Alto, CA 94304 USA Alan Fern afern@eecs.oregonstate.edu Prasad Tadepalli tadepall@eecs.oregonstate.edu School of Electrical Engineering

Tadepalli vs Uber

The Future of Filmscanning, Sal Prasad | Prasad Group

Sai Prasad Residency Brochure Sai Prasad Residency, Kharghar

Prasad Group

Charlie Burchfield – Univ. of Mississippi Tezeswi Tadepalli – Univ. of Cal. at San Diego, SDSC Christopher Mullen – Univ. of Mississippi.

User-Initiated Learning (UIL) Kshitij Judah, Tom Dietterich, Alan Fern, Jed Irvine, Michael Slater, Prasad Tadepalli, Oliver Brdiczka, Jim Thornton, Jim.

SATHISH PRASAD

AS DEMONSTRATED THROUGH THE DERIVATION OF SITES APPROPRIATE FOR A WATERPARK SEAMUS MCDOUGALL IAN MCILRAVEY KEERTHIKA LOGENDRAN ANDREW THOMPSON Constraint.

Training Report PRASAD

Dr. Syama Prasad Mookerjee : A Selfless PatriotDr. Syama Prasad Mookerjee - A Selfless Patriot Dr. Syama Prasad Mookerjee - A Selfless Patriot 3 5 DR. SYAMA PRASAD MOOKERJEE A Selfless