GPU Accelerated Markov Decision Processes in Crowd...

GPU Accelerated Markov Decision

Processes in Crowd Simulation

Sergio Ruiz

Computer Science Department

Tecnológico de Monterrey, CCM

Mexico City, México

[email protected]

Benjamín Hernández

National Center for Computational Sciences

Oak Ridge National Laboratory

Tennessee, USA

[email protected]

mailto:[email protected]


2

Contents

• Introduction

• Optimization Approaches

• Problem solving strategy

• A simple example

• Algorithm description

• Results

• Conclusions & future work

3

Crowd Simulation

Path Planning Local Collision Avoidance (LCA)

4

Optimization Approaches

• According to (Reyes et al. 2009, Foka and Trahanias 2003),

Markov Decision Processes (MDPs) are computationally

inefficient: as the state space grows, the problem becomes

intractable.

• Decomposition offers the possibility to solve large MDPs (Sucar

2007, Meuleau et al. 1998, Singh and Cohn 1998), either in

State Space decomposition, or Process decomposition.

• (Mausam and Weld. 2004) follow the idea of concurrency to

solve MDPs generating solutions close to optimal extending the

Labeled Real-time Dynamic Programming method.

5

Optimization Approaches

• (Sucar 2007) proposes a parallel implementation of weakly

coupled MDPs.

• (Jóhansson 2009) presents a dynamic programming framework

that implements the Value Iteration algorithm to solve MDPs

using CUDA.

• (Noer 2013) explores the design and implementation of a point-

based Value Iteration algorithm for Partially Observable MDPs

(POMDPs) with approximate solutions. The GPU

implementation supports belief stat pruning which avoids

calculations.

6

Problem Solving Strategy

• We propose a parallel Value

Iteration MDP solving algorithm to

guide groups of agents toward

assigned goals while avoiding

obstacles interactively. For optimal

performance the algorithm is run

over a hexagonal grid in the context

of a Fully Observable MDP.

7


• A Markov Decision Process is a tuple 𝑀 = 𝑆, 𝐴, 𝑇, 𝑅

• S is a finite set of states. In our case, 2D cells.

• A is a finite set of actions. In our case, 6 directions.

• T is a transition model T(s, a, s’).

• R is a reward function R(s).

• A policy 𝜋 is a solution that specifies the action for an agent

at a given state.

• 𝜋∗ is the optimal policy. Transition

8


States Value Iteration

𝜋𝑡∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄𝑡 𝑠, 𝑎

𝑄𝑡 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑇𝑠𝑗𝑎𝑉𝑡−1 𝑗

5

𝑗=0

𝑉𝑡 𝑠 = 𝑄𝑡 𝑠, 𝜋∗ 𝑠 ; 𝑉0 𝑠 = 0

9


• We propose to temporarily override the optimal policy when

agent density in a cell is above a certain threshold 𝝈.

10

A simplified example

1 2 3 4

a -3 -3 -3 +100

b -3 -3 -100

c -3 -3 -3 -3

A = { N, W, E }

𝛾 = 1 (for simplicity)

Transitions:

p = 0.8 (probability of taking a current action)

q = 0.1 (probability of taking another action)

𝜋𝑡∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄𝑡 𝑠, 𝑎

𝑄𝑡 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑇𝑠𝑗𝑎𝑉𝑡−1 𝑗

2

𝑗=0

What is 𝜋 for cell a3 ? 𝜋 𝑎3 = max{𝑄 𝑎3,𝑊 ,𝑄 𝑎3,𝑁 , 𝑄 𝑎3, 𝐸 }

𝑄 𝑎3, 𝑬 = 100 + 1.0(0.8(100) + 0.1(-3) + 0.1(0)) 𝑄 𝑎3,𝑾 = -3 + 1.0 (0.1(100) + 0.8(-3) + 0.1(0)) 𝑄 𝑎3,𝑵 = 0 + 1.0 (0.1(100) + 0.1(-3) + 0.8(0))

=> max is 𝑄 𝑎3, 𝑬

𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )

𝑅 𝑠, 𝑎 𝛾 𝑇𝑠𝑗

𝑎𝑉 𝑗

2

𝑗=0

11

Algorithm

– Data collect: current cell needs to know rewards from neighboring cells and out of bound values.

– Input generation: build 𝑇𝑠𝑗𝑎 and 𝑅 𝑠, 𝑎 = 𝑅𝑊

– Value Iteration: optimal policy computed using parallel transformations and parallel reduction by key.

𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )

𝑅 𝑠, 𝑎 𝛾 𝑇𝑠𝑗

𝑎𝑉 𝑗

2

𝑗=0

12

Algorithm: input generation

• Transition matrix requirements:

𝑇𝑃 =

𝑝 ⋯ 𝑝⋮ ⋱ ⋮𝑝 ⋯ 𝑝

𝑇𝑟,𝑐𝑄=

𝑞𝑖 ⋯ 𝑞𝑖⋮ ⋱ ⋮𝑞𝑖 ⋯ 𝑞𝑖

𝐷𝐴 =1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 1

𝐷𝐵 =0 ⋯ 1⋮ ⋱ ⋮1 ⋯ 0

Dimensions: |A|x|A| i.e. each cell can compute neighboring info

𝑟 ∈ 1,𝑀𝐷𝑃𝑟𝑜𝑤𝑠 𝑞𝑖 =𝑞

𝑅𝐸𝑖−1

𝑐 ∈ 1,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠

13

Algorithm: input generation

where 𝑇𝑟,𝑐 = 𝑇𝑝 ∘ 𝐷𝐴 + 𝑇𝑟,𝑐𝑄∘ 𝐷𝐵 =

𝑝 𝑞 𝑞𝑞 𝑝 𝑞𝑞 𝑞 𝑝

𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )

Transition matrix 𝑇𝑠𝑗𝑎 computation:

𝑇𝑠𝑗𝑎 =

𝑇1,1 ⋯ 𝑇1,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠⋮ ⋱ ⋮

𝑇𝑀𝐷𝑃𝑟𝑜𝑤𝑠,1 ⋯ 𝑇𝑀𝐷𝑃𝑟𝑜𝑤𝑠,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠

Represents a Cell

14

Algorithm: Parallel Value Iteration

1. Computation of Q-values.

𝜋𝑡 = 𝑅𝑊 + 𝛾 𝑇𝑠𝑗𝑎 𝑉

Consecutive parallel transformations (mult, mult, sum) results in a matrix Q that stores |A|-tuple of policies for taking all actions per each cell.

15

Algorithm: Parallel Value Iteration

2. Selection of best Q-values.

– Parallel reduction: from every consecutive |A|-tuple in 𝜋𝑡, the largest value index indicates current best policy.

3. Check for convergence.

– If 𝜋𝑡 − 𝜋𝑡−1 = [0,… , 0]

16

Crowd Navigation

Video

https://www.youtube.com/watch?v=369td2O8dxY

17

Results: test scenarios

Office (1,584 cells) Maze (100x100

cells) Champ de Mars

(100x100 cells)

Implementation: CUDA Thrust, OpenMP and CUDA Backbends CPU: Intel Core i7 CPU running at 3.40GHz. ARM (Jetson TK1): 32 bit ARM quad-core Cortex-A15 CPU running at 2.32GHz. GPUs: Tegra K1 192 CUDA Cores, Tesla K40c 2880 CUDA cores, Geforce GTX TITAN 2688 CUDA cores.

18

Results: GPU performance

19

Results: GPU speedup

Intel CPU baseline: 8 threads ARM CPU baseline: 4 threads

20

Conclusion

• Parallelization of the proposed algorithm was made possible by formulating it in terms of matrix operations, leveraging the “massive” data parallelism in GPU computing to reduce the MDP solution time.

• We demonstrated that standard parallel transformation and reduction operations provide the means to solve MDPs via Value Iteration with optimal performance.

21

Conclusion

• Taking advantage of the proposed hexagonal grid partitioning method, our implementation provides a good level of space discretization and performance.

• We obtained a 90x speed up using GPUs enabling us to simulate crowd behavior interactively.

• We found the Jetson TK1 GPU to have a remarkable performance, opening many possibilities to incorporate real-time MDP solvers in mobile robotics.

22

Future Work

• Reinforcement learning. Evaluate different parameter values to obtain policy convergence in the least number of iterations without losing precision in the generated paths.

• Couple the MDP solver with a Local Collision Avoidance method to obtain more precise simulation results at microscopic level.

• Investigate further applications of our MDP solver beyond the context of crowd simulation.

GPU Accelerated Markov Decision

Process in Crowd Simulation

Sergio Ruiz

Computer Science Department

Tecnológico de Monterrey, CCM

Mexico City, México

[email protected]

Benjamín Hernández

National Center for Computational Sciences

Oak Ridge National Laboratory

Tennessee, USA

[email protected]

Thank you!

This research was partially supported by: CONACyT SNI-54067, CONACyT PhD

scholarship 375247, Nvidia Hardware Grant and Oak Ridge Leadership Computing Facility

at the Oak Ridge National Laboratory, under DOE Contract No. DE-AC05-00OR22725.

Further reading: Ruiz, S. Hernandez, B. “A parallel solver for Markov Decision Process in Crowd Simulation” MICAI 2015, 14th Mexican International Conference on Artificial Intelligence, At Cuernavaca, Mexico, IEEE volume: ISBN 978-1-5090-0323-5



24

Additional Results: Intel CPU

25

Additional Results: ARM CPU

GPU Accelerated Markov Decision Processes in Crowd...

Documents

Transcript of GPU Accelerated Markov Decision Processes in Crowd...