GPU Accelerated Markov Decision Processes in Crowd...
Transcript of GPU Accelerated Markov Decision Processes in Crowd...
GPU Accelerated Markov Decision
Processes in Crowd Simulation
Sergio Ruiz
Computer Science Department
Tecnológico de Monterrey, CCM
Mexico City, México
Benjamín Hernández
National Center for Computational Sciences
Oak Ridge National Laboratory
Tennessee, USA
2
Contents
• Introduction
• Optimization Approaches
• Problem solving strategy
• A simple example
• Algorithm description
• Results
• Conclusions & future work
3
Crowd Simulation
Path Planning Local Collision Avoidance (LCA)
4
Optimization Approaches
• According to (Reyes et al. 2009, Foka and Trahanias 2003),
Markov Decision Processes (MDPs) are computationally
inefficient: as the state space grows, the problem becomes
intractable.
• Decomposition offers the possibility to solve large MDPs (Sucar
2007, Meuleau et al. 1998, Singh and Cohn 1998), either in
State Space decomposition, or Process decomposition.
• (Mausam and Weld. 2004) follow the idea of concurrency to
solve MDPs generating solutions close to optimal extending the
Labeled Real-time Dynamic Programming method.
5
Optimization Approaches
• (Sucar 2007) proposes a parallel implementation of weakly
coupled MDPs.
• (Jóhansson 2009) presents a dynamic programming framework
that implements the Value Iteration algorithm to solve MDPs
using CUDA.
• (Noer 2013) explores the design and implementation of a point-
based Value Iteration algorithm for Partially Observable MDPs
(POMDPs) with approximate solutions. The GPU
implementation supports belief stat pruning which avoids
calculations.
6
Problem Solving Strategy
• We propose a parallel Value
Iteration MDP solving algorithm to
guide groups of agents toward
assigned goals while avoiding
obstacles interactively. For optimal
performance the algorithm is run
over a hexagonal grid in the context
of a Fully Observable MDP.
7
Problem Solving Strategy
• A Markov Decision Process is a tuple 𝑀 = 𝑆, 𝐴, 𝑇, 𝑅
• S is a finite set of states. In our case, 2D cells.
• A is a finite set of actions. In our case, 6 directions.
• T is a transition model T(s, a, s’).
• R is a reward function R(s).
• A policy 𝜋 is a solution that specifies the action for an agent
at a given state.
• 𝜋∗ is the optimal policy. Transition
8
Problem Solving Strategy
States Value Iteration
𝜋𝑡∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄𝑡 𝑠, 𝑎
𝑄𝑡 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑇𝑠𝑗𝑎𝑉𝑡−1 𝑗
5
𝑗=0
𝑉𝑡 𝑠 = 𝑄𝑡 𝑠, 𝜋∗ 𝑠 ; 𝑉0 𝑠 = 0
9
Problem Solving Strategy
• We propose to temporarily override the optimal policy when
agent density in a cell is above a certain threshold 𝝈.
10
A simplified example
1 2 3 4
a -3 -3 -3 +100
b -3 -3 -100
c -3 -3 -3 -3
A = { N, W, E }
𝛾 = 1 (for simplicity)
Transitions:
p = 0.8 (probability of taking a current action)
q = 0.1 (probability of taking another action)
𝜋𝑡∗ 𝑠 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑎𝑄𝑡 𝑠, 𝑎
𝑄𝑡 𝑠, 𝑎 = 𝑅 𝑠, 𝑎 + 𝛾 𝑇𝑠𝑗𝑎𝑉𝑡−1 𝑗
2
𝑗=0
What is 𝜋 for cell a3 ? 𝜋 𝑎3 = max{𝑄 𝑎3,𝑊 ,𝑄 𝑎3,𝑁 , 𝑄 𝑎3, 𝐸 }
𝑄 𝑎3, 𝑬 = 100 + 1.0(0.8(100) + 0.1(-3) + 0.1(0)) 𝑄 𝑎3,𝑾 = -3 + 1.0 (0.1(100) + 0.8(-3) + 0.1(0)) 𝑄 𝑎3,𝑵 = 0 + 1.0 (0.1(100) + 0.1(-3) + 0.8(0))
=> max is 𝑄 𝑎3, 𝑬
𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )
𝑅 𝑠, 𝑎 𝛾 𝑇𝑠𝑗
𝑎𝑉 𝑗
2
𝑗=0
11
Algorithm
– Data collect: current cell needs to know rewards from neighboring cells and out of bound values.
– Input generation: build 𝑇𝑠𝑗𝑎 and 𝑅 𝑠, 𝑎 = 𝑅𝑊
– Value Iteration: optimal policy computed using parallel transformations and parallel reduction by key.
𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )
𝑅 𝑠, 𝑎 𝛾 𝑇𝑠𝑗
𝑎𝑉 𝑗
2
𝑗=0
12
Algorithm: input generation
• Transition matrix requirements:
𝑇𝑃 =
𝑝 ⋯ 𝑝⋮ ⋱ ⋮𝑝 ⋯ 𝑝
𝑇𝑟,𝑐𝑄=
𝑞𝑖 ⋯ 𝑞𝑖⋮ ⋱ ⋮𝑞𝑖 ⋯ 𝑞𝑖
𝐷𝐴 =1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 1
𝐷𝐵 =0 ⋯ 1⋮ ⋱ ⋮1 ⋯ 0
Dimensions: |A|x|A| i.e. each cell can compute neighboring info
𝑟 ∈ 1,𝑀𝐷𝑃𝑟𝑜𝑤𝑠 𝑞𝑖 =𝑞
𝑅𝐸𝑖−1
𝑐 ∈ 1,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠
13
Algorithm: input generation
where 𝑇𝑟,𝑐 = 𝑇𝑝 ∘ 𝐷𝐴 + 𝑇𝑟,𝑐𝑄∘ 𝐷𝐵 =
𝑝 𝑞 𝑞𝑞 𝑝 𝑞𝑞 𝑞 𝑝
𝑄 𝑎3, 𝐄 = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) 𝑄 𝑎3,𝐖 = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.1(0) ) 𝑄 𝑎3, 𝐍 = 0 + 1.0 ( 0.1(100) + 0.1(-3) + 0.8(0) )
Transition matrix 𝑇𝑠𝑗𝑎 computation:
𝑇𝑠𝑗𝑎 =
𝑇1,1 ⋯ 𝑇1,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠⋮ ⋱ ⋮
𝑇𝑀𝐷𝑃𝑟𝑜𝑤𝑠,1 ⋯ 𝑇𝑀𝐷𝑃𝑟𝑜𝑤𝑠,𝑀𝐷𝑃𝑐𝑜𝑙𝑢𝑚𝑛𝑠
Represents a Cell
14
Algorithm: Parallel Value Iteration
1. Computation of Q-values.
𝜋𝑡 = 𝑅𝑊 + 𝛾 𝑇𝑠𝑗𝑎 𝑉
Consecutive parallel transformations (mult, mult, sum) results in a matrix Q that stores |A|-tuple of policies for taking all actions per each cell.
15
Algorithm: Parallel Value Iteration
2. Selection of best Q-values.
– Parallel reduction: from every consecutive |A|-tuple in 𝜋𝑡, the largest value index indicates current best policy.
3. Check for convergence.
– If 𝜋𝑡 − 𝜋𝑡−1 = [0,… , 0]
16
Crowd Navigation
Video
https://www.youtube.com/watch?v=369td2O8dxY
17
Results: test scenarios
Office (1,584 cells) Maze (100x100
cells) Champ de Mars
(100x100 cells)
Implementation: CUDA Thrust, OpenMP and CUDA Backbends CPU: Intel Core i7 CPU running at 3.40GHz. ARM (Jetson TK1): 32 bit ARM quad-core Cortex-A15 CPU running at 2.32GHz. GPUs: Tegra K1 192 CUDA Cores, Tesla K40c 2880 CUDA cores, Geforce GTX TITAN 2688 CUDA cores.
18
Results: GPU performance
19
Results: GPU speedup
Intel CPU baseline: 8 threads ARM CPU baseline: 4 threads
20
Conclusion
• Parallelization of the proposed algorithm was made possible by formulating it in terms of matrix operations, leveraging the “massive” data parallelism in GPU computing to reduce the MDP solution time.
• We demonstrated that standard parallel transformation and reduction operations provide the means to solve MDPs via Value Iteration with optimal performance.
21
Conclusion
• Taking advantage of the proposed hexagonal grid partitioning method, our implementation provides a good level of space discretization and performance.
• We obtained a 90x speed up using GPUs enabling us to simulate crowd behavior interactively.
• We found the Jetson TK1 GPU to have a remarkable performance, opening many possibilities to incorporate real-time MDP solvers in mobile robotics.
22
Future Work
• Reinforcement learning. Evaluate different parameter values to obtain policy convergence in the least number of iterations without losing precision in the generated paths.
• Couple the MDP solver with a Local Collision Avoidance method to obtain more precise simulation results at microscopic level.
• Investigate further applications of our MDP solver beyond the context of crowd simulation.
GPU Accelerated Markov Decision
Process in Crowd Simulation
Sergio Ruiz
Computer Science Department
Tecnológico de Monterrey, CCM
Mexico City, México
Benjamín Hernández
National Center for Computational Sciences
Oak Ridge National Laboratory
Tennessee, USA
Thank you!
This research was partially supported by: CONACyT SNI-54067, CONACyT PhD
scholarship 375247, Nvidia Hardware Grant and Oak Ridge Leadership Computing Facility
at the Oak Ridge National Laboratory, under DOE Contract No. DE-AC05-00OR22725.
Further reading: Ruiz, S. Hernandez, B. “A parallel solver for Markov Decision Process in Crowd Simulation” MICAI 2015, 14th Mexican International Conference on Artificial Intelligence, At Cuernavaca, Mexico, IEEE volume: ISBN 978-1-5090-0323-5
24
Additional Results: Intel CPU
25
Additional Results: ARM CPU