Learning to Plan in High Dimensions via Neural Exploration ...

18
Learning to Plan in High Dimensions via Neural Exploration-Exploitation Trees Binghong Chen 1 , Bo Dai 2 , Qinjie Lin 3 , Guo Ye 3 , Han Liu 3 , Le Song 1,4 1 Georgia Institute of Technology, 2 Google Research, Brain Team, 3 Northwestern University, 4 Ant Financial ICLR 2020 Spotlight Presentation – Full Version 1

Transcript of Learning to Plan in High Dimensions via Neural Exploration ...

Learning to Plan in High Dimensions via Neural

Exploration-Exploitation Trees

Binghong Chen1, Bo Dai2, Qinjie Lin3, Guo Ye3, Han Liu3, Le Song1,4

1Georgia Institute of Technology, 2Google Research, Brain Team, 3Northwestern University, 4Ant Financial

ICLR 2020 Spotlight Presentation – Full Version

1

Task: Robot Arm Fetching

Problem: planning a robot arm to fetch a book on a shelf

β–ͺ Given (π‘ π‘‘π‘Žπ‘‘π‘’π‘ π‘‘π‘Žπ‘Ÿπ‘‘, π‘ π‘‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Žπ‘™, π‘šπ‘Žπ‘π‘œπ‘π‘ )

β–ͺ Find the shortest collision-free path

Traditional planners: based on sampling the state space

β–ͺ Construct a tree to search for a path

β–ͺ Use random samples from a uniform distribution to grow the tree

Sample

inefficient

2

Daily Task: Robot Arm Fetching

Similar problems are solved again and again with (π‘ π‘‘π‘Žπ‘‘π‘’π‘ π‘‘π‘Žπ‘Ÿπ‘‘, π‘ π‘‘π‘Žπ‘‘π‘’π‘”π‘œπ‘Žπ‘™, π‘šπ‘Žπ‘π‘œπ‘π‘ )~𝐷𝑖𝑠𝑑

Can we learn from past planning experience to build a smarter sampling-based planner than simply uniform random sampling?

Yes! Learned planner β†’ improve sample efficiency β†’ solve problems in less time.3

Sampling from Uniform v.s. Learned Dist

Different sampling distributions for growing a search tree on the same problem.

Uniform Distribution

Failed, 500 samples

Learned π‘žπœƒ(𝑠|π‘‘π‘Ÿπ‘’π‘’, π‘ π‘”π‘œπ‘Žπ‘™, π‘šπ‘Žπ‘π‘œπ‘π‘ )

Success, <60 samples

4

Neural Exploration-Exploitation Trees (NEXT)

β€’ Learn a search bias from past planning experience and uses it to guide future planning.

β€’ Balance exploration and exploitation to escape local minima.

β€’ Improve itself as it plans.5

Learning to Plan by Learning to Sample

1. Define a sampling policy π‘žπœƒ(𝑠|π‘‘π‘Ÿπ‘’π‘’, π‘ π‘”π‘œπ‘Žπ‘™ , π‘šπ‘Žπ‘π‘œπ‘π‘ ).

2. Learn πœƒ from past planning experience.

= argmax𝑠𝑉(𝑠)

~πœ‹(β‹… |parent)

π‘žπœƒ 𝑠 π‘‘π‘Ÿπ‘’π‘’, π‘ π‘”π‘œπ‘Žπ‘™ , π‘šπ‘Žπ‘π‘œπ‘π‘  :

1. π‘ π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘ = argmaxπ‘ βˆˆπ‘‘π‘Ÿπ‘’π‘’π‘‰πœƒ 𝑠 π‘ π‘”π‘œπ‘Žπ‘™ , π‘šπ‘Žπ‘π‘œπ‘π‘  ;

2. 𝑠𝑛𝑒𝑀~πœ‹πœƒ β‹… π‘ π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘ , π‘ π‘”π‘œπ‘Žπ‘™ , π‘šπ‘Žπ‘π‘œπ‘π‘  .

Pure exploitation policy6

Learning Sampling Bias by Imitation

β–ͺ The learned πœ‹πœƒ β‹… π‘ π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘, π‘ π‘”π‘œπ‘Žπ‘™, π‘šπ‘Žπ‘π‘œπ‘π‘  is

pointing to the goal and avoiding the

obstacles.

β–ͺ The learned π‘‰πœƒ 𝑠 π‘ π‘”π‘œπ‘Žπ‘™, π‘šπ‘Žπ‘π‘œπ‘π‘  (heatmap)

decreases as it moves away from the goal.

We learn πœ‹πœƒ and π‘‰πœƒ by imitating the shortest

solutions from previous planning problems.

7

Minimize:

𝑖=1

𝑛

(βˆ’

𝑗=1

π‘šβˆ’1

log πœ‹πœƒ 𝑠𝑗+1 𝑠𝑗 +

𝑗=1

π‘š

π‘‰πœƒ 𝑠𝑗 βˆ’ 𝑣𝑗2)

Value Iteration as Inductive Bias for πœ‹/𝑉

Intuition: first embedding the state and problem into a discrete latent representation

via an attention-based module, then the neuralized value iteration (planning module)

is performed to extract features for defining π‘‰πœƒ and πœ‹πœƒ.

8

9

Latent representation of states Neuralized value iteration

Neural Architecture for πœ‹/𝑉

Overall model architecture.

Attention-based State Embedding

Attention module. For illustration purpose, assume workspace is 2d and map shape is 3x3.

10

Workspace state, i.e.

spatial locations

softmax

softmax

outer product

entries >= 0

sum to 1

Remaining state, i.e.

configurations

11

Neuralized Value Iteration

Planning module. Inspired by Value Iteration Networks*.

*Tamar, Aviv, et al. "Value iteration networks." Advances in Neural Information Processing Systems. 2016.

Escaping the Local Minima

Pure exploitation policy

Exploration-exploitation policy

Variance π‘ˆ estimates the local density12

AlphaGo-Zero-style Learning

13

AlphaGo Zero

MCTS as a policy improvement

operator.

NEXT (Meta Self-Improving

Learning)

Exploration with random

sampling as a policy

improvement operator!Use πœ–-exploration to

improve the result

Anneal πœ–

Experiments

14

2D

3D

5D

7D

( ) distinct maps

(planning problems)

per environment⨉ 3000

2000 Training (SL/RL/MSIL)

1000 testing

Experiment Results

Learning-based Non-learning

Proposed method Ablation studies Baselines Ablation Baselines

MSIL Architecture UCB RL SL Heuristic Sampling-based

NEXT-KS NEXT-GP GPPN-KS GPPN-GP BFS Reinforce CVAE Dijkstra RRT* BIT*15

Search Tree Comparison

16

Case Study: Robot Arm Fetching

Success rates of different planners under varying time limits, the

higher the better.

17

Thanks for listening!

Paper Full Slides

For more details, please refer to our paper/full slides/poster:

Poster

18