1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre,...

1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe White to play and win in 2 moves 2 Life 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!) The world has a very large, partially observable state space and an unknown model So, being intelligent is provably hard How on earth do we manage? 3 4 Hierarchical structure!! (among other things) Deeply nested behaviors give modularity E.g., tongue controls for [t] independent of everything else given decision to say tongue High-level decisions give scale E.g., go to IJCAI 13 3,000,000,000 actions Look further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11] Reminder: MDPs Markov decision process (MDP) defined by Set of states S Set of actions A Transition model P(s | s,a) Markovian, i.e., depends only on s t and not on history Reward function R(s,a,s) Discount factor Optimal policy * maximizes the value function V = t t R(s t,(s t ),s t+1 ) Q(s,a) = s P(s | s,a)[R(s,a,s) + V(s)] 5 Reminder: policy invariance MDP M with reward function R(s,a,s) Any potential function (s) then Any policy that is optimal for M is optimal for any version M with reward function R(s,a,s) = R(s,a,s) + (s) - (s) So we can provide hints to speed up RL by shaping the reward function 6 Reminder: Bellman equation Bellman equation for a fixed policy : V (s) = s P(s | s,(s)) [ R(s,(s),s) + V (s) ] Bellman equation for optimal value function: V (s) = max a s P(s | s,a) [ R(s,a,s) + V (s) ] 7 Reminder: Reinforcement learning Agent observes transitions and rewards: Doing a in s leads to s with reward r Temporal-difference (TD) learning: V(s) global optimality 36 Peasant 3 thread Learning threadwise decomposition Peasant 2 thread Peasant 1 thread Top thread a r r1r1 r2r2 r3r3 Action state decomp 37 Peasant 3 thread Learning threadwise decomposition Peasant 2 thread Peasant 1 thread Top thread Choice state Q 1 (,)Q 3 (,) Q 2 (,) + Q (,) argmax = u Q-update 38 Peasant 3 thread Threadwise and temporal decomposition Q j = Q j,r + Q j,c + Q j,e where Q j,r (,u) = Expected reward gained by thread j while doing u Q j,c (,u) = Expected reward gained by thread j after u but before leaving current subroutine Q j,e (,u) = Expected reward gained by thread j after current subroutine Peasant 2 thread Peasant 1 thread Top thread 39 Num steps learning (x 1000) Reward of learnt policy Flat Undecomposed Threadwise Threadwise + Temporal Resource gathering with 15 peasants 40 41 Summary of Part I Structure in behavior seems essential for scaling up Partial programs Provide natural structural constraints on policies Decompose value functions into simple components Include internal state (e.g., goals) that further simplifies value functions, shaping rewards Concurrency Simplifies description of multieffector behavior Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it) 42 Outline Hierarchical RL with abstract actions Hierarchical RL with partial programs Efficient hierarchical planning 43 Abstract Lookahead k-step lookahead >> 1-step lookahead e.g., chess k-step lookahead no use if steps are too small e.g., first k characters of a NIPS paper Abstract plans (high-level executions of partial programs) are shorter Much shorter plans for given goals => exponential savings Can look ahead much further into future Kc3 Rxa1 Qa4 c2 Kxa1 aabb aa bbaabb a b aa ab ba bb 44 High-level actions Start with restricted classical HTN form of partial program A high-level action (HLA) Has a set of possible immediate refinements into sequences of actions, primitive or high-level Each refinement may have a precondition on its use Executable plan space = all primitive refinements of Act [Act] [GoSFO, Act][GoOAK, Act] [WalkToBART, BARTtoSFO, Act] Lookahead with HLAs To do offline or online planning with HLAs, need a model See classical HTN planning literature 45 Lookahead with HLAs To do offline or online planning with HLAs, need a model See classical HTN planning literature not! Drew McDermott (AI Magazine, 2000): The semantics of hierarchical planning have never been clarified no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions 46 47 Downward refinement Downward refinement property (DRP) Every apparently successful high-level plan has a successful primitive refinement Holy Grail of HTN planning: allows commitment to abstract plans without backtracking 48 Downward refinement Downward refinement property (DRP) Every apparently successful high-level plan has a successful primitive refinement Holy Grail of HTN planning: allows commitment to abstract plans without backtracking Bacchus and Yang, 1991: It is nave to expect DRP to hold in general 49 Downward refinement Downward refinement property (DRP) Every apparently successful high-level plan has a successful primitive refinement Holy Grail of HTN planning: allows commitment to abstract plans without backtracking Bacchus and Yang, 1991: It is nave to expect DRP to hold in general MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds 50 Downward refinement Downward refinement property (DRP) Every apparently successful high-level plan has a successful primitive refinement Holy Grail of HTN planning: allows commitment to abstract plans without backtracking Bacchus and Yang, 1991: It is nave to expect DRP to hold in general MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds Problem: how to say true things about effects of HLAs? 51 s0s0 g S s0s0 g S s0s0 g S h2h2 h1h1 h1h2h1h2 h2h2 h 1 h 2 is a solution Angelic semantics for HLAs Begin with atomic state-space view, start in s 0, goal state g Central idea is the reachable set of an HLA from each state When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal May seem related to nondeterminism But the nondeterminism is angelic: the uncertainty will be resolved by the agent, not an adversary or nature a 4 a 1 a 3 a 2 State space 52 Technical development NCSTRIPS to describe reachable sets STRIPS add/delete plus possibly add, possibly delete, etc E.g., GoSFO adds AtSFO, possibly adds CarAtSFO Reachable sets may be too hard to describe exactly Upper and lower descriptions bound the exact reachable set (and its cost) above and below Still support proofs of plan success/failure Possibly-successful plans must be refined Sound, complete, optimal offline planning algorithms Complete, eventually-optimal online (real-time) search 53 Example Warehouse World Has similarities to blocks and taxi domains, but more choices and constraints Gripper must stay in bounds Cant pass through blocks Can only turn around at top row Goal: have C on T4 Cant just move directly Final plan has 22 steps Left, Down, Pickup, Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown 54 Navigate(x t,y t ) (Pre: At(x s,y s )) Upper: -At(x s,y s ), +At(x t,y t ), FacingRight Lower: IF (Free(x t,y t ) x Free(x,y max )): -At(x s,y s ), +At(x t,y t ), FacingRight, ELSE: nil Representing descriptions: NCSTRIPS Assume S = set of truth assignments to propositions Descriptions specify propositions (possibly) added/deleted by HLA An efficient algorithm exists to progress state sets (represented as DNF formulae) through descriptions ~ s t ~ s t s t x 55 Experiment Instance 3 5x8 world 90-step plan Flat/hierarchical without descriptions did not terminate within 10,000 seconds 56 Online Search: Preliminary Results Steps to reach goal

1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre,...

Documents

Transcript of 1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre,...