Motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · Motivation...

1
Formulation Given: : Learner’s state representations. = () ; 0,1 : Learner’s labeling function and task propositions. : Learners set of available actions : The task specification as belief over formulas with support {}. Expected output: (): A stochastic policy that best satisfies . Motivation Linear temporal logic (LTL) formulas are an expressive means for specifying non-Markovian tasks. Prior research relies on LTL to automaton compilation for planning. However this is restricted to a single LTL formula. In many application there is an inherent uncertainty in specifications. [1],[2] In general specifications are expressed as a belief over support . Question 1: What does satisfying a belief mean? Question 2: How do we plan for a collection of LTL formulas {}? Planning with Uncertain Specifications Evaluation Criteria RSS 2019 Workshop on Combining Learning with Reasoning Ankit Shah, Julie Shah {ajshah, arnoldj}@mit.edu Visit our website for more! http://interactive.mit.edu [1] Shah, A., Kamath, P., Shah, J. A., & Li, S. (2018). Bayesian inference of temporal task specifications from demonstrations. In Advances in Neural Information Processing Systems (pp. 3804-3813). [2] Kim, J., Banks, C. J., & Shah, J. A. (2017, February). Collaborative planning with encoding of users' high- level strategies. In Thirty-First AAAI Conference on Artificial Intelligence. Most Likely Maximum Coverage Minimum Regret Chance Constrained = ∈{} 1 ∈{} ∈{} () () Satisfy only the most likely formula. Satisfy the largest set of unique formulas. Maximize satisfaction weighted by probability. is the maximum failure probability. Automata/MDP Compilation 1 = 0.05 2 = 0.15 3 = 0.8 ¬0 ∧ 3 ∧ ¬3 2 ∧ ¬2 1 ¬0 ∧ 1 ∧ 2 ∧ 3 ¬0 ∧ 3 Composite automaton = ⟨{ }, 0,1 , ,⟩ Naïve cross-product: 135 states Minimal automaton: 11 states Determine task success and reward. 5 states 9 states 3 states Environment MDP = ⟨, , : Determines available actions. Determines changes to environment state. Cross product ×ℳ =ℳ = ⟨{ } × , , ,⟩ 1 , 1 ⟩, ⟨ 2 ⟩, 2 = 1 , ⟨ 2 ⟩ × ( 1 , 2 ) Results Discussion MDP compilation admits formulas of the Obligation class of temporal properties. Any RL algorithm can be used to solve the compiled MDP, but exploration vs exploitation considerations are still important. Future Work Algorithms to exploit the composition of and Scaffolding of reward based on automaton structure. Allowing temporal properties like Recurrence, Persistence and Reactivity Most Likely Max coverage/Min regret Chance constrained = . Classes of belief distributions: Nature of task executions depends on: Nature of distribution. Choice of Evaluation criterion. Exploration strategy in RL algorithm.

Transcript of Motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · Motivation...

Page 1: Motivation - people.csail.mit.edupeople.csail.mit.edu/ajshah/publication/shah-2019... · Motivation • Linear temporal logic (LTL) formulas are an expressive means for specifying

Formulation• Given:

• 𝒙 ∈ 𝑿: Learner’s state representations.• 𝜶 = 𝑓(𝒙) ; 𝜶 ∈ 0,1 𝑛𝑝𝑟𝑜𝑝 : Learner’s labeling function and task

propositions.• 𝑨: Learners set of available actions• 𝑃 𝜑 : The task specification as belief over formulas with support {𝜑}.

• Expected output:• 𝜋𝑃 𝜑 (𝒙): A stochastic policy that best satisfies 𝑃 𝜑 .

Motivation• Linear temporal logic (LTL) formulas are an expressive means for specifying

non-Markovian tasks.

• Prior research relies on LTL to automaton compilation for planning. However this is restricted to a single LTL formula.

• In many application there is an inherent uncertainty in specifications.[1],[2]

• In general specifications are expressed as a belief 𝑃 𝜑 over support 𝜑 .

Question 1: What does satisfying a belief 𝑃 𝜑 mean?

Question 2: How do we plan for a collection of LTL formulas {𝜑}?

Planning with Uncertain Specifications

Evaluation Criteria

RSS 2019 Workshop on Combining Learning with Reasoning

Ankit Shah, Julie Shah

{ajshah, arnoldj}@mit.edu

Visit our website for more!

http://interactive.mit.edu

[1] Shah, A., Kamath, P., Shah, J. A., & Li, S. (2018). Bayesian inference of temporal task specifications from demonstrations. In Advances in Neural Information Processing Systems (pp. 3804-3813).[2] Kim, J., Banks, C. J., & Shah, J. A. (2017, February). Collaborative planning with encoding of users' high-level strategies. In Thirty-First AAAI Conference on Artificial Intelligence.

Most Likely Maximum Coverage Minimum Regret Chance Constrained

𝟙 𝛼 ⊨ 𝜑∗

𝜑∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝜑∈{𝜑}𝑃 𝜑

1

𝜑

𝜑∈{𝜑}

𝟙 𝛼 ⊨ 𝜑∗

𝜑∈{𝜑}

𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗

𝜑∈𝜑𝛿

𝑃(𝜑)𝟙 𝛼 ⊨ 𝜑∗

Satisfy only the most likely formula. Satisfy the largest set of unique formulas.

Maximize satisfaction weighted by probability.

𝛿 is the maximum failure probability.

Automata/MDP Compilation𝑃 𝜑1 = 0.05 𝑃 𝜑2 = 0.15 𝑃 𝜑3 = 0.8

𝑮¬𝑇0 ∧ 𝑭𝑊3 ∧ ¬𝑊3 𝑼𝑊2 ∧ ¬𝑊2 𝑼𝑊1 𝑮¬𝑇0 ∧ 𝑭𝑊1 ∧ 𝑭𝑊2 ∧ 𝑭𝑊3 𝑮¬𝑇0 ∧ 𝑭𝑊3

Composite automaton ℳ 𝜑 = ⟨{ 𝜑′ }, 0,1 𝑛𝑝𝑟𝑜𝑝 , 𝑇 𝜑 , 𝑅⟩

• Naïve cross-product: 135 states• Minimal automaton: 11 states• Determine task success and reward.

5 states 9 states 3 states

Environment MDP ℳ𝑒𝑛𝑣 = ⟨𝑿, 𝑨, 𝑇𝑒𝑛𝑣⟩:• Determines available actions.• Determines changes to environment state.

Cross product ℳ 𝜑 ×ℳ𝑒𝑛𝑣 = ℳ𝑠𝑝𝑒𝑐

ℳ𝑠𝑝𝑒𝑐 = ⟨{ 𝜑′ } × 𝑿, 𝐴, 𝑇𝑠𝑝𝑒𝑐 , 𝑅⟩

𝑇𝑠𝑝𝑒𝑐 ⟨ 𝜑1′ , 𝑥1⟩, ⟨ 𝜑2

′ ⟩, 𝑥2 = 𝑇 𝜑 𝜑1′ , ⟨𝜑2

′ ⟩ × 𝑇𝑒𝑛𝑣(𝑥1, 𝑥2)

Results Discussion

• MDP compilation admits formulas of the Obligation classof temporal properties.

• Any RL algorithm can be used to solve the compiled MDP,but exploration vs exploitation considerations are stillimportant.

Future Work• Algorithms to exploit the composition of ℳ 𝜑 and ℳ𝑒𝑛𝑣

• Scaffolding of reward based on automaton structure.• Allowing temporal properties like Recurrence, Persistence

and Reactivity

Most Likely Max coverage/Min regret Chance constrained 𝛅 = 𝟎. 𝟏

Classes of belief distributions:Nature of task executions depends on:• Nature of distribution.• Choice of Evaluation criterion.• Exploration strategy in RL algorithm.