Download - Learning to Improve the Quality of Plans Produced by Partial-order Planners M. Afzal Upal Intelligent Agents & Multiagent Systems Lab.

Learning to Improve the Quality of Plans Produced by Partial-order Planners

M. Afzal Upal

Intelligent Agents & Multiagent Systems Lab

Outline

Artificial Intelligence Planning: Problems and Solutions

Why Learn to Improve Plan Quality? The Performance Improving Partial-order

planner (PIP) Intra-solution Learning (ISL) algorithmSearch-control vs Rewrite rulesEmpirical Evaluation

Conclusion

The Performance Task: The Classical AI Planning

Given:

Initial State

Goals

Find:

Actions:{up, down, left, right}

A sequence of actions that achieves the goals when executed in the initial state e.g., down(4), right(3), up(2)

123

84

7 6 5

1 2 38 47 6 5

Automated Planning Systems

Domain Independent Planning Systems Modular, Sound, and Complete

Domain-dependent Planning Systems Practical, Efficient, Produce high quality plans

Domain Independent Systems State-space Search (each search node is a

valid world state) e.g., PRODIGY, FF

Partial-order Plan Space Search (each search node is a partially-ordered plan) Partial-order planners e.g., SNLP, UCPOP

Graphplan-based Search (a search node is a union of world states) e.g., STAN

Compilation to General Search satisfiability engines e.g., SATPLAN constraint satisfaction engines e.g., CPLAN

State-space vs Plan-space Planning

1 2 38 47 6 5

12

38 47 6 5

1 2 38 47 6 5

1 2 38 4

7 6 5 right(8)

down(2)

left(4)

right(8)

l(4)

d(2)

up(6)

up(6)

END

1 2 34

7 6 58

Partial-order Plan-space Planning

Partial-order planning is the process of removing flaws (unresolved goals and unordered actions that cannot take place at the same time)

Partial-order Plan-space Planning

Decouple the order in which actions are added during planning from the order in which they appear in the final plan

4

1

2 3

Learning to Improve Plan Quality for Partial-order Planners

How to represent plan quality information? Extended STRIPS operators + value function

How to identify learning opportunities? (there are no planning failures or successes to learn from) Assume a better quality model plan for a given problem

is available (from a domain expert or a through a more extensive automated search of the problem’s search space)

What search features to base the quality improving search control knowledge on?

The Logistics Transportation Domain

Initial State:

Goals:

at-object(parcel, postoffice)

at-truck(truck1, postoffice)

at-plane(plane1, airport)

at-object(parcel, airport)

STRIPS encoding of the Logistics Transportation Domain

Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}

LOAD-TRUCK(Object, Truck, Location)Effects: {in(Object,Truck), not(at-object(Object,Location))}

Preconditions: {at-truck(Truck,From)}

DRIVE-TRUCK(Truck, From, To)Effects: {at-truck(Truck,To), not(at-truck(Truck,From), same-city(From, To)}

UNLOAD-TRUCK(Object, Truck, Location)Preconditions: {in(Object,Truck), at-truck(Truck,Location)}

Effects: {at-object(Object,Location), not(in(Object,Truck))}

PR-STRIPS (similar to PDDL 2.1 level 2) A state is described using propositional as

well as metric attributes (that specify the levels of the resources in that state).

An action can have propositional as well as metric effects (functions which specify the amount of resources the action consumes).

A value function that specifies the relative importance of the amount of each resource consumed and defines plan quality as a function of the amount of resources consumed by all actions in the plan.

PR-STRIPS encoding of the Logistics Transportation Domain

Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}

LOAD-TRUCK(Object, Truck, Location)Effects: {in(Object,Truck), not(at-object(Object,Location)),

time(-0.5), money(-5)}

Preconditions: {at-truck(Truck,From)}

DRIVE-TRUCK(Truck, From, To)Effects: {at-truck(Truck,To), not(at-truck(Truck,From), time(-.02*distance(From, To)), money(-distance(From, To))}

UNLOAD-TRUCK(Object, Truck, Location)Preconditions: {in(Object,Truck), at-truck(Truck,Location)}

Effects: {at-object(Object,Location), not(in(Object,Truck)), time(-0.5), money(-5) }


Preconditions: {at-object(Object, Location), at-plane(Plane, Location)}

LOAD-PLANE(Object, Plane, Location)Effects: {in(Object, Plane), not(at-object(Object, Location)),

time(-0.5), money(-5)}

Preconditions: {at-plane(Plane, From), airport(To)}

FLY-PLANE(Plane, From, To)Effects: {at-plane(Plane,To), not(at-plane(Plane, From), time(-.02*distance(From, To)), money(-distance(From, To))}

UNLOAD-PLANE(Object, Plane, Location)Preconditions: {in(Object, Plane), at-plane(Plane, Location)}

Effects: {at-object(Object, Location), not(in(Object, Plane)), time(-0.5), money(-5) }


Quality(Plan) = 1/ (2*time-used(Plan) + 5*money-used(Plan))

The Learning Problem Given

A planning problem (goals, initial state, and initial resource level)

Domain knowledge (actions, plan quality knowledge)

A partial-order planner A model plan for the given problem

Find Domain specific rules that can be used by the

given planner to produce better quality plans (than the plans it would’ve produced had it not learned those rules).

Solution: The Intra-solution Learning Algorithm

1. Find a learning opportunity

2. Choose the relevant information and ignore the rest

3. Generalize the relevant information using a generalization theory

Phase 1: Find a Learning Opportunity

1. Generate a system’s default plan and a default planning trace using the given partial-order planner for the given problem

2. Compare the default plan with the model plan. If the model plan is not of higher quality then goto Step 1

3. Infer the planning decisions that produced the model plan

4. Compare the inferred model planning trace with the default planning trace to identify the decision points where the two traces differ. These are the conflicting choice points

Model Trace

System’s Planning Trace

Common Nodes

Phase 2: Choose the relevant Information

1. Examine the downstream planning traces identifying relevant planning decisions using the heuristics

1. A planning decision to add an action Q is relevant if Q supplies a relevant condition to a relevant action

2. A planning decision to establish an open condition is relevant if it binds an uninstantiated variable of a relevant open condition

3. A planning decision to resolve a threat is relevant if all three actions involved are relevant

Phase 3: Generalize the Relevant Information

1. Generalize the relevant information using a generalization theory

1. Replace all constants with variables

An Example Logistics Problem

Initial-state: {at-object(o1, lax),

at-object(o2, lax),

at-truck(tr1, lax),

at-plane(p1, lax),

airport(sjc),

distance(lax, sjc)=250,

time=0,

money=500}

Goals: {at-object(o1, sjc),

at-object(o2, sjc)}

Generate System’s Default Plan and Default Planning Trace

Use the given planner to generate system’s default planning trace (an ordered constraint set) Each add-step/establishment decision adds a causal-link and

an ordering constraint Each threat-resolution decision adds an ordering constraint

1- START ‹ END,

2- unload-truck() ‹ END, unload-truck(o1,Tr,sjc) at-object(o1,sjc)

END

3- load-truck() ‹ unload-truck(),load-truck(o1,Tr, sjc) in-truck(o1,Tr)

unload-truck(o1,Tr, sjc)

4- drive-truck() ‹ unload-truck(),drive-truck(Tr, X, sjc) at-truck(Tr, sjc)

unload-truck(o1,Tr, sjc)

5- …

Compare System’s Default Plan with the Model Plan

load-truck(o1, tr1, lax),

load-truck(o2, tr1,lax),

drive-truck(tr1, lax, sjc),

unload-truck(o1, tr1, sjc),

unload-truck(o2, tr1, sjc)

load-plane(o1, p1, lax),

load-plane(o2, p1, lax),

fly-plane(p1, lax, sjc),

unload-plane(o1, p1, sjc),

unload-plane(o2, p1, sjc)

System’s Default Plan Model Plan

Infer the Unordered Model Constraint Set

unload-plane(ol,p1,sjc) at-object(o1,sjc)

END

load-plane(ol,p1,lax) at-object(o1,sjc)

unload-plane(o1,p1,sjc)

fly-plane(p1,sjc,lax) at-plane(p1,sjc)


START at-plane(p1,lax)

load-plane(ol,p1,lax)


fly-plane(ol,p1,lax)

START at-object(o1,lax)

load-plane(ol,p1,lax)

unload-plane(o2,p1,sjc) at-object(o2,sjc)

END

load-plane(o2,p1,lax) at-object(o2,sjc)


fly-plane(p1,sjc,lax) at-plane(p1,sjc)



load-plane(o2,p1,lax)


fly-plane(o2,p1,lax)

START at-object(o2,lax)

load-plane(o2,p1,lax)

Compare the two Planning Traces to Identify Learning Opportunities

START ‹ END at-object(o1,sjc)

START ‹ END, unload-truck(o1,tr1,sjc) ‹ END

unload-truck(o1,t1,sjc) at-object(o1,tr1,sjc)

END

START ‹ END, unload-plane(o1,p1,ap) ‹ END

unload-plane(o1,p1,sjc) at-object(o1,p1,sjc)

END

A learning opportunity

Choose the Relevant Planning Decisions

add-actions:START-END

add-action:unload-plane(o1) add-actions:unload-truck(o1)

add-action:fly-plane()

add-action:load-plane(o1)

add-action:unload-plane(o2)

add-action:load-plane(o2)

add-action:drive-truck()

add-actions:load-truck(o1)

add-action:drive-truck()

add-actions:load-truck(o2)

learning opportunity

relevant decisions

irrelevant decisions

Generalize the relevant planning decisions chains

add-actions:START-END

add-action:unload-plane(O, T) add-actions:unload-truck(O, P)

add-action:fly-plane(T,X,Y)

add-action:load-plane(O, T)

add-action:drive-truck(P,X,Y)

add-actions:load-truck(O, P)

In What Form Should the Learned Knowledge be Stored?

Rewrite Rule

To-be-replaced actions

{load-truck(O,T,X),

drive-truck(T,X,Y),

unload(O,T, Y)}

Replacing actions

{load-plane(O,P,X),

fly-plane(P,X,Y),

unload-plane(O,P,Y))}

Search-Control Rule

Given the goals {at-object(O,Y)} to resolve and effects {at-truck(T,X), at-plane(P, X), airport(Y)}, and distance(X, Y) > 100

prefer the planning decisions

{add-step(unload-plane(O,P,Y)), add-step(load-plane(O,P,X)), add-step(fly-plane(P,X,Y))}

over the planning decisions

{add-step(unload-truck(O,T,Y)), add-step(load-truck(O,T,X)), add-step(drive-truck(T,X,Y))}

Search Control Knowledge A heuristic function that provides an estimate of the

quality of the plan a node is expected to lead to

root

n

quality=8

quality=4

quality=2

Rewrite Rules A Rewrite rule is a 2-tuple to-be-

replaced-subplan, replacing-subplan Used after search has produced a complete

plan to rewrite it into a higher quality plan. Only useful in those domains where it is

possible to efficiently produce a low quality plan but hard to produce a higher quality plan

E.g., To-be-replaced-subplan: A4, A5Replacing subplan: B1

Planning by Rewriting

A1

A2

A3

A4

A5

A6

B1

Empirical Evaluation I: What Form Should the Learned Knowledge be Stored in?

Perform empirical experiments to compare the performance of a version of PIP that learns search-control rules (Sys-search-control) with a version that learns rewrite rules (Sys-rewrite).

Both Sys-rewrite-first and Sys-rewrite-best perform up to two rewritings.

At each rewriting Sys-rewrite-first randomly chooses one of

the applicable rewrite rules Sys-rewrite-best applies all applicable rewrite

rules to try all ways of rewriting a plan.

Experimental Set-up Three benchmark planning domains logistics,

softbot, and process planning Randomly generate 120 unique problem

instances Train Sys-search-control and Sys-rewrite on

optimal quality solutions for 20, 30, 40, and 60 examples and test them on the remaining examples (cross-validation)

Plan quality is one minus the average distance of the plans generated by a system from the optimal quality plans

Planning efficiency is measured by counting the average number of new nodes generated by each system

Results

0

0.2

0.4

0.6

0.8

1

0 20 30 40 60

0

0.2

0.4

0.6

0.8

1

1.2

0 20 30 40 600

0.2

0.4

0.6

0.8

1

1.2

0 20 30 40 60

Sys-Rewrite-first

Sys-Rewrite-best

Sys-Search-control

Softbot Logistics Process Planning

05

101520253035404550

Num new nodes

0 20 30 40 600

20

40

60

80

100

120

140

160

0 20 30 40 60

Sys-Search-control

Sys-Rewrite-first

Sys-Rewrite-best

0

5

10

15

20

25

30

35

40

45

50

0 20 30 40 60

Conclusion I Both search control and rewrite rules lead to

improvements in plan quality. Rewrite-rules have a larger cost in terms of the

loss of planning efficiency than search control rules

Need a mechanism to distinguish good rules from bad rules and to forget the bad rules

Comparing planning traces seems to be a better technique for learning search control rules than rewrite rules

Need to explore alternate strategies for learning rewrite rules By comparing two completed plans of different quality Through static domain analysis

Empirical Evaluation II: A Study of the Factors Affecting PIP’s Learning Performance

Generated 25 abstract domains varying along a number of seemingly relevant dimensions Instance Similarity Quality Branching Factor (average number of

multiple quality solutions per problem) Association between the default planning bias

and the quality bias Are there any statistically significant

differences in PIP’s performance as each factor is varied (student t-test)?

Results PIP’s learning leads to greater

improvements in domains where Quality branching factor is large Planner’s default biases are negatively

correlated with the quality improving heuristic function

There is no simple relationship between instance similarity and PIP’s learning performance

Conclusion II Need to address scale up issues Need to keep up with advances in AI planning

technologies “It is arguably more difficult to accelerate a new

generation planner by outfiting it with learning as the overhead cost by the learning system can overwhelm the gains in search efficiency” (Kambhampati 2001)

Problem is not the lack of a well defined task! Organize a symposium/special issue on issues of

how to efficiently organize, retrieve, and forget learned knowledge

An open source planning and learning software?