Learning to Improve the Quality of Plans Produced by Partial-order Planners
M. Afzal Upal
Intelligent Agents & Multiagent Systems Lab
Outline
Artificial Intelligence Planning: Problems and Solutions
Why Learn to Improve Plan Quality? The Performance Improving Partial-order
planner (PIP) Intra-solution Learning (ISL) algorithmSearch-control vs Rewrite rulesEmpirical Evaluation
Conclusion
The Performance Task: The Classical AI Planning
Given:
Initial State
Goals
Find:
Actions:{up, down, left, right}
A sequence of actions that achieves the goals when executed in the initial state e.g., down(4), right(3), up(2)
123
84
7 6 5
1 2 38 47 6 5
Automated Planning Systems
Domain Independent Planning Systems Modular, Sound, and Complete
Domain-dependent Planning Systems Practical, Efficient, Produce high quality plans
Domain Independent Systems State-space Search (each search node is a
valid world state) e.g., PRODIGY, FF
Partial-order Plan Space Search (each search node is a partially-ordered plan) Partial-order planners e.g., SNLP, UCPOP
Graphplan-based Search (a search node is a union of world states) e.g., STAN
Compilation to General Search satisfiability engines e.g., SATPLAN constraint satisfaction engines e.g., CPLAN
State-space vs Plan-space Planning
1 2 38 47 6 5
12
38 47 6 5
1 2 38 47 6 5
1 2 38 4
7 6 5 right(8)
down(2)
left(4)
right(8)
l(4)
d(2)
up(6)
up(6)
END
1 2 34
7 6 58
Partial-order Plan-space Planning
Partial-order planning is the process of removing flaws (unresolved goals and unordered actions that cannot take place at the same time)
Partial-order Plan-space Planning
Decouple the order in which actions are added during planning from the order in which they appear in the final plan
4
1
2 3
Learning to Improve Plan Quality for Partial-order Planners
How to represent plan quality information? Extended STRIPS operators + value function
How to identify learning opportunities? (there are no planning failures or successes to learn from) Assume a better quality model plan for a given problem
is available (from a domain expert or a through a more extensive automated search of the problem’s search space)
What search features to base the quality improving search control knowledge on?
The Logistics Transportation Domain
Initial State:
Goals:
at-object(parcel, postoffice)
at-truck(truck1, postoffice)
at-plane(plane1, airport)
at-object(parcel, airport)
STRIPS encoding of the Logistics Transportation Domain
Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}
LOAD-TRUCK(Object, Truck, Location)Effects: {in(Object,Truck), not(at-object(Object,Location))}
Preconditions: {at-truck(Truck,From)}
DRIVE-TRUCK(Truck, From, To)Effects: {at-truck(Truck,To), not(at-truck(Truck,From), same-city(From, To)}
UNLOAD-TRUCK(Object, Truck, Location)Preconditions: {in(Object,Truck), at-truck(Truck,Location)}
Effects: {at-object(Object,Location), not(in(Object,Truck))}
PR-STRIPS (similar to PDDL 2.1 level 2) A state is described using propositional as
well as metric attributes (that specify the levels of the resources in that state).
An action can have propositional as well as metric effects (functions which specify the amount of resources the action consumes).
A value function that specifies the relative importance of the amount of each resource consumed and defines plan quality as a function of the amount of resources consumed by all actions in the plan.
PR-STRIPS encoding of the Logistics Transportation Domain
Preconditions: {at-object(Object,Location), at-truck(Truck,Location)}
LOAD-TRUCK(Object, Truck, Location)Effects: {in(Object,Truck), not(at-object(Object,Location)),
time(-0.5), money(-5)}
Preconditions: {at-truck(Truck,From)}
DRIVE-TRUCK(Truck, From, To)Effects: {at-truck(Truck,To), not(at-truck(Truck,From), time(-.02*distance(From, To)), money(-distance(From, To))}
UNLOAD-TRUCK(Object, Truck, Location)Preconditions: {in(Object,Truck), at-truck(Truck,Location)}
Effects: {at-object(Object,Location), not(in(Object,Truck)), time(-0.5), money(-5) }
PR-STRIPS encoding of the Logistics Transportation Domain
Preconditions: {at-object(Object, Location), at-plane(Plane, Location)}
LOAD-PLANE(Object, Plane, Location)Effects: {in(Object, Plane), not(at-object(Object, Location)),
time(-0.5), money(-5)}
Preconditions: {at-plane(Plane, From), airport(To)}
FLY-PLANE(Plane, From, To)Effects: {at-plane(Plane,To), not(at-plane(Plane, From), time(-.02*distance(From, To)), money(-distance(From, To))}
UNLOAD-PLANE(Object, Plane, Location)Preconditions: {in(Object, Plane), at-plane(Plane, Location)}
Effects: {at-object(Object, Location), not(in(Object, Plane)), time(-0.5), money(-5) }
PR-STRIPS encoding of the Logistics Transportation Domain
Quality(Plan) = 1/ (2*time-used(Plan) + 5*money-used(Plan))
The Learning Problem Given
A planning problem (goals, initial state, and initial resource level)
Domain knowledge (actions, plan quality knowledge)
A partial-order planner A model plan for the given problem
Find Domain specific rules that can be used by the
given planner to produce better quality plans (than the plans it would’ve produced had it not learned those rules).
Solution: The Intra-solution Learning Algorithm
1. Find a learning opportunity
2. Choose the relevant information and ignore the rest
3. Generalize the relevant information using a generalization theory
Phase 1: Find a Learning Opportunity
1. Generate a system’s default plan and a default planning trace using the given partial-order planner for the given problem
2. Compare the default plan with the model plan. If the model plan is not of higher quality then goto Step 1
3. Infer the planning decisions that produced the model plan
4. Compare the inferred model planning trace with the default planning trace to identify the decision points where the two traces differ. These are the conflicting choice points
Model Trace
System’s Planning Trace
Common Nodes
Phase 2: Choose the relevant Information
1. Examine the downstream planning traces identifying relevant planning decisions using the heuristics
1. A planning decision to add an action Q is relevant if Q supplies a relevant condition to a relevant action
2. A planning decision to establish an open condition is relevant if it binds an uninstantiated variable of a relevant open condition
3. A planning decision to resolve a threat is relevant if all three actions involved are relevant
Phase 3: Generalize the Relevant Information
1. Generalize the relevant information using a generalization theory
1. Replace all constants with variables
An Example Logistics Problem
Initial-state: {at-object(o1, lax),
at-object(o2, lax),
at-truck(tr1, lax),
at-plane(p1, lax),
airport(sjc),
distance(lax, sjc)=250,
time=0,
money=500}
Goals: {at-object(o1, sjc),
at-object(o2, sjc)}
Generate System’s Default Plan and Default Planning Trace
Use the given planner to generate system’s default planning trace (an ordered constraint set) Each add-step/establishment decision adds a causal-link and
an ordering constraint Each threat-resolution decision adds an ordering constraint
1- START ‹ END,
2- unload-truck() ‹ END, unload-truck(o1,Tr,sjc) at-object(o1,sjc)
END
3- load-truck() ‹ unload-truck(),load-truck(o1,Tr, sjc) in-truck(o1,Tr)
unload-truck(o1,Tr, sjc)
4- drive-truck() ‹ unload-truck(),drive-truck(Tr, X, sjc) at-truck(Tr, sjc)
unload-truck(o1,Tr, sjc)
5- …
Compare System’s Default Plan with the Model Plan
load-truck(o1, tr1, lax),
load-truck(o2, tr1,lax),
drive-truck(tr1, lax, sjc),
unload-truck(o1, tr1, sjc),
unload-truck(o2, tr1, sjc)
load-plane(o1, p1, lax),
load-plane(o2, p1, lax),
fly-plane(p1, lax, sjc),
unload-plane(o1, p1, sjc),
unload-plane(o2, p1, sjc)
System’s Default Plan Model Plan
Infer the Unordered Model Constraint Set
unload-plane(ol,p1,sjc) at-object(o1,sjc)
END
load-plane(ol,p1,lax) at-object(o1,sjc)
unload-plane(o1,p1,sjc)
fly-plane(p1,sjc,lax) at-plane(p1,sjc)
unload-plane(o1,p1,sjc)
START at-plane(p1,lax)
load-plane(ol,p1,lax)
START at-plane(p1,lax)
fly-plane(ol,p1,lax)
START at-object(o1,lax)
load-plane(ol,p1,lax)
unload-plane(o2,p1,sjc) at-object(o2,sjc)
END
load-plane(o2,p1,lax) at-object(o2,sjc)
unload-plane(o2,p1,sjc)
fly-plane(p1,sjc,lax) at-plane(p1,sjc)
unload-plane(o2,p1,sjc)
START at-plane(p1,lax)
load-plane(o2,p1,lax)
START at-plane(p1,lax)
fly-plane(o2,p1,lax)
START at-object(o2,lax)
load-plane(o2,p1,lax)
Compare the two Planning Traces to Identify Learning Opportunities
START ‹ END at-object(o1,sjc)
START ‹ END, unload-truck(o1,tr1,sjc) ‹ END
unload-truck(o1,t1,sjc) at-object(o1,tr1,sjc)
END
START ‹ END, unload-plane(o1,p1,ap) ‹ END
unload-plane(o1,p1,sjc) at-object(o1,p1,sjc)
END
A learning opportunity
Choose the Relevant Planning Decisions
add-actions:START-END
add-action:unload-plane(o1) add-actions:unload-truck(o1)
add-action:fly-plane()
add-action:load-plane(o1)
add-action:unload-plane(o2)
add-action:load-plane(o2)
add-action:drive-truck()
add-actions:load-truck(o1)
add-action:drive-truck()
add-actions:load-truck(o2)
learning opportunity
relevant decisions
irrelevant decisions
Generalize the relevant planning decisions chains
add-actions:START-END
add-action:unload-plane(O, T) add-actions:unload-truck(O, P)
add-action:fly-plane(T,X,Y)
add-action:load-plane(O, T)
add-action:drive-truck(P,X,Y)
add-actions:load-truck(O, P)
In What Form Should the Learned Knowledge be Stored?
Rewrite Rule
To-be-replaced actions
{load-truck(O,T,X),
drive-truck(T,X,Y),
unload(O,T, Y)}
Replacing actions
{load-plane(O,P,X),
fly-plane(P,X,Y),
unload-plane(O,P,Y))}
Search-Control Rule
Given the goals {at-object(O,Y)} to resolve and effects {at-truck(T,X), at-plane(P, X), airport(Y)}, and distance(X, Y) > 100
prefer the planning decisions
{add-step(unload-plane(O,P,Y)), add-step(load-plane(O,P,X)), add-step(fly-plane(P,X,Y))}
over the planning decisions
{add-step(unload-truck(O,T,Y)), add-step(load-truck(O,T,X)), add-step(drive-truck(T,X,Y))}
Search Control Knowledge A heuristic function that provides an estimate of the
quality of the plan a node is expected to lead to
root
n
quality=8
quality=4
quality=2
Rewrite Rules A Rewrite rule is a 2-tuple to-be-
replaced-subplan, replacing-subplan Used after search has produced a complete
plan to rewrite it into a higher quality plan. Only useful in those domains where it is
possible to efficiently produce a low quality plan but hard to produce a higher quality plan
E.g., To-be-replaced-subplan: A4, A5Replacing subplan: B1
Planning by Rewriting
A1
A2
A3
A4
A5
A6
B1
Empirical Evaluation I: What Form Should the Learned Knowledge be Stored in?
Perform empirical experiments to compare the performance of a version of PIP that learns search-control rules (Sys-search-control) with a version that learns rewrite rules (Sys-rewrite).
Both Sys-rewrite-first and Sys-rewrite-best perform up to two rewritings.
At each rewriting Sys-rewrite-first randomly chooses one of
the applicable rewrite rules Sys-rewrite-best applies all applicable rewrite
rules to try all ways of rewriting a plan.
Experimental Set-up Three benchmark planning domains logistics,
softbot, and process planning Randomly generate 120 unique problem
instances Train Sys-search-control and Sys-rewrite on
optimal quality solutions for 20, 30, 40, and 60 examples and test them on the remaining examples (cross-validation)
Plan quality is one minus the average distance of the plans generated by a system from the optimal quality plans
Planning efficiency is measured by counting the average number of new nodes generated by each system
Results
0
0.2
0.4
0.6
0.8
1
0 20 30 40 60
0
0.2
0.4
0.6
0.8
1
1.2
0 20 30 40 600
0.2
0.4
0.6
0.8
1
1.2
0 20 30 40 60
Sys-Rewrite-first
Sys-Rewrite-best
Sys-Search-control
Softbot Logistics Process Planning
05
101520253035404550
Num new nodes
0 20 30 40 600
20
40
60
80
100
120
140
160
0 20 30 40 60
Sys-Search-control
Sys-Rewrite-first
Sys-Rewrite-best
0
5
10
15
20
25
30
35
40
45
50
0 20 30 40 60
Conclusion I Both search control and rewrite rules lead to
improvements in plan quality. Rewrite-rules have a larger cost in terms of the
loss of planning efficiency than search control rules
Need a mechanism to distinguish good rules from bad rules and to forget the bad rules
Comparing planning traces seems to be a better technique for learning search control rules than rewrite rules
Need to explore alternate strategies for learning rewrite rules By comparing two completed plans of different quality Through static domain analysis
Empirical Evaluation II: A Study of the Factors Affecting PIP’s Learning Performance
Generated 25 abstract domains varying along a number of seemingly relevant dimensions Instance Similarity Quality Branching Factor (average number of
multiple quality solutions per problem) Association between the default planning bias
and the quality bias Are there any statistically significant
differences in PIP’s performance as each factor is varied (student t-test)?
Results PIP’s learning leads to greater
improvements in domains where Quality branching factor is large Planner’s default biases are negatively
correlated with the quality improving heuristic function
There is no simple relationship between instance similarity and PIP’s learning performance
Conclusion II Need to address scale up issues Need to keep up with advances in AI planning
technologies “It is arguably more difficult to accelerate a new
generation planner by outfiting it with learning as the overhead cost by the learning system can overwhelm the gains in search efficiency” (Kambhampati 2001)
Problem is not the lack of a well defined task! Organize a symposium/special issue on issues of
how to efficiently organize, retrieve, and forget learned knowledge
An open source planning and learning software?
Top Related