© 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst...

14
© 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram Ramanujan) MCTS Workshop at ICAPS-2011 June 12, 2011

Transcript of © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst...

Page 1: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation1

Guiding Combinatorial Optimizationwith UCT

Ashish Sabharwal and Horst SamulowitzIBM Watson Research Center

(presented by Raghuram Ramanujan)

MCTS Workshop at ICAPS-2011June 12, 2011

Page 2: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation2

MCTS and Combinatorial Search Monte Carlo Tree Search (MCTS): widely used in a variety of domains in AI

Upper Confidence bounds on Trees (UCT): a form of MCTS, especially successful in two-agent game tree search, e.g., Go, Kriegspiel, Mancala, General Game Playing

Based on single-agent tree search: one multi-armed bandit at each node of a tree goal: find the most “rewarding” root-to-leaf path in the tree

Combinatorial Search

A discrete search space, e.g., {0,1}N or {R, G, B}N

A “feasible” subspace of interest: typically defined indirectly by a finite set of constraints

Goal: find a solution – an element of the discrete space that satisfies all constraints

If a utility function / objective function given: find an optimal solution

E.g., Boolean Satisfiability (SAT), Graph Coloring (COL), Constraint Satisfaction Problems (CSPs), Constraint Optimization, Integer Programming (IP)

Can MCTS/UCT inspired techniques be used to improve the performance of combinatorial search algorithms?

graph coloring

Page 3: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation3

Mixed Integer Programming (MIP) :A Challenging but Promising Opportunity

MIP: linear inequality constraints, continuous & discrete variables

Typically with a linear (or quadratic) objective function

NP-hard; highly useful, with several academic and commercial solvers available

MIP search appears much more suitable than, e.g., SAT for applying UCT!

Opportunity for applying UCT

MIP solvers such as IBM ILOG’s CPLEX, Gurobi, etc.:

maintain a “frontier” of open nodes, exploring them with acombination of best-first search, “diving” to the bottom of the tree, etc.

rely on spending substantial effort per node, e.g., computing LP relaxation to obtain a bound on the objective value in the subtree: an estimate of the true value

In contrast, state-of-the-art SAT solvers not easily adapted to UCT:

are based on enhancements to basic depth-first search traversal

rely on processing nodes extremely fast (~ 2000-5000 per second)

Can we improve CPLEX by letting UCT decide search tree exploration order?

Page 4: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation4

Mixed Integer Programming (MIP) :A Challenging but Promising Opportunity

Challenges and Differences from the “usual” setup for UCT

Biggest success of UCT so far: two-agent game tree search, rather than single-agent

Random playouts are costly to implement in MIP search

Unlike game tree search, too costly to create a full UCT tree at each node

Exploitation isn’t very meaningful after true value of a node is revealed:no reason to repeatedly visit that node even if it is optimal

LP relaxation – available for “free”, provides a guaranteed bound on the true value averaging backups may not be the best strategy!

Highly optimized commercial MIP solvers such as CPLEX very hard to improve upon!

Implementation: no easy access to CPLEX’s internal data structures; must maintain our own “shadow tree” for exploring UCT strategies – additional overhead

Main Finding:

Guidance near the top of the tree can improve performance across a variety of instances!

Page 5: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation5

How does Search in CPLEX (roughly) work?

Sea

rch

Tre

e

CPLEX open nodes and corresponding quality estimate of the underlying sub-tree(e.g., LP objective value)

10x

3E 4E

iE

CPLEX explores the search tree by alternating between two operations:

I. Node Selection: Select the next open search node to continue search on: CPLEX selects node with the best estimate E

II. Branching: Select the next variable to branch on (assume binary branching)

Root-Node

10x

5y5y

6E

2z2z

7E 8E

1v1v

- Node Selection: Initially only one node that can be selected- Branching: Select variable x- Node Selection: Select node with estimate- Branching: Select variable y

1E

CPLEX closed nodes

- Node Selection: Select node with estimate 2E- Branching: Select variable z- Node Selection: Select node with estimate 5E- Branching: Select variable v

1E

0E

2E

5E

E

Page 6: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation6

Guiding Node Selection in CPLEX with UCT

Node Selection with UCT

Idea: expand nodes in the order in which UCT would expand them

Traverse search tree from root to a current leaf node (i.e., “open” node) while at each node selecting the child that has the highest UCT score s.

UCT score s: Combines estimate of the “quality” of a node (the same CPLEX uses) with how often this node has been visited already

Goal: Balance Exploration / Exploitation in CPLEX search

Tree Update Phase

When node selection reaches a leaf node,

compute its quality estimate (e.g., objective value of LP relaxation) and propagate it upwards towards the root

branch on this node using the default variable/value selection of CPLEX

Update rule / backup operator: max of the two children (no averaging!), if maximization problem; min if minimization

Result: estimate at each node N along this leaf-to-root path equals the best value seen in the entire sub-tree under N

Page 7: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation7

Guiding Search in CPLEX with UCT Node Selection

Node Selection is now guided by UCT scores (as illustrated below)

UCT score is based on estimate E and number of visits to a search nod

In order to employ UCT one needs to maintain a shadow tree of CPLEXs search tree

CPLEX maintains just a frontier of open nodes; the underlying search tree only exists implicitly

Sea

rch

Tre

e 10x

3E 4E

Root-Node

10x

5y5y

4E 6E

2z2z

7E 8E

1v1v

- Node Selection: Initially only one node that can be selected- Branching: Select variable x- Node Selection: Select node with highest UCT score based on and - Branching: Select variable y

1E

- Node Selection: Select node with highest UCT score based on and

2E…

1#visits 2#visits

1#visits

0#visits

3#visits

2#visits

1E 2E

0E

5E

8#visits7#visits6#visits

5#visits

4#visits

CPLEX open nodes and corresponding quality estimate of the underlying sub-tree(e.g., LP objective value)iE

CPLEX closed nodes

E

Page 8: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation8

Guiding Search in CPLEX with UCT Tree Update Phase

After selecting a node N and branching on a variable, two child nodes N_left and N_right will be created with their corresponding estimates E_left and E_right

When propagating estimates upwards, we only consider the best estimate (e.g., no averaging)

Update using the “backup operator”

Sea

rch

Tre

e 10x

3E 4E

Root-Node

10x

5y5y 1E 2E

0E 121 ),max( EEE - Propagate to 0E

443 ),max( EEE - Propagate to 1E as long as new estimates improve current best estimate at a node on path to the root.

E.g., only if then propagate new estimate to node labeled with . However, visit counts are updated for each node on the path to root.

04 EE 0E

CPLEX open nodes and corresponding quality estimate of the underlying sub-tree(e.g., LP objective value)iE

CPLEX closed nodes

E

Page 9: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation9

UCT Score: “Epsilon Greedy” Variant of UCB1

UCT Score computation:

N = tree node under considerationP = parent of N = a constant balancing exploration and exploitation (0.7 in

experiments) = theoretically a number decreasing inversely proportional to visits(N) ( = a constant set to 0.01 in experiments)

Fast and accurate enough for our purposes, compared to the standard UCB1 formula

Page 10: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation10

Experimental Evaluation

Starting with 1,024 publically available MIP instances we removed:

All instances solved by default CPLEX within 10 seconds (too easy)

All instances not solved by default CPLEX within 900 seconds (too hard)

Experimental Evaluation is based on the 170 remaining instances

Spanning a variety of domains

Experimentation not limited to any particular instance family (e.g., TSP instances, set covering, etc.)

Experiments were conducted on:

Intel Xeon CPU E5410, 2.33GHz with 8 cores, and 32GB of memory

Only a single run per machine since multiple CPLEXs on one machinecan (and often do!) interfere with each other

OS: Ubuntu

Page 11: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation11

Experimental Evaluation: Solvers

Default CPLEX

Uses various strategies, including a combination of best-first node selection and depth-first “diving” to reach a leaf node from each best node

Highly optimized; very challenging to beat by a large margin across a large variety of problem domains

CPLEX with node selection guided by UCT

Best results when guidance limited to the top 5 levels of the tree;then revert to the default node selection of CPLEX

Other standard exploration schemes

Best-first

Breadth-first

Depth-first

Page 12: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation12

Preliminary Experimental Results

[ timeout: 600 sec ]

Promising performance:

UCT guidance results in the fewest instances timing out (8)Fastest on 39 instancesLowest average runtime (albeit only by a few seconds)

Page 13: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation13

Preliminary Experimental Results

Pairwise performance measure (timeout: 600 sec) :how often does the row solver outperform the column solver?e.g., UCT guidance outperforms default CPLEX on 64 instances;

52 times vice versa

Promising performance:

UCT guidance outperforms default CPLEX and other natural alternatives

Page 14: © 2011 IBM Corporation 1 Guiding Combinatorial Optimization with UCT Ashish Sabharwal and Horst Samulowitz IBM Watson Research Center (presented by Raghuram.

© 2011 IBM Corporation14

Conclusion

Explored the use of MCTS/UCT in a combinatorial search setting

Specifically, for mixed integer programming (MIP) search, with CPLEX

Typical “random playouts” very costly but LP relaxation objective value serves as a good estimate – a guaranteed one-sided bound!

Max-style update rule performs better here than the usual averaging backups

Guiding combinatorial search with UCT holds promise!

Improving performance of highly optimized MIP solvers across a variety of problem domains is a huge challenge

UCT-inspired guidance for node selection shows promise

Most benefit when UCT used only near the top of the search tree

Further exploration along these lines appears fruitful, e.g.:

using UCT for variable or value selection (rather than node selection)

building a “full” UCT tree at each search tree node before branching