Approximation Techniques for Automated Reasoning Irina Rish IBM T.J.Watson Research Center...

Approximation Techniques for Automated Reasoning

Irina RishIBM T.J.Watson Research Center

[email protected]

Rina DechterUniversity of California, Irvine

[email protected]

SP2 2

Outline Introduction

Reasoning tasks Reasoning approaches: elimination and

conditioning

CSPs: exact inference and approximations

Belief networks: exact inference and approximations

MDPs: decision-theoretic planning

Conclusions

SP2 3

Automated reasoning tasks Propositional satisfiability Constraint satisfaction Planning and scheduling Probabilistic inference Decision-theoretic planning Etc.

Reasoning is NP-hard

Approximations

SP2 4

Graphical Frameworks Our focus - graphical frameworks: constraint and belief networks Nodes variables Edges dependencies (constraints, probabilities, utilities) Reasoning graph

transformations

SP2 5

Propositional Satisfiability

If Alex goes, then Becky goes: If Chris goes, then Alex goes: Query: Is it possible that Chris goes to

the party but Becky does not?

Example: party problem

) (or, BA BA

) (or, ACA C

e?satisfiabl Is

C B, A,C B,Atheorynalpropositio

SP2 6

Constraint Satisfaction Example: map coloring Variables - countries (A,B,C,etc.) Values - colors (e.g., red, green, yellow) Constraints: etc. ,ED D, AB,A

SP2 7

Constrained OptimizationExample: power plant scheduling

)X,...,ost(XTotalFuelC minimize :

)(Power : demandpower time,down-min and up-min ,, :sConstraint

. domain ,Variables

N1

4321

1

Objective

DemandXXXXX

{ON,OFF}},...,X{X

i

n

SP2 8

Probabilistic Inference

smoking

A

S

T

V

X D

BCtuberculosis

X-ray

visit to Asia

lungcancer bronchitis

dyspnoea(shortness of breath)

abnormality in lungs

Query: P(T = yes | S = no, D = yes) = ?

Example: medical diagnosis

SP2 9

Decision-Theoretic Planning

State = {X, Y, Battery_Level} Actions = {Go_North, Go_South, Go_West, Go_East} Probability of success = P Task: reach the goal location ASAP

Example: robot navigation

SP2 10

Reasoning Methods Our focus - conditioning and elimination Conditioning (“guessing” assignments, reasoning by assumptions)

• Branch-and-bound (optimization)• Backtracking search (CSPs)• Cycle-cutset (CSPs, belief nets)

Variable elimination (inference, “propagation” of constraints, probabilities, cost

functions)• Dynamic programming (optimization)• Adaptive consistency (CSPs)• Joint-tree propagation (CSPs, belief nets)

SP2 11

Conditioning: Backtracking Search

O(exp(n)) :Complexity

0

SP2 12

Bucket E: E D, E C

Bucket D: D A

Bucket C: C B

Bucket B: B A

Bucket A:

A C

widthinduced -*

*

w ))exp(w O(n :Complexity

contradiction

=

D = C

B = A

Bucket EliminationAdaptive Consistency (Dechter & Pear, 1987)

=

SP2 13

Bucket-elimination and conditioning: a uniform framework

Unifying approach to different reasoning tasks Understanding: commonality and differences “Technology transfer” Ease of implementation Extensions to hybrids:

conditioning+elimination Approximations

SP2 14

Exact CSP techniques: complexity

SP2 15

Approximations Exact approaches can be intractable Approximate conditioning

• Local search, gradient descent (optimization, CSPs, SAT)

• Stochastic simulations (belief nets) Approximate elimination

• Local consistency enforcing (CSPs), local probability propagation (belief nets)

• Bounded resolution (SAT)• Mini-bucket approach (belief nets)

Hybrids (conditioning+elimination) Other approximations (e.g., variational)

SP2 16

“Road map” CSPs: complete algorithms

Variable Elimination Conditioning (Search)

CSPs: approximations Belief nets: complete

algorithms Belief nets: approximations MDPs

SP2 17

Constraint Satisfaction

Planning and scheduling Configuration and design problems Circuit diagnosis Scene labeling Temporal reasoning Natural language processing

Applications:

SP2 18

A Bred greenred yellowgreen redgreen yellowyellow greenyellow red

Constraint Satisfaction

Example: map coloring Variables - countries (A,B,C,etc.)

Values - colors (e.g., red, green, yellow)

Constraints: etc. ,ED D, AB,A

C

A

B

DE

F

G

SP2 19

Constraint Networks

variablesofpair dconstrainea between edge an le,per variab nodea

},...,{},...,{ },,...,{

},...,{

1

11

1

:graph Constraint :sConstraint

:Domains :iables Var

C}D,{X, :network Constraint

l

kin

n

CCvvDDD

XX

CD

X

sconstraint all satisfies that variables the toassignment a valuea to :(CSP) Problem onSatisfacti Constraint solution A

SP2 20

The Idea of Elimination

project and join E variableEliminate

ECDBC EBEDDBC RRRR

3

value assignment

D

B

C

RDBC

eliminating E

SP2 21

Variable Elimination

Eliminate variablesone by one:“constraintpropagation”

Solution generation after elimination is backtrack-free

3

SP2 22

Elimination Operation:join followed by projection

Join operation over A finds all solutions satisfying

constraints that involve A

SP2 23

Bucket EliminationAdaptive Consistency (Dechter and Pearl, 1987)

d ordering along widthinduced -(d) ,

*

*

w(d)))exp(w O(n :Complexity

E

D

A

C

B

}2,1{

}2,1{}2,1{

}2,1{ }3,2,1{

:)(AB :)(BC :)(AD :)(

BE C,E D,E :)(

ABucketBBucketCBucketDBucketEBucket

A

E

D

C

B

:)(EB :)(

EC , BC :)(ED :)(

BA D,A :)(

EBucketBBucketCBucketDBucketABucket

E

A

D

C

B

|| RDBE ,

|| RE

|| RDB

|| RDCB

|| RACB

|| RAB

RA

RCBE

SP2 24

Induced WidthWidth along ordering d: max # of previous neighbors (“parents”)

Induced width The width in the ordered induced graph, obtained by connecting “parents” of each recursively, from i=n to 1.

)(* dw

iX

SP2 25

Induced width (continued) Finding minimum- ordering is NP-

complete (Arnborg, 1985)

Greedy ordering heuristics: min-width, min-degree, max-cardinality (Bertele and Briochi, 1972; Freuder 1982)

Tractable classes: trees have of an ordering is computed in O(n) time, i.e. complexity of elimination is easy to

predict

*w

1* w*w

SP2 26

Example: crossword puzzle

SP2 27

Crossword Puzzle:Adaptive consistency

SP2 28

Adaptive Consistency as “bucket-elimination”

Initialize: partition constraints into For i=n down to 1 // process buckets in the reverse

orderfor all relations do

// join all relations and “project-out”

nbucketbucket ,...,1

im bucketRR ,...,1

) ()( jX jnew RR

i

iX

If is not empty, add it to where k is the largest variable index in Else problem is unsatisfiable

newR ,, ikbucketk newR

Return the set of all relations (old and new) in the buckets

SP2 29

Solving Trees (Mackworth and Freuder, 1985)

Adaptive consistency is linear for trees andequivalent to enforcing directional arc-consistency (recording only unary constraints)

SP2 30

Properties of bucket-elimination(adaptive consistency) Adaptive consistency generates a constraint network

that is backtrack-free (can be solved without deadends).

The time and space complexity of adaptive consistency along ordering d is .

Therefore, problems having bounded induced width are tractable (solved in polynomial time).

Examples of tractable problem classes: trees ( ), series-parallel networks ( ), and in general k-trees ( ).

(d)))exp(w O(n *

1*w2*w

k*w

SP2 31

“Road map” CSPs: complete algorithms

Variable Elimination Conditioning (Search)

CSPs: approximations Belief nets: complete

algorithms Belief nets: approximations MDPs

SP2 32

The Idea of Conditioning

space linear time, lexponentia :Complexityalgorithms search by used :ngConditioni

SP2 33

Backtracking Search+Heuristics

Look-ahead schemes Forward checking (Haralick and Elliot, 1980) MAC (full arc-consistency at each node) (Gashnig

1977) Look back schemes

Backjumping (Gashnig 1977, Dechter 1990, Prosser 1993)

Backmarking (Gashnig 1977) BJ+DVO (Frost and Dechter, 1994) Constraint learning (Dechter 1990, Frost and

Dechter 1994, Bayardo and Miranker 1996)

“Vanilla” backtracking + variable/value ordering Heuristics + constraint propagation + learning +…

SP2 34

Search complexity distributions

Complexity histograms (deadends, time) => continuous distributions (Frost, Rish, and Vila 1997; Selman and Gomez 1997, Hoos 1998)

nodes explored in the search space

Frequency

(p

robabili

ty)

SP2 35

Constraint Programming Constraint solving embedded in

programming languages Allows flexible modeling + with

algorithms Logic programs + forward checking Eclipse, Ilog, OPL Using only look-ahead schemes.

SP2 36

Complete CSP algorithms: summary

Bucket elimination: adaptive consistency (CSP), directional resolution (SAT) elimination operation: join-project (CSP), resolution

(SAT) Time and space exponential in the induced width (given a variable ordering)

Conditioning: Backtracking search+heuristics Time complexity: worst-case O(exp(n)), but average-

case is often much better. Space complexity: linear.

SP2 37

“Road map”

CSPs: complete algorithms CSPs: approximations

Approximating elimination Approximating conditioning

Belief nets: complete algorithms

Belief nets: approximations MDPs

SP2 38

Approximating Elimination:Local Constraint Propagation Problem: bucket-elimination algorithms are intractable when induced width is large

Approximation: bound the size of recorded dependencies, i.e. perform local constraint propagation (local inference)

Advantages: efficiency; may discover inconsistencies by deducing new constraints

Disadvantages: does not guarantee a solution exist

SP2 39

From Global to Local Consistency

SP2 40

Constraint Propagation

• Arc-consistency, unit resolution, i-consistency

32,1,

32,1, 32,1,

1 X, Y, Z, T 3X YY = ZT ZX T

X Y

T Z

32,1,

=

SP2 41

Constraint Propagation

• Arc-consistency, unit resolution, i-consistency

1 X, Y, Z, T 3X YY = ZT ZX T

X Y

T Z

=

1 3

2 3

• Incorporated into backtracking search

• Constraint programming languages powerful approach for modeling and solving combinatorial optimization problems.

SP2 42

Arc-consistencyOnly domain constraints are recorded:

A BABA DRR

Example: }.2,1{ to of domain reduces

constriant },3,2,1{ },3,2,1{

X

YX

RXYXRR

SP2 43

Local consistency: i-consistency

i-consistency: Any consistent assignment to any i-1 variables is

consistent with at least one value of any i-th variable strong i-consistency: k-consistency for every directional i-consistency Given an ordering, each variable is i-consistent with

any i-1 preceding variables strong directional i-consistency Given an ordering, each variable is strongly i-consistent

with any i-1 preceding variables

ik

SP2 44

Directional i-consistency

DCBR

A

E

CD

B

D

CB

E

D

CB

E

DC

B

E

:A

B A:B

BC :C

AD C,D :D

BE C,E D,E :E

Adaptive d-arcd-path

DBDC RR ,CBR

DRCRDR

SP2 45

Enforcing Directional i-consistency

Directional i-consistency bounds the size of recorded constraints by i. i=1 - arc-consistency i=2 - path-consistency For , directional i-consistency is

equivalent to adaptive consistency

*wi

SP2 46

Example: SAT Elimination operation – resolution Directional Resolution – adaptive consistency

(Davis and Putnam, 1960; Dechter and Rish, 1994)

Bounded resolution – bounds the resolvent size BDR(i) – directional i-consistency (Dechter and Rish, 1994) k-closure – full k-consistency (van Gelder and Tsuji, 1996)

In general: bounded induced-width resolution DCDR(b) – generalizes cycle-cutset idea: limits induced width by conditioning on cutset variables (Rish and Dechter 1996, Rish and Dechter 2000)

SP2 47

Directional Resolution Adaptive Consistency

SP2 48

DR complexity

))exp(( :space and timeDR))(exp(||

*

*

wnOwObucketi

SP2 49

History 1960 – resolution-based Davis-Putnam algorithm

1962 – resolution step replaced by conditioning (Davis, Logemann and Loveland, 1962) to avoid memory explosion, resulting into a backtracking search algorithm known as Davis-Putnam (DP), or DPLL procedure.

The dependency on induced width was not known in 1960.

1994 – Directional Resolution (DR), a rediscovery of the original Davis-Putnam, identification of tractable classes (Dechter and Rish, 1994).

SP2 50

DR versus DPLL: complementary propertiesUniform random 3-CNFs(large induced width)

(k,m)-tree 3-CNFs(bounded induced width)

SP2 51

Complementary properties => hybrids

SP2 52

BDR-DP(i): bounded resolution + backtracking Complete algorithm: run BDR(i) as preprocessing before the Davis-Putnam backtracking algorithm. Empirical results: random vs. structured (low-w*) problems:

SP2 53

DCDR(b)Conditioning+DR

*

*

low wfor ity tractabilguarantees Resolution wreduces ngConditioni

:Idea

SP2 54

otherwise condition ,)(w* bX i if Resolve

SP2 55

DCDR(b): empirical results

)exp(space |),)(|exp( Time hybrid :0 DR,pure : DPLL,pure : 0

:off- tradeAdjustable **

bbcondbwbwbb

SP2 56

Approximating Elimination: Summary

Key idea: local propagation, restricting the number of variables involved in recorded constraints Examples: arc-, path-, and i-consistency (CSPs),

bounded resolution, k-closure (SAT) For SAT:

bucket-elimination=directional resolution (original resolution-based Davis-Putnam)

Conditioning=DPLL (backtracking search) Hybrids: bounded resolution+search= complete algorithms (BDR-DP(i), DCDR(b) )

SP2 57

“Road map”

CSPs: complete algorithms CSPs: approximations

Approximating elimination Approximating conditioning

Belief nets: complete algorithms Belief nets: approximations

MDPs

SP2 58

Approximating Conditioning: Local Search Problem: complete (systematic, exhaustive)

search can be intractable (O(exp(n) worst-case)

Approximation idea: explore only parts of search space

Advantages: anytime answer; may “run into” a solution quicker than systematic approaches

Disadvantages: may not find an exact solution even if there is one; cannot detect that a problem is unsatisfiable

SP2 59

Simple “greedy” search

1. Generate a random assignment to all variables2. Repeat until no improvement made or solution

found: // hill-climbing step3. flip a variable (change its value) that

increases the number of satisfied

constraints

Easily gets stuck at local maxima

SP2 60

GSAT – local search for SAT(Selman, Levesque and Mitchell, 1992)

1. For i=1 to MaxTries2. Select a random assignment A3. For j=1 to MaxFlips4. if A satisfies all constraint, return A5. else flip a variable to maximize the score 6. (number of satisfied constraints; if no variable 7. assignment increases the score, flip at random)8. end9. end

Greatly improves hill-climbing by adding restarts and sideway moves

SP2 61

WalkSAT (Selman, Kautz and Cohen, 1994)

With probability p random walk – flip a variable in some

unsatisfied constraintWith probability 1-p perform a hill-climbing step

Adds random walk to GSAT:

Randomized hill-climbing often solves large and hard satisfiable problems

SP2 62

Other approaches Different flavors of GSAT with randomization

(GenSAT by Gent and Walsh, 1993; Novelty by McAllester, Kautz and Selman, 1997)

Simulated annealing Tabu search Genetic algorithms Hybrid approximations: elimination+conditioning

SP2 63

Approximating conditioning with elimination

Energy minimization in neural networks (Pinkas and Dechter, 1995)

For cycle-cutset nodes, use the greedy update function (relative to neighbors).For the rest of nodes, run the arc-consistency algorithm followed by value assignment.

}1,0{iX }1,0{jX

cutset

SP2 64

GSAT with Cycle-Cutset(Kask and Dechter, 1996)

Input: a CSP, a partition of the variables into cycle-cutset and tree variablesOutput: an assignment to all the variables

Within each try:Generate a random initial asignment, and then alternate between the two steps:

1. Run Tree algorithm (arc-consistency+assignment) on the problem with fixed values of cutset variables. 2. Run GSAT on the problem with fixed values of tree variables.

SP2 65

Results: GSAT with Cycle-Cutset(Kask and Dechter, 1996)

GSAT versus GSAT +CC

0

10

20

30

40

50

60

70

14 22 36 43

cycle cutset size

# o

f p

rob

lem

s s

olv

ed

GSAT

GSAT+CC

SP2 66

Results: GSAT with Cycle-Cutset(Kask and Dechter, 1996)

SP2 67

“Road map” CSPs: complete algorithms CSPs: approximations Bayesian belief nets: complete algorithms

Bucket-elimination Relation to: join-tree, Pearl’s poly-tree

algorithm, conditioning Belief nets: approximations MDPs

SP2 69

Example: Printer Troubleshooting

SP2 70

Example: Car Diagnosis

SP2 71

What are they good for? Diagnosis: P(cause|symptom)=?

Medicine Bio-informatics

Computer troubleshooting

Stock market

Text Classification

Speechrecognition

Prediction: P(symptom|cause)=?

classmax Classification: P(class|

data) Decision-making (given a cost function)

1C 2C

cause

symptomsymptom

cause

SP2 72

Probabilistic Inference Tasks

X/A

a

*k

*1 e),xP(maxarg)a,...,(a

evidence)|xP(X)BEL(X iii

Belief updating:

Finding most probable explanation (MPE)

Finding maximum a-posteriory hypothesis

Finding maximum-expected-utility (MEU) decision

e),xP(maxarg*xx

)xU(e),xP(maxarg)d,...,(d X/D

d

*k

*1

variableshypothesis: XA

function utilityx variablesdecision

: )( :

UXD

SP2 73

Belief Updating

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

SP2 74

“Moral” Graph

n

iiin XparentsXPXXP

11 ))(|(),...,(

Conditional

ProbabilityDistributio

n(CPD)Clique in

moral graph

(“family”)

SP2 76

Bucket elimination Algorithm elim-bel (Dechter 1996)

b

Elimination operator

P(a|e=0)

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

SP2 77

b

maxElimination operator

MPE

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

Finding Algorithm elim-mpe (Dechter 1996)

)xP(maxMPEx

),|(),|()|()|()(maxby replaced is

,,,,cbePbadPabPacPaPMPE

:

bcdea max

SP2 79

Complexity of elimination))((exp ( * dwnO

ddw ordering along graph moral of widthinduced the)(*

The effect of the ordering:

4)( 1* dw 2)( 2

* dw“Moral” graph

A

D E

CB

B

C

D

E

A

E

D

C

B

A

SP2 80

Other tasks and algorithms MAP and MEU tasks:

Similar bucket-elimination algorithms - elim-map, elim-meu (Dechter 1996)

Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede

maximization (i.e. hypothesis or decision variables are eliminated last)

Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.

SP2 81

Relationship with join-tree clustering

))()())

(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)

P(d|a,b)bucket(D) P(e|b,c)bucket(E)

B

C

D

ED,C,B,A, :Ordering

ABC

BCE

ADBA cluster is a set of buckets (a “super-bucket”)

SP2 82

Relationship with Pearl’s belief propagation in poly-trees

Pearl’s belief propagation for single-root query

1X

2Z

1Z

3U

1Y

1U

2U

3Z

elim-bel using topological ordering and super-buckets for

families

Elim-bel, elim-mpe, and elim-map are linear for poly-trees.

1Z 2Z 3Z

1U 2U 3U

1X

1Y

)|(

)(

11

11

uzP

uZ )( 22

uZ )( 33uZ

)( 11xY

“Diagnostic support”

“Causal support”

)( 1x

SP2 83

Conditioning generates the probability tree

0

),|(),|()|()|()()0,(ebcb

cbePbadPacPabPaPeaP

Complexity of conditioning: exponential time, linear space

SP2 84

Conditioning+Elimination

0

),|(),|()|()|()()0,(ebcb

cbePbadPacPabPaPeaP

Idea: conditioning until of a (sub)problem gets small*w

SP2 85

Super-bucket elimination(Dechter and El Fattah, 1996)

Eliminating several variables ‘at once’ Conditioning is done only in super-buckets

SP2 86

The idea of super-bucketsLarger super-buckets (cliques) =>more time but less space

Complexity:1. Time: exponential in clique (super-bucket) size2. Space: exponential in separator size

SP2 87

Application: circuit diagnosisProblem: Given a circuit and its unexpected output, identify faulty components. The problem can be modeled as a constraint optimization problem and solved by bucket elimination.

SP2 88

Time-Space Tradeoff

SP2 89

“Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete algorithms Belief nets: approximations

Local inference: mini-buckets Stochastic simulations Variational techniques

MDPs

SP2 90

Mini-buckets: “local inference”

The idea is similar to i-consistency: bound the size of recorded dependencies

Computation in a bucket is time and space exponential in the number of variables involved

Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

SP2 91

Mini-bucket approximation: MPE task

Split a bucket into mini-buckets =>bound complexity

XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO

SP2 92

Approx-mpe(i) Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]

Example: approx-mpe(3) versus elim-mpe

2* w 4* w

SP2 93

Properties of approx-mpe(i) Complexity: O(exp(2i)) time and O(exp(i)) time.

Accuracy: determined by upper/lower (U/L) bound.

As i increases, both accuracy and complexity increase.

Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,

1999)

Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)

SP2 94

Anytime Approximation

UL

L

U

mpe(i)-approxL mpe(i)-approxU

iii

ii

step

smallest theand largest the

solution return ,11

far so found solutionbest thekeepby computed boundlower by computed boundupper

available are resources space and time

0

returnend

if

While :Initialize

)mpe(-anytime

SP2 95

Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999)

Randomly generated networks Uniform random probabilities Random noisy-OR

CPCS networks Probabilistic decoding

Comparing approx-mpe and anytime-mpe

versus elim-mpe

SP2 96

Random networks Uniform random: 60 nodes, 90 edges (200 instances)

In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results

Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.

q

iy

in qqyyxPi

parameter noise random),...,|0(1

1

SP2 97

Anytime-mpe(0.0001) U/L error vs time

Time and parameter i

1 10 100 1000

Up

pe

r/L

ow

er

0.6

1.0

1.4

1.8

2.2

2.6

3.0

3.4

3.8 cpcs422b cpcs360b

i=1 i=21

CPCS networks – medical diagnosis(noisy-OR model)

Test case: no evidence

505.2 70.3anytime-mpe( ),

110.5 70.3anytime-mpe( ),

1697.6 115.8elim-mpe

cpcs422 cpcs360 AlgorithmTime (sec)

410 110

SP2 98

log(U/L)

0 2 4 6 8 10 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

log(U/L) histogram for i=10 on 1000 instances of random evidence

log(U/L) histogram for i=10 on 1000 instances of likely evidence

log(U/L)

0 1 2 3 4 5 6 7 8 9 10 11 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

The effect of evidenceMore likely evidence=>higher MPE => higher accuracy (why?)

Likely evidence versus random (unlikely) evidence

SP2 99

Probabilistic decodingError-correcting linear block code

State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)

SP2 100

Iterative Belief Proapagation

Belief propagation is exact for poly-trees IBP - applying BP iteratively to cyclic

networks

No guarantees for convergence Works well for many coding networks

)( 11uX

1U 2U 3U

2X1X

)( 12xU

)( 12uX

)( 13xU

) BEL(U update :step One

1

SP2 101

approx-mpe vs. IBPcodes *w-low onbetter is mpe-approx

codes w*)-(high generatedrandomly onbetter is IBP

Bit error rate (BER) as a function of noise (sigma):

SP2 102

Mini-buckets: summary Mini-buckets – local inference approximation

Idea: bound size of recorded functions

Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random

problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on

low-induced width codes

SP2 103

Heuristic search Mini-buckets record upper-bound heuristics The evaluation function over

Best-first: expand a node with maximal evaluation function Branch and Bound: prune if f >= upper bound Properties:

an exact algorithm Better heuristics lead to more prunning

),...(x 1p pxx

pj buckethjp

p

iiip

ppp

hxh

paxPxg

xhxgxf

)(

)|()(

)()()(1

1

SP2 104

Heuristic Function

Given a cost functionP(a,b,c,d,e) = P(a) • P(b|a) • P(c|a) • P(e|b,c) • P(d|b,a)

Define an evaluation function over a partial assignment as theprobability of it’s best extension

f*(a,e,d) = maxb,c P(a,b,c,d,e) = = P(a) • maxb,c P)b|a) • P(c|a) • P(e|b,c) • P(d|a,b)

= g(a,e,d) • H*(a,e,d)

E

E

DA

D

B

D

D

B0

1

1

0

1

0

SP2 106

maxB P(e|b,c) P(d|a,b) P(b|a)

maxC P(c|a) hB(e,c)

maxD hB(d,a)

maxE hC(e,a)

maxA P(a) hE(a) hD (a)

Heuristic Function

The evaluation function f(xp) can be computed using function

recorded by the Mini-Bucket scheme and can be used to estimate

the probability of the best extension of partial assignment xp={x1, …, xp},

f(xp)=g(xp) H(xp )

For example,

H(a,e,d) = hB(d,a) hC (e,a)

g(a,e,d) = P(a)

SP2 107

Properties Heuristic is monotone Heuristic is admissible Heuristic is computed in linear time IMPORTANT:

Mini-buckets generate heuristics of varying strength using control parameter – bound I

Higher bound -> more preprocessing -> stronger heuristics -> less search Allows controlled trade-off between

preprocessing and search

SP2 108

Empirical Evaluation of mini-bucket heuristics

Time [sec]

0 10 20 30

% S

olve

d E

xact

ly

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

BBMB i=2

BFMB i=2

BBMB i=6

BFMB i=6

BBMB i=10

BFMB i=10

BBMB i=14

BFMB i=14

Random Coding, K=100, noise 0.32

Time [sec]

0 10 20 30

% S

olve

d E

xact

ly

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

BBMB i=6 BFMB i=6 BBMB i=10 BFMB i=10 BBMB i=14 BFMB i=14

SP2 109



MDPs

SP2 110

Stochastic Simulation Forward sampling (logic sampling) Likelihood weighing Markov Chain Monte Carlo

(MCMC): Gibbs sampling

SP2 111

Approximation via Sampling

(MCMC) sampling Gibbs * weighinglikelihood *

:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -

? Eevidence handle How to3.

, #

)(

:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(

: ( from samples generate 1.

in

i2

i1

iN1

N

yYwithsamplesyYP

PN

SX)

SP2 112

Forward Sampling(logic sampling (Henrion, 1988))

2 step and 1 5.: , and .4

)|( from sample 3. to .2

to# 1.

withconsistent samples :),...,( ordering an

samples, of # - evidence, - :1

goto

ixXEX

paxPxXn1i

N1sample

EN XXoancestral

NE

iii

iiii

n

sample rejectif

forFor

Output

Input

SP2 113

Forward sampling (example)

1X

2X 3X

4X

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from sample 5.otherwise 1, fromstart and

samplereject 0, If .4)|( from Sample .3)|( from Sample .2

)( from Sample .1 sample generate//

0 :Evidence

3,244

3

133

122

11

3

xxxPx

xxxPxxxPx

xPxk

X

Drawback: high rejection rate!

SP2 114

Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)

y Y wheres

EXi

1

)lescore(sampE)|y P(YThenscores normalize .7

)|P(ele)score(samp .6)|( from sample 5.

.4 to# 3.

.),...,( :nodes theof an Find2.

. assign , 1.

i

amples

i

iiii

i

n

iii

papaxPxX

EXN1sample

XXoorderingancestral

exEX

forFor

each For

Works well for likely evidence!

“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood

SP2 115

Gibbs Sampling(Geman and Geman, 1984)

Markov Chain Monte Carlo (MCMC):create a Markov chain of samples

}){\|( from sample 5. .4

to# 3. , 2.

. , 1.

iiii

i

ii

iii

XXxPxXEX

N1samplevaluerandomxEX

exEX

forFor

each For each For

Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow

SP2 116

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiii paxPpaxPXXxP )|()|(}){\|(

:locally computed is }){\|( :Important ii XXxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

SP2 117



MDPs

SP2 118

Variational ApproximationsIdea: variational transformation of CPDs simplifies

inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: re-derived

for each particular form of CPD functions

SP2 119

Variational bounds: example

log(x) 1log

}1log{min

)log(

x

x

x

parameter lvariationa -

This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds

SP2 120

Convex duality approach(Jaakkola and Jordan, 1997)

bounds. lowerconvex

bounds upper

function dualconcave

get we,)( For .2

)()( )()(

get weand

)}({min)(

)}({min)(:s.t. )( a hasit is )( If 1.

*

*

*

*

*

xf

xfxffxxf

xfxf

fxxf f ,xf

T

T

T

x

T

SP2 121

Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))

Noisy-OR model:

ij

j

pad

dijii qqdfP )1()1()|0( 0

1d 2d kd

1f 2f 3f nf

600 diseases

4000 findings

1log- where

)|0(

,0

)-q(

edfP

ijij

jdii

ipajd ij

SP2 122

Inference in QMR-DT

Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)

jii dj

fi

fi dPdfPdfP

dPdfPfdP)( )|( )|(

)()|(),(

01

j

ij

ifij

i

i

ipajd ij

d

padf

i

f

jdi

ee

e

][

0

0

0

0

0

1

0 )1(i

ipajd ij

f

jdie

Positive evidence “couples” the disease nodes

k,...,dd

fdPfdP2

),( )|( 1 :Inference

factorized

factorized

SP2 123

Variational approach to QMR-DT(Jaakkola and Jordan, 1997)

ipajd

jijiii

ipajd iji

ipajd ij

dfifjdi

i

jdii

x

eeedfP

edfP

fdualconcaveexf

][)|1(

:by bounded be can 1)|1( Then

)1ln()1(ln)( a has and is )1ln()(

)(0

)()0(

0

*

**

The effect of positive evidence is now factorized (diseases are “decoupled”)

SP2 124

Variational approach (cont.)

Bounds on local CPDs yield a bound on posterior

Two approaches: sequential and block Sequential: applies variational

transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters

Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors

SP2 125

Block approach

distance (KL)Leibler - Kullback theis )||( where

)||(minarg Find

bounds iational their var withCPDs some replacingafter ionapproximat),|(

evidence given ofposterior exact )|(

*

PQDPQD

EYQEYEYP

)(

)(log)()||(

S SP

SQSQPQD

SP2 126

Variational approach: summary Variational approximations were

successfully applied to inference in QMR-DT and neural networks (logistic functions), and to learning (approximate E step in EM-algorithm)

For more details, see: Saul, Jaakkola, and Jordan, 1996 Jaakkola and Jordan, 1997 Neal and Hinton, 1998 Jordan, 1999

SP2 127

“Road map”

CSPs: complete algorithms CSPs: approximations Belief nets: complete

algorithms Belief nets: approximations MDPs:

Elimination and Conditioning

SP2 128

Decision-Theoretic Planning

State = {X, Y, Battery_Level} Actions = {Go_North, Go_South, Go_West, Go_East} Probability of success = P Task: reach the goal location ASAP

Example: robot navigation

SP2 129

Dynamic Belief Networks (DBNs)

Two-stage influence diagram Interaction graph

SP2 130

Markov Decision Process

).(π(x))V,|(π(x)),()(max

ΩΩπ :)(N MDPhorizon-Infinite - ΩΩ:d ),d,...,(dπ

:)(N MDPhorizon- Finite- πoptimal an find 6.

slices timeofnumber -N 5.x state ina action for taking reward - a)r(x, 4.

iesprobabilit transition- P3.space stateDΩ domain,- Daction,-}a,...,{aa 2.space stateDΩ domain, Dstate,}x,...,{xx .1

πΩ

ππ

ax

axtN1

xy

maaam1

nxxxn1

x

yxyPxrxVy

a

reward d)(discounte total expected maximum :Criterion 7.

policy :Problem

SP2 131

Dynamic Programming: Elimination

)( },),|(),({max)(: EquationOptimality

1

11 NNN

x

tttttt

a

t xrVVaxxPaxrxVt

t

)||||()||||O(N:gprogrammin dynamic of Complexity

22 nX

mAXA DDNO

))(|(),|( ,),(),(

:iesprobabilit and utilities leDecomposab

1

11

1

ti

ti

n

i

tttn

i

ti

tii

tt xpaxPaxxPaxraxr

SP2 132

Bucket Elimination

2

Complexity: O(exp(w*))

SP2 133

MDPs: Elimination and Conditioning

Finite-horizon MDPs: dynamic programming=elimination along temporal ordering

(N slices)

Infinite-horizon MDPs: Value Iteration (VI) = elimination along temporal ordering

(iterative) Policy Iteration (PI) = conditioning on Aj, elimination on Xj

(iterative)

Bucket elimination: “non-temporal” orderings Complexity:

nwnwO 2* *)),(exp(

SP2 134

MDPs: approximations Open directions for further research:

Applying probabilistic inference approximations to DBNs

Handling actions (rewards)

Approximating elimination, heuristic search, etc.

SP2 135

Conclusions Common reasoning approaches: elimination and conditioning Exact reasoning is often intractable => need approximations Approximation principles:

Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).

Mini-buckets, IBP, i-consistency enforcing Approximating conditioning – local search, stochastic

simulations Other approximations: variational techniques, etc.

Further research: Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which

approximation suits which problem structure Other approximation paradigms (e.g., other ways of

approximating probabilities, constraints, cost functions)

Approximation Techniques for Automated Reasoning Irina Rish IBM T.J.Watson Research Center...

Documents

Transcript of Approximation Techniques for Automated Reasoning Irina Rish IBM T.J.Watson Research Center...