Efficient Weight Learning for Markov Logic Networks
description
Transcript of Efficient Weight Learning for Markov Logic Networks
![Page 1: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/1.jpg)
Efficient Weight Learning for Markov Logic NetworksDaniel LowdUniversity of Washington(Joint work with Pedro Domingos)
![Page 2: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/2.jpg)
Outline Background Algorithms
Gradient descent Newton’s method Conjugate gradient
Experiments Cora – entity resolution WebKB – collective classification
Conclusion
![Page 3: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/3.jpg)
Markov Logic Networks Statistical Relational Learning: combining probability with
first-order logic Markov Logic Network (MLN) =
weighted set of first-order formulas
Applications: link prediction [Richardson & Domingos, 2006], entity resolution [Singla & Domingos, 2006], information extraction [Poon & Domingos, 2007], and more…
i iinwxXP Z exp)( 1
![Page 4: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/4.jpg)
Example: WebKB
Collective classification of university web pages:Has(page, “homework”) Class(page,Course)¬Has(page, “sabbatical”) Class(page,Student)Class(page1,Student) LinksTo(page1,page2)
Class(page2,Professor)
![Page 5: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/5.jpg)
Example: WebKB
Collective classification of university web pages:Has(page,+word) Class(page,+class)¬Has(page,+word) Class(page,+class)Class(page1,+class1) LinksTo(page1,page2)
Class(page2,+class2)
![Page 6: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/6.jpg)
Overview
Discriminative weight learning in MLNsis a convex optimization problem.
Problem: It can be prohibitively slow.Solution: Second-order optimization methodsProblem: Line search and function evaluations
are intractable.Solution: This talk!
![Page 7: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/7.jpg)
Sneak preview
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 10 100 1000 10000 100000Time (s)
AU
C
Before After
![Page 8: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/8.jpg)
Outline Background Algorithms
Gradient descent Newton’s method Conjugate gradient
Experiments Cora – entity resolution WebKB – collective classification
Conclusion
![Page 9: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/9.jpg)
Gradient descent
Move in direction of steepest descent, scaled by learning rate:
wt+1 = wt + gt
![Page 10: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/10.jpg)
Gradient descent in MLNs Gradient of conditional log likelihood is:
∂ P(Y=y|X=x)/∂ wi = ni - E[ni] Problem: Computing expected counts is hard Solution: Voted perceptron [Collins, 2002; Singla & Domingos, 2005]
Approximate counts use MAP state MAP state approximated using MaxWalkSAT The only algorithm ever used for MLN discriminative learning
Solution: Contrastive divergence [Hinton, 2002] Approximate counts from a few MCMC samples MC-SAT gives less correlated samples [Poon & Domingos, 2006] Never before applied to Markov logic
![Page 11: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/11.jpg)
Per-weight learning rates
Some clauses have vastly more groundings than others Smokes(X) Cancer(X) Friends(A,B) Friends(B,C) Friends(A,C)
Need different learning rate in each dimension Impractical to tune rate to each weight by hand Learning rate in each dimension is:
/(# of true clause groundings)
![Page 12: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/12.jpg)
Ill-Conditioning
Skewed surface slow convergence Condition number: (λmax/λmin) of Hessian
![Page 13: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/13.jpg)
The Hessian matrix
Hessian matrix: all second-derivatives In an MLN, the Hessian is the negative
covariance matrix of clause counts Diagonal entries are clause variances Off-diagonal entries show correlations
Shows local curvature of the error function
![Page 14: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/14.jpg)
Newton’s method
Weight update: w = w + H-1 g We can converge in one step if error surface is
quadratic Requires inverting the Hessian matrix
![Page 15: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/15.jpg)
Diagonalized Newton’s method
Weight update: w = w + D-1 g We can converge in one step if error surface is
quadratic AND the features are uncorrelated (May need to determine step length…)
![Page 16: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/16.jpg)
Conjugate gradient
Include previous direction in newsearch direction
Avoid “undoing” any work If quadratic, finds n optimal weights in n steps Depends heavily on line searches
Finds optimum along search direction by function evals.
![Page 17: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/17.jpg)
Scaled conjugate gradient
Include previous direction in newsearch direction
Avoid “undoing” any work If quadratic, finds n optimal weights in n steps Uses Hessian matrix in place of line search Still cannot store entire Hessian matrix in memory
[Møller, 1993]
![Page 18: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/18.jpg)
Step sizes and trust regions Choose the step length
Compute optimal quadratic step length: gTd/dTHd Limit step size to “trust region” Key idea: within trust region, quadratic approximation is good
Updating trust region Check quality of approximation
(predicted and actual change in function value) If good, grow trust region; if bad, shrink trust region
Modifications for MLNs Fast computation of quadratic forms:
Use a lower bound on the function change:
])[( E- ])[(E Hdd 22Tiiiwiiiw ndnd
[Møller, 1993; Nocedal & Wright, 2007]
)()()( 11 ttTttt wwgwfwf
![Page 19: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/19.jpg)
Preconditioning
Initial direction of SCG is the gradientVery bad for ill-conditioned problems
Well-known fix: preconditioningMultiply by matrix to lower condition number Ideally, approximate inverse Hessian
Standard preconditioner: D-1
[Sha & Pereira, 2003]
![Page 20: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/20.jpg)
Outline Background Algorithms
Gradient descent Newton’s method Conjugate gradient
Experiments Cora – entity resolution WebKB – collective classification
Conclusion
![Page 21: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/21.jpg)
Experiments: Algorithms
Voted perceptron (VP, VP-PW) Contrastive divergence (CD, CD-PW) Diagonal Newton (DN) Scaled conjugate gradient (SCG, PSCG)
Baseline: VPNew algorithms: VP-PW, CD, CD-PW, DN, SCG, PSCG
![Page 22: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/22.jpg)
Experiments: Datasets Cora
Task: Deduplicate 1295 citations to 132 papers Weights: 6141 [Singla & Domingos, 2006] Ground clauses: > 3 million Condition number: > 600,000
WebKB [Craven & Slattery, 2001] Task: Predict categories of 4165 web pages Weights: 10,891 Ground clauses: > 300,000 Condition number: ~7000
![Page 23: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/23.jpg)
Experiments: Method
Gaussian prior on each weight Tuned learning rates on held-out data Trained for 10 hours Evaluated on test data
AUC: Area under precision-recall curve CLL: Average conditional log-likelihood of all
query predicates
![Page 24: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/24.jpg)
Results: Cora AUC
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000Time (s)
AUC
VP
![Page 25: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/25.jpg)
Results: Cora AUC
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW
![Page 26: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/26.jpg)
Results: Cora AUC
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW CD CD-PW
![Page 27: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/27.jpg)
Results: Cora AUC
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW CD CD-PW DN SCG PSCG
![Page 28: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/28.jpg)
Results: Cora CLL
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
1 10 100 1000 10000 100000Time (s)
CLL
VP
![Page 29: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/29.jpg)
Results: Cora CLL
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
1 10 100 1000 10000 100000Time (s)
CLL
VP VP-PW
![Page 30: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/30.jpg)
Results: Cora CLL
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
1 10 100 1000 10000 100000Time (s)
CLL
VP VP-PW CD CD-PW
![Page 31: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/31.jpg)
Results: Cora CLL
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
1 10 100 1000 10000 100000Time (s)
CLL
VP VP-PW CD CD-PW DN SCG PSCG
![Page 32: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/32.jpg)
Results: WebKB AUC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW
![Page 33: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/33.jpg)
Results: WebKB AUC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW CD CD-PW
![Page 34: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/34.jpg)
Results: WebKB AUC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 10 100 1000 10000 100000Time (s)
AUC
VP VP-PW CD CD-PW DN SCG PSCG
![Page 35: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/35.jpg)
Results: WebKB CLL
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
1 10 100 1000 10000 100000Time (s)
CLL
VP VP-PW CD CD-PW DN SCG PSCG
![Page 36: Efficient Weight Learning for Markov Logic Networks](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816849550346895dde317c/html5/thumbnails/36.jpg)
Conclusion Ill-conditioning is a real problem in
statistical relational learning PSCG and DN are an effective solution
Efficiently converge to good models No learning rate to tune Orders of magnitude faster than VP
Details remaining Detecting convergence Preventing overfitting Approximate inference
Try it out in Alchemy:http://alchemy.cs.washington.edu/