Amortized Integer Linear Programming Inference
description
Transcript of Amortized Integer Linear Programming Inference
June 2013Inferning Workshop, ICML, Atlanta GA
Amortized Integer Linear Programming Inference
Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
Page 1
With thanks to: Collaborators: Gourab Kundu, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)
Please…
Page 2
Learning and Inference
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems:
Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not
all component models can be learned simultaneously We need to think about (learned) models for different sub-problems Knowledge relating sub-problems (constraints) becomes more
essential and may appear only at evaluation time Goal: Incorporate models’ information, along with prior
knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context
specific knowledge/constraints.
Page 3
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.This is an Inference
ProblemPage 4
Outline Integer Linear Programming Formulations for Natural Language
Processing Examples
Amortized Inference What is it and why could it be possible? The general scheme
Theorems for amortized inference Making the k-th inference cheaper than the 1st Full structures; Partial structures Experimental results
Page 5
Semantic Role Labeling
I left my pearls to my daughter in my will .[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .
A0 Leaver A1 Things left A2 Benefactor AM-LOC Location
I left my pearls to my daughter in my will .
Page 6
Archetypical Information Extraction Problem: E.g., Concept Identification and Typing, Event Identification, etc.
Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier
Binary classification Classify argument candidates
Argument Classifier Multi-class classification
Inference Use the estimated probability distribution given
by the argument classifier Use structural and linguistic constraints Infer the optimal global output
One inference problem for each verb predicate.
argmax a,t ya,t ca,t
Subject to:• One label per argument: t ya,t = 1• No overlapping or embedding • Relations between verbs and arguments,….
Algorithmic ApproachI left my nice pearls to her
I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]
I left my nice pearls to her
candidate arguments
I left my nice pearls to her
Page 7
Variable ya,t indicates whether candidate argument a is assigned a label t. ca,t is the corresponding model score
John, a fast-rising politician, slept on the train to Chicago. Verb Predicate: sleep
Sleeper: John, a fast-rising politician Location: on the train to Chicago
Who was John? Relation: Apposition (comma) John, a fast-rising politician
What was John’s destination? Relation: Destination (preposition) train to Chicago
Verb SRL is not Sufficient
Examples of preposition relations
Queen of England
City of Chicago
Page 9
The bus was heading for Nairobi in Kenya.
Coherence of predictions
Location
Destination
Predicate: head.02A0 (mover): The busA1 (destination): for Nairobi in Kenya
Predicate arguments from different triggers should be consistent
Joint constraints linking the two tasks.
Destination A1
Page 10
Joint inference (CCMs)
Each argument label
Argument candidates
PrepositionPreposition relationlabel
Verb SRL constraints Only one label per preposition
Verb arguments Preposition relations
Re-scaling parameters (one per label)Constraints:
Variable ya,t indicates whether candidate argument a is assigned a label t. ca,t is the corresponding model score
Page 11
+ ….
+ Joint constraints between tasks, easy with ILP formulation
Joint Inference – no (or minimal) joint learning
Have been shown useful in the context of many NLP problems
[Roth&Yih, 04,07: Entities and Relations; Punyakanok et. al: SRL …] Summarization; Co-reference; Information & Relation Extraction; Event
Identifications; Transliteration; Textual Entailment; Knowledge Acquisition; Sentiments; Temporal Reasoning, Dependency Parsing,…
Some theoretical work on training paradigms [Punyakanok et. al., 05 more; Constraints Driven Learning, PR, Constrained EM…]
Some work on Inference, mostly approximations, bringing back ideas on Lagrangian relaxation, etc.
Good summary and description of training paradigms: [Chang, Ratinov & Roth, Machine Learning Journal 2012]
Summary of work & a bibliography: http://L2R.cs.uiuc.edu/tutorials.html
Constrained Conditional Models—ILP Formulations
Page 12
Outline Integer Linear Programming Formulations for Natural Language
Processing Examples
Amortized Inference What is it and why could it be possible? The general scheme
Theorems for amortized inference Making the k-th inference cheaper than the 1st? Full structures; Partial structures Experimental results
Page 13
Inference in NLP
In NLP, we typically don’t solve a single inference problem. We solve one or more per sentence. Beyond improving the inference algorithm, what can be done?
S1
He
is
reading
a
book
After inferring the POS structure for S1, Can we speed up inference for S2 ?Can we make the k-th inference problem cheaper than the first?
S2
They
are
watching
a
movie
POS
PRP
VBZ
VBG
DT
NN
S1 & S2 look very different but their output structures are the same
The inference outcomes are the same
Page 14
Amortized ILP Inference [Kundu, Srikumar & Roth, EMNLP-12,ACL-13]
We formulate the problem of amortized inference: reducing inference time over the lifetime of an NLP tool
We develop conditions under which the solution of a new, previously unseen problem, can be exactly inferred from earlier solutions without invoking a solver.
This results in a family of exact inference schemes Algorithms are invariant to the underlying solver; we simply reduce
the number of calls to the solver
Significant improvements both in terms of solver calls and wall clock time in a state-of-the-art Semantic Role Labeling
Page 15
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480
100000
200000
300000
400000
500000
600000
Number of examples of given size
The Hope: POS Tagging on Gigaword
Number of Tokens
Page 16
Number of structures is much smaller than the number of sentences
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480
100000
200000
300000
400000
500000
600000
Number of examples of size
Number of unique POS tag sequences
The Hope: POS Tagging on Gigaword
Number of Tokens
Number of examples of a given size Number of unique POS tag sequences
Page 17
The Hope: Dependency Parsing on Gigaword
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 500
100000
200000
300000
400000
500000
600000
Number of Examples of sizeNumber of unique dependency trees
Number of Tokens
Number of structures is much smaller than the number of sentences
Number of examples of a given size Number of unique Dependency Trees
Page 18
The Hope: Semantic Role Labeling on Gigaword
1 2 3 4 5 6 7 80
20000400006000080000
100000120000140000160000180000
Number of SRL structuresNumber of unique SRL structures
Number of Arguments per Predicate
Number of structures is much smaller than the number of sentences
Number of examples of a given size Number of unique SRL structures
Page 19
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480
100000
200000
300000
400000
500000
600000
Number of examples of size
Number of unique POS tag sequences
POS Tagging on Gigaword
Number of Tokens
How skewed is the distribution of the structures?
A small # of structures occur very frequently
Page 20
Amortized ILP Inference
These statistics show that many different instances are mapped into identical inference outcomes.
How can we exploit this fact to save inference cost?
We do this in the context of 0-1 LP, which is the most commonly used formulation in NLP.
Max cx Ax ≤ b x 2 {0,1}
Page 21
After solving n inference problems, can we make the (n+1)th one faster?
x*P: <0, 1, 1, 0>
cP: <2, 3, 2, 1>cQ: <2, 4, 2, 0.5>
max 2x1+4x2+2x3+0.5x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
max 2x1+3x2+2x3+x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
Equivalence Classes
P Q
Same equivalence class
Optimal Solution
Objective coefficients of problems P, Q
We define an equivalence class as the set of ILPs that have: the same number of inference variables
the same feasible set (same constraints modulo renaming)
Page 22
For problems in a given equivalence class, we give conditions on the objective functions, under which the solution of a new problem
Q is the same as the one of P (which we already cached)
The Recipe
Given: A cache of solved ILPs and a new problem
If CONDITION(cache, new problem)then
SOLUTION(new problem) = old solutionElse
Call base solver and update cacheEnd
Page 23
Page 24
Amortized Inference Experiments
Setup Verb semantic role labeling
Other results also at the end of the talk Speedup & Accuracy are measured over WSJ test set
(Section 23) Baseline is solving ILP using Gurobi solver.
For amortization Cache 250,000 SRL inference problems from Gigaword For each problem in test set, invoke an amortized
inference algorithm
x*P: <0, 1, 1, 0>
cP: <2, 3, 2, 1>
cQ: <2, 4, 2, 0.5>
max 2x1+4x2+2x3+0.5x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
max 2x1+3x2+2x3+x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
Theorem I
P Q
The objective coefficients of active variables did not decrease from P to Q
Page 25
If
x*P: <0, 1, 1, 0>
cP: <2, 3, 2, 1>
cQ: <2, 4, 2, 0.5>
max 2x1+4x2+2x3+0.5x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
max 2x1+3x2+2x3+x4
x1 + x2 ≤ 1 x3 + x4 ≤ 1
Theorem I
P Q
The objective coefficients of inactive variables did not increase from P to Q
x*P=x*
Q
Then: The optimal solution of Q is the same as P’s
Page 26
AndThe objective coefficients of active variables did not decrease from P to Q
If
Page 28
Speedup & Accuracy
baseline Theorem 10.8
11.21.41.61.8
22.22.42.6
50
55
60
65
70
75
SpeedupF1
Speedup
Amortized inference gives a speedup without losing accuracy
Solve only 40% of problems
cP1
cP2
Solution x*
Feasible region
ILPs corresponding to all these objective vectors will share the same maximizer for this feasible region
All ILPs in the cone will share the maximizer
Theorem II (Geometric Interpretation)
Page 30
Theorem III
Objective values for problem P
Structured Margin d
Objective values for problem Q
Increase in objective value of the competing structures B = (CQ – CP) y
Decrease in objective value of the solutionA = (CP – CQ) y*
Incr
easin
g ob
jecti
ve v
alue
Two competing structures
y* the solution to problem P
Theorem (margin based amortized inference): If A + B is less than the structured margin, then y* is still the optimum for Q
Page 33
Speedup & Accuracy
baseline Th1 Th2 Th3 Margin based
0.8
1.3
1.8
2.3
2.8
3.3
50
55
60
65
70
75
SpeedupF1
Amortization schemes [EMNLP’12, ACL’13]
Speedup
1.0
Amortized inference gives a speedup without losing accuracy
Solve only one in three problems
Page 34
So far…
Amortized inference Making inference faster by re-using previous computations
Techniques for amortized inference But these are not useful if the full structure is not redundant!
0
100000
200000
300000
400000
500000
600000 Smaller Structures are
more redundant
Page 35
Decomposed amortized inference
Taking advantage of redundancy in components of structures Extend amortization techniques to cases where the full structured
output may not be repeated
Store partial computations of “components” for use in future inference problems
Page 36
The bus was heading for Nairobi in Kenya.
Coherence of predictions
Location
Destination
Predicate: head.02A0 (mover): The busA1 (destination): for Nairobi in Kenya
Predicate arguments from different triggers should be consistent
Joint constraints linking the two tasks.
Destination A1
Page 37
Example: Decomposition for inference
Verb SRL constraints Only one label per preposition
Joint constraints
Verb relations Preposition relations
Constraints:
Re-introduce constraints using Lagrangian Relaxation [Komodakis, et al 2007], [Rush & Collins, 2011], [Chang & Collins, 2011], …
The Preposition ProblemThe Verb Problem
Page 38
Decomposed amortized Inference
Intuition Create smaller problems by removing constraints from ILPs
Smaller problems -> more cache hits! Solve relaxed inference problems using any amortized inference
algorithm Re-introduce these constraints via Lagrangian relaxation
Page 39
Speedup & Accuracy
baseline Th1 Th2 Th3 Margin based
Margin based + decomp
0.8
1.8
2.8
3.8
4.8
5.8
6.8
50
55
60
65
70
75
SpeedupF1
Speedup
1.0
Amortized inference gives a speedup without losing accuracy
Solve only one in six problems
Amortization schemes [EMNLP’12, ACL’13]
Page 40
Reduction in inference calls (SRL)
Inference Engine Theorem 1 Margin based inference
100
4132.7
24.416.6
Num. inference calls +decomposition
Solve only one in six problems
Page 41
Reduction in inference calls (Entity-relation extraction)
Infrence Engine Theorem 1 Margin based inference
100
59.5
28.2
57
24.4
Num. inference calls +decomposition
Solve only one in four problems
Page 42
So far…
We have given theorems that allow savings of 5/6 of the calls to your favorite inference engine.
But, there is some cost in Checking the conditions of the theorems Accessing the cache
Our implementations are clearly not state-of-the-art but….
Page 43
Reduction in wall-clock time (SRL)
ILP Solver Theorem 1 Margin based inference
100
54.845.9
40 38.1
Num. inference calls +decomposition
Solve only one in 2.6 problems
Page 44
Conclusion
Amortized inference: Gave conditions for determining when a new, unseen problem, shares a previously seen solution (or parts of it)
Theory depends on the ILP formulation of the problem, but applies to your favorite inference algorithm In particular, can use approximate inference as the base solver The approximation properties of the underlying algorithm will be retained
We showed that we can save 5/6 or calls to an inference engine Theorems can be relaxed to increase cache hits
Integer Linear Programming formulations are powerful We already knew that they are expressive and easy to use in many problems Moreover: even if you want to use other solvers…. We showed that the ILP formulation is key to amortization
Thank You!
Page 45
Page 46
Theorem III
Objective values for problem P
Structured Margin d
Objective values for problem Q
Increase in objective value of the competing structures B = (CQ – CP) y
Decrease in objective value of the solutionA = (CP – CQ) y*
Incr
easin
g ob
jecti
ve v
alue
Two competing structures
y* the solution to problem P
Theorem (margin based amortized inference): If A + B is less than the structured margin, then y* is still the optimum for Q
Easy to compute during caching Easy to compute Hard to compute
max over y – we relax the problem, and only increase B
Experiments: Semantic Role Labeling
SRL: Based on the state-of-the-art Illinois SRL [V. Punyakanok and D. Roth and W. Yih, The Importance of Syntactic Parsing
and Inference in Semantic Role Labeling, Computational Linguistics – 2008] In SRL, we solve an ILP problem for each verb predicate in each sentence
Amortization Experiments: Speedup & Accuracy are measured over WSJ test set (Section 23) Baseline is solving ILP using Gurobi 4.6
For amortization: We collect 250,000 SRL inference problems from Gigaword and store in a
database For each ILP in test set, we invoke one of the theorems (exact / approx.) If found, we return it, otherwise we call the baseline ILP solver
Page 47
Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
other 0.05
per 0.85
loc 0.10
other 0.05
per 0.50
loc 0.45
other 0.10
per 0.60
loc 0.30
irrelevant 0.10
spouse_of 0.05
born_in 0.85
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.05
spouse_of 0.45
born_in 0.50
other 0.05
per 0.85
loc 0.10
other 0.10
per 0.60
loc 0.30
other 0.05
per 0.50
loc 0.45
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.10
spouse_of 0.05
born_in 0.85
other 0.05
per 0.50
loc 0.45
Improvement over no inference: 2-5%
Models could be learned separately; constraints may come up only at decision time.
Page 48
Note: Non Sequential Model
Key Questions: How to guide the global inference? How to learn? Why not Jointly?
Y = argmax y score(y=v) [[y=v]] =
= argmax score(E1 = PER)¢ [[E1 = PER]] + score(E1 = LOC)¢ [[E1 =
LOC]] +… score(R
1 = S-of)¢ [[R
1 = S-of]] +…..
Subject to Constraints
An Objective function that incorporates learned models with knowledge (constraints)
A constrained Conditional Model