Amortized Integer Linear Programming Inference

June 2013Inferning Workshop, ICML, Atlanta GA

Amortized Integer Linear Programming Inference

Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

With thanks to: Collaborators: Gourab Kundu, Vivek Srikumar, Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)

Please…

Learning and Inference

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems:

Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not

all component models can be learned simultaneously We need to think about (learned) models for different sub-problems Knowledge relating sub-problems (constraints) becomes more

essential and may appear only at evaluation time Goal: Incorporate models’ information, along with prior

knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context

specific knowledge/constraints.

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.This is an Inference

Problem

Outline Integer Linear Programming Formulations for Natural Language

Processing Examples

Amortized Inference What is it and why could it be possible? The general scheme

Theorems for amortized inference Making the k-th inference cheaper than the 1st Full structures; Partial structures Experimental results

Semantic Role Labeling

I left my pearls to my daughter in my will .[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .

A0 Leaver A1 Things left A2 Benefactor AM-LOC Location

I left my pearls to my daughter in my will .

Archetypical Information Extraction Problem: E.g., Concept Identification and Typing, Event Identification, etc.

Identify argument candidates Pruning [Xue&Palmer, EMNLP’04] Argument Identifier

Binary classification Classify argument candidates

Argument Classifier Multi-class classification

Inference Use the estimated probability distribution given

by the argument classifier Use structural and linguistic constraints Infer the optimal global output

One inference problem for each verb predicate.

argmax a,t ya,t ca,t

Subject to:• One label per argument: t ya,t = 1• No overlapping or embedding • Relations between verbs and arguments,….

Algorithmic ApproachI left my nice pearls to her

I left my nice pearls to her[ [ [ [ [ ] ] ] ] ]

I left my nice pearls to her

candidate arguments

I left my nice pearls to her

Variable ya,t indicates whether candidate argument a is assigned a label t. ca,t is the corresponding model score

John, a fast-rising politician, slept on the train to Chicago. Verb Predicate: sleep

Sleeper: John, a fast-rising politician Location: on the train to Chicago

Who was John? Relation: Apposition (comma) John, a fast-rising politician

What was John’s destination? Relation: Destination (preposition) train to Chicago

Verb SRL is not Sufficient

Examples of preposition relations

Queen of England

City of Chicago

The bus was heading for Nairobi in Kenya.

Coherence of predictions

Location

Destination

Predicate: head.02A0 (mover): The busA1 (destination): for Nairobi in Kenya

Predicate arguments from different triggers should be consistent

Joint constraints linking the two tasks.

Destination A1

Joint inference (CCMs)

Each argument label

Argument candidates

PrepositionPreposition relationlabel

Verb SRL constraints Only one label per preposition

Verb arguments Preposition relations

Re-scaling parameters (one per label)Constraints:

Variable ya,t indicates whether candidate argument a is assigned a label t. ca,t is the corresponding model score

+ ….

+ Joint constraints between tasks, easy with ILP formulation

Joint Inference – no (or minimal) joint learning

Have been shown useful in the context of many NLP problems

[Roth&Yih, 04,07: Entities and Relations; Punyakanok et. al: SRL …] Summarization; Co-reference; Information & Relation Extraction; Event

Identifications; Transliteration; Textual Entailment; Knowledge Acquisition; Sentiments; Temporal Reasoning, Dependency Parsing,…

Some theoretical work on training paradigms [Punyakanok et. al., 05 more; Constraints Driven Learning, PR, Constrained EM…]

Some work on Inference, mostly approximations, bringing back ideas on Lagrangian relaxation, etc.

Good summary and description of training paradigms: [Chang, Ratinov & Roth, Machine Learning Journal 2012]

Summary of work & a bibliography: http://L2R.cs.uiuc.edu/tutorials.html

Constrained Conditional Models—ILP Formulations

http://l2r.cs.uiuc.edu/tutorials.html

Outline Integer Linear Programming Formulations for Natural Language

Processing Examples

Amortized Inference What is it and why could it be possible? The general scheme

Theorems for amortized inference Making the k-th inference cheaper than the 1st? Full structures; Partial structures Experimental results

Inference in NLP

In NLP, we typically don’t solve a single inference problem. We solve one or more per sentence. Beyond improving the inference algorithm, what can be done?

S1

He

is

reading

a

book

After inferring the POS structure for S1, Can we speed up inference for S2 ?Can we make the k-th inference problem cheaper than the first?

S2

They

are

watching

a

movie

POS

PRP

VBZ

VBG

DT

NN

S1 & S2 look very different but their output structures are the same

The inference outcomes are the same

Amortized ILP Inference [Kundu, Srikumar & Roth, EMNLP-12,ACL-13]

We formulate the problem of amortized inference: reducing inference time over the lifetime of an NLP tool

We develop conditions under which the solution of a new, previously unseen problem, can be exactly inferred from earlier solutions without invoking a solver.

This results in a family of exact inference schemes Algorithms are invariant to the underlying solver; we simply reduce

the number of calls to the solver

Significant improvements both in terms of solver calls and wall clock time in a state-of-the-art Semantic Role Labeling

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480

100000

200000

300000

400000

500000

600000

Number of examples of given size

The Hope: POS Tagging on Gigaword

Number of Tokens

Number of structures is much smaller than the number of sentences

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480

100000

200000

300000

400000

500000

600000

Number of examples of size

Number of unique POS tag sequences

The Hope: POS Tagging on Gigaword

Number of Tokens

Number of examples of a given size Number of unique POS tag sequences

The Hope: Dependency Parsing on Gigaword

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 500

100000

200000

300000

400000

500000

600000

Number of Examples of sizeNumber of unique dependency trees

Number of Tokens


Number of examples of a given size Number of unique Dependency Trees

The Hope: Semantic Role Labeling on Gigaword

1 2 3 4 5 6 7 80

20000400006000080000

100000120000140000160000180000

Number of SRL structuresNumber of unique SRL structures

Number of Arguments per Predicate


Number of examples of a given size Number of unique SRL structures

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 480

100000

200000

300000

400000

500000

600000

Number of examples of size

Number of unique POS tag sequences

POS Tagging on Gigaword

Number of Tokens

How skewed is the distribution of the structures?

A small # of structures occur very frequently

Amortized ILP Inference

These statistics show that many different instances are mapped into identical inference outcomes.

How can we exploit this fact to save inference cost?

We do this in the context of 0-1 LP, which is the most commonly used formulation in NLP.

Max cx Ax ≤ b x 2 {0,1}

After solving n inference problems, can we make the (n+1)th one faster?

x*P: <0, 1, 1, 0>

cP: <2, 3, 2, 1>cQ: <2, 4, 2, 0.5>

max 2x1+4x2+2x3+0.5x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

max 2x1+3x2+2x3+x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

Equivalence Classes

P Q

Same equivalence class

Optimal Solution

Objective coefficients of problems P, Q

We define an equivalence class as the set of ILPs that have: the same number of inference variables

the same feasible set (same constraints modulo renaming)

For problems in a given equivalence class, we give conditions on the objective functions, under which the solution of a new problem

Q is the same as the one of P (which we already cached)

The Recipe

Given: A cache of solved ILPs and a new problem

If CONDITION(cache, new problem)then

SOLUTION(new problem) = old solutionElse

Call base solver and update cacheEnd

Amortized Inference Experiments

Setup Verb semantic role labeling

Other results also at the end of the talk Speedup & Accuracy are measured over WSJ test set

(Section 23) Baseline is solving ILP using Gurobi solver.

For amortization Cache 250,000 SRL inference problems from Gigaword For each problem in test set, invoke an amortized

inference algorithm

x*P: <0, 1, 1, 0>

cP: <2, 3, 2, 1>

cQ: <2, 4, 2, 0.5>

max 2x1+4x2+2x3+0.5x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

max 2x1+3x2+2x3+x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

Theorem I

P Q

The objective coefficients of active variables did not decrease from P to Q

If

x*P: <0, 1, 1, 0>

cP: <2, 3, 2, 1>

cQ: <2, 4, 2, 0.5>

max 2x1+4x2+2x3+0.5x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

max 2x1+3x2+2x3+x4

x1 + x2 ≤ 1 x3 + x4 ≤ 1

Theorem I

P Q

The objective coefficients of inactive variables did not increase from P to Q

x*P=x*

Q

Then: The optimal solution of Q is the same as P’s

AndThe objective coefficients of active variables did not decrease from P to Q

If

Speedup & Accuracy

baseline Theorem 10.8

11.21.41.61.8

22.22.42.6

50

55

60

65

70

75

SpeedupF1

Speedup

Amortized inference gives a speedup without losing accuracy

Solve only 40% of problems

cP1

cP2

Solution x*

Feasible region

ILPs corresponding to all these objective vectors will share the same maximizer for this feasible region

All ILPs in the cone will share the maximizer

Theorem II (Geometric Interpretation)

Page 30

Theorem III

Objective values for problem P

Structured Margin d

Objective values for problem Q

Increase in objective value of the competing structures B = (CQ – CP) y

Decrease in objective value of the solutionA = (CP – CQ) y*

Incr

easin

g ob

jecti

ve v

alue

Two competing structures

y* the solution to problem P

Theorem (margin based amortized inference): If A + B is less than the structured margin, then y* is still the optimum for Q

Page 33

Speedup & Accuracy

baseline Th1 Th2 Th3 Margin based

0.8

1.3

1.8

2.3

2.8

3.3

50

55

60

65

70

75

SpeedupF1

Amortization schemes [EMNLP’12, ACL’13]

Speedup

1.0


Solve only one in three problems

Page 34

So far…

Amortized inference Making inference faster by re-using previous computations

Techniques for amortized inference But these are not useful if the full structure is not redundant!

0

100000

200000

300000

400000

500000

600000 Smaller Structures are

more redundant

Page 35

Decomposed amortized inference

Taking advantage of redundancy in components of structures Extend amortization techniques to cases where the full structured

output may not be repeated

Store partial computations of “components” for use in future inference problems

Page 36

The bus was heading for Nairobi in Kenya.

Coherence of predictions

Location

Destination

Predicate: head.02A0 (mover): The busA1 (destination): for Nairobi in Kenya

Predicate arguments from different triggers should be consistent

Joint constraints linking the two tasks.

Destination A1

Page 37

Example: Decomposition for inference

Verb SRL constraints Only one label per preposition

Joint constraints

Verb relations Preposition relations

Constraints:

Re-introduce constraints using Lagrangian Relaxation [Komodakis, et al 2007], [Rush & Collins, 2011], [Chang & Collins, 2011], …

The Preposition ProblemThe Verb Problem

Page 38

Decomposed amortized Inference

Intuition Create smaller problems by removing constraints from ILPs

Smaller problems -> more cache hits! Solve relaxed inference problems using any amortized inference

algorithm Re-introduce these constraints via Lagrangian relaxation

Page 39

Speedup & Accuracy

baseline Th1 Th2 Th3 Margin based

Margin based + decomp

0.8

1.8

2.8

3.8

4.8

5.8

6.8

50

55

60

65

70

75

SpeedupF1

Speedup

1.0


Solve only one in six problems

Amortization schemes [EMNLP’12, ACL’13]

Page 40

Reduction in inference calls (SRL)

Inference Engine Theorem 1 Margin based inference

100

4132.7

24.416.6

Num. inference calls +decomposition

Solve only one in six problems

Page 41

Reduction in inference calls (Entity-relation extraction)

Infrence Engine Theorem 1 Margin based inference

100

59.5

28.2

57

24.4


Solve only one in four problems

Page 42

So far…

We have given theorems that allow savings of 5/6 of the calls to your favorite inference engine.

But, there is some cost in Checking the conditions of the theorems Accessing the cache

Our implementations are clearly not state-of-the-art but….

Page 43

Reduction in wall-clock time (SRL)

ILP Solver Theorem 1 Margin based inference

100

54.845.9

40 38.1


Solve only one in 2.6 problems

Page 44

Conclusion

Amortized inference: Gave conditions for determining when a new, unseen problem, shares a previously seen solution (or parts of it)

Theory depends on the ILP formulation of the problem, but applies to your favorite inference algorithm In particular, can use approximate inference as the base solver The approximation properties of the underlying algorithm will be retained

We showed that we can save 5/6 or calls to an inference engine Theorems can be relaxed to increase cache hits

Integer Linear Programming formulations are powerful We already knew that they are expressive and easy to use in many problems Moreover: even if you want to use other solvers…. We showed that the ILP formulation is key to amortization

Thank You!

Page 45

Theorem III

Objective values for problem P

Structured Margin d

Objective values for problem Q

Increase in objective value of the competing structures B = (CQ – CP) y

Decrease in objective value of the solutionA = (CP – CQ) y*

Incr

easin

g ob

jecti

ve v

alue

Two competing structures

y* the solution to problem P

Theorem (margin based amortized inference): If A + B is less than the structured margin, then y* is still the optimum for Q

Easy to compute during caching Easy to compute Hard to compute

max over y – we relax the problem, and only increase B

Experiments: Semantic Role Labeling

SRL: Based on the state-of-the-art Illinois SRL [V. Punyakanok and D. Roth and W. Yih, The Importance of Syntactic Parsing

and Inference in Semantic Role Labeling, Computational Linguistics – 2008] In SRL, we solve an ILP problem for each verb predicate in each sentence

Amortization Experiments: Speedup & Accuracy are measured over WSJ test set (Section 23) Baseline is solving ILP using Gurobi 4.6

For amortization: We collect 250,000 SRL inference problems from Gigaword and store in a

database For each ILP in test set, we invoke one of the theorems (exact / approx.) If found, we return it, otherwise we call the baseline ILP solver

Page 47

Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3

R12 R2

3

other 0.05

per 0.85

loc 0.10

other 0.05

per 0.50

loc 0.45

other 0.10

per 0.60

loc 0.30

irrelevant 0.10

spouse_of 0.05

born_in 0.85

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.05

spouse_of 0.45

born_in 0.50

other 0.05

per 0.85

loc 0.10

other 0.10

per 0.60

loc 0.30

other 0.05

per 0.50

loc 0.45

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.10

spouse_of 0.05

born_in 0.85

other 0.05

per 0.50

loc 0.45

Improvement over no inference: 2-5%

Models could be learned separately; constraints may come up only at decision time.

Page 48

Note: Non Sequential Model

Key Questions: How to guide the global inference? How to learn? Why not Jointly?

Y = argmax y score(y=v) [[y=v]] =

= argmax score(E1 = PER)¢ [[E1 = PER]] + score(E1 = LOC)¢ [[E1 =

LOC]] +… score(R

1 = S-of)¢ [[R

1 = S-of]] +…..

Subject to Constraints

An Objective function that incorporates learned models with knowledge (constraints)

A constrained Conditional Model

Amortized Integer Linear Programming Inference

Documents

Transcript of Amortized Integer Linear Programming Inference