1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of...

160
1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University of Washington

Transcript of 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of...

Page 1: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

1

Autonomous Web-scale Information Extraction

Doug DowneyAdvisor: Oren Etzioni Department of Computer Science and EngineeringTuring CenterUniversity of Washington

Page 2: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

2

Web Information Extraction

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Edison invented the light bulb…(Edison, light bulb) Invented

x V y => (x, y) V

e.g., KnowItAll [Etzioni et al., 2005], TextRunner [Banko et al., 2007], others [Pasca et al., 2007]

Page 3: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

3

Identifying correct extractions

…mayors of major cities such as Giuliani… => Giuliani City

Supervised IE: hand-label examples of each concept

Not possible on the Web (far too many concepts)

=> Unsupervised IE (UIE)

How can we automatically identify correct extractions for any concept without hand-labeled data?

Page 4: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

4

KnowItAll Hypothesis (KH)

Extractions that occur more frequently in distinct sentences in the corpus are more likely to be correct.

Repetitions of the same error are relatively rare

…mayors of major cities such as Giuliani… …hotels in popular cities such as Marriot.…

Misinformation is the exception rather than the rule

“Elvis killed JFK” – 200 hits“Oswald killed JFK” – 3000 hits

Page 5: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

5

Redundancy

KH can identify many correct statements because the Web is highly redundant

– same facts repeated many times, in many ways – e.g., “Edison invented the light bulb” – 10,000 hits

(but leveraging the KH is a little tricky => probabilistic model)

Thesis:We can identify correct extractions without labeled data using a probabilistic model of redundancy.

Page 6: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

6

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 7: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

7

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 8: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

8

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 9: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

9

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 10: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

10

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1 and unlabeled examples (x)

P(y=1 | x1) increases with x1

Page 11: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

11

Common Structure

Task Monotonic FeatureUIE “C such as x”

[Etzioni et al., 2005]

Word Sense Disambiguation

“plant and animal species” [Yarowsky, 1995]

Information Retrieval search query[Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”[McCallum & Nigam, 1999; Gliozzo, 2005]

Named Entity Recognition

contains(“Mr.”)

[Collins & Singer, 1998]

Page 12: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

12

MF model is provably distinct from standard smoothness assumptions in SSL Cluster Assumption Manifold Assumption => MFs can complement other methods

Unlike co-training, MF Model doesn’t require labeled data pre-defined “views”

Isn’t this just ___ ?

Page 13: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

13

One MF implies PAC-learnability without labeled data …when MF is conditionally independent of other features & is

minimally informative Corollary to co-training theorem [Blum and Mitchell, 1998]

MFs provide more information (vs. labels) about unlabeled examples as feature space grows As number of features increases

Information gain due to MFs stays constant, vs. Information gain due to labeled examples falls(under assumptions)

Theoretical Results

Page 14: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

14

MFA: Given MFs and unlabeled data Use the MFs to produce noisy labels Train any classifier

Classification with the MF Model

Page 15: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

15

20 Newsgroups dataset (MF:newsgroup name)

Vs. Two SSL baselines (NB + EM, LP)

Without labeled data:

Experimental Results

Page 16: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

16

MFA-SSL provides a 15% error reduction for 100-400 labeled examples.

MFA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

Page 17: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

17

Bad News: confusable MFs

For more complex tasks, monotonicity is insufficient

Example: City extractions

MF: extraction frequency with e.g., “cities such as x”

..also MF for:has skyscrapers

has an opera house

located on Earth, …

New York 1488

Chicago 999

Los Angeles 859

… …

Twisp 1

Northeast 1

MF Extraction value

Page 18: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

18

Performance of MFA in UIE

Page 19: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

19

MFA for SSL in UIE

Page 20: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

20

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 21: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

21

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

Page 22: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

22

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

Page 23: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

23

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

Page 24: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

24

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

Page 25: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

25

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 26: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

26

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

Page 27: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

27

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 28: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

28

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

Page 29: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

29

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

Page 30: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

30

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

Page 31: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

31

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

Page 32: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

32

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

Page 33: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

33

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

Page 34: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

34

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

Page 35: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

35

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Page 36: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

36

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Page 37: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

37

Sensitivity Analysis

URNS assumes num(E), p are constant

If we alter parameter choices substantially, URNS still outperforms noisy-or, PMI by at least 8x

Most sensitive to p

p ~ 0.85 is relatively consistent across randomly selected classes from Wordnet(solvents, devices, thinkers, relaxants, mushrooms, mechanisms, resorts, flies, tones, machines, …)

Page 38: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

38

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Omaha and other cities” 950

“Illinois and other cities” 24,400

“cities such as Omaha” 930

“cities such as Illinois” 6

Multiple Extraction Patterns

Page 39: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

39

Benefits from Multiple Urns

10 1.0 1.0

20 0.9875 1.0

50 0.925 0.955

100 0.8375 0.845

200 0.7075 0.71

Precision at K K Single Multiple

Using multiple URNS reduces error by 29%.

Page 40: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

40

URNS vs. MFA

Page 41: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

41

URNS + MFA in SSL

MFA-ssl (urns) reduces error by 6%, on average.

Page 42: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

42

URNS: Learnable from unlabeled data

All URNS parameters can be learned from unlabeled data alone [Theorem 20]

URNS implies PAC learnability from unlabeled data alone [Theorem 21]

Even with confusable MFs (i.e. even without conditional independence)

(with assumptions)

Page 43: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

43

Parameters Learnable (1)

We can express the URNS model as:

Compound Poisson Process Mixture gC() + gE() can be learned, given enough

samples [Loh, 1993]

Task: learn power-law distributions gC(), gE() from

their sum

Page 44: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

44

Parameters Learnable (2)

Assume:

Sufficiently high frequency => only target elements

Sufficiently low frequency => only errors

Then:

gC() + gE() =

Page 45: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

45

1) Background2) KH as a general problem structure

• Monotonic Feature Model

3) URNS model• How does probability increase with repetition?

4) Challenge: The “long tail”• Unsupervised language models

Outline

Page 46: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

46

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Page 47: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

47

Mayor McCheese

Page 48: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

48

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Assessing Sparse Extractions

Page 49: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

49

Terms in the same class tend to appear in similar contexts.

“cities including __” 42,000 1

“__ and other cities” 37,900 0

The Distributional Hypothesis

Hits with Hits withContext Chicago Twisp

“__ hotels” 2,000,000 1,670

“mayor of __” 657,000 82

Page 50: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

50

Precomputed – scalable

Handle sparsity

Unsupervised Language Models

Page 51: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

51

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors

Page 52: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

52

Twisp: < >

HMM(Twisp):

HMM provides “distributional summary” Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Page 53: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

53

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

Page 54: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

54

Summary of Thesis

Formalization of Monotonic Features (MFs) One MF enables PAC Learnability from unlabeled

data alone [Corollary 4.1]

MFs provide greater information gain vs. labels as feature space increases in size [Theorem 8]

The MF model is formally distinct from other SSL approaches [Theorems 9 and 10]

MF model is insufficient when “subconcepts” are

present [Proposition 12]

Page 55: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

55

Summary: MFs (Continued)

MFA: General SSL algorithm for MFs Given MFs, MFA perf. equivalent to state-of-the-art

SSL algorithm with 160 labeled examples. [Table 2.1]

Even when MFs are not given, MFA can detect MFs in SSL, reducing error by 16%. [Figure 2.5]

MFA is not effective for UIE [Table 2.2 & Figure 2.6]

Page 56: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

56

Summary: URNS

URNS: Formal model of redundancy in IE Describes how probability increases with MF value

[Proposition 13]

Models corroboration among multiple extraction mechanisms (multiple URNS) [Proposition 14]

Page 57: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

57

URNS Theoretical Results

Uniform Special Case (USC) Odds in USC increase exponentially with repetition

[Theorem 15]

Error decreases exponentially when parameters are known [Theorem 16]

Zipfian Case (ZC) Closed-form expression for ZC probability given

parameters and odds given repetitions [Theorem 17]

Error in ZC is bounded above by K / n1- for any > 0 when parameters are known [Theorem 19]

Page 58: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

58

URNS Theoretical Results (cont.)

Zipfian Case (ZC) In ZC, with probability 1-, the parameters of URNS

can be estimated with error < for all , > 0, given sufficient data [Theorem 20]

In ZC, URNS guarantees PAC learnability given only unlabeled data, given that the MF is sufficiently informative and a “seperability” criterion is met in the concept space [Theorem 21]

Page 59: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

59

URNS Experimental Results

Supervised Learning [Table 3.3]

19% error reduction over noisy-or 10% error reduction over logistic regression Comparable performance to SVM

Semi-supervised IE [Figure 3.4]

6% error reduction over LP Unsupervised IE [Figure 3.2]

1500% error reduction over noisy-or 2200% error reduction over PMI

Improved Efficiency [Table 3.2]

8x faster than PMI

Page 60: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

60

Other Applications of URNS

Estimating extraction precision and recall [Table 3.7]

Identifying synonymous objects and relations (RESOLVER) [Yates & Etzioni, 2007]

Identifying functional relations in text [Ritter et al., 2008]

Page 61: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

61

Assessing Sparse Extractions

Hidden Markov Model assessor (HMM-T): Error reduction of 23-46% over context vectors on

typechecking task [Table 4.1]

Error reduction of 28% over context vectors on sparse unary extractions [Table 4.2]

10-50x more efficient vs. context vectors

Sparse extraction assessment with language models:

Error reduction of 39% over previous work [Table 4.3]

Massively more scalable than previous techniques

Page 62: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

62

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzAlan Ritter

Stef SchoenmackersStephen Soderland

Dan Weld

Page 63: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

63

Page 64: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

64

Page 65: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

65

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

Extracting Atlanta City requires: Syntactic Parsing (Atlanta -> is -> metropolis) Subclass discovery (metropolis(x)=>city(x))

Challenging & difficult to scale e.g. [Collins, 1997; Snow & Ng 2006]

Web IE without labeled examples

Page 66: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

66

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

But most sentences are “tough”:

We walked the tree-lined streets of the bustling metropolis that is Atlanta.

“cities such as Atlanta” – 21,600 Hits

Web IE without labeled examples

Page 67: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

67

Web IE without labeled examples

Extraction is sometimes “easy”: generic extraction patterns

…cities such as Chicago… => Chicago City

C such as x => x C [Hearst,1992]

…Bloomberg, mayor of New York City…(Bloomberg, New York City) Mayor

x, C of y => (x, y) C

The scale and redundancy of the Web makes a multitude of facts “easy” to extract.

Page 68: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

68

http://www.cs.washington.edu/research/textrunner/

[Banko et al., 2007]

TextRunner Search

Page 69: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

69

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

Extraction patterns make errors:

“Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from 1964-1971, and…”

But…

Task: Assess which extractions are correct Without hand-labeled examples At Web-scale

Thesis: “We can assess extraction correctness by leveraging redundancy and probabilistic models.”

Page 70: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

70

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness URNS model of redundancy

[Downey et al., IJCAI 2005]

(Distinguished Paper Award)

4) Challenge: The “long tail”

5) Machine learning generalization

Outline

Page 71: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

71

2) Multiple patterns

Phrase Hits

1) Repetition

“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Redundancy – Two Intuitions

Goal: a formal model of these intuitions.

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

Page 72: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

72

Given a term x and a set of sentences containing extraction patterns for a class C, what is the probability that x C?

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Consider a single pattern suggesting C , e.g.,

countries such as x

Redundancy: Single Pattern

Page 73: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

73

“…countries such as Saudi Arabia…”

“…countries such as the United States…”

“…countries such as Saudi Arabia…”

“…countries such as Japan…”

“…countries such as Africa…”

“…countries such as Japan…”

“…countries such as the United Kingdom…”

“…countries such as Iraq…”

“…countries such as Afghanistan…”

“…countries such as Australia…”

C = Country

n = 10 occurrences

Redundancy: Single Pattern

Page 74: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

74

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

k2

2

1

1

1

1

1

1

p = probability pattern yields a correct extraction, i.e.,

p = 0.9

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9 Noisy-or ignores: –Sample size (n) –Distribution of C

Naïve Model: Noisy-Or

Pnoisy-orPnoisy-or(xC | x seen k times)

= 1 – (1 – p)k

[Agichtein & Gravano, 2000; Lin et al. 2003]

Page 75: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

75

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

C = Country

n = 10

Saudi Arabia

Japan

United States

Africa

United Kingdom

Iraq

Afghanistan

Australia

2

2

1

1

1

1

1

1

0.99

0.99

0.9

0.9

0.9

0.9

0.9

0.9

As sample size increases, noisy-or becomes inaccurate.

Needed in Model: Sample Size

Pnoisy-or Pnoisy-ork k

Page 76: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

76

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.9

0.9

0.9

0.9

0.9

Needed in Model: Distribution of C

Pnoisy-ork

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

United States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 77: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

77

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

PfreqkUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Pfreq(xC | x seen k times)

= 1 – (1 – p)k/n

Page 78: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

78

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Probability xC depends on the distribution of C.

C = Country

n ~50,000

3899

1999

1

1

1

1

1

0.9999…

0.9999…

0.05

0.05

0.05

0.05

0.05

Needed in Model: Distribution of C

Pfreq Pfreqk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

Page 79: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

79

Tokyo

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

…cities such as Tokyo…

Urn for C = City

My solution: URNS Model

Page 80: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

80

C – set of unique target labels

E – set of unique error labels

num(C) – distribution of target labels

num(E) – distribution of error labels

Urn – Formal Definition

Page 81: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

81

distribution of target labels: num(C) = {2, 2, 1, 1, 1}

distribution of error labels: num(E) = {2, 1}

U.K.

Sydney

Cairo

Tokyo

Tokyo

Atlanta

Atlanta

Yakima

Utah

U.K.

Urn for C = City

Urn Example

Page 82: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

82

If an extraction x appears k times in a set of n distinct occurrences of the pattern, what is the probability that x C?

Computing Probabilities

Page 83: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

83

Given that an extraction x appears k times in n draws from the urn (with replacement), what is the probability that x C?

where s is the total number of balls in the urn

Computing Probabilities

Page 84: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

84

Multiple urns Target label frequencies correlated across urns Error label frequencies can be uncorrelated

Phrase Hits“Chicago and other cities” 94,400

“Illinois and other cities” 23,100

“cities such as Chicago” 42,500

“cities such as Illinois” 7

Multiple Extraction Patterns

Page 85: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

85

URNS without labeled data

Needed: num(C), num(E)

Assumed to be Zipf

Frequency of ith element i-z

With assumptions, learn Zipfian parameters for any class C from unlabeled data alone

Page 86: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

86

p 1 - p

C Zipf E Zipf

Observed frequency distribution

URNS without labeled data

Constant across C, for a given pattern

Learn num(C) from unlabeled data!

Constant across C

Page 87: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

87

New York

Chicago

. . .

El Estor

Nikki

Ragaz

Villegas

Cres

Northeastwards

C = City

n ~50,000

1488

999

1

1

1

1

1

1

0.9999…

0.9999…

0.63

0.63

0.63

0.63

0.63

0.63

C = Country

n ~50,000

3899

1999

1

1

1

1

1

1

0.9999…

0.9999…

0.03

0.03

0.03

0.03

0.03

0.03

Probabilities Assigned by URNS

PURNS PURNSk kUnited States

China

. . .

OilWatch Africa

Religion Paraguay

Chicken Mole

Republics of Kenya

Atlantic Ocean

New Zeland

Page 88: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

88

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

URNS’s probabilities are 15-22x closer to optimal.

Probability Accuracy

Page 89: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

89

Computation is efficient Continuous Zipf & Poisson approximations => Closed form expression P(x C | evidence)

vs. Pointwise Mutual Information (PMI) [Etzioni et al. 2005]

PMI computed with search engine hit counts (inspired by [Turney, 2000])

URNS requires no hit count queries (~8x faster)

Scalability

Page 90: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

90

Probabilistic model of redundancy Accurate without hand-labeled examples

15-22x improvement in accuracy Scalable

8x faster

[Downey et al., IJCAI 2005]

URNS: Contributions

Page 91: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

91

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail” Language models to the rescue

[Downey et al., ACL 2007]

5) Machine learning generalization

Outline

Page 92: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

92

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

Tend to be correct

e.g., (Bloomberg, New York City)

Challenge: the “long tail”

Page 93: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

93

Mayor McCheese

Page 94: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

94

Strategy1) Model how common extractions occur in text

2) Rank sparse extractions by fit to model

Unsupervised language models Precomputed – scalable Handle sparsity

Assessing Sparse Extractions

Page 95: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

95

The “distributional hypothesis”:Instances of the same relationship tend to appear in similar contexts.

…David B. Shaver was elected as the new mayor of Pickerington, Ohio.

http://www.law.capital.edu/ebriefsarchive/Summer2004/ClassActionsLeft.asp

…Mike Bloomberg was elected as the new mayor of New York City.

http://www.queenspress.com/archives/coverstories/2001/issue52/coverstory.htm

Assessing Sparse Extractions

Page 96: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

96

Type errors are common:

Alexander the Great conquered Egypt… (Great, Egypt) Conquered

Locally acquired malaria is now uncommon… (Locally, malaria) Acquired

Type checking

Page 97: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

97

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 2005]

1 2 1… …

such

as

x , B

osto

n

But

x is

n’t th

e

Ang

eles

and

x .

Baseline: context vectors (1)

Page 98: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

98

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Baseline: context vectors (2)

Page 99: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

99

ti ti+1 ti+2 ti+3

wi wi+1 wi+2 wi+3

cities such as Seattle

Hidden Markov Model (HMM)

States – unobserved

Words – observed

Hidden States ti {1, …, N} (N fairly small)

Train on unlabeled data – P(ti | wi = w) is N-dim. distributional summary of w

– Compare extractions using KL divergence

Page 100: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

100

Twisp: < >

P(t | Twisp):

Distributional Summary P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 0 0 0 1 . . .

0.14 0.01 … 0.06 t=1 2 N

HMM Compresses Context Vectors

Page 101: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

101

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> Context vectors say no,

dot product is 0!

291 0 …

<x>

, O

hio

<x>

, Ill

inoi

s

0 1 …

Example

Page 102: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

102

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example

Page 103: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

103

Task: Ranking sparse TextRunner extractions.

Metric: Area under precision-recall curve.

Language models reduce missing area by 39% over nearest competitor.

Experimental Results

Headquartered Merged Average

Frequency 0.710 0.784 0.713

PL 0.651 0.851 … 0.785

LM 0.810 0.908 0.851

Page 104: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

104

No hand-labeled data Scalability

Language models precomputed=> Can be queried at interactive speed

Improved accuracy over previous work[Downey et al., ACL 2007]

REALM: Contributions

Page 105: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

105

1) Motivation

2) Background on Web IE

3) Estimating extraction correctness

4) Challenge: The “long tail”

5) Machine learning generalization Monotonic Features

[Downey et al., 2008 (submitted)]

Outline

Page 106: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

106

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Semi-supervised Learning

[McCallum & Nigam, 1999; Gliozzo, 2005]

Page 107: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

107

Common Structure

Task Hint BootstrapWeb IE “x, C of y” Distributional Hypothesis

Word Sense Disambiguation

“plant and animal species”

One sense per context, one sense per discourse[Yarowsky, 1995]

Information Retrieval

search query Pseudo-relevance feedback [Kwok & Grunfield, 1995; Thompson & Turtle, 1995]

Document Classification

Topic word, e.g.: “politics”

Bag-of-words and EM [McCallum & Nigam, 1999; Gliozzo, 2005]

Identity of a monotonic feature xi such that:P(y = 1 | xi) increases strictly monotonically with xi

Classification of examples x = (x1, …, xd) into classes y {0, 1}

Page 108: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

108

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 109: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

109

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 110: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

110

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 111: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

111

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 112: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

112

Monotonic Features

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given monotonic feature x1

and unlabeled examples (x)

Page 113: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

113

1. No labeled data, MFs given (MA) With noisy labels from MFs, train any classifier

2. Labeled data, no MFs given (MA-SSL) Detect MFs from labeled data, run MA

3. Labeled data and MFs given (MA-BOTH) Run MA with given & detected MFs

Exploiting MF Structure

Page 114: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

114

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

Page 115: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

115

MA-SSL provides a 15% error reduction for 100-400 labeled examples.

MA-BOTH provides a 31% error reduction for 0-800 labeled examples.

Experimental Results

Page 116: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

116

Co-training Requires labeled examples and known views

Semi-supervised smoothness assumptions Cluster assumption Manifold assumption …both provably distinct from MF structure

Relationship to other approaches

Page 117: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

117

Best known methods for IE without labeled data Probabilities of correctness (URNS)

Massive improvements in accuracy (15-22x) Handling sparse data (Language models)

Vastly more scalable than previous work Accuracy wins (39% error reduction)

Generalization beyond IE Monotonic Feature abstraction – widely applicable Accuracy wins in document classification

Summary of Results

Page 118: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

118

IE Web IE But still need:

A coherent knowledge base MayorOf(Chicago, Daley) –

the same “Chicago” as Starred-in(Chicago, Zeta-Jones)? Future Work: entity resolution, schema discovery

Improved accuracy and coverage Currently, ignore character/document features, recursive

structure, etc. Future work: more sophisticated language models

(e.g. PCFGs)

Conclusions and Future Work

Page 119: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

119

Thanks!

Acknowledgements:Oren Etzioni

Mike CafarellaPedro DomingosSusan Dumais

Eric HorvitzStef Schoenmackers

Dan Weld

Page 120: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

120

Self-Supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Page 121: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

121

Language Modeling for IE REALM is simple, ignores:

Character- or Document-Level Features Web structure Recursive structure (PCFGs)

Goal: x won an Oscar for playing a villain…What is P(x) ?

From facts to knowledge Entity resolution and inference

Future Work

Page 122: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

122

Named Entity Location Lexical Statistics improve state of the art

[Downey et al., IJCAI 2007]

Modeling Web Search Characterizing user behavior

[Downey et al., SIGIR 2007] (poster)[Liebling et al., 2008] (submitted)

Predictive models [Downey et al., IJCAI 2007]

Other Work

Page 123: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

123

Web Fact-Finding

Who has won three or more Academy Awards?

Page 124: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

124

Web Fact-FindingProblems:

User has to pick the right words, often a tedious process:

"world foosball champion in 1998“ – 0 hits“world foosball champion” 1998 – 2 hits, no answer

What if I could just ask for P(x) in“x was world foosball champion in 1998?”

How far can language modeling and the distributional hypothesis take us?

Page 125: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

125

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 126: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

126

Miami

Twisp

Star Wars

. . . 98 0 20 250 30 513 . . .

. . . 5 0 1 2 1 1 . . .. . . 1 1000 0 2 1 1 . . .

X s

ound

trac

khe

vis

ited

X a

ndci

ties

such

as

XX

and

oth

er c

ities

X lo

dgin

g

KnowItAll Hypothesis

Distributional Hypothesis

Page 127: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

127

invent in real time

TextRunner

Ranked by frequency

REALM improves precision of the top 20 extractions by an average of 90%.

Page 128: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

128

Tarantella, Santa Cruz

International Business Machines Corporation, Armonk

Mirapoint, Sunnyvale

ALD, Sunnyvale

PBS, Alexandria

General Dynamics, Falls Church

Jupitermedia Corporation, Darien

Allegro, Worcester

Trolltech, Oslo

Corbis, Seattle

TR Precision: 40% REALM Precision: 100%

Improving TextRunner: Example (1)

“headquartered” Top 10:company, Palo Alto

held company, Santa Cruz

storage hardware and software, Hopkinton

Northwestern Mutual, Tacoma

1997, New York City

Google, Mountain View

PBS, Alexandria

Linux provider, Raleigh

Red Hat, Raleigh

TI, Dallas

TR Precision: 40%

Page 129: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

129

Arabs, Rhodes

Arabs, Istanbul

Assyrians, Mesopotamia

Great, Egypt

Assyrians, Kassites

Arabs, Samarkand

Manchus, Outer Mongolia

Vandals, North Africa

Arabs, Persia

Moors, Lagos

TR Precision: 60% REALM Precision: 90%

Improving TextRunner: Example (2)

“conquered” Top 10:Great, Egypt

conquistador, Mexico

Normans, England

Arabs, North Africa

Great, Persia

Romans, part

Romans, Greeks

Rome, Greece

Napoleon, Egypt

Visigoths, Suevi Kingdom

TR Precision: 60%

Page 130: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

130

Previous n-gram technique (1)

1) Form a context vector for each extracted argument:…

cities such as Chicago , Boston ,

But Chicago isn’t the best

cities such as Chicago , Boston ,

Los Angeles and Chicago .

2) Compute dot products between extractions and seeds in this space [cf. Ravichandran et al. 2005].

1 2 1… …

such as <x> , Boston

But <x> isn’t the

Angeles and <x> .

Page 131: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

131

Miami: < >Twisp: < >

Problems: Vectors are large Intersections are sparse

. . . 71 25 1 513 . . .w

hen

he v

isite

d X

he v

isite

d X

and

visi

ted

X a

nd o

ther

X a

nd o

ther

citi

es

. . . 0 0 0 1 . . .

Previous n-gram technique (2)

Page 132: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

132

Miami: < >

P(t | Miami):

Latent state distribution P(t | w) Compact (efficient – 10-50x less data retrieved) Dense (accurate – 23-46% error reduction)

. . . 71 25 1 513 . . .

0.14 0.01 … 0.06 t=1 2 N

Compressing Context Vectors

Page 133: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

133

Example: N-Grams on Sparse Data

Is Pickerington of the same type as Chicago?

Chicago , IllinoisPickerington , Ohio

Chicago:Pickerington:

=> N-grams says no, dot product is 0!

291 0 …

<x> , Ohio

<x> , Illinois

0 1 …

Page 134: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

134

HMM Generalizes:

Chicago , Illinois

Pickerington , Ohio

Example: HMM-T on Sparse Data

Page 135: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

135

HMM-T Limitations

Learning iterations take time proportional to (corpus size *Tk+1)

T = number of latent states

k = HMM order

We use limited values T=20, k=3 Sufficient for typechecking (Santa Clara is a city) Too coarse for relation assessment

(Santa Clara is where Intel is headquartered)

Page 136: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

136

The REALM ArchitectureTwo steps for assessing R(arg1, arg2) Typechecking

Ensure arg1 and arg2 are of proper type for RMayorOf(Intel, Santa Clara)

Leverages all occurrences of each arg Relation Assessment

Ensure R actually holds between arg1 and arg2MayorOf(Giuliani, Seattle)

Both steps use pre-computed language models=> Scales to Open IE

Page 137: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

137

Type checking isn’t enoughNY Mayor Giuliani toured downtown Seattle.

Want: How do arguments behave in relation to each other?

Relation Assessment

Page 138: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

138

N-gram language model:

P(wi, wi-1, … wi-k)

arg1, arg2 often far apart => large k (inaccurate)

REL-GRAMS (1)

Page 139: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

139

Relational Language Model (REL-GRAMS):

For any two arguments e1, e2:

P(wi, wi-1, … wi-k | wi = e1, e1 near e2)

k can be small – REL-GRAMS still captures entity relationships Mitigate sparsity with BM25 metric (from IR)

Combine with HMM-T by multiplying ranks.

REL-GRAMS (2)

Page 140: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

140

Experiments

Task: Re-rank sparse TextRunner extractions for Conquered, Founded, Headquartered, MergedREALM vs.

TextRunner (TR) – frequency ordering (equivalent to PMI [Etzioni et al, 2005] and Urns [Downey et al, 2005])

Pattern Learning (PL) – based on Snowball [Agichtein 2000]

HMM-T and REL-GRAMS in isolation

Page 141: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

141

Learning num(C) and num(E)

From untagged data: ill-posed problem• num(C) can vary wildly with C

e.g., countries vs. cities vs. mayors

Assume:1) Consistent precision of a single co-occurrence,

e.g., in a randomly drawn phrase “C such as x”,x C about p of the time. (0.9 for [Etzioni,

2005])

2) num(E) is constant for all C

3) num(C) is Zipf Estimate num(C) from untagged data using EM

[Downey et al. 2005] (Also: multiple contexts)

Page 142: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

142

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 -

P(x C) in “C such as x”

Assumed ~0.9

Error Distribution

Assumed large with Zipf parameter 1.0

Page 143: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

143

URNS without labeled data

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

ency

Frequency Rank

Fre

qu

en

cy

1 - Can vary wildly (e.g. cities vs. countries).

Learned from unlabeled data using EM

Page 144: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

144

Distributional Similarity

Naïve Approach – find sentences containing seed1&seed2 or arg1&arg2:

Compare context distributions:

P(wb,…, we | seed1, seed2 )

P(wb,…, we | arg1, arg2)But e – b can be large

Many parameters, sparse data => inaccuracy

wb … wh seed1 wh+2 … wi seed2 wi+2 … we

wb … wh arg1 wh+2 … wi arg2 wi+2 … we

Page 145: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

145

http://www.cs.washington.edu/research/textrunner/

TextRunner Search

Page 146: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

146

Large textual corpora are redundant,

and we can use this observation to bootstrap extraction and classification models 

from minimally labeled, or even completely unlabeled data.

Thesis

Page 147: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

147

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Page 148: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

148

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

Page 149: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

149

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features

Page 150: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

150

Problem: num(C) can vary wildly e.g. cities vs. countries

Assume: num(C), num(E) Zipf distributed

freq. of ith element i-z

p and num(E) independent of C

Learn num(C) from unlabeled data alone With Expectation Maximization

URNS without labeled data

Page 151: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

151

20 Newsgroups dataset

Task: Given text, determine newsgroup of origin

(MFs: newsgroup name)

Without labeled data:

Experimental Results

Page 152: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

152

Typecheck each arg by comparing HMM’s distributional summaries:

Rank arguments in ascending order of f(arg)

arg|,|

||

1(arg) tPseedtP

seedsKLf

ii

HMM Type-checking

Page 153: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

153

Classical Supervised Learning

?

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)

x1

x2

Page 154: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

154

Semi-supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y {0, 1}given labeled examples (x, y)and unlabeled examples (x)

x1

x2

Page 155: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

155

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)

Page 156: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

156

Self-supervised Learning

x1

x2

Learn function from x = (x1, …, xd) to y {0, 1} given unlabeled examples (x)and system labels its own examples

Page 157: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

157

Self-supervised Learning

Input Examples Output

Supervised Labeled Classifier

Semi-supervised Labeled & Unlabeled Classifier

Self-supervised Unlabeled Classifier

Unsupervised Unlabeled Clustering

Page 158: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

158

Supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Page 159: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

159

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x)

Output: concept c: X -> {0, 1} that approximates P(y | x).

Monotonic Features

Smaller

Page 160: 1 Autonomous Web-scale Information Extraction Doug Downey Advisor: Oren Etzioni Department of Computer Science and Engineering Turing Center University.

160

Semi-supervised classification task: Feature space X of d-tuples x = (x1, …, xd)

Binary output space Y = {0, 1} Inputs

Labeled examples DL = {(x, y)} ~ P(x, y)

Unlabeled examples DU = {(x)} ~ P(x) Monotonic features M {1,…,d} such that:

P(y=1 | xi) increases strictly monotonically with xi for all i M.

Output: concept c: X -> {0, 1} that approximates P(y | x).

Potentially empty!

Monotonic Features