1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and...

76
1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department

Transcript of 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and...

Page 1: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

1

Discovering and Utilizing Structure in Large Unstructured Text DatasetsEugene Agichtein

Math and Computer Science Department

Page 2: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

2

Information Extraction Example Information extraction systems represent text in

structured form

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Disease Outbreaks in The New York Times

Information Extraction System

Page 3: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

3

How can information extraction help?

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis

Large Text Collection Structured Relation

Page 4: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

4

Goal: Detect, Monitor, Predict Outbreaks

Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …

911 CallsTraffic accidents, …

Historical news, breaking news stories,wire, alerts, …

Hospital Records

IESys 4

IESys 3

IESys 2

IESys 1

Data Integration, Data Mining, Trend Analysis

Detection, Monitoring, Prediction

Page 5: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

5

Challenges in Information Extraction

Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

Page 6: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

6

Outline Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

Page 7: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

7

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 8: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

8

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 9: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

9

What is “Information Extraction”Information Extraction =

segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 10: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

10

What is “Information Extraction”Information Extraction =

segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 11: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

11

What is “Information Extraction”Information Extraction =

segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 12: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

12

What is “Information Extraction”Information Extraction =

segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Page 13: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

13

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Page 14: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

14

Information Extraction Tasks Extracting entities and relations

Entities Named (e.g., Person) Generic (e.g., disease name)

Relations Entities related in a predefined way (e.g., Location of a Disease

outbreak) Discovered automatically

Common information extraction steps: Preprocessing: sentence chunking, parsing, morphological analysis Rules/extraction patterns: manual, machine learning, and hybrid Applying extraction patterns to extract new information

Postprocessing and complex extraction: not covered Co-reference resolution Combining Relations into Events, Rules, …

Page 15: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

15

Two kinds of IE approaches

Knowledge Engineering

rule based developed by experienced

language engineers make use of human

intuition requires only small amount

of training data development could be very

time consuming some changes may be

hard to accommodate

Machine Learning

use statistics or other machine learning

developers do not need LE expertise

requires large amounts of annotated training data

some changes may require re-annotation of the entire training corpus

annotators are cheap (but you get what you pay for!)

Page 16: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

16

Extracting Entities from Text

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Sliding WindowClassify Pre-segmented

Candidates

Finite State Machines Context Free GrammarsBoundary Models

Abraham Lincoln was born in Kentucky.

member?

Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Page 17: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

17

Hidden Markov ModelsS

t -1S

t

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

Page 18: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

18

IE with Hidden Markov Models

Yesterday Lawrence Saul spoke this example sentence.

Yesterday Lawrence Saul spoke this example sentence.

Person name: Lawrence Saul

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

Page 19: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

19

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on 450k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

Back-off to: Back-off to:

P(st | st-1 )

P(st )

P(ot | st , ot-1 )

P(ot | st )

P(ot )

or

Results:

Page 20: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

20

Relation Extraction

Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System

Disease Outbreaks in The New York Times

Page 21: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

21

Relation Extraction Typically require Entity Tagging as preprocessing

Knowledge Engineering Rules defined over lexical items

“<company> located in <location>” Rules defined over parsed text

“((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, …

Machine Learning-based Learn rules/patterns from examples

Dan Roth 2005, Cardie 2006, Mooney 2005, … Partially-supervised: bootstrap from “seed” examples

Agichtein & Gravano 2000, Etzioni et al., 2004, …

Recently, hybrid models [Feldman2004, 2006]

Page 22: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

22

Comparison of Approaches Use “language-engineering” environments

to help experts create extraction patterns GATE [2002], Proteus [1998]

Train system over manually labeled data Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996]

Exploit large amounts of unlabeled data DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000] Etzioni et al. (’04): KnowItAll: extracting unary relations Yangarber et al. (’00, ’02): Pattern refinement, generalized names

detection

significanteffort

substantialeffort

minimaleffort

Page 23: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

23

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

Page 24: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

24

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Page 25: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

25

Can use any Can use any full-text search full-text search engineengine

Snowball: Finding Example Occurrences Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Search Engine

Text Database

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp

The Armonk-based IBM introduced a new line…

Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

Page 26: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

26

Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …

Snowball: Tagging EntitiesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…

Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

Page 27: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

27

Snowball: Extraction Patterns

General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2

Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in,

0.5)] ) Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) )

HMMs, Sparse sequences, Conditional Random Fields, …

Computer servers at Microsoft’s headquarters in Redmond…

Page 28: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

28

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

2 Cluster Cluster similarsimilar occurrences.occurrences.

Page 29: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

29

Snowball: Generating Patterns Get

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids

1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

2 Cluster Cluster similarsimilar occurrences.occurrences.

3

Page 30: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

30

Vector Space Clustering

Page 31: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

31

Google 's new headquarters in Mountain View are …

Snowball: Extracting New TuplesMatch tagged text fragments against patterns

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ORGANIZATION {<'s 0.71>, <headquarters 0.71> }

{<located 0.71>, < in 0.71>}

LOCATION {<- 0.71>, <based 0.71>

P1

P2

P3

Match=0.8

Match=0.4

Match=0

ORGANIZATION

ORGANIZATION

LOCATION

LOCATION

V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION

Page 32: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

32

Snowball: Evaluating Patterns

Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total

= 2/3 = 0.66= 2/3 = 0.66

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive

“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative

LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters

IBM Armonk

Intel Santa Clara

Microsoft Redmond

Current seed tuples

Page 33: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

33

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

P4: 0.663COM Santa Clara

{<- 0.75>, <based 0.75>}P3: 0.95

0.4

Conf(T): 0.83

)PMatch(*)Conf(P-1-1 i

p

i

0.8

LOCATIONORGANIZATION { < , 1> }

LOCATION ORGANIZATION

Page 34: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

34

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... .... ..... .... .. ... .... ..... .... ..

Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration

Page 35: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

35

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted

Page 36: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

36

Pattern-Tuple Duality A “good” tuple:

Extracted by “good” patterns Tuple weight goodness

A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness

Edge weight: Match/Similarity of tuple context

to pattern

Page 37: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

37

How to Set Node Weights Constraint violation (from before)

Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =

HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate

)PMatch(*)Conf(P-1-1 i

p

i

Page 38: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

38

Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm “Hide” labels for some seed

tuples

Iterate EM algorithm to convergence on tuple/pattern confidence values

Set threshold t such that (t > 90% of spy tuples)

Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

Page 39: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

39

Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

Page 40: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

40

Example Task: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

Page 41: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

41

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions,

RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

Page 42: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

42

Outline Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

Page 43: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

43

Extracting A Relation From a Large Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

Page 44: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

44

An Abstract View of Text-Centric Tasks

Output tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Task tuple

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Page 45: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

45

Executing a Text-Centric TaskOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

Page 46: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

46

ScanOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

Page 47: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

47

Iterative Query ExpansionOutput tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Query

Generation

4. Augment seed tuples with new tuples

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

Page 48: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

48

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Query Generation

Information Extraction System

Problem: Learn keyword queries to retrieve “promising” documents

Page 49: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

49

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation

Information Extraction System

? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

Page 50: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

50

Training Classifiers to Recognize “Useful” Documents

disease reported epidemic expected area

virus reported expected infected patients

products made used exported far

past old homerun sponsored event

++

--

Ripper SVM

disease AND reported => USEFUL

virus 3

infected 2

sponsored -1

Okapi (IR)

disease

infected

reported

virus

epidemic

products

usedfar

exported

Document features:

words

D1

D2

D3

D4

Page 51: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

51

SVM

Generating Queries from Classifiers

disease and reportedepidemic

virus

QCombined

virusinfected

epidemicvirusdisease AND reported

Ripper Okapi (IR)

disease AND reported => USEFUL

disease

infected

reported

virus

epidemic

products

usedfar

exportedvirus 3

infected 2

sponsored -1

Page 52: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

52

SIGMOD 2003 Demonstration

Page 53: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

53

An Even Simpler Querying Strategy: “Tuples”

DiseaseName Location Date

Ebola Zaire May 1995

“Ebola” and “Zaire”

InformationExtraction

System

Malaria Ethiopia Jan. 1995

hemorrhagic fever Africa May 1995

1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and

iterate

Search Engine

Page 54: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

54

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Comparison of Document Access Methods

QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

Tuples strategy: Recall at most 46%

Page 55: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

55

Predicting Recall of Tuples Strategy

Seed

Tuple

SUCCESS! FAILURE

Can we predict if Tuples will succeed?

WebDB 2003

Seed

Tuple

Page 56: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

56

Using Querying Graph for Analysis

We need to compute the: Number of documents retrieved after

sending Q tuples as queries (estimates time) Number of tuples that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the: Degree distribution of the tuples

discovered by retrieving documents Degree distribution of the documents

retrieved by the tuples (Not the same as the degree distribution of a

randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

Page 57: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

57

Information Reachability Graph

t2, t3, and t4 “reachable” from t1t1 retrieves document d1

that contains t2

t1

t2 t3

t4t5

Tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Page 58: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

58

t2

t1

t3

t4

Connected Components

In OutCore(strongly

connected)

Reachable Tuples, do not retrieve tuples in Core

Tuples that retrieve other tuples and themselves

Tuples that retrieve other tuples but are not reachable

Page 59: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

59

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture: Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

Page 60: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

60

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

Page 61: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

61

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

Page 62: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

62

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

Page 63: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

63

Estimating ReachabilityIn a power-law random graph G a giant

component CG emerges* if d (the average outdegree) > 1, and:

Estimate: Reachability ~ CG / |T| Depends only on d (average

outdegree)

* For < 3.457Chung and Lu, Annals of Combinatorics, 2002

Page 64: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

64

Estimating Reachability Algorithm1. Pick some random tuples

2. Use tuples to query database

3. Extract tuples from matching documents to compute reachability graph edges

4. Estimate average outdegree

5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

TuplesDocument

st1

t2

t3

t4

d1

d2

d3

d4

t1

t3

t2

t2

t4

d =1.5

Page 65: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

65

Estimating Reachability of NYT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000

MaxResults

Rea

chab

ility

S=10 S=50 S=100 S=200 Real Graph

.46

Approximate reachability is estimated after ~ 50 queries.

Can be used to predict success (or failure) of a Tuples querying strategy.

Page 66: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

66

Outline Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining

Page 67: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

67

Goal: Detect, Monitor, Predict Outbreaks

Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …

911 CallsTraffic accidents, …

Historical news, breaking news stories,wire, alerts, …

Hospital Records

IESys 4

IESys 3

IESys 2

IESys 1

Data Integration, Data Mining, Trend Analysis

Detection, Monitoring, Prediction

Page 68: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

68

Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text

Physician notes, patient narrative, call transcripts

Call transcripts: a difficult extraction problem Not grammatical, dialogue, speechtext unreliable, … Use partially supervised techniques to learn extraction

patterns

One approach: Link together (when possible) call transcript and patient

record (e.g., by time, address, and patient name) Correlate patterns in transcript with diagnosis/symptoms Fine-grained learning: can automatically train for each

symptom or group of patients, etc.

Page 69: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

69

Authority, Trust, Confidence How reliable are signals emitted by

information extraction?

Dimensions of trust/confidence: Source reliability: diagnosis vs. notes vs. 911 calls Tuple extraction confidence Source extraction difficulty

Page 70: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

70

Source Confidence Estimation Task “easy” when context term distributions diverge from background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts task is “easy” or “hard”

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

Page 71: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

71

Inferring Social Networks Explicit networks

Patient records: family, geographical entities in structured and unstructured portions

Implicit connections Extract events (e.g., “went to restaurant X

yesterday”) Extract relationships (e.g., “I work in Kroeger’s in

Toco Hills”

Page 72: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

72

Modeling Social Networks for Epidemiology, security, …

Email exchange mapped onto cubicle locations.

Page 73: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

73

Improve Prediction Accuracy

Suppose we managed to Automatically identify people currently sick or

about to get sick Automatically infer (part of) their social network

Can we improve prediction for dynamics of an outbreak?

Page 74: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

74

Multimodal Information Extraction and Data Mining

Develop joint models over structured data E.g., lab results and symptoms extracted from text

One approach: mutual reinforcement Co-training: train classifier on redundant views of data

(e.g., structured & unstructured) Bootstrap on examples proposed by both views

More generally: graphical models

Page 75: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

75

Summary Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining

Page 76: 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and Computer Science Department.

76

Thank You

Details: papers, other talk slides:http://www.mathcs.emory.edu/~eugene/