1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and...

1

Discovering and Utilizing Structure in Large Unstructured Text DatasetsEugene Agichtein

Math and Computer Science Department

2

Information Extraction Example Information extraction systems represent text in

structured form

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Disease Outbreaks in The New York Times

Information Extraction System

3

How can information extraction help?

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide input to data mining & statistics analysis

Large Text Collection Structured Relation

4

Goal: Detect, Monitor, Predict Outbreaks

Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …

911 CallsTraffic accidents, …

Historical news, breaking news stories,wire, alerts, …

Hospital Records

IESys 4

IESys 3

IESys 2

IESys 1

Data Integration, Data Mining, Trend Analysis

Detection, Monitoring, Prediction

5

Challenges in Information Extraction

Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

6

Outline Information extraction overview

Partially supervised information extraction Adaptivity Confidence estimation

Text retrieval for scalable extraction Query-based information extraction Implicit connections/graphs in text databases

Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

7

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

8

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT





NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

9

What is “Information Extraction”Information Extraction =

segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT





Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

10


segmentation + classification + association + clustering


October 14, 2002, 4:00 a.m. PT






11




October 14, 2002, 4:00 a.m. PT






12




October 14, 2002, 4:00 a.m. PT





Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

13

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

14

Information Extraction Tasks Extracting entities and relations

Entities Named (e.g., Person) Generic (e.g., disease name)

Relations Entities related in a predefined way (e.g., Location of a Disease

outbreak) Discovered automatically

Common information extraction steps: Preprocessing: sentence chunking, parsing, morphological analysis Rules/extraction patterns: manual, machine learning, and hybrid Applying extraction patterns to extract new information

Postprocessing and complex extraction: not covered Co-reference resolution Combining Relations into Events, Rules, …

15

Two kinds of IE approaches

Knowledge Engineering

rule based developed by experienced

language engineers make use of human

intuition requires only small amount

of training data development could be very

time consuming some changes may be

hard to accommodate

Machine Learning

use statistics or other machine learning

developers do not need LE expertise

requires large amounts of annotated training data

some changes may require re-annotation of the entire training corpus

annotators are cheap (but you get what you pay for!)

16

Extracting Entities from Text

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Sliding WindowClassify Pre-segmented

Candidates

Finite State Machines Context Free GrammarsBoundary Models

Abraham Lincoln was born in Kentucky.

member?

Abraham Lincoln was born in Kentucky.Abraham Lincoln was born in Kentucky.

Classifier

which class?

…and beyond


Classifier

which class?

Try alternatewindow sizes:

Classifier

which class?

BEGIN END BEGIN END

BEGIN


Most likely state sequence?


NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

17

Hidden Markov ModelsS

t -1S

t

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

18

IE with Hidden Markov Models

Yesterday Lawrence Saul spoke this example sentence.

Yesterday Lawrence Saul spoke this example sentence.

Person name: Lawrence Saul

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

19

HMM Example: “Nymble”

Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]

Task: Named Entity Extraction

Train on 450k words of news wire text.

Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

[Bikel, et al 1998], [BBN “IdentiFinder”]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Transitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

Back-off to: Back-off to:

P(st | st-1 )

P(st )

P(ot | st , ot-1 )

P(ot | st )

P(ot )

or

Results:

20

Relation Extraction

Extract structured relations from text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire


Disease Outbreaks in The New York Times

21

Relation Extraction Typically require Entity Tagging as preprocessing

Knowledge Engineering Rules defined over lexical items

“<company> located in <location>” Rules defined over parsed text

“((Obj <company>) (Verb located) (*) (Subj <location>))” Proteus, GATE, …

Machine Learning-based Learn rules/patterns from examples

Dan Roth 2005, Cardie 2006, Mooney 2005, … Partially-supervised: bootstrap from “seed” examples

Agichtein & Gravano 2000, Etzioni et al., 2004, …

Recently, hybrid models [Feldman2004, 2006]

22

Comparison of Approaches Use “language-engineering” environments

to help experts create extraction patterns GATE [2002], Proteus [1998]

Train system over manually labeled data Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996]

Exploit large amounts of unlabeled data DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000] Etzioni et al. (’04): KnowItAll: extracting unary relations Yangarber et al. (’00, ’02): Pattern refinement, generalized names

detection

significanteffort

substantialeffort

minimaleffort

23

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

24

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

25

Can use any Can use any full-text search full-text search engineengine

Snowball: Finding Example Occurrences Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Search Engine

Text Database

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp

The Armonk-based IBM introduced a new line…

Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

26

Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …

Snowball: Tagging EntitiesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…

Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

27

Snowball: Extraction Patterns

General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2

Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in,

0.5)] ) Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) )

HMMs, Sparse sequences, Conditional Random Fields, …

Computer servers at Microsoft’s headquarters in Redmond…

28

Snowball: Generating Patterns Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}


2 Cluster Cluster similarsimilar occurrences.occurrences.

29

Snowball: Generating Patterns Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}


Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids

1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

2 Cluster Cluster similarsimilar occurrences.occurrences.

3

30

Vector Space Clustering

31

Google 's new headquarters in Mountain View are …

Snowball: Extracting New TuplesMatch tagged text fragments against patterns

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


ORGANIZATION {<'s 0.71>, <headquarters 0.71> }

{<located 0.71>, < in 0.71>}

LOCATION {<- 0.71>, <based 0.71>

P1

P2

P3

Match=0.8

Match=0.4

Match=0

ORGANIZATION

ORGANIZATION

LOCATION

LOCATION

V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION

32

Snowball: Evaluating Patterns

Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total

= 2/3 = 0.66= 2/3 = 0.66

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive

“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative

LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters

IBM Armonk

Intel Santa Clara

Microsoft Redmond

Current seed tuples

33

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


P4: 0.663COM Santa Clara

{<- 0.75>, <based 0.75>}P3: 0.95

0.4

Conf(T): 0.83

)PMatch(*)Conf(P-1-1 i

p

i

0.8

LOCATIONORGANIZATION { < , 1> }

LOCATION ORGANIZATION

34

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1






3M Minneapolis 0.7


157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... .... ..... .... .. ... .... ..... .... ..

Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration

35

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1






3M Minneapolis 0.7


Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted

36

Pattern-Tuple Duality A “good” tuple:

Extracted by “good” patterns Tuple weight goodness

A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness

Edge weight: Match/Similarity of tuple context

to pattern

37

How to Set Node Weights Constraint violation (from before)

Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =

HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate

)PMatch(*)Conf(P-1-1 i

p

i

38

Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm “Hide” labels for some seed

tuples

Iterate EM algorithm to convergence on tuple/pattern confidence values

Set threshold t such that (t > 90% of spy tuples)

Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

39

Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

40

Example Task: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

41

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDR, Micromedex… [Thesis] AdverseEffects, DrugInteractions,

RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03] Gene and Protein Synonyms

42




Current and future work Inferring and analyzing social networks Utility-based extraction tuning Multi-modal information extraction and data mining Authority/trust/confidence estimation

43

Extracting A Relation From a Large Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

44

An Abstract View of Text-Centric Tasks

Output tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Task tuple

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

45

Executing a Text-Centric TaskOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

46

ScanOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

47

Iterative Query ExpansionOutput tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Query

Generation

4. Augment seed tuples with new tuples

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

48

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995




Query Generation


Problem: Learn keyword queries to retrieve “promising” documents

49

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation


? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

50

Training Classifiers to Recognize “Useful” Documents

disease reported epidemic expected area

virus reported expected infected patients

products made used exported far

past old homerun sponsored event

++

--

Ripper SVM

disease AND reported => USEFUL

virus 3

infected 2

sponsored -1

Okapi (IR)

disease

infected

reported

virus

epidemic

products

usedfar

exported

Document features:

words

D1

D2

D3

D4

51

SVM

Generating Queries from Classifiers

disease and reportedepidemic

virus

QCombined

virusinfected

epidemicvirusdisease AND reported

Ripper Okapi (IR)

disease AND reported => USEFUL

disease

infected

reported

virus

epidemic

products

usedfar

exportedvirus 3

infected 2

sponsored -1

52

SIGMOD 2003 Demonstration

53

An Even Simpler Querying Strategy: “Tuples”



“Ebola” and “Zaire”

InformationExtraction

System


hemorrhagic fever Africa May 1995

1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and

iterate

Search Engine

54

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Comparison of Document Access Methods

QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

Tuples strategy: Recall at most 46%

55

Predicting Recall of Tuples Strategy

Seed

Tuple

SUCCESS! FAILURE

Can we predict if Tuples will succeed?

WebDB 2003

Seed

Tuple

56

Using Querying Graph for Analysis

We need to compute the: Number of documents retrieved after

sending Q tuples as queries (estimates time) Number of tuples that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the: Degree distribution of the tuples

discovered by retrieving documents Degree distribution of the documents

retrieved by the tuples (Not the same as the degree distribution of a

randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

57

Information Reachability Graph

t2, t3, and t4 “reachable” from t1t1 retrieves document d1

that contains t2

t1

t2 t3

t4t5

Tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

58

t2

t1

t3

t4

Connected Components

In OutCore(strongly

connected)

Reachable Tuples, do not retrieve tuples in Core

Tuples that retrieve other tuples and themselves

Tuples that retrieve other tuples but are not reachable

59

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture: Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

60

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

61

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

62

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

63

Estimating ReachabilityIn a power-law random graph G a giant

component CG emerges* if d (the average outdegree) > 1, and:

Estimate: Reachability ~ CG / |T| Depends only on d (average

outdegree)

* For < 3.457Chung and Lu, Annals of Combinatorics, 2002

64

Estimating Reachability Algorithm1. Pick some random tuples

2. Use tuples to query database

3. Extract tuples from matching documents to compute reachability graph edges

4. Estimate average outdegree

5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

TuplesDocument

st1

t2

t3

t4

d1

d2

d3

d4

t1

t3

t2

t2

t4

d =1.5

65

Estimating Reachability of NYT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000

MaxResults

Rea

chab

ility

S=10 S=50 S=100 S=200 Real Graph

.46

Approximate reachability is estimated after ~ 50 queries.

Can be used to predict success (or failure) of a Tuples querying strategy.

66




Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining

67

Goal: Detect, Monitor, Predict Outbreaks

Current Patient Records: Diagnosis, physician’s notes, lab results/analysis, …

911 CallsTraffic accidents, …

Historical news, breaking news stories,wire, alerts, …

Hospital Records

IESys 4

IESys 3

IESys 2

IESys 1

Data Integration, Data Mining, Trend Analysis

Detection, Monitoring, Prediction

68

Adaptive, Utility-Driven Extraction Extract relevant symptoms and modifiers from text

Physician notes, patient narrative, call transcripts

Call transcripts: a difficult extraction problem Not grammatical, dialogue, speechtext unreliable, … Use partially supervised techniques to learn extraction

patterns

One approach: Link together (when possible) call transcript and patient

record (e.g., by time, address, and patient name) Correlate patterns in transcript with diagnosis/symptoms Fine-grained learning: can automatically train for each

symptom or group of patients, etc.

69

Authority, Trust, Confidence How reliable are signals emitted by

information extraction?

Dimensions of trust/confidence: Source reliability: diagnosis vs. notes vs. 911 calls Tuple extraction confidence Source extraction difficulty

70

Source Confidence Estimation Task “easy” when context term distributions diverge from background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts task is “easy” or “hard”

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

71

Inferring Social Networks Explicit networks

Patient records: family, geographical entities in structured and unstructured portions

Implicit connections Extract events (e.g., “went to restaurant X

yesterday”) Extract relationships (e.g., “I work in Kroeger’s in

Toco Hills”

72

Modeling Social Networks for Epidemiology, security, …

Email exchange mapped onto cubicle locations.

73

Improve Prediction Accuracy

Suppose we managed to Automatically identify people currently sick or

about to get sick Automatically infer (part of) their social network

Can we improve prediction for dynamics of an outbreak?

74

Multimodal Information Extraction and Data Mining

Develop joint models over structured data E.g., lab results and symptoms extracted from text

One approach: mutual reinforcement Co-training: train classifier on redundant views of data

(e.g., structured & unstructured) Bootstrap on examples proposed by both views

More generally: graphical models

75

Summary Information extraction overview



Current and future work Adaptive information extraction and tuning Authority/trust/confidence estimation Inferring and analyzing social networks Multi-modal information extraction and data mining

76

Thank You

Details: papers, other talk slides:http://www.mathcs.emory.edu/~eugene/

1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and...

Documents

Transcript of 1 Discovering and Utilizing Structure in Large Unstructured Text Datasets Eugene Agichtein Math and...