Information Extraction CS 652 Information Extraction and Integration.
Information Extraction
description
Transcript of Information Extraction
Information Extraction
October 13, 2006
What is Information Extraction?
Input: Specification:
Types of entities to find Types of relations to find Templates to fill
Corpus of text: Possibly formatted Possibly annotated for
linguistic structure
Output: Text + annotation:
Entities tagged w/type and coreference info
Relations b/t entities tagged
Filled templates: Instances of templates
found in text
MUC: Genesis of IE
DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual
event/competition where results were presented. Focused on extracting information from news articles:
Terrorist events Industrial joint ventures Company management changes
Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)
MUC
Named entity Person, Organization, Location
Co-reference Clinton President Bill Clinton
Template element Perpetrator, Target
Template relation Incident
Multilingual
Named entities and events
San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …
Coreference links
San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …
(Partial) Scenario template
Incident: Date 19 Apr 89Incident: Location El Salvador: San Salvador (CITY)Incident: Type BombingPerpetrator: Individual ID “urban guerrillas”Perpetrator: Organization ID “FMLN”Perpetrator: Organization Confidence Suspected or Accused by Authorities: "FMLN"
Physical Target: Description “vehicle”Physical Target: Effect Some Damage: “vehicle”Human Target: Name “Roberto Garcia Alvarado”Human Target: Description “attorney general”: “Roberto Garcia Alvarado”Human Target: Effect Death: “Roberto Garcia Alvarado”
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
MUC Typical Text
Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month
MUC Templates
Relationship tie-up
Entities: Bridgestone Sports Co, a local concern, a Japanese trading house
Joint venture company Bridgestone Sports Taiwan Co
Activity ACTIVITY 1
Amount NT$2,000,000
MUC Templates
ATIVITY 1 Activity
Production Company
Bridgestone Sports Taiwan Co Product
Iron and “metal wood” clubs Start Date
January 1990
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
Example from Fastus (1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
Automated Content Extraction
Objectives: Extract information from texts of varying quality Detect unique entities, events, and relations:
Find all entity mentions Link mentions by entity
Track entities within and across documents Output XML for downstream processes
ACE entity and mention types
Entity Type Subtypes
Person (PER) N/A
Organization (ORG) Government, Commerical, Educational, Non-profit, Other
Location (LOC) Address, Boundary, Celestial, Land-Region-Natural, Region-Local, Region-Subnational, Region-National, Region-International, Water-Body, Other
Geo-Political Entity (GPE) Continent, Nation, State-or-Province, County-or-District, Population-Center, Other
Facility (FAC) Building, Subarea-Building, Bounded-Area, Conduit, Path, Barrier, Plant, Other
Vehicle (VEH) Land, Air, Water, Subarea-Vehicle, Other
Weapon (WEA) Blunt, Exploding, Sharp, Chemical, Biological, Shooting, Projectile, Nuclear, Other
Entity Mention Type Description
Name (NAM) A proper name reference to the entity
Nominal (NOM) A common noun reference to the entity
Pronominal (PRO) A pronoun reference to the entity
Premodifier (PRE) A premodifier reference to the entity
ACE relation and event types
Relation type Subtypes
Physical (PHYS) Located, Near, Part-Whole
Personal/ Social (PER-SOC)
Business, Family, Other
Employment/ Membership/ Subisdiary (EMP-ORG)
Employ-Executive, Employ-Staff, Employ-Undetermined, Member-of-Group, Partner, Subsidiary, Other
Agent-Artifact (ART) User-or-Owner, Inventor-or-Manufacturer, Other
PER/ORG Affliation (OTHER-AFF)
Ethnic, Ideology, Other
GPE Affliation Citizen-or-Resident, Based-in, Other
Discourse (DISC) N/A
Event Types
Destruction/ Damage (BRK)
Creation/ Improvement (MAK)
Transfer of Possession or Control (GIV)
Movement (MOV)
Interaction of Agents (INT)
Event roles
Agent
Object
Source (MOV/GIV)
Target (MOV/GIV)
Time
Location
Other
Applications
Information gathering (intelligence tasks) Question answering
Answer extraction from retrieved documents Ontology induction Improving indexing for IR
IE task breakdown
Entities: Identification: finding entity mentions Classification: determining entity type Normalization: standardizing entity mentions (e.g.,
identifying co-referring entity mentions) Relations:
Association: identifying related entities and their relations
Two approaches to IE
Knowledge-engineering approach Grammar rules built by hand Human expert generates domain-specific patterns through
introspection and corpus work Iterative process: build, test, evaulate errors, repeat
Data-driven approach Use statistical methods Learn recognizers and classifiers from annotated data where
available Leverage unannotated corpora, if possible, by bootstrapping
Knowledge engineering
Advantages: Conceptually straightforward Best-performing systems still hand-built
Disadvantages: Lots of human effort required Human expertise also required Not readily portable to new domains or languages
Data-driven approach
Advantages: Porting to new domains straightforward Domain expertise not necessary Good coverage is ensured
Disadvantages: Training data may not exist or may be difficult to acquire Changes in specification may require re-annotation of
training data
Which approach to use?
Use hand-built rule-based approach when: Resources (esp. lexicons)
available Rule writers available Training data unavailable or
hard to get Extraction specifications
subject to change Highest possible performance
needed
Use data-driven approach when: Resources unavailable Rule writers unavailable Training data cheap and
plentiful Extraction specifications
stable Good performance good
enough
Typical NLP tasks for IE
Tokenization Finding word boundaries
Lexical lookup Using domain lexicons w/type information, e.g., first-name lists,
place-name lists, etc. Part-of-speech tagging
POS tags provide generalization for later processes Can be hand-built or machine-learned
Shallow parsing Coreference resolution
Shallow parsing: cascaded finite-state transducers Limited linguistic analysis:
Grammar divided into levels (chunks and clauses) Pipeline of finite-state recognizers/transducers
Robust: Local decisions, no global optimization Easy-first parsing
High-precision decisions Attachment decisions can be indefinitely delayed
Time and space efficient Deterministic search
Natural Language Processing-based Information Extraction
If extracting from automatically generated web pages, simple regex patterns usually work.
If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging
Mark each word as a noun, verb, preposition, etc. Syntactic parsing
Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)
KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags.
Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]
MUC: the NLP genesis of IE
DARPA funded significant efforts in IE in the early to mid 1990’s.
Message Understanding Conference (MUC) was an annual event/competition where results were presented.
Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes
Information extraction is of particular interest to the intelligence community
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
Example of IE from FASTUS (1993)
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
Example of IE: FASTUS(1993)
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
FASTUS
1.Complex Words: Recognition of multi-words and proper names
2.Basic Phrases:Simple noun groups, verb groups and particles
3.Complex phrases:Complex noun groups and verb groups
4.Domain Events:Patterns for events of interest to the applicationBasic templates are to be built.
5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.
Based on finite state automata (FSA) transductions
set upnew Taiwan dollars
a Japanese trading househad set up
production of 20, 000 iron and metal wood clubs
[company][set up][Joint-Venture]with[company]
0
1
2
3
4
PN ’s
ADJ
Art
N
PN
P
’s
Art
Finite Automaton forNoun groups:John’s interestingbook with a nice cover
Grep++ = Cascaded grepping
Rule-based Extraction Examples
Determining which person holds what office in what organization [person] , [office] of [org]
Vuk Draskovic, leader of the Serbian Renewal Movement [org] (named, appointed, etc.) [person] P [office]
NATO appointed Wesley Clark as Commander in ChiefDetermining where an organization is located
[org] in [loc] NATO headquarters in Brussels
[org] [loc] (division, branch, headquarters, etc.) KFOR Kosovo headquarters
IE with hidden markov models
Hidden Markov Models
S t -1 S t
O t
S t+1
O t +1Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
Markov Property
S2
S2S1
1/2
1/2 1/3
2/3
1
The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt
In another word, current state determines the probability distribution for the next state.
S1: rainS2: cloudS3: sun
Markov Property
S2
S3S1
1/2
1/2 1/3
2/3
1
State-transition probabilities,
A =
S1: rainS2: cloudS3: sun
033.067.005.05.0100
Q: given today is sunny (i.e., q1=3),what is the probability of “sun-cloud”with the model?
Hidden Markov ModelS1: rainS2: cloudS3: sun
S2
S3S1
1/2
1/2 1/3
2/3
14/5
1/10
7/101/5 3/10
9/10
observations
O1 O2 O3 O4 O5
state sequences
IE with Hidden Markov Model
SI/EECS 767 is held weekly at SIN2 .
SI/EECS 767 is held weekly at SIN2
Course name: SI/EECS 767
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “course name”state extract as a course name:
),(maxarg osPs
course namelocation namebackground
Name Entity Extraction[Bikel, et al 1998]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Hidden states
Name Entity ExtractionTransitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
P(ot | st , ot-1 )or
(1) Generating first word of a name-class
(2) Generating the rest of words in the name-class
(3) Generating “+end+” in a name-class
HMM-Experimental Results
Train on ~500k words of news wire text.
Results:
Learning HMM for IE[Seymore, 1999]
Consider labeled, unlabeled, and distantly-labeled data
Some Issues with HMM
Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range
dependencies of the observations Very strict independence assumptions on the observations
We want More than an Atomic View of WordsWould like richer representation of text: many arbitrary, overlapping features of the words.
S t -1 S t
O t
S t+1
O t +1Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Maximum Entropy Markov ModelsS t -1 S t
O t
S t+1
O t +1Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations
...)|Pr( tt xsCourtesy of William W. Cohen
[Lafferty, 2001]
Problems with Richer Representationand a Generative Model
These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future
Two choices:
Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!
Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
S t -1 S t
O t
S t+1
O t +1Ot -1
S t -1 S t
O t
S t+1
O t +1Ot -1
MEMMS t -1 S t
O t
S t+1
O t +1Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
......),|Pr( ,2,1 tttt ssxsCourtesy of William W. Cohen
HMM vs. MEMMSt-1 St
Ot
St+1
Ot+1Ot-1
...
iiiii sossos )|Pr()|Pr(),Pr( 11
St-1 St
Ot
St+1
Ot+1Ot-1
...
iiii ossos ),|Pr()|Pr( 11
Conditional Sequence Models
We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):
Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are
given at test time anyway.
Conditional Markov Models (CMMs) vs HMMS
St-1 St
Ot
St+1
Ot+1Ot-1
...
iiiii sossos )|Pr()|Pr(),Pr( 11
St-1 St
Ot
St+1
Ot+1Ot-1
...
iiii ossos ),|Pr()|Pr( 11
Lots of ML ways to estimate Pr(y | x)
nn oooossss ,...,,..., 2121
Joint
Conditional
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
kttkko osft ),(exp)(
(A super-special case of Conditional Random Fields.)
Conditional Finite State Sequence Models
From HMMs to CRFs[Lafferty, McCallum, Pereira 2001][McCallum, Freitag & Pereira, 2000]
||
11 )|()|(
)(1)|(
o
ttttt soPssP
oPosP
||
11 ),(),(
)(1 o
tttotts soss
oZ
where
Arbitrary features of s,o, and t
Feature Functions:),,,( Example 1 tossf ttk
otherwise 0
s s )d(Capitalize if 1),,,( j1i
1,d,Capitalizettt
ttss
ssotossf
ji
Yesterday Pedro Domingos spoke this example sentence.
s3
s1 s2
s4
1 )2,,,( 21,, 31 ossf ssdCapitalize
o = o1 o2 o3 o4 o5 o6 o7
Learning Parameters of CRFs
),,,(),(# where
),'(# )|'(),(#
1
2'
)()(
,
tossfos
ososPosL
ttt
kk
k
i s
ik
i
Dosk
k
Methods:• iterative scaling (quite slow – 2000 iterations from good start)• gradient, conjugate gradient (faster)• limited-memory quasi-Newton methods (“super fast”)
[Sha & Pereira 2002] & [Malouf 2002]
Maximize log-likelihood of parameters k given training data D
Log-likelihood gradient:
k
k
Dos
o
t kttkk tossf
oZL 2
2
,
||
11 2
),,,(exp)(
1log
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor :econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2001; also Hofmann 2003, Taskar et al 2003]
Avoids the tricky math; very fast; uses “pseudo-negative” examples of sequences; approximates a margin classifier for “good” vs “bad” sequences
Analogous tothe gradientfor this onetraining instance
Broader Issues in IE
Broader ViewCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Up to now we have been focused on segmentation and classification
Broader ViewCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
12
3
4
5
(1) Association as Binary Classification
[Zelenko et al, 2002]
Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.
Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO
Person-Role ( Ted Senator, KDD 2003 General Chair) YES
Person Person Role
Do this with SVMs and tree kernels over parse trees.
(1) Association with Finite State Machines [Ray & Craven, 2001]
… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …
DET thisN enzymeN ubc6V localizesPREP toART theADJ endoplasmicN reticulumPREP withART theADJ catalyticN domainV facingART theN cytosol Subcellular-localization (UBC6, endoplasmic reticulum)
(1) Association with Graphical Models[Roth & Yih 2002]Capture arbitrary-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofentity #2, e.g. over{person, location,…}
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Random variableover the class ofentity #1, e.g. over{person, location,…}
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
person?
personlives-in
(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance
dependencies among predictions.
Local languagemodels contributeevidence to entityclassification.
Random variableover the class ofentity #1, e.g. over{person, location,…}
Local languagemodels contributeevidence to relationclassification.
Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}
Dependencies between classesof entities and relations!
Inference with loopy belief propagation.
location
personlives-in
Broader ViewCreate ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
12
3
4
5
When do two extracted stringsrefer to the same object?
(2) Learning a Distance Metric Between Records[Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003]
Learn Pr ({duplicate, not-duplicate} | record1, record2)with a Maximum Entropy classifier.
Do greedy agglomerative clustering using this Probability as a distance metric.
(2) String Edit Distance distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O N
C C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
alignment
(2) Computing String Edit Distance
D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 3 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
learntheseparameters
(2) String Edit Distance Learning
Precision/recall for MAILING dataset duplicate detection
[Bilenko & Mooney, 2002, 2003]
(2) Information Integration
Goal might be to merge results of two IE systems:
Name: Introduction to Computer Science
Number: CS 101
Teacher: M. A. Kludge
Time: 9-11am
Name: Data Structures in Java
Room: 5032 Wean Hall
Title: Intro. to Comp. Sci.
Num: 101
Dept: Computer Science
Teacher: Dr. Klüdge
TA: John Smith
Topic: Java Programming
Start time: 9:10 AM
[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],[Richardson & Domingos 2003]
(2) Other Information Integration Issues
Distance metrics for text – which work well? [Cohen, Ravikumar, Fienberg, 2003]
Finessing integration by soft database operations based on similarity [Cohen, 2000]
Integration of complex structured databases: (capture dependencies among multiple merges) [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi,
Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]
Broader View
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IETokenize
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
Now touch on some other issues
1
1
2
3
4
5
(5) Working with IE Data Some special properties of IE data:
It is based on extracted text It is “dirty”, (missing extraneous facts, improperly normalized
entity names, etc.) May need cleaning before use
What operations can be done on dirty, unnormalized databases? Datamine it directly. Query it directly with a language that has “soft joins” across
similar, but not identical keys. [Cohen 1998] Use it to construct features for learners [Cohen 2000] Infer a “best” underlying clean database
[Cohen, Kautz, MacAllester, KDD2000]
Evaluating IE Accuracy Always evaluate performance on independent, manually-annotated
test data not used during system development. Template Measure for each test document:
Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the
solution template): C Compute average value of metrics adapted from IR:
Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision
MUC Information Extraction:State of the Art c. 1997
NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production
Summary and prelude
We’ve looked at the “fragment extraction” task. Future? Top-down semantic constraints (as well as syntax)? Unified framework for extraction from regular & natural text?
(BWI is one tiny step; Webfoot [Soderland 1999] is another.) Beyond fragment extraction:
Anaphora resolution, discourse processing, ... Fragment extraction is good enough for many Web information
services! Next time:
Learning methods for information extraction
Three generations of IE systems
Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the domain Iterative guess-test-tweak-repeat cycle
Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates, using
methods like ILP Require huge, labeled corpora (effort is just moved!)
Machine Learning (Sequence) Models [1997 – ] One decodes a statistical model that classifies the words of the
text, using HMMs, random fields or statistical parsers Learning usually supervised; may be partially unsupervised
Basic IE References
Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/
Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/
Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)
Some IE tools Available
MALLET (UMass) statistical natural language processing, document classification, clustering, information extraction
other machine learning applications to text. Sample Application:
GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.
http://minorthird.sourceforge.net/ “a collection of Java classes for storing text, annotating text, and
learning to extract entities and categorize text” Stored documents can be annotated in independent files using
TextLabels (denoting, say, part-of-speech and semantic information)
MinorThird
GATE
http://gate.ac.uk/ie/annie.html
leading toolkit for Text Mining distributed with an Information Extraction component set called ANNIE (demo) Used in many research projects
Long list can be found on its website Under integration of IBM UIMA
Sunita Sarawagi's CRF package
http://crf.sourceforge.net/ A Java implementation of conditional random fields for sequential labeling.
UIMA (IBM)
Unstructured Information Management Architecture. A platform for unstructured information management
solutions from combinations of semantic analysis (IE) and search components.
Some Interesting Website based on IE
ZoomInfo CiteSeer.org (some of us using it everyday!)
Google Local, Google Scholar and many more…