Post on 19-Dec-2015
Information Extraction
Shallow Processing Techniques for NLPLing570
December 5, 2011
RoadmapInformation extraction:
Definition & Motivation
General approach
Relation extraction techniques:Fully supervised
Lightly supervised
Information ExtractionMotivation:
Unstructured data hard to manage, assess, analyze
Information ExtractionMotivation:
Unstructured data hard to manage, assess, analyze
Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems
Information ExtractionMotivation:
Unstructured data hard to manage, assess, analyze
Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems
Goal:Given unstructured information in textsConvert to structured data
E.g., populate DBs
IE ExampleCiting high fuel prices, United Airlines said Friday it
has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:
IE ExampleCiting high fuel prices, United Airlines said Friday it
has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: United Airlines Amount: $6 Effective Date: 2006-10-26 Follower: American Airlines
Other ApplicationsShared tasks:
Message Understanding Conferences (MUC)
Other ApplicationsShared tasks:
Message Understanding Conferences (MUC)Joint ventures/mergers:
Other ApplicationsShared tasks:
Message Understanding Conferences (MUC)Joint ventures/mergers:
CompanyA, CompanyB, CompanyC, amount, type
Terrorist incidents
Other ApplicationsShared tasks:
Message Understanding Conferences (MUC)Joint ventures/mergers:
CompanyA, CompanyB, CompanyC, amount, type
Terrorist incidents Incident type, Actor, Victim, Date, Location
General:
Other ApplicationsShared tasks:
Message Understanding Conferences (MUC)Joint ventures/mergers:
CompanyA, CompanyB, CompanyC, amount, type
Terrorist incidents Incident type, Actor, Victim, Date, Location
General:PricegrabberStock market analysisApartment finder
Information ExtractionComponents
Incorporates range of techniques, tools
Information ExtractionComponents
Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:
ClassificationSequence labeling
Information ExtractionComponents
Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:
ClassificationSequence labeling
Semi-, un-supervised learning: clustering, bootstrapping
Information ExtractionComponents
Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:
ClassificationSequence labeling
Semi-, un-supervised learning: clustering, bootstrapping
Part-of-Speech taggingNamed Entity RecognitionChunkingParsing
Information Extraction Process
Typically pipeline of major components
Information Extraction Process
Typically pipeline of major components
First step: Named Entity Recognition (NER)Often application-specific
Generic type: Person, Organiztion, LocationSpecific: genes to college courses
Information Extraction Process
Typically pipeline of major components
First step: Named Entity Recognition (NER)Often application-specific
Generic type: Person, Organiztion, LocationSpecific: genes to college courses
Identify NE mentions: specific instances
Example: Pre-NERCiting high fuel prices, United Airlines said Friday
it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said
[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Information Extraction:Reference Resolution
Builds on NER
Clusters entity mentions referring to same thingsCreates entity equivalence classes
Information Extraction:Reference Resolution
Builds on NER
Clusters entity mentions referring to same thingsCreates entity equivalence classes
Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)
Information Extraction:Reference Resolution
Builds on NER
Clusters entity mentions referring to same thingsCreates entity equivalence classes
Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)
Also non-anaphoric, general mentions
Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said
[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Example: post-NERCiting high fuel prices, [ORGUnited Airlines]
said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Relation Detection & Classification
Relation detection & classification:Find and classify relations among entities in text
Relation Detection & Classification
Relation detection & classification:Find and classify relations among entities in text
Mix of application-specific and generic relations
Generic relations:
Relation Detection & Classification
Relation detection & classification:Find and classify relations among entities in text
Mix of application-specific and generic relations
Generic relations:
Family, employment, part-whole, membership, geospatial
Example: post-NERCiting high fuel prices, [ORGUnited Airlines]
said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Relation Detection & Classification
Relation detection & classification:Find and classify relations among entities in text
Mix of application-specific and generic relationsGeneric relations:
Family, employment, part-whole, membership, geospatial
Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines
Relation Detection & Classification
Relation detection & classification: Find and classify relations among entities in text
Mix of application-specific and generic relationsGeneric relations:
Family, employment, part-whole, membership, geospatial
Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines
Domain-specific relations:United serves Chicago; United serves Denver, etc
Example: post-NERCiting high fuel prices, [ORGUnited Airlines]
said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Event Detection &Classification
Event detection and classification:Find key events
Event Detection &Classification
Event detection and classification:Find key events
Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases
How cued?
Event Detection &Classification
Event detection and classification:Find key events
Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases
How cued? Instances of “said” and ”cite”
Temporal Expressions:Recognition and Analysis
Temporal expression recognitionFinds temporal events in text
Example: post-NERCiting high fuel prices, [ORGUnited Airlines]
said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Temporal Expressions:Recognition and Analysis
Temporal expression recognitionFinds temporal events in text
E.g. Friday, Thursday in example text
Temporal Expressions:Recognition and Analysis
Temporal expression recognitionFinds temporal events in text
E.g. Friday, Thursday in example textOthers:
Date expressions: Absolute: days of week, holidays Relative: two days from now, next year
Clock times: noon, 3:30pm
Temporal Expressions:Recognition and Analysis
Temporal expression recognition Finds temporal events in text
E.g. Friday, Thursday in example text Others:
Date expressions: Absolute: days of week, holidays Relative: two days from now, next year
Clock times: noon, 3:30pm
Temporal analysis: Map expressions to points/spans on timeline
Anchor: e.g. Friday, Thursday: tied to example dateline
Template FillingTexts describe stereotypical situations in domain
Identify texts evoking template situations
Fill slots in templates based on text
Template FillingTexts describe stereotypical situations in domain
Identify texts evoking template situations
Fill slots in templates based on text
In example: fare-raise is stereo-typical event
Information Extraction Steps
Named Entity Recognition
Reference Resolution
Relation Detection & Classification
Event Detection & Classification
Temporal Expression Recognition & Analysis
Template-filling
Relation Detection & Classification
Common Relations
Relations and Entitiesfrom Example
Supervised Approaches to Relation Analysis
Two stage process:
Supervised Approaches to Relation Analysis
Two stage process:Relation detection:
Detect whether a relation holds between two entities
Supervised Approaches to Relation Analysis
Two stage process:Relation detection:
Detect whether a relation holds between two entities Positive examples:
Supervised Approaches to Relation Analysis
Two stage process:Relation detection:
Detect whether a relation holds between two entities Positive examples:
Drawn from labeled training data Negative examples:
Supervised Approaches to Relation Analysis
Two stage process:Relation detection:
Detect whether a relation holds between two entities Positive examples:
Drawn from labeled training data Negative examples:
Generated from within-sentence entity pairs w/no relation
Relation classification:Label each related pair of entities by type
Multi-class classification task
Feature DesignWhich features can we use to classify relations?
Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said
[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Common Relations
Feature DesignWhich features can we use to classify relations?
Feature DesignWhich features can we use to classify relations?
Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments
Feature DesignWhich features can we use to classify relations?
Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments
Context word features:Bag-of-words/bigrams between entities (+/-
stemmed)Words and stems before/after entitiesDistance in words, # entities between arguments
Feature DesignFeatures for relation classification
Syntactic structure features can signal relations E.g. appositive signals part-of relation
Feature DesignFeatures for relation classification
Syntactic structure features can signal relations E.g. appositive signals part-of relation
Feature DesignSyntactic structure features
Derived from syntactic analysisFrom base-phrase chunking to full constituent parse
Feature DesignSyntactic structure features
Derived from syntactic analysisFrom base-phrase chunking to full constituent parse
Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments
How can we represent syntax?
Feature DesignSyntactic structure features
Derived from syntactic analysisFrom base-phrase chunking to full constituent parse
Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments
How can we represent syntax?Binary features: Detection of construction patterns
Feature DesignSyntactic structure features
Derived from syntactic analysisFrom base-phrase chunking to full constituent parse
Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments
How can we represent syntax?Binary features: Detection of construction patternsStructure features: Encode paths through trees
Some Example Features
Lightly Supervised Approaches
Supervised approaches highly effective, but
Lightly Supervised Approaches
Supervised approaches highly effective, butRequire large manually annotated corpus for
training
Lightly Supervised Approaches
Supervised approaches highly effective, butRequire large manually annotated corpus for
training
Unsupervised approaches interesting, but
Lightly Supervised Approaches
Supervised approaches highly effective, butRequire large manually annotated corpus for
training
Unsupervised approaches interesting, butMay not yield results consistent with desired
categories
Lightly supervised approaches:
Lightly Supervised Approaches
Supervised approaches highly effective, butRequire large manually annotated corpus for
training
Unsupervised approaches interesting, butMay not yield results consistent with desired
categories
Lightly supervised approaches:Build on small seed sets of patterns that capture
relations of interestE.g. manually constructed regular expressions
Lightly Supervised Approaches
Supervised approaches highly effective, butRequire large manually annotated corpus for
training
Unsupervised approaches interesting, butMay not yield results consistent with desired
categories
Lightly supervised approaches:Build on small seed sets of patterns that capture
relations of interestE.g. manually constructed regular expressions
Iteratively generalize and refine patterns to optimize
Bootstrapping RelationsCreate seed patterns and instances
Example: finding airline hub citiesInitial pattern:
Bootstrapping RelationsCreate seed patterns and instances
Example: finding airline hub citiesInitial pattern: / * has a hub at * /
In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport
Bootstrapping RelationsCreate seed patterns and instances
Example: finding airline hub citiesInitial pattern: / * has a hub at * /
In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport
Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)
Bootstrapping RelationsIssues:
Bootstrapping RelationsIssues:
False alarms:Some matches aren’t valid relations
The catheter has a hub at the proximal end. A star topology often has a hub at the center.
Fix?
Bootstrapping RelationsIssues:
False alarms:Some matches aren’t valid relations
The catheter has a hub at the proximal end. A star topology often has a hub at the center.
Fix? /[ORG] has hub at [LOC]/Misses:
Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at
Liverpool. Ryanair also has a continental hub at Charleroi airport.
Fix?
Bootstrapping RelationsIssues:
False alarms:Some matches aren’t valid relations
The catheter has a hub at the proximal end. A star topology often has a hub at the center.
Fix? /[ORG] has hub at [LOC]/Misses:
Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.
Fix? Relax patterns
Bootstrapping RelationsIssues:
False alarms:Some matches aren’t valid relations
The catheter has a hub at the proximal end. A star topology often has a hub at the center.
Fix? /[ORG] has hub at [LOC]/Misses:
Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.
Fix? Relax patterns – see issue 1
Bootstrapping Relations Issues:
False alarms:Some matches aren’t valid relations
The catheter has a hub at the proximal end. A star topology often has a hub at the center.
Fix? /[ORG] has hub at [LOC]/ Misses:
Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.
Fix? Relax patterns – see issue 1 Add new high precision patterns
BootstrappingHow can we obtain more high quality patterns?
BootstrappingHow can we obtain more high quality patterns?
Human labeling?
BootstrappingHow can we obtain more high quality patterns?
Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences
BootstrappingHow can we obtain more high quality patterns?
Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences
E.g. tuple: Ryanair, Charleroi and hub
BootstrappingHow can we obtain more high quality patterns?
Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences
E.g. tuple: Ryanair, Charleroi and hubOccurrences:
Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi
Patterns:
BootstrappingHow can we obtain more high quality patterns?
Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences
E.g. tuple: Ryanair, Charleroi and hub Occurrences:
Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi
Patterns: /[ORG], which uses [LOC] as a hub /
BootstrappingHow can we obtain more high quality patterns?
Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences
E.g. tuple: Ryanair, Charleroi and hub Occurrences:
Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi
Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /
Bootstrapping IssuesIssues:
How do we represent patterns?
How to we assess the patterns?
How to we assess the tuples?
Pattern RepresentationGeneral approaches:
Pattern RepresentationGeneral approaches:
Regular expressionsSpecific, high precision
Pattern RepresentationGeneral approaches:
Regular expressionsSpecific, high precision
Machine learning style feature vectorsMore flexible, higher recall
Pattern RepresentationGeneral approaches:
Regular expressionsSpecific, high precision
Machine learning style feature vectorsMore flexible, higher recall
Components in representation:Context before first entity mentionContext between entity mentionsContext after last entity mentionOrder of arguments
Assessing Patterns & Tuples
Issue:
Assessing Patterns & Tuples
Issue: No gold standard!Can use seed patterns
Assessing Patterns & Tuples
Issue: No gold standard!Can use seed patterns
Balance two main factors:Performance on seed tuples
Hits and missesProductivity in full collection
Finds
Assessing Patterns & Tuples
Issue: No gold standard!Can use seed patterns
Balance two main factors:Performance on seed tuples
Hits and missesProductivity in full collection
Finds
Use conservative strategy to minimize semantic driftE.g. Sydney has a ferry hub at Circular Key.
Relation Analysis Evaluation
Approaches:Standardized, labeled data sets
Relation Analysis Evaluation
Approaches:Standardized, labeled data sets
Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection
Relation Analysis Evaluation
Approaches:Standardized, labeled data sets
Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection
No fully labeled corpus
Relation Analysis Evaluation
Approaches:Standardized, labeled data sets
Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection
No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)
Intuition: Don’t care how many times we find same tuple
Relation Analysis Evaluation
Approaches:Standardized, labeled data sets
Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection
No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)
Intuition: Don’t care how many times we find same tuple
Checked by human assessors