Information Extraction
description
Transcript of Information Extraction
CMPT 884, SFU, Martin Ester, 1-09 1
Information Extraction
Martin Ester
Simon Fraser University
School of Computing Science
CMPT 884
Spring 2009
CMPT 884, SFU, Martin Ester, 1-09 2
Information Extraction
Outline• Introduction
motivation, applications, issues
• Entity extraction hand-coded, machine learning
• Relation extraction supervised, partially supervised
• Entity resolutionstring similarity, finding similar pairs, creating groups
• Future research
[Feldman 2006] [Agichtein & Sarawagi 2006]
CMPT 884, SFU, Martin Ester, 1-09 3
Introduction
Motivation
• 80% of all human-generated data is natural language text
• search engines return whole documents, requiring the user
to read documents and manually extract relevant
information (entities, facts, . . .)
very time-consuming
• need for automatic extraction of such information from
collections of natural language text documents
information extraction (IE)
CMPT 884, SFU, Martin Ester, 1-09 4
Introduction
Definitions• Entity: an object of interest such as a person or organization.• Attribute: a property of an entity such as its name, alias, descriptor, or type.• Relation: a relationship held between two or more entities such as Position of a Person in a Company.• Event: an activity involving several entities such as a terrorist act, aircraft crash, management change, new product introduction.
CMPT 884, SFU, Martin Ester, 1-09 5
Introduction
Example
CMPT 884, SFU, Martin Ester, 1-09 6
Introduction
Applications
• question answeringWho is the president of the US?
Where was Martin Luther born?• automatic creation of databases
e.g., database of protein localizations or adverse reactions to a drug• opinion mining
analyzing online product reviews to get user feedback
CMPT 884, SFU, Martin Ester, 1-09 7
IntroductionChallenges
• Complexity of natural languagee.g., identifying word and sentence boundaries is fairly easy
in European languages, much harder in Chinese / Japanese• Ambiguity of natural language
e.g., homonyms• Diversity of natural language
many ways of expressing a given information, e.g. synonyms• Diversity of writing styles
e.g., scientific papers, newspaper articles, maintenance reports, emails, . . .
CMPT 884, SFU, Martin Ester, 1-09 8
IntroductionChallenges
• names are hard to discover– impossible to enumerate– new candidates are generated all the time– hard to provide syntactic rules
• types of proper names– people– companies– products– genes- . . .
CMPT 884, SFU, Martin Ester, 1-09 9
IntroductionArchitecture of IE System
Local analysisLocal analysis
Discourse (global)
analysis
Discourse (global)
analysis
CMPT 884, SFU, Martin Ester, 1-09 10
Introduction
Knowledge Engineering Approach
•Extraction rules are hand-crafted by linguists in cooperation with domain experts.• Most of the work is done by inspecting a set of relevant documents.• Development of rule set is very time-consuming.• Requires substantial CS and domain expertise.• Rule sets are domain-specific, do not transfer to other domains.• Knowledge engineering (KE) approach often achieves higher accuracy than machine learning approach.
CMPT 884, SFU, Martin Ester, 1-09 11
Introduction
Machine Learning Approach
• Automatically learn model („rules“) from annotated training corpus.• Techniques based on pure statistics and little linguistic knowledge.• No CS expertise required when building model.• However creating the annotated corpus is very laborious, since very large number of training examples needed.• Transfer to other domains is easier than KE approach.• Accuracy of machine learning (ML) approach is typically lower.
CMPT 884, SFU, Martin Ester, 1-09 12
Introduction
Topics Not Covered
• co-reference resolution
e.g., article referencing a noun (entity) of another
sentence
• event extraction
event has type, actor, time . . .
• sentiment detection
a certain statement (opinion) is classified
as positive / negative
CMPT 884, SFU, Martin Ester, 1-09 13
Entity Extraction
Lexical Analysis
• breaking up the input document into individual words = tokens • token: sequence of characters treated as a unit • punctuation marks also considered as token
e.g., „,“ (comma) • often, use regular expressions to define format of token
CMPT 884, SFU, Martin Ester, 1-09 14
Entity Extraction
Syntactic Analysis• part-of-speech tagging [Charniak 1997]
marking up the tokens in a text as corresponding to a particular part of speech (POS), based on both its definition, as well as its context • coarse POS tags: e.g., N, V, A, Aux, ….• finer POS tags: - PRP: personal pronouns (you, me, she, he, them, him, her, …) - PRP$: possessive pronouns (my, our, her, his, …) - NN: singular common nouns (sky, door, theorem, …) - NNS: plural common nouns (doors, theorems, women, …) - NNP: singular proper names (Fifi, IBM, Canada, …)
- NNPS: plural proper names (Americas, Carolinas, …)
CMPT 884, SFU, Martin Ester, 1-09 15
Entity Extraction
Syntactic Analysis
• Words often have more than one POS, e.g. back• The back door = JJ
• On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for a particular instance of a word.• e.g., input: the lead paint is unsafe
output: the/Det lead/N paint/N is/V unsafe/Adj
CMPT 884, SFU, Martin Ester, 1-09 16
Entity Extraction
Knowledge Engineering Approach [Chaudhuri 2005]
• hand-coded rules often relatively straightforward• easy to incorporate domain knowledge• require substantial CS expertise• example rule
<token> INITIAL</token>
<token>DOT </token>
<token>CAPSWORD</token>
<token>CAPSWORD</token> finds person names with a salutation and two
capitalized words, e.g. Dr. Laura Haas
CMPT 884, SFU, Martin Ester, 1-09 17
Entity ExtractionKnowledge Engineering Approach
• a more complex example: conference name$wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|four teenth|fifteenth)";
my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";
my $ordinals="(?:$wordOrdinals|$numberOrdinals)";
my $confTypes="(?:Conference|Workshop|Symposium)";
my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces
my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference
name for workshops (e.g. "VLDB Workshop ...")
my $connectors="(?:on|of)";
my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # abbreviations like "(SIGMOD'06)"
my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abb reviations?)(?:\\n|\\r|\\.|<)"; . . .
CMPT 884, SFU, Martin Ester, 1-09 18
Entity ExtractionMachine Learning Approach
• We can view the named entity extraction as a sequence classification problem: classify each word as belonging to one of the named entity classes or to the noname class.
• Class label of sequence element depends on neighboring ones.
• One of the most popular techniques for dealing with classifying sequences is Hidden Markov Models (HMM).
• Other popular ML method for entity extraction: Conditional Random Fields [Lafferty et al 2001].
• Requires large enough labeled (annotated) training dataset.
CMPT 884, SFU, Martin Ester, 1-09 19
Entity Extraction
Hidden Markov Models [Rabiner 1989]
• HMM (Hidden Markov Model) is a finite state automaton
with stochastic state transitions and symbol emissions.
• The automaton models a probabilistic generative process.
• In this process a sequence of symbols is produced by
starting in an initial state, transitioning to a new state,
emitting a symbol selected by the state and repeating this
transition/emission cycle until a designated final state is
reached.
• Very successful in many sequence classification tasks.
CMPT 884, SFU, Martin Ester, 1-09 20
Entity Extraction
Example
HMM for addresses
CMPT 884, SFU, Martin Ester, 1-09 21
Entity Extraction
Hidden Markov Models
• T = length of the sequence of observations (training set)
• N = number of states in the model
• qt = the actual state at time t
• S = {S1,...SN} (finite set of possible states)
• V = {O1,...OM} (finite set of observation symbols)
• π = {πi} = {P(q1 = Si)} starting probabilities
• A = {aij}=P(qt+1= Si | qt = Sj) transition probabilities
• B = {bi(Ot)} = {P(Ot | qt = Si)} emission probabilities
• λ = (π, A, B) hidden Markov model
CMPT 884, SFU, Martin Ester, 1-09 22
Entity Extraction
Hidden Markov Models
• How to find P( O | λ ): the probability of an observation sequence
given the HMM model?
forward-backward algorithm
• How to find λ that maximizes P( O |λ )?
This is the task of the training phase.
Baum-Welch algorithm
• How to find the most likely state trajectory given λ and O?
This is the task of the test phase.
Viterbi algorithm
CMPT 884, SFU, Martin Ester, 1-09 23
Relation ExtractionExample
Apple's programmers "think different" on a "campus" in
Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore.
Microsoft's central headquarters in Redmond is home to almost every product group and division.
Organization Location
Microsoft
Apple Computer
Nike
Redmond
Cupertino
Portland
Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."
CMPT 884, SFU, Martin Ester, 1-09 24
Relation Extraction
Introduction• No single source contains all the relations• Each relation appears on many web pages• There are repeated patterns in the way relations are represented on web pages
exploit redundancy• Components of relation appear “close” together
use context of occurrence of relation to determine patterns• pattern consists of constants (tokens) and variables (placeholders for entities)
• tuple: instance / occurrence of a relation
CMPT 884, SFU, Martin Ester, 1-09 25
Relation ExtractionIntroduction
• Typically requires entity extraction (tagging) as preprocessing• Knowledge engineering approach - patterns defined over lexical items “<company> located in <location>” - patterns defined over parsed text “((Obj <company>) (Verb located) (*) (Subj <location>))”• Machine learning approach - learn rules/patterns from examples - partially-supervised: bootstrap from example tuples [Agichtein & Gravano 2000, Etzioni et al 2004]
CMPT 884, SFU, Martin Ester, 1-09 26
Relation ExtractionSnowball [Agichtein & Gravano 2000]
• Exploit duality between patterns and tuples- find tuples that match a set of patterns- find patterns that match a lot of tuples bootstrapping approach
Initial Seed Tuples Occurrences of Seed Tuples
Tag Entities
Generate Extraction Patterns
Generate New Seed Tuples
Augment Table
CMPT 884, SFU, Martin Ester, 1-09 27
Relation Extraction
Snowball
• how to represent patterns of occurrences?
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
initial seed tuples
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, share ofRedmond-based Microsoft fell…
The Armonk-based IBM introduceda new line…
The combined company will operate
from Boeing’s headquarters in Seattle.
Intel, Santa Clara, cut prices of itsPentium processor.
occurrences of seed tuples
CMPT 884, SFU, Martin Ester, 1-09 28
Relation ExtractionPatterns
• (extraction) pattern has format <left, tag1, middle, tag2, right>,
where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms
•patterns derived directly from occurrences are too specific
< left , tag1 , middle , tag2 , right >
ORGANIZATION 's central headquarters in LOCATION is home to...
LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}
{<is 0.75>, <home 0.75> }
CMPT 884, SFU, Martin Ester, 1-09 29
Relation ExtractionPattern Clusters
• cluster patterns, cluster centroids define patterns
{<servers 0.75><at 0.75>}
{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}
ORGANIZATION LOCATION
{<operate 0.75><from 0.75>}
{<’s 0.7> <headquarters 0.7> <in 0.7>}
ORGANIZATION LOCATION
Cluster 1
{<shares 0.75><of 0.75>}
{<- 0.75> <based 0.75> }
{<fell 1>}
{<the 1>}
{<- 0.75> <based 0.75> }
ORGANIZATION
LOCATION
{<introduced 0.75> <a 0.75>}
LOCATION
ORGANIZATION
Cluster 2
CMPT 884, SFU, Martin Ester, 1-09 30
Relation Extraction
Evaluation of Patterns
• How good are new extraction patterns?•Measure their performance through their accuracy vs. the initial seed tuples (ground truth).
extraction with pattern “ORGANIZATION, LOCATION”
Boeing, Seattle, said… PositiveIntel, Santa Clara, cut prices… Positiveinvest in Microsoft, New York-based Negativeanalyst Jane Smith said
ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA
CMPT 884, SFU, Martin Ester, 1-09 31
Relation Extraction
Evaluation of Patterns
• Trust only patterns with high “support” and
“confidence”,
i.e. that produce many correct (positive) tuples and only
a
few false (negative) tuples.
• conf(p) = pos(p)/(pos(p)+neg(p))
where p denotes a pattern and pos(p), neg(p) denote the
numbers of positive, negative tuples produced
CMPT 884, SFU, Martin Ester, 1-09 32
Relation ExtractionEvaluation of Tuples
• Trust only tuples that match many patterns.
• Suppose candidate tuple t matches patterns p1 and p2. What is the probability that t is a valid tuple?• Assume matches of different patterns are independent events.• Pr[t matches p1 and t is not valid] = 1-conf(p1)
Pr[t matches p2 and t is not valid] = 1-conf(p2)
Pr[t matches {p1,p2} and t is not valid] = (1-conf(p1))(1-conf(p2))
Pr[t matches {p1,p2} and t is valid] = 1 - (1-conf(p1))(1-conf(p2))
• If tuple t matches a set of patterns P
conf(t) = 1 - p in P(1-conf(p))
CMPT 884, SFU, Martin Ester, 1-09 33
Relation Extraction
Snowball Algorithm
1. Start with seed set R of tuples2. Generate set P of patterns from R
compute support and confidence for each pattern in P
discard patterns with low support or confidence3. Generate new set T of tuples matching patterns P
compute confidence of each tuple in T add to R the tuples t in T with conf(t)>threshold.4. go back to step 2
CMPT 884, SFU, Martin Ester, 1-09 34
Relation Extraction
Discussion
•bootstrapping approach requires only a relatively small number of training tuples (semi-supervised)•is effective for binary, 1:1 relations•bootstrapping approach has been adopted by lots of subsequent work•pattern evaluation is heuristic and has no theory behind
Statistical Snowball, WWW 09•what about n-ary relations?•what about 1:m relations?
CMPT 884, SFU, Martin Ester, 1-09 35
Entity Resolution
Introduction
CMPT 884, SFU, Martin Ester, 1-09 36
Entity ResolutionIntroduction
•Entity resolution - map entity mentions to the corresponding entities
- entities stored in database or ontology•Challenges
- large lists with multiple noisy mentions of the same entity - no single attribute to order or cluster likely duplicates while separating them from similarbut different entities - need to depend on fuzzy and computationally expensive string similarity functions.
CMPT 884, SFU, Martin Ester, 1-09 37
Entity ResolutionIntroduction
•Typical approach- define string similarity
numeric attributes are easy to compare, hard are string attributes
needs to perform approximate matches - find similar pairs of entities
- create groupsfrom duplicate entity pairs(clustering)
CMPT 884, SFU, Martin Ester, 1-09 38
Entity ResolutionString Similarity
•Token-based
Jaccard
TF-IDF cosine similarities
suitable for large documents
• Character-based
Edit-distance and variants like Levenshtein, Jaro-Winkler
Soundex
suitable for short strings with spelling mistakes
• Hybrids
CMPT 884, SFU, Martin Ester, 1-09 39
Entity ResolutionToken-Based String Similarity
•Tokens/words
‘AT&T Corporation’ ‘AT&T’ , ‘Corporation’
• Similarity: various measures of overlap of two sets S,T
• Jaccard(S,T) = |S∩T|/|S∪T|
• Example
S = ‘AT&T Corporation’ ‘AT&T’ , ‘Corporation’
T = ‘AT&T Corp’ ‘AT&T’ , ‘Corp.’
Jaccard(S,T) = 1/3
• Variants: weights attached with each token
CMPT 884, SFU, Martin Ester, 1-09 40
Entity ResolutionToken-Based String Similarity
• Sets transformed to vectors with each term as dimension
• Cosine similarity:
dot-product of two vectors each normalized to unit length
cosine of angle between them• Term weight = TF/IDF
log (tf+1) * log idf wheretf : frequency of ‘term’ in a document didf : number of documents / number of documents
containing ‘term’
rare ‘terms’ are more important
CMPT 884, SFU, Martin Ester, 1-09 41
Entity ResolutionToken-Based String Similarity
• Widely used in traditional IR
• Example:
‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’
low weights for ‘Corporation’,’Corp’,’Inc’,
higher weight for ‘AT&T’
CMPT 884, SFU, Martin Ester, 1-09 42
Entity ResolutionCharacter-Based String Similarity
• Given two strings, S,T, edit(S,T):
minimum cost sequence of operations to transform S to T.
• Character operations: I (insert), D (delete), R (Replace).
• Example: edit(Error,Eror) = 1, edit(great,grate) = 2
• Dynamic programming algorithm to compute edit();
• Several variants (gaps,weights)
becomes NP-complete
• Varying costs of operations: can be learnt
• Suitable for common typing mistakes on small strings
CMPT 884, SFU, Martin Ester, 1-09 43
Entity Resolution
Find Duplicate Pairs
• Input: a large list of entities with string attributes
• Output: all pairs (S,T) of entities which satisfy a similarity
criteria such as
Jaccard(S,T) > 0.7
Edit-distance(S,T) < k
•Naive method: for each record pair, compute similarity score
• I/O and CPU intensive, not scalable to millions of entities
• Goal: reduce O(n2) cost to O(n*w), where w << n
• Reduce number of pairs on which similarity is computed
CMPT 884, SFU, Martin Ester, 1-09 44
Entity Resolution
Find Duplicate Pairs
• Method: filter and refinement
• Use inexpensive filter to filter out as many pairs as possible e.g. EditDistance(s,t) ≤ d →
|q-grams(s) ∩ q-grams(t)| ≥ max(|s|,|t|) - (d-1)*q - 1
q-gram: subsequence of q consecutive characters
e.g. 3-grams for ‘AT&T Corporation’
{‘AT&’,’T&T’,’&T ‘, ‘T C’,’ Co’, ’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}
• If a pair (s, t) does not satisfy the filter, it cannot satisfy the
similarity criteria
e.g., |q-grams(s) ∩ q-grams(t)| < max(|s|,|t|) - (d-1)*q - 1 →
EditDistance(s,t) > d
CMPT 884, SFU, Martin Ester, 1-09 45
Entity Resolution
Find Duplicate Pairs
• Do not have to apply the filter to all pairs of entities
use index to retrieve subset of entities that share
q-grams
• Compute the expensive similarity function only to
pairs
that survive the filter step
e.g. EditDistance(s,t)
CMPT 884, SFU, Martin Ester, 1-09 46
Entity Resolution
Create Groups of Duplicates
• Given pairs of duplicate entities
• Group them such that each group corresponds to
one entity
• Many clustering algorithms have been applied
• Number of clusters hard to specify in advance
• Ground truth may be available for some entity pairs
semi-supervised clustering
CMPT 884, SFU, Martin Ester, 1-09 47
Entity Resolution
Create Groups of Duplicates
• Agglomerative clustering:
repeatedly merge closest clusters
• Definition of closeness of clusters subject to tuning
Average/Max/Min similarity
• Efficient implementations possible using special
data
structures
CMPT 884, SFU, Martin Ester, 1-09 48
Entity Resolution
Challenges
• Collective entity resolution
consider relationships between entities and propagate
resolution decisions along these relationships
use Markov Logic Networks [Parag & Domingos 2005]
• Mapping to existing background knowledge
ontology of real world entities may be given
map entities / clusters of entities to ontology entries
k-nearest neighbor methods
CMPT 884, SFU, Martin Ester, 1-09 49
Information ExtractionReferences
•Eugene Agichtein, Luis Gravano: Snowball: Extracting Relations Snowball: Extracting Relations from Large Plain-Text Collections, ACM DL, 2000 from Large Plain-Text Collections, ACM DL, 2000 •Eugene Agichtein, Sunita Sarawagi: Scalable Information Extraction and Integration, Tutorial KDD 2006• Eugene Charniak: Statistical Techniques for Natural Language Parsing“, AI Magazine 18(4), 1997• S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping?, CIDR 2005• Ronen Feldman:Information Extraction: Theory and Practice, Tutorial ICML 2006
CMPT 884, SFU, Martin Ester, 1-09 50
Information ExtractionReferences
• John Lafferty, Andrew McCallum, Fernando Pereira: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001• L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE 77(2), 1989