Post on 11-Jan-2016
807 - TEXT ANALYTICS
Massimo Poesio
Lecture 5: Named Entity Recognition
Text classification at Different Granularities
• Text Categorization: – Classify an entire document
• Information Extraction (IE):– Identify and classify small units within documents
• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names
• People, locations, organizations
Adapted from slide by William Cohen
What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
Adapted from slide by William Cohen
What is Information ExtractionFilling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
Adapted from slide by William Cohen
What is Information ExtractionInformation Extraction = segmentation + classification + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka “named entity extraction”
Adapted from slide by William Cohen
What is Information ExtractionInformation Extraction = segmentation + classification + association
A familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
Adapted from slide by William Cohen
What is Information ExtractionInformation Extraction = segmentation + classification + association
A familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
INFORMATION EXTRACTION
• More general definition: extraction of structured information from unstructured documents
• IE Tasks:– Named entity extraction
• Named entity recognition• Coreference resolution• Relationship extraction
• Semi-structured IE– Table extraction
• Terminology extraction
Adapted from slide by William Cohen
Landscape of IE Tasks:Degree of Formatting
Grammatical sentencesand some formatting & links
Text paragraphswithout formatting
Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting & links Tables
Adapted from slide by William Cohen
Landscape of IE Tasks:Intended Breadth of Coverage
Web site specific Genre specific Wide, non-specific
Amazon.com Book Pages Resumes University Names
Formatting Layout Language
Landscape of IE Tasks”Complexity
Closed set
He was born in Alabama…
The big Wyoming sky…
U.S. states
Regular set
Phone: (413) 545-1323
The CALD main office can be reached at 412-268-1299
U.S. phone numbers
Complex pattern
University of ArkansasP.O. Box 140Hope, AR 71802
U.S. postal addresses
Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210
…was among the six houses sold by Hope Feldman that year.
Ambiguous patterns,needing context andmany sources of evidence
Person names
Pawel Opalinski, SoftwareEngineer at WhizBang Labs.
Adapted from slide by William Cohen
Landscape of IE Tasks:Single Field/Record
Single entity
Person: Jack Welch
Binary relationship
Relation: Person-TitlePerson: Jack WelchTitle: CEO
N-ary record
“Named entity” extraction
Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.
Relation: Company-LocationCompany: General ElectricLocation: Connecticut
Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt
Person: Jeffrey Immelt
Location: Connecticut
Adapted from slide by William Cohen
State of the Art Performance: a sample
• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s
• Binary relation extraction– Contained-in (Location1, Location2)
Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s
• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site
Slide by Chris Manning, based on slides by several others
Three generations of IE systems
• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the domain– Iterative guess-test-tweak-repeat cycle
• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates, using automated
rule learners– Require huge, labeled corpora (effort is just moved!)
• Statistical Models [1997 – ]– Use machine learning to learn which features indicate boundaries and types of
entities.– Learning usually supervised; may be partially unsupervised
Named Entity Recognition (NER)
Input:
Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.
Output:
Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.
Named Entity Recognition (NER)• Locate and classify atomic elements in text into
predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …)
• Input: a block of text– Jim bought 300 shares of Acme Corp. in 2006.
• Output: annotated block of text– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought
<NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>
– ENAMEX tags (MUC in the 1990s)
THE STANDARD NEWS DOMAIN
• Most work on NER focuses on – NEWS– Variants of repertoire of entity types first studied in
MUC and then in ACE:• PERSON• ORGANIZATION
– GPE
• LOCATION• TEMPORAL ENTITY• NUMBER
HOW
• Two tasks:– Identifying the part of text that mentions a text
(RECOGNITION)– Classifying it (CLASSIFICATION)
• The two tasks are reduced to a standard classification task by having the system classify WORDS
Basic Problems in NER
• Variation of NEs – e.g. John Smith, Mr Smith, John.• Ambiguity of NE types
– John Smith (company vs. person)– May (person vs. month)– Washington (person vs. location)– 1945 (date vs. time)
• Ambiguity with common words, e.g. “may”
Problems in NER
• Category definitions are intuitively quite clear, but there are many grey areas.
• Many of these grey area are caused by metonymy.Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.Company vs. Artefact: “shares in MTV” vs. “watching MTV”Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”
Solutions• The task definition must be very clearly specified at the
outset.• The definitions adopted at the MUC conferences for each
category listed guidelines, examples, counter-examples, and “logic” behind the intuition.
• MUC essentially adopted simplistic approach of disregarding metonymous uses of words, e.g. “England” was always identified as a location. However, this is not always useful for practical applications of NER (e.g. football domain).
• Idealistic solutions, on the other hand, are not always practical to implement, e.g. making distinctions based on world knowledge.
More complex problems in NER
• Issues of style, structure, domain, genre etc.– Punctuation, spelling, spacing, formatting, ….all have an
impact
Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom
> Tell me more about Leonardo> Da Vinci
Approaches to NER: List Lookup
• System that recognises only entities stored in its lists (GAZETTEERS).
• Advantages - Simple, fast, language independent, easy to retarget
• Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
Approaches to NER: Shallow Parsing
• Names often have internal structure. These components can be either stored or guessed.
location: CapWord + {City, Forest, Center} e.g. Sherwood ForestCap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street
Shallow Parsing Approach(E.g., Mikheev et al 1998)
• External evidence - names are often used in very predictive local contexts
Location:“to the” COMPASS “of” CapWord e.g. to the south of Loitokitok“based in” CapWord e.g. based in LoitokitokCapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city
Difficulties in Shallow Parsing Approach
• Ambiguously capitalised words (first word in sentence) [All American Bank] vs. All [State Police]• Semantic ambiguity “John F. Kennedy” = airport (location) “Philip Morris” = organisation• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell] [Center for Computational Linguistics] vs. message from
[City Hospital] for [John Smith].
Machine learning approaches to NER
• NER as classification: the IOB representation• Supervised methods
– Support Vector Machines– Logistic regression (aka Maximum Entropy)– Sequence pattern learning– Hidden Markov Models– Conditional Random Fields
• Distant learning• Semi-supervised methods
THE ML APPROACH TO NE: THE IOB REPRESENTATION
THE ML APPROACH TO NE: FEATURES
FEATURES
FEATURES
Supervised ML for NER
• Methods already seen– Decision trees– Support Vector Machines
• Sequence learning – Hidden Markov Models– Maximum Entropy Models– Conditional Random Fields
NER as a SEQUENCE CLASSIFICATION TASK
Sequence Labeling as Classification: POS Tagging
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
Slide from Ray Mooney
John saw the saw and decided to take it to the table.
classifier
NNP
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
PRP
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
IN
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Sequence Labeling as Classification
Classify each token independently but use as input features, information about the surrounding tokens (sliding window).
John saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Using Outputs as Inputs
Better input features are usually the categories of the surrounding tokens, but these are not available yet
Can use category of either the preceding or succeeding tokens by going forward or back and using previous output
Slide from Ray Mooney
Forward Classification
John saw the saw and decided to take it to the table.
classifier
NNP
Slide from Ray Mooney
Forward Classification
NNPJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Forward Classification
NNP VBDJohn saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Forward Classification
NNP VBD DTJohn saw the saw and decided to take it to the table.
classifier
NN
Slide from Ray Mooney
Forward Classification
NNP VBD DT NNJohn saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Forward Classification
NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Forward Classification
NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Forward Classification
NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
DT NNJohn saw the saw and decided to take it to the table.
classifier
IN
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
IN DT NNJohn saw the saw and decided to take it to the table.
classifier
PRP
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VB
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
TO
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
CC
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
DT
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
DT VBD CC VBD TO VB PRP IN DT NNJohn saw the saw and decided to take it to the table.
classifier
VBD
Slide from Ray Mooney
Backward Classification
• Disambiguating Disambiguating ““toto”” in this case would be even in this case would be even easier backward.easier backward.
VBD DT VBD CC VBD TO VB PRP IN DT NN John saw the saw and decided to take it to the table.
classifier
NNP
Slide from Ray Mooney
NER as Sequence Labeling
Probabilistic Sequence Models
Probabilistic sequence models allow integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment
Two standard models Hidden Markov Model (HMM) Conditional Random Field (CRF) Maximum Entropy Markov Model (MEMM) is a
simplified version of CRF
Hidden Markov Models (HMMs)
• Generative – Find parameters to maximize P(X,Y)
• Assumes features are independent• When labeling Xi future observations are taken
into account (forward-backward)
Conditional Random Fields (CRFs)
• Discriminative– Find parameters to maximize P(Y|X)
• Doesn’t assume that features are independent• When labeling Yi future observations are taken into
account The best of both worlds!
86
PROBABILISTIC CLASSIFICATION: GENERATIVE VS DISCRIMINATIVE
• Let Y be the random variable for the class which takes values {y1,y2,…ym}.
• Let X be the random variable describing an instance consisting of a vector of values for n features <X1,X2…Xn>, let xk be a possible vector value for X and xij a possible value for Xi.
• For classification, we need to compute P(Y=yi | X=xk) for i = 1…m
• Could be done using joint distribution but this requires estimating an exponential number of parameters.
Discriminative Vs. Generative
,( )p y x
|( )p y x
Generative Model: A model that generate observed data randomly
Naïve Bayes: once the class label is known, all the features are independent
Discriminative: Directly estimate the posterior probability; Aim at modeling the “discrimination” between different outputs
MaxEnt classifier: linear combination of feature function in the exponent,
Both generative models and discriminative models describe distributions over (y , x), but they work in different directions.
Discriminative Vs. Generative
,( )p y x
|( )p y x
=unobservable=observable
Simple Linear Chain CRF Features• Modeling the conditional distribution is similar to
that used in multinomial logistic regression.
• Create feature functions fk(Yt, Yt−1, Xt)
– Feature for each state transition pair i, j• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
– Feature for each state observation pair i, o• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically in the number of states (i.e. tags).
93
Conditional Distribution forLinear Chain CRF
• Using these feature functions for a simple linear chain CRF, we can define:
94
T
t
K
ktttkk XYYf
XZXYP
1 11 )),,(exp(
)(
1)|(
Y
T
t
K
ktttkk XYYfXZ
1 11 )),,(exp()(
Adding Token Features to a CRF• Can add token features Xi,j
95
…X1,1X1,m …X2,1
X2,m …XT,1XT,m…
…
• Can add additional feature functions for each token feature to model conditional distribution.
Y1 Y2 YT
NER: EVALUATION
TYPICAL PERFORMANCE
NER Evaluation Campaigns
• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens
• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens
• Mention Detection-- ACE 2005– 599 documents
CoNLL2003 shared task (1)
• English and German language• 4 types of NEs:
– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person
• Training Set for developing the system• Test Data for the final evaluation
CoNLL2003 shared task (2)
• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format
• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O
CoNLL2003 shared task (3)
English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%
baseline 71.91% 50.90% 59.61%
CURRENT RESEARCH ON NER
• New domains• New approaches:
– Semi-supervised– Distant
• Handling many NE types• Integration with Machine Translation• Handling difficult linguistic phenomena such as
metonymy
NEW DOMAINS
• BIOMEDICAL• CHEMISTRY• HUMANITIES: MORE FINE GRAINED TYPES
Bioinformatics Named Entities
• Protein• DNA• RNA• Cell line• Cell type• Drug• Chemical
NER IN THE HUMANITIES
LOC
SITE
CULTURE
Semi-supervised learning
• Modest amounts of supervision– Small size of training data– Supervisor input sought when necessary
• Aims to match supervised learning performance, but with muchless human effort
• Bootstrapping– Seeds used to identify contextual clues– Contextual clues used to find more NEs
• Examples: (Brin 1998); (Collins and Singer 1999); (Riloff and Jones 1999); (Cucchiarelli and Velardi 2001); (Pasca et al. 2006); (Heng and Grishman 2006); (Nadeau et al. 2006), and (Liao and Veeramachaneni, 2009)
Semi-supervised learning
Input–A seed list of a few examples of a given NE type
• ‘Muhammad’ & ‘Obama’ can be used as seed examples for entity of type person.
Parameters–Number of iterations!
–Number of initial seeds!
–The ranking measure (Reliability measure)!
ASemiNER - Methodology
– Sentences containing seed instance (retrieved)
– A number of tokens on each side of the seed. (extracted)
• Sentence boundaries
Pattern Induction - Initial Patterns
TP pair = (Token/POS) pair
– Token (noun) inflected forms
Pattern Induction - Final Patterns
− Token (verb) stems
– Lists of trigger nouns (e.g., alsayd `Mr.’, alraeys `President’, alduktur `Dr.’ )
• They will be used as Arabic NE indicators or trigger words in the training phase.
• Arabic Wikipedia articles are crawled randomly, prepared, and POS-tagged.
• The left and right nouns of the named entity are extracted and collected.
• The top most frequent nouns (inflected) are picked and stored as “trigger” nouns.
– Lists of trigger verbs (e.g., rasam `draw’, naHat `sculpture’, ...etc.)• The top most frequent verbs (stems) are picked and stored as “trigger” verbs
Pattern Induction - Final Patterns“Trigger” words
• Generalization• TP pairs that contains nouns, and verbs are stripped of their ‘Token’ parts, unless these tokens are
in the corresponding lists of trigger words. alsayd/NN ‘Mr./NN’ alsayd/NN ‘Mr./NN’ as alsayd ‘Mr.’ is in the list of trigger nouns qalam/NN ‘pen/NN’ / NN as qalam ‘pen’ is not among trigger nouns.
• TP pairs that contain preposition are kept without changes.• TP pairs that contain other parts of speech categories (e.g., proper noun, adjective, coordinating
conjunction) are stripped of their ‘Token’ parts. mufyd/JJ ‘useful/JJ’ /JJ
• All POS tags used for verbs (e.g., VBP, VBD, VBN) are converted to one form VB.
• All POS tags used for nouns (e.g., NN, NNS) are converted to one form NN.
• All POS tags used for proper noun (e.g., NNP, NNPS) are converted to one form NNP.
• The seed instance is replaced with NE class tag (e.g., <PersonName>, <Location>, <organization>).
Pattern Induction
Pattern Induction
• Producing Final Patterns
Initial Pattern
Final Pattern
English Gloss
Pattern Induction
• Two more Final Patterns
Pattern Induction
• Final Pattern Set (P) is modified and filtered every time a
new pattern is added .
• Repeated patterns are rejected.
• Pattern consisting of less than six TP pairs should contain at
least one ‘Token’ part. – /VB /NN <PersonName>/NNP /NNP
ASemiNER - Methodology
Instance Extraction• ASemiNER retrieves from the training corpus the set of instances (I) that match
any of the patterns in (P) using Regular Expressions (Regex).
• ASemiNER automatically generates regexes from final patterns without any
modification regardless of the correctness of the POS tags assigned to the proper
noun by POS tagger.
Instance Extraction• ASemiNER automatically add the information of average NE length
to the produced regexes. (2 tokens)
ASemiNER - Methodology
Instance Ranking/Selection
• Extracted instances in (I) are ranked according to:– Distinct patterns used to extract them.
(pattern variety is better cue to semantic than absolute frequency)
– Pointwise Mutual Information (PMI) :
• |i,p|: the frequency of the instance i extracted by pattern p.
• |i|: the frequency of the instance i in the corpus.
• |p|: the frequency of the pattern p in the corpus.
ASemiNER - Methodology
Top m instances
Where ( m ) is set to the number of instances in the previous iteration + 1
Experiments & Results
• Several experiments – Different values of the parameters
• No. of iterations.
• No. of initial seeds.
• Ranking measure.
• Training data– ACE 2005 (Linguistic Data Consortium, LDC)
– ANERcorp training set (Benajiba et al. 2007)
• Test data– ANERcorp test corpus
Experiments & Results
• Several experiments
– Standard NE types:
• Person
• Location
• Organization
– Specialised NE types:
• Politicians
• Sportspersons
• Artists
Simple Models (standard NE types)
• ANERcorp (Training data)
• Without iterations.
• No. of Initial seeds : 5
ASemiNER (Specialised NE types)
• Politicians, Artists, and Sportspersons
• Unlike supervised techniques, ASemiNER does not require additional
annotated training data or re-annotating the existing one.
• It requires only minor modification:
• For each new NE type, Generate new trigger nouns and verb lists. – Artists trigger nouns (e.g., actress, actor, painter…etc. )
– Politicians trigger nouns (e.g., president, party, king, …etc.)
– Sportsmen trigger nouns (e.g., player, football, athletic, …etc.)
ASemiNER (Specialised NE types)
– ASemiNER performs as well as on recognizing standard person category
– ASemiNER proved to be easily adaptable when extracting new types of
NEs
• Query:
WIKIPEDIA AND NER
• Wikipedia:
130May 2012 Truc-Vien T. Nguyen
Giotto was called to work in Padua, and also in Rimini
THANKS
• I used slides from Bernardo Magnini, Chris Manning, Roberto Zanoli, Ray Mooney