ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems
description
Transcript of ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems
ACML 2010 TutorialWeb People Search: Person Name
Disambiguation and Other Problems
Hiroshi Nakagawa Introduction Feature Extraction(Phrase Extraction)
Minoru YoshidaFeature Extraction(Information Extraction Approach) End
(University of Tokyo)
Contents
1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues
Contents
1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues
Introduction
1. Motivation2. Problem Settings3. Differences from other problems4. History
Motivation• Web search for person names :
over 10% of all queries• “Same-name” Problem in person name search– When different real-world entities have the same name,
the reference from the name to the entity can be ambiguous.
– Many different persons having the same name • (e.g.,) John Smith
– Persons having the same name as a famous one • (e.g.,) Bill Gates
Difficult to access to the target person
A study of the query log of the AllTheWeb and Altavistasearch sites gives an idea of the relevance of the people searchtask: 11-17% of the queries were composed of a person namewith additional terms and 4% were identified simply as person names:(Artiles+, 2009 WePS2)
With ordinary search engines, it is tough to find Bill Gates who is not a Microsoft
founder! Domination!
Problem in People Search
Query
Search engine
Results
Which pages for what persons?
Person Name Clustering
Query
Search engine
Searchresult Clusters of Web pages
Each page in a cluster refers to the same entity.
数理情報学輪講 (2008/04/18) 8
Sample Systemquery= Ichiro Suzuki:famous Japanese baseball player
Keywords aboutthe person
Documents about the same person
数理情報学輪講 (2008/04/18) 9
Output Example ( Ichiro Suzuki )
Painter
LawyerDentist
Used as an example name because Ichiro is so famous
Introduction
1. Motivation2. Problem Settings3. Differences from other problems4. History
Problem Setting• Given: a set of Web pages returned from a
search engine when entering person name queries
• Goal: to cluster Web pages– One cluster for one entity– Possibly with related information (e.g., biography
and/or related words)Another usage :
If a person has many aspects, like scientist and poet, these aspects are grouped together. Easy to grasp who he/she is.
Example: Sakai Shuichi
Sakai shuichi is a professor of the University of Tokyo in the field of Computer Architecture: These pages are about his books of Computer Architecture
He is a Japanese poet too. These pages are about his collection of poems.
Example: Famous car maker”TOYOTA”
These pages are about TOYOTA’s retailer’s network
These pages are about TOYOTA HOME which is a house maker and one of TOYOTA company’s group enterprise
Introduction
1. Motivation2. Problem Settings3. Differences from other problems4. History
Difference from Other Tasks
• Cluster documents for the same person• Difficult to use training data for other person names
Method WSD,Categorization
Person Name Clustering
Document Clustering
Goal Categorize Cluster documents about the same entity(=person)
Cluster similar documents
Answers Definite y/n
Definite y/n Not definite
Number of Cluster
# of categories
# of entities (unknown)
Task dependent
Training Data
Yes Difficult to use No
Learning Supervised Unsupervised Unsupervised
15
Unknown but exact # in real world
WSD: Word Sense Disambiguation
bank
I was strolling the bank.Do you use a bank card there?Did you go to the bank?
?
Challenges
• Noisy Web data– Light linguistic tools• POS taggers, Stemmer, NE taggers• Pattern-based information extraction
• How to use “training data” – Most systems use unsupervised clustering
approach– Some systems assume “background knowledge”
• How to determine K (number of clusters)Remember this K does not depend on users intention but is exact and
fixed, in real use. Different form usual clustering!
(1) Heavy and sophisticated NLP tools such as HPSG parser is not suitable for
the purpose.(2)The system should work in tolerant
speed light weight tools is needed
Introduction
1. Motivation2. Problem Settings3. Differences from other problems4. History
History
1998 Cross-document coreference Resolution[Bagga+, 98] – Naive VSM
(Word Sense Disambiguation)
Disambiguation for Web Search Results[Mann+, 03] – Biographic data
Web People Search Workshop (WePS)[Artiles+, 07][Artiles+, 09]
2007
2003
(Coreference Resolution)
History
• Web People Search Workshop– 1st, SemEval-2007– 2nd, WWW-2009• Document Clustering• Attribute Extraction
– 3rd, CLEF-2010(Conference on Multilingual and Multimodal Information Access Evaluation )20-23 September 2010, Padua.• Document Clustering & Attribute Extraction• Organization Name Disambiguation
WePS2 Data Source: 30names
WePS2 Data 1 (Artiles+, 09)
WePS2 Data 2
WePS2 Data 3
WePS2 summary report
Contents
1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues
Main Steps
1. Preprocessing2. Feature extraction3. Feature weighting / Similarity calculation4. Clustering5. (Related Information Extraction)
PREPROCESSING
Preprocessing
• Filter out useless pages (“junk pages”)– the name is matched, but the matched string
doesn’t refer to a person (e.g., company name)• Data cleaning– HTML Tag removal– Sentence (snippet) extraction– Coreference resolution(used by Bagga+)
In addition, alphabetically ordered name list page. (Ono+, 08)
In fact, very difficult task of NLP
Junk Page Filtering
• SVM-based classification (Wan+, 05)– features• Simple lexical features• Stylistic features (fonts / tags) • query-relevant features (next-to-query words) • linguistic features (NE counts) …
words related or not related to the person name
Such as how many person, organization,
location name appear.
i.e. how many and which words in bold
font
FEATURE EXTRACTION
Feature Extraction
• How to characterize each name appearance– Name itself can not be used for disambiguation!
• Each name appearances can be characterized by contexts.
• Possible contexts– Surrounding words, adjacent strings, syntactically
related words, etc.– Which to use?
Basic Approach
• Use all words in documents– Or snippets (texts around the name)– Or titles/summaries (first sentence, etc.)
• Use TFIDF weighting scheme
Problem
• There exist: – relatively useful features and relatively useless
features • (especially for person name disambiguation)
– Useful: NEs, biography, noun phrases, etc.– Useless: General words, boilerplate, etc.
• How to distinguish useful features from others• How to give weight to each feature
Named Entities
• Documents about Bill Gates
related person name related organization name
Noun Phrases
• Documents about Bill Gates
related key words
Other Words
• Documents about Bill Gates
Other Words
• Documents about Bill Gates
more important
Extracting Useful Features
• Thresholding• Tool-based approach– POS tagging, NE tagging
• Information Extraction approach
• Meta-data approach– Link structures, Meta tags
Based on score related to our purpose: TFIDF etc.
Later described by Yoshida
Thresholding
• Calculate TFIDF scores of words• Discard the words with low TFIDF scoresUnigram, Bigram, even N-gram can be used
(Chen+, 09) , where Google 5 gram corpus (from 1T words) is used to calculate TFIDF score
Other Scores: such as Log-Likelihood Ratio, Mutual information, KL-
divergence,
Tool Based Approach
• Available Tools:– POS tagging– NE Extraction (sophisticated unsophisticated but simple) bigram, N-gram– Keyword extraction
High performance POS taggers are developed for many languages.
For western languages , stemmers are also developed .
middle between NE and bigram,N-gram
Part of Speech (POS) Tagging
• Detect the grammatical categories of the words– Nouns, verbs, prepositions, adverbs, adjectives, …– Typically nouns are used as features
– Noun phrases can be extracted with some simple rules– Many available tools (e.g., Tree Tagger)
William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …
NOUNS
NOUNS VERB
ADJECTIVE
NOUNS VERB DETERMINER
Named Entities (NE) Extraction
• Find “proper names” in texts– e.g., names of persons, organizations, locations, …– Include time expressions in many cases
– Many available tools (Stanford NER, OpenNLP, Espotter, …)
William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …
PERSON DATE
Key Phrase Extraction
• Noun phrases consisting of 2 or more words– Likely to be topic-related concepts– Term-extraction tool “Gensen”(Nakagawa+, 05)• Noun phrases with the score of “term-likelihood”• Topic related term -> higher score
Gates held the positions of CEO and chief software architect,
and remains the largest individual shareholder …SCORE=45.2
SCORE=22.4
Gensen( 言選 ) Web Score
処理Processing
(=proc) 学会society
信息Information
能力capacity
計算機computer
段階step
L:# of left adjacent words: 2+1 R:# of right adjacent words: : 3+1
From corpus we extract:信息処理 , 計算機処理能力 , 処理段階 , 信息処理学会Information proc, computer proc. capacity, proc. step, info. proc.society
L(W= 処理 )=2+1 R(W= 処理 )=3+1 LR(W= 処理 )=3×4=12
Calculation of LR and FLRCompound word:W ={ w1, ... , wn} where wi is a simple
noun.L(wi) = # of left side connection of wi+1R(wi) = # of right side connection of wi+1Score LR of Comp. word:W={ w1 ... wn}, like 信息処理学会 is defined as follows:
Example :LR( 信息処理 ) =[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4
Or LR(information processing) =[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4
nn
iii wRwLWLR
2/1
1
)()()(
Normalized by length
Calculation of LR and FLR
F(W) is the independent frequency of comp. word:W where “independent” means that W is not a part of longer comp. word.
Then FLR ( W) is defined asFLR ( W ) = F(W) × LR(W)
Example FLR( 信息処理 ) =F ( 信息処理 )×[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4
nn
iii wRwLWLR
2/1
1
)()()(
This FLR is the score to rank term candidates
Normalized by length
F(W) has similar effect as TF. Then, if corpus is big, F(w) affects more to FLR(w).
Example of term extraction by Gensen Web: English article:SVM on Wikipedia
Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The original SVM algorithm was invented by Vladimir Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given input, which of two possible classes the input is a member of. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. …….Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.
Extracted terms (term score)Top 1-17hyperplane 116.65 margin 109.54 SVM 74.08 vector 56.12 point 52.85 support vector 49.34 training data 48.12 data 47.83 problem 44.27 space 44.09 data point 38.01 classifier 30.59 classification 29.58 optimization problem 26.05 set 25.30 support vector machine 24.66kernel 21.00
Top 18-38set of point 20.73 linear classifier 19.99 maximum-margin hyperplane 19.92 example 19.60 one 17.32 Vladimir Vapnik 15.87 parameter 14.70 linear SVM 14.40 training set 14.00 optimization 13.42 model 12.25 training vector 12.04 support vector classification 11.70 two classe 11.57 normal vector 11.38 kernel trick 11.22 maximum margin classifier 11.22
Top 408–426(last)Vandewalle 1.00 derive 1.00 it 1.00 Leisch 1.00 2.3 1.00 H1 1.00 c 1.00 Hornik 1.00 mean 1.00 testing 1.00 transformation 1.00 unconstrained 1.00 homogeneous 1.00 need 1.00 learner 1.00 grid-search 1.00 convex 1.00 See 1.00 trade 1.00
.....
Contents
1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues
Contents
1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues
Introduction
1. Motivation2. Problem Settings3. Differences from other problems4. History
Information Extraction Approach
• Information extraction:– The task to extract specific type of information– e.g., person and his/her working place
William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …
OCCUPATION
NAME
NATIONALITY
DATE OF BIRTH
Information Extraction Approach
• Useful features for disambiguation (Wan+, 2005) (Mann+, 2003) (Niu+, 04)
• Also used as “summaries” of clusters– To be help of users to find objective clusters – WePS-2 “attribute extraction task”
Information Extraction Approach• Different methods for different attributes– Simple patterns (hand-crafted / automatically
obtained)• Phone, FAX, URL, E-mail
– Syntactic rules (hand-crafted /automatically generated)• Date of birth, Titles, positions,
– Dictionary match (from wikipedia, etc.)• Occupation, Major, Degree, Nationality
– Keywords extracted by NER tools• Birth place (LOCATION), Affiliation (ORGANIZATION),
Schools (ORGANIZATION)
Hand-Crafted Patterns• Typically written with regular expressions• Phone, FAX– +## (#) ####-####
• URLs– http://www.xxx.xxx.xxx/...
• E-mails– [email protected]
• Needs some classification (Phone or FAX?)– Supervised learning– Keyword-based approach (e.g., “born” for date of
birth)
Automatically Generated Patterns
• Patterns for birth years (Mann+, 03)
• Patterns for titles (Wan+, 05)
<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>
<name> is a <title>
Automatically Generated Patterns• Approach by (Mann+, 03)– Bootstrapping method• Start with seed facts
– (e.g., (Mozart, 1756))• Find sentences (from the Web) that contain both of
elements – (e.g., “Mozart was born in 1756”)
• Perform some generalization – (e.g., “<name> was born in <birth year>”)
• Extract substrings with high score (measured using current facts) • Extract new facts
<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>
Dictionary Matching
• Construct a list of occupations, nations (for “nationality” attributes), etc. from existing dictionaries– Wikipedia, WordNet, etc.
Dictionary Matching• e.g., List of countries
Link Structure Approach
• It is difficult to find correct network structures– Difficulty in finding “in-links”
• Needs some approximation• (Bekkerman+, 05) : “socially linked persons
tend to link similar pages”– Determine whether two pages are linked or not– MaxEnt classification with “linked-page” (URLs in
pages) features
FEATURE WEIGHTING / SIMILARITY CALCULATION
Feature Weighting
• Knowledge-based approach– US Census data, WordNet
• Web-query approach• SVD• Bootstrapping• Determination of link/non-link by supervised
classifiers
Knowledge-Base Approach
• US Census data– Frequent name -> ambiguous (Fleishman+, 04)
• WordNet– Semantic similarity for concept words• WordNet distance
WordNet
• Publicly available “dictionary” (thesaurus)– Hierarchical structures between words– We can find “synonyms”, “hyponyms”,
“hypernyms” of words• Many “semantic distance” measures between
two words– Path length– Depth of common hypernyms– …
Web-Query Approach
• Name-concept relation (Fleishman+, 04)• Validate relations between context NEs by
Web search counts (Kalashnikov+, 08) (Nuray-Turan+, 09)
• Use query “name + bigram”, concatenating the snippetes into a new document (Chen+, 09)
• Obtaining reliable counts (google_df) (Bekkerman+, 05)
Name-concept relation (Fleishman+, 04)
• Task: distinguish (name, concept) pairs– (Paul Simon, pop star) ; (Paul Simon, singer)– (Paul Simpn, pop star) ; (Paul Simon, politician)
• MaxEnt Classifier• Features using Web counts (N:name,
c:concept, +:AND operation)– Q(N + c1 + c2) : Intersection– | Q(N + c1) - Q(N + c2) |: Difference– Q(N + c1 + c2) / (Q(N + c1) + Q(N + c2)) : Ratio
Validate relations between context NEs by Web search counts
(Kalashnikov+, 08) (Nuray-Turan+, 09) • NE-based document similarity calculated using
Web counts– NE: persons or organizations
• WebDice (C:context set … [c1] OR [c2] OR …)– 2Q(N + C1 + C2) / (Q(N + C1) + Q(N + C2))– 2Q(N + C1 + C2) / (Q(N) + Q(C1 + C2))– The second one was better
Use query “name + bigram”, concatenating the snippetes into a new document (Chen+,
09)• Obtain additional features for similarity
calculation– Web page -> b: maximal weight bigram– Snippets100(N + b) -> one new document– New document -> additional features (tokens)
Obtaining reliable counts (google_df) (Bekkerman+, 05)
• Google_tfidf(w) = tf(w) / log(Q(w))
• Some recent systems use Google N-gram (Chen+, 09)
Dimension Reduction by SVD(Pedersen+, 05)
• Reduce sparseness of context vectors • More semantic-level representations (can use
word similarities in contexts)• Bigram features (contexts)
• Strong features can identify a person– High precision, but not always observed
72
Bill GatesPaul AllenMicrosoft
program
Bill GatesSteve BallmerMicrosoft
program
Bill Gates
program
sameperson
Strong Features•NEs•CKWs...
Weak Features
Not useful in general, but useful for this name
Cluster Refinement by Bootstrapping (1/4)(Yoshida+, 10)
7373
f1
f2
f3
fm
Document Set Feature Set
Document-Feature Matrix P
Document-Cluster Relation
Feature-Cluster Relation
・・・・・・
Initial Cluster
d1
d2
d3
d4
d5
d6
dn
CF ,rCD,r
Cluster Refinement by Bootstrapping (2/4)D F
7474
f1
f2
f3
fm
Document Set Feature Set
Document-Feature Matrix P
Document-Cluster Relation
Feature-Cluster Relation
・・・・・・
Initial Cluster
d1
d2
d3
d4
d5
d6
dn
CF ,rCD,r
D F
)(,
)1(,
)(,
T)(,
tCF
tCD
tCD
tCF
Prr
rPr
)(,
T)1(,
tCD
tCD rPPr
Cluster Refinement by Bootstrapping (3/4)
75
100010010001001
30.020.020.020.030.020.045.040.005.010.020.040.045.005.010.020.005.005.060.040.030.010.010.040.040.0
3.04.05.02.085.015.02.085.015.02.01.00.13.02.08.0
Refined values
Initial values
Each document is taken in the cluster with the largestrelation value
TPP
Cluster Refinement by Bootstrapping (4/4)
Determination of “linked” or”not-linked“ by supervised classifiers
• MaxEnt Classification (Fleischman+, 04)– Features: name features, web features, etc.
• SkyLine-Based Classification (Kalashnikov+, 08) – Features: search engine hit counts
CLUSTERING
Problem: How to Determine K
• Hierarchical clustering with thresholds
• Online Clustering (Single Pass Clustering)
• Building “core” clusters (2-stage clustering)
• Variable-Component-Number Clustering (e.g., Dirichlet Process Mixture)
Hierarchical clustering with thresholds
• Used in many systems• Popular settings:– Agglomerative clustering– Group-average method (or, single-link method in
some times)– Predetermined threshold (or, determined by cross-
validation in some times)
Hierarchical clustering with thresholds
80
High
Low
5 2 3 9 1 8 7 6 4Document ID
→2 clusters{1,2,3,5,9},
{4,6,7,8}
→4 clusters ,
{2,5},{1,3,9},{6,7,8},{4}
Cluster similarity:group average method
yyxx CdCd
yxyx
yxC ddCC
CC,
d ,sim1,sim
Cluster Similarity
Cluster-Distance Calculation
(complete linkage method)
(single linkage method)
(centroid method)
×
×
Online Clustering
• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results
1
6
54 3
2
Online Clustering
• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster
1
6
54 3
2
Online Clustering
• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster – If the similarity is below the threshold, create a
new cluster• Similarity: Naïve Bayes | Cosine with TFIDF
1
6
54 3
2
Building Core Clusters
• 1st stage clustering – High Precision Clusters– Relatively high threshold (Mann+, 03)– Use strong features only (Ikeda+, 09)
• 2nd stage clustering – Treat Remaining Documents– Add to the most similar 1st stage clusters (Mann+,
03) (Ikeda+, 09)– Feature weighting by 1st stage clusters (Yoshida+,
10)
Query Expansion Approach (Ikeda+, 09)
• Re-extract key-phrases by using 1st-stage clusters– Key-phrases for documents -> key-phrases for
clusters– More reliable than one document
home runs, major leagues,
all stars,
87
Current cluster Top CKWs 1 Extract top CKWs from the current cluster2 Search for the CKWs in documents out of the cluster3 If such documents exist, then copy them into the cluster (soft clustering)4 Remove 1-element clusters
Other documents
Search 2
1
1
87
Feature weighting by 1st stage clusters (Yoshida+, 10)
1. Make clusters by strong features2. Weight weak features using clusters, and
refine similarities3. Refine clusters by using new similarities
88
Using Dirichlet Process Mixture (Ono+, 08)
• Topic = word distribution– Topic:”economics” = Word distribution:
{“dollar”:0.03, “stock”:0.05, “share”:0.01, ...}• Document = mixture of topics– {economics:0.3, politics:0.2, ...}
• Document’s topic = topic with highest weight• Modeling by DPUM (Dirichlet Process Unigram
Mixture)– # of topics is automatically determined
89
Example: Estimation of Latent Topics
word-1
word-2
word-3Document = each point
Latent entity = each (red) bar
Dirichlet Process Unigram Mixture
θ
θ
θd
wdn
M
Nd
UM
G0=Distrubution for θ (Dirichlet Distribution)θ=Multi. Distribution
(Countable number of Multi. distributions)
G
DPUM Parameter Estimation
Initial entity distribution Estimation of entity distributionby iteratively maximizing likelihood
Politics
Politics
Economics
Emonomics
Politics
Entertainment
Sports
ArtsSociety
Merge clusters with the same topic
EVALUATION ISSUES
Evaluation Issues
• Evaluation Measures• Available Corpus• WePS Workshop
Evaluation Measures
• Precision / Recall / F-measure• Purity / Inverse Purity• B-cubed Precision / Recall / F-measure– Extended B-cubed
Recall and Precision for Clustering• Features and recall/precision• First stage cluster = high precision
A:size of cluster B:# of correct documents
C:# of correct documents in cluster
Precision
Recall C=3A=5 B=8
375.0recall
0.6 precision
BCR
ACP
97
Recall and Precision [Larsen and Aone 1999]
• Machine-made clusters
are calculated for each as:
for each that maximize
• Correct clusters
99
F-measure
Total F-measure (F):
Note:
Precision (P) , Recall (R) are calculated in the same way.
Example
Machine-made clusters: D
[A][A][B]
Correct clusters: C [A][A][A][A][A]P = 2 /3
R = 2 /5F = 1 /2
P = 3 /5R = 3 /5F = 3 /5
[A][A][A][B][C]
[B][B] …
Purity / Inverse Purity
• Similar to precision / recall– L: manually annotated categories (clusters)– C: clusters output by systems
B-Cubed Precision/Recall• Entity-wise accuracy calculation– C: cluster (by system) containing e– L: cluster (by human) containing e
102
B-Cubed Precision/Recall• Borrowed from (Amigo, 09)
103
Other Metrics
• Counting pairs– Given pair of documents, label “link” or “unlink”– Problem: # of pairs is quadratic to size of clusters
• Entropy– Low entropy in cluster -> pure
• Edit distance– Distance from system output to correct output
Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Homogeneity: the purer, the better
• Completeness: the more complete, the better
Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Rag bag– Noisy cluster <- noise: better!– Pure cluster <- noise: worse!
Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Cluster size vs. quantity– A small error in big cluster : better!– (Large number of) small errors in small clusters :
worse!
Which Metrics to Use• Borrowed from (Amigo, 09)
Which Metrics to Use• Borrowed from (Amigo, 09)
Which Metrics to Use• Borrowed from (Amigo, 09)
Baselines
111
P-IP vs. B-Cubed: for Practical Data
• Purity/Inverse-Purity measure is not appropriate in soft-clustering case– It gives very high scores to “cheat” baseline
clustering (COMBINED in the table)• B-cubed measure is appropriate in this case
Available Corpus
• John Smith Corpus (Bagga+, 98)• 12 different people (Bekkerman+, 05)• WePS corpus (Artiles+, 07)(Artiles+, 09)– WePS-1• 79 person names (49 training + 30 test), 100 top pages
for each– WePS-2• 30 person names, 150 top pages for each
WePS (Web People Search) Workshops (Artiles+, 07)(Artiles+, 09)
• Evaluation campaigns for person name disambiguation (along with person attribute extraction)
• WePS-1– with SemEval-2007– 16 teams participated
• WePS-2– with WWW-2009– 17 teams participated
114
References• (Amigo+, 09) Enrique Amigó , Julio Gonzalo , Javier Artiles , Felisa Verdejo, A comparison
of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, v.12 n.4, p.461-486, August 2009
• (Artiles+, 07) Javier Artiles , Julio Gonzalo , Satoshi Sekine, The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task, Proceedings of the 4th International Workshop on Semantic Evaluations, p.64-69, June 23-24, 2007, Prague, Czech Republic
• (Artiles+, 09) J. Artiles, J. Gonzalo, and S. Sekine. WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.
• (Bagga+, 98) Amit Bagga , Breck Baldwin, Entity-based cross-document coreferencing using the Vector Space Model, Proceedings of the 17th international conference on Computational linguistics, August 10-14, 1998, Montreal, Quebec, Canada
• (Balog+, 08) K. Balog, L. Azzopardi, and M. de Rijke. Personal name resolution of web people search. In WWW2008 Workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008), 2008.
• (Balog+, 09) Krisztian Balog, Jiyin He, Katja Hofmann, Valentin Jijkoun, Christof Monz, Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke, The University of Amsterdam at WePS2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.
References• (Bekkerman+, 05) Ron Bekkerman , Andrew McCallum, Disambiguating Web
appearances of people in a social network, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
• (Bollegala+, 06) Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Extracting key phrases to disambiguate personal name queries in web search, Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?, July 23-23, 2006, Sydney, Australia
• (Bunescu+, 06) R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 2006.
• (Chen+, 09) Names.Ying Chen, Sophia Yat Mei Lee and Chu-Ren Huang, PolyUHK: A Robust Information Extraction System for Web Personal, 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.
• (Chen+, 07) Ying Chen, James Martin, Towards Robust Unsupervised Personal Name Disambiguation, EMNLP-CoNLL 2007, pp. 190-198, 2007
• (Elmacioglu+, 07) Ergin Elmacioglu , Yee Fan Tan , Su Yan , Min-Yen Kan , Dongwon Lee, PSNUS: web people name disambiguation by simple clustering with rich features, Proceedings of the 4th International Workshop on Semantic Evaluations, p.268-271, June 23-24, 2007, Prague, Czech Republic
References• (Fleishman+, 2004) Fleischman, M.B. and E.H. Hovy, Multi-Document Person
Name Resolution. Proceedings of the Reference Resolution Workshop at the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). Barcelona, Spain, 2004
• (Gooi+, 04) Chung H. Gooi, James Allan, Cross-Document Coreference on a Large Scale Corpus, HLT-NAACL 2004: Main Proceedings, pp. 9-16, 2004
• (Han+, 04) Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, JCDL 2004, pp. 296-305, 2004
• (Ikeda+, 09) M. Ikeda, S. Ono, I. Sato, M. Yoshida, and H. Nakagawa. Person Name Disambiguation on the Web by Two-Stage Clustering. 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.
• (Kalashnikov+, 08) Dmitri V. Kalashnikov , Rabia Nuray-Turan , Sharad Mehrotra, Towards breaking the quality curse.: a web-querying approach to web people search., Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
• (Li+, 04) X. Li, P. Morie and D. Roth, Robust Reading: Identification and Tracing of Ambiguous Names. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) , pp. 17-24, 2004
References• (Malin, 05) Bradley Malin, Unsupervised name disambiguation via social network
similarity, In Workshop on Link Analysis, Counterterrorism, and Security, with SDM 2005• (Murakami, 10) Hiroshi Ueda, Harumi Murakami, and Shoji Tatsumi, Suggesting Subject
Headings using Web Information Sources, ... Conference on Agents and Artificial Intelligence (ICAART 2010) Volume 1 Artificial Intelligence, pp.640-643, 2010.
• (Mann+, 03) Gideon S. Mann , David Yarowsky, Unsupervised personal name disambiguation, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada
• (Nakagawa+, 03) H. Nakagawa and T. Mori. Automatic term recognition based on statistics of compound nouns and their components. Terminology, 9(2):201--219, 2003.
• (Niu+, 04) Cheng Niu , Wei Li , Rohini K. Srihari, Weakly supervised learning for cross-document person name disambiguation supported by information extraction, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.597-es, July 21-26, 2004, Barcelona, Spain
• (Nuray-Turan+, 09) R. Nuray-Turan, Z. Chen, D. Kalashnikov, and S. Mehrotra. Exploiting web querying for web people search in weps2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.
References• (On+, 07) B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph
partition. In Proc. of the SIAM SDM Conf., Minneapolis, Minnesota, USA, 2007• (Ono+, 08) Shingo Ono , Issei Sato , Minoru Yoshida , Hiroshi Nakagawa, Person name
disambiguation in web pages using social network, compound words and latent topics, Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining, May 20-23, 2008, Osaka, Japan
• (Pedersen+, 05) Ted Pedersen, Amruta Purandare, Anagha Kulkarni , Name Discrimination by Clustering Similar Contexts, CICLing 2005, pp. 226-237, 2005
• (Resnick+, 94) Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States
• (Yoshida+, 10) Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa, Person name disambiguation by bootstrapping, In SIGIR '10: Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pp. 10-17, 2010
• (Wan+, 05) Xiaojun Wan , Jianfeng Gao , Mu Li , Binggong Ding, Person resolution in person search results: WebHawk, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany