ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

ACML 2010 TutorialWeb People Search: Person Name

Disambiguation and Other Problems

Hiroshi Nakagawa Introduction Feature Extraction(Phrase Extraction)

Minoru YoshidaFeature Extraction(Information Extraction Approach) End

(University of Tokyo)

Contents

1. Introduction2. Feature Extraction3. Feature Weighting / Similarity Calculation4. Clustering5. Evaluation Issues

Introduction

1. Motivation2. Problem Settings3. Differences from other problems4. History

Motivation• Web search for person names ：

over 10% of all queries• “Same-name” Problem in person name search– When different real-world entities have the same name,

the reference from the name to the entity can be ambiguous.

– Many different persons having the same name • (e.g.,) John Smith

– Persons having the same name as a famous one • (e.g.,) Bill Gates

Difficult to access to the target person

A study of the query log of the AllTheWeb and Altavistasearch sites gives an idea of the relevance of the people searchtask: 11-17% of the queries were composed of a person namewith additional terms and 4% were identified simply as person names:(Artiles+, 2009 WePS2)

With ordinary search engines, it is tough to find Bill Gates who is not a Microsoft

founder! Domination!

Problem in People Search

Query

Search engine

Results

Which pages for what persons?

Person Name Clustering

Query

Search engine

Searchresult Clusters of Web pages

Each page in a cluster refers to the same entity.

数理情報学輪講 (2008/04/18) 8

Sample Systemquery= Ichiro Suzuki:famous Japanese baseball player

Keywords aboutthe person

Documents about the same person

数理情報学輪講 (2008/04/18) 9

Output Example （ Ichiro Suzuki ）

Painter

LawyerDentist

Used as an example name because Ichiro is so famous

Introduction


Problem Setting• Given: a set of Web pages returned from a

search engine when entering person name queries

• Goal: to cluster Web pages– One cluster for one entity– Possibly with related information (e.g., biography

and/or related words)Another usage :

If a person has many aspects, like scientist and poet, these aspects are grouped together. Easy to grasp who he/she is.

Example: Sakai Shuichi

Sakai shuichi is a professor of the University of Tokyo in the field of Computer Architecture: These pages are about his books of Computer Architecture

He is a Japanese poet too. These pages are about his collection of poems.

Example: Famous car maker”TOYOTA”

These pages are about TOYOTA’s retailer’s network

These pages are about TOYOTA HOME which is a house maker and one of TOYOTA company’s group enterprise

Introduction


Difference from Other Tasks

• Cluster documents for the same person• Difficult to use training data for other person names

Method WSD,Categorization

Person Name Clustering

Document Clustering

Goal Categorize Cluster documents about the same entity(=person)

Cluster similar documents

Answers Definite 　y/n

Definite 　 y/n Not definite

Number of Cluster

# of categories

# of entities (unknown)

Task dependent

Training Data

Yes Difficult to use No

Learning Supervised Unsupervised Unsupervised

15

Unknown but exact # in real world

WSD: Word Sense Disambiguation

bank

I was strolling the bank.Do you use a bank card there?Did you go to the bank?

?

Challenges

• Noisy Web data– Light linguistic tools• POS taggers, Stemmer, NE taggers• Pattern-based information extraction

• How to use “training data” – Most systems use unsupervised clustering

approach– Some systems assume “background knowledge”

• How to determine K (number of clusters)Remember this K does not depend on users intention but is exact and

fixed, in real use. Different form usual clustering!

(1) Heavy and sophisticated NLP tools such as HPSG parser is not suitable for

the purpose.(2)The system should work in tolerant

speed light weight tools is needed

Introduction


History

1998 Cross-document coreference Resolution[Bagga+, 98] – Naive VSM

(Word Sense Disambiguation)

Disambiguation for Web Search Results[Mann+, 03] – Biographic data

Web People Search Workshop (WePS)[Artiles+, 07][Artiles+, 09]

2007

2003

(Coreference Resolution)

History

• Web People Search Workshop– 1st, SemEval-2007– 2nd, WWW-2009• Document Clustering• Attribute Extraction

– 3rd, CLEF-2010(Conference on Multilingual and Multimodal Information Access Evaluation )20-23 September 2010, Padua.• Document Clustering & Attribute Extraction• Organization Name Disambiguation

WePS2 Data Source: 30names

WePS2 Data 1 (Artiles+, 09)

WePS2 Data 2

WePS2 Data 3

WePS2 summary report

Contents


Main Steps

1. Preprocessing2. Feature extraction3. Feature weighting / Similarity calculation4. Clustering5. (Related Information Extraction)

PREPROCESSING

Preprocessing

• Filter out useless pages (“junk pages”)– the name is matched, but the matched string

doesn’t refer to a person (e.g., company name)• Data cleaning– HTML Tag removal– Sentence (snippet) extraction– Coreference resolution(used by Bagga+)

In addition, alphabetically ordered name list page. (Ono+, 08)

In fact, very difficult task of NLP

Junk Page Filtering

• SVM-based classification (Wan+, 05)– features• Simple lexical features• Stylistic features (fonts / tags) • query-relevant features (next-to-query words) • linguistic features (NE counts) …

words related or not related to the person name

Such as how many person, organization,

location name appear.

i.e. how many and which words in bold

font

FEATURE EXTRACTION

Feature Extraction

• How to characterize each name appearance– Name itself can not be used for disambiguation!

• Each name appearances can be characterized by contexts.

• Possible contexts– Surrounding words, adjacent strings, syntactically

related words, etc.– Which to use?

Basic Approach

• Use all words in documents– Or snippets (texts around the name)– Or titles/summaries (first sentence, etc.)

• Use TFIDF weighting scheme

Problem

• There exist: – relatively useful features and relatively useless

features • (especially for person name disambiguation)

– Useful: NEs, biography, noun phrases, etc.– Useless: General words, boilerplate, etc.

• How to distinguish useful features from others• How to give weight to each feature

Named Entities

• Documents about Bill Gates

related person name related organization name

Noun Phrases


related key words

Other Words


Other Words


more important

Extracting Useful Features

• Thresholding• Tool-based approach– POS tagging, NE tagging

• Information Extraction approach

• Meta-data approach– Link structures, Meta tags

Based on score related to our purpose: TFIDF etc.

Later described by Yoshida

Thresholding

• Calculate TFIDF scores of words• Discard the words with low TFIDF scoresUnigram, Bigram, even N-gram can be used

(Chen+, 09) , where Google 5 gram corpus (from 1T words) is used to calculate TFIDF score

Other Scores: such as Log-Likelihood Ratio, Mutual information, KL-

divergence,

Tool Based Approach

• Available Tools:– POS tagging– NE Extraction (sophisticated unsophisticated but simple) bigram, N-gram– Keyword extraction

High performance POS taggers are developed for many languages.

For western languages , stemmers are also developed .

middle between NE and bigram,N-gram

Part of Speech (POS) Tagging

• Detect the grammatical categories of the words– Nouns, verbs, prepositions, adverbs, adjectives, …– Typically nouns are used as features

– Noun phrases can be extracted with some simple rules– Many available tools (e.g., Tree Tagger)

William Henry "Bill" Gates III (born October 28, 1955) is an American business magnate, philanthropist, …

NOUNS

NOUNS VERB

ADJECTIVE

NOUNS VERB DETERMINER

Named Entities (NE) Extraction

• Find “proper names” in texts– e.g., names of persons, organizations, locations, …– Include time expressions in many cases

– Many available tools (Stanford NER, OpenNLP, Espotter, …)


PERSON DATE

Key Phrase Extraction

• Noun phrases consisting of 2 or more words– Likely to be topic-related concepts– Term-extraction tool “Gensen”(Nakagawa+, 05)• Noun phrases with the score of “term-likelihood”• Topic related term -> higher score

Gates held the positions of CEO and chief software architect,

and remains the largest individual shareholder …SCORE=45.2

SCORE=22.4

Gensen( 言選 ) Web Score

処理Processing

(=proc) 学会society

信息Information

能力capacity

計算機computer

段階step

L:# of left adjacent words: ２＋１ R:# of right adjacent words: : ３＋１

From corpus we extract:信息処理 , 計算機処理能力 , 処理段階 , 信息処理学会Information proc, computer proc. capacity, proc. step, info. proc.society

L(W= 処理 )=2+1 R(W= 処理 )=3+1　　　　　　　 LR(W= 処理 )=3×4=12

Calculation of LR and FLRCompound word:W ={ w1, ... , wn} where wi is a simple

noun.L(wi) = # of left side connection of wi+1R(wi) = # of right side connection of wi+1Score LR of Comp. word:W={ w1 ... wn},　　 like 　信息処理学会　 is defined as follows:

Example :LR( 信息処理 ) =[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4

Or LR(information processing) =[L(info.)×R(info.) × L(proc.)×R(proc.) ]1/4

nn

iii wRwLWLR

2/1

1

)()()(

　Normalized by length

Calculation of LR and FLR

F(W) is the independent frequency of comp. word:W where “independent” means that W is not a part of longer comp. word.

Then FLR （ W) is defined asFLR （ W ） = F(W) × LR(W)

Example FLR( 信息処理 ) =F ( 信息処理 )×[L( 信息 )×R( 信息 ) × L( 処理 )×R( 処理 ) ]1/4

nn

iii wRwLWLR

2/1

1

)()()(

　

This FLR is the score to rank term candidates

Normalized by length

F(W) has similar effect as TF. Then, if corpus is big, F(w) affects more to FLR(w).

Example of term extraction by Gensen Web: English article:SVM on Wikipedia

Support vector machines (SVMs) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. The original SVM algorithm was invented by Vladimir Vapnik and the current standard incarnation (soft margin) was proposed by Corinna Cortes and Vladimir Vapnik[1]. The standard SVM is a non-probabilistic binary linear classifier, i.e. it predicts, for each given input, which of two possible classes the input is a member of. Since an SVM is a classifier, then given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Intuitively, an SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. …….Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems.[10] Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a low rank approximation to the matrix is often used to use the kernel trick.

Extracted terms (term score)Top 1-17hyperplane 116.65 margin 109.54 SVM 74.08 vector 56.12 point 52.85 support vector 49.34 training data 48.12 data 47.83 problem 44.27 space 44.09 data point 38.01 classifier 30.59 classification 29.58 optimization problem 26.05 set 25.30 support vector machine 24.66kernel 21.00

Top 18-38set of point 20.73 linear classifier 19.99 maximum-margin hyperplane 19.92 example 19.60 one 17.32 Vladimir Vapnik 15.87 parameter 14.70 linear SVM 14.40 training set 14.00 optimization 13.42 model 12.25 training vector 12.04 support vector classification 11.70 two classe 11.57 normal vector 11.38 kernel trick 11.22 maximum margin classifier 11.22

Top 408–426(last)Vandewalle 1.00 derive 1.00 it 1.00 Leisch 1.00 2.3 1.00 H1 1.00 c 1.00 Hornik 1.00 mean 1.00 testing 1.00 transformation 1.00 unconstrained 1.00 homogeneous 1.00 need 1.00 learner 1.00 grid-search 1.00 convex 1.00 See 1.00 trade 1.00

.....

Contents


Introduction


Information Extraction Approach

• Information extraction:– The task to extract specific type of information– e.g., person and his/her working place


OCCUPATION

NAME

NATIONALITY

DATE OF BIRTH

Information Extraction Approach

• Useful features for disambiguation (Wan+, 2005) (Mann+, 2003) (Niu+, 04)

• Also used as “summaries” of clusters– To be help of users to find objective clusters – WePS-2 “attribute extraction task”

Information Extraction Approach• Different methods for different attributes– Simple patterns (hand-crafted / automatically

obtained)• Phone, FAX, URL, E-mail

– Syntactic rules (hand-crafted /automatically generated)• Date of birth, Titles, positions,

– Dictionary match (from wikipedia, etc.)• Occupation, Major, Degree, Nationality

– Keywords extracted by NER tools• Birth place (LOCATION), Affiliation (ORGANIZATION),

Schools (ORGANIZATION)

Hand-Crafted Patterns• Typically written with regular expressions• Phone, FAX– +## (#) ####-####

• URLs– http://www.xxx.xxx.xxx/...

• E-mails– [email protected]

• Needs some classification (Phone or FAX?)– Supervised learning– Keyword-based approach (e.g., “born” for date of

birth)

mailto:[email protected]

Automatically Generated Patterns

• Patterns for birth years (Mann+, 03)

• Patterns for titles (Wan+, 05)

<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>

<name> is a <title>

Automatically Generated Patterns• Approach by (Mann+, 03)– Bootstrapping method• Start with seed facts

– (e.g., (Mozart, 1756))• Find sentences (from the Web) that contain both of

elements – (e.g., “Mozart was born in 1756”)

• Perform some generalization – (e.g., “<name> was born in <birth year>”)

• Extract substrings with high score (measured using current facts) • Extract new facts

<name> (<birth year> - ####)<name> <name> ( <birth year><name> was born in <birth year>

Dictionary Matching

• Construct a list of occupations, nations (for “nationality” attributes), etc. from existing dictionaries– Wikipedia, WordNet, etc.

Dictionary Matching• e.g., List of countries

Link Structure Approach

• It is difficult to find correct network structures– Difficulty in finding “in-links”

• Needs some approximation• (Bekkerman+, 05) : “socially linked persons

tend to link similar pages”– Determine whether two pages are linked or not– MaxEnt classification with “linked-page” (URLs in

pages) features

FEATURE WEIGHTING / SIMILARITY CALCULATION

Feature Weighting

• Knowledge-based approach– US Census data, WordNet

• Web-query approach• SVD• Bootstrapping• Determination of link/non-link by supervised

classifiers

Knowledge-Base Approach

• US Census data– Frequent name -> ambiguous (Fleishman+, 04)

• WordNet– Semantic similarity for concept words• WordNet distance

WordNet

• Publicly available “dictionary” (thesaurus)– Hierarchical structures between words– We can find “synonyms”, “hyponyms”,

“hypernyms” of words• Many “semantic distance” measures between

two words– Path length– Depth of common hypernyms– …

Web-Query Approach

• Name-concept relation (Fleishman+, 04)• Validate relations between context NEs by

Web search counts (Kalashnikov+, 08) (Nuray-Turan+, 09)

• Use query “name + bigram”, concatenating the snippetes into a new document (Chen+, 09)

• Obtaining reliable counts (google_df) (Bekkerman+, 05)

Name-concept relation (Fleishman+, 04)

• Task: distinguish (name, concept) pairs– (Paul Simon, pop star) ; (Paul Simon, singer)– (Paul Simpn, pop star) ; (Paul Simon, politician)

• MaxEnt Classifier• Features using Web counts (N:name,

c:concept, +:AND operation)– Q(N + c1 + c2) : Intersection– | Q(N + c1) - Q(N + c2) |: Difference– Q(N + c1 + c2) / (Q(N + c1) + Q(N + c2)) : Ratio

Validate relations between context NEs by Web search counts

(Kalashnikov+, 08) (Nuray-Turan+, 09) • NE-based document similarity calculated using

Web counts– NE: persons or organizations

• WebDice (C:context set … [c1] OR [c2] OR …)– 2Q(N + C1 + C2) / (Q(N + C1) + Q(N + C2))– 2Q(N + C1 + C2) / (Q(N) + Q(C1 + C2))– The second one was better

Use query “name + bigram”, concatenating the snippetes into a new document (Chen+,

09)• Obtain additional features for similarity

calculation– Web page -> b: maximal weight bigram– Snippets100(N + b) -> one new document– New document -> additional features (tokens)

Obtaining reliable counts (google_df) (Bekkerman+, 05)

• Google_tfidf(w) = tf(w) / log(Q(w))

• Some recent systems use Google N-gram (Chen+, 09)

Dimension Reduction by SVD(Pedersen+, 05)

• Reduce sparseness of context vectors • More semantic-level representations (can use

word similarities in contexts)• Bigram features (contexts)

• Strong features can identify a person– High precision, but not always observed

72

Bill GatesPaul AllenMicrosoft

program

Bill GatesSteve BallmerMicrosoft

program

Bill Gates

program

sameperson

Strong Features•NEs•CKWs...

Weak Features

Not useful in general, but useful for this name

Cluster Refinement by Bootstrapping (1/4)(Yoshida+, 10)

7373

f1

f2

f3

fm

Document Set Feature Set

Document-Feature Matrix 　 P

Document-Cluster Relation

Feature-Cluster Relation

・・・・・・

Initial Cluster

d1

d2

d3

d4

d5

d6

dn

CF ,rCD,r

Cluster Refinement by Bootstrapping (2/4)D F

7474

f1

f2

f3

fm

Document Set Feature Set

Document-Feature Matrix 　 P

Document-Cluster Relation

Feature-Cluster Relation

・・・・・・

Initial Cluster

d1

d2

d3

d4

d5

d6

dn

CF ,rCD,r

D F

)(,

)1(,

)(,

T)(,

tCF

tCD

tCD

tCF

Prr

rPr

)(,

T)1(,

tCD

tCD rPPr

Cluster Refinement by Bootstrapping (3/4)

75

100010010001001

30.020.020.020.030.020.045.040.005.010.020.040.045.005.010.020.005.005.060.040.030.010.010.040.040.0

3.04.05.02.085.015.02.085.015.02.01.00.13.02.08.0

Refined values

Initial values

Each document is taken in the cluster with the largestrelation value

TPP

Cluster Refinement by Bootstrapping (4/4)

Determination of “linked” or”not-linked“ by supervised classifiers

• MaxEnt Classification (Fleischman+, 04)– Features: name features, web features, etc.

• SkyLine-Based Classification (Kalashnikov+, 08) – Features: search engine hit counts

CLUSTERING

Problem: How to Determine K

• Hierarchical clustering with thresholds

• Online Clustering (Single Pass Clustering)

• Building “core” clusters (2-stage clustering)

• Variable-Component-Number Clustering (e.g., Dirichlet Process Mixture)

Hierarchical clustering with thresholds

• Used in many systems• Popular settings:– Agglomerative clustering– Group-average method (or, single-link method in

some times)– Predetermined threshold (or, determined by cross-

validation in some times)

Hierarchical clustering with thresholds

80

High

Low

５２３９１８７６４Document ID

→2 clusters{1,2,3,5,9},

{4,6,7,8}

→4 clusters ，

{2,5},{1,3,9},{6,7,8},{4}

Cluster similarity:group average method

yyxx CdCd

yxyx

yxC ddCC

CC,

d ,sim1,sim

Cluster Similarity

Cluster-Distance Calculation

(complete linkage method)

(single linkage method)

(centroid method)

×

×

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results

1

6

54 3

2

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster

1

6

54 3

2

Online Clustering

• Single Pass Clustering (Balog+, 08)– Take pages from the 1st in search results– For each page, find the most similar cluster – If the similarity is below the threshold, create a

new cluster• Similarity: Naïve Bayes | Cosine with TFIDF

1

6

54 3

2

Building Core Clusters

• 1st stage clustering – High Precision Clusters– Relatively high threshold (Mann+, 03)– Use strong features only (Ikeda+, 09)

• 2nd stage clustering – Treat Remaining Documents– Add to the most similar 1st stage clusters (Mann+,

03) (Ikeda+, 09)– Feature weighting by 1st stage clusters (Yoshida+,

10)

Query Expansion Approach (Ikeda+, 09)

• Re-extract key-phrases by using 1st-stage clusters– Key-phrases for documents -> key-phrases for

clusters– More reliable than one document

home runs, major leagues,

all stars,

87

Current cluster Top CKWs 1 Extract top CKWs from the current cluster2 Search for the CKWs in documents out of the cluster3 If such documents exist, then copy them into the cluster (soft clustering)4 Remove 1-element clusters

Other documents

Search 2

1

1

87

Feature weighting by 1st stage clusters (Yoshida+, 10)

1. Make clusters by strong features2. Weight weak features using clusters, and

refine similarities3. Refine clusters by using new similarities

88

Using Dirichlet Process Mixture (Ono+, 08)

• Topic = word distribution– Topic:”economics” = Word distribution:

{“dollar”:0.03, “stock”:0.05, “share”:0.01, ...}• Document = mixture of topics– {economics:0.3, politics:0.2, ...}

• Document’s topic = topic with highest weight• Modeling by DPUM (Dirichlet Process Unigram

Mixture)– # of topics is automatically determined

89

Example: Estimation of Latent Topics

word-1

word-2

word-3Document = each point

Latent entity = each (red) bar

Dirichlet Process Unigram Mixture

θ

θ

θd

wdn

M

Nd

UM

G0=Distrubution for θ (Dirichlet Distribution)θ=Multi. Distribution

(Countable number of Multi. distributions)

G

DPUM Parameter Estimation

Initial entity distribution Estimation of entity distributionby iteratively maximizing likelihood

Politics

Politics

Economics

Emonomics

Politics

Entertainment

Sports

ArtsSociety

Merge clusters with the same topic

EVALUATION ISSUES

Evaluation Issues

• Evaluation Measures• Available Corpus• WePS Workshop

Evaluation Measures

• Precision / Recall / F-measure• Purity / Inverse Purity• B-cubed Precision / Recall / F-measure– Extended B-cubed

Recall and Precision for Clustering• Features and recall/precision• First stage cluster = high precision

A:size of cluster B:# of correct documents

C:# of correct documents in cluster

Precision

Recall C=3A=5 B=8

375.0recall

0.6 precision

BCR

ACP

97

Recall and Precision [Larsen and Aone 1999]

• Machine-made clusters

are calculated for each as:

for each that maximize

• Correct clusters

99

F-measure

Total F-measure (F):

Note:

Precision (P) ， Recall (R) are calculated in the same way.

Example

Machine-made clusters: D

[A][A][B]

Correct clusters: C [A][A][A][A][A]P = 2 /3

R = 2 /5F = 1 /2

P = 3 /5R = 3 /5F = 3 /5

[A][A][A][B][C]

[B][B] …

Purity / Inverse Purity

• Similar to precision / recall– L: manually annotated categories (clusters)– C: clusters output by systems

B-Cubed Precision/Recall• Entity-wise accuracy calculation– C: cluster (by system) containing e– L: cluster (by human) containing e

102

B-Cubed Precision/Recall• Borrowed from (Amigo, 09)

103

Other Metrics

• Counting pairs– Given pair of documents, label “link” or “unlink”– Problem: # of pairs is quadratic to size of clusters

• Entropy– Low entropy in cluster -> pure

• Edit distance– Distance from system output to correct output

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Homogeneity: the purer, the better

• Completeness: the more complete, the better

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Rag bag– Noisy cluster <- noise: better!– Pure cluster <- noise: worse!

Which Metrics to Use• Constraints (borrowed from (Amigo, 09))• Cluster size vs. quantity– A small error in big cluster : better!– (Large number of) small errors in small clusters :

worse!

Which Metrics to Use• Borrowed from (Amigo, 09)

Baselines

111

P-IP vs. B-Cubed: for Practical Data

• Purity/Inverse-Purity measure is not appropriate in soft-clustering case– It gives very high scores to “cheat” baseline

clustering (COMBINED in the table)• B-cubed measure is appropriate in this case

Available Corpus

• John Smith Corpus (Bagga+, 98)• 12 different people (Bekkerman+, 05)• WePS corpus (Artiles+, 07)(Artiles+, 09)– WePS-1• 79 person names (49 training + 30 test), 100 top pages

for each– WePS-2• 30 person names, 150 top pages for each

WePS (Web People Search) Workshops (Artiles+, 07)(Artiles+, 09)

• Evaluation campaigns for person name disambiguation (along with person attribute extraction)

• WePS-1– with SemEval-2007– 16 teams participated

• WePS-2– with WWW-2009– 17 teams participated

114

References• (Amigo+, 09) Enrique Amigó , Julio Gonzalo , Javier Artiles , Felisa Verdejo, A comparison

of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, v.12 n.4, p.461-486, August 2009

• (Artiles+, 07) Javier Artiles , Julio Gonzalo , Satoshi Sekine, The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task, Proceedings of the 4th International Workshop on Semantic Evaluations, p.64-69, June 23-24, 2007, Prague, Czech Republic

• (Artiles+, 09) J. Artiles, J. Gonzalo, and S. Sekine. WePS 2 Evaluation Campaign: overview of the Web People Search Clustering Task. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

• (Bagga+, 98) Amit Bagga , Breck Baldwin, Entity-based cross-document coreferencing using the Vector Space Model, Proceedings of the 17th international conference on Computational linguistics, August 10-14, 1998, Montreal, Quebec, Canada

• (Balog+, 08) K. Balog, L. Azzopardi, and M. de Rijke. Personal name resolution of web people search. In WWW2008 Workshop: NLP Challenges in the Information Explosion Era (NLPIX 2008), 2008.

• (Balog+, 09) Krisztian Balog, Jiyin He, Katja Hofmann, Valentin Jijkoun, Christof Monz, Manos Tsagkias, Wouter Weerkamp and Maarten de Rijke, The University of Amsterdam at WePS2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

References• (Bekkerman+, 05) Ron Bekkerman , Andrew McCallum, Disambiguating Web

appearances of people in a social network, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan

• (Bollegala+, 06) Danushka Bollegala , Yutaka Matsuo , Mitsuru Ishizuka, Extracting key phrases to disambiguate personal name queries in web search, Proceedings of the Workshop on How Can Computational Linguistics Improve Information Retrieval?, July 23-23, 2006, Sydney, Australia

• (Bunescu+, 06) R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 2006.

• (Chen+, 09) Names.Ying Chen, Sophia Yat Mei Lee and Chu-Ren Huang, PolyUHK: A Robust Information Extraction System for Web Personal, 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

• (Chen+, 07) Ying Chen, James Martin, Towards Robust Unsupervised Personal Name Disambiguation, EMNLP-CoNLL 2007, pp. 190-198, 2007

• (Elmacioglu+, 07) Ergin Elmacioglu , Yee Fan Tan , Su Yan , Min-Yen Kan , Dongwon Lee, PSNUS: web people name disambiguation by simple clustering with rich features, Proceedings of the 4th International Workshop on Semantic Evaluations, p.268-271, June 23-24, 2007, Prague, Czech Republic

References• (Fleishman+, 2004) Fleischman, M.B. and E.H. Hovy, Multi-Document Person

Name Resolution. Proceedings of the Reference Resolution Workshop at the 42nd Annual Meeting of the Association for Computational Linguistics (ACL). Barcelona, Spain, 2004

• (Gooi+, 04) Chung H. Gooi, James Allan, Cross-Document Coreference on a Large Scale Corpus, HLT-NAACL 2004: Main Proceedings, pp. 9-16, 2004

• (Han+, 04) Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, JCDL 2004, pp. 296-305, 2004

• (Ikeda+, 09) M. Ikeda, S. Ono, I. Sato, M. Yoshida, and H. Nakagawa. Person Name Disambiguation on the Web by Two-Stage Clustering. 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, 2009.

• (Kalashnikov+, 08) Dmitri V. Kalashnikov , Rabia Nuray-Turan , Sharad Mehrotra, Towards breaking the quality curse.: a web-querying approach to web people search., Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore

• (Li+, 04) X. Li, P. Morie and D. Roth, Robust Reading: Identification and Tracing of Ambiguous Names. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) , pp. 17-24, 2004

References• (Malin, 05) Bradley Malin, Unsupervised name disambiguation via social network

similarity, In Workshop on Link Analysis, Counterterrorism, and Security, with SDM 2005• (Murakami, 10) Hiroshi Ueda, Harumi Murakami, and Shoji Tatsumi, Suggesting Subject

Headings using Web Information Sources, ... Conference on Agents and Artificial Intelligence (ICAART 2010) Volume 1 Artificial Intelligence, pp.640-643, 2010.

• (Mann+, 03) Gideon S. Mann , David Yarowsky, Unsupervised personal name disambiguation, Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, p.33-40, May 31, 2003, Edmonton, Canada

• (Nakagawa+, 03) H. Nakagawa and T. Mori. Automatic term recognition based on statistics of compound nouns and their components. Terminology, 9(2):201--219, 2003.

• (Niu+, 04) Cheng Niu , Wei Li , Rohini K. Srihari, Weakly supervised learning for cross-document person name disambiguation supported by information extraction, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.597-es, July 21-26, 2004, Barcelona, Spain

• (Nuray-Turan+, 09) R. Nuray-Turan, Z. Chen, D. Kalashnikov, and S. Mehrotra. Exploiting web querying for web people search in weps2. 2nd Web People Search Evaluation Workshop (WePS 2009), 2009.

References• (On+, 07) B.-W. On and D. Lee. Scalable name disambiguation using multi-level graph

partition. In Proc. of the SIAM SDM Conf., Minneapolis, Minnesota, USA, 2007• (Ono+, 08) Shingo Ono , Issei Sato , Minoru Yoshida , Hiroshi Nakagawa, Person name

disambiguation in web pages using social network, compound words and latent topics, Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining, May 20-23, 2008, Osaka, Japan

• (Pedersen+, 05) Ted Pedersen, Amruta Purandare, Anagha Kulkarni , Name Discrimination by Clustering Similar Contexts, CICLing 2005, pp. 226-237, 2005

• (Resnick+, 94) Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States

• (Yoshida+, 10) Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, Hiroshi Nakagawa, Person name disambiguation by bootstrapping, In SIGIR '10: Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval , pp. 10-17, 2010

• (Wan+, 05) Xiaojun Wan , Jianfeng Gao , Mu Li , Binggong Ding, Person resolution in person search results: WebHawk, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany

ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems

Documents

Transcript of ACML 2010 Tutorial Web People Search: Person Name Disambiguation and Other Problems