Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab

download Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab

If you can't read please download the document

description

The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures. Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab Technische Universität Darmstadt. Wikipedia as a Language Resource. NLP applications - PowerPoint PPT Presentation

Transcript of Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge Processing Lab

PowerPoint Presentation

The More the Better? Assessing the Influence of Wikipedias Growth onSemantic Relatedness MeasuresTorsten Zesch and Iryna Gurevych

Ubiquitous Knowledge Processing LabTechnische Universitt DarmstadtNr.

Wikipedia as a Language ResourceNLP applicationsInformation Extraction [Ruiz-Casado et al., 2005]Information Retrieval [Gurevych et al., 2007]Keyphrase Extraction [Medelyan, Milne & Witten, 2008]Named Entity Recognition [Bunescu & Pasca, 2006]Question Answering [Ahn et al., 2004]Semantic Relatedness [Zesch & Gurevych, 2010]Text Categorization [Gabrilovich & Markovitch, 2006]WSD [Mihalcea, 2007][Medelyan et al., 2008] for an excellent overview.20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.27. Mai 2008| | 27. Mai 2008| | 27. Mai 2008| | 212. Mai 2008| | 2Growth of Wikipedia

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

Nr.Wikipedia grows very fastGrowth of the ten largest language editionslog scale3Growth of Wikipedia

Categories introduced20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

Nr.Wikipedia grows very fastGrowth of the ten largest language editionslog scale4Growth of WikipediaCoverage

Influence of Wikipedias growth on task performance is unknownOnly most recent Wikipedia snapshots are publicly availablePrevious research cannot be reproduced20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch |

Nr.Wikipedia grows very fastGrowth of the ten largest language editionslog scale5JWPL TimeMachineSnapshot 2Snapshot 1ApplicationApplicationApplicationJava-based API (JWPL)Run-timeTimeMachineOne time effortWikipedia Dump(All revisions)http://dumps.wikimedia.org>1TB uncompressed (Eng)Snapshot from a certain date is reconstructedMultiple snapshots from a time span possibleDeleted articles not includedAvailable as part of the JWPL Wikipedia API releasehttp://www.ukp.tu-darmstadt.de/research/software/jwpl/

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.12. Mai 2008| | 6

Wikipedia as a Language ResourceNLP applicationsInformation Extraction [Ruiz-Casado et al., 2005]Information Retrieval [Gurevych et al., 2007]Keyphrase Extraction [Medelyan, Milne & Witten, 2008]Named Entity Recognition [Bunescu & Pasca, 2006]Question Answering [Ahn et al., 2004]Semantic Relatedness [Zesch & Gurevych, 2010]Text Categorization [Gabrilovich & Markovitch, 2006]WSD [Mihalcea, 2007][Medelyan et al., 2008] for an excellent overview.20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.77. Mai 2008| | 77. Mai 2008| | 77. Mai 2008| | 712. Mai 2008| | 7Why semantic relatedness? Besides being familiar with the topic

Wikipedia as a Language ResourceNLP applicationsInformation Extraction [Ruiz-Casado et al., 2005]Information Retrieval [Gurevych et al., 2007]Keyphrase Extraction [Medelyan, Milne & Witten, 2008]Named Entity Recognition [Bunescu & Pasca, 2006]Question Answering [Ahn et al., 2004]Semantic Relatedness [Zesch & Gurevych, 2010]Text Categorization [Gabrilovich & Markovitch, 2006]WSD [Mihalcea, 2007][Medelyan et al., 2008] for an excellent overview.Direct use of WikipediaUses many featuresArticle textArticle titlesCategoriesLinksLink anchorsRedirects20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.87. Mai 2008| | 87. Mai 2008| | 87. Mai 2008| | 812. Mai 2008| | 8Semantic Relatedness MeasurestreecartreewillowQuantify the strength of semantic relatedness [0,1]20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.More formally speaking, we need a semantic relatedness measuresThis means an algorithms that quantifies the strength of a semantic relationship between two words using a knowledge source.9Semantic Relatedness Measurestreetree0.10.9carwillowQuantify the strength of semantic relatedness [0,1]20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.10Types of Semantic Relatedness MeasuresPath Based

Gloss Based

Concept Vector Based

Link Vector Based20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.11Path based MeasuresSemantic relatedness corresponds e.g. to number of edges of the shortest path between two nodes (articles, categories)carmotor vehiclecab...minivanbiketruckgarbage trucktractorcabminivantractorcab minivan: 2cab tractor: 420.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.12Implementation makes use of the category graph, category article links, redirects, and article titles.Gloss based measuresWordNet glosses

tree (plant) a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown

trunk (tree) the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | A Wikipedia article is a kind of a (very long and detailed) gloss.Nr.13Term Document Matrixt1t2t3tm-1tmd131000d205010d310233dn-102321dn23050TermsDocuments20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.14We visualize the working principle of this measure type using a term-document matrix.

Cellimportance of a term in a document

usually determined term weighting method like tf.idf.Gloss Based Measurest1t2t3tm-1tmd131000d205010d310233dn-102321dn23050Articles

[Lesk, 1986]

Inner Product (usually Lesk)20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | c1c2c3cn-1cnArticle TitlesNr.15In structured knowledge sources, each document corresponds to a certain concept.

The comparison between the document vectors thus also tells us something about the relatedness of the corresponding concepts.

However, WordNet and Wiktionary only contain very short definitions, result in a very sparse matrix and rather poor results.Concept Vector Based Measurec1t1t2t3tm-1tmd131000d205010d310233dn-102321dn23050c2c3cn-1cnInner Product (usually Cosine)

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | ESA [Gabrilovich & Markovitch, 2007]Nr.16Concept vector based measures can also be visualized using the term-document matrix.

However, term vectors instead of document vectors

G&M07 propose to use Wikipedia articles as the concept space.Link Vector Based Measurel1l2l3lm-1lmd131000d205010d310233dn-102321dn23050c1c2c3cn-1cnArticlesArticle TitlesLinks

Inner Product (usually Cosine)20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Visualized using the document/link matrix.17Path Based

Gloss Based

Concept Vector Based

Link Vector BasedTypes of Semantic Relatedness Measurescarmotor vehiclecab...minivanbiketruckgarbage trucktractorcabminivantractort1t2t3tm-1tmd131000d205010d310233dn-102321dn23050t1t2t3tm-1tmd131000d205010d310233dn-102321dn23050l1l2l3lm-1lmd131000d205010d310233dn-102321dn23050Category GraphTitlesRedirectsTitlesTextTitlesLinksText

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.18Experimental SetupCreated 6-montly snapshots of the German WikipediaStart 01.12.2002End 23.11.2008Accessed the dumps using JWPL Wikipedia APIImplemented all measure types on top of JWPL

Two evaluation approaches:Correlation with human judgments on word pair listsSolving word choice problems20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Exact dates for all snapshots are given in the paper.German Wikipedia, because English dump was unavailable at that time.19Experimental SetupCreated 6-montly snapshots of the German WikipediaStart 01.12.2002End 23.11.2008Accessed the dumps using JWPL Wikipedia APIImplemented all measure types on top of JWPL

Two evaluation approaches:Correlation with human judgments on word pair listsSolving word choice problems20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Exact dates for all snapshots are given in the paper.Only report results for first task, as they are consistent with the results obtained on the other task.20Evaluation Datasets0.580.830.08

tree laketree willowtree car0.70.90.0Spearman rank correlation coefficient 0.50.50.75

0.250.00.00.750.751.0

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.21For the evaluation process, we need human judgments of SR as a reference.

present word pairsEvaluation Datasets0.580.830.08

tree laketree willowtree car0.70.90.0Spearman rank correlation coefficient 0.50.50.75

0.250.00.00.750.751.0

Gur350 dataset [Gurevych, 2005]350 word pairsNouns, verbs, and adjectives20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.22For the evaluation process, we need human judgments of SR as a reference.

present word pairsCoveragetree laketree willowtree car

20030.33

1.02007Coverage:20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.23Coverage Gur35020.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.24Coverage Gur35020.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.25Coverage Gur35020.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Categories introducedNr.26Correlation Gur35020.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.27Correlation Gur350 (Fixed Coverage)20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.28Experimental SetupCreated 6-montly snapshots of the German WikipediaStart 01.12.2002End 23.11.2008Accessed the dumps using JWPL Wikipedia APIImplemented all measure types on top of JWPL

Two evaluation approaches:Correlation with human judgments on word pair listsSolving word choice problems20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Exact dates for all snapshots are given in the paper.Only report results for first task, as they are consistent with the results obtained on the other task.29DatasetDatasets1008 German word choice problems [Mohammad et al., 2007]

Evaluation metricCoverage / Accuracy / Harmonic Mean

20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Coverage20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Accuracy20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Harmonic Mean20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.SummaryWikipedia is a great resource for many NLP tasksWikipedia grows very fast

The more, the better?

Growth does not hurt performance of semantic relatedness measures Using more recent Wikipedia dumps does not increase coverage much

JWPL Time MachineCreate a snapshot reflecting any past state of WikipediaReproducing previous results obtained using a certain snapshotPerform similar studies for other NLP taskshttp://www.ukp.tu-darmstadt.de/research/software/jwpl/20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.More recent also means a lot bigger.

34References (I)Ahn, D., Jijkoun, V., Mishne, G., Mller, K., de Rijke, M., and Schlobach, S. (2004). Using Wikipedia at the TREC QA Track. In Proceedings of the Thirteenth Text REtrieval Conference (TREC), Gaithersburg, MarylandBunescu, R. and Pasca, M. (2006). Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 916, Trento,Italy.Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of The 20th International Joint Conference on Artificial Intelligence (IJCAI), pages 16061611, Hyderabad, India.Gurevych, I. (2005). Using the Structure of a Conceptual Network in Computing Semantic Relatedness. In Proceedings of the 2nd International Joint Conference on Natural Language Processing, pages 767778, Jeju Island, Republic of Korea.Gurevych, I., Mller, C., and Zesch, T. (2007). What to be? - Electronic Career Guidance Based on Semantic Relatedness. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 10321039, Prague, Czech Republic.Lesk, M. (1986). Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation, pages 2426, Toronto, Canada.20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.References (II)Mihalcea, R. (2007). Using Wikipedia for Automatic Word Sense Disambiguation. In Proceedings of HLT 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, NY, April 2007Medelyan, O, Legg, C., Milne, D., and Witten. I.H. (2008) Mining Meaning from Wikipedia. International Journal of Human-Computer Studies. 67:9, September 2009, p. 716-754Medelyan, O, Witten, I.H., and Milne, D. (2008) Topic Indexing with Wikipedia. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI'08), Chicago, I.L.Mohammad, S., Gurevych, I., Hirst, G., and Zesch, T. (2007). Cross-lingual Distributional Profiles of Concepts for Measuring Semantic Distance. In Proceedings of EMNLP-CoNLL, pages 571580, Prague, Czech Republic.Ruiz-Casado, M., Alfonseca, E., and Castells, P. (2005). Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets. In Advances in Web Intelligence, pages 380386.Zesch, T., and Gurevych, I. (2010). Wisdom of Crowds versus Wisdom of Linguists - Measuring the Semantic Relatedness of Words. In: Journal of Natural Language Engineering., vol. 16, no. 01, pages 2559.20.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.Backup SlidesNr.Coverage Gur6520.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.38Correlation Gur6520.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.39Correlation Gur6520.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.40Correlation Gur6520.05.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Torsten Zesch | Nr.41