Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning
-
Upload
giuseppe-rizzo -
Category
Technology
-
view
907 -
download
3
description
Transcript of Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning
Learning with the Web: Spotting Learning with the Web: Spotting Named Entities on the intersection Named Entities on the intersection
of NERD and Machine Learningof NERD and Machine Learning
Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy
@giusepperizzo
May 13, 2013 3/13Making Sense of Microposts (#MSM2013)
Preprocessing
➢ Dataset is converted in CoNLL IOB format
➢ Applied 10 cross-fold validation
➢ Chunked the set of tweets in 50KB parts in order to comply with NERD filesize limitations
May 13, 2013 4/13Making Sense of Microposts (#MSM2013)
NERD extractors
➢ Retrieves named entities from 10 extractors (Web APIs)
➢ Harmonizes the classification according to the NERD Ontology v0.5 http://nerd.eurecom.fr/ontology
➢ 75 entity classes mapped to 4 MSM'13 classes
http://nerd.eurecom.fr
May 13, 2013 5/13Making Sense of Microposts (#MSM2013)
Ritter et al. (2011)
➢ Off-the-shelf tool tailored to a Twitter stream based on:
– LabelledLDA (+CRF)– Textual features (POS,Capitalization,Suffix, etc.)– Freebase gazetters (names of PER, ORG, LOC)
➢ 10 entity classes mapped to 4 classes
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)
May 13, 2013 6/13Making Sense of Microposts (#MSM2013)
Stanford CRF
➢ Re-trained on the MSM'13 corpora
➢ Parameters based on english.conll.4class.distsim.crf.ser.gz properties file provided with the Stanford distribution
➢ Baseline of our approach
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05) (2005)
May 13, 2013 7/13Making Sense of Microposts (#MSM2013)
Textual features
➢ POS
➢ Capitalisation information– initial capital– all capitalized – proportion of token capitals
➢ Prefix (first three letters of the token)
➢ Suffix (last three letters of the token)
➢ Whether token is at the beginning of at the end of the micropost
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)
May 13, 2013 8/13Making Sense of Microposts (#MSM2013)
ML settings
Run01: 7 textual features (POS, initial capital, proportion of capitals, prefix, sufix, end/start token); 0 extractor; ML=k-NN, k =1, Euclidean distance
Run02: 0 textual feature; 12 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, Zemanta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO
Run03: 4 textual features (POS, initial capital, suffix, Proportion of Capitals); 8 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Opencalais, Textrazor, Wikimeta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO
May 13, 2013 9/13Making Sense of Microposts (#MSM2013)
Precision – MSM'13 training,10 cross-fold validation
May 13, 2013 10/13Making Sense of Microposts (#MSM2013)
Recall - MSM'13 training,10 cross-fold validation
May 13, 2013 11/13Making Sense of Microposts (#MSM2013)
F1 – MSM'13 training,10 cross-fold validation
May 13, 2013 12/13Making Sense of Microposts (#MSM2013)
Lessons learned
➢ MISC class is ambiguously defined
➢ 8.1% of the named entities from the training data occurs in the test data
➢ Best Run03: not all extractors and some textual features
➢ For the next challenge what about entity linking?
May 13, 2013 13/13Making Sense of Microposts (#MSM2013)
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
N ERD-MLhttp://github.com/giusepperizzo/nerdml