Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

13
Learning with the Web: Spotting Learning with the Web: Spotting Named Entities on the intersection Named Entities on the intersection of NERD and Machine Learning of NERD and Machine Learning Marieke van Erp, Giuseppe Rizzo , Raphaël Troncy @giusepperizzo

description

Talk "Learning with the web: spotting named entities on the intersection of nerd and machine learning" event during #MSM'13 (WWW'13), Rio de Janeiro, Brazil Microposts shared on social platforms instantaneously report facts, opinions or emotions. In these posts, entities are often used but they are continuously changing depending on what is currently trending. In such a scenario, recognising these named entities is a challenging task, for which off-the-shelf approaches are not well equipped. We propose NERD-ML, an approach that unifies the benefits of a crowd entity recognizer through Web entity extractors combined with the linguistic strengths of a machine learning classifier.

Transcript of Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Page 1: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

Learning with the Web: Spotting Learning with the Web: Spotting Named Entities on the intersection Named Entities on the intersection

of NERD and Machine Learningof NERD and Machine Learning

Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy

@giusepperizzo

Page 2: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 2/13Making Sense of Microposts (#MSM2013)

NERD-ML @ MSM'13

Page 3: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 3/13Making Sense of Microposts (#MSM2013)

Preprocessing

➢ Dataset is converted in CoNLL IOB format

➢ Applied 10 cross-fold validation

➢ Chunked the set of tweets in 50KB parts in order to comply with NERD filesize limitations

Page 4: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 4/13Making Sense of Microposts (#MSM2013)

NERD extractors

➢ Retrieves named entities from 10 extractors (Web APIs)

➢ Harmonizes the classification according to the NERD Ontology v0.5 http://nerd.eurecom.fr/ontology

➢ 75 entity classes mapped to 4 MSM'13 classes

http://nerd.eurecom.fr

Page 5: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 5/13Making Sense of Microposts (#MSM2013)

Ritter et al. (2011)

➢ Off-the-shelf tool tailored to a Twitter stream based on:

– LabelledLDA (+CRF)– Textual features (POS,Capitalization,Suffix, etc.)– Freebase gazetters (names of PER, ORG, LOC)

➢ 10 entity classes mapped to 4 classes

Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)

Page 6: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 6/13Making Sense of Microposts (#MSM2013)

Stanford CRF

➢ Re-trained on the MSM'13 corpora

➢ Parameters based on english.conll.4class.distsim.crf.ser.gz properties file provided with the Stanford distribution

➢ Baseline of our approach

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05) (2005)

Page 7: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 7/13Making Sense of Microposts (#MSM2013)

Textual features

➢ POS

➢ Capitalisation information– initial capital– all capitalized – proportion of token capitals

➢ Prefix (first three letters of the token)

➢ Suffix (last three letters of the token)

➢ Whether token is at the beginning of at the end of the micropost

Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)

Page 8: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 8/13Making Sense of Microposts (#MSM2013)

ML settings

Run01: 7 textual features (POS, initial capital, proportion of capitals, prefix, sufix, end/start token); 0 extractor; ML=k-NN, k =1, Euclidean distance

Run02: 0 textual feature; 12 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, Zemanta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO

Run03: 4 textual features (POS, initial capital, suffix, Proportion of Capitals); 8 extractors (AlchemyAPI, DBpedia Spotlight, Extractiv, Opencalais, Textrazor, Wikimeta, Stanford NER, Ritter et al.); ML=SVM, polynomial kernel, SMO

Page 9: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 9/13Making Sense of Microposts (#MSM2013)

Precision – MSM'13 training,10 cross-fold validation

Page 10: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 10/13Making Sense of Microposts (#MSM2013)

Recall - MSM'13 training,10 cross-fold validation

Page 11: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 11/13Making Sense of Microposts (#MSM2013)

F1 – MSM'13 training,10 cross-fold validation

Page 12: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 12/13Making Sense of Microposts (#MSM2013)

Lessons learned

➢ MISC class is ambiguously defined

➢ 8.1% of the named entities from the training data occurs in the test data

➢ Best Run03: not all extractors and some textual features

➢ For the next challenge what about entity linking?

Page 13: Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning

May 13, 2013 13/13Making Sense of Microposts (#MSM2013)

Thanks for your time and attention

http://www.slideshare.net/giusepperizzo

N ERD-MLhttp://github.com/giusepperizzo/nerdml