Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...

1
http://multimedialab.elis.ugent.be Ghent University – iMinds, ELIS Department/Multimedia Lab Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg – Ghent, Belgium Fréderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle Workshop on Machine Learning and NLP, NIPS 2014 Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations 12/12/2014, Montreal, Canada Research Question Vote-Constrained Bootstrapping* Can we avoid manual feature engineering when developing a Part-of-Speech tagger for Twitter microposts? @frederic_godin, @BaptistV, @wmdeneve and @rvdwalle Solution Automatically learn features on 400 million raw Twitter microposts that capture syntactic and semantic patterns and feed them to a neural network Learn Features Train the Part-of-Speech Tagger 400 million Word2vec Skip-gram 400D vector 400D 400D vector 400D 400D 400D vector 400D vector 400D Vector Hidden Layer (500D) Output Layer (52D) im doin good VBG Evaluation ARK tagger GATE tagger im doin good VBG V Agree? Automatically generate high confidence labeled data Use this data to pre-train the neural network *Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data" Word2vec dataset Pre-training dataset Accuracy validation set Accuracy test set 150M / 87.95% 87.46% 150M 50K 89.64% 88.82% 400M 50K 89.73% 88.95% 400M 125K 90.09% 88.90% Ritter et al. (2011) 84.55% Derczynski et al. (2013) 88.69%

Transcript of Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...

  • http://multimedialab.elis.ugent.be Ghent University iMinds, ELIS Department/Multimedia Lab

    Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg Ghent, Belgium

    Frderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle Workshop on Machine Learning and NLP, NIPS 2014

    Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations

    12/12/2014, Montreal, Canada

    Research Question

    Vote-Constrained Bootstrapping*

    Can we avoid manual feature engineering when developing a Part-of-Speech tagger for Twitter microposts?

    @frederic_godin, @BaptistV, @wmdeneve and @rvdwalle

    Solution

    Automatically learn features on 400 million raw Twitter microposts that capture syntactic and semantic patterns and feed them to a neural network

    Learn Features Train the Part-of-Speech Tagger

    400 million

    Word2vec Skip-gram

    400D vector

    400D

    400D vector

    400D 400D

    400D vector 400D vector 400D Vector Hidden Layer (500D)

    Output Layer (52D)

    im doin good

    VBG

    Evaluation

    ARK tagger

    GATE tagger

    im

    doin

    good VBG

    V

    Agree?

    Automatically generate high confidence labeled data

    Use this data to pre-train the neural network

    *Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data"

    Word2vec dataset

    Pre-training dataset

    Accuracy validation set

    Accuracy test set

    150M / 87.95% 87.46%

    150M 50K 89.64% 88.82%

    400M 50K 89.73% 88.95%

    400M 125K 90.09% 88.90%

    Ritter et al. (2011) 84.55%

    Derczynski et al. (2013) 88.69%