Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...

http://multimedialab.elis.ugent.be Ghent University – iMinds, ELIS Department/Multimedia Lab Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg – Ghent, Belgium Fréderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle Workshop on Machine Learning and NLP, NIPS 2014 Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations 12/12/2014, Montreal, Canada Research Question Vote-Constrained Bootstrapping* Can we avoid manual feature engineering when developing a Part-of-Speech tagger for Twitter microposts? @frederic_godin, @BaptistV, @wmdeneve and @rvdwalle Solution Automatically learn features on 400 million raw Twitter microposts that capture syntactic and semantic patterns and feed them to a neural network Learn Features Train the Part-of-Speech Tagger 400 million Word2vec Skip-gram 400D vector 400D 400D vector 400D 400D 400D vector 400D vector 400D Vector Hidden Layer (500D) Output Layer (52D) im doin good VBG Evaluation ARK tagger GATE tagger im doin good VBG V Agree? Automatically generate high confidence labeled data Use this data to pre-train the neural network *Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data" Word2vec dataset Pre-training dataset Accuracy validation set Accuracy test set 150M / 87.95% 87.46% 150M 50K 89.64% 88.82% 400M 50K 89.73% 88.95% 400M 125K 90.09% 88.90% Ritter et al. (2011) 84.55% Derczynski et al. (2013) 88.69%

Upload
fgodin
Category

Internet
view
289
download
0

Embed Size (px):

Transcript of Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...

http://multimedialab.elis.ugent.be Ghent University iMinds, ELIS Department/Multimedia Lab

Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg Ghent, Belgium

Frderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle Workshop on Machine Learning and NLP, NIPS 2014

Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations

12/12/2014, Montreal, Canada

Research Question

Vote-Constrained Bootstrapping*

Can we avoid manual feature engineering when developing a Part-of-Speech tagger for Twitter microposts?

@frederic_godin, @BaptistV, @wmdeneve and @rvdwalle

Solution

Automatically learn features on 400 million raw Twitter microposts that capture syntactic and semantic patterns and feed them to a neural network

Learn Features Train the Part-of-Speech Tagger

400 million

Word2vec Skip-gram

400D vector

400D

400D vector

400D 400D

400D vector 400D vector 400D Vector Hidden Layer (500D)

Output Layer (52D)

im doin good

VBG

Evaluation

ARK tagger

GATE tagger

im

doin

good VBG

V

Agree?

Automatically generate high confidence labeled data

Use this data to pre-train the neural network

*Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data"

Word2vec dataset

Pre-training dataset

Accuracy validation set

Accuracy test set

150M / 87.95% 87.46%

150M 50K 89.64% 88.82%

400M 50K 89.73% 88.95%

400M 125K 90.09% 88.90%

Ritter et al. (2011) 84.55%

Derczynski et al. (2013) 88.69%