Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...
Transcript of Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using...
-
http://multimedialab.elis.ugent.be Ghent University iMinds, ELIS Department/Multimedia Lab
Gaston Crommenlaan 8 bus 201 B-9050 Ledeberg Ghent, Belgium
Frderic Godin, Baptist Vandersmissen, Azarakhsh Jalalvand, Wesley De Neve and Rik Van de Walle Workshop on Machine Learning and NLP, NIPS 2014
Alleviating Manual Feature Engineering for Part-of-Speech Tagging of Twitter Microposts using Distributed Word Representations
12/12/2014, Montreal, Canada
Research Question
Vote-Constrained Bootstrapping*
Can we avoid manual feature engineering when developing a Part-of-Speech tagger for Twitter microposts?
@frederic_godin, @BaptistV, @wmdeneve and @rvdwalle
Solution
Automatically learn features on 400 million raw Twitter microposts that capture syntactic and semantic patterns and feed them to a neural network
Learn Features Train the Part-of-Speech Tagger
400 million
Word2vec Skip-gram
400D vector
400D
400D vector
400D 400D
400D vector 400D vector 400D Vector Hidden Layer (500D)
Output Layer (52D)
im doin good
VBG
Evaluation
ARK tagger
GATE tagger
im
doin
good VBG
V
Agree?
Automatically generate high confidence labeled data
Use this data to pre-train the neural network
*Derczynski et al., 2013. "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data"
Word2vec dataset
Pre-training dataset
Accuracy validation set
Accuracy test set
150M / 87.95% 87.46%
150M 50K 89.64% 88.82%
400M 50K 89.73% 88.95%
400M 125K 90.09% 88.90%
Ritter et al. (2011) 84.55%
Derczynski et al. (2013) 88.69%