(NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle:...
Transcript of (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle:...
Nathan Schneider • NLPIT, Rotterdam • June 23, 2015
htt
p:/
/fee
lgra
fix.
com
/835928-
jungl
e-w
allp
aper
.htm
l
Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets
Edited Text
2
Conversational Web Text
3
#jesuischarlie<3333wut! u da man! *fist pump*
4
RICHNESS
ROBUSTNESS
syntactic parsing
semantic parsing
NERPOS
5
representation
annotation
automation
6
representation
annotation
automation
Outline
7
• Twitter POS tagging
• Twitter dependency parsing
• What’s next?
8
multi-word abbreviations
non-standard spellings
hashtags
Also: at-mentions, URLs, emoticons, symbols, typos, etc.
Twitter POS• Part-of-speech tagging for Twitter:
annotation, features, and experiments. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. ACL-HLT 2011.
• Improved part-of-speech tagging for online conversational text with word clusters. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. NAACL-HLT 2013.
9
Twitter POS: Representation
10
lexical & punctuationcommon noun determiner
proper noun preposition
pronoun verb particleverb coordinating conjunctionadjective numeraladverb interjectionpunctuation predeterminer / existential there
complexnominal+possessive (his, books’)
proper noun+possessive (Mary’s book)
nominal+verbal (ur, ima) proper noun+verbal (Mary’s happy)
existential+verbal (there’s)
Twitter POS: Representation
11
Twitter-specific
hashtag #mcconnelling
username @justinbieber
URL/email cnn.com [email protected]
emoticon :-) \o/
Twitter discourse marker RT <—
other ily mfw ™
Twitter POS: Annotation
13
• 17 annotators
15
83.0
85.5
88.0
90.5
93.092.2
83.083.083.0
Twitter POS: Annotation
Inter-annotator agreement
16
83.0
85.5
88.0
90.5
93.092.2
83.083.083.083.0
85.5
88.0
90.5
93.092.2
89.4
83.4
85.9
Twitter POS: Automationincl. special regexes,
distributional similarity, phonetic normalization, tag
dictionary
Our CRF, base features
Our CRF, all features
Inter-annotator agreement
Stanford tagger
17
Can we do better?
• No explicit annotation guidelines document → some conventions unclear.
• Tokenization difficult due to creative emoticons. :~P \o/
• Accuracy still lags on rare/OOV words.
• CRF is too slow to tag huge volumes of text.
18
Annotation Conventions• Jesse & the Rippers
the California Chamber of Commerce ‣ All and only nouns within proper names
tagged as proper noun
• (1) this wind is serious(2) i just orgasmed over this(3) You should know, that if you come any closer …
‣ (1): determiner, (2): pronoun, (3): preposition/subordinator
‣ Gimpel et al. annotations were inconsistent, so we corrected them
19
Annotation Conventions
• RT @anonuser : Tonight’s memorial for Lucas Ransom starts at 8:00 p.m. and is being held at the open space at the corner of Del Pla ...
‣ Proper noun (truncated, but we can tell from context)
• (1) I need to go home man .(2) * Bbm yawn face * Man that #napflow felt so refreshing .
‣ (1): noun, (2): interjection
20
Improved Tokenizer
• Rule-based, with regular expressions for faces, etc.:*O -_-<3333
• Also better URL patterns: about.me
21
Word Clusters
• Brown clusters help smooth over lexicon to better accommodate rare/OOV words
• We trained 1000 clusters on 56M English tweets (847M tokens) spread over 4 years
23
Word Clusters
Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol
A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying
B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou
C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains
D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora
E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona
F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo
G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333
Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.
3.1 Clustering Method
We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.
When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-
4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster
5https://github.com/saffsd/langid.py
kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.
3.2 Cluster Examples
Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and
laughter
hearts/love symbols
24
Word Clusters
Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol
A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying
B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou
C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains
D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora
E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona
F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo
G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333
Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.
3.1 Clustering Method
We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.
When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-
4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster
5https://github.com/saffsd/langid.py
kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.
3.2 Cluster Examples
Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and
hearts/love symbols + faces
laughter
25
Word Clusters
Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol
A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying
B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou
C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains
D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora
E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona
F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo
G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333
Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.
3.1 Clustering Method
We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.
When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-
4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster
5https://github.com/saffsd/langid.py
kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.
3.2 Cluster Examples
Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and
hearts/love symbols + faces
laughter + interjections
26
Word Clusters
Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol
A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying
B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou
C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains
D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora
E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona
F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo
G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333
Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.
3.1 Clustering Method
We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.
When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-
4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster
5https://github.com/saffsd/langid.py
kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.
3.2 Cluster Examples
Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and
hearts/love symbols + faces
laughter + interjections
27
Word Clusters
Feature set OCT27TEST DAILY547 NPSCHATTESTAll features 91.60 92.80 91.19 1
with clusters; without tagdicts, namelists 91.15 92.38 90.66 2without clusters; with tagdicts, namelists 89.81 90.81 90.00 3only clusters (and transitions) 89.50 90.54 89.55 4without clusters, tagdicts, namelists 86.86 88.30 88.26 5
Gimpel et al. (2011) version 0.2 88.89 89.17 6Inter-annotator agreement (Gimpel et al., 2011) 92.2 7Model trained on all OCT27 93.2 8
Table 2: Tagging accuracies (%) in ablation experiments. OCT27TEST and DAILY547 95% confidence intervals areroughly ±0.7%. Our final tagger uses all features and also trains on OCT27TEST, achieving 93.2% on DAILY547.
tures, affix n-grams, capitalization, emoticon pat-terns, etc.—and the accuracy is in fact still betterthan the previous work (row 4).18
We also wanted to know whether to keep the tagdictionary and name list features, but the splits re-ported in Fig. 2 did not show statistically signifi-cant differences; so to better discriminate betweenablations, we created a lopsided train/test split ofall data with a much larger test portion (26,974 to-kens), having greater statistical power (tighter con-fidence intervals of ± 0.3%).19 The full system got90.8% while the no–tag dictionary, no-namelists ab-lation had 90.0%, a statistically significant differ-ence. Therefore we retain these features.
Compared to the tagger in Gimpel et al., most ofour feature changes are in the new lexical featuresdescribed in §3.5.20 We do not reuse the other lex-ical features from the previous work, including aphonetic normalizer (Metaphone), a name list con-sisting of words that are frequently capitalized, anddistributional features trained on a much smaller un-labeled corpus; they are all worse than our newlexical features described here. (We did include,however, a variant of the tag dictionary feature thatuses phonetic normalization for lookup; it seemed toyield a small improvement.)
18Furthermore, when evaluating the clusters as unsupervised(hard) POS tags, we obtain a many-to-one accuracy of 89.2%on DAILY547. Before computing this, we lowercased the textto match the clusters and removed tokens tagged as URLs andat-mentions.
19Reported confidence intervals in this paper are 95% bino-mial normal approximation intervals for the proportion of cor-rectly tagged tokens: ±1.96
pp(1� p)/n
tokens
. 1/p
n.20Details on the exact feature set are available in a technical
report (Owoputi et al., 2012), also available on the website.
Non-traditional words. The word clusters are es-pecially helpful with words that do not appear in tra-ditional dictionaries. We constructed a dictionaryby lowercasing the union of the ispell ‘American’,‘British’, and ‘English’ dictionaries, plus the stan-dard Unix words file from Webster’s Second Inter-national dictionary, totalling 260,985 word types.After excluding tokens defined by the gold stan-dard as punctuation, URLs, at-mentions, or emoti-cons,21 22% of DAILY547’s tokens do not appear inthis dictionary. Without clusters, they are very dif-ficult to classify (only 79.2% accuracy), but addingclusters generates a 5.7 point improvement—muchlarger than the effect on in-dictionary tokens (Ta-ble 3).
Varying the amount of unlabeled data. A taggerthat only uses word clusters achieves an accuracy of88.6% on the OCT27 development set.22 We createdseveral clusterings with different numbers of unla-beled tweets, keeping the number of clusters con-stant at 800. As shown in Fig. 3, there was initiallya logarithmic relationship between number of tweetsand accuracy, but accuracy (and lexical coverage)levels out after 750,000 tweets. We use the largestclustering (56 million tweets and 1,000 clusters) asthe default for the released tagger.
6.2 Evaluation on RITTERTW
Ritter et al. (2011) annotated a corpus of 787tweets23 with a single annotator, using the PTB
21We retain hashtags since by our guidelines a #-prefixed to-ken is ambiguous between a hashtag and a normal word, e.g. #1or going #home.
22The only observation features are the word clusters of atoken and its immediate neighbors.
23https://github.com/aritter/twitter_nlp/blob/master/data/annotated/pos.txt
28
Word Clusters
We approach part-of-speech tagging for
informal, online conversational text
using large-scale unsupervised word clustering and new lexical features. Our system achieves state-of-the-art tagging results on both Twitter and IRC data. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines.
Improved PartImproved Part--ofof--Speech Tagging for Online Conversational Text with Word ClustersSpeech Tagging for Online Conversational Text with Word Clusters
Word Clusters
Tagger Features! Hierarchical word clusters via Brown clustering (Brown et al., 1992) on a sample of 56M tweets! Surrounding words/clusters! Current and previous tags! Tag dict. constructed from WSJ, Brown corpora! Tag dict. entries projected to Metaphoneencodings! Name lists from Freebase, Moby Words, Names Corpus! Emoticon, hashtag, @mention, URL patterns
Olutobi Owoputi* Brendan O’Connor* Chris Dyer* Kevin Gimpel+ Nathan Schneider* Noah A. Smith*
*School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA+Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
Highest Weighted Clusters
SpeedTagger: 800 tweets/s (compared to 20 tweets/s previously)Tokenizer: 3,500 tweets/s
Software & Data Release! Improved emoticon detector and tweet tokenizer! Newly annotated evaluation set, fixes to previous annotations
Examples
RVVVOPNDVP
NowHateingStartCuldYallSoCroudDaShakeBoutta
ResultsOur tagger achieves state-of-the-art results in POS tagging for each dataset:
O
heV
canV
addO
uP
on^
fb lolololsonamelastyofiraskedhesmhikr!PNADPVOG!
or n & and103&100110*
you yall u it mine everything nothing something anyone
someone everyone nobody
899O11101*
do did kno know care mean hurts hurt say realize believe
worry understand forget agree remember love miss hate
think thought knew hope wish guess bet have
29267V01*
the da my your ur our their his378D1101*
young sexy hot slow dark low interesting easy important
safe perfect special different random short quick bad crazy
serious stupid weird lucky sad
6510A111110*
x <3 :d :p :) :o :/2798E1110101100*
i'm im you're we're he's there's its it's428L11000*
lol lmao haha yes yea oh omg aww ah btw wow thanks
sorry congrats welcome yay ha hey goodnight hi dear
please huh wtf exactly idk bless whatever well ok
8160! 11101010*
Most common word in each cluster with prefixTypesTagCluster prefix
Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus
(incl. system messages)
Tagset
Datasets
Tagger, tokenizer, and data all released at:
www.ark.cs.cmu.edu/TweetNLP
Accuracy on RITTERTW corpus
Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus
(incl. system messages)
Accuracy on RITTERTW corpus
Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus
(incl. system messages)
ModelDiscriminative sequence model (MEMM) with L1/L2 regularization
29
Speed
• Tokenizer: 3500 tweets/s
• MEMM instead of CRF is much faster
‣ Greedy: 800 tweets/s (10k w/s), barely any loss in accuracy relative to Viterbi
Outline
30
• Twitter POS tagging
• Twitter dependency parsing
• What’s next?
Twitter Syntax: Representation
31
OMG I <3 the Biebs & want to have his babies ! —> LA
Times: Teen Pop Star Heartthrob is All the Rage on
Social Media… #belieber
Twitter Syntax: Representation
32
OMG I <3 the Biebs & want to have his babies ! —> LA
Times : Teen Pop Star Heartthrob is All the Rage on
Social Media … #belieber
Twitter Syntax: Representation
33
OMG I <3 the Biebs & want to have his babies ! —> LA
Times : Teen Pop Star Heartthrob is All the Rage on
Social Media … #belieber
Twitter Syntax: Representation
34
OMG
I <3 the Biebs & want to have his babies
LA Times
Teen Pop Star Heartthrob is All the Rage on Social Media
Twitter Syntax: Representation
35
OMG
I <3 the Biebs & want to have his babies
LA_Times
Teen Pop Star Heartthrob is All_the_Rage on Social Media
Twitter Syntax: Representation
36
OMG
I <3 the Biebs & want to have his babies
LA_Times
Teen Pop Star Heartthrob is All_the_Rage on Social Media
• Fragmentary Unlabeled Dependency Grammar (“FUDG”; Schneider et al. 2013)
‣ allows utterance segmentation, token selection, MWEs, coordination, underspecification
Twitter Syntax: Representation
37
OMG
I <3 the Biebs & want to have his babies
LA_Times
Teen Pop Star Heartthrob is All_the_Rage on Social Media
• FUDG can be rendered in ASCII (“GFL”): Teen > (Pop > Star) > Heartthrob Heartthrob > is** < [All the Rage] is < on < (Social > Media)
Twitter Syntax: Annotation
38
1 day of annotation, 26 participants
• Custom web-based annotation tool (Mordowanec et al., ACL 2014 demo)
Twitter Syntax: Automation• A supervised discriminative graph-based
parser for tweets (Kong et al., EMNLP 2014)
‣ (1) lexical selection: a sequence model
‣ (2) parsing: a 2nd-order TurboParser model
‣ produces FUDG parses (incl. coordination, multiple utterances, MWEs)
• Experiments
‣ train on PTB, test on tweets: 73% UAS
‣ train on 1,473 English tweets (9k tokens): 80%
‣ domain adaptation (train on tweets, with some features derived from PTB-trained parser): 81%
39
Twitter POS & Syntax: Summary• Modified traditional representations to meet
the needs of our domain & process
• Rapid annotation by (mostly) CS grad students, informed by linguistics
• Widely downloaded (>3,000), state-of-the-art POS tagger for Twitter; parser will be released in time for EMNLP
• Syntactic representations & annotation tools inspired by Twitter are now being used for Wikipedia, low-resource African languages, and even Shakespeare!
40
Twitter POS & Syntax: Links
• http://www.ark.cs.cmu.edu/TweetNLP/
• http://www.ark.cs.cmu.edu/FUDG/
41
42
representation
annotation
automation
What’s Next?
• More languages/genres?
• Richer/deeper representations?
• Applications?
43
44
RICHNESS
ROBUSTNESS
syntactic parsing
semantic parsing
NERPOS
45thx