(NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle:...

42
Nathan Schneider • NLPIT, Rotterdam • June 23, 2015 http://feelgrafix.com/835928- jungle-wallpaper.html Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets

Transcript of (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle:...

Page 1: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Nathan Schneider • NLPIT, Rotterdam • June 23, 2015

htt

p:/

/fee

lgra

fix.

com

/835928-

jungl

e-w

allp

aper

.htm

l

Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets

Page 2: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Edited Text

2

Page 3: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Conversational Web Text

3

#jesuischarlie<3333wut! u da man! *fist pump*

Page 4: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

4

RICHNESS

ROBUSTNESS

syntactic parsing

semantic parsing

NERPOS

Page 5: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

5

representation

annotation

automation

Page 6: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

6

representation

annotation

automation

Page 7: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Outline

7

• Twitter POS tagging

• Twitter dependency parsing

• What’s next?

Page 8: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter

8

multi-word abbreviations

non-standard spellings

hashtags

Also: at-mentions, URLs, emoticons, symbols, typos, etc.

Page 9: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS• Part-of-speech tagging for Twitter:

annotation, features, and experiments. Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. ACL-HLT 2011.

• Improved part-of-speech tagging for online conversational text with word clusters. Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. NAACL-HLT 2013.

9

Page 10: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS: Representation

10

lexical & punctuationcommon noun determiner

proper noun preposition

pronoun verb particleverb coordinating conjunctionadjective numeraladverb interjectionpunctuation predeterminer / existential there

complexnominal+possessive (his, books’)

proper noun+possessive (Mary’s book)

nominal+verbal (ur, ima) proper noun+verbal (Mary’s happy)

existential+verbal (there’s)

Page 11: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS: Representation

11

Twitter-specific

hashtag #mcconnelling

username @justinbieber

URL/email cnn.com [email protected]

emoticon :-) \o/

Twitter discourse marker RT <—

other ily mfw ™

Page 12: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS: Annotation

13

• 17 annotators

Page 13: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

15

83.0

85.5

88.0

90.5

93.092.2

83.083.083.0

Twitter POS: Annotation

Inter-annotator agreement

Page 14: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

16

83.0

85.5

88.0

90.5

93.092.2

83.083.083.083.0

85.5

88.0

90.5

93.092.2

89.4

83.4

85.9

Twitter POS: Automationincl. special regexes,

distributional similarity, phonetic normalization, tag

dictionary

Our CRF, base features

Our CRF, all features

Inter-annotator agreement

Stanford tagger

Page 15: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

17

Can we do better?

• No explicit annotation guidelines document → some conventions unclear.

• Tokenization difficult due to creative emoticons. :~P \o/

• Accuracy still lags on rare/OOV words.

• CRF is too slow to tag huge volumes of text.

Page 16: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

18

Annotation Conventions• Jesse & the Rippers

the California Chamber of Commerce ‣ All and only nouns within proper names

tagged as proper noun

• (1) this wind is serious(2) i just orgasmed over this(3) You should know, that if you come any closer …

‣ (1): determiner, (2): pronoun, (3): preposition/subordinator

‣ Gimpel et al. annotations were inconsistent, so we corrected them

Page 17: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

19

Annotation Conventions

• RT @anonuser : Tonight’s memorial for Lucas Ransom starts at 8:00 p.m. and is being held at the open space at the corner of Del Pla ...

‣ Proper noun (truncated, but we can tell from context)

• (1) I need to go home man .(2) * Bbm yawn face * Man that #napflow felt so refreshing .

‣ (1): noun, (2): interjection

Page 18: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

20

Improved Tokenizer

• Rule-based, with regular expressions for faces, etc.:*O -_-<3333

• Also better URL patterns: about.me

Page 19: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

21

Word Clusters

• Brown clusters help smooth over lexicon to better accommodate rare/OOV words

• We trained 1000 clusters on 56M English tweets (847M tokens) spread over 4 years

Page 20: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

23

Word Clusters

Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.

3.1 Clustering Method

We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.

When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-

4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster

5https://github.com/saffsd/langid.py

kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.

3.2 Cluster Examples

Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and

laughter

hearts/love symbols

Page 21: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

24

Word Clusters

Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.

3.1 Clustering Method

We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.

When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-

4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster

5https://github.com/saffsd/langid.py

kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.

3.2 Cluster Examples

Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and

hearts/love symbols + faces

laughter

Page 22: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

25

Word Clusters

Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.

3.1 Clustering Method

We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.

When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-

4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster

5https://github.com/saffsd/langid.py

kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.

3.2 Cluster Examples

Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and

hearts/love symbols + faces

laughter + interjections

Page 23: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

26

Word Clusters

Binary path Top words (by frequency)A1 111010100010 lmao lmfao lmaoo lmaooo hahahahaha lool ctfu rofl loool lmfaoo lmfaooo lmaoooo lmbo lololol

A2 111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah kk hahaa ahahA3 111010100100 yes yep yup nope yess yesss yessss ofcourse yeap likewise yepp yesh yw yuup yusA4 111010100101 yeah yea nah naw yeahh nooo yeh noo noooo yeaa ikr nvm yeahhh nahh noooooA5 11101011011100 smh jk #fail #random #fact smfh #smh #winning #realtalk smdh #dead #justsaying

B 011101011 u yu yuh yhu uu yuu yew y0u yuhh youh yhuu iget yoy yooh yuo yue juu dya youz yyou

C 11100101111001 w fo fa fr fro ov fer fir whit abou aft serie fore fah fuh w/her w/that fron isn agains

D 111101011000 facebook fb itunes myspace skype ebay tumblr bbm flickr aim msn netflix pandora

E1 0011001 tryna gon finna bouta trynna boutta gne fina gonn tryina fenna qone trynaa qonE2 0011000 gonna gunna gona gna guna gnna ganna qonna gonnna gana qunna gonne goona

F 0110110111 soo sooo soooo sooooo soooooo sooooooo soooooooo sooooooooo soooooooooo

G1 11101011001010 ;) :p :-) xd ;-) ;d (; :3 ;p =p :-p =)) ;] xdd #gno xddd >:) ;-p >:d 8-) ;-dG2 11101011001011 :) (: =) :)) :] :’) =] ^_^ :))) ^.^ [: ;)) ((: ^__^ (= ^-^ :))))G3 1110101100111 :( :/ -_- -.- :-( :’( d: :| :s -__- =( =/ >.< -___- :-/ </3 :\ -____- ;( /: :(( >_< =[ :[ #fmlG4 111010110001 <3 xoxo <33 xo <333 #love s2 <URL-twitition.com> #neversaynever <3333

Figure 2: Example word clusters (HMM classes): we list the most probable words, starting with the most probable, indescending order. Boldfaced words appear in the example tweet (Figure 1). The binary strings are root-to-leaf pathsthrough the binary cluster tree. For example usage, see e.g. search.twitter.com, bing.com/social andurbandictionary.com.

3.1 Clustering Method

We obtained hierarchical word clusters via Brownclustering (Brown et al., 1992) on a large set ofunlabeled tweets.4 The algorithm partitions wordsinto a base set of 1,000 clusters, and induces a hi-erarchy among those 1,000 clusters with a series ofgreedy agglomerative merges that heuristically opti-mize the likelihood of a hidden Markov model with aone-class-per-lexical-type constraint. Not only doesBrown clustering produce effective features for dis-criminative models, but its variants are better unsu-pervised POS taggers than some models developednearly 20 years later; see comparisons in Blunsomand Cohn (2011). The algorithm is attractive for ourpurposes since it scales to large amounts of data.

When training on tweets drawn from a singleday, we observed time-specific biases (e.g., nu-merical dates appearing in the same cluster as theword tonight), so we assembled our unlabeled datafrom a random sample of 100,000 tweets per dayfrom September 10, 2008 to August 14, 2012,and filtered out non-English tweets (about 60% ofthe sample) using langid.py (Lui and Baldwin,2012).5 Each tweet was processed with our to-

4As implemented by Liang (2005), v. 1.3: https://github.com/percyliang/brown-cluster

5https://github.com/saffsd/langid.py

kenizer and lowercased. We normalized all at-mentions to h@MENTIONi and URLs/email ad-dresses to their domains (e.g. http://bit.ly/dP8rR8 ) hURL-bit.lyi). In an effort to reducespam, we removed duplicated tweet texts (this alsoremoves retweets) before word clustering. Thisnormalization and cleaning resulted in 56 millionunique tweets (847 million tokens). We set theclustering software’s count threshold to only clusterwords appearing 40 or more times, yielding 216,856word types, which took 42 hours to cluster on a sin-gle CPU.

3.2 Cluster Examples

Fig. 2 shows example clusters. Some of the chal-lenging words in the example tweet (Fig. 1) are high-lighted. The term lololol (an extension of lol for“laughing out loud”) is grouped with a large numberof laughter acronyms (A1: “laughing my (fucking)ass off,” “cracking the fuck up”). Since expressionsof laughter are so prevalent on Twitter, the algorithmcreates another laughter cluster (A1’s sibling A2),that tends to have onomatopoeic, non-acronym vari-ants (e.g., haha). The acronym ikr (“I know, right?”)is grouped with expressive variations of “yes” and“no” (A4). Note that A1–A4 are grouped in a fairlyspecific subtree; and indeed, in this message ikr and

hearts/love symbols + faces

laughter + interjections

Page 24: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

27

Word Clusters

Feature set OCT27TEST DAILY547 NPSCHATTESTAll features 91.60 92.80 91.19 1

with clusters; without tagdicts, namelists 91.15 92.38 90.66 2without clusters; with tagdicts, namelists 89.81 90.81 90.00 3only clusters (and transitions) 89.50 90.54 89.55 4without clusters, tagdicts, namelists 86.86 88.30 88.26 5

Gimpel et al. (2011) version 0.2 88.89 89.17 6Inter-annotator agreement (Gimpel et al., 2011) 92.2 7Model trained on all OCT27 93.2 8

Table 2: Tagging accuracies (%) in ablation experiments. OCT27TEST and DAILY547 95% confidence intervals areroughly ±0.7%. Our final tagger uses all features and also trains on OCT27TEST, achieving 93.2% on DAILY547.

tures, affix n-grams, capitalization, emoticon pat-terns, etc.—and the accuracy is in fact still betterthan the previous work (row 4).18

We also wanted to know whether to keep the tagdictionary and name list features, but the splits re-ported in Fig. 2 did not show statistically signifi-cant differences; so to better discriminate betweenablations, we created a lopsided train/test split ofall data with a much larger test portion (26,974 to-kens), having greater statistical power (tighter con-fidence intervals of ± 0.3%).19 The full system got90.8% while the no–tag dictionary, no-namelists ab-lation had 90.0%, a statistically significant differ-ence. Therefore we retain these features.

Compared to the tagger in Gimpel et al., most ofour feature changes are in the new lexical featuresdescribed in §3.5.20 We do not reuse the other lex-ical features from the previous work, including aphonetic normalizer (Metaphone), a name list con-sisting of words that are frequently capitalized, anddistributional features trained on a much smaller un-labeled corpus; they are all worse than our newlexical features described here. (We did include,however, a variant of the tag dictionary feature thatuses phonetic normalization for lookup; it seemed toyield a small improvement.)

18Furthermore, when evaluating the clusters as unsupervised(hard) POS tags, we obtain a many-to-one accuracy of 89.2%on DAILY547. Before computing this, we lowercased the textto match the clusters and removed tokens tagged as URLs andat-mentions.

19Reported confidence intervals in this paper are 95% bino-mial normal approximation intervals for the proportion of cor-rectly tagged tokens: ±1.96

pp(1� p)/n

tokens

. 1/p

n.20Details on the exact feature set are available in a technical

report (Owoputi et al., 2012), also available on the website.

Non-traditional words. The word clusters are es-pecially helpful with words that do not appear in tra-ditional dictionaries. We constructed a dictionaryby lowercasing the union of the ispell ‘American’,‘British’, and ‘English’ dictionaries, plus the stan-dard Unix words file from Webster’s Second Inter-national dictionary, totalling 260,985 word types.After excluding tokens defined by the gold stan-dard as punctuation, URLs, at-mentions, or emoti-cons,21 22% of DAILY547’s tokens do not appear inthis dictionary. Without clusters, they are very dif-ficult to classify (only 79.2% accuracy), but addingclusters generates a 5.7 point improvement—muchlarger than the effect on in-dictionary tokens (Ta-ble 3).

Varying the amount of unlabeled data. A taggerthat only uses word clusters achieves an accuracy of88.6% on the OCT27 development set.22 We createdseveral clusterings with different numbers of unla-beled tweets, keeping the number of clusters con-stant at 800. As shown in Fig. 3, there was initiallya logarithmic relationship between number of tweetsand accuracy, but accuracy (and lexical coverage)levels out after 750,000 tweets. We use the largestclustering (56 million tweets and 1,000 clusters) asthe default for the released tagger.

6.2 Evaluation on RITTERTW

Ritter et al. (2011) annotated a corpus of 787tweets23 with a single annotator, using the PTB

21We retain hashtags since by our guidelines a #-prefixed to-ken is ambiguous between a hashtag and a normal word, e.g. #1or going #home.

22The only observation features are the word clusters of atoken and its immediate neighbors.

23https://github.com/aritter/twitter_nlp/blob/master/data/annotated/pos.txt

Page 25: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

28

Word Clusters

We approach part-of-speech tagging for

informal, online conversational text

using large-scale unsupervised word clustering and new lexical features. Our system achieves state-of-the-art tagging results on both Twitter and IRC data. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines.

Improved PartImproved Part--ofof--Speech Tagging for Online Conversational Text with Word ClustersSpeech Tagging for Online Conversational Text with Word Clusters

Word Clusters

Tagger Features! Hierarchical word clusters via Brown clustering (Brown et al., 1992) on a sample of 56M tweets! Surrounding words/clusters! Current and previous tags! Tag dict. constructed from WSJ, Brown corpora! Tag dict. entries projected to Metaphoneencodings! Name lists from Freebase, Moby Words, Names Corpus! Emoticon, hashtag, @mention, URL patterns

Olutobi Owoputi* Brendan O’Connor* Chris Dyer* Kevin Gimpel+ Nathan Schneider* Noah A. Smith*

*School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA+Toyota Technological Institute at Chicago, Chicago, IL 60637, USA

Highest Weighted Clusters

SpeedTagger: 800 tweets/s (compared to 20 tweets/s previously)Tokenizer: 3,500 tweets/s

Software & Data Release! Improved emoticon detector and tweet tokenizer! Newly annotated evaluation set, fixes to previous annotations

Examples

RVVVOPNDVP

NowHateingStartCuldYallSoCroudDaShakeBoutta

ResultsOur tagger achieves state-of-the-art results in POS tagging for each dataset:

O

heV

canV

addO

uP

on^

fb lolololsonamelastyofiraskedhesmhikr!PNADPVOG!

or n & and103&100110*

you yall u it mine everything nothing something anyone

someone everyone nobody

899O11101*

do did kno know care mean hurts hurt say realize believe

worry understand forget agree remember love miss hate

think thought knew hope wish guess bet have

29267V01*

the da my your ur our their his378D1101*

young sexy hot slow dark low interesting easy important

safe perfect special different random short quick bad crazy

serious stupid weird lucky sad

6510A111110*

x <3 :d :p :) :o :/2798E1110101100*

i'm im you're we're he's there's its it's428L11000*

lol lmao haha yes yea oh omg aww ah btw wow thanks

sorry congrats welcome yay ha hey goodnight hi dear

please huh wtf exactly idk bless whatever well ok

8160! 11101010*

Most common word in each cluster with prefixTypesTagCluster prefix

Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus

(incl. system messages)

Tagset

Datasets

Tagger, tokenizer, and data all released at:

www.ark.cs.cmu.edu/TweetNLP

Accuracy on RITTERTW corpus

Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus

(incl. system messages)

Accuracy on RITTERTW corpus

Dev set accuracy using only clusters as featuresAccuracy on NPSCHATTEST corpus

(incl. system messages)

ModelDiscriminative sequence model (MEMM) with L1/L2 regularization

Page 26: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

29

Speed

• Tokenizer: 3500 tweets/s

• MEMM instead of CRF is much faster

‣ Greedy: 800 tweets/s (10k w/s), barely any loss in accuracy relative to Viterbi

Page 27: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Outline

30

• Twitter POS tagging

• Twitter dependency parsing

• What’s next?

Page 28: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

31

OMG I <3 the Biebs & want to have his babies ! —> LA

Times: Teen Pop Star Heartthrob is All the Rage on

Social Media… #belieber

Page 29: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

32

OMG I <3 the Biebs & want to have his babies ! —> LA

Times : Teen Pop Star Heartthrob is All the Rage on

Social Media … #belieber

Page 30: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

33

OMG I <3 the Biebs & want to have his babies ! —> LA

Times : Teen Pop Star Heartthrob is All the Rage on

Social Media … #belieber

Page 31: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

34

OMG

I <3 the Biebs & want to have his babies

LA Times

Teen Pop Star Heartthrob is All the Rage on Social Media

Page 32: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

35

OMG

I <3 the Biebs & want to have his babies

LA_Times

Teen Pop Star Heartthrob is All_the_Rage on Social Media

Page 33: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

36

OMG

I <3 the Biebs & want to have his babies

LA_Times

Teen Pop Star Heartthrob is All_the_Rage on Social Media

• Fragmentary Unlabeled Dependency Grammar (“FUDG”; Schneider et al. 2013)

‣ allows utterance segmentation, token selection, MWEs, coordination, underspecification

Page 34: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Representation

37

OMG

I <3 the Biebs & want to have his babies

LA_Times

Teen Pop Star Heartthrob is All_the_Rage on Social Media

• FUDG can be rendered in ASCII (“GFL”): Teen > (Pop > Star) > Heartthrob Heartthrob > is** < [All the Rage] is < on < (Social > Media)

Page 35: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Annotation

38

1 day of annotation, 26 participants

• Custom web-based annotation tool (Mordowanec et al., ACL 2014 demo)

Page 36: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter Syntax: Automation• A supervised discriminative graph-based

parser for tweets (Kong et al., EMNLP 2014)

‣ (1) lexical selection: a sequence model

‣ (2) parsing: a 2nd-order TurboParser model

‣ produces FUDG parses (incl. coordination, multiple utterances, MWEs)

• Experiments

‣ train on PTB, test on tweets: 73% UAS

‣ train on 1,473 English tweets (9k tokens): 80%

‣ domain adaptation (train on tweets, with some features derived from PTB-trained parser): 81%

39

Page 37: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS & Syntax: Summary• Modified traditional representations to meet

the needs of our domain & process

• Rapid annotation by (mostly) CS grad students, informed by linguistics

• Widely downloaded (>3,000), state-of-the-art POS tagger for Twitter; parser will be released in time for EMNLP

• Syntactic representations & annotation tools inspired by Twitter are now being used for Wikipedia, low-resource African languages, and even Shakespeare!

40

Page 38: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

Twitter POS & Syntax: Links

• http://www.ark.cs.cmu.edu/TweetNLP/

• http://www.ark.cs.cmu.edu/FUDG/

41

Page 39: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

42

representation

annotation

automation

Page 40: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

What’s Next?

• More languages/genres?

• Richer/deeper representations?

• Applications?

43

Page 41: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

44

RICHNESS

ROBUSTNESS

syntactic parsing

semantic parsing

NERPOS

Page 42: (NLPIT Workshop) (Keynote) Nathan Schneider - “Hacking a Way Through the Twitter Language Jungle: Syntactic Annotation, Tagging, and Parsing of English Tweets”

45thx