Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...
Transcript of Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...
University of Sheffield, NLP
Practical Sentiment Analysis:Hands-on Material
Diana MaynardUniversity of Sheffield, UK
© The University of Sheffield, 1995-2012This work is licensed underthe Creative Commons Attribution-NonCommercial-NoDerivs Licence
University of Sheffield, NLP
Introduction
• This hands-on tutorial assumes some basic knowledge of GATE
• If you've never used GATE before, you probably want to get familiar with at least the basics: try Module 1 of our GATE training course https://gate.ac.uk/wiki/TrainingCourseJune2012/
• You can also just have a look at the examples without trying them in GATE
• The hands-on exercises marked “Advanced” are for more expert users who are familiar with JAPE: they should have completed Track 1 of our GATE training course or had experience using GATE in real life
• You can just skip these if you're not an advanced user
• If you want more in-depth training with GATE, you can either join our annual GATE training course (next one May/June 2013 in Sheffield) or contact us to arrange a personalised training/consultancy in person or online
• More info about everything GATE-related https://gate.ac.uk
University of Sheffield, NLP
Processing Social Media
Examples and exercises:
• Loading tweets into GATE
• Linguistic pre-processing
– Language detection
– Tokenisation
– Emoticon detection
– POS tagging
– Normalisation
– Spam removal
• Named Entity detection
• Visualisation (ANNIC)
University of Sheffield, NLP
Loading tweets into GATE
University of Sheffield, NLP
Example Tweet metadata in JSON
{ "contributors":null, "text":"Automotive RDFa (a horribly researched SEO article on RDFa/Microformats): http://ow.ly/5JSoS #somanyerrorsitsfunny“, "geo":null,"retweeted":false,"in_reply_to_screen_name":null, "truncated":false, "entities":{"urls":[{"expanded_url":null,"indices":[74,92],"url":"http://ow.ly/5JSoS"}], "hashtags":[{"text":"somanyerrorsitsfunny","indices":[93,114]}], "user_mentions":[]}, "in_reply_to_status_id_str":null,"id":94029193863639040,"source":"<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">HootSuite<\/a>“, "in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu Jul 21 13:01:21 +0000 2011",
University of Sheffield, NLP
Example Tweet metadata in JSON (2)
"in_reply_to_user_id":null,"id_str":"94029193863639040“,"place":{"id":"c799e2d3a79f810e", "bounding_box":{"type":"Polygon",
"coordinates":[[[6.6266397,35.4928765],[18.5203619,35.4928765],[18.5203619,47.0924248],[6.6266397,47.0924248]]]},
"place_type":"country", "name":"Italia", "attributes":{}, "country_code":"IT“,
"url":"http:/…/1/geo/id/c799e2d3a79f810e.json", "full_name":"Italia", "country":"Italia"
},
Type of place, e.g. “city”
Country containing the place of origin
More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf
University of Sheffield, NLP
Example Tweet metadata in JSON (3)
"user":{"location":"Blacksburg, VA", …, "statuses_count":2404, "lang":"en", "id":20446311, …, "description":“Text from the user profile (max 160 chars)", …, "name":“User Name", …, "created_at":"Mon Feb 09 16:33:16 +0000 2009", "followers_count":1239, "geo_enabled":false, …, "url":“The author’s URL (optional)", "utc_offset":-21600, "time_zone":"Central Time (US & Canada)", .., "friends_count":160, …, "screen_name":“twitter-user-name", …, "listed_count":189, …
}, …
Embedded user information, can get out-of-sync, if the user changes it later
More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf
University of Sheffield, NLP
How to acquire tweets
• The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets
– Currently the index includes between 6-9 days of Tweets
– Rate limit not published
– Requests to the Search API are anonymous. The rate limit is measured against the requesting client IP
• The REST API allows access timelines, tweeting, following, etc.
• The Streaming API streams tweets in real time
• See https://dev.twitter.com/docs/twitter-libraries
• Currently tweet download is done externally to GATE
University of Sheffield, NLP
Importing tweets into GATE
• Currently GATE has no JSON format handler implemented
• Instead, we use a python script to convert the JSON into XML which can then be opened by GATE
– json2xml.py in your hand-outs (./json2xml.py --help)
– ./json2xml.py -s tweet-sample.json -o sample-out-dir
– Writes all tweets as 1 output.xml file in the <sample-out-dir>
• Create a corpus in GATE Developer
• Right click -> “Populate from a single concatenated file”
• Select lang-id-small-test-set.xml
• Change “Root element” to doc_root and File URL to point to the directory where output.xml is
University of Sheffield, NLP
Each tweet becomes a GATE document
University of Sheffield, NLP
Language Pre-processing
University of Sheffield, NLP
Language Detection
There are many language detection systems readily available
The main challenges on tweets/Facebook status updates:
the short number of tokens (10 tokens/tweet on average)
the noisy nature of the words (abbreviations, misspellings).
Due to the length of the text, we can make the assumption that one tweet is written in only one language
We have adapted the TextCat language identification plugin
Provided fingerprints for 5 languages: DE, EN, FR, ES, NL
You can extend it to new languages easily (see GATE user guide)
University of Sheffield, NLP
Hands-On 1: Language ID
• Load twitie-lang-id.xgapp in GATE
• Use the tweet corpus already open (10 docs in 4 languages)
• Run the application
– The Annotation Set Transfer first copies the text annotation from the “Original markups” set as a Tweet annotation in the PreProcess annotation set
– The Tweet Language Identification PR adds a “lang” feature to the Tweet annotation in the PreProcess set
• Inspect the results
• Keep the app open for later
• Close the corpus
University of Sheffield, NLP
Language ID Results: English Example
Various annotations created by the metadata-based pre-processing jape (tweet-metadata-parser.jape in resources)
Sentence is an annotation created to span the entire tweet text
TwitterUser spans the entire user information in the tweet
TweetCreatedAt – the timestamp of this tweet
University of Sheffield, NLP
Tokenisation
Splitting a text into its constituent parts
Plenty of “unusual”, but very important tokens in social media:
– @Apple – mentions of company/brand/person names
– #fail, #SteveJobs – hashtags expressing sentiment, person or company names
– :-(, :-), :-P – emoticons (punctuation and optionally letters)
– URLs
Tokenisation key for entity recognition and opinion mining
A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].
University of Sheffield, NLP
Example
– #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all.
– Tokenising on white space doesn't work that well:
• Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub-token level
– Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses
University of Sheffield, NLP
Special Tokenisation for tweets (Stanford) Try this demo with different tokenisations and tweets
http://sentiment.christopherpotts.net/tokenizing/
Emoticons, @mentions, #tags, URLs, and emails are all tokens in their own right
Most words are lower-cased, unless written in all caps (this can convey sentiment)
Dates are normalised into 1 token, as are phone numbers
Deals with lengthening (YAAAAAAY → YAAAY)
Takes into account HTML tags, such as <strong>
Some issues:
It can get very slow computationally, due to complex cases
Tailored for sentiment
University of Sheffield, NLP
Hands-On: Hashtag and @mention tokenisation
• Load the GATE Unicode Tokeniser, with its default settings
• Load a Document Reset with defaults
• Create a new corpus controller, add the Reset, then Tokeniser
• Close prev. corpus, create new one and populate from single concatenated file, using test-10-tweets.xml
• Inspect the results, especially around hashtags and @mentions
• Create a JAPE transducer, loading resources/hashtag.jape
• Add it to the application and re-run. Hashtag annotations appear
• Now add a new rule to detect @mentions as UserID annotations
• Right-click on the JAPE transducer, re-load, and re-run the app
University of Sheffield, NLP
The GATE Twitter Tokeniser
Treat RTs and URLs as 1 token each
#nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID
Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase
Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules
Consequently, tokenisation is faster and more generic
Also, more tailored to how ANNIE NER expects the input
University of Sheffield, NLP
GATE Twitter Tokeniser: An Example
University of Sheffield, NLP
Hands-On: Running the GATE Tweet Tokeniser
• Right click on ProcessingResources, load ANNIE English Tokeniser
– Leave TokeniserRulesURL unchanged
– For TransducerGrammarURL navigate to your hands-out directory, then choose resources/tokeniser/twitter.jape
• Add the Tweet Tokeniser at the end of the TwitIE Lang ID app
• Set the AnnotationSetName parameter to PreProcess
• Re-run the app and inspect the results (Token, Hashtag, UserID)
• Note that the Token annotations under UserIDs have now POS category NNP, since they are proper names
• Take a quick look at the actual rules for Hashtag and UserID recognition in twitter.jape. See how they differ from the simple ones we wrote earlier.
University of Sheffield, NLP
Emoticon Detection
• There is a gazetteer list of some commonly used emoticons in your hand-outs, resources/emoticons-list.
• Create an ANNIE Gazetteer PR, name it Emoticon gazetteer, and point it to the emoticons directory
• Add it at the end of your pipeline, set the AnnotationSetName parameter to PreProcess, run the app
• Inspect the Lookup annotations in GATE Developer
University of Sheffield, NLP
Example candidates: Lookup.majorType =emoticon
• Easiest is to use a JAPE making all emoticon lookups into annots
• BUT not all Lookups are actual emoticons, as you can see
University of Sheffield, NLP
Using the Segment Processing PR to restrict scope
• The PR is part of the Alignment Plugin
• Use it to process only the part of the tweet covered by the Tweet annotation (or any other annotation, e.g. user)
• The PR takes as one of its parameters, another processing resource or an instance of an application that you want to run on the document (e.g. ANNIE)
• Here we will use it to restrict the processing scope of a JAPE grammar
• For other uses, see Module 10 Advanced IE of the GATE training course
University of Sheffield, NLP
Running emoticon.jape on the tweet text
• Application contains a Segment Processing PR
• Segment Processing PR calls Emoticon JAPE
University of Sheffield, NLP
Segment Processing Parameters
• Segment Processing PR calls the Emoticon JAPE grammar• The input annotation set needs to be set to PreProcess• It will run only on the text covered by the span of the “Tweet” annotation• Emoticon.jape simply converts all emoticon Lookups into Emoticon annotation• In practice, you might also wish to merge all underlying Token annotations into 1 Token annotation of kind = emoticon or punctuation
University of Sheffield, NLP
Annotation Result
• Yellow shading shows the tweet text to be annotated• The Emoticon annotation within it appears in purple• Emoticon lookups in the rest of the document are not annotated
University of Sheffield, NLP
Hands-on: Emoticon detection (1)
• Created a JAPE transducer, pointing it at resources/emoticon.jape
• Add it to the application, set its inputAS and outputAS params to PreProcess, run the app, then remove the transducer. This just ensures we set it to the right annotation set, instead of the default one
• Open the CREOLE Plugin Manager and load the Alignment plugin
• Create a Segment Processing PR as follows:
• Select the Emoticon JAPE Transducer as “analyser” parameter
• Specify “Tweet” as the SegmentAnnotationType
• inputASName = “PreProcess”
• Add the Emoticon Segment Processing PR to the end of application
• Run it and inspect the results
University of Sheffield, NLP
Hands-on: Emoticon detection (2) (Advanced users) • Have a look at the grammar emoticon.jape and:
• Modify it to use the within JAPE operator to check if the Lookup is within a Tweet annotation (Module 3). Remove the Segmenter PR
• Delete the underlying Lookup annotation
University of Sheffield, NLP
POS Tagging
• The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011)
• Need for an adapted POS tagger, specifically for tweets
• We re-trained the Stanford POS tagger using some hand-annotated tweets, IRC and news texts
• The resulting new Tweet POS Tagger is in applications/plugins/ Tagger_Stanford
• Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets
University of Sheffield, NLP
Create the ANNIE POS Tags
• Create an Annotation Set Transfer, add to the application
• Set its run-time parameters as shown:
• Create an ANNIE POS Tagger with default init parameters
• Add to the application and set run-time params:
University of Sheffield, NLP
Hands-On: ANNIE POS Tags (2)
• Run the application
• Inspect the Token annotations in the ANNIE set
University of Sheffield, NLP
Register the Tagger_Stanford plugin
• Open CREOLE Plugin Manager
• Select the directory hand-outs/applications/plugins/Tagger_Stanford
• Check “Load Now” and press Apply All
University of Sheffield, NLP
Configure the Stanford Tagger
• Create another Annotation Set Transfer, add to the application
• Set its run-time parameters as shown:
• Create an instance of Stanford Tagger from tweet model:
• Add to the application at the end
• Run the application
University of Sheffield, NLP
Hands On: App Sanity Check
• By now your tweet processing application should look like this
University of Sheffield, NLP
TwitIE POS Tagger Results: Example
• If all has been setup properly, you will get results in 2 sets:
– ANNIE will have the POS tags from the ANNIE POS Tagger
– The default set will have those from the TwitIE Tagger
University of Sheffield, NLP
Compare the Differences: Annotation Diff
• Click on the Annotation Diff button
• Select a document from the test corpus (same Key and Resp)
• Key set: [Default set]; Resp. set: ANNIE
• Type: Token; Features: some, then select: category
University of Sheffield, NLP
Compare the Differences (2)
• Click on the Compare button
• Inspect the results; repeat for 1-2 more documents
• HINT: Clicking on the Start column will sort tokens by offset
• We are still improving the tweet POS model, but major improvements made already
• Accuracy already better than (Ritter, 2011)
University of Sheffield, NLP
App cleanup for next hands on
• Modify your application pipeline:
– Remove the Annotation Set Transfer that was copying to the ANNIE set
– Remove the ANNIE POS tagger
– Load an ANNIE Gazetteer with default init parameters
– Add it to the application *before* the Tweet POS AST and the TwitIE POS tagger
– Set its AnnotationSetName run-time param to PreProcess
– Re-run the application, check you have Lookups in PreProcess
University of Sheffield, NLP
ANNIE NER on Tweets
• To run the ANNIE Transducer just on the tweet text:
– Instantiate an ANNIE NE Transducer PR with defaults
– Add it to the end of your application
– Run it and inspect the default annotation set for NEs
University of Sheffield, NLP
Why the mistake? OrgJobTitle rule
Rule: OrgJobtitlePriority: 30( {Unknown.kind == PN} //It is only considering one preceding word as a candidate
//Grammar in plugins/ANNIE/resources/NE/org_context.jape):org( {Lookup.majorType == jobtitle})--> { gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org"); gate.FeatureMap features = Factory.newFeatureMap(); features.put("rule ", "OrgJobTitle"); outputAS.add(org.firstNode(), org.lastNode(), "Organization", features); outputAS.removeAll(org);}
University of Sheffield, NLP
Tweet Capitalisation: an NER nightmare!
…And hashtag semantics is yet another…
University of Sheffield, NLP
Case-Insensitive matching This would seem the ideal solution, especially for gazetteer lookup,
when people don't use case information as expected
However, setting all PRs to be case-insensitive can have undesired consequences
– POS tagging becomes unreliable (e.g. “May” vs “may”)
– Back-off strategies may fail, e.g. unknown words beginning with a capital letter are normally assumed to be proper nouns
– BUT this doesn’t work on tweets anyway!
– Gazetteer entries quickly become ambiguous (e.g. many place names and first names are ambiguous with common words)
Solutions include selective use of case insensitivity, removal of ambiguous terms from lists, additional verification (e.g. use of the text of any contained URLs)
University of Sheffield, NLP
More flexible matching techniques
In GATE, as well as the standard gazetteers, we have options for modified versions which allow for more flexible matching
BWP Gazetteer: uses Levenshein edit distance for approximate string matching
Extended Gazetteer: has a number of parameters for matching prefixes, suffixes, initial capitalisation and so on
University of Sheffield, NLP
Try: Run ANNIE on User Profile Text
• User descriptions are another piece of useful text to mine
• Appear as UserDescription annotations in PreProcess
• Create another Annotation Set Transfer from PreProcess to the default set, using the UserDescription annotation from PreProcess as the textTagName
– HINT: See the parameters of the Tweet POS AST
• Add the new AST PR after the Tweet POS AST, but before the TwitIE POS Tagger. Re-run the app
University of Sheffield, NLP
ANNIE Results in User Descriptions
…TwitIE NE rules are being improved, watch this space…
University of Sheffield, NLP
NER in Tweets
Performance of the Stanford NER drops to 48% [Liu’11]
Pre-processing used:
Stop words, user names, and links are removed
Specially adapted/trained POS tagger [Ritter’11]
NP Chunker adapted to tweets [Ritter’11]
Capitalisation information [Ritter’11]
Syntactic normalisation [Doerhmann, 2011]
Gazetteers derived from Freebase [Ritter’11]
University of Sheffield, NLP
NER for Tweets (2)
Performance reported on 4 entity types (PER, LOC, ORG, PRODUCT): 80.2% f-score (81.6% P; 78.8% R) [Liu et al 2011]
[Doerhmann, 2011] improved on Liu's results by normalising the tweets first
Ritter's scores are lower but against more Freebase entity types: PERSON, GEO-LOCATION, COMPANY, PRODUCT, FACILITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, and OTHER
University of Sheffield, NLP
Stemming
The Snowball stemmer is already integrated in GATE
11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish
http://snowball.tartarus.org
University of Sheffield, NLP
De-duplication and Spam Removal
Heuristics from [Choudhury & Breslin, #MSM2011]:
Remove as duplicates/spam:
Messages with only hashtags (and optional URL)
Messages with only @mentions and a URL Note this is not 100% reliable, e.g. somebody might legitimately reply to a
tweet with a URL, if the original tweet requested the information
Similar content, different user names and with the same timestamp are considered to be a case of multiple accounts
Same account, identical content are considered to be duplicate tweets
Same account, same content at multiple times are considered as spam tweets
University of Sheffield, NLP
Advanced Hands On: Spam Removing JAPE
• Implement a JAPE grammar that creates a Spam annotation for tweets which contain only 1 or more hashtags and optionally one or more URLs (a shell JAPE file is spam.jape in hand-outs)
– #hashtag1 #hashtag2 http://whatever.com
– http://whatever.com #hashtag1
– All URLs have a URL annotation
– All hashtags have a Hashtag annotation covering them
• To help you do that, create a JAPE transducer with resources/ spam-pre-process.jape and add it to the app. Set inputAS and outputAS runtime params to PreProcess
• This creates BeforeTextToken and AfterTextToken to delimit the boundaries of the tweet text. Use these to help you build your JAPE pattern
University of Sheffield, NLP
Example Tweet with The Required Annotations highlighted
University of Sheffield, NLP
Tweet Normalisation
“RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”
OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
Similar to SMS normalisation
For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token
BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment
Therefore some choose not to normalise, while others keep both versions of the tokens
University of Sheffield, NLP
Syntactic Normalisation [Kaufmann, 2010]
Preparation: removing emoticons, tokenisation
Orthographic mapping: 2moro, u
Syntactic disambiguation
Determine when @mentions and #tags have syntactic value and should be kept in the sentence, vs replies, retweets and topic tagging
Machine Translation: used MOSES
Trained on SMS and ANC corpora
University of Sheffield, NLP
GATE Tweet Normaliser
• Open the Plugin manager, choose to register a new CREOLE directory and point it to applications/plugins/Normaliser_Twitter
• Create an instance of the Twitter Normaliser PR
• Remove all PRs from your pipeline that run after the Tokeniser
• Add the Twitter Normaliser PR at the end instead
• Set its inputAS and outputAS params to PreProcess
• Create a new corpus
• Populate it from a single file: corpora/ normaliser_test_corpus.xml
– Remember to specify doc_root
• Run the pipeline and inspect the results
University of Sheffield, NLP
A normalised example
Normaliser currently based on spelling correction and some lists of common abbreviations
Outstanding issues:
Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation
Some abbreviations which span token boundaries (e.g. gr8, do n’t) are not yet handled
Capitalisation and punctuation normalisation
University of Sheffield, NLP
ANNIC Demo and HandsOn
• Formulating queries
• Finding matches in the corpus
• Analysing the contexts
• Refining the queries
• Demo: http://gate.ac.uk/demos/annic2008/Annic-only.htm
University of Sheffield, NLP
Hands-On: Using ANNIC
● Load the datastore politwits-500 in GATE and double cllick on it to open the datastore viewer
● Select “Lucene datastore searcher” from the datastore viewer (bottom pane)
● Try out some patterns to see what results you get, e.g. {Sentiment}
● Hint: click on the name of an annotation in the bottom right corner, to add it to the search box, or start typing in the search box to get some help with possible annotations.
University of Sheffield, NLP
Pattern examples
● {Party}
● {Affect}
● {Lookup.majorType == negation} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}
● {Token.string == "I"} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}
● {Person} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}
● {Affect} ({Token})*5 {Lookup.majorType == "candidate"}
● {Vote} ({Token})*5 {Lookup.majorType == "candidate"}
University of Sheffield, NLP
References:
• T. Baldwin and M. Lui. Language Identification: The Long and the Short of the Matter. In Proc. NAACL HLT ’10. http://www.aclweb.org/anthology/N10-1027.
• M. Kaufmann. Syntactic Normalization of Twitter Messages. http://www.cs.uccs.edu/~kalita/work/reu/REUFinalPapers2010/Kaufmann.pdf
• S. Choudhury and J. Breslin. Extracting Semantic Entities and Events from Sports Tweets. Proceedings of #MSM2011 Making Sense of Microposts. 2011.
• X. Liu, S. Zhang, F. Wei, M. Zhou. Recognizing Named Entities in Tweets. ACL'2011.
• A. Ritter, Mausam, Etzioni. Named entity recognition in tweets: an experimental study. EMNLP'2011.
• Doerhmann. Named Entity Extraction from the Colloquial Setting of Twitter. http://www.cs.uccs.edu/~kalita/work/reu/REU2011/FinalPapers/Doehermann.pdf
• S. Carter, W. Weerkamp, E. Tsagkias. Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text. Language Resources and Evaluation Journal. 2013 (Forthcoming).