Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...

University of Sheffield, NLP

Practical Sentiment Analysis:Hands-on Material

Diana MaynardUniversity of Sheffield, UK

© The University of Sheffield, 1995-2012This work is licensed underthe Creative Commons Attribution-NonCommercial-NoDerivs Licence


Introduction

• This hands-on tutorial assumes some basic knowledge of GATE

• If you've never used GATE before, you probably want to get familiar with at least the basics: try Module 1 of our GATE training course https://gate.ac.uk/wiki/TrainingCourseJune2012/

• You can also just have a look at the examples without trying them in GATE

• The hands-on exercises marked “Advanced” are for more expert users who are familiar with JAPE: they should have completed Track 1 of our GATE training course or had experience using GATE in real life

• You can just skip these if you're not an advanced user

• If you want more in-depth training with GATE, you can either join our annual GATE training course (next one May/June 2013 in Sheffield) or contact us to arrange a personalised training/consultancy in person or online

• More info about everything GATE-related https://gate.ac.uk


Processing Social Media

Examples and exercises:

• Loading tweets into GATE

• Linguistic pre-processing

– Language detection

– Tokenisation

– Emoticon detection

– POS tagging

– Normalisation

– Spam removal

• Named Entity detection

• Visualisation (ANNIC)


Loading tweets into GATE


Example Tweet metadata in JSON

{ "contributors":null, "text":"Automotive RDFa (a horribly researched SEO article on RDFa/Microformats): http://ow.ly/5JSoS #somanyerrorsitsfunny“, "geo":null,"retweeted":false,"in_reply_to_screen_name":null, "truncated":false, "entities":{"urls":[{"expanded_url":null,"indices":[74,92],"url":"http://ow.ly/5JSoS"}], "hashtags":[{"text":"somanyerrorsitsfunny","indices":[93,114]}], "user_mentions":[]}, "in_reply_to_status_id_str":null,"id":94029193863639040,"source":"<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">HootSuite<\/a>“, "in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu Jul 21 13:01:21 +0000 2011",


Example Tweet metadata in JSON (2)

"in_reply_to_user_id":null,"id_str":"94029193863639040“,"place":{"id":"c799e2d3a79f810e", "bounding_box":{"type":"Polygon",

"coordinates":[[[6.6266397,35.4928765],[18.5203619,35.4928765],[18.5203619,47.0924248],[6.6266397,47.0924248]]]},

"place_type":"country", "name":"Italia", "attributes":{}, "country_code":"IT“,

"url":"http:/…/1/geo/id/c799e2d3a79f810e.json", "full_name":"Italia", "country":"Italia"

},

Type of place, e.g. “city”

Country containing the place of origin

More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf


Example Tweet metadata in JSON (3)

"user":{"location":"Blacksburg, VA", …, "statuses_count":2404, "lang":"en", "id":20446311, …, "description":“Text from the user profile (max 160 chars)", …, "name":“User Name", …, "created_at":"Mon Feb 09 16:33:16 +0000 2009", "followers_count":1239, "geo_enabled":false, …, "url":“The author’s URL (optional)", "utc_offset":-21600, "time_zone":"Central Time (US & Canada)", .., "friends_count":160, …, "screen_name":“twitter-user-name", …, "listed_count":189, …

}, …

Embedded user information, can get out-of-sync, if the user changes it later

More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf


How to acquire tweets

• The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets

– Currently the index includes between 6-9 days of Tweets

– Rate limit not published

– Requests to the Search API are anonymous. The rate limit is measured against the requesting client IP

• The REST API allows access timelines, tweeting, following, etc.

• The Streaming API streams tweets in real time

• See https://dev.twitter.com/docs/twitter-libraries

• Currently tweet download is done externally to GATE


Importing tweets into GATE

• Currently GATE has no JSON format handler implemented

• Instead, we use a python script to convert the JSON into XML which can then be opened by GATE

– json2xml.py in your hand-outs (./json2xml.py --help)

– ./json2xml.py -s tweet-sample.json -o sample-out-dir

– Writes all tweets as 1 output.xml file in the <sample-out-dir>

• Create a corpus in GATE Developer

• Right click -> “Populate from a single concatenated file”

• Select lang-id-small-test-set.xml

• Change “Root element” to doc_root and File URL to point to the directory where output.xml is


Each tweet becomes a GATE document


Language Pre-processing


Language Detection

There are many language detection systems readily available

The main challenges on tweets/Facebook status updates:

the short number of tokens (10 tokens/tweet on average)

the noisy nature of the words (abbreviations, misspellings).

Due to the length of the text, we can make the assumption that one tweet is written in only one language

We have adapted the TextCat language identification plugin

Provided fingerprints for 5 languages: DE, EN, FR, ES, NL

You can extend it to new languages easily (see GATE user guide)


Hands-On 1: Language ID

• Load twitie-lang-id.xgapp in GATE

• Use the tweet corpus already open (10 docs in 4 languages)

• Run the application

– The Annotation Set Transfer first copies the text annotation from the “Original markups” set as a Tweet annotation in the PreProcess annotation set

– The Tweet Language Identification PR adds a “lang” feature to the Tweet annotation in the PreProcess set

• Inspect the results

• Keep the app open for later

• Close the corpus


Language ID Results: English Example

Various annotations created by the metadata-based pre-processing jape (tweet-metadata-parser.jape in resources)

Sentence is an annotation created to span the entire tweet text

TwitterUser spans the entire user information in the tweet

TweetCreatedAt – the timestamp of this tweet


Tokenisation

Splitting a text into its constituent parts

Plenty of “unusual”, but very important tokens in social media:

– @Apple – mentions of company/brand/person names

– #fail, #SteveJobs – hashtags expressing sentiment, person or company names

– :-(, :-), :-P – emoticons (punctuation and optionally letters)

– URLs

Tokenisation key for entity recognition and opinion mining

A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].


Example

– #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all.

– Tokenising on white space doesn't work that well:

• Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub-token level

– Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses


Special Tokenisation for tweets (Stanford) Try this demo with different tokenisations and tweets

http://sentiment.christopherpotts.net/tokenizing/

Emoticons, @mentions, #tags, URLs, and emails are all tokens in their own right

Most words are lower-cased, unless written in all caps (this can convey sentiment)

Dates are normalised into 1 token, as are phone numbers

Deals with lengthening (YAAAAAAY → YAAAY)

Takes into account HTML tags, such as <strong>

Some issues:

It can get very slow computationally, due to complex cases

Tailored for sentiment


Hands-On: Hashtag and @mention tokenisation

• Load the GATE Unicode Tokeniser, with its default settings

• Load a Document Reset with defaults

• Create a new corpus controller, add the Reset, then Tokeniser

• Close prev. corpus, create new one and populate from single concatenated file, using test-10-tweets.xml

• Inspect the results, especially around hashtags and @mentions

• Create a JAPE transducer, loading resources/hashtag.jape

• Add it to the application and re-run. Hashtag annotations appear

• Now add a new rule to detect @mentions as UserID annotations

• Right-click on the JAPE transducer, re-load, and re-run the app


The GATE Twitter Tokeniser

Treat RTs and URLs as 1 token each

#nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID

Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase

Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules

Consequently, tokenisation is faster and more generic

Also, more tailored to how ANNIE NER expects the input


GATE Twitter Tokeniser: An Example


Hands-On: Running the GATE Tweet Tokeniser

• Right click on ProcessingResources, load ANNIE English Tokeniser

– Leave TokeniserRulesURL unchanged

– For TransducerGrammarURL navigate to your hands-out directory, then choose resources/tokeniser/twitter.jape

• Add the Tweet Tokeniser at the end of the TwitIE Lang ID app

• Set the AnnotationSetName parameter to PreProcess

• Re-run the app and inspect the results (Token, Hashtag, UserID)

• Note that the Token annotations under UserIDs have now POS category NNP, since they are proper names

• Take a quick look at the actual rules for Hashtag and UserID recognition in twitter.jape. See how they differ from the simple ones we wrote earlier.


Emoticon Detection

• There is a gazetteer list of some commonly used emoticons in your hand-outs, resources/emoticons-list.

• Create an ANNIE Gazetteer PR, name it Emoticon gazetteer, and point it to the emoticons directory

• Add it at the end of your pipeline, set the AnnotationSetName parameter to PreProcess, run the app

• Inspect the Lookup annotations in GATE Developer


Example candidates: Lookup.majorType =emoticon

• Easiest is to use a JAPE making all emoticon lookups into annots

• BUT not all Lookups are actual emoticons, as you can see


Using the Segment Processing PR to restrict scope

• The PR is part of the Alignment Plugin

• Use it to process only the part of the tweet covered by the Tweet annotation (or any other annotation, e.g. user)

• The PR takes as one of its parameters, another processing resource or an instance of an application that you want to run on the document (e.g. ANNIE)

• Here we will use it to restrict the processing scope of a JAPE grammar

• For other uses, see Module 10 Advanced IE of the GATE training course


Running emoticon.jape on the tweet text

• Application contains a Segment Processing PR

• Segment Processing PR calls Emoticon JAPE


Segment Processing Parameters

• Segment Processing PR calls the Emoticon JAPE grammar• The input annotation set needs to be set to PreProcess• It will run only on the text covered by the span of the “Tweet” annotation• Emoticon.jape simply converts all emoticon Lookups into Emoticon annotation• In practice, you might also wish to merge all underlying Token annotations into 1 Token annotation of kind = emoticon or punctuation


Annotation Result

• Yellow shading shows the tweet text to be annotated• The Emoticon annotation within it appears in purple• Emoticon lookups in the rest of the document are not annotated


Hands-on: Emoticon detection (1)

• Created a JAPE transducer, pointing it at resources/emoticon.jape

• Add it to the application, set its inputAS and outputAS params to PreProcess, run the app, then remove the transducer. This just ensures we set it to the right annotation set, instead of the default one

• Open the CREOLE Plugin Manager and load the Alignment plugin

• Create a Segment Processing PR as follows:

• Select the Emoticon JAPE Transducer as “analyser” parameter

• Specify “Tweet” as the SegmentAnnotationType

• inputASName = “PreProcess”

• Add the Emoticon Segment Processing PR to the end of application

• Run it and inspect the results


Hands-on: Emoticon detection (2) (Advanced users) • Have a look at the grammar emoticon.jape and:

• Modify it to use the within JAPE operator to check if the Lookup is within a Tweet annotation (Module 3). Remove the Segmenter PR

• Delete the underlying Lookup annotation


POS Tagging

• The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011)

• Need for an adapted POS tagger, specifically for tweets

• We re-trained the Stanford POS tagger using some hand-annotated tweets, IRC and news texts

• The resulting new Tweet POS Tagger is in applications/plugins/ Tagger_Stanford

• Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets


Create the ANNIE POS Tags

• Create an Annotation Set Transfer, add to the application

• Set its run-time parameters as shown:

• Create an ANNIE POS Tagger with default init parameters

• Add to the application and set run-time params:


Hands-On: ANNIE POS Tags (2)


• Inspect the Token annotations in the ANNIE set


Register the Tagger_Stanford plugin

• Open CREOLE Plugin Manager

• Select the directory hand-outs/applications/plugins/Tagger_Stanford

• Check “Load Now” and press Apply All


Configure the Stanford Tagger

• Create another Annotation Set Transfer, add to the application

• Set its run-time parameters as shown:

• Create an instance of Stanford Tagger from tweet model:

• Add to the application at the end



Hands On: App Sanity Check

• By now your tweet processing application should look like this


TwitIE POS Tagger Results: Example

• If all has been setup properly, you will get results in 2 sets:

– ANNIE will have the POS tags from the ANNIE POS Tagger

– The default set will have those from the TwitIE Tagger


Compare the Differences: Annotation Diff

• Click on the Annotation Diff button

• Select a document from the test corpus (same Key and Resp)

• Key set: [Default set]; Resp. set: ANNIE

• Type: Token; Features: some, then select: category


Compare the Differences (2)

• Click on the Compare button

• Inspect the results; repeat for 1-2 more documents

• HINT: Clicking on the Start column will sort tokens by offset

• We are still improving the tweet POS model, but major improvements made already

• Accuracy already better than (Ritter, 2011)


App cleanup for next hands on

• Modify your application pipeline:

– Remove the Annotation Set Transfer that was copying to the ANNIE set

– Remove the ANNIE POS tagger

– Load an ANNIE Gazetteer with default init parameters

– Add it to the application *before* the Tweet POS AST and the TwitIE POS tagger

– Set its AnnotationSetName run-time param to PreProcess

– Re-run the application, check you have Lookups in PreProcess


ANNIE NER on Tweets

• To run the ANNIE Transducer just on the tweet text:

– Instantiate an ANNIE NE Transducer PR with defaults

– Add it to the end of your application

– Run it and inspect the default annotation set for NEs


Why the mistake? OrgJobTitle rule

Rule: OrgJobtitlePriority: 30( {Unknown.kind == PN} //It is only considering one preceding word as a candidate

//Grammar in plugins/ANNIE/resources/NE/org_context.jape):org( {Lookup.majorType == jobtitle})--> { gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org"); gate.FeatureMap features = Factory.newFeatureMap(); features.put("rule ", "OrgJobTitle"); outputAS.add(org.firstNode(), org.lastNode(), "Organization", features); outputAS.removeAll(org);}


Tweet Capitalisation: an NER nightmare!

…And hashtag semantics is yet another…


Case-Insensitive matching This would seem the ideal solution, especially for gazetteer lookup,

when people don't use case information as expected

However, setting all PRs to be case-insensitive can have undesired consequences

– POS tagging becomes unreliable (e.g. “May” vs “may”)

– Back-off strategies may fail, e.g. unknown words beginning with a capital letter are normally assumed to be proper nouns

– BUT this doesn’t work on tweets anyway!

– Gazetteer entries quickly become ambiguous (e.g. many place names and first names are ambiguous with common words)

Solutions include selective use of case insensitivity, removal of ambiguous terms from lists, additional verification (e.g. use of the text of any contained URLs)


More flexible matching techniques

In GATE, as well as the standard gazetteers, we have options for modified versions which allow for more flexible matching

BWP Gazetteer: uses Levenshein edit distance for approximate string matching

Extended Gazetteer: has a number of parameters for matching prefixes, suffixes, initial capitalisation and so on


Try: Run ANNIE on User Profile Text

• User descriptions are another piece of useful text to mine

• Appear as UserDescription annotations in PreProcess

• Create another Annotation Set Transfer from PreProcess to the default set, using the UserDescription annotation from PreProcess as the textTagName

– HINT: See the parameters of the Tweet POS AST

• Add the new AST PR after the Tweet POS AST, but before the TwitIE POS Tagger. Re-run the app


ANNIE Results in User Descriptions

…TwitIE NE rules are being improved, watch this space…


NER in Tweets

Performance of the Stanford NER drops to 48% [Liu’11]

Pre-processing used:

Stop words, user names, and links are removed

Specially adapted/trained POS tagger [Ritter’11]

NP Chunker adapted to tweets [Ritter’11]

Capitalisation information [Ritter’11]

Syntactic normalisation [Doerhmann, 2011]

Gazetteers derived from Freebase [Ritter’11]


NER for Tweets (2)

Performance reported on 4 entity types (PER, LOC, ORG, PRODUCT): 80.2% f-score (81.6% P; 78.8% R) [Liu et al 2011]

[Doerhmann, 2011] improved on Liu's results by normalising the tweets first

Ritter's scores are lower but against more Freebase entity types: PERSON, GEO-LOCATION, COMPANY, PRODUCT, FACILITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, and OTHER


Stemming

The Snowball stemmer is already integrated in GATE

11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish

http://snowball.tartarus.org


De-duplication and Spam Removal

Heuristics from [Choudhury & Breslin, #MSM2011]:

Remove as duplicates/spam:

Messages with only hashtags (and optional URL)

Messages with only @mentions and a URL Note this is not 100% reliable, e.g. somebody might legitimately reply to a

tweet with a URL, if the original tweet requested the information

Similar content, different user names and with the same timestamp are considered to be a case of multiple accounts

Same account, identical content are considered to be duplicate tweets

Same account, same content at multiple times are considered as spam tweets


Advanced Hands On: Spam Removing JAPE

• Implement a JAPE grammar that creates a Spam annotation for tweets which contain only 1 or more hashtags and optionally one or more URLs (a shell JAPE file is spam.jape in hand-outs)

– #hashtag1 #hashtag2 http://whatever.com

– http://whatever.com #hashtag1

– All URLs have a URL annotation

– All hashtags have a Hashtag annotation covering them

• To help you do that, create a JAPE transducer with resources/ spam-pre-process.jape and add it to the app. Set inputAS and outputAS runtime params to PreProcess

• This creates BeforeTextToken and AfterTextToken to delimit the boundaries of the tweet text. Use these to help you build your JAPE pattern


Example Tweet with The Required Annotations highlighted


Tweet Normalisation

“RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”

OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!

Similar to SMS normalisation

For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token

BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment

Therefore some choose not to normalise, while others keep both versions of the tokens


Syntactic Normalisation [Kaufmann, 2010]

Preparation: removing emoticons, tokenisation

Orthographic mapping: 2moro, u

Syntactic disambiguation

Determine when @mentions and #tags have syntactic value and should be kept in the sentence, vs replies, retweets and topic tagging

Machine Translation: used MOSES

Trained on SMS and ANC corpora


GATE Tweet Normaliser

• Open the Plugin manager, choose to register a new CREOLE directory and point it to applications/plugins/Normaliser_Twitter

• Create an instance of the Twitter Normaliser PR

• Remove all PRs from your pipeline that run after the Tokeniser

• Add the Twitter Normaliser PR at the end instead

• Set its inputAS and outputAS params to PreProcess

• Create a new corpus

• Populate it from a single file: corpora/ normaliser_test_corpus.xml

– Remember to specify doc_root

• Run the pipeline and inspect the results


A normalised example

Normaliser currently based on spelling correction and some lists of common abbreviations

Outstanding issues:

Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation

Some abbreviations which span token boundaries (e.g. gr8, do n’t) are not yet handled

Capitalisation and punctuation normalisation


ANNIC Demo and HandsOn

• Formulating queries

• Finding matches in the corpus

• Analysing the contexts

• Refining the queries

• Demo: http://gate.ac.uk/demos/annic2008/Annic-only.htm


Hands-On: Using ANNIC

● Load the datastore politwits-500 in GATE and double cllick on it to open the datastore viewer

● Select “Lucene datastore searcher” from the datastore viewer (bottom pane)

● Try out some patterns to see what results you get, e.g. {Sentiment}

● Hint: click on the name of an annotation in the bottom right corner, to add it to the search box, or start typing in the search box to get some help with possible annotations.

http://gate.ac.uk/demos/annic2008/Annic-only.htm


Pattern examples

● {Party}

● {Affect}

● {Lookup.majorType == negation} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Token.string == "I"} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Person} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Affect} ({Token})*5 {Lookup.majorType == "candidate"}

● {Vote} ({Token})*5 {Lookup.majorType == "candidate"}


References:

• T. Baldwin and M. Lui. Language Identification: The Long and the Short of the Matter. In Proc. NAACL HLT ’10. http://www.aclweb.org/anthology/N10-1027.

• M. Kaufmann. Syntactic Normalization of Twitter Messages. http://www.cs.uccs.edu/~kalita/work/reu/REUFinalPapers2010/Kaufmann.pdf

• S. Choudhury and J. Breslin. Extracting Semantic Entities and Events from Sports Tweets. Proceedings of #MSM2011 Making Sense of Microposts. 2011.

• X. Liu, S. Zhang, F. Wei, M. Zhou. Recognizing Named Entities in Tweets. ACL'2011.

• A. Ritter, Mausam, Etzioni. Named entity recognition in tweets: an experimental study. EMNLP'2011.

• Doerhmann. Named Entity Extraction from the Colloquial Setting of Twitter. http://www.cs.uccs.edu/~kalita/work/reu/REU2011/FinalPapers/Doehermann.pdf

• S. Carter, W. Weerkamp, E. Tsagkias. Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text. Language Resources and Evaluation Journal. 2013 (Forthcoming).

Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...

Documents

Transcript of Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...