Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...

60
University of Sheffield, NLP Practical Sentiment Analysis: Hands-on Material Diana Maynard University of Sheffield, UK © The University of Sheffield, 1995-2012 This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs Licence

Transcript of Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction •...

Page 1: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Practical Sentiment Analysis:Hands-on Material

Diana MaynardUniversity of Sheffield, UK

© The University of Sheffield, 1995-2012This work is licensed underthe Creative Commons Attribution-NonCommercial-NoDerivs Licence

Page 2: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Introduction

• This hands-on tutorial assumes some basic knowledge of GATE

• If you've never used GATE before, you probably want to get familiar with at least the basics: try Module 1 of our GATE training course https://gate.ac.uk/wiki/TrainingCourseJune2012/

• You can also just have a look at the examples without trying them in GATE

• The hands-on exercises marked “Advanced” are for more expert users who are familiar with JAPE: they should have completed Track 1 of our GATE training course or had experience using GATE in real life

• You can just skip these if you're not an advanced user

• If you want more in-depth training with GATE, you can either join our annual GATE training course (next one May/June 2013 in Sheffield) or contact us to arrange a personalised training/consultancy in person or online

• More info about everything GATE-related https://gate.ac.uk

Page 3: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Processing Social Media

Examples and exercises:

• Loading tweets into GATE

• Linguistic pre-processing

– Language detection

– Tokenisation

– Emoticon detection

– POS tagging

– Normalisation

– Spam removal

• Named Entity detection

• Visualisation (ANNIC)

Page 4: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Loading tweets into GATE

Page 5: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example Tweet metadata in JSON

{ "contributors":null, "text":"Automotive RDFa (a horribly researched SEO article on RDFa/Microformats): http://ow.ly/5JSoS #somanyerrorsitsfunny“, "geo":null,"retweeted":false,"in_reply_to_screen_name":null, "truncated":false, "entities":{"urls":[{"expanded_url":null,"indices":[74,92],"url":"http://ow.ly/5JSoS"}], "hashtags":[{"text":"somanyerrorsitsfunny","indices":[93,114]}], "user_mentions":[]}, "in_reply_to_status_id_str":null,"id":94029193863639040,"source":"<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">HootSuite<\/a>“, "in_reply_to_user_id_str":null,"favorited":false,"in_reply_to_status_id":null,"retweet_count":0,"created_at":"Thu Jul 21 13:01:21 +0000 2011",

Page 6: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example Tweet metadata in JSON (2)

"in_reply_to_user_id":null,"id_str":"94029193863639040“,"place":{"id":"c799e2d3a79f810e", "bounding_box":{"type":"Polygon",

"coordinates":[[[6.6266397,35.4928765],[18.5203619,35.4928765],[18.5203619,47.0924248],[6.6266397,47.0924248]]]},

"place_type":"country", "name":"Italia", "attributes":{}, "country_code":"IT“,

"url":"http:/…/1/geo/id/c799e2d3a79f810e.json", "full_name":"Italia", "country":"Italia"

},

Type of place, e.g. “city”

Country containing the place of origin

More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf

Page 7: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example Tweet metadata in JSON (3)

"user":{"location":"Blacksburg, VA", …, "statuses_count":2404, "lang":"en", "id":20446311, …, "description":“Text from the user profile (max 160 chars)", …, "name":“User Name", …, "created_at":"Mon Feb 09 16:33:16 +0000 2009", "followers_count":1239, "geo_enabled":false, …, "url":“The author’s URL (optional)", "utc_offset":-21600, "time_zone":"Central Time (US & Canada)", .., "friends_count":160, …, "screen_name":“twitter-user-name", …, "listed_count":189, …

}, …

Embedded user information, can get out-of-sync, if the user changes it later

More: https://courses.ischool.berkeley.edu/i202/f11/sites/default/files/map-of-a-tweet.pdf

Page 8: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

How to acquire tweets

• The Twitter Search API is a dedicated API for running searches against the real-time index of recent Tweets

– Currently the index includes between 6-9 days of Tweets

– Rate limit not published

– Requests to the Search API are anonymous. The rate limit is measured against the requesting client IP

• The REST API allows access timelines, tweeting, following, etc.

• The Streaming API streams tweets in real time

• See https://dev.twitter.com/docs/twitter-libraries

• Currently tweet download is done externally to GATE

Page 9: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Importing tweets into GATE

• Currently GATE has no JSON format handler implemented

• Instead, we use a python script to convert the JSON into XML which can then be opened by GATE

– json2xml.py in your hand-outs (./json2xml.py --help)

– ./json2xml.py -s tweet-sample.json -o sample-out-dir

– Writes all tweets as 1 output.xml file in the <sample-out-dir>

• Create a corpus in GATE Developer

• Right click -> “Populate from a single concatenated file”

• Select lang-id-small-test-set.xml

• Change “Root element” to doc_root and File URL to point to the directory where output.xml is

Page 10: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Each tweet becomes a GATE document

Page 11: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Language Pre-processing

Page 12: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Language Detection

There are many language detection systems readily available

The main challenges on tweets/Facebook status updates:

the short number of tokens (10 tokens/tweet on average)

the noisy nature of the words (abbreviations, misspellings).

Due to the length of the text, we can make the assumption that one tweet is written in only one language

We have adapted the TextCat language identification plugin

Provided fingerprints for 5 languages: DE, EN, FR, ES, NL

You can extend it to new languages easily (see GATE user guide)

Page 13: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-On 1: Language ID

• Load twitie-lang-id.xgapp in GATE

• Use the tweet corpus already open (10 docs in 4 languages)

• Run the application

– The Annotation Set Transfer first copies the text annotation from the “Original markups” set as a Tweet annotation in the PreProcess annotation set

– The Tweet Language Identification PR adds a “lang” feature to the Tweet annotation in the PreProcess set

• Inspect the results

• Keep the app open for later

• Close the corpus

Page 14: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Language ID Results: English Example

Various annotations created by the metadata-based pre-processing jape (tweet-metadata-parser.jape in resources)

Sentence is an annotation created to span the entire tweet text

TwitterUser spans the entire user information in the tweet

TweetCreatedAt – the timestamp of this tweet

Page 15: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Tokenisation

Splitting a text into its constituent parts

Plenty of “unusual”, but very important tokens in social media:

– @Apple – mentions of company/brand/person names

– #fail, #SteveJobs – hashtags expressing sentiment, person or company names

– :-(, :-), :-P – emoticons (punctuation and optionally letters)

– URLs

Tokenisation key for entity recognition and opinion mining

A study of 1.1 million tweets: 26% of English tweets have a URL, 16.6% - a hashtag, and 54.8% - a user name mention [Carter, 2013].

Page 16: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example

– #WiredBizCon #nike vp said when @Apple saw what http://nikeplus.com did, #SteveJobs was like wow I didn't expect this at all.

– Tokenising on white space doesn't work that well:

• Nike and Apple are company names, but if we have tokens such as #nike and @Apple, this will make the entity recognition harder, as it will need to look at sub-token level

– Tokenising on white space and punctuation characters doesn't work well either: URLs get separated (http, nikeplus), as are emoticons and email addresses

Page 17: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Special Tokenisation for tweets (Stanford) Try this demo with different tokenisations and tweets

http://sentiment.christopherpotts.net/tokenizing/

Emoticons, @mentions, #tags, URLs, and emails are all tokens in their own right

Most words are lower-cased, unless written in all caps (this can convey sentiment)

Dates are normalised into 1 token, as are phone numbers

Deals with lengthening (YAAAAAAY → YAAAY)

Takes into account HTML tags, such as <strong>

Some issues:

It can get very slow computationally, due to complex cases

Tailored for sentiment

Page 18: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-On: Hashtag and @mention tokenisation

• Load the GATE Unicode Tokeniser, with its default settings

• Load a Document Reset with defaults

• Create a new corpus controller, add the Reset, then Tokeniser

• Close prev. corpus, create new one and populate from single concatenated file, using test-10-tweets.xml

• Inspect the results, especially around hashtags and @mentions

• Create a JAPE transducer, loading resources/hashtag.jape

• Add it to the application and re-run. Hashtag annotations appear

• Now add a new rule to detect @mentions as UserID annotations

• Right-click on the JAPE transducer, re-load, and re-run the app

Page 19: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

The GATE Twitter Tokeniser

Treat RTs and URLs as 1 token each

#nike is two tokens (# and nike) plus a separate annotation HashTag covering both. Same for @mentions -> UserID

Capitalisation is preserved, but an orthography feature is added: all caps, lowercase, mixCase

Date and phone number normalisation, lowercasing, and emoticons are optionally done later in separate modules

Consequently, tokenisation is faster and more generic

Also, more tailored to how ANNIE NER expects the input

Page 20: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

GATE Twitter Tokeniser: An Example

Page 21: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-On: Running the GATE Tweet Tokeniser

• Right click on ProcessingResources, load ANNIE English Tokeniser

– Leave TokeniserRulesURL unchanged

– For TransducerGrammarURL navigate to your hands-out directory, then choose resources/tokeniser/twitter.jape

• Add the Tweet Tokeniser at the end of the TwitIE Lang ID app

• Set the AnnotationSetName parameter to PreProcess

• Re-run the app and inspect the results (Token, Hashtag, UserID)

• Note that the Token annotations under UserIDs have now POS category NNP, since they are proper names

• Take a quick look at the actual rules for Hashtag and UserID recognition in twitter.jape. See how they differ from the simple ones we wrote earlier.

Page 22: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Emoticon Detection

• There is a gazetteer list of some commonly used emoticons in your hand-outs, resources/emoticons-list.

• Create an ANNIE Gazetteer PR, name it Emoticon gazetteer, and point it to the emoticons directory

• Add it at the end of your pipeline, set the AnnotationSetName parameter to PreProcess, run the app

• Inspect the Lookup annotations in GATE Developer

Page 23: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example candidates: Lookup.majorType =emoticon

• Easiest is to use a JAPE making all emoticon lookups into annots

• BUT not all Lookups are actual emoticons, as you can see

Page 24: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Using the Segment Processing PR to restrict scope

• The PR is part of the Alignment Plugin

• Use it to process only the part of the tweet covered by the Tweet annotation (or any other annotation, e.g. user)

• The PR takes as one of its parameters, another processing resource or an instance of an application that you want to run on the document (e.g. ANNIE)

• Here we will use it to restrict the processing scope of a JAPE grammar

• For other uses, see Module 10 Advanced IE of the GATE training course

Page 25: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Running emoticon.jape on the tweet text

• Application contains a Segment Processing PR

• Segment Processing PR calls Emoticon JAPE

Page 26: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Segment Processing Parameters

• Segment Processing PR calls the Emoticon JAPE grammar• The input annotation set needs to be set to PreProcess• It will run only on the text covered by the span of the “Tweet” annotation• Emoticon.jape simply converts all emoticon Lookups into Emoticon annotation• In practice, you might also wish to merge all underlying Token annotations into 1 Token annotation of kind = emoticon or punctuation

Page 27: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Annotation Result

• Yellow shading shows the tweet text to be annotated• The Emoticon annotation within it appears in purple• Emoticon lookups in the rest of the document are not annotated

Page 28: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-on: Emoticon detection (1)

• Created a JAPE transducer, pointing it at resources/emoticon.jape

• Add it to the application, set its inputAS and outputAS params to PreProcess, run the app, then remove the transducer. This just ensures we set it to the right annotation set, instead of the default one

• Open the CREOLE Plugin Manager and load the Alignment plugin

• Create a Segment Processing PR as follows:

• Select the Emoticon JAPE Transducer as “analyser” parameter

• Specify “Tweet” as the SegmentAnnotationType

• inputASName = “PreProcess”

• Add the Emoticon Segment Processing PR to the end of application

• Run it and inspect the results

Page 29: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-on: Emoticon detection (2) (Advanced users) • Have a look at the grammar emoticon.jape and:

• Modify it to use the within JAPE operator to check if the Lookup is within a Tweet annotation (Module 3). Remove the Segmenter PR

• Delete the underlying Lookup annotation

Page 30: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

POS Tagging

• The accuracy of the Stanford POS tagger drops from about 97% on news to 80% on tweets (Ritter, 2011)

• Need for an adapted POS tagger, specifically for tweets

• We re-trained the Stanford POS tagger using some hand-annotated tweets, IRC and news texts

• The resulting new Tweet POS Tagger is in applications/plugins/ Tagger_Stanford

• Next we compare the differences between the ANNIE POS Tagger and the Tweet POS Tagger on the example tweets

Page 31: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Create the ANNIE POS Tags

• Create an Annotation Set Transfer, add to the application

• Set its run-time parameters as shown:

• Create an ANNIE POS Tagger with default init parameters

• Add to the application and set run-time params:

Page 32: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-On: ANNIE POS Tags (2)

• Run the application

• Inspect the Token annotations in the ANNIE set

Page 33: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Register the Tagger_Stanford plugin

• Open CREOLE Plugin Manager

• Select the directory hand-outs/applications/plugins/Tagger_Stanford

• Check “Load Now” and press Apply All

Page 34: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Configure the Stanford Tagger

• Create another Annotation Set Transfer, add to the application

• Set its run-time parameters as shown:

• Create an instance of Stanford Tagger from tweet model:

• Add to the application at the end

• Run the application

Page 35: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands On: App Sanity Check

• By now your tweet processing application should look like this

Page 36: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

TwitIE POS Tagger Results: Example

• If all has been setup properly, you will get results in 2 sets:

– ANNIE will have the POS tags from the ANNIE POS Tagger

– The default set will have those from the TwitIE Tagger

Page 37: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Compare the Differences: Annotation Diff

• Click on the Annotation Diff button

• Select a document from the test corpus (same Key and Resp)

• Key set: [Default set]; Resp. set: ANNIE

• Type: Token; Features: some, then select: category

Page 38: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Compare the Differences (2)

• Click on the Compare button

• Inspect the results; repeat for 1-2 more documents

• HINT: Clicking on the Start column will sort tokens by offset

• We are still improving the tweet POS model, but major improvements made already

• Accuracy already better than (Ritter, 2011)

Page 39: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

App cleanup for next hands on

• Modify your application pipeline:

– Remove the Annotation Set Transfer that was copying to the ANNIE set

– Remove the ANNIE POS tagger

– Load an ANNIE Gazetteer with default init parameters

– Add it to the application *before* the Tweet POS AST and the TwitIE POS tagger

– Set its AnnotationSetName run-time param to PreProcess

– Re-run the application, check you have Lookups in PreProcess

Page 40: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

ANNIE NER on Tweets

• To run the ANNIE Transducer just on the tweet text:

– Instantiate an ANNIE NE Transducer PR with defaults

– Add it to the end of your application

– Run it and inspect the default annotation set for NEs

Page 41: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Why the mistake? OrgJobTitle rule

Rule: OrgJobtitlePriority: 30( {Unknown.kind == PN} //It is only considering one preceding word as a candidate

//Grammar in plugins/ANNIE/resources/NE/org_context.jape):org( {Lookup.majorType == jobtitle})--> { gate.AnnotationSet org = (gate.AnnotationSet) bindings.get("org"); gate.FeatureMap features = Factory.newFeatureMap(); features.put("rule ", "OrgJobTitle"); outputAS.add(org.firstNode(), org.lastNode(), "Organization", features); outputAS.removeAll(org);}

Page 42: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Tweet Capitalisation: an NER nightmare!

…And hashtag semantics is yet another…

Page 43: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Case-Insensitive matching This would seem the ideal solution, especially for gazetteer lookup,

when people don't use case information as expected

However, setting all PRs to be case-insensitive can have undesired consequences

– POS tagging becomes unreliable (e.g. “May” vs “may”)

– Back-off strategies may fail, e.g. unknown words beginning with a capital letter are normally assumed to be proper nouns

– BUT this doesn’t work on tweets anyway!

– Gazetteer entries quickly become ambiguous (e.g. many place names and first names are ambiguous with common words)

Solutions include selective use of case insensitivity, removal of ambiguous terms from lists, additional verification (e.g. use of the text of any contained URLs)

Page 44: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

More flexible matching techniques

In GATE, as well as the standard gazetteers, we have options for modified versions which allow for more flexible matching

BWP Gazetteer: uses Levenshein edit distance for approximate string matching

Extended Gazetteer: has a number of parameters for matching prefixes, suffixes, initial capitalisation and so on

Page 45: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Try: Run ANNIE on User Profile Text

• User descriptions are another piece of useful text to mine

• Appear as UserDescription annotations in PreProcess

• Create another Annotation Set Transfer from PreProcess to the default set, using the UserDescription annotation from PreProcess as the textTagName

– HINT: See the parameters of the Tweet POS AST

• Add the new AST PR after the Tweet POS AST, but before the TwitIE POS Tagger. Re-run the app

Page 46: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

ANNIE Results in User Descriptions

…TwitIE NE rules are being improved, watch this space…

Page 47: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

NER in Tweets

Performance of the Stanford NER drops to 48% [Liu’11]

Pre-processing used:

Stop words, user names, and links are removed

Specially adapted/trained POS tagger [Ritter’11]

NP Chunker adapted to tweets [Ritter’11]

Capitalisation information [Ritter’11]

Syntactic normalisation [Doerhmann, 2011]

Gazetteers derived from Freebase [Ritter’11]

Page 48: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

NER for Tweets (2)

Performance reported on 4 entity types (PER, LOC, ORG, PRODUCT): 80.2% f-score (81.6% P; 78.8% R) [Liu et al 2011]

[Doerhmann, 2011] improved on Liu's results by normalising the tweets first

Ritter's scores are lower but against more Freebase entity types: PERSON, GEO-LOCATION, COMPANY, PRODUCT, FACILITY, TV-SHOW, MOVIE, SPORTSTEAM, BAND, and OTHER

Page 49: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Stemming

The Snowball stemmer is already integrated in GATE

11 European languages: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish

http://snowball.tartarus.org

Page 50: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

De-duplication and Spam Removal

Heuristics from [Choudhury & Breslin, #MSM2011]:

Remove as duplicates/spam:

Messages with only hashtags (and optional URL)

Messages with only @mentions and a URL Note this is not 100% reliable, e.g. somebody might legitimately reply to a

tweet with a URL, if the original tweet requested the information

Similar content, different user names and with the same timestamp are considered to be a case of multiple accounts

Same account, identical content are considered to be duplicate tweets

Same account, same content at multiple times are considered as spam tweets

Page 51: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Advanced Hands On: Spam Removing JAPE

• Implement a JAPE grammar that creates a Spam annotation for tweets which contain only 1 or more hashtags and optionally one or more URLs (a shell JAPE file is spam.jape in hand-outs)

– #hashtag1 #hashtag2 http://whatever.com

– http://whatever.com #hashtag1

– All URLs have a URL annotation

– All hashtags have a Hashtag annotation covering them

• To help you do that, create a JAPE transducer with resources/ spam-pre-process.jape and add it to the app. Set inputAS and outputAS runtime params to PreProcess

• This creates BeforeTextToken and AfterTextToken to delimit the boundaries of the tweet text. Use these to help you build your JAPE pattern

Page 52: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Example Tweet with The Required Annotations highlighted

Page 53: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Tweet Normalisation

“RT @Bthompson WRITEZ: @libbyabrego honored?! Everybody knows the libster is nice with it...lol...(thankkkks a bunch;))”

OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!

Similar to SMS normalisation

For some components to work well (POS tagger, parser), it is necessary to produce a normalised version of each token

BUT uppercasing, and letter and exclamation mark repetition often convey strong sentiment

Therefore some choose not to normalise, while others keep both versions of the tokens

Page 54: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Syntactic Normalisation [Kaufmann, 2010]

Preparation: removing emoticons, tokenisation

Orthographic mapping: 2moro, u

Syntactic disambiguation

Determine when @mentions and #tags have syntactic value and should be kept in the sentence, vs replies, retweets and topic tagging

Machine Translation: used MOSES

Trained on SMS and ANC corpora

Page 55: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

GATE Tweet Normaliser

• Open the Plugin manager, choose to register a new CREOLE directory and point it to applications/plugins/Normaliser_Twitter

• Create an instance of the Twitter Normaliser PR

• Remove all PRs from your pipeline that run after the Tokeniser

• Add the Twitter Normaliser PR at the end instead

• Set its inputAS and outputAS params to PreProcess

• Create a new corpus

• Populate it from a single file: corpora/ normaliser_test_corpus.xml

– Remember to specify doc_root

• Run the pipeline and inspect the results

Page 56: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

A normalised example

Normaliser currently based on spelling correction and some lists of common abbreviations

Outstanding issues:

Insert new Token annotations, so easier to POS tag, etc? For example: “trying to” now 1 annotation

Some abbreviations which span token boundaries (e.g. gr8, do n’t) are not yet handled

Capitalisation and punctuation normalisation

Page 57: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

ANNIC Demo and HandsOn

• Formulating queries

• Finding matches in the corpus

• Analysing the contexts

• Refining the queries

• Demo: http://gate.ac.uk/demos/annic2008/Annic-only.htm

Page 58: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Hands-On: Using ANNIC

● Load the datastore politwits-500 in GATE and double cllick on it to open the datastore viewer

● Select “Lucene datastore searcher” from the datastore viewer (bottom pane)

● Try out some patterns to see what results you get, e.g. {Sentiment}

● Hint: click on the name of an annotation in the bottom right corner, to add it to the search box, or start typing in the search box to get some help with possible annotations.

Page 59: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

Pattern examples

● {Party}

● {Affect}

● {Lookup.majorType == negation} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Token.string == "I"} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Person} ({Token})*4 {Lookup.majorType == "vote"}{Lookup.majorType == "party"}

● {Affect} ({Token})*5 {Lookup.majorType == "candidate"}

● {Vote} ({Token})*5 {Lookup.majorType == "candidate"}

Page 60: Practical Sentiment Analysis: Hands-on Material …...University of Sheffield, NLP Introduction • This hands-on tutorial assumes some basic knowledge of GATE • If you've never

University of Sheffield, NLP

References:

• T. Baldwin and M. Lui. Language Identification: The Long and the Short of the Matter. In Proc. NAACL HLT ’10. http://www.aclweb.org/anthology/N10-1027.

• M. Kaufmann. Syntactic Normalization of Twitter Messages. http://www.cs.uccs.edu/~kalita/work/reu/REUFinalPapers2010/Kaufmann.pdf

• S. Choudhury and J. Breslin. Extracting Semantic Entities and Events from Sports Tweets. Proceedings of #MSM2011 Making Sense of Microposts. 2011.

• X. Liu, S. Zhang, F. Wei, M. Zhou. Recognizing Named Entities in Tweets. ACL'2011.

• A. Ritter, Mausam, Etzioni. Named entity recognition in tweets: an experimental study. EMNLP'2011.

• Doerhmann. Named Entity Extraction from the Colloquial Setting of Twitter. http://www.cs.uccs.edu/~kalita/work/reu/REU2011/FinalPapers/Doehermann.pdf

• S. Carter, W. Weerkamp, E. Tsagkias. Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text. Language Resources and Evaluation Journal. 2013 (Forthcoming).