Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS,...
Transcript of Natural Language Processing TJTSD66: Advanced Topics in Social Media Dr. WANG, Shuaiqiang @ CS & IS,...
Natural Language Processing
TJTSD66: Advanced Topics in Social Media
Dr. WANG, Shuaiqiang @ CS & IS, JYUEmail: [email protected]
Homepage: http://users.jyu.fi/~swang/
(Social Media Mining)
2Social Media Mining Natural Language Processing Slide 2 of 66
Part I: What is Natural Language Processing?
3Social Media Mining Natural Language Processing Slide 3 of 66
NLP Example: Question Answering
4Social Media Mining Natural Language Processing Slide 4 of 66
NLP Example: Information Extraction
5Social Media Mining Natural Language Processing Slide 5 of 66
NLP Example: Sentiment Analysis
• size and weight– nice and compact to carry!– since the camera is small and light, I won't
need to carry around those heavy, bulky professional cameras either!
– the camera feels flimsy, is plastic and very light in weight, you have to be very delicate in the handling of this camera
Attributes:zoomaffordabilitysize and weightflashEase of use
6Social Media Mining Natural Language Processing Slide 6 of 66
NLP Example: Machine Translation
7Social Media Mining Natural Language Processing Slide 7 of 66
Language Technology
8Social Media Mining Natural Language Processing Slide 8 of 66
Part II: Preprocessing Techniques in NLP
9Social Media Mining Natural Language Processing Slide 9 of 66
Normalization
• Information Retrieval: indexed text & query terms must have same form.
• am, is, are be• U.S.A. USA• car, cars, car’s, cars’ car• automate(s), automatic, automation
automat• US vs. us!!!
10Social Media Mining Natural Language Processing Slide 10 of 66
Sentence Segmentation
• !,? Are relatively unambiguous• Period “.” is quite ambiguous– Sentence boundary– Abbreviations like Inc. or Dr.– Numbers like .02% or 4.3
• Build a binary classifier for “.”
11Social Media Mining Natural Language Processing Slide 11 of 66
Sentence Segmentation: Decision Tree
12Social Media Mining Natural Language Processing Slide 12 of 66
Word Tokenization
• English– “San Francisco”, “New York” as a token
• Chinese and Japanese– No spaces between words– 莎拉波娃现在居住在美国东南部的佛罗里达– 莎拉波娃 /现在 /居住 /在 /美国 /东南部 /的 /佛罗里达
– Also called Word Segmentation in Chinese (中文分词 )
13Social Media Mining Natural Language Processing Slide 13 of 66
Part III: Language Modeling
14Social Media Mining Natural Language Processing Slide 14 of 66
Probabilistic Language Models
• Assign a Probability to a sentence– Machine Translation:
• P(high winds tonight) > P(large winds tonight)
– Spell Correction• The office is about fifteen minuets from my house• P(about fifteen minutes from) > P(about fifteen
minuets from)
– Speech Recognition• P(I saw a van) >> P(eyes awe of an)
– Summarization, question-answering, etc.
15Social Media Mining Natural Language Processing Slide 15 of 66
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words:– P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:– P(w5|w1,w2,w3,w4)
• A model that computes either of these:– P(W) or P(wn|w1,w2…wn-1) Is called a language
model
16Social Media Mining Natural Language Processing Slide 16 of 66
How to Compute P(W)
•
17Social Media Mining Natural Language Processing Slide 17 of 66
Example
• P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)
• P(the|its water is so transparent that) = Count(its water is so transparent that the) / Count(its water is so transparent that) ???
• No! We’ll never see enough data for estimation.
18Social Media Mining Natural Language Processing Slide 18 of 66
Markov Assumption
• Simplifying assumption:– P(the | its water is so transparent that) ≈
P(the | that)
• Or maybe– P(the | its water is so transparent that) ≈
P(the | transparent that)
19Social Media Mining Natural Language Processing Slide 19 of 66
Markov Assumption
•
20Social Media Mining Natural Language Processing Slide 20 of 66
N-gram Models
•
21Social Media Mining Natural Language Processing Slide 21 of 66
Smoothing
• Intuition– Some words WITHOUT any occurrence in the
corpus do NOT mean its probability is ZERO!– Maybe because the corpus is too small
• Smoothing– Steal probability mass to generalize better
22Social Media Mining Natural Language Processing Slide 22 of 66
Add-one (Laplace) Smoothing
•
23Social Media Mining Natural Language Processing Slide 23 of 66
Berkeley Restaurant Corpus
24Social Media Mining Natural Language Processing Slide 24 of 66
Laplace-Smoothed Bigrams
25Social Media Mining Natural Language Processing Slide 25 of 66
Reconstituted Counts
26Social Media Mining Natural Language Processing Slide 26 of 66
Compare with Raw Counts
27Social Media Mining Natural Language Processing Slide 27 of 66
Disadvantage
• Add-1 estimation is a blunt instrument• So add‐1 isn’t used for N-grams:– We’ll see better methods
• But add‐1 is used to smooth other NLP models– For text classification– In domains where the number of zeros isn’t so
huge.
28Social Media Mining Natural Language Processing Slide 28 of 66
Good Turing Smoothing
Add-1 Smoothing
Add-k Smoothing:More general
¿𝑐 (𝑤 𝑖−1 ,𝑤𝑖 )+𝑚×
1𝑉
𝑐 (𝑤 𝑖−1 )+𝑚
29Social Media Mining Natural Language Processing Slide 29 of 66
Unigram Prior Smoothing
• Let : • : the probability of each word under the
uniform distribution• The word distribution P(wi) can be
estimated with the unigram prior knowledge, the smoothing formula can be rewritten :
30Social Media Mining Natural Language Processing Slide 30 of 66
Jelinek-Mercer Smoothing
• Let
• Then
¿𝑝 (𝑤𝑖∨𝑤𝑖− 1 )• Bigram interpolated model
𝑝𝑖𝑛𝑡𝑒𝑟𝑝 (𝑤𝑖∨𝑤𝑖− 1)=(1−𝜆)𝑝 (𝑤 𝑖∨𝑤 𝑖−1 )+𝜆𝑝 (𝑤 𝑖 )
31Social Media Mining Natural Language Processing Slide 31 of 66
Part IV: Sentiment Analysis
32Social Media Mining Natural Language Processing Slide 32 of 66
Positive or Negative in Movie Reviews?
• Unbelievably disappointing • Full of zany characters and richly applied
satire, and some great plot twists• This is the greatest screwball comedy
ever filmed• It was pathetic. The worst part about it
was the boxing scenes.
33Social Media Mining Natural Language Processing Slide 33 of 66
Google Product Search
34Social Media Mining Natural Language Processing Slide 34 of 66
Twitter sentiment vs. Gallup Poll of Consumer Confidence
35Social Media Mining Natural Language Processing Slide 35 of 66
Twitter Sentiment
• Johan Bollen, Huina Mao, Xiaojun Zeng. 2011. Twitter mood predicts the stock market, Journal of Computational Science 2:1, 1-8. 10.1016/j.jocs.2010.12.007.
36Social Media Mining Natural Language Processing Slide 36 of 66
Other Names of Sentiment Analysis
• Opinion extraction• Opinion mining• Sentiment mining• Subjectivity analysis
37Social Media Mining Natural Language Processing Slide 37 of 66
Why sentiment analysis?
• Movie: is this review positive or negative?
• Products: what do people think about the new iPhone?
• Public sentiment: how is consumer confidence? Is despair increasing?
• Politics: what do people think about this candidate or issue?
• Prediction: predict election outcomes or market trends from sentiment
38Social Media Mining Natural Language Processing Slide 38 of 66
Sentiment Analysis
• Simplest task: – Is the attitude of this text positive or negative?
• More complex:– Rank the attitude of this text from 1 to 5
• Advanced:– Detect the target, source, or complex attitude
types
39Social Media Mining Natural Language Processing Slide 39 of 66
Sentiment Classification for Movie Reviews
• Polarity detection:– Is an IMDB movie review positive or negative?
• Data: Polarity Data 2.0:– http
://www.cs.cornell.edu/people/pabo/movie-review-data
• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86.
• Bo Pang and Lillian Lee. 2004. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271- 278.
40Social Media Mining Natural Language Processing Slide 40 of 66
IMDB data in the Pang and Lee database
• when _star wars_ came out some twenty years ago, the image of traveling throughout the stars has become a commonplace image. […]
• when han solo goes light speed, the stars change to bright lines, going towards the viewer in lines that converge at an invisible point.
• cool.• _october sky_ offers a much
simpler image--that of a single white dot, traveling horizontally across the night sky. […]
• "snake eyes" is the most aggravating kind of movie: the kind that shows so much potential then becomes unbelievably disappointing.
• it's not just because this is a brian depalma film, and since he's a great director and one who’s films are always greeted with at least some fanfare.
• and it's not even because this was a film starring nicolas cage and since he gives a brauvara performance, this film is hardly worth his talents.
41Social Media Mining Natural Language Processing Slide 41 of 66
Baseline Algorithm
• Tokenization• Feature Extraction• Classification using different classifiers– Naïve Bayes– MaxEnt– SVM
42Social Media Mining Natural Language Processing Slide 42 of 66
Sentiment Tokenization Issues
• Deal with HTML and XML markup• Twitter mark-up (names, hash tags)• Capitalization (preserve for words in all
caps)• Phone numbers, dates• Emoticons• Useful code:– Christopher Potts sentiment tokenizer– Brendan O’Connor twitter tokenizer
43Social Media Mining Natural Language Processing Slide 43 of 66
Extracting Features
• How to handle negation– I didn’t like this movie! vs– I really like this movie!
• Which words to use?– Only adjectives– All words
• All words turns out to work better, at least on this data
44Social Media Mining Natural Language Processing Slide 44 of 66
Negations
• Add NOT_ to every word between negation and following punctuation:– didn’t like this movie , but I
– didn’t NOT_like NOT_this NOT_movie but I
• Das, Sanjiv and Mike Chen. 2001. Yahoo! for Amazon: Extracting Market Sentiment from Stock Message Boards. In APFA.
• Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 79-86.
45Social Media Mining Natural Language Processing Slide 45 of 66
Boolean feature for Multinomial Naïve Bayes
• Intuition:– For sentiment (and probably for other text
classification domains)– Word occurrence may matter more than word
frequency• The occurrence of the word fantastic tells us a lot• The fact that it occurs 5 times may not tell us much
more.
– Boolean Multinomial Naïve Bayes• Clips all the word counts in each document at 1
– Other possibility: log(freq(w))
46Social Media Mining Natural Language Processing Slide 46 of 66
Boolean Multinomial Naïve Bayes
• From training corpus, extract Vocabulary• Calculate P(cj) terms
• Calculate P(wk|cj) terms
• Make predictions
𝑐 𝑁𝐵=max𝑐 𝑗 ∈𝐶
𝑃 (𝑐 𝑗 ) ∏𝑖∈ 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠
𝑃 (𝑤𝑖∨𝑐 𝑗)
47Social Media Mining Natural Language Processing Slide 47 of 66
Other issues in Classification
• MaxEnt and SVM tend to do better than Naïve Bayes
• Subtlety: makes reviews hard to classify– Perfume review in Perfumes: the Guide
• “If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut.”
48Social Media Mining Natural Language Processing Slide 48 of 66
Thwarted Expectations and Ordering Effects
• “This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.”
• Well as usual Keanu Reeves is nothing special, but surprisingly, the very talented Laurence Fishburne is not so good either, I was surprised.
49Social Media Mining Natural Language Processing Slide 49 of 66
Sentiment Lexicons
• The General Inquirer, 1966.– Home page: http://www.wjh.harvard.edu/~inquirer– List of Categories:
http://www.wjh.harvard.edu/~inquirer/homecat.htm
– Spreadsheet: http://www.wjh.harvard.edu/~inquirer/inquirerbasic.xls
– Categories:• Positive (1915 words) and Negative (2291 words)• Strong vs. Weak, Active vs. Passive, Overstated vs.
Understated• Pleasure, Pain, Virtue, Vice, Motivation, Cognitive
Orientation, etc.
– Free for Research Use
50Social Media Mining Natural Language Processing Slide 50 of 66
Sentiment Lexicons
• LIWC (Linguistic Inquiry and Word Count), 2007– Home page: http://www.liwc.net/– 2300 words, >70 classes– Affective Processes
• negative emotion (bad, weird, hate, problem, tough)• positive emotion (love, nice, sweet)
– Cognitive Processes• Tentative (maybe, perhaps, guess), Inhibition (block,
constraint)
– Pronouns, Negation (no, never), Quantifiers (few, many)
– $30 or $90 fee
51Social Media Mining Natural Language Processing Slide 51 of 66
Sentiment Lexicons
• MPQA Subjectivity Cues Lexicon (EMNLP’03, 05)– Home page:
http://www.cs.piX.edu/mpqa/subj_lexicon.html• 6885 words from 8221 lemmas• 2718 positive and 4912 negative
– Each word annotated for intensity (strong, weak)
– GNU GPL
52Social Media Mining Natural Language Processing Slide 52 of 66
Sentiment Lexicons
• Bing Liu Opinion Lexicon (KDD‐2004)– Bing Liu's Page on Opinion Mining– http://www.cs.uic.edu/~liub/FBS/opinion‐
lexicon-English.rar– 6786 words: 2006 positive and 4783 negative
53Social Media Mining Natural Language Processing Slide 53 of 66
Sentiment Lexicons
• SentiWordNet (LREC‐2010)– Home page: http://sentiwordnet.isti.cnr.it/– All WordNet synsets automatically annotated
for degrees of positivity, negativity, and neutrality/objectiveness
54Social Media Mining Natural Language Processing Slide 54 of 66
Disagreements between Polarity Lexicons
Christopher Potts, Sentiment Tutorial, 2011
55Social Media Mining Natural Language Processing Slide 55 of 66
Learning Sentiment Lexicons
• Semi-supervised learning of lexicons– Use a small amount of information– A few labeled examples– A few hand‐built patterns– To bootstrap a lexicon
56Social Media Mining Natural Language Processing Slide 56 of 66
Intuition for identifying word polarity (ACL’97)
• Adjectives conjoined by “and” have same polarity– Fair and legitimate, corrupt and brutal– *fair and brutal, *corrupt and legitimate
• Adjectives conjoined by “but” do not– fair but brutal
57Social Media Mining Natural Language Processing Slide 57 of 66
Step 1
• Label seed set of 1336 adjectives (all >20 in 21 million word WSJ corpus)– 657 positive
adequate central clever famous intelligent remarkable reputed sensitive slender thriving…
– 679 negativecontagious drunken ignorant lanky listless primitive strident troublesome unresolved unsuspecting…
58Social Media Mining Natural Language Processing Slide 58 of 66
Step 2
• Expand seed set to conjoined adjectives
59Social Media Mining Natural Language Processing Slide 59 of 66
Step 3
• Supervised classifier assigns “polarity similarity” to each word pair, resulting in graph:
60Social Media Mining Natural Language Processing Slide 60 of 66
Step 4
• Clustering for partitioning the graph into two
61Social Media Mining Natural Language Processing Slide 61 of 66
Output Polarity Lexicon
• Positive– bold decisive disturbing generous good honest
important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…
• Negative– ambiguous cautious cynical evasive harmful
hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…
62Social Media Mining Natural Language Processing Slide 62 of 66
Output Polarity Lexicon
• Positive– bold decisive disturbing generous good
honest important large mature patient peaceful positive proud sound stimulating straightforward strange talented vigorous witty…
• Negative– ambiguous cautious cynical evasive harmful
hypocritical inefficient insecure irrational irresponsible minor outspoken pleasant reckless risky selfish tedious unsupported vulnerable wasteful…
63Social Media Mining Natural Language Processing Slide 63 of 66
Using WordNet to learn polarity
• WordNet: online thesaurus (covered in later• lecture).• Create positive (“good”) and negative seed-
words (“terrible”)• Find Synonyms and Antonyms
– Positive Set: Add synonyms of positive words (“well”) and antonyms of negative words
– Negative Set: Add synonyms of negative words (“awful”) and antonyms of positive words (”evil”)
• Repeat, following chains of synonyms• Filter
64Social Media Mining Natural Language Processing Slide 64 of 66
Turney Algorithm (ACL’02)
1. Extract a phrasal lexicon from reviews2. Learn polarity of each phrase3. Rate a review by the average polarity of
its phrases
65Social Media Mining Natural Language Processing Slide 65 of 66
Summary on Learning Lexicons
• Advantages:– Can be domain‐specific– Can be more robust (more words)
• Intuition– Start with a seed set of words (‘good’, ‘poor’) – Find other words that have similar polarity:
• Using “and” and “but”• Using words that occur nearby in the same
document• Using WordNet synonyms and antonyms
66Social Media Mining Natural Language Processing Slide 66 of 66
Any Question?