Automatic Domain Adaptive Sentiment Analysis Phase 1
-
Upload
samantha-kelly -
Category
Documents
-
view
28 -
download
0
description
Transcript of Automatic Domain Adaptive Sentiment Analysis Phase 1
Outline Introduction
Problem Definition Thesis Statement Motivation
Background and Related Work Challenges Approaches
Research Plan Approach Evaluation Timeline
Conclusion
Problem Definition
Sentiment Analysis is the automatic detection and measurement of sentiment in text segments by machines.
3 Sub Tasks Objective vs. Subjective Topic Detection Positive vs. Negative
Commonly applied to web data Very Domain Dependent
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Thesis Statement
This dissertation will develop and evaluate techniques to discover and encode domain-specific, domain-independent, and semantic
knowledge to improve both single and multiple domain sentiment analysis problems
on textual data given low labeled data conditions.
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Motivation: Private Sector
Market Research Surveys Focus Groups Feature Analysis Customer targeting (Free samples etc…)
Consumer Sentiment Search Compare pros and cons Overall opinion of products/services
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Motivation: Public Sector
Political Alternative Polling Determine popular support for legislation Choose campaign issues
National Security Detect individuals at risk for radicalization Determine local sentiment about US policy Determine local values and sentimental icons Portray actions positively using local flavor
Public Health Detect potential suicide victims Detect mentally unstable people
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Challenges
Text Representation Unedited Text Sentiment Drift Negation Sarcasm Sentiment Target Identification Granularity Domain Dependence
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 1Domain Dependent Sentiment The same sentence can mean two very different
things in different domains Ex: “Read the book.” <= Good for books, bad for movies Ex: “Jolting, heart pounding, You’re in for one hell of a
bumpy ride!” Good for movies and books, bad for cars.
Sentimental word associations change with domain Fuzzy cameras are bad, but fuzzy teddy bears are good. Big trucks are good, but big iPods are bad. Bad is bad, but bad villains are good.
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 2 Endless Possibilities
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain Dependence 3Organization and Granularity
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Theory of the Three Signals
Authors communicate messages using three types of signals Domain-Specific Signals Domain-Independent Signals Semantic Signals
More specific signals are generally more powerful than more generic signals
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Domain-Specific Signals Dependent on problem and domain Considered more useful by readers
Tells what is good or bad about topic Domain knowledge determines
sentiment orientation Very strong in context, but weak or
misleading out of context Can cause over generalization
error when overvalued New domain-specific signal words
are ignored in CDT
Fuzzy teddy bears
Sharp pictures Sharp knives Smooth rides New ideas Fast servers Fast cars Slow roasted
burgers Slow motion Small cameras Big cars
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Proposed Approach
Sentiment Search is more than just a classification problem
Detecting and Using the three signals Dynamic Domain Adapting Classifiers Generic Feature Detection using unlabeled data Semantic Feature Spaces
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Dynamic Domain Adapting Classifiers
A (preferably domain-independent) model is built using computationally intense algorithms before query time on a set of labeled data.
Users interact at a query box level Query results define the domain of interest Domain specific adaptations are calculated
compares how the domain of interest is different from known cases uses semantic knowledge about word senses and relations must be fast algorithm: users are waiting
Domain specific adaptations are woven into the domain independent model resulting model is temporary used to classify documents as positive, negative, or objective
Sentimental search results are processed for significant components and presented for human consumption
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Overview
SentimentClassifier
Query Results Define a new Domain
ContextSpecificModel
LuceneIndex
Query
Dynamic DomainAdapter
GeneralModel
+ -
Labeled data fromknown domain
SemanticKnowledge
SentimentalSearchResults
ComponentAnalysis
BusinessIntelligence
Key: User Level, Source Data, Knowledge,Labeled Data Algorithms, Search Results
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion
Subjective Context Scoring
Multiply: PMI(Word,Context) IDF Co-occurance with know generic sentiment seed
words times their bias (From movie reviews) Seeds:
bad,worst,stupid,ridiculous, terrible,poorly great,best,perfect,wonderful,
excellent,effective
Rocchio Baseline
Rocchio - Query Expansion algorithm for search Similar goals to ours, find more relevant words Does not account for sentiment
The new query is a weight sum of Matching document vectors Query vector Non-matching document vectors (negative value).
Sentimental Context
Components: PMI(Word,Context) TF IDF Log( Actual Co Occur of Word,Seed, context / Prob by
chance) Values:
Abnormality to other docs Popular words in context Rare words in the corpus Words that occur with sentiment words in the query
documents
Google Hits (Battery Related): iPod battery good ~ 13.5 Mill iPod battery bad ~ 900 K iPod nano battery good ~ 3 Mill iPod nano battery bad ~ 785 K iPod shuffle battery good ~ 1.6 Mill iPod shuffle battery bad ~ 230 K iPod shuffle battery price good ~ 2.6 Mill (not a typo) iPod shuffle battery price bad ~ 230 K iPod battery price good ~ 13.5 Mill iPod battery price bad ~ 850 K iPod nano battery price good ~ 3 Mill iPod nano battery price bad ~ 785 K
Summary
Interesting problem with many potential applications
Domain dependence is the core challenge The keys to success are:
Vast quantities of unlabeled data Semantic knowledge from freely available
sources Semantics must guide and influence but not
overrule the statistics
1. Intro - 2. Related Work - 3. Research Plan - 4. Conclusion