Opinion Detection by Transfer Learning
11-742 Information Retrieval Lab
Grace Hui YangAdvised by Prof. Yiming Yang
Outline
• Introduction• The Problem• Transfer Learning by Constructing
Informative Prior• Datasets• Evaluation Method• Experimental Results• Conclusion
Introduction
• TREC 2006 Blog Track– Opinion Detection Task
<num> Number: 851
<title> "March of the Penguins"
<desc> Description:Provide opinion of the film documentary "March of the Penguins".
<narr> Narrative:Relevant documents should include opinions concerning the filmdocumentary "March of the Penguins". Articles or comments aboutpenguins outside the context of this film documentary are notrelevant.
Opinion Detection Literature Review
• Researchers in Natural Language Processing (NLP) community– Turney (2002) : groups online words whose point mutual
information close to "excellent" and "poor"– Riloff & Wiebe (2003): use a high-precision classifier to get
high quality opinions and non-opinions, and then extract syntactic patterns. Repeat this process to bootstrap
– Pang et al. (2002): treat opinion and sentiment detection and as a text classification problem
• Naive Bayes, Maximum Entropy, SVM +unigram pres. (82.9%)
– Pang & Lee (2005): use Minicuts to cluster sentences based on their subjectivity and sentiment orientation.
• Researchers from data mining community– Morinaga et al. (2002) : use word polarity, syntactic
pattern matching rules to extract opinions, PCA to create correspondence between the product names and keywords
Existing System
• Query Expansion• Document Retrieval• Binary Text Classification by
Bayesian Logistic Regression
No Available Training Data
• Transfer Learning– Transfer knowledge over similar tasks
but different domain– Generalize knowledge from limited
training data– Discover underlying general structures
across domains
Transfer Learning Literature Review
• Baxter(1997) and Thrun(1996): both used hierarchical Bayesian learning
• Lawrence and Platt (2004), Yu et al. (2005): also use hierarchical Bayesian models to learn hyper-parameters of Gaussian process
• Ando and Zhang (2005): proposed a framework for Gaussian logistic regression for text classification .
• Raina et al. (2006): continued this approach and built informative priors for Gaussian logistic regression
Transfer Learning
• The Approach presented in this project is Inspired by the work done by Raina, Ng & Koller (2006) on text classification
• Transferring common knowledge (word dependence) in similar tasks by constructing a informative prior in a Bayesian Logistic Regression Framework
Logistic Regression Framework
• Logistic regression assumes sigmoid-like data distribution
• To avoid overfitting, multivariate Gaussian prior is added on θ
• Maximum a posteriori (MAP) Estimation
Non-diagonal Covariance
• Zero-mean, equal variance Prior
– Cannot capture relationship among words
• Zero-mean, non-diagonal covariance Prior
– Model word dependency in covariance matrix’s off-diagonal entries
Pair-wised Covariance
• Covariance Definition:
• Given zero mean,
Get Covariance by MCMC
• Markov Chain Monte Carlo (MCMC)• Sample V (V=4) small vocabularies
with size S (S=5) containing the two words wi and wjcorresponding to θi and θj.
• From each vocabulary, sample T (T=4) training sets with size Z(Z=3) to train an ordinary Log. Reg. model on labeled datasets
Get Covariance by MCMC
• Subtract a bootstrap estimation of the covariance due to randomness of training set change
Learning a Covariance Matrix
• Learning a single covariance for pairs of regression coefficients is NOT all we need
• Two Challenges:(1) Valid Covariance Matrix– A valid covariance matrix needs to be
positive semi-definite (PSD)– Hermitian matrix (square, self-adjoint)
with nonnegative eigen values. – Project the matrix on to a PSD cone
Learning a Covariance Matrix
(2) Pair-wise calculations increase the complexity quadratically with vocabulary size– represent the word dependence as
linear combination of underlying features
– Learn the coefficients by Least Squared Error
Learning a Covariance Matrix By Joint Minimization
• λ is the trade-off coefficient between the two objectives.– As λ-> 0, only care about PSD cone– As λ-> 1, only care about word pair
relationship– Set to 0.6
Solve the Joint Minimization
• Convex problem, converge to global minimum
• Fix Σ , minimize over ψ– Use Quadratic Program (QP) Solver
• Fix ψ , minimize over Σ– A special semi-definite programming (SDP) – Eigen decomposition and keep the nonnegative
values
Feature Design
• Model word dependency– Wordnet synset– and?
• People do not always use the same general syntactic patterns to express opinion– "blah blah is good", – "awesome blah blah!"
Target-Opinion Word Pair
• Different opinion targets relate to different customary expression– A person is knowledgeable– A computer processor is fast– A computer processor is knowledgeable
(ill)– A person is fast (ill)– A computer processor is running like a
horse (word polarity test fails)
Target-Opinion Word Pair
• From training corpus, extract from a positive example– subject and object (excludes pronouns)
• “Melvin, pig”
– subject and BE-predicate• “lens, clear”, “base, heavy”
– modifier and subject• “good, coffee” , “interesting, movie”
Word Synonym
• Bridge vocabulary gap from training to testing– “This movie is good" in training corpus– "The film is really good" in the testing
corpus
Feature Vector
Log-co-occurrence
Target-Opinion
Synonym
Datasets
• Training Corpus– Movie reviews [Pang & Lee from
Cornell]• 10,000 sentences (5,000 opinions, 5,000
non-opinions)
– Product reviews [Hu & Liu from UIC]• 4,000+ sentences (2,034 opinions, 2,173
non-opinions.• Digital camera, cell phone, DVD player,
Jukebox, …
Datasets
• Test Corpus – TREC 2006 Blog corpus– 3,201,002 articles (TREC reports 3,215,171)– December 2005 to February 2006– Technorati, Bloglines, Blogpulse …
• For each topic, 5,000 passages are retrieved– Using Lemur as search engine – 132,399 passages in total– 2,648 passages per topic– Each passage 1-10 sentences ( less than 100
words)
Evaluation Method
• Precision at 11-pt recall level • Mean average precision (MAP)• Answers are provided by TREC qrels,
– Document ids of documents containing an opinion
• Note that our system is developed for opinion detection at sentence level– An averaged score of all the sentences in a
retrieved passages– Extract Unique document ids to compare with
TREC qrels
Experimental Results
• Effects of Using Non-diagonal Prior Covariance– Baseline: Using movie reviews to train the
Gaussian log. Reg. model with Prior ~N(0,σ2)– Feature Selection: Using common word
features in movie reviews and product reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2)
– Informative Prior:Using movie reviews to calculateprior covariance, train the Gaussian log. Reg. model with the informative prior ~N(0,Σ)
32% improvement
Experimental Results
• Effects of Feature Design– Baseline: Using movie reviews to train
the Gaussian log. Reg. model with Prior ~N(0,σ2), bi-gram model
– Transfer Learning Using Synonyms: Using informative prior ~N(0,Σ)
– Transfer Learning Using Target-Opinion pairs: informative prior ~N(0,Σ)
– Transfer Learning Using Both: informative prior ~N(0,Σ)
A good feature
Experimental Results
• Effects on External Dataset Selection
Negative Effect of Transfer Learning
Why Negative Effect Occurs?
• Movie covers more general topics• Product only share 23% topics
Conclusion
• Applying Transfer Learning in Opinion Detection
• Transfer Learning by Informative Prior improves brutal transfer learning by 32%
• Discovering a good feature for opinion detection– Target-Opinion pair
• Need to be careful when choosing external datasets to help
Thank You!
Top Related