Download - Opinion Detection by Transfer Learning 11-742 Information Retrieval Lab Grace Hui Yang Advised by Prof. Yiming Yang.

Opinion Detection by Transfer Learning

11-742 Information Retrieval Lab

Grace Hui YangAdvised by Prof. Yiming Yang

Outline

• Introduction• The Problem• Transfer Learning by Constructing

Informative Prior• Datasets• Evaluation Method• Experimental Results• Conclusion

Introduction

• TREC 2006 Blog Track– Opinion Detection Task

<num> Number: 851

<title> "March of the Penguins"

<desc> Description:Provide opinion of the film documentary "March of the Penguins".

<narr> Narrative:Relevant documents should include opinions concerning the filmdocumentary "March of the Penguins". Articles or comments aboutpenguins outside the context of this film documentary are notrelevant.

Opinion Detection Literature Review

• Researchers in Natural Language Processing (NLP) community– Turney (2002) : groups online words whose point mutual

information close to "excellent" and "poor"– Riloff & Wiebe (2003): use a high-precision classifier to get

high quality opinions and non-opinions, and then extract syntactic patterns. Repeat this process to bootstrap

– Pang et al. (2002): treat opinion and sentiment detection and as a text classification problem

• Naive Bayes, Maximum Entropy, SVM +unigram pres. (82.9%)

– Pang & Lee (2005): use Minicuts to cluster sentences based on their subjectivity and sentiment orientation.

• Researchers from data mining community– Morinaga et al. (2002) : use word polarity, syntactic

pattern matching rules to extract opinions, PCA to create correspondence between the product names and keywords

Existing System

• Query Expansion• Document Retrieval• Binary Text Classification by

Bayesian Logistic Regression

No Available Training Data

• Transfer Learning– Transfer knowledge over similar tasks

but different domain– Generalize knowledge from limited

training data– Discover underlying general structures

across domains

Transfer Learning Literature Review

• Baxter(1997) and Thrun(1996): both used hierarchical Bayesian learning

• Lawrence and Platt (2004), Yu et al. (2005): also use hierarchical Bayesian models to learn hyper-parameters of Gaussian process

• Ando and Zhang (2005): proposed a framework for Gaussian logistic regression for text classification .

• Raina et al. (2006): continued this approach and built informative priors for Gaussian logistic regression

Transfer Learning

• The Approach presented in this project is Inspired by the work done by Raina, Ng & Koller (2006) on text classification

• Transferring common knowledge (word dependence) in similar tasks by constructing a informative prior in a Bayesian Logistic Regression Framework

Logistic Regression Framework

• Logistic regression assumes sigmoid-like data distribution

• To avoid overfitting, multivariate Gaussian prior is added on θ

• Maximum a posteriori (MAP) Estimation

Non-diagonal Covariance

• Zero-mean, equal variance Prior

– Cannot capture relationship among words

• Zero-mean, non-diagonal covariance Prior

– Model word dependency in covariance matrix’s off-diagonal entries

Pair-wised Covariance

• Covariance Definition:

• Given zero mean,

Get Covariance by MCMC

• Markov Chain Monte Carlo (MCMC)• Sample V (V=4) small vocabularies

with size S (S=5) containing the two words wi and wjcorresponding to θi and θj.

• From each vocabulary, sample T (T=4) training sets with size Z(Z=3) to train an ordinary Log. Reg. model on labeled datasets

Get Covariance by MCMC

• Subtract a bootstrap estimation of the covariance due to randomness of training set change

Learning a Covariance Matrix

• Learning a single covariance for pairs of regression coefficients is NOT all we need

• Two Challenges:(1) Valid Covariance Matrix– A valid covariance matrix needs to be

positive semi-definite (PSD)– Hermitian matrix (square, self-adjoint)

with nonnegative eigen values. – Project the matrix on to a PSD cone

Learning a Covariance Matrix

(2) Pair-wise calculations increase the complexity quadratically with vocabulary size– represent the word dependence as

linear combination of underlying features

– Learn the coefficients by Least Squared Error

Learning a Covariance Matrix By Joint Minimization

• λ is the trade-off coefficient between the two objectives.– As λ-> 0, only care about PSD cone– As λ-> 1, only care about word pair

relationship– Set to 0.6

Solve the Joint Minimization

• Convex problem, converge to global minimum

• Fix Σ , minimize over ψ– Use Quadratic Program (QP) Solver

• Fix ψ , minimize over Σ– A special semi-definite programming (SDP) – Eigen decomposition and keep the nonnegative

values

Feature Design

• Model word dependency– Wordnet synset– and?

• People do not always use the same general syntactic patterns to express opinion– "blah blah is good", – "awesome blah blah!"

Target-Opinion Word Pair

• Different opinion targets relate to different customary expression– A person is knowledgeable– A computer processor is fast– A computer processor is knowledgeable

(ill)– A person is fast (ill)– A computer processor is running like a

horse (word polarity test fails)

Target-Opinion Word Pair

• From training corpus, extract from a positive example– subject and object (excludes pronouns)

• “Melvin, pig”

– subject and BE-predicate• “lens, clear”, “base, heavy”

– modifier and subject• “good, coffee” , “interesting, movie”

Word Synonym

• Bridge vocabulary gap from training to testing– “This movie is good" in training corpus– "The film is really good" in the testing

corpus

Feature Vector

Log-co-occurrence

Target-Opinion

Synonym

Datasets

• Training Corpus– Movie reviews [Pang & Lee from

Cornell]• 10,000 sentences (5,000 opinions, 5,000

non-opinions)

– Product reviews [Hu & Liu from UIC]• 4,000+ sentences (2,034 opinions, 2,173

non-opinions.• Digital camera, cell phone, DVD player,

Jukebox, …

Datasets

• Test Corpus – TREC 2006 Blog corpus– 3,201,002 articles (TREC reports 3,215,171)– December 2005 to February 2006– Technorati, Bloglines, Blogpulse …

• For each topic, 5,000 passages are retrieved– Using Lemur as search engine – 132,399 passages in total– 2,648 passages per topic– Each passage 1-10 sentences ( less than 100

words)

Evaluation Method

• Precision at 11-pt recall level • Mean average precision (MAP)• Answers are provided by TREC qrels,

– Document ids of documents containing an opinion

• Note that our system is developed for opinion detection at sentence level– An averaged score of all the sentences in a

retrieved passages– Extract Unique document ids to compare with

TREC qrels

Experimental Results

• Effects of Using Non-diagonal Prior Covariance– Baseline: Using movie reviews to train the

Gaussian log. Reg. model with Prior ~N(0,σ2)– Feature Selection: Using common word

features in movie reviews and product reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2)

– Informative Prior:Using movie reviews to calculateprior covariance, train the Gaussian log. Reg. model with the informative prior ~N(0,Σ)

32% improvement


• Effects of Feature Design– Baseline: Using movie reviews to train

the Gaussian log. Reg. model with Prior ~N(0,σ2), bi-gram model

– Transfer Learning Using Synonyms: Using informative prior ~N(0,Σ)

– Transfer Learning Using Target-Opinion pairs: informative prior ~N(0,Σ)

– Transfer Learning Using Both: informative prior ~N(0,Σ)

A good feature


• Effects on External Dataset Selection

Negative Effect of Transfer Learning

Why Negative Effect Occurs?

• Movie covers more general topics• Product only share 23% topics

Conclusion

• Applying Transfer Learning in Opinion Detection

• Transfer Learning by Informative Prior improves brutal transfer learning by 32%

• Discovering a good feature for opinion detection– Target-Opinion pair

• Need to be careful when choosing external datasets to help

Thank You!