Fusion with Sentiment Scores for Market Research

8
Fusion with Sentiment Scores for Market Research Subrata Das Machine Analytics, Belmont, MA [email protected] Arup Das Alphaserve Technologies, NY & Machine Analytics, MA adas@{alphaserveit, machineanalytics}.com AbstractThe recent surge in electronic and social media has led to an explosion of sentiment data embedded in public and private documents, fueling interest in sentiment analysis, especially as individuals, brands and corporations look to manage their reputational risk which is directly correlated to company performance. In this paper, we describe two approaches to score sentiments from a large unstructured text corpus 1 to fuse with other relevant structured relational data: 1) a simple but effective and fast lexicon-based approach where the score of a document is based on the occurrences of stemmed words representing positive and negative sentiments; and 2) a supervised machine learning approach where the score is derived by making use of a kernel-based classification model created from the training documents. Example applications of these techniques can be found in our text analytics tool called aText which can compute sentiment scores of product reviews from Amazon and TripAdvisor to gain market insight to products and services. Another example is the computation of sentiment scores using aText for public and private companies from credible financial sources which is further fused with market data (stock price) to create a composite index for financial analysts and traders. KeywordsText Analytics, Sentiment Analysis; Natural Language Processing; Machine Learning I. INTRODUCTION Sentiment analysis (aka opinion mining) refers to the identification, extraction, and quantification or scoring of various types of subjective emotions in documents. The recent surge in electronic and social media has led to an explosion of data and opinion, fueling interest in sentiment analysis to aid in market research to promote products. In the financial domain, market movement is largely driven by sentiments. The derived sentiment scores can be fused with other relevant structured relational data for enhanced high- level fusion [2] for market research and prediction. The question is how to compute sentiment given consumers and analysts opinions are expressed in the vast amount of textual blogs, news articles, social media posts, customer feedback, and reviews, some of which are openly available while the rest are company proprietary. Text analytics is a process for analyzing large text corpii to help discover information that is strategic to an organization. For example, text analytics will discover people’s opinions on various blog sites about a company’s new product, or analyze customers’ sentiment from text surveys. 1 A corpus is a set of documents representing news articles, blogs, emails, reviews, opinions, and such. A fundamental technology in sentiment analysis applications is classification, labeling documents in a corpus with a predefined set of categories. The most common and primitive labels are positiveand negativebut sentiment can also be labeled in finer levels expressing various types of emotions, for example. Most approaches in sentiment analysis use bag of words representations [23] where a predefined set of words (i.e. lexicon) are used to represent a category of emotion. Documents are classified as per the occurrences of the words. The approach is fast but do not take into account the wider context of the occurrence of a word in a document. For contextual consideration, a piece of text or document is converted into a feature vector or other representation that makes its most salient and important features available. Such feature vectors are used to build models to help classifying into the categories of sentiment. In this paper, we describe two approaches to score sentiment from big unstructured text corpus: 1) a simple but effective and fast lexicon-based approach where the score of a document is based on the occurrences of stemmed words representing positive and negative sentiments; and 2) a supervised machine learning approach where the score is derived by making use of a kernel-based classification model created from labeled training documents. We demonstrate the machine learning approach with a classification task into five rating categories that is similar to aspect prediction. Snyder and Barzilay [31] analyzed larger reviews in more detail by analyzing the sentiment of multiple aspects of restaurants, such as food or atmosphere. Shimada and Endo [30] have proposed a method based on word variance for seeing several stars. Pappas and Popescu-Belis [24] have proposed a method using multiple-instance learning for aspect rating prediction. All experiments in this paper were carried out using the Java application programming interface (API) of the in- house text analytics tool aText. The tool automatically analyzes text documents in order to extract actionable intelligence. It employs deep linguistics processing, text classification, and information extraction techniques. Given a corpus containing a set of textual documents, aText automatically extracts triples, summarizes documents, performs sentiment and social network analyses, and classifies documents in both supervised and unsupervised manners. The specific supervised classification techniques of aText that we make use of for the proposed machine learning approach to sentiment scoring are Naive Bayesian Classifier (NBC) [7][20][4][2], k-dependence NBC [27], and Fisher Kernel (FK) [16] algorithms. In both the proposed sentiment

Transcript of Fusion with Sentiment Scores for Market Research

Page 1: Fusion with Sentiment Scores for Market Research

Fusion with Sentiment Scores for Market Research

Subrata Das

Machine Analytics, Belmont, MA

[email protected]

Arup Das

Alphaserve Technologies, NY & Machine Analytics, MA

adas@{alphaserveit, machineanalytics}.com

Abstract— The recent surge in electronic and social media

has led to an explosion of sentiment data embedded in public

and private documents, fueling interest in sentiment analysis,

especially as individuals, brands and corporations look to

manage their reputational risk which is directly correlated to

company performance. In this paper, we describe two

approaches to score sentiments from a large unstructured text

corpus1 to fuse with other relevant structured relational data:

1) a simple but effective and fast lexicon-based approach

where the score of a document is based on the occurrences of

stemmed words representing positive and negative sentiments;

and 2) a supervised machine learning approach where the

score is derived by making use of a kernel-based classification

model created from the training documents. Example

applications of these techniques can be found in our text

analytics tool called aText which can compute sentiment scores

of product reviews from Amazon and TripAdvisor to gain

market insight to products and services. Another example is

the computation of sentiment scores using aText for public and

private companies from credible financial sources which is

further fused with market data (stock price) to create a

composite index for financial analysts and traders.

Keywords—Text Analytics, Sentiment Analysis; Natural

Language Processing; Machine Learning

I. INTRODUCTION

Sentiment analysis (aka opinion mining) refers to the

identification, extraction, and quantification or scoring of

various types of subjective emotions in documents. The

recent surge in electronic and social media has led to an

explosion of data and opinion, fueling interest in sentiment

analysis to aid in market research to promote products. In

the financial domain, market movement is largely driven by

sentiments. The derived sentiment scores can be fused with

other relevant structured relational data for enhanced high-

level fusion [2] for market research and prediction. The

question is how to compute sentiment given consumers and

analysts opinions are expressed in the vast amount of textual

blogs, news articles, social media posts, customer feedback,

and reviews, some of which are openly available while the

rest are company proprietary. Text analytics is a process for

analyzing large text corpii to help discover information that

is strategic to an organization. For example, text analytics

will discover people’s opinions on various blog sites about a

company’s new product, or analyze customers’ sentiment

from text surveys.

1 A corpus is a set of documents representing news articles,

blogs, emails, reviews, opinions, and such.

A fundamental technology in sentiment analysis

applications is classification, labeling documents in a corpus

with a predefined set of categories. The most common and

primitive labels are “positive” and “negative” but sentiment

can also be labeled in finer levels expressing various types

of emotions, for example. Most approaches in sentiment

analysis use bag of words representations [23] where a

predefined set of words (i.e. lexicon) are used to represent a

category of emotion. Documents are classified as per the

occurrences of the words. The approach is fast but do not

take into account the wider context of the occurrence of a

word in a document. For contextual consideration, a piece

of text or document is converted into a feature vector or

other representation that makes its most salient and

important features available. Such feature vectors are used

to build models to help classifying into the categories of

sentiment.

In this paper, we describe two approaches to score

sentiment from big unstructured text corpus: 1) a simple but

effective and fast lexicon-based approach where the score of

a document is based on the occurrences of stemmed words

representing positive and negative sentiments; and 2) a

supervised machine learning approach where the score is

derived by making use of a kernel-based classification

model created from labeled training documents. We

demonstrate the machine learning approach with a

classification task into five rating categories that is similar

to aspect prediction. Snyder and Barzilay [31] analyzed

larger reviews in more detail by analyzing the sentiment

of multiple aspects of restaurants, such as food or

atmosphere. Shimada and Endo [30] have proposed a

method based on word variance for seeing several stars.

Pappas and Popescu-Belis [24] have proposed a method

using multiple-instance learning for aspect rating prediction. All experiments in this paper were carried out using the

Java application programming interface (API) of the in-house text analytics tool aText. The tool automatically analyzes text documents in order to extract actionable intelligence. It employs deep linguistics processing, text classification, and information extraction techniques. Given a corpus containing a set of textual documents, aText automatically extracts triples, summarizes documents, performs sentiment and social network analyses, and classifies documents in both supervised and unsupervised manners. The specific supervised classification techniques of aText that we make use of for the proposed machine learning approach to sentiment scoring are Naive Bayesian Classifier (NBC) [7][20][4][2], k-dependence NBC [27], and Fisher Kernel (FK) [16] algorithms. In both the proposed sentiment

Page 2: Fusion with Sentiment Scores for Market Research

scoring approaches, documents are first stemmed before classifying. Stemming [10] is the conflation of the morphological variants of the same word (e.g., application, applied, applying) into a common stem (apply). In most cases, the stemming leads to an improvement of the classification performance. aText has scraping ability from specific web sites (e.g. Amazon and TripAdvisor) by making use of the underlying format of the pages on the sites.

The rest of the paper is organized as follows. Section II describes the lexicon based approach. Section III describes the supervised classification based approach. The concluding section touches sentiment analysis using richer lexicon and relates sentiment scores with the history of stock price.

II. LEXICON BASED SENTIMENT SCORING

The lexicon based sentiment scoring makes use of a

dictionary of words annotated with positive and negative

sentiments. A proprietary algorithm produces a positive and

a negative sentiment scores of each document (sum to 1.0)

as shown in Figure 1. The corpus of documents in this

example contains Amazon customer reviews on a specific

television brand. The words representing positive (resp.

negative) sentiment are highlighted in green (resp. red).

Figure 2 shows the overall sentiment of all the reviews

downloaded.

Figure 1: Sentiment scoring of individual articles

The top of the split pane on the right in Figure 2 shows the frequencies of the words occurring in the corpus. The two numbers within parentheses corresponding to a word indicate the number of appearances of that word in positive and negative contexts. For example, the word “warranty” (stemmed version is “warranti”) appears about 44% in the positive context and 56% in the negative context. A large negative context will perhaps trigger the manufacturer to look into the item in more detail. The user can then highlight (in cyan) the documents where the term “warranty” is occurring.

As shown in Figure 3, we stem a document before matching its words with the dictionary. Handling negation can be an important concern in sentiment analysis. While the bag-of-words representations of “The is pleasant” and “This is not pleasant” are considered to be very similar by most commonly-used similarity measures, the only differing token, the negation term, forces the two sentences into

opposite classes. We recognize such and highlight as shown in Figure 3 and weigh appropriately when scoring sentiment.

Figure 2: Overall sentiment score and context

Figure 3: Document stemming

III. SUPERVISED CLASSIFICATION TECHNIQUES FOR

SENTIMENT SCORING

Our objective is to develop a supervised clustering and

classification technique to predict a document into one of

sentiment categories such as positive vs. negative and a

rating between 1 and 5. Most traditional clustering

techniques, such as feed-forward and supervised neural

networks, rely on carefully crafted data models in terms of

Page 3: Fusion with Sentiment Scores for Market Research

fixed-length vector structures of ordered n-tuples. Each

component in a vector represents some feature of an object

from the underlying problem domain. One of the early

approaches to supervised text classification with successful

applications to information retrieval, Latent Semantic

Analysis (LSA) [9], constructs feature vectors from the

terms occurring in documents. Such vectors become “very

high” dimensional to account for every term occurring in

the corpus. A similarity measure between two vectors

(usually the cosine of their contained angle in the semantic

space) is defined to cluster the vectors representing a text

corpus of documents. LSA, which is based on Single Value

Decomposition (SVD), attempts to solve the synonomy and

polysemy problems to match documents by taking

advantage of the implicit higher-order structure of the

association of terms with articles to create a multi-

dimensional semantic structure. Other notable developments

for text classification are unsupervised but generative Latent

Dirichlet Allocation (LDA) [1], and probabilistic Latent

Semantic Analysis (pLSA) [15] and its hierarchical

extension [12]. High dimensionality remains a problem for

these techniques. In general, discriminative techniques

perform better than generative ones by learning only

classifier functions, as opposed to learning explicit relations

among variables via joint probability distributions to

facilitate sampling. In this section, we present a hybrid approach to text

document classification leveraging better performance of discriminative classifiers and models of generative classifiers that can be visually inspected and adjusted by human experts. The proposed approach, kNBC/FK, trains a generative graphical probabilistic model, called k-dependence Naïve Bayesian Classifier (kNBC) [27], and then derives a Fisher Kernel (FK) [16] from the model to incorporate into a discriminative classifier. A kNBC model overcomes the strong conditional independence assumption of simple NBC (k = 0) by capturing relevant feature dependencies that exist in a corpus. We therefore expect a classifier to achieve optimal Bayesian accuracy if the right dependencies are set in the model. The TAN algorithm in [11] for inducing conditional trees is to generate optimal 1-dependence Bayesian classifiers.

FK with respect to a generative model compares two data points through the directions in which they ‘stretch’ the parameters of the model. This is achieved by comparing the two gradients of the derived score vectors at the two points as a function of the parameters. Thus the derived score vector corresponding to a sample text document explains how much parameters of the kNBC model contribute to generate the example, enabling the kernel approach to compare two documents with different numbers of features via any discriminant classifier. Our approach is suitable for handling a very high-dimensional feature space by discarding irrelevant features based on a mutual information measure during the kNBC model construction process without compromising the classifier quality.

This derivation of FK from a kNBC model seems to be the first in the literature. In principle, FK can be derived for any generative model with a differentiable likelihood

function. Shi et al. [28] derived FK for a NBC model. Denoyer and Gallinari [6] developed a specialized Bayesian network for structured document classification. This generative model has been transformed into a discriminant classifier using the method of FK. Sewell [29] has trained generative hidden Markov models on market data to derive a FK for a discriminative SVM. Nicotra et al. [22] extracted FK from a Hidden Tree Markov Model. Holub et al. [14] chose a simplified probabilistic Constellation model to derive FK, showing strong performance improvements for classification tasks over the corresponding generative approach. Dick and Kersting [8] developed FKs for relational data and empirically showed performance improvements over the results achieved without FKs. They used Bayesian logic programs as the relational model. Such models integrate definite logic programs with Bayesian belief networks. Perronnin and Dance [26] proposed a framework to image categorization where the underlying generative model is a Gaussian mixture model approximating the distribution of low-level features in images from a visual vocabulary.

The experimental evaluation is carried out with the well-known TripAdvisor collection. We apply natural language processing techniques to preprocess these collections, including stemming and XML file parsing, before applying the techniques. We show a comparable and sometimes improved performance over the baseline discriminative SVM-based classification.

The rest of the section is organized as follows. Subsection A provides kNBC background. Subsection B details an algorithm for constructing kNBC using a mutual information measure. Subsection C provides FK background and derives the FK for a kNBC model. Subsection D details the nature of the corpus that we have selected for empirical evaluation of the proposed hybrid approach. Section 5 presents the detailed evaluation results and analyses.

A. k-Dependence Naïve Bayesian Classifier (kNBC)

A kNBC [27], as shown in Figure 4, is a Bayesian network [25][18] which contains the structure of the NBC

[7][20][4][2] and allows each feature iv to have a maximum

of k feature nodes as parents, where features jv s are tokens

in document d. By varying the value of k one can define models that smoothly move along the spectrum of feature dependence.

… …

1,..., nc c

1v 2v 3v

Class

Variable

… nv

Feature

Variables

Figure 4: Generic structure of a k-NBC

Let d be a document that we want to classify and the

given set of classes is 1,..., nC c c . We want to compute

|ip c d , for every i:

Page 4: Fusion with Sentiment Scores for Market Research

1

| ,|

|

| ,

i j i j

ji i

i n

k j k j

k j

p c p v c vp c p d c

p c dp d

p c p v c v

where jv are the parents of jv . Note that the

computation of the posterior |ip c d after propagation of

evidence e involves only a multiplication of the relevant entries from the probability tables, without requiring full belief propagation as in Bayesian networks. One requires the

prior and conditional probabilities ip c and |j ip v c ,

which can either be obtained from domain experts or determined based on the keyword frequencies in documents. In NBC, the product of conditional probabilities comes from the assumption that tokens in a document are independent given the document class. This conditional independence assumption of features does not hold in most cases. For example, word co-occurrence is a commonly used feature for text classification.

We don’t need the estimated posterior |ip c d to be

correct. Instead, we only need

arg max | arg max |i i

i i j ic c j

p c d p c p v c

The score for each class can be expressed in the following tractable form for analytical purposes:

log log |i j ijp c p v c

The score is not a probability value, but is sufficient for the purpose of determining the most probable class. It reduces round-off errors due to a product of small fractions caused by a large number of tokens.

An example kNBC is shown in Figure 5, which is based on a ski-related document corpus of web pages. Some pages are advertisements for “shops”, some are describing “resorts”, and the rest are categorized as “other” containing articles, events, results, etc. The mutually exclusive and exhaustive set of hypotheses is the three classification classes of documents, and each child node of the network corresponds to a keyword as target attribute. In a kNBC

structure of Figure 4, an edge from iv to jv implies that the

influence of iv on the assessment of the class variable also

depends on the value of jv . For example, in Figure 5, the

influence of the attribute “brand” on the class DocType (C) depends on the value of “ski,” while in the equivalent NBC (i.e., without the edges among children) the influence of each attribute on the class variable is independent of other attributes. These additional edges among children in a kNBC affect the classification process in that a value of “brand”

that is typically surprising (i.e., |p brand C is low) may

be unsurprising if the value of its correlated attribute, “ski,”

is also unlikely (i.e., | ;p brand C ski is high). In this

situation, the NBC will overpenalize the probability of the class variable by considering two unlikely observations, while the augmented network of Figure 5 will not.

More concretely, in a suitably constructed corpus with distribution of documents among the three categories shop,

resort and other as 60%, 30% and 10%, the posterior probability distribution of the class variable in the equivalent NBC given that a document has only “ski” and “brand” keywords is as follows:

| , , 0.91

| , , 0.08

| , , 0.01

p DocType shop ski brand slope

p DocType resort ski brand slope

p DocType other ski brand slope

Doc Type

(C)

“price”

(v1)

“ski”

(v2)

“brand”

(v3)

“slope”

(v4)

c1 = shop

c2 = resort

c3 = other

Figure 5: k-NBC for document classification

While computing conditional probabilities from the frequency of occurrences, one would expect

| ,p brand shop ski to be higher than

| ,p brand resort ski since a web page for a ski shop is

more likely to mention the keyword “brand” than a web page

of a ski resort. Similarly, | ,p slope resort ski is likely to be

higher than | ,p slope shop ski . These kinds of

dependencies are not captured in a NBC. In the kNBC, the probability distribution among the hypotheses is as follows, due to the presence of the keywords ski and brand in a web page but absence of the keyword slope:

| , , 0.99

| , , 0.01

| , , 0.00

p DocType shop ski brand slope

p DocType resort ski brand slope

p DocType other ski brand slope

Note here the enhanced disambiguation in classification as compared to (0.91, 0.08, 0.01) obtained from the NBC presented earlier for the same evidence.

B. Algorithm for Constructing kNBC

The algorithm for constructing kNBC is provided with a set

of input labeled training instances belonging to a class C

and the value of k for the maximum allowable degree of

feature dependence. It outputs a kNBC model with

conditional probability tables determined from the input

data. The structural simplicity of kNBC (and hence NBC)

and the completeness of the input labeled instances avoid

the need for complex algorithms used for learning structure

and parameters in Bayesian networks [13][21]. The

algorithm here makes use of the following mutual variables

between two variables X and Y when selecting the order of

child nodes and the k parent nodes of a child.

,

,; , log

X Y

p X YI X Y p X Y

p X p Y

Page 5: Fusion with Sentiment Scores for Market Research

The probabilities in this formula are determined by counting the number of individual and pair-wise joint occurrences of the variables in the articles. Algorithm – Let the used variable list S be empty.

Let the k-dependence network BN being constructed begin with a single class node C.

– Repeat until S includes all domain features (i.e., the vocabulary containing all the terms):

Select feature maxX which is not in S and has the

largest value max ;I X C .

Add a node to BN representing maxX .

Add an arc from C to maxX in BN.

Add min | |,m k S arcs from m distinct features

jX in S with the highest value for max ; |jI X X C .

Add maxX to S.

– Compute the conditional probability tables inferred by the structure of BN by using counts from input instances and output BN.

C. Fisher Kernel (FK)

Jaakkola and Haussler [16] first introduced the notion of FK

to enable one to compare two incomplete data items with

different numbers of features via any classical discriminant

classifier. FK with respect to a generative model compares

two data points through the directions in which they

‘stretch’ the parameters of the model. This is achieved by

comparing the two gradients of the derived score vectors at

the two points as a function of the parameters. A

representative score vector of fixed length for each data

item x is first derived as follows. The log-likelihood of a data item x with respect to a

generative model M with parameters 1,..., n is

defined as

logM

L x

The Fisher score of a data item x with respect to a generative model M with parameters is defined as

1

, log log

n

M M

i i

f M x L x L x

where i

is the gradient operator with respect to the

parameter i . Intuitively, the score vector explains how

much parameters of the model contribute to generate the example. The Fisher information matrix with respect to a generative model M with parameters is defined as

, ,T

MI E f M x f M x

where the expectation is over the generation of the data point x. The Fisher information kernel with respect to a generative model M with parameters is defined as

1, , ,T

Mx y f M x I f M y

This kernel defines a distance between two data points x and y. This kernel function can be used with any kernel-based classifier, such as the support vector machine. We will make use of the practical FK

, , ,T

x y f M x f M y

and other types of kernels that are part of the SVM package offers when clustering Fisher vectors.

3.1. Derivation of FK for kNBC

We provide a mathematical derivation of FK of kNBC

models (some intermediate steps are omitted due to space

limitation). Assume that C is the set 1,..., mc c of mutually

exclusive and exhaustive set of classes of the root node and

X is the set 1,..., nx x of feature nodes of the k-NBC. We

also assume that an arbitrary combination of parent states of

the node ix is denoted as * ix (there are 2k of such if k

is the number of parents of ix ). Evidence e = 1,..., ny y is

received on nodes 1,..., nx x , where each

iy is categorical

(i iy x or

ix , i.e., ix is either true or false). The

derivation below assumes that there is either positive or

negative categorical evidence on every child node ix of

1,..., nx x corresponding to whether the word ix is present or

absent in the input document representing evidence e. The

derivation is similar in case some of the child nodes are left

uncertain or evidence is non-categorical. Consider the following derivation of the likelihood in

terms of the parameters of a kNBC model .

1 1

1

1

1 1

1 1

1 1

| |

,..., |

| , ,...,

| ,

m m

M M i M i M i

i i

m

M i M n i

i

m n

M i M j i j

i j

m n

M i M j i e j

i j

P e P ec P c P e c

P c P y y c

P c P y c y y

P c P y c x

The last line follows from the fact that, given a class ic ,

jx is independent of non-parent nodes

1,..., j e jy y x and that the parent nodes of jx are only

in 1 1,..., jy y as per the structure of a kNBC model.

Parameters M iP c s are the probability distribution of the

root node and | ,M j i e jP y c x s are the conditional

probabilities of the child nodes. Hence,

,

1

| , | , 1

. ., | , 1

j j j

M i

i

M j i k j M j i k j

M j i k j

y x x

P c

P x c x P x c x

i e P y c x

Page 6: Fusion with Sentiment Scores for Market Research

for all ic . We now compute the partial derivatives of the log

likelihood function with respect to each conditional

probability *| ,M j k jP y c x , where * jx is an

arbitrary combination of the parent states and jy is a

variable with domain ,j jx x .

*

1 1

log | log |

| , | ,

1| ,

|

, ,

M M

M j k j M j k e j

m n

M i M p i e p

i l p lM

i k l j

P e P e

P y c x P y c x

P c P y c xP e

c c y y

where , 1x y if x y else 0. Pushing the second

summation inside the product, we obtain the below results after the simple rearrangement.

*

log | | ,

| , | ,

M M k

M j k j M j k e j

P e P c e

P y c x P y c x

Note that e jx is unique given e and hence we will have

only two partial derivatives irrespective of the number of combinations of parents. The partial derivative with respect

to each prior probability M iP c (only 1m probabilities

are independent) of classes is the following:

1

1

log | | , | ,, 2,...,

| |

M M k M

M k M k M

P e P c e P c ek m

P c P c P c

As mentioned earlier, the computation of the

posterior | ,M kP c e in the above expressions after

propagation of evidence e is just a multiplication of the relevant entries from the probability tables of the model.

D. Experimental Setting

In this section we present details of the collection

TripAdvisor, and various statistics pertaining to the

preprocessed collection. TripAdvisor corpus has been built

by Baccianella et al., consisting of 15,763 hotel reviews

from the TripAdvisor Web site

(http://www.tripadvisor.com), a popular site to review

tourism-related activities. Each review is labeled with a

score of one to five “stars”. Figure 6 shows three samples

with 5, 3 and 1 stars. Note the usage of highly

discriminatory words like “fantastic”, “okay” and “terrible”

in coherence with the degree of ratings. We have used 10,508 documents for training and used

5,255 documents for testing. The training set contains 23,341 unique stemmed words.

Figure 6: Example TripAdvisor Reviews

The distribution of labels is highly skewed, since 44% of all the training articles have a global score of 5 stars, 34.8% a global score of 4 stars, 10% 3 stars, 7.1% 2 stars and only 4.1% 1 star. Test articles have a similar distribution.

Topic # 5 4630

4 3643

3 1052

2 752

1 431

Table 1: Distribution of TripAdvisor training articles

This kind of skewed distribution tends to make the

classification task for the least frequent scores difficult.

Figure 7 shows a fragment of the kNBC for the TripAdvisor

corpus showing dependence between stemmed words. As

expected, the word “terrible” has dependence on words such

as “rude” and “worst” and the word “comfort” has

dependence on words such as “clean” and “walk”.

worst clean walk comfort

… …

5,4,3,2,1Class

Variable

…rude terribl

Figure 7: Example kNBC (k = 2) dependencies for the

TripAdvisor Corpus

Figure 8 shows a screenshot for category prediction in aText

using NBC. The corresponding ground truth is shown in the

popup dialog window.

Number of articles (or reviews): 15763 Number of stemmed words: 23341 Number of topics: 5 Total number of training articles: 10508 Total number of test articles: 5255

253154_3638452 Terrible experience The twin room booked turned out to be a single with single bed and camp bed By our third night we had an ant infestation which the management were unwilling to deal with Eventually I had to spray the room and wait till 1 30am before being able to return

We left _PROS_Nothing _CONS_Nothing 1

274573_3994017 Perfectly okay The Hotel Suisse was just okay Daniela was very helpful but the other people at the front desk were not We found their attitude less than desirable The rooms were clean and spacious Breakfast delivered to the room was a bit ackward but certainly doable The location to the Spanish steps was quite helpful easy access to the Metro and a taxi stand They also asked for 1 night stay in cash which

we thought was odd _PROS_Nothing _CONS_Nothing 3

203223_3024338 Fantastic For the money the location is fantastic It is about 300 yards from a metro station From the hotel you can walk to the colosseum you can practically see it when you leave the hotel Yes the rooms are small but the hotel has great character I would definitely stay there again and would recommend it to others _PROS_Nothing

_CONS_Nothing 5

Page 7: Fusion with Sentiment Scores for Market Research

Figure 8: Predicting review category

E. Evaluation Results

The performance analysis of kNBC and FK on the

TripAdvisor corpus is presented in this section. To reduce

the complexity of dealing with thousands of children in a

kNBC model, we have discarded all children sharing very

low mutual information with the root node. First we used

the formula for ;I X Y for mutual information presented

earlier. In this process we have experimentally shown that

discarded children indeed contribute very little or nothing to

improved performance. We then computed the

performances of kNBC and FK by varying the value of the

dependence degree k, and compared them against the

baseline SVM. As shown in the table below, the performance for kNBC

(k = 0) does not degrade with the progressive reduction of the number of child nodes by increasing the mutual information threshold (Hit Rate and F-Measure are formally defined just a little later).

NBC Model

Mutual Information Threshold

0.001 0.002 0.003 0.004 0.005 0.006 0.007

No of Children

3521 401 237 160 117 79 61

Hit Rate 57.9 57.1 57.7 57.1 58.2 57.8 57.0

F-Measure

89.8 89.6 90.5 90.7 91.3 91.7 91.6

Table 2: Performance of kNBC by varying number of children

The table above suggests the performance stabilizes after

the threshold value 0.003. The baseline SVM performance

for this value is as follows:

With a kNBC consisting of 117 children and k = 2, the

confusion matrix is shown Table 3. In this table, we have

computed the “hit rate”, i.e., the percentage of the total

number of diagonal elements compared to the total number

of reports. This is equal to 58.9%. A plausible explanation

of the low performance of the TripAdvisor corpus is the

blurriness between two consecutive ratings. Many words in

reports from satisfied customers rating 4 and 5 are likely to

be common; so will be the case with ratings of 1 and 2.

kNBC Model 5 4 3 2 1

5 1662 602 56 25 10

4 535 996 220 61 9

3 49 167 153 83 9

2 25 60 80 138 71

1 12 6 15 71 140

Table 3: TripAdvisor Confusion Matrix

If we now transform the classification as a binary

classification problem with ratings 4 and 5 in the “positive”

class and ratings 1-3 in the “negative” class, performance

becomes acceptable. We then follow the usual precision

and recall definitions and the following definition of F-

measure:

2 /F Precision Recall Precision Recall

We now have varied the dependencies, and the table below

shows the hit rate and F-measure by varying the value of k

between 0 and 3.

K=0 (NBC) K=1 K=2 K=3

kNBC 83.2% 83.5% 85.1% 85.5%

kNBC/FK 81.9% 86.2% 87.4% 88.9%

Table 4: Comparison of performances between kNBC and FK

It’s clear that k = 2 yields the optimum performance and that

the performance of the hybrid kNBC/FK approach is

comparable to the baseline SVM performance.

IV. CONCLUSIONS

We have presented two different ways of computing

sentiment score of a document, namely, lexicon and

supervised classification based. In the financial domain, we

have used a richer lexicon with ten categories, including

positive and negative, as shown in Figure 9. We have also experimented with about 300 articles

written by analysts in 2015 on a particular company. We plotted the sentiment trend in certain number of intervals as shown in the bottom panel of Figure 10 (blue represents positive and red represent negative and the sum of the two scores at any time point is 1.0) and superimposed with the stock prices of the period. The correlation between the two graphs in some segments are evident. We have also defined a volatility index of sentiment reflecting a measure of price ups and downs during the period. The numeric sentiment scores over the time period can be incorporated into any time-series regression algorithm predicting the stock price.

Dependency k: 2 Overall: 58.9% Precision: 90.9% Recall: 92.2% F-Measure: 91.5%

Mutual information threshold: 0.003 Reduced number of children nodes: 117 Baseline SVM performance: 60.7%

Page 8: Fusion with Sentiment Scores for Market Research

Figure 9: Finer level of sentiment scoring

Figure 10: Correlating sentiment trend with stock prices

Our future plan is to enhance the scoring techniques via

deep linguistics processing and unsupervised classification

approaches of aText such as LDA and PLSA.

REFERENCES

[1] Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. J. of Machine Learning Research, 3(5):993–1022.

[2] Das, S. (2008). High-Level Data Fusion, Artech House, MA, USA.

[3] Das. S. (2012). “A framework for distributed high-level fusion,” In Net-centric Distributed Fusion, D. Hall, J. Llinas, M. Liggins, C. Chong (eds.), CRC Press/Taylor and Francis.

[4] Das, S. (2014). Computational Business Analytics, Chapman and Hall/CRC Press.

[5] Das, S., Ascano, R., and Macarty, M. (2015). “Distributed Big Data

Search for Analyst Queries and Data Fusion,” International Conference on Information Fusion.

[6] Denoyer, L. and Gallinari, P. (2004). “Bayesian network model for semi-structured document classification,” Information Processing and Management, Vol. 40, pp. 807–827.

[7] Duda, R., and Hart, P. (1973). Pattern Recognition and Scene Analysis, Wiley, NY.

[8] Dick, U. and Kersting, K. (2006). “Fisher Kernels for Relational Data,” Proc. of the 17th European Conference on Machine Learning, Springer-Verlag, pp. 114–125.

[9] Dumais, S., Furnas, G., Landauer, T., Deerwester, S., and Harshman, R. (1988). “Using latent semantic analysis to improve access to textual information,” Prof. of the Conf. on Human Factors in Computing Systems (CHI). pp. 281-286.

[10] Frakes, W. (1992). “Stemming Algorithms,” In: W.B. Frakes and R. Baeza-Yates (eds), Information Retrieval. Data Structures and Algorithms, pp. 131-160, Prentice Hall, 1992.

[11] Friedman, N., Geiger, D., and Goldszmidt, M. (1997). “Building classifers using bayesian networks,” Machine Learning, Vol. 29, pp. 131–163.

[12] Gaussier, E., Goutte, C., Popat, K., and Chen, F. (2002). “A Hierarchical Model for Clustering and Categorising Documents,” Adv. in Information Retrieval – Proc. of the 24th BCS-IRSG European Colloquium on IR Research (ECIR).

[13] Heckerman, D. E. (1996). “A tutorial on learning Bayesian networks,” Technical Report: MSR-TR-95-06, Microsoft Corporation, Redmond, WA.

[14] Holub, A., Welling, M., and Perona, P. (2005). “Combining Generative Models and Fisher Kernels for Object Recognition,” Prof. of the IEEE Int. Conf. on Comp. Vision.

[15] Hofmann T. (1999). “Probabilistic Latent Semantic Analysis,” Proccedings of the Conference on Uncertainity in Artificial Intelligence, UAI’99, Stockholm.

[16] Jaakkola, T. and Haussler, D. (1999). “Exploiting Generative Models in Discriminative Classifiers,” Advances in Neural Information Processing Systems 11, Bradford Books. Cambridge, MA: The MIT Press, pp. 487–493.

[17] Joachims, T. (1998). “Text categorization with suport vector machines: Learning with many relevant features,” Proc. of the 10th European Conf. on Machine Learning, Springer-Verlag, pp. 137–142.

[18] Jensen, F. V. (2002). Bayesian Networks and Decision Graphs, Springer-Verlag, NY.

[19] LIBSVM. (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

[20] Mitchell, T. (1997). Machine Learning. McGraw-Hill, NY.

[21] Neapolitan, R. E. (2003). Learning Bayesian Networks, Prentice Hall, Upper Saddle River, NJ.

[22] Nicotra, L., Micheli, A., and Starita, A. (2004). “Fisher Kernel for Tree Structured Data,” Proceedings of the 2004 IEEE Int. Joint Conference on Neural Networks, Vol. 3, pp. 1917–1922.

[23] Pang, B. and Lee, L. (2008). “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, Vol. 2, No 1-2, pp 1–135.

[24] Pappas, N. and Popescu-Belis, A. (2014). “Explaining the stars: Weighted multiple-instance learning for aspect-based sentiment analysis,” Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP), pp. 455–466.

[25] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann.

[26] Perronnin, F. and Dance, C. (2007). “Fisher Kernels on Visual Vocabularies for Image Categorization,” Computer Vision and Pattern Recognition (CVPR), pp. 1–8.

[27] Sahami, M. (1996). “Learning limited dependence Bayesian classifiers,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 335-338.

[28] Shi, Z., Huang, Y., and Zhang, S. (2005). “Fisher Score Based Naive Bayesian Classifier,” Int. Conf. on Neural Networks and Brain (ICNN&B), Vol. 3(13-15), pp. 1616–1621.

[29] Sewell, M. (2007). “Fisher Kernel,” Department of Computer Science, University College London, April 2007.

[30] Shimada, K. and Tsutomu, E. (2008). “Seeing several stars: A rating inference task for a document containing several evaluation criteria,” Advances in Knowledge Discovery and Data Mining, 12th Pacific-Asia Conference, PAKDD, pp. 1006–1014.

[31] Snyder, B. and Barzilay, R. (2007). “Multiple aspect ranking using the good grief algorithm,” Proc. of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAAC), pp. 300–307.