Aspect Term Extraction for Sentiment Analysis: New ...nlp.cs.aueb.gr/pubs/W14-1306.pdf · placed by...

Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM) @ EACL 2014, pages 44–52,Gothenburg, Sweden, April 26-30 2014. c©2014 Association for Computational Linguistics

Aspect Term Extraction for Sentiment Analysis: New Datasets, NewEvaluation Measures and an Improved Unsupervised Method

John Pavlopoulos and Ion AndroutsopoulosDept. of Informatics, Athens University of Economics and Business, Greece

http://nlp.cs.aueb.gr/

Abstract

Given a set of texts discussing a particularentity (e.g., customer reviews of a smart-phone), aspect based sentiment analysis(ABSA) identifies prominent aspects of theentity (e.g., battery, screen) and an aver-age sentiment score per aspect. We fo-cus on aspect term extraction (ATE), oneof the core processing stages of ABSA thatextracts terms naming aspects. We makepublicly available three new ATE datasets,arguing that they are better than previouslyavailable ones. We also introduce newevaluation measures for ATE, again argu-ing that they are better than previouslyused ones. Finally, we show how a pop-ular unsupervised ATE method can be im-proved by using continuous space vectorrepresentations of words and phrases.

1 Introduction

Before buying a product or service, consumers of-ten search the Web for expert reviews, but increas-ingly also for opinions of other consumers, ex-pressed in blogs, social networks etc. Many usefulopinions are expressed in text-only form (e.g., intweets). It is then desirable to extract aspects (e.g.,screen, battery) from the texts that discuss a par-ticular entity (e.g., a smartphone), i.e., figure outwhat is being discussed, and also estimate aspectsentiment scores, i.e., how positive or negativethe (usually average) sentiment for each aspect is.These two goals are jointly known as Aspect BasedSentiment Analysis (ABSA) (Liu, 2012).

In this paper, we consider free text customer re-views of products and services; ABSA, however,is also applicable to texts about other kinds ofentities (e.g., politicians, organizations). We as-sume that a search engine retrieves customer re-views about a particular target entity (product or

Figure 1: Automatically extracted prominent as-pects (shown as clusters of aspect terms) and aver-age aspect sentiment scores of a target entity.

service), that multiple reviews written by differentcustomers are retrieved for each target entity, andthat the ultimate goal is to produce a table like theone of Fig. 1, which presents the most prominentaspects and average aspect sentiment scores of thetarget entity. Most ABSA systems in effect performall or some of the following three subtasks:

Aspect term extraction: Starting from textsabout a particular target entity or entities of thesame type as the target entity (e.g., laptop re-views), this stage extracts and possibly ranks byimportance aspect terms, i.e., terms naming as-pects (e.g., ‘battery’, ‘screen’) of the target en-tity, including multi-word terms (e.g., ‘hard disk’)(Liu, 2012; Long et al., 2010; Snyder and Barzi-lay, 2007; Yu et al., 2011). At the end of this stage,each aspect term is taken to be the name of a dif-ferent aspect, but aspect terms may subsequentlybe clustered during aspect aggregation; see below.

Aspect term sentiment estimation: This stageestimates the polarity and possibly also the inten-sity (e.g., strongly negative, mildly positive) of theopinions for each aspect term of the target entity,usually averaged over several texts. Classifyingtexts by sentiment polarity is a popular researchtopic (Liu, 2012; Pang and Lee, 2005; Tsytsarauand Palpanas, 2012). The goal, however, in this

44

ABSA subtask is to estimate the (usually average)polarity and intensity of the opinions about partic-ular aspect terms of the target entity.

Aspect aggregation: Some systems group aspectterms that are synonyms or near-synonyms (e.g.,‘price’, ‘cost’) or, more generally, cluster aspectterms to obtain aspects of a coarser granularity(e.g., ‘chicken’, ‘steak’, and ‘fish’ may all be re-placed by ‘food’) (Liu, 2012; Long et al., 2010;Zhai et al., 2010; Zhai et al., 2011). A polar-ity (and intensity) score can then be computed foreach coarser aspect (e.g., ‘food’) by combining(e.g., averaging) the polarity scores of the aspectterms that belong in the coarser aspect.

In this paper, we focus on aspect term extrac-tion (ATE). Our contribution is threefold. Firstly,we argue (Section 2) that previous ATE datasets arenot entirely satisfactory, mostly because they con-tain reviews from a particular domain only (e.g.,consumer electronics), or they contain reviews forvery few target entities, or they do not contain an-notations for aspect terms. We constructed andmake publicly available three new ATE datasetswith customer reviews for a much larger numberof target entities from three domains (restaurants,laptops, hotels), with gold annotations of all theaspect term occurrences; we also measured inter-annotator agreement, unlike previous datasets.

Secondly, we argue (Section 3) that commonlyused evaluation measures are also not entirely sat-isfactory. For example, when precision, recall,and F -measure are computed over distinct as-pect terms (types), equal weight is assigned tomore and less frequent aspect terms, whereas fre-quently discussed aspect terms are more impor-tant; and when precision, recall, and F -measureare computed over aspect term occurrences (to-kens), methods that identify very few, but very fre-quent aspect terms may appear to perform muchbetter than they actually do. We propose weightedvariants of precision and recall, which take into ac-count the rankings of the distinct aspect terms thatare obtained when the distinct aspect terms are or-dered by their true and predicted frequencies. Wealso compute the average weighted precision overseveral weighted recall levels.

Thirdly, we show (Section 4) how the popularunsupervised ATE method of Hu and Liu (2004),can be extended with continuous space word vec-tors (Mikolov et al., 2013a; Mikolov et al., 2013b;Mikolov et al., 2013c). Using our datasets and

evaluation measures, we demonstrate (Section 5)that the extended method performs better.

2 Datasets

We first discuss previous datasets that have beenused for ATE, and we then introduce our own.

2.1 Previous datasetsSo far, ATE methods have been evaluated mainlyon customer reviews, often from the consumerelectronics domain (Hu and Liu, 2004; Popescuand Etzioni, 2005; Ding et al., 2008).

The most commonly used dataset is that of Huand Liu (2004), which contains reviews of onlyfive particular electronic products (e.g., NikonCoolpix 4300). Each sentence is annotated withaspect terms, but inter-annotator agreement hasnot been reported.1 All the sentences appear tohave been selected to express clear positive or neg-ative opinions. There are no sentences express-ing conflicting opinions about aspect terms (e.g.,“The screen is clear but small”), nor are thereany sentences that do not express opinions abouttheir aspect terms (e.g., “It has a 4.8-inch screen”).Hence, the dataset is not entirely representative ofproduct reviews. By contrast, our datasets, dis-cussed below, contain reviews from three domains,including sentences that express conflicting or noopinions about aspect terms, they concern manymore target entities (not just five), and we havealso measured inter-annotator agreement.

The dataset of Ganu et al. (2009), on whichone of our datasets is based, is also popular. Inthe original dataset, each sentence is tagged withcoarse aspects (‘food’, ‘service’, ‘price’, ‘ambi-ence’, ‘anecdotes’, or ‘miscellaneous’). For exam-ple, “The restaurant was expensive, but the menuwas great” would be tagged with the coarse as-pects ‘price’ and ‘food’. The coarse aspects, how-ever, are not necessarily terms occurring in thesentence, and it is unclear how they were obtained.By contrast, we asked human annotators to markthe explicit aspect terms of each sentence, leavingthe task of clustering the terms to produce coarseraspects for an aspect aggregation stage.

The ‘Concept-Level Sentiment Analysis Chal-lenge’ of ESWC 2014 uses the dataset of Blitzeret al. (2007), which contains customer reviews of

1Each aspect term occurrence is also annotated with a sen-timent score. We do not discuss these scores here, since wefocus on ATE. The same comment applies to the dataset ofGanu et al. (2009) and our datasets.

45

DVDs, books, kitchen appliances, and electronicproducts, with an overall sentiment score for eachreview. One of the challenge’s tasks requires sys-tems to extract the aspects of each sentence and asentiment score (positive or negative) per aspect.2

The aspects are intended to be concepts from on-tologies, not simply aspect terms. The ontologiesto be used, however, are not fully specified and notraining dataset with sentences and gold aspects iscurrently available.

Overall, the previous datasets are not entirelysatisfactory, because they contain reviews froma particular domain only, or reviews for veryfew target entities, or their sentences are not en-tirely representative of customer reviews, or theydo not contain annotations for aspect terms, orno inter-annotator agreement has been reported.To address these issues, we provide three newATE datasets, which contain customer reviews ofrestaurants, hotels, and laptops, respectively.3

2.2 Our datasets

The restaurants dataset contains 3,710 Englishsentences from the reviews of Ganu et al. (2009).4

We asked human annotators to tag the aspect termsof each sentence. In “The dessert was divine”,for example, the annotators would tag the aspectterm ‘dessert’. In a sentence like “The restaurantwas expensive, but the menu was great”, the an-notators were instructed to tag only the explicitlymentioned aspect term ‘menu’. The sentence alsorefers to the prices, and a possibility would be toadd ‘price’ as an implicit aspect term, but we donot consider implicit aspect terms in this paper.

We used nine annotators for the restaurant re-views. Each sentence was processed by a singleannotator, and each annotator processed approxi-mately the same number of sentences. Among the3,710 restaurant sentences, 1,248 contain exactlyone aspect term, 872 more than one, and 1,590 noaspect terms. There are 593 distinct multi-wordaspect terms and 452 distinct single-word aspectterms. Removing aspect terms that occur onlyonce leaves 67 distinct multi-word and 195 dis-tinct single-word aspect terms.

The hotels dataset contains 3,600 English sen-

2See http://2014.eswc-conferences.org/.3Our datasets are available upon request. The datasets

of the ABSA task of SemEval 2014 (http://alt.qcri.org/semeval2014/task4/) are based on our datasets.

4The original dataset of Ganu et al. contains 3,400 sen-tences, but some of the sentences had not been properly split.

tences from online customer reviews of 30 hotels.We used three annotators. Among the 3,600 hotelsentences, 1,326 contain exactly one aspect term,652 more than one, and 1,622 none. There are 199distinct multi-word aspect terms and 262 distinctsingle-word aspect terms, of which 24 and 120,respectively, were tagged more than once.

The laptops dataset contains 3,085 English sen-tences of 394 online customer reviews. A singleannotator (one of the authors) was used. Amongthe 3,085 laptop sentences, 909 contain exactlyone aspect term, 416 more than one, and 1,760none. There are 350 distinct multi-word and 289distinct single-word aspect terms, of which 67 and137, respectively, were tagged more than once.

To measure inter-annotator agreement, we useda sample of 75 restaurant, 75 laptop, and 100 hotelsentences. Each sentence was processed by two(for restaurants and laptops) or three (for hotels)annotators, other than the annotators used previ-ously. For each sentence si, the inter-annotatoragreement was measured as the Dice coefficientDi = 2 · |Ai∩Bi|

|Ai|+|Bi| , where Ai, Bi are the sets ofaspect term occurrences tagged by the two anno-tators, respectively, and |S| denotes the cardinal-ity of a set S; for hotels, we use the mean pair-wise Di of the three annotators.5 The overall inter-annotator agreement D was taken to be the aver-age Di of the sentences of each sample. We, thus,obtained D = 0.72, 0.70, 0.69, for restaurants, ho-tels, and laptops, respectively, which indicate rea-sonably high inter-annotator agreement.

2.3 Single and multi-word aspect terms

ABSA systems use ATE methods ultimately to ob-tain the m most prominent (frequently discussed)distinct aspect terms of the target entity, for dif-ferent values of m.6 In a system like the one ofFig. 1, for example, if we ignore aspect aggrega-tion, each row will report the average sentimentscore of a single frequent distinct aspect term, andm will be the number of rows, which may dependon the display size or user preferences.

Figure 2 shows the percentage of distinct multi-word aspect terms among the m most frequent dis-tinct aspect terms, for different values of m, in

5Cohen’s Kappa cannot be used here, because the annota-tors may tag any word sequence of any sentence, which leadsto a very large set of categories. A similar problem was re-ported by Kobayashi et al. (2007).

6A more general definition of prominence might also con-sider the average sentiment score of each distinct aspect term.

46

our three datasets and the electronics dataset of Huand Liu (2004). There are many more single-worddistinct aspect terms than multi-word distinct as-pect terms, especially in the restaurant and hotelreviews. In the electronics and laptops datasets,the percentage of multi-word distinct aspect terms(e.g., ‘hard disk’) is higher, but most of the dis-tinct aspect terms are still single-word, especiallyfor small values of m. By contrast, many ATE

methods (Hu and Liu, 2004; Popescu and Etzioni,2005; Wei et al., 2010) devote much of their pro-cessing to identifying multi-word aspect terms.

Figure 2: Percentage of (distinct) multi-word as-pect terms among the most frequent aspect terms.

3 Evaluation measures

We now discuss previous ATE evaluation mea-sures, also introducing our own.

3.1 Precision, Recall, F-measureATE methods are usually evaluated using preci-sion, recall, and F -measure (Hu and Liu, 2004;Popescu and Etzioni, 2005; Kim and Hovy, 2006;Wei et al., 2010; Moghaddam and Ester, 2010;Bagheri et al., 2013), but it is often unclear if thesemeasures are applied to distinct aspect terms (noduplicates) or aspect term occurrences.

In the former case, each method is expected toreturn a set A of distinct aspect terms, to be com-pared to the set G of distinct aspect terms the hu-man annotators identified in the texts. TP (truepositives) is |A∩G|, FP (false positives) is |A\G|,FN (false negatives) is |G\A|, and precision (P ),recall (R), F = 2·P ·R

P+R are defined as usually:

P =TP

TP + FP, R =

TPTP + FN

(1)

This way, however, precision, recall, and F -measure assign the same importance to all the dis-tinct aspect terms, whereas missing, for example, amore frequent (more frequently discussed) distinctaspect term should probably be penalized moreheavily than missing a less frequent one.

When precision, recall, and F -measure are ap-plied to aspect term occurrences (Liu et al., 2005),TP is the number of aspect term occurrencestagged (each term occurrence) both by the methodbeing evaluated and the human annotators, FP isthe number of aspect term occurrences tagged bythe method but not the human annotators, and FNis the number of aspect term occurrences taggedby the human annotators but not the method. Thethree measures are then defined as above. Theynow assign more importance to frequently occur-ring distinct aspect terms, but they can producemisleadingly high scores when only a few, butvery frequent distinct aspect terms are handledcorrectly. Furthermore, the occurrence-based def-initions do not take into account that missing sev-eral aspect term occurrences or wrongly taggingexpressions as aspect term occurrences may notactually matter, as long as the m most frequentdistinct aspect terms can be correctly reported.

3.2 Weighted precision, recall, AWP

What the previous definitions of precision and re-call miss is that in practice ABSA systems useATE methods ultimately to obtain the m most fre-quent distinct aspect terms, for a range of m val-ues. Let Am and Gm be the lists that contain them most frequent distinct aspect terms, ordered bytheir predicted and true frequencies, respectively;the predicted and true frequencies are computedby examining how frequently the ATE method orthe human annotators, respectively, tagged occur-rences of each distinct aspect term. Differencesbetween the predicted and true frequencies do notmatter, as long as Am = Gm, for every m. Notincluding in Am a term of Gm should be penal-ized more or less heavily, depending on whetherthe term’s true frequency was high or low, respec-tively. Furthermore, including in Am a term not inGm should be penalized more or less heavily, de-pending on whether the term was placed towardsthe beginning or the end of Am, i.e., depending onthe prominence that was assigned to the term.

To address the issues discussed above, we in-troduce weighted variants of precision and recall.

47

For each ATE method, we now compute a singlelist A =

⟨a1, . . . , a|A|

⟩of distinct aspect terms

identified by the method, ordered by decreasingpredicted frequency. For every m value (numberof most frequent distinct aspect terms to show),the method is treated as having returned the sub-list Am with the first m elements of A. Similarly,we now take G =

⟨g1, . . . , g|G|

⟩to be the list of

the distinct aspect terms that the human annotatorstagged, ordered by decreasing true frequency.7 Wedefine weighted precision (WPm) and weightedrecall (WRm) as in Eq. 2–3. The notation 1{κ}denotes 1 if condition κ holds, and 0 otherwise.By r(ai) we denote the ranking of the returnedterm ai in G, i.e., if ai = gj , then r(ai) = j; ifai ∈ G, then r(ai) is an arbitrary positive integer.

WPm =∑m

i=11i · 1{ai ∈ G}∑m

i=11i

(2)

WRm =

∑mi=1

1r(ai)

· 1{ai ∈ G}∑|G|j=1

1j

(3)

WRm counts how many terms of G (gold dis-tinct aspect terms) the method returned in Am,but weighting each term by its inverse ranking

1r(ai)

, i.e., assigning more importance to terms thehuman annotators tagged more frequently. Thedenominator of Eq. 3 sums the weights of allthe terms of G; in unweighted recall applied todistinct aspect terms, where all the terms of Ghave the same weight, the denominator would be|G| = TP + FN (Eq. 1). WPm counts howmany gold aspect terms the method returned inAm, but weighting each returned term ai by itsinverse ranking 1

i in Am, to reward methods thatreturn more gold aspect terms towards the begin-ning of Am. The denominator of Eq. 2 sums theweights of all the terms of Am; in unweighted pre-cision applied to distinct aspect terms, the denom-inator would be |Am| = TP + FN (Eq. 1).

We plot weighted precision-recall curves bycomputing WPm,WRm pairs for different valuesof m, as in Fig. 3 below.8 The higher the curveof a method, the better the method. We also com-pute the average (interpolated) weighted precision

7In our experiments, we exclude from G aspect termstagged by the annotators only once.

8With supervised methods, we perform a 10-fold cross-validation for each m, and we macro-average WPm,WRm

over the folds. We provide our datasets partitioned in folds.

(AWP ) of each method over 11 recall levels:

AWP =111

∑r∈{0,0.1,...,1}

WP int(r)

WP int(r) = maxm∈{1,...,|A|},WRm ≥ r

WPm

AWP is similar to average (interpolated) precision(AP ), which is used to summarize the tradeoff be-tween (unweighted) precision and recall.

3.3 Other related measuresYu at al. (2011) used nDCG@m (Jarvelin andKekalainen, 2002; Sakai, 2004; Manning et al.,2008), defined below, to evaluate each list of mdistinct aspect terms returned by an ATE method.

nDCG@m =1Z

m∑i=1

2t(i) − 1log2(1 + i)

Z is a normalization factor to ensure that a perfectranking gets nDCG@m = 1, and t(i) is a rewardfunction for a term placed at position i of the re-turned list. In the work of Yu et al., t(i) = 1 if theterm at position i is not important (as judged bya human), t(i) = 2 if the term is ‘ordinary’, andt(i) = 3 if it is important. The logarithm is used toreduce the reward for distinct aspect terms placedat lower positions of the returned list.

The nDCG@m measure is well known in rank-ing systems (e.g., search engines) and it is similarto our weighted precision (WPm). The denomina-tor or Eq. 2 corresponds to the normalization fac-tor Z of nDCG@m; the 1

i factor of in the numer-ator of Eq. 2 corresponds to the 1

log2(1+i) degra-dation factor of nDCG@m; and the 1{ai ∈ G}factor of Eq. 2 is a binary reward function, corre-sponding to the 2t(i) − 1 factor of nDCG@m.

The main difference from nDCG@m is thatWPm uses a degradation factor 1

i that is inverselyproportional to the ranking of the returned termai in the returned list Am, whereas nDCG@muses a logarithmic factor 1

log2(1+i) , which reducesless sharply the reward for distinct aspect termsreturned at lower positions in Am. We believethat the degradation factor of WPm is more ap-propriate for ABSA, because most users would inpractice wish to view sentiment scores for only afew (e.g., m = 10) frequent distinct aspect terms,whereas in search engines users are more likely toexamine more of the highly-ranked returned items.It is possible, however, to use a logarithmic degra-dation factor in WPm, as in nDCG@m.

48

Another difference is that we use a binary re-ward factor 1{ai ∈ G} in WPm, instead of the2t(i) − 1 factor of nDCG@m that has three pos-sibly values in the work of Yu at al. (2011). Weuse a binary reward factor, because preliminaryexperiments we conducted indicated that multi-ple relevance levels (e.g., not an aspect term, as-pect term but unimportant, important aspect term)confused the annotators and led to lower inter-annotator agreement. The nDCG@m measurecan also be used with a binary reward factor; thepossible values t(i) would be 0 and 1.

With a binary reward factor, nDCG@m in ef-fect measures the ratio of correct (distinct) aspectterms to the terms returned, assigning more weightto correct aspect terms placed closer the top of thereturned list, like WPm. The nDCG@m mea-sure, however, does not provide any indicationof how many of the gold distinct aspect termshave been returned. By contrast, we also mea-sure weighted recall (Eq. 3), which examines howmany of the (distinct) gold aspect terms have beenreturned in Am, also assigning more weight to thegold aspect terms the human annotators taggedmore frequently. We also compute the averageweighted precision (AWP ), which is a combina-tion of WPm and WRm, for a range of m values.

4 Aspect term extraction methods

We implemented and evaluated four ATE meth-ods: (i) a popular baseline (dubbed FREQ) that re-turns the most frequent distinct nouns and nounphrases, (ii) the well-known method of Hu and Liu(2004), which adds to the baseline pruning mech-anisms and steps that detect more aspect terms(dubbed H&L), (iii) an extension of the previousmethod (dubbed H&L+W2V), with an extra prun-ing step we devised that uses the recently pop-ular continuous space word vectors (Mikolov etal., 2013c), and (iv) a similar extension of FREQ

(dubbed FREQ+W2V). All four methods are unsu-pervised, which is particularly important for ABSA

systems intended to be used across domains withminimal changes. They return directly a list A ofdistinct aspect terms ordered by decreasing pre-dicted frequency, rather than tagging aspect termoccurrences, which would require computing theA list from the tagged occurrences before apply-ing our evaluation measures (Section 3.2).

4.1 The FREQ baseline

The FREQ baseline returns the most frequent (dis-tinct) nouns and noun phrases of the reviews ineach dataset (restaurants, hotels, laptops), orderedby decreasing sentence frequency (how many sen-tences contain the noun or noun phrase).9 This is areasonably effective and popular baseline (Hu andLiu, 2004; Wei et al., 2010; Liu, 2012).

4.2 The H&L method

The method of Hu and Liu (2004), dubbed H&L,first extracts all the distinct nouns and nounphrases from the reviews of each dataset (lines 3–6 of Algorithm 1) and considers them candidatedistinct aspect terms.10 It then forms longer can-didate distinct aspect terms by concatenating pairsand triples of candidate aspect terms occurring inthe same sentence, in the order they appear in thesentence (lines 7–11). For example, if ‘batterylife’ and ‘screen’ occur in the same sentence (inthis order), then ‘battery life screen’ will also be-come a candidate distinct aspect term.

The resulting candidate distinct aspect termsare ordered by decreasing p-support (lines 12–15).The p-support of a candidate distinct aspect term tis the number of sentences that contain t, exclud-ing sentences that contain another candidate dis-tinct aspect term t′ that subsumes t. For example,if both ‘battery life’ and ‘battery’ are candidatedistinct aspect terms, a sentence like “The batterylife was good” is counted in the p-support of ‘bat-tery life’, but not in the p-support of ‘battery’.

The method then tries to correct itself by prun-ing wrong candidate distinct aspect terms and de-tecting additional candidates. Firstly, it discardsmulti-word distinct aspect terms that appear in‘non-compact’ form in more than one sentences(lines 16–23). A multi-word term t appears in non-compact form in a sentence if there are more thanthree other words (not words of t) between anytwo of the words of t in the sentence. For exam-ple, the candidate distinct aspect term ‘battery lifescreen’ appears in non-compact form in “batterylife is way better than screen”. Secondly, if thep-support of a candidate distinct aspect term t issmaller than 3 and t is subsumed by another can-

9We use the default POS tagger of NLTK, and the chun-ker of NLTK trained on the Treebank corpus; see http://nltk.org/. We convert all words to lower-case.

10Some details of the work of Hu and Liu (2004) were notentirely clear to us. The discussion here and our implementa-tion reflect our understanding.

49

didate distinct aspect term t′, then t is discarded(lines 21–23).

Subsequently, a set of ‘opinion adjectives’ isformed; for each sentence and each candidate dis-tinct aspect term t that occurs in the sentence, theclosest to t adjective of the sentence (if there isone) is added to the set of opinion adjectives (lines25-27). The sentences are then re-scanned; if asentence does not contain any candidate aspectterm, but contains an opinion adjective, then thenearest noun to the opinion adjective is added tothe candidate distinct aspect terms (lines 28–31).The remaining candidate distinct aspect terms arereturned, ordered by decreasing p-support.

Algorithm 1 The method of Hu and LiuRequire: sentences: a list of sentences1: terms = new Set(String)2: psupport = new Map(String, int)3: for s in sentences do4: nouns = POSTagger(s).getNouns()5: nps = Chunker(s).getNPChunks()6: terms.add(nouns ∪ nps)7: for s in sentences do8: for t1, t2 in terms s.t. t1, t2 in s ∧

s.index(t1)<s.index(t2) do9: terms.add(t1 + ” ” + t2)

10: for t1, t2, t3 in s.t. t1, t2,t3 in s ∧s.index(t1)<s.index(t2)<s.index(t3) do

11: terms.add(t1 + ” ” + t2 + ” ” + t3)12: for s in sentences do13: for t: t in terms ∧ t in s do14: if ¬∃ t’: t’ in terms ∧ t’ in s ∧ t in t’ then15: psupport[term] += 116: nonCompact = new Map(String, int)17: for t in terms do18: for s in sentences do19: if maxPairDistance(t.words())>3 then20: nonCompact[t] += 121: for t in terms do22: if nonCompact[t]>1 ∨ (∃ t’: t’ in terms ∧ t in t’ ∧

psupport[t]<3) then23: terms.remove(t)24: adjs = new Set(String)25: for s in sentences do26: if ∃ t: t in terms ∧ t in s then27: adjs.add(POSTagger(s).getNearestAdj(t))28: for s in sentences do29: if ¬∃ t: t in terms ∧ t in s ∧ ∃ a: a in adjs ∧ a in s

then30: t = POSTagger(s).getNearestNoun(adjs)31: terms.add(t)32: return psupport.keysSortedByValue()

4.3 The H&L+W2V method

We extended H&L by including an additionalpruning step that uses continuous vector spacerepresentations of words (Mikolov et al., 2013a;Mikolov et al., 2013b; Mikolov et al., 2013c).The vector representations of the words are pro-

Centroid Closest Wikipedia wordsCom. lang. only, however, so, way, becauseRestaurants meal, meals, breakfast, wingstreet,

snacksHotels restaurant, guests, residence, bed, ho-

telsLaptops gameport, hardware, hd floppy, pcs, ap-

ple macintosh

Table 1: Wikipedia words closest to the commonlanguage and domain centroids.

duced by using a neural network language model,whose inputs are the vectors of the words occur-ring in each sentence, treated as latent variables tobe learned. We used the English Wikipedia to trainthe language model and obtain word vectors, with200 features per vector. Vectors for short phrases,in our case candidate multi-word aspect terms, areproduced in a similar manner.11

Our additional pruning stage is invoked imme-diately immediately after line 6 of Algorithm 1. Ituses the ten most frequent candidate distinct as-pect terms that are available up to that point (fre-quency taken to be the number of sentences thatcontain each candidate) and computes the centroidof their vectors, dubbed the domain centroid. Sim-ilarly, it computes the centroid of the 20 most fre-quent words of the Brown Corpus (news category),excluding stop-words and words shorter than threecharacters; this is the common language centroid.Any candidate distinct aspect term whose vector iscloser to the common language centroid than thedomain centroid is discarded, the intuition beingthat the candidate names a very general concept,rather than a domain-specific aspect.12 We use co-sine similarity to compute distances. Vectors ob-tained from Wikipedia are used in all cases.

To showcase the insight of our pruning step,Table 1 shows the five words from the EnglishWikipedia whose vectors are closest to the com-mon language centroid and the three domain cen-troids. The words closest to the common languagecentroid are common words, whereas words clos-est to the domain centroids name domain-specificconcepts that are more likely to be aspect terms.

11We use WORD2VEC, available at https://code.google.com/p/word2vec/, with a continuous bag ofwords model, default parameters, the first billion charactersof the English Wikipedia, and the pre-processing of http://mattmahoney.net/dc/textdata.html.

12WORD2VEC does not produce vectors for phrases longerthan two words; thus, our pruning mechanism never discardscandidate aspect terms of more than two words.

50

Figure 3: Weighted precision – weighted recall curves for the three datasets.

4.4 The FREQ+W2V methodAs with H&L+W2V, we extended FREQ by addingour pruning step that uses the continuous spaceword (and phrase) vectors. Again, we producedone common language and three domain cen-troids, as before. Candidate distinct aspect termswhose vector was closer to the common languagecentroid than the domain centroid were discarded.

5 Experimental results

Table 2 shows the AWP scores of the methods.All four methods perform better on the restaurantsdataset. At the other extreme, the laptops datasetseems to be the most difficult one; this is due to thefact that it contains many frequent nouns and nounphrases that are not aspect terms; it also containsmore multi-word aspect terms (Fig. 2).

H&L performs much better than FREQ in allthree domains, and our additional pruning (W2V)improves H&L in all three domains. By contrastFREQ benefits from W2V only in the restaurant re-views (but to a smaller degree than H&L), it bene-fits only marginally in the hotel reviews, and in thelaptop reviews FREQ+W2V performs worse thanFREQ. A possible explanation is that the list ofcandidate (distinct) aspect terms that FREQ pro-duces already misses many aspect terms in the ho-tel and laptop datasets; hence, W2V, which canonly prune aspect terms, cannot improve the re-sults much, and in the case of laptops W2V has anegative effect, because it prunes several correctcandidate aspect terms. All differences betweenAWP scores on the same dataset are statisticallysignificant; we use stratified approximate random-ization, which indicates p ≤ 0.01 in all cases.13

Figure 3 shows the weighted precision andweighted recall curves of the four methods. Inthe restaurants dataset, our pruning improves

13See http://masanjin.net/sigtest.pdf.

Method Restaurants Hotels LaptopsFREQ 43.40 30.11 9.09

FREQ+W2V 45.17 30.54 7.18H&L 52.23 49.73 34.34

H&L+W2V 66.80 53.37 38.93

Table 2: Average weighted precision results (%).

the weighted precision of both H&L and FREQ;by contrast it does not improve weighted re-call, since it can only prune candidate as-pect terms. The maximum weighted precisionof FREQ+W2V is almost as good as that ofH&L+W2V, but H&L+W2V (and H&L) reachmuch higher weighted recall scores. In the hotelreviews, W2V again improves the weighted pre-cision of both H&L and FREQ, but to a smallerextent; again W2V does not improve weighted re-call; also, H&L and H&L+W2V again reach higherweighted recall scores. In the laptop reviews,W2V marginally improves the weighted precisionof H&L, but it lowers the weighted precision ofFREQ; again H&L and H&L+W2V reach higherweighted recall scores. Overall, Fig. 3 confirmsthat H&L+W2V is the best method.

6 Conclusions

We constructed and made publicly available threenew ATE datasets from three domains. We alsointroduced weighted variants of precision, recall,and average precision, arguing that they are moreappropriate for ATE. Finally, we discussed howa popular unsupervised ATE method can be im-proved by adding a new pruning mechanism thatuses continuous space vector representations ofwords and phrases. Using our datasets and eval-uation measures, we showed that the improvedmethod performs clearly better than the origi-nal one, also outperforming a simpler frequency-based baseline with or without our pruning.

51

ReferencesA. Bagheri, M. Saraee, and F. Jong. 2013. An unsuper-

vised aspect detection model for sentiment analysisof reviews. In Proceedings of NLDB, volume 7934,pages 140–151.

J. Blitzer, M. Dredze, and F. Pereira. 2007. Biogra-phies, Bollywood, boom-boxes and blenders: Do-main adaptation for sentiment classification. In Pro-ceedings of ACL, pages 440–447, Prague, Czech Re-public.

X. Ding, B. Liu, and P. S. Yu. 2008. A holistic lexicon-based approach to opinion mining. In Proceedingsof WSDM, pages 231–240, Palo Alto, CA, USA.

G. Ganu, N. Elhadad, and A. Marian. 2009. Beyondthe stars: Improving rating predictions using reviewtext content. In Proceedings of WebDB, Providence,RI, USA.

M. Hu and B. Liu. 2004. Mining and summarizingcustomer reviews. In Proceedings of KDD, pages168–177, Seattle, WA, USA.

Kalervo Jarvelin and Jaana Kekalainen. 2002. Cumu-lated gain-based evaluation of IR techniques. ACMTransactions on Information Systems, 20(4):422–446.

S.-M. Kim and E. Hovy. 2006. Extracting opinions,opinion holders, and topics expressed in online newsmedia text. In Proceedings of SST, pages 1–8, Syd-ney, Australia.

N. Kobayashi, K. Inui, and Y. Matsumoto. 2007. Ex-tracting aspect-evaluation and aspect-of relations inopinion mining. In Proceedings of EMNLP-CoNLL,pages 1065–1074, Prague, Czech Republic.

B. Liu, M. Hu, and J. Cheng. 2005. Opinion ob-server: analyzing and comparing opinions on theweb. In Proceedings of WWW, pages 342–351,Chiba, Japan.

B. Liu. 2012. Sentiment Analysis and Opinion Mining.Synthesis Lectures on Human Language Technolo-gies. Morgan & Claypool.

C. Long, J. Zhang, and X. Zhut. 2010. A review se-lection approach for accurate feature rating estima-tion. In Proceedings of COLING, pages 766–774,Beijing, China.

C. D. Manning, P. Raghavan, and H. Schutze. 2008.Introduction to Information Retrieval. CambridgeUniversity Press.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013a.Efficient estimation of word representations in vec-tor space. In Proceedings of Workshop at ICLR.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, andJ. Dean. 2013b. Distributed representations ofwords and phrases and their compositionality. InProceedings of NIPS.

T. Mikolov, W.-T. Yih, and G. Zweig. 2013c. Linguis-tic regularities in continuous space word representa-tions. In Proceedings of NAACL HLT.

S. Moghaddam and M. Ester. 2010. Opinion digger:an unsupervised opinion miner from unstructuredproduct reviews. In Proceedings of CIKM, pages1825–1828, Toronto, ON, Canada.

B. Pang and L. Lee. 2005. Seeing stars: exploit-ing class relationships for sentiment categorizationwith respect to rating scales. In Proceedings of ACL,pages 115–124, Ann Arbor, MI, USA.

Ana-Maria Popescu and Oren Etzioni. 2005. Extract-ing product features and opinions from reviews. InProceedings of HLT-EMNLP, pages 339–346, Van-couver, Canada.

T. Sakai. 2004. Ranking the NTCIR systems basedon multigrade relevance. In Proceedings of AIRS,pages 251–262, Beijing, China.

B. Snyder and R. Barzilay. 2007. Multiple aspect rank-ing using the good grief algorithm. In Proceedingsof NAACL, pages 300–307, Rochester, NY, USA.

M. Tsytsarau and T. Palpanas. 2012. Survey on min-ing subjective data on the web. Data Mining andKnowledge Discovery, 24(3):478–514.

C.-P. Wei, Y.-M. Chen, C.-S. Yang, and C. C Yang.2010. Understanding what concerns consumers:a semantic approach to product feature extractionfrom consumer reviews. Information Systems andE-Business Management, 8(2):149–167.

J. Yu, Z. Zha, M. Wang, and T. Chua. 2011. As-pect ranking: identifying important product aspectsfrom online consumer reviews. In Proceedings ofNAACL, pages 1496–1505, Portland, OR, USA.

Z. Zhai, B. Liu, H. Xu, and P. Jia. 2010. Group-ing product features using semi-supervised learningwith soft-constraints. In Proceedings of COLING,pages 1272–1280, Beijing, China.

Z. Zhai, B. Liu, H. Xu, and P. Jia. 2011. Clusteringproduct features for opinion mining. In Proceedingsof WSDM, pages 347–354, Hong Kong, China.

52

Aspect Term Extraction for Sentiment Analysis: New ...nlp.cs.aueb.gr/pubs/W14-1306.pdf · placed by...

Documents

Transcript of Aspect Term Extraction for Sentiment Analysis: New ...nlp.cs.aueb.gr/pubs/W14-1306.pdf · placed by...