BigDataAspect-BasedOpinionMiningUsingtheSLDAandHME- LDA Models

Research ArticleBig Data Aspect-Based OpinionMining Using the SLDA and HME-LDA Models

Ling Yuan ,1 JiaLi Bin ,1 YinZhen Wei ,2 Fei Huang,3 XiaoFei Hu,3 and Min Tan3

1School of Computer Science, Huazhong University of Science and Technology, 430074, China2Huanggang Normal University, 438000, China3Wuhan Fiberhome Technical Services Co., Ltd, 430205, China

Correspondence should be addressed to YinZhen Wei; [email protected]

Received 20 July 2020; Revised 1 September 2020; Accepted 23 October 2020; Published 19 November 2020

Academic Editor: Amr Tolba

Copyright © 2020 Ling Yuan et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In order to make better use of massive network comment data for decision-making support of customers and merchants in the bigdata era, this paper proposes two unsupervised optimized LDA (Latent Dirichlet Allocation) models, namely, SLDA(SentiWordNet WordNet-Latent Dirichlet Allocation) and HME-LDA (Hierarchical Clustering MaxEnt-Latent DirichletAllocation), for aspect-based opinion mining. One scheme of each of two optimized models, which both use seed words as topicwords and construct the inverted index, is designed to enhance the readability of experiment results. Meanwhile, based on theLDA topic model, we introduce new indicator variables to refine the classification of topics and try to classify the opinion targetwords and the sentiment opinion words by two different schemes. For better classification effect, the similarity between wordsand seed words is calculated in two ways to offset the fixed parameters in the standard LDA. In addition, based on theSemEval2016ABSA data set and the Yelp data set, we design comparative experiments with training sets of different sizes anddifferent seed words, which prove that the SLDA and the HME-LDA have better performance on the accuracy, recall value, andharmonic value with unannotated training sets.

1. Introduction

With the development of the Internet, almost all the things ofhuman living have become digitized. The information in net-work is in the form of structured data and unstructured data[1]. Big data analysis and mining is aimed at discoveringimplicit, previously unknown, and potentially useful infor-mation and knowledge from big databases that contain highvolumes of valuable veracious data collected or generated at ahigh velocity from a wide variety of data sources [2], which iscalled “4V” of big data. In fact, the deeper mining of big datais to mine the user demand and other deep information; thetext mining that this paper studied is one of the ways to minevalid information from big data of text. Therefore, the studyproposes two optimized opinion mining methods for cus-tomers and merchants to extract valid information they needfrom massive textual data that satisfies the “4V” [2].

For example, when purchasing a product, people usuallyrefer to others’ comments in the specialized product com-

ment area first. Although comments on different platformshave different forms of display, most of them are text-based. Due to most product comments on Taobao only havepositive, neutral, and negative labels, what users can directlyrefer to is just the number of positive and negative comments.Since different users have different needs for the same prod-uct, they need to know which attributes of it perform well andwhich perform poorly. However, these attributes are notshown in detail in the comment interface. It is impossibleand uneconomic for users to read more comments to findthe attribute they need, because of the massive quantity andcontinuous growth of product comments. Thus, in order toassist clients and merchants for better decision-making, con-ducting opinion mining and sentiment analysis of big data isnecessary, which will bring huge profits to some markets.

Data mining and analysis have been used in the tourismindustry [3], groundwater potential mapping [4], and so on;it becomes more and more important in modern life. Liuand Zhang [5] divided the text mining and analysis tasks into

HindawiWireless Communications and Mobile ComputingVolume 2020, Article ID 8869385, 19 pageshttps://doi.org/10.1155/2020/8869385

https://orcid.org/0000-0002-0160-846X

https://orcid.org/0000-0003-2375-9469

https://orcid.org/0000-0002-8106-4414

https://creativecommons.org/licenses/by/4.0/

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1155/2020/8869385

three levels of granularity: chapter level, sentence level, andattribute level. The chapter level, whose research unit is adocument, usually uses algorithms to show whether the opin-ions expressed by the author are negative or positive, and it isoften used to analyse blogs and news. While the sentence-levelsentiment analysis is often used to analyse whether the expres-sion of microblogs, tweets, etc., in social networks is negativeor positive. Obviously, it is impossible to perform opinionmining at the chapter level and sentence level when analysinguser comments. Please consider the following review:

The location of this restaurant is relatively remote, butthe waiter has a good attitude and the dishes taste good.

In fact, this sentence contains three opinions. Therefore,we ought not to determine the sentiment polarity of thiscomment simply and should get more specific results, suchas <location, remote, negative>, <service, good, positive>,<dish, good, positive>, i.e., the aspect-based opinion mining.In 2015 and 2016, SEMEVAL released the research topic ofABSA (Aspect-Based Sentiment Analysis), triggering a studyboom among scholars. SEMEVAL divides ABSA into threesubtasks: identifying entities and attributes, identifying theexpression of opinions, and identifying the sentiment polar-ity of opinions. SEMEVAL also gives the annotated trainingsets and test sets, such as catering, hotel, and laptop.

In recent years, people devote a lot of energy to analysecomments on the Internet for obtaining the detailed opinionsof users [6, 7]. The aspect-based opinion mining of onlinecomments can be divided into two subtasks, namely, theopinion target extraction and the classification of sentimentpolarity of comments. Figure 1 gives the overview of thestudy method classification of the aspect-based opinionmining.

With the development of supervised models, the deeplearning, and the conditional random field, CNN (Convolu-tional Neural Networks) and RNN (Recurrent Neural Net-works) have been widely used in NLP (Natural LanguageProcessing) [8], which have achieved good results in ABSAresearch. Some other supervised learning methods [9, 10]such as the BMAM [11] which was proposed in recent timealso have achieved good results. In addition, Chen et al.[12] and Araque et al. [13] introduced deep learning intothe ABSA and achieved satisfactory results. However, com-pared with the models proposed in this paper based onLDA, all the above models lack the adaptability in variousfields and have high manpower costs for annotation. Forexample, the effects of these models will be greatly reducedwhen the aspect category of the comment is transferred fromthe food and beverage to the laptop. Moreover, supervisedmodels such as BMAM [11] need a lot more manpower thanthe models proposed in this paper to annotate data due to thesmall number of annotated training sets given. In addition,the time complexity of SLDA and HME-LDA is better thanthe popular models above. The time complexity of SLDAand HME-LDA is just related to the number of documents,topics, and words, while the time complexity of the popularmodels such as RNN is related to the multilayer neural net-work structure, especially the fully connected layer, and theresult of the previous time sequence, leading to a really lowerspeed than the SLDA and HME-LDA.

As for the unsupervised model, there are two basicmodels for latent semantic analysis: the probabilistic latentsemantic analysis (PLSA) [14] model and the latent Dirichletallocation (LDA) [15] model, which can be applied to extractattributes, assuming that each comment is a combination ofattribute words and opinion. Mei et al. [16] proposed atopic-sentiment hybrid model based on the PLSA to extractaspect opinion target words and sentiment words from agroup of blogs. Li et al. [17] proposed two kinds of jointmodels, sentiment LDA and dependence-sentiment LDA,to find positive or negative aspects of sentiment words. Dueto the flexibility of the LDA topic model, it is extended andcombined with other methods to obtain a topic model [18,19], which can improve the result topics or the additionalinformation of the model.

In view of the short content, wide coverage and the smallnumber of the annotated corpus of the network commentand its need for aspect-based mining, this paper proposestwo schemes based on the LDA topic model that have unsu-pervised features and good extensibility, making it possiblefor network comments to perform aspect-based opinionmining with as little annotated data as possible. As for thedata cleaning, the amount of data, correctness, completeness,and time correlation [20] are all good evaluation indicators ofdata quality. As for the data amendment, the Markov Ran-dom Field performs well [21].

This paper regards opinion targets, aspects, and opinionexpressions as aspect opinion targets that refer to entities orproperties to which sentiment words modifies, and sentimentwords related to aspect opinion targets are called sentimentopinion words or opinion words.

Moreover, the aspect-based opinion mining schemesproposed in this paper only require users to set correspond-ing seed words and introduce the classification layer to clas-sify the opinion target words and sentiment opinion words.Meanwhile, in order to improve the effects of the models inschemes, we are biasing the parameters of the LDA modelby calculating the similarity between the words in the corpusand the seed words we set.

To sum up, the main study contents of this paper are asfollows:

(1) Introduce the NLP tools, WordNet and SentiWord-Net, into the standard LDA model to design an opti-mized LDA-based topic model

(2) Introduce a maximum entropy classifier into thestandard LDA model to design another optimizedLDA-based topic model

(3) Implement the above two optimized models anddesign experiments to verify the feasibility and supe-riority of optimized models

The rest of the paper is organized as follows. Next, wewill describe two schemes with two optimized LDA-basedmodels in detail. Then, the experimental results will begiven in the next section. At the same time, we will givesome analysis about the results. Finally, we give theconclusion.

2 Wireless Communications and Mobile Computing

2. Two Optimized LDAModels for Aspect-BasedOpinion Mining

This paper proposes two schemes for aspect-based opinionmining. The first scheme is based on the inverted list andthe SLDA (SentiWordNet WordNet-Latent Dirichlet Alloca-tion) model proposed in this paper. The second scheme isbased on the inverted list and the HME-LDA (HierarchicalClustering MaxEnt-Latent Dirichlet Allocation) model pro-posed in this paper.

The SLDA model is an optimized LDA model based onthe WordNet and SentiWordNet, where the WordNet is forthe similarity calculation of words and seed words and theSentiWordNet is for the separation of the opinion targetwords and the opinion words, while the HME-LDA modelis an optimized LDA model based on SLDA and MaxEnt-LDA [22]. In fact, SLDA still has the disadvantage of relyingon dictionary tools (WordNet and SentiWordNet), leading tothe application failure of SLDA in other languages. Luckily,the HME-LDA model solves the problem.

Next, we will illustrate the two optimized models ofschemes in detail.

2.1. Optimized Model, SLDA, Based on SentiWordNet andWordNet. Different from the text messages such as docu-ments, blogs, and news, network comments tend to beshorter and often appear in the form of sentences. Please con-sider the following restaurant comments:

(1) “We, there were four of us, arrived at noon - the placewas empty and the staff acted like we were imposingon them and they were very rude.”

(2) “Everything is always cooked to perfection, the ser-vice is excellent, the decor cool and understated.”

In nonaspect opinion mining, the only thing you need todo is to analyse the sentiment polarity of sentences. Forexample, the word “rude” in the first sentence is negative,so the sentiment polarity of the first sentence is negative. Inthe second one, the sentiment polarity of the word “perfec-tion” is positive, so the sentiment polarity of the second sen-tence is positive. This way that only judges the sentimentpolarity of sentences does not apply to network comments.More meaningful information should be specific to the wordpairs of <Aspect Opinion Target-Opinion> such as <staff-rude>, <cook-perfect>, and <service-excellent>, while the

algorithm based on the topic model lacks readability andappointed key words, where the relevant original contentfails to be directly found by the final result.

The SLDA is an optimized LDA model based on Senti-WordNet (a sentiment dictionary based on WordNet) andWordNet (a large database of English words). Since it isimpossible for the LDA itself to separate opinions from opin-ion targets, this chapter adds a classification layer of opinionwords and opinion target words based on the LDA to realizethe separation of opinions and aspect opinion targets. Thesimilarity between the word and the seed word in the text,which is reflected on LDA parameters, is calculated by settingthe seed word and using the calculation tools of the vocabu-lary similarity in WordNet. Meanwhile, the opinion target isseparated from the sentiment opinion words using tools thatcalculate the vocabulary sentiment in SentiWordNet. Aimingat the lack of readability of LDA results, we establish abelonging relationship among the clustering results, seedwords, and original texts. Also, the SLDA model needs toset seed words, but has no need for the additional annotateddata sets.

In order to achieve the goal that users can quickly findwhat they want from massive comments by inquiring theindex rather than by reading, there are three steps to do:

(1) Construct an inverted index to number sentences andwords

(2) Determine the aspect category of comments, separatethe aspect-based attribute words from the sentimentopinion words, and determine the aspect categoryto which they belong

(3) Enhance the readability of results. Users can see anoverview of Domain-Aspect-Opinion and find thespecific sentence by the inverted index of a word

It is worth noting that the premise of our study is that theaspect category of the original corpus is known. The aspectcategory, which has the belonging relationship with theaspect opinion target, is usually given in advance by the orig-inal corpus, which can be learned by the user guide of thecorpus. For example, the aspect opinion target “steak”belongs to the aspect category “Food.” In addition, bothwords and sentences can belong to an aspect category. Thus,what only need to do is to label aspect opinion target wordsand sentences with a specific aspect category.

Topicmodel

Sentimentlexicon

OthersGrammaticalrules

Opinion target expressionextraction

Polarity judgement of sentimental opinions

Aspect-based opinion mining

Supervised Frequency

Figure 1: The classification of aspect-based opinion mining.

3Wireless Communications and Mobile Computing

2.1.1. Implementation Process. Figure 2 is an overall flowchart of the scheme that conducts aspect-based opinion min-ing based on SLDA and the inverted index.

(1) Construct an inverted index. The words in the corpusare numbered in the form of binary group < a, b >,where a is the serial number of the sentence and bis the serial number of the words in the sentence. Inaddition, the generation of the inverted list requiresthe removal of duplicate word and the recording oftheir numbers. The inverted list reserves the sentencenumber that contains the word and the positioninformation of the word in the sentence, making iteasier to retrieve with context information from theoriginal corpus later

(2) Data preprocessing, whose main tasks are to extractthe data required and remove stop words from theoriginal corpus for making a training set. The formatsof the original data sets are XML and CSV. It is nec-essary to extract comment statements by the corre-sponding labels and fields. And the text in thecorpus contains many useless stop words, such as“is” and “a,” which should be removed before furtherprocessing to avoid interference with the training ofthe SLDA model

(3) Introduce preprocessed data into SLDA for modeltraining and get clustering results. The setup of seedwords, as well as the assist of WordNet and Senti-WordNet, is required in this process

(4) Process the clustering results for better readability.The results in the topic model are probability matrix,whose readability is poor. To solve this problem, it isnecessary to find the word corresponding to theresult with higher probability and find its originalsentence by the inverted index

In short, the SLDA model, trained with the preprocesseddata of step 2 above, queries the original sentences that con-tain keywords from the original corpus by the inverted indexof step 1 above.

2.1.2. The Optimized Direction of LDA. The expectation ofevery random variable μi of the Dirichlet distribution canbe expressed as EðμiÞ. The value of EðμiÞ can be calculatedby equation (1), where α represents the parameter of theDirichlet distribution and K represents the number of topics:

E μið Þ = αi∑K

i=1αi, ð1Þ

where α is a fixed value and the EðμiÞ of each topic issame. Based on this, a biasing method of α can be exploredto make the expectations of the corresponding probabilitiesof each topic different, so as to generate the topic bias. Morevisually, when α is fixed, it means that we fail to know whichtopic word to use before generating the document. When α is

biased, the ideal topic is determined before the document isgenerated.

Actually, the topic number Zm,n can be used as an indica-tor variable in the LDA model to control the selection oftopic-word distribution. Based on this, more indicator vari-ables can be introduced to refine the topic and obtain moretopic-word distributions.

From the above, there are two aspect opinion targets thatthe LDA model can optimize: the first is to bias the parame-ters, α and β, which can generate topic biases to improve theclassification effect; the second is to introduce more indicatorvariables like Zm,n, which can generate more detailed topicclassifications.

2.1.3. The Description of the SLDA Model. The standard LDAuses the document as the unit of topic allocation, while in theaspect-based opinion mining of the network comments, thesentence is often used as the unit of topic allocation, becausethere is no document with large contents in network com-ments and the topic allocation of vocabulary in networkcomments is actually more meaningful. In the aspect-basedopinion mining of network comments, it is necessary toextract the aspect category of the comment, the opinion tar-get, and the comment opinion (sentiment polarity) from thetext. For example, in restaurant comments, “food,” “service,”and “ambiance” are the aspect categories of comments. In the“food” category, “steak” is the opinion target, and the evalu-ation of “good” for “steak” is the opinion of the comments(sentiment polarity).

In the SLDA model, seed words are directly used asaspect categories of comments, while the opinion targets ofcomments and the comment opinions are separated by theintroduced sentiment layer, and the positive and negativepolarities of comments are classified as well. The PGM(Probabilistic Graphical Model) of SLDA is shown inFigure 3.

2.1.4. The Generation Process of the SLDA Model. Only theparameters α and β, which, respectively, belong to thedocument-topic Dirichlet distribution and the topic-word

Input to model

Training setCorpus

Invertedindex

Extractionresults

SLDA model

Seed-wordset

WordNetSentiWordNet

Datapreprocessing

Output the resultQuery sentencesby words

Con

struc

t the

inve

rted

inde

x

Que

ry th

e orig

inal

stat

emen

t

Figure 2: The overall flow chart of the scheme that is based onSLDA and inverted index.


Dirichlet distribution, are introduced into the standard LDA.In the SLDA model, seed words have been identified as theaspect category of comments. The variable yd ∈ fA,Og isintroduced to represent the separation of the opinion targetand the comment opinion. When yd = A, the current wordis the opinion target of comments. When yd =O, the currentword is the comment opinion. Meanwhile, the variable vd ∈fP,Ng is introduced into SLDA. When vd = P, the sentimentpolarity of the current vocabulary is positive, and when vd=N , the sentiment polarity of the current vocabulary is neg-ative. Both yd and vd are determined by the correspondingalgorithms based on the WordNet and SentiWordNet. Inthe standard LDA model, the values of parameters, α and β,are fixed. In the SLDA model, different αd will be set for eachsentence, and the parameters, βt,A, βt,P, and βt,N , will be setfor the opinion target, the positive comments and the nega-tive comments, respectively.

In SLDA, the first step is to generate the sentence-topicdistribution and determine a topic for each sentence by themultinomial distribution. Then, two influence factors, ydand vd , are determined by WordNet and SentiWordNet toindicate the aspect categories of words. By the SLDA genera-tion method above, we finally select a topic-word distributionand determine the final word.

The concrete generation process of SLDA is as follows:Firstly, the distributions of the opinion target φt,A, the

positive opinion word φt,P, and the negative opinion wordφt,N are, respectively, extracted from the parameters, βt,A,βt,P, βt,N , where φt,A ~ Dirðβt,AÞ, φt,P ~ Dirðβt,PÞ, and φt,N ~Dirðβt,NÞ.

Similar to the standard LDA, for each sentence (docu-ment in the standard LDA), a topic distribution, θd ~ DirðαdÞ, is extracted from the Dirichlet distribution with theparameter αd .

Then, the word wd,n in the sentence d extracts a topicnumber t, i.e., extracts zd,n ~MultiðθdÞ.

After determining the topic number of the word wðd, nÞ,the classification of it remains to be determined. After deter-mining the topic t in SLDA, in order to further classify theword as the opinion target word or the opinion of comments,the variables, yd,n ∈ fA,Og and vd,n ∈ P,N , are introducedinto the SLDA model. The yd,n and vd,n point to the n-thword wd,n in the sentence d jointly. The yd,n is used to indi-cate that the word wd,n is the opinion target of commentsor the comment opinion, which is extracted from the Ber-noulli distribution on f0, 1g with the parameter πd,n. Thevd,n is used to indicate that the word wd,n is a positive or anegative comment opinion, which is extracted from the Ber-noulli distribution on f0, 1g with the parameter Ωd,n. Theabove two Bernoulli distributions are calculated by theWordNet and SentiWordNet. Finally, the word wd,n can beextracted according to equation (2):

Wd, n ~

Multi φt, Að Þ, if yd, n = A,

Multi φt, Að Þ, if yd, n =O, and vd, n = P,

Multi φt, Að Þ, if yd, n =O, and vd, n =N:

8>><>>:

ð2Þ

WordNetSentiWordNet

𝛽t,A 𝛽t,P 𝛽t,N

𝜑t,A

W

𝜋d,n yd,n vd,n Ωd,n

Zd,n 𝜃d 𝛼d

𝜑t,P 𝜑t,N

SentiWordNet

d,n

Figure 3: The probabilistic graphical model of SLDA.


Table 1 gives the description of the related symbols in theSLDA model, which is useful for readers in reading.

2.1.5. The Inference Process of the SLDA Model. The SLDAmodel consists of two major parts. One is the classificationof the opinion target words and sentiment opinion wordscomposed of the WordNet and SentiWordNet as well asthe classification of positive and negative sentiment opinionwords. The other is the LDA topic model. Besides, the seedwords should be set before the inference of SLDA. Next, sev-eral modules will be introduced in turn.

(1) The Setting of Seed Words. In SLDA, seed words are set asaspect categories of comments. For example, in the classicEnglish comment set of the Restaurant, the aspect categoriesare food, service, ambiance, etc. If the corpus is the RestaurantEnglish comment set, the seed words can be directly set asfood, service, and ambience. The seed word is recorded aswt , where t ∈ f1,⋯, Tg, that is to say, the number of seedwords determines the number of topics in the SLDA model.

(2) The Inference of the Word Classification Model Based onthe WordNet and SentiWordNet. In the SLDA model, theBernoulli distribution with parameter πd,n and Bernoulli dis-tribution with parameter Ωd,n, which are, respectively, usedto separate the opinion target words from sentiment opinionwords and separate positive sentiment opinion words fromnegative sentiment opinion words, are related to the Word-Net and SentiWordNet. The calculation of πd,n depends onthe seed word wt .

The words in WordNet have the feature of polysemy. Tocalculate the similarity between words, it is necessary todetermine the exact meaning of a word. Therefore, beforethe model inference, it is necessary to determine the semanticinterpretation st,k0 of the seed word wt in the WordNet.When the current word is wd,n, its semantic interpretationin the WordNet is sd,n,k, where k ∈ f1,⋯, Kg and K is thenumber of semantic interpretations. After determining thesemantics of the seed words, we can regard the seed wordas the topic and the aspect category of the final opinion tar-get. All nonsentiment opinion words will be grouped into acertain aspect category of comments. Therefore, the similar-ity between the semantics of the current word and the seman-tics of the seed word can be calculated, and the semanticswith the greatest similarity can be determined as the meaningexpressed by the word in the sentence finally. The degree ofsemantic similarity between different words in the WordNetis recorded as Simðs1, s2Þ; then, the semantic similarity, Simðsd,n,k,st,k0Þ, between the sd,n,k of wd,n and the st ,k0 of each seedword wt can be calculated, where k ∈ f1,⋯, Kg. Besides, K isthe number of semantic interpretations, t ∈ f1,⋯, Tg, and Tis the number of seed words. In all calculation results, wechoose the semantics with the max value of the similarityresult and determine the largest result k′ as the semanticssd,n,k′ to which the current word wd,n belongs.

After determining the semantic sd,n,k′ of the word wd,n,the sentiment polarity of sd,n,k′ can be queried in the Senti-

WordNet. In the SentiWordNet, the semantics of a wordhas three sentiment propensity probabilities: ρo indicatesthe probability that the semantics is objective (excluding sen-timent polarity), pP indicates the probability that the seman-tics is positive, and ρN indicates the probability that thesemantics is negative. Besides, po + pP + pN = 1. It is believedthat if the semantics is objective, the word is the opinion tar-get, whose corresponding probability is po; otherwise, it is asentiment opinion word. The sentiment scores of the seman-tics sd,n,k′ of the word wd,n are recorded as pod,n,k, p

Pd,n,k, and

pNd,n,k. In Section 2.1.4, the Bernoulli distributions withparameter πd,n and parameter Ωd,n can be determined byequation (3) and equation (4):

πd,n = pod,n,k, ð3Þ

Ωd,n =pPd,n,k

pPd,n,k + pNd,n,k: ð4Þ

So far, the model inference based on the WordNet andSentiWordNet has been completed. Algorithm 1 is the pseu-docode for calculating πd,n, Ωd,n.

(3) The Inference of the SLDA Model. In the standard LDA,the parameters, α and β, of the Dirichlet distribution arefixed. While in the SLDA model, the parameters, α and β,are biased by calculating the similarity between the input cor-pus and seed words. The fixed parameters are recorded asαbase and βbase, and the biased parameters are recorded asαd , βt,A, βt,P, and βt,N . In this paragraph, the semantic simi-larity between the word w and the seed word t is recordedas simðw, tÞ. The probability that w is a positive word isrecorded as simðw, PÞ. The probability that w is a negativeword is recorded as simðw,NÞ. The w is recorded as simðw, AÞ if it belongs to the opinion target word. Thew is recordedas simðw,OÞ if it belongs to the sentiment opinion word.

In the standard LDA, the parameter α is used to controlthe topic distribution probability of the document. For alldocuments in the corpus, the values of α are the same, leadingto determine which topic word will be the document topicbefore generating the document hardly. However, the param-eter αd in SLDA, which can be calculated by equation (5), willbe set separately for each document (for each sentence inSLDA) based on the similarity between vocabulary and seedwords, leading to determine a more ideal topic in advancebefore generating the document.

αd =∑Nd

i sim wd,i,tð Þ∑T

t′∑Ndi sim wd,i,t ′

� � × αbase: ð5Þ

In equation (5), Nd is the number of all words in the cur-rent sentence, T is the number of topics, wd,i is the i-th wordin the current sentence, and t is the seed word.

In the standard LDA, the parameter β is used to controlthe word distribution of each topic, and the value of β isthe same for each topic. In SLDA, parameters which can be


calculated by equation (6), equation (7), and equation (8) willbe set separately for each topic based on the similaritybetween vocabulary and seed words.

βt,A = sim w, Að Þ × βbase, ð6Þ

βt,P = sim w, Pð Þ × βbase, ð7Þ

βt,N = sim w,Nð Þ × βbase: ð8Þ

Similar to the standard LDA, SLDA is solved by the Gibbs

sampling method. The variables involved in the solution pro-cess are explained in Table 2.

Equation (9) is used to sample the topic of each sentence.

p Zd,n = t z¬d,n, y¬d,n, v¬d,n, ⋅��

∝nt,Awd,n

+ βt,Awd,n

∑Vv nt,Av + βt,A

v

� �

×nt,Pwd,n

+ βt,Pwd,n

∑Vv nt,Pv + βt,P

v

� � ×nt,Nwd,n

+ βt,Nwd,n

∑Vv nt,Nv + βt,N

v

� � × nd,t + αd,tð Þ:

ð9Þ

Table 1: The description of related symbols in the SLDA model.

Symbol Description

D The total number of comments in the corpus. The unit of corpus is the sentence

T The number of topics

V The number of words in the corpus

wd,n The n-th word in the d-th comment in the corpus

zd,n The topic of d-th comment. The value is 1,⋯, Tf gyd,n An indicator variable. The value is A,Of g. It is used to indicate the opinion target words and the sentiment opinion words.

vd,n An indicator variable. The value is P,Nf g. It is used to indicate positive and negative sentiment opinion words.

A,O, P,N The opinion target, the sentiment opinion word, the positive sentiment word, the negative sentiment word

φt,A The distribution of the opinion target generated by a priori Dirichlet distribution with the parameter βt,A

φt,P The distribution of positive comment words generated by a priori Dirichlet distribution with parameters βt,P

φt,N The distribution of negative comment words generated by a priori Dirichlet distribution with parameters βt,N

θd The distribution of sentence topic terms generated by a priori Dirichlet distribution with parameter αd

1: Query the semantic list Slist of wd,n in WordNet2: //Record the semantic value of the max similarity3: sw = 0 ;4: //Record the maximum similarity5: simmax = 0;6: //Each semantic s of wd,n7: fors ∈ slistdo8: forst ∈ stlistdo9: //Sim() is a function provided by WordNet10: ifsimmax < Simðs, stÞthen11: sw = s;12: sim max = Simðs, stÞ;13; end if14: end for15: end for16: Use SentiWordNet to query the sentiment polarity of semantic Sw17: Calculate πd,n using equation (3)18: Calculate Ωd,n using equation (4)19: returnsw20: returnπd,n21: returnΩd,n

Algorithm 1. The calculation of πd,n, Ωd,n.


Equation (10) and equation (11) are used to sample yd,nand vd,n.

p yd,n = u zd,n = t,�� ⋅

� �α

nt,uwd,n+ βt,u

wd,n

� �× sim wd,n, uð Þ

∑Vv nt,uv + βt,u

v

� � , u ∈ A,Of g,

ð10Þ

p vd,n = q zd,n = t,�� ⋅

� �∝

nt,qwd,n + βt,qwd,n

� �× sim wd,n, qð Þ

∑Vv nt,qv + βt,q

v

� � , q ∈ P,Nf g:

ð11ÞIn the corpus, the approximate probability of the topic t

and the sentence d can be calculated by equation (12).

θd =nd,t + αd,t

nd +∑Tt ′αd,t ′

: ð12Þ

With t as the topic, the approximate probability that theword wd,n is the opinion target can be calculated by equation(13), the approximate probability that the word wd,n is thepositive opinion can be calculated by equation (14), and theapproximate probability that the word wd,n is the negativeopinion can be calculated by equation (15).

φt,Awd,n

=nt,Awd,n

+ βt,Awd,n

∑Vv nt,Av + βt,A

v

� � , ð13Þ

φt,Pwd,n

=nt,Pwd,n

+ βt,Pwd,n

∑Vv nt,Pv + βt,P

v

� � , ð14Þ

φt,Nwd,n

=nt,Nwd,n

+ βt,Nwd,n

∑Vv nt,Nv + βt,N

v

� � : ð15Þ

(4) Gibbs Sampling Implementation of the SLDA Model. TheGibbs sampling process of SLDA mainly has the followingsteps:

(1) Random initialization: randomly assign a topic num-ber z to all sentences in the corpus, the topic numbersof all words in the sentence are also set to z, then thevalues of the indicator variables, y and u, are ran-domly set for all words in the sentence, where y ∈ fA,Og, u ∈ fP,Ng.

(2) Traverse the corpus again, resample the topics of allwords according to equation (9), and update the rel-evant values. Then, resample the indicator variables,y and u, according to equation (10) and equation(11), and update their values

(3) Repeat step 2 until the Gibbs sampling resultsconverge

(4) Process the final results to improve the readabilityand save the results

In the initialization phase of the Gibbs sampling, a topicnumber is randomly assigned to each sentence in the corpus.For each word in the sentence, three categories are randomlyassigned, which are determined by the indicator variables, yand u. Then, add 1 to the corresponding statistics. A part ofthe pseudocode in the initialization phase is shown inAlgorithm 2.

In the repeated iteration phase of Gibbs sampling, thetopic of each sentence in the corpus and the category of eachword in the sentence are resampled. Then, update the rele-vant statistics after each sampling. Repeat the process untilthe end of the iteration.

In the result processing stage, the relevant value φ can becalculated by equation (13), equation (14), and equation (15).The calculated φ value is a digital, and the variables requiredin the calculation are the topic number, the category to whichthe word belongs, and the number of the word in the vocab-ulary. The relevant sentence information is missing; accord-ingly, it is necessary to effectively organize various types ofinformation for the user to view. We use result to representthe final result. The result.topic is the topic information, thatis, the seed word set by ourselves, which also can be regardedas the comment category word. The result.word saves theoriginal word. The result.wordType is the category of the cur-rent word (aspect target words, positive and negative opinionwords). The result.sentences is the sentence to which the wordbelongs. The result.prob is the probability that the wordbecomes a member of the category which the commentbelongs to. The firstm results are generated for each categoryunder all the topics, and the relevant pseudocode is shown inAlgorithm 3.

After getting the finalResults, we can query the resultsaccording to both the topic and the wordType.

2.2. Optimized Model, HME-LDA, Based on MAXENT-LDA.Because the WordNet and SentiWordNet only supportEnglish, SLDA has no linguistic adaptation. Therefore, wepropose an optimized model in this chapter, namely, theHME-LDA model that has the linguistic adaptation. The

Table 2: The partial symbolic description of SLDA model inference.

Symbol Description

nt,Av The number of words v in topic t and category A

nd,t The number of words with the topic t in the d-th sentence

βt,uv The number of words v in topic t and category A


optimized model, HME-LDA, is proposed by combiningwith the MaxEnt-LDA [22] model and the SLDA model.And the HME-LDA uses an unsupervised hierarchical clus-tering method to generate the annotated data set requiredby the maximum entropy model. Based on these, the newmodel can be used for comment opinion mining in manyother languages.

2.2.1. Implementation Process. The overall process of thescheme that conducts aspect-based opinion mining basedon the HME-LDA and the inverted index is shown inFigure 4:

(1) This step is the same as the implementation of step 1in Section 2.1.1

(2) Data preprocessing, whose main task is to removestop words. The text in the corpus contains manyuseless stop words, such as “is” and “a,” which shouldbe removed before further processing to avoid inter-ference with the results

(3) Automatically generate annotated data sets by hierar-chical clustering and train maximum entropy modelsfor classification of the opinion target words and sen-timent opinion words

(4) Enter the data into the HME-LDA model and per-form the training to get the results

(5) Process the results for better readability. The resultsin the topic model are probability matrixes, whosereadability is poor. To solve this problem, it is neces-

sary to find the word which corresponds to the resultthat has a higher probability and find its original sen-tence by the inverted index

As for Step 3 above, there are a lot of word order informa-tion in the corpus; thus, when the category of a word in thecorpus can be determined, the feature fwi−2,wi−1,wi,wi+1,wi+2g can be extracted. In HME-LDA, seed words are dividedinto topic seed words of aspect categories, seed wordsexpressing positive sentiment, and seed words expressingnegative sentiment. By scanning corpus, annotated featuresets, fwi−2,wi−1,wu

i ,wi+1,wi+2g, can be obtained, where u∈ fA,Og. When u = A, wA

i is the topic seed words, and whenu =O, wO

i is the sentiment seed words, and u is regarded asthe label. However, the number of seed words is limited, sothe training set obtained by scanning corpus may be insuffi-cient in size. Therefore, the words in the corpus are consid-ered to be clustered, and all the words in the category of theseed words are considered to have the same category as theseed words, and the word wu

i in the scanned annotated fea-ture set fwi−2,wi−1,wu

i ,wi+1,wi+2g is replaced with the wordin the category of seed words to get the new annotated data.

2.2.2. MaxEnt-LDA Model. In order to realize the aspect-based opinion mining of reviews, Zhao et al. [22] proposeda MaxEnt-LDA model based on LDA. Figure 5 is the proba-bility model diagram of MaxEnt-LDA [22].

Adopting the second thought of optimizing LDA in Sec-tion 2.1.2, MaxEnt-LDA further divides topics and increasesthe topic number by introducing another two indicator vari-ables, yd,s,n and ud,s,n, that are similar to Zm,n. Three newtopic-word distributions are generated by the Dirichlet distri-bution with the parameter β. Meanwhile, the original topic isfurther divided into categories A (aspect-term) and O (opin-ion word).

In the MaxEnt-LDA model, the indicator variable yd,s,n isgenerated by the sampling of the maximum entropy model.In this model, the selected features of the maximum entropymodel include the word and the part of speech, while thereare three labels which are background word B, opinion wordO, and the opinion target word A. The annotated training setis partially extracted from the SemEval data set and partiallyannotated by manual. The indicator variable yd,s,n is the sameas Zm,n, both of which are sampled from the multinomial dis-tribution generated by the Dirichlet distribution.

MaxEnt-LDA increases both the type and number ofclassification and introduces the maximum entropy model.However, the MaxEnt-LDA model increases the dependenceon the annotated data simultaneously. In order to pursue theunsupervised features of the model, there are two ways toimprove the MaxEnt-LDA: one is to replace the maximumentropymodel with other unsupervised classification models;the other is to use unsupervised methods to automaticallylabel data sets to avoid the dependence on annotated data.Meanwhile, the parameters α and β can be considered forbias.

2.2.3. The Description of the HME-LDA Model. In the HME-LDA model, seed words are directly set as an aspect category

1: ford = 1 to Ddo2: //Randomly assign topic t to sentences in Corpus3: t = randomint from (1, T);4: documentTopics [d] = t;5: documentTopicsCount [d][t]++;6: forw = 1 to Ndo7: //Randomly assign value to y8: y = randomint from (0,1);9: Y ½d�½w� = y;10: //Randomly assign value to u11: u= randomint from (0,1);12: U½d�½w� = u;13: ify ==0 then14: //statistic of aspect targets with topic t plus 115: aspectWordCount [w][t]++;16: end if17: ify ==1 and u ==0 then18: //statistic of positive opinion with topic t plus 119: positiveWordCount [w][t]++;20: end if21: ify ==1 and u ==1 then22: //statistic of negative opinion with topic t plus 123: negativeWordCount [w][t]++;24: end if25: end for26: end for

Algorithm 2. The initialization of Gibbs sampling.


1: //Create a 3d array to store results2: init results[][][]3: for topic in TopicList do4: for wordType in {A, P, N} do5: for v in V do6: result = new result7: result.topic = topic8: result.word = v9: result.wordType =wordType10: Calculate φ according to Equation. (13), Equation. (14), Equation. (15)11: result.prob = φ12: //Add result to the array13: results[topic][wordType][v] = result14: end for15: end for16: end for17: //Next, sort the results18: for topic in TopicList do19: /Store sentences contain results20: finalResults[][][]21: for wordType in {A, P, N} do22: //Sort by φ value and get the results of the top m23: finalResults[topic][wordType] = getTopKByPhi(results[topic][wordType], m)24: fork = 1 to m do25: Query the corresponding sentence of result.v from the inverted index described in Section 3.1.1 and add it to the result26: end for27: end for28: end for29: return finalResults

Algorithm 3. The result processing of the Gibbs sampling.

Trainingset

Additionalcorpus

Datapreprocessing

Construct an

inverted index

Query the

original statement

Query index

by the word

Output results Simila

rity

calcu

lation

Inpu

t to

mod

el Hierarchical

classification

Automaticlabeling

training set

Maximumentropymodel

Seed-word set

Seed-word set

Corpus

Word2VecExtractionresult

Invertedindex

HME-LDAmodel

Figure 4: The flow chart of the scheme that is based on the HME-LDA and inverted index.


of reviews. Therefore, it only needs to classify the opinion tar-gets of reviews and the review opinions by introducing themaximum entropy model as a classifier. The Bernoulli distri-bution with the parameter πd,n, where d indicates the d-thsentence and n indicates the n-th word, is jointly determinedby the weight λd,n and the eigenvector f d,n of the maximumentropy model. A beta distribution with a parameter δd willbe introduced as a priori to generate a Bernoulli distributionwith a parameterΩd for the classification of positive and neg-ative sentiment opinion words. Figure 6 is the PGM.

The generation process of the HME-LDA model is simi-lar to that of the SLDAmodel in Section 2.1. The difference isthat yd,n is determined by the maximum entropy modelrather than by the WordNet and SentiWordNet. And vd,n isdetermined by the parameter δd rather than by the calcula-tion of theWordNet and SentiWordNet. The generation pro-cess of the HME-LDA model can refer to Section 2.1.4.

In addition, because there are no methods to calculate thesentiment polarity in HME-LDA, it is necessary to set a sen-timent opinion word whose sentiment polarity is positive ornegative for each comment category.

2.2.4. The Automatic Data Annotation Method Based onHierarchical Clustering. The MaxEnt-LDA [22] model usesthe word feature with a window whose size is 3 to extractthe features from the annotated words. The selected featuresinclude the word and the word order fwi−1,wi,wi+1g, wherewi is the current word. The selected features also include thefeatures of grammatical rules of words fPOSi−1, POSi,POSi+1g, where POSi indicates the part of speech of the cur-rent word (adjectives, nouns, verbs, etc.). The part-of-speech tagging requires the use of additional tools, while itis different for the accuracy of part-of-speech tagging due todifferent kinds of languages, and there is a possibility thatthe tagging tool is lacking. Therefore, the features selectedin this chapter are only the words themselves.

(1) The Process of Automatically Annotating Data. Afteridentifying the feature information obtained from the train-

ing, the next step is to consider how to automatically labelit. In order to extract the feature fwi−2,wi−1,wi,wi+1,wi+2g,the only thing that needs to do here is to determine the cate-gory of the word wi in the corpus with information of wordorder. The seed-word setting of HEM-LDA is explained inSection 2.2.3. In the HME-LDA, seed words can be dividedinto three categories: topic seed words of a review category,seed words expressing positive sentiments, and seed wordsexpressing negative sentiments. And the opinion target wordsbelong to the corresponding category of comments, both ofwhich are the same kind of topic. Therefore, seed words canbe divided into two categories: opinion target words and sen-timent opinion words. The categories of these seed words areable to be determined. The annotated feature sets fwi−2,wi−1,wu

i ,wi+1,wi+2g can be obtained by scanning the corpus,where u ∈ fA,Og. When u = A, wA

i is the topic seed word.When u =O,wO

i is the sentiment seed word, and u is the label.

However, the number of seed words is limited, and thesize of the training set obtained by scanning corpus may beinsufficient. Therefore, we attempt to cluster the words inthe corpus and treat that all words in the category of seedwords have the same category as the seed words. Besides,the word in this category is used to replace the word wu

i inthe scanned annotated feature set fwi−2,wi−1,wu

i ,wi+1,wi+2g, so as to obtain the new annotated data. The pseudocodeis shown in Algorithm 4.

(2) The Selection of Clustering Method. In this section, inorder to improve the domain adaptability of the model andavoid the different effects of the same value of K when usingthe data of different fields, and to avoid more parameteradjustments as well, we select the hierarchical clusteringmethod which has no need for the number of clusters. Theresult of hierarchical clustering is shown in Figure 7.

In the tree structure generated by the hierarchical cluster-ing results, the leaf nodes of the tree are words in the corpus.

𝛽

T

S D

Yp

ФB

yd,s,n W

ud,s,nXd,s,n

Zd,s

Nd,s

{B,O,A}

𝜃d

ФA,t ФO,t ФA,g ФO,g

𝛼

𝜋d,s,n

𝜆

d,s,n

Figure 5: The probability diagram of MaxEnt-LDA.

𝛽t,A 𝛽t,P 𝛽t,N

𝜑t,A 𝜑t,P 𝜑t,N

Wfd,n

𝜆d,n

Zd,n 𝜃d 𝛼d

yd,n𝜋d,n vd,n Ωd,n 𝛿d

d,n

Figure 6: The probabilistic graphical model of HME-LDA.


When looking for words of the same category, the intermedi-ate node of the upper layer can be found by the current word.All the leaf nodes of the subtree with this intermediate nodeas the root node belong to the same category.

2.2.5. The Inference of the HEM-LDA Model. The reasoningof the HME-LDA model mainly includes two parts. The firstpart is the inference of the maximum entropy model for theclassification of the opinion target words and sentimentopinion words. And the second part is the inference of theoptimized LDA model.

(1) The Maximum Entropy Model. The maximum entropymodel solves the classification problem actually. When theinput of the model is x, the probability pðy ∣ xÞ of the categoryy can be calculated by equation (16).

p y ∣ xð Þ = e∑ni=1λi f i x,yð Þ

∑ye∑n

i=1λi f i x,yð Þ : ð16Þ

In equation (16), λi represents the weight vector, f iðx, yÞis the eigenfunction, and n is the number of categories. Whenusing the maximum entropy model for the classification ofthe opinion target words and sentiment opinion words, it isnecessary to select appropriate features. The MaxEnt-LDA[22] model uses two features, i.e., word order and part ofspeech. The part of speech tagging relies on some other toolsthat is different with different languages. In order to avoidusing tools that rely on languages, this chapter chooses theword order as a feature. Section 2.2.4 gives the method ofautomatically annotating the training set.

By training the maximum entropy model, the weight λuof the feature set f d,n can be obtained, and πu

d,n can beobtained by equation (17), where d represents the d-th sen-tence, n is the n-th word in the sentence, u ∈ f0, 1g is the labelcollection, 0 represents the opinion target whose type is A,and 1 represents the sentiment opinion whose type is O.

p yd,n ∣ f d,n� �

= πud,n =

eλu×f d,n

∑1i=0e

λi×f d,n: ð17Þ

(2) The Model Inference of HME-LDA. In SLDA, the param-

eters, α and β, are offset by calculating the similarity betweenthe input corpus and the seed words. The fixed parametersare denoted as αbase and βbase, and the offset parameters aredenoted as αd , βt,A, βt,P , and βt,N . In Chapter 2, the aboveparameters are offset by using the semantic similarity calcu-lation of the WordNet and the sentiment polarity calculationof the SentiWordNet. In this section, both the Word2Vecmodel and the cosine distance are used to calculate the simi-larity between words. The training corpus of the Word2Vecmodel is the all content in the corpus. The vector of the cur-rent word in theWord2Vec is represented by vw, and the vec-tor of the seed word wt whose topic is t in Word2Vec isrepresented by vwt

. The similarity between the word w andthe topic word wt can be calculated by equation (18).

sim w,wtð Þ = cos θð Þ = vw ⋅ vwt

vwj j × vwt

�� : ð18Þ

Similar to the SLDA model, the HME-LDA will set theparameter αd separately for each document (each sentencein the SLDA model) based on the similarity between thevocabulary and seed words. The biased parameters can becalculated by equation (19).

αd =∑Nd

i=1sim wd,i,wtð Þ∑T

t ′∑Ndi sim wd,i,wt′

� � × αbase: ð19Þ

In equation (19), Nd is the number of all words in thecurrent sentence, T is the number of topics, wd,i is the i-thword in the current sentence, and t is the seed word.

Based on the similarity between the vocabulary and seedwords, the SLDA model will set parameters separately foreach topic. These biased parameters can be calculated byequation (20), equation (21), and equation (22).

βt,A = sim w,wtð Þ × βbase, ð20Þ

βt,P = sim w,wt,Pð Þ × βbase, ð21Þβt,N = sim w,wt,Nð Þ × βbase: ð22Þ

Different from SLDA, HME-LDA introduces the

1: fort in Topics do //Process each seed word2: wordList = getWrodListFromCorpus(t) //Find the location where t appears from the corpus and get the corresponding word order3: wordCluster = getWordCluster(t) //Get all the words of the category t from the clustering results4: trainSet = new Set //Used to save labelled training samples5: for wOrder in wordList do6: for w in wordCluster do7: replaceWord(wOrder, t, w) //Replace the word t in the word order with w8: trainSet.add(wOrder, t.Type) //Add the label to which wOrder and t belong to the training set9: end for10: end for11: end for

Algorithm 4. The process of automatically labelling data.


parameter δ to control the sentiment polarity of each sen-tence. The parameter δd,q can be calculated by equation (23).

δd,q =∑Nd

i sim wd,i,wt,q� �

∑q′∈ P,Nf g∑Ndi sim wd,i,wt,q′

� � × δbase: ð23Þ

Similar to SLDA, HME-LDA uses the method of Gibbssampling for solving. And the variables involved in the solu-tion process have the same meanings as those in Tables 1 and2.

Then, we use equation (24) to sample the topic of eachsentence.

p zd,n = t ∣ z¬d,n, y¬d,n, v¬d,n, ⋅� �

∝nt,Awd,n

+ βt,Awd,n

∑Vv nt,Av + βt,A

v

� �

×nt,Pwd,n

+ βt,Pwd,n

∑Vv nt,Pv + βt,P

v

� � ×nt,Nwd,n

+ βt,Nwd,n

∑Vv nt,Nv + βt,N

v

� � × nd,t + αd,tð Þ:

ð24Þ

Equation (25) and equation (26) are used to sample yd,nand vd,n,

p yd,n = u ∣ zd,n = t, ⋅� �

∝nt,uwd,n

+ βt,uwd,n

∑Vv nt,uv + βt,u

v

� �

×eλu×f d,n

∑u′∈ A,Of geλu′×f d,n

, u ∈ A,Of g,ð25Þ

p vd,n = q ∣ zd,n = t, ⋅ð Þ∝nt,qwd,n + βt,q

wd,n

∑Vv nt,qv + βt,q

v

� � × nd,q + δd,q� �

:

ð26ÞIn the corpus, the approximate probability of the topic t

in sentence d can be calculated by equation (27),

θtd =nd,t + αd,t

nd +∑Tt ′αd,t ′

: ð27Þ

With t as the topic, the approximate probability that theword wd,n is the opinion target can be calculated by equation(28),

φt,Awd,n

=nt,Awd,n

+ βt,Awd,n

∑Vv nt,Av + βt,A

v

� � : ð28Þ

With t as the topic, the approximate probability that theword wd,n is the positive opinion word can be calculated byequation (29),

φt,Pwd,n

=nt,Pwd,n

+ βt,Pwd,n

∑Vv nt,Pv + βt,P

v

� � : ð29Þ

With t as the topic, the approximate probability that theword wd,n is the negative opinion word can be calculated byequation (30),

φt,Nwd,n

=nt,Nwd,n

+ βt,Nwd,n

∑Vv nt,Nv + βt,N

v

� � : ð30Þ

3. Results and Analysis

This chapter mainly trains the SLDA model in Section 2.1and the HME-LDA model in Section 2.2 and, respectively,uses the above two models to extract the opinion targetsand opinion review words on the Restaurant English dataset, and then, this chapter will verify the feasibility of themodels and analyse the experimental results.

3.1. Experimental Data Set. The data set of the experiment isfrom SemEval2016ABSA and Yelp. The original data ofreviews on Yelp includes restaurants and hotels, so it needsto be screened. The test data is from the task B of SemEva-l2016ABSA. The test part of this experiment only pays atten-tion to the classification of aspect-based opinion words, sothe original test data need to be processed.

3.1.1. Original Data Set. SemEvel provides the training setand test set of the Restaurant reviews in XML format, wherethe size of the training set is 737KB and the size of test set is264KB. The label structure of a sentence is shown in Figure 8.

The content in the label text is the original sentence, theattribute target in the label opinion is the opinion target,the category is the aspect category of reviews, and the polaritydescribes the sentiment polarity of the target. The modelinput proposed in this paper is the plain text without anno-tated information. The task of this paper is to extract theopinion target words and opinion sentiment opinion wordsand judge the polarity of the sentiment word. Here, the con-tent in the text label needs to be extracted and added to the

PlanLetter

RequestMemo

CaseQuestion

ChargeStatement

Draft

EvaluationAssessment

AnalysisUnderstanding

OpinionConversation

Discussion

AccountsPeople

CustomersIndividualsEmployees

Students

RepsRepresentativesRepresentative

Rep

DayYear

WeekMonth

QuarterHalf

Figure 7: The results presentation of hierarchical clusteringmethods.


corpus, and the content of target and category is extracted astest data. When verifying the experimental results based onthe evaluation indicators, we only take the opinion target intoaccount.

The size of the original data in Yelp is 231.2MB, whichcontains lots of useless fields and covers a wide range offields, including Restaurant, Hotel, and Wine Bar. There isa field, business_categories, which represents the realm ofthe review, in the Yelp data set. Therefore, the data can be fil-tered through the field above. The Yelp data is used to pro-vide additional training sets and can be used to trainWord2Vec and hierarchical clustering models to provideautomatic annotated training sets for the maximum entropymodel in HME-LDA.

3.1.2. Data Preprocessing. The Restaurant data set in SemEvalis in XML format, while the format of Yelp data is CSV.Therefore, it is necessary to extract the data needed for thisexperiment from the files with two formats and carry out uni-fied numbering. In the Restaurant data set, all sentences areannotated with text, so the sentences can be directly obtainedfrom the labels. In the Yelp data set, the field, business_cate-gories, stores the category of the review, and if the review con-tains the restaurant word, it is a restaurant-related review.The review of Yelp contains multiple sentences, so it can besplit into multiple sentences by punctuations. Finally, thetwo txt files are used to store the extracted results, and oneline in the file is a sentence. The xml.dom package and thecsv package in python are used in the process. The amountof data finally extracted is shown in Table 3.

Repetitive words are removed from the statistics of thenumber of words, and no repetition words are removed inthe statistics of the average length of sentences. When wordsare extracted from the sentences, they are separated by spacesand punctuation marks. Due to the need for additional toolsfor word type reduction, the word type reduction is not car-ried out here. Therefore, when counting the number ofwords, both the different tenses and the singular and pluralforms of the same word are taken into account, resulting inthe final number of words is larger than the fact.

There is additional annotated information in the SemE-val corpus. The opinion target word and the aspect categoryof the word in the Restaurant review field can be extractedfrom the annotated information. The opinion label in theoriginal xml file is extracted by python’s xml.dom package,then the attributes, target, and category are extracted fromthe opinion label and make statistics. The statistical resultsare shown in Table 4.

Figure 9 generated by statistics in the training corpus ofSimEval shows that the review category, food, accounts fora large proportion, while the review categories, location anddrinks, account for a small proportion. The review category,restaurant, which contains a lot of semantics, is not usuallythe object of extraction. In this chapter, we only take thereview categories, food, service, and ambience into account.

3.1.3. The Construction of the Experimental Data Set. TheSemEval2016Restaurant review set is used in both the exper-imental data set and the test set. In the HME-LDA model,additional Yelp data sets are needed to train the hierarchicalclassification model and the Word2Vec model, and theamount of additional data provided will affect the final resultsas well. Therefore, the Yelp data set is divided into four train-ing sets according to the number of sentences, which areshown in Table 5.

3.2. The Main Evaluation Indicators

3.2.1. The Evaluating Indicators of the Aspect ReviewCategory of Sentences. Referring to the evaluation methodsof Zhao et al. [22], this paper chooses accuracy P, recall R,and their harmonic value F1 as verification indicators, whichare shown in equation (31), equation (32), and equation (33).In the MaxEnt-LDA [22] model, the topic of a sentence isundefined. According to the distribution of the topic words,the topics should be set to food, service, and ambiencemanu-ally. Then, the sentence should be set to the correspondingtopic according to the probability of the words appearing ineach topic. The SLDA model proposed in this paper, as wellas the HME-LDA model that has a fewer dependence onthe language than SLDA, treats the seed word as the topicword and the aspect category of reviews. Therefore, the topicof a sentence can be determined directly by the distributionin the model.

P = The number of correct predictionsThe total number of projections

, ð31Þ

R =The number of correct predictions

The total correct number, ð32Þ

F1 =2 × P × RP + R

: ð33Þ

3.2.2. The Evaluation Indicators of the Opinion Target. In themodels proposed in this paper, the distribution informationof the opinion target based on the specific topic is stored inφt,A. The φt,A preserves the word probability based on the

Figure 8: The format of the SemEval data set.


features of a topic model, and the same word may exist in dif-ferent topics, so it is impossible to calculate the extractedaccuracy and recall rate directly with φt,A. Referring to thescheme of Zhao et al. [22], n words with the highest probabil-ity of each topic in φt,A are extracted and are treated as repre-sentatives, and then, their accuracy is calculated. In Section 1,the annotated data set contains information about the opin-ion targets and the categories of reviews they belong to. Fromthis, the nwords with the highest frequency of occurrence areselected as references for each review category. If n is 5, 10,20, respectively, the accuracy rate Pt@n can be calculatedaccording to equation (31), where t is the topic number andn is the number of words taken. The average accuracy ofextraction is expressed by Pt@n, which is calculated by equa-tion (34).

P@n =∑T

t=0Pt@n

T: ð34Þ

3.3. The Experimental Results and Analysis

3.3.1. The Setting of Experimental Parameters. In this experi-ment, the parameters related to the topic model are set asαbase = 50/T , βbase = 0:01, and δbase = T , where T is the num-ber of topics, and T = 3 in the subsequent experiment. Inaddition, the other one we need to set is the seed word andthe cluster number of the hierarchical classification model.The seed word set is divided into two groups for comparison.One is the seed word set A {food, service, ambience}, and theother is seed word set B {chicken, staff, atmosphere}. The clus-ter number of the hierarchical classification model does notdirectly determine the number of categories. The clusternumber is set to 200 according to experience.

The parameter setting of the comparison model MaxEnt-LDA [22] is the same as the original text. When trainingmaximum entropy, the features of the model can be dividedinto three categories: word, part of speech, and part of speechplus word. The HME-LDA model proposed in this paperonly uses the feature word. In the comparative experiment,the feature of the maximum entropy of the MaxEnt-LDA[22] model is chosen as the word.

3.3.2. The Influence of Seed Word Set on SLDA and HME-LDA. Figures 10 and 11 show the impact of seed words onthe SLDA model and the HME-LDA model when they are,respectively, set to seed word set A {food, service, ambience}and seed word set B {chicken, staff, atmosphere}. When usingthe seed word set A, the model is recorded as SLDA-A, HME-LDA-A. When using the seed word set B, the model isrecorded as SLDA-B, HME-LDA-B. Besides, the HME-LDA model additionally uses Yelp10K as the training set ofthe automatic annotated method.

It can be seen from the experimental results in Figure 10that the selection of seed word sets has a greater impact onthe SLDA model, while the impact on the HME-LDA modelis not significant. Under the topics of food and ambience, theaccuracy, recall rate and F1 value of the SLDA-A model arehigher than those of SLDA-B, and when under the topic ofservice, the evaluation indexes above of SLDA-A model are

Table 3: The information of the experimental data set.

Index SemEval Yelp

The number of reviews 2000 85000

The number of words 3373 67066

The average length of sentence 13 13

Table 4: The information of the annotating data.

The aspectcategory ofreviews

Thenumber ofsentences

The number ofwords beforerepetition

The number ofwords afterrepetition

Food 952 952 420

Restaurant 258 258 90

Service 324 324 57

Drinks 96 96 51

Ambience 228 228 94

Location 22 22 9

17%

14%

Food

5%

12% 1%

51%

RestaurantService

DrinksAmbienceLocation

Figure 9: The proportion of review categories in the SemEvalcorpus.

Table 5: The information of the Yelp training set.

The name of thetraining sets

The number ofsentences

The number ofsentences

Yelp2K 2000 3809

Yelp4K 4000 5553

Yelp10K 10000 8948

Yelp20K 20000 12490


0.0000

P

R

F1

MaxEnt-LDA

HME-LDA-B

HME-LDA-A

SLDA-B

SLDA-A

Food

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000

0.67130.6212

0.65120.64130.6493

0.52380.4721

0.54100.5158

0.48120.5884

0.53650.5910

0.57170.5528

0.8000

(a)

Service

0.62130.6542

0.63120.6413

0.67930.5238

0.54210.54100.5358

0.51120.5684

0.59290.58260.58380.5834

0.0000

P

R

F1

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000

MaxEnt-LDA

HME-LDA-B

HME-LDA-A

SLDA-B

SLDA-A

(b)

0.0000

P

R

F1

Ambience

0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000

0.57170.5528

0.59100.54020.5884

0.48120.5158

0.54100.4721

0.52380.6493

0.64130.65120.6312

0.6713

MaxEnt-LDA

HME-LDA-B

HME-LDA-A

SLDA-B

SLDA-A

(c)

Figure 10: The influence of seed word sets on the extraction of sentence review categories: (a) evaluation index value of review category Food;(b) evaluation index value of review category Service; (c) evaluation index value of review category Ambience.


lower than those of the SLDA-B model. This is related to thefeatures of WordNet and SentiWordNet. In WordNet, thesimilarity between words is mainly calculated by the pathbetween words. When the seed word food is used as the initialposition, all branches only need to trace up to food. When aword in a branch of food is used as the seed word, the wordsof the other branches need to trace back to food first, and thensearched down. This way makes the number of pathsincrease, which reduces the similarity of the words, thusaffecting the results, while WordNet mainly preserves theconcept of words and fails to perform knowledge reasoning,that is, it fails to infer the relationship betweenwaiter and ser-vice to improve the similarity between them. When the seedword is replaced by staff, the explicit conceptual noun willmake it easier to find noun words such as waiter, thusimproving the effect of the SLDA. The eigenvector of themaximum entropy classifier in the HME-LDAmodel is takenas word order information, ignoring the conceptual problemof the word itself, so the choices of seed words have littleeffect on the final results. The evaluation indicators ofSLDA-A and HME-LDA-A models are both slightly betterthan MaxEnt-LDA.

The results of Figure 11 show that the smaller the value ofn, the higher the accuracy of the opinion target extraction ineach model. And all models, namely, MaxEnt-LDA, SLDA,and HME-LDA, are suitable for extracting the opinion targetwords with the highest correlation, which is consistent withthe comment habits of people. Most people only focus on afew aspect opinion targets of the comment objects, and theinformation people want to get from the comments is alsobased on their most concerned aspect. Similar to the resultsin Figure 11, when the seed word set is changed from A toB, the accuracy of the SLDA model decreases a lot, whilethe accuracy of the HME-LDA model decreases little, whichhas a great relationship with the word classification methodof the SLDA. Meanwhile, it can be seen that SLDA-A andHME-LDA-A are slightly higher in the accuracy of the opin-ion target extraction than that of the MaxEnt-LDA.

3.3.3. The Influence of the Size of Training Set on HME-LDA.In the HME-LDA model, the effect of the maximum entropymodel is related to the accuracy of automatically annotateddata sets. The size of the training set that is used for trainingthe hierarchical classification model will affect the accuracy

Average P @ N

P@5

P@10

P@20

0.0000 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.7000 0.8000 0.9000

0.49280.5212

0.50120.4843

0.51840.5812

0.61010.64110.5721

0.62380.7133

0.70130.7512

0.66120.7713

MaxEnt-LDA

HME-LDA-B

HME-LDA-A

SLDA-B

SLDA-A

Figure 11: The influence of the seed word sets on P@N.

0.0000Yelp2K

HME-LDA-AMaxEnt-LDA

Yelp4K

Average F1

Yelp10K Yelp20K

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

Figure 12: The F1 value of HME-LDA in training sets of different sizes.


of automatic data annotation. Since it is difficult to calculatethe accuracy of the automatic annotated data set, the impactof the training set size on the HME-LDA model is indicatedby the evaluation indexes F1 and P@5 of the final HME-LDA model. Here, F1 takes the mean value of F1 under thethree topics, and P@5 indicates the mean value of the accu-racy of the opinion target extraction in each review categorywhen taking the top five values. In the experiment, the seedword is set to the seed word set A {food, service, ambience}.The experimental results are shown in Figures 12 and 13.

The value of the MaxEnt-LDA model is taken as a refer-ence in the above figures. The Yelp data set is not used in theMaxEnt-LDA model, so the evaluation indicators of theMaxEnt-LDA model in these figures remain unchanged.With the increase of training set size, the effect of HEM-LDA is on the rise. When the training set reaches 10K, theeffect of the HME-LDA model tends to be stable and slightlybetter than that of the MaxEnt-LDA model. The reason whythe effect of the model is related to the size of the training setis because the larger the training set is, the more effectiveinformation it contains. Meanwhile, since the number ofclusters in the hierarchical classification model is constant,when the amount of data is small, the number of words ineach cluster is relatively small, and the accuracy of classifica-tion is relatively low. When the amount of data increases, theaccuracy of classification will be improved relatively, whichwill affect the final effect of the model.

4. Conclusions

This paper mainly studies the methods to reduce the depen-dence of models on annotated data by focusing on the topicof aspect-based opinion mining. The unsupervised LDAtopic model has good expansibility. Based on the LDA topicmodel, this paper introduces the two types of dictionarytools, WordNet and SentiWordNet, to propose the SLDAmodel. In order to further reduce the dependence on lan-guage tools, the maximum entropy model and the methodof automatically annotating data are introduced to proposean optimized model HME-LDA. The experiments show thatboth the SLDA model and the HME-LDA model have goodresults on accuracy and recall rate without relying on the

annotated data. Therefore, the two optimized models willgive more detailed and accurate information for cryptocur-rency investors in the blockchain to assist them in betterdecision-making.

Data Availability

Firstly, the Yelp data set used to support the findings of thisstudy can be available from the https://www.yelp.com/dataset. Secondly, the SemEval2016ABSA data set used tosupport the findings of this study are included within thearticle: “SemEval-2016 Task 5: Aspect Based Sentiment Anal-ysis”. Also, this dataset can be available from the http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools.

Conflicts of Interest

The authors declare that there is no conflict of interestregarding the publication of this paper.

Acknowledgments

This work was supported by the Social Science Fund Plan-ning Project of Ministry of Education of People’s Republicof China “Research on Data Service and Guarantee for theFourth Paradigm of Social Science” (20YJA870017).

References

[1] C. Song, H. Jung, and K. Chung, “Development of a medicalbig-data mining process using topic modeling,” Cluster Com-puting, vol. 22, no. S1, pp. 1949–1958, 2019.

[2] L. Carson K-S, “Big data analysis and mining,” in AdvancedMethodologies and Technologies in Network Architecture,Mobile Computing, and Data Analytics, pp. 15–27, IGI Global,2019.

[3] Q. Li, S. Li, S. Zhang, and J. Hu, “A Review of Text Corpus-Based Tourism Big Data Mining,” Applied Sciences, vol. 9,no. 16, p. 3300, 2019.

[4] S. Lee, Y. Hyun, and M. J. Lee, “Groundwater potential map-ping using data mining models of big data analysis inGoyang-si, South Korea,” Sustainability, vol. 11, no. 6, article1678, 2019.

HME-LDA-AMaxEnt-LDA

P@5

Yelp2K Yelp4K Yelp10K Yelp20K0.00000.10000.20000.30000.40000.50000.60000.70000.80000.9000

Figure 13: The P@5 value of HME-LDA in training sets of different sizes.


https://www.yelp.com/dataset

https://www.yelp.com/dataset

http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools

[5] B. Liu and L. Zhang, “A survey of opinion mining and senti-ment analysis,” in Mining Text Data, pp. 415–463, Springer,Boston, MA, 2013.

[6] K. Liu, L. Xu, and J. Zhao, Opinion target extraction usingword-based translation model, pp. 1346–1356, 2012.

[7] Z. Chen, A. Mukherjee, and B. Liu, Aspect extraction withautomated prior knowledge learning, 2014.

[8] H. Cheng, Z. Xie, Y. Shi, and N. Xiong, “Multi-step data pre-diction in wireless sensor networks based on one-dimensional CNN and bidirectional LSTM,” IEEE Access,vol. 7, 2019.

[9] M. Pontiki, D. Galanis, H. Papageorgiou et al., SemEval-2016task 5: aspect based sentiment analysis, pp. 19–30, 2016.

[10] M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou,I. Androutsopoulos, and S. Manandhar, “SemEval-2014 task4: aspect based sentiment analysis,” in Proceedings of the 8thinternational workshop on semantic evaluation (SemEval2014), pp. 27–35, 2014.

[11] Y. F. Zeng, T. Lan, and Z. F. Wu, “Bi-memory based attentionmodel for aspect level sentiment classification,” Chinese Jour-nal of Computers, vol. 8, pp. 1845–1857, 2019.

[12] T. Chen, R. Xu, and X. Wang, “Improving Sentiment AnalysisVia Sentence Type Classification Using BiLSTM-CRF andCNN,” Expert Systems with Applications, vol. 72, pp. 221–230, 2017.

[13] O. Araque, I. Corcuera-Platas, J. F. Sánchez-Rada, and C. A.Iglesias, “Enhancing deep learning sentiment analysis withensemble techniques in social applications,” Expert Systemswith Applications, vol. 77, pp. 236–246, 2018.

[14] T. Hofmann, “Probabilistic latent semantic indexing,” ACMSIGIR Forum, vol. 51, no. 2, pp. 211–218, 2017.

[15] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet allocation,”Journal of Machine Learning Research, vol. 3, p. 993, 2013.

[16] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai, Topic Senti-ment Mixture: Modeling Facets and Opinions inWeblogs, 2007.

[17] F. Li, C. Han, M. Huang et al., Structure-Aware Review Miningand Summarization, vol. 2, pp. 653–661, 2010.

[18] D. Blei and J. McAuliffe, “Supervised topic models,” Advancesin Neural Information Processing Systems, vol. 3, 2010.

[19] D. Ramage, D. Hall, R. Nallapati, and C. Manning, LabeledLDA: a supervised topic model for credit attribution in multi-labeled corpora, vol. 1, pp. 248–256, 2009.

[20] H. Cheng, D. Feng, X. Shi, and C. Chen, “Data quality analysisand cleaning strategy for wireless sensor networks,” EURASIPJournal on Wireless Communications and Networking, vol. 61,2018.

[21] H. Cheng, Z. Su, N. Xiong, and Y. Xiao, “Energy-efficient nodescheduling algorithms for wireless sensor networks using Mar-kov Random Field model,” Information Sciences, vol. 329,pp. 461–477, 2016.

[22] W. Zhao, J. Jiang, H. Yan, and X. Li, Jointly modeling aspectsand opinions with a MaxEnt-LDA hybrid, pp. 56–65, 2010.


BigDataAspect-BasedOpinionMiningUsingtheSLDAandHME- LDA Models

Documents

Transcript of BigDataAspect-BasedOpinionMiningUsingtheSLDAandHME- LDA Models