arXiv:1711.04305v2 [cs.IR] 6 Dec 2018 · 2018-12-07 · dation system, in [44] proposed a...

manuscript No.(will be inserted by the editor)

Latent Dirichlet Allocation (LDA) and Topic modeling:models, applications, a survey

Hamed Jelodar · Yongli Wang · Chi Yuan ·Xia Feng · Xiahui Jiang · Yanchao Li · LiangZhao

Received: date / Accepted: date

Abstract Topic modeling is one of the most powerful techniques in text mining fordata mining, latent data discovery, and finding relationships among data and text doc-uments. Researchers have published many articles in the field of topic modeling andapplied in various fields such as software engineering, political science, medical andlinguistic science, etc. There are various methods for topic modelling; Latent Dirich-let Allocation (LDA) is one of the most popular in this field. Researchers have pro-posed various models based on the LDA in topic modeling. According to previouswork, this paper will be very useful and valuable for introducing LDA approachesin topic modeling. In this paper, we investigated highly scholarly articles (between2003 to 2016) related to topic modeling based on LDA to discover the research de-velopment, current trends and intellectual structure of topic modeling. In addition,we summarize challenges and introduce famous tools and datasets in topic modelingbased on LDA.

Keywords Topic modeling, Latent Dirichlet Allocation, tag recommendation,Semantic web, Gibbs Sampling

1 Introduction

Natural language processing (NLP) is a challenging research in computer science toinformation management, semantic mining, and enabling computers to obtain mean-ing from human language processing in text-documents. Topic modeling methods are

H. JelodarSchool of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing210094, ChinaTel.: +8615298394479E-mail: [email protected]

Y. WangSchool of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing210094, ChinaE-mail: [email protected]

arX

iv:1

711.

0430

5v2

[cs

.IR

] 6

Dec

201

8

2 Hamed Jelodar et al.

powerful smart techniques that widely applied in natural language processing to topicdiscovery and semantic mining from unordered documents [1]. In a wide perspective,Topic modeling methods based on LDA have been applied to natural language pro-cessing, text mining, and social media analysis, information retrieval. For example,topic modeling based on social media analytics facilitates understanding the reactionsand conversations between people in online communities, As well as extracting use-ful patterns and understandable from their interactions in addition to what they shareon social media websites such as twiiter facebook [2-3]. Topic models are prominentfor demonstrating discrete data; also, give a productive approach to find hidden struc-tures(semantics) in gigantic information. There are many papers for in this field anddefinitely cannot mention to all of them, so we selected more signification papers.Topic models are applied in various fields including medical sciences [4-7] , softwareengineering [8-12], geography [13-17], political science [18-20] , etc.

For example in political science, In [20] proposed a new two-layer matrix factor-ization methodology for identifying topics in large political speech corpora over timeand identify both niche topics related to events at a particular point in time and broad,long-running topics. This paper has focused on European Parliament speeches, theproposed topic modeling method has a number of potential applications in the studyof politics, including the analysis of speeches in other parliaments, political man-ifestos, and other more traditional forms of political texts. In [21] suggested a newunsupervised topic model based on LDA for contrastive opinion modeling which pur-pose to find the opinions from multiple views, according to a given topic and theirdifference on the topic with qualifying criteria, the model called Cross-PerspectiveTopic (CPT) model. They performed experiments with both qualitative and quan-titative measures on two datasets in the political area that include: first dataset isstatement records of U.S. senators that show political stances of senators. Also forthe second dataset, extracted of world News Medias from three representative mediain U.S (New York Times), China (Xinhua News) and India (Hindu). To evaluate theirapproach with other models, used corrIDA and LDA as two baselines.

Another group of researchers focused on topic modeling in software engineering,in [8] for the first time, they used LDA, to extract topics in source code and performvisualization of software similarity, In other words, LDA is used as an intuitive ap-proach for calculation of similarity between source files and obtain their respectivedistributions of each document over topics. They utilized their method on 1,555 soft-ware projects from Apache and SourceForge that includes 19 million source lines ofcode (SLOC). The authors demonstrated this approach, can be effective for project or-ganization, software refactoring. In [22] introduced a method based on LDA for auto-matically categorizing software systems, called LACT. For evaluation of LACT, used43 open-source software systems in different programming languages and showedLACT can categorize software systems based on type of programming language. In[23, 24] proposed an approach topic modeling based on LDA model for the pur-pose of bug localization. Their idea, applied to analysis of same bugs in Mozilla andEclipse and result showed that their LDA-based approach is better than LSI for eval-

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 3

uate and analyze bugs in these source codes.

An analysis of geographic information is another issue that can be referred to [17].They introduced a novel method based on multi-modal Bayesian models to describesocial media by merging text features and spatial knowledge that called GeoFolk. Asa general outlook, this method can be considered as an extension of Latent DirichletAllocation (LDA). They used the available standard CoPhIR dataset that contains anabundance of over 54 million Flickr. The GeoFolk model has the ability to be usedin quality-oriented applications and can be merged with some models from Web 2.0social. In [16], this article examines the issue of topic modeling to extract the top-ics from geographic information and GPS-related documents. They suggested a newlocation text method that is a combination of topic modeling and geographical clus-tering called LGTA (Latent Geographical Topic Analysis). To test their approaches,they collected a set of data from the website Flickr, according to various topics.

In other view, Most of the papers that were studied, had goals for this topic mod-eling, such as: Source code analysis [8 , 9, 22, 24-27] , Opinion and aspect Mining[18, 28-36], Event detection [37-40], Image classification [13, 41], recommendationsystem [42-48] and emotion classification [49-51], etc. For example in recommen-dation system, in [44] proposed a personalized hashtag recommendation approachbased LDA model that can discover latent topics in microblogs, called Hashtag-LDAand applied experiments on ”UDI-TwitterCrawl-Aug2012-Tweets” as a real-worldTwitter dataset.

1.1 literature review and related works

Topic models have many applications in natural processing languages. Many articleshave been published based on topic modeling approaches in various subject such asSocial Network, software engineering, Linguistic science and etc. There are someworks that have focused on survey in Topic modeling.In [52], the authors presented a survey on topic modeling in software engineeringfield to specify how topic models have thus far been applied to one or more softwarerepositories. They focused on articles written between Dec 1999 to Dec 2014 andsurveyed 167 article that using topic modeling in software engineering area. Theyidentified and demonstrate the research trends in mining unstructured repositories bytopic models. They found that most of studies focused on only a limited number ofsoftware engineering task and also most studies use only basic topic models. In [53],the authors focused on survey in Topic Models with soft clustering abilities in textcorpora and investigated basic concepts and existing models classification in vari-ous categories with parameter estimation (such as Gibbs Sampling) and performanceevaluation measures. In addition, the authors presented some applications of topicmodels for modeling text corpora and discussed several open issues and future direc-tions.


In [54], introduced and surveyed the field of opinion mining and sentiment anal-ysis, which helps us to observe a elements from the intimidating unstructured text.The authors discussed the most extensively studied subject of subjectivity classifica-tion and sentiment which specifies whether a document is opinionated. Also, they de-scribed aspect-based sentiment analysis which exploits the full power of the abstractmodel and discussed about aspect extraction based on topic modeling approaches.In [55], they discussed challenges of text mining techniques in information systemsresearch and indigested the practical application of topic modeling in combinationwith explanatory regression analysis, using online customer reviews as an exemplarydata source.

In [56], the authors presented a survey of how topic models have thus far beenapplied in Software Engineering tasks from four SE journals and eleven conferenceproceedings. They considered 38 selected publications from 2003 to 2015 and foundthat topic models are widely used in various SE tasks in an increasing tendency, suchas social software engineering, developer recommendation and etc. Our research dif-ference with other works is that, we had a deep study on topic modeling approachesbased on LDA with the coverage of various aspects such as applications, tools ,dataset and models.

1.2 Motivations and contributions

Since, topic modeling techniques can be very useful and effective in natural lan-guage processing to semantic mining and latent discovery in documents and datasets.Hence, our motivation is to investigate the topic modeling approaches in differentsubjects with the coverage of various aspects such as models, tools, dataset and ap-plications. The main goal of this work is to provide an overview of the methods oftopic modeling based on LDA. In summary, this paper makes four main contributions:

– We investigate scholarly articles (from 2003 to 2016) which are related to TopicModeling based on LDA to discover the research development, current trends andintellectual structure of topic modeling based on LDA.

– We investigate topic modeling applications in various sciences.– We summarize challenges in topic modeling, such as image processing, Visualiz-

ing topic models, Group discovery, User Behavior Modeling, and etc.– We introduce some of the most famous data and tools in topic modeling.

2 Computer science and topic modeling

Topic models have an important role in computer science for text mining and naturallanguage processing. In Topic modeling, a topic is a list of words that occur in sta-tistically significant methods. A text can be an email, a book chapter, a blog posts,a journal article and any kind of unstructured text. Topic models cannot understandthe means and concepts of words in text documents for topic modeling. Instead, they


suppose that any part of the text is combined by selecting words from probable bas-kets of words where each basket corresponds to a topic. The tool goes via this processover and over again until it stays on the most probable distribution of words into bas-kets which call topics. Topic modeling can provide a useful view of a large collectionin terms of the collection as a whole, the individual documents, and relationships be-tween documents. In figure 1, we provided a taxonomy of topic modeling methodsbased on LDA, from some of the impressive works.

2.1 Latent Dirichlet Allocation

LDA is a generative probabilistic model of a corpus. The basic idea is that the doc-uments are represented as random mixtures over latent topics, where a topic is char-acterized by a distribution over words. Latent Dirichlet allocation (LDA), first intro-duced by Blei, Ng and Jordan in 2003 [1], is one of the most popular methods intopic modeling. LDA represents topics by word probabilities. The words with high-est probabilities in each topic usually give a good idea of what the topic is can wordprobabilities from LDA.

LDA, an unsupervised generative probabilistic method for modeling a corpus, isthe most commonly used topic modeling method. LDA assumes that each documentcan be represented as a probabilistic distribution over latent topics, and that topicdistribution in all documents share a common Dirichlet prior. Each latent topic inthe LDA model is also represented as a probabilistic distribution over words and theword distributions of topics share a common Dirichlet prior as well. Given a corpusD consisting of M documents, with document d having Nd words (d ∈ 1,..., M), LDAmodels D according to the following generative process [4]:(a) Choose a multinomial distribution ϕ t for topic t (t ∈{1,..., T}) from a Dirichlet

distribution with parameter β(b) Choose a multinomial distribution θ d for document d (d ∈{1,..., M}) from a

Dirichlet distribution with parameter α.(c) For a word w n (n ∈{1,..., N d }) in document d,

1. (a) i. Select a topic z n from θ d .ii. Select a word w n from ϕ zn .

In above generative process, words in documents are only observed variables whileothers are latent variables (ϕ and θ) and hyper parameters (α and β). In order toinfer the latent variables and hyper parameters, the probability of observed data D iscomputed and maximized as follows:

p(D|α, β) =

M∏d=1

∫p(θd |α)(

Nd∏n=1

∑zdn

p(Zdn|θd)p(wdn|zdn, β))dθd (1)

Defined α parameters of topic Dirichlet prior and the distribution of words over top-ics, which, drawn from the Dirichlet distribution, given β. Defined, T is the number


of topics, M as number of documents; N is the size of the vocabulary. The Dirichlet-multinomial pair for the corpus-level topic distributions, considered as (α, θ). TheDirichlet-multinomial pair for topic-word distributions, given (β, ϕ). The variablesθ d are document-level variables, sampled when per document. zdn ,wdn variables areword-level variables and are sampled when for each word in each text-document.

LDA is a distinguished tool for latent topic distribution for a large corpus. There-fore, it has the ability to identify sub-topics for a technology area composed of manypatents, and represent each of the patents in an array of topic distributions. WithLDA, the terms in the set of documents, generate a vocabulary that is then applied todiscover hidden topics. Documents are treated as a mixture of topics, where a topicis a probability distribution over this set of terms. Each document is then seen as aprobability distribution over set of topics. We can think of the data as coming froma generative process that is defined by the joint probability distribution over what isobserved and what is hidden.

2.2 Parameter estimation, Inference, Training for LDA

Various methods have been proposed to estimate LDA parameters, such as variationalmethod [1], expectation propagation [57] and Gibbs sampling [58].

– Gibbs sampling, is a Monte Carlo Markov-chain algorithm, powerful techniquein statistical inference, and a method of generating a sample from a joint distribu-tion when only conditional distributions of each variable can be efficiently com-puted. According to our knowledge, researchers have widely used this methodfor the LDA. Some of works related based on LDA and Gibbs, such as [22, 50,59-66].

– Expectation-Maximization (EM), is a powerful method to obtain parameter es-timation of graphical models and can use for unsupervised learning. In fact, thealgorithm relies on discovering the maximum likelihood estimates of parame-ters when the data model depends on certain latent variables EM algorithm con-tains two steps, the E-step (expectation) and the M-step (maximization). Someresearchers have applied this model to LDA training, such as [67-70].

– Variational Bayes inference (VB), VB can be considered as a type of EM ex-tension that uses a parametric approximation to the posterior distribution of bothparameters and other latent variables and attempts to optimize the fit (e.g. usingKL-divergence) to the observed data. Some researchers have applied this modelto LDA training, such as [71-72].

2.3 A brief look at past work: Research between 2003 to 2009

The LDA was first presented in 2003, and researchers have been tried to provideextended approaches based on LDA, as shown in Table 1. Undeniably, this period(2003 to 2009) is very important because key and baseline approaches were intro-duced, such as: Corr LDA, Author-Topic Model , DTM and , RTM etc .


2006 Dynamic topic model(DT

2003 Corr_LDA

2003

Author-Topic Model

2006

Correlated Topic Models, CTM

2006

Pachinko Allocation Model,

2006 Topics over Time(TO

2007

GWN-LDA

2008

MG-LDA

2009

MedLDA

2008

On-line lda (OLDA)

2009

Rethinking LDA

Relational Topic Models LACT

2009

Labeled LDA

2009 HDP-LDA

2009 LDA-SOM

2010

Topic-Aspect Model

2010

GeoFolk

2010

TWC-LDA

2010

Topic Modeling Based on latent Dirichlet allocation (LDA)

Dependency-Sentiment-LDA

……..

………

2011

Sentence-LDA MMLDA

2011

Bio-LDA

2011

Twitter-LDA

2011

SHDT

2011

LogicLDA

2012

Mr. LDA

2012

LDA-G

2012

ET-LDA

2012

SShLDA

……..

……..

2014

emotion LDA (ELDA Biterm Topic Modeling

2014

TWILITE

2015

LightLDA

2015

Latent Feature LDA (LF-

2015

MRF-LDA

2016

Hashtag-LDA

2016

PMB-LDA

2017

Enriched LDA (ELDA

2017 iDoctor

……..

……..

2009 2011 2014

2009

Fig. 1 A taxonomy of methods based on extension LDA, considered some of the impressive works


Author-Topic model [75], is a popular and simple probabilistic model in topicmodeling for finding relationships among authors, topics, words and documents. Thismodel provides a distribution of different topics for each author and also a distributionof words for each topic. For evaluation, the authors used 1700 papers of NIPS con-ference and also 160,000 CiteSeer abstracts of CiteSeer dataset. To estimate the topicand author distributions applied Gibbs sampling. According to their result, showedthis approach can provide a significantly predictive for interests of authors in per-plexity measure. This model has attracted much attention from researchers and manyapproaches proposed based on ATM, such as [12, 76].

DTM, Dynamic Topic Model (DTM) is introduced by Blei and Lafferty as an ex-tension of LDA that this model can obtain evolution of topics over time in a sequen-tially arranged corpus of documents and exhibits evolution of word-topic distributionwhich causes it easy to vision the topic trend [73]. As an advantage, DTM is veryimpressible for extracting topics from collections that change slowly over a period oftime.Labeled LDA (LLDA) is another LDA extension which suppose that each documenthas a set of known labels [66]. This model can be trained with labeled documentsand even supports documents with more than one label. Topics are learned from theco-occurring terms in places from the same category, with topics approximately cap-turing different place categories. A separate L-LDA model is trained for each placecategory, and can be used to infer the category of new, previously unseen places.LLDA is a supervised algorithm that makes topics applying the Labels assigned man-ually. Therefore, LLDA can obtain meaningful topics, with words that map well tothe labels applied. As a disadvantage, Labeled LDA has limitation to support latentsubtopics within a determined label or any global topics. For overcome this problem,proposed partially labeled LDA (PLDA) [74].

MedLDA, proposed the maximum entropy discrimination latent Dirichlet allo-cation (MedLDA) model, which incorporates the mechanism behind the hierarchi-cal Bayesian models (such as, LDA) with the max margin learning (such as SVMs)according to a unified restricted optimization framework. In fact each data sampleis considered to a point in a finite dimensional latent space, of which each featurecorresponds to a topic, i.e., unigram distribution over terms in a vocabulary [67].MEDLDA model can be applied in both classification and regression. However, thequality of the topical space learned using MedLDA is heavily affected by the qualityof the classifications which are learned at per iteration of MedLDA, as a disadvantage.

Relational Topic Models (RTM), is another extension, RTM is a hierarchicalmodel of networks and per-node attribute data. First, each document was createdfrom topics in LDA. Then, modelling the connections between documents and con-sidered as binary variables, one for each pair from documents. These are distributedbased on a distribution that depends on topics used to generate each of the constituentdocuments. So in this way, the content of the documents are statistically linked to thelink structure between them and we can say that this model can be used to summarizea network of documents [69]. In fact, the advantage of RTM is that it considers both


links and document context between documents. The disadvantage of RTM is its scal-ability. Since RTM can only make predictions for a document couple, this means thatthe action of recommending related questions for a recently created one would bringin computing couple RTM responses among every available question in the query andthe collection.


Twenty seven approaches are introduced in Table 1, where eight were published in2010 and six in 2011. According to the Table 1, used LDA model for variety subjects,such as: Scientific topic discovery [35], Source code analysis [27], Opinion Mining[30], Event detection [40], Image Classification [41].

Sizov et al. in [17] introduced a novel method based on multi-modal Bayesianmodels to describe social media by merging text features and spatial knowledge thatcalled GeoFolk. As a general outlook, this method can be considered as an exten-sion of Latent Dirichlet Allocation (LDA). They used the available standard CoPhIRdataset that it contains an abundance of over 54 million Flickr. The GeoFolk modelhas ability to be used in quality-oriented applications and can be merged with somemodels from Web 2.0 social.

Z. Zhai et al. in [30] used prior knowledge as a constraint in the LDA modelsto improve grouping of features by LDA. They extract must link and cannot-linkconstraint from the corpus. Must link indicates that two features must be in the samegroup while cannot-link restricts that two features cannot be in the same group. Theseconstraints are extracted automatically. If at least one of the terms of two product fea-tures are same, they are considered to be in the same group as must link. On the otherhand, if two features are expressed in the same sentence without conjunction ”and”,they are considered as a different feature and should be in different groups as cannot-link.

Wang et al. in [85] suggested an approach based on LDA that called Bio-LDA thatcan identify biological terminology to obtain latent topics. The authors have shownthat this approach can be applied in different studies such as association search, as-sociation predication, and connectivity map generation. And they showed that Bio-LDA can be applied to increase the application of molecular bonding techniques asheat maps.

10 Hamed Jelodar et al.Ta

ble

1So

me

impr

essi

vear

ticle

sba

sed

onL

DA

:bet

wee

n20

03-2

011

Aut

hor-

stud

yM

odel

Year

sPa

ram

eter

Est

imat

ion/

Infe

renc

eM

etho

dsPr

oble

mD

omai

n[7

0]C

orr

LD

A20

03V

aria

tiona

lEM

LD

AIm

age

anno

tatio

nan

dre

trie

val

[75]

Aut

hor-

Topi

cM

odel

2004

Gib

bsSa

mpl

ing

LD

AFi

ndth

ere

latio

nshi

psbe

twee

nau

thor

s,do

cum

ents

,wor

ds,a

ndto

pics

[76]

Aut

horR

ecip

ient

-Top

ic(A

RT

)20

05G

ibbs

Sam

plin

gL

DA

Aut

hor-

Topi

c(A

T)

Soci

alne

twor

kan

alys

isan

dro

ledi

scov

ery

[73]

Dyn

amic

topi

cm

odel

(DT

M)

2006

Kal

man

vari

atio

nala

lgor

ithm

LD

AG

alto

nW

atso

npr

oces

sPr

ovid

ea

dyna

mic

mod

elfo

revo

lutio

nof

topi

cs

[77]

Topi

csov

erTi

me(

TOT

)20

06G

ibbs

Sam

plin

gL

DA

Cap

ture

wor

dco

-occ

urre

nces

and

loca

lizat

ion

inco

ntin

uous

time.

[78]

Pach

inko

Allo

catio

nM

odel

,PA

M20

06G

ibbs

Sam

plin

gL

DA

Adi

rect

edac

yclic

grap

hm

etho

dC

aptu

rear

bitr

ary

topi

cco

rrel

atio

ns

[79]

GW

N-L

DA

2007

Gib

bssa

mpl

ing

LD

AH

iera

rchi

calB

ayes

ian

algo

rith

mPr

obab

ilist

icco

mm

unity

profi

leD

isco

very

inso

cial

netw

ork

[80]

On-

line

lda

(OL

DA

)20

08G

ibbs

Sam

plin

gL

DA

Em

piri

calB

ayes

met

hod

Trac

king

and

Topi

cD

etec

tion

[36]

MG

-LD

A20

08G

ibbs

sam

plin

gL

DA

Sent

imen

tana

lysi

sin

mul

ti-as

pect

[66]

Lab

eled

LD

A20

09G

ibbs

Sam

plin

gL

DA

Prod

ucin

ga

labe

led

docu

men

tcol

lect

ion.

[69]

Rel

atio

nalT

opic

Mod

els

2009

Exp

ecta

tion-

max

imiz

atio

n(E

M)

LD

AM

ake

pred

ictio

nsbe

twee

nno

des

,attr

ibut

es,l

inks

stru

ctur

e[8

1]H

DP-

LD

A20

09G

ibbs

Sam

plin

gL

DA

[82]

Dis

cLD

A20

09G

ibbs

Sam

plin

gL

DA

Cla

ssifi

catio

nan

ddi

men

sion

ality

redu

ctio

nin

docu

men

ts[1

7]G

eoFo

lk20

10G

ibbs

sam

plin

gL

DA

Con

tent

man

agem

enta

ndre

trie

valo

fspa

tiali

nfor

mat

ion

[65]

Join

tLD

A20

10G

ibbs

Sam

plin

gL

DA

Bag

-of-

wor

dm

odel

Min

ing

mul

tilin

gual

topi

cs

[35]

Topi

c-A

spec

tMod

el(T

AM

)20

10G

ibbs

Sam

plin

gL

DA

SVM

Scie

ntifi

cto

pic

disc

over

y

[86]

Dep

ende

ncy-

Sent

imen

t-L

DA

2010

Gib

bsSa

mpl

ing

LD

ASe

ntim

entc

lass

ifica

tion

[27]

Topi

cXP

2010

LD

ASo

urce

code

anal

ysis

[30]

cons

trai

ned-

LD

A20

10G

ibbs

Sam

plin

gL

DA

Opi

nion

Min

ing

and

Gro

upin

gPr

oduc

tFea

ture

s[4

0]PE

T-P

opul

arE

vent

sTr

acki

ng20

10G

ibbs

Sam

plin

gL

DA

Eve

ntan

alys

isin

soci

alne

twor

k

[39]

[40]

ED

CoW

2010

LD

AW

avel

etTr

ansf

orm

atio

nE

vent

anal

ysis

inTw

itter

[85]

Bio

-LD

A.

2011

Gib

bsSa

mpl

ing

LD

AE

xtra

ctbi

olog

ical

term

inol

ogy

[64]

Twitt

er-L

DA

2011

Gib

bsSa

mpl

ing

LD

AA

utho

r-to

pic

mod

el

Page

Ran

k

Ext

ract

ing

topi

calk

eyph

rase

san

dan

alyz

ing

Twitt

erco

nten

t

[41]

max

-mar

gin

late

ntD

iric

hlet

allo

catio

n(M

ML

DA

2011

vari

atio

nali

nfer

ence

LD

ASV

MIm

age

Cla

ssifi

catio

nan

dA

nnot

atio

n

[34]

Sent

ence

-LD

A20

11G

ibbs

sam

plin

gL

DA

Asp

ects

and

sent

imen

tdis

cove

ryfo

rweb

revi

ew

[87]

PLD

A+

2011

Gib

bssa

mpl

ing

LD

AW

eigh

ted

roun

d-ro

bin

Red

uce

inte

r-co

mpu

terc

omm

unic

atio

ntim

e

[72]

Dir

ichl

etcl

ass

lang

uage

mod

el(D

CL

M),

2011

vari

atio

nalB

ayes

ian

EM

(VB

-EM

)alg

orith

msp

eech

reco

gniti

onan

dex

ploi

tatio

nof

lang

uage

mod

els

Dir

ichl

etcl

ass

lang

uage

mod

el(D

CL

M),



According to Table 2, some of the popular works published between 2012 and 2013focused on a variety of topics, such as music retrieve [88], opinion and aspect mining[89], Event analysis [38].

ET-LDA, in this work, the authors developed a joint Bayesian model that per-forms event segmentation and topic modeling in one unified framework. In fact, theyproposed an LDA model to obtain event’s topics and analysis tweeting behaviors onTwitter that called Event and Tweets LDA (ET-LDA). They employed Gibbs Sam-pling method to estimate the topic distribution [38]. The advantage of ET-LDA isthat can extract top general topic for entire tweet collection, which means that is veryrelevant and influenced with the event.

Mr. LDA, the authors introduced a novel model and parallelized LDA algorithmin the MapReduce framework that called Mr. LDA. In contrast other approacheswhich use Gibbs sampling for LDA, this model uses variational inference [71]. LDA-GA, The authors focused on the issue of Textual Analysis in Software Engineering.They proposed an LDA model based on Genetic Algorithm to determine a near-optimal configuration for LDA (LDA-GA), This approach is considered by three sce-narios that include: (a) traceability link recovery, (b) feature location, and (c) labeling.They applied the Euclidean distance for measuring distance between documents andused Fast collapsed Gibbs sampling to approximate the posterior distributions of pa-rameters[63].

TopicSpam, they proposed a generative lda based topic modeling methods fordetecting fake review. Their model can obtain differences among truthful and decep-tive reviews. Also, their approach can provide a clear probabilistic prediction abouthow probable a review is truthful or deceptive. For evaluation, the authors used 800reviews of 20 Chicago hotels and showed that TopicSpam slightly outperforms Top-icTDB in accuracy.

SSHLDA, the authors proposed a semi supervised hierarchical approach basedon topic model which goals for exploring new topics in the data space wheneverincorporating the information from viewed labels of hierarchical into the modelingprocess, called SemiSupervised Hierarchical LDA (SSHLDA). They applied labelshierarchy in as basic hierarchy, called Base Tree (BT); then they used hLDA to maketopic hierarchy automatically for each leaf node in BT, called Leaf Topic Hierarchy(LTH). One of the benefits from SSHLDA is that, it can incorporate labeled topicsinto the generative process of documents. In addition, SSHLDA can automaticallyexplore latent topic in data space, and extend existing hierarchy of showed topics.



According to Table 2, some of the popular works published between 2014 and 2015focused on a variety of topics, such as: Hash/tag discovery [45-46], opinion miningand aspect mining [28-29, 31-32], recommendation system [45, 47-48].

Biterm Topic Modeling (BTM), Topic modeling over short texts is an increas-ingly important task due to the prevalence of short texts on the Web. Short texts arepopular on today’s Web, especially with emergence of social media. Inferring topicsfrom large scale short texts becomes critical. They proposed a novel topic model forshort texts, namely the biterm topic model (BTM). This model can well capture topicswithin short texts by explicitly modeling word co-occurrence patterns in the wholecorpus. BTM obtains underlying topics in a set of text-documents and a distributionof global from per topic in each of them with an analysis of the generation of biterms.Their results showed that btm generates discriminative topic representations as wellas rather coherent topics in short texts. The main advantage of BTM prevents the datasparsity issue with learning a global topic distribution [96-97].

TOT-MMM, introduced a hashtag recommendation that called TOT-MMM, Thisapproach is a hybrid model that combines a temporal clustering component similarto that of the Topics-over-Time (TOT) Model with the Mixed Membership Model(MMM) that was originally proposed for word-citation co-occurrence. This modelcan capture the temporal clustering effect in latent topics, thereby improving hashtagmodeling and recommendations. They developed a collapsed Gibbs sampling (CGS)to approximate the posterior modes of the remaining random variables [45]. The pos-terior distribution of latent topic equaling k for the nth hashtag in tweet d is given by:

P(zh(dn) = k|z(m)

.. , z(h)−dn,w

(m).. ,w

(h).. , t

(.).. ∝

βh+cw(h)

dn−dn,k

vhβh+c(h)−dn,k

α+c(db)−dn,k+c(dm )

.,k

Kα+Ndm+Ndb−1

×tψk1−1d (1−td)ψk2−1

B(ψk1,ψk2)(h) where denotes the number of hashtags type w(b)

dn assigned to latent topic k,excluding the hashtag currently undergoing processing; c(h)

−dn,k denotes the number ofhashtags assigned to latent topic k, excluding the assignment at position dn; c(dh)

−dn,kdenotes the number of hashtags assigned to latent topic k in tweet d, excluding thehashtag currently undergoing processing; c(dm)

.,k denotes the number of words assignedto latent topic k in tweet d; Vb is the number of unique hashtags; td is the tweet timestamp omitting position subscripts and superscripts (all words and hashtags share thesame time stamp); ψk1,ψk2 are the parameters of the beta distribution for latent topick.

The probability for a hashtag given the observed words and time stamps is:p(wh

vn|w(m)v. , tv

)=

∫p(w(h)

vn |θ(v)

)p(θ(v)|w|(m)

v. , tv)

dθ(v)

s ≈ 1´S

´S∑

s=1

K∑k=1∅(s)

h,k,w(h)vnθ(v)(s)

k ,


where S is the total number of recorded sweeps, and the superscript s marks theparameters computed based on a specific recorded sweep. To provide the top N pre-dictions, they ranked from largest to smallest and output the first N hashtags.

rLDA , the authors introduced a novel probabilistic formulation to obtain the rel-evance of a tag with considering all the other images and their tags and also theyproposed a novel model called regularized latent Dirichlet allocation (rLDA). Thismodel can estimates the latent topics for each document, with making use of otherdocuments. They used a collective inference scheme to estimate the distribution oflatent topics and applied a deep network structure to analyze the benefit of regular-ized LDA [45-46].

2.7 A brief look from some impressive past works: Research in 2016

According to Table 2, some of the popular works published for this year focused on avariety of topics, such as recommendation system [42-44], opinion mining and aspectmining [28-32].

A bursty topic on Twitter is one that triggers a surge of relevant tweets within ashort period of time, which often reflects important events of mass interest. How toleverage Twitter for early detection of bursty topics has, therefore, become an impor-tant research problem with immense practical value. In TopicSketch [59], proposed asketch-based topic model together with a set of techniques to achieve real-time burstytopic detection from the perspective of topic modeling, that called in this paper Top-icSketch.

The bursty topics are often triggered by some events such as some breaking newsor a compelling basketball game, which get a lot of attention from people, and ”force”people to tweet about them intensely. For example, in physics, this ”force” can beexpressed by ”acceleration”, which in our setting describes change of ”velocity”,i.e., arriving rate of tweets. Bursty topics can get significant acceleration when theyare bursting, while the general topics usually get nearly zero acceleration. So the”acceleration” trick can be used to preserve information of bursty topics but fil-ter out the others. Equation (3) shows how we calculate the ”velocity” v (t) and”acceleration”a (t) of words.

v∆T =∑ti≤t

Xi.exp ((ti − t) /∆T )

∆T

In Equation (1), Xi is the frequency of a word (or a pair of words, or a triple ofwords) in the i-th tweet, ti is its timestamp. The exponential part in v∆T (t) works like asoft moving window, which gives the recent terms high weight, but gives low weight


to the ones far away, and the smoothing parameter ∆T is the window size. In fact, theauthors proposed a novel data sketch which efficiently maintains at a cost of low-levelcomputational of three quantities: the total number of every tweets, the occurrence ofper word and also the occurrence of per word pair. Therefore, low calculation costsis one of the advantages from this approach.

Hashtag-LDA, the authors a personalized hashtag recommendation approach isintroduced according to the latent topical information in untagged microblogs. Thismodel can enhance influence of hashtags on latent topics’ generation by jointly mod-eling hashtags and words in microblogs. This approach inferred by Gibbs sampling tolatent topics and considered a real-world Twitter dataset to evaluation their approach[44]. CDLDA proposed a conceptual dynamic latent Dirichlet allocation model fortracking and topic detection for conversational communication, particularly for spo-ken interactions. This model can extract dependencies between topics and speechacts. The CDLDA applied hypernym information and speech acts for topic detectionand tracking in conversations, and it captures contextual information from transitions,incorporated concept features and speech acts [61].

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 15Ta

ble

2So

me

impr

essi

vear

ticle

sba

sed

onL

DA

:bet

wee

n20

12-2

016

Aut

hor-

Stud

yM

odel

Year

sPa

ram

eter

Est

imat

ion/

Infe

renc

eM

etho

dsPr

oble

mD

omai

n[9

0]lo

cally

disc

rim

inat

ive

topi

cm

odel

(LD

TM

)20

12E

xpec

tatio

n-m

axim

izat

ion

(EM

)L

DA

Doc

umen

tsem

antic

anal

ysis

[38]

ET-

LD

A20

12G

ibbs

sam

plin

gL

DA

Eve

ntse

gmen

tatio

nTw

itter

[88]

infin

itela

tent

harm

onic

allo

catio

n(i

LH

A)

2012

Exp

ecta

tion-

max

imiz

atio

n(E

M)a

lgor

ithm

Var

iatio

nalB

ayes

(VB

)L

DA

HD

P(H

iera

rchi

calD

iric

hlet

proc

esse

s)m

ultip

itch

anal

ysis

and

mus

icin

form

atio

nre

trie

val

[71]

Mr.

LD

A20

12V

aria

tiona

lBay

esin

fere

nce

LD

AN

ewto

n-R

aphs

onm

etho

d

Map

Red

uce

Alg

orith

m

Exp

lori

ngdo

cum

entc

olle

ctio

nsfr

omla

rge

scal

e

[91]

FB-L

DA

,RC

B-L

DA

2012

Gib

bsSa

mpl

ing

LD

AA

naly

zean

dtr

ack

publ

icse

ntim

entv

aria

tions

(on

twitt

er)

[92]

fact

oria

lLD

A20

12G

ibbs

sam

plin

gL

DA

Ana

lysi

ste

xtin

am

ulti-

dim

ensi

onal

stru

ctur

e

[93]

SShL

DA

2012

Gib

bssa

mpl

ing

LD

AhL

DA

Topi

cdi

scov

ery

inda

tasp

ace

[94]

Uto

pian

2013

Gib

bssa

mpl

ing

LD

AV

isua

ltex

tana

lytic

s

[63]

LD

A-G

A20

13G

ibbs

sam

plin

gL

DA

Gen

etic

Alg

orith

mSo

ftw

are

text

ualr

etri

eval

and

anal

ysis

[33]

Mul

ti-as

pect

Sent

imen

tAna

lysi

sfo

rChi

nese

Onl

ine

Soci

alR

evie

ws

(MSA

-CO

SRs

2013

Gib

bsSa

mpl

ing?

LD

A

Sent

imen

tana

lysi

sA

ndas

pect

min

ing

of

Chi

nese

soci

alre

view

s[8

9]To

picS

pam

2013

Gib

bssa

mpl

ing

LD

Aop

inio

nsp

amde

tect

ion

[95]

WT-

LD

A20

13G

ibbs

sam

plin

gL

DA

Web

Serv

ice

Clu

ster

ing

[51]

emot

ion-

LD

A(E

LD

A)

2014

Gib

bssa

mpl

ing

LD

ASo

cial

emot

ion

clas

sific

atio

nof

onlin

ene

ws

[48]

TW

ILIT

E20

14E

Mal

gori

thm

LD

AR

ecom

men

datio

nsy

stem

forT

witt

er

[98]

Red

-LD

A20

14G

ibbs

-Sam

plin

LD

AE

xtra

ctin

form

atio

nan

dan

dda

tam

odel

ing

inPa

tient

Rec

ord

Not

es[9

6,97

]B

iterm

-Top

ic-M

odel

ing(

BT

M)

2014

Gib

bssa

mpl

ing

LD

AD

ocum

entc

lust

erin

gfo

rsho

rtte

xt

[47]

Tren

dSe

nsiti

ve-L

aten

tDir

ichl

etA

lloca

tion

(TS-

LD

A)

2014

Gib

bssa

mpl

ing

LD

AN

orm

aliz

edD

isco

unte

dC

umul

ativ

eG

ain

(nD

CG

)A

maz

onM

echa

nica

lTur

k(A

MT

)2pl

atfo

rmIn

tere

stin

gtw

eets

disc

over

foru

sers

,rec

omm

enda

tion

syst

em

[32]

Fine

-gra

ined

Lab

eled

LD

A(F

L-L

DA

),U

nifie

dFi

ne-g

rain

edL

abel

edL

DA

(UFL

-LD

A)

2014

Gib

bssa

mpl

ing

LD

AA

spec

text

ract

ion

and

revi

ewm

inin

g[4

5,46

]R

egul

ariz

edla

tent

Dir

ichl

etal

loca

tion

(rL

DA

)20

14V

aria

tiona

lBay

esin

fere

nce

LD

AA

utom

atic

imag

eta

ggin

gor

tag

reco

mm

enda

tion

[29]

gene

rativ

epr

obab

ilist

icas

pect

min

ing

mod

el(P

AM

M)

2014

Exp

ecta

tion-

max

imiz

atio

n(E

M)

LD

AO

pini

onm

inin

gan

dgr

oupi

ngs

ofdr

ugre

view

s,as

pect

min

ing

[28]

AE

P-ba

sed

Lat

entD

iric

hlet

Allo

catio

n(A

EP-

LD

A)

2014

Gib

bssa

mpl

ing

LD

AO

pini

on/a

spec

tmin

ing

and

sent

imen

twor

did

entifi

catio

n

[31]

AD

M-L

DA

2014

Gib

bssa

mpl

ing

LD

AM

arko

vch

ain

Asp

ectm

inin

gan

dse

ntim

enta

naly

sis

[99]

MR

F-L

DA

2015

EM

algo

rith

mL

DA

Mar

kov

Ran

dom

Fiel

dE

xplo

iting

wor

dco

rrel

atio

nkn

owle

dge

[100

]L

ight

LD

A20

15G

ibbs

Sam

plin

gL

DA

Topi

cm

odel

ing

forv

ery

larg

eda

tasi

zes

[101

]L

FLD

A,L

F-D

MM

2015

Gib

bsSa

mpl

ing

LD

AD

ocum

entc

lust

erin

gfo

rsho

rtte

xt[1

03]

LFT

2015

Gib

bsSa

mpl

ing

LD

ASe

man

ticco

mm

unity

dete

ctio

n[1

02]

AT

C20

15E

ML

DA

Aut

horc

omm

unity

disc

over

y[1

06]

FLD

A,D

FLD

A20

15G

ibbs

Sam

plin

gL

DA

Mul

ti-la

beld

ocum

entc

ateg

oriz

atio

n[4

1]H

asht

ag-L

DA

2016

Gib

bssa

mpl

ing

LD

AH

asht

agre

com

men

datio

nan

dFi

ndre

latio

nshi

psbe

twee

nto

pics

and

hash

tags

[44]

PMB

-LD

A20

16E

xpec

tatio

n-m

axim

izat

ion

(EM

)L

DA

Ext

ract

the

popu

latio

nm

obili

tybe

havi

ors

forl

arge

scal

e[1

08]

Aut

omat

icR

ule

Gen

erat

ion

(LA

RG

en20

16G

ibbs

Sam

plin

gL

DA

Mal

war

ean

alys

isan

dA

utom

atic

Sign

atur

eG

ener

atio

n[9

9]PT

-LD

A20

16G

ibbs

-EM

algo

rith

mL

DA

Pers

onal

ityre

cogn

ition

inso

cial

netw

ork

[110

]C

orr-

wdd

CR

F20

16G

ibbs

sam

plin

gL

DA

Kno

wle

dge

Dis

cove

ryin

Ele

ctro

nic

Med

ical

Rec

ord

[42]

mul

ti-id

iom

atic

LD

Am

odel

(MiL

DA

)20

16G

ibbs

sam

plin

gL

DA

Con

tent

-bas

edre

com

men

datio

nan

dau

tom

atic

linki

ng[4

3]L

ocat

ion-

awar

eTo

pic

Mod

el(L

TM

)20

16G

ibbs

sam

plin

gL

DA

Mus

icR

ecom

men

datio

n[6

2]To

pPR

F20

16G

ibbs

sam

plin

gL

DA

Eva

luat

eth

ere

leva

ncy

betw

een

feed

back

docu

men

ts[5

0]co

ntex

tual

sent

imen

ttop

icm

odel

(CST

M)

2016

Exp

ecta

tion-

max

imiz

atio

n(E

M)

LD

AE

mot

ion

clas

sific

atio

nin

soci

alne

twor

k[6

1]co

ncep

tual

dyna

mic

late

ntD

iric

hlet

allo

catio

n(C

DL

DA

)20

16G

ibbs

sam

plin

gL

DA

Topi

cde

tect

ion

inco

nver

satio

ns

[60]

mul

tiple

-cha

nnel

late

ntD

iric

hlet

allo

catio

n(M

CL

DA

)20

16G

ibbs

sam

plin

gL

DA

Find

the

rela

tions

betw

een

diag

nose

san

dm

edic

atio

nsfr

omhe

alth

care

data

[37]

mul

ti-m

odal

even

ttop

icm

odel

(mm

ET

M)

2016

Gib

bssa

mpl

ing

LD

ATr

acki

ngan

dso

cial

even

tana

lysi

s[1

11]

Dyn

amic

Onl

ine

Hie

rarc

hica

lDir

ichl

etPr

oces

sm

odel

(DO

HD

P)20

16G

ibbs

sam

plin

LD

AD

ynam

icto

pic

evol

utio

nary

disc

over

yfo

rChi

nese

soci

alm

edia

[59]

Topi

cske

tch

2016

Gib

bssa

mpl

ing

LD

AT

tens

orde

com

posi

tion

algo

rith

m

Cou

nt-M

inal

gori

thm

Rea

ltim

ede

tect

ion

and

burs

tyto

pics

dico

very

from

Twitt

er

[112

]fa

ston

line

EM

(FO

EM

)20

16E

xpec

tatio

n-m

axim

izat

ion

(Bat

chE

M)

LD

AB

igto

pic

mod

elin

g[1

13]

Join

tMul

ti-gr

ain

Topi

cSe

ntim

ent(

JMT

S)20

16G

ibbs

sam

plin

gL

DA

Ext

ract

ing

sem

antic

aspe

cts

from

onlin

ere

view

s[1

14]

Cha

ract

erw

ord

topi

cm

odel

(CW

TM

)20

16G

ibbs

sam

plin

gL

DA

Cap

ture

the

sem

antic

cont

ents

inte

xtdo

cum

ents

(Chi

nese

lang

uage

).


Fig. 2 A vision of the application of Topic modeling in various sciences (based on previous works)

mmETM, the authors proposed a novel multi modal social event tracking to cap-ture the evolutionary trends from social events and also for generating effective eventsummary details over time. The mmETM can model the multimodal property of so-cial event and learn correlations among visual modalities and textual to apart the non-visual-representative topics and visual representative topics. This model can work inan online mode with the event consisting of many stories over time and this is a greatadvantage for getting the evolutionary trends in event tracking and evolution [37].

3 Topic Modeling for which the area is used?

With the passage of time, the importance of Topic modeling in different disciplineswill be increase. According to previous studies, we present a taxonomy of currentapproaches topic models based on LDA model and in different subject such as SocialNetwork [76, 115-118], Software Engineering [8, 9, 25-26], Crime Science [119-121] and also in areas of Geographical [13, 14-17], Political Science [19-20, 122],Medical/Biomedical [4, 123-125] and Linguistic science [126-130] as illustrated byFigure. 2.

3.1 Topic modeling in Linguistic science

LDA is an advanced textual analysis technique grounded in computational linguis-tics research that calculates the statistical correlations among words in a large set ofdocuments to identify and quantify the underlying topics in these documents. In thissubsection, we examine some of topic modeling methodology from computationallinguistic research, shown some significant research in Table 3. In [130], employedthe distributional hypothesis in various direction and it efforts to cancel the require-ment of a seed lexicon as an essential prerequisite for use of bilingual vocabularyand introduce various ways to identify the translation of words among languages.In [128] introduced a method that leads the machine translation systems to relevanttranslations based on topic-specific contexts and used the topic distributions to obtain


Table 3 Impressive works LDA-based in Linguistic science

Study Year Purpose Dataset

[130] 2011Introduce various ways toidentify the translation ofwords among languages [BiLDA].

A Wikipedia dataset (Arabic,Spanish, French,Russian and English)

[129] 2010Obtain term weightingbased on LDA A multilingual dataset

[127] 2013Present a diversity ofnew visualization techniquesto make concept of topic-solutions

- Dissertation abstracts(1980 to 2010)- 1 million abstracts

[126] 2012A topic modeling approach,that it consider geographicinformation

Foursquare Dataset

[131] 2014An approach that is capableto find a document withdifferent language

ALTW2010

[132] 2013A method for linguistic discoveryand conceptual metaphors resources Wikipedia

topic-dependent lexical weighting probabilities. They considered a topic model fortraining data, and adapt translation model. To evaluate their approach, they performedexperiments on Chinese to English machine translation and show the approach can bean effective strategy for dynamically biasing a statistical machine translation towardsrelevant translations.

In [127] presented a diversity of new visualization techniques to make the con-cept of topic-solutions and introduce new forms of supervised LDA, to evaluate theyconsidered a corpus of dissertation abstracts from 1980 to 2010 that belongs to 240universities in the United States. In [126] developed a standard topic modeling ap-proach, that consider geographic and temporal information and this approach usedto Foursquare data and discover the dominant topics in the proximity of a city. Also,the researchers have shown that the abundance of data available in location-based so-cial network (LBSN) enables such models to obtain the topical dynamics in urbaniteenvironments. In [132] have introduced a method for discovery of linguistic and con-ceptual metaphors resources and built an LDA model on Wikipedia; align its topicsto possibly source and aim concepts, they used from both target and source domainsto identify sentences as potentially metaphorical. In [131] presented an approach thatis capable to find a document with a different language and identify the current lan-guage in a document and next step calculate their relative proportions, this approachis based on LDA and used from ALTW2010 as a dataset to evaluation their method.

3.2 Topic modeling in political science

Some topic modeling methods have been adopted in the political science literature toanalyze political attention. In settings where politicians have limited time-resourcesto express their views, such as the plenary sessions in parliaments, politicians mustdecide what topics to address. Analyzing such speeches can thus provide insight intothe political priorities of the politician under consideration. Single membership topic


Table 4 Impressive works LDA-based in political science


[19] 2013

Evaluate the behavioral effectsof different databases frompolitical orientationclassifiers

-Political Figures Dataset-Politically Active Dataset-Politically Modest Dataset-Conover 2011 Dataset (C2D)

[21]Introduce a topic model forcontrastive opinion modeling

Statement records ofU.S. senators

[133] 2012

Detection topics that evokedifferent reactionsfrom communities thatlie on the political spectrum

A collection of blog postsfrom five blogs:1. Carpetbagger(CB)thecarpetbaggerreport.com2. Daily Kos(DK)dailykos.com3. Matthew Yglesias(MY)yglesias.thinkprogress.org4.Red State(RS)redstate.com5.Right Wing News(RWN)rightwingnews.com

[18] 2010Discover the hidden relationshipsbetween opinion word and topics words

The statement records ofsenators through theProject Vote Smart(http://www.votesmart.org)

[134] 2014Analyze issues related toKorea’s presidential election

Project Vote Smart WebSite(https://votesmart.org/)

[135] 2014Examine Political Contentionin the U.S. Trucking Industry Regulations.gov online portal

[136] 2015presented a method formulti-dimensional analysisof political documents

Three Germannational elections(2002, 2005 and 2009)

models that assume each speech relates to one topic; have successfully been appliedto plenary speeches made in the 105th to the 108th U.S. Senate in order to trace po-litical attention of the Senators within this context over time [18]. In addition, in [20]proposed a new two-layer matrix factorization methodology for identifying topicsin large political speech corpora over time and identify both niche topics related toevents at a particular point in time and broad, long-running topics. This paper hasfocused on European Parliament speeches, the proposed topic modeling method hasa number of potential applications in the study of politics, including the analysis ofspeeches in other parliaments, political manifestos, and other more traditional formsof political texts. In [19], the purpose of the study is to examine the various effectsof dataset selection with consideration of policy orientation classifiers and built threedatasets that each data set include of a collection of Twitter users who have a politi-cal orientation.In this approach, the output of an LDA has been used as one of manyfeatures as a feed to apply SVM classifier and another part of this method used anLLDA that Considered as a stand-alone classifier. Their assessment showed that thereare some limitations to building labels for non-political user categories. Shown somesignificant research based on LDA to political science in Table 4.

Fang et al. in [21] suggested a new unsupervised topic model based on LDAfor contrastive opinion modeling which purpose to find the opinions from multipleviews, according to a given topic and their difference on the topic with qualifyingcriteria, the model called Cross-Perspective Topic (CPT) model. They performed ex-

http://www.votesmart.org


periments with both qualitative and quantitative measures on two datasets in the po-litical area that include: first dataset is statement records of U.S. senators that showpolitical stances of senators by these records, also for the second dataset, extractsof world News Medias from three representative media in U.S (New York Times),China (Xinhua News) and India (Hindu). To evaluate their approach with other mod-els used corrIDA and LDA as two baselines. Yano et al. in [137, 138] applied severalprobabilistic models based on LDA to predict responses from political blog posts. Inmore detail, they used topic models LinkLDA and CommentLDA to generate blogdata(topics, words of post) in their method and with this model can found a relation-ship between the post, the commentators and their responses. To evaluate their model,gathered comments and blog posts with focusing on American politics from 40 blogsites.

In [139] introduced a new application of universal sensing based on using mo-bile phone sensors and used an LDA topic model to discover pattern and analysis ofbehaviors of people who changed their political opinions, also evaluated to variouspolitical opinions for residents of individual , with consider a measure of dynamic ho-mophily that reveals patterns for external political events. To collect data and applytheir approach, they provided a mobile sensing platform to capture social interactionsand dependent variables of American Presidential campaigns of John McCain andPresident Barack Obama in last three months of 2008. In [133] analyzed reactionsof emotions and suggested a novel model Multi Community Response LDA (MCR-LDA) which in fact is a multi-target and for predicting comment polarity from postcontent used sLDA and support vector machine classification. To evaluate their ap-proach, provided a dataset of blog posts from five blogs that focus on US politics thatwas made by [137].

In [18], the authors suggested a generative model to auto discover of the latentassociations between opinion words and topics that can be useful for extraction ofpolitical standpoints and used an LDA model to reduce the size of adjective words, theauthors successfully get that sentences extracted by their model and they shown thismodel can effectively in different opinions. They were focused on statement recordsof senators that includes 15, 512 statements from 88 senators from Project Vote SmartWebSite. In [134], examined how social and political issues related to South Koreanpresidential elections in 2012 on Twitter and used an LDA method to evaluate therelationship between topics extracted from events and tweets. In [136], proposed amethod for evaluating and comparing documents, based on an extension of LDA, andused LogicLDA and Labeled LDA approaches for topic modeling in their method.They are considered German National Elections since 1990 as a dataset to apply theirmethod and shown that the use of their method consistently better than a baselinemethod that simulates manual annotation based on text and keywords evaluation.


Table 5 Impressive works LDA-based in medical and biomedical

Study Year Purpose/problem domain Dataset

[124] 2017Presented three LDA-basedmodels Adverse DrugReaction Prediction

ADReCS database

[84] 2011 Extract biological terminology-MEDLINE and Bio-TermsExtraction-Chem2Bio2Rdf

[4] 2017User preference distributiondiscovery and identitydistribution of doctor feature

-Yelp Dataset(Yelp.com)

[7] 2012-Ranking GENE-DRUG-Detecting relationshipbetween gene and drug

National Library ofMedicine

[6] 2011Analyzing public healthinformation on Twetter

20 disease articles oftwitter data

[115] 2013Analysis of Generated Contentby User from social networkingsites

one million English postedfrom Facebook’s server logs

[124] 2013Pattern discovery and extractionfor Clinical Processes

A data-set from Zhejiang HuzhouCentral Hospital of China

[123] 2011Identifying miRNA-mRNA infunctional miRNA regulatory modules Mouse mammary dataset

[140] 2011 Extract common relationship T2DM Clinical Dataset

[5] 2012Extract the latent topic inTraditional Chinese Medicinedocument

T2DM Clinical Dataset

3.3 Topic modeling in medical and biomedical

Topic models applied to pure biomedical or medical text mining, researchers haveintroduced this approach into the fields of medical science. Topic modeling could beadvantageously applied to the large datasets of biomedical/medical research, shownsome significant research based on LDA to analyze and evaluate content in medicaland biomedical in Table 5. In [125] introduced three LDA-like models and found thatthis model has higher accuracy than the state-of-the-art alternatives. Authors demon-strated that this approach based on LDA could successfully uncover the probabilisticpatterns between Adverse drug reaction (ADR) topics and used ADRS database forevaluating their approach. The aim of the authors to predict ADR from a large numberof ADR candidates to obtain a drug target. In [4], focused on the issue of profession-alized medical recommendations and introduced a new healthcare recommendationsystem that called iDoctor, that used Hybrid matrix factorization methods for pro-fessionalized doctor recommendation. In fact, They adopted an LDA topic model toextract the topics of doctor features and analyzing document similarity. The datasetthis article is college from a crowd sourced website that called Yelp. Their resultshowed that iDoctor can increase the accuracy of health recommendations and it canhas higher prediction in users ratings.

In [85], the authors suggested an approach based on LDA that called Bio-LDAthat can identify biological terminology to obtain latent topics. The authors haveshown that this approach can be applied in different studies such as association search,association predication, and connectivity map generation. And they showed that Bio-


LDA can be applied to increase the application of molecular bonding techniques asheat maps. In [7] proposed a topic modeling for rating gene-drug relations by usingprobabilistic KL distance and LDA that called LDA-PKL and showed that the sug-gested model achieved better than Mean Average Precision (MAP). They found thatthe presented method achieved a high Mean Average Precision (MAP) to rating anddetecting pharmacogenomics(PGx) relations. To analyze and apply their approachused a dataset from National Library of Medicine. In [6], Presented Ailment TopicAspect Model (ATAM) to the analysis of more than one and a half million tweetsin public health and they were focused on a specific question and specific models;”What public health information can be learned from Twitter?”.

In [124] introduced an LDA based method to discover patterns of internal treat-ment for Clinical processes (CPs), and currently, detect these hidden patterns is oneof the most serious elements of clinical process evaluation. Their main approach is toobtain care flow logs and also estimate hidden patterns for the gathered logs based onLDA. Patterns identified can apply for classification and discover clinical activitieswith the same medical treatment. To experiment the potentials of their approach, useda data-set that collected from Zhejiang Huzhou Central Hospital of China. In [123]introduced a model for the discovery of functional miRNA regulatory modules (FM-RMs) that merge heterogeneous datasets and it including expression profiles of bothmiRNAs and mRNAs, using or even without using exploit the previous goal bindinginformation. This model used a topic model based on Correspondence Latent Dirich-let Allocation (Corr-LDA). As an evaluation dataset, they perform their method tomouse model expression datasets to study the issue of human breast cancer. Theauthors found that their model is mighty to obtain different biologically meaning-ful models. In [140], the authors had a study on Chinese medical (CM) diagnosisby topic modeling and introduced a model based on Author-Topic model to detectCM diagnosis from Clinical Information of Diabetes Patients, and called Symptom-Herb-Diagnosis topic (SHDT) model. Evaluation dataset has been collected from328 diabetes patients. The results indicated that the SHDT model can discover herbprescription topics and typical symptom for a bunch of important medical-related dis-eases in comorbidity diseases (such as; heart disease and diabetic kidney)[140].

3.4 Topic modeling in geographical and locations

There is a significant body of research on geographical topic modeling. According topast work, researchers have shown that topic modeling based on location informationand textual information can be effective to discover geographical topics and Geo-graphical Topic Analysis. Table 6 shown some significant research based on LDAto topic modeling and content analysis in geographical issues. In [16], this articleexamines the issue of topic modeling to extract topics from geographic informationand GPS-related documents. They suggested a new location text method that is acombination of topic modeling and geographical clustering called LGTA (Latent Ge-ographical Topic Analysis). To test their approaches, they collected a set of data fromthe website Flickr, according to various topics. In [17] introduced a novel method


Table 6 Impressive works LDA-based in geographical and locations


[17] 2010Discovering multi-facetedsummaries of documents CoPhIR dataset

[16] 2011 Content management and retrieval Flicker Dataset

[15] 2013Semantic clustering invery high resolution panchromaticsatellite images

A QUICKBIRD image ofa suburban area

[14] 2010

Data Discovery, Evaluation ofgeographically coherent linguisticregions and find the relationshipbetween topic variation and regional

A Twitter Dataset

[13] 2008Geo-located image categorizationand georecognition 3013 images Panoramio in France

[141] 2015 Cluster discovery in geo-locations Reuters-21578

[142] 2014Discovering newsworthyinformation From Twitter A small Twitter Dataset

based on multi-modal Bayesian models to describe social media by merging text fea-tures and spatial knowledge that called GeoFolk. As a general outlook, this methodcan be considered as an extension of Latent Dirichlet Allocation (LDA). They usedthe available standard CoPhIR dataset that it contains an abundance of over 54 millionFlickr. The GeoFolk model has ability to be used in quality-oriented applications andcan be merged with some models from Web 2.0 social. In [15] proposed a multiscaleLDA model that is a combination of multiscale image representation and probabilis-tic topic model to obtain effective clustering VHR satellite images.

In[14], the authors introduced a model that includes two sources of lexical vari-ation: geographical area and topic, in another word, this model can discover wordswith geographical coherence in different linguistic regions, and find a relationshipbetween regional and variety of topics. To test their model, they gathered a datasetfrom the website Twitter and also we can say that, also can show from an author’sgeographic location from raw text [15-17]. In [13] suggested a statistical model forclassification of geo-located images based on latent representation. In this model,the content of a geo-located database able be visualized by means of some few se-lected images for each geo-category. This model can be considered as an extensionof probabilistic Latent Semantic Analysis (pLSA). They built a database of the geo-located image which contains 3013 images (Panoramio), that is related to southeast-ern France.

In addition, in [143] designed a browsing system (GeoVisNews) and proposed anextended matrix factorization method based on geographic information for rankinglocations, location relevance analysis, and obtain intra-relations between documentsand locations, called Ordinal Correlation Consistent Matrix Factorization (OCCMF).For evaluation, the authors used a large multimedia news dataset and showed thattheir system better than Story Picturing Engine and Yahoo News Map.

In this work [141], the authors focused on the issue of identifying textual topicsof clusters including spatial objects with descriptions of text. They presented com-bined methods based on cluster method and topic model to discover textual object


clusters from documents with geo-locations. In fact, they used a probabilistic gen-erative model (LDA) and the DBSCAN algorithm to find topics from documents.In this paper, they utilized dataset Reuters-21578 as a dataset for Analysis of theirmethods. In [142], the authors presented a study on characterizing significant reportsfrom Twitter. The authors introduced a probabilistic model to topic discovery in thegeographical topic area and this model can find hidden significant events on Twitterand also considered stochastic variational inference (SVI) to apply gradient ascenton the variable objective with LDA. They collected 2,535 geo-tagged tweets fromthe Upper Manhattan area of New York. that the KL divergence is a good metric foridentifying significant tweet event, but for a large dataset of news articles, the resultwill be negative.

3.5 Software engineering and topic modeling

Software evolution and source code analysis can be effective in solving current andfuture software engineering problems. Topic modeling has been used in informa-tion retrieval and text mining where it has been applied to the problem of briefinglarge text corpora. Recently, many articles have been published for evaluating / min-ing software using topic modeling based on LDA, Table 7 shown some significantresearch based on LDA to source code analysis and an other related issues in soft-ware engineering. In [8], for the first time, the authors used LDA, to extract topics insource code and perform to visualization of software similarity, In other words, LDAuses an intuitive approach for calculation of similarity between source files with ob-tain their respective distributions of each document over topics. They utilized theirmethod on 1,555 software projects from Apache and SourceForge that includes 19million source lines of code (SLOC). The authors demonstrated this approach, can beeffective for project organization, software refactoring. In addition, in [9], introduceda new coupling metric based on Relational Topic Models (RTM) that called Rela-tional Topic based Coupling (RTC), that can identifying latent topics and analyze therelationships between latent topic distributions software data. Also, can say that theRTM is an extension of LDA. The authors used thirteen open source software sys-tems for evaluation this metric and demonstrated that RTC has a useful and valuableimpact on the analysis of large software systems.

The work in [10] focused on software traceability by topic modeling and proposeda combining approach based on LDA model and automated link capture. They uti-lized their method to several data sets and demonstrated how topic modeling increasesoftware traceability, and found this approach, able to scale for carried larger numbersfrom artifacts. In [11] studied about the challenges use of topic models to mine soft-ware repositories and detect the evolution of topics in the source code, and suggestedthe apply of statistical topic models (LDA) for the discovery of textual repositories.Statistical topic models can have different applications in software engineering suchas bug prediction, traceability link recovery and software evolution. Chen et al. in[26] used a generative statistical model(LDA) for analyzing source code and find re-lationships between software defects and software development. They showed LDA


Table 7 Impressive works LDA-based in software engineering


[8] 2007Mining software and extractedconcepts from code

SourceForge andApache(1,555 projects)

[9] 2010Identifying latent topics andfind their relationshipsin source code

Thirteen open sourcesoftware systems

[10] 2010 Generating traceability linksArchStudio softwareproject

[26] 2012Find relationship between theconceptual concerns in source code. Source code entities

[25] 2008 Analyzing Software EvolutionOpen source Java projects,Eclipse and ArgoUML

[22] 2009Automatic Categorization ofSoftware systems 43 open-source software systems

[23] 2008Source code retrieval forbug localization Mozila, Eclipse source code

[24] 2010Automatic bug localizationand evaluate its effectiveness

Open source software such as(Rhino, and Eclipse)

[144] 2017Detection of maliciousAndroid apps 1612 malicious application

can easily scale to large documents and utilized their approach on three large datasetthat includes: Mozilla Firefox, and Mylyn, Eclipse. Linsteadet al. in [25] used andutilized Author-Topic models(AT) to analysis in source codes. AT modeling is an ex-tension of LDA model that evaluation and obtain the relationship of authors to topicsand applied their method on Eclipse 3.0 source code including of 700,000 code linesand 2,119 source files with considering of 59 developers. They demonstrated thattopic models provided the effective and statistical basis for evaluation of developersimilarity.

The work in [22] introduced a method based on LDA for automatically catego-rizing software systems, called LACT. For evaluation of LACT, used 43 open-sourcesoftware systems in different programming languages and showed LACT can cate-gorization of software systems based on the type of programming language. In ad-dition, in [23, 34], Proposed an approach topic modeling based on LDA model forthe purpose of bug localization. Their idea, applied to the analysis of same bugs inMozilla and Eclipse and result showed that their LDA-based approach is better thanLSI to evaluate and analyze of bugs in these source codes. In [144], introduced atopic-specific approach by considering the combination of description and sensitivedata flow information and used an advanced topic model based on LDA with GA,for understanding malicious apps, cluster apps according to their descriptions. Theyutilized their approach on 3691 benign and 1612 malicious application. The authorsfound Topic-specific, data flow signatures are very efficient and useful in highlightingthe malicious behavior.


Tweet Messages

Content Analysis

Tweet Messages

Recommend to User 1

Hashtag Recommendation

User Profile

Historical Information

Prepossessing for Topic modeling

Topic modeling based on LDA

Parameter estimation methods

Get result from LDA and Ranking Hashtags

User Interest score

Similarity Score

Recommend to User N1

Hashtag Recommendation

Fig. 3 A simple framework based on LAD to generate tag as a recommendation system on Twitter

3.6 Topic modeling in social network and microblogs

Social networks are a rich source for knowledge discovery and behavior analysis. Forexample, Twitter is one of the most popular social networks that its evaluation andanalysis can be very effective for analyzing user behavior and etc. Figure 3, shows asimple idea based on LDA algorithm to building a tag recordation system on Twit-ter. Recently, researchers have proposed many LDA approaches for analyzing usertweets on Twitter. Weng et al. the authors were concentrated on identifying influentialtwitterers on Twitter and proposed an approach based on an extension of PageRankalgorithm to rate users, called TwitterRank, and used an LDA model to find latenttopic information from a large collection of documentation. For evaluation this ap-proach, they prepared a dataset from Top-1000 Singapore-based twitterers, showedthat their approach is better than other related algorithms [145]. Hong et al. Thispaper examines the issue of identifying the Message popularity as measured basedon the count of future retweets and sheds. The authors utilized TF-IDF scores andconsidered it as a baseline, also used Latent Dirichlet Allocation (LDA) to calculatethe topic distributions for messages. They collected a dataset that includes 2,541,178users and 10,612,601 messages and demonstrated that this method can identify mes-sages which will attract thousands of retweets [146]. Table 8 shown some significantresearch based on LDA to topic modeling and user behavior analysis in social net-work and microblogs.

The authors in [147] focused on topical recommendations on tweeter and pre-sented a novel methodology for topic discovery of interests of a user on Twitter. Infact, they used a Labeled Latent Dirichlet Allocation (L-LDA) model to discover la-


Table 8 Impressive works LDA-based in social Network and microblogs

Study Year purpose Dataset

[145] 2010Finding influential twittererson social network(Twitter)

Top-1000 Singapore-basedtwitterers

[147] 2014Building a topicalrecommendation systems A twitter dataset

[48] 2014A recommendation systemfor Twitter A twitter dataset

[148] 2012Analysis and discoveredevents on Twitter A twitter dataset

[91] 2014Analyze public sentimentvariations regarding acertain tar on Twitter

A twitter dataset

[49] 2012Analysis of the emotionaland stylistic distributionson Twitter

A twitter dataset

[149] 2016A topic-enhanced word embeddingfor Twitter sentiment classification SemEval-2014

[150] 2016Categorize emotion tendencyon Sina Wibo A Sina Wibo dataset

[151] 2016Ranking items and summarize newsinformation based on hLDAand maximum spanning tree

A large news Dataset(CNN,BBC,ABC and Google News)

tent topics between two tweet-sets. The authors found that their method could bebetter than content based methods for discovery of user-interest. In addition, in [48]suggested, a recommendation system based on LDA for obtaining the behaviors ofusers on Twitter, called TWILITE. In more detail, TWILTW can calculate the topicdistributions of users to tweet messages and also they introduced ranking algorithmsin order to recommend top-K followers for users on Twitter.

In [121] investigated in the context of a criminal incident prediction on Twitter.They suggested an approach for analysis and understanding of Twitter posts baseda probabilistic language model and also considered a generalized linear regressionmodel. Their evaluation showed that this approach is the capability of predict hit-and-run crimes, only using information that exists in the content of the training set oftweets. This paper [152], they introduced a novel method based LDA model to hash-tag recommendation on Twitter that can categories posts with them (hashtags). Lin etal. in [153] investigated the cold-start issue with useing the social information for Apprecommendation on Twitter and used an LDA model to discovering latent group from”Twitter personalities” to recommendations discovery. For test and experiment, theyconsidered Apple’s iTunes App Store and Twitter as a dataset. Experimental resultsshow, their approach significantly better than other state-of-the-art recommendationtechniques.

The authors in [148] presented a technique to analysis and discovered events byan LDA model.The authors found that this method can detect events in inferred topicsfrom tweets by wavelet analysis. For test and evaluation, they collected 13.6 milliontweets from Twitter as a dataset and showed the use of both hashtag names and in-ferred topics is a beneficial effect in description information for events. In addition,in [154] investigated the issue of how to effectively discover and find health-relatedtopics on Twitter and presented an LDA model for identifies latent topic informationfrom a dataset and it includes 2,231,712 messages from 155,508 users. They foundthat this method may be a valuable tool for detect public health on Twitter. Tan and


et al. in [91] focused on tracking public sentiment and modeling on Twitter. Theysuggest a topic model approach based on LDA, Foreground and Background LDAto distill topics of the foreground. Also proposed another method for ranking a setof reason candidates in natural language, called Reason Candidate and BackgroundLDA (RCB-LDA). Their results showed that their models can be used to identify spe-cial topics and find different aspects. The authors in [49] collected a large corpus fromTwitter in seven emotions that includes; disgust, Anger, Fear, Love, Joy, sadness, andsurprise. They used a probabilistic topic model, based on LDA, which considered fordiscovery of emotions in a corpus of Twitter conversations. Srijith et al. in [155] pro-posed a probabilistic topic model based on hierarchical Dirichlet processes (HDP))for detection of sub-story. They compared HDP with spectral clustering (SC) andlocality sensitive hashing (LSH) and showed that HDP is very effective for story de-tection data sets, and has an improvement of up to 60% in the F-score.

In [149] proposed a method based on Twitter sentiment classification using topic-enhanced word embedding and also used an LDA model to generate a topic distribu-tion of tweets, considered SVM for classifying tasks in sentiment classification. Theyused the dataset on SemEval-2014 from Twitter Sentiment Analysis Track. Experi-ments show that their model can obtain 81.02% in macro F-measure.In [156] focusedon examining of demographic characteristics in Trump Followers on Twitter. Theyconsidered a negative binomial regression model for modeling the ”likes” and usedLDA to extract tweets of Trump. They provided evaluations on the dataset US2016(Twitter) that include a number of followers for all the candidates in the United Statespresidential election of 2016. The authors demonstrated that topic-enhanced wordembedding is very impressive for classification of sentiment on Twitter.

3.7 Crime prediction/evaluation

Over time; definitely, provides further applications for modeling in various sciences.According to recent work, some researchers have applied the topic modeling meth-ods to crime prediction and analysis. Table 9 shown some significant research basedon LDA to topic discovery and crime activity analysis. In [119] introduced an earlywarning system to find the crime activity intention base on an LDA model and collab-orative representation classifier (CRC).The system includes two steps: They utilizedLDA for learning features and extract the features that can represent from articlesources. For the next step, used from achieved features of LDA to classify a newdocument by collaborative representation classifier (CRC). In [120] used a statisticaltopic modeling based on LDA to identify discussion topics among a big city in theUnited States and used kernel density estimation (KDE) techniques for a standardcrime prediction . Sharma et al. the authors introduced an approach based on the ge-ographical model of crime intensities to detect the safest path between two locationsand used a simple Naive Bayes classifier based on features derived from an LDAmodel [157].


Table 9 Impressive works LDA-based to crime prediction


[121] 2012Automatic semantic analysison Twitter posts

A corpus of tweetsfrom Twitter(manual)

[120] 2014Crime prediction usingtagged tweets

City of Chicago Data(data.cityofchicago.org)

[119] 2015Detect the crime activityintention

800 news articles fromyahoo Chinese news

Table 10 Some well-known tools for topic modeling

Tools Implementation/ Language Inference/Parameter Source code availabilityMallet [159] Java Gibbs sampling http://mallet.cs.umass.edu/topics.php

TMT [160] Java Gibbs samplinghttps://nlp.stanford.edu/software/

tmt/tmt-0.4/

Mr.LDA [71] JavaVariationalBayesianInference

https://github.com/lintool/Mr.LDA

JGibbLDA [161] Java Gibbs sampling http://jgibblda.sourceforge.net/Gensim [162] Python Gibbs sampling https://radimrehurek.com/gensimTopicXP Java(Eclipse plugin) http://www.cs.wm.edu/semeru/TopicXP/

Matlab TopicModeling [163] Matlab Gibbs sampling

http://psiexp.ss.uci.edu/research/

programs data/toolbox.htmYahoo LDA[165] C++ Gibbsampling https://github.com/shravanmn/Yahoo LDA

Lda in R [164] R Gibbsamplinghttps://cran.r-project.org/web/

packages/lda/

4 Open source library and tools / datasets / Software packages and tools for theanalysis

We need new tools to help us organize, search, and understand these vast amountsof information. In this section, we introduce some impressive datasets and tools forTopic Modeling.

4.1 Library and Tools

Many tools for Topic modeling and analysis are available, including professional andamateur software, commercial software, and open source software and also, thereare many popular datasets that are considered as a standard source for testing andevaluation. Table 10, show some well-known tools for topic modeling and Table 11,show some well-known datasets for topic modeling. For example; Mallet tools,TheMALLET topic model package incorporates an extremely quick and highly scalableimplementation of Gibbs sampling, proficient methods for tools and document-topichyperparameter optimization for inferring topics for new documents given trainedmodels. Topic models provide a simple approach to analyze huge volumes of un-labeled text. The role of these tools, as mentioned, A ”topic” consists of a groupof words that habitually happen together. Topic models can associate words withdistinguish and similar meanings among uses of words with various meanings andconsidering contextual clues [158].

For evaluation and testing, according to previous work, researchers have releasedmany dataset in various subjects, size, and dimensions for public access and other

http://mallet.cs.umass.edu/topics.php

http://jgibblda.sourceforge.net/

http://www.cs.wm.edu/semeru/TopicXP/

http://psiexp.ss.uci.edu/research/


Table 11 Some well-known Dataset for topic modelingDataset Language Date of publish Short-detail Availability address

Reuters(Reuters21578)[166] English 1997

Newsletters invarious categories

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578

ReutersV 1(Reuters-Volume I) [167] English 2004


UDI-TwitterCrawl-Aug2012 [168] English 2012A twitter dataset frommillions of tweets

https://wiki.illinois.edu//wiki/display/

forward/Dataset-UDI-TwitterCrawl-Aug2012SemEval-2013 Dataset [169] English 2013 A twitter dataset from millions of tweetsWiki10[179] English 2009 A Wikipedia Document in various category http://nlp.uned.es/social-tagging/wiki10+/

Weibo dataset [170] Chinese 2013 A popular Chinese microblogging network

Bag of Words English 2008

A multi dataset(PubMed abstracts,KOS blog,NYTimes news,NIPS full papers,Enron Emails)

https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

CiteUlike [171] English 2011A bibliography sharing serviceof academic papers http://www.citeulike.org/faq/data.adp

DBLP Dataset [172] EnglishA bibliographic database aboutcomputer science journals

https://hpi.de/naumann/projects/repeatability/

datasets/dblp-dataset.html

HowNet lexicon Chinese 2000-2013A Chinese machine-readable dictionary/ lexical knowledge http://www.keenage.com/html/e index.html

Virastyar , Persian lexicon [173] Persian 2013 Persian poems electronic lexicahttp://ganjoor.net/http://www.virastyar.ir/data/

NIPS abstracts English 2016The distribution of wordsin the full text ofthe NIPS conference (1987 to 2015)

https://archive.ics.uci.edu/ml/datasets/NIPS+Conference+Papers+1987-2015

Ch-wikipedia [114][174] Chinese A Chinese corpus from Chinese Wikipedia

Pascal VOC 2007 [175][176] English 2007 A dataset of natural imageshttp://host.robots.ox.ac.uk/pascal/VOC/

voc2007/

AFP ARB corpus [177] Arabic 2001A collection of newspaper articlessin Arabic from Agence France Presse

20Newsgroups4 corpus [178] English 2008Newsletters invarious categories http://qwone.com/∼jason/20Newsgroups/

New York Times (NYT)dataset [179] English 2008


future work. So, due to the importance of this research, we examined the well-known dataset from previous work. Table 11, shows lists of some famous and populardatasets in various languages.

5 Discussion and seven important issues in challenges

There are challenges and discussions that can be considered as future work in topicmodeling. According to our studies, some issues require further research, which canbe very effective and attractive for the future. In this section, we discussed sevenimportant issues and we found that the following issues have not been sufficientlysolved. These are the gaps in the reviewed work that would prove to be directions forfuture work.

5.1 Topics Modeling in image processing, Image classification and annotation

Image classification and annotation are important problems in computer vision, butrarely considered together and need some intelligent approach for predict classes. Forexample, an image of a class highway is more likely annotated with words ”road”and ”traffic”, ”car ” than words ”fish ” ”scuba” and ”boat”. In [180] developed anew probabilistic model for jointly modeling the image, its annotations, and its classlabel. Their model behaves the class label as a global description of the image andbehaves annotation terms as local descriptions of parts of the image. Its underlyingprobabilistic hypotheses naturally integrate these sources of information. They derivean approximate inference and obtain algorithms based on variational ways as well asimpressive approximations for annotating and classifying new images and extended

http://kdd.ics.uci.edu/databases/

http://nlp.uned.es/social-tagging/wiki10+/

http://www.citeulike.org/faq/data.adp

http://www.keenage.com/html/e_index.html

http://ganjoor.net/

http://host.robots.ox.ac.uk/pascal/VOC/

http://qwone.com/~jason/20Newsgroups/


supervised topic modeling (sLDA) to classification problems.

Lienou and et al. in [181] focused on the problem of an image semantic interpreta-tion of large satellite images and used a topic modeling, that each word in a documentconsidering as a segment of image and a document is as an image. For evaluation,they performed experiments on panchromatic QuickBird images. Philbin and et al.in [182] proposed a geometrically consistent latent topic model to detect significantobjects, called Latent Dirichlet Allocation (gLDA) and then introduced methods foreffectiveness of calculations a matching graph, that images are the nodes and the edgestrength in visual content. The gLDA method is able to group images of a specific ob-ject despite large imaging variations and can also pick out different views of a singleobject. In [183] introduced a semi-automatic approach to latent information retrieval.According to the hierarchical structure from the images. They considered a combinedinvestigation using LDA model and invariant descriptors of image region for a visualscene modeling. Wick and et al. in [184] They presented an error correction algo-rithm using topic modeling based on LDA to Optical character recognition (OCR)error correction. This algorithm including two models: a topic model to calculate theword probabilities and an OCR model for obtaining the probability of character er-rors. In addition, we can combine Topic models with matrix factorization methodsto image understanding, tag assignment and semantic discovery from social imagedatasets [185-186].

5.2 Audio, Music information retrieval and processing

According to our knowledge, few research works have been done in music infor-mation analysis using topic modeling. For example; in [187] proposed a modifiedversion of LDA to process continuous data and audio retrieval. In this model, eachaudio document includes various latent topics and considered each topic as a Gaus-sian distribution on the audio feature data. To evaluate the efficiency of their model,used 1214 audio documents in various categories (such as rain, bell, river, laugh, gun,dog and so on). The authors in [188], focused on estimation and estimation of singingcharacteristics from signals of audio. This paper introduces a topic modeling to thevocal timbre analysis, that each song is considered as a weighted mixture of multipletopics. In this approach, first extracted features of vocal timbre of polyphonic musicand then used an LDA model to estimate merging weights of multiple topics. Forevaluation, they applied 36 songs that consist of 12 Japanese singers.

5.3 Drug safety evaluation and Approaches to improving it

Understanding safety of drug and performance continue to be critical and challeng-ing issues for academia and also it is an important issue in new drug discovery. Topicmodeling holds potential for mining the biological documents and given the impor-tance and magnitude of this issue, researchers can consider it as a future work. Bisgin


and et al. in [189] introduced an ’in silico’ framework to drug repositioning guidedthrough a probabilistic graphical model, that defined a drug as a ’document’ anda phenotype form a drug as a ’word’. They applied their approach on the SIDERdatabase to estimate phenome distribution from drugs and identified 908 drugs fromSIDER with new capacity indications and demonstrated that the model can be effec-tive for further investigations. Yu and et al. in [191] investigated the issue of drug-induced acute liver failure (ALF) with considering the role of topic modeling to drugsafety evaluation, they explored the LiverTox database for drugs discovery with acapacity to cause ALF. Yang and et al. introduced an automatic approach based onkeyphrase extraction to detect expressions of consumer health, according to adversedrug reaction (ADRs) in social media. They used an LDA model as a Feature spacemodeling to build a topic space on the consumer corpus and consumer health expres-sions mining.[189-191].

5.4 Analysis of comments of famous personalities, social demographics

Public social media and micro-blogging services, most notably Twitter, the peoplehave found a venue to hear and be heard by their peers without an intermediary. Asa consequence and helped by the public nature of twitter political scientists now po-tentially have the means to evaluate and understand narratives that organically form,decline among and spread the public in a political campaign. For this field we canrefer to some impressive recent works, for example; Wang and et al. they introduceda framework to derive the topic preferences of Donald Trump’s followers on Twitterand used LDA to infer the weighted mixture for each Trump tweet from topics. Inaddition, we can refer to [156, 192-194].

5.5 Group discovery and topic modeling

Graph mining and social network analysis in large graphs is a challenging problem.Group discovery has many applications, such as understanding the social structureof organizations, uncovering criminal organizations, and modeling large scale socialnetworks in the Internet community. LDA Models can be an efficient method fordiscovering latent group structure in large networks. In [116], the authors proposeda scalable Bayesian alternative based on LDA and graph to group discovery in abig real-world graph. For evaluation, they collected three datasets from PubMed. In[117], the authors introduced a generative approach using a hierarchical Bayes modelfor group discovery in Social Media Analysis that called Group Latent Anomaly De-tection (GLAD) model. This model merged the ideas from both the LDA model andMixture Membership Stochastic Block (MMSB) model.


5.6 User Behavior Modeling

Social media provides valuable resources to analyze user behaviors and capture userpreferences. Since the user generated data (such as users activities, user interests) insocial media is a challenge[195-196], using topic modeling techniques(such as LDA)can contribute to an important role for the discovery of hidden structures related touser behavior in social media. Although some topic modeling approaches have beenproposed in user behavior modeling, there are still many open questions and chal-lenges to be addressed. For example; Giri et al. in [197] introduced a novel way usingan unsupervised topic model for hidden interests discovery of users and analyzingbrowsing behavior of users in a cellular network that can be very effective for mo-bile advertisements and online recommendation systems. In addition, we can refer to[198-200].

5.7 Visualizing topic models

Although different approaches have been investigated to support the visualization oftext in large sets of documents such as machine learning, but it is an open challengein text mining and visualizing data in big data source. Some of the few studies thathave been done, such as [201-205]. Chuang and et al. in [205] proposed a topic toolbased on a novel visualization technique to the evaluation of textual topical in topicmodeling, called Termite. The tool can visualize the collection from the distributionof topic term in LDA with considering a matrix layout. The authors used two mea-sures for understanding a topic model of the Useful terms that including: ”saliency”and ”distinctiveness”. They used the Kullback-Liebler divergence between the topicsdistribution determined the term for obtain these measures. This tools can increasethe interpretations of topical results and make a legible result.

6 Conclusion

Topic models have an important role in computer science for text mining. In Topicmodeling, a topic is a list of words that occur in statistically significant methods. Atext can be an email, a book chapter, a blog posts, a journal article and any kind ofunstructured text. Topic models cannot understand the means and concepts of wordsin text documents for topic modeling. Instead, they suppose that any part of the textis combined by selecting words from probable baskets of words where each basketcorresponds to a topic. The tool goes via this process over and over again until itstays on the most probable distribution of words into baskets which call topics. Topicmodeling can provide a useful view of a large collection in terms of the collection asa whole, the individual documents, and the relationships between the documents. Inthis paper, we investigated scholarly articles highly (between 2003 to 2016) related toTopic Modeling based on LDA in various science. Given the importance of research,


we believe this paper can be a significant source and good opportunities for textmining with topic modeling based on LDA for researchers and future works.

Acknowledgements

This article has been awarded by the National Natural Science Foundation of China(61170035, 61272420, 81674099, 61502233), the Fundamental Research Fund forthe Central Universities (30916011328, 30918015103), and Nanjing Science andTechnology Development Plan Project (201805036).

References

1. Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine Learning research,2003. 3(Jan): p. 993-1022.

2. Sun, S., C. Luo, and J. Chen, A review of natural language processing techniques for opinion miningsystems. Information Fusion, 2017. 36: p. 10-25.

3. Xu, Z., et al., Crowdsourcing based social media data analysis of urban emergency events. MultimediaTools and Applications, 2017. 76(9): p. 11567-11584.

4. Zhang, Y., et al., iDoctor: Personalized and professionalized medical recommendations based on hybridmatrix factorization. Future Generation Computer Systems, 2017. 66: p. 30-35.

5. Jiang, Z., et al. Using link topic model to analyze traditional chinese medicine clinical symptom-herbregularities. in e-Health Networking, Applications and Services (Healthcom), 2012 IEEE 14th Inter-national Conference on. 2012. IEEE.

6. Paul, M.J. and M. Dredze, You are what you Tweet: Analyzing Twitter for public health. Icwsm, 2011.20: p. 265-272.

7. Wu, Y., et al. Ranking gene-drug relationships in biomedical literature using latent dirichlet allocation.in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2012. NIH PublicAccess.

8. Linstead, E., et al. Mining concepts from code with probabilistic topic models. in Proceedings of thetwenty-second IEEE/ACM international conference on Automated software engineering. 2007. ACM.

9. Gethers, M. and D. Poshyvanyk. Using relational topic models to capture coupling among classes inobject-oriented software systems. in Software Maintenance (ICSM), 2010 IEEE International Confer-ence on. 2010. IEEE.

10. Asuncion, H.U., A.U. Asuncion, and R.N. Taylor. Software traceability with topic modeling. in Pro-ceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 2010.ACM.

11. Thomas, S.W. Mining software repositories using topic models. in Proceedings of the 33rd Interna-tional Conference on Software Engineering. 2011. ACM.

12. Thomas, S.W., et al. Modeling the evolution of topics in source code histories. in Proceedings of the8th working conference on mining software repositories. 2011. ACM.

13. Cristani, M., et al. Geo-located image analysis using latent representations. in Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on. 2008. IEEE.

14. Eisenstein, J., et al. A latent variable model for geographic lexical variation. in Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing. 2010. Association for Computa-tional Linguistics.

15. Tang, H., et al., A multiscale latent Dirichlet allocation model for object-oriented clustering of VHRpanchromatic satellite images. IEEE Transactions on Geoscience and Remote Sensing, 2013. 51(3):p. 1680-1692.

16. Yin, Z., et al. Geographical topic discovery and comparison. in Proceedings of the 20th internationalconference on World wide web. 2011. ACM.

17. Sizov, S. Geofolk: latent spatial semantics in web 2.0 social media. in Proceedings of the third ACMinternational conference on Web search and data mining. 2010. ACM.


18. Chen, B., et al. What Is an Opinion About? Exploring Political Standpoints Using Opinion ScoringModel. in AAAI. 2010.

19. Cohen, R. and D. Ruths. Classifying political orientation on Twitter: It’s not easy! in ICWSM. 2013.20. Greene, D. and J.P. Cross. Unveiling the Political Agenda of the European Parliament Plenary: A

Topical Analysis. in Proceedings of the ACM Web Science Conference. 2015. ACM.21. Fang, Y., et al. Mining contrastive opinions on political texts using cross-perspective topic model. in

Proceedings of the fifth ACM international conference on Web search and data mining. 2012. ACM.22. Tian, K., M. Revelle, and D. Poshyvanyk. Using latent dirichlet allocation for automatic categoriza-

tion of software. in Mining Software Repositories, 2009. MSR’09. 6th IEEE International WorkingConference on. 2009. IEEE.

23. Lukins, S.K., N.A. Kraft, and L.H. Etzkorn. Source code retrieval for bug localization using latentdirichlet allocation. in Reverse Engineering, 2008. WCRE’08. 15th Working Conference on. 2008.IEEE.

24. Lukins, S.K., N.A. Kraft, and L.H. Etzkorn, Bug localization using latent dirichlet allocation. Informa-tion and Software Technology, 2010. 52(9): p. 972-990.

25. Linstead, E., C. Lopes, and P. Baldi. An application of latent Dirichlet allocation to analyzing softwareevolution. in Machine Learning and Applications, 2008. ICMLA’08. Seventh International Confer-ence on. 2008. IEEE.

26. Chen, T.-H., et al. Explaining software defects using topic models. in Mining Software Repositories(MSR), 2012 9th IEEE Working Conference on. 2012. IEEE.

27. Savage, T., et al. Topic XP: Exploring topics in source code using Latent Dirichlet Allocation. inSoftware Maintenance (ICSM), 2010 IEEE International Conference on. 2010. IEEE.

28. Zheng, X., et al., Incorporating appraisal expression patterns into topic modeling for aspect and senti-ment word identification. Knowledge-Based Systems, 2014. 61: p. 29-47.

29. Cheng, V.C., et al., Probabilistic aspect mining model for drug reviews. IEEE transactions on knowl-edge and data engineering, 2014. 26(8): p. 2002-2013.

30. Zhai, Z., et al., Constrained LDA for grouping product features in opinion mining. Advances in knowl-edge discovery and data mining, 2011: p. 448-459.

31. Bagheri, A., M. Saraee, and F. De Jong, ADM-LDA: An aspect detection model based on topic mod-elling using the structure of review sentences. Journal of Information Science, 2014. 40(5): p. 621-636.

32. Wang, T., et al., Product aspect extraction supervised with online domain knowledge. Knowledge-Based Systems, 2014. 71: p. 86-100.

33. Xianghua, F., et al., Multi-aspect sentiment analysis for Chinese online social reviews based on topicmodeling and HowNet lexicon. Knowledge-Based Systems, 2013. 37: p. 186-195.

34. Jo, Y. and A.H. Oh. Aspect and sentiment unification model for online review analysis. in Proceedingsof the fourth ACM international conference on Web search and data mining. 2011. ACM.

35. Paul, M. and R. Girju, A two-dimensional topic-aspect model for discovering multi-faceted topics.Urbana, 2010. 51(61801): p. 36.

36. Titov, I. and R. McDonald. Modeling online reviews with multi-grain topic models. in Proceedings ofthe 17th international conference on World Wide Web. 2008. ACM.

37. Qian, S., et al., Multi-modal event topic model for social event analysis. IEEE Transactions on Multi-media, 2016. 18(2): p. 233-246.

38. Hu, Y., et al. ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback. in AAAI.2012.

39. Weng, J. and B.-S. Lee, Event detection in twitter. ICWSM, 2011. 11: p. 401-408.40. Lin, C.X., et al. PET: a statistical model for popular events tracking in social communities. in Proceed-

ings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.2010. ACM.

41. Wang, Y. and G. Mori. Max-margin Latent Dirichlet Allocation for Image Classification and Annota-tion. in BMVC. 2011.

42. Zoghbi, S., I. Vulic, and M.-F. Moens, Latent Dirichlet allocation for linking user-generated contentand e-commerce data. Information Sciences, 2016. 367: p. 573-599.

43. Cheng, Z. and J. Shen, On effective location-aware music recommendation. ACM Transactions onInformation Systems (TOIS), 2016. 34(2): p. 13.

44. Zhao, F., et al., A personalized hashtag recommendation approach using LDA-based topic model inmicroblog environment. Future Generation Computer Systems, 2016. 65: p. 196-206.

45. Lu, H.-M. and C.-H. Lee, The Topic-Over-Time Mixed Membership Model (TOT-MMM): A TwitterHashtag Recommendation Model that Accommodates for Temporal Clustering Effects. IEEE Intelli-


gent Systems, 2015(1): p. 1-1.46. Wang, J., et al., Image tag refinement by regularized latent Dirichlet allocation. Computer Vision and

Image Understanding, 2014. 124: p. 61-70.47. Yang, M.-C. and H.-C. Rim, Identifying interesting Twitter contents using topical analysis. Expert

Systems with Applications, 2014. 41(9): p. 4330-4336.48. Kim, Y. and K. Shim, TWILITE: A recommendation system for Twitter using a probabilistic model

based on latent Dirichlet allocation. Information Systems, 2014. 42: p. 59-77.49. Roberts, K., et al. EmpaTweet: Annotating and Detecting Emotions on Twitter. in LREC. 2012.50. Rao, Y., Contextual sentiment topic model for adaptive social emotion classification. IEEE Intelligent

Systems, 2016. 31(1): p. 41-47.51. Rao, Y., et al., Building emotional dictionary for sentiment analysis of online news. World Wide Web,

2014. 17(4): p. 723-742.52. Chen, T.-H., S.W. Thomas, and A.E. Hassan, A survey on the use of topic models when mining software

repositories. Empirical Software Engineering, 2016. 21(5): p. 1843-1919.53. Daud, A., et al., Knowledge discovery through directed probabilistic topic models: a survey. Frontiers

of computer science in China, 2010. 4(2): p. 280-301.54. Liu, B. and L. Zhang, A survey of opinion mining and sentiment analysis, in Mining text data. 2012,

Springer. p. 415-463.55. Debortoli, S., et al., Text mining for information systems researchers: an annotated topic modeling

tutorial. CAIS, 2016. 39: p. 7.56. Sun, X., et al. Exploring topic models in software engineering data analysis: A survey. in 2016 17th

IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networkingand Parallel/Distributed Computing (SNPD). 2016. IEEE.

57. Minka, T. and J. Lafferty. Expectation-propagation for the generative aspect model. in Proceedings ofthe Eighteenth conference on Uncertainty in artificial intelligence. 2002. Morgan Kaufmann Publish-ers Inc.

58. Griffiths, T.L. and M. Steyvers, Finding scientific topics. Proceedings of the National academy ofSciences, 2004. 101(suppl 1): p. 5228-5235.

59. Xie, W., et al., Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions onKnowledge and Data Engineering, 2016. 28(8): p. 2216-2229.

60. Lu, H.-M., C.-P. Wei, and F.-Y. Hsiao, Modeling healthcare data using multiple-channel latent Dirichletallocation. Journal of biomedical informatics, 2016. 60: p. 210-223.

61. Yeh, J.-F., Y.-S. Tan, and C.-H. Lee, Topic detection and tracking for conversational content by usingconceptual dynamic latent Dirichlet allocation. Neurocomputing, 2016. 216: p. 310-318.

62. Miao, J., J.X. Huang, and J. Zhao, TopPRF: A probabilistic framework for integrating topic space intopseudo relevance feedback. ACM Transactions on Information Systems (TOIS), 2016. 34(4): p. 22.

63. Panichella, A., et al. How to effectively use topic models for software engineering tasks? an approachbased on genetic algorithms. in Proceedings of the 2013 International Conference on Software Engi-neering. 2013. IEEE Press.

64. Zhao, W.X., et al. Comparing twitter and traditional media using topic models. in European Conferenceon Information Retrieval. 2011. Springer.

65. Jagarlamudi, J. and H. Daume III. Extracting Multilingual Topics from Unaligned Comparable Cor-pora. in ECIR. 2010. Springer.

66. Ramage, D., et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled cor-pora. in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1-Volume 1. 2009. Association for Computational Linguistics.

67. Zhu, J., A. Ahmed, and E.P. Xing. MedLDA: maximum margin supervised topic models for regressionand classification. in Proceedings of the 26th annual international conference on machine learning.2009. ACM.

68. Guo, J., et al. Named entity recognition in query. in Proceedings of the 32nd international ACM SIGIRconference on Research and development in information retrieval. 2009. ACM.

69. Chang, J. and D.M. Blei. Relational topic models for document networks. in International conferenceon artificial intelligence and statistics. 2009.

70. Blei, D.M. and M.I. Jordan. Modeling annotated data. in Proceedings of the 26th annual internationalACM SIGIR conference on Research and development in informaion retrieval. 2003. ACM.

71. Zhai, K., et al. Mr. LDA: A flexible large scale topic modeling package using variational inference inmapreduce. in Proceedings of the 21st international conference on World Wide Web. 2012. ACM.


72. Chien, J.-T. and C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Transac-tions on Audio, Speech, and Language Processing, 2011. 19(3): p. 482-495.

73. Blei, D.M. and J.D. Lafferty. Dynamic topic models. in Proceedings of the 23rd international confer-ence on Machine learning. 2006. ACM.

74. Ramage, D., C.D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining.in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and datamining. 2011. ACM.

75. Rosen-Zvi, M., et al. The author-topic model for authors and documents. in Proceedings of the 20thconference on Uncertainty in artificial intelligence. 2004. AUAI Press.

76. McCallum, A., A. Corrada-Emmanuel, and X. Wang, Topic and role discovery in social networks.Computer Science Department Faculty Publication Series, 2005: p. 3.

77. Wang, X. and A. McCallum. Topics over time: a non-Markov continuous-time model of topical trends.in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and datamining. 2006. ACM.

78. Li, W. and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations.in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.

79. Zhang, H., et al. Probabilistic community discovery using hierarchical latent gaussian mixture model.in AAAI. 2007.

80. AlSumait, L., D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining textstreams with applications to topic detection and tracking. in Data Mining, 2008. ICDM’08. EighthIEEE International Conference on. 2008. IEEE.

81. Wang, C. and D.M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichletprocess. in Advances in neural information processing systems. 2009.

82. Lacoste-Julien, S., F. Sha, and M.I. Jordan. DiscLDA: Discriminative learning for dimensionality re-duction and classification. in Advances in neural information processing systems. 2009.

83. Wallach, H.M., D.M. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. in Advances inneural information processing systems. 2009.

84. Millar, J.R., G.L. Peterson, and M.J. Mendenhall. Document Clustering and Visualization with LatentDirichlet Allocation and Self-Organizing Maps. in FLAIRS Conference. 2009.

85. Wang, H., et al., Finding complex biological relationships in recent PubMed articles using Bio-LDA.PloS one, 2011. 6(3): p. e17243.

86. Li, F., M. Huang, and X. Zhu. Sentiment Analysis with Global Topics and Local Dependency. in AAAI.2010.

87. Liu, Z., et al., Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing.ACM Transactions on Intelligent Systems and Technology (TIST), 2011. 2(3): p. 26.

88. Yoshii, K. and M. Goto, A nonparametric Bayesian multipitch analyzer based on infinite latent har-monic allocation. IEEE Transactions on Audio, Speech, and Language Processing, 2012. 20(3): p.717-730.

89. Li, J., C. Cardie, and S. Li. TopicSpam: a Topic-Model based approach for spam detection. in ACL (2).2013.

90. Wu, H., et al., Locally discriminative topic modeling. Pattern Recognition, 2012. 45(1): p. 617-625.91. Tan, S., et al., Interpreting the public sentiment variations on twitter. IEEE transactions on knowledge

and data engineering, 2014. 26(5): p. 1158-1170.92. Paul, M. and M. Dredze. Factorial LDA: Sparse multi-dimensional text models. in Advances in Neural

Information Processing Systems. 2012.93. Mao, X.-L., et al. SSHLDA: a semi-supervised hierarchical topic model. in Proceedings of the 2012

joint conference on empirical methods in natural language processing and computational natural lan-guage learning. 2012. Association for Computational Linguistics.

94. Choo, J., et al., Utopian: User-driven topic modeling based on interactive nonnegative matrix factor-ization. IEEE transactions on visualization and computer graphics, 2013. 19(12): p. 1992-2001.

95. Chen, L., et al. WT-LDA: user tagging augmented LDA for web service clustering. in InternationalConference on Service-Oriented Computing. 2013. Springer.

96. Cheng, X., et al., Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and DataEngineering, 2014(1): p. 1-1.

97. Yan, X., et al. A biterm topic model for short texts. in Proceedings of the 22nd international conferenceon World Wide Web. 2013. ACM.

98. Cohen, R., et al., Redundancy-aware topic modeling for patient record notes. PloS one, 2014. 9(2): p.e87555.


99. Xie, P., D. Yang, and E.P. Xing. Incorporating Word Correlation Knowledge into Topic Modeling. inHLT-NAACL. 2015.

100. Yuan, J., et al. Lightlda: Big topic models on modest computer clusters. in Proceedings of the 24thInternational Conference on World Wide Web. 2015. International World Wide Web ConferencesSteering Committee.

101. Nguyen, D.Q., et al., Improving topic models with latent feature word representations. Transactionsof the Association for Computational Linguistics, 2015. 3: p. 299-313.

102. Li, C., et al., The author-topic-community model for author interest profiling and community discov-ery. Knowledge and Information Systems, 2015. 44(2): p. 359-383.

103. Yu, X., J. Yang, and Z.-Q. Xie, A semantic overlapping community detection algorithm based on fieldsampling. Expert Systems with Applications, 2015. 42(1): p. 366-375.

104. Jiang, D., et al., SG-WSTD: A framework for scalable geographic web search topic discovery.Knowledge-Based Systems, 2015. 84: p. 18-33.

105. Fu, X., et al., Dynamic non-parametric joint sentiment topic mixture model. Knowledge-Based Sys-tems, 2015. 82: p. 102-114.

106. Li, X., J. Ouyang, and X. Zhou, Supervised topic models for multi-label classification. Neurocom-puting, 2015. 149: p. 811-819.

107. Hong, L., E. Frias-Martinez, and V. Frias-Martinez. Topic Models to Infer Socio-Economic Maps. inAAAI. 2016.

108. Lee, S., et al., LARGen: automatic signature generation for Malwares using latent Dirichlet allocation.IEEE Transactions on Dependable and Secure Computing, 2016.

109. Liu, Y., J. Wang, and Y. Jiang, PT-LDA: A latent variable model to predict personality traits of socialnetwork users. Neurocomputing, 2016. 210: p. 155-163.

110. Li, C., et al., Hierarchical Bayesian nonparametric models for knowledge discovery from electronicmedical records. Knowledge-Based Systems, 2016. 99: p. 168-182.

111. Fu, X., et al., Dynamic online HDP model for discovering evolutionary topics from Chinese socialtexts. Neurocomputing, 2016. 171: p. 412-424.

112. Zeng, J., Z.-Q. Liu, and X.-Q. Cao, Fast online EM for big topic modeling. IEEE Transactions onKnowledge and Data Engineering, 2016. 28(3): p. 675-688.

113. Alam, M.H., W.-J. Ryu, and S. Lee, Joint multi-grain topic sentiment: modeling semantic aspects foronline reviews. Information Sciences, 2016. 339: p. 206-223.

114. Qin, Z., Y. Cong, and T. Wan, Topic modeling of Chinese language beyond a bag-of-words. ComputerSpeech and Language, 2016. 40: p. 60-78.

115. Wang, Y.-C., M. Burke, and R.E. Kraut. Gender, topic, and audience response: an analysis of user-generated content on facebook. in Proceedings of the SIGCHI Conference on Human Factors in Com-puting Systems. 2013. ACM.

116. Henderson, K. and T. Eliassi-Rad. Applying latent dirichlet allocation to group discovery in largegraphs. in Proceedings of the 2009 ACM symposium on Applied Computing. 2009. ACM.

117. Yu, R., X. He, and Y. Liu, Glad: group anomaly detection in social media analysis. ACM Transactionson Knowledge Discovery from Data (TKDD), 2015. 10(2): p. 18.

118. Liu, Y., et al. Fortune Teller: Predicting Your Career Path. in AAAI. 2016.119. Chen, S.-H., et al. Latent dirichlet allocation based blog analysis for criminal intention detection

system. in Security Technology (ICCST), 2015 International Carnahan Conference on. 2015. IEEE.120. Gerber, M.S., Predicting crime using Twitter and kernel density estimation. Decision Support Sys-

tems, 2014. 61: p. 115-125.121. Wang, X., M.S. Gerber, and D.E. Brown, Automatic Crime Prediction Using Events Extracted from

Twitter Posts. SBP, 2012. 12: p. 231-238.122. Preotiuc-Pietro, D., et al. Beyond binary labels: political ideology prediction of twitter users. in Pro-

ceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017.

123. Liu, B., et al., Identifying functional miRNA-mRNA regulatory modules with correspondence latentdirichlet allocation. Bioinformatics, 2010. 26(24): p. 3105-3111.

124. Huang, Z., X. Lu, and H. Duan, Latent treatment pattern discovery for clinical processes. Journal ofmedical systems, 2013. 37(2): p. 9915.

125. Xiao, C., et al. Adverse Drug Reaction Prediction with Symbolic Latent Dirichlet Allocation. inAAAI. 2017.

126. Bauer, S., et al. Talking places: Modelling and analysing linguistic content in foursquare. in Pri-vacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International


Confernece on Social Computing (SocialCom). 2012. IEEE.127. McFarland, D.A., et al., Differentiating language usage through topic models. Poetics, 2013. 41(6):

p. 607-625.128. Eidelman, V., J. Boyd-Graber, and P. Resnik. Topic models for dynamic translation model adaptation.

in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: ShortPapers-Volume 2. 2012. Association for Computational Linguistics.

129. Wilson, A.T. and P.A. Chew. Term weighting schemes for latent dirichlet allocation. in human lan-guage technologies: The 2010 annual conference of the North American Chapter of the Associationfor Computational Linguistics. 2010. Association for Computational Linguistics.

130. Vulic, I., W. De Smet, and M.-F. Moens. Identifying word translations from comparable corporausing latent topic models. in Proceedings of the 49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies: short papers-Volume 2. 2011. Association forComputational Linguistics.

131. Lui, M., J.H. Lau, and T. Baldwin, Automatic detection and language identification of multilingualdocuments. Transactions of the Association for Computational Linguistics, 2014. 2: p. 27-40.

132. Heintz, I., et al. Automatic extraction of linguistic metaphor with lda topic modeling. in Proceedingsof the First Workshop on Metaphor in NLP. 2013.

133. Balasubramanyan, R., et al. Modeling polarizing topics: When do different political communitiesrespond differently to the same news? in ICWSM. 2012.

134. Song, M., M.C. Kim, and Y.K. Jeong, Analyzing the political landscape of 2012 korean presidentialelection in twitter. IEEE Intelligent Systems, 2014. 29(2): p. 18-26.

135. Levy, K.E. and M. Franklin, Driving regulation: using topic models to examine political contentionin the US trucking industry. Social Science Computer Review, 2014. 32(2): p. 182-194.

136. Zirn, C. and H. Stuckenschmidt, Multidimensional topic analysis in political texts. Data and Knowl-edge Engineering, 2014. 90: p. 38-53.

137. Yano, T., W.W. Cohen, and N.A. Smith. Predicting response to political blog posts with topic models.in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North Amer-ican Chapter of the Association for Computational Linguistics. 2009. Association for ComputationalLinguistics.

138. Yano, T. and N.A. Smith. What’s Worthy of Comment? Content and Comment Volume in PoliticalBlogs. in ICWSM. 2010.

139. Madan, A., et al. Pervasive sensing to model political opinions in face-to-face networks. in Interna-tional Conference on Pervasive Computing. 2011. Springer.

140. Zhang, X.-p., et al., Topic model for chinese medicine diagnosis and prescription regularities analysis:case on diabetes. Chinese journal of integrative medicine, 2011. 17(4): p. 307-313.

141. Zhang, L., X. Sun, and H. Zhuge, Topic discovery of clusters from documents with geographicallocation. Concurrency and Computation: Practice and Experience, 2015. 27(15): p. 4015-4038.

142. McInerney, J. and D.M. Blei. Discovering newsworthy tweets with a geographical topic model. inNewsKDD: Data Science for News Publishing workshop Workshop in conjunction with KDD2014the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2014.

143. Li, Z., et al., Enhancing news organization for convenient retrieval and browsing. ACM Transactionson Multimedia Computing, Communications, and Applications (TOMM), 2013. 10(1): p. 1.

144. Yang, X., et al., Characterizing malicious Android apps by mining topic-specific data flow signatures.Information and Software Technology, 2017.

145. Weng, J., et al. Twitterrank: finding topic-sensitive influential twitterers. in Proceedings of the thirdACM international conference on Web search and data mining. 2010. ACM.

146. Hong, L., O. Dan, and B.D. Davison. Predicting popular messages in twitter. in Proceedings of the20th international conference companion on World wide web. 2011. ACM.

147. Bhattacharya, P., et al. Inferring user interests in the twitter social network. in Proceedings of the 8thACM Conference on Recommender systems. 2014. ACM.

148. Cordeiro, M. Twitter event detection: combining wavelet analysis and topic inference summarization.in Doctoral symposium on informatics engineering. 2012.

149. Ren, Y., R. Wang, and D. Ji, A topic-enhanced word embedding for Twitter sentiment classification.Information Sciences, 2016. 369: p. 188-198.

150. Li, Y., et al., Design and implementation of Weibo sentiment analysis based on LDA and dependencyparsing. China Communications, 2016. 13(11): p. 91-105.

151. Li, Z., et al., Multimedia news summarization in search. ACM Transactions on Intelligent Systemsand Technology (TIST), 2016. 7(3): p. 33.


152. Godin, F., et al. Using topic models for twitter hashtag recommendation. in Proceedings of the 22ndInternational Conference on World Wide Web. 2013. ACM.

153. Lin, J., et al. Addressing cold-start in app recommendation: latent user models constructed fromtwitter followers. in Proceedings of the 36th international ACM SIGIR conference on Research anddevelopment in information retrieval. 2013. ACM.

154. Prier, K.W., et al. Identifying health-related topics on twitter. in International Conference on SocialComputing, Behavioral-Cultural Modeling, and Prediction. 2011. Springer.

155. Srijith, P., et al., Sub-story detection in Twitter with hierarchical Dirichlet processes. InformationProcessing & Management, 2017. 53(4): p. 989-1003.

156. Wang, Y., et al. Catching Fire via” Likes”: Inferring Topic Preferences of Trump Followers on Twitter.in ICWSM. 2016.

157. Sharma, V., et al. Analyzing Newspaper Crime Reports for Identification of Safe Transit Paths. inHLT-NAACL. 2015.

158. Steyvers, M. and T. Griffiths, Probabilistic topic models. Handbook of latent semantic analysis, 2007.427(7): p. 424-440.

159. McCallum, A.K., Mallet: A machine learning for language toolkit. 2002.160. Ramage, D. and E. Rosen, Stanford topic modeling toolbox. 2011, Dec.161. Phan, X.-H. and C.-T. Nguyen, Jgibblda: A java implementation of latent dirichlet allocation (lda)

using gibbs sampling for parameter estimation and inference. 2006.162. Rehurek, R. and P. Sojka, Gensim-Statistical Semantics in Python. 2011.163. Steyvers, M. and T. Griffiths, Matlab topic modeling toolbox 1.4. URL http://psiexp. ss. uci.

edu/research/programs data/toolbox.htm, 2011.164. Chang, J., lda: Collapsed Gibbs sampling methods for topic models. 2011, R.165. Ahmed, A., et al. Scalable inference in latent variable models. in Proceedings of the fifth ACM

international conference on Web search and data mining. 2012. ACM.166. Lewis, D.D., Reuters-21578 text categorization collection. 1997.167. Lewis, D.D., et al., Rcv1: A new benchmark collection for text categorization research. Journal of

machine learning research, 2004. 5(Apr): p. 361-397.168. Li, R., et al. Towards social user profiling: unified and discriminative influence model for inferring

home locations. in Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining. 2012. ACM.

169. Manandhar, S. and D. Yuret. Second joint conference on lexical and computational semantics (*sem), volume 2: Proceedings of the seventh international workshop on semantic evaluation (semeval2013). in Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2:Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). 2013.

170. Zhang, J., et al. Social Influence Locality for Modeling Retweeting Behaviors. in IJCAI. 2013.171. Wang, C. and D.M. Blei. Collaborative topic modeling for recommending scientific articles. in Pro-

ceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data min-ing. 2011. ACM.

172. Lange, D. and F. Naumann. Frequency-aware similarity measures: why Arnold Schwarzenegger isalways a duplicate. in Proceedings of the 20th ACM international conference on Information andknowledge management. 2011. ACM.

173. Asgari, E. and J.-C. Chappelier. Linguistic Resources and Topic Models for the Analysis of PersianPoems. in CLfL@ NAACL-HLT. 2013.

174. Cong, Y., et al. Cross-Modal Information Retrieval-A Case Study on Chinese Wikipedia. in Interna-tional Conference on Advanced Data Mining and Applications. 2012. Springer.

175. Everingham, M., et al., The pascal visual object classes challenge 2007 (voc 2007) results (2007).2008.

176. Everingham, M., et al., The pascal visual object classes (voc) challenge. International journal ofcomputer vision, 2010. 88(2): p. 303-338.

177. Larkey, L.S. and M.E. Connell. Arabic Information Retrieval at UMass in TREC-10. in TREC. 2001.178. Rennie, J., The 20 Newsgroups data set. 2017, http.179. Sandhaus, E., The New York Times Annotated Corpus. Philadelphia, PA: Linguistic Data Consortium.

2008.180. Chong, W., D. Blei, and F.-F. Li. Simultaneous image classification and annotation. in Computer

Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. 2009. IEEE.181. Lienou, M., H. Maitre, and M. Datcu, Semantic annotation of satellite images using latent Dirichlet

allocation. IEEE Geoscience and Remote Sensing Letters, 2010. 7(1): p. 28-32.

http://psiexp


182. Philbin, J., J. Sivic, and A. Zisserman, Geometric latent dirichlet allocation on a matching graph forlarge-scale image datasets. International journal of computer vision, 2011. 95(2): p. 138-153.

183. Vaduva, C., I. Gavat, and M. Datcu, Latent Dirichlet allocation for spatial analysis of satellite images.IEEE Transactions on Geoscience and Remote Sensing, 2013. 51(5): p. 2770-2786.

184. Wick, M., M. Ross, and E. Learned-Miller. Context-sensitive error correction: Using topic modelsto improve OCR. in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth InternationalConference on. 2007. IEEE.

185. Li, Z., J. Tang, and T. Mei, Deep collaborative embedding for social image understanding. IEEEtransactions on pattern analysis and machine intelligence, 2018.

186. Li, Z. and J. Tang, Weakly supervised deep matrix factorization for social image understanding. IEEETransactions on Image Processing, 2017. 26(1): p. 276-288.

187. Hu, P., et al., Latent topic model for audio retrieval. Pattern Recognition, 2014. 47(3): p. 1138-1143.188. Nakano, T., K. Yoshii, and M. Goto. Vocal timbre analysis using latent Dirichlet allocation and cross-

gender vocal timbre similarity. in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEEInternational Conference on. 2014. IEEE.

189. Bisgin, H., et al., A phenome-guided drug repositioning through a latent variable model. BMC bioin-formatics, 2014. 15(1): p. 267.

190. Yang, M. and M. Kiang. Extracting Consumer Health Expressions of Drug Safety from Web Forum.in System Sciences (HICSS), 2015 48th Hawaii International Conference on. 2015. IEEE.

191. Yu, K., et al., Mining hidden knowledge for drug safety assessment: topic modeling of LiverTox as acase study. BMC bioinformatics, 2014. 15(17): p. S6.

192. Alashri, S., et al. An analysis of sentiments on facebook during the 2016 US presidential election.in Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM InternationalConference on. 2016. IEEE.

193. Shi, B., et al. Detecting Common Discussion Topics Across Culture From News Reader Comments.in ACL (1). 2016.

194. Hou, L., et al., Newsminer: Multifaceted news analysis for event search. Knowledge-Based Systems,2015. 76: p. 17-29.

195. Diao, Q., et al. Finding bursty topics from microblogs. in Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics: Long Papers-Volume 1. 2012. Association forComputational Linguistics.

196. Yin, H., et al. A temporal context-aware model for user behavior modeling in social media systems.in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.ACM.

197. Giri, R., et al. User behavior modeling in a cellular network using latent dirichlet allocation. in Inter-national Conference on Intelligent Data Engineering and Automated Learning. 2014. Springer.

198. Wang, S., et al. Cross media topic analytics based on synergetic content and user behavior modeling.in Multimedia and Expo (ICME), 2014 IEEE International Conference on. 2014. IEEE.

199. Yuan, B., et al. Mobile Web User Behavior Modeling. in International Conference on Web InformationSystems Engineering. 2014. Springer.

200. Siersdorfer, S., et al., Analyzing and mining comments and comment ratings on the social web. ACMTransactions on the Web (TWEB), 2014. 8(3): p. 17.

201. Chaney, A.J.-B. and D.M. Blei. Visualizing Topic Models. in ICWSM. 2012.202. Kim, M., et al., Topiclens: Efficient multi-level visual topic exploration of large-scale document col-

lections. IEEE transactions on visualization and computer graphics, 2017. 23(1): p. 151-160.203. Murdock, J. and C. Allen. Visualization Techniques for Topic Model Checking. in AAAI. 2015.204. Gretarsson, B., et al., Topicnets: Visual analysis of large text corpora with topic modeling. ACM

Transactions on Intelligent Systems and Technology (TIST), 2012. 3(2): p. 23.205. Chuang, J., C.D. Manning, and J. Heer. Termite: Visualization techniques for assessing textual topic

models. in Proceedings of the international working conference on advanced visual interfaces. 2012.ACM.

arXiv:1711.04305v2 [cs.IR] 6 Dec 2018 · 2018-12-07 · dation system, in [44] proposed a...

Documents

Transcript of arXiv:1711.04305v2 [cs.IR] 6 Dec 2018 · 2018-12-07 · dation system, in [44] proposed a...