arXiv:1711.04305v2 [cs.IR] 6 Dec 2018 · 2018-12-07 · dation system, in [44] proposed a...
Transcript of arXiv:1711.04305v2 [cs.IR] 6 Dec 2018 · 2018-12-07 · dation system, in [44] proposed a...
manuscript No.(will be inserted by the editor)
Latent Dirichlet Allocation (LDA) and Topic modeling:models, applications, a survey
Hamed Jelodar · Yongli Wang · Chi Yuan ·Xia Feng · Xiahui Jiang · Yanchao Li · LiangZhao
Received: date / Accepted: date
Abstract Topic modeling is one of the most powerful techniques in text mining fordata mining, latent data discovery, and finding relationships among data and text doc-uments. Researchers have published many articles in the field of topic modeling andapplied in various fields such as software engineering, political science, medical andlinguistic science, etc. There are various methods for topic modelling; Latent Dirich-let Allocation (LDA) is one of the most popular in this field. Researchers have pro-posed various models based on the LDA in topic modeling. According to previouswork, this paper will be very useful and valuable for introducing LDA approachesin topic modeling. In this paper, we investigated highly scholarly articles (between2003 to 2016) related to topic modeling based on LDA to discover the research de-velopment, current trends and intellectual structure of topic modeling. In addition,we summarize challenges and introduce famous tools and datasets in topic modelingbased on LDA.
Keywords Topic modeling, Latent Dirichlet Allocation, tag recommendation,Semantic web, Gibbs Sampling
1 Introduction
Natural language processing (NLP) is a challenging research in computer science toinformation management, semantic mining, and enabling computers to obtain mean-ing from human language processing in text-documents. Topic modeling methods are
H. JelodarSchool of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing210094, ChinaTel.: +8615298394479E-mail: [email protected]
Y. WangSchool of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing210094, ChinaE-mail: [email protected]
arX
iv:1
711.
0430
5v2
[cs
.IR
] 6
Dec
201
8
2 Hamed Jelodar et al.
powerful smart techniques that widely applied in natural language processing to topicdiscovery and semantic mining from unordered documents [1]. In a wide perspective,Topic modeling methods based on LDA have been applied to natural language pro-cessing, text mining, and social media analysis, information retrieval. For example,topic modeling based on social media analytics facilitates understanding the reactionsand conversations between people in online communities, As well as extracting use-ful patterns and understandable from their interactions in addition to what they shareon social media websites such as twiiter facebook [2-3]. Topic models are prominentfor demonstrating discrete data; also, give a productive approach to find hidden struc-tures(semantics) in gigantic information. There are many papers for in this field anddefinitely cannot mention to all of them, so we selected more signification papers.Topic models are applied in various fields including medical sciences [4-7] , softwareengineering [8-12], geography [13-17], political science [18-20] , etc.
For example in political science, In [20] proposed a new two-layer matrix factor-ization methodology for identifying topics in large political speech corpora over timeand identify both niche topics related to events at a particular point in time and broad,long-running topics. This paper has focused on European Parliament speeches, theproposed topic modeling method has a number of potential applications in the studyof politics, including the analysis of speeches in other parliaments, political man-ifestos, and other more traditional forms of political texts. In [21] suggested a newunsupervised topic model based on LDA for contrastive opinion modeling which pur-pose to find the opinions from multiple views, according to a given topic and theirdifference on the topic with qualifying criteria, the model called Cross-PerspectiveTopic (CPT) model. They performed experiments with both qualitative and quan-titative measures on two datasets in the political area that include: first dataset isstatement records of U.S. senators that show political stances of senators. Also forthe second dataset, extracted of world News Medias from three representative mediain U.S (New York Times), China (Xinhua News) and India (Hindu). To evaluate theirapproach with other models, used corrIDA and LDA as two baselines.
Another group of researchers focused on topic modeling in software engineering,in [8] for the first time, they used LDA, to extract topics in source code and performvisualization of software similarity, In other words, LDA is used as an intuitive ap-proach for calculation of similarity between source files and obtain their respectivedistributions of each document over topics. They utilized their method on 1,555 soft-ware projects from Apache and SourceForge that includes 19 million source lines ofcode (SLOC). The authors demonstrated this approach, can be effective for project or-ganization, software refactoring. In [22] introduced a method based on LDA for auto-matically categorizing software systems, called LACT. For evaluation of LACT, used43 open-source software systems in different programming languages and showedLACT can categorize software systems based on type of programming language. In[23, 24] proposed an approach topic modeling based on LDA model for the pur-pose of bug localization. Their idea, applied to analysis of same bugs in Mozilla andEclipse and result showed that their LDA-based approach is better than LSI for eval-
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 3
uate and analyze bugs in these source codes.
An analysis of geographic information is another issue that can be referred to [17].They introduced a novel method based on multi-modal Bayesian models to describesocial media by merging text features and spatial knowledge that called GeoFolk. Asa general outlook, this method can be considered as an extension of Latent DirichletAllocation (LDA). They used the available standard CoPhIR dataset that contains anabundance of over 54 million Flickr. The GeoFolk model has the ability to be usedin quality-oriented applications and can be merged with some models from Web 2.0social. In [16], this article examines the issue of topic modeling to extract the top-ics from geographic information and GPS-related documents. They suggested a newlocation text method that is a combination of topic modeling and geographical clus-tering called LGTA (Latent Geographical Topic Analysis). To test their approaches,they collected a set of data from the website Flickr, according to various topics.
In other view, Most of the papers that were studied, had goals for this topic mod-eling, such as: Source code analysis [8 , 9, 22, 24-27] , Opinion and aspect Mining[18, 28-36], Event detection [37-40], Image classification [13, 41], recommendationsystem [42-48] and emotion classification [49-51], etc. For example in recommen-dation system, in [44] proposed a personalized hashtag recommendation approachbased LDA model that can discover latent topics in microblogs, called Hashtag-LDAand applied experiments on ”UDI-TwitterCrawl-Aug2012-Tweets” as a real-worldTwitter dataset.
1.1 literature review and related works
Topic models have many applications in natural processing languages. Many articleshave been published based on topic modeling approaches in various subject such asSocial Network, software engineering, Linguistic science and etc. There are someworks that have focused on survey in Topic modeling.In [52], the authors presented a survey on topic modeling in software engineeringfield to specify how topic models have thus far been applied to one or more softwarerepositories. They focused on articles written between Dec 1999 to Dec 2014 andsurveyed 167 article that using topic modeling in software engineering area. Theyidentified and demonstrate the research trends in mining unstructured repositories bytopic models. They found that most of studies focused on only a limited number ofsoftware engineering task and also most studies use only basic topic models. In [53],the authors focused on survey in Topic Models with soft clustering abilities in textcorpora and investigated basic concepts and existing models classification in vari-ous categories with parameter estimation (such as Gibbs Sampling) and performanceevaluation measures. In addition, the authors presented some applications of topicmodels for modeling text corpora and discussed several open issues and future direc-tions.
4 Hamed Jelodar et al.
In [54], introduced and surveyed the field of opinion mining and sentiment anal-ysis, which helps us to observe a elements from the intimidating unstructured text.The authors discussed the most extensively studied subject of subjectivity classifica-tion and sentiment which specifies whether a document is opinionated. Also, they de-scribed aspect-based sentiment analysis which exploits the full power of the abstractmodel and discussed about aspect extraction based on topic modeling approaches.In [55], they discussed challenges of text mining techniques in information systemsresearch and indigested the practical application of topic modeling in combinationwith explanatory regression analysis, using online customer reviews as an exemplarydata source.
In [56], the authors presented a survey of how topic models have thus far beenapplied in Software Engineering tasks from four SE journals and eleven conferenceproceedings. They considered 38 selected publications from 2003 to 2015 and foundthat topic models are widely used in various SE tasks in an increasing tendency, suchas social software engineering, developer recommendation and etc. Our research dif-ference with other works is that, we had a deep study on topic modeling approachesbased on LDA with the coverage of various aspects such as applications, tools ,dataset and models.
1.2 Motivations and contributions
Since, topic modeling techniques can be very useful and effective in natural lan-guage processing to semantic mining and latent discovery in documents and datasets.Hence, our motivation is to investigate the topic modeling approaches in differentsubjects with the coverage of various aspects such as models, tools, dataset and ap-plications. The main goal of this work is to provide an overview of the methods oftopic modeling based on LDA. In summary, this paper makes four main contributions:
– We investigate scholarly articles (from 2003 to 2016) which are related to TopicModeling based on LDA to discover the research development, current trends andintellectual structure of topic modeling based on LDA.
– We investigate topic modeling applications in various sciences.– We summarize challenges in topic modeling, such as image processing, Visualiz-
ing topic models, Group discovery, User Behavior Modeling, and etc.– We introduce some of the most famous data and tools in topic modeling.
2 Computer science and topic modeling
Topic models have an important role in computer science for text mining and naturallanguage processing. In Topic modeling, a topic is a list of words that occur in sta-tistically significant methods. A text can be an email, a book chapter, a blog posts,a journal article and any kind of unstructured text. Topic models cannot understandthe means and concepts of words in text documents for topic modeling. Instead, they
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 5
suppose that any part of the text is combined by selecting words from probable bas-kets of words where each basket corresponds to a topic. The tool goes via this processover and over again until it stays on the most probable distribution of words into bas-kets which call topics. Topic modeling can provide a useful view of a large collectionin terms of the collection as a whole, the individual documents, and relationships be-tween documents. In figure 1, we provided a taxonomy of topic modeling methodsbased on LDA, from some of the impressive works.
2.1 Latent Dirichlet Allocation
LDA is a generative probabilistic model of a corpus. The basic idea is that the doc-uments are represented as random mixtures over latent topics, where a topic is char-acterized by a distribution over words. Latent Dirichlet allocation (LDA), first intro-duced by Blei, Ng and Jordan in 2003 [1], is one of the most popular methods intopic modeling. LDA represents topics by word probabilities. The words with high-est probabilities in each topic usually give a good idea of what the topic is can wordprobabilities from LDA.
LDA, an unsupervised generative probabilistic method for modeling a corpus, isthe most commonly used topic modeling method. LDA assumes that each documentcan be represented as a probabilistic distribution over latent topics, and that topicdistribution in all documents share a common Dirichlet prior. Each latent topic inthe LDA model is also represented as a probabilistic distribution over words and theword distributions of topics share a common Dirichlet prior as well. Given a corpusD consisting of M documents, with document d having Nd words (d ∈ 1,..., M), LDAmodels D according to the following generative process [4]:(a) Choose a multinomial distribution ϕ t for topic t (t ∈{1,..., T}) from a Dirichlet
distribution with parameter β(b) Choose a multinomial distribution θ d for document d (d ∈{1,..., M}) from a
Dirichlet distribution with parameter α.(c) For a word w n (n ∈{1,..., N d }) in document d,
1. (a) i. Select a topic z n from θ d .ii. Select a word w n from ϕ zn .
In above generative process, words in documents are only observed variables whileothers are latent variables (ϕ and θ) and hyper parameters (α and β). In order toinfer the latent variables and hyper parameters, the probability of observed data D iscomputed and maximized as follows:
p(D|α, β) =
M∏d=1
∫p(θd |α)(
Nd∏n=1
∑zdn
p(Zdn|θd)p(wdn|zdn, β))dθd (1)
Defined α parameters of topic Dirichlet prior and the distribution of words over top-ics, which, drawn from the Dirichlet distribution, given β. Defined, T is the number
6 Hamed Jelodar et al.
of topics, M as number of documents; N is the size of the vocabulary. The Dirichlet-multinomial pair for the corpus-level topic distributions, considered as (α, θ). TheDirichlet-multinomial pair for topic-word distributions, given (β, ϕ). The variablesθ d are document-level variables, sampled when per document. zdn ,wdn variables areword-level variables and are sampled when for each word in each text-document.
LDA is a distinguished tool for latent topic distribution for a large corpus. There-fore, it has the ability to identify sub-topics for a technology area composed of manypatents, and represent each of the patents in an array of topic distributions. WithLDA, the terms in the set of documents, generate a vocabulary that is then applied todiscover hidden topics. Documents are treated as a mixture of topics, where a topicis a probability distribution over this set of terms. Each document is then seen as aprobability distribution over set of topics. We can think of the data as coming froma generative process that is defined by the joint probability distribution over what isobserved and what is hidden.
2.2 Parameter estimation, Inference, Training for LDA
Various methods have been proposed to estimate LDA parameters, such as variationalmethod [1], expectation propagation [57] and Gibbs sampling [58].
– Gibbs sampling, is a Monte Carlo Markov-chain algorithm, powerful techniquein statistical inference, and a method of generating a sample from a joint distribu-tion when only conditional distributions of each variable can be efficiently com-puted. According to our knowledge, researchers have widely used this methodfor the LDA. Some of works related based on LDA and Gibbs, such as [22, 50,59-66].
– Expectation-Maximization (EM), is a powerful method to obtain parameter es-timation of graphical models and can use for unsupervised learning. In fact, thealgorithm relies on discovering the maximum likelihood estimates of parame-ters when the data model depends on certain latent variables EM algorithm con-tains two steps, the E-step (expectation) and the M-step (maximization). Someresearchers have applied this model to LDA training, such as [67-70].
– Variational Bayes inference (VB), VB can be considered as a type of EM ex-tension that uses a parametric approximation to the posterior distribution of bothparameters and other latent variables and attempts to optimize the fit (e.g. usingKL-divergence) to the observed data. Some researchers have applied this modelto LDA training, such as [71-72].
2.3 A brief look at past work: Research between 2003 to 2009
The LDA was first presented in 2003, and researchers have been tried to provideextended approaches based on LDA, as shown in Table 1. Undeniably, this period(2003 to 2009) is very important because key and baseline approaches were intro-duced, such as: Corr LDA, Author-Topic Model , DTM and , RTM etc .
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 7
2006 Dynamic topic model(DT
2003 Corr_LDA
2003
Author-Topic Model
2006
Correlated Topic Models, CTM
2006
Pachinko Allocation Model,
2006 Topics over Time(TO
2007
GWN-LDA
2008
MG-LDA
2009
MedLDA
2008
On-line lda (OLDA)
2009
Rethinking LDA
Relational Topic Models LACT
2009
Labeled LDA
2009 HDP-LDA
2009 LDA-SOM
2010
Topic-Aspect Model
2010
GeoFolk
2010
TWC-LDA
2010
Topic Modeling Based on latent Dirichlet allocation (LDA)
Dependency-Sentiment-LDA
……..
………
2011
Sentence-LDA MMLDA
2011
Bio-LDA
2011
Twitter-LDA
2011
SHDT
2011
LogicLDA
2012
Mr. LDA
2012
LDA-G
2012
ET-LDA
2012
SShLDA
……..
……..
2014
emotion LDA (ELDA Biterm Topic Modeling
2014
TWILITE
2015
LightLDA
2015
Latent Feature LDA (LF-
2015
MRF-LDA
2016
Hashtag-LDA
2016
PMB-LDA
2017
Enriched LDA (ELDA
2017 iDoctor
……..
……..
2009 2011 2014
2009
Fig. 1 A taxonomy of methods based on extension LDA, considered some of the impressive works
8 Hamed Jelodar et al.
Author-Topic model [75], is a popular and simple probabilistic model in topicmodeling for finding relationships among authors, topics, words and documents. Thismodel provides a distribution of different topics for each author and also a distributionof words for each topic. For evaluation, the authors used 1700 papers of NIPS con-ference and also 160,000 CiteSeer abstracts of CiteSeer dataset. To estimate the topicand author distributions applied Gibbs sampling. According to their result, showedthis approach can provide a significantly predictive for interests of authors in per-plexity measure. This model has attracted much attention from researchers and manyapproaches proposed based on ATM, such as [12, 76].
DTM, Dynamic Topic Model (DTM) is introduced by Blei and Lafferty as an ex-tension of LDA that this model can obtain evolution of topics over time in a sequen-tially arranged corpus of documents and exhibits evolution of word-topic distributionwhich causes it easy to vision the topic trend [73]. As an advantage, DTM is veryimpressible for extracting topics from collections that change slowly over a period oftime.Labeled LDA (LLDA) is another LDA extension which suppose that each documenthas a set of known labels [66]. This model can be trained with labeled documentsand even supports documents with more than one label. Topics are learned from theco-occurring terms in places from the same category, with topics approximately cap-turing different place categories. A separate L-LDA model is trained for each placecategory, and can be used to infer the category of new, previously unseen places.LLDA is a supervised algorithm that makes topics applying the Labels assigned man-ually. Therefore, LLDA can obtain meaningful topics, with words that map well tothe labels applied. As a disadvantage, Labeled LDA has limitation to support latentsubtopics within a determined label or any global topics. For overcome this problem,proposed partially labeled LDA (PLDA) [74].
MedLDA, proposed the maximum entropy discrimination latent Dirichlet allo-cation (MedLDA) model, which incorporates the mechanism behind the hierarchi-cal Bayesian models (such as, LDA) with the max margin learning (such as SVMs)according to a unified restricted optimization framework. In fact each data sampleis considered to a point in a finite dimensional latent space, of which each featurecorresponds to a topic, i.e., unigram distribution over terms in a vocabulary [67].MEDLDA model can be applied in both classification and regression. However, thequality of the topical space learned using MedLDA is heavily affected by the qualityof the classifications which are learned at per iteration of MedLDA, as a disadvantage.
Relational Topic Models (RTM), is another extension, RTM is a hierarchicalmodel of networks and per-node attribute data. First, each document was createdfrom topics in LDA. Then, modelling the connections between documents and con-sidered as binary variables, one for each pair from documents. These are distributedbased on a distribution that depends on topics used to generate each of the constituentdocuments. So in this way, the content of the documents are statistically linked to thelink structure between them and we can say that this model can be used to summarizea network of documents [69]. In fact, the advantage of RTM is that it considers both
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 9
links and document context between documents. The disadvantage of RTM is its scal-ability. Since RTM can only make predictions for a document couple, this means thatthe action of recommending related questions for a recently created one would bringin computing couple RTM responses among every available question in the query andthe collection.
2.4 A brief look at past work: Research between 2010 to 2011
Twenty seven approaches are introduced in Table 1, where eight were published in2010 and six in 2011. According to the Table 1, used LDA model for variety subjects,such as: Scientific topic discovery [35], Source code analysis [27], Opinion Mining[30], Event detection [40], Image Classification [41].
Sizov et al. in [17] introduced a novel method based on multi-modal Bayesianmodels to describe social media by merging text features and spatial knowledge thatcalled GeoFolk. As a general outlook, this method can be considered as an exten-sion of Latent Dirichlet Allocation (LDA). They used the available standard CoPhIRdataset that it contains an abundance of over 54 million Flickr. The GeoFolk modelhas ability to be used in quality-oriented applications and can be merged with somemodels from Web 2.0 social.
Z. Zhai et al. in [30] used prior knowledge as a constraint in the LDA modelsto improve grouping of features by LDA. They extract must link and cannot-linkconstraint from the corpus. Must link indicates that two features must be in the samegroup while cannot-link restricts that two features cannot be in the same group. Theseconstraints are extracted automatically. If at least one of the terms of two product fea-tures are same, they are considered to be in the same group as must link. On the otherhand, if two features are expressed in the same sentence without conjunction ”and”,they are considered as a different feature and should be in different groups as cannot-link.
Wang et al. in [85] suggested an approach based on LDA that called Bio-LDA thatcan identify biological terminology to obtain latent topics. The authors have shownthat this approach can be applied in different studies such as association search, as-sociation predication, and connectivity map generation. And they showed that Bio-LDA can be applied to increase the application of molecular bonding techniques asheat maps.
10 Hamed Jelodar et al.Ta
ble
1So
me
impr
essi
vear
ticle
sba
sed
onL
DA
:bet
wee
n20
03-2
011
Aut
hor-
stud
yM
odel
Year
sPa
ram
eter
Est
imat
ion/
Infe
renc
eM
etho
dsPr
oble
mD
omai
n[7
0]C
orr
LD
A20
03V
aria
tiona
lEM
LD
AIm
age
anno
tatio
nan
dre
trie
val
[75]
Aut
hor-
Topi
cM
odel
2004
Gib
bsSa
mpl
ing
LD
AFi
ndth
ere
latio
nshi
psbe
twee
nau
thor
s,do
cum
ents
,wor
ds,a
ndto
pics
[76]
Aut
horR
ecip
ient
-Top
ic(A
RT
)20
05G
ibbs
Sam
plin
gL
DA
Aut
hor-
Topi
c(A
T)
Soci
alne
twor
kan
alys
isan
dro
ledi
scov
ery
[73]
Dyn
amic
topi
cm
odel
(DT
M)
2006
Kal
man
vari
atio
nala
lgor
ithm
LD
AG
alto
nW
atso
npr
oces
sPr
ovid
ea
dyna
mic
mod
elfo
revo
lutio
nof
topi
cs
[77]
Topi
csov
erTi
me(
TOT
)20
06G
ibbs
Sam
plin
gL
DA
Cap
ture
wor
dco
-occ
urre
nces
and
loca
lizat
ion
inco
ntin
uous
time.
[78]
Pach
inko
Allo
catio
nM
odel
,PA
M20
06G
ibbs
Sam
plin
gL
DA
Adi
rect
edac
yclic
grap
hm
etho
dC
aptu
rear
bitr
ary
topi
cco
rrel
atio
ns
[79]
GW
N-L
DA
2007
Gib
bssa
mpl
ing
LD
AH
iera
rchi
calB
ayes
ian
algo
rith
mPr
obab
ilist
icco
mm
unity
profi
leD
isco
very
inso
cial
netw
ork
[80]
On-
line
lda
(OL
DA
)20
08G
ibbs
Sam
plin
gL
DA
Em
piri
calB
ayes
met
hod
Trac
king
and
Topi
cD
etec
tion
[36]
MG
-LD
A20
08G
ibbs
sam
plin
gL
DA
Sent
imen
tana
lysi
sin
mul
ti-as
pect
[66]
Lab
eled
LD
A20
09G
ibbs
Sam
plin
gL
DA
Prod
ucin
ga
labe
led
docu
men
tcol
lect
ion.
[69]
Rel
atio
nalT
opic
Mod
els
2009
Exp
ecta
tion-
max
imiz
atio
n(E
M)
LD
AM
ake
pred
ictio
nsbe
twee
nno
des
,attr
ibut
es,l
inks
stru
ctur
e[8
1]H
DP-
LD
A20
09G
ibbs
Sam
plin
gL
DA
[82]
Dis
cLD
A20
09G
ibbs
Sam
plin
gL
DA
Cla
ssifi
catio
nan
ddi
men
sion
ality
redu
ctio
nin
docu
men
ts[1
7]G
eoFo
lk20
10G
ibbs
sam
plin
gL
DA
Con
tent
man
agem
enta
ndre
trie
valo
fspa
tiali
nfor
mat
ion
[65]
Join
tLD
A20
10G
ibbs
Sam
plin
gL
DA
Bag
-of-
wor
dm
odel
Min
ing
mul
tilin
gual
topi
cs
[35]
Topi
c-A
spec
tMod
el(T
AM
)20
10G
ibbs
Sam
plin
gL
DA
SVM
Scie
ntifi
cto
pic
disc
over
y
[86]
Dep
ende
ncy-
Sent
imen
t-L
DA
2010
Gib
bsSa
mpl
ing
LD
ASe
ntim
entc
lass
ifica
tion
[27]
Topi
cXP
2010
LD
ASo
urce
code
anal
ysis
[30]
cons
trai
ned-
LD
A20
10G
ibbs
Sam
plin
gL
DA
Opi
nion
Min
ing
and
Gro
upin
gPr
oduc
tFea
ture
s[4
0]PE
T-P
opul
arE
vent
sTr
acki
ng20
10G
ibbs
Sam
plin
gL
DA
Eve
ntan
alys
isin
soci
alne
twor
k
[39]
[40]
ED
CoW
2010
LD
AW
avel
etTr
ansf
orm
atio
nE
vent
anal
ysis
inTw
itter
[85]
Bio
-LD
A.
2011
Gib
bsSa
mpl
ing
LD
AE
xtra
ctbi
olog
ical
term
inol
ogy
[64]
Twitt
er-L
DA
2011
Gib
bsSa
mpl
ing
LD
AA
utho
r-to
pic
mod
el
Page
Ran
k
Ext
ract
ing
topi
calk
eyph
rase
san
dan
alyz
ing
Twitt
erco
nten
t
[41]
max
-mar
gin
late
ntD
iric
hlet
allo
catio
n(M
ML
DA
2011
vari
atio
nali
nfer
ence
LD
ASV
MIm
age
Cla
ssifi
catio
nan
dA
nnot
atio
n
[34]
Sent
ence
-LD
A20
11G
ibbs
sam
plin
gL
DA
Asp
ects
and
sent
imen
tdis
cove
ryfo
rweb
revi
ew
[87]
PLD
A+
2011
Gib
bssa
mpl
ing
LD
AW
eigh
ted
roun
d-ro
bin
Red
uce
inte
r-co
mpu
terc
omm
unic
atio
ntim
e
[72]
Dir
ichl
etcl
ass
lang
uage
mod
el(D
CL
M),
2011
vari
atio
nalB
ayes
ian
EM
(VB
-EM
)alg
orith
msp
eech
reco
gniti
onan
dex
ploi
tatio
nof
lang
uage
mod
els
Dir
ichl
etcl
ass
lang
uage
mod
el(D
CL
M),
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 11
2.5 A brief look at past work: Research between 2012 to 2013
According to Table 2, some of the popular works published between 2012 and 2013focused on a variety of topics, such as music retrieve [88], opinion and aspect mining[89], Event analysis [38].
ET-LDA, in this work, the authors developed a joint Bayesian model that per-forms event segmentation and topic modeling in one unified framework. In fact, theyproposed an LDA model to obtain event’s topics and analysis tweeting behaviors onTwitter that called Event and Tweets LDA (ET-LDA). They employed Gibbs Sam-pling method to estimate the topic distribution [38]. The advantage of ET-LDA isthat can extract top general topic for entire tweet collection, which means that is veryrelevant and influenced with the event.
Mr. LDA, the authors introduced a novel model and parallelized LDA algorithmin the MapReduce framework that called Mr. LDA. In contrast other approacheswhich use Gibbs sampling for LDA, this model uses variational inference [71]. LDA-GA, The authors focused on the issue of Textual Analysis in Software Engineering.They proposed an LDA model based on Genetic Algorithm to determine a near-optimal configuration for LDA (LDA-GA), This approach is considered by three sce-narios that include: (a) traceability link recovery, (b) feature location, and (c) labeling.They applied the Euclidean distance for measuring distance between documents andused Fast collapsed Gibbs sampling to approximate the posterior distributions of pa-rameters[63].
TopicSpam, they proposed a generative lda based topic modeling methods fordetecting fake review. Their model can obtain differences among truthful and decep-tive reviews. Also, their approach can provide a clear probabilistic prediction abouthow probable a review is truthful or deceptive. For evaluation, the authors used 800reviews of 20 Chicago hotels and showed that TopicSpam slightly outperforms Top-icTDB in accuracy.
SSHLDA, the authors proposed a semi supervised hierarchical approach basedon topic model which goals for exploring new topics in the data space wheneverincorporating the information from viewed labels of hierarchical into the modelingprocess, called SemiSupervised Hierarchical LDA (SSHLDA). They applied labelshierarchy in as basic hierarchy, called Base Tree (BT); then they used hLDA to maketopic hierarchy automatically for each leaf node in BT, called Leaf Topic Hierarchy(LTH). One of the benefits from SSHLDA is that, it can incorporate labeled topicsinto the generative process of documents. In addition, SSHLDA can automaticallyexplore latent topic in data space, and extend existing hierarchy of showed topics.
12 Hamed Jelodar et al.
2.6 A brief look at past work: Research between 2014 to 2015
According to Table 2, some of the popular works published between 2014 and 2015focused on a variety of topics, such as: Hash/tag discovery [45-46], opinion miningand aspect mining [28-29, 31-32], recommendation system [45, 47-48].
Biterm Topic Modeling (BTM), Topic modeling over short texts is an increas-ingly important task due to the prevalence of short texts on the Web. Short texts arepopular on today’s Web, especially with emergence of social media. Inferring topicsfrom large scale short texts becomes critical. They proposed a novel topic model forshort texts, namely the biterm topic model (BTM). This model can well capture topicswithin short texts by explicitly modeling word co-occurrence patterns in the wholecorpus. BTM obtains underlying topics in a set of text-documents and a distributionof global from per topic in each of them with an analysis of the generation of biterms.Their results showed that btm generates discriminative topic representations as wellas rather coherent topics in short texts. The main advantage of BTM prevents the datasparsity issue with learning a global topic distribution [96-97].
TOT-MMM, introduced a hashtag recommendation that called TOT-MMM, Thisapproach is a hybrid model that combines a temporal clustering component similarto that of the Topics-over-Time (TOT) Model with the Mixed Membership Model(MMM) that was originally proposed for word-citation co-occurrence. This modelcan capture the temporal clustering effect in latent topics, thereby improving hashtagmodeling and recommendations. They developed a collapsed Gibbs sampling (CGS)to approximate the posterior modes of the remaining random variables [45]. The pos-terior distribution of latent topic equaling k for the nth hashtag in tweet d is given by:
P(zh(dn) = k|z(m)
.. , z(h)−dn,w
(m).. ,w
(h).. , t
(.).. ∝
βh+cw(h)
dn−dn,k
vhβh+c(h)−dn,k
α+c(db)−dn,k+c(dm )
.,k
Kα+Ndm+Ndb−1
×tψk1−1d (1−td)ψk2−1
B(ψk1,ψk2)(h) where denotes the number of hashtags type w(b)
dn assigned to latent topic k,excluding the hashtag currently undergoing processing; c(h)
−dn,k denotes the number ofhashtags assigned to latent topic k, excluding the assignment at position dn; c(dh)
−dn,kdenotes the number of hashtags assigned to latent topic k in tweet d, excluding thehashtag currently undergoing processing; c(dm)
.,k denotes the number of words assignedto latent topic k in tweet d; Vb is the number of unique hashtags; td is the tweet timestamp omitting position subscripts and superscripts (all words and hashtags share thesame time stamp); ψk1,ψk2 are the parameters of the beta distribution for latent topick.
The probability for a hashtag given the observed words and time stamps is:p(wh
vn|w(m)v. , tv
)=
∫p(w(h)
vn |θ(v)
)p(θ(v)|w|(m)
v. , tv)
dθ(v)
s ≈ 1´S
´S∑
s=1
K∑k=1∅(s)
h,k,w(h)vnθ(v)(s)
k ,
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 13
where S is the total number of recorded sweeps, and the superscript s marks theparameters computed based on a specific recorded sweep. To provide the top N pre-dictions, they ranked from largest to smallest and output the first N hashtags.
rLDA , the authors introduced a novel probabilistic formulation to obtain the rel-evance of a tag with considering all the other images and their tags and also theyproposed a novel model called regularized latent Dirichlet allocation (rLDA). Thismodel can estimates the latent topics for each document, with making use of otherdocuments. They used a collective inference scheme to estimate the distribution oflatent topics and applied a deep network structure to analyze the benefit of regular-ized LDA [45-46].
2.7 A brief look from some impressive past works: Research in 2016
According to Table 2, some of the popular works published for this year focused on avariety of topics, such as recommendation system [42-44], opinion mining and aspectmining [28-32].
A bursty topic on Twitter is one that triggers a surge of relevant tweets within ashort period of time, which often reflects important events of mass interest. How toleverage Twitter for early detection of bursty topics has, therefore, become an impor-tant research problem with immense practical value. In TopicSketch [59], proposed asketch-based topic model together with a set of techniques to achieve real-time burstytopic detection from the perspective of topic modeling, that called in this paper Top-icSketch.
The bursty topics are often triggered by some events such as some breaking newsor a compelling basketball game, which get a lot of attention from people, and ”force”people to tweet about them intensely. For example, in physics, this ”force” can beexpressed by ”acceleration”, which in our setting describes change of ”velocity”,i.e., arriving rate of tweets. Bursty topics can get significant acceleration when theyare bursting, while the general topics usually get nearly zero acceleration. So the”acceleration” trick can be used to preserve information of bursty topics but fil-ter out the others. Equation (3) shows how we calculate the ”velocity” v (t) and”acceleration”a (t) of words.
v∆T =∑ti≤t
Xi.exp ((ti − t) /∆T )
∆T
In Equation (1), Xi is the frequency of a word (or a pair of words, or a triple ofwords) in the i-th tweet, ti is its timestamp. The exponential part in v∆T (t) works like asoft moving window, which gives the recent terms high weight, but gives low weight
14 Hamed Jelodar et al.
to the ones far away, and the smoothing parameter ∆T is the window size. In fact, theauthors proposed a novel data sketch which efficiently maintains at a cost of low-levelcomputational of three quantities: the total number of every tweets, the occurrence ofper word and also the occurrence of per word pair. Therefore, low calculation costsis one of the advantages from this approach.
Hashtag-LDA, the authors a personalized hashtag recommendation approach isintroduced according to the latent topical information in untagged microblogs. Thismodel can enhance influence of hashtags on latent topics’ generation by jointly mod-eling hashtags and words in microblogs. This approach inferred by Gibbs sampling tolatent topics and considered a real-world Twitter dataset to evaluation their approach[44]. CDLDA proposed a conceptual dynamic latent Dirichlet allocation model fortracking and topic detection for conversational communication, particularly for spo-ken interactions. This model can extract dependencies between topics and speechacts. The CDLDA applied hypernym information and speech acts for topic detectionand tracking in conversations, and it captures contextual information from transitions,incorporated concept features and speech acts [61].
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 15Ta
ble
2So
me
impr
essi
vear
ticle
sba
sed
onL
DA
:bet
wee
n20
12-2
016
Aut
hor-
Stud
yM
odel
Year
sPa
ram
eter
Est
imat
ion/
Infe
renc
eM
etho
dsPr
oble
mD
omai
n[9
0]lo
cally
disc
rim
inat
ive
topi
cm
odel
(LD
TM
)20
12E
xpec
tatio
n-m
axim
izat
ion
(EM
)L
DA
Doc
umen
tsem
antic
anal
ysis
[38]
ET-
LD
A20
12G
ibbs
sam
plin
gL
DA
Eve
ntse
gmen
tatio
nTw
itter
[88]
infin
itela
tent
harm
onic
allo
catio
n(i
LH
A)
2012
Exp
ecta
tion-
max
imiz
atio
n(E
M)a
lgor
ithm
Var
iatio
nalB
ayes
(VB
)L
DA
HD
P(H
iera
rchi
calD
iric
hlet
proc
esse
s)m
ultip
itch
anal
ysis
and
mus
icin
form
atio
nre
trie
val
[71]
Mr.
LD
A20
12V
aria
tiona
lBay
esin
fere
nce
LD
AN
ewto
n-R
aphs
onm
etho
d
Map
Red
uce
Alg
orith
m
Exp
lori
ngdo
cum
entc
olle
ctio
nsfr
omla
rge
scal
e
[91]
FB-L
DA
,RC
B-L
DA
2012
Gib
bsSa
mpl
ing
LD
AA
naly
zean
dtr
ack
publ
icse
ntim
entv
aria
tions
(on
twitt
er)
[92]
fact
oria
lLD
A20
12G
ibbs
sam
plin
gL
DA
Ana
lysi
ste
xtin
am
ulti-
dim
ensi
onal
stru
ctur
e
[93]
SShL
DA
2012
Gib
bssa
mpl
ing
LD
AhL
DA
Topi
cdi
scov
ery
inda
tasp
ace
[94]
Uto
pian
2013
Gib
bssa
mpl
ing
LD
AV
isua
ltex
tana
lytic
s
[63]
LD
A-G
A20
13G
ibbs
sam
plin
gL
DA
Gen
etic
Alg
orith
mSo
ftw
are
text
ualr
etri
eval
and
anal
ysis
[33]
Mul
ti-as
pect
Sent
imen
tAna
lysi
sfo
rChi
nese
Onl
ine
Soci
alR
evie
ws
(MSA
-CO
SRs
2013
Gib
bsSa
mpl
ing?
LD
A
Sent
imen
tana
lysi
sA
ndas
pect
min
ing
of
Chi
nese
soci
alre
view
s[8
9]To
picS
pam
2013
Gib
bssa
mpl
ing
LD
Aop
inio
nsp
amde
tect
ion
[95]
WT-
LD
A20
13G
ibbs
sam
plin
gL
DA
Web
Serv
ice
Clu
ster
ing
[51]
emot
ion-
LD
A(E
LD
A)
2014
Gib
bssa
mpl
ing
LD
ASo
cial
emot
ion
clas
sific
atio
nof
onlin
ene
ws
[48]
TW
ILIT
E20
14E
Mal
gori
thm
LD
AR
ecom
men
datio
nsy
stem
forT
witt
er
[98]
Red
-LD
A20
14G
ibbs
-Sam
plin
LD
AE
xtra
ctin
form
atio
nan
dan
dda
tam
odel
ing
inPa
tient
Rec
ord
Not
es[9
6,97
]B
iterm
-Top
ic-M
odel
ing(
BT
M)
2014
Gib
bssa
mpl
ing
LD
AD
ocum
entc
lust
erin
gfo
rsho
rtte
xt
[47]
Tren
dSe
nsiti
ve-L
aten
tDir
ichl
etA
lloca
tion
(TS-
LD
A)
2014
Gib
bssa
mpl
ing
LD
AN
orm
aliz
edD
isco
unte
dC
umul
ativ
eG
ain
(nD
CG
)A
maz
onM
echa
nica
lTur
k(A
MT
)2pl
atfo
rmIn
tere
stin
gtw
eets
disc
over
foru
sers
,rec
omm
enda
tion
syst
em
[32]
Fine
-gra
ined
Lab
eled
LD
A(F
L-L
DA
),U
nifie
dFi
ne-g
rain
edL
abel
edL
DA
(UFL
-LD
A)
2014
Gib
bssa
mpl
ing
LD
AA
spec
text
ract
ion
and
revi
ewm
inin
g[4
5,46
]R
egul
ariz
edla
tent
Dir
ichl
etal
loca
tion
(rL
DA
)20
14V
aria
tiona
lBay
esin
fere
nce
LD
AA
utom
atic
imag
eta
ggin
gor
tag
reco
mm
enda
tion
[29]
gene
rativ
epr
obab
ilist
icas
pect
min
ing
mod
el(P
AM
M)
2014
Exp
ecta
tion-
max
imiz
atio
n(E
M)
LD
AO
pini
onm
inin
gan
dgr
oupi
ngs
ofdr
ugre
view
s,as
pect
min
ing
[28]
AE
P-ba
sed
Lat
entD
iric
hlet
Allo
catio
n(A
EP-
LD
A)
2014
Gib
bssa
mpl
ing
LD
AO
pini
on/a
spec
tmin
ing
and
sent
imen
twor
did
entifi
catio
n
[31]
AD
M-L
DA
2014
Gib
bssa
mpl
ing
LD
AM
arko
vch
ain
Asp
ectm
inin
gan
dse
ntim
enta
naly
sis
[99]
MR
F-L
DA
2015
EM
algo
rith
mL
DA
Mar
kov
Ran
dom
Fiel
dE
xplo
iting
wor
dco
rrel
atio
nkn
owle
dge
[100
]L
ight
LD
A20
15G
ibbs
Sam
plin
gL
DA
Topi
cm
odel
ing
forv
ery
larg
eda
tasi
zes
[101
]L
FLD
A,L
F-D
MM
2015
Gib
bsSa
mpl
ing
LD
AD
ocum
entc
lust
erin
gfo
rsho
rtte
xt[1
03]
LFT
2015
Gib
bsSa
mpl
ing
LD
ASe
man
ticco
mm
unity
dete
ctio
n[1
02]
AT
C20
15E
ML
DA
Aut
horc
omm
unity
disc
over
y[1
06]
FLD
A,D
FLD
A20
15G
ibbs
Sam
plin
gL
DA
Mul
ti-la
beld
ocum
entc
ateg
oriz
atio
n[4
1]H
asht
ag-L
DA
2016
Gib
bssa
mpl
ing
LD
AH
asht
agre
com
men
datio
nan
dFi
ndre
latio
nshi
psbe
twee
nto
pics
and
hash
tags
[44]
PMB
-LD
A20
16E
xpec
tatio
n-m
axim
izat
ion
(EM
)L
DA
Ext
ract
the
popu
latio
nm
obili
tybe
havi
ors
forl
arge
scal
e[1
08]
Aut
omat
icR
ule
Gen
erat
ion
(LA
RG
en20
16G
ibbs
Sam
plin
gL
DA
Mal
war
ean
alys
isan
dA
utom
atic
Sign
atur
eG
ener
atio
n[9
9]PT
-LD
A20
16G
ibbs
-EM
algo
rith
mL
DA
Pers
onal
ityre
cogn
ition
inso
cial
netw
ork
[110
]C
orr-
wdd
CR
F20
16G
ibbs
sam
plin
gL
DA
Kno
wle
dge
Dis
cove
ryin
Ele
ctro
nic
Med
ical
Rec
ord
[42]
mul
ti-id
iom
atic
LD
Am
odel
(MiL
DA
)20
16G
ibbs
sam
plin
gL
DA
Con
tent
-bas
edre
com
men
datio
nan
dau
tom
atic
linki
ng[4
3]L
ocat
ion-
awar
eTo
pic
Mod
el(L
TM
)20
16G
ibbs
sam
plin
gL
DA
Mus
icR
ecom
men
datio
n[6
2]To
pPR
F20
16G
ibbs
sam
plin
gL
DA
Eva
luat
eth
ere
leva
ncy
betw
een
feed
back
docu
men
ts[5
0]co
ntex
tual
sent
imen
ttop
icm
odel
(CST
M)
2016
Exp
ecta
tion-
max
imiz
atio
n(E
M)
LD
AE
mot
ion
clas
sific
atio
nin
soci
alne
twor
k[6
1]co
ncep
tual
dyna
mic
late
ntD
iric
hlet
allo
catio
n(C
DL
DA
)20
16G
ibbs
sam
plin
gL
DA
Topi
cde
tect
ion
inco
nver
satio
ns
[60]
mul
tiple
-cha
nnel
late
ntD
iric
hlet
allo
catio
n(M
CL
DA
)20
16G
ibbs
sam
plin
gL
DA
Find
the
rela
tions
betw
een
diag
nose
san
dm
edic
atio
nsfr
omhe
alth
care
data
[37]
mul
ti-m
odal
even
ttop
icm
odel
(mm
ET
M)
2016
Gib
bssa
mpl
ing
LD
ATr
acki
ngan
dso
cial
even
tana
lysi
s[1
11]
Dyn
amic
Onl
ine
Hie
rarc
hica
lDir
ichl
etPr
oces
sm
odel
(DO
HD
P)20
16G
ibbs
sam
plin
LD
AD
ynam
icto
pic
evol
utio
nary
disc
over
yfo
rChi
nese
soci
alm
edia
[59]
Topi
cske
tch
2016
Gib
bssa
mpl
ing
LD
AT
tens
orde
com
posi
tion
algo
rith
m
Cou
nt-M
inal
gori
thm
Rea
ltim
ede
tect
ion
and
burs
tyto
pics
dico
very
from
Twitt
er
[112
]fa
ston
line
EM
(FO
EM
)20
16E
xpec
tatio
n-m
axim
izat
ion
(Bat
chE
M)
LD
AB
igto
pic
mod
elin
g[1
13]
Join
tMul
ti-gr
ain
Topi
cSe
ntim
ent(
JMT
S)20
16G
ibbs
sam
plin
gL
DA
Ext
ract
ing
sem
antic
aspe
cts
from
onlin
ere
view
s[1
14]
Cha
ract
erw
ord
topi
cm
odel
(CW
TM
)20
16G
ibbs
sam
plin
gL
DA
Cap
ture
the
sem
antic
cont
ents
inte
xtdo
cum
ents
(Chi
nese
lang
uage
).
16 Hamed Jelodar et al.
Fig. 2 A vision of the application of Topic modeling in various sciences (based on previous works)
mmETM, the authors proposed a novel multi modal social event tracking to cap-ture the evolutionary trends from social events and also for generating effective eventsummary details over time. The mmETM can model the multimodal property of so-cial event and learn correlations among visual modalities and textual to apart the non-visual-representative topics and visual representative topics. This model can work inan online mode with the event consisting of many stories over time and this is a greatadvantage for getting the evolutionary trends in event tracking and evolution [37].
3 Topic Modeling for which the area is used?
With the passage of time, the importance of Topic modeling in different disciplineswill be increase. According to previous studies, we present a taxonomy of currentapproaches topic models based on LDA model and in different subject such as SocialNetwork [76, 115-118], Software Engineering [8, 9, 25-26], Crime Science [119-121] and also in areas of Geographical [13, 14-17], Political Science [19-20, 122],Medical/Biomedical [4, 123-125] and Linguistic science [126-130] as illustrated byFigure. 2.
3.1 Topic modeling in Linguistic science
LDA is an advanced textual analysis technique grounded in computational linguis-tics research that calculates the statistical correlations among words in a large set ofdocuments to identify and quantify the underlying topics in these documents. In thissubsection, we examine some of topic modeling methodology from computationallinguistic research, shown some significant research in Table 3. In [130], employedthe distributional hypothesis in various direction and it efforts to cancel the require-ment of a seed lexicon as an essential prerequisite for use of bilingual vocabularyand introduce various ways to identify the translation of words among languages.In [128] introduced a method that leads the machine translation systems to relevanttranslations based on topic-specific contexts and used the topic distributions to obtain
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 17
Table 3 Impressive works LDA-based in Linguistic science
Study Year Purpose Dataset
[130] 2011Introduce various ways toidentify the translation ofwords among languages [BiLDA].
A Wikipedia dataset (Arabic,Spanish, French,Russian and English)
[129] 2010Obtain term weightingbased on LDA A multilingual dataset
[127] 2013Present a diversity ofnew visualization techniquesto make concept of topic-solutions
- Dissertation abstracts(1980 to 2010)- 1 million abstracts
[126] 2012A topic modeling approach,that it consider geographicinformation
Foursquare Dataset
[131] 2014An approach that is capableto find a document withdifferent language
ALTW2010
[132] 2013A method for linguistic discoveryand conceptual metaphors resources Wikipedia
topic-dependent lexical weighting probabilities. They considered a topic model fortraining data, and adapt translation model. To evaluate their approach, they performedexperiments on Chinese to English machine translation and show the approach can bean effective strategy for dynamically biasing a statistical machine translation towardsrelevant translations.
In [127] presented a diversity of new visualization techniques to make the con-cept of topic-solutions and introduce new forms of supervised LDA, to evaluate theyconsidered a corpus of dissertation abstracts from 1980 to 2010 that belongs to 240universities in the United States. In [126] developed a standard topic modeling ap-proach, that consider geographic and temporal information and this approach usedto Foursquare data and discover the dominant topics in the proximity of a city. Also,the researchers have shown that the abundance of data available in location-based so-cial network (LBSN) enables such models to obtain the topical dynamics in urbaniteenvironments. In [132] have introduced a method for discovery of linguistic and con-ceptual metaphors resources and built an LDA model on Wikipedia; align its topicsto possibly source and aim concepts, they used from both target and source domainsto identify sentences as potentially metaphorical. In [131] presented an approach thatis capable to find a document with a different language and identify the current lan-guage in a document and next step calculate their relative proportions, this approachis based on LDA and used from ALTW2010 as a dataset to evaluation their method.
3.2 Topic modeling in political science
Some topic modeling methods have been adopted in the political science literature toanalyze political attention. In settings where politicians have limited time-resourcesto express their views, such as the plenary sessions in parliaments, politicians mustdecide what topics to address. Analyzing such speeches can thus provide insight intothe political priorities of the politician under consideration. Single membership topic
18 Hamed Jelodar et al.
Table 4 Impressive works LDA-based in political science
Study Year Purpose Dataset
[19] 2013
Evaluate the behavioral effectsof different databases frompolitical orientationclassifiers
-Political Figures Dataset-Politically Active Dataset-Politically Modest Dataset-Conover 2011 Dataset (C2D)
[21]Introduce a topic model forcontrastive opinion modeling
Statement records ofU.S. senators
[133] 2012
Detection topics that evokedifferent reactionsfrom communities thatlie on the political spectrum
A collection of blog postsfrom five blogs:1. Carpetbagger(CB)thecarpetbaggerreport.com2. Daily Kos(DK)dailykos.com3. Matthew Yglesias(MY)yglesias.thinkprogress.org4.Red State(RS)redstate.com5.Right Wing News(RWN)rightwingnews.com
[18] 2010Discover the hidden relationshipsbetween opinion word and topics words
The statement records ofsenators through theProject Vote Smart(http://www.votesmart.org)
[134] 2014Analyze issues related toKorea’s presidential election
Project Vote Smart WebSite(https://votesmart.org/)
[135] 2014Examine Political Contentionin the U.S. Trucking Industry Regulations.gov online portal
[136] 2015presented a method formulti-dimensional analysisof political documents
Three Germannational elections(2002, 2005 and 2009)
models that assume each speech relates to one topic; have successfully been appliedto plenary speeches made in the 105th to the 108th U.S. Senate in order to trace po-litical attention of the Senators within this context over time [18]. In addition, in [20]proposed a new two-layer matrix factorization methodology for identifying topicsin large political speech corpora over time and identify both niche topics related toevents at a particular point in time and broad, long-running topics. This paper hasfocused on European Parliament speeches, the proposed topic modeling method hasa number of potential applications in the study of politics, including the analysis ofspeeches in other parliaments, political manifestos, and other more traditional formsof political texts. In [19], the purpose of the study is to examine the various effectsof dataset selection with consideration of policy orientation classifiers and built threedatasets that each data set include of a collection of Twitter users who have a politi-cal orientation.In this approach, the output of an LDA has been used as one of manyfeatures as a feed to apply SVM classifier and another part of this method used anLLDA that Considered as a stand-alone classifier. Their assessment showed that thereare some limitations to building labels for non-political user categories. Shown somesignificant research based on LDA to political science in Table 4.
Fang et al. in [21] suggested a new unsupervised topic model based on LDAfor contrastive opinion modeling which purpose to find the opinions from multipleviews, according to a given topic and their difference on the topic with qualifyingcriteria, the model called Cross-Perspective Topic (CPT) model. They performed ex-
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 19
periments with both qualitative and quantitative measures on two datasets in the po-litical area that include: first dataset is statement records of U.S. senators that showpolitical stances of senators by these records, also for the second dataset, extractsof world News Medias from three representative media in U.S (New York Times),China (Xinhua News) and India (Hindu). To evaluate their approach with other mod-els used corrIDA and LDA as two baselines. Yano et al. in [137, 138] applied severalprobabilistic models based on LDA to predict responses from political blog posts. Inmore detail, they used topic models LinkLDA and CommentLDA to generate blogdata(topics, words of post) in their method and with this model can found a relation-ship between the post, the commentators and their responses. To evaluate their model,gathered comments and blog posts with focusing on American politics from 40 blogsites.
In [139] introduced a new application of universal sensing based on using mo-bile phone sensors and used an LDA topic model to discover pattern and analysis ofbehaviors of people who changed their political opinions, also evaluated to variouspolitical opinions for residents of individual , with consider a measure of dynamic ho-mophily that reveals patterns for external political events. To collect data and applytheir approach, they provided a mobile sensing platform to capture social interactionsand dependent variables of American Presidential campaigns of John McCain andPresident Barack Obama in last three months of 2008. In [133] analyzed reactionsof emotions and suggested a novel model Multi Community Response LDA (MCR-LDA) which in fact is a multi-target and for predicting comment polarity from postcontent used sLDA and support vector machine classification. To evaluate their ap-proach, provided a dataset of blog posts from five blogs that focus on US politics thatwas made by [137].
In [18], the authors suggested a generative model to auto discover of the latentassociations between opinion words and topics that can be useful for extraction ofpolitical standpoints and used an LDA model to reduce the size of adjective words, theauthors successfully get that sentences extracted by their model and they shown thismodel can effectively in different opinions. They were focused on statement recordsof senators that includes 15, 512 statements from 88 senators from Project Vote SmartWebSite. In [134], examined how social and political issues related to South Koreanpresidential elections in 2012 on Twitter and used an LDA method to evaluate therelationship between topics extracted from events and tweets. In [136], proposed amethod for evaluating and comparing documents, based on an extension of LDA, andused LogicLDA and Labeled LDA approaches for topic modeling in their method.They are considered German National Elections since 1990 as a dataset to apply theirmethod and shown that the use of their method consistently better than a baselinemethod that simulates manual annotation based on text and keywords evaluation.
20 Hamed Jelodar et al.
Table 5 Impressive works LDA-based in medical and biomedical
Study Year Purpose/problem domain Dataset
[124] 2017Presented three LDA-basedmodels Adverse DrugReaction Prediction
ADReCS database
[84] 2011 Extract biological terminology-MEDLINE and Bio-TermsExtraction-Chem2Bio2Rdf
[4] 2017User preference distributiondiscovery and identitydistribution of doctor feature
-Yelp Dataset(Yelp.com)
[7] 2012-Ranking GENE-DRUG-Detecting relationshipbetween gene and drug
National Library ofMedicine
[6] 2011Analyzing public healthinformation on Twetter
20 disease articles oftwitter data
[115] 2013Analysis of Generated Contentby User from social networkingsites
one million English postedfrom Facebook’s server logs
[124] 2013Pattern discovery and extractionfor Clinical Processes
A data-set from Zhejiang HuzhouCentral Hospital of China
[123] 2011Identifying miRNA-mRNA infunctional miRNA regulatory modules Mouse mammary dataset
[140] 2011 Extract common relationship T2DM Clinical Dataset
[5] 2012Extract the latent topic inTraditional Chinese Medicinedocument
T2DM Clinical Dataset
3.3 Topic modeling in medical and biomedical
Topic models applied to pure biomedical or medical text mining, researchers haveintroduced this approach into the fields of medical science. Topic modeling could beadvantageously applied to the large datasets of biomedical/medical research, shownsome significant research based on LDA to analyze and evaluate content in medicaland biomedical in Table 5. In [125] introduced three LDA-like models and found thatthis model has higher accuracy than the state-of-the-art alternatives. Authors demon-strated that this approach based on LDA could successfully uncover the probabilisticpatterns between Adverse drug reaction (ADR) topics and used ADRS database forevaluating their approach. The aim of the authors to predict ADR from a large numberof ADR candidates to obtain a drug target. In [4], focused on the issue of profession-alized medical recommendations and introduced a new healthcare recommendationsystem that called iDoctor, that used Hybrid matrix factorization methods for pro-fessionalized doctor recommendation. In fact, They adopted an LDA topic model toextract the topics of doctor features and analyzing document similarity. The datasetthis article is college from a crowd sourced website that called Yelp. Their resultshowed that iDoctor can increase the accuracy of health recommendations and it canhas higher prediction in users ratings.
In [85], the authors suggested an approach based on LDA that called Bio-LDAthat can identify biological terminology to obtain latent topics. The authors haveshown that this approach can be applied in different studies such as association search,association predication, and connectivity map generation. And they showed that Bio-
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 21
LDA can be applied to increase the application of molecular bonding techniques asheat maps. In [7] proposed a topic modeling for rating gene-drug relations by usingprobabilistic KL distance and LDA that called LDA-PKL and showed that the sug-gested model achieved better than Mean Average Precision (MAP). They found thatthe presented method achieved a high Mean Average Precision (MAP) to rating anddetecting pharmacogenomics(PGx) relations. To analyze and apply their approachused a dataset from National Library of Medicine. In [6], Presented Ailment TopicAspect Model (ATAM) to the analysis of more than one and a half million tweetsin public health and they were focused on a specific question and specific models;”What public health information can be learned from Twitter?”.
In [124] introduced an LDA based method to discover patterns of internal treat-ment for Clinical processes (CPs), and currently, detect these hidden patterns is oneof the most serious elements of clinical process evaluation. Their main approach is toobtain care flow logs and also estimate hidden patterns for the gathered logs based onLDA. Patterns identified can apply for classification and discover clinical activitieswith the same medical treatment. To experiment the potentials of their approach, useda data-set that collected from Zhejiang Huzhou Central Hospital of China. In [123]introduced a model for the discovery of functional miRNA regulatory modules (FM-RMs) that merge heterogeneous datasets and it including expression profiles of bothmiRNAs and mRNAs, using or even without using exploit the previous goal bindinginformation. This model used a topic model based on Correspondence Latent Dirich-let Allocation (Corr-LDA). As an evaluation dataset, they perform their method tomouse model expression datasets to study the issue of human breast cancer. Theauthors found that their model is mighty to obtain different biologically meaning-ful models. In [140], the authors had a study on Chinese medical (CM) diagnosisby topic modeling and introduced a model based on Author-Topic model to detectCM diagnosis from Clinical Information of Diabetes Patients, and called Symptom-Herb-Diagnosis topic (SHDT) model. Evaluation dataset has been collected from328 diabetes patients. The results indicated that the SHDT model can discover herbprescription topics and typical symptom for a bunch of important medical-related dis-eases in comorbidity diseases (such as; heart disease and diabetic kidney)[140].
3.4 Topic modeling in geographical and locations
There is a significant body of research on geographical topic modeling. According topast work, researchers have shown that topic modeling based on location informationand textual information can be effective to discover geographical topics and Geo-graphical Topic Analysis. Table 6 shown some significant research based on LDAto topic modeling and content analysis in geographical issues. In [16], this articleexamines the issue of topic modeling to extract topics from geographic informationand GPS-related documents. They suggested a new location text method that is acombination of topic modeling and geographical clustering called LGTA (Latent Ge-ographical Topic Analysis). To test their approaches, they collected a set of data fromthe website Flickr, according to various topics. In [17] introduced a novel method
22 Hamed Jelodar et al.
Table 6 Impressive works LDA-based in geographical and locations
Study Year Purpose Dataset
[17] 2010Discovering multi-facetedsummaries of documents CoPhIR dataset
[16] 2011 Content management and retrieval Flicker Dataset
[15] 2013Semantic clustering invery high resolution panchromaticsatellite images
A QUICKBIRD image ofa suburban area
[14] 2010
Data Discovery, Evaluation ofgeographically coherent linguisticregions and find the relationshipbetween topic variation and regional
A Twitter Dataset
[13] 2008Geo-located image categorizationand georecognition 3013 images Panoramio in France
[141] 2015 Cluster discovery in geo-locations Reuters-21578
[142] 2014Discovering newsworthyinformation From Twitter A small Twitter Dataset
based on multi-modal Bayesian models to describe social media by merging text fea-tures and spatial knowledge that called GeoFolk. As a general outlook, this methodcan be considered as an extension of Latent Dirichlet Allocation (LDA). They usedthe available standard CoPhIR dataset that it contains an abundance of over 54 millionFlickr. The GeoFolk model has ability to be used in quality-oriented applications andcan be merged with some models from Web 2.0 social. In [15] proposed a multiscaleLDA model that is a combination of multiscale image representation and probabilis-tic topic model to obtain effective clustering VHR satellite images.
In[14], the authors introduced a model that includes two sources of lexical vari-ation: geographical area and topic, in another word, this model can discover wordswith geographical coherence in different linguistic regions, and find a relationshipbetween regional and variety of topics. To test their model, they gathered a datasetfrom the website Twitter and also we can say that, also can show from an author’sgeographic location from raw text [15-17]. In [13] suggested a statistical model forclassification of geo-located images based on latent representation. In this model,the content of a geo-located database able be visualized by means of some few se-lected images for each geo-category. This model can be considered as an extensionof probabilistic Latent Semantic Analysis (pLSA). They built a database of the geo-located image which contains 3013 images (Panoramio), that is related to southeast-ern France.
In addition, in [143] designed a browsing system (GeoVisNews) and proposed anextended matrix factorization method based on geographic information for rankinglocations, location relevance analysis, and obtain intra-relations between documentsand locations, called Ordinal Correlation Consistent Matrix Factorization (OCCMF).For evaluation, the authors used a large multimedia news dataset and showed thattheir system better than Story Picturing Engine and Yahoo News Map.
In this work [141], the authors focused on the issue of identifying textual topicsof clusters including spatial objects with descriptions of text. They presented com-bined methods based on cluster method and topic model to discover textual object
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 23
clusters from documents with geo-locations. In fact, they used a probabilistic gen-erative model (LDA) and the DBSCAN algorithm to find topics from documents.In this paper, they utilized dataset Reuters-21578 as a dataset for Analysis of theirmethods. In [142], the authors presented a study on characterizing significant reportsfrom Twitter. The authors introduced a probabilistic model to topic discovery in thegeographical topic area and this model can find hidden significant events on Twitterand also considered stochastic variational inference (SVI) to apply gradient ascenton the variable objective with LDA. They collected 2,535 geo-tagged tweets fromthe Upper Manhattan area of New York. that the KL divergence is a good metric foridentifying significant tweet event, but for a large dataset of news articles, the resultwill be negative.
3.5 Software engineering and topic modeling
Software evolution and source code analysis can be effective in solving current andfuture software engineering problems. Topic modeling has been used in informa-tion retrieval and text mining where it has been applied to the problem of briefinglarge text corpora. Recently, many articles have been published for evaluating / min-ing software using topic modeling based on LDA, Table 7 shown some significantresearch based on LDA to source code analysis and an other related issues in soft-ware engineering. In [8], for the first time, the authors used LDA, to extract topics insource code and perform to visualization of software similarity, In other words, LDAuses an intuitive approach for calculation of similarity between source files with ob-tain their respective distributions of each document over topics. They utilized theirmethod on 1,555 software projects from Apache and SourceForge that includes 19million source lines of code (SLOC). The authors demonstrated this approach, can beeffective for project organization, software refactoring. In addition, in [9], introduceda new coupling metric based on Relational Topic Models (RTM) that called Rela-tional Topic based Coupling (RTC), that can identifying latent topics and analyze therelationships between latent topic distributions software data. Also, can say that theRTM is an extension of LDA. The authors used thirteen open source software sys-tems for evaluation this metric and demonstrated that RTC has a useful and valuableimpact on the analysis of large software systems.
The work in [10] focused on software traceability by topic modeling and proposeda combining approach based on LDA model and automated link capture. They uti-lized their method to several data sets and demonstrated how topic modeling increasesoftware traceability, and found this approach, able to scale for carried larger numbersfrom artifacts. In [11] studied about the challenges use of topic models to mine soft-ware repositories and detect the evolution of topics in the source code, and suggestedthe apply of statistical topic models (LDA) for the discovery of textual repositories.Statistical topic models can have different applications in software engineering suchas bug prediction, traceability link recovery and software evolution. Chen et al. in[26] used a generative statistical model(LDA) for analyzing source code and find re-lationships between software defects and software development. They showed LDA
24 Hamed Jelodar et al.
Table 7 Impressive works LDA-based in software engineering
Study Year Purpose Dataset
[8] 2007Mining software and extractedconcepts from code
SourceForge andApache(1,555 projects)
[9] 2010Identifying latent topics andfind their relationshipsin source code
Thirteen open sourcesoftware systems
[10] 2010 Generating traceability linksArchStudio softwareproject
[26] 2012Find relationship between theconceptual concerns in source code. Source code entities
[25] 2008 Analyzing Software EvolutionOpen source Java projects,Eclipse and ArgoUML
[22] 2009Automatic Categorization ofSoftware systems 43 open-source software systems
[23] 2008Source code retrieval forbug localization Mozila, Eclipse source code
[24] 2010Automatic bug localizationand evaluate its effectiveness
Open source software such as(Rhino, and Eclipse)
[144] 2017Detection of maliciousAndroid apps 1612 malicious application
can easily scale to large documents and utilized their approach on three large datasetthat includes: Mozilla Firefox, and Mylyn, Eclipse. Linsteadet al. in [25] used andutilized Author-Topic models(AT) to analysis in source codes. AT modeling is an ex-tension of LDA model that evaluation and obtain the relationship of authors to topicsand applied their method on Eclipse 3.0 source code including of 700,000 code linesand 2,119 source files with considering of 59 developers. They demonstrated thattopic models provided the effective and statistical basis for evaluation of developersimilarity.
The work in [22] introduced a method based on LDA for automatically catego-rizing software systems, called LACT. For evaluation of LACT, used 43 open-sourcesoftware systems in different programming languages and showed LACT can cate-gorization of software systems based on the type of programming language. In ad-dition, in [23, 34], Proposed an approach topic modeling based on LDA model forthe purpose of bug localization. Their idea, applied to the analysis of same bugs inMozilla and Eclipse and result showed that their LDA-based approach is better thanLSI to evaluate and analyze of bugs in these source codes. In [144], introduced atopic-specific approach by considering the combination of description and sensitivedata flow information and used an advanced topic model based on LDA with GA,for understanding malicious apps, cluster apps according to their descriptions. Theyutilized their approach on 3691 benign and 1612 malicious application. The authorsfound Topic-specific, data flow signatures are very efficient and useful in highlightingthe malicious behavior.
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 25
Tweet Messages
Content Analysis
Tweet Messages
Recommend to User 1
Hashtag Recommendation
User Profile
Historical Information
Prepossessing for Topic modeling
Topic modeling based on LDA
Parameter estimation methods
Get result from LDA and Ranking Hashtags
User Interest score
Similarity Score
Recommend to User N1
Hashtag Recommendation
Fig. 3 A simple framework based on LAD to generate tag as a recommendation system on Twitter
3.6 Topic modeling in social network and microblogs
Social networks are a rich source for knowledge discovery and behavior analysis. Forexample, Twitter is one of the most popular social networks that its evaluation andanalysis can be very effective for analyzing user behavior and etc. Figure 3, shows asimple idea based on LDA algorithm to building a tag recordation system on Twit-ter. Recently, researchers have proposed many LDA approaches for analyzing usertweets on Twitter. Weng et al. the authors were concentrated on identifying influentialtwitterers on Twitter and proposed an approach based on an extension of PageRankalgorithm to rate users, called TwitterRank, and used an LDA model to find latenttopic information from a large collection of documentation. For evaluation this ap-proach, they prepared a dataset from Top-1000 Singapore-based twitterers, showedthat their approach is better than other related algorithms [145]. Hong et al. Thispaper examines the issue of identifying the Message popularity as measured basedon the count of future retweets and sheds. The authors utilized TF-IDF scores andconsidered it as a baseline, also used Latent Dirichlet Allocation (LDA) to calculatethe topic distributions for messages. They collected a dataset that includes 2,541,178users and 10,612,601 messages and demonstrated that this method can identify mes-sages which will attract thousands of retweets [146]. Table 8 shown some significantresearch based on LDA to topic modeling and user behavior analysis in social net-work and microblogs.
The authors in [147] focused on topical recommendations on tweeter and pre-sented a novel methodology for topic discovery of interests of a user on Twitter. Infact, they used a Labeled Latent Dirichlet Allocation (L-LDA) model to discover la-
26 Hamed Jelodar et al.
Table 8 Impressive works LDA-based in social Network and microblogs
Study Year purpose Dataset
[145] 2010Finding influential twittererson social network(Twitter)
Top-1000 Singapore-basedtwitterers
[147] 2014Building a topicalrecommendation systems A twitter dataset
[48] 2014A recommendation systemfor Twitter A twitter dataset
[148] 2012Analysis and discoveredevents on Twitter A twitter dataset
[91] 2014Analyze public sentimentvariations regarding acertain tar on Twitter
A twitter dataset
[49] 2012Analysis of the emotionaland stylistic distributionson Twitter
A twitter dataset
[149] 2016A topic-enhanced word embeddingfor Twitter sentiment classification SemEval-2014
[150] 2016Categorize emotion tendencyon Sina Wibo A Sina Wibo dataset
[151] 2016Ranking items and summarize newsinformation based on hLDAand maximum spanning tree
A large news Dataset(CNN,BBC,ABC and Google News)
tent topics between two tweet-sets. The authors found that their method could bebetter than content based methods for discovery of user-interest. In addition, in [48]suggested, a recommendation system based on LDA for obtaining the behaviors ofusers on Twitter, called TWILITE. In more detail, TWILTW can calculate the topicdistributions of users to tweet messages and also they introduced ranking algorithmsin order to recommend top-K followers for users on Twitter.
In [121] investigated in the context of a criminal incident prediction on Twitter.They suggested an approach for analysis and understanding of Twitter posts baseda probabilistic language model and also considered a generalized linear regressionmodel. Their evaluation showed that this approach is the capability of predict hit-and-run crimes, only using information that exists in the content of the training set oftweets. This paper [152], they introduced a novel method based LDA model to hash-tag recommendation on Twitter that can categories posts with them (hashtags). Lin etal. in [153] investigated the cold-start issue with useing the social information for Apprecommendation on Twitter and used an LDA model to discovering latent group from”Twitter personalities” to recommendations discovery. For test and experiment, theyconsidered Apple’s iTunes App Store and Twitter as a dataset. Experimental resultsshow, their approach significantly better than other state-of-the-art recommendationtechniques.
The authors in [148] presented a technique to analysis and discovered events byan LDA model.The authors found that this method can detect events in inferred topicsfrom tweets by wavelet analysis. For test and evaluation, they collected 13.6 milliontweets from Twitter as a dataset and showed the use of both hashtag names and in-ferred topics is a beneficial effect in description information for events. In addition,in [154] investigated the issue of how to effectively discover and find health-relatedtopics on Twitter and presented an LDA model for identifies latent topic informationfrom a dataset and it includes 2,231,712 messages from 155,508 users. They foundthat this method may be a valuable tool for detect public health on Twitter. Tan and
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 27
et al. in [91] focused on tracking public sentiment and modeling on Twitter. Theysuggest a topic model approach based on LDA, Foreground and Background LDAto distill topics of the foreground. Also proposed another method for ranking a setof reason candidates in natural language, called Reason Candidate and BackgroundLDA (RCB-LDA). Their results showed that their models can be used to identify spe-cial topics and find different aspects. The authors in [49] collected a large corpus fromTwitter in seven emotions that includes; disgust, Anger, Fear, Love, Joy, sadness, andsurprise. They used a probabilistic topic model, based on LDA, which considered fordiscovery of emotions in a corpus of Twitter conversations. Srijith et al. in [155] pro-posed a probabilistic topic model based on hierarchical Dirichlet processes (HDP))for detection of sub-story. They compared HDP with spectral clustering (SC) andlocality sensitive hashing (LSH) and showed that HDP is very effective for story de-tection data sets, and has an improvement of up to 60% in the F-score.
In [149] proposed a method based on Twitter sentiment classification using topic-enhanced word embedding and also used an LDA model to generate a topic distribu-tion of tweets, considered SVM for classifying tasks in sentiment classification. Theyused the dataset on SemEval-2014 from Twitter Sentiment Analysis Track. Experi-ments show that their model can obtain 81.02% in macro F-measure.In [156] focusedon examining of demographic characteristics in Trump Followers on Twitter. Theyconsidered a negative binomial regression model for modeling the ”likes” and usedLDA to extract tweets of Trump. They provided evaluations on the dataset US2016(Twitter) that include a number of followers for all the candidates in the United Statespresidential election of 2016. The authors demonstrated that topic-enhanced wordembedding is very impressive for classification of sentiment on Twitter.
3.7 Crime prediction/evaluation
Over time; definitely, provides further applications for modeling in various sciences.According to recent work, some researchers have applied the topic modeling meth-ods to crime prediction and analysis. Table 9 shown some significant research basedon LDA to topic discovery and crime activity analysis. In [119] introduced an earlywarning system to find the crime activity intention base on an LDA model and collab-orative representation classifier (CRC).The system includes two steps: They utilizedLDA for learning features and extract the features that can represent from articlesources. For the next step, used from achieved features of LDA to classify a newdocument by collaborative representation classifier (CRC). In [120] used a statisticaltopic modeling based on LDA to identify discussion topics among a big city in theUnited States and used kernel density estimation (KDE) techniques for a standardcrime prediction . Sharma et al. the authors introduced an approach based on the ge-ographical model of crime intensities to detect the safest path between two locationsand used a simple Naive Bayes classifier based on features derived from an LDAmodel [157].
28 Hamed Jelodar et al.
Table 9 Impressive works LDA-based to crime prediction
Study Year Purpose Dataset
[121] 2012Automatic semantic analysison Twitter posts
A corpus of tweetsfrom Twitter(manual)
[120] 2014Crime prediction usingtagged tweets
City of Chicago Data(data.cityofchicago.org)
[119] 2015Detect the crime activityintention
800 news articles fromyahoo Chinese news
Table 10 Some well-known tools for topic modeling
Tools Implementation/ Language Inference/Parameter Source code availabilityMallet [159] Java Gibbs sampling http://mallet.cs.umass.edu/topics.php
TMT [160] Java Gibbs samplinghttps://nlp.stanford.edu/software/
tmt/tmt-0.4/
Mr.LDA [71] JavaVariationalBayesianInference
https://github.com/lintool/Mr.LDA
JGibbLDA [161] Java Gibbs sampling http://jgibblda.sourceforge.net/Gensim [162] Python Gibbs sampling https://radimrehurek.com/gensimTopicXP Java(Eclipse plugin) http://www.cs.wm.edu/semeru/TopicXP/
Matlab TopicModeling [163] Matlab Gibbs sampling
http://psiexp.ss.uci.edu/research/
programs data/toolbox.htmYahoo LDA[165] C++ Gibbsampling https://github.com/shravanmn/Yahoo LDA
Lda in R [164] R Gibbsamplinghttps://cran.r-project.org/web/
packages/lda/
4 Open source library and tools / datasets / Software packages and tools for theanalysis
We need new tools to help us organize, search, and understand these vast amountsof information. In this section, we introduce some impressive datasets and tools forTopic Modeling.
4.1 Library and Tools
Many tools for Topic modeling and analysis are available, including professional andamateur software, commercial software, and open source software and also, thereare many popular datasets that are considered as a standard source for testing andevaluation. Table 10, show some well-known tools for topic modeling and Table 11,show some well-known datasets for topic modeling. For example; Mallet tools,TheMALLET topic model package incorporates an extremely quick and highly scalableimplementation of Gibbs sampling, proficient methods for tools and document-topichyperparameter optimization for inferring topics for new documents given trainedmodels. Topic models provide a simple approach to analyze huge volumes of un-labeled text. The role of these tools, as mentioned, A ”topic” consists of a groupof words that habitually happen together. Topic models can associate words withdistinguish and similar meanings among uses of words with various meanings andconsidering contextual clues [158].
For evaluation and testing, according to previous work, researchers have releasedmany dataset in various subjects, size, and dimensions for public access and other
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 29
Table 11 Some well-known Dataset for topic modelingDataset Language Date of publish Short-detail Availability address
Reuters(Reuters21578)[166] English 1997
Newsletters invarious categories
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578
ReutersV 1(Reuters-Volume I) [167] English 2004
Newsletters invarious categories
UDI-TwitterCrawl-Aug2012 [168] English 2012A twitter dataset frommillions of tweets
https://wiki.illinois.edu//wiki/display/
forward/Dataset-UDI-TwitterCrawl-Aug2012SemEval-2013 Dataset [169] English 2013 A twitter dataset from millions of tweetsWiki10[179] English 2009 A Wikipedia Document in various category http://nlp.uned.es/social-tagging/wiki10+/
Weibo dataset [170] Chinese 2013 A popular Chinese microblogging network
Bag of Words English 2008
A multi dataset(PubMed abstracts,KOS blog,NYTimes news,NIPS full papers,Enron Emails)
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
CiteUlike [171] English 2011A bibliography sharing serviceof academic papers http://www.citeulike.org/faq/data.adp
DBLP Dataset [172] EnglishA bibliographic database aboutcomputer science journals
https://hpi.de/naumann/projects/repeatability/
datasets/dblp-dataset.html
HowNet lexicon Chinese 2000-2013A Chinese machine-readable dictionary/ lexical knowledge http://www.keenage.com/html/e index.html
Virastyar , Persian lexicon [173] Persian 2013 Persian poems electronic lexicahttp://ganjoor.net/http://www.virastyar.ir/data/
NIPS abstracts English 2016The distribution of wordsin the full text ofthe NIPS conference (1987 to 2015)
https://archive.ics.uci.edu/ml/datasets/NIPS+Conference+Papers+1987-2015
Ch-wikipedia [114][174] Chinese A Chinese corpus from Chinese Wikipedia
Pascal VOC 2007 [175][176] English 2007 A dataset of natural imageshttp://host.robots.ox.ac.uk/pascal/VOC/
voc2007/
AFP ARB corpus [177] Arabic 2001A collection of newspaper articlessin Arabic from Agence France Presse
20Newsgroups4 corpus [178] English 2008Newsletters invarious categories http://qwone.com/∼jason/20Newsgroups/
New York Times (NYT)dataset [179] English 2008
Newsletters invarious categories
future work. So, due to the importance of this research, we examined the well-known dataset from previous work. Table 11, shows lists of some famous and populardatasets in various languages.
5 Discussion and seven important issues in challenges
There are challenges and discussions that can be considered as future work in topicmodeling. According to our studies, some issues require further research, which canbe very effective and attractive for the future. In this section, we discussed sevenimportant issues and we found that the following issues have not been sufficientlysolved. These are the gaps in the reviewed work that would prove to be directions forfuture work.
5.1 Topics Modeling in image processing, Image classification and annotation
Image classification and annotation are important problems in computer vision, butrarely considered together and need some intelligent approach for predict classes. Forexample, an image of a class highway is more likely annotated with words ”road”and ”traffic”, ”car ” than words ”fish ” ”scuba” and ”boat”. In [180] developed anew probabilistic model for jointly modeling the image, its annotations, and its classlabel. Their model behaves the class label as a global description of the image andbehaves annotation terms as local descriptions of parts of the image. Its underlyingprobabilistic hypotheses naturally integrate these sources of information. They derivean approximate inference and obtain algorithms based on variational ways as well asimpressive approximations for annotating and classifying new images and extended
30 Hamed Jelodar et al.
supervised topic modeling (sLDA) to classification problems.
Lienou and et al. in [181] focused on the problem of an image semantic interpreta-tion of large satellite images and used a topic modeling, that each word in a documentconsidering as a segment of image and a document is as an image. For evaluation,they performed experiments on panchromatic QuickBird images. Philbin and et al.in [182] proposed a geometrically consistent latent topic model to detect significantobjects, called Latent Dirichlet Allocation (gLDA) and then introduced methods foreffectiveness of calculations a matching graph, that images are the nodes and the edgestrength in visual content. The gLDA method is able to group images of a specific ob-ject despite large imaging variations and can also pick out different views of a singleobject. In [183] introduced a semi-automatic approach to latent information retrieval.According to the hierarchical structure from the images. They considered a combinedinvestigation using LDA model and invariant descriptors of image region for a visualscene modeling. Wick and et al. in [184] They presented an error correction algo-rithm using topic modeling based on LDA to Optical character recognition (OCR)error correction. This algorithm including two models: a topic model to calculate theword probabilities and an OCR model for obtaining the probability of character er-rors. In addition, we can combine Topic models with matrix factorization methodsto image understanding, tag assignment and semantic discovery from social imagedatasets [185-186].
5.2 Audio, Music information retrieval and processing
According to our knowledge, few research works have been done in music infor-mation analysis using topic modeling. For example; in [187] proposed a modifiedversion of LDA to process continuous data and audio retrieval. In this model, eachaudio document includes various latent topics and considered each topic as a Gaus-sian distribution on the audio feature data. To evaluate the efficiency of their model,used 1214 audio documents in various categories (such as rain, bell, river, laugh, gun,dog and so on). The authors in [188], focused on estimation and estimation of singingcharacteristics from signals of audio. This paper introduces a topic modeling to thevocal timbre analysis, that each song is considered as a weighted mixture of multipletopics. In this approach, first extracted features of vocal timbre of polyphonic musicand then used an LDA model to estimate merging weights of multiple topics. Forevaluation, they applied 36 songs that consist of 12 Japanese singers.
5.3 Drug safety evaluation and Approaches to improving it
Understanding safety of drug and performance continue to be critical and challeng-ing issues for academia and also it is an important issue in new drug discovery. Topicmodeling holds potential for mining the biological documents and given the impor-tance and magnitude of this issue, researchers can consider it as a future work. Bisgin
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 31
and et al. in [189] introduced an ’in silico’ framework to drug repositioning guidedthrough a probabilistic graphical model, that defined a drug as a ’document’ anda phenotype form a drug as a ’word’. They applied their approach on the SIDERdatabase to estimate phenome distribution from drugs and identified 908 drugs fromSIDER with new capacity indications and demonstrated that the model can be effec-tive for further investigations. Yu and et al. in [191] investigated the issue of drug-induced acute liver failure (ALF) with considering the role of topic modeling to drugsafety evaluation, they explored the LiverTox database for drugs discovery with acapacity to cause ALF. Yang and et al. introduced an automatic approach based onkeyphrase extraction to detect expressions of consumer health, according to adversedrug reaction (ADRs) in social media. They used an LDA model as a Feature spacemodeling to build a topic space on the consumer corpus and consumer health expres-sions mining.[189-191].
5.4 Analysis of comments of famous personalities, social demographics
Public social media and micro-blogging services, most notably Twitter, the peoplehave found a venue to hear and be heard by their peers without an intermediary. Asa consequence and helped by the public nature of twitter political scientists now po-tentially have the means to evaluate and understand narratives that organically form,decline among and spread the public in a political campaign. For this field we canrefer to some impressive recent works, for example; Wang and et al. they introduceda framework to derive the topic preferences of Donald Trump’s followers on Twitterand used LDA to infer the weighted mixture for each Trump tweet from topics. Inaddition, we can refer to [156, 192-194].
5.5 Group discovery and topic modeling
Graph mining and social network analysis in large graphs is a challenging problem.Group discovery has many applications, such as understanding the social structureof organizations, uncovering criminal organizations, and modeling large scale socialnetworks in the Internet community. LDA Models can be an efficient method fordiscovering latent group structure in large networks. In [116], the authors proposeda scalable Bayesian alternative based on LDA and graph to group discovery in abig real-world graph. For evaluation, they collected three datasets from PubMed. In[117], the authors introduced a generative approach using a hierarchical Bayes modelfor group discovery in Social Media Analysis that called Group Latent Anomaly De-tection (GLAD) model. This model merged the ideas from both the LDA model andMixture Membership Stochastic Block (MMSB) model.
32 Hamed Jelodar et al.
5.6 User Behavior Modeling
Social media provides valuable resources to analyze user behaviors and capture userpreferences. Since the user generated data (such as users activities, user interests) insocial media is a challenge[195-196], using topic modeling techniques(such as LDA)can contribute to an important role for the discovery of hidden structures related touser behavior in social media. Although some topic modeling approaches have beenproposed in user behavior modeling, there are still many open questions and chal-lenges to be addressed. For example; Giri et al. in [197] introduced a novel way usingan unsupervised topic model for hidden interests discovery of users and analyzingbrowsing behavior of users in a cellular network that can be very effective for mo-bile advertisements and online recommendation systems. In addition, we can refer to[198-200].
5.7 Visualizing topic models
Although different approaches have been investigated to support the visualization oftext in large sets of documents such as machine learning, but it is an open challengein text mining and visualizing data in big data source. Some of the few studies thathave been done, such as [201-205]. Chuang and et al. in [205] proposed a topic toolbased on a novel visualization technique to the evaluation of textual topical in topicmodeling, called Termite. The tool can visualize the collection from the distributionof topic term in LDA with considering a matrix layout. The authors used two mea-sures for understanding a topic model of the Useful terms that including: ”saliency”and ”distinctiveness”. They used the Kullback-Liebler divergence between the topicsdistribution determined the term for obtain these measures. This tools can increasethe interpretations of topical results and make a legible result.
6 Conclusion
Topic models have an important role in computer science for text mining. In Topicmodeling, a topic is a list of words that occur in statistically significant methods. Atext can be an email, a book chapter, a blog posts, a journal article and any kind ofunstructured text. Topic models cannot understand the means and concepts of wordsin text documents for topic modeling. Instead, they suppose that any part of the textis combined by selecting words from probable baskets of words where each basketcorresponds to a topic. The tool goes via this process over and over again until itstays on the most probable distribution of words into baskets which call topics. Topicmodeling can provide a useful view of a large collection in terms of the collection asa whole, the individual documents, and the relationships between the documents. Inthis paper, we investigated scholarly articles highly (between 2003 to 2016) related toTopic Modeling based on LDA in various science. Given the importance of research,
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 33
we believe this paper can be a significant source and good opportunities for textmining with topic modeling based on LDA for researchers and future works.
Acknowledgements
This article has been awarded by the National Natural Science Foundation of China(61170035, 61272420, 81674099, 61502233), the Fundamental Research Fund forthe Central Universities (30916011328, 30918015103), and Nanjing Science andTechnology Development Plan Project (201805036).
References
1. Blei, D.M., A.Y. Ng, and M.I. Jordan, Latent dirichlet allocation. Journal of machine Learning research,2003. 3(Jan): p. 993-1022.
2. Sun, S., C. Luo, and J. Chen, A review of natural language processing techniques for opinion miningsystems. Information Fusion, 2017. 36: p. 10-25.
3. Xu, Z., et al., Crowdsourcing based social media data analysis of urban emergency events. MultimediaTools and Applications, 2017. 76(9): p. 11567-11584.
4. Zhang, Y., et al., iDoctor: Personalized and professionalized medical recommendations based on hybridmatrix factorization. Future Generation Computer Systems, 2017. 66: p. 30-35.
5. Jiang, Z., et al. Using link topic model to analyze traditional chinese medicine clinical symptom-herbregularities. in e-Health Networking, Applications and Services (Healthcom), 2012 IEEE 14th Inter-national Conference on. 2012. IEEE.
6. Paul, M.J. and M. Dredze, You are what you Tweet: Analyzing Twitter for public health. Icwsm, 2011.20: p. 265-272.
7. Wu, Y., et al. Ranking gene-drug relationships in biomedical literature using latent dirichlet allocation.in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2012. NIH PublicAccess.
8. Linstead, E., et al. Mining concepts from code with probabilistic topic models. in Proceedings of thetwenty-second IEEE/ACM international conference on Automated software engineering. 2007. ACM.
9. Gethers, M. and D. Poshyvanyk. Using relational topic models to capture coupling among classes inobject-oriented software systems. in Software Maintenance (ICSM), 2010 IEEE International Confer-ence on. 2010. IEEE.
10. Asuncion, H.U., A.U. Asuncion, and R.N. Taylor. Software traceability with topic modeling. in Pro-ceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 2010.ACM.
11. Thomas, S.W. Mining software repositories using topic models. in Proceedings of the 33rd Interna-tional Conference on Software Engineering. 2011. ACM.
12. Thomas, S.W., et al. Modeling the evolution of topics in source code histories. in Proceedings of the8th working conference on mining software repositories. 2011. ACM.
13. Cristani, M., et al. Geo-located image analysis using latent representations. in Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on. 2008. IEEE.
14. Eisenstein, J., et al. A latent variable model for geographic lexical variation. in Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing. 2010. Association for Computa-tional Linguistics.
15. Tang, H., et al., A multiscale latent Dirichlet allocation model for object-oriented clustering of VHRpanchromatic satellite images. IEEE Transactions on Geoscience and Remote Sensing, 2013. 51(3):p. 1680-1692.
16. Yin, Z., et al. Geographical topic discovery and comparison. in Proceedings of the 20th internationalconference on World wide web. 2011. ACM.
17. Sizov, S. Geofolk: latent spatial semantics in web 2.0 social media. in Proceedings of the third ACMinternational conference on Web search and data mining. 2010. ACM.
34 Hamed Jelodar et al.
18. Chen, B., et al. What Is an Opinion About? Exploring Political Standpoints Using Opinion ScoringModel. in AAAI. 2010.
19. Cohen, R. and D. Ruths. Classifying political orientation on Twitter: It’s not easy! in ICWSM. 2013.20. Greene, D. and J.P. Cross. Unveiling the Political Agenda of the European Parliament Plenary: A
Topical Analysis. in Proceedings of the ACM Web Science Conference. 2015. ACM.21. Fang, Y., et al. Mining contrastive opinions on political texts using cross-perspective topic model. in
Proceedings of the fifth ACM international conference on Web search and data mining. 2012. ACM.22. Tian, K., M. Revelle, and D. Poshyvanyk. Using latent dirichlet allocation for automatic categoriza-
tion of software. in Mining Software Repositories, 2009. MSR’09. 6th IEEE International WorkingConference on. 2009. IEEE.
23. Lukins, S.K., N.A. Kraft, and L.H. Etzkorn. Source code retrieval for bug localization using latentdirichlet allocation. in Reverse Engineering, 2008. WCRE’08. 15th Working Conference on. 2008.IEEE.
24. Lukins, S.K., N.A. Kraft, and L.H. Etzkorn, Bug localization using latent dirichlet allocation. Informa-tion and Software Technology, 2010. 52(9): p. 972-990.
25. Linstead, E., C. Lopes, and P. Baldi. An application of latent Dirichlet allocation to analyzing softwareevolution. in Machine Learning and Applications, 2008. ICMLA’08. Seventh International Confer-ence on. 2008. IEEE.
26. Chen, T.-H., et al. Explaining software defects using topic models. in Mining Software Repositories(MSR), 2012 9th IEEE Working Conference on. 2012. IEEE.
27. Savage, T., et al. Topic XP: Exploring topics in source code using Latent Dirichlet Allocation. inSoftware Maintenance (ICSM), 2010 IEEE International Conference on. 2010. IEEE.
28. Zheng, X., et al., Incorporating appraisal expression patterns into topic modeling for aspect and senti-ment word identification. Knowledge-Based Systems, 2014. 61: p. 29-47.
29. Cheng, V.C., et al., Probabilistic aspect mining model for drug reviews. IEEE transactions on knowl-edge and data engineering, 2014. 26(8): p. 2002-2013.
30. Zhai, Z., et al., Constrained LDA for grouping product features in opinion mining. Advances in knowl-edge discovery and data mining, 2011: p. 448-459.
31. Bagheri, A., M. Saraee, and F. De Jong, ADM-LDA: An aspect detection model based on topic mod-elling using the structure of review sentences. Journal of Information Science, 2014. 40(5): p. 621-636.
32. Wang, T., et al., Product aspect extraction supervised with online domain knowledge. Knowledge-Based Systems, 2014. 71: p. 86-100.
33. Xianghua, F., et al., Multi-aspect sentiment analysis for Chinese online social reviews based on topicmodeling and HowNet lexicon. Knowledge-Based Systems, 2013. 37: p. 186-195.
34. Jo, Y. and A.H. Oh. Aspect and sentiment unification model for online review analysis. in Proceedingsof the fourth ACM international conference on Web search and data mining. 2011. ACM.
35. Paul, M. and R. Girju, A two-dimensional topic-aspect model for discovering multi-faceted topics.Urbana, 2010. 51(61801): p. 36.
36. Titov, I. and R. McDonald. Modeling online reviews with multi-grain topic models. in Proceedings ofthe 17th international conference on World Wide Web. 2008. ACM.
37. Qian, S., et al., Multi-modal event topic model for social event analysis. IEEE Transactions on Multi-media, 2016. 18(2): p. 233-246.
38. Hu, Y., et al. ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback. in AAAI.2012.
39. Weng, J. and B.-S. Lee, Event detection in twitter. ICWSM, 2011. 11: p. 401-408.40. Lin, C.X., et al. PET: a statistical model for popular events tracking in social communities. in Proceed-
ings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.2010. ACM.
41. Wang, Y. and G. Mori. Max-margin Latent Dirichlet Allocation for Image Classification and Annota-tion. in BMVC. 2011.
42. Zoghbi, S., I. Vulic, and M.-F. Moens, Latent Dirichlet allocation for linking user-generated contentand e-commerce data. Information Sciences, 2016. 367: p. 573-599.
43. Cheng, Z. and J. Shen, On effective location-aware music recommendation. ACM Transactions onInformation Systems (TOIS), 2016. 34(2): p. 13.
44. Zhao, F., et al., A personalized hashtag recommendation approach using LDA-based topic model inmicroblog environment. Future Generation Computer Systems, 2016. 65: p. 196-206.
45. Lu, H.-M. and C.-H. Lee, The Topic-Over-Time Mixed Membership Model (TOT-MMM): A TwitterHashtag Recommendation Model that Accommodates for Temporal Clustering Effects. IEEE Intelli-
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 35
gent Systems, 2015(1): p. 1-1.46. Wang, J., et al., Image tag refinement by regularized latent Dirichlet allocation. Computer Vision and
Image Understanding, 2014. 124: p. 61-70.47. Yang, M.-C. and H.-C. Rim, Identifying interesting Twitter contents using topical analysis. Expert
Systems with Applications, 2014. 41(9): p. 4330-4336.48. Kim, Y. and K. Shim, TWILITE: A recommendation system for Twitter using a probabilistic model
based on latent Dirichlet allocation. Information Systems, 2014. 42: p. 59-77.49. Roberts, K., et al. EmpaTweet: Annotating and Detecting Emotions on Twitter. in LREC. 2012.50. Rao, Y., Contextual sentiment topic model for adaptive social emotion classification. IEEE Intelligent
Systems, 2016. 31(1): p. 41-47.51. Rao, Y., et al., Building emotional dictionary for sentiment analysis of online news. World Wide Web,
2014. 17(4): p. 723-742.52. Chen, T.-H., S.W. Thomas, and A.E. Hassan, A survey on the use of topic models when mining software
repositories. Empirical Software Engineering, 2016. 21(5): p. 1843-1919.53. Daud, A., et al., Knowledge discovery through directed probabilistic topic models: a survey. Frontiers
of computer science in China, 2010. 4(2): p. 280-301.54. Liu, B. and L. Zhang, A survey of opinion mining and sentiment analysis, in Mining text data. 2012,
Springer. p. 415-463.55. Debortoli, S., et al., Text mining for information systems researchers: an annotated topic modeling
tutorial. CAIS, 2016. 39: p. 7.56. Sun, X., et al. Exploring topic models in software engineering data analysis: A survey. in 2016 17th
IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networkingand Parallel/Distributed Computing (SNPD). 2016. IEEE.
57. Minka, T. and J. Lafferty. Expectation-propagation for the generative aspect model. in Proceedings ofthe Eighteenth conference on Uncertainty in artificial intelligence. 2002. Morgan Kaufmann Publish-ers Inc.
58. Griffiths, T.L. and M. Steyvers, Finding scientific topics. Proceedings of the National academy ofSciences, 2004. 101(suppl 1): p. 5228-5235.
59. Xie, W., et al., Topicsketch: Real-time bursty topic detection from twitter. IEEE Transactions onKnowledge and Data Engineering, 2016. 28(8): p. 2216-2229.
60. Lu, H.-M., C.-P. Wei, and F.-Y. Hsiao, Modeling healthcare data using multiple-channel latent Dirichletallocation. Journal of biomedical informatics, 2016. 60: p. 210-223.
61. Yeh, J.-F., Y.-S. Tan, and C.-H. Lee, Topic detection and tracking for conversational content by usingconceptual dynamic latent Dirichlet allocation. Neurocomputing, 2016. 216: p. 310-318.
62. Miao, J., J.X. Huang, and J. Zhao, TopPRF: A probabilistic framework for integrating topic space intopseudo relevance feedback. ACM Transactions on Information Systems (TOIS), 2016. 34(4): p. 22.
63. Panichella, A., et al. How to effectively use topic models for software engineering tasks? an approachbased on genetic algorithms. in Proceedings of the 2013 International Conference on Software Engi-neering. 2013. IEEE Press.
64. Zhao, W.X., et al. Comparing twitter and traditional media using topic models. in European Conferenceon Information Retrieval. 2011. Springer.
65. Jagarlamudi, J. and H. Daume III. Extracting Multilingual Topics from Unaligned Comparable Cor-pora. in ECIR. 2010. Springer.
66. Ramage, D., et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled cor-pora. in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:Volume 1-Volume 1. 2009. Association for Computational Linguistics.
67. Zhu, J., A. Ahmed, and E.P. Xing. MedLDA: maximum margin supervised topic models for regressionand classification. in Proceedings of the 26th annual international conference on machine learning.2009. ACM.
68. Guo, J., et al. Named entity recognition in query. in Proceedings of the 32nd international ACM SIGIRconference on Research and development in information retrieval. 2009. ACM.
69. Chang, J. and D.M. Blei. Relational topic models for document networks. in International conferenceon artificial intelligence and statistics. 2009.
70. Blei, D.M. and M.I. Jordan. Modeling annotated data. in Proceedings of the 26th annual internationalACM SIGIR conference on Research and development in informaion retrieval. 2003. ACM.
71. Zhai, K., et al. Mr. LDA: A flexible large scale topic modeling package using variational inference inmapreduce. in Proceedings of the 21st international conference on World Wide Web. 2012. ACM.
36 Hamed Jelodar et al.
72. Chien, J.-T. and C.-H. Chueh, Dirichlet class language models for speech recognition. IEEE Transac-tions on Audio, Speech, and Language Processing, 2011. 19(3): p. 482-495.
73. Blei, D.M. and J.D. Lafferty. Dynamic topic models. in Proceedings of the 23rd international confer-ence on Machine learning. 2006. ACM.
74. Ramage, D., C.D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining.in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and datamining. 2011. ACM.
75. Rosen-Zvi, M., et al. The author-topic model for authors and documents. in Proceedings of the 20thconference on Uncertainty in artificial intelligence. 2004. AUAI Press.
76. McCallum, A., A. Corrada-Emmanuel, and X. Wang, Topic and role discovery in social networks.Computer Science Department Faculty Publication Series, 2005: p. 3.
77. Wang, X. and A. McCallum. Topics over time: a non-Markov continuous-time model of topical trends.in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and datamining. 2006. ACM.
78. Li, W. and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations.in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.
79. Zhang, H., et al. Probabilistic community discovery using hierarchical latent gaussian mixture model.in AAAI. 2007.
80. AlSumait, L., D. Barbara, and C. Domeniconi. On-line lda: Adaptive topic models for mining textstreams with applications to topic detection and tracking. in Data Mining, 2008. ICDM’08. EighthIEEE International Conference on. 2008. IEEE.
81. Wang, C. and D.M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichletprocess. in Advances in neural information processing systems. 2009.
82. Lacoste-Julien, S., F. Sha, and M.I. Jordan. DiscLDA: Discriminative learning for dimensionality re-duction and classification. in Advances in neural information processing systems. 2009.
83. Wallach, H.M., D.M. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. in Advances inneural information processing systems. 2009.
84. Millar, J.R., G.L. Peterson, and M.J. Mendenhall. Document Clustering and Visualization with LatentDirichlet Allocation and Self-Organizing Maps. in FLAIRS Conference. 2009.
85. Wang, H., et al., Finding complex biological relationships in recent PubMed articles using Bio-LDA.PloS one, 2011. 6(3): p. e17243.
86. Li, F., M. Huang, and X. Zhu. Sentiment Analysis with Global Topics and Local Dependency. in AAAI.2010.
87. Liu, Z., et al., Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing.ACM Transactions on Intelligent Systems and Technology (TIST), 2011. 2(3): p. 26.
88. Yoshii, K. and M. Goto, A nonparametric Bayesian multipitch analyzer based on infinite latent har-monic allocation. IEEE Transactions on Audio, Speech, and Language Processing, 2012. 20(3): p.717-730.
89. Li, J., C. Cardie, and S. Li. TopicSpam: a Topic-Model based approach for spam detection. in ACL (2).2013.
90. Wu, H., et al., Locally discriminative topic modeling. Pattern Recognition, 2012. 45(1): p. 617-625.91. Tan, S., et al., Interpreting the public sentiment variations on twitter. IEEE transactions on knowledge
and data engineering, 2014. 26(5): p. 1158-1170.92. Paul, M. and M. Dredze. Factorial LDA: Sparse multi-dimensional text models. in Advances in Neural
Information Processing Systems. 2012.93. Mao, X.-L., et al. SSHLDA: a semi-supervised hierarchical topic model. in Proceedings of the 2012
joint conference on empirical methods in natural language processing and computational natural lan-guage learning. 2012. Association for Computational Linguistics.
94. Choo, J., et al., Utopian: User-driven topic modeling based on interactive nonnegative matrix factor-ization. IEEE transactions on visualization and computer graphics, 2013. 19(12): p. 1992-2001.
95. Chen, L., et al. WT-LDA: user tagging augmented LDA for web service clustering. in InternationalConference on Service-Oriented Computing. 2013. Springer.
96. Cheng, X., et al., Btm: Topic modeling over short texts. IEEE Transactions on Knowledge and DataEngineering, 2014(1): p. 1-1.
97. Yan, X., et al. A biterm topic model for short texts. in Proceedings of the 22nd international conferenceon World Wide Web. 2013. ACM.
98. Cohen, R., et al., Redundancy-aware topic modeling for patient record notes. PloS one, 2014. 9(2): p.e87555.
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 37
99. Xie, P., D. Yang, and E.P. Xing. Incorporating Word Correlation Knowledge into Topic Modeling. inHLT-NAACL. 2015.
100. Yuan, J., et al. Lightlda: Big topic models on modest computer clusters. in Proceedings of the 24thInternational Conference on World Wide Web. 2015. International World Wide Web ConferencesSteering Committee.
101. Nguyen, D.Q., et al., Improving topic models with latent feature word representations. Transactionsof the Association for Computational Linguistics, 2015. 3: p. 299-313.
102. Li, C., et al., The author-topic-community model for author interest profiling and community discov-ery. Knowledge and Information Systems, 2015. 44(2): p. 359-383.
103. Yu, X., J. Yang, and Z.-Q. Xie, A semantic overlapping community detection algorithm based on fieldsampling. Expert Systems with Applications, 2015. 42(1): p. 366-375.
104. Jiang, D., et al., SG-WSTD: A framework for scalable geographic web search topic discovery.Knowledge-Based Systems, 2015. 84: p. 18-33.
105. Fu, X., et al., Dynamic non-parametric joint sentiment topic mixture model. Knowledge-Based Sys-tems, 2015. 82: p. 102-114.
106. Li, X., J. Ouyang, and X. Zhou, Supervised topic models for multi-label classification. Neurocom-puting, 2015. 149: p. 811-819.
107. Hong, L., E. Frias-Martinez, and V. Frias-Martinez. Topic Models to Infer Socio-Economic Maps. inAAAI. 2016.
108. Lee, S., et al., LARGen: automatic signature generation for Malwares using latent Dirichlet allocation.IEEE Transactions on Dependable and Secure Computing, 2016.
109. Liu, Y., J. Wang, and Y. Jiang, PT-LDA: A latent variable model to predict personality traits of socialnetwork users. Neurocomputing, 2016. 210: p. 155-163.
110. Li, C., et al., Hierarchical Bayesian nonparametric models for knowledge discovery from electronicmedical records. Knowledge-Based Systems, 2016. 99: p. 168-182.
111. Fu, X., et al., Dynamic online HDP model for discovering evolutionary topics from Chinese socialtexts. Neurocomputing, 2016. 171: p. 412-424.
112. Zeng, J., Z.-Q. Liu, and X.-Q. Cao, Fast online EM for big topic modeling. IEEE Transactions onKnowledge and Data Engineering, 2016. 28(3): p. 675-688.
113. Alam, M.H., W.-J. Ryu, and S. Lee, Joint multi-grain topic sentiment: modeling semantic aspects foronline reviews. Information Sciences, 2016. 339: p. 206-223.
114. Qin, Z., Y. Cong, and T. Wan, Topic modeling of Chinese language beyond a bag-of-words. ComputerSpeech and Language, 2016. 40: p. 60-78.
115. Wang, Y.-C., M. Burke, and R.E. Kraut. Gender, topic, and audience response: an analysis of user-generated content on facebook. in Proceedings of the SIGCHI Conference on Human Factors in Com-puting Systems. 2013. ACM.
116. Henderson, K. and T. Eliassi-Rad. Applying latent dirichlet allocation to group discovery in largegraphs. in Proceedings of the 2009 ACM symposium on Applied Computing. 2009. ACM.
117. Yu, R., X. He, and Y. Liu, Glad: group anomaly detection in social media analysis. ACM Transactionson Knowledge Discovery from Data (TKDD), 2015. 10(2): p. 18.
118. Liu, Y., et al. Fortune Teller: Predicting Your Career Path. in AAAI. 2016.119. Chen, S.-H., et al. Latent dirichlet allocation based blog analysis for criminal intention detection
system. in Security Technology (ICCST), 2015 International Carnahan Conference on. 2015. IEEE.120. Gerber, M.S., Predicting crime using Twitter and kernel density estimation. Decision Support Sys-
tems, 2014. 61: p. 115-125.121. Wang, X., M.S. Gerber, and D.E. Brown, Automatic Crime Prediction Using Events Extracted from
Twitter Posts. SBP, 2012. 12: p. 231-238.122. Preotiuc-Pietro, D., et al. Beyond binary labels: political ideology prediction of twitter users. in Pro-
ceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017.
123. Liu, B., et al., Identifying functional miRNA-mRNA regulatory modules with correspondence latentdirichlet allocation. Bioinformatics, 2010. 26(24): p. 3105-3111.
124. Huang, Z., X. Lu, and H. Duan, Latent treatment pattern discovery for clinical processes. Journal ofmedical systems, 2013. 37(2): p. 9915.
125. Xiao, C., et al. Adverse Drug Reaction Prediction with Symbolic Latent Dirichlet Allocation. inAAAI. 2017.
126. Bauer, S., et al. Talking places: Modelling and analysing linguistic content in foursquare. in Pri-vacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International
38 Hamed Jelodar et al.
Confernece on Social Computing (SocialCom). 2012. IEEE.127. McFarland, D.A., et al., Differentiating language usage through topic models. Poetics, 2013. 41(6):
p. 607-625.128. Eidelman, V., J. Boyd-Graber, and P. Resnik. Topic models for dynamic translation model adaptation.
in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: ShortPapers-Volume 2. 2012. Association for Computational Linguistics.
129. Wilson, A.T. and P.A. Chew. Term weighting schemes for latent dirichlet allocation. in human lan-guage technologies: The 2010 annual conference of the North American Chapter of the Associationfor Computational Linguistics. 2010. Association for Computational Linguistics.
130. Vulic, I., W. De Smet, and M.-F. Moens. Identifying word translations from comparable corporausing latent topic models. in Proceedings of the 49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies: short papers-Volume 2. 2011. Association forComputational Linguistics.
131. Lui, M., J.H. Lau, and T. Baldwin, Automatic detection and language identification of multilingualdocuments. Transactions of the Association for Computational Linguistics, 2014. 2: p. 27-40.
132. Heintz, I., et al. Automatic extraction of linguistic metaphor with lda topic modeling. in Proceedingsof the First Workshop on Metaphor in NLP. 2013.
133. Balasubramanyan, R., et al. Modeling polarizing topics: When do different political communitiesrespond differently to the same news? in ICWSM. 2012.
134. Song, M., M.C. Kim, and Y.K. Jeong, Analyzing the political landscape of 2012 korean presidentialelection in twitter. IEEE Intelligent Systems, 2014. 29(2): p. 18-26.
135. Levy, K.E. and M. Franklin, Driving regulation: using topic models to examine political contentionin the US trucking industry. Social Science Computer Review, 2014. 32(2): p. 182-194.
136. Zirn, C. and H. Stuckenschmidt, Multidimensional topic analysis in political texts. Data and Knowl-edge Engineering, 2014. 90: p. 38-53.
137. Yano, T., W.W. Cohen, and N.A. Smith. Predicting response to political blog posts with topic models.in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North Amer-ican Chapter of the Association for Computational Linguistics. 2009. Association for ComputationalLinguistics.
138. Yano, T. and N.A. Smith. What’s Worthy of Comment? Content and Comment Volume in PoliticalBlogs. in ICWSM. 2010.
139. Madan, A., et al. Pervasive sensing to model political opinions in face-to-face networks. in Interna-tional Conference on Pervasive Computing. 2011. Springer.
140. Zhang, X.-p., et al., Topic model for chinese medicine diagnosis and prescription regularities analysis:case on diabetes. Chinese journal of integrative medicine, 2011. 17(4): p. 307-313.
141. Zhang, L., X. Sun, and H. Zhuge, Topic discovery of clusters from documents with geographicallocation. Concurrency and Computation: Practice and Experience, 2015. 27(15): p. 4015-4038.
142. McInerney, J. and D.M. Blei. Discovering newsworthy tweets with a geographical topic model. inNewsKDD: Data Science for News Publishing workshop Workshop in conjunction with KDD2014the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2014.
143. Li, Z., et al., Enhancing news organization for convenient retrieval and browsing. ACM Transactionson Multimedia Computing, Communications, and Applications (TOMM), 2013. 10(1): p. 1.
144. Yang, X., et al., Characterizing malicious Android apps by mining topic-specific data flow signatures.Information and Software Technology, 2017.
145. Weng, J., et al. Twitterrank: finding topic-sensitive influential twitterers. in Proceedings of the thirdACM international conference on Web search and data mining. 2010. ACM.
146. Hong, L., O. Dan, and B.D. Davison. Predicting popular messages in twitter. in Proceedings of the20th international conference companion on World wide web. 2011. ACM.
147. Bhattacharya, P., et al. Inferring user interests in the twitter social network. in Proceedings of the 8thACM Conference on Recommender systems. 2014. ACM.
148. Cordeiro, M. Twitter event detection: combining wavelet analysis and topic inference summarization.in Doctoral symposium on informatics engineering. 2012.
149. Ren, Y., R. Wang, and D. Ji, A topic-enhanced word embedding for Twitter sentiment classification.Information Sciences, 2016. 369: p. 188-198.
150. Li, Y., et al., Design and implementation of Weibo sentiment analysis based on LDA and dependencyparsing. China Communications, 2016. 13(11): p. 91-105.
151. Li, Z., et al., Multimedia news summarization in search. ACM Transactions on Intelligent Systemsand Technology (TIST), 2016. 7(3): p. 33.
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey 39
152. Godin, F., et al. Using topic models for twitter hashtag recommendation. in Proceedings of the 22ndInternational Conference on World Wide Web. 2013. ACM.
153. Lin, J., et al. Addressing cold-start in app recommendation: latent user models constructed fromtwitter followers. in Proceedings of the 36th international ACM SIGIR conference on Research anddevelopment in information retrieval. 2013. ACM.
154. Prier, K.W., et al. Identifying health-related topics on twitter. in International Conference on SocialComputing, Behavioral-Cultural Modeling, and Prediction. 2011. Springer.
155. Srijith, P., et al., Sub-story detection in Twitter with hierarchical Dirichlet processes. InformationProcessing & Management, 2017. 53(4): p. 989-1003.
156. Wang, Y., et al. Catching Fire via” Likes”: Inferring Topic Preferences of Trump Followers on Twitter.in ICWSM. 2016.
157. Sharma, V., et al. Analyzing Newspaper Crime Reports for Identification of Safe Transit Paths. inHLT-NAACL. 2015.
158. Steyvers, M. and T. Griffiths, Probabilistic topic models. Handbook of latent semantic analysis, 2007.427(7): p. 424-440.
159. McCallum, A.K., Mallet: A machine learning for language toolkit. 2002.160. Ramage, D. and E. Rosen, Stanford topic modeling toolbox. 2011, Dec.161. Phan, X.-H. and C.-T. Nguyen, Jgibblda: A java implementation of latent dirichlet allocation (lda)
using gibbs sampling for parameter estimation and inference. 2006.162. Rehurek, R. and P. Sojka, Gensim-Statistical Semantics in Python. 2011.163. Steyvers, M. and T. Griffiths, Matlab topic modeling toolbox 1.4. URL http://psiexp. ss. uci.
edu/research/programs data/toolbox.htm, 2011.164. Chang, J., lda: Collapsed Gibbs sampling methods for topic models. 2011, R.165. Ahmed, A., et al. Scalable inference in latent variable models. in Proceedings of the fifth ACM
international conference on Web search and data mining. 2012. ACM.166. Lewis, D.D., Reuters-21578 text categorization collection. 1997.167. Lewis, D.D., et al., Rcv1: A new benchmark collection for text categorization research. Journal of
machine learning research, 2004. 5(Apr): p. 361-397.168. Li, R., et al. Towards social user profiling: unified and discriminative influence model for inferring
home locations. in Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining. 2012. ACM.
169. Manandhar, S. and D. Yuret. Second joint conference on lexical and computational semantics (*sem), volume 2: Proceedings of the seventh international workshop on semantic evaluation (semeval2013). in Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2:Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). 2013.
170. Zhang, J., et al. Social Influence Locality for Modeling Retweeting Behaviors. in IJCAI. 2013.171. Wang, C. and D.M. Blei. Collaborative topic modeling for recommending scientific articles. in Pro-
ceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data min-ing. 2011. ACM.
172. Lange, D. and F. Naumann. Frequency-aware similarity measures: why Arnold Schwarzenegger isalways a duplicate. in Proceedings of the 20th ACM international conference on Information andknowledge management. 2011. ACM.
173. Asgari, E. and J.-C. Chappelier. Linguistic Resources and Topic Models for the Analysis of PersianPoems. in CLfL@ NAACL-HLT. 2013.
174. Cong, Y., et al. Cross-Modal Information Retrieval-A Case Study on Chinese Wikipedia. in Interna-tional Conference on Advanced Data Mining and Applications. 2012. Springer.
175. Everingham, M., et al., The pascal visual object classes challenge 2007 (voc 2007) results (2007).2008.
176. Everingham, M., et al., The pascal visual object classes (voc) challenge. International journal ofcomputer vision, 2010. 88(2): p. 303-338.
177. Larkey, L.S. and M.E. Connell. Arabic Information Retrieval at UMass in TREC-10. in TREC. 2001.178. Rennie, J., The 20 Newsgroups data set. 2017, http.179. Sandhaus, E., The New York Times Annotated Corpus. Philadelphia, PA: Linguistic Data Consortium.
2008.180. Chong, W., D. Blei, and F.-F. Li. Simultaneous image classification and annotation. in Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. 2009. IEEE.181. Lienou, M., H. Maitre, and M. Datcu, Semantic annotation of satellite images using latent Dirichlet
allocation. IEEE Geoscience and Remote Sensing Letters, 2010. 7(1): p. 28-32.
40 Hamed Jelodar et al.
182. Philbin, J., J. Sivic, and A. Zisserman, Geometric latent dirichlet allocation on a matching graph forlarge-scale image datasets. International journal of computer vision, 2011. 95(2): p. 138-153.
183. Vaduva, C., I. Gavat, and M. Datcu, Latent Dirichlet allocation for spatial analysis of satellite images.IEEE Transactions on Geoscience and Remote Sensing, 2013. 51(5): p. 2770-2786.
184. Wick, M., M. Ross, and E. Learned-Miller. Context-sensitive error correction: Using topic modelsto improve OCR. in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth InternationalConference on. 2007. IEEE.
185. Li, Z., J. Tang, and T. Mei, Deep collaborative embedding for social image understanding. IEEEtransactions on pattern analysis and machine intelligence, 2018.
186. Li, Z. and J. Tang, Weakly supervised deep matrix factorization for social image understanding. IEEETransactions on Image Processing, 2017. 26(1): p. 276-288.
187. Hu, P., et al., Latent topic model for audio retrieval. Pattern Recognition, 2014. 47(3): p. 1138-1143.188. Nakano, T., K. Yoshii, and M. Goto. Vocal timbre analysis using latent Dirichlet allocation and cross-
gender vocal timbre similarity. in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEEInternational Conference on. 2014. IEEE.
189. Bisgin, H., et al., A phenome-guided drug repositioning through a latent variable model. BMC bioin-formatics, 2014. 15(1): p. 267.
190. Yang, M. and M. Kiang. Extracting Consumer Health Expressions of Drug Safety from Web Forum.in System Sciences (HICSS), 2015 48th Hawaii International Conference on. 2015. IEEE.
191. Yu, K., et al., Mining hidden knowledge for drug safety assessment: topic modeling of LiverTox as acase study. BMC bioinformatics, 2014. 15(17): p. S6.
192. Alashri, S., et al. An analysis of sentiments on facebook during the 2016 US presidential election.in Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM InternationalConference on. 2016. IEEE.
193. Shi, B., et al. Detecting Common Discussion Topics Across Culture From News Reader Comments.in ACL (1). 2016.
194. Hou, L., et al., Newsminer: Multifaceted news analysis for event search. Knowledge-Based Systems,2015. 76: p. 17-29.
195. Diao, Q., et al. Finding bursty topics from microblogs. in Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics: Long Papers-Volume 1. 2012. Association forComputational Linguistics.
196. Yin, H., et al. A temporal context-aware model for user behavior modeling in social media systems.in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 2014.ACM.
197. Giri, R., et al. User behavior modeling in a cellular network using latent dirichlet allocation. in Inter-national Conference on Intelligent Data Engineering and Automated Learning. 2014. Springer.
198. Wang, S., et al. Cross media topic analytics based on synergetic content and user behavior modeling.in Multimedia and Expo (ICME), 2014 IEEE International Conference on. 2014. IEEE.
199. Yuan, B., et al. Mobile Web User Behavior Modeling. in International Conference on Web InformationSystems Engineering. 2014. Springer.
200. Siersdorfer, S., et al., Analyzing and mining comments and comment ratings on the social web. ACMTransactions on the Web (TWEB), 2014. 8(3): p. 17.
201. Chaney, A.J.-B. and D.M. Blei. Visualizing Topic Models. in ICWSM. 2012.202. Kim, M., et al., Topiclens: Efficient multi-level visual topic exploration of large-scale document col-
lections. IEEE transactions on visualization and computer graphics, 2017. 23(1): p. 151-160.203. Murdock, J. and C. Allen. Visualization Techniques for Topic Model Checking. in AAAI. 2015.204. Gretarsson, B., et al., Topicnets: Visual analysis of large text corpora with topic modeling. ACM
Transactions on Intelligent Systems and Technology (TIST), 2012. 3(2): p. 23.205. Chuang, J., C.D. Manning, and J. Heer. Termite: Visualization techniques for assessing textual topic
models. in Proceedings of the international working conference on advanced visual interfaces. 2012.ACM.