Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1,...
Transcript of Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1,...
Encoding Lineage in Scholarly Articles
Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain11Department of Computer Science, University of Texas, El Paso, TX 79968
2U. S. Army Corps of Engineers, Alexandria, VA 22315{snaim, mkader}@miners.utep.edu, [email protected], [email protected]
Abstract
The development of new scientific concepts today is an out-come of the accumulated knowledge built over time. Everyscientific domain requires understanding of the trends of thedependencies between its subdomains. Analyses of trendsto capture such dependencies using conventional documentmodeling techniques is a challenging task due to two rea-sons: (1) conventional vector-space modeling based represen-tation of documents does not realize the history of the con-tent, and (2) neither feature-level nor document-level causal-ity is provided with any digital library metadata or citationnetwork. In this paper, we propose an intuitive temporal rep-resentation of a scientific article that encodes inherent his-toric characteristics of the content. This intuitive representa-tion of each document is then leveraged to discover causalrelationships between scientific articles. In addition, we pro-vide a mechanism to explore the lineage of each document interms of other previously published documents, which illus-trates how the theme of the document under analysis evolvedover time. Empirical studies reported in the paper show thatthe proposed technique identifies meaningful causal relation-ships and discovers meaningful lineage in the scientific litera-ture that could not be discovered through the citation networkof the articles.
1 IntroductionWith the rapid growth in electronic publication of scientificarticles, we now have many digital libraries richer than ever.For example, both IEEE Xplore (ieeexplore.ieee.org) andPubmed Central (ncbi.nlm.nih.gov/pmc) have over 3.6 mil-lion full-text articles in their collection while ACM DigitalLibrary (dl.acm.org) contains more than 400 thousands pub-lications in computing and information domain. Due to suchvast collection of searchable scientific articles, researchersin any domain find it much easier to retrieve documents re-lated to a particular topic in the literature. Almost all of theavailable digital libraries return search results based on thetextual similarity of the initial query. However, results re-turned against a search query are not enough to realize howthe topic evolved and conceptually diffused over time fromanother topic. The topics of other domains that caused or in-fluenced the state in the past may not contain the same key-words that were used for the query. In this paper, we describe
Copyright c� 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
a mechanism that goes beyond similarity to capture the ac-tual dependencies between scientific articles to understandhow a document published recently has reached its state.
There are a few attempts to capture the evolution of thecurrent state both at document (Shaparenko and Joachims2007; El-Arini and Guestrin 2011; Hasan et al. 2009) andconcept levels (Blei and Lafferty 2006; Mei and Zhai 2005;Wang and McCallum 2006). Shaparenko and Joachims(Shaparenko and Joachims 2007) explains a mechanismnamed information genealogy that heavily depends on tex-tual similarity to compute dependence between documentswhile forming a lineage. As a result, the approach does notcapture influential documents that do not contain much tex-tual overlap with the initial set of documents. In practice,two scientific articles may contain different textual contentbut one may influence or relate to the other historically. Forexample, with the use of terms related to bipolar junctionsin physics the vocabulary in electrical engineering startedto change which eventually resulted in a strong branch ofscholarly endeavor — computer science. Similarly, the inno-vations as well as the vocabulary in laser physics fostered thearea of cancer treatment influencing the literature in biomed-ical science. The approach we describe in this paper encodeshistorical trends of the entities used in a paper in the form ofa temporal series that is used to detect causality with otherarticles avoiding direct similarity computation between doc-uments while discovering a lineage. There are a few systemsthat have the ability to provide an illustration of the con-ceptual evolution of the literature, such as, Dynamic TopicModeling (Blei and Lafferty 2006), Online LDA (AlSumait,Barbara, and Domeniconi 2008), and Mei and Zhai’s evo-lutionary theme pattern (Mei and Zhai 2005). While theseapproaches focus on lineage at conceptual level for an entirecorpus, our center of attention in this paper is at documentlevel lineage construction based on the latent causality be-tween the documents of a corpus.
Our framework, through a number of experiments con-ducted over 400 thousand publication abstracts from IEEEXplore digital library, discovers lineage of documents thatsimilarity-dependent methods cannot detect. The main con-tributions of this paper are as follows:
1. We represent a document as a time series that encodeshistoric importance of the terms of the document from theentire vocabulary perspective.
2. Our framework provides a mechanism to cluster docu-ments with high causal relationship.
3. We propose a systematic way to track the lineage of anypublished article in the form of a chain of causal docu-ments.
2 Related WorkBig collections of scholarly articles from different digital li-braries have been exploited in multitude of applications in-cluding keyword extraction (Caragea et al. 2014), citationrecommendation (Kataria, Mitra, and Bhatia 2010) and sum-marization of new contributions (Teufel and Moens 2002).In an attempt to find the influential articles of a current docu-ment, Shaparenko and Joachims (Shaparenko and Joachims2007) try to explain the content of a document using the tex-tual content of previously published articles. Though thismethod performs well to identify influential articles in theliterature, it is heavily dependent on textual similarity ofthe documents, and therefore, fail to identify true influ-ence where there is less or no textual overlap. El-Arini andGuestrin (El-Arini and Guestrin 2011) and Hasan et. al.(Hasan et al. 2009) go beyond the keyword matching andutilize other meta-information like citations to find the re-lated articles for a given set of papers. Another path of worktowards understanding the evolution of research has beenthrough incorporating temporal information into topic mod-eling. Mei and Zhai (Mei and Zhai 2005) find the impor-tant themes in every time frame of the corpus and connectthose themes based on their thematic similarity to show theevolutionary transitions of different topics. Dynamic TopicModel (Blei and Lafferty 2006) and Online LDA (AlSumait,Barbara, and Domeniconi 2008) also track the topics overtime and provide a means to identifying emerging topics invarious time points. In this paper we introduce the conceptof causality to understanding the evolutionary nature of re-search. Though the idea of causality has been successfullyused in the fields of economics (Granger 1969; Cheng andLai 1997) and neuroscience (Roebroeck, Formisano, andGoebel 2005), this is the first attempt, to the best of theauthors’ knowledge, to exploit Granger causality (Granger1969) for understanding the lineage of scientific documents.
3 Problem FormulationLet D = {d1, d2, . . . , dN} be the scholarly dataset of N ar-ticles containing M words W = {w1, w2, . . . , wM}. Eacharticle may contain an arbitrary number of words in any se-quence. The publication dates of the documents span over atime frame Y = {y1, y2, . . . , yl} where yi is ith year andyi+1 = yi+1. We denote the set of articles published in andbefore year yi as Di. The tasks are:• Identify the set of all causal dependencies in the corpus,
R = {(di, dj) : dj has a causal dependence on di}.
• Construct a causal chain Hi = {h1i , h
2i , . . . , h
|Hi
|i } for
each article di 2 D where h1i = {di} and hq
i is the set ofdocuments each of which has causal influence on at leastone of the documents in hq�1
i .• Compute a clustering, C = {C1, C2, . . . , C|C|} of the cor-
pus where documents in each group Ci 2 C demonstratestrong causal dependence among themselves.
0.0E+00
5.0E-06
1.0E-05
1.5E-05
2.0E-05
2.5E-05
1950 1955 1960 1965 1970 1975 1980
Estim
ated
per
turb
atio
n, π
Ideal documentReal document
-30.0
-20.0
-10.0
0.0
10.0
20.0
30.0
40.0
50.0
60.0
1950 1955 1960 1965 1970 1975 1980
Diff
eren
ce o
f ang
les (
degr
ee)
Tum
ors
Tum
ors R
adia
tion
Mam
m. c
ells
Rad
iatio
n Sq
uam
ous
carc
inom
a
Squamous carcinoma Radiation Drugs Combined treatment
Tum
ors
Mam
mal
ian
Cel
ls Bl
eom
ycin
Figure 1: Time series generation for a document. (a) Perturbationcaused by a real and the corresponding ideal document. (b) Gener-ated signal for the real document.
4 MethodologyThe objective of the proposed framework is to identifycausal dependency structure of all the publications. The fol-lowing subsections describe the functionalities of the pro-posed framework: 1) represent documents as time series,2) identify causal groups of articles, and 3) discover causalchains.
4.1 Document Representation as a Time SeriesOur framework captures the change of the distribution of thevocabulary W over time by reading the articles in a sequenceas they were published. We compute the term distribution atthe end of every year to capture this evolving nature of thevocabulary. Let the term distribution after the ith year be�i = {�1i , �2i , . . . , �Mi } where �ji is the frequency of the jthterm of the vocabulary W in all the document published inor before the ith year and
PMj=1 �
ji = 1.
This evolving distribution of the vocabulary enables us togenerate a time series for each document based on the rel-ative novelty of that article. If an article d 2 D publishedin the jth year contains entities that have already been pub-lished in the literature in any ith year where i < j, thend will not introduce much change in �i since the distribu-tion already contains some values for those entities. Basedon this concept, we replicate document d and place it in thepartial corpus Di. Let the extended corpus for document din the ith year be Di = Di [ d and the term distribution ofthe extended corpus Di be �i. The amount of perturbationthat document d creates in the distribution of the ith yearis estimated as the Kullback-Leibler divergence between theoriginal distribution �i and the distribution of the extendedcorpus �i.
⇡id =
lX
i=1
�i ln�i�i
(1)
The order of two distributions �i and �i in Equation 1 isimportant due to the asymmetric nature of KL-divergence.Since our intention is to measure the extra information addedto the corpus Di by the inclusion of document d, we alwayscompute the KL-divergence from �i to �i, not the other wayaround.
By placing d in each year yi 2 Y we get a time series⇡d = {⇡1
d,⇡2d, . . . ,⇡
ld} for d that represents the document’s
influence on the vocabulary over all the years. The red linein Figure 1(a) shows this perturbation-based time series of asample document.
In the next step, we create a synthetic document d0 thatis an identical copy of d but each of the terms being en-
coded by a unique identifier. That is, the terms in the syn-thetic document do not appear in any of the documents in theentire corpus D. This ensures that d0 independently holdsthe properties of d as a single document but it is a novel-ized one since none of the documents contains the syntheticterms. Now following the same procedure as we used ford, we add d0 to Di to form an extended corpus D0
i, calcu-late the term distribution �0i for that corpus, and estimatethe amount of perturbation d0 could introduce. We construct⇡d0 = {⇡1
d0 ,⇡2d0 , . . . ,⇡l
d0} for d0 the same way as we com-puted ⇡d. The blue line in Figure 1(a) is the signal generatedby placing d0 in every year. Since the content of d0 is unique,the line for d0 will always have larger (or equal) value inevery year than the line for the real document d.
The time series ⇡d0 is a representation of the degree of in-fluence the document d would have had on the vocabularyhad it be a completely unique document. Therefore, the dif-ference between the two time series ⇡d and ⇡d0 gives us anestimation of the amount of uniqueness of document d inevery year of the corpus — the smaller the difference thecloser it is to the extremely unique document. Figure 1(b)shows the signal for d that is derived from the differencebetween ⇡d and ⇡d0 . We denote this signal as Sd and calcu-late its magnitude in ith year as the difference between theangles created by the time series ⇡d and ⇡0
d at the ith year.
4.2 Identification of Causal ClustersThe information provided by an article published today is anoutcome of knowledge accumulated over time in the past.Each article in a corpus of a certain domain is very likelyto have a few articles to which it is causally dependent.We use the signal Sd generated for each document d 2 Dto compute the causalities between all articles. We lever-age Granger causality test in this purpose. Granger causal-ity (Granger 1969) is a statistical hypothesis test which es-timates the usefulness of one time series in predicting thefuture values of another time series. A time series X is saidto Granger-cause another time series Y if it can be shownthat those X values provide statistically significant informa-tion about future values of Y . We use Granger causality totest if there exists any causal relationship between two docu-ments. Assuming that the document di was published beforethe document dj , we test the following hypothesis to identifya causal effect from di to dj :
P[dj(t+ 1) 2 A|I(t)] 6= P[dj(t+ 1) 2 A|I�di
(t)] (2)
where P indicates probability, A is an arbitrary non-emptyset, and I(t) and I�d
i
(t) respectively denote the informa-tion available as of time t in the corpus, and that in the corpusexcluding di. If the above hypothesis is accepted, we say thatdocument di Granger causes document dj . Figure 2 demon-strates the causal relationship between two documents basedon their time series.
Once we identify the set of all the causal relationships Rin the corpus, we can build a causality network or a causalitymatrix. In practice, for large corpus we limit causality com-putation for any document to the documents published in rprevious years only. The set of IEEE publications that weuse in the experimental results section spans over 54 years,
-50
0
50
100
0 2 4 6 8 10 12 14 16 18 20
Time
Document studies history of microprocessors
4004, 8086, ALU MC68000, 32 bit microprocessorProtected mode, 80286
-50
0
50
100
0 2 4 6 8 10 12 14 16 18 20
Time
Document studies history of operating systems Calculator, Nintendo, Batch processing, CP/M Windows NT, Mac OS, GUI
Multi-user multitasking systems
Figure 2: An illustrative example of causality between two docu-ments based on their signals.
Algorithm 1: ComputeLineage – algorithm to com-pute the causal chain of a document.
input : Document diparameter: Set of causal pairs, R
Look-back threshold, rBranching factor, b
output : Lineage of di, hi
1 create an empty list hi;2 append {di} to hi;3 do4 create an empty set Sp;5 S read the last set from hi;6 for each document d 2 S do7 Rd {(dp, dq) : (dp, dq) 2 Rd and dq = d}
if |Rd| > b then8 Rd top b pairs of Rd;9 end
10 for each causal pair (dp, dq) 2 Rd do11 if yearOf(dq) - yearOf(dp) r then12 add dp to Sp;13 end14 end15 end16 if |Sp| 6= 0 then17 append Sp to hi;18 end19 while |Sp| 6= 0;20 return hi
and we vary the value of that look-back threshold r from 8to 10 years for our experiments.
The strengths of the causal relationships in R allow us touse clustering algorithms to find causal groups of articles.We use a density based clustering algorithm DBSCAN togroup causal documents. The motivation behind the use of adensity based clustering is that this specific family of clus-tering algorithms does not require the number of clusters kas an input (unlike k-means clustering). DBSCAN is a logi-cal choice for causal partitioning since there is no metadatathat can help us identify possible number of causal clusters.
4.3 Computing Causal ChainsAlgorithm 1 provides a chain of causal documents for agiven document di. Each chain is a graph or tree containingcausality flows from leaf documents toward the documentfor which the lineage is generated. The chain is initialized inline 1-2 with the given document di. We expand the chainby adding new set of documents that have causal influenceon the documents of the previous set. Each document d in a
level is expanded to b articles with highest causalities on dwhich were published within r years of the publication dateof d. The procedure terminates when there is no more causalparent left to expand from a certain level.
5 EvaluationWe evaluate our framework by evaluating the quality ofcausal clusters, how much information is diffused over timein a lineage, and comparing the lineage produced by our ap-proach with a similarity based model, information geneal-ogy (Shaparenko and Joachims 2007), and a citation net-work based lineage.
One of the evaluation metrics we use is diffusion coeffi-cient, which estimates a quantity of how the theme changedover time. The basic assumption here is that a lineage ofscholarly articles should diffuse a concept over the yearsand generate new ideas. Let P = {p1, p2, . . . , p|P |} bethe set of paths in the lineage of a document d 2 D, andpk = {dpk
0 , dpk
1 , . . . , dpk
n�1} be the kth path in P containingn documents and dpk
0 refers to the source document d. Dif-fusion coefficient D(d) of lineage of the document d is thendefined as
D(d) =1
|P |
|P |X
k=1
1� 1
n� 2
n�3X
i=0
n�1X
j=i+2
disp(dpk
i , dpk
j )
!
(3)where
disp(dpk
i , dpk
j ) =
⇢1
n+i�j if dpk
i and dpk
j have term overlap0 otherwise.
Larger values of D(d) indicate better diffusion over timewhile smaller values will refer to lesser conceptual drift.
In addition, we examine how much of the lineage gener-ated by our approach overlaps with the citation tree formedfor a document. Let T citation(d) and T causality(d) are theancestor trees of depth L of a document d 2 D generated us-ing the actual citations and causal relationships, respectively.Then a citation overlap score for article d can be defined as:
Acausality(d) =LX
l=1
���⌘Tcitation(d)
l \⌘Tcausality(d)
l
���⇥l
L(4)
where ⌘Tcitation(d)
l and ⌘T causality(d)l are the set of refer-
ences of documents at level l of the tree T citation(d) andT causality(d) respectively. The same equation can be usedto compute citation overlap score Asimilarity for a similar-ity based lineage as well as Agenealogy for an informationgenealogy based approach.
We express the degree of causality within the documentsin the same topical group t as causal density score ⇢(t):
⇢(t) =2⇥ |Rt|nt ⇥ b
(5)
where b is the maximum number of causal parents consid-ered for each document, Rt ⇢ R is the set of causal rela-tions within topic t and nt is the number of documents in t.We also calculate the causal influence of a topic t1 on an-other topic t2 as:
�(t1, t2) =|{(di, dj) 2 R : di 2 Dt1 , dj 2 Dt2}|
nt1 + nt1
(6)
A cluster of documents with large number of term overlaps
A cluster of documents with smaller number of terms overlaps
05
101520253035404550
clus
ter
stru
ctur
ete
xtfe
atur
eco
mpl
exity
spac
eef
fect
iven
ess
orde
rcl
uste
ring
feat
ure
spac
ecl
assi
ficat
ion
proc
ess
term eb
data
set
mix
ture
resp
ect
web
dim
ensi
onal
itysi
mila
rity
Perc
enta
ge o
f doc
umen
ts
with
term
ove
rlap
Cluster Size=22
05
101520253035404550
way
pres
ent
tree
topi
cre
latio
nshi
pru
leor
der
feat
ure
pref
ace
know
ledg
epo
int
proc
ess
data
set
cano
nica
lne
twor
kex
pert
cont
ent
disc
over
ysc
alab
ility
diff
icul
tyPerc
enta
ge o
f doc
umen
ts
with
term
ove
rlap
Cluster Size=24
Figure 3: Comparison of two highly causal clusters in term of thetextual similarities among the documents of each cluster.where Dt1 and Dt2 are the set of documents in topic t1 andt2 respectively. Finally, the causal dominance (t) of a topicover the other topics in the corpus is calculated as
(t) =1
nt
X
d2Dt
X
t02T
X
d02Dt
0
⇢1 (d, d0) 2 R0 otherwise.
(7)
where T is the set of all topics
6 Experimental ResultsIn this section, we seek to answer the following questionsto justify the capabilities and correctness of the proposedmodel.
1. How does the causality based clustering mechanism com-pare to a baseline similarity based clustering approach?(Section 6.1)
2. Can the proposed lineage formation approach identify ge-nealogies drifted from other topics? (Section 6.2)
3. Which topical groups have most causal influence on theother topical groups in the entire corpus? (Section 6.3)We collected a publication dataset of titles and abstracts
along with some meta-data that includes publication yearand citations of each paper. The collection contains 412,484computer science articles from the IEEE Xplore digital li-brary. The computer science articles are recognized by en-tries available in the DBLP computer science bibliographydatabase. This collection contains documents from the year1961 to 2014. We extracted over eight hundred thousand en-tities from the titles and the abstracts using lingpipe, Stan-ford NER and openNLP entity detectors (Hossain et al.2012). The entities were then tokenized to construct the fea-ture set for each document.6.1 Causality based vs. Similarity based
clusteringOne may argue that one particular scholarly article is onlymotivated by similar articles published in the past. Whilesimilarity is a good way to discover articles published onthe same topic, the inter-topic influence cannot be captured
0
10
20
30
40
50
60
70
80
0 2 4 6 8 10 12 14 16
Perc
enta
ge o
f doc
umen
t cl
uste
rs d
istri
bute
d to
T te
rm
clus
ters
Number of Term Cluster, T
Causality based document clusteringBaseline (Similarity based clustering)
2-5 4-7 6-9 8-11 10-13 12-15 14-17 16-19
Figure 4: Comparison between causality based and similarity baseddocument clustering with respect to the causality based term clus-tering.
Spatio-Temporal Topic Modeling in
Mobile Social Media for Location
RecommendationAbstract—Mobile networks enable users to post on social media services (e.g., Twitter)
A Spatio-Temporal Simulation Model
for Movement Data Generation
Abstract—The real-world process of generating a largespatio-temporal data collection
Viral Marketing for Multiple Products
Abstract—Viral Marketing, the idea of exploiting socialinteractions of users to propagate
A Compact Representation of
Spatio-Temporal DataAbstract—As technology advances we encounter moreavailable data on moving objects
Indexing the Trajectories of
Moving Objects in Networks
Abstract—The management of moving objects has been intensively
Developing Data Allocation Schemes byIncremental Mining of User Moving Patterns
Abstract—In this paper, we present a new data mining algorithm which involves incremental mining
Parallel Mining of Association Rules
Abstract—We consider the problem of mining association rules on a shared nothing multiprocessor
Incremental implementation model
for relational databases with
transaction timeAbstract—An implementation model for the standard relational data model extended with
2012
2009 2008
2006
2004
2003
1996 1991
Epidemic spreading in real networks: an
eigenvalue viewpoint
Abstract—How will a virus propagate in a real network? Does an epidemic threshold exist for a finite graph?
2001
Measuring and modeling computer
virus prevalence
Abstract—To understand the current extent of the computer virus problem and predict its future course, the authors have
1993
Figure 5: The causal chain of a document discovered by the pro-posed method.using similarity search. To verify whether highly causal doc-ument clusters are always similar or not, we analyze the sim-ilarity between the documents of each of the causal clustersdetected by density based clustering (as described in Section4.2). Figure 3 shows the percentage of documents that sharecertain terms in two different causal clusters. The terms inthe x-axis are ordered based on percentage of documents ina cluster containing those terms. The top twenty frequentterms are shown for each cluster. The plot in the left showsthat the documents in a highly causal cluster have largeamount of term overlaps. For example, 32% and 23% of thedocuments contain the terms cluster and structure, respec-tively. In contrast, the plot at the right side illustrates thatthe documents in a highly causal cluster can have compara-tively low textual similarity. The most frequent term “way”appears in just 12.5% of the documents of the causal clus-ter represented by the plot at the right side of Figure 3. Thisstipulates that a causal cluster of documents may or may notbe similar in terms of textual contents.
While Figure 3 shows an evidence that similarity is notthe key indicator of causality, it does not confirm that highlycausal documents grouped in a cluster are causal at the fea-ture level. To examine whether a causality based groupingof documents brings more causal items together than a sim-ilarity based grouping, we cluster all the documents basedon causality and similarity separately. Then we cluster allthe terms of the corpus based on term causality consideringthe frequency of each term in each year as the amplitude ofthe corresponding term signal. Additionally, we apply topicmodeling to the corpus and assign a topic to each of the arti-
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10
Aver
age
cita
tion
over
lap
scor
e
Tree depth, L
Causality basedSimilarity basedGenealogy based 0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
3 3.5 4 4.5 5 5.5 6 6.5 7
Aver
age
diffu
sion
coef
ficie
nt
Length of a lineage chain
Causality basedSimilarity basedCitation basedGenealogy based
Figure 6: (left) Average citation overlap scores for three ap-proaches: 1) our causality based approach, 2) similarity based lin-eage, and 3) information genealogy based approach. (right) Ourcausality based framework demonstrates the highest form of con-ceptual diffusion over time.
cles. We seek to verify the causality of the terms of the doc-uments of a causality based cluster by examining the domi-nant terms of the topics of the documents of that causal clus-ter. Those dominating terms should come from a small num-ber of causality based term clusters. In contrast, the domi-nant topical terms of a similarity based cluster of articles willtend to come from multiple causal term clusters if causal-ity and similarity are less relational. Figure 4 shows that thedominant topical terms of causality based document clustersare distributed to small number of causal term clusters in-dicating that causality based document clusters are highlycausal at term level. The similarity based document clus-ters exhibit a different trend. The topical terms of the simi-larity based article clusters spread into comparatively largenumber of causal term clusters. This indicates that the doc-uments inside the causal groups identified by our proposedframework are more causal at feature level than a groupingdiscovered by a similarity based baseline approach.6.2 Lineage FormationFigure 5 shows an example of a lineage identified by Algo-rithm 1 for a document in the IEEE Xplore library (ArticleReference Number: 6729600). The branching factor we usedfor this lineage is b = 4 and the look-back threshold r = 10.The initial document for which the chain is formed embedsspatio-temporal data, social media, and recommender sys-tem in one paper. The causal paths shown in the lineage il-lustrates that the topic of this document was influenced bydifferent areas of research including relational database withtemporal aspects, association rule mining, network analy-sis, and viral marketing in social media. The chain discoversmore causal articles than similar ones to form the lineage.
Overall, we evaluate the lineage in terms of diffusion co-efficient (Equation 3) and citation overlap (Equation 4). Fig-ure 6(left) compares average citation overlap scores at dif-ferent levels of all the lineages detected for all documentsusing three approaches (1) our causality based approach, (2)similarity based approach, and (3) information genealogybased approach. The figure shows that the similarity basedapproach has the highest overlap with the citation lineage.This matches the fact that citations are generally outcomesof keyword search resulting in similar topics. Our approachexhibits the least amount of overlaps with the citation net-work among three approaches. In contrast, Figure 6(right)shows that our approach has the highest average conceptualdiffusion over time than all other methods, even when com-pared to the original citation network. This indicates that ourapproach has the ability to detect how the lineage driftedfrom another topic and formed the current literature, whichother methods do not possess.6.3 Analyzing Topical CausalityIn this section we infer the notion of document-level causal-ity to understand topic-level influential relationships in theliterature. For the experiments in this section we used thedocuments published in ICDM conferences and workshopsfrom 2001 to 2014, which is a subset of the IEEE datasetused in the preceding experiments. We apply LDA (Blei, Ng,and Jordan 2003) with 20 topics on this ICDM data set to getthe distribution of topics in each document. We categorize a
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Causal0Density0of0Topic
Topic0ID
0.23%
0.19%
Topic%13%Topic%10%
Topic%7%
Topic%19%
Topic%4%
Topic%16%
0.16%
0.09%
0.15%
0.04%
0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4" 4.5" 5"Causal"strength"
Topic"ID"
10"
16"
8"7"
2"
Figure 7: (left) Within-topic causality score of different topics. (middle) Topic pairs with most causal relation between them. (right) Topicswith most causal strengths.
document to be under the topic which has the highest prob-ability in that document.
If the documents under a particular topic are mostlycaused by the other documents in that topic, the topic issupposed to depict more causal density (Equation 5). Figure7(left) shows the causal density of different topics for b = 5.From the figure we can see that documents related to outliersdetection (topic no. 8) and association rule and subgraphmining (topic no. 10) are mostly caused by the documentsunder the same topic. On the other hand, the research ar-eas that inspired or were inspired by a variety of other fieldsshould demonstrate strong causal relations across the topics.Figure 7(middle) shows the pairs of topics that have mostcausal influence calculated by Equation 6. Each edge in thefigure is labeled with the value of �(t1, t2). There are someareas in the literature that are more dominant than others andhence cause more documents within and across the topics.Figure 7(right) shows the top five dominant topics with theircausal strengths computed by Equation 7.
Based on the causal relationships found among the top-ical groups of documents, our observation is that the top-ics with higher cross-topic causality has low causal density(e.g. topic number 5 and 18). Topic 10 is an exception in thatsense, which is the reason behind its being the most domi-nant topic in the entire corpus.
7 ConclusionIn this paper we present a novel time series based represen-tation for scientific articles that enables searching beyondmere content similarity. Though such representation is in-tuitive and has shown superior ability in identifying actualcausal relationships between the documents, we are yet tocomprehend all of the characteristics of these signals. Onedirection of our research is going towards having a better un-derstanding of these document signals and applying them toidentify even more subtle causalities. We also have shown amechanism to find a causal chain of documents for any givendocument and then inferred those chains to topical groups ofdocuments in order to see the causal relationships betweendifferent topics. However, a direct approach towards under-standing the topical evolution – how one concept begetsother new ideas, gets merged with some other concepts, orsimply vanishes away at some point – and the causal factorsbehind those behaviors of topics would be more interesting.
ReferencesAlSumait, L.; Barbara, D.; and Domeniconi, C. 2008. On-line lda: Adaptive topic models for mining text streams with
applications to topic detection and tracking. In ICDM’08,3–12.Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models.In ICML’06, 113–120.Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latentdirichlet allocation. Journal of machine Learning research3:993–1022.Caragea, C.; Bulgarov, F.; Godea, A.; and Gollapalli, S. D.2014. Citation-enhanced keyphrase extraction from researchpapers: A supervised approach. In EMNLP’14.Cheng, B. S., and Lai, T. W. 1997. An investigation of co-integration and causality between energy consumption andeconomic activity in taiwan. Energy Economics 19(4):435–444.El-Arini, K., and Guestrin, C. 2011. Beyond keywordsearch: discovering relevant scientific literature. In KDD’11,439–447.Granger, C. W. 1969. Investigating causal relations byeconometric models and cross-spectral methods. Economet-rica: Journal of the Econometric Society 424–438.Hasan, M. A.; Spangler, W. S.; Griffin, T.; and Alba, A.2009. Coa: Finding novel patents through text analysis. InKDD’09, 1175–1184.Hossain, M. S.; Butler, P.; Boedihardjo, A. P.; and Ramakr-ishnan, N. 2012. Storytelling in entity networks to supportintelligence analysts. In KDD’12, 1375–1383.Kataria, S.; Mitra, P.; and Bhatia, S. 2010. Utilizing con-text in generative bayesian models for linked corpus. InAAAI’10, volume 10, 1.Mei, Q., and Zhai, C. 2005. Discovering evolutionary themepatterns from text: an exploration of temporal text mining. InKDD’05, 198–207.Roebroeck, A.; Formisano, E.; and Goebel, R. 2005. Map-ping directed influence over the brain using granger causal-ity and fmri. Neuroimage 25(1):230–242.Shaparenko, B., and Joachims, T. 2007. Information ge-nealogy: uncovering the flow of ideas in non-hyperlinkeddocument databases. In KDD’07, 619–628.Teufel, S., and Moens, M. 2002. Summarizing scientific arti-cles: experiments with relevance and rhetorical status. Com-putational linguistics 28(4):409–445.Wang, X., and McCallum, A. 2006. Topics over time: anon-markov continuous-time model of topical trends. InKDD’06, 424–433.