Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1,...

6
Encoding Lineage in Scholarly Articles Sheikh Motahar Naim 1 , Md Abdul Kader 1 , Arnold P. Boedihardjo 2 , M. Shahriar Hossain 1 1 Department of Computer Science, University of Texas, El Paso, TX 79968 2 U. S. Army Corps of Engineers, Alexandria, VA 22315 {snaim, mkader}@miners.utep.edu, [email protected], [email protected] Abstract The development of new scientific concepts today is an out- come of the accumulated knowledge built over time. Every scientific domain requires understanding of the trends of the dependencies between its subdomains. Analyses of trends to capture such dependencies using conventional document modeling techniques is a challenging task due to two rea- sons: (1) conventional vector-space modeling based represen- tation of documents does not realize the history of the con- tent, and (2) neither feature-level nor document-level causal- ity is provided with any digital library metadata or citation network. In this paper, we propose an intuitive temporal rep- resentation of a scientific article that encodes inherent his- toric characteristics of the content. This intuitive representa- tion of each document is then leveraged to discover causal relationships between scientific articles. In addition, we pro- vide a mechanism to explore the lineage of each document in terms of other previously published documents, which illus- trates how the theme of the document under analysis evolved over time. Empirical studies reported in the paper show that the proposed technique identifies meaningful causal relation- ships and discovers meaningful lineage in the scientific litera- ture that could not be discovered through the citation network of the articles. 1 Introduction With the rapid growth in electronic publication of scientific articles, we now have many digital libraries richer than ever. For example, both IEEE Xplore (ieeexplore.ieee.org) and Pubmed Central (ncbi.nlm.nih.gov/pmc) have over 3.6 mil- lion full-text articles in their collection while ACM Digital Library (dl.acm.org) contains more than 400 thousands pub- lications in computing and information domain. Due to such vast collection of searchable scientific articles, researchers in any domain find it much easier to retrieve documents re- lated to a particular topic in the literature. Almost all of the available digital libraries return search results based on the textual similarity of the initial query. However, results re- turned against a search query are not enough to realize how the topic evolved and conceptually diffused over time from another topic. The topics of other domains that caused or in- fluenced the state in the past may not contain the same key- words that were used for the query. In this paper, we describe Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. a mechanism that goes beyond similarity to capture the ac- tual dependencies between scientific articles to understand how a document published recently has reached its state. There are a few attempts to capture the evolution of the current state both at document (Shaparenko and Joachims 2007; El-Arini and Guestrin 2011; Hasan et al. 2009) and concept levels (Blei and Lafferty 2006; Mei and Zhai 2005; Wang and McCallum 2006). Shaparenko and Joachims (Shaparenko and Joachims 2007) explains a mechanism named information genealogy that heavily depends on tex- tual similarity to compute dependence between documents while forming a lineage. As a result, the approach does not capture influential documents that do not contain much tex- tual overlap with the initial set of documents. In practice, two scientific articles may contain different textual content but one may influence or relate to the other historically. For example, with the use of terms related to bipolar junctions in physics the vocabulary in electrical engineering started to change which eventually resulted in a strong branch of scholarly endeavor — computer science. Similarly, the inno- vations as well as the vocabulary in laser physics fostered the area of cancer treatment influencing the literature in biomed- ical science. The approach we describe in this paper encodes historical trends of the entities used in a paper in the form of a temporal series that is used to detect causality with other articles avoiding direct similarity computation between doc- uments while discovering a lineage. There are a few systems that have the ability to provide an illustration of the con- ceptual evolution of the literature, such as, Dynamic Topic Modeling (Blei and Lafferty 2006), Online LDA (AlSumait, Barbar´ a, and Domeniconi 2008), and Mei and Zhai’s evo- lutionary theme pattern (Mei and Zhai 2005). While these approaches focus on lineage at conceptual level for an entire corpus, our center of attention in this paper is at document level lineage construction based on the latent causality be- tween the documents of a corpus. Our framework, through a number of experiments con- ducted over 400 thousand publication abstracts from IEEE Xplore digital library, discovers lineage of documents that similarity-dependent methods cannot detect. The main con- tributions of this paper are as follows: 1. We represent a document as a time series that encodes historic importance of the terms of the document from the entire vocabulary perspective.

Transcript of Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1,...

Page 1: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

Encoding Lineage in Scholarly Articles

Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain11Department of Computer Science, University of Texas, El Paso, TX 79968

2U. S. Army Corps of Engineers, Alexandria, VA 22315{snaim, mkader}@miners.utep.edu, [email protected], [email protected]

Abstract

The development of new scientific concepts today is an out-come of the accumulated knowledge built over time. Everyscientific domain requires understanding of the trends of thedependencies between its subdomains. Analyses of trendsto capture such dependencies using conventional documentmodeling techniques is a challenging task due to two rea-sons: (1) conventional vector-space modeling based represen-tation of documents does not realize the history of the con-tent, and (2) neither feature-level nor document-level causal-ity is provided with any digital library metadata or citationnetwork. In this paper, we propose an intuitive temporal rep-resentation of a scientific article that encodes inherent his-toric characteristics of the content. This intuitive representa-tion of each document is then leveraged to discover causalrelationships between scientific articles. In addition, we pro-vide a mechanism to explore the lineage of each document interms of other previously published documents, which illus-trates how the theme of the document under analysis evolvedover time. Empirical studies reported in the paper show thatthe proposed technique identifies meaningful causal relation-ships and discovers meaningful lineage in the scientific litera-ture that could not be discovered through the citation networkof the articles.

1 IntroductionWith the rapid growth in electronic publication of scientificarticles, we now have many digital libraries richer than ever.For example, both IEEE Xplore (ieeexplore.ieee.org) andPubmed Central (ncbi.nlm.nih.gov/pmc) have over 3.6 mil-lion full-text articles in their collection while ACM DigitalLibrary (dl.acm.org) contains more than 400 thousands pub-lications in computing and information domain. Due to suchvast collection of searchable scientific articles, researchersin any domain find it much easier to retrieve documents re-lated to a particular topic in the literature. Almost all of theavailable digital libraries return search results based on thetextual similarity of the initial query. However, results re-turned against a search query are not enough to realize howthe topic evolved and conceptually diffused over time fromanother topic. The topics of other domains that caused or in-fluenced the state in the past may not contain the same key-words that were used for the query. In this paper, we describe

Copyright c� 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

a mechanism that goes beyond similarity to capture the ac-tual dependencies between scientific articles to understandhow a document published recently has reached its state.

There are a few attempts to capture the evolution of thecurrent state both at document (Shaparenko and Joachims2007; El-Arini and Guestrin 2011; Hasan et al. 2009) andconcept levels (Blei and Lafferty 2006; Mei and Zhai 2005;Wang and McCallum 2006). Shaparenko and Joachims(Shaparenko and Joachims 2007) explains a mechanismnamed information genealogy that heavily depends on tex-tual similarity to compute dependence between documentswhile forming a lineage. As a result, the approach does notcapture influential documents that do not contain much tex-tual overlap with the initial set of documents. In practice,two scientific articles may contain different textual contentbut one may influence or relate to the other historically. Forexample, with the use of terms related to bipolar junctionsin physics the vocabulary in electrical engineering startedto change which eventually resulted in a strong branch ofscholarly endeavor — computer science. Similarly, the inno-vations as well as the vocabulary in laser physics fostered thearea of cancer treatment influencing the literature in biomed-ical science. The approach we describe in this paper encodeshistorical trends of the entities used in a paper in the form ofa temporal series that is used to detect causality with otherarticles avoiding direct similarity computation between doc-uments while discovering a lineage. There are a few systemsthat have the ability to provide an illustration of the con-ceptual evolution of the literature, such as, Dynamic TopicModeling (Blei and Lafferty 2006), Online LDA (AlSumait,Barbara, and Domeniconi 2008), and Mei and Zhai’s evo-lutionary theme pattern (Mei and Zhai 2005). While theseapproaches focus on lineage at conceptual level for an entirecorpus, our center of attention in this paper is at documentlevel lineage construction based on the latent causality be-tween the documents of a corpus.

Our framework, through a number of experiments con-ducted over 400 thousand publication abstracts from IEEEXplore digital library, discovers lineage of documents thatsimilarity-dependent methods cannot detect. The main con-tributions of this paper are as follows:

1. We represent a document as a time series that encodeshistoric importance of the terms of the document from theentire vocabulary perspective.

Page 2: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

2. Our framework provides a mechanism to cluster docu-ments with high causal relationship.

3. We propose a systematic way to track the lineage of anypublished article in the form of a chain of causal docu-ments.

2 Related WorkBig collections of scholarly articles from different digital li-braries have been exploited in multitude of applications in-cluding keyword extraction (Caragea et al. 2014), citationrecommendation (Kataria, Mitra, and Bhatia 2010) and sum-marization of new contributions (Teufel and Moens 2002).In an attempt to find the influential articles of a current docu-ment, Shaparenko and Joachims (Shaparenko and Joachims2007) try to explain the content of a document using the tex-tual content of previously published articles. Though thismethod performs well to identify influential articles in theliterature, it is heavily dependent on textual similarity ofthe documents, and therefore, fail to identify true influ-ence where there is less or no textual overlap. El-Arini andGuestrin (El-Arini and Guestrin 2011) and Hasan et. al.(Hasan et al. 2009) go beyond the keyword matching andutilize other meta-information like citations to find the re-lated articles for a given set of papers. Another path of worktowards understanding the evolution of research has beenthrough incorporating temporal information into topic mod-eling. Mei and Zhai (Mei and Zhai 2005) find the impor-tant themes in every time frame of the corpus and connectthose themes based on their thematic similarity to show theevolutionary transitions of different topics. Dynamic TopicModel (Blei and Lafferty 2006) and Online LDA (AlSumait,Barbara, and Domeniconi 2008) also track the topics overtime and provide a means to identifying emerging topics invarious time points. In this paper we introduce the conceptof causality to understanding the evolutionary nature of re-search. Though the idea of causality has been successfullyused in the fields of economics (Granger 1969; Cheng andLai 1997) and neuroscience (Roebroeck, Formisano, andGoebel 2005), this is the first attempt, to the best of theauthors’ knowledge, to exploit Granger causality (Granger1969) for understanding the lineage of scientific documents.

3 Problem FormulationLet D = {d1, d2, . . . , dN} be the scholarly dataset of N ar-ticles containing M words W = {w1, w2, . . . , wM}. Eacharticle may contain an arbitrary number of words in any se-quence. The publication dates of the documents span over atime frame Y = {y1, y2, . . . , yl} where yi is ith year andyi+1 = yi+1. We denote the set of articles published in andbefore year yi as Di. The tasks are:• Identify the set of all causal dependencies in the corpus,

R = {(di, dj) : dj has a causal dependence on di}.

• Construct a causal chain Hi = {h1i , h

2i , . . . , h

|Hi

|i } for

each article di 2 D where h1i = {di} and hq

i is the set ofdocuments each of which has causal influence on at leastone of the documents in hq�1

i .• Compute a clustering, C = {C1, C2, . . . , C|C|} of the cor-

pus where documents in each group Ci 2 C demonstratestrong causal dependence among themselves.

0.0E+00

5.0E-06

1.0E-05

1.5E-05

2.0E-05

2.5E-05

1950 1955 1960 1965 1970 1975 1980

Estim

ated

per

turb

atio

n, π

Ideal documentReal document

-30.0

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

1950 1955 1960 1965 1970 1975 1980

Diff

eren

ce o

f ang

les (

degr

ee)

Tum

ors

Tum

ors R

adia

tion

Mam

m. c

ells

Rad

iatio

n Sq

uam

ous

carc

inom

a

Squamous carcinoma Radiation Drugs Combined treatment

Tum

ors

Mam

mal

ian

Cel

ls Bl

eom

ycin

Figure 1: Time series generation for a document. (a) Perturbationcaused by a real and the corresponding ideal document. (b) Gener-ated signal for the real document.

4 MethodologyThe objective of the proposed framework is to identifycausal dependency structure of all the publications. The fol-lowing subsections describe the functionalities of the pro-posed framework: 1) represent documents as time series,2) identify causal groups of articles, and 3) discover causalchains.

4.1 Document Representation as a Time SeriesOur framework captures the change of the distribution of thevocabulary W over time by reading the articles in a sequenceas they were published. We compute the term distribution atthe end of every year to capture this evolving nature of thevocabulary. Let the term distribution after the ith year be�i = {�1i , �2i , . . . , �Mi } where �ji is the frequency of the jthterm of the vocabulary W in all the document published inor before the ith year and

PMj=1 �

ji = 1.

This evolving distribution of the vocabulary enables us togenerate a time series for each document based on the rel-ative novelty of that article. If an article d 2 D publishedin the jth year contains entities that have already been pub-lished in the literature in any ith year where i < j, thend will not introduce much change in �i since the distribu-tion already contains some values for those entities. Basedon this concept, we replicate document d and place it in thepartial corpus Di. Let the extended corpus for document din the ith year be Di = Di [ d and the term distribution ofthe extended corpus Di be �i. The amount of perturbationthat document d creates in the distribution of the ith yearis estimated as the Kullback-Leibler divergence between theoriginal distribution �i and the distribution of the extendedcorpus �i.

⇡id =

lX

i=1

�i ln�i�i

(1)

The order of two distributions �i and �i in Equation 1 isimportant due to the asymmetric nature of KL-divergence.Since our intention is to measure the extra information addedto the corpus Di by the inclusion of document d, we alwayscompute the KL-divergence from �i to �i, not the other wayaround.

By placing d in each year yi 2 Y we get a time series⇡d = {⇡1

d,⇡2d, . . . ,⇡

ld} for d that represents the document’s

influence on the vocabulary over all the years. The red linein Figure 1(a) shows this perturbation-based time series of asample document.

In the next step, we create a synthetic document d0 thatis an identical copy of d but each of the terms being en-

Page 3: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

coded by a unique identifier. That is, the terms in the syn-thetic document do not appear in any of the documents in theentire corpus D. This ensures that d0 independently holdsthe properties of d as a single document but it is a novel-ized one since none of the documents contains the syntheticterms. Now following the same procedure as we used ford, we add d0 to Di to form an extended corpus D0

i, calcu-late the term distribution �0i for that corpus, and estimatethe amount of perturbation d0 could introduce. We construct⇡d0 = {⇡1

d0 ,⇡2d0 , . . . ,⇡l

d0} for d0 the same way as we com-puted ⇡d. The blue line in Figure 1(a) is the signal generatedby placing d0 in every year. Since the content of d0 is unique,the line for d0 will always have larger (or equal) value inevery year than the line for the real document d.

The time series ⇡d0 is a representation of the degree of in-fluence the document d would have had on the vocabularyhad it be a completely unique document. Therefore, the dif-ference between the two time series ⇡d and ⇡d0 gives us anestimation of the amount of uniqueness of document d inevery year of the corpus — the smaller the difference thecloser it is to the extremely unique document. Figure 1(b)shows the signal for d that is derived from the differencebetween ⇡d and ⇡d0 . We denote this signal as Sd and calcu-late its magnitude in ith year as the difference between theangles created by the time series ⇡d and ⇡0

d at the ith year.

4.2 Identification of Causal ClustersThe information provided by an article published today is anoutcome of knowledge accumulated over time in the past.Each article in a corpus of a certain domain is very likelyto have a few articles to which it is causally dependent.We use the signal Sd generated for each document d 2 Dto compute the causalities between all articles. We lever-age Granger causality test in this purpose. Granger causal-ity (Granger 1969) is a statistical hypothesis test which es-timates the usefulness of one time series in predicting thefuture values of another time series. A time series X is saidto Granger-cause another time series Y if it can be shownthat those X values provide statistically significant informa-tion about future values of Y . We use Granger causality totest if there exists any causal relationship between two docu-ments. Assuming that the document di was published beforethe document dj , we test the following hypothesis to identifya causal effect from di to dj :

P[dj(t+ 1) 2 A|I(t)] 6= P[dj(t+ 1) 2 A|I�di

(t)] (2)

where P indicates probability, A is an arbitrary non-emptyset, and I(t) and I�d

i

(t) respectively denote the informa-tion available as of time t in the corpus, and that in the corpusexcluding di. If the above hypothesis is accepted, we say thatdocument di Granger causes document dj . Figure 2 demon-strates the causal relationship between two documents basedon their time series.

Once we identify the set of all the causal relationships Rin the corpus, we can build a causality network or a causalitymatrix. In practice, for large corpus we limit causality com-putation for any document to the documents published in rprevious years only. The set of IEEE publications that weuse in the experimental results section spans over 54 years,

-50

0

50

100

0 2 4 6 8 10 12 14 16 18 20

Time

Document studies history of microprocessors

4004, 8086, ALU MC68000, 32 bit microprocessorProtected mode, 80286

-50

0

50

100

0 2 4 6 8 10 12 14 16 18 20

Time

Document studies history of operating systems Calculator, Nintendo, Batch processing, CP/M Windows NT, Mac OS, GUI

Multi-user multitasking systems

Figure 2: An illustrative example of causality between two docu-ments based on their signals.

Algorithm 1: ComputeLineage – algorithm to com-pute the causal chain of a document.

input : Document diparameter: Set of causal pairs, R

Look-back threshold, rBranching factor, b

output : Lineage of di, hi

1 create an empty list hi;2 append {di} to hi;3 do4 create an empty set Sp;5 S read the last set from hi;6 for each document d 2 S do7 Rd {(dp, dq) : (dp, dq) 2 Rd and dq = d}

if |Rd| > b then8 Rd top b pairs of Rd;9 end

10 for each causal pair (dp, dq) 2 Rd do11 if yearOf(dq) - yearOf(dp) r then12 add dp to Sp;13 end14 end15 end16 if |Sp| 6= 0 then17 append Sp to hi;18 end19 while |Sp| 6= 0;20 return hi

and we vary the value of that look-back threshold r from 8to 10 years for our experiments.

The strengths of the causal relationships in R allow us touse clustering algorithms to find causal groups of articles.We use a density based clustering algorithm DBSCAN togroup causal documents. The motivation behind the use of adensity based clustering is that this specific family of clus-tering algorithms does not require the number of clusters kas an input (unlike k-means clustering). DBSCAN is a logi-cal choice for causal partitioning since there is no metadatathat can help us identify possible number of causal clusters.

4.3 Computing Causal ChainsAlgorithm 1 provides a chain of causal documents for agiven document di. Each chain is a graph or tree containingcausality flows from leaf documents toward the documentfor which the lineage is generated. The chain is initialized inline 1-2 with the given document di. We expand the chainby adding new set of documents that have causal influenceon the documents of the previous set. Each document d in a

Page 4: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

level is expanded to b articles with highest causalities on dwhich were published within r years of the publication dateof d. The procedure terminates when there is no more causalparent left to expand from a certain level.

5 EvaluationWe evaluate our framework by evaluating the quality ofcausal clusters, how much information is diffused over timein a lineage, and comparing the lineage produced by our ap-proach with a similarity based model, information geneal-ogy (Shaparenko and Joachims 2007), and a citation net-work based lineage.

One of the evaluation metrics we use is diffusion coeffi-cient, which estimates a quantity of how the theme changedover time. The basic assumption here is that a lineage ofscholarly articles should diffuse a concept over the yearsand generate new ideas. Let P = {p1, p2, . . . , p|P |} bethe set of paths in the lineage of a document d 2 D, andpk = {dpk

0 , dpk

1 , . . . , dpk

n�1} be the kth path in P containingn documents and dpk

0 refers to the source document d. Dif-fusion coefficient D(d) of lineage of the document d is thendefined as

D(d) =1

|P |

|P |X

k=1

1� 1

n� 2

n�3X

i=0

n�1X

j=i+2

disp(dpk

i , dpk

j )

!

(3)where

disp(dpk

i , dpk

j ) =

⇢1

n+i�j if dpk

i and dpk

j have term overlap0 otherwise.

Larger values of D(d) indicate better diffusion over timewhile smaller values will refer to lesser conceptual drift.

In addition, we examine how much of the lineage gener-ated by our approach overlaps with the citation tree formedfor a document. Let T citation(d) and T causality(d) are theancestor trees of depth L of a document d 2 D generated us-ing the actual citations and causal relationships, respectively.Then a citation overlap score for article d can be defined as:

Acausality(d) =LX

l=1

���⌘Tcitation(d)

l \⌘Tcausality(d)

l

���⇥l

L(4)

where ⌘Tcitation(d)

l and ⌘T causality(d)l are the set of refer-

ences of documents at level l of the tree T citation(d) andT causality(d) respectively. The same equation can be usedto compute citation overlap score Asimilarity for a similar-ity based lineage as well as Agenealogy for an informationgenealogy based approach.

We express the degree of causality within the documentsin the same topical group t as causal density score ⇢(t):

⇢(t) =2⇥ |Rt|nt ⇥ b

(5)

where b is the maximum number of causal parents consid-ered for each document, Rt ⇢ R is the set of causal rela-tions within topic t and nt is the number of documents in t.We also calculate the causal influence of a topic t1 on an-other topic t2 as:

�(t1, t2) =|{(di, dj) 2 R : di 2 Dt1 , dj 2 Dt2}|

nt1 + nt1

(6)

A cluster of documents with large number of term overlaps

A cluster of documents with smaller number of terms overlaps

05

101520253035404550

clus

ter

stru

ctur

ete

xtfe

atur

eco

mpl

exity

spac

eef

fect

iven

ess

orde

rcl

uste

ring

feat

ure

spac

ecl

assi

ficat

ion

proc

ess

term eb

data

set

mix

ture

resp

ect

web

dim

ensi

onal

itysi

mila

rity

Perc

enta

ge o

f doc

umen

ts

with

term

ove

rlap

Cluster Size=22

05

101520253035404550

way

pres

ent

tree

topi

cre

latio

nshi

pru

leor

der

feat

ure

pref

ace

know

ledg

epo

int

proc

ess

data

set

cano

nica

lne

twor

kex

pert

cont

ent

disc

over

ysc

alab

ility

diff

icul

tyPerc

enta

ge o

f doc

umen

ts

with

term

ove

rlap

Cluster Size=24

Figure 3: Comparison of two highly causal clusters in term of thetextual similarities among the documents of each cluster.where Dt1 and Dt2 are the set of documents in topic t1 andt2 respectively. Finally, the causal dominance (t) of a topicover the other topics in the corpus is calculated as

(t) =1

nt

X

d2Dt

X

t02T

X

d02Dt

0

⇢1 (d, d0) 2 R0 otherwise.

(7)

where T is the set of all topics

6 Experimental ResultsIn this section, we seek to answer the following questionsto justify the capabilities and correctness of the proposedmodel.

1. How does the causality based clustering mechanism com-pare to a baseline similarity based clustering approach?(Section 6.1)

2. Can the proposed lineage formation approach identify ge-nealogies drifted from other topics? (Section 6.2)

3. Which topical groups have most causal influence on theother topical groups in the entire corpus? (Section 6.3)We collected a publication dataset of titles and abstracts

along with some meta-data that includes publication yearand citations of each paper. The collection contains 412,484computer science articles from the IEEE Xplore digital li-brary. The computer science articles are recognized by en-tries available in the DBLP computer science bibliographydatabase. This collection contains documents from the year1961 to 2014. We extracted over eight hundred thousand en-tities from the titles and the abstracts using lingpipe, Stan-ford NER and openNLP entity detectors (Hossain et al.2012). The entities were then tokenized to construct the fea-ture set for each document.6.1 Causality based vs. Similarity based

clusteringOne may argue that one particular scholarly article is onlymotivated by similar articles published in the past. Whilesimilarity is a good way to discover articles published onthe same topic, the inter-topic influence cannot be captured

0

10

20

30

40

50

60

70

80

0 2 4 6 8 10 12 14 16

Perc

enta

ge o

f doc

umen

t cl

uste

rs d

istri

bute

d to

T te

rm

clus

ters

Number of Term Cluster, T

Causality based document clusteringBaseline (Similarity based clustering)

2-5 4-7 6-9 8-11 10-13 12-15 14-17 16-19

Figure 4: Comparison between causality based and similarity baseddocument clustering with respect to the causality based term clus-tering.

Page 5: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

Spatio-Temporal Topic Modeling in

Mobile Social Media for Location

RecommendationAbstract—Mobile networks enable users to post on social media services (e.g., Twitter)

A Spatio-Temporal Simulation Model

for Movement Data Generation

Abstract—The real-world process of generating a largespatio-temporal data collection

Viral Marketing for Multiple Products

Abstract—Viral Marketing, the idea of exploiting socialinteractions of users to propagate

A Compact Representation of

Spatio-Temporal DataAbstract—As technology advances we encounter moreavailable data on moving objects

Indexing the Trajectories of

Moving Objects in Networks

Abstract—The management of moving objects has been intensively

Developing Data Allocation Schemes byIncremental Mining of User Moving Patterns

Abstract—In this paper, we present a new data mining algorithm which involves incremental mining

Parallel Mining of Association Rules

Abstract—We consider the problem of mining association rules on a shared nothing multiprocessor

Incremental implementation model

for relational databases with

transaction timeAbstract—An implementation model for the standard relational data model extended with

2012

2009 2008

2006

2004

2003

1996 1991

Epidemic spreading in real networks: an

eigenvalue viewpoint

Abstract—How will a virus propagate in a real network? Does an epidemic threshold exist for a finite graph?

2001

Measuring and modeling computer

virus prevalence

Abstract—To understand the current extent of the computer virus problem and predict its future course, the authors have

1993

Figure 5: The causal chain of a document discovered by the pro-posed method.using similarity search. To verify whether highly causal doc-ument clusters are always similar or not, we analyze the sim-ilarity between the documents of each of the causal clustersdetected by density based clustering (as described in Section4.2). Figure 3 shows the percentage of documents that sharecertain terms in two different causal clusters. The terms inthe x-axis are ordered based on percentage of documents ina cluster containing those terms. The top twenty frequentterms are shown for each cluster. The plot in the left showsthat the documents in a highly causal cluster have largeamount of term overlaps. For example, 32% and 23% of thedocuments contain the terms cluster and structure, respec-tively. In contrast, the plot at the right side illustrates thatthe documents in a highly causal cluster can have compara-tively low textual similarity. The most frequent term “way”appears in just 12.5% of the documents of the causal clus-ter represented by the plot at the right side of Figure 3. Thisstipulates that a causal cluster of documents may or may notbe similar in terms of textual contents.

While Figure 3 shows an evidence that similarity is notthe key indicator of causality, it does not confirm that highlycausal documents grouped in a cluster are causal at the fea-ture level. To examine whether a causality based groupingof documents brings more causal items together than a sim-ilarity based grouping, we cluster all the documents basedon causality and similarity separately. Then we cluster allthe terms of the corpus based on term causality consideringthe frequency of each term in each year as the amplitude ofthe corresponding term signal. Additionally, we apply topicmodeling to the corpus and assign a topic to each of the arti-

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10

Aver

age

cita

tion

over

lap

scor

e

Tree depth, L

Causality basedSimilarity basedGenealogy based 0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

3 3.5 4 4.5 5 5.5 6 6.5 7

Aver

age

diffu

sion

coef

ficie

nt

Length of a lineage chain

Causality basedSimilarity basedCitation basedGenealogy based

Figure 6: (left) Average citation overlap scores for three ap-proaches: 1) our causality based approach, 2) similarity based lin-eage, and 3) information genealogy based approach. (right) Ourcausality based framework demonstrates the highest form of con-ceptual diffusion over time.

cles. We seek to verify the causality of the terms of the doc-uments of a causality based cluster by examining the domi-nant terms of the topics of the documents of that causal clus-ter. Those dominating terms should come from a small num-ber of causality based term clusters. In contrast, the domi-nant topical terms of a similarity based cluster of articles willtend to come from multiple causal term clusters if causal-ity and similarity are less relational. Figure 4 shows that thedominant topical terms of causality based document clustersare distributed to small number of causal term clusters in-dicating that causality based document clusters are highlycausal at term level. The similarity based document clus-ters exhibit a different trend. The topical terms of the simi-larity based article clusters spread into comparatively largenumber of causal term clusters. This indicates that the doc-uments inside the causal groups identified by our proposedframework are more causal at feature level than a groupingdiscovered by a similarity based baseline approach.6.2 Lineage FormationFigure 5 shows an example of a lineage identified by Algo-rithm 1 for a document in the IEEE Xplore library (ArticleReference Number: 6729600). The branching factor we usedfor this lineage is b = 4 and the look-back threshold r = 10.The initial document for which the chain is formed embedsspatio-temporal data, social media, and recommender sys-tem in one paper. The causal paths shown in the lineage il-lustrates that the topic of this document was influenced bydifferent areas of research including relational database withtemporal aspects, association rule mining, network analy-sis, and viral marketing in social media. The chain discoversmore causal articles than similar ones to form the lineage.

Overall, we evaluate the lineage in terms of diffusion co-efficient (Equation 3) and citation overlap (Equation 4). Fig-ure 6(left) compares average citation overlap scores at dif-ferent levels of all the lineages detected for all documentsusing three approaches (1) our causality based approach, (2)similarity based approach, and (3) information genealogybased approach. The figure shows that the similarity basedapproach has the highest overlap with the citation lineage.This matches the fact that citations are generally outcomesof keyword search resulting in similar topics. Our approachexhibits the least amount of overlaps with the citation net-work among three approaches. In contrast, Figure 6(right)shows that our approach has the highest average conceptualdiffusion over time than all other methods, even when com-pared to the original citation network. This indicates that ourapproach has the ability to detect how the lineage driftedfrom another topic and formed the current literature, whichother methods do not possess.6.3 Analyzing Topical CausalityIn this section we infer the notion of document-level causal-ity to understand topic-level influential relationships in theliterature. For the experiments in this section we used thedocuments published in ICDM conferences and workshopsfrom 2001 to 2014, which is a subset of the IEEE datasetused in the preceding experiments. We apply LDA (Blei, Ng,and Jordan 2003) with 20 topics on this ICDM data set to getthe distribution of topics in each document. We categorize a

Page 6: Encoding Lineage in Scholarly Articles - The … Lineage in Scholarly Articles Sheikh Motahar Naim1, Md Abdul Kader1, Arnold P. Boedihardjo2, M. Shahriar Hossain1 ... Barbara, and

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Causal0Density0of0Topic

Topic0ID

0.23%

0.19%

Topic%13%Topic%10%

Topic%7%

Topic%19%

Topic%4%

Topic%16%

0.16%

0.09%

0.15%

0.04%

0" 0.5" 1" 1.5" 2" 2.5" 3" 3.5" 4" 4.5" 5"Causal"strength"

Topic"ID"

10"

16"

8"7"

2"

Figure 7: (left) Within-topic causality score of different topics. (middle) Topic pairs with most causal relation between them. (right) Topicswith most causal strengths.

document to be under the topic which has the highest prob-ability in that document.

If the documents under a particular topic are mostlycaused by the other documents in that topic, the topic issupposed to depict more causal density (Equation 5). Figure7(left) shows the causal density of different topics for b = 5.From the figure we can see that documents related to outliersdetection (topic no. 8) and association rule and subgraphmining (topic no. 10) are mostly caused by the documentsunder the same topic. On the other hand, the research ar-eas that inspired or were inspired by a variety of other fieldsshould demonstrate strong causal relations across the topics.Figure 7(middle) shows the pairs of topics that have mostcausal influence calculated by Equation 6. Each edge in thefigure is labeled with the value of �(t1, t2). There are someareas in the literature that are more dominant than others andhence cause more documents within and across the topics.Figure 7(right) shows the top five dominant topics with theircausal strengths computed by Equation 7.

Based on the causal relationships found among the top-ical groups of documents, our observation is that the top-ics with higher cross-topic causality has low causal density(e.g. topic number 5 and 18). Topic 10 is an exception in thatsense, which is the reason behind its being the most domi-nant topic in the entire corpus.

7 ConclusionIn this paper we present a novel time series based represen-tation for scientific articles that enables searching beyondmere content similarity. Though such representation is in-tuitive and has shown superior ability in identifying actualcausal relationships between the documents, we are yet tocomprehend all of the characteristics of these signals. Onedirection of our research is going towards having a better un-derstanding of these document signals and applying them toidentify even more subtle causalities. We also have shown amechanism to find a causal chain of documents for any givendocument and then inferred those chains to topical groups ofdocuments in order to see the causal relationships betweendifferent topics. However, a direct approach towards under-standing the topical evolution – how one concept begetsother new ideas, gets merged with some other concepts, orsimply vanishes away at some point – and the causal factorsbehind those behaviors of topics would be more interesting.

ReferencesAlSumait, L.; Barbara, D.; and Domeniconi, C. 2008. On-line lda: Adaptive topic models for mining text streams with

applications to topic detection and tracking. In ICDM’08,3–12.Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models.In ICML’06, 113–120.Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latentdirichlet allocation. Journal of machine Learning research3:993–1022.Caragea, C.; Bulgarov, F.; Godea, A.; and Gollapalli, S. D.2014. Citation-enhanced keyphrase extraction from researchpapers: A supervised approach. In EMNLP’14.Cheng, B. S., and Lai, T. W. 1997. An investigation of co-integration and causality between energy consumption andeconomic activity in taiwan. Energy Economics 19(4):435–444.El-Arini, K., and Guestrin, C. 2011. Beyond keywordsearch: discovering relevant scientific literature. In KDD’11,439–447.Granger, C. W. 1969. Investigating causal relations byeconometric models and cross-spectral methods. Economet-rica: Journal of the Econometric Society 424–438.Hasan, M. A.; Spangler, W. S.; Griffin, T.; and Alba, A.2009. Coa: Finding novel patents through text analysis. InKDD’09, 1175–1184.Hossain, M. S.; Butler, P.; Boedihardjo, A. P.; and Ramakr-ishnan, N. 2012. Storytelling in entity networks to supportintelligence analysts. In KDD’12, 1375–1383.Kataria, S.; Mitra, P.; and Bhatia, S. 2010. Utilizing con-text in generative bayesian models for linked corpus. InAAAI’10, volume 10, 1.Mei, Q., and Zhai, C. 2005. Discovering evolutionary themepatterns from text: an exploration of temporal text mining. InKDD’05, 198–207.Roebroeck, A.; Formisano, E.; and Goebel, R. 2005. Map-ping directed influence over the brain using granger causal-ity and fmri. Neuroimage 25(1):230–242.Shaparenko, B., and Joachims, T. 2007. Information ge-nealogy: uncovering the flow of ideas in non-hyperlinkeddocument databases. In KDD’07, 619–628.Teufel, S., and Moens, M. 2002. Summarizing scientific arti-cles: experiments with relevance and rhetorical status. Com-putational linguistics 28(4):409–445.Wang, X., and McCallum, A. 2006. Topics over time: anon-markov continuous-time model of topical trends. InKDD’06, 424–433.