(121013) #fitalk locating the source of diffusion in large-scale network

31
2nd Division, 2nd R&D Institute, Agency for Defense Development Donghwan Lee [email protected] A Quick Introduction to Network Science for Security Guys

Transcript of (121013) #fitalk locating the source of diffusion in large-scale network

Page 1: (121013) #fitalk   locating the source of diffusion in large-scale network

2nd Division, 2nd R&D Institute,Agency for Defense Development

Donghwan Lee [email protected]

A Quick Introduction toNetwork Sciencefor Security Guys

Page 2: (121013) #fitalk   locating the source of diffusion in large-scale network

Mafia Boy &Project Rivolta

• February 2000, large commercial websites including Yahoo, FIFA, Amazon, Dell, E*TRADE, eBay, and CNN were shutdown

• Suspect revealed as a high-school student, attacked listed sites with basic knowledges & techniques, caused 1.2 billion $ in global economic damage

Page 3: (121013) #fitalk   locating the source of diffusion in large-scale network

Science

• “Science should make the wonderful and complex understandable and simple but not less wonderful” - Herbert A. Simon

• “Science is built of facts the way a house is built of bricks; but an accumulation of facts is no more science than a pile of bricks is a house” - Henri Poincaré

Page 4: (121013) #fitalk   locating the source of diffusion in large-scale network

Pareto law

• 80% of results come from 20% sources

Page 5: (121013) #fitalk   locating the source of diffusion in large-scale network

Small-World Experiment

• Frigyes Karinthy, “Láncszemek,” in Minden másképpen van, 1929.

• “We could name any person among earth’s one and a half billion inhabitants and through at most five acquaintances, one of which he knew personally, he could link to the chosen one”

Page 6: (121013) #fitalk   locating the source of diffusion in large-scale network

6 Degrees of Seperation

• Milgram’s Experiment: counting the number of ties between any two people

• Result: people in the United States are separated by about six people on average

• Jeffrey Travers & Stanley Milgram. 1969. "An Experimental Study of the Small World Problem." Sociometry, Vol. 32, No. 4, pp. 425-443.

Page 7: (121013) #fitalk   locating the source of diffusion in large-scale network

Random Tie

• Erdős–Rényi Model: random graph (1960)

• As the network gets larger, the number of required random edges dramatically decrease

Page 8: (121013) #fitalk   locating the source of diffusion in large-scale network

Weak Tie

• From whom do people often get jobs?

• Mark Granovetter, “The Strength of Weak Ties,” Amercican Journal of Sociology, Vol. 78, Issue 6, 1360-1380, May 1973.

• Capability-based Security

Page 9: (121013) #fitalk   locating the source of diffusion in large-scale network

Small World: Watts-Strogatz

• D. J. Watts & S. H. Strogatz, “Collective dynamics of ‘small-world’ networks,” Nature vol. 393, no. 6684, pp. 409–410, 1988.

Page 11: (121013) #fitalk   locating the source of diffusion in large-scale network

Scale-free Examples

• Power-grid

• Transporta-tion networks

• Metabolic networks

• Food circle

• Worm epidemic

Page 12: (121013) #fitalk   locating the source of diffusion in large-scale network

Scale-free Examples

• River networks

• Synchronizationnetworks of firefliesor hand claps

Page 13: (121013) #fitalk   locating the source of diffusion in large-scale network

Scale-free Examples

• Social Networks!

Page 14: (121013) #fitalk   locating the source of diffusion in large-scale network

Brain vs. The Universe

Page 15: (121013) #fitalk   locating the source of diffusion in large-scale network

Internet: revisited

• Paul Baran: father of ARPANET,designed survivable architecture of Internet

Page 16: (121013) #fitalk   locating the source of diffusion in large-scale network

Internet: revisited

• But the result shows..

Page 17: (121013) #fitalk   locating the source of diffusion in large-scale network

Power-Law-Topologyof Internet

1

10

100

1000

10000

1 10 100

"971108.out"exp(7.68585) * x ** ( -2.15632 )

1

10

100

1000

10000

1 10 100

"980410.out"exp(7.89793) * x ** ( -2.16356 )

1

10

100

1000

10000

1 10 100

"981205.out"exp(8.11393) * x ** ( -2.20288 )

1

10

100

1000

10000

1 10 100

"routes.out"exp(8.52124) * x ** ( -2.48626 )

• No. of Router vs. Number of Nodes (Links)

Page 18: (121013) #fitalk   locating the source of diffusion in large-scale network

• WWW is a scale-free network

• Diameter of WWW = 19(avg. 19 clicks between web pages)

• R. Albert, H. Jeong, and A.-L. Barabási, “The diameter of the WWW,” Nature, vol. 401, no. 6749, pp. 130-131, 1999.

Diameter of WWW

Page 19: (121013) #fitalk   locating the source of diffusion in large-scale network

And How the Web Grows...

• Rich-get-richer

• A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, pp. 509-512, October 15, 1999.

Page 20: (121013) #fitalk   locating the source of diffusion in large-scale network

2nd Division, 2nd R&D Institute,Agency for Defense Development

Donghwan Lee [email protected]

Locating the Source of Diffusion in Large-

Scale Network

Page 21: (121013) #fitalk   locating the source of diffusion in large-scale network

Locating Source of Diffusion

• forward problem: on the diffusion process and its dependence on the rates of infection and cure

• but, now we have; inverse problem: inferring the original source of diffusion

Page 22: (121013) #fitalk   locating the source of diffusion in large-scale network

Network Model

1

Locating the Source of Diffusion in Large-Scale NetworksPedro C. Pinto,1 Patrick Thiran,1 Martin Vetterli1

1École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

How can we localize the source of diffusion in a complex network? Due to the tremendous size of many real networks—suchas the Internet or the human social graph—it is usually infeasible to observe the state of all nodes in a network. We show thatit is fundamentally possible to estimate the location of the source from measurements collected by sparsely-placed observers.We present a strategy that is optimal for arbitrary trees, achieving maximum probability of correct localization. We describeefficient implementations with complexity O(N!), where ! = 1 for arbitrary trees, and ! = 3 for arbitrary graphs. In thecontext of several case studies, we determine how localization accuracy is affected by various system parameters, includingthe structure of the network, the density of observers, and the number of observed cascades.

Localizing the source of a contaminant or a virus is an ex-tremely desirable but challenging task. In nature, many animalsare intrinsically capable of performing source localization.Through chemotaxis, for example, certain bacteria can analyzeconcentration gradients around them in order to quickly movetowards the source of a nutrient, or quickly avoid the sourceof a poison [1,2]. Animals such as the Pacific salmon and thegreen sea turtles are capable of using olfaction to navigate inodor plumes, for foraging or reproductive activities [3,4]. Incertain systems, however, the task of localizing the source hasto be performed in a network, rather than in the continuousspace. This is the case, for example, when an infectious diseasespreads through human populations across a large region, asobserved with the worldwide H1N1 virus pandemic in 2009.Here the system is more conveniently modelled as a networkof interconnected people, and source localization reduces toidentifying which person in the network was first infected.In recent years, there has been significant effort in studying

the dynamics of epidemic outbreaks on networks [5–11].In particular, the focus has been on the forward problemof epidemics: understanding the diffusion process and itsdependence on the rates of infection and cure, as well as on thestructure of the network. In this letter, we focus on the inverseproblem of inferring the original source of diffusion, given theinfection data gathered at some of the nodes in the network.The ability to estimate the source is invaluable in helpingauthorities contain the epidemic or contamination. In thiscontext, the inference of the underlying propagation networkwas studied in [12], while the inference of the unknown sourcewas analyzed in [13], in both cases assuming that we knowthe state of all nodes in the network. More recently, thecontrollability of complex networks was considered in [14],using appropriately selected driver nodes. Here, our goal is tolocate the source of diffusion under the practical constraintthat only a small fraction of nodes can be observed. Thisis the case, for example, when locating a spammer who issending undesired emails over the Internet, where it is clearlyimpossible to monitor all the nodes. Thus, the main difficultyis to develop tractable estimators that can be efficiently im-plemented (i.e., with sub-exponential complexity), and that

observerinformation source

s!o1

o2

o3

tv1,o1

tv2,o1

tv3,o2

tv4,o3tv5,o3

v1

v2

v3

v4

v5

Figure 1. Source estimation on an arbitrary graph G. At the unknowntime t = t! , the information source s! initiates the diffusion. The blueedges denote those over which information has already propagated. In thisexample, there are three observers, which measure from which neighboursand at what time they received the information. The goal is to estimate, fromthese observations, which node in G is the information source.

perform well on multiple topologies.We first introduce our network model. The underlying

network on which diffusion takes place is modeled by a finite,undirected graph G = {V,E}, where the vertex set V hasN nodes, and the edge set E has L edges (Fig. 1). Thegraph G is assumed to be known, at least approximately,as is often verified in practice—e.g., rumors spreading in asocial network, or electrical perturbations propagating on theelectrical grid. The information source, s! " G, is the vertexthat originates the information and initiates the diffusion. Wemodel s! as a random variable (RV) whose prior distributionis uniform over the set V , i.e., any node in the network isequally likely to be the source a priori.The diffusion process is modeled as follows. At time t, each

vertex u " G has one of two possible states: i) informed, ifit has already received the information from any neighbour;or ii) ignorant, if it has not been informed so far. Let V(u)denote the set of vertices directly connected to u, i.e., theneighbourhood or vicinity of u. Suppose u is in the ignorantstate and, at time tu, receives the information for the first timefrom one neighbour—say, s—thus becoming informed. Then,

• For a finite, undirected graph

1

Locating the Source of Diffusion in Large-Scale NetworksPedro C. Pinto,1 Patrick Thiran,1 Martin Vetterli1

1École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

How can we localize the source of diffusion in a complex network? Due to the tremendous size of many real networks—suchas the Internet or the human social graph—it is usually infeasible to observe the state of all nodes in a network. We show thatit is fundamentally possible to estimate the location of the source from measurements collected by sparsely-placed observers.We present a strategy that is optimal for arbitrary trees, achieving maximum probability of correct localization. We describeefficient implementations with complexity O(N!), where ! = 1 for arbitrary trees, and ! = 3 for arbitrary graphs. In thecontext of several case studies, we determine how localization accuracy is affected by various system parameters, includingthe structure of the network, the density of observers, and the number of observed cascades.

Localizing the source of a contaminant or a virus is an ex-tremely desirable but challenging task. In nature, many animalsare intrinsically capable of performing source localization.Through chemotaxis, for example, certain bacteria can analyzeconcentration gradients around them in order to quickly movetowards the source of a nutrient, or quickly avoid the sourceof a poison [1,2]. Animals such as the Pacific salmon and thegreen sea turtles are capable of using olfaction to navigate inodor plumes, for foraging or reproductive activities [3,4]. Incertain systems, however, the task of localizing the source hasto be performed in a network, rather than in the continuousspace. This is the case, for example, when an infectious diseasespreads through human populations across a large region, asobserved with the worldwide H1N1 virus pandemic in 2009.Here the system is more conveniently modelled as a networkof interconnected people, and source localization reduces toidentifying which person in the network was first infected.In recent years, there has been significant effort in studying

the dynamics of epidemic outbreaks on networks [5–11].In particular, the focus has been on the forward problemof epidemics: understanding the diffusion process and itsdependence on the rates of infection and cure, as well as on thestructure of the network. In this letter, we focus on the inverseproblem of inferring the original source of diffusion, given theinfection data gathered at some of the nodes in the network.The ability to estimate the source is invaluable in helpingauthorities contain the epidemic or contamination. In thiscontext, the inference of the underlying propagation networkwas studied in [12], while the inference of the unknown sourcewas analyzed in [13], in both cases assuming that we knowthe state of all nodes in the network. More recently, thecontrollability of complex networks was considered in [14],using appropriately selected driver nodes. Here, our goal is tolocate the source of diffusion under the practical constraintthat only a small fraction of nodes can be observed. Thisis the case, for example, when locating a spammer who issending undesired emails over the Internet, where it is clearlyimpossible to monitor all the nodes. Thus, the main difficultyis to develop tractable estimators that can be efficiently im-plemented (i.e., with sub-exponential complexity), and that

observerinformation source

s!o1

o2

o3

tv1,o1

tv2,o1

tv3,o2

tv4,o3tv5,o3

v1

v2

v3

v4

v5

Figure 1. Source estimation on an arbitrary graph G. At the unknowntime t = t! , the information source s! initiates the diffusion. The blueedges denote those over which information has already propagated. In thisexample, there are three observers, which measure from which neighboursand at what time they received the information. The goal is to estimate, fromthese observations, which node in G is the information source.

perform well on multiple topologies.We first introduce our network model. The underlying

network on which diffusion takes place is modeled by a finite,undirected graph G = {V,E}, where the vertex set V hasN nodes, and the edge set E has L edges (Fig. 1). Thegraph G is assumed to be known, at least approximately,as is often verified in practice—e.g., rumors spreading in asocial network, or electrical perturbations propagating on theelectrical grid. The information source, s! " G, is the vertexthat originates the information and initiates the diffusion. Wemodel s! as a random variable (RV) whose prior distributionis uniform over the set V , i.e., any node in the network isequally likely to be the source a priori.The diffusion process is modeled as follows. At time t, each

vertex u " G has one of two possible states: i) informed, ifit has already received the information from any neighbour;or ii) ignorant, if it has not been informed so far. Let V(u)denote the set of vertices directly connected to u, i.e., theneighbourhood or vicinity of u. Suppose u is in the ignorantstate and, at time tu, receives the information for the first timefrom one neighbour—say, s—thus becoming informed. Then,

Page 23: (121013) #fitalk   locating the source of diffusion in large-scale network

Observers &Observation Set

• if tv,o denotes the absolute time at which observer o receives the information from its neighbour v, then the observation set is composed of tuples of direction and time measurements, i.e., {(o, v, tv,o)}, for all o ∈ O and v ∈ V(o)

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

Page 24: (121013) #fitalk   locating the source of diffusion in large-scale network

The Estimator

• for all source s ∈ , the maximum likelihood estimator is,

where Πs is the set of all possible paths between the source s and the observers in the graph

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

Page 25: (121013) #fitalk   locating the source of diffusion in large-scale network

Complexity &Tractability

• Complexity increases exponentially with the number of nodes in

• Finding a solution in polynomial time

• Consider a tree

• Using direction & timing informations in sensors, build an active tree

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

Page 26: (121013) #fitalk   locating the source of diffusion in large-scale network

Example2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

Complexity= O(N)

Page 27: (121013) #fitalk   locating the source of diffusion in large-scale network

Optimal Estimation inGeneral Trees

2

u will retransmit the information to all its other neighbours,so that each neighbour v ! V(u)\s receives the informationat time tu + !uv , where !uv denotes the random propagationdelay associated with edge uv. The RVs {!uv} for differentedges uv have a known, arbitrary joint distribution. Thediffusion process is initiated by the source s! at an unknowntime t = t!. This diffusion model is general enough toaccommodate various scenarios encountered in practice.Let O ! {oi}Kk=1 " G denote the set of K observers,

whose location on G is chosen or known. Each observermeasures from which neighbour and at what time it receivedthe information. Specifically, if tv,o denotes the absolutetime at which observer o receives the information from itsneighbour v, then the observation set is composed of tuplesof direction and time measurements, i.e., O ! {(o, v, tv,o)},for all o ! O and v ! V(o).How is the source location recovered from the measure-

ments taken at the observers? We adopt a maximum probabilityof localization criterion, which corresponds to designing anestimator s(·) such that the localization probability Ploc !

P(s(O) = s!) is maximized. Since we consider s! tobe uniformly random over G, the optimal estimator is themaximum likelihood (ML) estimator,

s(O) = argmaxs"G

P(O|s! = s)

= argmaxs"G

!

!s

P(!s|s! = s)#

"

· · ·

"

g(!1, · · · , !L,O,!s, s)d!1 · · · d!L. (1)

Here, !s is the set of all possible paths {Ps,ok}Kk=1 between

the source s and the observers in the graph G; the set {!l}Ll=1represents the random propagation delays for all L edges ofgraph G; and g is a deterministic function that depends on thejoint distribution of the propagation delays in a complicatedway. In essence, the estimator in (1) is performing averagesover two different sources of randomness: a) the uncertaintyin the paths that the information takes to reach the observers,and b) the uncertainty in the time that the information takes tocross the edges of G. Due the combinatorial nature of (1), itscomplexity increases exponentially with the number of nodesin G, and is therefore intractable. In what follows, we proposea strategy of complexityO(N) that is optimal for general trees,and a strategy of complexity O(N3) that is sub-optimal forgeneral graphs.Consider first the case of an underlying tree T . Because a

tree does not contain cycles, only a subset Oa $ O of theobservers will receive information emitted by the unknownsource. We call Oa = {ok}

Ka

k=1 the set of Ka active observers.The observations made by the nodes in Oa provide two typesof information: a) the direction in which information arrivesto the active observers, which uniquely determines a subsetTa $ T of regular nodes (called active subtree, Fig. 2a); andb) the timing at which the information arrives to the activeobservers, denoted by {tk}

Ka

k=1, which is used to localize thesource within the set Ta. It is also convenient to label the edgesof Ta as E(Ta) = {1, 2, . . . , Ea}, so that the propagation delayassociated with edge i ! E is denoted by the RV !i (Fig. 2a).

o1 (reference)

o2

o3

!10

!1

!2

!3

!4

!5

!6!7

!8

!9

#

32

$

#

10

$

#

–10

$

#

–30

$

#

1–2$

#

–10

$

#

–10

$

#

–10

$

(a)

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

d1

d2

#

32

$

#

10

$#

–10

$#

–30

$

#

1–2$

(b)

Figure 2. (a) Active tree Ta. The vector next to each candidate source s is thenormalized deterministic delay µ

s! µ

s/µ. The normalized delay covariance

for this tree is ! ! !/!2 = [5, 2; 2, 4]. (b) Equiprobability countours of thePDFs P(d|s! = s) for all s ! Ta, and the corresponding decision regions.For a given observation d, the optimal estimator chooses the source s thatmaximizes P(d|s! = s).

We consider that the propagation delays associated with theedges of T are independent identically distributed (i.i.d.) RVswith Gaussian distribution N (µ,"2), where the mean µ andvariance "2 are known [15]. With these definitions, we havethe following result.Proposition 1 (Optimal Estimation in General Trees): For

a general propagation tree T , the optimal estimator is givenby

s = argmaxs"Ta

µTs !

#1

%

d%1

2µs

&

(2)

where d is the observed delay, µs is the deterministic delay,and ! is the delay covariance, given by

[d]k = tk+1 % tk, (3)[µs]k = µ · (|P(s, ok+1)|% |P(s, o1)|) , (4)

[!]k,i = "2 ·

'

|P(o1, ok+1)|, k = i,

|P(o1, ok+1) & P(o1, oi+1)|, k '= i,(5)

for k, i = 1, . . . ,Ka % 1, with |P(u, v)| denoting the numberof edges (length) of the path connecting vertices u and v.

Intuitively, µs and ! represent, respectively, the mean andcovariance of the observed delay d (a random vector), whennode s is chosen as the source (see Figure 2 for visualinterpretation). The full proof of Proposition 1 is given in [16,sec. S1].Proposition 1 essentially reduces the estimation formula in

(1) to a tractable expression whose parameters can be simplyobtained from path lengths in the tree T . Furthermore, it iseasy to show that the complexity of (2)-(5) scales as O(N)with the number of nodes N in the tree [16, sec. S2]. Inpractice, the Gaussian condition for the propagation delayscan often be relaxed to non-Gaussian scenarios. The estimatorin Proposition 1 can be shown to be near-optimal (see [16,sec. S3] for a concrete example), as long as the observers aresparse—which is often verified in practice—and the propaga-tion delays have finite moments. The sparsity implies that thedistance between observers is large, and so is the number of

Page 28: (121013) #fitalk   locating the source of diffusion in large-scale network

Optimal Estimation in General Graphs

• Relaxation : assuming that actual diffusion tree is breadth-first-search tree

3

RVs of the sum

dk = tk+1 ! t1 =!

i!P(s!,ok+1)

!i !!

i!P(s!,o1)

!i.

Then, the observer delay vector d can be closely approximatedby a Gaussian random vector, due to the central limit theorem.We now consider the most general case of source estimation

on an arbitrary graph G. When the information is diffusedon the network, there is a tree corresponding to the firsttime each node gets informed, which spans all nodes in G.Since the number of spanning trees can be exponentiallylarge, we introduce an approximation by assuming that theactual diffusion tree is a breadth-first search (BFS) tree. Thiscorresponds to assuming that the information travels from thesource to each observer along a minimum-length path, whichis intuitively satisfying. The resulting estimator can be writtenas

s = argmaxs!G

S(s,d, Tbfs,s), (6)

where S = µTs !

"1s

"

d! 12µs

#

, with parameters µs and !s

computed with respect to the BFS tree Tbfs,s rooted at s.It can easily shown that the complexity of (6) scales sub-exponentially with N , as O(N3) [16, sec. S2].We now turn our attention to the localization performance

and its dependence on: i) the structure of the network, ii) thedensity and placement of the observers, and iii) the observationof multiple information cascades. We first apply the proposedestimator to various synthetic networks, shown in Table 1.Clearly, the estimator performs the best in scale-free networks(such as the Barabási-Albert [17][18] and the Apollonianmodels [19–21])—in some cases requiring as few as 4% ofobservers to achieve a localization probability of 90%. This isbecause scale-free networks exhibit “hubs” with large degrees,which can be picked as observers and are able to receive alarge amount of information about the source. If the networkis not scale-free (such as the Erdös-Rényi model), or theobservers are placed uniformly at random, then more observersare necessary to achieve the same localization performance.So far we assumed that the source of information transmits

only one message. However, in many scenarios, the sourceemits different messages over time, which diffuse indepen-dently over the network. These information cascades can begathered and exploited by the observers, as revealed by thefollowing proposition.Proposition 2 (Effect of Multiple Cascades): If the

source s# transmits C independent cascades of informationon a tree T , then the probability of correct localization Plocachieved by the optimal estimator is given by

Ploc = Pmax !O"

e"aC#

,

where Pmax is the maximum probability of localizationachieved under deterministic propagation, and a is a constant.The full proof is given in [16, sec. S4]. The proposition

shows that as the observers collect more information fromsuccessive cascades, they can average out the variance associ-ated with random propagation, and approach the localizationperformance of the deterministic scenario (Pmax) at a rate that

Table 1. Percentage K/N of observers necessary to achieve Ploc =90%, for different networks and observer placements. The “high-degree” placement picks the highest-degree nodes as observers, whilethe “random” placement picks the observers randomly. We considerN = 100 nodes, and propagation ratio µ/! = 4.

Observer PlacementNetwork High-degree RandomApollonian 4% 25%

Barabási-Albert 18% 41%Erdös-Rényi (Np = 0.5) 34% 49%Erdös-Rényi (Np = 2) 32% 44%

is at least exponential. We can think of such phenomenon asa time-resolution tradeoff : the observers can achieve higheraccuracy of localization by waiting for a longer time, overwhich they can observe more cascades.Lastly, we test the effectiveness of the proposed algorithm

with real, measured data. We consider the well-documentedcase of cholera outbreak that occurred in the KwaZulu-Natalprovince, South Africa, in 2000 (Fig. 3a). The epidemic wascaused by a strain of Vibrio cholerae, which colonizes thehuman intestine and is transmitted through contaminationof aquatic environments. The dataset was provided by theKwaZulu-Natal Health Department, and consists of each singlecholera case, specified by the date and health subdistrict whereit occurred. To perform source localization, we consider anetwork model of the basin (Fig. 3b) developed in [10].The nodes represent human communities and associated waterreservoirs, in which the disease can be diffused and grow.The edges of the graph represent hydrological links betweenthe communities. The propagation parameters for this bacteriawere obtained from past epidemics [16, sec. S5][22]. Sourcelocalization is performed by monitoring the daily cholera casesreported in K communities (the observers). These are selecteduniformly at random, due to the lack of a priori informationabout the source location. Table 3c shows that by monitoringonly 20% of the communities, we achieve an average errorof less than 4 hops between the estimated source and thefirst infected community. This small distance error may enablea faster emergency response from the authorities in order tocontain an outbreak.To conclude, the results in this paper suggest that a sparse

deployment of observers may provide an effective alternativeto the individual monitoring (either human or automatic) of allnodes in a network. However, several challenges remain. First,in some scenarios, it may be difficult to exactly determinethe underlying graph on which diffusion occurs. In a choleraoutbreak, for example, the diffusion of the bacteria is alsoinfluenced by the long-range movement of infected individ-uals, in addition to the basic hydrological transport. Sincethis mobility network cannot be reliably measured, furtherstudy is needed to determine the robustness of our frameworkto inaccuracies in the underlying graph. Second, the choiceof observers in the network strongly affects the performanceof the proposed algorithm. Optimal strategies for observerplacement need to be further investigated. Nevertheless, ourresults indicate that source localization in large networks—

Complexity= O(N3)

Page 29: (121013) #fitalk   locating the source of diffusion in large-scale network

Effect of MultipleCascades

• Time-resolution tradeoff

3

RVs of the sum

dk = tk+1 ! t1 =!

i!P(s!,ok+1)

!i !!

i!P(s!,o1)

!i.

Then, the observer delay vector d can be closely approximatedby a Gaussian random vector, due to the central limit theorem.We now consider the most general case of source estimation

on an arbitrary graph G. When the information is diffusedon the network, there is a tree corresponding to the firsttime each node gets informed, which spans all nodes in G.Since the number of spanning trees can be exponentiallylarge, we introduce an approximation by assuming that theactual diffusion tree is a breadth-first search (BFS) tree. Thiscorresponds to assuming that the information travels from thesource to each observer along a minimum-length path, whichis intuitively satisfying. The resulting estimator can be writtenas

s = argmaxs!G

S(s,d, Tbfs,s), (6)

where S = µTs !

"1s

"

d! 12µs

#

, with parameters µs and !s

computed with respect to the BFS tree Tbfs,s rooted at s.It can easily shown that the complexity of (6) scales sub-exponentially with N , as O(N3) [16, sec. S2].We now turn our attention to the localization performance

and its dependence on: i) the structure of the network, ii) thedensity and placement of the observers, and iii) the observationof multiple information cascades. We first apply the proposedestimator to various synthetic networks, shown in Table 1.Clearly, the estimator performs the best in scale-free networks(such as the Barabási-Albert [17][18] and the Apollonianmodels [19–21])—in some cases requiring as few as 4% ofobservers to achieve a localization probability of 90%. This isbecause scale-free networks exhibit “hubs” with large degrees,which can be picked as observers and are able to receive alarge amount of information about the source. If the networkis not scale-free (such as the Erdös-Rényi model), or theobservers are placed uniformly at random, then more observersare necessary to achieve the same localization performance.So far we assumed that the source of information transmits

only one message. However, in many scenarios, the sourceemits different messages over time, which diffuse indepen-dently over the network. These information cascades can begathered and exploited by the observers, as revealed by thefollowing proposition.Proposition 2 (Effect of Multiple Cascades): If the

source s# transmits C independent cascades of informationon a tree T , then the probability of correct localization Plocachieved by the optimal estimator is given by

Ploc = Pmax !O"

e"aC#

,

where Pmax is the maximum probability of localizationachieved under deterministic propagation, and a is a constant.The full proof is given in [16, sec. S4]. The proposition

shows that as the observers collect more information fromsuccessive cascades, they can average out the variance associ-ated with random propagation, and approach the localizationperformance of the deterministic scenario (Pmax) at a rate that

Table 1. Percentage K/N of observers necessary to achieve Ploc =90%, for different networks and observer placements. The “high-degree” placement picks the highest-degree nodes as observers, whilethe “random” placement picks the observers randomly. We considerN = 100 nodes, and propagation ratio µ/! = 4.

Observer PlacementNetwork High-degree RandomApollonian 4% 25%

Barabási-Albert 18% 41%Erdös-Rényi (Np = 0.5) 34% 49%Erdös-Rényi (Np = 2) 32% 44%

is at least exponential. We can think of such phenomenon asa time-resolution tradeoff : the observers can achieve higheraccuracy of localization by waiting for a longer time, overwhich they can observe more cascades.Lastly, we test the effectiveness of the proposed algorithm

with real, measured data. We consider the well-documentedcase of cholera outbreak that occurred in the KwaZulu-Natalprovince, South Africa, in 2000 (Fig. 3a). The epidemic wascaused by a strain of Vibrio cholerae, which colonizes thehuman intestine and is transmitted through contaminationof aquatic environments. The dataset was provided by theKwaZulu-Natal Health Department, and consists of each singlecholera case, specified by the date and health subdistrict whereit occurred. To perform source localization, we consider anetwork model of the basin (Fig. 3b) developed in [10].The nodes represent human communities and associated waterreservoirs, in which the disease can be diffused and grow.The edges of the graph represent hydrological links betweenthe communities. The propagation parameters for this bacteriawere obtained from past epidemics [16, sec. S5][22]. Sourcelocalization is performed by monitoring the daily cholera casesreported in K communities (the observers). These are selecteduniformly at random, due to the lack of a priori informationabout the source location. Table 3c shows that by monitoringonly 20% of the communities, we achieve an average errorof less than 4 hops between the estimated source and thefirst infected community. This small distance error may enablea faster emergency response from the authorities in order tocontain an outbreak.To conclude, the results in this paper suggest that a sparse

deployment of observers may provide an effective alternativeto the individual monitoring (either human or automatic) of allnodes in a network. However, several challenges remain. First,in some scenarios, it may be difficult to exactly determinethe underlying graph on which diffusion occurs. In a choleraoutbreak, for example, the diffusion of the bacteria is alsoinfluenced by the long-range movement of infected individ-uals, in addition to the basic hydrological transport. Sincethis mobility network cannot be reliably measured, furtherstudy is needed to determine the robustness of our frameworkto inaccuracies in the underlying graph. Second, the choiceof observers in the network strongly affects the performanceof the proposed algorithm. Optimal strategies for observerplacement need to be further investigated. Nevertheless, ourresults indicate that source localization in large networks—

Page 30: (121013) #fitalk   locating the source of diffusion in large-scale network

Performance Results

• K/N rate necessary to achieve Ploc = 90%

3

RVs of the sum

dk = tk+1 ! t1 =!

i!P(s!,ok+1)

!i !!

i!P(s!,o1)

!i.

Then, the observer delay vector d can be closely approximatedby a Gaussian random vector, due to the central limit theorem.We now consider the most general case of source estimation

on an arbitrary graph G. When the information is diffusedon the network, there is a tree corresponding to the firsttime each node gets informed, which spans all nodes in G.Since the number of spanning trees can be exponentiallylarge, we introduce an approximation by assuming that theactual diffusion tree is a breadth-first search (BFS) tree. Thiscorresponds to assuming that the information travels from thesource to each observer along a minimum-length path, whichis intuitively satisfying. The resulting estimator can be writtenas

s = argmaxs!G

S(s,d, Tbfs,s), (6)

where S = µTs !

"1s

"

d! 12µs

#

, with parameters µs and !s

computed with respect to the BFS tree Tbfs,s rooted at s.It can easily shown that the complexity of (6) scales sub-exponentially with N , as O(N3) [16, sec. S2].We now turn our attention to the localization performance

and its dependence on: i) the structure of the network, ii) thedensity and placement of the observers, and iii) the observationof multiple information cascades. We first apply the proposedestimator to various synthetic networks, shown in Table 1.Clearly, the estimator performs the best in scale-free networks(such as the Barabási-Albert [17][18] and the Apollonianmodels [19–21])—in some cases requiring as few as 4% ofobservers to achieve a localization probability of 90%. This isbecause scale-free networks exhibit “hubs” with large degrees,which can be picked as observers and are able to receive alarge amount of information about the source. If the networkis not scale-free (such as the Erdös-Rényi model), or theobservers are placed uniformly at random, then more observersare necessary to achieve the same localization performance.So far we assumed that the source of information transmits

only one message. However, in many scenarios, the sourceemits different messages over time, which diffuse indepen-dently over the network. These information cascades can begathered and exploited by the observers, as revealed by thefollowing proposition.Proposition 2 (Effect of Multiple Cascades): If the

source s# transmits C independent cascades of informationon a tree T , then the probability of correct localization Plocachieved by the optimal estimator is given by

Ploc = Pmax !O"

e"aC#

,

where Pmax is the maximum probability of localizationachieved under deterministic propagation, and a is a constant.The full proof is given in [16, sec. S4]. The proposition

shows that as the observers collect more information fromsuccessive cascades, they can average out the variance associ-ated with random propagation, and approach the localizationperformance of the deterministic scenario (Pmax) at a rate that

Table 1. Percentage K/N of observers necessary to achieve Ploc =90%, for different networks and observer placements. The “high-degree” placement picks the highest-degree nodes as observers, whilethe “random” placement picks the observers randomly. We considerN = 100 nodes, and propagation ratio µ/! = 4.

Observer PlacementNetwork High-degree RandomApollonian 4% 25%

Barabási-Albert 18% 41%Erdös-Rényi (Np = 0.5) 34% 49%Erdös-Rényi (Np = 2) 32% 44%

is at least exponential. We can think of such phenomenon asa time-resolution tradeoff : the observers can achieve higheraccuracy of localization by waiting for a longer time, overwhich they can observe more cascades.Lastly, we test the effectiveness of the proposed algorithm

with real, measured data. We consider the well-documentedcase of cholera outbreak that occurred in the KwaZulu-Natalprovince, South Africa, in 2000 (Fig. 3a). The epidemic wascaused by a strain of Vibrio cholerae, which colonizes thehuman intestine and is transmitted through contaminationof aquatic environments. The dataset was provided by theKwaZulu-Natal Health Department, and consists of each singlecholera case, specified by the date and health subdistrict whereit occurred. To perform source localization, we consider anetwork model of the basin (Fig. 3b) developed in [10].The nodes represent human communities and associated waterreservoirs, in which the disease can be diffused and grow.The edges of the graph represent hydrological links betweenthe communities. The propagation parameters for this bacteriawere obtained from past epidemics [16, sec. S5][22]. Sourcelocalization is performed by monitoring the daily cholera casesreported in K communities (the observers). These are selecteduniformly at random, due to the lack of a priori informationabout the source location. Table 3c shows that by monitoringonly 20% of the communities, we achieve an average errorof less than 4 hops between the estimated source and thefirst infected community. This small distance error may enablea faster emergency response from the authorities in order tocontain an outbreak.To conclude, the results in this paper suggest that a sparse

deployment of observers may provide an effective alternativeto the individual monitoring (either human or automatic) of allnodes in a network. However, several challenges remain. First,in some scenarios, it may be difficult to exactly determinethe underlying graph on which diffusion occurs. In a choleraoutbreak, for example, the diffusion of the bacteria is alsoinfluenced by the long-range movement of infected individ-uals, in addition to the basic hydrological transport. Sincethis mobility network cannot be reliably measured, furtherstudy is needed to determine the robustness of our frameworkto inaccuracies in the underlying graph. Second, the choiceof observers in the network strongly affects the performanceof the proposed algorithm. Optimal strategies for observerplacement need to be further investigated. Nevertheless, ourresults indicate that source localization in large networks—

Page 31: (121013) #fitalk   locating the source of diffusion in large-scale network

Conclusion

• Now we can locate the origin of diffusion in networks,

only within O(N3) complexity andwith (comparably) quite a few sensors deployed