03/22/11 I2.2: Analysis of significant substructures in time-varying networks Ambuj Singh (in...

29
03/22/11 I2.2: Analysis of significant substructures in time-varying networks Ambuj Singh (in collaboration with P. Bogdanov, M. Mongiovi, X. Yang) NS-CTA INARC Mid-Year Review March 2011 1

Transcript of 03/22/11 I2.2: Analysis of significant substructures in time-varying networks Ambuj Singh (in...

03/22/11

I2.2: Analysis of significant substructures in time-varying networks

Ambuj Singh

(in collaboration with P. Bogdanov, M. Mongiovi, X. Yang)

NS-CTA INARC Mid-Year  ReviewMarch 2011 1

03/22/11

Dynamic networks

• Dynamic networks are commonplaceo online interaction networks

Twitter, Wikipedia, LinkedIn, Facebook, .. o mobile networks

Cyber-physical scenario (EDIN, INARC) virus propagation (E2.1)

• Generative models to explain the network structureo preferential attachment [Barabasi '99]o forest-fire [Lescovec '09]

• Markov Chain models (discrete, continuous)o when, where, what changes [Avin '08, Clemente '08]

• Latent space / context models [Zheng '05]• Network flow/traffic [Daganzo '94, Bickel '01, Stoev '09] • Disease propagation, blog cascade, SIS [Lescovec '07] • Stochastic actor-based models [Snijders '09]

2

03/22/11

Our focus

• Dynamic edge attributes• Simplest case

o edge is +1 or -1o +1 means flow of interest

congestion, flow above historical thresholdo real values are a general case and can also

be considered• Query: find highest scoring substructures in

graph over time o combines graph structure and time

3

03/22/11

Motivation: traffic congestion

4

03/22/11

Re-tweet rate of #music in Twitter

5

03/22/11

Outline

• Motivation• Problem definition• Solving for a fixed time interval• Heuristic for multiple time intervals• Path Forward

6

03/22/11

Problem definition

t1 t2 t3 t4

1 -1 1 -1

1 -1 -1 -1

-1 -1 -1 -1

1 1 1 -1

1 -1 1 -1

11

1-1

1

-1-1

1 -1-1

-1

-11

1-1

-1

-1

1

• A time evolving graph G = (V, E, Ft(e))

o V: set of nodeso E: set of edgeso Ft(e): mapping of edges to

{-1,1}

• Score of an edge e in interval [t1,t2] = ∑ Ft(e)

• Score of a subgraph in interval = ∑ score(e), for all e in the subgraph

-1-1

7

03/22/11

• Given a graph G=(V, E) with positive node weights p(v) and negative edge weights c(e), find a subtree T’= (V’,E’) such that

o Goemans-Williamson Minimization (GW-PCST):

o Net Worth Maximization (NW-PCST):

• Both are NP-hard (equivalent objective functions) [Johnson’00]

o GW-PCST has an approximation factor = 2-1/(n-1). 

o The rooted version of NW-PCST is NP-hard to approximate within any constant factor [Feig 01]

Prize-collecting Steiner Tree (PCST)

GW(T’) = ∑ c(e) + ∑ p(v)

NW(T’) = ∑ p(v) - ∑ c(e)

e in E’

e in E’v in V’

v not in V’

8

03/22/11

Why the same guarantee doesn’t hold for NW?

In this specific example:

GW-PCST• APX = 3*(k-1)• OPT = 2*k• ratio ≈ 2/3

NW-PCST• OPT = k• APX = 3• ratio = k/3

2220

2

3

3

3

3

Optimal solution: the whole graph

k

OPTAPX

9

03/22/11

Merge-and-refine approximation

• Merge nodes into clusters in a bottom-up fashion• shortest-path metric graph using edge costs 

• Merge triangle and star structures considering both node values and interconnect cost

• Multiple refinement iterations

• Approximation qualityo OPT <= APX + c*N(OPT), where N is the cost of

interconnectiono Good approximation for instances in which there are

cheaply connected clusters of high-prize nodes

• Challengeso Relatively high computational cost due to all pairs

shortest path computation 10

03/22/11

An example

• Aggregate edge values within the interval• Transform the edge-weighted graph into NW-PCST• Apply the Merge-and-refine approximation 

11

03/22/11

Running time of merge-and-refine 

• APSP comprises 90% of the approximation running time• Takes more than a second for N=360 for one interval

12

03/22/11

Baseline solution across time

• Find the best subgraph in time by exhaustive enumerationo Consider all O(t2) intervalso Apply the solution for a fixed interval in eacho Take the best obtained subgraph in all intervals

• Polynomial cost, but impractical for real-world problemso The highway system of Southern California has ~ 4k

edges with live-traffic measurementso The Autonomous Systems (AS)-level Internet backbone

has hundreds of thousand of links o The baseline solution would not be practical for networks

of this scale

• Need for scalable solutions of acceptable quality13

03/22/11

Best-first approach using bounds

• Idea: reduce the number of calls to Merge-and-refineo Estimate solutions for different intervalso Evaluate the most "promising" intervals firsto Prune intervals that do not contain the best solution

• Bound the solution in an intervalo Computationally simple to computeo Effective in terms of pruning power

• Best first procedureo Order intervals by their upper boundo Prune infeasible intervals using lower bound

14

03/22/11

Upper bound (UB)

• Offline: o Consider a hyper-graph in which original edges become

nodes and original nodes become hyper-edges o Split the original edges into k partitions via hyper-graph

partitioningo Maintain edges at partition "boundaries“

• Online UB estimation for a fixed interval:o UB of a partition is the aggregate of its positive edgeso Edges between partitions:

0 cost if there is at least one positive boundary edge cheapest boundary edge otherwise

o Solve the NW-PCST on the obtained coarse-level graph

15

03/22/11

Upper bound example

16

03/22/11

Upper bound effectiveness

• The upper bound is more effective if:o Partitions are well connected (small diameter)o Edges within partitions are correlatedo Boundary edges are minimal and have expected value

closer to -1 than within-partition edges

• The upper bound is a coarse aggregation of the original grapho Coarseness is controlled by # partitionso Trade-off between efficiency and effectiveness

17

03/22/11

Upper bound quality

• Random Markovian graph (N=150,M=180,T=300).• Number of partitions: 2-64. • Random 64 is a random partitioning of edges into 64.

18

03/22/11

Lower bound

• Local iterative search in the solution space within an intervalo Simulated Annealing (SA) procedure that grows/shrinks a

subgraph within an intervalo Possible moves: add/remove an edge from an existing

solutiono Allow sub-optimal moves according to an annealing

schedule• Better quality than simple greedy algorithm

o Due to sub-optimal moves, high-score clusters can be joined even if there are more than 2-hops away

• Better running time than Merge-and-refineo No computation of all pairs shortest paths

19

03/22/11

Summary

• Dynamic graphs with changing edge attributes

• Simplest query: find the highest scoring substructure

• Heuristics under development• Approximation guarantee

• Empirical validation ono traffic networko twitter messages

20

03/22/11

Path forward

• Maximal scoring subgraph is a building block for richer queries and analyseso What is the structure of a congestion? Global (short and

large), longitudinal (prolonged and localized) or a combination of both?

o What characterizes the evolution of a network?o How do different network regions compare? o Is evolution similar across networks of different genres?  

• Index structures o Use statistical models for indexing real-world networks

Exploit locality within the network and locality in time Represent the network at different level of coarseness

o Queries constrained by Time Neighborhood

o Similarity queries 21

03/22/11

Connections

• Queries/analysis of information flow (E 2.1) • Queries on mobile networks (E 2.2, E2.3)

• Formal modeling of time (E1.1)

• Dynamic network models (E2.1)

22

03/22/11

• Query/analysis of mobility networks

• Cyber-physical scenario

• Query/analysis of evolving networks

• Patterns of behavior in composite networks

• Find terrorist groups using temporal interactions

Army relevance

23

03/22/11

• P. Bogdanov, B. Baumer, P. Basu, A. Singh, and A. Bar-Noy, “Discovering Influential Groups of Agents Using Composite Network Analysis,” submitted to NetSci 2011.

• P. Bogdanov, Nicholas D. Larusso and Ambuj K. Singh, “Towards Community Discovery in Signed Collaborative Interaction Networks,” published in SIASP at 2010 IEEE International Conference on Data Mining, 2010.

• K. Macropol and A. Singh, “Content-based Modeling and Prediction of Information Dissemination,” submitted to ASONAM 2011.

• M. Mongiovi, A. Singh, X. Yan, B. Zong, K. Psounis, “An Indexing System for Mobility-aware Information Management,” submitted to VLDB.

• Ziyu Guan, Jian Wu, Zheng Yun, Ambuj K. Singh and Xifeng Yan, Assessing and Ranking Structural Correlations in Graphs, to appear at SIGMOD 2011.

• Nicholas D Larusso and Ambuj K. Singh, Synopses for Probabilistic Data over Large Domains, in EDBT 2011.

Publications

24

03/22/11

THANK YOU!

25

03/22/11

• Markovian - the graph state is a Markov Chaino Fixed set of nodeso Edges at time t depend on edges at time t-1

• Cover Time of Dynamic Graphs [Avin et Al. '08]o Introduction of Markovian Dynamic Graphso Exponential cover timeo Lazy random walks 

• Information spread in Markovian graphs [Clementi '09]o Edge-Markoviano Geometric Markovian - node mobility

• Evolving range-dependent graphs [Grindrod '09]o Edge dynamics as a birth/death process

Markovian dynamic models

26

03/22/11

Dynamic models of traffic

• The cell transmission model (CTM) [Daganzo '94]o Dynamic model of highway traffico Inspired by hydrodynamic theory 

• Traffic Flow on a Freeway Network [Bickel '01]o Time and context Markovian model of the traffic flowo The state of a segment at time t depends on the state of

its neighbors and and itself at time t-1o Model of a single highway. How about junctions? 

• Computer Network Traffic [Stoev '09]o Statistical model of traffic flow across all linkso Applied to traffic prediction 

27

03/22/11

Background literature

[Avin '08] Chen Avin and Zvi Lotker. "How to Explore a Fast-Changing World." 2008

[Bickel '01]  Peter Bickel, Chao Chen, Jaimyoung Kwon, and John Rice. "Traffic Flow on a Freeway Network" Electrical Engineering, 2001.

[Clementi '09] Andrea Clementi, Angelo Monti, Francesco Pasquale, and Riccardo Silvestri. "Information Spreading in Stationary Markovian Evolving Graphs". Informatica, 2009  

[Feig’01] J. Feigenbaum, C. Padimitriou, and S. Shenker, “Sharing the Cost of Multicast Transmissions,” JCSS, 63, 21-41, 2001.

[Grinford '09] Peter Grindrod and Desmond J. Higham. "Evolving Graphs: Dynamical Models, Inverse Problems and Propagation." 2009

[Johnson’00] D. Johnson, M. Minkoff, S. Phillips, “The Prize Collecting Steiner Tree Problem: Theory and Practice,” ACM SODA, 2000.

[Lescovec '07] Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst "Cascading behavior in large blog graphs Patterns and a Model", SDM, 2007 

28

03/22/11

Background literature

[Ribeiro '11] B. Ribeiro, D. Figueiredo, E. de Souza e Silva, and D. Towsley, "Characterizing Dynamic Graphs with Continuous-time Random Walks" SIGMETRICS 2011.

[Snijders '09] Tom A.B. Snijders, Gerhard G. van de Bunt, Christian E.G. Steglich, "Introduction to Stochastic Actor-Based Models for Network Dynamics", Social Networks, 2009

[Stoev '09] Stilian A. Stoev, George Michailidis, and Joel Vaughan. "Global Modeling and Prediction of Computer Network", Arxiv 2009

[Zheng '05] A. X. Zheng and A. Goldenberg "A  Generative Model for Dynamic Contextual Friendship Networks", Learning, 2005

29