March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM)...

26
March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan (UCSB)

Transcript of March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM)...

Page 1: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

March, 2011

I2.2 Large-Scale Information Network Processing

Mid-Year Report

Charu Aggarwal (IBM)Christos Faloutsos (CMU)

Ambuj Singh (UCSB)Xifeng Yan (UCSB)

Page 2: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Task Setting

2

Indexing, Partitioning, and Distributed Processing on Time-Varying Networks

Page 3: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

INARC I2.2 Mid-Year Report Objectives

– Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Key Technical Innovations:

− Dynamic graph indexing models and structures

− Scalable graph processing

− Graph partition overlapping and re-balancing theory Primary Members

– Xifeng Yan (UCSB), Ambuj Singh (UCSB), Charu Aggarwal (IBM), Christos Faloutsos (CMU)

Collaborative Members– Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa

IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army

3

Page 4: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

I2.2: Large-Scale Information Network Processing

Key Objective:− Novel graph index model and advanced graph distributed

computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Deliverables:Q1: Data collection and cleaning for graph indexing and

distributed graph computing

Q2: Design graph indices with time-varying concerns

Q3: Design and test distributed graph computing strategies

Q4: Hypotheses validation and research paper submission

Impact:− Provide fast, scalable, and linked information access to

soldiers and commanders

Key Technical Innovations− Dynamic graph indexing models and structures to resolve

graph queries in time-varying information networks

− Query cost models for distributed graph processing

− Graph partition overlapping and re-balancing theory to (1)

improve locality of data for parallel computing, and (2)

accommodate dynamic network data updates and query

workload changes

− Self evolving distributed graph processing environment to

adjust graph partitions dynamically

Role ResearchersLead Xifeng Yan, UCSB

Primary Ambuj Singh, UCSB

Primary Charu Aggarwal, IBM

Primary Christos Faloutsos, CMU

Collab Z. Wen, IBM, SCNARC

Collab G. Cao, PSU, CNARC

Collab J. Han, UIUC, INARC

Total $326.7K

4

Page 5: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

5

Large-Scale Information Network Processing: Invent scalable information network infrastructure

Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks

The models and the proposed experimental systems provide fundamental analysis of – How indexing of dynamic network data affects query performance, – How graph partitioning schemes affect distributed query processing,– How the models and laws of real networks affect the design of graph

indexing and partitioning strategy

5

Advance State-of-the-Art Network Science

5

Page 6: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

6

Subtask 1: Graph Index and Search (UCSB, IBM)– Fast access and processing of time-varying information networks is the

key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes!

Subtask 2: Graph over MapReduce (CMU)– To process overwhelming amount of data on the Web, social networks,

emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods.

– Needed by any large-scale network data processing including information, social and communication networks

Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU)– Military information is often distributed in many devices, distributed

graph processing run graph algorithms without putting all data together in the same machine

6

Military Relevance

6

Page 7: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Subtask 1: Graph Index and Search

Indexing Methods for Large Scale Static and Dynamic Networks Methods for Indexing Massive Disk-Resident Graphs

(Aggarwal (IBM), Zhao (UIUC), and Han (UIUC)) Methods for Indexing Dynamic Network Streams (Aggarwal

(IBM), Khan (UCSB), Yan (UCSB)) Dynamic structural index for label-based queries (Aggarwal

(IBM) and Li (UCSB)): SDM 2011 accepted. Analysis of significant substructures in time-varying networks

(Singh (UCSB) et al.) Find highest scoring substructures combines structure and

time

7

Page 8: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

gDensity: Model-Based Indexing

Problem definition (labeled proximity search)– Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query.

8

Q=(a, b, c)

d=3d=3d=2d=2

Q=(“reconnaissance”, “biometric matching”, “failure modeling”)

Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

Page 9: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

gDensity: Ideas and Results

Can we do better?

9

Which one is more promising?

u’s densitydistribution

v’s densitydistribution

10 – 300 times faster

Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

Page 10: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

10

Align two networks

Graph Search: a Model-Based Approach

Ideas Use information propagation

model to propagate labels in information networks

Convert vertices to vectors Align sets of vectors

Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges

Information Propagation Model

A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

(a) linkedin (b) facebook

Page 11: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

11

SEARCH ALGORITHM

Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε.

Step 2: Discard the labels of the unmatched nodes in the target graph.

Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further.

G Q

v1

v2 v3v4

u1 u2

u3u4 u5

u6

f

Page 12: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

12

Dynamic Updates

Dynamic Update in Index vs. Re-indexing (DBLP)

Indexing is performed for h=2 hops.

Page 13: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

13

Investigate graph properties and graph algorithms using MapReduce– Spectral Analysis of Billion-Scale Graphs– Patterns on the Connected Components of Terabyte-Scale

Graphs

Study the limitation of the MapReduce architecture on processing network-centric data– Using the discovered patterns of terabyte-scale real-life

graphs.

13

Subtask 2: Graph Over MapReduce

13

Page 14: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Spectral Analysis of Billion-Scale Graphs

Billion-Scale Eigen-solver Computes top-k eigen-values and eigenvectors Find anomalies in large graphs. Many application: SVD, triangle counting, spectral

clustering, … A careful implementation of Lanczos on hadoop

can give excellent accuracy as well as scalabil-ity

Contribution: HEigen: a billion-scale eigensolver which can

handle 1000x larger matrices than previous methods

Application of the eigensolver on the twitter graph helps us spot abnormal users (adult ad-vertisers)

14U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

Page 15: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Patterns on the Connected Components of Terabyte-Scale Graphs

A large graph is composed of many connected components– Q1: static patterns?– Q2: evolution patterns?– Q3: model?

15U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010

Size

Count

YahooWeb graph|V| = 1.4 billion|E| = 6.7 billion

120 GBytes

Metric:Graph Fractal Dimension(G): log |E| / log |V|

Page 16: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Subtask 3: Graph Partition for Distributed Graph Computing

16Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011

Are typical techniques efficient for graph queries?

Graph partitioning and distribution techniques (e.g., Pregel) Limitations:– Unavailable to the public– Unbalanced workload due to

skewed uniformly distributed graph queries.

– Communication overhead due to inter‐machine (cross partition) communication. Sedge: distributed graph processing

– Model-based Graph Partitioning Techniques– First-of-Its Kind Distributed Graph Computing Platform

for Information, Social, and Communication Networks

Page 17: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Graph Partition Models

17Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011

Dynamic Workload: Replicate Partitions- Replicate partitions that are intensively

accessed by many queries

Dynamic Workload: New Partitions- Generate new partitions that are

intensively accessed by many cross-partition queries

Complementary Partitions- Generate partitions sets that are

complementary to each other

Page 18: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Graph partitioning with region constraint

Optimal Solution:

Where:

NP-hard

18

Page 19: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Global Optimization

Before each iteration, increase the weight of edges in each region wrt. its priority

… Iteratively repartition

the graph

19

Page 20: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

10,000 random queries. Increase partition number by adding more machines.

# of Machines vs. Throughput Improvement Ratio

Graph Partition for Distributed Graph Computing

20

Page 21: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Collaborations and Path Ahead

Collaborations within I2– Monthly meeting– Strong connection between I2.1 and I2.2: One problem,

two sides. information network processing on DTN and Clusters

– (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has done much work on DTN and Storage. Shengqi Yang will intern at IBM this summer.

Collaborations with researchers in other networks– (S1.1) Work with Zhen Wen (IBM), on the social network

application of graph density indexing. U Kang was a summer intern at IBM

– (E1.1, R2.3) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search.

– (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas

– Graph search has connection with (T2.4) M. Goldberg’s work on trust structure.

– Work with Sachi Desai (Army) on graph query language/system.

2121

Page 22: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Next Six Months and Path Ahead to 2012

Continue research on large-scale information network processing (more specific)(1)Graph indexing on multiple time-varying graph

snapshots(2)Compression-based, Model-based Info Network

Processing(3) Edge lay-out on Hadoop file system for better

compression and better performance(4) Complementary graph partitioning theories.

Other research topics planned– Models and methods for building complex graph

queries

– Models and methods for routing complex graph queries to data sources (for both I2.1 and I2.2)

– Tensor analysis on Hadoop 2222

Page 23: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Research Papers (Accepted/Published) A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao,

Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf.  on Management of Data (SIGMOD'11), 2011.

Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11

C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011

U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia. 

U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

U Kang, Duen Horng Chau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany.

23

Page 24: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Research Papers Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-

Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity

Search on Large Graphs, to be submitted to VLDB Journal Petko Bogdanov, Misael Mogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks

in Time, to be submitted to VLDB Journal C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk

Resident Graphs, Research Report, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic

Graph Streams, Research Report, to be submitted to VLDB Journal

24

Page 25: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Big Picture

25

Stage 1: How to distribute graphs (we are here)Stage 2: How to construct queriesStage 3: How to execute/route queries

Make Information Network Accessible by Soldiers and Commanders

Page 26: March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan.

Questions?

26