March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM)...

March, 2011

I2.2 Large-Scale Information Network Processing

Mid-Year Report

Charu Aggarwal (IBM)Christos Faloutsos (CMU)

Ambuj Singh (UCSB)Xifeng Yan (UCSB)

Task Setting

2

Indexing, Partitioning, and Distributed Processing on Time-Varying Networks

INARC I2.2 Mid-Year Report Objectives

– Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Key Technical Innovations:

− Dynamic graph indexing models and structures

− Scalable graph processing

− Graph partition overlapping and re-balancing theory Primary Members

– Xifeng Yan (UCSB), Ambuj Singh (UCSB), Charu Aggarwal (IBM), Christos Faloutsos (CMU)

Collaborative Members– Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa

IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army

3

I2.2: Large-Scale Information Network Processing

Key Objective:− Novel graph index model and advanced graph distributed

computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Deliverables:Q1: Data collection and cleaning for graph indexing and

distributed graph computing

Q2: Design graph indices with time-varying concerns

Q3: Design and test distributed graph computing strategies

Q4: Hypotheses validation and research paper submission

Impact:− Provide fast, scalable, and linked information access to

soldiers and commanders

Key Technical Innovations− Dynamic graph indexing models and structures to resolve

graph queries in time-varying information networks

− Query cost models for distributed graph processing

− Graph partition overlapping and re-balancing theory to (1)

improve locality of data for parallel computing, and (2)

accommodate dynamic network data updates and query

workload changes

− Self evolving distributed graph processing environment to

adjust graph partitions dynamically

Role ResearchersLead Xifeng Yan, UCSB

Primary Ambuj Singh, UCSB

Primary Charu Aggarwal, IBM

Primary Christos Faloutsos, CMU

Collab Z. Wen, IBM, SCNARC

Collab G. Cao, PSU, CNARC

Collab J. Han, UIUC, INARC

Total $326.7K

4

5

Large-Scale Information Network Processing: Invent scalable information network infrastructure

Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science

Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks

The models and the proposed experimental systems provide fundamental analysis of – How indexing of dynamic network data affects query performance, – How graph partitioning schemes affect distributed query processing,– How the models and laws of real networks affect the design of graph

indexing and partitioning strategy

5

Advance State-of-the-Art Network Science

5

6

Subtask 1: Graph Index and Search (UCSB, IBM)– Fast access and processing of time-varying information networks is the

key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes!

Subtask 2: Graph over MapReduce (CMU)– To process overwhelming amount of data on the Web, social networks,

emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods.

– Needed by any large-scale network data processing including information, social and communication networks

Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU)– Military information is often distributed in many devices, distributed

graph processing run graph algorithms without putting all data together in the same machine

6

Military Relevance

6

Subtask 1: Graph Index and Search

Indexing Methods for Large Scale Static and Dynamic Networks Methods for Indexing Massive Disk-Resident Graphs

(Aggarwal (IBM), Zhao (UIUC), and Han (UIUC)) Methods for Indexing Dynamic Network Streams (Aggarwal

(IBM), Khan (UCSB), Yan (UCSB)) Dynamic structural index for label-based queries (Aggarwal

(IBM) and Li (UCSB)): SDM 2011 accepted. Analysis of significant substructures in time-varying networks

(Singh (UCSB) et al.) Find highest scoring substructures combines structure and

time

7

gDensity: Model-Based Indexing

Problem definition (labeled proximity search)– Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query.

8

Q=(a, b, c)

d=3d=3d=2d=2

Q=(“reconnaissance”, “biometric matching”, “failure modeling”)

Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

gDensity: Ideas and Results

Can we do better?

9

Which one is more promising?

u’s densitydistribution

v’s densitydistribution

10 – 300 times faster

Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

10

Align two networks

Graph Search: a Model-Based Approach

Ideas Use information propagation

model to propagate labels in information networks

Convert vertices to vectors Align sets of vectors

Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges

Information Propagation Model

A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

(a) linkedin (b) facebook

11

SEARCH ALGORITHM

Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε.

Step 2: Discard the labels of the unmatched nodes in the target graph.

Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further.

G Q

v1

v2 v3v4

u1 u2

u3u4 u5

u6

f

12

Dynamic Updates

Dynamic Update in Index vs. Re-indexing (DBLP)

Indexing is performed for h=2 hops.

13

Investigate graph properties and graph algorithms using MapReduce– Spectral Analysis of Billion-Scale Graphs– Patterns on the Connected Components of Terabyte-Scale

Graphs

Study the limitation of the MapReduce architecture on processing network-centric data– Using the discovered patterns of terabyte-scale real-life

graphs.

13

Subtask 2: Graph Over MapReduce

13

Spectral Analysis of Billion-Scale Graphs

Billion-Scale Eigen-solver Computes top-k eigen-values and eigenvectors Find anomalies in large graphs. Many application: SVD, triangle counting, spectral

clustering, … A careful implementation of Lanczos on hadoop

can give excellent accuracy as well as scalabil-ity

Contribution: HEigen: a billion-scale eigensolver which can

handle 1000x larger matrices than previous methods

Application of the eigensolver on the twitter graph helps us spot abnormal users (adult ad-vertisers)

14U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

Patterns on the Connected Components of Terabyte-Scale Graphs

A large graph is composed of many connected components– Q1: static patterns?– Q2: evolution patterns?– Q3: model?

15U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010

Size

Count

YahooWeb graph|V| = 1.4 billion|E| = 6.7 billion

120 GBytes

Metric:Graph Fractal Dimension(G): log |E| / log |V|

Subtask 3: Graph Partition for Distributed Graph Computing

16Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011

Are typical techniques efficient for graph queries?

Graph partitioning and distribution techniques (e.g., Pregel) Limitations:– Unavailable to the public– Unbalanced workload due to

skewed uniformly distributed graph queries.

– Communication overhead due to inter‐machine (cross partition) communication. Sedge: distributed graph processing

– Model-based Graph Partitioning Techniques– First-of-Its Kind Distributed Graph Computing Platform

for Information, Social, and Communication Networks

Graph Partition Models

17Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011

Dynamic Workload: Replicate Partitions- Replicate partitions that are intensively

accessed by many queries

Dynamic Workload: New Partitions- Generate new partitions that are

intensively accessed by many cross-partition queries

Complementary Partitions- Generate partitions sets that are

complementary to each other

Graph partitioning with region constraint

Optimal Solution:

Where:

NP-hard

18

Global Optimization

Before each iteration, increase the weight of edges in each region wrt. its priority

…

… Iteratively repartition

the graph

19

10,000 random queries. Increase partition number by adding more machines.

# of Machines vs. Throughput Improvement Ratio

Graph Partition for Distributed Graph Computing

20

Collaborations and Path Ahead

Collaborations within I2– Monthly meeting– Strong connection between I2.1 and I2.2: One problem,

two sides. information network processing on DTN and Clusters

– (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has done much work on DTN and Storage. Shengqi Yang will intern at IBM this summer.

Collaborations with researchers in other networks– (S1.1) Work with Zhen Wen (IBM), on the social network

application of graph density indexing. U Kang was a summer intern at IBM

– (E1.1, R2.3) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search.

– (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas

– Graph search has connection with (T2.4) M. Goldberg’s work on trust structure.

– Work with Sachi Desai (Army) on graph query language/system.

2121

Next Six Months and Path Ahead to 2012

Continue research on large-scale information network processing (more specific)(1)Graph indexing on multiple time-varying graph

snapshots(2)Compression-based, Model-based Info Network

Processing(3) Edge lay-out on Hadoop file system for better

compression and better performance(4) Complementary graph partitioning theories.

Other research topics planned– Models and methods for building complex graph

queries

– Models and methods for routing complex graph queries to data sources (for both I2.1 and I2.2)

– Tensor analysis on Hadoop 2222

Research Papers (Accepted/Published) A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao,

Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.

Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11

C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011

U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia.

U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11

U Kang, Duen Horng Chau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany.

23

https://agora.cs.illinois.edu/download/attachments/30425499/Ness.pdf?version=1&modificationDate=1300041240000

https://agora.cs.illinois.edu/download/attachments/30425499/larusso_edbt_2011.pdf?version=1&modificationDate=1300894257800

https://agora.cs.illinois.edu/download/attachments/30425499/collective_SDM.pdf?version=1&modificationDate=1300041213000

https://agora.cs.illinois.edu/download/attachments/30425499/patternsindisconnectedcomps.pdf?version=1&modificationDate=1300040153000

Research Papers Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-

Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity

Search on Large Graphs, to be submitted to VLDB Journal Petko Bogdanov, Misael Mogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks

in Time, to be submitted to VLDB Journal C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk

Resident Graphs, Research Report, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic

Graph Streams, Research Report, to be submitted to VLDB Journal

24

Big Picture

25

Stage 1: How to distribute graphs (we are here)Stage 2: How to construct queriesStage 3: How to execute/route queries

Make Information Network Accessible by Soldiers and Commanders

Questions?

26

March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM)...

Documents

Transcript of March, 2011 I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM)...