Graph Analytics on Massive Collections of Small Graphs

25
Graph Analytics on Massive Collections of Small Graphs Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business EDBT 2014 - Athens [email protected] [email protected]

description

Graph Analytics on Massive Collections of Small Graphs. Dritan Bleco Yannis Kotidis Department of Informatics Athens University Of Economics and Business. [email protected]. [email protected] . EDBT 2014 - Athens. Outline. Motivation Graph Records & Queries - PowerPoint PPT Presentation

Transcript of Graph Analytics on Massive Collections of Small Graphs

Page 1: Graph Analytics on Massive Collections of Small Graphs

Graph Analytics on Massive Collections of Small Graphs

Dritan Bleco Yannis Kotidis

Department of InformaticsAthens University Of Economics and Business

EDBT 2014 - Athens

[email protected] [email protected]

Page 2: Graph Analytics on Massive Collections of Small Graphs

Outline

• Motivation• Graph Records & Queries• Storage of Graph Records and Indexing using a

Column Store• Graph View Materialization• Selection of Graph Views• Extensions• Experiments• Conclusions

Dritan Bleco

Page 3: Graph Analytics on Massive Collections of Small Graphs

Motivational Example

• Focus on small graphs that are generated continuously– Examples: data from CRM , WMS and SCM applications

• Difference between our targeted applications and other applications of graphs (e.g. social web, biology) – Not a single massive graph but a massive collection of

smaller graphs– Nodes/ Edges are mapped to real world entities

• Thus, no need for isomorphism discovery

Dritan Bleco

Page 4: Graph Analytics on Massive Collections of Small Graphs

Framework Overview• Our framework puts together three different techniques

– A column-oriented relational backend to permit a flat description of the graph records. • Alleviates recursion and costly joins for path calculations (required in a straightforward

relational implementation)

– A very efficient indexing mechanism using bitmap columns • Analogous to bitmap indexes frequently used in DWs• This model is generic and can accommodate specialized graph indexes (for example the

gIndex)

– A framework that permits the creation and reuse of materialized graph views of different types• These views improve query times especially for aggregation queries

Dritan Bleco

Page 5: Graph Analytics on Massive Collections of Small Graphs

AF

EGD I

KProduction Lines

Hubs

Customer Locations

Dritan Bleco

BC H

J

Own RouteLeased Route

QUERIES

• Delivery Time for products shipped via [A, D, E, G, I] path

• Delivery Cost for products shipped using Leased Routes

• The longest delay for products shipped from Region 1 to Location I via Hubs of Region2

Region1

Region2

Page 6: Graph Analytics on Massive Collections of Small Graphs

Primitive Query Types• Graph Queries

– Find records that contain a given query graph Gq

– The result is the record id with the respective measures of each matching record

– For example return delivery times along all hops in [A, D, E, G, I]

• Aggregate Graph Queries– A Graph Query Gq with the addition of a user-defined aggregate

function f– The result is the aggregation of the measures along all maximal paths

(paths connecting sink and terminal nodes in Gq) – E.g. total delivery time for all shipments via [A, D, E, G, I]

Dritan Bleco

Page 7: Graph Analytics on Massive Collections of Small Graphs

A

Graph Queries

Dritan Bleco

Record 1BDC

E

A FD

CE G

A FD

E G

Record 2

Record 3

1:3 2:4 3:2

4:1 5:2

2:1 3:26:4 7:1

4:2 5:3

4:55:4

6:3 7:1

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

Find records that follow path [ACEF]

Result : r2 , AC:1, CE:2, EF:4 (record id , related measures)

Page 8: Graph Analytics on Massive Collections of Small Graphs

A

Graph Aggregate Queries

Dritan Bleco

Record 1BDC

E

A FD

CE G

A FD

E G

Record 2

Record 3

1:3 2:4 3:2

4:1 5:2

2:1 3:26:4 7:1

4:2 5:3

4:55:4

6:3 7:1

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

Find records and the total (sum) cost for path [ADEF]

Result : r2 , ADEF:9 (record id, aggregated measures) r3, ADEF:12

Page 9: Graph Analytics on Massive Collections of Small Graphs

A

Storage Model

Dritan Bleco

Record 1BDC

E

A FD

CE G

A FD

E G

Record 2

Record 3

1:3 2:4 3:2

4:1 5:2

2:1 3:26:4 7:1

4:2 5:3

4:55:4

6:3 7:1

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7

1 3 4 2 1 2 Null Null2 Null 1 2 2 3 4 13 Null Null Null 5 4 3 1

Page 10: Graph Analytics on Massive Collections of Small Graphs

A

Bitmap Columns – a simple index

Dritan Bleco

Record 1BDC

E

A FD

CE G

A FD

E G

Record 2

Record 3

1:3 2:4 3:2

4:1 5:2

2:1 3:26:4 7:1

4:2 5:3

4:55:4

6:3 7:1

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7

1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0

2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1

Page 11: Graph Analytics on Massive Collections of Small Graphs

A

Queries using Bitmap Columns

Dritan Bleco

B

D

CE F G

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7

1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0

2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1

Graph Query

Get the costs delay of [ACEF] path

Select recid, m2, m3, m6 where b2=1 AND b3=1 AND b6=1

Graph Aggregate Query

Get the total cost delay of [ACEF] pathSelect recid, m2 + m3 + m6 where b2=1 AND b3=1 AND b6=1

Page 12: Graph Analytics on Massive Collections of Small Graphs

Graph View Materialization• Materialized Graph Views

– Used for Graph Queries / Aggregate Graph Queries– Implemented as bitmaps resulting from ANDing the edges of a

subgraph derived (by our techniques) from a set of graph queries– These bitmaps are added as a new columns in the database

• Materialized Aggregate Graph Views – Used for Graph Queries / Graph Aggregate Queries– A Bitmap (as in a Graph View) and pre-computed aggregates

• Bitmap is the corresponding materialized Graph View• Aggregates are derived from the measures stored in graph records

Dritan Bleco

Page 13: Graph Analytics on Massive Collections of Small Graphs

A

Materialized Graph Views

Dritan Bleco

B

D

CE F G

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7 bq1

1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0 0

2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1 0

Query

Q1 = Get the cost delay of [ACEF] path

Select recid, m2 ,m3 ,m6 where bq1=1 (b2=1 AND b3=1 AND b6=1)

Materialized View for Q1 : bq1 = b2 AND b3 AND b6

Page 14: Graph Analytics on Massive Collections of Small Graphs

A

Materialized Aggregate Views

Dritan Bleco

B

D

CE F G

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7 mq1 b1 b2 b3 b4 b5 b6 b7 bq1

1 3 4 2 1 2 Null Null Null 1 1 1 1 1 0 0 0

2 Null 1 2 2 3 4 1 7 0 1 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 Null 0 0 0 1 1 1 1 0

Query

Q1 = Get the total cost of [ACEF] path

Select recid, mq1 (m2 + m3 + m6 ) where bq1=1 (b2=1 AND b3=1 AND b6=1)Path Aggregated Q1 : bq1 = b2 AND b3 AND b6

mq1 = m2 + m3 + m6

Page 15: Graph Analytics on Massive Collections of Small Graphs

A

Dritan Bleco

B

D

CE F G

Edge Edge IdAB 1AC 2CE 3AD 4DE 5EF 6FG 7

rec Id m1 m2 m3 m4 m5 m6 m7 mq1 b1 b2 b3 b4 b5 b6 b7 bq1

1 3 4 2 1 2 Null Null Null 1 1 1 1 1 0 0 0

2 Null 1 2 2 3 4 1 7 0 1 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 Null 0 0 0 1 1 1 1 0

Another Query can use the materialization of Q1 Q2 = Get the total cost delay of [ACEFG] path

Select recid, mq1 + m7 (m2 + m3 + m6 +m7 ) where bq1=1 AND b7=1 (b2=1 AND b3=1 AND b6=1 AND b7=1 )Aggregated Q1 : bq1 = b2 AND b3 AND b6

mq1 = m2 + m3 + m6

Page 16: Graph Analytics on Massive Collections of Small Graphs

Re-use of materialized graph views• See our past work "Business Intelligence on Complex Graph Data", BEWEB, Berlin,

Germany, March 2012,

– How to formulate complex graph expressions using a set of intuitive operators we define

• How to best answer a user query using materialized (Aggregate or not) Graph Views?

– A simple cost model based on the number of bitmaps required for answering a query

– Mapped to a set cover problem–

– Solved via a greedy algorithm

– Details are in the paper.Dritan Bleco

Page 17: Graph Analytics on Massive Collections of Small Graphs

What to materialize?• Aggressive materialization: Materialize whole queries– Often not possible due to space limitations

• Our approach: Query Driven Graph View Selection

• First need to derive a set of candidate views – Naïve approach : Consider all subsets of the edges in the

Union of all Query Graphs• Exponential number of candidates (thus not feasible)• Many redundant Views

– Intuition: Prune candidates based on a monotonicity property

Dritan Bleco

Page 18: Graph Analytics on Massive Collections of Small Graphs

Dritan Bleco

Candidate Generation

Based on this property we only consider the following candidates :1. Each query graph +{[ACEFGHJ], [ADEFGHJ]}

2. All the subgraphs that are intersection between 2 query graphs +{[EFGHJ]}

3. All the subgraphs that are intersection between 2 graphs of the previous step until no more new views are created

AB

D

CE F G H J

Frequent Query Set {[ACEFGHJ], [ADEFGHJ]}

Monotonicity Property : Graph View Gv ’ supersedes Graph View Gv iff Gv Gv ’ and Gq : Gv Gq ⇒ Gv ’

The view selection from candidate set mapped as set a cover problem

Page 19: Graph Analytics on Massive Collections of Small Graphs

Dritan Bleco

ExtensionsAll data are be stored in a single relation

rec Id m1 m2 m3 m4 m5 m6 m7 b1 b2 b3 b4 b5 b6 b7

1 3 4 2 1 2 Null Null 1 1 1 1 1 0 0

2 Null 1 2 2 3 4 1 0 1 1 1 1 1 1

3 Null Null Null 5 4 3 1 0 0 0 1 1 1 1

But obviously can be partitioning in more than one relation

rec Id m1 m2 m3 b1 b2 b3

1 3 4 2 1 1 1

2 Null 1 2 0 1 1

3 Null Null Null 0 0 0

rec Id m4 m5 m6 m7 b4 b5 b6 b7

1 1 2 Null Null 1 1 0 0

2 2 3 4 1 1 1 1 1

3 5 4 3 1 1 1 1 1

Can easily incorporate Specialized Graph Indexes (for example the gIndex)

Page 20: Graph Analytics on Massive Collections of Small Graphs

Experiments• Graph records from two datasets

1. * NY: Depicts New York roads and

2. **Gnutella: Describes connections among Gnutella hosts from August 2002.

• Experimental evaluation among 4 systems– Commercial Row Store Relational DB– Column Store Relational DB– Neo4j– Commercial Native RDF DB

• * http://www.dis.uniroma1.it/~challenge9/download.shtml• ** http://snap.stanford.edu/data/p2p-Gnutella05.html

Dritan Bleco

Page 21: Graph Analytics on Massive Collections of Small Graphs

Comparison to alternative Systems (no views)

Dritan Bleco

• Our System provides almost constant query times with increasing graph query size as fewer records are retrieved (even though more bitmaps are being used)

• Column store not affected from increasing density (% edges in a record)

Page 22: Graph Analytics on Massive Collections of Small Graphs

Benefit of Using Graph Views

• Graph views provide savings of up to 32% in query times– there is a mandatory cost for fetching the records that is not affected by materialization

• Thus, more savings are seen in aggregate queries– using 100 aggregate graph views reduce the execution time by 89%

• Larger gains when queries exhibit skew (graphs in the paper)

Dritan Bleco

Runtime for 100 uniform Graph Queries Runtime for 100 uniform Aggregate Graph Queries

Page 23: Graph Analytics on Massive Collections of Small Graphs

Using Additional Indexes

• gIndex (record driven): trained the index using records that are part of the query result set

– It took about 24 hours to process about 100.000 records

• Graph views (query driven) result in up to 6 times faster query processing times– It ran in less than one second Dritan Bleco

gIndex in 100 uniform Graph Queries gIndex 100 uniform Aggregate Graph Queries

Page 24: Graph Analytics on Massive Collections of Small Graphs

Conclusions• Presented a framework where both data and queries are modeled as

abstract graph structures– Abstracted two primitive query graphs– Introduced two types of Graph Views for expediting queries– Discussed an efficient mechanism for selecting a set of non-redundant views– Answering queries using Graph Views by solving an instance of a set cover problem

• Argued for a simple yet effective representation of graph records using a flat relational model implemented in a column store– Introduced bitmap indexes for efficient query processing– Graph Views are stored within the same relational schema

• Presented experimental results using datasets consisting of hundreds of millions of graph records– Experimental results show that our platform is orders of magnitude faster than

• A straightforward relational implementation • Alternative systems that natively handle graph data.

Dritan Bleco

Page 25: Graph Analytics on Massive Collections of Small Graphs

Thank you,

Dritan Bleco

Questions?