SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
-
Upload
jason-riedy -
Category
Technology
-
view
394 -
download
4
description
Transcript of SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
Streaming Graph Analytics for Massive GraphsJason Riedy, David A. Bader, David EdigerGeorgia Institute of Technology
10 July 2012
Outline
MotivationWhere graphs appear... (hint: everywhere)Data volumes and rates of changeWhy analyze data streams?
TechnicalOverall streaming approachData structure benchmarkClustering coefficientsConnected componentsCommon aspects and questions
Session outline
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 2/31
Exascale Data Analysis
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
The data is full of semantically rich relationships.Graphs! Graphs! Graphs!
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 3/31
Graphs are pervasive
β’ Sources of massive data: petascale simulations, experimentaldevices, the Internet, scientific applications.
β’ New challenges for analysis: data sizes, heterogeneity,uncertainty, data quality.
AstrophysicsProblem Outlier detectionChallenges Massive datasets, temporal variationGraph problems Matching,clustering
BioinformaticsProblem Identifying targetproteinsChallenges Dataheterogeneity, qualityGraph problems Centrality,clustering
Social InformaticsProblem Emergent behavior,information spreadChallenges New analysis,data uncertaintyGraph problems Clustering,flows, shortest paths
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 4/31
These are not easy graphs.Yifan Huβs (AT&T) visualization of the Livejournal data set
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 5/31
But no shortage of structure...
Protein interactions, Giot et al., βA ProteinInteraction Map of Drosophila melanogasterβ,Science 302, 1722-1736, 2003.
Jasonβs network via LinkedIn Labs
β’ Globally, there rarely are good, balanced separators in thescientific computing sense.
β’ Locally, there are clusters or communities and many levels ofdetail.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 6/31
Also no shortage of data...
Existing (some out-of-date) data volumes
NYSE 1.5 TB generated daily into a maintained 8 PB archive
Google βSeveral dozenβ 1PB data sets (CACM, Jan 2010)
LHC 15 PB per year (avg. 21 TB daily)http://public.web.cern.ch/public/en/lhc/
Computing-en.html
Wal-Mart 536 TB, 1B entries daily (2006)
EBay 2 PB, traditional DB, and 6.5PB streaming, 17 trillionrecords, 1.5B records/day, each web click is 50-150details. http://www.dbms2.com/2009/04/30/
ebays-two-enormous-data-warehouses/
Faceboot 845 M users... and growing.
β’ All data is rich and semantic (graphs!) and changing.
β’ Base data rates include items and not relationships.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 7/31
General approaches
β’ High-performance static graph analysisβ’ Develop techniques that apply to unchanging massive graphs.β’ Provides useful after-the-fact information, starting points.β’ Serves many existing applications well: market research, much
bioinformatics, ...
β’ High-performance streaming graph analysisβ’ Focus on the dynamic changes within massive graphs.β’ Find trends or new information as they appear.β’ Serves upcoming applications: fault or threat detection, trend
analysis, ...
Both very important to different areas.Remaining focus is on streaming.
Note: Not CS theory streaming, but analysis of streaming data.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 8/31
Why analyze data streams?
Data volumes
NYSE 1.5TB daily
LHC 41TB daily
Facebook Who knows?
Data transferβ’ 1 Gb Ethernet: 8.7TB daily at
100%, 5-6TB daily realistic
β’ Multi-TB storage on 10GE: 300TBdaily read, 90TB daily write
β’ CPU β Memory: QPI,HT:2PB/day@100%
Data growth
β’ Facebook: > 2Γ/yr
β’ Twitter: > 10Γ/yr
β’ Growing sources:Bioinformatics,Β΅sensors, security
Speed growth
β’ Ethernet/IB/etc.: 4Γ in next 2years. Maybe.
β’ Flash storage, direct: 10Γ write,4Γ read. Relatively huge cost.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 9/31
Overall streaming approach
Protein interactions, Giot et al., βA ProteinInteraction Map of Drosophila melanogasterβ,Science 302, 1722-1736, 2003.
Jasonβs network via LinkedIn Labs
Assumptions
β’ A graph represents some real-world phenomenon.β’ But not necessarily exactly!β’ Noise comes from lost updates, partial information, ...
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 10/31
Overall streaming approach
Protein interactions, Giot et al., βA ProteinInteraction Map of Drosophila melanogasterβ,Science 302, 1722-1736, 2003.
Jasonβs network via LinkedIn Labs
Assumptions
β’ We target massive, βsocial networkβ graphs.β’ Small diameter, power-law degreesβ’ Small changes in massive graphs often are unrelated.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 11/31
Overall streaming approach
Protein interactions, Giot et al., βA ProteinInteraction Map of Drosophila melanogasterβ,Science 302, 1722-1736, 2003.
Jasonβs network via LinkedIn Labs
Assumptions
β’ The graph changes, but we donβt need a continuous view.β’ We can accumulate changes into batches...β’ But not so many that it impedes responsiveness.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 12/31
Difficulties for performance
β’ What partitioningmethods apply?
β’ Geometric? Nope.β’ Balanced? Nope.β’ Is there a single, useful
decomposition?Not likely.
β’ Some partitions exist, butthey donβt often helpwith balanced bisection ormemory locality.
β’ Performance needs newapproaches, not juststandard scientificcomputing methods.
Jasonβs network via LinkedIn Labs
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 13/31
STINGβs focus
Source data
predictionaction
summary
Control
VizSimulation / query
β’ STING manages queries against changing graph data.β’ Visualization and control often are application specific.
β’ Ideal: Maintain many persistent graph analysis kernels.β’ Keep one current snapshot of the graph resident.β’ Let kernels maintain smaller histories.β’ Also (a harder goal), coordinate the kernelsβ cooperation.
β’ Gather data into a typed graph structure, STINGER.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 14/31
STINGER
STING Extensible Representation:
β’ Rule #1: No explicit locking.β’ Rely on atomic operations.
β’ Massive graph: Scattered updates, scattered reads rarelyconflict.
β’ Use time stamps for some view of time.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 15/31
Initial results
Prototype STING and STINGER
Background benchmark for STINGER maintenance, plus monitoringthe following properties:
1 clustering coefficients, and
2 connected components.
High-level
β’ Support high rates of change, over 10k updates per second.
β’ Performance scales somewhat with available processing.
β’ Gut feeling: Scales as much with sockets as cores.
http://www.cc.gatech.edu/stinger/
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 16/31
Experimental setup
Unless otherwise noted
Line Model Speed (GHz) Sockets Cores
Nehalem X5570 2.93 2 4Westmere E7-8870 2.40 4 10
β’ Westmere loaned by Intel (thank you!)
β’ All memory: 1067MHz DDR3, installed appropriately
β’ Implementations: OpenMP, gcc 4.6.1, Linux β 3.0 kernel
β’ Artificial graph and edge stream generated by R-MAT[Chakrabarti, Zhan, & Faloutsos].
β’ Scale x , edge factor f β 2x vertices, β f Β· 2x edges.β’ Edge actions: 7/8th insertions, 1/8th deletionsβ’ Results over five batches of edge actions.
β’ Caveat: No vector instructions, low-level optimizations yet.
β’ Portable: OpenMP, Cray XMT family
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 17/31
STINGER insertion / removal rates
Edges per second
On E7-8870, 80 hardware threads but four sockets:
Number of threads
Agg
rega
te S
TIN
GE
R u
pdat
es p
er s
econ
d
104
104.5
105
105.5
106
106.5
107
βββ
βββ
βββ
βββ
βββ βββ βββ
βββ
ββββββ βββ βββ βββ
βββ
βββ
βββ βββ
βββ
βββ βββ βββ
βββ
βββ
ββββββ βββ βββ
βββ
βββ
βββ
βββ
βββ
βββ βββββββββ
βββ ββββββ βββ βββ
βββ
βββ
ββββββ
βββ
ββββββ βββ
βββ βββ βββ ββββββ βββ
ββββββ
βββ
βββ
βββ
ββββββ βββ
βββ
βββ βββ βββ
βββ
βββ
βββ
βββ
ββββββ
βββ
ββββββ
βββ
βββ
βββ
βββ β
βββ
β
β
β
ββ
βββ
βββ
βββββββββ
βββ
ββββββ
βββ
βββ βββ βββ β
ββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ β
ββ
β
β
β
β
ββ
βββ
β
β
β
βββ
βββ
βββ
βββ
βββ βββ
βββ
βββ β
ββ β
ββ
ββ
β
β
β
β
β
ββ
βββ
βββ β
ββ
βββ
βββββ
β βββ
βββ
βββ
βββ
ββ
β
ββ
β
βββ
βββ
20 40 60 80
Batch size
β 3000000
β 1000000
β 300000
β 100000
β 30000
β 10000
β 3000
β 1000
β 300
β 100
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 18/31
STINGER insertion / removal rates
Seconds per batch, βlatencyβ
On E7-8870, 80 hardware threads but four sockets:
Number of threads
Sec
onds
per
bat
ch p
roce
ssed
10β3
10β2
10β1
100
βββ
βββ
βββ
βββ
βββ βββ βββ
βββ
βββ βββ βββ βββ βββ
βββ
βββ
βββ βββ
βββ
βββ βββ βββ
βββ
βββ βββ
βββ βββ ββββββ
βββ
βββ
βββ
βββ
βββ βββ ββββββ βββ βββ
βββ βββ βββ
βββ
βββ
βββ βββ
βββ
ββββββ ββ
ββββ βββ βββ βββ βββ βββ
βββ
βββ
βββ
βββ
βββ
ββββββ
ββββββ
βββ βββ ββββββ
βββ
βββ
βββ
ββββββ
ββββββ
βββ
βββ
βββ
ββββββ ββ
ββ
ββ
βββ
ββββββ
βββββββββ
βββ
ββββββ
βββ
βββ βββ βββ βββ
βββ
ββββββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
ββββββ β
ββ β
β
β
β
ββ
βββ
ββ
β
βββ
βββ
βββ
ββββββ ββ
β
βββ
βββ β
ββ β
ββ ββ
β
β
β
β
βββ
βββ
βββ β
ββ
βββ
ββββββ βββ
βββ
βββ
βββ β
ββ
ββ
β
βββ
βββ
20 40 60 80
Batch size
β 3000000
β 1000000
β 300000
β 100000
β 30000
β 10000
β 3000
β 1000
β 300
β 100
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 19/31
Clustering coefficients
β’ Used to measure βsmall-world-nessβ[Watts & Strogatz] and potentialcommunity structure
β’ Larger clustering coefficient β moreinter-connected
β’ Roughly the ratio of the number of actualto potential triangles
v
i
j
m
n
β’ Defined in terms of triplets.
β’ i β v β j is a closed triplet (triangle).
β’ m β v β n is an open triplet.
β’ Clustering coefficient:# of closed triplets / total # of triplets
β’ Locally around v or globally for entire graph.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 20/31
Updating triangle counts
Given Edge {u, v} to be inserted (+) or deleted (-)
Approach Search for vertices adjacent to both u and v , updatecounts on those and u and v
Three methods
Brute force Intersect neighbors of u and v by iterating over each,O(dudv ) time.
Sorted list Sort uβs neighbors. For each neighbor of v , check if inthe sorted list.
Compressed bits Summarize uβs neighbors in a bit array. Reducescheck for v βs neighbors to O(1) time each.Approximate with Bloom filters. [MTAAP10]
All rely on atomic addition.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 21/31
Batches of 10k actions
ThreadsGraph size: scale 22, edge factor 16
Upd
ates
per
sec
onds
, bot
h m
etric
and
ST
ING
ER
103.5
104
104.5
105
105.5
Brute force
β
β
ββ
5.1e+03
1.7e+04 3.4e+04
3.9e+03
1.5e+04
0 20 40 60 80
Bloom filter
ββ
ββββββ
βββ
β
βββ
β
ββ
2.4e+04
8.8e+04 1.2e+05
2.0e+04
1.3e+05
0 20 40 60 80
Sorted list
β
ββ
βββ
β
β
βββ
ββ
ββ
βββ
β
β
β
βββ
β β
β
ββ
ββ
2.2e+04
8.8e+04 1.4e+05
1.7e+04
1.2e+05
0 20 40 60 80
Machine
a 4 x E7β8870
a 2 x X5570
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 22/31
Different batch sizes
ThreadsGraph size: scale 22, edge factor 16
Upd
ates
per
sec
onds
, bot
h m
etric
and
ST
ING
ER
103.5
104
104.5
105
105.5
103.5
104
104.5
105
105.5
103.5
104
104.5
105
105.5
Brute force
βββ
β
β
ββ
β
β
ββ
ββ
β
β
β
β
βββββββββ
β
β
βββ
β
β
ββ
0 20 40 60 80
Bloom filter
β
β
β
β
β
ββ
ββββ
β
β
βββ
βββββ
β β
β
βββββ βββ
βββ
ββββ
β
ββ
0 20 40 60 80
Sorted list
βββ
β
β
ββ
β
ββ
β
βββββ
ββ
β
βββββ
βββββ
ββ
ββ
βββ
β
β
βββββ β β
ββββ
0 20 40 60 80
1001000
10000
Machine
4 x E7β8870
2 x X5570
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 23/31
Different batch sizes: Latency
ThreadsGraph size: scale 22, edge factor 16
Sec
onds
bet
wee
n up
date
s, b
oth
met
ric a
nd S
TIN
GE
R
10β310β2.5
10β210β1.5
10β110β0.5
100
10β310β2.5
10β210β1.5
10β110β0.5
100
10β310β2.5
10β210β1.5
10β110β0.5
100
Brute force
β
ββ
ββ
β
β
ββ
β
β
β
β
β
β
β
ββββ
ββββββββββββββ
β
β
ββ
0 20 40 60 80
Bloom filter
β
β
β
β
ββ
ββββ
β
ββββ
βββββ
β β
β
βββββ βββ
βββββββ
βββ
0 20 40 60 80
Sorted list
β
β
β
ββ
β
βββ
βββββ
ββ
β
βββββ
βββββ
ββ
ββ
βββ
β
β
βββββ β β
ββββ
0 20 40 60 80
1001000
10000
Machine
4 x E7β8870
2 x X5570
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 24/31
Connected components
β’ Maintain a mapping from vertex tocomponent.
β’ Global property, unlike trianglecounts
β’ In βscale freeβ social networks:β’ Often one big component, andβ’ many tiny ones.
β’ Edge changes often sit withincomponents.
β’ Remaining insertions mergecomponents.
β’ Deletions are more difficult...
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 25/31
Connected components: Deleted edges
The difficult caseβ’ Very few deletions
matter.
β’ Determining whichmatter may require alarge graph search.
β’ Re-running staticcomponentdetection.
β’ (Long history, seerelated work in[MTAAP11].)
β’ Coping mechanisms:β’ Heuristics.β’ Second level of
batching.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 26/31
Deletion heuristics
Rule out effect-less deletionsβ’ Use the spanning tree by-product of static connected
component algorithms.
β’ Ignore deletions when one of the following occur:
1 The deleted edge is not in the spanning tree.2 If the endpoints share a common neighborβ.3 If the loose endpoint can reach the rootβ.
β’ In the last two (β), also fix the spanning tree.
Rules out 99.7% of deletions.
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 27/31
Connected components: Performance
On E7-8870
Number of threads
Con
nect
ed c
ompo
nent
upd
ates
per
sec
ond
103
103.5
104
104.5
105
βββ
βββ
ββ
ββββ
βββ β
βββββ
βββ
ββ
ββββ βββ βββ βββ
βββ
βββ
βββ βββ
βββ βββ
βββ β
ββββββββ
βββ βββ βββ
ββββ
ββ β
β
β
βββ
β
ββ
βββ
βββ
β
β
β βββ
β
β
β
βββ ββ
ββββ βββ
ββ
β
ββ
β
βββ
β
β
ββ
β
β
βββ
β
β
β
ββ
ββββ
βββ βββ βββ
βββ βββ
βββ
β
βββββ
βββ
ββ
β
βββ
β
β
β ββ
β ββββββ
βββ βββ
βββ
βββ β
β
β
βββ
βββ
βββ
βββ
βββ
ββ
β βββ β
ββ
βββ
β
β
β βββ ββ
ββββ βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
ββ
ββββ
βββ β
ββ
ββ
β
βββ
βββ
βββ
βββ
βββ
β
ββ
βββ βββ
βββ
ββββ
ββ β
ββ
βββ
β
β
β
βββ
βββ
βββ
βββ
βββ
βββ β
ββ
βββ
βββ
βββ βββ β
βββ
β
β β
β
β
βββ
βββ
ββββββ
βββ
βββ β
βββββ
βββ
ββββββ β
ββ
βββ
ββ
β
βββ
20 40 60 80
Batch size
β 3000000
β 1000000
β 300000
β 100000
β 30000
β 10000
β 3000
β 1000
β 300
β 100
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 28/31
Connected components: Latency
On E7-8870
Number of threads
Sec
onds
per
bat
ch p
roce
ssed
10β1.5
10β1
10β0.5
100
100.5
101βββ
βββ
ββ
ββββ
βββ β
βββββ
βββ
ββ
ββββ βββ βββ
βββ
βββ
βββ
βββ βββ
βββ β
βββββ β
βββββ β
ββ
βββ ββββββ
ββββ
ββ
β
β
β
βββ
β
ββ
βββ
βββ
β
β
β βββ
β
β
β
βββ ββ
ββββ ββ
βββ
β
ββ
ββββ
β
β
ββ
β
β
βββ
β
β
βββ
ββββ
βββ βββ
βββ
βββ βββ
βββ
β
ββ
βββ
βββ
ββ
β
βββ
β
β
β ββ
β ββββββ
ββββββ β
ββ
βββ β
β
β
βββ
βββ
βββ β
ββ
βββ
ββ
β βββ β
ββ
βββ
β
β
β βββ ββ
ββββ βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ
βββ β
ββ
ββ
β
βββ
βββ
βββ ββ
β
βββ
β
ββ
βββ βββ
βββ
βββ β
ββ β
ββ
βββ
β
β
β
βββ
βββ
βββ
βββ
βββ
βββ β
ββ
βββ
βββ
βββ βββ β
βββ
β
β β
β
β
βββ
βββ
βββ
βββ
βββ
βββ βββ
βββ
βββ
ββββββ
βββ
βββ
ββ
β
βββ
20 40 60 80
Batch size
β 3000000
β 1000000
β 300000
β 100000
β 30000
β 10000
β 3000
β 1000
β 300
β 100
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 29/31
Common aspects
β’ Each parallelizes sufficiently well over the affected vertices V β²,those touched by new or removed edges.
β’ Total amount of work is O(Vol(V β²)) = O(β
vβV β² deg(v)).
β’ Our in-progress work on refining or re-agglomeratingcommunities with updates also is O(Vol(V β²)).
β’ How many interesting graph properties can be updated withO(Vol(V β²)) work?
β’ Do these parallelize well?
β’ The hidden constant and how quickly performance becomesasymptotic determines the metric update rate. Whatimplementation techniques bash down the constant?
β’ How sensitive are these metrics to noise and error?
β’ How quickly can we βforgetβ data and still maintain metrics?
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 30/31
Session outline
β’ Fast Counting of Patterns in GraphsAli Pinar, C. Seshadhri, and Tamara G. Kolda, Sandia NationalLaboratories, USA
β’ Statistical Models and Methods for Anomaly Detectionin Large GraphsNicholas Arcolano, Massachusetts Institute of Technology, USA
β’ Perfect Power Law Graphs: Generation, Sampling,Construction and FittingJeremy Kepner, Massachusetts Institute of Technology, USA
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 31/31
Acknowledgment of support
SIAM AN 2012βStreaming Graph Analytics for Massive GraphsβJason Riedy 32/31