Graph Analysis Beyond Linear Algebra
-
Upload
jason-riedy -
Category
Data & Analytics
-
view
295 -
download
2
Transcript of Graph Analysis Beyond Linear Algebra
![Page 1: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/1.jpg)
graph analysis beyond linear algebra
E. Jason RiedyDMML, 24 October 2015
HPC Lab, School of Computational Science and EngineeringGeorgia Institute of Technology
![Page 2: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/2.jpg)
motivation and applications
![Page 3: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/3.jpg)
(insert prefix here)-scale data analysis
Cyber-security Identify anomalies, malicious actors
Health care Finding outbreaks, population epidemiology
Social networks Advertising, searching, grouping
Intelligence Decisions at scale, regulating algorithms
Systems biology Understanding interactions, drug design
Power grid Disruptions, conservation
Simulation Discrete events, cracking meshes
• Graphs are a motif / theme in data analysis.• Changing and dynamic graphs are important! 3
![Page 4: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/4.jpg)
outline
1. Motivation and background2. Linear algebra leads to a better graph algorithm:incremental PageRank
3. Sparse linear algebra techniques lead to a scoop:community detection
4. And something else: connected components
4
![Page 5: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/5.jpg)
why graphs?
Another tool, like dense and sparse linear algebra.
• Combine things with pairwiserelationships
• Smaller, more generic than raw data.• Taught (roughly) to all CS students...• Semantic attributions can captureessential relationships.
• Traversals can be faster than filteringDB joins.
• Provide clear phrasing for queriesabout relationships.
5
![Page 6: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/6.jpg)
potential applications
• Social Networks• Identify communities, influences, bridges, trends,anomalies (trends before they happen)...
• Potential to help social sciences, city planning, andothers with large-scale data.
• Cybersecurity• Determine if new connections can access a device orrepresent new threat in < 5ms...
• Is the transfer by a virus / persistent threat?• Bioinformatics, health
• Construct gene sequences, analyze proteininteractions, map brain interactions
• Credit fraud forensics⇒ detection⇒ monitoring• Integrate all the customer’s data, identify in real-time
6
![Page 7: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/7.jpg)
streaming graph data
Networks data rates:
• Gigabit ethernet: 81k – 1.5M packets per second• Over 130 000 flows per second on 10 GigE (< 7.7 µs)
Person-level data rates:
• 500M posts per day on Twitter (6k / sec)1• 3M posts per minute on Facebook (50k / sec)2
We need to analyze only changes and not entire graph.
Throughput & latency trade off and expose differentlevels of concurrency.
1www.internetlivestats.com/twitter-statistics/2www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/
7
![Page 8: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/8.jpg)
incremental pagerank
![Page 9: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/9.jpg)
pagerank
Everyone’s “favorite” metric: PageRank.
• Stationary distribution of the random surfer model.• Eigenvalue problem can be re-phrased as a linearsystem (
I− αATD−1) x = kv,with
α teleportation constant, much < 1A adjacency matrixD diagonal matrix of out degrees, withx/0 = x (self-loop)
v personalization vector, here 1/|V|k irrelevant scaling constant
• Amenable to analysis, etc. 9
![Page 10: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/10.jpg)
incremental pagerank
• Streaming data setting, update PageRank withouttouching the entire graph.
• Existing methods maintain databases of walks, etc.• Let A∆ = A+∆A, D∆ = D+∆D for the new graph,want to solve for x+∆x.
• Simple algebra:(I− αAT∆D−1
∆
)∆x = α
(A∆D−1
∆ − AD−1) x,and the right-hand side is sparse.
• Re-arrange for Jacobi,
∆x(k+1) = αAT∆D−1∆ ∆x(k) + α
(A∆D−1
∆ − AD−1) x,iterate, ...
10
![Page 11: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/11.jpg)
incremental pagerank: whoops1000 100 10
●● ● ● ● ● ● ● ● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
1e−12
1e−10
caidaRouterLevel
coPapersC
iteseercoP
apersDB
LPgreat_britain.osm
PG
Pgiantcom
popow
er
0 2500 5000 7500 10000 0 2500 5000 7500 10000 0 2500 5000 7500 10000k
val
graphname ● caidaRouterLevel coPapersCiteseer coPapersDBLP great_britain.osm PGPgiantcompo power
• And fail. The updated solution wanders away fromthe true solution. Top rankings stay the same...
11
![Page 12: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/12.jpg)
incremental pagerank: think instead
• The old solution x is an ok, not exact, solution to theoriginal problem, now a nearby problem.
• How close? Residual:
r′ = kv− x+ αA∆D−1∆ x
= r+ α(A∆D−1
∆ − AD−1) x.• Solve (I− αA∆D−1
∆ )∆x = r′.• Cheat by not refining all of r′, only region growingaround the changes.
• (Also cheat by updating r rather than recomputing atthe changes.)
12
![Page 13: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/13.jpg)
incremental pagerank: worksbelgium.osm caidaRouterLevel coPapersDBLP luxembourg.osm
1e−04
1e−02
1e−04
1e−02
1e−04
1e−02
1001000
10000
0 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 200000 5000 10000 15000 20000Number of edges added
Rel
ativ
e 1−
norm
−w
ise
back
war
d er
ror
v. r
esta
rted
Pag
eRan
k
alg dpr dprheld pr pr_restart
• Thinking about the numerical linear algebra issuescan lead to better graph algorithms. 13
![Page 14: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/14.jpg)
community detection
![Page 15: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/15.jpg)
graphs: big, nasty hairballs
Yifan Hu’s (AT&T) visualization of the in-2004 data sethttp://www2.research.att.com/~yifanhu/gallery.html
15
![Page 16: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/16.jpg)
but no shortage of structure...
Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,
Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs
• Locally, there are clusters or communities.• Until 2011, no parallel method for communitydetection.
• But, gee, the problem looks familiar...
16
![Page 17: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/17.jpg)
community detection
• Partition a graph’svertices into disjointcommunities.
• A community locallyoptimizes somemetric, NP-hard.
• Trying to capture thatvertices are moresimilar within onecommunity thanbetweencommunities. Jason’s network via LinkedIn Labs
17
![Page 18: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/18.jpg)
common community metric: modularity
• Modularity: Deviation of connectivity in thecommunity induced by a vertex set S from someexpected background model of connectivity.
• Newman’s uniform model, modularity of a cluster isfraction of edges in the community −
fraction expected from uniformly sampling graphswith the same degree sequenceQS = (mS − x2S/4m)/m
• Modularity: sum of cluster contributions• “Sufficiently large” modularity⇒ some structure• Known issues: Resolution limit, NP, etc.
18
![Page 19: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/19.jpg)
sequential agglomerative method
A
B
C
DE
FG
• A common method (e.g. Clauset,Newman, & Moore, 2004)agglomerates vertices intocommunities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increasesmodularity.
• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi ’07). 19
![Page 20: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/20.jpg)
sequential agglomerative method
A
B
C
DE
FG
C
B
• A common method (e.g. Clauset,Newman, & Moore, 2004)agglomerates vertices intocommunities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increasesmodularity.
• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi ’07). 19
![Page 21: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/21.jpg)
sequential agglomerative method
A
B
C
DE
FG
C
BD
A
• A common method (e.g. Clauset,Newman, & Moore, 2004)agglomerates vertices intocommunities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increasesmodularity.
• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi ’07). 19
![Page 22: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/22.jpg)
sequential agglomerative method
A
B
C
DE
FG
C
BD
A
B
C
• A common method (e.g. Clauset,Newman, & Moore, 2004)agglomerates vertices intocommunities.
• Each vertex begins in its owncommunity.
• An edge is chosen to contract.• Merging maximally increasesmodularity.
• Priority queue.
• Known often to fall into an O(n2)performance trap withmodularity (Wakita & Tsurumi ’07). 19
![Page 23: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/23.jpg)
parallel agglomerative method
A
B
C
DE
FG
• Use a matching to avoid the queue.• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.• Within factor of 2 from heaviest.
• Merge all communities at once.• Maintains some balance.• Produces different results.• Agnostic to weighting, matching• Up until 2011, no one tried this...
20
![Page 24: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/24.jpg)
parallel agglomerative method
A
B
C
DE
FG
C
D
G
• Use a matching to avoid the queue.• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.• Within factor of 2 from heaviest.
• Merge all communities at once.• Maintains some balance.• Produces different results.• Agnostic to weighting, matching• Up until 2011, no one tried this...
20
![Page 25: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/25.jpg)
parallel agglomerative method
A
B
C
DE
FG
C
D
G
E
B
C
• Use a matching to avoid the queue.• Compute a heavy weight matching.
• Simple greedy, maximal algorithm.• Within factor of 2 from heaviest.
• Merge all communities at once.• Maintains some balance.• Produces different results.• Agnostic to weighting, matching• Up until 2011, no one tried this...
20
![Page 26: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/26.jpg)
parallel agglomerative community detection
Graph |V| |E| Reference
soc-LiveJournal1 4 847571 68993773 “SNAP”uk-2007-05 105896555 3301876564 Ubicrawler
Peak processing rates in edges/second:
Platform Mem soc-LiveJournal1 uk-2007-05
E7-8870 256GiB 6.90× 106 6.54× 106XMT2 2TiB 1.73× 106 3.11× 106
Clustering: Sufficiently good. Won 10th DIMACSImplementation Challenge’s mix category in 2012. Later:Fagginger Auer & Bisseling (2012), add star detection.LaSalle and Karypis (2014), add m-l “refinement.” 21
![Page 27: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/27.jpg)
what about streaming?
Data and plots from Pushkar Godbolé.
Preliminary experiments...
Simple re-agglomeration: Fast,decreasing modularity.
“Backtracking” appears to work,but carries more data (see alsoGörke, et al. at KIT).
Clusterings are very sensitive.
22
![Page 28: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/28.jpg)
connected components
![Page 29: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/29.jpg)
sensitivity and components
• Ok, clusterings are optimizing over a bumpy surface,of course they’re sensitive... (Streaming exacerbates.)
• Pick a clean problem: connected components• Where could errors occur?
• Streaming: Dropped, forgotten information• Computing: Stop for energy or time, thresholds• Real-life: Surveys not returned
• How do you even measure errors?• Pairwise co-membership counts• Empirical distributions from vertex membership• No one measure...• All need the true solution...
24
![Page 30: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/30.jpg)
sensitivity of connected components−
Error
+
− Fraction of graph remaing (kinda) +
From Zakrzewska & Bader, “Measuring the Sensitivity of Graph Metrics to Missing Data,” PPAM 2013
25
![Page 31: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/31.jpg)
questions / future
Can graph analysis learn from linear algebra andnumerical analysis?
• Are there relevant concepts of backward error?• Don’t need the true solution to evaluate (or estimate)some distance.
• Should graph analysis look more in the statisticaldirection?• Moderate graphs hit converging limits.
• Are there other easy analogies / low-hanging fruit?• Environments for playing with large graphs?
• Sane threading and atomic operations (not data)
Feel free to join in...26
![Page 32: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/32.jpg)
acknowledgements
![Page 33: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/33.jpg)
hpc lab people
Faculty:• David A. Bader• Oded Green (wasstudent)
Data:• Pushkar Godbolé• Anita Zakrzewska
STINGER:
• Robert McColl,• James Fairbanks,• Adam McLaughlin,• Daniel Henderson,• David Ediger (nowGTRI),
• Jason Poovey (GTRI),• Karl Jiang, and• feedback from users inindustry, government,academia
28
![Page 34: Graph Analysis Beyond Linear Algebra](https://reader034.fdocuments.us/reader034/viewer/2022042611/5872c3d71a28ab0c718b5baf/html5/thumbnails/34.jpg)
stinger: where do you get it?
Home: www.cc.gatech.edu/stinger/Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/
Gateway to
• code,
• development,
• documentation,
• presentations...
Remember: Academic code, but maturingwith contributions.Users / contributors / questioners:Georgia Tech, PNNL, CMU, Berkeley, Intel,Cray, NVIDIA, IBM, Federal Government,Ionic Security, Citi, ...
29