GraphChi: Big Data - small machine (A. Kyrola) GraphChi: Big Data – small machine What is it good...

21
GraphChi: Big Data - small machine (A. Kyrola) GraphChi: Big Data – small machine What is it good for – and what’s new? Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~ akyrola Twitter: @kyrpov

Transcript of GraphChi: Big Data - small machine (A. Kyrola) GraphChi: Big Data – small machine What is it good...

  • Slide 1
  • GraphChi: Big Data - small machine (A. Kyrola) GraphChi: Big Data small machine What is it good for and whats new? Aapo Kyrl Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov
  • Slide 2
  • GraphChi: Big Data - small machine (A. Kyrola) GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
  • Slide 3
  • GraphChi: Big Data - small machine (A. Kyrola) What is GraphChi Both in OSDI12!
  • Slide 4
  • GraphChi: Big Data - small machine (A. Kyrola) Parallel Sliding Windows Only P large reads for each interval (sub-graph). P 2 reads on one full pass. or Details: Kyrola, Blelloch, Guestrin: Large-scale graph computation on just a PC (OSDI 2012)
  • Slide 5
  • GraphChi: Big Data - small machine (A. Kyrola) Why GraphChi PerformanceGreat scalability GraphLab/Pregel style programming + with hacking + more ApplicationsEasy!
  • Slide 6
  • GraphChi: Big Data - small machine (A. Kyrola) Performance Comparison Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRank See the paper for more comparisons. WebGraph Belief Propagation (U Kang et al.) Matrix Factorization (Alt. Least Sqr.)Triangle Counting On a Mac Mini: GraphChi can solve as big problems as existing large-scale systems. Comparable performance.
  • Slide 7
  • GraphChi: Big Data - small machine (A. Kyrola) PowerGraph Comparison PowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs. With 64 more machines, 512 more CPUs: Pagerank: 40x faster than GraphChi Triangle counting: 30x faster than GraphChi. OSDI12 GraphChi has state-of-the- art performance / CPU. vs. GraphChi
  • Slide 8
  • GraphChi: Big Data - small machine (A. Kyrola) Scalability / Input Size [SSD] Throughput: number of edges processed / second. Conclusion: the throughput remains roughly constant when graph size is increased No worries of running out of memory, or buying more machines when your data grows. Graph size Performance
  • Slide 9
  • GraphChi: Big Data - small machine (A. Kyrola) GraphChi^2 Task 7Task 6Task 5Task 4Task 3Task 2Task 1 TimeT Distributed Graph System Single-computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 TimeT T11T10T9T8T7T6T5T4T3T2T1 6 machines 12 machines Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 (Significantly) less than 2x throughput with 2x machines Exactly 2x throughput with 2x machines
  • Slide 10
  • GraphChi: Big Data - small machine (A. Kyrola)
  • Slide 11
  • Applications for GraphChi Graph Mining Connected components Approx. shortest paths Triangle counting Community Detection SpMV PageRank Generic Recommendations Random walks Collaborative Filtering (by Danny Bickson) ALS SGD Sparse-ALS SVD, SVD++ Item-CF + many more Probabilistic Graphical Models Belief Propagation
  • Slide 12
  • GraphChi: Big Data - small machine (A. Kyrola) Programming + Special Features Similar programming model as GraphLab version 1 Dynamic graphs Streaming graphs while computing Graph contraction algorithms (new) Minimum spanning forest
  • Slide 13
  • GraphChi: Big Data - small machine (A. Kyrola) Easy to Get Started Java and C++ versions available No installation, just run Any machine, SSD or HD http://graphchi.org http://code.google.com/p/graphchi http://code.google.com/p/graphchi-java
  • Slide 14
  • GraphChi: Big Data - small machine (A. Kyrola) Whats New
  • Slide 15
  • GraphChi: Big Data - small machine (A. Kyrola) Extensions 1. Dynamic Edge and Vertex Values Divide shards into small (4 mb) blocks that can be resized separately. Block 1 Block 2 Block 3 Block N Shard(j) Block 2 Block 3 Block 1 2. Integration with Hadoop / Pig 3. Fast neighborhood queries over shards Sparse indices 4. DrunkardMob: Random Walks (next)
  • Slide 16
  • GraphChi: Big Data - small machine (A. Kyrola) Random Walk Simulations Personalized PageRank Problem: using the power method would require O(V 2 ) of memory to compute for all vertices. Can be approximated by simulating random walks and computing the sample distribution. Other applications: Recommender systems: FolkRank (Hotho 2006), finding candidates Knowledge-base inference (Lao, Cohen 2009)
  • Slide 17
  • GraphChi: Big Data - small machine (A. Kyrola) Random walk in an in-memory graph Compute one walk a time (multiple in parallel, of course): Extremely slow in GraphChi / PSW ! Each hop might require loading of a new interval.
  • Slide 18
  • GraphChi: Big Data - small machine (A. Kyrola) Random walks in GraphChi DrunkardMob algorithm Reverse thinking parfor vertex in graph: mywalks = walkManager.getWalksAtVertex(vertex.id) foreach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Need to encode only current vertex and source vertex for each walk: 4-byte integer sufficient / walk With 144 GB RAM, could run 15 billion walks simultaneously (on Java) recommendations for 15 million users
  • Slide 19
  • GraphChi: Big Data - small machine (A. Kyrola) Keeping track of walks Vertex walks table (WalkManager) Source A top-N visits Source B top-N visits Walk Distribution Tracker (DrunkardCompanion) Execution interval GraphChi
  • Slide 20
  • GraphChi: Big Data - small machine (A. Kyrola) Keeping track of walks Vertex walks table (WalkManager) Source A top-N visits Source B top-N visits Walk Distribution Tracker (DrunkardCompanion) Execution interval GraphChi Source A top-N visits Source B top-N visits
  • Slide 21
  • GraphChi: Big Data - small machine (A. Kyrola) Application: Twitters Who-to-Follow Based on WWW13 paper by Gupta et. al. Step 1: Compute Circle of Trust (CoT) for each user Step 2: Bipartite graph with CoT + CoTs followees. Step 3: Compute SALSA and pick top scored users as recommendations. DrunkardMob Neighborhood queries over shards.
  • Slide 22
  • GraphChi: Big Data - small machine (A. Kyrola) Conclusion GraphChi can run your favorite graph computation on extremely large graphs on your laptop Unique features such as random walk simulations and dynamic graphs Most popular: Collaborative Filtering toolkit (by Danny Bickson)
  • Slide 23
  • GraphChi: Big Data - small machine (A. Kyrola) Thank you! Aapo Kyrl Ph.D. candidate @ CMU soon to graduate! http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov