NUMA optimized Parallel Breadth first Search on Multicore Single node System
-
Upload
mohammad-tahsin-al-shalabi -
Category
Engineering
-
view
110 -
download
2
Transcript of NUMA optimized Parallel Breadth first Search on Multicore Single node System
![Page 1: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/1.jpg)
NUMA optimized Parallel Breadth first
Search on Multicore Single node System
Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi
Ruba Break Mariam Al-kassar Nagham Ballan
![Page 2: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/2.jpg)
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
![Page 3: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/3.jpg)
Background
Large scale graph in various fields
US Road network:
58 million edges
Twitter follow-ship :
1.47 billion edges
Neuronal network :
100 trillion edges
large
![Page 4: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/4.jpg)
Background
Fast and scalable graph processing by using HPC
![Page 5: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/5.jpg)
Importance of graph processing
Application field
Transportation
Social network
Cyber-security
Bioinformatics
- Step 3:
• concurrent search (breadth first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
![Page 6: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/6.jpg)
Breadth first search
BFS is important and fundamental graph processing
– Obtains relationship of distance (hops) as standIalone
– Many algorithm (BC, Max.flow, Max.independent set)
Problems of Fast and scalable computation BFS
- low arithmetic intensity
- irregular memory accesses
![Page 7: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/7.jpg)
Graph500 Benchmark
Measures computer performance using TEPS ratio in graph
processing such as BFS (Breath-first search)
TEPS ratio = # of Traversed edges per second
![Page 8: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/8.jpg)
Contribution
Efficient hybrid algorithm of BFS [Beamer2011,2012]
reduces unnecessary edge traversal
Our proposal
- NUMA-optimized hybrid algorithm
- Improves locality of memory access
. Library for considering NUMA carefully
. Column-wise graph partitioning
![Page 9: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/9.jpg)
Example
4-way Intel Xeon E5 (64 CPU cores)
• Scalable: Scale well up to 64 threads.
• Fast: 11.15 GTEPS and 2.2x speedup compared with
original Hybrid algorithm
![Page 10: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/10.jpg)
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
![Page 11: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/11.jpg)
Breadth first Search (BFS)
Obtain level of each vertices from source vertex
Level = certain # of hops away from the source
Input:Graph G and source
Output:Tree with root as source
![Page 12: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/12.jpg)
Hybrid BFS for low diameter graph
Efficient for Low diameter graph
– scale free and/or small world property such as social
network.
At higher ranks in Graph500 benchmark
Hybrid algorithm
- combines top-down algorithm and bottom-up algorithm
– reduces unnecessary edge traversal
![Page 13: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/13.jpg)
Hybrid algorithm
Top down algorithm Bottom up algorithm
Efficient for a small-frontier Efficient for a large-frontier
![Page 14: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/14.jpg)
Top down algorithm
Explores outgoing edges of frontier queue QF
Appends unvisited vertices into neighbor queue QN
Efficient for a small frontier
• Has an unnecessary edge traversal for a large frontier
![Page 15: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/15.jpg)
Bottom up algorithm
Explores frontier queue QF from unvisited vertices.
Appends adjacent vertices into neighbors QN
Efficient for a large frontier• Has unnecessary edge traversal for a small frontier
![Page 16: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/16.jpg)
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
![Page 17: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/17.jpg)
How to speedup the hybrid algorithm?
NUMA architecture
– Non uniform memory access
– Each CPU socket has a local RAM
– Fast local RAM and slow non-local RAM
4 socket Intel Xeon E5 system
![Page 18: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/18.jpg)
Frequent non local memory accesses on NUMA
architecture
![Page 19: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/19.jpg)
Difficulty of considering NUMA architecture
1. How does distribute graph and data to each local RAM?
2. How does bind partial graph and data to each NUMA unit?
![Page 20: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/20.jpg)
ULIBC: Ubiquity Library for Intelligently Binding
Cores
1. NUMACTL (command line tool, library for C/C++)2. Intel compiler Thread Affinity Interface (API)
3. ULIBC (Our library, library for C/C++)– Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket
![Page 21: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/21.jpg)
NUMA-opt. Column wise Graph Partitioning
. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k- Ak is a set of adjacency list that holds incoming edges to Vk.
![Page 22: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/22.jpg)
NUMA-optimized Top down
Explores outgoing edges of frontier queue QF.
Appends unvisited vertices into neighbor queue QN.
Efficient for a small frontier • Has unnecessary edge traversal for a large frontier
![Page 23: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/23.jpg)
Details of NUMA-optimized Top-down
![Page 24: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/24.jpg)
NUMA-optimized Bottom-up Explores frontier queue QF from unvisited vertices.
Appends adjacent vertices into neighbors QN.
Efficient for a large frontier.• Has unnecessary edge traversal for a small frontier
![Page 25: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/25.jpg)
Details of NUMA-optimized Bottom- up
![Page 26: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/26.jpg)
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
![Page 27: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/27.jpg)
Machine specification
4 way Intel Xeon E5.
– CentOS 6.4 (Kernel 2.6.32)
– GCC 4.4.7
– 64 logical CPU cores
– 4 NUMA units x 16 logical cores
4 way AMD Opteron 6174.
– Fedora 19 (Kernel 3.11.2)
– GCC 4.8.1
– 48 CPU cores
– 8 NUMA units x 6-core
![Page 28: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/28.jpg)
TEPS ratio varied with problem size
NUMA 2.2x speedups compared with original hybrid algorithm
NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
![Page 29: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/29.jpg)
Strong scaling on Intel/AMD System
Scale well up to # of threads as # of cores
![Page 30: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/30.jpg)
Twitter network
![Page 31: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/31.jpg)
Graph500 benchmark
Fastest of single node on 4th list (Jue 2012)
Fastest of CPU-based single-node on 6th list (June 2013)
![Page 32: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/32.jpg)
1st Green Graph500 list on June 2013
Measures power-efficient using TEPS/W ratio
Results on various system such as Android ,Linux , and Mac.
NUMA
Small Data category
![Page 33: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/33.jpg)
Outline Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
![Page 34: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/34.jpg)
Conclusion
NUMA-optimized Hybrid BFS algorithm
– Reduces unnecessary edge traversals and remote RAM access
carefully considering NUMA
Numerical results on 4 way Intel Xeon
– scales well up to 64 threads (scalable)
– achieves 11.15 GTEPS (fast)
– 2.2x speedup compared original Hybrid
Graph500 and Green Graph500
– Fastest single-node in June 2012
– Most power-efficient in June 2013
![Page 35: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/35.jpg)
Future work Further optimizing NUMA-optimized BFS$
Distributed-memory parallel computation
![Page 36: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/36.jpg)
References Parallel Breadth-First Search on Distributed Memory Systems
[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA ABuluc, [email protected]]
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]
Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search
[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]
Distributed Breadth First Search
[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]
Evaluation and Optimization of Breadth-First Search on NUMA Cluster
[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences cuizehan, chenlicheng, cmy, baoyg, huangyongbing, [email protected]]
![Page 37: NUMA optimized Parallel Breadth first Search on Multicore Single node System](https://reader034.fdocuments.us/reader034/viewer/2022042818/55c384d4bb61ebe57a8b45c8/html5/thumbnails/37.jpg)
References Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)
Memory
[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and
Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing
Lawrence Livermore National Laboratory; Livermore, CA frpearce, [email protected] frpearce,
Reducing Communication in Parallel Breadth-First Search on Distributed
Memory Systems
[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer
Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National
Laboratory Email: [email protected], [email protected], [email protected], [email protected]]
Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore
and Multiprocessor Systems
[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt
Augustin, Germany e-mail: [email protected], [email protected]]