NUMA optimized Parallel Breadth first
Search on Multicore Single node System
Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi
Ruba Break Mariam Al-kassar Nagham Ballan
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
Background
Large scale graph in various fields
US Road network:
58 million edges
Twitter follow-ship :
1.47 billion edges
Neuronal network :
100 trillion edges
large
Background
Fast and scalable graph processing by using HPC
Importance of graph processing
Application field
Transportation
Social network
Cyber-security
Bioinformatics
- Step 3:
• concurrent search (breadth first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
Breadth first search
BFS is important and fundamental graph processing
– Obtains relationship of distance (hops) as standIalone
– Many algorithm (BC, Max.flow, Max.independent set)
Problems of Fast and scalable computation BFS
- low arithmetic intensity
- irregular memory accesses
Graph500 Benchmark
Measures computer performance using TEPS ratio in graph
processing such as BFS (Breath-first search)
TEPS ratio = # of Traversed edges per second
Contribution
Efficient hybrid algorithm of BFS [Beamer2011,2012]
reduces unnecessary edge traversal
Our proposal
- NUMA-optimized hybrid algorithm
- Improves locality of memory access
. Library for considering NUMA carefully
. Column-wise graph partitioning
Example
4-way Intel Xeon E5 (64 CPU cores)
• Scalable: Scale well up to 64 threads.
• Fast: 11.15 GTEPS and 2.2x speedup compared with
original Hybrid algorithm
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
Breadth first Search (BFS)
Obtain level of each vertices from source vertex
Level = certain # of hops away from the source
Input:Graph G and source
Output:Tree with root as source
Hybrid BFS for low diameter graph
Efficient for Low diameter graph
– scale free and/or small world property such as social
network.
At higher ranks in Graph500 benchmark
Hybrid algorithm
- combines top-down algorithm and bottom-up algorithm
– reduces unnecessary edge traversal
Hybrid algorithm
Top down algorithm Bottom up algorithm
Efficient for a small-frontier Efficient for a large-frontier
Top down algorithm
Explores outgoing edges of frontier queue QF
Appends unvisited vertices into neighbor queue QN
Efficient for a small frontier
• Has an unnecessary edge traversal for a large frontier
Bottom up algorithm
Explores frontier queue QF from unvisited vertices.
Appends adjacent vertices into neighbors QN
Efficient for a large frontier• Has unnecessary edge traversal for a small frontier
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
How to speedup the hybrid algorithm?
NUMA architecture
– Non uniform memory access
– Each CPU socket has a local RAM
– Fast local RAM and slow non-local RAM
4 socket Intel Xeon E5 system
Frequent non local memory accesses on NUMA
architecture
Difficulty of considering NUMA architecture
1. How does distribute graph and data to each local RAM?
2. How does bind partial graph and data to each NUMA unit?
ULIBC: Ubiquity Library for Intelligently Binding
Cores
1. NUMACTL (command line tool, library for C/C++)2. Intel compiler Thread Affinity Interface (API)
3. ULIBC (Our library, library for C/C++)– Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket
NUMA-opt. Column wise Graph Partitioning
. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k- Ak is a set of adjacency list that holds incoming edges to Vk.
NUMA-optimized Top down
Explores outgoing edges of frontier queue QF.
Appends unvisited vertices into neighbor queue QN.
Efficient for a small frontier • Has unnecessary edge traversal for a large frontier
Details of NUMA-optimized Top-down
NUMA-optimized Bottom-up Explores frontier queue QF from unvisited vertices.
Appends adjacent vertices into neighbors QN.
Efficient for a large frontier.• Has unnecessary edge traversal for a small frontier
Details of NUMA-optimized Bottom- up
Outline
Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
Machine specification
4 way Intel Xeon E5.
– CentOS 6.4 (Kernel 2.6.32)
– GCC 4.4.7
– 64 logical CPU cores
– 4 NUMA units x 16 logical cores
4 way AMD Opteron 6174.
– Fedora 19 (Kernel 3.11.2)
– GCC 4.8.1
– 48 CPU cores
– 8 NUMA units x 6-core
TEPS ratio varied with problem size
NUMA 2.2x speedups compared with original hybrid algorithm
NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
Strong scaling on Intel/AMD System
Scale well up to # of threads as # of cores
Twitter network
Graph500 benchmark
Fastest of single node on 4th list (Jue 2012)
Fastest of CPU-based single-node on 6th list (June 2013)
1st Green Graph500 list on June 2013
Measures power-efficient using TEPS/W ratio
Results on various system such as Android ,Linux , and Mac.
NUMA
Small Data category
Outline Background
Breadth-first Search (BFS)
NUMAI optimized parallel BFS
Numerical Results
Conclusion
Conclusion
NUMA-optimized Hybrid BFS algorithm
– Reduces unnecessary edge traversals and remote RAM access
carefully considering NUMA
Numerical results on 4 way Intel Xeon
– scales well up to 64 threads (scalable)
– achieves 11.15 GTEPS (fast)
– 2.2x speedup compared original Hybrid
Graph500 and Green Graph500
– Fastest single-node in June 2012
– Most power-efficient in June 2013
Future work Further optimizing NUMA-optimized BFS$
Distributed-memory parallel computation
References Parallel Breadth-First Search on Distributed Memory Systems
[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA ABuluc, [email protected]]
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]
Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search
[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]
Distributed Breadth First Search
[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]
Evaluation and Optimization of Breadth-First Search on NUMA Cluster
[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences cuizehan, chenlicheng, cmy, baoyg, huangyongbing, [email protected]]
References Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)
Memory
[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and
Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing
Lawrence Livermore National Laboratory; Livermore, CA frpearce, [email protected] frpearce,
Reducing Communication in Parallel Breadth-First Search on Distributed
Memory Systems
[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer
Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National
Laboratory Email: [email protected], [email protected], [email protected], [email protected]]
Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore
and Multiprocessor Systems
[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt
Augustin, Germany e-mail: [email protected], [email protected]]
Top Related