NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA optimized Parallel Breadth first

Search on Multicore Single node System

Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi

Ruba Break Mariam Al-kassar Nagham Ballan

Outline

Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Background

Large scale graph in various fields

US Road network:

58 million edges

Twitter follow-ship :

1.47 billion edges

Neuronal network :

100 trillion edges

large

Background

Fast and scalable graph processing by using HPC

Importance of graph processing

Application field

Transportation

Social network

Cyber-security

Bioinformatics

- Step 3:

• concurrent search (breadth first search)

• optimization (single source shortest path)

• edge-oriented (maximal independent set)

Breadth first search

BFS is important and fundamental graph processing

– Obtains relationship of distance (hops) as standIalone

– Many algorithm (BC, Max.flow, Max.independent set)

Problems of Fast and scalable computation BFS

- low arithmetic intensity

- irregular memory accesses

Graph500 Benchmark

Measures computer performance using TEPS ratio in graph

processing such as BFS (Breath-first search)

TEPS ratio = # of Traversed edges per second

Contribution

Efficient hybrid algorithm of BFS [Beamer2011,2012]

reduces unnecessary edge traversal

Our proposal

- NUMA-optimized hybrid algorithm

- Improves locality of memory access

. Library for considering NUMA carefully

. Column-wise graph partitioning

Example

4-way Intel Xeon E5 (64 CPU cores)

• Scalable: Scale well up to 64 threads.

• Fast: 11.15 GTEPS and 2.2x speedup compared with

original Hybrid algorithm

Outline

Background



Numerical Results

Conclusion

Breadth first Search (BFS)

Obtain level of each vertices from source vertex

Level = certain # of hops away from the source

Input:Graph G and source

Output:Tree with root as source

Hybrid BFS for low diameter graph

Efficient for Low diameter graph

– scale free and/or small world property such as social

network.

At higher ranks in Graph500 benchmark

Hybrid algorithm

- combines top-down algorithm and bottom-up algorithm

– reduces unnecessary edge traversal

Hybrid algorithm

Top down algorithm Bottom up algorithm

Efficient for a small-frontier Efficient for a large-frontier

Top down algorithm

Explores outgoing edges of frontier queue QF

Appends unvisited vertices into neighbor queue QN

Efficient for a small frontier

• Has an unnecessary edge traversal for a large frontier

Bottom up algorithm

Explores frontier queue QF from unvisited vertices.

Appends adjacent vertices into neighbors QN

Efficient for a large frontier• Has unnecessary edge traversal for a small frontier

Outline

Background



Numerical Results

Conclusion

How to speedup the hybrid algorithm?

NUMA architecture

– Non uniform memory access

– Each CPU socket has a local RAM

– Fast local RAM and slow non-local RAM

4 socket Intel Xeon E5 system

Frequent non local memory accesses on NUMA

architecture

Difficulty of considering NUMA architecture

1. How does distribute graph and data to each local RAM?

2. How does bind partial graph and data to each NUMA unit?

ULIBC: Ubiquity Library for Intelligently Binding

Cores

1. NUMACTL (command line tool, library for C/C++)2. Intel compiler Thread Affinity Interface (API)

3. ULIBC (Our library, library for C/C++)– Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket

NUMA-opt. Column wise Graph Partitioning

. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k- Ak is a set of adjacency list that holds incoming edges to Vk.

NUMA-optimized Top down

Explores outgoing edges of frontier queue QF.

Appends unvisited vertices into neighbor queue QN.

Efficient for a small frontier • Has unnecessary edge traversal for a large frontier

Details of NUMA-optimized Top-down

NUMA-optimized Bottom-up Explores frontier queue QF from unvisited vertices.

Appends adjacent vertices into neighbors QN.

Efficient for a large frontier.• Has unnecessary edge traversal for a small frontier

Details of NUMA-optimized Bottom- up

Outline

Background



Numerical Results

Conclusion

Machine specification

4 way Intel Xeon E5.

– CentOS 6.4 (Kernel 2.6.32)

– GCC 4.4.7

– 64 logical CPU cores

– 4 NUMA units x 16 logical cores

4 way AMD Opteron 6174.

– Fedora 19 (Kernel 3.11.2)

– GCC 4.8.1

– 48 CPU cores

– 8 NUMA units x 6-core

TEPS ratio varied with problem size

NUMA 2.2x speedups compared with original hybrid algorithm

NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).

Strong scaling on Intel/AMD System

Scale well up to # of threads as # of cores

Twitter network

Graph500 benchmark

Fastest of single node on 4th list (Jue 2012)

Fastest of CPU-based single-node on 6th list (June 2013)

1st Green Graph500 list on June 2013

Measures power-efficient using TEPS/W ratio

Results on various system such as Android ,Linux , and Mac.

NUMA

Small Data category

Outline Background



Numerical Results

Conclusion

Conclusion

NUMA-optimized Hybrid BFS algorithm

– Reduces unnecessary edge traversals and remote RAM access

carefully considering NUMA

Numerical results on 4 way Intel Xeon

– scales well up to 64 threads (scalable)

– achieves 11.15 GTEPS (fast)

– 2.2x speedup compared original Hybrid

Graph500 and Green Graph500

– Fastest single-node in June 2012

– Most power-efficient in June 2013

Future work Further optimizing NUMA-optimized BFS$

Distributed-memory parallel computation

References Parallel Breadth-First Search on Distributed Memory Systems

[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA ABuluc, [email protected]]

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]

Distributed Breadth First Search

[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences cuizehan, chenlicheng, cmy, baoyg, huangyongbing, [email protected]]

mailto:[email protected]

References Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)

Memory

[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and

Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing

Lawrence Livermore National Laboratory; Livermore, CA frpearce, [email protected] frpearce,

[email protected]]

Reducing Communication in Parallel Breadth-First Search on Distributed

Memory Systems

[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer

Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National

Laboratory Email: [email protected], [email protected], [email protected], [email protected]]

Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore

and Multiprocessor Systems

[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt

Augustin, Germany e-mail: [email protected], [email protected]]



NUMA optimized Parallel Breadth first Search on Multicore Single node System

Engineering

Transcript of NUMA optimized Parallel Breadth first Search on Multicore Single node System