GraphM sc'19

SC 2021

LCCG: A Locality-Centric Hardware Accelerator for High Throughput of Concurrent Graph Processing

Jin Zhao1, Yu Zhang1, Xiaofei Liao1, Ligang He2, Bingsheng He3, Hai Jin1, Haikun Liu1

1 SCTS/CGCL, Huazhong University of Science and Technology, China

2 University of Warwick, UK3 National University of Singapore, Singapore

• Background and Challenges

• Locality-centric Accelerator

• Hardware Design

• Experimental Results

• Conclusion

Outline

customer segmentation

…

Applications of Concurrent Graph Processing

• The example applications

computing the landmark labeling

computing the network

community profile

random

walk

k-core

friend suggestion

BFS

PageRankWCC

social information monitoring

Existing Hardware-software Solutions

• Existing general-purpose many-core processor

under-utilization of cache and memory bandwidth

serious cache thrashing and intense resource contention

• Specific hardware accelerators

designed to serve a single job

inefficient for resolving the uncoordinated CGP jobs

Analysis on Existing General Processor

• Which stage costs lots of time?

Cycle stack

DRAM and the Last-Level Cache

What is the problem?

Challenges: Irregular Graph Traversal

• Example: two BFS jobs

Such irregular graph traversals of the jobs cause redundant memory traffic among these jobs.

c. Memory access pattern of two BFS jobs wthin the second itertion

b. Execution of two BFS jobs on the grapha. An example graph

d. Ratio of fetched graph structure data

shared by different nums of jobs

Challenges: Intense Resource Contention

Ratio of the fetched useful vertex state data

to all fetched vertex state data

• Example: Multiple jobs executed concurrently

Low ratio of the useful vertex state data implies ineffective use of the cache and

memory bandwidth.

Motivations: Inter-job Locality of the CGP Jobs

Inter-job temporal locality

Inter-job spatial locality

b. Execution of two BFS jobs on the graph

c. Real experiment: CGP jobs on Ligra-M

reduce the memory traffic

achieve higher utilization of the

cache and memory bandwidth

a. An example graph



• Hardware Design


• Conclusion

Outline

Topology-aware Execution

• Topology-aware Regularization of Data Accesses

BFS 1 and BFS 2 traverse the same graph structure separately

from different graph vertices

common traverse path：

BFS 2 BFS 1 and BFS 2

common traverse path：

BFS 1 and BFS 2 BFS 1 and BFS 2

the active vertices of BFS 1 and BFS 2 are E、G

……

cycles

Topology-aware Regularization of Data Accesses

the active vertices of BFS 1 and BFS 2 are B、F

Mining the temporal locality of concurrent graph analysis task data access to maximize the

sharing rate of loaded graph data

When BFS1 accesses its own state of B, the

state of B for BFS2 can also be fetched into

the LLC for BFS2's execution.

The data structure used to maintain and index

the hot vertices' states.

Mining concurrent graphs to analyze the spatial locality of task data access, improving

cache and bandwidth utilization, and minimizing resource competition

• Vertex States Coalescing

Topology-aware Execution

Common traversal paths

Extra instructions ?

LCCG Design

• System architecture

Runtime system for high-concurrency graph analysis service

Existing Software Graph Processing Framework

PageRank BFS SSSP

...

Shared graph structure data & state data of each task graph vertex

memory

WCC

concurrent graph jobs

CPU

LCCG Architecture



• Hardware Design


• Conclusion

Outline

TATR: Topology-Aware Traversal Regularization

• Key Data Structures

activa_all array

Intermediate_queue array

WorkList array


Microarchitecture of the TATR unit

• Exploration of Common Traversal Paths.


• Enhancement of Traversal Parallelism.

TATR uses multiple pipelines to explore the

common traversal paths.

The completed pipeline steals half of the remaining

unvisited vertices from other pipeline for balanced

load.Microarchitecture of the TATR unit


• Partition Organization

Enabling the loaded graph data to be reused by these CGP jobs.

Count the number of edges of the vertices

Microarchitecture of the TATR unit

PI: Prefetching and Indexing

• Key Data Structures

The Consolidated states array

The Job offset array

State offset arrays

𝛼 × |𝑉|

𝜎

|V | is the total number σ of verticesσ is set to 0.75 by default


• Prefetching of Graph Data

Microarchitecture of the PI unit


• State Indexing of Hot Vertices.

Microarchitecture of the PI unit

A job need to access the state of a vertex v

1. checks if this vertex is a hot vertex(thresshold T).

2. adds hot vertex information into the Consolidated

states table -> the access to it can be efficient.

3. partition synchronous control -> avoid cache

thrashing.

Achieving higher utilization of the cache and memory bandwidth



• Hardware Design


• Conclusion

Outline

• Platform

- Zsim

• Typical graph processing algorithms

- PageRank(PR), Random Walk(RW), Weakly Connected

Component(WCC), Maximal Independent Sets(MIS), Label Propagation(LP), k-core, Single Source Shortest Path(SSSP), Breadth-First Search(BFS)

• Datasets

- 5 real world datasets

• Evaluation Methodology

Experiment Setup

- baseline software system: Ligra-M

- ours: software-only LCCG(LCCG-S)and hardware implemented LCCG(LCCG-H), LCCG-T is that only TATR is enable in LCCG-H

- state-of-the-art: HATS, Minnow, PHI

Evaluation: Overall Performance

LCCG-S outperforms Ligra-M slightly.

LCCG-S incurs high bookkeeping cost,

which impedes its overall efficiency.

LCCG-H improves the throughput by

11.3∼23.9 times over Ligra-M.

Normalized execution time

Memory accesses breakdown

Through regularized graph traversal and

memory shared, both LCCG-S and LCCG-H

significantly reduce the memory accesses.

LCCG-H reduces the accesses to memory by

82.3% compared to Ligra-M

• Comparison with Software Approaches

Evaluation: Overall Performance

• Comparison with Hardware Accelerators

LCCG-H improves the throughput of HATS,

Minnow, and PHI by 4.7 ~ 10.3, 5.5 ~ 13.2,

and 3.8 ~ 8.4 times

Normalized execution time

Main Memory accesses

Less memory traffic is generated by LCCG-

H for handling multiple jobs compared to

three state-of-the-art hardware accelerators

(HATS, Minnow, and PHI ).



• Hardware Design


• Conclusion

Outline

Conclusion

➢What LCCG brings for Graph Processing?

• Regularizing the graph traversals

• Consolidating the storage and accesses of the graph data

• A locality-centric programmable accelerator LCCG

• Minimize the data access cost for the execution of the CGP jobs and also achieve higher utilization of the cores

➢ Future work

• How to integrate some existing hardware techniques into LCCG to get better performance

• How to avoid the leaking of some private information of the jobs for the LCCG

THANK YOU!

Service Computing Technology and System Lab., MoE (SCTS)

Cluster and Grid Computing Lab., Hubei Province (CGCL)

GraphM sc'19

Documents

Transcript of GraphM sc'19