Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue...

Post on 26-Jun-2020

4 views 0 download

Transcript of Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue...

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Yanhao Chen1, Ari B. Hayes1, Chi Zhang2, Timothy Salmon1, Eddy Z. Zhang1

1. Rutgers University2. University of Pittsburgh

2018 USENIX Annual Technical Conference

Sparse Matrix• Sparse Linear Systems

- CG- GMRES- …

• Physics Based Simulations- CFD

• Deep Learning Optimizations- Pruned Neural Networks

• ……

2018 USENIX Annual Technical Conference

Sparse Matrix Operation

y=A xSparse Matrix

AInput Vector

xOutput Vector

y

&' = ()*+,) { .'/ ⨀ 1/}

Binary Operator3 ∈ 1, … ,8 , 9 ∈ [1, … , ;]

2018 USENIX Annual Technical Conference

Sparse Matrix Vector Multiplication (SpMV)

!"#$%" = sum⨀ = ∗

,- = ./0{2-,4 ∗ 54}

78,8 78,9 78,: 78,;

< < < <

7:,8 < 7:,: <

< 7;,9 < 7;,;

=8=9=:=;

>8>9>:>;

* =

A x y

,- = !"#$%"{2-4⨀54}

,: = 2:,8 ∗ 58 + 2:,: ∗ 5:@ ∈ 1, … ,D , E ∈ [1, … , G]

2018 USENIX Annual Technical Conference

Single Source Shortest Path (SSSP)

!"#$%" = min⨀ = +

,- = ./0{23- + 43}

,- = !"#$%"{2-3⨀43}

S

2

1

3

4

5

S 1 2 3 4 5S12345

67 = 89:{;7 , 2=,7+;= , 2>,7+;> }

Adjacent Matrix A

y: distance vector of j th iteration

x ∶ (previous) distance vector of j-1 th iterationA ∈ 1, … ,E , F ∈ [1, … , H]

2018 USENIX Annual Technical Conference

Problem with Sparsity on GPUs• Low data reuse is always a big problem

• e.g. SpMV

– The input vector and the output vector can be reused a lot

– They are usually too large to fit into GPU’s cache

– The sparsity of the matrix causes irregular accesses of the vectors

– This means low reuse of the data in the cache

!" = $%&{(",* ∗ ,*}

2018 USENIX Annual Technical Conference

Existing Methods to Improve Data Reuse on GPUs

• Warp Scheduling Policy– Throttling concurrent threads– Limits the number of active warps [Rogers+, MICRO’12]– DYNCTA: controls the number of CTAs [Kayiran+,PACT’13]

• Computation and Data Layout Transformation– Reduce irregular memory accesses– Improve Memory Coalescing [Zhang+, ICS’10]

Need Hardware Modification!

Only focus on Spatial Data Reuse!

2018 USENIX Annual Technical Conference

Ø Is the First Software Throttling implementation

Ø Is focused on Temporal Data Reuse

Ø Exploits the Trade-off between throttling performance and GPU throughput

2018 USENIX Annual Technical Conference

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

2018 USENIX Annual Technical Conference

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Matrix A is bypassing the cache

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Matrix A is bypassing the cache

Running at one time

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

!" #" !$ !%!& #% #&

Matrix A is bypassing the cache

Running at one time

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

< !% #" > < !% #& > < !& #" > < !& #& >

Tim

e

Throttling Phase 1

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 1

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!" #" !$ #$

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

2018 USENIX Annual Technical Conference

SpMV with Software Throttling

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

Tim

e

!% #" !& #&

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

X =

X Y

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• An effective partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• An effective partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

2018 USENIX Annual Technical Conference

Graph Representation• Graph Edge Partition Model

– Places an emphasis on Data– Node → Data object– Edge → Interaction between data

1 2 3 4

1 !"," !",$ !",% !",&2 ' ' ' '

3 !%," ' !%,% '

4 ' !&,$ ' !&,&

("($(%(&

)")$)%)&

X =

A X YX1 X2 X3 X4

Y1 Y2 Y3 Y4

2018 USENIX Annual Technical Conference

Why Edge Partition Model?1. Better load balancing

– PowerGraph [OSDI’12], Streaming Edge Partition [KDD’14], SPAC[SIGMETRICS’17]

• Balanced vertex partition is sometimes NOT equal to balanced workload

2. Quantifying the communication cost

3. Applies to a large class of parallel applications– N-body, CFD, Sparse Linear Algebra, Graph Analytics, …

20

2018 USENIX Annual Technical Conference

Different Edge Partition ModelsLoad Balanced Partition Data Balanced Partition

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

# Nodes: 4 # Nodes: 4# Edges: 3 # Edges: 3 Cache-Fit Partition

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Cache Capacity: 4

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A Partition B

Cache Capacity: 4

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A Partition B

Cache Capacity: 4

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

Cache Capacity: 4

X2 X3 X4

Y2 Y3 Y4

Partition B Partition C

2018 USENIX Annual Technical Conference

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

Cache Capacity: 4

Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A

2018 USENIX Annual Technical Conference

What We Need for Software Throttling• A good partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

2018 USENIX Annual Technical Conference

Cache-Fit (CF) Scheduling• Isolate the computation of different Cache-Fit Partitions• Run one Cache-Fit Partition at one time

TL: tuple list

N: # of tuples

TL’: new tuple list

Pi: # of tuples in TL[i]

Strict Barriers

CUDA Function

2018 USENIX Annual Technical Conference

Low Pipeline Utilization

Kernel 1 -- Partition ACache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X1

Y1

X1

Y2

X2

Y1

4 Working Threads

X2

Y2

2018 USENIX Annual Technical Conference

Low Throughput

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X2

Y3

X3

Y3

4 Working Threads

Kernel 2 -- Partition B

IDLE

2018 USENIX Annual Technical Conference

Low Pipeline Utilization

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

X4

Y3

X4

Y4

4 Working Threads

Kernel 3 -- Partition C

IDLE

low pipeline utilization

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

2018 USENIX Annual Technical Conference

Cache-Fit Queue (CF-Q) Scheduling• Invoke a single kernel call but still

enable throttling

• Set up a FIFO queue • Each entry corresponds to a chunk

– A chunk is part of a cache-fit partition– Adjacent chunks are from the same

Cache-Fit Partition• Each warp fetches a chunk from the

queue and processes it

Cache-Fit Partition 1

Chunk 1

…Chunk 2

Chunk N

Que

ue

Cache-Fit Partition 2

Chunk 1

…Chunk 2

Chunk M

2018 USENIX Annual Technical Conference

Cache-Fit Queue (CF-Q) Scheduling cont.

Cache-Fit Partition 1

Chunk 1

…Chunk 2

Chunk N

Que

ue

Cache-Fit Partition 2

Chunk 1

…Chunk 2

Chunk M

Tim

e

Warp 1 Warp 2

… …Chunk 1

Chunk 4

Chunk 2

Chunk 3

Chunk NChunk 1

Chunk 2Chunk 3

Relaxed Barrier

Chunk 3

Chunk 4

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling• Dynamically merge Cache-Fit Partitions

• Perform an Online Profiling to decide which partitions should be merged

• Use the Tree Representation of the data balanced partition to help the Online Profiling

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling

Cache Capacity: 4

Partition A Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

2018 USENIX Annual Technical Conference

Split-Join (SJ) Scheduling

A.stime: 0.5A.btime: 0.5

stime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

All Edges

Partition A Partition B, C

Partition B Partition C

2018 USENIX Annual Technical Conference

Split-Join (SJ) Schedulingstime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

BC.stime: 0.3BC.btime: 0.3

All.stime: 1.2All.btime: 0.8

BC.stime < B.btime + C.btime

All Edges

Partition A Partition B, C

Partition B Partition C

All.stime > A.btime + BC.btime

A.stime: 0.5A.btime: 0.5

2018 USENIX Annual Technical Conference

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

2018 USENIX Annual Technical Conference

Split-Join Queue (SJ-Q) Scheduling• Provide strict barriers between different merged partitions

• No barriers inside a merged partition of SJ– No guarantee of the execution order

• Set up one FIFO queue for each merged partition– Provide relaxed barriers between cache-fit partitions from the same

merged partition

2018 USENIX Annual Technical Conference

Four Scheduling Methods Summary

Methods Pipeline Utilization Profiling Barriers Queue Code

ChangeCF Low No Strict No No

CF-Q High No Relaxed Yes YesSJ High Yes Strict No No

SJ-Q High Yes Strict / Relaxed Yes Yes

2018 USENIX Annual Technical Conference

Software Throttling Performance• Experiment Settings

GPU Model Titan X GTX 745Architecture Pascal Maxwell

Core # 5376 576L2 Cache 3MB 2MB

CUDA Version CUDA 8.0 CUDA 8.0

CPU Intel XeonE5-2620

Intel Core i7-4790

2018 USENIX Annual Technical Conference

Benchmarks• Sparse Linear Algebra Workloads

– Sparse Matrix Vector Multiplication (SPMV)

• Graph Processing Workloads– Bellman-Ford (BMF)– Page Rank (PR)

• Neural Network Optimization– Deep Compression: Pruned AlexNet

2018 USENIX Annual Technical Conference

Sparse Matrix Vector Multiplication• Baseline: CUSP• Matrices: Florida Sparse Matrix Collection

– Focus on large matrices: working set cannot fit into L2 cache– 228 large matrices on GTX 745– 192 large matrices on Titan X

2018 USENIX Annual Technical Conference

Overall SPMV Performance

2.018

0.81

1.21.41.61.8

2

Org + R CF SJ CF-Q SJ-Q

Norm

alize

d Sp

eedu

p

Average Speedup For SPMV

GTX 745Titan X

2018 USENIX Annual Technical Conference

Effect of Cache Hit Rate

0

0.5

1

1.5

2

2.5

3

Org + R CF SJ CF-Q SJ-Q

Nor

mal

ized

Spe

edup

40% - 100%30% - 40%20% - 30%10% - 20%0% - 10%

4.16 4.24Titan X

2018 USENIX Annual Technical Conference

Effect of Working Set Size

0

0.5

1

1.5

2

2.5

3

Org + R CF SJ CF-Q SJ-Q

Nor

mal

ized

Spe

edup 1X - 2X (cache size) 2X - 4X

4X - 8X 8X ~ INF

3.02

Titan X

2018 USENIX Annual Technical Conference

Graph Application Performance

0

0.5

1

1.5

2

2.5

3

3.5

Pokec WebGoogle Wikipedia* WikiTalk IMDB RoadCentral RoadCal

Nor

mal

ized

Spee

dup

BMF & PR Performance using SJ w/ overhead

BMFPR

RoadCal is a small Graph

������������ � �� �������

Titan X

2018 USENIX Annual Technical Conference

Graph Application Performance cont.

0102030405060

Pokec

WebGoogle

Wiki

pedia*

Wiki

TalkIM

DB

RoadCentral

RoadCalL2 C

ache

Hit

Rate

(%)

BMF Cache Hit Rate

Original SJ

05

101520253035

Pokec

WebGoogle

Wiki

pedia*

Wiki

TalkIM

DB

RoadCentral

RoadCalL2 C

ache

Hit

Rate

(%)

PR Cache Hit Rate

Original SJ

Titan XTitan X

RoadCal is a small Graph

2018 USENIX Annual Technical Conference

Deep Learning Benchmark• Deep Compression [Han+,ICLR’16]

– Prune AlexNet to remove low weight elements in fully connected layers

– Deep Compression provide us sparse matrices

0

0.5

1

1.5

2

fc6 fc7 fc8 Combined

Norm

alize

d Sp

eedu

p

Speedup for AlexNet using Deep Compression

CFSJCF-QSJ-Q

Titan X

fc7 and fc8 has small vector size

2018 USENIX Annual Technical Conference

Conclusion• We proposed the first locality-aware Software

Throttling framework for GPUs

• Our framework can increase data reuse by improving Temporal Locality

• We exploited the Trade-off between cache performance and pipeline utilization

2018 USENIX Annual Technical Conference