Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue...

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Yanhao Chen1, Ari B. Hayes1, Chi Zhang2, Timothy Salmon1, Eddy Z. Zhang1

1. Rutgers University2. University of Pittsburgh

2018 USENIX Annual Technical Conference

Sparse Matrix• Sparse Linear Systems

- CG- GMRES- …

• Physics Based Simulations- CFD

• Deep Learning Optimizations- Pruned Neural Networks

• ……

Sparse Matrix Operation

y=A xSparse Matrix

AInput Vector

xOutput Vector

&' = ()*+,) { .'/ ⨀ 1/}

Binary Operator3 ∈ 1, … ,8 , 9 ∈ [1, … , ;]

Sparse Matrix Vector Multiplication (SpMV)

!"#$%" = sum⨀ = ∗

,- = ./0{2-,4 ∗ 54}

78,8 78,9 78,: 78,;

< < < <

7:,8 < 7:,: <

< 7;,9 < 7;,;

=8=9=:=;

>8>9>:>;

,- = !"#$%"{2-4⨀54}

,: = 2:,8 ∗ 58 + 2:,: ∗ 5:@ ∈ 1, … ,D , E ∈ [1, … , G]

Single Source Shortest Path (SSSP)

!"#$%" = min⨀ = +

,- = ./0{23- + 43}

,- = !"#$%"{2-3⨀43}

S 1 2 3 4 5S12345

67 = 89:{;7 , 2=,7+;= , 2>,7+;> }

Adjacent Matrix A

y: distance vector of j th iteration

x ∶ (previous) distance vector of j-1 th iterationA ∈ 1, … ,E , F ∈ [1, … , H]

Problem with Sparsity on GPUs• Low data reuse is always a big problem

• e.g. SpMV

– The input vector and the output vector can be reused a lot

– They are usually too large to fit into GPU’s cache

– The sparsity of the matrix causes irregular accesses of the vectors

– This means low reuse of the data in the cache

!" = $%&{(",* ∗ ,*}

Existing Methods to Improve Data Reuse on GPUs

• Warp Scheduling Policy– Throttling concurrent threads– Limits the number of active warps [Rogers+, MICRO’12]– DYNCTA: controls the number of CTAs [Kayiran+,PACT’13]

• Computation and Data Layout Transformation– Reduce irregular memory accesses– Improve Memory Coalescing [Zhang+, ICS’10]

Need Hardware Modification!

Only focus on Spatial Data Reuse!

Ø Is the First Software Throttling implementation

Ø Is focused on Temporal Data Reuse

Ø Exploits the Trade-off between throttling performance and GPU throughput

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

SpMV with Software Throttling !"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Matrix A is bypassing the cache

SpMV with Software Throttling

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

Running at one time

{ < !" #" > < !$ #" > < !% #" > < !& #" > < !" #% > < !% #% > < !$ #& > < !& #& > }

Original Case

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

!" #" !$ !%!& #% #&

Running at one time

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

< !% #" > < !% #& > < !& #" > < !& #& >

Throttling Phase 1

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 1

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

!" #" !$ #$

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

< !" #" > < !" #$ > < !$ #" > < !$ #$ >

Throttling Phase 2

< !% #" > < !% #& > < !& #" > < !& #& >

Cache Capacity: 4

!% #" !& #&

All Data Can Fit into the Cache

!"!#!$!%

&"&#&$&%

'"," '",# '",$ '",%

) ) ) )

'$," ) '$,$ )

) '%,# ) '%,%A

What We Need for Software Throttling• An effective partitioning algorithm

– All data items in one partition can fit into the cache– The interaction between different partitions are minimized

• Applicable scheduling methods– Handle the trade-off between throttling and throughput

What We Need for Software Throttling• An effective partitioning algorithm

Graph Representation• Graph Edge Partition Model

– Places an emphasis on Data– Node → Data object– Edge → Interaction between data

1 2 3 4

1 !"," !",$ !",% !",&2 ' ' ' '

3 !%," ' !%,% '

4 ' !&,$ ' !&,&

("($(%(&

)")$)%)&

A X YX1 X2 X3 X4

Y1 Y2 Y3 Y4

Why Edge Partition Model?1. Better load balancing

– PowerGraph [OSDI’12], Streaming Edge Partition [KDD’14], SPAC[SIGMETRICS’17]

• Balanced vertex partition is sometimes NOT equal to balanced workload

2. Quantifying the communication cost

3. Applies to a large class of parallel applications– N-body, CFD, Sparse Linear Algebra, Graph Analytics, …

Different Edge Partition ModelsLoad Balanced Partition Data Balanced Partition

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition 1 Partition 2

# Nodes: 4 # Nodes: 4# Edges: 3 # Edges: 3 Cache-Fit Partition

Data Balanced Partition• Recursive Bisection Framework

– 2-way Load Balanced Edge Partition• SPAC [Li+,SIGMETRICS’17]

– Minimize vertex replica (data copy)

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Cache Capacity: 4

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A Partition B

Cache Capacity: 4

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Cache Capacity: 4

X2 X3 X4

Y2 Y3 Y4

Partition B Partition C

Cache Capacity: 4

Partition B

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Partition A

What We Need for Software Throttling• A good partitioning algorithm

Effective scheduling methods• Four different scheduling methods

– Cache-Fit (CF)– Cache-Fit Queue (CF-Q)– Split-Join (SJ)– Split-Join Queue (SJ-Q)

Cache-Fit (CF) Scheduling• Isolate the computation of different Cache-Fit Partitions• Run one Cache-Fit Partition at one time

TL: tuple list

N: # of tuples

TL’: new tuple list

Pi: # of tuples in TL[i]

Strict Barriers

CUDA Function

Low Pipeline Utilization

Kernel 1 -- Partition ACache Capacity: 4

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

4 Working Threads

Low Throughput

Cache Capacity: 4

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

4 Working Threads

Kernel 2 -- Partition B

Low Pipeline Utilization

Cache Capacity: 4

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

4 Working Threads

Kernel 3 -- Partition C

low pipeline utilization

Cache-Fit Queue (CF-Q) Scheduling• Invoke a single kernel call but still

enable throttling

• Set up a FIFO queue • Each entry corresponds to a chunk

– A chunk is part of a cache-fit partition– Adjacent chunks are from the same

Cache-Fit Partition• Each warp fetches a chunk from the

queue and processes it

Cache-Fit Partition 1

Chunk 1

…Chunk 2

Chunk N

Chunk 1

…Chunk 2

Chunk M

Cache-Fit Queue (CF-Q) Scheduling cont.

Chunk 1

…Chunk 2

Chunk N

Chunk 1

…Chunk 2

Chunk M

Warp 1 Warp 2

… …Chunk 1

Chunk 4

Chunk 2

Chunk 3

Chunk NChunk 1

Chunk 2Chunk 3

Relaxed Barrier

Chunk 3

Chunk 4

Split-Join (SJ) Scheduling• Dynamically merge Cache-Fit Partitions

• Perform an Online Profiling to decide which partitions should be merged

• Use the Tree Representation of the data balanced partition to help the Online Profiling

Split-Join (SJ) Scheduling

Cache Capacity: 4

Partition C

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Split-Join (SJ) Scheduling

A.stime: 0.5A.btime: 0.5

stime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

All Edges

Partition A Partition B, C

Split-Join (SJ) Schedulingstime: measured standalone running timebtime: optimal running time on this node

B.stime: 0.2B.btime: 0.2

C.stime: 0.2C.btime: 0.2

BC.stime: 0.3BC.btime: 0.3

All.stime: 1.2All.btime: 0.8

BC.stime < B.btime + C.btime

All Edges

Partition A Partition B, C

All.stime > A.btime + BC.btime

A.stime: 0.5A.btime: 0.5

Split-Join Queue (SJ-Q) Scheduling• Provide strict barriers between different merged partitions

• No barriers inside a merged partition of SJ– No guarantee of the execution order

• Set up one FIFO queue for each merged partition– Provide relaxed barriers between cache-fit partitions from the same

merged partition

Four Scheduling Methods Summary

Methods Pipeline Utilization Profiling Barriers Queue Code

ChangeCF Low No Strict No No

CF-Q High No Relaxed Yes YesSJ High Yes Strict No No

SJ-Q High Yes Strict / Relaxed Yes Yes

Software Throttling Performance• Experiment Settings

GPU Model Titan X GTX 745Architecture Pascal Maxwell

Core # 5376 576L2 Cache 3MB 2MB

CUDA Version CUDA 8.0 CUDA 8.0

CPU Intel XeonE5-2620

Intel Core i7-4790

Benchmarks• Sparse Linear Algebra Workloads

– Sparse Matrix Vector Multiplication (SPMV)

• Graph Processing Workloads– Bellman-Ford (BMF)– Page Rank (PR)

• Neural Network Optimization– Deep Compression: Pruned AlexNet

Sparse Matrix Vector Multiplication• Baseline: CUSP• Matrices: Florida Sparse Matrix Collection

– Focus on large matrices: working set cannot fit into L2 cache– 228 large matrices on GTX 745– 192 large matrices on Titan X

Overall SPMV Performance

1.21.41.61.8

Org + R CF SJ CF-Q SJ-Q

Average Speedup For SPMV

GTX 745Titan X

Effect of Cache Hit Rate

40% - 100%30% - 40%20% - 30%10% - 20%0% - 10%

4.16 4.24Titan X

Effect of Working Set Size

edup 1X - 2X (cache size) 2X - 4X

4X - 8X 8X ~ INF

Titan X

Graph Application Performance

Pokec WebGoogle Wikipedia* WikiTalk IMDB RoadCentral RoadCal

BMF & PR Performance using SJ w/ overhead

RoadCal is a small Graph

��

Titan X

Graph Application Performance cont.

0102030405060

WebGoogle

pedia*

TalkIM

RoadCentral

RoadCalL2 C

BMF Cache Hit Rate

Original SJ

101520253035

WebGoogle

pedia*

TalkIM

RoadCentral

RoadCalL2 C

PR Cache Hit Rate

Original SJ

Titan XTitan X

RoadCal is a small Graph

Deep Learning Benchmark• Deep Compression [Han+,ICLR’16]

– Prune AlexNet to remove low weight elements in fully connected layers

– Deep Compression provide us sparse matrices

fc6 fc7 fc8 Combined

Speedup for AlexNet using Deep Compression

CFSJCF-QSJ-Q

Titan X

fc7 and fc8 has small vector size

Conclusion• We proposed the first locality-aware Software

Throttling framework for GPUs

• Our framework can increase data reuse by improving Temporal Locality

• We exploited the Trade-off between cache performance and pipeline utilization

Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue...

Documents

Transcript of Locality-Aware Software Throttling for Sparse Matrix ... · –Split-Join (SJ) –Split-Join Queue...

Staffing Approaches: Split Staffing · 2020. 12. 3. · Staffing Approaches: Split Staffing Session 3 | Friday , November 20, 2020 As you join us on this webinar, please: 1. Mute

CMU SCS 15-721 :: Join Algorithms (Hashing)CMU 15-721 (Spring 2016) PARTITION PHASE . Split the input relations into partitioned buffers by hashing the tuples’ join key(s). →The

Sj Streets

Recognition of split-graphic sequences · 2015. 2. 15. · Recognition of split-graphic sequences 253 The join [12,28,66] of two graphs Gand His denoted by G+ H. It has the following

Surftest SJ-210 / SJ-310 - TranscatJ-2 Surftest SJ-210 / SJ-310 SERIES 178 — Portable Surface Roughness Tester FEATURES: SJ-210 sThe 2.4-inch color graphic LCD provides excellent

The Java Fork-Join Pool Framework: Work Stealingschmidt/cs891f/2019-PDFs/10.3.2-ForkJoin… · This behavior arises from “divide & conquer” nature of fork-join tasks that split

Surftest SJ-210 / SJ-310 - MSI Viking...J-2 Surftest SJ-210 / SJ-310 SERIES 178 — Portable Surface Roughness Tester FEATURES: SJ-210 s The 2.4-inch color graphic LCD provides excellent

Pull up a chair. Take a taste. Come join us. Life is so ...€¦ · Leche Flan Banana Split Needs no introduction. It’s a banana split. Cotton Candy Dessert Cup Choco Fudge Mud

Sj Engineering

MODEL SJ-FP624V SJ-FB624V SJ-FP676Vsupport.sharp.net.au/downloads/opmanuals/SJFP624V-SJFB624V-SJ… · Door alarm indication This indication shows door alarm “ON”. Feature icons

Parallel Programming Course Threading Building Blocks … · Parallel Programming Course Threading Building Blocks ... split/join nodes ... [, task_group_context& group] ) #include

SC193 Perth - Inverness - WhatDoTheyKnow · Stanley Junction SB (SJ) NRN 094 P3 UM DM P3 157 SJ l3 SJ l4 158 UM DM SJ l5 SJ l6 Stanley Junction SB 7m 0160yds Tokenless Block Stanley

Totnes & ridgetown Parish Magazine, March 2018...Easter Day TOTNES JO SJ JO/SJ JO/SJ JO 8.00 JL SJ SJ JL JL+ TG Iona BRIDGE-JL TOWN 9.30 /11.15 TOTNES JO/SJ JO JO/SJ JO/SJ JO 11.15

SJ Presentation

SJ PROPERTIES SUITES, BUYCO, EHF., SJ … · uni ted s at s di rict cour eastern district of wisconsin sj properties suites, buyco, ehf., sj-fasteignir, ehf, and, askar capital, hf,

SJ 07_20160620_0004

SJ in brief SJ SJ in brief…rs-och...SJ AB Annual Report 2010 Annual Review SJ AB Annual Report 2010 Annual Review SJ SJ SJ in brief SJ is a customer-oriented, modern and profitable

SPONGE-JET 170-SJ Feed Unit 470-SJ Feed Unit

Brainrulespzreview sj

SJ-48H-S SJ-51H-S SJ-55H-S SERVICE MANUAL...1 SJ-48H-S SJ-51H-S SJ-55H-S This equipment complies with the requirements of Directives 89/336/EEC and 73/23/EEC as amended by 93/68/EEC.