Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops...

34
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Transcript of Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops...

Page 1: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Alternative GPU friendly assignment algorithms

Paul Richmond and Peter HeywoodDepartment of Computer ScienceThe University of Sheffield

Page 2: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Graphics Processing Units (GPUs)

Page 3: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Context: GPU Performance

384 GigaFLOPS

8.74 TeraFLOPS

4992 cores

Serial ComputingParallel Computing

~40 GigaFLOPS

Accelerated Computing

16 cores

1 core

Page 4: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

0.0 TFlops

1.0 TFlops

2.0 TFlops

3.0 TFlops

4.0 TFlops

5.0 TFlops

6.0 TFlops

7.0 TFlops

8.0 TFlops

9.0 TFlops

10.0 TFlops

1 CPU Core GPU (4992 cores)

8.74 TeraFLOPS

~40 GigaFLOPS

6 hours CPU timevs.

1 minute GPUtime

Page 5: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

• Much of the functionality of CPUs is unused for HPC

• Complex Pipelines, Branch prediction, out of order execution, etc.

• Ideally for HPC we want: Simple, Low Power and Highly Parallel cores

Accelerators

Page 6: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

An accelerated system

• Co-processor not a CPU replacement

DRAM GDRAM

CPU GPU/Accelerator

I/O I/OPCIe

Page 7: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Thinking Parallel

• Hardware considerations

• High Memory Latency (PCI-e)

• Huge Numbers of processing cores

• Algorithmic changes required

• High level of parallelism required

• Data parallel != task parallel

“If your problem is not parallel then think again”

Page 8: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Amdahl’s Law

• Speedup of a program is limited by the proportion than can be parallelised

• Addition of processing cores gives diminishing returns

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =1

𝑃𝑁 − (1 − 𝑃)

0

5

10

15

20

25

Sp

ee

du

p (

S)

Number of Processors (N)

P = 25%

P = 50%

P = 90%

P= 95%

Page 9: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

SATALL Optimisation

Page 10: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Profile the Application

• 11 hour runtime

• Function A

• 97.4% runtime

• 2000 calls

• Hardware• Intel Core i7-4770k 3.50GHz

• 16GB DDR3

• Nvidia GeForce Titan X

97.4%

0.3% 0.3% 0.1% 0.1% 0.1%0

5000

10000

15000

20000

25000

30000

35000

40000

45000

A B C D E F

Tim

e (

s)

Function

Time per function for Largest Network (LoHAM)

Page 11: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Function A

• Input

• Network (directed weighted graph)

• Origin-Destination Matrix

• Output

• Traffic flow per edge

• 2 Distinct Steps

1. Single Source Shortest Path (SSSP)All-or-Nothing Path

For each origin in the O-D matrix

2. Flow AccumulationApply the OD value for each trip to each link on the route

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial

Distribution of Runtime (LoHAM)

Flow SSSP

Page 12: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Single Source Shortest Path

For a single Origin Vertex (Centroid)

Find the route to each Destination Vertex

With the Lowest Cumulative Weight (Cost)

Page 13: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Serial SSSP Algorithm

• D’Esopo-Pape (1974)

• Maintains a priority queue of vertices to explore

Highly Serial

• Not a Data-Parallel Algorithm

• We must change algorithm to match the hardware

Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.

Page 14: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Parallel SSSP Algorithm

• Bellman-Ford Algorithm (1956)

• Poor serial performance & time complexity

• Performs significantly more work

• Highly Parallel

• Suitable for GPU acceleration

Bellman, Richard. On a routing problem. 1956. 0

5

10

15

20

25

30

Nu

mb

er

of

Ed

ges

Total Edges Considered

Bellman-Ford Desopo-Pape

Page 15: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Implementation

• A2 - Naïve Bellman-Ford using Cuda

• Up to 369x slower

• Striped bars continue off the scale

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

Serial A2

Tim

e (

sec

on

ds)

Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

Derby 36.5s

CLoHAM 1777.2s

LoHAM 5712.6s

Page 16: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Implementation

• Followed iterative cycle of performance optimisations

• A3 – Early Termination

• A4 – Node Frontier

• A8 – Multiple origins Concurrently

• SSSP for each Origin in the OD matrix

• A10 – Improved load Balancing

• Cooperative Thread Array

• A11 – Improved array access

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

Serial A2 A3 A4 A8 A10 A11

Tim

e (

sec

on

ds)

Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

Page 17: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Limiting Factor (Function A)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial

Distribution of Runtime (CLoHAM)

Flow SSSP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial

Distribution of Runtime (LoHAM)

Flow SSSP

Page 18: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Limiting Factor (Function A)

• Limiting Factor has now changed

• Need to parallelise Flow Accumulation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial A11

Distribution of Runtime (CLoHAM)

Flow SSSP

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Serial A11

Distribution of Runtime (LoHAM)

Flow SSSP

Page 19: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Flow Accumulation

• Shortest Path + OD = Flow-per-link

For each origin-destination pair

Trace the route from the destination to the originincreasing the flow value for each link visited

• Parallel problem

• But requires synchronised access to shared data structure for all trips (atomic operations)

0 1 2 …

0 0 2 3 …

1 1 0 4 …

2 2 5 0 …

… … … …

Link 0 1 2 3 …

Flow 9 8 6 9 …

Page 20: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

• Problem: • A12 - lots of atomic operations serialise

the execution

0

1

2

3

4

5

6

Serial A12

Tim

e (

Se

co

nd

s)

Time to compute Flow values per link

Derby CLoHAM LoHAM

Flow Accumulation

Page 21: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Flow Accumulation

• Problem: • A12 - lots of atomic operations serialise

the execution

• Solutions: • A15 - Reduce number of atomic

operations

• Solve in batches using parallel reduction

• A14 - Use fast hardware-supported single precision atomics

• Minimise loss of precision using multiple

32-bit summations

• 0.000022% total error

0

1

2

3

4

5

6

Serial A12 A15(double)

A14(single)

Tim

e (

Se

co

nd

s)

Time to compute Flow values per link

Derby CLoHAM LoHAM

Page 22: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Integrated Results

Page 23: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Assignment Speedup relative to Serial

Serial

• LoHAM – 12h 12m

Double precision

• LoHAM – 35m 22s

Single precision

• Reduced loss of precision

• LoHAM – 17m 32s

Hardware:

• Intel Core i7-4770K

• 16GB DDR3

• Nvidia GeForce Titan X 1.0

0

3.4

2

3.7

3

6.1

4

1.0

0

5.5

1

11.2

4

18.2

4

1.0

0

6.8

1

20

.70

41.

77

0

5

10

15

20

25

30

35

40

45

Serial Multicore A15 A14(single)

Ass

ign

me

nt

Tim

e r

ela

tive

to

se

rial

Relative Assignment Runtime Performance vs Serial

Derby CLoHAM LoHAM

Page 24: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Assignment Speedup relative to Multicore

Multicore

• LoHAM – 1h 47m

Double precision

• LoHAM – 35m 22s

Single precision

• Reduced loss of precision

• LoHAM – 17m 32s

Hardware:

• Intel Core i7-4770K

• 16GB DDR3

• Nvidia GeForce Titan X

0.2

9

1.0

0

1.0

9

1.79

0.1

8

1.0

0

2.0

4

3.3

1

0.1

5

1.0

0

3.0

4

6.1

4

0

1

2

3

4

5

6

7

Serial Multicore A15 A14(single)

Ass

ign

me

nt

Tim

e r

ela

tive

to

mu

ltic

ore

Relative Assignment Runtime Performance vs Multicore

Derby CLoHAM LoHAM

Page 25: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

GPU Computing at UoS

Page 26: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Expertise at Sheffield• Specialists in GPU Computing and

performance optimisation

• Complex Systems Simulations via FLAME and FLAME GPU

• Visual Simulation, Computer Graphics and Virtual Reality

• Training and Education for GPU Computing

Page 27: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Thank You

Paul Richmond

[email protected]

paulrichmond.shef.ac.uk

Peter Heywood

[email protected]

ptheywood.uk

RuntimeSpeedup

SerialSpeedup

Multicore

Serial 12:12:24 1.00 0.15

Multicore 01:47:36 6.81 1.00

A15 (double precision)

00:35:22 20.70 3.04

A14 (single precision)

00:17:32 41.77 6.14

Largest Model (LoHAM) results

Page 28: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Backup Slides

Page 29: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Benchmark Models

• 3 Benchmark networks

• Range of sizes

• Small to V. Large

• Up to 12 hour runtime

Model Vertices (Nodes)

Edges (Links)

O-D trips

Derby 2700 25385 547²

CLoHAM 15179 132600 2548²

LoHAM 18427 192711 5194²0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Serial Salford Multicore

Tim

e (

s)

Version

Benchmark model performance

Derby CLoHAM LoHAM

Page 30: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Edges considered per algorithm

0

1

2

3

4

5

6

7

0 1 2 3 4

Nu

mb

er

of

Ed

ges

Iteration

Edges Considered Per Iteration

Bellman-Ford Desopo-Pape

0

5

10

15

20

25

30

Nu

mb

er

of

Ed

ges

Total Edges Considered

Bellman-Ford Desopo-Pape

Page 31: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Vertex Frontier (A4)

• Only Vertices which were updated in the previous iteration can result in an update

• Much fewer threads launched per iteration

• Up to 2500 instead of 18427 per iteration

0

500

1000

1500

2000

2500

3000

Fro

nti

er

Siz

eIteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM

Page 32: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Multiple Concurrent Origins (A8)

0

500

1000

1500

2000

2500

3000

Fro

nti

er

Siz

e

Iteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000

Fro

nti

er

Siz

e

Iteration

Frontier Size for all concurrent sources

Derby CLoHAM LoHAM

Page 33: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Atomic Contention

• Atomic operations are guaranteed to occur

• Atomic Contention

• multiple threads atomically modify same address

• Serialised!

• atomicAdd(double) not implemented in hardware

• Not yet

• Solutions

1. Algorithmic change to minimise atomic contention

2. Single precision

0

5000000

10000000

15000000

20000000

25000000

30000000

0 10 20 30 40 50 60 70 80 90 100

Ato

mic

Co

nte

stan

ce

Atomic Contestance per Iteration

Cloham Loham

Page 34: Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops 4.0 TFlops 5.0 TFlops 6.0 TFlops 7.0 TFlops 8.0 TFlops 9.0 TFlops 10.0 TFlops 1

Raw Performance

Hardware:

• Intel Core i7-4770K

• 16GB DDR3

• Nvidia GeForce Titan X

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Serial Multicore A8 A10 A11 A15 A13(single)

A14(single)

Tim

e (

sec

on

ds)

Assignment runtime per algorithm

Derby CLoHAM LoHAM