Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops...

Alternative GPU friendly assignment algorithms

Paul Richmond and Peter HeywoodDepartment of Computer ScienceThe University of Sheffield

Graphics Processing Units (GPUs)

Context: GPU Performance

384 GigaFLOPS

8.74 TeraFLOPS

4992 cores

Serial ComputingParallel Computing

~40 GigaFLOPS

Accelerated Computing

16 cores

1 core

0.0 TFlops

1.0 TFlops

2.0 TFlops

3.0 TFlops

4.0 TFlops

5.0 TFlops

6.0 TFlops

7.0 TFlops

8.0 TFlops

9.0 TFlops

10.0 TFlops

1 CPU Core GPU (4992 cores)

8.74 TeraFLOPS

~40 GigaFLOPS

6 hours CPU timevs.

1 minute GPUtime

• Much of the functionality of CPUs is unused for HPC

• Complex Pipelines, Branch prediction, out of order execution, etc.

• Ideally for HPC we want: Simple, Low Power and Highly Parallel cores

Accelerators

An accelerated system

• Co-processor not a CPU replacement

DRAM GDRAM

CPU GPU/Accelerator

I/O I/OPCIe

Thinking Parallel

• Hardware considerations

• High Memory Latency (PCI-e)

• Huge Numbers of processing cores

• Algorithmic changes required

• High level of parallelism required

• Data parallel != task parallel

“If your problem is not parallel then think again”

Amdahl’s Law

• Speedup of a program is limited by the proportion than can be parallelised

• Addition of processing cores gives diminishing returns

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑆 =1

𝑃𝑁 − (1 − 𝑃)

Number of Processors (N)

P = 25%

P = 50%

P = 90%

P= 95%

SATALL Optimisation

Profile the Application

• 11 hour runtime

• Function A

• 97.4% runtime

• 2000 calls

• Hardware• Intel Core i7-4770k 3.50GHz

• 16GB DDR3

• Nvidia GeForce Titan X

0.3% 0.3% 0.1% 0.1% 0.1%0

A B C D E F

Function

Time per function for Largest Network (LoHAM)

Function A

• Input

• Network (directed weighted graph)

• Origin-Destination Matrix

• Output

• Traffic flow per edge

• 2 Distinct Steps

1. Single Source Shortest Path (SSSP)All-or-Nothing Path

For each origin in the O-D matrix

2. Flow AccumulationApply the OD value for each trip to each link on the route

Serial

Distribution of Runtime (LoHAM)

Flow SSSP

Single Source Shortest Path

For a single Origin Vertex (Centroid)

Find the route to each Destination Vertex

With the Lowest Cumulative Weight (Cost)

Serial SSSP Algorithm

• D’Esopo-Pape (1974)

• Maintains a priority queue of vertices to explore

Highly Serial

• Not a Data-Parallel Algorithm

• We must change algorithm to match the hardware

Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974): 212-222.

Parallel SSSP Algorithm

• Bellman-Ford Algorithm (1956)

• Poor serial performance & time complexity

• Performs significantly more work

• Highly Parallel

• Suitable for GPU acceleration

Bellman, Richard. On a routing problem. 1956. 0

Total Edges Considered

Bellman-Ford Desopo-Pape

Implementation

• A2 - Naïve Bellman-Ford using Cuda

• Up to 369x slower

• Striped bars continue off the scale

Serial A2

Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

Derby 36.5s

CLoHAM 1777.2s

LoHAM 5712.6s

Implementation

• Followed iterative cycle of performance optimisations

• A3 – Early Termination

• A4 – Node Frontier

• A8 – Multiple origins Concurrently

• SSSP for each Origin in the OD matrix

• A10 – Improved load Balancing

• Cooperative Thread Array

• A11 – Improved array access

Serial A2 A3 A4 A8 A10 A11

Optimisation

Time to compute all requied SSSP results per model

Derby CLoHAM LoHAM

Limiting Factor (Function A)

Serial

Distribution of Runtime (CLoHAM)

Flow SSSP

Serial

Flow SSSP

Limiting Factor (Function A)

• Limiting Factor has now changed

• Need to parallelise Flow Accumulation

Serial A11

Distribution of Runtime (CLoHAM)

Flow SSSP

Serial A11

Flow SSSP

Flow Accumulation

• Shortest Path + OD = Flow-per-link

For each origin-destination pair

Trace the route from the destination to the originincreasing the flow value for each link visited

• Parallel problem

• But requires synchronised access to shared data structure for all trips (atomic operations)

0 1 2 …

0 0 2 3 …

1 1 0 4 …

2 2 5 0 …

… … … …

Link 0 1 2 3 …

Flow 9 8 6 9 …

• Problem: • A12 - lots of atomic operations serialise

the execution

Serial A12

Time to compute Flow values per link

Derby CLoHAM LoHAM

Flow Accumulation

• Problem: • A12 - lots of atomic operations serialise

the execution

• Solutions: • A15 - Reduce number of atomic

operations

• Solve in batches using parallel reduction

• A14 - Use fast hardware-supported single precision atomics

• Minimise loss of precision using multiple

32-bit summations

• 0.000022% total error

Serial A12 A15(double)

A14(single)

Time to compute Flow values per link

Derby CLoHAM LoHAM

Integrated Results

Assignment Speedup relative to Serial

Serial

• LoHAM – 12h 12m

Double precision

• LoHAM – 35m 22s

Single precision

• Reduced loss of precision

Hardware:

• Intel Core i7-4770K

• 16GB DDR3

• Nvidia GeForce Titan X 1.0

Serial Multicore A15 A14(single)

Relative Assignment Runtime Performance vs Serial

Derby CLoHAM LoHAM

Assignment Speedup relative to Multicore

Multicore

• LoHAM – 1h 47m

Double precision

Single precision

• Reduced loss of precision

Hardware:

• 16GB DDR3

Serial Multicore A15 A14(single)

Relative Assignment Runtime Performance vs Multicore

Derby CLoHAM LoHAM

GPU Computing at UoS

Expertise at Sheffield• Specialists in GPU Computing and

performance optimisation

• Complex Systems Simulations via FLAME and FLAME GPU

• Visual Simulation, Computer Graphics and Virtual Reality

• Training and Education for GPU Computing

Thank You

Paul Richmond

p.richmond@sheffield.ac.uk

paulrichmond.shef.ac.uk

Peter Heywood

p.heywood@sheffield.ac.uk

ptheywood.uk

RuntimeSpeedup

SerialSpeedup

Multicore

Serial 12:12:24 1.00 0.15

Multicore 01:47:36 6.81 1.00

A15 (double precision)

00:35:22 20.70 3.04

A14 (single precision)

00:17:32 41.77 6.14

Largest Model (LoHAM) results

Backup Slides

Benchmark Models

• 3 Benchmark networks

• Range of sizes

• Small to V. Large

• Up to 12 hour runtime

Model Vertices (Nodes)

Edges (Links)

O-D trips

Derby 2700 25385 547²

CLoHAM 15179 132600 2548²

LoHAM 18427 192711 5194²0

Serial Salford Multicore

Version

Benchmark model performance

Derby CLoHAM LoHAM

Edges considered per algorithm

0 1 2 3 4

Iteration

Edges Considered Per Iteration

Total Edges Considered

Vertex Frontier (A4)

• Only Vertices which were updated in the previous iteration can result in an update

• Much fewer threads launched per iteration

• Up to 2500 instead of 18427 per iteration

eIteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM

Multiple Concurrent Origins (A8)

Iteration

Frontier size for each element for source 1

Derby CLoHAM LoHAM

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000

Iteration

Frontier Size for all concurrent sources

Derby CLoHAM LoHAM

Atomic Contention

• Atomic operations are guaranteed to occur

• Atomic Contention

• multiple threads atomically modify same address

• Serialised!

• atomicAdd(double) not implemented in hardware

• Not yet

• Solutions

1. Algorithmic change to minimise atomic contention

2. Single precision

5000000

10000000

15000000

20000000

25000000

30000000

0 10 20 30 40 50 60 70 80 90 100

Atomic Contestance per Iteration

Cloham Loham

Raw Performance

Hardware:

• 16GB DDR3

Serial Multicore A8 A10 A11 A15 A13(single)

A14(single)

Assignment runtime per algorithm

Derby CLoHAM LoHAM

Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops...

Documents

Transcript of Alternative GPU friendly assignment algorithms€¦ · 0.0 TFlops 1.0 TFlops 2.0 TFlops 3.0 TFlops...

Intel®: Accelerating the Path to Exascale...Computing Power Available Today 10 PFlops 1 PFlops 100 TFlops 10 TFlops 1 TFlops 100 GFlops 10 GFlops 1 GFlops 100 MFlops 100 PFlops 10

Plastic - gemu-group.com · PVDF Code 75 10.0 10.0 10.0 10.0 10.0 10.0 9.0 8.0 7.1 6.3 5.4 4.7 Data for extended temperature ranges on request. Please note that the ambient temperature

PROPRIETAR Y · MUD PROQRAM MUD WEIGHT TYPE QP MUD 9.0- 9.3 Native Od 9.3-10.0 Lignosulfonate 10.0-12.5 Lignosulfonate MINIMUM QUA The greater of 1000 sxs o> Barite or enough to raise

Sumitomo Heavy Industries Construction Cranes · 'SPI 10 Performance Characteristics (Leader type: R70) ton 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 pile driving stability

Business Talk & BTIP Mitel MiVoice Business...Mitel Collaboration Advanced (MiCollab) 9.0 7.2 7.0 Specific MBG Teleworker (vMBG) 11.0 10.0 9.1 Attendant 5500 IP Console 9.0 8.0 7.2

ENDURING DIFFERENTIATIONgilesm/cuda/lecs/Enduring_Differentiation-2x2.pdfAI (Tensor Cores): ~20 TFLOPS : 120 TFLOPS (~6x!) Architecture with Technology 36 REVOLUTIONARY PERFORMANCE

What is scientific computing? - BOINC · 2002 NEC Earth Simulator 35 TFLOPS 2006 IBM Blue Gene/L 280 TFLOPS Supercomputer history . PCs versus Supercomputers

PAGE 2PAGE 2 - MOTOMAXX … · ZX10-r 2008-2010 38/38-235- 9.0 9.5 10.0 10.5 11.0 r 2011-2015 38/38-235- 9.0 9.5 10.0 10.5 11.0 r 2000-2006 38-250- 8.5 9.0 9.5 10.0 10.5 r1400 2007-2014

Anti-Virus Comparative Performance test · avast! Free1 5.0 AVG Anti-Virus 9.0 AVIRA AntiVir Premium 9.0 BitDefender Antivirus 2010 eScan AntiVirus 10.0 ESET NOD32 Antivirus 4.0 F-Secure

Energy Efficiency in Appalachia--Chapter 8 › assets › research_reports › EnergyEfficiencyChapter8.pdfChapter 8: Summary and Conclusions 115 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5

Architectural Considerations for a 500 TFLOPS Heterogeneous HPC

P8/PX8 - Bombas Gota › assets › p8-px8-operacion-y... · 2019-04-16 · k254 10.0 l229 9.0 m254 10.0 n15 0.6 p8 metal p8 metal saniflofda 51 mm (2") fnpt liquid discharge a 13

Amazon Elastic Inference - Developer Guide€¦ · Amazon Elastic Inference Developer Guide Elastic Inference Basics Accelerator Type FP32 Throughput (TFLOPS) FP16 Throughput (TFLOPS)

Gpu protein Folding up to 3954 TFLOPS!

Cisco Connect...Развитие Cisco UCCX Features 8.0 8.5 9.0 10.0 10.6 11.0 11.5 Virtualization Cisco Unified Intelligence Center X XX

Printing from undefined - Ohio Weather, Home of …...Protege & 323 New Belt 0.31-0.35 (8.0-9.0) Used Belt 0.35-0.39 (9.0-10.0) RX7 New Belt 0.47-0.59 (12.0-15.0) Used Belt 0.55-0.67

NVSWITCH AND DGX-2 · 2018-08-21 · ©2018 NVIDIA CORPORATION 2 NVIDIA DGX-2 SERVER AND NVSWITCH 16 Tesla™ V100 32 GB GPUs. FP64: 125 TFLOPS. FP32: 250 TFLOPS. Tensor: 2000 TFLOPS.

H1 2019 RESULTS BUSINESS UPDATE 31.07.19 [Sola lettura] · 6 Income Statement by Quarter € mn H1 2019 Group Q118 Q218 Q318 Q418 Q119 Q219 Entry fees 14.1 11.6 9.0 10.0 8.8 9.0 Management

Worry-Free - Trend Micro · Worry-Free Business Security 10.0 Hizmet Paketi 1 Sistem Gereklilikleri • VMware Workstation 6.0, 6.5, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0 •

POLLUTION AND HEALTH METRICS...7 | Pollution and Health Metrics Gloal, Reional, and Countr Analsis | Deceer 2019 Millions 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 Millions 10.0