Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Matei RipeanuNetworked Systems Laboratory (NetSysLab)

University of British Columbia

Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

Hybrid architectures in Top 500 [Nov’10]

• Hybrid architectures– High compute power / memory bandwidth– Energy efficient

[operated today at low overall efficiency]

• Agenda for this talk– GPU Architecture Intuition

• What generates the above characteristics?

– Progress on efficiently harnessing hybrid

(GPU-based) architectures

5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Feed the cores with data

Idea #3

The processing elements are data hungry!

Wide, high throughput memory bus

10,000x parallelism!

Idea #4

Hide memory access latency

Hardware supported multithreading

The Resulting GPU Architecture

Multiprocessor 2

Multiprocessor NGPU

Core MInstruction

Shared Memory

Registers

Multiprocessor 1

Core 1

Registers

Core 2

Registers

Global Memory

Texture Memory

Constant Memory

nVidia Tesla 2050

448 cores

Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB

high throughput – 150GB/s•Texture – read only•Constant – read only

Hybrid• PCI 16x -- 4GBps

HostMemory

HostMachine

GPU characteristics

High peak compute power

High host-device communication overhead

Complex to program (SIMD, co-processor model)

High peak memory bandwidth

Limited memory space

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question

Distributed storage systems

Context:

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Roadmap: Two Projects

Computationally Intensive Operations in Distributed (Storage) Systems

Hashing

Erasure coding

Encryption/decryption

Membership testing (Bloom-filter)

Compression

Computationally intensive Limit performance

Similarity detection (deduplication)

Content addressability

Security

Integrity checks

Redundancy

Load balancing

Summary cache

Storage efficiency

Operations Techniques

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

Techniques To improve Performance/Reliability

Files divided into stream of blocks

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

MosaStorehttp://mosastore.net

GPU accelerated deduplication:Design / prototype implementation that integrates similarity detection and GPU support

End-to-end system evaluation2x throughput improvement for a realistic checkpointing workload

Challenges

Integration Challenges

Minimizing the integration effort

Transparency

Separation of concerns

Extracting Major Performance Gains

Hiding memory allocation overheads

Hiding data transfer overheads

Efficient utilization of the GPU memory units

Use of multi-GPU systems

Similarity Detection

Hashing

Offloading Layer

Hashing on GPUs

HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems

1 Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08

However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection

One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU

HashGPU

Hashing a stream of blocks

Profiling HashGPU

Amortizing memory allocation and overlapping data transfers and computation may bring important benefits

At least 75% overhead

CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations

HashGPU

CrystalGPU

One performance data point:CrystalGPU can improve the speedup of hashing by more than 10x

CrystalGPU Opportunities and Enablers

Opportunity: Reusing GPU memory buffers

Enabler: a high-level memory manager

Opportunity: overlap the communication and computation

Enabler: double buffering and asynchronous kernel launch

Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)

Enabler: a task queue manager

HashGPU

CrystalGPUMemory Manager Task Queue

Double Buffering

HashGPU Performance on top CrystalGPU

The gains enabled by the three optimizations can be realized!

Base Line: CPU Single Core

End-to-end system evaluation

Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU

Three configuration– No similarity detection (without-SD)– Similarity detection

• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)

Three workloads – Real checkpointing workload– Completely similar files: maximum gains in terms of data saving– Completely different files: only overheads, no gains

Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive

End-to-End System Evaluation

• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10

System Throughput (Checkpointing Workload)

The integrated system preserves the throughput gains on a realistic workload!

1.8x improvement

System Throughput (Synthetic Workload of Similar Files)

Offloading to the GPU enables close to optimal performance!

Room for 2ximprovement

Impact on a Competing (Compute Intensive) Application

Writing Checkpoints back to back

2ximprovement

Frees resources (CPU) to competing applications while preserving throughput gains!

7% reduction

Summary

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

MosaStorehttp://mosastore.net

Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems?

Motivating Question

StoreGPU summary

Techniques To improve Performance/Reliability

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

Results so far: StoreGPU: storage system

prototype that offloads to GPU Evaluate the feasibility of GPU

offloading, and the impact on competing applications

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Motivating Question

Distributed storage systems

Context:

Motivating Question

StoreGPU

Roadmap: Two Projects

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….

..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: Sequence Alignment Problem

Problem: Find where each query most likely originated from

Queries 108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length (up to ~400GB)

Queries

Reference

Sequence Alignment on GPUs

MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] Achieves good speedup compared to CPU version Based on suffix tree

However, suffers from significant communication and post-processing overheads

MUMmerGPU++ [gharibeh 10]: Use a space efficient data structure (though, from higher

computational complexity class): suffix array Achieves significant speedup compared to suffix tree-based

on GPU

> 50% overhead

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.

Speedup Evaluation

Workload: Human, ~10M queries, ~30M ref. length

Suffix Tree Suffix Tree Suffx Array

Over 60% improvement

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

→search-optimized data-structures

• Large output size→compressed output

representation (decompress on the CPU)

High-level algorithm (executed on the host)

The core data structure

massive number of queries and long reference =>

pre-process reference to an index

CAA TACACA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len)

but the constant is high ~20 x ref_len Post-processing:

DFS traversal for each query O(4qry_len - min_match_len)

The core data structure

massive number of queries and long reference => pre-process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query

Space: O(ref_len), but the constant is high: ~20xref_len

Post-processing: O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Efficient

A better matching data structure?

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: Reduced communication

Less data to transfer

A better matching data structure

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

A better matching data structure

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Impact 3: Lower post-processing overhead

Evaluation

Evaluation setup

Workload / Species Reference

sequence length# of

queriesAverage read

length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Speedup: array-based over tree-based

Dissecting the overheads

Consequences:• Focus shifts to optimizing

the compute stage• Opportunity to exploit

multi-GPU systems (as I/O is less of a bottleneck)

Workload: HS1, ~78M queries, ~238M ref. length on GeForce

Choice of appropriate data structure can be crucial when porting applications to the GPU

A good matching data structure ensures: Low communication

overhead Data locality: can be

achieved at the cost of additional per thread processing time

Low post-processing overhead

MummerGPU++ Summary

Motivating Question

MummerGPU++

Hybrid platforms will gain wider adoption.

Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective

Motivating Question Motivating Question

StoreGPU

Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca

Accelerated storage systems A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.

Ripeanu, JoCC‘08

Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific

Computing Magazine, January/February 2011.

Middleware runtime support to simplify application development

• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, Technical Report

GPU-optimized building blocks: Data structures and libraries• Hashing, BloomFilters, SuffixArray

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Documents

Transcript of Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

GPU Acceleration for Container on Intel Processor …schd.ws/hosted_files/lc3china2017/6e/Zhenyu Wang_LinuxCon...GPU Acceleration for Container on Intel Processor Graphics Zhenyu Wang

TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION …€¦ · TECHNIQUES TO LEVERAGE DATA-PARALLEL GPU ACCELERATION FOR COMPUTER VISION ALGORITHMS by Allen Paul Nichols B.S.,

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

GPU Acceleration for ANSYS Mechanical Peter Tiefenthaler · GPU Acceleration for ANSYS Mechanical “Accelerate” Sparse Direct Solver Supported options Static, full transient, full

Graphics Processing Unit (GPU) Acceleration of Machine ...flightsoftware.jhuapl.edu/files/FSW09_Tweddle.pdf · Graphics Processing Unit (GPU) Acceleration of Machine Vision Software

NVIDIA GRID™ GPU Acceleration for Virtualizationon-demand.gputechconf.com/gtc/2013/presentations/S3501...NVIDIA GRID GPU Acceleration for Virtualization | GTC 2013 Author Will Wade

GPU Acceleration in Registration

Parallelization schemes & GPU Acceleration - GROMACSapi/deki/files/213/=gromacs_parallelization... · Parallelization schemes & GPU Acceleration Erik Lindahl / Szilárd Páll GROMACS

GPU hardware acceleration for industrial applications spring... · GPU hardware acceleration for industrial applications ... language is provided by nVidia’s CUDA C Programming

Case Studies in the GPU Acceleration of Two Earth Sciences ...users.cecs.anu.edu.au/~peter/seminars/GPUaccelEarthSciApps.pdf · SY2019 Challenges in HPCCase Studies in the GPU Acceleration

GPU Acceleration and Interactive Visualization for … · Author: Andrea Purgato GPU Acceleration and Interactive Visualization for Spatio-Temporal Network Committee: Angus Forbes

GPU Acceleration of Sparse Matrix Factorization in …on-demand.gputechconf.com/gtc/...sparse-matrix-factorization-cholm… · GPU ACCELERATION OF SPARSE MATRIX FACTORIZATION IN CHOLMOD

GPU Database Acceleration on PowerEdge R940xa...8 GPU Database Acceleration on PowerEdge R940xa Block Diagram of Brytlyt stack interfacing between PostgreSQL and GPU Cluster Data Generation

GPU Based Acceleration of WRF Model: A Review

GPU Acceleration of HPC Applications

GPU ACCELERATION FOR OLAP

GPU Acceleration of a Fully 3D Iterative Reconstruction ...nuclear.fis.ucm.es/webgrupo_2007/publicaciones/PROCEEDINGS 200… · GPU Acceleration of a Fully 3D Iterative Reconstruction

GPU Acceleration for Seismic Interpretation Algorithmsdeveloper.download.nvidia.com/GTC/...GTC2012-GPU-Acceleration-Sei… · GPU Acceleration for Seismic ... GPU Acceleration for

GPU Acceleration of Computational Fluid Dynamics …on-demand.gputechconf.com/...acceleration-cfd-industrial-apps-openf… · Slide 5 GPU Acceleration of CFD in Industrial Applications

GPU Acceleration of Computational Electromagnetics Methodson-demand.gputechconf.com/gtc/2016/posters/GTC_2016... · 2016-03-15 · GPU Acceleration of Computational Electromagnetics