Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Post on 09-Jan-2016

21 views 0 download

Tags:

description

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab) - PowerPoint PPT Presentation

Transcript of Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

1

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Matei RipeanuNetworked Systems Laboratory (NetSysLab)

University of British Columbia

Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

2

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

3

Hybrid architectures in Top 500 [Nov’10]

4

• Hybrid architectures– High compute power / memory bandwidth– Energy efficient

[operated today at low overall efficiency]

• Agenda for this talk– GPU Architecture Intuition

• What generates the above characteristics?

– Progress on efficiently harnessing hybrid

(GPU-based) architectures

5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

6Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

7Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

8Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

9Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

10Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

11Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

12

Feed the cores with data

Idea #3

The processing elements are data hungry!

Wide, high throughput memory bus

13

10,000x parallelism!

Idea #4

Hide memory access latency

Hardware supported multithreading

14

The Resulting GPU Architecture

Multiprocessor 2

Multiprocessor NGPU

Core MInstruction

Unit

Shared Memory

Registers

Multiprocessor 1

Core 1

Registers

Core 2

Registers

Global Memory

Texture Memory

Constant Memory

nVidia Tesla 2050

448 cores

Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB

high throughput – 150GB/s•Texture – read only•Constant – read only

Hybrid• PCI 16x -- 4GBps

HostMemory

HostMachine

15

GPU characteristics

High peak compute power

High host-device communication overhead

Complex to program (SIMD, co-processor model)

High peak memory bandwidth

Limited memory space

16

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question

Distributed storage systems

Context:

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Roadmap: Two Projects

17

Computationally Intensive Operations in Distributed (Storage) Systems

Hashing

Erasure coding

Encryption/decryption

Membership testing (Bloom-filter)

Compression

Computationally intensive Limit performance

Similarity detection (deduplication)

Content addressability

Security

Integrity checks

Redundancy

Load balancing

Summary cache

Storage efficiency

Operations Techniques

18

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

Techniques To improve Performance/Reliability

b1b2

b3b n

Files divided into stream of blocks

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

MosaStorehttp://mosastore.net

19

GPU accelerated deduplication:Design / prototype implementation that integrates similarity detection and GPU support

End-to-end system evaluation2x throughput improvement for a realistic checkpointing workload

20

Challenges

Integration Challenges

Minimizing the integration effort

Transparency

Separation of concerns

Extracting Major Performance Gains

Hiding memory allocation overheads

Hiding data transfer overheads

Efficient utilization of the GPU memory units

Use of multi-GPU systems

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

Hashing

Offloading Layer

21

Hashing on GPUs

HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems

1 Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08

However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection

One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU

HashGPU

GPU

b1b2

b3b n

Hashing a stream of blocks

22

Profiling HashGPU

Amortizing memory allocation and overlapping data transfers and computation may bring important benefits

At least 75% overhead

23

CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPU

One performance data point:CrystalGPU can improve the speedup of hashing by more than 10x

24

CrystalGPU Opportunities and Enablers

Opportunity: Reusing GPU memory buffers

Enabler: a high-level memory manager

Opportunity: overlap the communication and computation

Enabler: double buffering and asynchronous kernel launch

Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)

Enabler: a task queue manager

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPUMemory Manager Task Queue

Double Buffering

25

HashGPU Performance on top CrystalGPU

The gains enabled by the three optimizations can be realized!

Base Line: CPU Single Core

26

End-to-end system evaluation

27

Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU

Three configuration– No similarity detection (without-SD)– Similarity detection

• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)

Three workloads – Real checkpointing workload– Completely similar files: maximum gains in terms of data saving– Completely different files: only overheads, no gains

Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive

End-to-End System Evaluation

• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10

28

System Throughput (Checkpointing Workload)

The integrated system preserves the throughput gains on a realistic workload!

1.8x improvement

29

System Throughput (Synthetic Workload of Similar Files)

Offloading to the GPU enables close to optimal performance!

Room for 2ximprovement

30

Impact on a Competing (Compute Intensive) Application

Writing Checkpoints back to back

2ximprovement

Frees resources (CPU) to competing applications while preserving throughput gains!

7% reduction

31

Summary

32

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

MosaStorehttp://mosastore.net

33

Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems?

Motivating Question

StoreGPU summary

Techniques To improve Performance/Reliability

b1b2

b3b n

Files divided into stream of blocks

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

Results so far: StoreGPU: storage system

prototype that offloads to GPU Evaluate the feasibility of GPU

offloading, and the impact on competing applications

34

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question

Distributed storage systems

Context:

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Roadmap: Two Projects

35

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….

..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: Sequence Alignment Problem

Problem: Find where each query most likely originated from

Queries 108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length (up to ~400GB)

Queries

Reference

36

Sequence Alignment on GPUs

MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] Achieves good speedup compared to CPU version Based on suffix tree

However, suffers from significant communication and post-processing overheads

MUMmerGPU++ [gharibeh 10]: Use a space efficient data structure (though, from higher

computational complexity class): suffix array Achieves significant speedup compared to suffix tree-based

on GPU

> 50% overhead

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.

37

Speedup Evaluation

Workload: Human, ~10M queries, ~30M ref. length

Suffix Tree Suffix Tree Suffx Array

Over 60% improvement

38

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

39

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

→search-optimized data-structures

• Large output size→compressed output

representation (decompress on the CPU)

High-level algorithm (executed on the host)

40

The core data structure

massive number of queries and long reference =>

pre-process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len)

but the constant is high ~20 x ref_len Post-processing:

DFS traversal for each query O(4qry_len - min_match_len)

41

The core data structure

massive number of queries and long reference => pre-process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query

Space: O(ref_len), but the constant is high: ~20xref_len

Post-processing: O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

42

A better matching data structure?

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: Reduced communication

Less data to transfer

Com

pute

43

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

Com

pute

44

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 3: Lower post-processing overhead

Com

pute

45

Evaluation

46

Evaluation setup

Workload / Species Reference

sequence length# of

queriesAverage read

length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

47

Speedup: array-based over tree-based

48

Dissecting the overheads

Consequences:• Focus shifts to optimizing

the compute stage• Opportunity to exploit

multi-GPU systems (as I/O is less of a bottleneck)

Workload: HS1, ~78M queries, ~238M ref. length on GeForce

49

Choice of appropriate data structure can be crucial when porting applications to the GPU

A good matching data structure ensures: Low communication

overhead Data locality: can be

achieved at the cost of additional per thread processing time

Low post-processing overhead

MummerGPU++ Summary

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

50

MummerGPU++

Hybrid platforms will gain wider adoption.

Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

51

Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

52

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca

Accelerated storage systems A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.

Ripeanu, JoCC‘08

Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific

Computing Magazine, January/February 2011.

Middleware runtime support to simplify application development

• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, Technical Report

GPU-optimized building blocks: Data structures and libraries• Hashing, BloomFilters, SuffixArray