Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

52
1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

description

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab) - PowerPoint PPT Presentation

Transcript of Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Page 1: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

1

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

Matei RipeanuNetworked Systems Laboratory (NetSysLab)

University of British Columbia

Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

Page 2: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

2

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

Page 3: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

3

Hybrid architectures in Top 500 [Nov’10]

Page 4: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

4

• Hybrid architectures– High compute power / memory bandwidth– Energy efficient

[operated today at low overall efficiency]

• Agenda for this talk– GPU Architecture Intuition

• What generates the above characteristics?

– Progress on efficiently harnessing hybrid

(GPU-based) architectures

Page 5: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 6: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

6Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 7: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

7Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 8: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

8Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 9: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

9Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 10: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

10Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 11: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

11Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

Page 12: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

12

Feed the cores with data

Idea #3

The processing elements are data hungry!

Wide, high throughput memory bus

Page 13: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

13

10,000x parallelism!

Idea #4

Hide memory access latency

Hardware supported multithreading

Page 14: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

14

The Resulting GPU Architecture

Multiprocessor 2

Multiprocessor NGPU

Core MInstruction

Unit

Shared Memory

Registers

Multiprocessor 1

Core 1

Registers

Core 2

Registers

Global Memory

Texture Memory

Constant Memory

nVidia Tesla 2050

448 cores

Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB

high throughput – 150GB/s•Texture – read only•Constant – read only

Hybrid• PCI 16x -- 4GBps

HostMemory

HostMachine

Page 15: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

15

GPU characteristics

High peak compute power

High host-device communication overhead

Complex to program (SIMD, co-processor model)

High peak memory bandwidth

Limited memory space

Page 16: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

16

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question

Distributed storage systems

Context:

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Roadmap: Two Projects

Page 17: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

17

Computationally Intensive Operations in Distributed (Storage) Systems

Hashing

Erasure coding

Encryption/decryption

Membership testing (Bloom-filter)

Compression

Computationally intensive Limit performance

Similarity detection (deduplication)

Content addressability

Security

Integrity checks

Redundancy

Load balancing

Summary cache

Storage efficiency

Operations Techniques

Page 18: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

18

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

Techniques To improve Performance/Reliability

b1b2

b3b n

Files divided into stream of blocks

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

MosaStorehttp://mosastore.net

Page 19: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

19

GPU accelerated deduplication:Design / prototype implementation that integrates similarity detection and GPU support

End-to-end system evaluation2x throughput improvement for a realistic checkpointing workload

Page 20: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

20

Challenges

Integration Challenges

Minimizing the integration effort

Transparency

Separation of concerns

Extracting Major Performance Gains

Hiding memory allocation overheads

Hiding data transfer overheads

Efficient utilization of the GPU memory units

Use of multi-GPU systems

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

Hashing

Offloading Layer

Page 21: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

21

Hashing on GPUs

HashGPU1: a library that exploits GPUs to support specialized use of hashing in distributed storage systems

1 Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08

However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection

One performance data point:Accelerates hashing by up to 5x speedup compared to a single core CPU

HashGPU

GPU

b1b2

b3b n

Hashing a stream of blocks

Page 22: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

22

Profiling HashGPU

Amortizing memory allocation and overlapping data transfers and computation may bring important benefits

At least 75% overhead

Page 23: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

23

CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPU

One performance data point:CrystalGPU can improve the speedup of hashing by more than 10x

Page 24: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

24

CrystalGPU Opportunities and Enablers

Opportunity: Reusing GPU memory buffers

Enabler: a high-level memory manager

Opportunity: overlap the communication and computation

Enabler: double buffering and asynchronous kernel launch

Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters)

Enabler: a task queue manager

Similarity Detection

b1b2

b3b n

Files divided into stream of blocks

GPU

HashGPU

Off

load

ing

Lay

er

CrystalGPUMemory Manager Task Queue

Double Buffering

Page 25: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

25

HashGPU Performance on top CrystalGPU

The gains enabled by the three optimizations can be realized!

Base Line: CPU Single Core

Page 26: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

26

End-to-end system evaluation

Page 27: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

27

Testbed– Four storage nodes and one metadata server– One client with 9800GX2 GPU

Three configuration– No similarity detection (without-SD)– Similarity detection

• on CPU (4 cores @ 2.6GHz) (SD-CPU)• on GPU (9800 GX2) (SD-GPU)

Three workloads – Real checkpointing workload– Completely similar files: maximum gains in terms of data saving– Completely different files: only overheads, no gains

Success metrics:– System throughput – Impact on a competing application: compute or I/O intensive

End-to-End System Evaluation

• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10

Page 28: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

28

System Throughput (Checkpointing Workload)

The integrated system preserves the throughput gains on a realistic workload!

1.8x improvement

Page 29: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

29

System Throughput (Synthetic Workload of Similar Files)

Offloading to the GPU enables close to optimal performance!

Room for 2ximprovement

Page 30: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

30

Impact on a Competing (Compute Intensive) Application

Writing Checkpoints back to back

2ximprovement

Frees resources (CPU) to competing applications while preserving throughput gains!

7% reduction

Page 31: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

31

Summary

Page 32: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

32

Distributed Storage System Architecture

Client

Metadata Manager

Storage Nodes

Access Module

Application

MosaStorehttp://mosastore.net

Page 33: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

33

Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems?

Motivating Question

StoreGPU summary

Techniques To improve Performance/Reliability

b1b2

b3b n

Files divided into stream of blocks

De-duplication

SecurityIntegrity Checks

Redundancy

CPUGPU

Offloading Layer

Enabling Operations

CompressionEncoding/Decoding

Encryption/Decryption

Hashing

Application Layer

FS API

Results so far: StoreGPU: storage system

prototype that offloads to GPU Evaluate the feasibility of GPU

offloading, and the impact on competing applications

Page 34: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

34

MummerGPU++

Context:

Porting a bioinformatics application (Sequence Alignment)

A string matching problem Data intensive (102 GB)

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question

Distributed storage systems

Context:

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Roadmap: Two Projects

Page 35: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

35

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….

..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: Sequence Alignment Problem

Problem: Find where each query most likely originated from

Queries 108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length (up to ~400GB)

Queries

Reference

Page 36: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

36

Sequence Alignment on GPUs

MUMmerGPU [Schatz 07, Trapnell 09]: A GPU port of the sequence alignment tool MUMmer [Kurtz 04] Achieves good speedup compared to CPU version Based on suffix tree

However, suffers from significant communication and post-processing overheads

MUMmerGPU++ [gharibeh 10]: Use a space efficient data structure (though, from higher

computational complexity class): suffix array Achieves significant speedup compared to suffix tree-based

on GPU

> 50% overhead

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.

Page 37: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

37

Speedup Evaluation

Workload: Human, ~10M queries, ~30M ref. length

Suffix Tree Suffix Tree Suffx Array

Over 60% improvement

Page 38: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

38

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

Page 39: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

39

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

→search-optimized data-structures

• Large output size→compressed output

representation (decompress on the CPU)

High-level algorithm (executed on the host)

Page 40: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

40

The core data structure

massive number of queries and long reference =>

pre-process reference to an index

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len)

but the constant is high ~20 x ref_len Post-processing:

DFS traversal for each query O(4qry_len - min_match_len)

Page 41: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

41

The core data structure

massive number of queries and long reference => pre-process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query

Space: O(ref_len), but the constant is high: ~20xref_len

Post-processing: O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Expensive

Efficient

Page 42: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

42

A better matching data structure?

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: Reduced communication

Less data to transfer

Com

pute

Page 43: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

43

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 2: Better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

Com

pute

Page 44: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

44

A better matching data structure

$

CAA TACACA$

0

5

CA$

2 4

CA$ $

3 1

$ CA$

Suffix Tree

0 A$

1 ACA$

2 ACACA$

3 CA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-

processO(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 3: Lower post-processing overhead

Com

pute

Page 45: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

45

Evaluation

Page 46: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

46

Evaluation setup

Workload / Species Reference

sequence length# of

queriesAverage read

length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Page 47: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

47

Speedup: array-based over tree-based

Page 48: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

48

Dissecting the overheads

Consequences:• Focus shifts to optimizing

the compute stage• Opportunity to exploit

multi-GPU systems (as I/O is less of a bottleneck)

Workload: HS1, ~78M queries, ~238M ref. length on GeForce

Page 49: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

49

Choice of appropriate data structure can be crucial when porting applications to the GPU

A good matching data structure ensures: Low communication

overhead Data locality: can be

achieved at the cost of additional per thread processing time

Low post-processing overhead

MummerGPU++ Summary

Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

Page 50: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

50

MummerGPU++

Hybrid platforms will gain wider adoption.

Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective

Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems?

Motivating Question Motivating Question

How should one design/port applications to efficiently exploit GPU characteristics?

StoreGPU

Page 51: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

51

Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

Page 52: Harvesting the Opportunity of  GPU-Based Acceleration for  Data-Intensive Applications

52

Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca

Accelerated storage systems A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.

Ripeanu, JoCC‘08

Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.

Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific

Computing Magazine, January/February 2011.

Middleware runtime support to simplify application development

• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, Technical Report

GPU-optimized building blocks: Data structures and libraries• Hashing, BloomFilters, SuffixArray