Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance

Abdullah Gharaibeh Matei Ripeanu

NetSysLabThe University of British Columbia

GPUs offer different characteristics

High peak compute power

High communication overhead

High peak memory bandwidth

Limited memory space

Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

Motivating Question: How should we design applications to efficiently exploit GPU characteristics?

Context: A bioinformatics problem: Sequence Alignment

A string matching problem Data intensive (102 GB)

Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:

A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version

Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

> 50% overhead

Use a space efficient data structure (though, from higher computational complexity class): suffix array

4x speedup compared to suffix tree-based on GPU

Idea: trade-off time for space

Consequences: Opportunity to exploit

multi-GPU systems as I/O is less of a bottleneck

Focus is shifted towards optimizing the compute stage

Significant overhead reduction

Outline

Sequence alignment: background and offloading to GPU

Space/Time trade-off analysis

Evaluation

CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT…. ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..

Background: sequence alignment problem

Find where each query most likely originated from Queries

108 queries101 to 102 symbols length per query

Reference106 to 1011 symbols length

Queries

Reference

GPU Offloading: opportunity and challenges

Sequence alignment

Easy to partition Memory intensive

Massively parallel High memory bandwidth

Data Intensive Large output size

Limited memory space No direct access to

other I/O devices (e.g., disk)C

GPU Offloading: addressing the challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,

subref) CopyFromGPU(results) } Decompress(results)}

• Data intensive problem and limited memory space

→divide and compute in rounds

• Large output size→compressed output

representation (decompress on the CPU) High-level algorithm (executed on the host)

Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis

The core data structuremassive number of queries and long reference => pre-

process reference to an index

CAA TACACA$

Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

The core data structuremassive number of queries and long reference => pre-

process reference to an index

Past work: build a suffix tree (MUMmerGPU [Schatz 07])

Search: O(qry_len) per query Space: O(ref_len), but the

constant is high: ~20xref_len Post-processing:

O(4qry_len - min_match_len), DFS traversal per query

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)

MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}

Expensive

Efficient

A better matching data structure

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len

Search O(qry_len) O(qry_len x log ref_len)

Post-process O(4qry_len - min_match_len) O(qry_len – min_match_len)

Impact 1: reduced communication

Less data to transfer

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Impact 2: better data locality is achieved at the cost of additional per-thread processing time

Space for longer sub-references => fewer processing rounds

CAA TACACA$

Suffix Tree

1 ACA$

2 ACACA$

4 CACA$

5 TACACA$

Suffix Array

Impact 3: lower post-processing overhead

Evaluation

Evaluation setup

Workload / Species Reference sequence length

# of queries

Average read length

HS1 - Human (chromosome 2) ~238M ~78M ~200

HS2 - Human (chromosome 3) ~100M ~2M ~700

MONO - L. monocytogenes ~3M ~6M ~120

SUIS - S. suis ~2M ~26M ~36

Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)

Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])

Success metrics Performance Energy consumption

Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Speedup: array-based over tree-based

Dissecting the overheads

Significant reduction in data transfers and post-processing

Workload: HS1, ~78M queries, ~238M ref. length on Geforce

Summary GPUs have drastically different performance

characteristics

Reconsidering the choice of the data structure used is necessary when porting applications to the GPU

A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of

additional per thread processing time Low post-processing overhead

Code available at:Code available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

Documents

Transcript of Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

Understanding GPGPU Vector Register File Usagewysem/publications/quals-gpgpu-vrf-slides.pdfUnderstanding GPGPU Vector Register File Usage. 2 | PUBLIC AGENDA GPU Architecture gem5 Model

Algorithm Engineering „ GPGPU“

Python + GPGPU

Jets Gpgpu

GPGPU using CUDA Thrust

peddie gpgpu

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

GPGPU Programming

Computer Vision Algorithm Acceleration Using GPGPU · Boeing Research & Technology | GPGPU GPGPU Pipeline Optimization: After GPU Pipeline is kept full with processing, no “air

GPGPU Programming Using NVIDIA CUDAresearch.utar.edu.my/centres/dev/CISST/event/GPGPU Programming Using... · GPGPU Programming Using NVIDIA CUDA Prepared by Lee WaiKong Email: wklee@utar.edu.my

Gpgpu intro

GPGPU-Sim & AerialVision · Overview •GPGPU-Sim –overview –some internals –demo •AerialVision –demo •Encountered problems Note: Heavily based on 3-hour GPGPU-Sim Tutorial

The GPGPU Continuum

Introduction to tradeoffs

GPGPU Computing and SIMD

GUI Tradeoffs

GPGPU – CUDA 1.cg.elte.hu/~gpgpu/cuda/GPGPU_CUDA_1.pdf · CUDA (Compute Unified Device Architecture) NVIDIA által fejlesztett GPGPU platform. Fejleszthető: C, C++, Fortran nyelveken,

CSTalks - GPGPU - 19 Jan

GPGPU - ELTE

GPGPU: Beyond Graphics