1
Harvesting the Opportunity of GPU-based Acceleration
Matei RipeanuNetworked Systems Laboratory (NetSysLab)
University of British Columbia
Joint work Abdullah Gharaibeh, Samer Al-Kiswany
2
A golf course …
… a (nudist) beach
(… and 199 days of rain each year)
Networked Systems Laboratory (NetSysLab)University of British Columbia
3
Hybrid architectures in Top 500 [Nov’10]
4
• Hybrid architectures– High compute power / memory bandwidth– Energy efficient
[operated today at low efficiency]
• Agenda for this talk– GPU Architecture Intuition
• What generates the above characteristics?
– Progress on efficiently harnessing hybrid
(GPU-based) architectures
5Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
6Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
7Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
8Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
9Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
10Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
11Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
12
Feed the cores with data
Idea #3
The processing elements are data hungry!
Wide, high throughput memory bus
13
10,000x parallelism!
Idea #4
Hide memory access latency
Hardware supported multithreading
14
The Resulting GPU Architecture
Multiprocessor 2
Multiprocessor NGPU
Core MInstruction
Unit
Shared Memory
Registers
Multiprocessor 1
Core 1
Registers
Core 2
Registers
Global Memory
Texture Memory
Constant Memory
nVidia Tesla 2050
448 cores
Four ‘memories’•Shared fast – 4 cycles small – 48KB•Global slow – 400-600cycles large – up to 3GB
high throughput – 150GB/s•Texture – read only•Constant – read only
Hybrid• PCI 16x -- 4GBps
HostMemory
HostMachine
15
GPUs offer different characteristics
High peak compute power
High host-device communication overhead
Complex to program
High peak memory bandwidth
Limited memory space
16
Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca
Porting applications to efficiently exploit GPU characteristics• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC’10• Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific
Computing Magazine, January/February 2011.
Middleware runtime support to simplify application development
• CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR
GPU-optimized building blocks: Data structures and libraries• GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09• Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M.
Ripeanu, SC’10• A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10• On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M.
Ripeanu, JoCC‘08
17
Motivating Question: How should we design applications to efficiently exploit GPU characteristics?
Context: A bioinformatics problem: Sequence Alignment
A string matching problem Data intensive (102 GB)
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10
18
Past work: sequence alignment on GPUsMUMmerGPU [Schatz 07, Trapnell 09]:
A GPU port of the sequence alignment tool MUMmer [Kurtz 04] ~4x (end-to-end) compared to CPU version
Hypothesis:
mismatch between the core data structure (suffix tree) and GPU characteristics
> 50% overhead
(%)
19
Use a space efficient data structure (though, from higher computational complexity class): suffix array
4x speedup compared to suffix tree-based on GPU
Idea: trade-off time for space
Consequences: Opportunity to exploit
multi-GPU systems as I/O is less of a bottleneck
Focus is shifted towards optimizing the compute stage
Significant overhead reduction
20
Outline for the rest of this talk
Sequence alignment: background and offloading to GPU
Space/Time trade-off analysis
Evaluation
21
CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG ...TAGGC TGCGC... ...CGGCA... ...GGCG ...GGCTA ATGCG… .…TCGG... TTTGCGG…. ...TAGG ...ATAT… .…CCTA... CAATT….
..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG..
Background: Sequence Alignment Problem
Problem: Find where each query most likely originated from
Queries 108 queries101 to 102 symbols length per query
Reference106 to 1011 symbols length (up to ~400GB)
Queries
Reference
22
GPU Offloading: Opportunity and Challenges
Sequence alignment
Easy to partition Memory intensive
GPU
Massively parallel High memory bandwidth
Op
po
rtu
nit
y
Data Intensive Large output size
Limited memory space No direct access to
other I/O devices (e.g., disk)C
hal
len
ges
23
GPU Offloading: addressing the challenges
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset,
subref) CopyFromGPU(results) } Decompress(results)}
• Data intensive problem and limited memory space
→divide and compute in rounds
→search-optimized data-structures
• Large output size→compressed output
representation (decompress on the CPU)
High-level algorithm (executed on the host)
24
Space/Time Trade-off AnalysisSpace/Time Trade-off Analysis
25
The core data structure
massive number of queries and long reference =>
pre-process reference to an index
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])
Search: O(qry_len) per query Space: O(ref_len)
but the constant is high ~20 x ref_len Post-processing:
DFS traversal for each query O(4qry_len - min_match_len)
26
The core data structure
massive number of queries and long reference => pre-process reference to an index
Past work: build a suffix tree (MUMmerGPU [Schatz 07])
Search: O(qry_len) per query
Space: O(ref_len), but the constant is high: ~20xref_len
Post-processing: O(4qry_len - min_match_len), DFS traversal per query
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys)foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref)
MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results)}
Expensive
Expensive
Efficient
27
A better matching data structure?
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 1: Reduced communication
Less data to transfer
Com
pute
28
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 2: Better data locality is achieved at the cost of additional per-thread processing time
Space for longer sub-references => fewer processing rounds
Com
pute
29
A better matching data structure
$
CAA TACACA$
0
5
CA$
2 4
CA$ $
3 1
$ CA$
Suffix Tree
0 A$
1 ACA$
2 ACACA$
3 CA$
4 CACA$
5 TACACA$
Suffix Array
Space O(ref_len), 20 x ref_len O(ref_len), 4 x ref_len
Search O(qry_len) O(qry_len x log ref_len)
Post-
processO(4qry_len - min_match_len) O(qry_len – min_match_len)
Impact 3: Lower post-processing overhead
Com
pute
30
Evaluation
31
Evaluation setup
Workload / Species Reference
sequence length# of
queriesAverage read
length
HS1 - Human (chromosome 2) ~238M ~78M ~200
HS2 - Human (chromosome 3) ~100M ~2M ~700
MONO - L. monocytogenes ~3M ~6M ~120
SUIS - S. suis ~2M ~26M ~36
Testbed Low-end Geforce 9800 GX2 GPU (512MB) High-end Tesla C1060 (4GB)
Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])
Success metrics Performance Energy consumption
Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)
32
Speedup: array-based over tree-based
33
Dissecting the overheads
Significant reduction in data transfers and post-processing
Workload: HS1, ~78M queries, ~238M ref. length on GeForce
34
Comparing with CPU performance [baseline single core performance]
[Suffix tree] [Suffix tree] [Suffix array]
35
Summary GPUs have drastically different performance
characteristics
Reconsidering the choice of the data structure used is necessary when porting applications to the GPU
A good matching data structure ensures: Low communication overhead Data locality: might be achieved at the cost of
additional per thread processing time Low post-processing overhead
36
Code, benchmarks and papers Code, benchmarks and papers available at:available at: netsyslab.ece.ubc.ca netsyslab.ece.ubc.ca
Top Related