GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
-
Upload
kohei-kaigai -
Category
Software
-
view
1.464 -
download
1
Transcript of GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
1
GPGPU Accelerates PostgreSQL
~Unlock the power of multi-thousand cores~
PGconf.EU 2015 Wien
NEC Business Creation Division
The PG-Strom Project
KaiGai Kohei <[email protected]>
2
Today’s topic
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Very powerful computing capability
Very functional & well-used
database
PG-Strom:
Glue of the both promising technologies
GPGPU
3
Self Introduction
▌Name: KaiGai Kohei
▌Works for: NEC
▌Role:
Lead of the PG-Strom project
Contribution to PostgreSQL and other OSS projects
Making business opportunity using this technology
▌PG-Strom Project
Mission: To deliver the power of semiconductor evolution for every people who want
Launched at Jan-2012, as a personal development project
Project is now funded by NEC
Fully open source project
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
4
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
5
Features of GPU (Graphic Processor Unit)
▌Characteristics
Massive parallel cores
Much higher DRAM bandwidth
Better price / performance ratio
advantages for massive but simple arithmetic operations
Different manner to write software algorithm
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
SOURCE: CUDA C Programming Guide
GPU CPU
Model Nvidia GTX
TITAN X Intel Xeon E5-2690 v3
Architecture Maxwell Haswell
Launch Mar-2015 Sep-2014
# of transistors 8.0billion 3.84billion
# of cores 3072
(simple) 12
(functional)
Core clock 1.0GHz 2.6GHz,
up to 3.5GHz
Peak Flops (single
precision) 6.6TFLOPS
998.4GFLOPS (with AVX2)
DRAM size 12GB, GDDR5 (384bits bus)
768GB/socket, DDR4
Memory band 336.5GB/s 68GB/s
Power consumption
250W 135W
Price $999 $2,094
6
How GPU cores works
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
● item[0]
step.1 step.2 step.4 step.3
Calculation of
𝑖𝑡𝑒𝑚[𝑖]
𝑖=0…𝑁−1
with GPU cores
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Sum of items[] by log2N steps
Inter-core synchronization by HW functionality
7
Dark Silicon and Semiconductor Trend
▌Processor shrink also increases power leakage.
▌Dark Silicon: area cannot power-on due to thermal problem.
▌People thought: How to utilize these extra transistors?
Domain specific processors, like GPU
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Core Core Core Core
Core Core Core Core
Dark Silicon Core Core
Core Core
Dark Silicon
Core
Core
Core
Core
Core
Core
Next Gen Next Gen
Intel Haswell on-chip layout (4core system)
8
Towards heterogeneous era...
Software must be designed to intermediate CPU and GPU
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
9
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
10
What is PG-Strom
▌Core ideas
① GPU native code generation on the fly
② Asynchronous massive parallel execution
▌Advantages
Transparent acceleration with 100% query compatibility
Commodity H/W and less system integration cost
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Parser
Planner
Executor
Custom- Scan/Join Interface
Query: SELECT * FROM l_tbl JOIN r_tbl on l_tbl.lid = r_tbl.rid;
PG-Strom
CUDA driver
nvrtc
DMA Data Transfer
CUDA Source code Massive
Parallel Execution
11
How query is processed
▌Custom-Scan/Join interface (v9.5 new!)
▌PG-Strom provides own scan/join logics using GPUs
▌It generates same results, but different way
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Query Parser
Query Optimizer
Query Executor
SQL (text)
Parse Tree
Query Plan
Custom Scan/Join Interface
PG-Strom
Interactions between
PostgreSQL and
Extensions
CUDA code
Data Chunk
NVIDIA Run-Time Compiler
12
definition at planning
time
(OT) Dive into Custom-Join
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Built-in JOIN Custom JOIN
CustomScan (scanrelid==0)
Logic implemented by extension
Inner Scan Outer Scan
table-X table-Y
Join
• NestedLoop • MergeJoin • HashJoin
TupleDesc of table-X
TupleDesc of table-Y
X-tuple Y-tuple
joined-tuple
TupleDesc of Join X-Y
table-Y Something other data source
table-X
Data Load with extension’s comfortable way
joined-tuple
compatible tuple format
13
SQL-to-GPU native code generation (1/3)
postgres=# EXPLAIN VERBOSE
SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; QUERY PLAN
-------------------------------------------------------------------------------------
HashAggregate (cost=167646.33..167646.36 rows=3 width=12)
Output: cat, pgstrom.count((pgstrom.nrows())), pgstrom.avg((pgstrom.nrows((x IS NOT NULL))), (pgstrom.psum(x)))
Group Key: t0.cat
-> Custom Scan (GpuPreAgg) (cost=24150.22..163315.53 rows=258 width=48)
Output: cat, pgstrom.nrows(), pgstrom.nrows((x IS NOT NULL)), pgstrom.psum(x)
Bulkload: On (density: 100.00%)
Reduction: Local + Global
Device Filter: ((t0.x >= t0.y) AND (t0.x <= (t0.y + '20'::double precision)))
Features: format: tuple-slot, bulkload: unsupported
Kernel Source: /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu -> Custom Scan (BulkScan) on public.t0 (cost=21150.22..159312.94 rows=10000060 width=12)
Output: id, cat, aid, bid, cid, did, eid, x, y, z
Features: format: heap-tuple, bulkload: supported
(13 rows)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
14
SQL-to-GPU native code generation (2/3)
$ less /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
kern_data_store *ktoast,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); /* constant 20.0 */
pg_float8_t KVAR_8 = pg_float8_vref(kds,kcxt,7,kds_index); /* var of X */
pg_float8_t KVAR_9 = pg_float8_vref(kds,kcxt,8,kds_index); /* var of Y */
return EVAL(pgfn_boolop_and_2(kcxt,
pgfn_float8ge(kcxt, KVAR_8, KVAR_9), pgfn_float8le(kcxt, KVAR_8, pgfn_float8pl(kcxt, KVAR_9, KPARAM_1))));
}
:
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
formula expression based on WHERE clause
15
SQL-to-GPU native code generation (3/3)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
SQL Query
Dynamically constructed GPU code
------ Per Query Variant
Statically constructed GPU code
------ Algorithm Core
NVIDIA Run-Time Compiler (nvrtc)
Cached GPU Binary
+
PG-Strom GPU code generator
For Execution
For Next Time GpuScan,
GpuHashJoin, GpuPreAgg, GpuSort, ...
16
PG-Strom functionalities at Sep-2015
▌Logics
GpuScan ... Simple loop extraction by GPU multithread
GpuHashJoin ... GPU multithread based N-way hash-join
GpuNestLoop ... GPU multithread based N-way nested-loop
GpuPreAgg ... Row reduction prior to CPU aggregation
GpuSort ... GPU bitonic + CPU merge, hybrid sorting
▌Data Types
Numeric ... int2/4/8, float4/8, numeric
Date and Time ... date, time(tz), timestamp(tz), interval
Text ... simple matching only, at this moment
others ... currency
▌Functions
Comparison operator ... <, <=, !=, =, >=, >
Arithmetic operators ... +, -, *, /, %, ...
Mathematical functions ... sqrt, log, exp, ...
Aggregate functions ... min, max, sum, avg, stddev, ...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
17
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
18
Asynchronous & Pipelining Execution (1/3)
▌Karma of discrete GPU device
We have to move the data to be processed.
Thus, people say data intensive tasks are not good for GPU.
Correct, but we can reduce the impact.
Our solution: Pipelining
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU
CPU
GPU kernel execution
time
DMA Send
DMA Recv
19
Asynchronous & Pipelining Execution (2/3)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
chunk-(i+1)
chunk-(i+2)
chunk-i
chunk-(i+3)
table
scan
Move to Next
DMA Recv
GPU Kernel Exec
DMA Send
Buffer Read
DMA Recv
GPU Kernel Exec
DMA Send
Buffer Read
GPU Kernel Exec
DMA Send
Buffer Read
DMA Send
Buffer Read
Step-(i+4) Step-(i+3) Step-(i+2) Step-(i+1) Step-i
20
Asynchronous & Pipelining Execution (3/3)
▌Karma of discrete GPU device
We have to move the data to be processed.
Thus, people say data intensive tasks are not good for GPU.
Correct, but we can reduce the impact
Our solution: Pipelining
▌Downside of the pipelining
Too small workloads
Pipeline hazard
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU
CPU
GPU kernel execution
time
DMA Send
DMA Recv
21
SQL Logic on GPU (1/3) – GpuHashJoin
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
Inner relation
Outer relation
Inner relation
Outer relation
Hash table Hash table
Next step Next step
All CPU does is just references the result of relations join
Sequential Hash table search by CPU
Sequential Projection by CPU
Parallel Projection by GPU
Parallel Hash-table search by GPU
Built-in Hash-Join GpuHashJoin of PG-Strom
22
Inner Hash table
OT: Ununiformly Distributed Hash-Table
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU Core-0
GPU Core-1
GPU Core-2
GPU Core-3
GPU Core-4
GPU Core-5
GPU Core-M
Outer relation
Item[0]
Item[1]
Item[2]
Item[3]
Item[4]
Item[5]
Item[N]
Item[M]
Item Item Item Item Item Item Item Item Item Item
Inner relation
Hash Value
Hash Function
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Item[3] Item
Overflow of Result Buffer
23
Inner Hash table
OT: Ununiformly Distributed Hash-Table
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GPU Core-0
GPU Core-1
GPU Core-2
GPU Core-3
GPU Core-4
GPU Core-5
GPU Core-M
Outer relation
Item[0]
Item[1]
Item[2]
Item[3]
Item[4]
Item[5]
Item[N]
Item[M]
Item Item Item Item Item Item Item Item Item Item
Inner relation
Hash Value
Hash Function
Item[3] Item Item[3] Item
Virtual Partition with Extra Randomness
Just fit for Result Buffer
x Retry for each virtual partition
24
SQL Logic on GPU (2/3) – GpuNestLoop (unparametalized)
Oute
r-Rela
tion
(Nx: u
sually
larg
er)
※ s
plit to
chunk-b
y-c
hunk o
n
dem
and
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
Two dimensional GPU kernel
launch
blockDim.x
blockDim.y
Ny
Nx Thread
(X=2, Y=3)
Inner-Relation (Ny: relatively small)
Only edge thread references DRAM to fetch values.
Nx:32 x Ny:32 = 1024
A matrix can be evaluated with only 64 times DRAM accesses
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
25
SQL Logic on GPU (3/3) – GpuPreAgg
▌GpuPreAgg, prior to Aggregate, reduces number of rows to be processed.
▌Query rewrite to make equivalent results from intermediate aggregation.
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
GroupAggregate
Sort
SeqScan
Tbl_1
Result Set
several rows
several millions rows
several millions rows
Y, count(*), avg(X)
X, Y
X, Y
X, Y, Z (table definition)
GroupAggregate
Sort
GpuScan
Tbl_1
Result Set
several rows
several hundreds rows
several millions rows
Y, sum(nrows), avg_ex(nrows, psum)
Y, nrows, psum_X
X, Y
X, Y, Z (table definition)
GpuPreAgg Y, nrows, psum_X
several hundreds rows
SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;
reduction of rows to be
processed
26
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
27
Microbenchmark (1/2) – total response time
▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;
measurement of query response time with increasing of inner relations
t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded.
▌Environment:
PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)
CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
58.89 82.51
102.15 125.46
159.83
194.50
241.88
12.44 17.88 17.56 21.98 33.33 38.14
50.29
0
50
100
150
200
250
300
2 3 4 5 6 7 8
Qu
ery
Res
po
nse
Tim
e [s
ec]
# of tables involved
Query Response Time towards Numer of Tables Joined
PostgreSQL 9.5β PG-Strom
28
Microbenchmark (2/2) – Time consumption per component
▌Good acceleration results towards JOIN and AGGREGATION
▌Less acceleration ratio when N is larger than 6
By inaccurate problem size estimation, why?
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
50
100
150
200
250
300
PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom
2 3 4 5 6 7 8
Qu
ery
Res
po
nse
Tim
e [s
ec]
# of tables involved
Time consumption per component (PostgreSQL v9.5β vs PG-Strom)
Scan Join Aggregate Others
29
OT: Why accurate problem size estimation is important
▌GpuJoin results must be smaller than GPU RAM size. ▌Virtual inner-partition saves scale of the Join results. Better size estimation reduces number of retries. We can determine the problem size using dynamic parallelism
downside: Supported by only Kepler or later generation.
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
innerRel (depth-1)
innerRel (depth-2)
innerRel (depth-3)
innerRel (depth-4)
outerRel Results
× × × × =
× × × × =
GpuJoin with virtual inner-partition and iteration
Larger than
GPU RAM
Within GPU RAM
Size
30
TPC-DS Benchmark (1/2) – Total Results
▌What is TPC-DS? Successor of TPC-H, defines 103 (99+4) kind of queries Simulation of decision support queries in retail industry
▌About this benchmark Run the series of TPC-DS queries sequentially. We could collect 96 results of 103. Environment: • PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64) • CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
500
1000
1500
2000
Qu
ery
Re
spo
nse
Tim
e [
sec]
PostgreSQL v9.5β PG-Strom
anomaly?
31
TPC-DS Benchmark (2/2) – Total Result
▌Summary
PG-Strom wins to PostgreSQL: 76 of 96
PostgreSQL wins to PG-Strom: 20 of 96
shows advantages on most of simple queries
▌Challenges
Wise plan construction, to avoid GpuXXX node not to be here
Functionality improvement for Aggregation and Scan
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0
100
200
300
400
500
Qu
ery
Re
spo
nse
Tim
e [
sec]
PostgreSQL v9.5β PG-Strom
32
Time consumption per workloads (1/3) – PostgreSQL v9.5
▌Scan, Join and Aggregation provides a clear explanation for the time consumption factor in TPC-DS workloads
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)
Scan Join Aggregate Others
33
Time consumption per workloads (2/3) – PG-Strom
▌PG-Strom dramatically reduced time for Joins.
▌Time for Joins are: 37% 17%
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Time consumption per workloads (PostgreSQL v9.5beta)
Scan Join Aggregate Others
34
Time consumption per workloads (3/3) – Wow! GpuJoin
GpuJoin can explain 84% of performance improvement
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
y = 0.962x - 7.289 R² = 0.8442
-50.00
0.00
50.00
100.00
150.00
200.00
250.00
-50.00 0.00 50.00 100.00 150.00 200.00 250.00
Imp
rove
me
nt
of
Tota
l Tim
e [
sec]
(T
ota
l tim
e o
f P
gSQ
L) –
(To
tal t
ime
of
PG
-Str
om
)
Improvement of Join Time [sec] (Join time of PgSQL) – (Join time of PG-Strom)
35
Agenda
1. GPU characteristics and why?
2. What intermediates PostgreSQL and GPU
3. How GPU processes SQL workloads
4. Performance results and analysis
5. Further development
36
What we learn from TPC-DS
▌GpuJoin may be a killer feature of PG-Strom
....unless problem size estimation is accurate.
Planner needs to be wiser, not to inject GpuJoin towards the situation not suitable.
▌For Aggregation
GpuPreAgg didn’t appear as we expected
Anything GPU can help CPU aggregation?
▌For Scan
Scan performance == data stream bandwidth of GPU input
Single thread is a restriction for bandwidth of data supply
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
37
Features enhancement / improvement
▌Target-list calculation on GPU
Pre-calculation of hash-value for HashAggregate
Projection off-load if complicated mathematical expression
▌Custom-Scan under Gather node
Allows multiple workers to provide data stream to GPUs
▌Partial Aggregation before Join
Wider utilization of GpuPreAgg to reduce number of rows to be joined by CPU
▌Utilization of dynamic parallelism
Wise approach towards result buffer overflow problem in Join
▌... and more...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
38
Short term development
1. Software quality improvement
2. Corner case handling
3. Documentation
4. Target-List calculation in GPU
5. Software quality improvement
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
39
WIP Project – In-place Computing
Why export entire data to run mathematical algorithm?
▌Runs complicated mathematical expression on GPUs, then CPU just references the result
Algorithms in SQL (even partially) automatically moves to GPUs
WIP: Data-Clustering algorithms, like Min-Max, K-means, ...
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
mathematical algorithm
Things wants to
know
Things wants to
know
result
result
mathematical algorithm
Principle of Near-Data Computing
Limited computing capability
Data Export
40
Future Direction of PG-Strom
Real-life workloads are the only compass for development
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
41
WE WANT YOU GET INVOLVED
▌PostgreSQL Users
who concern about query execution performance on your workloads
▌Application Developers
who are interested in trying yet another database acceleration methodology
▌PostgreSQL Developers
who can bet the future of heterogeneous era
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
42
Resources
▌Github
https://github.com/pg-strom/devel
https://github.com/pg-strom/devel/issues
▌Documentation
Under construction.... sorry
▌Today’s slides
Uploaded soon, check #pgconfeu on Tw
▌Author’s contact
e-mail: [email protected]
Tw: @kkaigai
PGconf.EU 2015 - GPGPU Accelerates PostgreSQL
43