GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

1

GPGPU Accelerates PostgreSQL

~Unlock the power of multi-thousand cores~

PGconf.EU 2015 Wien

NEC Business Creation Division

The PG-Strom Project

KaiGai Kohei <[email protected]>

2

Today’s topic

PGconf.EU 2015 - GPGPU Accelerates PostgreSQL

Very powerful computing capability

Very functional & well-used

database

PG-Strom:

Glue of the both promising technologies

GPGPU

3

Self Introduction

▌Name: KaiGai Kohei

▌Works for: NEC

▌Role:

Lead of the PG-Strom project

Contribution to PostgreSQL and other OSS projects

Making business opportunity using this technology

▌PG-Strom Project

Mission: To deliver the power of semiconductor evolution for every people who want

Launched at Jan-2012, as a personal development project

Project is now funded by NEC

Fully open source project


4

Agenda

1. GPU characteristics and why?

2. What intermediates PostgreSQL and GPU

3. How GPU processes SQL workloads

4. Performance results and analysis

5. Further development

5

Features of GPU (Graphic Processor Unit)

▌Characteristics

Massive parallel cores

Much higher DRAM bandwidth

Better price / performance ratio

advantages for massive but simple arithmetic operations

Different manner to write software algorithm


SOURCE: CUDA C Programming Guide

GPU CPU

Model Nvidia GTX

TITAN X Intel Xeon E5-2690 v3

Architecture Maxwell Haswell

Launch Mar-2015 Sep-2014

# of transistors 8.0billion 3.84billion

# of cores 3072

(simple) 12

(functional)

Core clock 1.0GHz 2.6GHz,

up to 3.5GHz

Peak Flops (single

precision) 6.6TFLOPS

998.4GFLOPS (with AVX2)

DRAM size 12GB, GDDR5 (384bits bus)

768GB/socket, DDR4

Memory band 336.5GB/s 68GB/s

Power consumption

250W 135W

Price $999 $2,094

6

How GPU cores works


● item[0]

step.1 step.2 step.4 step.3

Calculation of

𝑖𝑡𝑒𝑚[𝑖]

𝑖=0…𝑁−1

with GPU cores

◆

●

▲ ■ ★

● ◆

●

● ◆ ▲

●

● ◆

●

● ◆ ▲ ■

●

● ◆

●

● ◆ ▲

●

● ◆

●

item[1]

item[2]

item[3]

item[4]

item[5]

item[6]

item[7]

item[8]

item[9]

item[10]

item[11]

item[12]

item[13]

item[14]

item[15]

Sum of items[] by log2N steps

Inter-core synchronization by HW functionality

7

Dark Silicon and Semiconductor Trend

▌Processor shrink also increases power leakage.

▌Dark Silicon: area cannot power-on due to thermal problem.

▌People thought: How to utilize these extra transistors?

Domain specific processors, like GPU


Core Core Core Core

Core Core Core Core

Dark Silicon Core Core

Core Core

Dark Silicon

Core

Core

Core

Core

Core

Core

Next Gen Next Gen

Intel Haswell on-chip layout (4core system)

8

Towards heterogeneous era...

Software must be designed to intermediate CPU and GPU


9

Agenda






10

What is PG-Strom

▌Core ideas

① GPU native code generation on the fly

② Asynchronous massive parallel execution

▌Advantages

Transparent acceleration with 100% query compatibility

Commodity H/W and less system integration cost


Parser

Planner

Executor

Custom- Scan/Join Interface

Query: SELECT * FROM l_tbl JOIN r_tbl on l_tbl.lid = r_tbl.rid;

PG-Strom

CUDA driver

nvrtc

DMA Data Transfer

CUDA Source code Massive

Parallel Execution

11

How query is processed

▌Custom-Scan/Join interface (v9.5 new!)

▌PG-Strom provides own scan/join logics using GPUs

▌It generates same results, but different way


Query Parser

Query Optimizer

Query Executor

SQL (text)

Parse Tree

Query Plan

Custom Scan/Join Interface

PG-Strom

Interactions between

PostgreSQL and

Extensions

CUDA code

Data Chunk

NVIDIA Run-Time Compiler

12

definition at planning

time

(OT) Dive into Custom-Join


Built-in JOIN Custom JOIN

CustomScan (scanrelid==0)

Logic implemented by extension

Inner Scan Outer Scan

table-X table-Y

Join

• NestedLoop • MergeJoin • HashJoin

TupleDesc of table-X

TupleDesc of table-Y

X-tuple Y-tuple

joined-tuple

TupleDesc of Join X-Y

table-Y Something other data source

table-X

Data Load with extension’s comfortable way

joined-tuple

compatible tuple format

13

SQL-to-GPU native code generation (1/3)

postgres=# EXPLAIN VERBOSE

SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; QUERY PLAN

-------------------------------------------------------------------------------------

HashAggregate (cost=167646.33..167646.36 rows=3 width=12)

Output: cat, pgstrom.count((pgstrom.nrows())), pgstrom.avg((pgstrom.nrows((x IS NOT NULL))), (pgstrom.psum(x)))

Group Key: t0.cat

-> Custom Scan (GpuPreAgg) (cost=24150.22..163315.53 rows=258 width=48)

Output: cat, pgstrom.nrows(), pgstrom.nrows((x IS NOT NULL)), pgstrom.psum(x)

Bulkload: On (density: 100.00%)

Reduction: Local + Global

Device Filter: ((t0.x >= t0.y) AND (t0.x <= (t0.y + '20'::double precision)))

Features: format: tuple-slot, bulkload: unsupported

Kernel Source: /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu -> Custom Scan (BulkScan) on public.t0 (cost=21150.22..159312.94 rows=10000060 width=12)

Output: id, cat, aid, bid, cid, did, eid, x, y, z

Features: format: heap-tuple, bulkload: supported

(13 rows)


14


$ less /opt/pgsql/base/pgsql_tmp/pgsql_tmp_strom_24905.4.gpu

:

STATIC_FUNCTION(bool)

gpupreagg_qual_eval(kern_context *kcxt,

kern_data_store *kds,

kern_data_store *ktoast,

size_t kds_index)

{

pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); /* constant 20.0 */

pg_float8_t KVAR_8 = pg_float8_vref(kds,kcxt,7,kds_index); /* var of X */

pg_float8_t KVAR_9 = pg_float8_vref(kds,kcxt,8,kds_index); /* var of Y */

return EVAL(pgfn_boolop_and_2(kcxt,

pgfn_float8ge(kcxt, KVAR_8, KVAR_9), pgfn_float8le(kcxt, KVAR_8, pgfn_float8pl(kcxt, KVAR_9, KPARAM_1))));

}

:


formula expression based on WHERE clause

15



SQL Query

Dynamically constructed GPU code

------ Per Query Variant

Statically constructed GPU code

------ Algorithm Core

NVIDIA Run-Time Compiler (nvrtc)

Cached GPU Binary

+

PG-Strom GPU code generator

For Execution

For Next Time GpuScan,

GpuHashJoin, GpuPreAgg, GpuSort, ...

16

PG-Strom functionalities at Sep-2015

▌Logics

GpuScan ... Simple loop extraction by GPU multithread

GpuHashJoin ... GPU multithread based N-way hash-join

GpuNestLoop ... GPU multithread based N-way nested-loop

GpuPreAgg ... Row reduction prior to CPU aggregation

GpuSort ... GPU bitonic + CPU merge, hybrid sorting

▌Data Types

Numeric ... int2/4/8, float4/8, numeric

Date and Time ... date, time(tz), timestamp(tz), interval

Text ... simple matching only, at this moment

others ... currency

▌Functions

Comparison operator ... <, <=, !=, =, >=, >

Arithmetic operators ... +, -, *, /, %, ...

Mathematical functions ... sqrt, log, exp, ...

Aggregate functions ... min, max, sum, avg, stddev, ...


17

Agenda






18

Asynchronous & Pipelining Execution (1/3)

▌Karma of discrete GPU device

We have to move the data to be processed.

Thus, people say data intensive tasks are not good for GPU.

Correct, but we can reduce the impact.

Our solution: Pipelining


GPU

CPU

GPU kernel execution

time

DMA Send

DMA Recv

19



chunk-(i+1)

chunk-(i+2)

chunk-i

chunk-(i+3)

table

scan

Move to Next

DMA Recv

GPU Kernel Exec

DMA Send

Buffer Read

DMA Recv

GPU Kernel Exec

DMA Send

Buffer Read

GPU Kernel Exec

DMA Send

Buffer Read

DMA Send

Buffer Read

Step-(i+4) Step-(i+3) Step-(i+2) Step-(i+1) Step-i

20


▌Karma of discrete GPU device

We have to move the data to be processed.

Thus, people say data intensive tasks are not good for GPU.

Correct, but we can reduce the impact

Our solution: Pipelining

▌Downside of the pipelining

Too small workloads

Pipeline hazard


GPU

CPU

GPU kernel execution

time

DMA Send

DMA Recv

21

SQL Logic on GPU (1/3) – GpuHashJoin


Inner relation

Outer relation

Inner relation

Outer relation

Hash table Hash table

Next step Next step

All CPU does is just references the result of relations join

Sequential Hash table search by CPU

Sequential Projection by CPU

Parallel Projection by GPU

Parallel Hash-table search by GPU

Built-in Hash-Join GpuHashJoin of PG-Strom

22

Inner Hash table

OT: Ununiformly Distributed Hash-Table


GPU Core-0

GPU Core-1

GPU Core-2

GPU Core-3

GPU Core-4

GPU Core-5

GPU Core-M

Outer relation

Item[0]

Item[1]

Item[2]

Item[3]

Item[4]

Item[5]

Item[N]

Item[M]

Item Item Item Item Item Item Item Item Item Item

Inner relation

Hash Value

Hash Function

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Item[3] Item

Overflow of Result Buffer

23

Inner Hash table

OT: Ununiformly Distributed Hash-Table


GPU Core-0

GPU Core-1

GPU Core-2

GPU Core-3

GPU Core-4

GPU Core-5

GPU Core-M

Outer relation

Item[0]

Item[1]

Item[2]

Item[3]

Item[4]

Item[5]

Item[N]

Item[M]

Item Item Item Item Item Item Item Item Item Item

Inner relation

Hash Value

Hash Function

Item[3] Item Item[3] Item

Virtual Partition with Extra Randomness

Just fit for Result Buffer

x Retry for each virtual partition

24

SQL Logic on GPU (2/3) – GpuNestLoop (unparametalized)

Oute

r-Rela

tion

(Nx: u

sually

larg

er)

※ s

plit to

chunk-b

y-c

hunk o

n

dem

and

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

Two dimensional GPU kernel

launch

blockDim.x

blockDim.y

Ny

Nx Thread

(X=2, Y=3)

Inner-Relation (Ny: relatively small)

Only edge thread references DRAM to fetch values.

Nx:32 x Ny:32 = 1024

A matrix can be evaluated with only 64 times DRAM accesses


25

SQL Logic on GPU (3/3) – GpuPreAgg

▌GpuPreAgg, prior to Aggregate, reduces number of rows to be processed.

▌Query rewrite to make equivalent results from intermediate aggregation.


GroupAggregate

Sort

SeqScan

Tbl_1

Result Set

several rows

several millions rows


Y, count(*), avg(X)

X, Y

X, Y

X, Y, Z (table definition)

GroupAggregate

Sort

GpuScan

Tbl_1

Result Set

several rows

several hundreds rows


Y, sum(nrows), avg_ex(nrows, psum)

Y, nrows, psum_X

X, Y

X, Y, Z (table definition)

GpuPreAgg Y, nrows, psum_X

several hundreds rows

SELECT count(*), AVG(X), Y FROM Tbl_1 GROUP BY Y;

reduction of rows to be

processed

26

Agenda






27

Microbenchmark (1/2) – total response time

▌SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 [, ...] GROUP BY cat;

measurement of query response time with increasing of inner relations

t0: 100M rows, t1~t10: 100K rows for each, all the data was preloaded.

▌Environment:

PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64)

CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)


58.89 82.51

102.15 125.46

159.83

194.50

241.88

12.44 17.88 17.56 21.98 33.33 38.14

50.29

0

50

100

150

200

250

300

2 3 4 5 6 7 8

Qu

ery

Res

po

nse

Tim

e [s

ec]

# of tables involved

Query Response Time towards Numer of Tables Joined

PostgreSQL 9.5β PG-Strom

28

Microbenchmark (2/2) – Time consumption per component

▌Good acceleration results towards JOIN and AGGREGATION

▌Less acceleration ratio when N is larger than 6

By inaccurate problem size estimation, why?


0

50

100

150

200

250

300

PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom PgSQL Strom

2 3 4 5 6 7 8

Qu

ery

Res

po

nse

Tim

e [s

ec]

# of tables involved

Time consumption per component (PostgreSQL v9.5β vs PG-Strom)

Scan Join Aggregate Others

29

OT: Why accurate problem size estimation is important

▌GpuJoin results must be smaller than GPU RAM size. ▌Virtual inner-partition saves scale of the Join results. Better size estimation reduces number of retries. We can determine the problem size using dynamic parallelism

downside: Supported by only Kepler or later generation.


innerRel (depth-1)

innerRel (depth-2)

innerRel (depth-3)

innerRel (depth-4)

outerRel Results

× × × × =

× × × × =

GpuJoin with virtual inner-partition and iteration

Larger than

GPU RAM

Within GPU RAM

Size

30

TPC-DS Benchmark (1/2) – Total Results

▌What is TPC-DS? Successor of TPC-H, defines 103 (99+4) kind of queries Simulation of decision support queries in retail industry

▌About this benchmark Run the series of TPC-DS queries sequentially. We could collect 96 results of 103. Environment: • PostgreSQL v9.5beta1 + PG-Strom (22-Oct), CUDA 7.0 + RHEL6.6 (x86_64) • CPU: Xeon E5-2670v3, RAM: 384GB, GPU: NVIDIA TESLA K20c (2496cores)


0

500

1000

1500

2000

Qu

ery

Re

spo

nse

Tim

e [

sec]

PostgreSQL v9.5β PG-Strom

anomaly?

31

TPC-DS Benchmark (2/2) – Total Result

▌Summary

PG-Strom wins to PostgreSQL: 76 of 96

PostgreSQL wins to PG-Strom: 20 of 96

shows advantages on most of simple queries

▌Challenges

Wise plan construction, to avoid GpuXXX node not to be here

Functionality improvement for Aggregation and Scan


0

100

200

300

400

500

Qu

ery

Re

spo

nse

Tim

e [

sec]

PostgreSQL v9.5β PG-Strom

32

Time consumption per workloads (1/3) – PostgreSQL v9.5

▌Scan, Join and Aggregation provides a clear explanation for the time consumption factor in TPC-DS workloads


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time consumption per workloads (PostgreSQL v9.5beta)


33

Time consumption per workloads (2/3) – PG-Strom

▌PG-Strom dramatically reduced time for Joins.

▌Time for Joins are: 37% 17%


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Time consumption per workloads (PostgreSQL v9.5beta)


34

Time consumption per workloads (3/3) – Wow! GpuJoin

GpuJoin can explain 84% of performance improvement


y = 0.962x - 7.289 R² = 0.8442

-50.00

0.00

50.00

100.00

150.00

200.00

250.00

-50.00 0.00 50.00 100.00 150.00 200.00 250.00

Imp

rove

me

nt

of

Tota

l Tim

e [

sec]

(T

ota

l tim

e o

f P

gSQ

L) –

(To

tal t

ime

of

PG

-Str

om

)

Improvement of Join Time [sec] (Join time of PgSQL) – (Join time of PG-Strom)

35

Agenda






36

What we learn from TPC-DS

▌GpuJoin may be a killer feature of PG-Strom

....unless problem size estimation is accurate.

Planner needs to be wiser, not to inject GpuJoin towards the situation not suitable.

▌For Aggregation

GpuPreAgg didn’t appear as we expected

Anything GPU can help CPU aggregation?

▌For Scan

Scan performance == data stream bandwidth of GPU input

Single thread is a restriction for bandwidth of data supply


37

Features enhancement / improvement

▌Target-list calculation on GPU

Pre-calculation of hash-value for HashAggregate

Projection off-load if complicated mathematical expression

▌Custom-Scan under Gather node

Allows multiple workers to provide data stream to GPUs

▌Partial Aggregation before Join

Wider utilization of GpuPreAgg to reduce number of rows to be joined by CPU

▌Utilization of dynamic parallelism

Wise approach towards result buffer overflow problem in Join

▌... and more...


38

Short term development

1. Software quality improvement

2. Corner case handling

3. Documentation

4. Target-List calculation in GPU

5. Software quality improvement


39

WIP Project – In-place Computing

Why export entire data to run mathematical algorithm?

▌Runs complicated mathematical expression on GPUs, then CPU just references the result

Algorithms in SQL (even partially) automatically moves to GPUs

WIP: Data-Clustering algorithms, like Min-Max, K-means, ...


mathematical algorithm

Things wants to

know

Things wants to

know

result

result

mathematical algorithm

Principle of Near-Data Computing

Limited computing capability

Data Export

40

Future Direction of PG-Strom

Real-life workloads are the only compass for development


41

WE WANT YOU GET INVOLVED

▌PostgreSQL Users

who concern about query execution performance on your workloads

▌Application Developers

who are interested in trying yet another database acceleration methodology

▌PostgreSQL Developers

who can bet the future of heterogeneous era


42

Resources

▌Github

https://github.com/pg-strom/devel

https://github.com/pg-strom/devel/issues

▌Documentation

Under construction.... sorry

▌Today’s slides

Uploaded soon, check #pgconfeu on Tw

▌Author’s contact

e-mail: [email protected]

Tw: @kkaigai


GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~

Software

Transcript of GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~