Introduction to Presto at Treasure Data

28
Introduction to Presto Making SQL Scalable Taro L. Saito [email protected] Treasure Data, Inc.

Transcript of Introduction to Presto at Treasure Data

Page 1: Introduction to Presto at Treasure Data

Introduction to Presto Making SQL Scalable

Taro L. Saito [email protected]

Treasure Data, Inc.

Page 2: Introduction to Presto at Treasure Data

How do we make SQL scalable?• Problem

• Count access logs of each web page: • SELECT page, count(*) FROM weblog

GROUP BY page

• A Challenge • How do you process millions of records in a

second? • Making SQL scalable enough to handle large

data set

2

Page 3: Introduction to Presto at Treasure Data

3

HDFS

• Translate SQL into MapReduce (Hadoop) programs • MapReduce:

• Does the same job by using many machines

Hive

A B

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

HDFS

Single CPU Job

Distributed Processing

Page 4: Introduction to Presto at Treasure Data

SQL to MapReduce• Mapping SQL stages into MapReduce program

• SELECT page, count(*) FROM weblogGROUP BY page

4

HDFS

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

HDFS

TableScan(weblog)

GroupBy(hash(page))

count(weblog of a page)

result

Page 5: Introduction to Presto at Treasure Data

HDFS is the bottleneck• HDFS (Hadoop File System)

• Used for storing intermediate results • Provides fault-tolerance, but slow

5

HDFS

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

HDFS

TableScan(weblog)

GroupBy(hash(page))

count(weblog of a page)

result

Page 6: Introduction to Presto at Treasure Data

Presto• Distributed query engine developed by Facebook

• Uses HTTP for data transfer

• No intermediate storage like HDFS

• No fault-tolerance (but failure rate is less than 0.2%)

• Pipelining data transfer and data processing

6

A0B0

A1

A2

BB1

B2

B3

A

map reduce mergesplit

TableScan(weblog)

GroupBy(hash(page))

count(weblog of a page)

result

Page 7: Introduction to Presto at Treasure Data

Architecture Comparison

7

Hive Presto Spark BigQuery

Performance Slow Fast Fast Ultra Fast (using many disks)

Intermediate Storage HDFS None Memory/Disk Colossus (?)

Data Transfer HTTP HTTP HTTP ?

Query Execution

Stage-wizeMapReduce

Run all stagesat once

(pipelining)Stage-wise ?

Fault Tolerance Yes

None (but, TD will retry

the query) fromscratch)

Yes, but limited ?

Multiple Job Support

GoodCan handle many

jobs

limited (~ 5 concurrent queries

per account in TD)

Require another resource manager (e.g. YARN, mesos)

limited (Query queue)

Page 8: Introduction to Presto at Treasure Data

Presto Usage Stats• More than 99.8% queries finishes without any error • 90%~ of queries finishes within 1 minute

• Treasure Data Presto Stats • Processing more than 100,000 queries / day • Processing 15 trillion records / day

• Facebook’s stat: • 30,000~100,000 queries / day • 1 trillion records / day

• Treasure data is No.1 Presto user in the world

8

Page 9: Introduction to Presto at Treasure Data

Presto can process more than 1M rows /sec.

• N

9

Page 10: Introduction to Presto at Treasure Data

Presto Overview• A distributed SQL Engine developed by Facebook

• For interactive analysis on peta-scale dataset • As a replacement of Hive

• Nov. 2013: Open sourced at GitHub • Facebook now has 12 engineers working on Presto

• Code • In-memory query engine, written in Java • Based on ANSI SQL syntax • Isolating query execution layer and storage access layer • Connector provides data access methods

• Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB / System / TPCH connectors

• td-presto is our connector to access PlazmaDB (Columnar Message Pack Database)

10

Page 11: Introduction to Presto at Treasure Data

Architectural overview

11

https://prestodb.io/overview.html

With Hive connector

Page 12: Introduction to Presto at Treasure Data

Presto Users• Facebook

12

Page 13: Introduction to Presto at Treasure Data

• Dropbox

13

Page 14: Introduction to Presto at Treasure Data

• Airbnb

14

Page 15: Introduction to Presto at Treasure Data

Interactive Analysis with TD Presto + Jupyter

15

• https://github.com/treasure-data/td-jupyter-notebooks/blob/master/imported/pandas-td-tutorial.ipynb

Page 16: Introduction to Presto at Treasure Data

Presto InternalQuery Execution

Page 17: Introduction to Presto at Treasure Data

Stage 1

Stage 2

Stage 0

Presto Architecture

Query

Task 0.0Split

Task 1.0Split

Task 1.1 Task 1.2Split Split Split

Task 2.0Split

Task 2.1 Task 2.2Split Split Split Split Split Split Split

Split

TableScan (FROM)

Aggregation (GROUP BY)

Output

@worker#2 @worker#3 @worker#0

Page 18: Introduction to Presto at Treasure Data

Logical Query PlanOutput[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count

Exchange[GATHER] => nationkey:bigint, count:bigint

Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")

Exchange[REPARTITION] => nationkey:bigint, count_15:bigint

Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")

Project => [nationkey:bigint, expr:bigint] - expr := 1

InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]

Project => [custkey:bigint]

Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]

TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5

Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint

TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3

select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Page 19: Introduction to Presto at Treasure Data

Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count

Exchange[GATHER] => nationkey:bigint, count:bigint

Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")

Exchange[REPARTITION] => nationkey:bigint, count_15:bigint

Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")

Project => [nationkey:bigint, expr:bigint] - expr := 1

InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]

Project => [custkey:bigint]

Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]

TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5

Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint

TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3

Table Scan

select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Page 20: Introduction to Presto at Treasure Data

Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count

Exchange[GATHER] => nationkey:bigint, count:bigint

Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")

Exchange[REPARTITION] => nationkey:bigint, count_15:bigint

Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")

Project => [nationkey:bigint, expr:bigint] - expr := 1

InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]

Project => [custkey:bigint]

Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]

TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5

Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint

TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3

Stage 2

Logical Plan Optimization

select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Page 21: Introduction to Presto at Treasure Data

Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count

Exchange[GATHER] => nationkey:bigint, count:bigint

Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")

Exchange[REPARTITION] => nationkey:bigint, count_15:bigint

Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")

Project => [nationkey:bigint, expr:bigint] - expr := 1

InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]

Project => [custkey:bigint]

Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]

TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5

Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint

TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3

Stage 2

Stage 1

select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Page 22: Introduction to Presto at Treasure Data

Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count

Exchange[GATHER] => nationkey:bigint, count:bigint

Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")

Exchange[REPARTITION] => nationkey:bigint, count_15:bigint

Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")

Project => [nationkey:bigint, expr:bigint] - expr := 1

InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]

Project => [custkey:bigint]

Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]

TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5

Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint

TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3

Stage 2

Stage 1

Stage 0

Output Query Results (JSON)

select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey

Page 23: Introduction to Presto at Treasure Data

TD Storage Architecture

23

LogLogLogLogLogLog

1-hourpartition1-hour

partition1-hourpartition

Hadoop MapReduce

2015-09-29 01:00:00

2015-09-29 02:00:00

2015-09-29 03:00:00

Real-Time Storage

ArchiveStorage

time column-based partitioning…

Hive Presto

Log

many small log files log merge job

LogLogLogLogLog

Distributed SQL Query Engine

Page 24: Introduction to Presto at Treasure Data

Utilizing Time Index

24

1-hourpartition

2015-09-29 01:00:00

2015-09-29 02:00:00

2015-09-29 03:00:00

time column-based partitioning

Hive/Presto1-hour

partition1-hourpartition1-hour

partition

TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)

Query Results

2015-09-29 01:00:00

2015-09-29 02:00:00

2015-09-29 03:00:00

Hive/Presto Query Results

TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)

Scanning the whole data set

1-hourpartition1-hour

partition1-hourpartition1-hourpartition

Full Scan

Partial Scan

Page 25: Introduction to Presto at Treasure Data

Queries with huge results• SELECT col1, col2, col3, … FROM …

• INSERT INTO (table) SELECT col1, col2, … • or CREATE TABLE AS

25

1-hourpartition

headercol1col2……

Presto

Read query results in JSON (single-thread task: slow)

msgack.gz

On Amazon S3

Presto

1-hourpartition1-hourpartition

1-hourpartition

Directly create 1-hour partition on S3 from query results Runs in parallel: fast

Page 26: Introduction to Presto at Treasure Data

Memory Consuming Operators• DISTINCT col1, col2, … (duplicate elimination)

• Need to store the whole data set in a single node • COUNT(DISTINCT col1), etc.

• Use approx_distinct(col1) instead

• order by col1, col2, … • A single node task (in Presto)

• UNION • performs duplicate elimination (single node) • Use UNION ALL

26

Page 27: Introduction to Presto at Treasure Data

Finding bottlenecks• Table scan range

• Check TD_TIME_RANGE condition • distinct

• duplicate elimination of all selected columns (single node) • slow and memory consuming

• huge result output • Output Stage (0) becomes the bottleneck • Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT …

27

Page 28: Introduction to Presto at Treasure Data

Resources• Presto Query FAQs

• https://docs.treasuredata.com/articles/presto-query-faq

• Presto Documentation • https://prestodb.io/docs

28