Presto Strata Hadoop SJ 2016 short talk

Presto SQL Engine: what’s new? Strata Hadoop 2016 San Jose, CA

2

What is Presto?

100% open source distributed SQL query engine

Originally developed by Facebook

Key Differentiators:

Performance & Scale

Cross platform query capability, not only SQL on Hadoop

Apache licensed, hosted on GitHub

Certified distro & support from Teradata

3

Brief history of Presto

FALL 2012 6 developers start Presto

development

FALL 2014 88 Releases

41 Contributors 3943 Commits

SPRING 2016 141 Releases

116 Contributors 6879 Commits

SPRING 2013 Presto rolled out within Facebook

FALL 2013 Facebook open sources Presto

FALL 2008 Facebook open

sources Hive

4

• Facebook – Multiple production clusters (100s of nodes total)

- Massive 300PB Hadoop data warehouse

- Very large sharded MySQL installation

- Growing usage of Raptor SSD-based storage

– 1000s of internal daily active users

– 10-100s of concurrent queries

• Netflix – Over 200-node production cluster on EC2

– Over 25 PB in S3 (Parquet format)

– Over 350 active users and 3K queries daily

Presto in Production

5

Presto Architecture

Data stream API

Worker

Data stream API

Worker

Coordinator

Metadata

API

Parser/

analyzer Planner Scheduler

Worker

Client

Data location

API

Pluggable

6

Presto Extensibility – connectors

Parser/

analyzer Planner

Worker

Data location API

HD

FS /

S3

No

SQ

L

DB

MS

Cu

sto

m

…

Metadata API

HD

FS /

S3

No

SQ

L

DB

MS

Cu

sto

m

…

Data stream API

HD

FS /

S3

No

SQ

L

DB

MS

Cu

sto

m

…

Scheduler

Coordinator

7

• Hadoop/Hive connector & file formats: – HDFS & S3 + HCatalog

– ORC, RCFile, Parquet, SequenceFile, Text

• Open source data stores: – MySQL & PostgreSQL (non-parallel)

– Cassandra

– Kafka

– Redis

• In development by community: – MongoDB

– ElasticSearch

– HBase

Supported data sources & file formats

8

• In-memory processing

• Pipelined execution across nodes MPP-style

• Vectorized columnar processing

• Multithreaded execution keeps all CPU cores busy

• Presto is written in highly tuned Java

– Efficient flat-memory data structures (minimizes GC)

– Very careful coding of inner loops

– Runtime bytecode generation

• Optimized ORC & Parquet readers

• Excellent performance with interactive SQL analytics

Presto – Query Execution Performance

9

[ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ]

In addition: • Windowing functions

• Statistical and approximate aggregate functions

• UNNEST, TABLESAMPLE

In development: • Complex subqueries

• EXISTS, INTERSECT, EXCEPT

• ROLLUP, CUBE

ANSI SQL Support

10

• Cluster deployment models for Presto: – on premise (appliance or commodity clusters)

– VM (OpenStack, etc.)

– cloud (Amazon, etc)

• Types of Hadoop deployments: – on Hadoop/YARN cluster (all or subset of nodes)

– on a dedicated cluster

– mixed

Deployment models

11

Open source initiative

• Announced in June 2015 at Hadoop Summit – Growing interest and adoption

• Collaboration with Facebook and Presto community – Joint development, conference talks, meetups and webinars

• Major commitment from Teradata Labs: – 20 full-time engineers

– Free and open source contributions

– Enterprise-ready distribution

"A special shout out goes to Teradata — which joined the Presto community this year with a focus on enhancing enterprise features and providing support — for having seven of our top 10 external contributors." — Facebook

12

Implement Integrate Proliferate

• Installer • Documentation • Monitoring & Support

Tools

• Management Tool Integration

• YARN Integration • ODBC Driver

• JDBC Driver • BI Certification • Security • Cloud features

Commercial Support

Phase 1 Phase 2 Phase 3 June 8, 2015 Q4 2015 2016

Expanding ANSI SQL Coverage

Teradata Contributions to Presto

13

Recent developments and roadmap

• Q1 release: – Fully-featured ODBC & JDBC drivers

– Kerberos support

– DECIMAL support

• Later 2016: – BI tools certification

– TPC-H and TPC-DS unmodified

– Spill to disk

14

BI Tools certifications

15

Presto Connectors

Teradata Certified

Community Supported

Teradata QueryGrid™ - Multi-System Analytics

Targets

Entry Points

TERADATA DATABASE

ASTER ANALYTICS

PRESTO HADOOP

HIVE / HDFS

HADOOP

OTHER DATABASE

S

NOSQL DATABASE

S

TERADATA DATABASE

ASTER ANALYTICS

PRESTO HADOOP

Non-Relational DBs Multi-Genre Advanced Analytics™

Integrated Data Warehouses

3rd Party Relational DBs

Multiple Hadoop SQL Query Engines and Distributions

APACHE KAFKA

APACHE CASSANDRA

MYSQL POSTGRESQL PRESTO API REDIS

16

Certified Distro: www.teradata.com/presto

Website: www.prestodb.io

Presto Users Group: www.groups.google.com/group/presto-users

GitHub:

www.github.com/prestodb/presto

www.github.com/Teradata/presto

www.github.com/prestodb

More information

http://www.teradata.com/presto

http://www.prestodb.io/

http://www.groups.google.com/group/presto-users



http://www.github.com/prestodb/presto

http://www.github.com/prestodb/presto

http://www.github.com/Teradata/presto

http://www.github.com/prestodb

17

www.teradata.com/presto

Presto Strata Hadoop SJ 2016 short talk

Technology

Transcript of Presto Strata Hadoop SJ 2016 short talk