Presto at Hadoop Summit 2016

25
What's New in SQL-on-Hadoop and Beyond Martin Traverso, Facebook Kamil Bajda-Pawlikowski, Teradata Hadoop Summit 2016, San Jose, CA

Transcript of Presto at Hadoop Summit 2016

Page 1: Presto at Hadoop Summit 2016

What's New in SQL-on-Hadoop and Beyond

Martin Traverso, FacebookKamil Bajda-Pawlikowski, TeradataHadoop Summit 2016, San Jose, CA

Page 2: Presto at Hadoop Summit 2016

Agenda● Introduction● Presto at Facebook● Presto users and use cases● New features● Roadmap

Page 3: Presto at Hadoop Summit 2016

Introduction

Page 4: Presto at Hadoop Summit 2016

What is Presto● Open source distributed SQL engine● ANSI SQL syntax● Custom built for interactive analytic queries● Queries data across multiple data stores● Flexible deployment (on premise or cloud)● Extensible

Page 5: Presto at Hadoop Summit 2016
Page 6: Presto at Hadoop Summit 2016

Presto at Facebook

Page 7: Presto at Hadoop Summit 2016

Presto @ Facebook● Ad-hoc/interactive queries for Hadoop warehouse● Batch processing for Hadoop warehouse● Analytics for user-facing products● Analytics over various specialized stores

Page 8: Presto at Hadoop Summit 2016

Hadoop Warehouse - Stats● 1000s of internal daily active users● Millions of queries each month● Scan PBs of data every day● Process trillions of rows every day● 10s of concurrent queries

Page 9: Presto at Hadoop Summit 2016

Hadoop Warehouse - Batch

Page 10: Presto at Hadoop Summit 2016

Presto for User-facing Products● Requirements

○ Hundreds of ms to seconds latency, low variability○ Availability ○ Update semantics○ 10 - 15 way joins

● Stats○ > 99.99% query success rate○ 100% system availability○ 25 - 200 concurrent queries○ 1 - 20 queries per second○ <100ms - 5s latency

Page 11: Presto at Hadoop Summit 2016

Presto with Raptor● Large data sets (petabytes)● Milliseconds to seconds latency● Predictable performance● 5-15 minute load latency● Reliable data loads (no duplicates, no missing data)● High availability● 10s of concurrent queries

Page 12: Presto at Hadoop Summit 2016

Presto users and use cases

Page 13: Presto at Hadoop Summit 2016

Presto users

See more at https://github.com/prestodb/presto/wiki/Presto-Users

Page 14: Presto at Hadoop Summit 2016

Netflix statsInteractive, reporting, and app-driven queries

Data warehouse: 40PB in S3

~250 nodes across multiple clusters

~650 users with ~6K+ queries/day

Page 15: Presto at Hadoop Summit 2016

Twitter statsAd-hoc and low-latency queries

~200 nodes dedicated to Presto

Parquet with nested data structures

Page 16: Presto at Hadoop Summit 2016

Uber stats2 clusters

100+ machines

2000+ queries per day

HDFS on premise

Page 17: Presto at Hadoop Summit 2016

FINRA stats120+ EC2 nodes (r3.4xlarge)

2+ PBs of data on S3 (bzip2 & orc)

200+ users

Distro supported by Teradata

Page 18: Presto at Hadoop Summit 2016

New features

Page 19: Presto at Hadoop Summit 2016

SQL features● DDL syntax

CREATE / ALTER / DROP TABLE

● DML syntaxINSERT / DELETE

● SQL features:Data types: DECIMAL, VARCHAR(n), INT, SMALLINT, TINYINT

CUBE, ROLLUP, GROUPING SETS

INTERSECT

Non-equi joins

Uncorrelated subqueries

Page 20: Presto at Hadoop Summit 2016

Other features● Performance

Join and aggregation optimizations

● ConnectorsRedisMongoDB

● Kerberos● Presto-Admin● Ambari and YARN (via Apache Slider)

Page 21: Presto at Hadoop Summit 2016

● Enterprise-grade ODBC & JDBC drivers● BI tools certifications

Information Builders, Looker, MicroStrategy, MS Power BI, Qlik, Tableau, ZoomData

Drivers and BI tools

Page 22: Presto at Hadoop Summit 2016

Roadmap

Page 23: Presto at Hadoop Summit 2016

Short term● LDAP● SQL features

Data types: FLOAT, CHAR(n), VAR/BINARY(n)EXISTS, EXCEPTCorrelated subqueriesLambda expressionsPrepared statements

● ConnectorsAccumulo (by Bloomberg)

Page 24: Presto at Hadoop Summit 2016

Long term● Materialized Query Tables● Workload management● Spill to disk● Cost-based Optimizer

See more at https://github.com/prestodb/presto/wiki/Roadmap

Page 25: Presto at Hadoop Summit 2016

More about Presto

GitHub: https://github.com/prestodb & https://github.com/Teradata/presto

Website: http://prestodb.io

Group: https://groups.google.com/group/presto-users

Distro: http://www.teradata.com/presto