Presto at Hadoop Summit 2016
-
Upload
kbajda -
Category
Data & Analytics
-
view
1.361 -
download
3
Transcript of Presto at Hadoop Summit 2016
What's New in SQL-on-Hadoop and Beyond
Martin Traverso, FacebookKamil Bajda-Pawlikowski, TeradataHadoop Summit 2016, San Jose, CA
Agenda● Introduction● Presto at Facebook● Presto users and use cases● New features● Roadmap
Introduction
What is Presto● Open source distributed SQL engine● ANSI SQL syntax● Custom built for interactive analytic queries● Queries data across multiple data stores● Flexible deployment (on premise or cloud)● Extensible
Presto at Facebook
Presto @ Facebook● Ad-hoc/interactive queries for Hadoop warehouse● Batch processing for Hadoop warehouse● Analytics for user-facing products● Analytics over various specialized stores
Hadoop Warehouse - Stats● 1000s of internal daily active users● Millions of queries each month● Scan PBs of data every day● Process trillions of rows every day● 10s of concurrent queries
Hadoop Warehouse - Batch
Presto for User-facing Products● Requirements
○ Hundreds of ms to seconds latency, low variability○ Availability ○ Update semantics○ 10 - 15 way joins
● Stats○ > 99.99% query success rate○ 100% system availability○ 25 - 200 concurrent queries○ 1 - 20 queries per second○ <100ms - 5s latency
Presto with Raptor● Large data sets (petabytes)● Milliseconds to seconds latency● Predictable performance● 5-15 minute load latency● Reliable data loads (no duplicates, no missing data)● High availability● 10s of concurrent queries
Presto users and use cases
Presto users
See more at https://github.com/prestodb/presto/wiki/Presto-Users
Netflix statsInteractive, reporting, and app-driven queries
Data warehouse: 40PB in S3
~250 nodes across multiple clusters
~650 users with ~6K+ queries/day
Twitter statsAd-hoc and low-latency queries
~200 nodes dedicated to Presto
Parquet with nested data structures
Uber stats2 clusters
100+ machines
2000+ queries per day
HDFS on premise
FINRA stats120+ EC2 nodes (r3.4xlarge)
2+ PBs of data on S3 (bzip2 & orc)
200+ users
Distro supported by Teradata
New features
SQL features● DDL syntax
CREATE / ALTER / DROP TABLE
● DML syntaxINSERT / DELETE
● SQL features:Data types: DECIMAL, VARCHAR(n), INT, SMALLINT, TINYINT
CUBE, ROLLUP, GROUPING SETS
INTERSECT
Non-equi joins
Uncorrelated subqueries
Other features● Performance
Join and aggregation optimizations
● ConnectorsRedisMongoDB
● Kerberos● Presto-Admin● Ambari and YARN (via Apache Slider)
● Enterprise-grade ODBC & JDBC drivers● BI tools certifications
Information Builders, Looker, MicroStrategy, MS Power BI, Qlik, Tableau, ZoomData
Drivers and BI tools
Roadmap
Short term● LDAP● SQL features
Data types: FLOAT, CHAR(n), VAR/BINARY(n)EXISTS, EXCEPTCorrelated subqueriesLambda expressionsPrepared statements
● ConnectorsAccumulo (by Bloomberg)
Long term● Materialized Query Tables● Workload management● Spill to disk● Cost-based Optimizer
See more at https://github.com/prestodb/presto/wiki/Roadmap
More about Presto
GitHub: https://github.com/prestodb & https://github.com/Teradata/presto
Website: http://prestodb.io
Group: https://groups.google.com/group/presto-users
Distro: http://www.teradata.com/presto