The Enterprise Presto Company STARBURST Presto: SQL-on...

Post on 31-Aug-2019

21 views 0 download

Transcript of The Enterprise Presto Company STARBURST Presto: SQL-on...

The Enterprise Presto Company

STARBURST

Wojciech BielaGrzegorz Kokosiński

Presto: SQL-on-Anything

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Starburst in a nutshell

We are the Presto company!

● Largest team of Presto contributors outside of Facebook● Led Presto initiative at Teradata for past 3 years● Working in the SQL-on-Hadoop space since 2011

We offer:

● A production-ready distribution of Presto● Professional enterprise support● Presto managed services

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto is SQL-on-Anything.Deploy Anywhere , Query Anything

Analyst Tools

Data Sources

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Community

More at https://github.com/prestodb/presto/wiki/Presto-Users

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Why Presto?

● 100% open source distributed ANSI SQL engine○ Originally developed by Facebook

○ Introduced to Fortune 500 by Teradata

○ Commercialized by Starburst

● Presto is SQL-on-Anything: ○ Deploy anywhere

○ Query anything

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Highlights

● Community-driven open source project● No vendor lock-in

○ No Hadoop distro vendor lock-in

○ No storage engine vendor lock-in

○ No cloud vendor lock-in

● Query data where it lives○ No ETL or data integration necessary to get to insights

● Proven scalability● High concurrency● Interactive ANSI SQL queries

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* Multiple clusters (1000s of nodes)

* 300PB in HDFS, MySQL, and Raptor

* 1000s users, 100s concurrent queries

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Facebook - Data warehouse

● Hive + HDFS + ORC● multiple clusters ● Thousands of users, 300PB, 1000s nodes● PBs of data scanned, O(100k) queries every day● 100s of concurrent queries

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Facebook - User facing

● Usage○ reporting backend for ad campaign analytics

● Sharded MySQL storage● relatively small data (10’s to 100’s of TBs) ● 0.1-5 seconds latency● Support for data updates● highly available (different DCs)● 10-15 way joins

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 250+ AWS nodes

* 100+ PB in S3 (Parquet)

* 650+ users with 6K+ queries daily

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Suro / Kafka Cassandra

AegisthusUrsula

Amazon S3

TVs mobile laptop dimensionsevents

Teradata

TVs mobile laptopTVs mobile laptop

Netflix data pipeline

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 150+ PB HDFS

* 800+ nodes (2 clusters on prem)

* 200K+ queries/day

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 800+ nodes (on premise)

* Parquet data

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

* 120+ nodes in AWS

* 4PB is S3

* 200+ users

* Starburst support

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* Presto for interactive workload

* 200 nodes on AWS

* 20k+ queries / day

* 20PB+ data on S3

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Lyft ecosystem

Ingest Storage Compute Visualisation

AWS S3

Events

MongoDB

Other DS

Hive

Redshift

Superset

Other tools

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* 100 Presto VMs(on premises)

* 1K+ HDFS nodes

* ORC data

* Starburst support

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto in Production

More at https://github.com/prestodb/presto/wiki/Presto-Users

* 200+ nodes (on premises)

* HDFS, ObjectStore, and Cassandra

* Starburst support

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Logical Data Warehouse

Operational

Yahoo! Japan DWH

TeradataDWH

Operations(RDBs)

Data Lake (Hadoop)

TeradataDWH

RDBs

QG Presto

NoSQL

Data Lake

Hadoop S3

Copy & Load

ETL

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Data stream API

Worker

Data stream API

Worker

Coordinator

Metadata

API

Parser/

analyzerPlanner Scheduler

Worker

Client

Data location

API

Pluggable

Architecture

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto Connectors

Amazon S3

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Architecture

● Core Presto○ parser, planner, optimizer and scheduler

○ execution engine

○ stateless,

● Plugins○ connectors - data+metadata

○ user defined functions

○ user defined types

○ event listeners

○ authentication and authorization

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Connectors - Hive

● Table metadata read from Hive catalog

● Multiple filesystems○ HDFS, S3

● Supported file formats○ ORC (optimized reader, optimized writer)

○ RCFile (optimized reader, optimized writer)

○ Parquet (optimized reader)

○ Avro

○ all other Hive formats

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Easy deployment

● self contained (RPM/tar.gz)● worker auto discovery● trivial dependencies

○ just a recent JVM

● single-port network communication○ easy firewall/network setup

● even easier with presto-admin

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Hardware agnostic

● Infrastructure agnostic○ on premise (appliance or commodity clusters)

○ VM (OpenStack, etc.)

○ cloud (Amazon, Azure, etc)

■ pure EC2

■ EMR

■ AWS Athena (pay-as-you-go)

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Presto

Presto

HDFS DN

HDFS DN

HDFS DN

Presto

HDFS DN

Presto

HDFS DN

Presto

HDFS DN

HDFS DN

HDFS DN

Presto

HDFS DN

HDFS DN

HDFS DN

Separate nodes Shared nodes Mixed (rack local)

Deployment for HDFS

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

SQL support

● ANSI SQL support (good for BI tools)○ all standard data types

○ complex subqueries support (eg. correlated)

● Structural types○ map, array, row

○ JSON

● Lambda expressions○ SELECT transform(ARRAY['dog', 'whale'], x -> length(x))

■ [3, 5]

○ SELECT reduce(ARRAY[5, 20, 50], 0, (s, x) -> s + x, s -> s)

■ 75

● Spatial joins, functions and data types

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

SQL support

● All standard DDL/DML is supported ○ CREATE TABLE / CREATE TABLE AS

■ connector specific extensions supported via WITH clause

○ DROP TABLE

○ INSERT

○ DELETE

○ GRANT / REVOKE

● Set of supported features depends on connector○ richest support for Hive connector

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Performance

● MPP style ○ Operators pipeline and data streaming

○ In-memory execution

● Columnar data processing● Highly tuned Java

○ Query to ByteCode compilation

○ Memory efficient structures - Minimize GC

○ Careful inner loop implementation

● Multi-threaded execution keeps CPU busy○ Focus on being versatile. Support both (long running) single query at

a time and (interactive) highly concurrent workloads.

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer!

● Optimizer in ‘vanilla’ Presto○ currently rule based

● Exploit statistics provided by connectors○ Leverage existing Hive statistics

○ Selectivity estimates and statistics of plan fragments

○ Cost calculation of plan variants

● Cost based decisions (current Starburst release 195e)○ Join type selection

○ Join reordering

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer - results

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Cost-Based Optimizer in action

● 13x max speed-up● >50% 2-5x boost● ~10% 6-10x boost

For more see: https://www.starburstdata.com/technical-blog

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Connectivity

● Presto CLI● Enterprise JDBC/ODBC drivers

○ full JDBC and ODBC specs compliance

○ Kerberos authentication

○ LDAP authentication

● Open source JDBC driver○ requires Java 8

○ limited support for authentication

● Language specific bindings (R, Python, Go, Ruby, …)

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

BI Tool Support

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Security

● User authentication (CLI/ODBC/JDBC)○ Basic

○ Kerberos

○ LDAP

● Pluggable user authorization schemes (access control)● Connector level authorization

○ E.g. grants information stored in Hive catalog

● Support for kerberized HDFS/Hive metastore● SSL on the wire

○ client to Presto

○ between Presto nodes

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Key contributions from our team

● ANSI SQL syntax enhancements to fully support TPC-H and TPC-DS

● Spill to disk capabilities for large intermediate data sets

● Distributed sorting to handle ORDER BY for large datasets

● Security Integrations such as Kerberos, LDAP, and in-transit encryption

● Cost-Based Optimizer and other query performance improvements

● ODBC and JDBC drivers to enable BI tools such as Tableau, Qlik, etc

● Presto connectors for SQLServer and Cassandra

● Presto-Admin for easy installation & management of Presto

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Enterprise Support for Presto

PrestoCare™

Administration, monitoring and support of the

Presto Platform and Services

Enterprise Support

24/7 Enterprise support of Presto on-premises or in

the cloud.

Installation and tuning assistance.

Product roadmap influence.

Professional Services

Presto architecture,

tuning, integrations,

implementation, and other

development.

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Starburst Presto Roadmap

● Kafka connector improvements

● HDFS wire encryption

● Further Cost-Based Optimizer extensions

● Execution engine improvements

● Planner improvements

● Better AVRO support

● Support for Oracle Linux

● Support for Azure Cloud

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

More information

Certified Distro: www.starburstdata.com/presto

Project Website: www.prestodb.io

Presto Users Group: www.groups.google.com/group/presto-users

GitHub:www.github.com/prestodb/prestowww.github.com/starburstdata/presto

STARBURST

©2018 Starburst Data, Inc. All Rights Reserved

Learn more at:

www.starburstdata.com

Wojciech.Biela@starburstdata.comGrzegorz.Kokosinski@starburstdata.com