with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe...

25
Interactive BI Analytics with Presto Big Data Conference Europe 2020 Karol Sobczak, Software Engineer, Starburst Łukasz Osipiuk, Software Engineer, Starburst

Transcript of with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe...

Page 1: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Interactive BI Analyticswith Presto

Big Data Conference Europe 2020 Karol Sobczak, Software Engineer, Starburst

Łukasz Osipiuk, Software Engineer, Starburst

Page 2: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Łukasz OsipiukSoftware Engineer, Starburst

@losipiuk

[email protected]

/in/lukasz-osipiuk-781903

Page 3: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Karol SobczakSoftware Engineer, Co-founder, Starburst

/in/karol-sobczak-a7b19a10

@sopel39

[email protected]

Page 4: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Agenda

1. Introduction to Presto2. Presto in data analysis ecosystem3. Under the hood4. Demo

Page 5: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

What is Presto?

Community-driven open source project

High performance MPP SQL engine• Interactive ANSI SQL queries• Proven scalability• High concurrency

Deploy Anywhere• Kubernetes• Cloud• On premises

Separation of compute & storage• Scale storage & compute

independently• SQL-on-anything• Federated queries

Page 6: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Presto Users

High Tech

Facebook: 10,000+ of nodes, 1000s of usersUber 2,000+ nodes, 160K+ queries daily

LinkedIn: 500+ nodes, 200K+ queries dailyLyft: 400+ nodes, 100K+ queries daily

Retail Media Finance

Page 7: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

What is Starburst?● Starburst Enterprise Presto - distribution● Open core model● Biggest open source Presto contributor

● Headquartered in Boston, MA● Regional presence in EMEA - offices in Warsaw and London

Page 8: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Starburst Enterprise Presto

Page 9: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Data sources

...

...

...

Page 10: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Presto in data analysis ecosystem

Page 11: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Data Ingestion and Analytics EcosystemData producers

Data storage Machine learning/AI

SQL Analytics andBI reporting

Realtime SQL engine

Page 12: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Starburst Platform

Page 13: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Under the hood

Page 14: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Static partition pruning

SELECT cust_country, sum(sale_price) FROM customer JOIN sales ON cust_id = sale_cust_fkWHERE sale_date >= date '2012-08-01' and sale_date <= date '2012-08-31'GROUP BY cust_country

Static partition pruning saves the day!

Only 31 partitions of sales table will be scanned

cust_idcust_country...

customer

sale_idsale_cust_fksale_datesale_price...

sales

sales table partitioned by sale_date

Page 15: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Dynamic partition pruning

sales table partitioned by sale_date_fk

SELECT cust_country, sum(sale_price) FROM customer JOIN sales ON cust_id = sale_cust_fk JOIN date_dim ON sale_date_fk = date_idWHERE date_date >= date '2012-08-01' and date_date <= date '2012-08-31'GROUP BY cust_country

date_iddate_date...

date_dim

cust_idcust_country...

customer

sale_idsale_cust_fksale_date_fksale_price...

salesDynamic partition pruning saves the day!

Only ~31 partitions of sales table will be scanned

Page 16: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Cost based optimizer

TPC-DS queries, with CBO on and off

Page 17: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Cost based optimizer

Cost-Based Optimizer includes:

• join reordering based on selectivity estimates and cost• automatic join type selection (repartitioned vs broadcast)• automatic left/right side selection for joined tables • support for statistics stored in Hive Metastore

https://www.starburstdata.com/technical-blog/

Page 18: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join left/right side decision

Page 19: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join type selection

Page 20: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join reordering"Which customers are spending the most at our shop?"

SELECT c.custkey, sum(l.price)FROM customer cJOIN orders o ON c.custkey = o.custkeyJOIN lineitem l ON o.orderkey = l.orderkeyGROUP BY c.custkeyORDER BY sum(l.price) DESC;

Page 21: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join reordering

Page 22: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join reordering with filter"Which customers are spending the most on coffee?"

SELECT c.custkey, sum(l.price)FROM customer cJOIN orders o ON c.custkey = o.custkeyJOIN lineitem l ON o.orderkey = l.orderkeyWHERE l.item = 'coffee'GROUP BY c.custkeyORDER BY sum(l.price) DESC;

Page 23: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Join reordering with filter

Page 24: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

Demo time

Page 25: with Presto Interactive BI Analytics · 2020. 12. 3. · with Presto Big Data Conference Europe 2020 Karol Sobczak ... High performance MPP SQL engine •Interactive ANSI SQL queries

TPC-DS schema

store_sales(fact)

Oracledate_dim (dimension)time_dim (dimension)customer (dimension)

customer_address(dimension)

item(dimension)