Driving towards the intersection of capacity and demand ...€¦ · Driving towards the...

Driving towards the intersection of capacity and demand with

dynamic Presto scaling

Puneet Jaiswal

Software Engineer, Data Infra - Interactive Querying

06.20.2019

Mission

Improve people’s lives with the world’s best transportation

Agenda

● Presto Infra @ Lyft

● Gateway

● Schedule based Scaling

● Further Perf improvement

● GSheets Connector

● Future work

● Questions

Presto Infra @ Lyft

● Prestosql version 309

● 40 PB queryable event data

● 100K (peaks) daily queries (1.5M monthly)

● 950 DAUs

● 240 - 500 workers nodes total

○ 55 TB total available mem in peak time

○ 24K vCPUs

○ Worker node type - m5.12xlarge - 48 vcpu / 192 GB mem

● Schedule based scaling

Presto infra stackClients

Presto Gateway

Presto - load balancer

Presto 1

Presto 2

Multiple Presto clusters

Presto-Gateway

proxy/gateway/load balancer for presto

https://github.com/lyft/presto-gateway

https://github.com/lyft/presto-gateway

Problems

● Single Presto Coordinator

● Scale down was not easy - worker reduction affected running queries

● Upgrade requires downtime

● Single cluster vs multi cluster

● Clients (tableau / mode / looker etc) with single connection

○ Do not pass session user with each query - bad resource / queue isolation.

Presto Gateway

● Transparent API layer to access presto clusters without changing the protocol.

● Separate proxy end-point to access each presto cluster.

● API to activate / deactivate presto clusters

● Monitoring and alerting (Email / PagerDuty)

● Fast access query UI to trace queries

● Recovery speed - easier to block a bad cluster, than fixing it at the moment.

Query Routing

Currently round robin routing for all queries

Future:

Source / use case / available resources based routing

We can add simple rules to route queries selectively.

Presto gateway UI

Shows last N queries

Links to access query details page in native presto cluster

Shows active available clusters

Prestoadm tool

Cli tool for easy activate / deactivate operations

Schedule based ScalingSteady growth in users & queries - scaling required

● Presto gateway as load balancer

● Presto cluster is unit of scaling

● Gateway APIs to activate/deactivate backend clusters

● Scaling is triggered based on schedule

Query volume pattern

Granularity - minute

Raw data scanned rate (GB/s)

Peaks at 300 GB/s

Scheduled scaling - nodes vs query volume

Cutting 50% infra in non work hours resulted 30% reduction in cost.

Perf improvements

Performance boost - timelineThe longer the cluster ran, the slower it became.

P75 - query wall time (seconds)

Weekly Rolling P75This graph visualizes trends in query performance over a 7-day rolling window.

To avoid long GC pauses, we started recycling nodes every day

Java 11 upgrade

To tackle increased load we added more clusters

P99 total query time (exec+queued)(seconds)

Daily query volume

Google Sheets pluginWhy? Biz data usually maintained in sheets, but can be joined with hive tables.

v 0.1

● All columns are varchar type

● First line in sheet is treated as column name

● Sheet to table name mapping stored in a metadata sheet

● Presto connects to gsheets api using service account credentials

● View access to service account user for all the sheets

Table sheet mapping(metadata sheet)

Data Table

Querying

Apache Superset

Sends EXPLAIN (TYPE VALIDATE) queries to deep validate columns/udfs and tables etc.

Pre query validation & SQL IDE experience

Future work

● GSheets connector plugin

○ opensource

○ auto column type detection

○ easy sheet onboarding.

● Apache Superset

○ Showing query cost as user runs the presto query

● Scheduled vs adhoc query routing

● Gateway - HA

Questions?

Credits

Data Infra - Interactive Querying

Thank you Presto dev community !

Driving towards the intersection of capacity and demand ...€¦ · Driving towards the...

Documents

Transcript of Driving towards the intersection of capacity and demand ...€¦ · Driving towards the...