Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

31
Presto @ Facebook Martin Traverso and Dain Sundstrom

Transcript of Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Page 1: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto @ FacebookMartin Traverso and Dain Sundstrom

Page 2: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto @ Facebook

• Ad-hoc/interactive queries for warehouse

• Batch processing for warehouse

• Analytics for user-facing products

• Analytics over various specialized stores

Page 3: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Analytics for Warehouse

Page 4: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

ArchitectureUI CLI Dashboards Other tools

Gateway

Presto Presto

WarehouseCluster

WarehouseCluster

Page 5: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Deployment

Presto

HDFS Datanode

MR

HDFS Datanode

MR

HDFS Datanode

Presto

HDFS Datanode

MR

HDFS Datanode

Page 6: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Stats• 1000s of internal daily active users

• Millions of queries each month

• Scan PBs of data every day

• Process trillions of rows every day

• 10s of concurrent queries

Page 7: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Features• Pipelined partition/split enumeration

• Streaming

• Admission control

• Resource management

• System reliability

Page 8: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Batch workloads

Page 9: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Batch Requirements

• INSERT OVERWRITE

• More data types

• UDFs

• Physical properties (partitioning, etc)

Page 10: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Analytics for User-facing Products

Page 11: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Requirements

• Hundreds of ms to seconds latency, low variability

• Availability

• Update semantics

• 10 - 15 way joins

Page 12: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Architecture

Loader

PrestoWorker

PrestoWorker

PrestoWorker

MySQL

Loader

MySQL

MySQL

Client

Page 13: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Stats• > 99.99% query success rate

• 100% system availability

• 25 - 200 concurrent queries

• 1 - 20 queries per second

• <100ms - 5s latency

Page 14: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto Raptor

Page 15: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Requirements• Large data sets

• Seconds to minutes latency

• Predictable performance

• 5-15 minute load latency

• Reliable data loads (no duplicates, no missing data)

• 10s of concurrent queries

Page 16: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Basic Architecture

Coordinator

MySQL Worker Flash

Worker Flash

Worker Flash

Client

Page 17: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

But isn’t that exactly what Hive does?

Page 18: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Additional Features• Full featured and atomic DDL

• Table statistics

• Tiered storage

• Atomic data loads

• Physical organization

Page 19: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Table Statistics

• Table is divided into shards

• Each shard is stored in a separate replication unit (i.e., file)

• Typically 1 to 10 million rows

• Node assignment and stats stored in MySQL

Page 20: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Table Schema in MySQL

Tablesid name1 orders2 line_items3 parts

table1 shardsuuid nodes c1_min c1_max c2_min c2_max c3_min c3_max43a5 A 30 90 cat dog 2014 20146701 C 34 45 apple banana 2005 20159c0f A,D 25 26 cheese cracker 1982 1994df31 B 23 71 tiger zebra 1999 2006

Page 21: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Tiered Storage

Coordinator

MySQL Worker Flash

Worker Flash

Worker Flash

Client Backup

Page 22: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Tiered Storage

• One copy in local, expensive, flash

• Backup copy in cheap durable backup tier

• Currently Gluster internally, but can be anything durable

• Only assumes GET and PUT with client assigned ID methods

Page 23: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Atomic Data Loads

• Import data periodically from streaming event system

• Internally a Scribe based system similar to Kafka or Kinesis

• Provides continuation tokens

• Loads performed using SQL

Page 24: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Atomic Data Loads

INSERT INTO target SELECT * FROM source_stream WHERE token BETWEEN ${last_token} AND ${next_token}

Page 25: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Loader Process1. Record new job with “now” token in MySQL

2. Execute INSERT from last committed token to “now” token with external batch id

3. Wait for INSERT to commit (check external batch status)

4. Record job complete

5. Repeat

Page 26: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Failure Recovery• Loader crash

• Check status of jobs using external batch id

• INSERT hang

• Cancel query and rollback job (verify status to avoid race)

• Duplicate loader processes

• Process guarantees only one job can complete

• Monitor for lack of progress (catches no loaders also)

Page 27: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Physical Organization• Temporal organization

• Assure files don’t cross temporal boundaries

• Common filter clause

• Eases retention policies

• Sorted files

• Can reduce file sections processed (local stats)

• Can reduce shards processed

Page 28: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Unorganized Data

Sort Columns

Tim

e

Page 29: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Organized Data

Sort Columns

Tim

e

Page 30: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Background Organization• Compaction

• Balance data

• Eager data recover (from backup)

• Garbage collection

• Junk created by compaction, delete, balance, recovery

Page 31: Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Future Use Cases• Hot data cache for Hadoop data

• 0-N local copies of “backup” tier

• Query results cache

• Raw, not rolled-up, data store for Sharded MySql customers

• Materialized view store