Lessons Learned from Building and Operating Scuba

13
Lessons learned from building and operating Scuba Ciprian Gerea Facebook

Transcript of Lessons Learned from Building and Operating Scuba

Page 1: Lessons Learned from Building and Operating Scuba

Lessons learned from building and operating Scuba

Ciprian GereaFacebook

Page 2: Lessons Learned from Building and Operating Scuba

ODS

Scuba

events

metrics

livehistorical

Presto & Hive

Page 3: Lessons Learned from Building and Operating Scuba

Demo time

• Getting started– Writing to Scuba

Page 4: Lessons Learned from Building and Operating Scuba

What is Scuba

• Database– Real time ingestion & queries– Simple query model: rollups, no joins– Simple data model, flexible schema

• UI platform• Service– Runs its own ETL– Demand control

• Retention• Queries

Page 5: Lessons Learned from Building and Operating Scuba

Scribelogsfrom

serversScuba GUI

`scuba` CLI

Scuba gauge ScriptAlerts

combinedlogs

for each scribe

category

Tupperware

Ptail

manage perfpipe

tailer

Tailer

Data storagerockfortexpress.wildcard

SMC tier in PRN1

Scuba backend

Root aggregator

Leaf

adddirectly to

leaf servers

queries

results

SparkleTable insertion counts

Scuba system architecture

valid

ation

dataswarm

HiveToScuba

Page 6: Lessons Learned from Building and Operating Scuba

Scuba DB

• Data lives in tables• Columns can be: int/string/vector<string>• Can change schema on the fly.• Shared nothing storage in memory & flash• Data sharded at random• Only support rollup queries:

sum/avg/percentile.• Best effort queries: skip bad nodes.

Page 7: Lessons Learned from Building and Operating Scuba

Demo time

• Let’s run some queries• Customize the UI• ETL control

Page 8: Lessons Learned from Building and Operating Scuba

How we keep it running

• ODS metrics for everything• Scuba data sets for queries & subsystems we’re

actively debugging• Dashboards– Cubism is king!– Unidash for niche cases

• Active management of demand– Table size quotas– CPU load -> push to stream processing

Page 9: Lessons Learned from Building and Operating Scuba

Root cause for outages

• Other systems– Scribe– Hosting layer– Deployment mechanism

• Media failures: high on disk, low on flash.• Queries of doom• High load– DOS workloads– Load shedding bugs

Page 10: Lessons Learned from Building and Operating Scuba

Why it is successful

• Scuba’s niche: – Easy to get started– Fast <50ms P50 wall time– Smooth learning curve– UI is customizable (~1k custom presenters!!!)– Its flaws are acceptable• Not everyone needs transactions from the beginning• Users are OK with retrying queries

• Other tools don’t serve this niche well

Page 11: Lessons Learned from Building and Operating Scuba

What could be better

• Customers ask– More space– More consistent results– More expressive queries

• Sharding• Better persistent storage• Better support for time series

Page 12: Lessons Learned from Building and Operating Scuba

Q & A

Page 13: Lessons Learned from Building and Operating Scuba

Cubism Intro

• Horizon charts