Lessons from building large-scale, multi-cloud, SaaS ...€¦ · Approach: cloud agnostic dev...

Post on 01-Jan-2021

2 views 0 download

Transcript of Lessons from building large-scale, multi-cloud, SaaS ...€¦ · Approach: cloud agnostic dev...

Lessons from building large-scale, multi-cloud, SaaS software at DatabricksJeff PangPrincipal Software Engineer @

Who am I?

▪ Jeff PangPrincipal Software Engineer, Databricks

▪ Databricks Platform EngineeringTo help data teams solve the world’s toughest problems, the Databricks Platform team provides the world-class, multi-cloud platform that enables us to expand fast and iterate quickly

http://databricks.com/careers

About

▪ Founded in 2013 by the original creators of Apache Spark

▪ Data and AI platform as a service for 5000+ customers

▪ 1000+ employees, 200+ engineers, >$200M annual recurring revenue

Our product

Data scientists Data engineers Business users

Agenda

The architectureInside the Unified Analytics Platform

Challenges & lessonsGrowing a SaaS data platformOperating on multiple cloudsAccelerating a data platform with data & AI

The architectureInside the Unified Analytics Platform

Simple data engineering architecture

cluster

Reporting

Analytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

Bronze Silver Gold

CSV,JSON, TXT…

Data LakeS3, HDFS,

Blob Store, etc.

Modern data engineering architecture

Data Lake Reporting, Notebooks, AI

StreamingAnalytics

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Workflow scheduling

clusters

Cluster management

Customer Network

Multiply by thousands of customers...

Data Lake

CSV,JSON, TXT…

Kinesis

Customer Network

Data Lake

CSV,JSON, TXT…

Kinesis

Customer Network

Data Lake

CSV,JSON, TXT…

Kinesis

Customer Network

Data Lake

CSV,JSON, TXT…

Kinesis

Customer Network

Data Lake

CSV,JSON, TXT…

Kinesis

...

control plane

Collaborative Notebooks, AIStreamingAnalytics Workflow scheduling Cluster management Admin & Security

Reporting, Business Insights

...across many regions...

...on multiple clouds...

→ millions of VMs managed per day

That’s the Databricks control plane

What did we learn from building a large-scale, multi-cloud data platform?

100,000s of users 100,000s of Spark clusters per day

Millions of VMs launched per day Exabytes of data processed per day

Growing a SaaS data platform

Evolution of the Databricks control plane

We didn’t start with a global-scale, multi-cloud data platform

Challenge: Scaling a data platform from one customer to 5000+

Lesson: The factory that builds and evolves the data platform is more important than the data platform itself

Fast time to market

Databricks control plane “in-a-box”▪ Need to deliver value quickly▪ Need to iterate quickly▪ Can’t break things while iterating!

Keys to success:▪ Modern CI▪ Fast developer tools▪ Testing, testing, testing

V1 V2

25-500xScala build

speedups

10s of millions of tests per

day

100s of Databrick

s “in-a-box” test envs per day

Expand the total addressable market

Replicating control planes quickly▪ Need different configurations for

different environments▪ Need to update many environments▪ Can’t slow down platform

development!

Keys to success:▪ Declarative infrastructure▪ Modern CD infrastructure jsonnet

10 million lines

250klines

Service Framework

Land and expand workloads

Scaling the control plane▪ Need to support more users &

workloads▪ Need to build more features that scale▪ Don’t want devs to reinvent the wheel!

Keys to success:▪ A service framework to do the hard

stuff▪ Decompose monoliths to microservices

Container & replica management, APIs & RPCs, rate limits, metrics, logging, secrets & security, ...

CloudVM API

Cluster Manager

Customer Clusters

version 1

CloudVM API

CM Master

Customer Clusters

Worker Worker

API Server

CM MasterCM Shard

API ServerAPI ServerAPI Server

version 3

usage

Data Platform

The Databricks data platform factory

...Customer Network Customer Network Customer Network Customer Network Customer Network

Kubernetes

HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ...

Envoy, GraphQL

Cloud VMs, network, storage, databases

CM Master

Worker Worker

API Server

CM MasterCM Shard

API ServerAPI ServerAPI Server

Operating on multiple clouds

Why multi-cloud?

The data platform needs to be where the data is▪ Performance, latency, egress data costs▪ Cloud-specific integrations▪ Data governance policies

Challenge: Supporting multiple clouds without sacrificing dev velocity

Lesson: A cloud-agnostic layer is key to dev velocity, but it also needs to integrate with the standards of each cloud and deal with their quirks

Challenge: dev velocity on multiple clouds

Many cloud services have no direct equivalents▪ DynamoDB vs ?

▪ CosmosDB vs ?

▪ Aurora vs ?

▪ SQL DW vs ?

Cloud APIs don’t look likeeach other▪ SDK: no common interfaces

▪ Auth: IAM vs AAD

▪ ACLs: IAM vs Azure RBAC

APIs?Services?

Operational tools for each cloud are very different▪ Templates: CloudFormation

vs ARM templates

▪ Logs: CloudWatch vs Azure Monitor

Ops?

Approach: cloud agnostic dev framework

Use lowest common denominator cloud services

EKS ←Kubernetes →AKS

HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ...

Envoy

EC2VPC

RDS MySQL/Postgres

CM Master

Worker Worker

API Server

CM MasterCM Shard

API ServerAPI ServerAPI Server

Azure ComputeVNetAzure Database for MySQL/Postgres

≈≈≈

ELB Azure Load Balancer

Service framework API

Challenge: not everything can be cloud agnostic

Customers want to integrate with the standards of

each cloud

“Equivalent” cloud services

have implementation

quirks

Approach: abstraction layer for key integrations

Fargate ←Kubernetes →AKS

Bring your own key encryption

AuthN / AuthZ / Identity

EC2VPC

RDS MySQL/Postgres

CM Master

Worker Worker

API Server

CM MasterCM Shard

API ServerAPI ServerAPI Server

Azure ComputeVNetAzure Database for MySQL/Postgres

≈≈≈

Okta, OneLogin, etc. Azure Active DirectoryIAM roles

KMS Azure Key Vault

Unified usage serviceAWS Marketplace, Custom Billing Azure Commerce Billing

ELB Azure Load Balancer≈

Databricks file systemS3 Azure StorageS3 commit service

Approach: harmonize “equivalent” cloud service quirks

Promise of elastic computeis unevenly distributed▪ Provisioning speed differs

▪ Deletion speed differs(speed to refill quota)

→ Need to adapt to cloud resource and API limits

TCP connections are hard▪ “Invisible” NATs have

connection & timeout limits

→ Need tuned keep alive, connection limit configs

▪ Kernel TCP SACK bug caused API hangs in one cloud only

→ Need to deep robustness testing against both clouds (ex: poor NIC reliability)

NetworkVirtual machines

When MySQL != MySQL▪ Host OS matters

Ex: case sensitivity defaults

▪ Default DB params matterEx: tablespace config → 100x difference in recovery time

→ Need expertise in DB tuning to ensure equivalence

Databases

Accelerating a data platformwith data & AI

Inception: Improving a data platform with data & AI

We are one of our biggest customers

Challenge: Building a data platform is hard without a data platform▪ Need data to track usage, maintain security▪ Need data to observe and improve how users use the data platform▪ Need data to keep the data platform up and running

Lesson: Data & AI can accelerate data platform features, product analytics, and devops

How we use Databricks to accelerate itselfKey platform features▪ Usage and billing reports▪ Audit logs

Essential product analytics▪ Feature usage, trends, prediction▪ Growth and churn forecast, models

Mission critical devops▪ Service KPIs and SLAs▪ API and application structured logs▪ Spark debug logs

Data foundation & analytics

Our distributed data pipelines

100s of TB logs per

day

Millions of time

series per secondTime-series, raw logs,

request tracing, dashboards

Kinesis Event Hubs

Declarative data pipeline deployments

Real-time streaming

TakeawaysThe architectureManaging millions of VMs around the world in multiple clouds

Challenges & lessonsThe factory that builds and evolves the data platform is more important than the data platform itself

A cloud-agnostic platform that integrates with cloud standards and quirks is the key to multi-cloud

Data & AI accelerates data platform features, product analytics, and devops

Join us!http://databricks.com/careers

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.

34

Our Product

Built aroundopen source:

Interactive data science

Scheduled jobs

SQL frontend

Data scientists

Data engineers

Business users

Cloud Storage

Compute Clusters

Databricks Runtime

Customer’s Cloud AccountDatabricks Service

Logos

Colors

Color Palette

Primary Colors

Content Slides

Basic Slide

▪ Bullet 1▪ Sub-bullet

▪ Sub-bullet

▪ Bullet 2▪ Sub-bullet

▪ Sub-bullet

▪ Bullet 3▪ Sub-bullet

▪ Sub-bullet

Reduce Long Titles

▪ Bullet 1▪ Sub-bullet

▪ Sub-bullet

▪ Bullet 2▪ Sub-bullet

▪ Sub-bullet

By splitting them into a short title, and a more detailed subtitle using this slide format that includes a subtitle area

Two Columns

▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format

▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format

Headline FormatHeadline Format

Two Box

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

CategoryCategory

Three Box

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

CategoryCategory

▪ Bulleted list

▪ Bulleted list

Category

Four Box

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

▪ Bulleted list

CategoryCategory

▪ Bulleted list

▪ Bulleted list

Category

▪ Bulleted list

▪ Bulleted list

Category

Shapes

ShapesPill-shaped rectangle Double corner

rectangleDouble corner rectangle

Tables and Charts

TableColumn Column Column

Row Value Value Value

Row Value Value Value

Row Value Value Value

Row Value Value Value

Row Value Value Value

Row Value Value Value

Row Value Value Value

Bar chart

Line chart

Pie Chart

Quotes and Text Callouts

Attribution FormatSecond line of attribution

This is a template for a quote slide. This is where the quote goes. Attribute the source below…

Databricks simplifies data and AIso data teams can innovate faster

Databricks simplifies data and AIso data teams can innovate faster

Logos

Spark + AI Summit Logos

Databricks Logos

Open Source Logos

Feedback

Your feedback is important to us.

Don’t forget to rate and review the sessions.