Lessons from building large-scale, multi-cloud, SaaS ...€¦ · Approach: cloud agnostic dev...
Transcript of Lessons from building large-scale, multi-cloud, SaaS ...€¦ · Approach: cloud agnostic dev...
Lessons from building large-scale, multi-cloud, SaaS software at DatabricksJeff PangPrincipal Software Engineer @
Who am I?
▪ Jeff PangPrincipal Software Engineer, Databricks
▪ Databricks Platform EngineeringTo help data teams solve the world’s toughest problems, the Databricks Platform team provides the world-class, multi-cloud platform that enables us to expand fast and iterate quickly
http://databricks.com/careers
About
▪ Founded in 2013 by the original creators of Apache Spark
▪ Data and AI platform as a service for 5000+ customers
▪ 1000+ employees, 200+ engineers, >$200M annual recurring revenue
Our product
Data scientists Data engineers Business users
Agenda
The architectureInside the Unified Analytics Platform
Challenges & lessonsGrowing a SaaS data platformOperating on multiple cloudsAccelerating a data platform with data & AI
The architectureInside the Unified Analytics Platform
Simple data engineering architecture
cluster
Reporting
Analytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
Bronze Silver Gold
CSV,JSON, TXT…
Data LakeS3, HDFS,
Blob Store, etc.
Modern data engineering architecture
Data Lake Reporting, Notebooks, AI
StreamingAnalytics
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Workflow scheduling
clusters
Cluster management
Customer Network
Multiply by thousands of customers...
Data Lake
CSV,JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,JSON, TXT…
Kinesis
Customer Network
Data Lake
CSV,JSON, TXT…
Kinesis
...
control plane
Collaborative Notebooks, AIStreamingAnalytics Workflow scheduling Cluster management Admin & Security
Reporting, Business Insights
...across many regions...
...on multiple clouds...
→ millions of VMs managed per day
That’s the Databricks control plane
What did we learn from building a large-scale, multi-cloud data platform?
100,000s of users 100,000s of Spark clusters per day
Millions of VMs launched per day Exabytes of data processed per day
Growing a SaaS data platform
Evolution of the Databricks control plane
We didn’t start with a global-scale, multi-cloud data platform
Challenge: Scaling a data platform from one customer to 5000+
Lesson: The factory that builds and evolves the data platform is more important than the data platform itself
Fast time to market
Databricks control plane “in-a-box”▪ Need to deliver value quickly▪ Need to iterate quickly▪ Can’t break things while iterating!
Keys to success:▪ Modern CI▪ Fast developer tools▪ Testing, testing, testing
V1 V2
25-500xScala build
speedups
10s of millions of tests per
day
100s of Databrick
s “in-a-box” test envs per day
Expand the total addressable market
Replicating control planes quickly▪ Need different configurations for
different environments▪ Need to update many environments▪ Can’t slow down platform
development!
Keys to success:▪ Declarative infrastructure▪ Modern CD infrastructure jsonnet
10 million lines
250klines
Service Framework
Land and expand workloads
Scaling the control plane▪ Need to support more users &
workloads▪ Need to build more features that scale▪ Don’t want devs to reinvent the wheel!
Keys to success:▪ A service framework to do the hard
stuff▪ Decompose monoliths to microservices
Container & replica management, APIs & RPCs, rate limits, metrics, logging, secrets & security, ...
CloudVM API
Cluster Manager
Customer Clusters
version 1
CloudVM API
CM Master
Customer Clusters
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
version 3
usage
Data Platform
The Databricks data platform factory
...Customer Network Customer Network Customer Network Customer Network Customer Network
Kubernetes
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ...
Envoy, GraphQL
Cloud VMs, network, storage, databases
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
Operating on multiple clouds
Why multi-cloud?
The data platform needs to be where the data is▪ Performance, latency, egress data costs▪ Cloud-specific integrations▪ Data governance policies
Challenge: Supporting multiple clouds without sacrificing dev velocity
Lesson: A cloud-agnostic layer is key to dev velocity, but it also needs to integrate with the standards of each cloud and deal with their quirks
Challenge: dev velocity on multiple clouds
Many cloud services have no direct equivalents▪ DynamoDB vs ?
▪ CosmosDB vs ?
▪ Aurora vs ?
▪ SQL DW vs ?
Cloud APIs don’t look likeeach other▪ SDK: no common interfaces
▪ Auth: IAM vs AAD
▪ ACLs: IAM vs Azure RBAC
APIs?Services?
Operational tools for each cloud are very different▪ Templates: CloudFormation
vs ARM templates
▪ Logs: CloudWatch vs Azure Monitor
Ops?
Approach: cloud agnostic dev framework
Use lowest common denominator cloud services
EKS ←Kubernetes →AKS
HCVault, Consul, Prometheus, ELK, Jaeger, Grafana, common IAM, onboarding, billing, ...
Envoy
EC2VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
Azure ComputeVNetAzure Database for MySQL/Postgres
≈≈≈
ELB Azure Load Balancer
Service framework API
≈
Challenge: not everything can be cloud agnostic
Customers want to integrate with the standards of
each cloud
“Equivalent” cloud services
have implementation
quirks
Approach: abstraction layer for key integrations
Fargate ←Kubernetes →AKS
Bring your own key encryption
AuthN / AuthZ / Identity
EC2VPC
RDS MySQL/Postgres
CM Master
Worker Worker
API Server
CM MasterCM Shard
API ServerAPI ServerAPI Server
Azure ComputeVNetAzure Database for MySQL/Postgres
≈≈≈
Okta, OneLogin, etc. Azure Active DirectoryIAM roles
KMS Azure Key Vault
Unified usage serviceAWS Marketplace, Custom Billing Azure Commerce Billing
ELB Azure Load Balancer≈
Databricks file systemS3 Azure StorageS3 commit service
Approach: harmonize “equivalent” cloud service quirks
Promise of elastic computeis unevenly distributed▪ Provisioning speed differs
▪ Deletion speed differs(speed to refill quota)
→ Need to adapt to cloud resource and API limits
TCP connections are hard▪ “Invisible” NATs have
connection & timeout limits
→ Need tuned keep alive, connection limit configs
▪ Kernel TCP SACK bug caused API hangs in one cloud only
→ Need to deep robustness testing against both clouds (ex: poor NIC reliability)
NetworkVirtual machines
When MySQL != MySQL▪ Host OS matters
Ex: case sensitivity defaults
▪ Default DB params matterEx: tablespace config → 100x difference in recovery time
→ Need expertise in DB tuning to ensure equivalence
Databases
Accelerating a data platformwith data & AI
Inception: Improving a data platform with data & AI
We are one of our biggest customers
Challenge: Building a data platform is hard without a data platform▪ Need data to track usage, maintain security▪ Need data to observe and improve how users use the data platform▪ Need data to keep the data platform up and running
Lesson: Data & AI can accelerate data platform features, product analytics, and devops
How we use Databricks to accelerate itselfKey platform features▪ Usage and billing reports▪ Audit logs
Essential product analytics▪ Feature usage, trends, prediction▪ Growth and churn forecast, models
Mission critical devops▪ Service KPIs and SLAs▪ API and application structured logs▪ Spark debug logs
Data foundation & analytics
Our distributed data pipelines
100s of TB logs per
day
Millions of time
series per secondTime-series, raw logs,
request tracing, dashboards
Kinesis Event Hubs
Declarative data pipeline deployments
Real-time streaming
TakeawaysThe architectureManaging millions of VMs around the world in multiple clouds
Challenges & lessonsThe factory that builds and evolves the data platform is more important than the data platform itself
A cloud-agnostic platform that integrates with cloud standards and quirks is the key to multi-cloud
Data & AI accelerates data platform features, product analytics, and devops
Join us!http://databricks.com/careers
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
34
Our Product
Built aroundopen source:
Interactive data science
Scheduled jobs
SQL frontend
Data scientists
Data engineers
Business users
Cloud Storage
Compute Clusters
Databricks Runtime
Customer’s Cloud AccountDatabricks Service
Logos
Colors
Color Palette
Primary Colors
Content Slides
Basic Slide
▪ Bullet 1▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 2▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 3▪ Sub-bullet
▪ Sub-bullet
Reduce Long Titles
▪ Bullet 1▪ Sub-bullet
▪ Sub-bullet
▪ Bullet 2▪ Sub-bullet
▪ Sub-bullet
By splitting them into a short title, and a more detailed subtitle using this slide format that includes a subtitle area
Two Columns
▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format
▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format▪ Bulleted list format
Headline FormatHeadline Format
Two Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
Three Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category
Four Box
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
CategoryCategory
▪ Bulleted list
▪ Bulleted list
Category
▪ Bulleted list
▪ Bulleted list
Category
Shapes
ShapesPill-shaped rectangle Double corner
rectangleDouble corner rectangle
Tables and Charts
TableColumn Column Column
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Row Value Value Value
Bar chart
Line chart
Pie Chart
Quotes and Text Callouts
Attribution FormatSecond line of attribution
This is a template for a quote slide. This is where the quote goes. Attribute the source below…
Databricks simplifies data and AIso data teams can innovate faster
Databricks simplifies data and AIso data teams can innovate faster
Logos
Spark + AI Summit Logos
Databricks Logos
Open Source Logos
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.