Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...
Transcript of Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest...
![Page 1: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/1.jpg)
Building a data warehouseBuilding a data warehouseusing Spark SQLusing Spark SQL
Budapest Data Forum 2018
Gabor Ratky, CTO
![Page 2: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/2.jpg)
About meAbout meHands-on CTO at Secret Sauce to this daySoftware engineer at heartMade enough bad decisions to know thateverything is a trade-offCode quality and maintainability above allNot writing code when I don't have toNot building distributed systems when Idon't have toNot a data warehouse guy, but ❤ data
Simple is better than complex. Complex is better than complicated.
The Zen of Python, by Tim Peters
![Page 3: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/3.jpg)
About Secret SauceAbout Secret SauceSV startup in BudapestB2B2C apparel e-commerce companyData driven products to help merchandisingServices build on top of the data we collectCloud-based infrastructure (AWS)Small, effective teamsStrong engineering culture
Code qualityCode reviewsTestability
Everybody needs data to do their jobs
![Page 4: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/4.jpg)
Early daysEarly days
Partner data MongoDB$ mongoimport
![Page 5: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/5.jpg)
MongoDB
Redshift S3
PostgreSQLPostgreSQLPartner data
Event analyticsEvent analytics
kafkakafka
![Page 6: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/6.jpg)
MongoDB
Databricks S3
PostgreSQLPostgreSQLPartner data
Data warehousingData warehousing
kafkakafka
![Page 7: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/7.jpg)
Why Databricks and Spark?Why Databricks and Spark?Storage and compute are separateManaged clusters operated by DatabricksFits into and runs as part of our existinginfrastructure (AWS)Right tool for the job
Data engineers use pysparkData analysts use SQLData scientists use Python, R, SQL, H2O,Pandas, scikit-learn, dist-keras
Shared metastore (databases and tables)Collaborative, interactive notebooksGithub integration and flowAutomated jobs and schedulesProgrammatic API
![Page 8: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/8.jpg)
ClustersClusters
![Page 9: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/9.jpg)
WorkspaceWorkspace
![Page 10: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/10.jpg)
NotebooksNotebooks
![Page 11: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/11.jpg)
JobsJobs
![Page 12: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/12.jpg)
AnalyticsAnalytics
![Page 13: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/13.jpg)
AnalyticsAnalytics
![Page 14: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/14.jpg)
Build vs buyBuild vs buy
BUYBUY
![Page 15: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/15.jpg)
CostCost
![Page 16: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/16.jpg)
Cost (Redshift)Cost (Redshift)Persistent data warehouse4x ds2.xlarge nodes (8TB, 16 vCPU, 124GB RAM)On-demand price: $0.85/hr/node1 month ~ 732 hours
$2,488$2,488
![Page 17: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/17.jpg)
Cost (Databricks)Cost (Databricks)Ephemeral, interactive, multi-tenant cluster8TB storage (S3)i3.xlarge driver node (4 vCPU, 30.5GB RAM)4x i3.xlarge worker nodes (16 vCPU, 122GB RAM)Compute: $0.712/hr
$0.312/hr on-demand price4x $0.1/hr spot price
Databricks: $2/hr$0.4/hr/node
Storage: $188/mo + change1 month ~ 22 workdays ~ 176 hours
$665$665
![Page 18: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/18.jpg)
Utilization (Redshift)Utilization (Redshift)
![Page 19: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/19.jpg)
Utilization (Redshift)Utilization (Redshift)
![Page 20: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/20.jpg)
Utilization (Databricks)Utilization (Databricks)
~34 DBU/day, ~4.5 DBU/hr
~11.5 DBU/day
![Page 21: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/21.jpg)
Our experience so farOur experience so farStarted using Databricks in January 2018Quick adoption across the whole companyFast turnaround on data requestsEasy collaboration between technical andnon-technical peopleDatabricks allows us to focus on dataengineering, not data infrastructureGithub integration not perfect, but fits intoour workflowPartitioning and schema evolution needs alot of attentionDatabricks is an implementation detail, pickyour poisonEverything is a trade-off, make the right ones
![Page 22: Building a data warehouse using Spark SQL · Building a data warehouse using Spark SQL Budapest Data Forum 2018 Gabor Ratky, CTO. ... Partner data PostgreSQL PostgreSQL D a t a w](https://reader033.fdocuments.us/reader033/viewer/2022053022/604ee1915d501a05ad215876/html5/thumbnails/22.jpg)
NIHS*NIHS** not invented here syndrome