Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data...
Transcript of Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data...
![Page 1: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/1.jpg)
1
Building an open source data lake at scale in the cloud
Adrian Woodhead, Principal Engineer
![Page 2: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/2.jpg)
2Expedia Group Proprietary and Confidential
Agenda
Background
Data Lake foundation: data + metadata
High Availability and Disaster Recovery
Data federation
Event-based data processing
2
![Page 3: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/3.jpg)
3
![Page 4: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/4.jpg)
4Expedia Group Proprietary and Confidential
Data Lake journey
• “traditional” RDBMS Data Warehouse
• Introduced on-premise Hadoop + Hive cluster
• RDBMS SQL replaced by SQL from Hive
• Slow at busy times
• Painful upgrade path (software and hardware)
• Migration to “Cloud” as primary data lake
![Page 5: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/5.jpg)
5Expedia Group Proprietary and Confidential
C l o u d D a t a L a k e F o u n d a t i o n
1
2
![Page 6: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/6.jpg)
6Expedia Group Proprietary and Confidential
C l o u d D a t a L a k e H i g h A v a i l a b i l i t y
1
2
![Page 7: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/7.jpg)
7Expedia Group Proprietary and Confidential
C l o u d D a t a L a k e R e d u n d a n c y
1
2
![Page 8: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/8.jpg)
8Expedia Group Proprietary and Confidential
Redundancy by replication
• Data and Metadata
• Co-ordinated
• Data consistency during replication
• No partial reads
• Completeness more important than latency
8
1
2
![Page 9: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/9.jpg)
9Expedia Group Proprietary and Confidential
Circus Train – Hive dataset replicator
• https://github.com/HotelsDotCom/circus-train/
• Metadata only available after data
• Supports HDFS, S3, GCS etc.
• Standard “distcp” and optimised copiers
• Plugin architecture – Notifications, Copiers, Metadata transformations
• Selective data replication – custom filters, “Hive Diff”
• https://github.com/HotelsDotCom/shunting-yard• Event-driven Circus Train
9
1
2
![Page 10: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/10.jpg)
10Expedia Group Proprietary and Confidential
D a t a L a k e S i l o s
1
2
![Page 11: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/11.jpg)
11Expedia Group Proprietary and Confidential
Data Lake Silo Solutions
• Move back to a single data lake
• Scalability issues
• Increased “blast radius”
• Replicate shared data sets between data lakes
• Cost of maintaining replication jobs
• Increased file storage costs
• Increased network transfer costs
11
1
2
![Page 12: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/12.jpg)
12Expedia Group Proprietary and Confidential
Federated Cloud Data Lake
• https://github.com/HotelsDotCom/waggle-dance/
• Waggle Dance – a Hive Thrift metastore proxy
• Configure it with “downstream” Hive metastores
• Configure S3 bucket access permissions
• Set “hive.metastore.uris” to Waggle Dance server
• Use as you would Hive metastore in any client app
12
1
2
![Page 13: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/13.jpg)
13Expedia Group Proprietary and Confidential
W a g g l e D a n c e O v e r v i e w
1
2
![Page 14: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/14.jpg)
14Expedia Group Proprietary and Confidential
M u l t i - R e g i o n F e d e r a t e d C l o u d D a t a L a k e
Federate
Rep
licate US_WEST_2
US_EAST_1
US_WEST_2
US_EAST_1 Rep
licate
![Page 15: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/15.jpg)
15Expedia Group Proprietary and Confidential
Federated Cloud Data Lake Best Practices
• Expose read-only endpoints to “external” users
• Separate critical path infrastructure
• Federate data for access within a region
• Replicate data for access in a different region
15
1
2
![Page 16: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/16.jpg)
16Expedia Group Proprietary and Confidential
Federated Cloud Data Lake Alternative
• Presto – distributed SQL query engine for big data
• Federate Hive, MySQL, PostgreSQL and many others
• https://github.com/prestodb/presto
OR
• https://github.com/prestosql/presto
?
16
1
2
![Page 17: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/17.jpg)
17Expedia Group Proprietary and Confidential
Apiary - Cloud Data Lake Components
• https://github.com/ExpediaGroup/apiary
• Various components for a federated cloud data lake
• Docker images for all services
• Terraform deployment scripts
• Ranger for authorization
• Various optional extensions
17
1
2
![Page 18: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/18.jpg)
18Expedia Group Proprietary and Confidential
Apiary – Metadata Events
• https://github.com/ExpediaGroup/apiary-extensions/tree/master/apiary-metastore-events
• Events for tables/partitions CRUD operations
• Hive MetaStoreEventListener implementations
• Kafka
• AWS SNS
• Enable downstream data processing use cases• ETL, Governance, Lineage etc
18
1
2
![Page 19: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/19.jpg)
19Expedia Group Proprietary and Confidential
Problem – rewriting data at scale
• Changes to existing data
• Read isolation for long running queries
• Always create new folders for updates
• Repoint Hive data locations
• How to expire “orphaned data”?
19
1
2
![Page 20: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/20.jpg)
20Expedia Group Proprietary and Confidential
Beekeeper – orphaned data cleanup
• https://github.com/ExpediaGroup/beekeeper/
• Hive table parameter: beekeeper.remove.unreferenced.data=true
• Apiary event listener
• Detects data re-writes
• Schedules old data for deletion in future
• Periodically performs the data deletions
20
1
2
![Page 21: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/21.jpg)
21Expedia Group Proprietary and Confidential
Consistent CRUD alternatives
• http://hive.apache.org/ - Hive 3.1.x with ACID
• https://iceberg.incubator.apache.org/ - Iceberg
• https://delta.io/ - Delta Lake
• https://hudi.apache.org/ - Hudi
21
1
2
![Page 22: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/22.jpg)
22Expedia Group Proprietary and Confidential
Don’t forget to test
• https://github.com/klarna/HiveRunner/ - Hive SQL unit tests
• https://github.com/HotelsDotCom/mutant-swarm/ - Code coverage for HiveRunner
• https://github.com/HotelsDotCom/beeju - Unit tests for Thrift Hive metastore service and HiveServer2
22
1
2
![Page 23: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/23.jpg)
23Expedia Group Proprietary and Confidential
Where to next?
• Hybrid cloud
• best of both worlds but increased complexity
• Multi-cloud
• best of breed but increased complexity
• Docker + Kubernetes
• Reduce vendor lock-in
• Massive scale without too much effort
• Minimal changes for on-prem/EKS/GKE/AKS etc
![Page 24: Building an open source data lake at scale in the cloud · •AWS SNS •Enable downstream data processing use cases ... •Multi-cloud •best of breed but increased complexity •Docker](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec469e8e70ddc2d884049fb/html5/thumbnails/24.jpg)
24Expedia Group Proprietary and Confidential
Open Source Data Lake Components
Hive Federation
https://github.com/HotelsDotCom/waggle-dance
Hive Replication
https://github.com/HotelsDotCom/circus-train
https://github.com/ExpediaGroup/shunting-yard
Cloud Data Lake
https://github.com/ExpediaGroup/apiary
Hive Cleanup
https://github.com/ExpediaGroup/beekeeper