Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith...
Transcript of Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith...
![Page 1: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/1.jpg)
Big Data Vilnius
Agile Data Architecture
![Page 2: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/2.jpg)
About me
![Page 3: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/3.jpg)
What is an agile data warehouse?
![Page 4: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/4.jpg)
Design “lock-in”
![Page 5: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/5.jpg)
Design “lock-in”
primary_key customer_name email1 Joe Smith [email protected] Kerry Jones [email protected] Mary Woods [email protected]
primary_key customer_name email start_date end_date is_active1 Joe Smith [email protected] 2017-01-05 2017-12-24 N4 Joe Smith [email protected] 2017-12-24 NULL Y2 Kerry Jones [email protected] 2017-06-05 2017-09-02 N5 Kerry Smith [email protected] 2017-09-02 NULL Y
Slowly Changing Dimension Type 2
Slowly Changing Dimension Type 1
![Page 6: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/6.jpg)
Complexity: doing too many things at once
INSERT INTO customer_dimSELECT source_cust_id, first_name, last_name, eff_date, end_date, current_flagFROM( MERGE customer_dim cm USING customer_source cs ON cm.source_cust_id = cs.source_cust_id WHEN NOT MATCHED THEN INSERT VALUES (cs.source_cust_id, cs.first_name, cs.last_name, convert(char(10), getdate()-1, 101), ’12/31/2199′, ‘y’) WHEN MATCHED AND cm.current_flag = ‘y’ and cm.last_name <> cs.last_name THEN UPDATE SET cm.current_flag = ‘n’, cm.end_date = convert(char(10), getdate()- 2, 101) OUTPUT $Action action_out, cs.source_cust_id, cs.first_name, cs.last_name, convert(char(10), getdate()-1, 101) eff_date, ’12/31/2199′ end_date, ‘y’ current_flag) AS merge_outWHERE merge_out.action_out = ‘UPDATE’;
![Page 7: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/7.jpg)
Data volumes
• Failure to meet SLA
![Page 8: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/8.jpg)
Data volumes
• Failure to meet SLA
• Long user wait times
![Page 9: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/9.jpg)
Data volumes
• Failure to meet SLA
• Long user wait times
• Reports generate high strain → slow
![Page 10: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/10.jpg)
Reproducibility
• Rerun parts of your data pipeline without thinking?
![Page 11: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/11.jpg)
Reproducibility
• Rerun parts of your data pipeline without thinking?• Can you regenerate your entire warehouse (in principle)?
![Page 12: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/12.jpg)
Reproducibility
• Rerun parts of your data pipeline without thinking?• Can you regenerate your entire warehouse (in principle)?• → Easy to solve bugs
![Page 13: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/13.jpg)
Limitations of ETL tooling
• Focused on a specific database
![Page 14: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/14.jpg)
Limitations of ETL tooling
• Focused on a specific database• Not scalable
![Page 15: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/15.jpg)
Limitations of ETL tooling
• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines
![Page 16: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/16.jpg)
Limitations of ETL tooling
• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines• Not built from a functional philosophy
![Page 17: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/17.jpg)
Limitations of ETL tooling
• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines• Not built from a functional philosophy• Not extendable as a platform
![Page 18: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/18.jpg)
Engineering is a methodological process of stages requiring:
● engineering skills
● knowing what to do when
● and what NOT to do when
![Page 19: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/19.jpg)
The concept
![Page 20: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/20.jpg)
Contextualization
![Page 21: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/21.jpg)
The underlayer
![Page 22: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/22.jpg)
![Page 23: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/23.jpg)
Finished underlayer
![Page 24: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/24.jpg)
Painting the final layer
![Page 25: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/25.jpg)
Reductive as well as additive
![Page 26: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/26.jpg)
Add color
![Page 27: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/27.jpg)
![Page 28: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/28.jpg)
Agile Data Architecture
• Choose good ETL tools with generalized, reusable components• Apply design decisions at the right stage, not too early• Build ‘functional’ data pipelines• Make pipelines reproducible → immutable partitions• Decouple the source schema from your warehouse schema• Don’t (immediately) resist more persistent copies of the data in
pipeline
![Page 29: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/29.jpg)
The overall data pipeline
![Page 30: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/30.jpg)
What is a (raw) data vault 2.0?
![Page 31: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/31.jpg)
Step 1: Break data into reloadable partitions
![Page 32: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/32.jpg)
Step 1: Break data into reloadable partitionsgs://datalake/datavault/psa/crm/customer/2018/11/26/data.avro partition_a (customer)
gs://datalake/datavault/psa/crm/customer/2018/11/27/data.avro partition_b (customer)
gs://datalake/datavault/psa/crm/customer/2018/11/28/data.avro partition_c (customer)
gs://datalake/datavault/psa/crm/product/2018/11/26/data.avro partition_a (product)
gs://datalake/datavault/psa/crm/product/2018/11/27/data.avro partition_b (product)
gs://datalake/datavault/psa/crm/product/2018/11/28/data.avro partition_c (product)
![Page 33: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/33.jpg)
The overall architecture
![Page 34: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/34.jpg)
Step 2: Load raw datavault
![Page 35: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/35.jpg)
Do this for all entities
![Page 36: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/36.jpg)
The overall architecture
![Page 37: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/37.jpg)
The business vault (data interpretation)
Customer deduplication:
Bridge & PIT tables:
Calculations, data cleaning, data quality, soft business rules
![Page 38: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/38.jpg)
The PIT table (data interpretation)
name_dtm
address_dtm
contact_dtm
compensation_dtm
sal_sum_dtm
?
![Page 39: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/39.jpg)
The Bridge table (data interpretation)
![Page 40: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/40.jpg)
The overall architecture
![Page 41: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/41.jpg)
The star schema (data interpretation)
![Page 42: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/42.jpg)
The overall architecture
![Page 43: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/43.jpg)
Avoid complication with “data lanes”Merge data from different sources as late as possible
HR: role, salary, contract Hours Staging of
sources
Monthly org. view and salaries
Calc hours per type
Work on the data
Cost per hour Hours Match on finest grain
cost per …. filtered by ...
Combine or virtualize
Planning
Calc. hours per day
Hours
Planning capacity
![Page 44: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/44.jpg)
A possible architecture
![Page 45: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/45.jpg)
Join at slido.com: #bigdata2018
![Page 46: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com](https://reader036.fdocuments.us/reader036/viewer/2022071002/5fbe918c843bc97b87510201/html5/thumbnails/46.jpg)
+31 (0)1 - 68479294
Coltbaan 4C, Nieuwegein
@bigdatarep
www.bigdatarepublic.nl
/company/bigdata-republic
DATA SCIENCE | BIG DATA ANALYTICS | BIG DATA ARCHITECTURES