Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith...

46
Big Data Vilnius Agile Data Architecture

Transcript of Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith...

Page 1: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Big Data Vilnius

Agile Data Architecture

Page 2: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

About me

Page 3: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

What is an agile data warehouse?

Page 4: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Design “lock-in”

Page 5: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Design “lock-in”

primary_key customer_name email1 Joe Smith [email protected] Kerry Jones [email protected] Mary Woods [email protected]

primary_key customer_name email start_date end_date is_active1 Joe Smith [email protected] 2017-01-05 2017-12-24 N4 Joe Smith [email protected] 2017-12-24 NULL Y2 Kerry Jones [email protected] 2017-06-05 2017-09-02 N5 Kerry Smith [email protected] 2017-09-02 NULL Y

Slowly Changing Dimension Type 2

Slowly Changing Dimension Type 1

Page 6: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Complexity: doing too many things at once

INSERT INTO customer_dimSELECT source_cust_id, first_name, last_name, eff_date, end_date, current_flagFROM( MERGE customer_dim cm USING customer_source cs ON cm.source_cust_id = cs.source_cust_id WHEN NOT MATCHED THEN INSERT VALUES (cs.source_cust_id, cs.first_name, cs.last_name, convert(char(10), getdate()-1, 101), ’12/31/2199′, ‘y’) WHEN MATCHED AND cm.current_flag = ‘y’ and cm.last_name <> cs.last_name THEN UPDATE SET cm.current_flag = ‘n’, cm.end_date = convert(char(10), getdate()- 2, 101) OUTPUT $Action action_out, cs.source_cust_id, cs.first_name, cs.last_name, convert(char(10), getdate()-1, 101) eff_date, ’12/31/2199′ end_date, ‘y’ current_flag) AS merge_outWHERE merge_out.action_out = ‘UPDATE’;

Page 7: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Data volumes

• Failure to meet SLA

Page 8: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Data volumes

• Failure to meet SLA

• Long user wait times

Page 9: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Data volumes

• Failure to meet SLA

• Long user wait times

• Reports generate high strain → slow

Page 10: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Reproducibility

• Rerun parts of your data pipeline without thinking?

Page 11: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Reproducibility

• Rerun parts of your data pipeline without thinking?• Can you regenerate your entire warehouse (in principle)?

Page 12: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Reproducibility

• Rerun parts of your data pipeline without thinking?• Can you regenerate your entire warehouse (in principle)?• → Easy to solve bugs

Page 13: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Limitations of ETL tooling

• Focused on a specific database

Page 14: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Limitations of ETL tooling

• Focused on a specific database• Not scalable

Page 15: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Limitations of ETL tooling

• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines

Page 16: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Limitations of ETL tooling

• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines• Not built from a functional philosophy

Page 17: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Limitations of ETL tooling

• Focused on a specific database• Not scalable• Difficult to synchronize with other scheduled pipelines• Not built from a functional philosophy• Not extendable as a platform

Page 18: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Engineering is a methodological process of stages requiring:

● engineering skills

● knowing what to do when

● and what NOT to do when

Page 19: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The concept

Page 20: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Contextualization

Page 21: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The underlayer

Page 22: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com
Page 23: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Finished underlayer

Page 24: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Painting the final layer

Page 25: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Reductive as well as additive

Page 26: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Add color

Page 27: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com
Page 28: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Agile Data Architecture

• Choose good ETL tools with generalized, reusable components• Apply design decisions at the right stage, not too early• Build ‘functional’ data pipelines• Make pipelines reproducible → immutable partitions• Decouple the source schema from your warehouse schema• Don’t (immediately) resist more persistent copies of the data in

pipeline

Page 29: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The overall data pipeline

Page 30: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

What is a (raw) data vault 2.0?

Page 31: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Step 1: Break data into reloadable partitions

Page 32: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Step 1: Break data into reloadable partitionsgs://datalake/datavault/psa/crm/customer/2018/11/26/data.avro partition_a (customer)

gs://datalake/datavault/psa/crm/customer/2018/11/27/data.avro partition_b (customer)

gs://datalake/datavault/psa/crm/customer/2018/11/28/data.avro partition_c (customer)

gs://datalake/datavault/psa/crm/product/2018/11/26/data.avro partition_a (product)

gs://datalake/datavault/psa/crm/product/2018/11/27/data.avro partition_b (product)

gs://datalake/datavault/psa/crm/product/2018/11/28/data.avro partition_c (product)

Page 33: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The overall architecture

Page 34: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Step 2: Load raw datavault

Page 35: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Do this for all entities

Page 36: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The overall architecture

Page 37: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The business vault (data interpretation)

Customer deduplication:

Bridge & PIT tables:

Calculations, data cleaning, data quality, soft business rules

Page 38: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The PIT table (data interpretation)

name_dtm

address_dtm

contact_dtm

compensation_dtm

sal_sum_dtm

?

Page 39: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The Bridge table (data interpretation)

Page 40: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The overall architecture

Page 41: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The star schema (data interpretation)

Page 42: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

The overall architecture

Page 43: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Avoid complication with “data lanes”Merge data from different sources as late as possible

HR: role, salary, contract Hours Staging of

sources

Monthly org. view and salaries

Calc hours per type

Work on the data

Cost per hour Hours Match on finest grain

cost per …. filtered by ...

Combine or virtualize

Planning

Calc. hours per day

Hours

Planning capacity

Page 44: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

A possible architecture

Page 45: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

Join at slido.com: #bigdata2018

Page 46: Big Data Vilnius · Design “lock-in” primary_key customer_name email 1 Joe Smith joe.smith@aol.com 2 Kerry Jones kerry.jones@gmail.com 3 Mary Woods mary.woods@hotmail.com

+31 (0)1 - 68479294

Coltbaan 4C, Nieuwegein

@bigdatarep

www.bigdatarepublic.nl

/company/bigdata-republic

[email protected]

DATA SCIENCE | BIG DATA ANALYTICS | BIG DATA ARCHITECTURES