Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business...

52
Hierarchy of Needs Angel D’az Data Engineering

Transcript of Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business...

Page 1: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Hierarchy of Needs

Angel D’az

Data Engineering

Page 2: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Data Engineering Consultant

Tools● Python, AWS, Airflow, Ansible

Business Problems● Batch Processing Workflows ELT / ETL● Ground-Up Data Infrastructures

Self-Intro

Page 3: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.
Page 4: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.
Page 10: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.
Page 11: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

But why Another mental model for data

Page 12: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Focus is on fundamentals

Reasoning > Principles >Tools

Page 13: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Automation

01.

Page 14: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Without Automation

Page 15: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why Automation first?

Page 16: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

What is a good baseline for

Automation?

Page 17: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

What is a good baseline for

Automation?Scripts● Source control and Schedule below scripts

○ Script Existing Manual and Predictable Data Wrangling

○ Move legacy click and drag workflows over to scripts

Page 18: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

What does robust Automation look like?

More layers of complexity

Page 19: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

What does robust Automation look like?

More layers of complexity● Infrastructure as Code (IaC)

Page 20: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

What does robust Automation look like?

More layers of complexity● Infrastructure as Code (IaC)● Data Workflow Orchestration

Page 21: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why Airflow?It’s Extensible

Engineering Talent● Leverages Python language as the analytics standard

Technical● Connections to any data source● Lightweight backend works on any Linux/Unix Server● Code as Abstraction Layer

Page 22: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Extract

02.

Page 23: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Extract (v.)

Page 24: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Without Extraction, there are no ingredients for which our analysts to do their work

Without ingredients, any optimization is premature.

Extract (v.)

Page 25: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Either no-code Data Integration SaaS solution

Or

Fully automate your Data Source connections in code

Extract (v.)

Page 26: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Load

03.

Page 27: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Load

Cheaper storage killed ETL.

And ELT took its place.

Page 28: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Load

Data Lakes● Raw data will be in a rough state. ● Cloud Storage allows Analysts to query

○ Queries may be complex

Page 29: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Load

Data Lakes● Raw data will be in a rough state. ● Cloud Storage allows Analysts to query

○ Queries may be complex● Daily Snapshots (more info)

Page 30: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Load

Data Lakes● Raw data will be in a rough state. ● Cloud Storage allows Analysts to query

○ Queries may be complex● Daily Snapshots (more info)● Optimize with Parquet files

Page 31: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Transform

04.

Page 32: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Transform

Page 33: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Transform

Data Work that can be kept in SQL only.

Why?

Page 34: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why SQL only?

1. Maintainable Workflows

Page 35: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why SQL only?

1. Maintainable Workflows2. More Complexity

a. Remove Data Silos

Page 36: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why SQL only?

1. Maintainable Workflows2. More Complexity

a. Remove Data Silosb. Parameterize your SQL

Page 37: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Parameterize your SQL

SELECT {{ cols }}FROM tbl{{ where }}

Page 38: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Why SQL only?

1. Maintainable Workflows2. More Complexity

a. Remove Data Silosb. Parameterize your SQLc. Data Quality Testing

Page 39: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Optimize Analysis

05.

Page 40: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Optimize Analysis

Page 41: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Optimize Analysis

Time Sensitive Reporting● Spark

Page 42: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Optimize Analysis

Time Sensitive Reporting● Spark

Custom Data Transformations● Jupyter Notebooks

Page 43: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Optimize Analysis

Time Sensitive Reporting● Spark

Custom Data Transformations● Jupyter Notebooks

Large Scale Processes● Reduce Computational Cost with Systems Engineering

Page 44: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Machine Learning

06.

Page 47: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Streaming

07.

Page 48: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Streaming

Streaming for Data Analysis, alone, is rare.

Page 49: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

ConclusionBig “Why?”s

Page 50: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

TransparencyAndReproducibility

Page 51: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Enabling Ethics

Page 52: Angel D’az Hierarchy of Needs - Airflow Summit 2020Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Self-Intro.

Thank you!Say hi! Ask questions!

Writing: angelddaz.substack.comContact: [email protected]