Building Robust Pipelines with Airflow
-
Upload
erin-shellman -
Category
Technology
-
view
456 -
download
1
Transcript of Building Robust Pipelines with Airflow
![Page 1: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/1.jpg)
@erinshellman Wrangle Conf July 20th, 2017
Building Robust Pipelines with Airflow
![Page 2: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/2.jpg)
Zymology: is the science of fermentation and it’s applied to make materials and molecules
!
"
#
$
Beer
Insulin
Food additives
Plastics
![Page 3: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/3.jpg)
![Page 4: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/4.jpg)
Zymergen provides a platform for rapid improvement of microbial strains through genetic engineering.
![Page 5: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/5.jpg)
Robotic automation
Our experimentation is increasingly orchestrated with robotics and machine learning.
![Page 6: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/6.jpg)
Learning how to efficiently navigate the genome is the mission
of data science at Zymergen
![Page 7: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/7.jpg)
Blocker: process failure
Orchestrating complex experiments with robots is hard, and there are process failures. These failures often cause sporadic, extreme measurement values.
![Page 8: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/8.jpg)
Blocker: batch effects
We see temporal effects based on when experiments were performed
![Page 9: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/9.jpg)
Blocker: different interpretations of results
We’re building a platform that can support any microbe and any molecule.
Sometimes that results in a proliferation of solutions with disagreement on which is best.
![Page 10: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/10.jpg)
Processing pipeline
1.Identify process failures
2.Quantify and remove process-related bias
3.Identify strains that show improvement using consistent criteria
Clean model inputs
Outlier detection
Normalization
Hit detection
![Page 11: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/11.jpg)
Rolling our own ETL pipeline
There are many ways to measure the concentration of a molecule.
Any microbe, any molecule… any experiment, many data formats.
![Page 12: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/12.jpg)
Describing complex processing dependencies is hard.
Rolling our own ETL pipeline
![Page 13: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/13.jpg)
Airflow
https://airflow.incubator.apache.org/
“Airflow is a platform to programmatically author, schedule and monitor workflows.”
Airflow gives us flexibility to apply a common set of processing steps to variable data inputs, schedule complex processing workflows, and has become a delivery mechanism for our products.
![Page 14: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/14.jpg)
Structure and Flexibility
![Page 15: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/15.jpg)
e.g. Normalization
Airflow workflows are described as directed acyclic graphs (DAGs).
Each task node in the DAG is an operator.
![Page 16: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/16.jpg)
The anatomy of a DAG
Custom operators
Ordering
Instantiate DAG
![Page 17: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/17.jpg)
Modularity and flexibility
![Page 18: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/18.jpg)
Airflow + PyStan
With Bayesian hierarchical models we estimate (and monitor) the distribution of batch effects.
Experimental bias
![Page 19: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/19.jpg)
DropBox
• Scientists at Zymergen work with data using many different tools including JMP, SQL, and Excel.
• We use a custom DropBox hook to make quick data ingestion pipelines.
![Page 20: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/20.jpg)
Alerting / Communication
![Page 21: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/21.jpg)
3rd-party hooks & operators
![Page 22: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/22.jpg)
Operator
![Page 23: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/23.jpg)
Pairs well with Superset!
“Apache Superset is a modern, enterprise-ready business intelligence web application”
https://github.com/apache/incubator-superset
![Page 24: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/24.jpg)
Constructing machine learning workflows
![Page 25: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/25.jpg)
Fairflow: Functional Airflow
• The core of Fairflow is an abstract base class foperator that takes care of instantiating your Airflow operators and setting their dependencies.
• In Fairflow, DAGs are constructed from foperators that create the upstream operators when the final foperator is called.
![Page 26: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/26.jpg)
Configuring complex ML workflows… functionally
![Page 27: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/27.jpg)
Defining ML workflowsIn the DAG definition, create an instance of the task.
Then, instantiate a DAG like usual and call the compare task on the DAG.
![Page 28: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/28.jpg)
Defining ML workflows
The design allows for simple creation of complicated experimental workflows with arbitrary sets of models, parameters, and evaluation metrics.
![Page 29: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/29.jpg)
Is Airflow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!
![Page 30: Building Robust Pipelines with Airflow](https://reader034.fdocuments.us/reader034/viewer/2022051318/5a64d8197f8b9a4d0b8b4ae1/html5/thumbnails/30.jpg)
Thanks team!
%%
& '()
*
+