Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps...
Transcript of Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps...
![Page 1: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/1.jpg)
ML Ops and Kubeflow PipelinesSolutions and Best Practices for DevOps of Production ML ServicesKaz Sato, Developer Advocate, Google Cloud
![Page 2: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/2.jpg)
Kaz Sato
Developer Advocate
Data & Analytics
Google Cloud
@kazunori_279
LOGO OR ICON HERE
![Page 3: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/3.jpg)
1 What is "ML Ops"?DevOps for ML
![Page 4: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/4.jpg)
Launching is easy,Operating is hard.
"The real problems with a ML system will be found while you are continuously operating it for the long term"
CC0 images from pixabay.com
![Page 5: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/5.jpg)
What is DevOps?
“DevOps is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops).”
“(DevOps is to) strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management.”
- Wikipedia
![Page 6: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/6.jpg)
What is ML Ops?
ML Ops is a software engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops).
(ML Ops is to) strongly advocate automation and monitoring at all steps of ML system construction, from integration, testing, releasing to deployment and infrastructure management.
![Page 7: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/7.jpg)
![Page 8: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/8.jpg)
![Page 9: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/9.jpg)
Agenda
Development anti-patterns
Deployment anti-patterns
Operation anti-patterns
![Page 10: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/10.jpg)
"Depending on a ML superhero"
A ML superhero is:
ML Researcher
Data engineer
Infra and Ops engineer
Product Manager
A partner to execs
From PoC to production
![Page 11: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/11.jpg)
Solution: split the roles, build a scalable team
Split the roles to:
ML Researcher
Data engineer
Infra and Ops engineer
Product Manager
Business decision maker
![Page 13: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/13.jpg)
Example: Candy Sorter demo at I/O and Next
The team:
Researcher: ML models
Software engineer: software integration
Hardware engineer: robot SDK control
Project manager: leading team
Business decision maker: me
![Page 14: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/14.jpg)
"A black box that nobody understands"
Scenario:
Researcher creates a prototype
Engineer refactors it, brings it to production
Researcher doesn't understand the code
Engineer doesn't understand the model
![Page 15: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/15.jpg)
ML Tech Debt paper says: “At Google, a hybrid research
approach where engineers and researchers are embedded together on the same teams has helped reduce this source of friction significantly”
![Page 16: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/16.jpg)
Example: Engineer intros scalable platform
Scenario:
Researcher creates scikit-learn model
Engineer ports it to
Cloud AI Platform
![Page 18: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/18.jpg)
Changing Anything Changes Everything
Entanglement of ML system
A change to one feature could affect all of the other features
A change of a hyper-param could affect the whole result (regularization, learning rates, sampling, thresholds, etc.)
"Launching is easy, operating is hard"
"I'm just changing one feature"
![Page 19: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/19.jpg)
Rules of ML paper says: “Rule #14: Starting with an
interpretable model makes debugging easier”
“Rule #40: Keep ensembles simple”
![Page 20: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/20.jpg)
Solution: use simple model and feature
Use complex model judiciously:
Linear v. Deep
Convex v. Non-convex
Interpretable v. black box
![Page 21: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/21.jpg)
Solution: use ensembled model
Rules for using ensembled model:
Use either one of:A model taking input features: parts factoryA model assembles those models: assembly plant
Use semantically interpretable modelfor better robustness and easier troubleshooting
![Page 23: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/23.jpg)
Ensemble model in Aucnet's used car classifier
![Page 24: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/24.jpg)
"Lack of data validation"
In IT system:
The behavior of the system is defined by code
Validating functionality of your system with unit tests
In ML system:
The behavior of the system is defined by data
Validating functionality of your system with what?
"Data is the code"
![Page 25: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/25.jpg)
"Training-serving skew"
Cause:
Any differences (data, preprocessing, window etc) between training and serving
Result:
Accuracy drops when serving
![Page 26: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/26.jpg)
Solution: TensorFlow Extended (TFX)
Data Ingestion
TensorFlow Data Validation
TensorFlow Transform
Estimator or Keras Model
TensorFlow Model Analysis
TensorFlow Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An end-to-end tool for deploying production ML system
tensorflow.org/tfx
![Page 27: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/27.jpg)
TensorFlow Data Validation (TFDV)
Helps developers understand, validate, and monitor their ML data at scale
Used analyze and validate petabytes of data at Google every day
Has a proven track record in maintaining the health of production ML pipelines
![Page 28: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/28.jpg)
TFX paper says: “We want the user to treat data errors with the same rigor and care that they deal with bugs in code.”Google Play app install rate improved 2% after introducing data validation, finding stale table
![Page 29: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/29.jpg)
TensorFlow Data Validation
Demo
![Page 31: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/31.jpg)
"Lack of continuous monitoring"
Scenario:
Model accuracy drops over time
No practice for continuous monitoring
End users are frustrated with the experience
Business team notices it
Director asks the researcher to update the model ASAP
Don't you know what's happening now?!
![Page 32: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/32.jpg)
"Not knowing the freshness requirements"
Different freshness for different applications:
News aggregation: 5 min
E-commerce item recommend: 1 day/week
NLP for CSAT measurement: 1 month
Voice recognition: years?
Object detection for event: every setup
![Page 33: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/33.jpg)
Rules of ML paper says: “Rule #8: Know the freshness
requirements of your system”
![Page 34: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/34.jpg)
TensorFlow Model Analysis (TFMA)
Compute and visualize evaluation metrics for ML models
Ensure to meet specific quality thresholds and behaves as expected for all relevant slices of data
Provide tools to create a deep understanding of model performance
![Page 35: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/35.jpg)
Measure the delta between models
The delta defines the refresh requirement
![Page 36: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/36.jpg)
Use "sliced" metrics for better model analysis
![Page 37: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/37.jpg)
TensorFlow Model Analysis
Demo
![Page 39: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/39.jpg)
![Page 40: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/40.jpg)
ML Fairness: Fairness Indicator
![Page 41: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/41.jpg)
"Lack of ML lifecycle management"
Scenario:
Researcher creates a Notebook
He/she does everything on it from PoC to production
Data prep, transform, train, validation, serving, and deploy. Got high accuracy on prod service. Yay!
… and forget about the project
![Page 42: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/42.jpg)
"Lack of ML lifecycle management"
One year later, somebody found the accuracy had been dropping slowly
The director asks the researcher to update the model ASAP
The researcher somehow finds the old Notebook on laptop. Tries to remember how to go through every process manually
And wonders, why am I doing the emergency plumbing?? Is this my job?
![Page 43: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/43.jpg)
Solution: ML lifecycle management
![Page 44: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/44.jpg)
Kubeflow Pipelines
Enable developers to build custom ML workflows by easily “stitching” and connecting various components like building blocks.
![Page 45: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/45.jpg)
What Constitutes a Kubeflow Pipeline
Containerized implementations of ML Tasks• Containers provide portability, repeatability
and encapsulation• A containerized task can invoke other services like
AI Platform Training and Prediction, Dataflow or Dataproc• Customers can add custom tasks
Specification of the sequence of steps• Specified via Python DSL
Input Parameters• A “Job” = Pipeline invoked w/ specific parameters
![Page 46: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/46.jpg)
Visual depiction ofpipeline topology
Label 1
Label 2
Label 3
Label 4
Data pre-processing
& validation
Feature engineering
Training models in parallel
Score on test set & pick best
model
Deploy for serving
![Page 47: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/47.jpg)
Easy comparisonand analysis of runs
Label 1
Label 2
Label 3
![Page 49: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/49.jpg)
5
Summary
![Page 50: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/50.jpg)
IMAGE HEREIMAGE HERE
Development anti-patterns
![Page 51: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/51.jpg)
Deployment anti-patterns
![Page 52: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/52.jpg)
Operation anti-patterns
![Page 53: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/53.jpg)
References:
[1] Machine Learning: the high interest credit card of Technical Debt,
D. Sculley et al.
[2] Rules of Machine Learning, Martin Zinkevich
[3] TFX: A TensorFlow based production-scale machine learning platform,
Denis Bayor et al.
[4] Introducing TensorFlow Model Analysis, Clemens Mewald
![Page 54: Kubeflow Pipelines ML Ops and - USENIX...Kubeflow Pipelines Solutions and Best Practices for DevOps of Production ML Services Kaz Sato, Developer Advocate, Google Cloud ... “DevOps](https://reader034.fdocuments.us/reader034/viewer/2022050406/5f83542b78338a6fe06e99f6/html5/thumbnails/54.jpg)
Thank you!