How to Build Consistent and Scalable Workspaces for Data Science Teams
-
Upload
elaine-k-lee -
Category
Software
-
view
219 -
download
3
Transcript of How to Build Consistent and Scalable Workspaces for Data Science Teams
How to build consistent, scalable workspaces for data science teams
Elaine Lee
Data science is hard. Doing data science is even harder.
Ensuring enough resourcesManaging dependencies
http://www.seriouseats.com/assets_c/2014/06/20140525-294370-best-deep-dish-pizza-art-of-pizza-primary-thumb-1500xauto-404176.jpghttps://s-media-cache-ak0.pinimg.com/736x/91/6b/f0/916bf0f23660fc7019353800668060af.jpg
Nail it down
Identify system requirements for base Docker imageStabilize dependencies for data science work environment Increase test coverageGet continuous integration (CI) platform on the same page
Scale it up
Create a pool of worker machines ready to accept jobsSet up an asynchronous task queueProvide a simple command line interface for data scientists
Putting it all together
Pull changes Start Docker container
Run test suite Report Pass/Fail Export image for commit
Commit pushed to Github
Report resultGet image for commit
Start container from image
Run task
Request arrives in queue
workers
123abc…123abc…
123abc…123abc…
s3
Benefits
Flexible to any composition of EC2 instances-Extensible to EMR
Task environment guaranteed-Isolated from other tasks-Identical to conditions at time of development
One-time configuration-EC2 AMI
Extensible command line interface-R interface-Cluster management-Job monitoring
Use case: Quality assurance
CI testing
Other tests- Data validation
- Model consistency
http://img.pandawhale.com/post-52368-thanks-obama-making-sandwich-m-whnc.jpeg
Use case: Parallelizable tasks
Data manipulation- Feature engineering
Model builds- Advanced machine learning algorithms
- Hyperparameter search
https://pbs.twimg.com/media/Buw8Bz6IIAAxgxg.png