Analyzing data with docker v4
-
Upload
andreas-dewes -
Category
Data & Analytics
-
view
374 -
download
1
Transcript of Analyzing data with docker v4
Analyzing Data With DockerAndreas Dewes (@japh44)
EuroPython 2016 - Bilbao
Outline
Data Analysis: Small & Large-Scale, Easy & Difficult
Introduction To Docker
Containerizing our Data Analysis
Possible Approaches
Relevant Technologies & Outlook
Data Analysis: Use Cases
small-scale large-scale
automated
interactive
Interactive, UI-based analysis(e.g. iPython notebook)
analysis scripts usingLocal data sources(e.g. databases)
non-interactive analysis pipelines(e.g. Apache Hadoop)
Interactive “Big Data” tools, e.gApache Spark or Google BigQuery
So what's so difficult about data analysis?
Sharing Data & Tools
Reproducibility
Scaling
Enter Docker....
What is Docker?
A tool that allows us to deploy applications inside "software containers".
Containers work at the process level and isolate the view of the operating system (i.e. the processes, resources and files an application sees)
Provides a high-level API to manage, version-control, deploy and network containers.
Docker Swarm
Docker Core-Concepts
Docker EngineDocker Engine
Docker API
Registry
CLI
Image
Image
ImageContainer
Container
Container
Container
Container
Images Are Space-Efficient(or at least more efficient than VMs)
Containers Have Little Overhead
https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
Containers Are Self-Sufficient
Containers Are "Lego" For Data Analytics!
Container
output
inputsconfiguration
datanetworked containers
We Can Build Reproducible Data-Analysis Workflows With Them
Map Apache
logs
Map Nginx logs
BI
Aggregate results Filtering Monitoring
Archiving
Example: Analyzing Github Data
analysis script
log filesfrom Github
output
analysis process(es)
Repository with code: https://github.com/adewes/docker-map-reduce-example
Live Demo (fingers crossed)
Containerizing Our Analysis
analysis script
log filesfrom Github
output
analysis container
image
analysis container
analysis container
supervisor
Live demo (what could go wrong?)
Advantages DisadvantagesEasy to share
Each analysis step is self-sufficient
Analysis components are "plug & play"
Easy to parallelize (for the right problems)
Versioning included
Requires to prepare containers
Requires Docker on each machine
Slightly decreases interactivity & flexibility
Which Parts Are Missing?
Orchestration
Dependency Management
Resource ManagementResource Management
Rouster:A Python Tool for Containerized Data Analysis
Built on top of the Docker API"Make for Docker"
Resource ManagementContainer OrchestrationDependency Management
Rouster Uses Recipes to Describe Data Analysis Workflows
Resources(including dependencies)
Services
Actions
versioning, dependency calculation,backup / copying, distribution, ...
startup (including dependencies),resource provisioning, networking, ...
scheduling, monitoring, exceptionhandling, logging, ...
Live Demo: CSV -> Postgres
Open Questions
How to handle communication between containers(through files, network, ...)?
How to provide resources/data to containers in adistributed environment?
Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Built on top of Kubernetes.
http://www.pachyderm.io
Pachyderm
LuigiLuigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
https://github.com/spotify/luigi
Other relevant technologies
Summary & Outlook
Containers are here to stay!
They are useful in various data analysis contexts.
They don't solve all our problems though.
We need additional tools to use them effectively.
Thanks!Want to contribute?
https://github.com/7scientists/rouster
Andreas Dewes (@japh44)
Image Licenses:
https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpghttps://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/https://en.wikipedia.org/wiki/Orchestrahttps://de.wikipedia.org/wiki/Graph_(Graphentheorie)http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/https://brookeborel.com/2011/06/02/363/https://en.wikipedia.org/wiki/Data_sharing