Migrating pipelines into Docker

Migrating pipelines into Docker

Noa Resare, Spotify@blippie

Welcome!‣(let’s define pipeline)

‣Background

‣Docker improving engineering experience

‣Docker piece of puzzle to handle growth

‣Practical advice

Spotify & me‣Spotify

Streaming musicCelebrates 10 years this summer30m subscribers, most users on free tiermillions of concurrent users

‣..meat Spotify for 6 years.less than 50 engineers, now more than 1000operations engineeringbackend developmentFree SoftwareData Infrastructure

Big Data at Spotify

Humble beginnings‣Counting stream playbacks

‣Stack of servers in the Fußball-room

‣Streaming Hadoop, python

‣Quick excursion to Amazon 2012

The new cluster‣One large cluster, early 2012

‣60 nodes!

‣Luigi development starts

More technologies‣python code

‣pure java map/reduce

‣apache scrub

‣scala, scalding

Spotify engineering org‣A lot of autonomy

‣Big data touches a many different teamsFinanceAnalyticsFeature development (A/B testing)RecommendationsPayments and fraud

Shared resources, packaging‣Started out with some shared edge

nodeschaos ensued

‣More edge nodes!more chaos? more chaos!

‣Shared execution environmentfrom .deb to .jarstill a lot of one off edge nodes

Docker for pipelines

Brief introduction to docker‣Containers seem like virtual machines

‣docker run -it <image_name>

‣Filesystem reset between invocations

‣Typically built using a docker file

‣Image inheritance

Docker at Spotify‣Big bet on docker for services: helios

‣Lots of useful infrastructure

‣Solves some immediate packaging problems

What does Docker provide?‣Useful abstraction to reason about

‣an incremental way out of dependency hell

‣Artefact distribution, caching

‣Image inheritance mechanism for sharing infrastructure

Switching to docker in practice‣Previously

maven project with java, python, cron filebuild step to upload resulting jar to artifactorybuild step to copy cron file to execution cluster

‣Nowadd Dockerfile, data infrastructure base imagebuild step to build and upload image

Problems with cron cluster execution‣Implicit deployment via CI/CD

declaration

‣Status reported via output materialising

‣Who / what triggered job X?

‣Where does it run?

‣Debugging is a pain

Our solution: execution as a service‣Restful API for pipeline execution

‣List your job invocations

‣Explicitly schedule execution on node

‣Don’t rerun successful execution

‣Interface: docker image

Data growth, or cluster day of doom

Scaling is hard‣2000 nodes

‣100 PB storage

‣800 000 000 files in HDFS

‣180GB heap, 10G young generation

‣Adding 100TB data per day

Docker as vehicle for migration‣Our path forward: Google Cloud

‣Decouple storage from compute

‣Transparent switch from on premise Hadoop to DataProc and Cloud Storage

‣Entry point executable in base image

‣Auth, config, dynamic cluster allocation

Where are we now?‣Two squads are using dockerized

pipelines in production

‣Still using luigi, pull based dependencies

‣Styx, execution as service soon in prod

‣Google cloud migration as we speak

‣Docker drives transparent migration

Some practical docker advice‣Reproducible normalised builds

‣Explicit versioning

‣Split code, configuration, secrets

‣github.com/spotify/dockerfile-maven

Thank you!Don’t be a [email protected]@blippie

Migrating pipelines into Docker

Technology

Transcript of Migrating pipelines into Docker