Balancing Infrastructure with Optimization and Problem Formulation

55

Transcript of Balancing Infrastructure with Optimization and Problem Formulation

Page 1: Balancing Infrastructure with Optimization and Problem Formulation
Page 2: Balancing Infrastructure with Optimization and Problem Formulation

Balancing Infrastructure with Optimization and Problem Formulation

Sailthru Data Science

Page 3: Balancing Infrastructure with Optimization and Problem Formulation

How do we think about and practice

Data Science

Page 4: Balancing Infrastructure with Optimization and Problem Formulation

Talk OutlinePart 1:

● What is Data Science● Where should we spend our time as data

scientists?Part 2:

● How we balance infrastructure, optimization and problem formulation at Sailthru.

Page 5: Balancing Infrastructure with Optimization and Problem Formulation

What is Data Science?!

Page 6: Balancing Infrastructure with Optimization and Problem Formulation

“Data Science is

the extraction of

knowledge from data”… Wikipedia

Page 7: Balancing Infrastructure with Optimization and Problem Formulation

http://drewconway.com/

Page 8: Balancing Infrastructure with Optimization and Problem Formulation

wikibooks.org

Page 9: Balancing Infrastructure with Optimization and Problem Formulation

Data Scientists are good at …

These Interpretations Suggest:

Page 10: Balancing Infrastructure with Optimization and Problem Formulation

Data Scientists are good at structuring problems, and solving for and optimizing them.

These Interpretations Suggest:

Page 11: Balancing Infrastructure with Optimization and Problem Formulation

So what’s missing here?

zorger.com

Page 12: Balancing Infrastructure with Optimization and Problem Formulation

● problem formulation

● optimization

● infrastructure

The title of this talk mentions...

Page 13: Balancing Infrastructure with Optimization and Problem Formulation

● problem formulation

● optimization

● infrastructure

The title of this talk mentions...

Page 14: Balancing Infrastructure with Optimization and Problem Formulation

Infrastructure

the basic physical and organizational structures and facilities needed for the

operation of a society or enterprise“ ”

… Wikipedia

Page 15: Balancing Infrastructure with Optimization and Problem Formulation

Infrastructure: Often under-appreciated or undervalued by Data Scientists

Page 16: Balancing Infrastructure with Optimization and Problem Formulation

A Data Scientist’s infrastructure?

Page 17: Balancing Infrastructure with Optimization and Problem Formulation

InfrastructureSomething we become intimately familiar with

Page 18: Balancing Infrastructure with Optimization and Problem Formulation

InfrastructureA mission-critical component of our work!

Page 19: Balancing Infrastructure with Optimization and Problem Formulation

Components of a Solid Infrastructure

● Lots of Machinery. VMs, Containers

● Machines require coordination, redundancy and fault tolerance. CAP Theorem

Page 20: Balancing Infrastructure with Optimization and Problem Formulation

Components of a Solid Infrastructure● Resource Allocation Fair Scheduling, Bin Packing

● Control strategies Auto Scaling, Feedback, PID

● Communication algorithms Gossip, Paxos, ...

● Configuration Dynamic Persistence, Namespaces

● Monitoring Anomaly Detection, Visualization

● Data Storage Relational, Graph, Key-Value

● SO MANY TOOLS!

Page 21: Balancing Infrastructure with Optimization and Problem Formulation

So What is Data Science?

Problem Formulation

Infrastructure

Optimization

Page 22: Balancing Infrastructure with Optimization and Problem Formulation

Central Question

As a data scientist, how do I choose where to

spend my time?

Page 23: Balancing Infrastructure with Optimization and Problem Formulation

As a Data Scientist, ...

...when do I:

○ build infrastructure that supports my ideas○ optimize my existing models and

problems○ find new problems to work on

Page 24: Balancing Infrastructure with Optimization and Problem Formulation

Part 2 !

Page 25: Balancing Infrastructure with Optimization and Problem Formulation

Here’s a glimpse of how we tackle these choices at Sailthru.

Page 26: Balancing Infrastructure with Optimization and Problem Formulation
Page 27: Balancing Infrastructure with Optimization and Problem Formulation

● Sailthru is a personalization platform.

● We help our clients communicate with their customers.

● Our goal is to maximize the lifetime value of these customers so that our clients do well, customers are happy, and Sailthru is successful.

Page 28: Balancing Infrastructure with Optimization and Problem Formulation

Sailthru Sightlines: User Predictions

Page 29: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Example Use Cases

Incentivize users with low chance of purchasing

Personalize discounts above expected order value

Suppress users likely to opt-out of messages

Engage users unlikely to open on other channels

Page 30: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - How it Works

Page 31: Balancing Infrastructure with Optimization and Problem Formulation

Computational Challenges

● Feature Engineering + ML

● Run many dependent jobs at scale

● Resource allocator

● Auto Scaler

Page 32: Balancing Infrastructure with Optimization and Problem Formulation

Computational Challenges

● Feature Engineering + ML → Tidyjson & GBMs

● Run many dependent jobs at scale → Stolos

● Resource allocator → Mesos + AWS Spot Instances

● Auto Scaler → Relay.Mesos

Page 33: Balancing Infrastructure with Optimization and Problem Formulation

github.com/sailthru/stolos

STOLOS

Page 34: Balancing Infrastructure with Optimization and Problem Formulation

What problem does it solve?

A Directed Acyclic Multi-Graph task dependency scheduler designed to simplify complex, distributed pipelines.

It creates application queues that can be consumed from in any order.

Page 35: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stolos Pipeline

450 * 20

Each node is a job

Page 36: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stolos PipelineRepeats over time (currently, 1 day periods)

Page 37: Balancing Infrastructure with Optimization and Problem Formulation

github.com/sailthru/relay

github.com/sailthru/relay.mesos

Relay.Mesos

Page 38: Balancing Infrastructure with Optimization and Problem Formulation

What problem does it solve?

Relay actively minimizes the difference between a measured signal and a target signal.

Relay.Mesos plugs Relay into a tool called Mesos. → Lets us auto-scale consumers of queued Stolos jobs

Page 39: Balancing Infrastructure with Optimization and Problem Formulation

FFT Visualization

Page 40: Balancing Infrastructure with Optimization and Problem Formulation

Signal

FFT

f1

f2

f3

f4

FFT Visualization

k=0k=1

k=2ai-k =1

Page 41: Balancing Infrastructure with Optimization and Problem Formulation

Signal

FFT

f1

f2

f3

f4

FFT Visualization

k=0k=1

k=2ai-k =1

Page 42: Balancing Infrastructure with Optimization and Problem Formulation

The PID Algorithm

PV = Process Variable (Signal)SP = Set Point (Target)

MV = Manipulated Variable (Output)t = index on timesteps

**The “D” in PID is excluded here

Page 43: Balancing Infrastructure with Optimization and Problem Formulation

The PID Algorithm

PV = Process Variable (Signal)SP = Set Point (Target)

MV = Manipulated Variable (Output)t = index on timesteps

**The “D” in PID is excluded here

+ Kd Δ dt

Page 44: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Relay

Page 45: Balancing Infrastructure with Optimization and Problem Formulation

Thank You! Our team:

Page 46: Balancing Infrastructure with Optimization and Problem Formulation

Tidyjson github.com/sailthru/tidyjsonStolos github.com/sailthru/stolosRelay github.com/sailthru/relayRelay.Mesos github.com/sailthru/relay.mesosConsulconf github.com/sailthru/consulconf

With more in progress!

Check out our open sourced tools!

Page 47: Balancing Infrastructure with Optimization and Problem Formulation
Page 48: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - On Mesos←----------------> CPU Units <------------------>

←--

----

----

----

----

--->

RA

M ←

----

----

----

----

----

-> ←----------------> CPU Units <------------------>

←--

----

----

----

----

--->

RA

M ←

----

----

----

----

----

->

Page 49: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Page 50: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Build train and test sets from a sample of data

Page 51: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Run Grid Search to identify

Hyperparameters for the model

Page 52: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Build the model

Page 53: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Generate predictions for

all relevant models

Page 54: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Pipeline

Sample

Database

Database

SampleSample & Assemble

AssembleSampleSampleGrid AssembleSampleSampleBuild

AssembleSampleSamplePredict SampleSampleAPI Push

Page 55: Balancing Infrastructure with Optimization and Problem Formulation

Sightlines - Pipeline

Sample

Database

Database

SampleSample & Assemble

AssembleSampleSampleGrid AssembleSampleSampleBuild

AssembleSampleSamplePredict SampleSampleAPI Push

○ Upper branch: once per (client, day, model)○ Lower branch: once per (client, day)