Balancing Infrastructure with Optimization and Problem Formulation
-
Upload
alex-d-gaudio -
Category
Data & Analytics
-
view
238 -
download
1
Transcript of Balancing Infrastructure with Optimization and Problem Formulation
Balancing Infrastructure with Optimization and Problem Formulation
Sailthru Data Science
How do we think about and practice
Data Science
Talk OutlinePart 1:
● What is Data Science● Where should we spend our time as data
scientists?Part 2:
● How we balance infrastructure, optimization and problem formulation at Sailthru.
What is Data Science?!
“Data Science is
the extraction of
knowledge from data”… Wikipedia
http://drewconway.com/
wikibooks.org
Data Scientists are good at …
These Interpretations Suggest:
Data Scientists are good at structuring problems, and solving for and optimizing them.
These Interpretations Suggest:
So what’s missing here?
zorger.com
● problem formulation
● optimization
● infrastructure
The title of this talk mentions...
● problem formulation
● optimization
● infrastructure
The title of this talk mentions...
Infrastructure
the basic physical and organizational structures and facilities needed for the
operation of a society or enterprise“ ”
… Wikipedia
Infrastructure: Often under-appreciated or undervalued by Data Scientists
A Data Scientist’s infrastructure?
InfrastructureSomething we become intimately familiar with
InfrastructureA mission-critical component of our work!
Components of a Solid Infrastructure
● Lots of Machinery. VMs, Containers
● Machines require coordination, redundancy and fault tolerance. CAP Theorem
Components of a Solid Infrastructure● Resource Allocation Fair Scheduling, Bin Packing
● Control strategies Auto Scaling, Feedback, PID
● Communication algorithms Gossip, Paxos, ...
● Configuration Dynamic Persistence, Namespaces
● Monitoring Anomaly Detection, Visualization
● Data Storage Relational, Graph, Key-Value
● SO MANY TOOLS!
So What is Data Science?
Problem Formulation
Infrastructure
Optimization
Central Question
As a data scientist, how do I choose where to
spend my time?
As a Data Scientist, ...
...when do I:
○ build infrastructure that supports my ideas○ optimize my existing models and
problems○ find new problems to work on
Part 2 !
Here’s a glimpse of how we tackle these choices at Sailthru.
● Sailthru is a personalization platform.
● We help our clients communicate with their customers.
● Our goal is to maximize the lifetime value of these customers so that our clients do well, customers are happy, and Sailthru is successful.
Sailthru Sightlines: User Predictions
Sightlines - Example Use Cases
Incentivize users with low chance of purchasing
Personalize discounts above expected order value
Suppress users likely to opt-out of messages
Engage users unlikely to open on other channels
Sightlines - How it Works
Computational Challenges
● Feature Engineering + ML
● Run many dependent jobs at scale
● Resource allocator
● Auto Scaler
Computational Challenges
● Feature Engineering + ML → Tidyjson & GBMs
● Run many dependent jobs at scale → Stolos
● Resource allocator → Mesos + AWS Spot Instances
● Auto Scaler → Relay.Mesos
github.com/sailthru/stolos
STOLOS
What problem does it solve?
A Directed Acyclic Multi-Graph task dependency scheduler designed to simplify complex, distributed pipelines.
It creates application queues that can be consumed from in any order.
Sightlines - Stolos Pipeline
450 * 20
Each node is a job
Sightlines - Stolos PipelineRepeats over time (currently, 1 day periods)
github.com/sailthru/relay
github.com/sailthru/relay.mesos
Relay.Mesos
What problem does it solve?
Relay actively minimizes the difference between a measured signal and a target signal.
Relay.Mesos plugs Relay into a tool called Mesos. → Lets us auto-scale consumers of queued Stolos jobs
FFT Visualization
Signal
FFT
f1
f2
f3
f4
FFT Visualization
k=0k=1
k=2ai-k =1
Signal
FFT
f1
f2
f3
f4
FFT Visualization
k=0k=1
k=2ai-k =1
The PID Algorithm
PV = Process Variable (Signal)SP = Set Point (Target)
MV = Manipulated Variable (Output)t = index on timesteps
**The “D” in PID is excluded here
The PID Algorithm
PV = Process Variable (Signal)SP = Set Point (Target)
MV = Manipulated Variable (Output)t = index on timesteps
**The “D” in PID is excluded here
+ Kd Δ dt
Sightlines - Relay
Thank You! Our team:
Tidyjson github.com/sailthru/tidyjsonStolos github.com/sailthru/stolosRelay github.com/sailthru/relayRelay.Mesos github.com/sailthru/relay.mesosConsulconf github.com/sailthru/consulconf
With more in progress!
Check out our open sourced tools!
Sightlines - On Mesos←----------------> CPU Units <------------------>
←--
----
----
----
----
--->
RA
M ←
----
----
----
----
----
-> ←----------------> CPU Units <------------------>
←--
----
----
----
----
--->
RA
M ←
----
----
----
----
----
->
Sightlines - Stages
Predict API Push
Sample & Assemble Grid Build
Sightlines - Stages
Predict API Push
Sample & Assemble Grid Build
Build train and test sets from a sample of data
Sightlines - Stages
Predict API Push
Sample & Assemble Grid Build
Run Grid Search to identify
Hyperparameters for the model
Sightlines - Stages
Predict API Push
Sample & Assemble Grid Build
Build the model
Sightlines - Stages
Predict API Push
Sample & Assemble Grid Build
Generate predictions for
all relevant models
Sightlines - Pipeline
Sample
Database
Database
SampleSample & Assemble
AssembleSampleSampleGrid AssembleSampleSampleBuild
AssembleSampleSamplePredict SampleSampleAPI Push
Sightlines - Pipeline
Sample
Database
Database
SampleSample & Assemble
AssembleSampleSampleGrid AssembleSampleSampleBuild
AssembleSampleSamplePredict SampleSampleAPI Push
○ Upper branch: once per (client, day, model)○ Lower branch: once per (client, day)