[Research] deploying predictive models with the actor framework - Brian Gawalt
-
Upload
papisio -
Category
Data & Analytics
-
view
1.189 -
download
0
Transcript of [Research] deploying predictive models with the actor framework - Brian Gawalt
PAPIs 2015
Akka & Data Science:Making real-time predictionsBrian Gawalt2nd International Conference on Predictive APIs and AppsAugust 7, 2015
PAPIs 2015
[A]Sometimes, data scientists need to worry about throughput.
2
PAPIs 2015
[B]One way to increase throughput is with concurrency.
3
PAPIs 2015
[C]The Actor Model is an easy way to build a concurrent system.
4
PAPIs 2015
[D]Scala+Akka provides an easy-to-use Actor Model context.
5
PAPIs 2015
[A + B + C + D ⇒ E]Data scientists should check out Scala+Akka.
6
PAPIs 2015
Consider:● building a model, ● vs. using a model
7
PAPIs 2015
Lots of ways to practice building a model
8
PAPIs 2015
The Classic Process
1. Load your data set’s raw materials
2. Produce feature vectors:
o Training,
o Validation,
o Testing
3. Build the model with training and validation vectors
4 U th d l t t/ t9
PAPIs 2015
The Classic Process: One-time Testing
10
Load train/valid./test materials
Make train/valid./test feature vectors
Train Model
Make test predictions
Build
Use
PAPIs 2015
The Classic Process: Repeated Testing
11
Load train/valid. materials
Make train/valid. feature vectors
Train Model
Load test/new materials
Make test/new feature vectors
Make test/new predictions
(saved model)
(repeat every K minutes)
Build
Use
PAPIs 2015
Sometimes my tasks work like that, too!
12
PAPIs 2015
But this talk is about the other kind of tasks.
13
PAPIs 2015
[A]Sometimes, data scientists need to worry about throughput.
14
PAPIs 2015
Example:Freelancer availability on
15
PAPIs 2015
Hiring Freelancers on Upwork
1. Post a job
2. Search for freelancers
3. Find someone you like
4. Ask them to interview
o Request Accepted!
o or rejected/ignored...16
THE TASK:
Look at recent freelancer behavior, and predict, at time Step 2, who’s likely to accept an invite at time Step 4
PAPIs 2015
Building this model is business as usual:
17
PAPIs 2015
Building Availability Model
1. Load raw materials:
o Examples of accepts/rejects
o Histories of freelancer site activity
Job applications sent or received
Hours worked
Click logs
Profile updates
2. Produce feature vectors: 18
Greenplum
Amazon S3
Internal Service
PAPIs 2015
Using Availability Model
19
Load train/valid. materials
Make train/valid. feature vectors
Train Model
Load test/new materials
Make test/new feature vectors
Make test/new predictions
(saved model)
(repeat every 60 minutes)
PAPIs 2015
Using Availability Model
20
Load test/new materials
Make test/new feature vectors
Make test/new predictions
(saved model)
(repeat every 60 minutes)
Load job app data(4 min.)
Load click log data(30 min.)
Load work hours data(5 min.)
Load profile data(20 ms/profile)
PAPIs 2015
Using Availability Model
21
Load job app data(4 min.)
Load click log data(30 min.)
Load work hours data(5 min.)
Load profile data(20 ms/profile)
● Left with under 21 minutes to collect profile data○ Rate limit: 20 ms/profile○ At most, 63K profiles per
hour● Six Million freelancers who
need avail. predictions: expect ~90 hours between re-scoring any individual
● Still need to spend time actually building vectors and exporting scores!
PAPIs 2015
[B]One way to increase throughput is with concurrency.
22
PAPIs 2015
Expensive Option:Major infrastructure overhaul
23
PAPIs 2015
… but that takes a lot of time, attention, and cooperation…
24
PAPIs 2015
Simpler Option:The Actor Model
25
PAPIs 2015
[C]The Actor Model is an easy way to build a concurrent system.
26
PAPIs 2015
● Imagine a mailbox with a brain● Computation only begins when/if a
message arrives● Keeps its thoughts private:
○ No other actor can actively read this actor’s state
○ Other actors will have to wait to hear a message from this actor
An Actor
27
PAPIs 2015
● Lots of Actors, and each has:○ Private message queue○ Private state, shared only sending more
messages● Execution context:
○ Manages threading of each Actor’s computation
○ Handles asynch. message routing○ Can send prescheduled messages
● Each received message’s computation is fully completedbefore Actor moves on to next message in queue
The Actor Model of Concurrency
28
PAPIs 2015
The Actor Model of Concurrency
29
Execution Context
PAPIs 2015
Parallelizing predictions
30
Refresh work hours
Vectorizer:● Keep copies of raw data● Emit vector for each new
profile received
Refresh job apps
Refresh click log Fetch 10 profiles
Apply model; export
prediction
raw data
raw data
Schedule: Fetch once per hour Schedule: Fetch once per hour
Schedule: Fetch once per hour Schedule: Fetch every 300ms
PAPIs 2015
Serial processing
31
Refresh job apps
Make feature vectors
Export predictions
(repeat every 60 minutes)
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30 = 16 min
...
PAPIs 2015
Serial processing
32
Refresh job apps
Make feature vectors
Export predictions
(repeat every 60 minutes)
Refresh work hours
Refresh click log
Fetch ~50K profiles
...
55 min
5 min
4 min
5 min
30 min
55 - 4 - 5 - 30 = 16 min
... Throughput:48K users/hr
PAPIs 2015
Parallel Processing with Actors
33
Refresh job apps
...
Refresh click log
Refresh work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.(msg. processing time not to scale)
Rx data
Vectorize
...
PAPIs 2015
Parallel Processing with Actors
34
Refresh job apps
...
Refresh click log
Refresh work hrs.
Rx data
Fetch pro.
Export
Rx data
Fetch pro.
Fetch pro.
Fetch pro.
Fetch pro.= msg. sent= msg. rx’d
1/hr.
1/hr.
1/hr. 3/sec. (as rx’ed)
Store
Store
Vectorize
Vectorize
Store
1/hr.
Thr. 1 Thr. 2 Thr. 3 Thr. 4
Vectorize
Fetch pro.
Fetch pro.
Throughput:180K users/hr
Rx data
Vectorize
...
PAPIs 2015
[D]Scala+Akka provides an easy-to-use Actor Model context.
35
PAPIs 2015
Message passing, scheduling, & computation behavior defined in 445 lines.
36
PAPIs 2015
Scala+Akka Actors
● Create Scala class, mix in Actor trait
● Implement the required partial function: receive: PartialFunction[Any, Unit]
● Define family of message objects this actor’s planning to handle
● Define behavior for each message case in receive
37
PAPIs 2015
Scala+Akka Actors
38
Mixin same code used for export in non-Actor version
Private, mutable state: stored scores
Private, mutable state: time of last export
If receiving new scores: store them!
If storing lots of scores, or if it’s been awhile: upload what’s stored, then erase them
If told to shut down, stop accepting new scores
PAPIs 2015
Scala+Akka Pros
● Easy to get productive in the Scala language
● SBT dependency management makes it easy to move to any box with a JRE
● No global interpreter lock!
39
PAPIs 2015
Scala+Akka Cons
● Moderate Scala learning curve
● Object representation on the JVM has pretty lousy memory efficiency
● Not a lot of great options for building models in Scala (compared to R, Python, Julia)
40
PAPIs 2015
[A]Sometimes, data scientists need to worry about throughput.
41
PAPIs 2015
[B]One way to increase throughput is with concurrency.
42
PAPIs 2015
[C]The Actor Model is an easy way to build a concurrent system.
43
PAPIs 2015
[D]Scala+Akka provides an easy-to-use Actor Model context.
44
PAPIs 2015
[A + B + C + D ⇒ Z]Data scientists should check out Scala+Akka
45
PAPIs 2015
Thanks!Questions?
bgawalt@{upwork, gmail}.comtwitter.com/bgawalt