Large scale-ctr-prediction lessons-learned-florian-hartl
-
Upload
pydata -
Category
Data & Analytics
-
view
148 -
download
0
Transcript of Large scale-ctr-prediction lessons-learned-florian-hartl
CTR Prediction
CTR: Click-Through RatepCTR: predicted CTR
QuestionHow likely is the user to click on the ad?
WhyProxy for relevance
5.5%
0.8%
9.2%
?
Logistic Regression with
thousands of features,
trained and tested on
millions of samples.
Current pCTR Model
Kuvasz
pCTR Model History
(CC) from Flickr: "Wednesday Freedom 11"by Parker Knight
(CC) from Flickr: "Icelandig sheepdog"by Thomas Quine(CC) from Flickr: by Craige Moore
FrenchBrittany
Icelandic Sheepdog
Jindo Kuvasz
(CC) from Flickr: "The huge crossing" by Miroslav Petrasko
Infrastructure
(CC) from Flickr: "KOGI and WEL" by luckyno3
user feedbackservice
logs
Log at source of online prediction→ Prevents downstream modifications of data
Logging
data model
logsprediction verification
fastscalable
Make offline training iterations fast & scalable
Automation is key→ end-to-end pipeline→ automated visualizations
Tools: mrjob, Spark
Iterations
Offline Training at Yelp
merge logs sampling feature extraction
model training evaluation
mrjobAWS EMR
daily scheduled pipelinekicked off manually
mrjobAWS EMR
Spark
mrjobAWS EMR
mrjobAWS EMR
mrjobAWS EMR
new features
(CC) from Flickr: "Cloud" by Jason Pratt
Lessons Learned
InfrastructureLog at source of online predictionVerify predictionsMake offline iterations fast & scalable
Focus on a single metric(but don't trust it blindly)
Evaluation
data model
prediction verification
evaluation
fastscalable
Focus on a single metric(but don't trust it blindly)
Create helpful visualizations
Tools: Zeppelin
Evaluation
data model
prediction verification
evaluation
fastscalable
Visualizations...
feature 1feature 2feature 3
...
feature contribution
Feature contributionssd(feature) * coef
Feature value vs. CTR count
feature value
CTR
logs
Beware of biased training data→ offline != online→ pCTR threshold
Thresholds
user feedbackservice
pCTR Threshold
time
training data
Model 1 Model 2 Model 3 Model 4Idea:Frequent retraining
Better:Deliberate sampling of bad ads
CTR pCTR
Lessons Learned
InfrastructureLog at source of online predictionVerify predictionsMake offline iterations fast & scalable
Model ComprehensionEvaluate, evaluate, evaluateBe aware of threshold effects
user feedbackservice
onlineoffline
data model
logsprediction verification
evaluation
fastscalable
simplicity
simplicity
rule-based approach
simple models
Occam's razor
appropriate metric
documentation
"Simple Made Easy"
user feedbackservice
onlineoffline
data model
logsprediction verification
evaluation
fastscalablewell documented
fastscalablewell documented
simplicity
user feedbackservice
onlineoffline
data model
logsprediction verification
evaluation
fastscalablewell documented
fastscalablewell documented
simplicity
Lessons Learned
Above all, keep it simple.
InfrastructureLog at source of online predictionVerify predictionsMake offline iterations fast & scalable
Model ComprehensionEvaluate, evaluate, evaluateBe aware of threshold effects