RecSys 2015: Large-scale real-time product recommendation at Criteo
-
Upload
romain-lerallut -
Category
Internet
-
view
6.991 -
download
5
Transcript of RecSys 2015: Large-scale real-time product recommendation at Criteo
Copyright © 2015 Criteo
Large-Scale Real-Time Product
Recommendation at CriteoRomain Lerallut, Diane Gasselin
RecSys Vienna, Sept 18, 2015
Copyright © 2015 Criteo
2
Copyright © 2015 Criteo
« The largest internet company you’ve never heard of »
• Founded in 2005, in the adtech business since 2008
• Recommendation was our first product
• Disruptive business models
• 1700 people WW (50+% for less than a year)
• 300+ engineers
• 26 offices
• Live in 130 countries
• 1B unique users
Copyright © 2015 Criteo
We buy
• Inventory ! (ad spaces)
• Billions of times a day
• All over the Internet
• For 95% of the population
=> Funding the Web
A technology company first and foremost
We sell
• Clicks !• (that convert)
• (that convert a lot)
=> Delight to our clients !
We take the risk
You pay only for what you get
Copyright © 2015 Criteo
Learn on huge volumes of data
10 000 displays
Copyright © 2015 Criteo
Learn on huge volumes of data
10 000 displays
leads to
50 clicks
Copyright © 2015 Criteo
Learn on huge volumes of data
10 000 displays
leads to
50 clicks
leads to
1 sale
Copyright © 2015 Criteo
8
Traffic
800k HTTP requests / sec (peak activity)
29000 impressions / sec (peak activity)
Copyright © 2015 Criteo
9
Traffic
800k HTTP requests / sec (peak activity)
29000 impressions / sec (peak activity)
<10 ms to process RTB request
<100 ms to process reco request
Copyright © 2015 Criteo
10
Physical infrastructure
7 in-house data centers on 3 continents
Traffic
800k HTTP requests / sec (peak activity)
29000 impressions / sec (peak activity)
<10 ms to process RTB request
<100 ms to process reco request
Copyright © 2015 Criteo
11
Physical infrastructure
7 in-house data centers on 3 continents
~ 15000 servers, largest Hadoop cluster in Europe
More than 35 PB of storage Big Data
Traffic
800k HTTP requests / sec (peak activity)
29000 impressions / sec (peak activity)
<10 ms to process RTB request
<100 ms to process reco request
Copyright © 2015 Criteo
(Big) Data Sources
Ad display data20B events / day
User behavior data2B events / day
Catalog data1M+ products / client
10k clients
Copyright © 2015 Criteo
How do we do it ?
Copyright © 2015 Criteo
Recommend products for a user
• What we want: reco(user) = products
• But 1B users x 3B products !
• And we need to scale and keep it fresh
• What we can do :
• Pre-select products offline (source)
• Refine recommendation online
Copyright © 2015 Criteo
15
Offline : prepare sources
Advertiser events
Co events
Item View – Item View Item Sale – Item Sale
Best ofBest of by category
Similarities Complementarities
Top N
350M keys12B values
50B
50M keys1B values
Copyright © 2015 Criteo
User X saw orange shoes
Offline : prepare sources
Historical
Similar
Best-of
Other users :
Most viewed products on the client website
Some candidate products for user X
Complementary
Copyright © 2015 Criteo
OFFLINE
Reco overview
Advertiser
events
Source computation
Map-Reduce jobs
Recommendation Service
Display, Click, Sale logs
Prediction
models
Sources
Catalog
12h
4h
6h
4.5B
500M
100K qps
50B
Copyright © 2015 Criteo
ML model
• Logistic regression models because : • They scale
• They are fast
• They can handle lots of features (with a bit of magic)
Product-specific User-specific User-product interactions Display-specific
Copyright © 2015 Criteo
Online: sources
Similarities Most viewed Most bought
Copyright © 2015 Criteo
Online: merge of products
Similarities Most viewed Most bought
Copyright © 2015 Criteo
Online: scoring
Similarities Most viewed Most bought
0,02 0,12 0,06 0,18 0,03 0,05 0,01 0,005 0,011 0,013 0,004 0,007
Copyright © 2015 Criteo
Online: scoring
Similarities Most viewed Most bought
0,18 0,12 0,06 0,05 0,03 0,02 0,013 0,011 0,01 0,007 0,005 0,004
Copyright © 2015 Criteo
Online: candidates
0,18 0,12 0,06 0,05 0,03 0,02 0,013 0,011 0,01 0,007 0,005 0,004
SHOP SHOP SHOP SHOP
-50%
Copyright © 2015 Criteo
Evaluation
Copyright © 2015 Criteo
• It is the only truth we have
• 50% users on model A
• 50% users on model B
The basics : online ab-testing
My company
BUY! BUY!
BUY!
My company
BUY! BUY!
BUY!
Copyright © 2015 Criteo
• It is the only truth we have
• 50% users on model A
• 50% users on model B
• But it is onerous• If not good, we lose money, fast !
• Tests are long (~2weeks needed to have good confidence intervals)
• Code has to be prod-ready (no bug, good performance), we run 24/7
• Can be heavy on the infrastructure
• And does not take long-term effect into account
The basics : online ab-testing
My company
BUY! BUY!
BUY!
My company
BUY! BUY!
BUY!
Copyright © 2015 Criteo
The test framework for prediction
• ALTERNATIVE : Framework that replays production logs (offline)• 30 000 tests / year
• Replay ~x100
• BUT : we only have data on products we display (exploration iscostly)
• SO : we can only make sure we are not completely mistaken
Copyright © 2015 Criteo
Ultimate solution: offline ab-testing
• Find the best offline predictor for online performance
• Counterfactual Reasoning and Learning Systems
Léon Bottou Microsoft Research, Redmond, WA
Jonas Peters Max Planck Institute, Tübingen
Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly,
Dipankar Ray, Patrice Simard, Ed Snelson
• But we haven’t succeeded in making it precisely match reality..
Copyright © 2015 Criteo
Ultimate solution: offline ab-testing
• Find the best offline predictor for online performance
• Counterfactual Reasoning and Learning Systems
Léon Bottou Microsoft Research, Redmond, WA
Jonas Peters Max Planck Institute, Tübingen
Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly,
Dipankar Ray, Patrice Simard, Ed Snelson
• But we haven’t succeeded in making it precisely match reality.. YET
Copyright © 2015 Criteo
What’s next ?
Copyright © 2015 Criteo
What’s next for us : Upcoming challenges
• Long(er)-term user profiles
Copyright © 2015 Criteo
What’s next for us : Upcoming challenges
• Long(er)-term user profiles
• More and better product information (images, semantic, NLP)
Copyright © 2015 Criteo
What’s next for us : Upcoming challenges
• Long(er)-term user profiles
• More and better product information (images, semantic, NLP)
• Instant-update of similarities• (because batch computation is soooo last year)
Copyright © 2015 Criteo
What’s next for us : Upcoming challenges
• Long(er)-term user profiles
• More and better product information (images, semantic, NLP)
• Instant-update of similarities• (because batch computation is soooo last year)
• Joined product scoring• (score full banner and not products independently)
Copyright © 2015 Criteo
What’s next for you : Fancy a try ?
On your own:
With us !
http://labs.criteo.com/jobs/
• We published datasets for click prediction
• 4GB display-click data : Kaggle challenge in 2014 http://bit.ly/1vgw2XC• 1TB Display-Click data (industry’s largest dataset) : http://bit.ly/1PyH4Vq
• 4 billion of observations• 156 billion feature-value• available on Microsoft Azure• used by edX (UC Berkeley)
• We would be happy to share Recocentric data !
Copyright © 2015 Criteo
Questions?