Contextual Advertising by Combining Relevance with Click Feedback
-
Upload
emerson-vinson -
Category
Documents
-
view
32 -
download
1
description
Transcript of Contextual Advertising by Combining Relevance with Click Feedback
Contextual Advertising by Combining Relevance with Click Feedback
D. ChakrabartiD. AgarwalV. Josifovski
Motivation
Match ads to queries Sponsored Search: The query is a short piece of
text input by the user Content Match: The query is a webpage on which
ads can be displayed
Motivation
Relevance-based
1. Uses IR measures of match
cosine similarity BM25
2. Uses domain knowledge
3. Gives a score
Click-based
1. Uses ML methods to learn a good matching function
Maximum Entropy
2. Uses existing data improvement over time
3. Typically gives a probability of click
Motivation
Relevance-based4. Very low training cost
At most one or two params, which can be set by cross-validation
5. Simple computations at testing time Using the Weighted AND
(WAND) algorithm
Click-based4. Training is complicated
Scalability concerns Extremely imbalanced
class sizes Problems interpreting non-
clicks Sampling methods heavily
affect accuracy
5. All features must be computed at test time Good feature engineering
critical
Motivation
Relevance-based
Uses domain knowledge
Very low training cost Simple computations
at testing time
Click-based
Uses existing data improvement over time
Training is complicated Efficiency concerns
during testing
Combine the two
Benefits of both
Must control these
Motivation
We want a system for computing matches over all ads (~millions) NOT a re-ranking of filtered results of some other
matching algo Training:
Can be done offline Should be parallelizable (for scalability)
Testing: Must be as fast and scalable as WAND Accurate results
WAND Background
Red
Ball
Ad 1 Ad 5 Ad 8
Ad 7 Ad 8 Ad 9
Word posting lists
Cursors
Query = Red Ball
skip
Candidate Results = Ad 8 …
More generally, queries are weighted compute upper bounds on score for skips
WAND Background
Efficiency through cursor skipping Must be able to compute upper bounds
quickly Match scoring formula should not use features of
the form (“word X in query AND word Y in ad”) Such pairwise (“cross-product”) checks can
become very costly
Proposed Method
Only use features of the form (“word X in both query AND ad”)
Learn to predict click data using such features
Add in some function of IR scores as extra features What function?
Proposed Method
A logistic regression method model for CTR
CTR Main effect for page
(how good is the page)
Main effect for ad
(how good is the ad)
Interaction effect
(words shared by page and ad)
Model parameters
Proposed Method
Mp,w = tfp,w
Ma,w = tfa,w
Ip,a,w = tfp,w * tfa,w
So, IR-based term frequency measures are taken into account
Proposed Method
Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing
Proposed Method
How can IR scores fit into the model? What is the relationship between logit(pij) and
cosine score?
Quadratic relationshipCosine score
logi
t(p i
j)
Proposed Method
How can IR scores fit into the model? This quadratic relationship can be used in
two ways Put in cosine and cosine2 as features
Use it as a prior
Proposed Method
How can IR scores fit into the model? This quadratic relationship can be used in
two ways We tried both, and they give very similar results
Proposed Method
Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing
Proposed Method
Word selection Overall, nearly 110k words in corpus Learning parameters for each word would be:
Very expensive Require a huge amount of data Suffer from diminishing returns
So we want to select ~1k top words which will have the most impact
Proposed Method
Word selection Two methods
Data based: Define an interaction measure for each word
Higher values for words which have higher-than-expected CTR when they occur on both page and ad
Proposed Method
Word selection Two methods
Data based Relevance based
Compute average tfidf score of each word overall pages and ads
Higher values imply higher relevance
Proposed Method
Word selection Two methods
Data based Relevance based
We picked the top 1000 words by each measure
Data-based methods give better results
Recall
Pre
cisi
on
Proposed Method
Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing
Proposed Method
Finer resolutions than page-level or ad-level
The data has finer granularity Words are in “regions”, such as title, headers,
boldfaces, metadata, etc. Word matches in title can be more important that
in the body Simple extension of the model to region-specific
features
Proposed Method
Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing
Proposed Method
Fast Implementation Training: Hadoop implementation of Logistic
Regression
Data
Iterative Newton-Raphson
Random data splits Mean and
Variance estimates
Combine estimates Learned
model params
Proposed Method
Fast Implementation Testing
Main effect for ads is used in ordering of ads in postings list (static)
Interaction effect is used to modify the idf-table of words (static)
Main effect for pages does not play a role in ad serving (page is given)
Building postings lists
Proposed Method
Fast Implementation Testing
Model can be integrated into existing code No loss of performance or scalability of the existing
system
Proposed Method
Four sources of complexity Adding in IR scores Word selection for efficient learning Finer resolutions than page-level or ad-level Fast implementation for training and testing
Experiments
Recall
Pre
cisi
on
25% lift in precision at 10% recall
Magnification for low recall region
Experiments
Increasing the number of words from 1000 to 3400 led to only marginal improvement Diminishing returns System already performs close to its limit, without
needing more training
Conclusions
Relevance-based
Uses domain knowledge
Very low training cost Simple computations
at testing time
Combine the two
Parallel code for parameter fitting
Use existing system: no code changes or
efficiency bottlenecks
Click-based
Uses existing data improvement over time
Training is complicated Efficiency concerns
during testing