[ACM Press the fifth ACM international conference - Seattle, Washington, USA...

Finding the Right Consumer: Optimizing for Conversion inDisplay Advertising Campaigns

Yandong Liu‡∗, Sandeep Pandey†, Deepak Agarwal†, Vanja Josifovski†

‡ Carnegie Mellon, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA† Yahoo! Research, 4301 Great America Parkway, Santa Clara, CA 95054, USA

[email protected] | {spandey | dagarwal | vanjaj}@yahoo-inc.com

ABSTRACTThe ultimate goal of advertisers are conversions represent-ing desired user actions on the advertisers’ websites in theform of purchases and product information request. In thispaper we address the problem of finding the right audiencefor display campaigns by finding the users that are mostlikely to convert. This challenging problem is at the heartof display campaign optimization and has to deal with sev-eral issues such as very small percentage of converters in thegeneral population, high-dimensional representation of theuser profiles, large churning rate of users and advertisers.To overcome these difficulties, in our approach we use twosources of information: a seed set of users that have con-verted for a campaign in the past; and a description of thecampaign based on the advertiser’s website. We explore theimportance of the information provided by each of these twosources in a principled manner and then combine them topropose models for predicting converters. In particular, weshow how seed set can be used to capture the campaign-specific targeting constraints, while the campaign metadataallows to share targeting knowledge across campaigns. Wegive methods for learning these models and perform exper-iments on real-world advertising campaigns. Our findingsshow that the seed set and the campaign metadata are com-plimentary to each other and both sources provide valuableinformation for conversion optimization.

Categories and Subject DescriptorsH.4.m [Information Systems]: Miscellaneous

General TermsAlgorithms, Performance, Experimentation

Keywordsconversions, modeling, advertising

∗Work done while at Yahoo! Research

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WSDM’12, February 8–12, 2012, Seattle, Washington, USA.Copyright 2012 ACM 978-1-4503-0747-5/12/02 ...$10.00.

1. INTRODUCTIONBusinesses, small and large alike, seek to expand by reach-

ing out to users who can be their potential customers. On-line advertising is becoming one of the main advertisingchannels, accounting for estimated $80 billion spending in2011 and growing at the rate of 17% annually,1 with displayadvertising being a big part of it. To make their campaigneffective, display advertisers target specific users based ontheir historical pattern of activity or behavior, i.e., behav-ioral targeting. Different advertisers target different kind ofusers, e.g., a cellular company would be interested in userslooking to subscribe to a cell phone plan or buying a handset,an online trading/investing company looks for finance-savvyusers interested in buying/selling of shares, a travel agencywants to find customers for purchasing of flight tickets andbooking of hotel rooms. While in some cases targeting cri-teria can be specified as a simple condition/function of thepast user behavior, to get the best performance, increasinglysophisticated modeling techniques are being applied to de-tect the behavior patterns indicative of the user interest ina particular campaign/product.

Behavioral targeting aids advertisers in finding the rightaudience for their campaigns by characterizing and target-ing users who fit their needs. Intermediaries such as adnetworks, online exchanges and demand side platforms, per-form behavioral targeting by bringing in the three partiesinvolved, users, publishers and advertisers. While the com-plete details of advertising ecosystem is beyond the scopeof this paper, a simplified view showing the interaction be-tween advertisers and publishers is given in Figure 1. Herewe use the term ad broker to refer to the intermediariesthat (a) facilitate the collection of user data to build pro-files and (b) select the best advertising campaign to displayon a given Web page being viewed (impression) by a useron the publisher site. The ad broker constructs profiles forusers based on their past online activities such as Web pagesviewed at the participating publishers, Web search queries,vertical searches, etc. These profiles are then leveraged tolearn advertiser-specific segment/model describing the de-sired target audience in terms of the pattern of user behav-ior. The focus of this paper is on producing such effectivemodels for display advertising.

Most prior work on building behavioral targeting modelsfocuses on maximizing clicks [6, 22], that is, construct mod-els to identify those users who are most likely to click onads when shown. While clicks serve as a natural proxy for

1According to a study by emarketer.com.

473

Figure 1: Behavioral Targeting.

user’s interest in the advertised product, they can often bemisleading, e.g., click fraud, bounce clicks [7, 12, 23]. Hence,in this work our goal is to target users for conversions, rep-resenting user activity on the advertisers website beyond thead click [2, 4, 14]. Conversions are more tangible indicationof user interest in the advertiser than clicks, separating theincidental and casual interest from purchasing intent.

Building conversion models is extremely challenging formany reasons. Usually, only a very small portion of theusers that click eventually convert and thus, conversions arevery rare events. This constrains the modeling techniquesto parsimoniously work with the data. On the user side,user profiles are high dimensional consisting of several dif-ferent kinds of activities, ranging from user demographicsto search queries and page browsing. Dealing with such dif-ferent activities in the presence of limited target/conversioninformation is non-trivial. To add to this, the data is highlyvolatile due to cookie churn, changes in campaigns, variabil-ity in user interests and other temporal effects that do notallow accumulating long-standing data and require the mod-eling approach to have a quick start and dynamically adaptover time as new data comes in.

Our Approach.In view of these challenges, we propose a novel approach

for conversion prediction that relies on two distinct sourcesof information: (a) the metadata associated with the ad-vertising campaign such as ad creative, landing page etc.;and (b) seed users who have converted or viewed ads forthe advertiser in the past. These two sources are compli-mentary in the way they guide the modeling process. Fora new advertiser, since its ads have not been shown by theadvertising network yet, the ad broker does not have anyrecord of past converted users. As a result, initially, the adbroker must rely on using campaign’s metadata to find rightusers for it. The campaign metadata quickly helps in under-standing what the advertising campaign is about and thusidentifying the potential targeting set. The metadata canbe leveraged both in an unsupervised manner (e.g., targetsports enthusiasts for sports related advertising) and super-vised manner (e.g., pool the seed sets of campaigns usingcampaign metadata; more details in Section 3).

Subsequently, when the network has shown ads to enoughusers for the campaign, some of these users would have con-

verted. This information can be used in refining the ini-tial model: the converted users make the positive instances,while the unconverted users can be treated as the negatives.This can be modeled as a regression/classification task; pre-dicting the conversion likelihood based on user profiles.

We give a principled approach to combine the two sourcesof information and propose a series of models. In particular,we show how seed set can be used to capture the campaign-specific targeting constraints (local component), while thecampaign metadata allows to share targeting knowledge acrosscampaigns (global component). For example, a “nike” cam-paign can learn/teach which users to target from/to an“adi-das” campaign. Also, we give methods for learning thesemodels in a joint manner that simultaneously optimizes forthe local and global components and a two-step approachthat performs this optimization in two stages.

In doing so we investigate several technical questions. Forexample, how do we represent users and campaign in a suc-cinct manner? How useful are the two sources of informa-tion and how do they interact with each other? How canwe combine them both in a principled manner? We answerthese questions and make interesting observations throughreal advertising data collected from a large ad network. Forexample, we found that contrary to popular belief, learningmodels for large campaigns (in terms of number of conver-sions) can be more difficult compare to the smaller cam-paigns and the metadata can help in dealing with this issue.

Contributions.In summary, we make the following contributions in this

paper:

• We propose to predict conversions using: campaignmetadata and seed users. We examine their relativevalue for campaigns with different characteristics (e.g.,large and small, new and old).

• We give a set of modeling techniques that combinethe two sources of information for optimal performance(through the local and global component). To the bestof our knowledge, this is the first study on behavioraltargeting to model the global component.

• Using the campaign metadata, we propose a methodto bootstrap the prediction model for a campaign byexploiting information from related campaigns.

• We conduct extensive experiments and report the re-sults from a real-life advertising dataset, to confirm thevalidity of our approach.

Finally, we note that, although the experiments are fo-cused on conversion maximization for performance-baseddisplay advertisers, the principles described are applicablein a broader context as user profiles are the basis for au-dience selection in almost any setting of online advertisingtargeting and content personalization.

2. PROBLEM DEFINITIONAs in traditional brand and performance advertising, dis-

play advertisers aim to target the users that might be in-terested in their products to promote their brand or get adirect response from the users. Depending on the brandor performance inclination, the advertisers can set up their

474

campaigns using different goals. While brand advertisers areprimarily interested in number of ad views (impressions) bythe targeted audience, performance advertisers usually setup either click or conversions goals. In this paper we focuson performance advertisers with conversion optimizationgoals. (More details on performance-advertising and thetechnical difficulties of it are given in Section 5.1).

Mathematically, the conversion optimization task can beformulated as the following. Let c ∈ C denote a campaignand zc be the feature representation of its metadata. Letseedc denote the seed set for campaign c with labeled con-verters and non-converters. For each campaign c ∈ C, thegoal is to learn a model to differentiate between convert-ers and non-converters. Let xu denotes the feature profileof user u. In our example this is a high dimensional vec-tor where each co-ordinate is binary. We want to learn afunction f(xu, zc, c) that helps us estimate the propensityof user u to convert on campaign c.

We can achieve this through a classification approach wherewe learn a function f that classifies a user u as a converterfor campaign c if f(xu, zc, c) > T , T is a threshold (could becampaign specific) that is decided based on the cost of falsepositive and false negative. An alternate approach couldlearn f as odds of user u to convert on campaign c. Theoutput scores can then be used to perform classification.

We discuss the class of functions f in the modeling sec-tion 3. Next we discuss how to derive the user and cam-paigns vectors (xu and zc) that can be used in the subse-quent modeling tasks.

2.1 User RepresentationAs in the offline advertising, to infer user interests, user

profiles are constructed from known past user activity. Useractivity is tracked by the advertisers, publishers and third-parties through browser cookies that uniquely identify theuser. For each user, the ad broker may store the history ofpage visits, ad views, and search queries, and based on thecontent of these events compose the profile used for target-ing. Most of these events have textual content that can beanalyzed using established text processing techniques. Forexample, the text of search queries issued, ids of ads viewed,the content of the pages viewed, etc. To represent eventcontent we employ the bag of words method, which uses un-igrams and bigrams, as well as nodes of a topical taxonomythat represent more general text categories. This gives usa feature vector representing a user, xu, for modeling pur-poses (see Figure 2). The weight of each feature can bebinary or it can be computed based on its intensity in theuser activities.

2.2 Campaign RepresentationAs mentioned before, we employ two sources of informa-

tion to represent campaigns, as shown in Figure 2. First, wederive metadata from the campaign definition. Each cam-paign is composed of multiple ad creatives. An ad creative isan image or text snippet that is displayed to the user. Upona click on the ad, the user is taken to a web page associatedwith this creative, also called a landing page. The creativesand the landing pages give a succinct characterization ofthe advertising campaign, and they can be useful to inferthe domain of the campaign. In our approach, we constructa campaign metadata feature vector, zc, using the creativesand landing page content. One of the challenges in creat-

ing this feature vector is that campaigns can have multiplecreatives and also creatives can be associated with multiplecampaigns (see Figure 2). To be able to produce reasonablecampaign metadata, we considered several variants and inour experiments we adopted the approach of merging thecontent of all landing pages connected to a campaign as itssource of features. (More details are given in the experimentsection.)

The second data source that characterizes the campaignis the set of seed users. The seed set is composed of positiveand negative examples with regard to the given campaign.The positive set is composed of users that have converted forthis campaign in the past, while the negative set representsthe non-converted users. We will provide more details onthis in our experimental evaluation section.

For new campaigns, there is no seed set available, andthus we must rely on campaign metadata to characterizethe targeting requirements of the campaign. Over time thecampaign expands as it is exposed to more and more users.Note that each user in the seed set comes at the cost ofallocating one or more ad impressions to her.

3. MODELING APPROACHESRecall from Section 2 that our goal is to learn function

f(xu, zc, c) that helps us estimate the propensity of useru to convert on campaign c. We confine ourselves to aclass of functions f that can be decomposed additively asf(xu, zc, c) = g(xu, zc) + fc(xu), where g is a function ofuser features but depends on campaign c only through meta-data zc, fc is a campaign-specific function of user features.

Function Class. We consider three choices for functionclass f in this paper: a) Linear Support Vector Machine(L-SVM), b) Logistic Regression (LR), and c) Naive-Bayes(NB). Both SVM and logistic regression are known to pro-vide good performance in advertising and many other ap-plications, Naive Bayes was chosen due to its ability to sub-stantially reduce variance at the expense of incurring biasdue to the independence assumption. In noisy conversiondata, this provides a good baseline to compare other moreadvanced methods like Linear SVM and logistic regression.

Linear SVM. Let yu,c denote a binary indicator that takesvalue +1 if user u converts on campaign c, −1 otherwise.Then SVM estimates function f to minimize the hinge lossP

u,c max(0, 1−yu,c ·f(xu, zc, c)) (call L1 SVM), or squared

squared hinge lossP

u,c max(0, 1−yu,c ·f(xu, zc, c))2 (called

L2 SVM). If f is too flexible, it tends to overfit the data andadditional penalty term to constrain f is often used. Hence,we assume f to be linear in the known variables (xu andzc).

Logistic Regression. The hinge loss in this case is re-placed by the sigmoid loss function

Pu,c log(1+ exp(−yu,c ·

f(xu, zc, c))). As in SVM, we confine ourselves to a linearfunction and perform appropriate regularization. One canimpose either the L1 or L2 norm constraint on the unknowncoefficients. The former also ensures sparse solutions, i.e.,many irrelevant variables have zero coefficients.

Naive Bayes. Naive Bayes builds f by estimating the jointdensity of features (user and/or campaigns) separately in theconverting and non-converting classes. By Bayes theorem,

475

Figure 2: User and campaign representation.

the log-odds of probability of conversion is given by

logDc(xu|yu,c = 1)

Dc(xu|yu,c = −1)+ log

P (yu,c = 1)

P (yu,c = −1)

where Dc denotes the appropriate joint density of user fea-tures in campaign c. The second term which is the prior log-odds of conversion for a campaign is constant when classify-ing a user as converter/non-converter on a given campaign,hence it can be ignored. Estimating the notoriously high-dimensional density is the crucial task here, Naive Bayessimplifies this through the independence assumption whichensures joint estimates as products of one dimensional marginals.Although the one dimensional marginals are estimated withprecision and reduces variance, the bias incurred due to theindependence assumption may degrade performance.

For a given function class f , next we describe how weuse the seed set and the campaign metadata to build thesemodels.

3.1 Local Models Using the Seed SetGiven that each campaign targets a different set of users,

the first approach is to build a separate local model for eachcampaign. In particular, given the seed set, we exploit thepositive and negative examples to learn a campaign-specifictargeting function. In other words, we ignore g(xu, zc) andassume f(xu, zc, c) = fc(xu). For training L-SVM and lo-gistic regression, fc(xu) = x′

u βc, where βc is an unknownvector that is to be estimated from training data. Formallystated, we obtain βc as a solution to the following optimiza-tion problem.

argminβc

Xu∈C

L(xu, yu,c, βc) + λc‖βc ‖p (1)

where L is either the hinge max(0, 1−yu,c ·x′u βc) or squared

hinge max(0, 1−yu,c·x′u βc)

2 for L-SVM, and sigmoid log(1+exp(−yu,c ·x′

u βc)) for logistic regression respectively. For L-SVM, we use p = 2 but for logistic we use two values of p forthe penalty: p = 1 for L1 regularization and p = 2 gives L2

regularization. The parameter λc(≥ 0) determines the rel-ative weight of the penalty term and is estimated throughcross-validation. Larger values of λc implies more penaltyon the parameters. We note that if the user vector xu con-tains m features, the dimension of βc is also m. This leads

to too many parameters for large values of m, often thecase in advertising applications. The lack of conversions percampaign further exacerbates the situation. In fact, in mostsettings the number of conversions could be much smallerthan m. This makes penalization important to avoid over-fitting. Hence the right choice of λc is essential for goodperformance (see Section 4.3). We select this parameter byextensive cross-validation.

Expanding xu = (xu1, · · · , xum), for Naive Bayes fc(xu)is given by

fc(xu) =

mXi=1

logDi(xui|yuc = 1)

Di(xui|yuc = −1)(2)

where Di(xui|yuc) is the conditional marginal density of theith user feature in class yuc. The estimation of marginaldensity Di(.|.) depends on the nature of the feature. Inthis paper, since all our user features are binary, this den-sity is obtained simply by counting. For instance, Di(xui =1|yuc = 1) is simply the fraction of converters on campaignc who possess feature i. To avoid unreliable estimates forcampaigns with small number of conversions and/or featuresthat occur rarely, we perform mild smoothing. In particular,

we estimate Di(xui = 1|yuc = 1) as |u:xui=1,yuc=1|+(a·pic)|u:yuc=1|+a

,

where a is a positive smoothing constant that could be inter-preted as pseudo number of conversions, pic is the fraction ofusers who possess feature i in campaign c. We choose smallvalues of a (e.g a ∈ [1, 5]). Our experiments showed thatNaive-Bayes is very robust and it showed little sensitivity inperformance with respect to the choice of a.

We also note that learning local models is computationallyefficient since the computation can be done separately foreach campaign in parallel. All our computations with SVMand logistic regression were done with LIBLINEAR [19].

3.2 Global Models Using the Campaign Meta-data

Per campaign local models work well for campaigns withlarge seed set, i.e., large number of conversions. For newcampaigns or those with little training data, the perfor-mance is not satisfactory due to data sparsity. To miti-gate this, we employ the campaign metadata information(zc). One option is to obtain user and campaign similarity

476

using their features as in traditional information retrieval.However, this does not work in our case since user and cam-paign features are not mapped to the same semantic space.The other idea is to exploit campaign metadata to correlatelearning across campaigns and perform better prediction forthose campaigns that lack enough data.

We explore several different ways of performing learningacross campaigns in this paper. The first approach sharesthe model coefficients βc for user features across campaigns.We call this the Merge-based Global model. The sec-ond approach extends the merge model to include an addi-tional component that models user and campaign interactionthrough user features and campaign metadata. We call thisthe Interaction-based Global model. Finally, we exploreour most complex model that extends the interaction modelto include a campaign-specific local model, we call this theGlobal+Local model.

3.2.1 Merge-based Global ModelHere, we merge the seed users from all campaigns to learn

a global model. In other words, fc(xu) = x′β for all cam-paigns c where β is the global weight vector. Thus, onesingle set of coefficients is estimated for all campaigns. Thisreduces the number of parameters dramatically and allowsus to derive more precise estimates of the global coefficients.However, this model does not capture any user and cam-paign specific interactions. Instead, it captures the user’spropensity to convert in general, which can be useful in manyreal scenarios. For example, the willingness of a user to con-duct online transactions affects each campaign in the sameway and can be learned/exploited globally.

To build this global model, we put together the seed setsof all campaigns. Even if the goal is to learn user’s globalpropensity to convert, one has to be careful in training thismodel. For example, campaigns with large seed sets caneasily dominate the global model and bias the model esti-mates. This is counter-effective since such a global modelwould not perform well on campaigns with small seed set,which are the ones that should benefit the most from sucha collapsed model. We address this by weighing each cam-paign’s seed set equally, both in the positive and negativeclasses.

3.2.2 Interaction-based Global ModelMerge model estimates the global propensity of a user to

convert. However, it does not capture any affinity of cam-paigns to each other. For instance, if a kind of users hashigh propensity to convert on travel campaigns, we can per-haps use this information to recommend these users to anew travel campaign. One way to perform such cross cam-paign learning is by positing a linear model that is a functionof both user and campaign features. We accomplish thisthrough our interaction model by assuming g(xu, zc) =x′

uDzc, where D is a matrix of unknowns to be estimatedby pooling data across all campaigns. Hence, f(xu, zc, c) =x′

uDzc + x′uβ. Mathematically, we obtain D by solving the

following optimization problem.

minD

Xu,c

L(yu,c, x′uDzc + x′

uβ) + λ(‖D‖p + ‖β‖p) (3)

Note that this optimization is performed by pooling dataacross all campaigns. Also, the number of parameters to beestimated in (D,β) is (m + 1) · n, where m and n are the

number of user and campaign features respectively. Whenm and/or n is large, this can lead to a high dimensionaloptimization problem that is difficult to solve. For instance,in our experiments we deal with m = 70k and n = 500, thisleads to an optimization problem consisting of 35M parame-ters. Also, the computations are not separable by campaigns(unlike Section 3.1). Further, several user and campaign fea-tures are non-informative and noisy. To keep the interactionmatrix manageable, we use a simple a-priori feature filter-ing procedure that removes irrelevant user features (sincethe number of campaign features is relatively small, we donot perform any such filtering there). This filtering is per-formed through a Kullback-Liebler divergence measure, asdescribed below.

Variable filtering. We describe a variable importancescore for a user feature i. Let qi,c denote the conversionprobability for feature i on campaign c and qi. denote theoverall conversion probability for feature i. We compute thevalues of qi,c and qi. using their maximum-likelihood esti-mates, and to avoid unreliable estimates, we perform mildsmoothing similar in spirit to that described in the contextof Naive Bayes. The variable score for i is then given by theKullback-Liebler divergence

Pc qi,c · log(

qi,c

qi.). Note that a

value of 0 implies no interaction of feature i with campaignsand hence this feature is not important to be included inour model.

3.3 Global + Local ModelLocal models have the advantage that they capture the

campaign-specific effect. Global models capture the globaltargeting constraints and have the advantage that they cangeneralize well even when there is lack of campaign-specificdata. To get the best of both worlds, we build models thatinclude both the global and local components. More specif-ically, we assume f(xu, zc, c) = x′

uDzc + x′uβg + x′

uβc andsolve for (D, βg, βc:c∈C) by solving an appropriate L-SVMor logistic regression task. Mathematically, we obtain theparameters by solving the following optimization problem:

minD

Xu,c

L(yu,c, X′uDzc + x′

uβg + x′uβc)

+ λ‖D‖p + λ‖βg‖p +X

c

λc‖βc‖p

Here Xu are user features used for the interaction compo-nent after a-priori filtering, and so xu can be different fromXu. Note that this equation involves separate λs for eachcampaign and another λ for the transfer learning parame-ters. Obtaining so many tuning parameters when jointly op-timizing (D, βg, βc:c∈C) by pooling data across all campaignsis difficult in practice and computationally challenging.

We address the computational challenges by using twodifferent approaches

• simultaneously learn the global and local components(i.e., joint approach) but assume λc = λ.

• learn the global component first, then fit a separatecampaign-specific local model using X ′

uD̂zc + x′uβ̂g

as a known constant where D̂ and β̂g are the globalcomponent estimates (i.e., offset approach). This en-

tails changing L(xu, yu,c, βc) to L(xu, yu,c, X′uD̂zc +

x′uβ̂g + βc) in Equation 1.

477

0

1000

2000

3000

4000

5000

6000

1 2 3 4 5 6 7 8 9 10

# co

nver

ters

Campaigns

Figure 3: Campaign sizes in terms of the number ofconversions.

This model performs well on both mature and new cam-paigns. For the latter, the transfer learning componentbased on campaign features plays an important role, whilefor the former the additional local term adds a correctionthat helps it converge to the local campaign models.

4. EXPERIMENTSNext we evaluate our proposed models on a real dataset

collected from a large advertising network.

4.1 DatasetWe constructed a dataset of user profiles, which are la-

beled as positive or negative depending on whether the userconverted on a given campaign. Any potentially personallyidentifiable information was removed, and all the data wasanonymized. All datasets were compliant with the companyprivacy policy. The users were drawn from 10 randomlyselected display ad campaigns, which were registered on amajor US advertising network in 2011. The dataset spansmore than 300,000 users allowing us to draw meaningful con-clusions from these experiments. All these campaigns areperformance-based, i.e., advertisers only pay to the adver-tising network for actual conversions. Of the 10 campaigns,some are fairly small in terms of number of conversions (with10 to 20 conversions per week, on average), while others arelarge and receive many thousands of conversions every week,as shown in Figure 3. For these campaigns we obtained a logof ad activity for the four weeks period from 03/18/2011 to04/15/2011. The log contains fully anonymized ids of userswho viewed, clicked, or converted on the ads from thesecampaigns.

For each campaign c we construct a dataset with thoseusers that were shown one or more ads from the campaignduring the study period. Using the conversion data de-scribed above we give binary labels to these user profilesas either positive or negative. Users who converted makethe positive instances, while the rest make the negative ex-amples. Since the number of negative examples can be huge(many millions) for large campaigns, we down-sample themto keep about 30,000 examples per campaign. We perform3-fold cross validation on this dataset in our experimentsand report the performance averaged over the 3 folds.

User profiles.For each user observed in the period above, we take the

four weeks of her online activity preceding the conversion toconstruct the user profile, as described in Section 2. Theseactivities include page visits, ad views, and search queries.Note that while predicting a test instance, say on day t, weallow the prediction models to access user history up to dayt− 1. Hence, the prediction method is not using any futureinformation.

Campaign metadata.For the 10 campaigns in our experiments we collected the

ad creatives associated with them (see Figure 2). This givesus about 15,000 creatives, of which most are images and donot have associated text. However, each creative contains alanding page that denotes the URL of the page which theuser is directed to after clicking on the ad. We crawl eachlanding page, parse, and attribute the extracted content toits corresponding creative(s).

Since the relationship between campaigns and creatives ismany-to-many, it is not clear how to propagate creative’scontent to a campaign. The experiments reported in thispaper use a simple approach where, for a given campaign,we weigh all the connected creatives equally and put theircontent together. Another alternative is to differentiate thecreatives and their contributions to a given campaign us-ing the creative-campaign graph, e.g., give less weight tothose creatives which are shared by more campaigns andvice-versa. We leave this exploration as a part of our futurework. Finally, we perform some feature selection over cam-paign features using the number of landing pages a featureappears in.

Seed set.We compose the seed set with positive and negative exam-

ples as described earlier in this section. Recall that the seedset denotes the labeled users that are available for trainingfor a campaign at a given time. In our experiments we ex-plore how the prediction power of the models depends onthe size of the seed set. Thus, we simulate different sizesof the seed set by sampling the data from the full trainingseed set of a campaign. To explain further, say local(c, x )denotes the local model built for campaign c using x num-ber of positive examples (since negative examples are easyto acquire, even for a new campaign, we do not vary them).This allows us to simulate a campaign at different stages ofits life. For example, to simulate a new campaign we set x to0. To simulate slightly more mature campaigns, we can setx to 30 or 70 conversions. For the long-standing campaignswe set x to a large value which means that all the positivesexamples for this campaign in the training folds are used.

4.2 Evaluation MetricWe use the Receiver Operating Characteristic (ROC) curve

to evaluate the ranked list of users produced by the differ-ent targeting models. A ROC curve plots true positivesversus false positives for different classification thresholds.The area under the ROC curve is particularly interestingdue to its probabilistic interpretation. The Area UnderCurve (AUC) gives the probability that the audience se-lection method assigns a higher score to a random positiveexample than a random negative example (i.e., probability ofconcordance) [8, 10]. So, a purely random selection methodwill have an area under the curve of exactly 0.5. An algo-rithm that achieves AUC of 0.6 can distinguish a positive

478

0.5

0.55

0.6

0.65

0.7

0.75

0.8

1 2 3 4 5 6 7 8 9 10

Are

a un

der

the

RO

C c

urve

Campaign index (sorted in decreasing order by size)

SVMLogistic

Naive-Bayes

Figure 4: Performance comparison of SVM, Logisticand Naive-Bayes based local models.

user from a negative user with 60% probability, and is thusbetter than the random method by 20%.

An alternative metric could be to measure precision/recallat a certain rank in the list. Note that different campaignsmay have different requirements in terms of precision andrecall. For example, a small campaign whose reach is lim-ited would prefer higher recall, while a large campaign thatreaches out to many users might prefer higher precision).Consequently, selecting a rank at which to evaluate preci-sion such that it would be suitable for all campaigns, is notpossible. Instead, we use AUC since it combines the predic-tion performance over all ranks into a single number.

4.3 Local models using the seed setFor this experiment we build a separate model for each

campaign using all the positive examples from the train-ing folds. Hence, large campaigns will get to use more than5000 conversions, while the small campaigns learn from some500 conversions. First, in Figure 4 we show the perfor-mance comparison of SVM, Logistic (with L2 regulariza-tion) and Naive-Bayes models. The x-axis is the campaignindex where the campaigns have been sorted in the decreas-ing order by the number of conversions (i.e., smaller indicesdenote larger campaigns). On the y-axis we plot the bestAUC performance obtained for each campaign averaged over3 folds after varying the model parameters. For SVM andLogistic we vary the regularization constant and for Naive-Bayes the smoothing constant was varied. We note thatLogistic and SVM perform very similar, while Naive-bayesis slightly worse. However, we observed that Naive-Bayeswas not too sensitive to any learning parameters (such assmoothing constant). The same is not true for SVM andLogistic. In Figure 5 we show the performance of the SVMmodels for 5 different campaigns (of the 10 campaigns in ourdataset). Here the x-axis denotes the value of the regulariza-tion constant. As we can see that the performance dependsquite significantly on the regularization constant. When theconstant is too small, many useful features get eliminateddue to severe penalization. On the other hand, when theconstant is too large, the model starts overfitting and doesnot perform well on the test folds. For brevity we will useSVM to report the results for the remaining experiments.

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

2 2.5 3 3.5 4 4.5 5 5.5 6

Are

a un

der

the

RO

C c

urve

Regularization constant (log scale)

Campaign 1Campaign 3Campaign 4Campaign 5Campaign 7

Figure 5: Effect of the regularization constant onSVM models.

Impact of the number of conversions in the trainingset.

We also note that the best performance for different cam-paigns can be quite different; AUC varies from 0.55 to 0.75in Figure 4. To analyze this further we performed the follow-ing experiment. We varied the number of positive examplesthat were allowed for training while learning the model fora campaign. We call this local(c, x) where c denotes thecampaign and x denotes the allowed number of positive ex-amples in the seed/training set. We note here that x is theupper-bound; we vary x up to 5000 but for small campaignsthe seed set does not change beyond x = 540 since these arethe maximum number of conversions we have for the smallcampaigns in our dataset. As mentioned earlier, in this ex-periment we aim to simulate a campaign at different stagesof its life cycle. For new campaign x is close to 0, while fora mature campaign x can be in a range of a couple of hun-dreds depending on the size of the campaign. The results ofthis experiment are shown in in Figure 6. To avoid clutter-ing, we grouped campaigns by their sizes in the figure – thelargest 2 campaigns in terms of the number of conversionswere put into the “large” category, the middle 3 campaignsinto the “medium” and the bottom 5 into “small” category.

On the first inspection the results look contrary to whatone might expect: we see that the small campaigns not onlyoutperform the large campaigns when x is small, but theyalso perform better for large values of x. For example, whenx = 5000 large campaigns use all their 5000 conversions aspositive examples for training, while small campaigns usex = 540 positive examples only and they still perform bet-ter. From a careful investigation of the data, we identifiedthe following reason behind this. The small campaigns arequite restrictive in their definition of conversion (which ispartly the reason behind them being small), e.g., they re-quire email sign-up, click on the ad and order completionpage for conversions. On the other hand, the large cam-paigns have more relaxed definition of conversion, e.g., con-versions defined as a view of the ad followed by some kindof positive activity on the website. The converters for suchlarge campaigns are very heterogeneous and noisy. As a re-sult, even for large values of x (i.e., large number of positiveexamples in the training set) it is difficult to discriminate be-tween converters and non-converters. Small campaigns, due

479

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

1 1.5 2 2.5 3 3.5 4

Are

a un

der

the

RO

C c

urve

# positives examples in training (log scale)

smallmedium

large

Figure 6: Performance evaluation of the local mod-els for different number of positives (x) in the seedset. Note that for the small campaigns the maxi-mum number of conversions is x = 540 in our dataset.

to their restrictive conversion definition and specific target-ing segment, are easier to learn and have a quick start, i.e.,even with a few positive examples the models can be learnedsignificantly well.

4.4 Global ModelsWe start the evaluation of global models by studying the

merge-based model. Recall from Section 3.2.1, merge modelpools the seed sets from all campaigns while learning themodel for a given campaign. In particular, let merge(c, x)denote the merge model for campaign c when the seed setis of size x. To learn merge(c, x), we employ all the datafrom the other campaigns, {C − c}, and from campaign c weuse x positive examples. This simulates the setting wherecampaign c is new and C− c are old/mature campaigns thatthe ad matching platform has run in the past.

The results are shown in Figure 7. We compare merge(c, x)with local(c, x) in the figure for small, medium and largecampaign sizes. We note that for large campaigns the mergemodel performs better than the local model for small x val-ues. In particular, till x = 200 conversions the merge modelachieves higher AUC than the local model. This is expectedin view of the previous section since we observed there thatthe large campaigns have slow start. Hence, using the seedsets of other campaigns allows the merge-based approachto learn more robust models. For small/medium campaignsthe local models get better quite quickly (after 30 to 50 con-versions). However, note that collecting 50 conversions maytake a couple of days (if not week) for small vertically fo-cused campaigns in a real-world environment where multipleadvertisers compete for the attention of a relatively small setof targeted users.

Interaction-based global model.Next we study the interaction-based global model from

Section 3.2.2. We performed feature selection using KL-divergence to keep 3000 user features. On the campaignside, we kept 50 features per campaign by selecting the mostfrequent words in landing pages. In Table 1 we compare theinteraction-based and merge-based global models. We notethat overall the interaction approach is better, as expected.

Campaign size Interaction-based model

Small 6.06%Medium 0.4%Large 3.93%

Table 1: Performance improvement achieved by theinteraction-based model over merge-based.

Campaign size Global Global+Local

Small -10.5% -8.5%Medium 6.4% 6.1%Large 70% 78.6%

Table 2: Performance improvement of the global andglobal+local models over the local models for differ-ent campaign sizes.

We expect the improvement to increase with the number ofcampaigns pooled together, as that allows better learning ofthe weights for the user-campaign interaction features.

As shown in the table, the most gain comes for the smallcampaigns. Compare to the large campaigns which can havemany thousand creatives and landing pages, the small cam-paigns have few creatives providing us with a high-qualityhomogeneous set of campaign features.

4.5 Global + Local ModelsIn this experiment we study the global+local models from

Section 3.3. For brevity, we only focus on the joint opti-mization approach. In Table 2 we compare the joint modelagainst the local(c, x) and merge(c, x) for x = 30. As dis-cussed in Section 3.3, global+local model, in theory, sub-sumes both the local and global model and is a strict gen-eralization over these models. In practice, this seems tohold true except for the small campaigns where the jointglobal+local model is slightly worse than the local model.We believe that this is due to the fact that the regulariza-tion constant is shared across all campaigns in the joint ap-proach, while the local models have per-campaign tuning ofregularization constant. In fact, we observed that if we forcethe local models to share the same regularization constant,then the performance difference between the local and jointmodels on the small campaigns is marginal.

More importantly, we observe that the global+local ap-proach brings a large improvement (about 78%) over the lo-cal models on large campaigns. Recall that large campaignssuffer from slow start and so this is greatly beneficial to thesecampaigns. This demonstrates how the campaign metadatacan be useful in sharing targeting knowledge across cam-paigns.

5. BACKGROUND AND RELATED WORKFirst, we give some background on conversion optimiza-

tion in display advertising. Then we compare our workagainst the existing literature.

5.1 Conversion Optimization in Display Ad-vertising

Display advertising takes the form of graphical ads dis-played on web pages alongside the original content. From

480

0.52

0.56

0.6

0.64

0.68

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

Are

a un

der

the

RO

C c

urve


local model on small campaignsmerge global on small campaigns

0.52

0.56

0.6

0.64

0.68

1 1.5 2 2.5 3 3.5

Are

a un

der

the

RO

C c

urve


local, mediummerge, medium

0.52

0.56

0.6

0.64

0.68

1 1.5 2 2.5 3 3.5 4

Are

a un

der

the

RO

C c

urve


local, largemerge, large

Figure 7: Performance comparison of local model with merge-based model for different number of positives(x) in the training set. The three figures are for the three different campaign groups (based on size).

its meager beginnings in the 1990’s, display advertising hasgrown to an estimated $12.33 billion industry in 2011, ac-cording to the latest IDC report [21]. Display ads today arestandardized in size, and created, sold, traded, and placed byan industry composed of hundreds of companies that includepublishers, publisher optimizers, ad agencies, ad exchanges,ad networks, advertisers, and others.

Conversions are user actions that indicate the ad wassuccessfully perceived by users such as making a purchase,requesting a price quote, signing up for an account, etc.Maximizing conversions is one of the key challenges for to-day’s display ad brokers, posing several technical difficulties.First, conversions are very rare events. Only a small percent-age of users who see an ad will click on the ad, and only asmall percentage of those users convert. Conversions oftenoccur in the range of one out of 105−106 ad views. Further-more, the conversions represent a diverse set of events andthere is no single definition for conversions as it varies amongthe advertisers. For example, certain advertisers define con-version as the event when a user purchases a product, whileothers may call a subscription to mails/alerts or the fillingof a form as conversion. Thus the conversion rate can varysignificantly across different advertisers.

Lastly, conversions are tracked through conversion pixelwhich is a javascript code embedded in the advertiser’s con-version page (e.g., order completion page). The code getstriggered to notify the advertising network when a user whowas shown ads by the advertising network reaches the con-version page. A campaign can encompass multiple conver-sion pixels in the creative landing pages. Conceptually, theconversions pixels represent goals of the campaign while cre-atives are the means of achieving those goals. There couldbe multiple means to achieve the same goal as well as thesame mean can lead to multiple goals. This further com-plicates the issue making the task of predicting conversionseven harder.

5.2 Prior WorkOur work is related to modeling of user behavior and tar-

geting based on observed past events. User behavior hasbeen studied to understand user’s querying pattern [20],news browsing behavior [11], interests inferencing [17] andpersonalized search [18]. In contrast, our focus is on onlineadvertising and behavioral targeting in particular. Initialbehavioral targeting studies mostly focused on predictingclicks on ads [6, 22]. Clicks are used simply because theyare available and other information is not available at a large

scale. While clicks do represent user intent, they are knownto suffer with fraud issues [7, 12, 23]. Recently, however,advertisers have been willing to share feedback (throughconversion pixeling) at the level of individual users, tellingpublishers which of the users who saw the ad have actu-ally purchased the product [2, 3, 4, 14]. Since conversionsare the ultimate goal of advertisers, we focus on conversionoptimization in this study.

Previous work on conversion optimization uses only theseed set and learns local models [2, 3, 4], but as we showedin our experiments that the model performance depends sig-nificantly on the size of the seed set. Hence, we employ cam-paign metadata along with the seed set and propose a seriesof global models that pool campaigns together using theirmetadata. When the campaigns are new and the seed setis small or unavailable, we show that the metadata can besignificantly useful. To the best of our knowledge, this isthe first study in behavioral targeting framework which em-ploys such collaborative modeling. Our work is also relatedto the cold-start problem in collaborative filtering literaturewhere movies/items play the role of campaigns [1, 9, 15, 16].However, there are a couple of major differences: (a) usersinteract with and give thumbs up to many movies/items,while they do not convert on that many campaigns, (b) thenumber of campaigns is not of the order of the number ofitems. Both these limitations restrict the factor models inour setting [1, 15].

Another area of online advertising is social targeting wherethe goal is to identify users having strong influence over oth-ers [5]. Since brand affinity is likely to be shared betweensocially connected users, Provost et al. [13] identified regionsof the social network that may be more susceptible to a givenbrand message. This approach relies on the existence of asocial network, although in their work [13] the network wasapproximated using co-visitations.

6. CONCLUSIONSAdvertisers want more bang per buck and as a result,

conversion optimization in display campaigns is getting in-creasingly more attention. The task is very challenging forseveral reasons such as very low conversion rate, high vari-ance in number of conversions across campaigns, and di-verse set of events logged as conversions. In addition, onthe user side we also have sparse and noisy profile activ-ities. In this paper we proposed a two-pronged approachfor conversion optimization whereby we use a seed set of

481

converters to capture the campaign-specific or local target-ing criteria (e.g., interests in finance, shares, mortgage), andthe campaign metadata to share targeting knowledge acrosscampaigns (i.e., global component). To learn the local mod-els we experimented with SVM, Logistic and Naive-Bayesmodels. We showed that SVM and Logistic perform betterthan Naive-Bayes, however they are sensitive with respect tothe setting of model parameters while Naive-bayes is fairlyresilient. We found it surprising that campaigns with highernumber of conversions are actually harder to model, due tothe heterogeneity among the converted users.

To investigate the global models we proposed merge-basedand interaction-based models for pooling together the infor-mation from different campaigns. We showed how globalmodels improve the prediction performance for large cam-paigns which suffer from slow learning rate during the initialphase. Next we showed how our global+local models cap-ture the best of both worlds. They include the campaign-specific weight vector for user features (local), the sharedweight vector for user features (merge-based) and the inter-action of user and campaign features (interaction-based).Also, to learn these models we give a joint optimizationapproach which simultaneously accounts for both the localand global components. The offset approach, on the otherhand, performs optimization in two steps, but it allows per-campaign regularization which is beneficial.

While in this work we focused on advertising, user profil-ing and targeting are required in a wide range of web ap-plications beyond advertising, such as content recommen-dation and search personalization. The principles of userprofile generation and two-pronged targeting described inthis paper are not specific to advertising, and are thereforeapplicable to these other applications.

7. REFERENCES[1] D. Agarwal and B. chung Chen. Regression-based

latent factor models. In Proceedings of the 13th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2009.

[2] N. Archak, V. S. Mirrokni, and S. Muthukrishnan.Mining advertiser-specific user behavior usingadfactors. In Proceedings of the NineteenthInternational World Wide Web Conference, 2010.

[3] A. Bagherjeiran, A. O. Hatch, and A. Ratnaparkhi.Ranking for the conversion funnel. In Proceeding of the33rd international ACM SIGIR conference onResearch and development in information retrieval,pages 146–153, 2010.

[4] A. Bagherjeiran, A. O. Hatch, A. Ratnaparkhi, andR. Parekh. Large-scale customized models foradvertisers. In Proceedings of the IEEE InternationalConference on Data Mining Workshops, 2010.

[5] R. Bhatt, V. Chaoji, and R. Parekh. Predictingproduct adoption in large-scale social networks. InProceedings of the 19th ACM International Conferenceon Information and Knowledge Management, 2010.

[6] Y. Chen, D. Pavlov, and J. Canny. Large-scalebehavioral targeting. In Proceedings of the 15th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2009.

[7] I. Click Forensics. Click fraud index.http://www.clickforensics.com/resources/

click-fraud-index.html, 2010.

[8] M. Gonen. Receiver operating characteristic (ROC)curves. SAS Users Group International (SUGI),31:210–231, 2006.

[9] N. Good, J. B. Schafer, J. A. Konstan, A. Borchers,B. Sarwar, J. Herlocker, and J. Riedl. Combiningcollaborative filtering with personal agents for betterrecommendations. In Proceedings of the SixteenthNational Conference on Artificial Intelligence, 1999.

[10] M. Greiner, D. Pfeiffer, and R. D. Smith. Receiveroperating characteristic (roc) curves. PreventiveVeterinary Medicine, 45:23–41, 2000.

[11] J. Liu, P. Dolan, and E. R. Pedersen. Personalizednews recommendation based on click behavior. InProceedings of the 15th International Conference onIntelligent User Interfaces, 2010.

[12] Y. Peng, L. Zhang, M. Chang, and Y. Guan. Aneffective method for combating malicious scriptsclickbots. In Proceedings of the 14th EuropeanSymposium on Research in Computer Security(ESORICS), 2009.

[13] F. Provost, B. Dalessandro, R. Hook, X. Zhang, andA. Murray. Audience selection for on-line brandadvertising: privacy-friendly social network targeting.In Proceedings KDD, 2009.

[14] B. Rey and A. Kannan. Conversion rate based bidadjustment for sponsored search auctions. InProceedings of the Nineteenth International WorldWide Web Conference, 2010.

[15] R. Salakhutdinov and A. Mnih. Probabilistic matrixfactorization. In Advances in Neural InformationProcessing Systems, 2008.

[16] A. I. Schein, A. Popescul, L. H., R. Popescul, L. H.Ungar, and D. M. Pennock. Methods and metrics forcold-start recommendations. In Proceedings of the 25thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval, 2002.

[17] M. Shmueli-Scheuer, H. Roitman, D. Carmel,Y. Mass, and D. Konopnicki. Extracting user profilesfrom large scale data. In Proceedings MDAC, 2010.

[18] K. Sugiyama, K. Hatano, and M. Yoshikawa. Adaptiveweb search based on user profile constructed withoutany effort from users. In Proceedings WWW, 2004.

[19] X.-R. Wang, K.-W. Chang, C.-J. Hsieh, R.-E. Fan,G.-X. Yuan, H.-F. Yu, F.-L. Huang, and C.-J. Lin.Liblinear – a library for large linear classification.http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

[20] S. Wedig and O. Madani. A large-scale analysis ofquery logs for assessing personalization opportunities.In Proceedings KDD, 2006.

[21] K. Weide. Worldwide and U.S. internet ad spendreport: Growth accelerates, but dark clouds gather.http://www.idc.com/getdoc.jsp?containerId=224593(visited on 10/19/2010), 2010.

[22] J. Yan, N. Liu, G. Wang, W. Zhang, Y. Jiang, andZ. Chen. How much can behavioral targeting helponline advertising? In Proceedings of the 18thInternational Conference on World Wide Web, 2009.

[23] L. Zhang and Y. Guan. Detecting click fraud inpay-per-click streams of online advertising networks.In Proceedings of the 28th IEEE InternationalConference on Distributed Computing Systems, 2008.

482

[ACM Press the fifth ACM international conference - Seattle, Washington, USA...

Documents

Transcript of [ACM Press the fifth ACM international conference - Seattle, Washington, USA...