Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by...

10
Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He [email protected] Liang Wang [email protected] Kaipeng Liu [email protected] Bo Wu [email protected] Weinan Zhang [email protected] ABSTRACT Sponsored search is an indispensable business model and a major revenue contributor of almost all the search engines. From the ad- vertisers’ side, participating in ranking the search results by paying for the sponsored search advertisement to aract more awareness and purchase facilitates their commercial goal. From the users’ side, presenting personalized advertisement reecting their propen- sity would make their online search experience more satisfactory. Sponsored search platforms rank the advertisements by a ranking function to determine the list of advertisements to show and the charging price for the advertisers. Hence, it is crucial to nd a good ranking function which can simultaneously satisfy the platform, the users and the advertisers. Moreover, advertisements showing positions under dierent queries from dierent users may associate with advertisement candidates of dierent bid price distributions and click probability distributions, which requires the ranking func- tions to be optimized adaptively to the trac characteristics. In this work, we proposed a generic framework to optimize the ranking functions by deep reinforcement learning methods. e framework is composed of two parts: an oine learning part which initializes the ranking functions by learning from a simulated advertising en- vironment, allowing adequate exploration of the ranking function parameter space without hurting the performance of the commer- cial platform. An online learning part which further optimizes the ranking functions by adapting to the online data distribution. Experimental results on a large-scale sponsored search platform conrm the eectiveness of the proposed method. KEYWORDS Sponsored search, ranking strategy, reinforcement learning 1 INTRODUCTION Sponsored search is a multi-billion dollar business model which has been widely used in the industrial area [5, 11]. In the commonly employed pay-per-click model, advertisers are charged for users’ clicks on their advertisements. e sponsored search platform ranks the advertisements by a ranking function and select the top ranked ones to present to the users. e price charged from the advertisers of these presented advertisements is computed by the generalized Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). WOODSTOCK’97, El Paso, Texas USA © 2016 Copyright held by the owner/author(s). 123-4567-24-567/08/06. . . $15.00 DOI: 10.475/123 4 second price (GSP) auction mechanism [8, 24] as the smallest price which is sucient to maintain their allocated advertisement show- ing positions. Traditionally, the ranking score of an advertisement is set to be its expected revenue to the sponsored search platform, computed as the product between the advertisers bid price and the predicted click-through rate (CTR) of the user. e expected revenue based ranking function dominates the sponsored search area, and most of the existing methods focus on designing elaborate models to predict the CTR [11, 16, 20, 39]. In this work, instead of designing a CTR prediction model, we try an alternative way to add more exibility in the ranking function for balancing the gain between users, advertisers and our advertise- ment platform. For the user experience, we recognize the engage- ment of users as their CTR on the advertisements and a term is added to the ranking function which are monotonic of the users’ CTR. To improve the advertisers’ return on their spending, we add a term related to the users’ expected purchase amount. e ranking function is then computed as the weighted sum of the two terms together with the expected revenue term. Although the ranking function can well represent the benets of the platform, user and advertisers, it may not be directly related to the benets of these players. First of all, owing to the second price auction mechanism [8], the advertisers are charged by the minimum amount of dollars they need to keep their advertisement position, not the amount of money they bid. e price is determined by the competitiveness of the underlying advertisement candidates. Second, the CTR and CVR (i.e., conversion rate, the ratio of buying behavior aer each advertisement click) are predicted by a predic- tion model, which is generally biased and prone to noise because of being training on a biased distributed dataset [12]. To link the real world benets of the users, advertisers and platform directly to the ranking function, we propose a reinforcement learning framework to learn the ranking functions based on the observed gain in a ‘trail- and-error’ manner. By treating the ranking function parameter tuning as a machine learning problem, we are able to deal with more complex problems including higher parameter space and traf- c characteristic based tuning. Advertisement showing positions associated with dierent queries and dierent users exhibit diverse characteristics in terms of distribution of bid price and CTR/CVR. Tuning the parameters according to trac characteristics would denitely improve the performance of the ranking function. e reinforcement learning [33] is generally used in sequen- tial decision making, which follows an explore and exploit strat- egy to optimize the control functions of the agents. Starting from its proposal, it is mostly used in games and robotics applications [18, 23, 30] where the exploration can be done in a simulated or arXiv:1803.07347v3 [cs.IR] 26 Mar 2018

Transcript of Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by...

Page 1: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

Optimizing Sponsored Search Ranking Strategy by DeepReinforcement Learning

Li [email protected]

Liang [email protected]

Kaipeng [email protected]

Bo [email protected]

Weinan [email protected]

ABSTRACTSponsored search is an indispensable business model and a majorrevenue contributor of almost all the search engines. From the ad-vertisers’ side, participating in ranking the search results by payingfor the sponsored search advertisement to a�ract more awarenessand purchase facilitates their commercial goal. From the users’side, presenting personalized advertisement re�ecting their propen-sity would make their online search experience more satisfactory.Sponsored search platforms rank the advertisements by a rankingfunction to determine the list of advertisements to show and thecharging price for the advertisers. Hence, it is crucial to �nd a goodranking function which can simultaneously satisfy the platform,the users and the advertisers. Moreover, advertisements showingpositions under di�erent queries from di�erent users may associatewith advertisement candidates of di�erent bid price distributionsand click probability distributions, which requires the ranking func-tions to be optimized adaptively to the tra�c characteristics. In thiswork, we proposed a generic framework to optimize the rankingfunctions by deep reinforcement learning methods. �e frameworkis composed of two parts: an o�ine learning part which initializesthe ranking functions by learning from a simulated advertising en-vironment, allowing adequate exploration of the ranking functionparameter space without hurting the performance of the commer-cial platform. An online learning part which further optimizesthe ranking functions by adapting to the online data distribution.Experimental results on a large-scale sponsored search platformcon�rm the e�ectiveness of the proposed method.

KEYWORDSSponsored search, ranking strategy, reinforcement learning

1 INTRODUCTIONSponsored search is a multi-billion dollar business model which hasbeen widely used in the industrial area [5, 11]. In the commonlyemployed pay-per-click model, advertisers are charged for users’clicks on their advertisements. �e sponsored search platform ranksthe advertisements by a ranking function and select the top rankedones to present to the users. �e price charged from the advertisersof these presented advertisements is computed by the generalizedPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).WOODSTOCK’97, El Paso, Texas USA© 2016 Copyright held by the owner/author(s). 123-4567-24-567/08/06. . .$15.00DOI: 10.475/123 4

second price (GSP) auction mechanism [8, 24] as the smallest pricewhich is su�cient to maintain their allocated advertisement show-ing positions. Traditionally, the ranking score of an advertisementis set to be its expected revenue to the sponsored search platform,computed as the product between the advertisers bid price and thepredicted click-through rate (CTR) of the user.

�e expected revenue based ranking function dominates thesponsored search area, and most of the existing methods focus ondesigning elaborate models to predict the CTR [11, 16, 20, 39]. Inthis work, instead of designing a CTR prediction model, we try analternative way to add more �exibility in the ranking function forbalancing the gain between users, advertisers and our advertise-ment platform. For the user experience, we recognize the engage-ment of users as their CTR on the advertisements and a term isadded to the ranking function which are monotonic of the users’CTR. To improve the advertisers’ return on their spending, we adda term related to the users’ expected purchase amount. �e rankingfunction is then computed as the weighted sum of the two termstogether with the expected revenue term.

Although the ranking function can well represent the bene�tsof the platform, user and advertisers, it may not be directly relatedto the bene�ts of these players. First of all, owing to the secondprice auction mechanism [8], the advertisers are charged by theminimum amount of dollars they need to keep their advertisementposition, not the amount of money they bid. �e price is determinedby the competitiveness of the underlying advertisement candidates.Second, the CTR and CVR (i.e., conversion rate, the ratio of buyingbehavior a�er each advertisement click) are predicted by a predic-tion model, which is generally biased and prone to noise because ofbeing training on a biased distributed dataset [12]. To link the realworld bene�ts of the users, advertisers and platform directly to theranking function, we propose a reinforcement learning frameworkto learn the ranking functions based on the observed gain in a ‘trail-and-error’ manner. By treating the ranking function parametertuning as a machine learning problem, we are able to deal withmore complex problems including higher parameter space and traf-�c characteristic based tuning. Advertisement showing positionsassociated with di�erent queries and di�erent users exhibit diversecharacteristics in terms of distribution of bid price and CTR/CVR.Tuning the parameters according to tra�c characteristics wouldde�nitely improve the performance of the ranking function.

�e reinforcement learning [33] is generally used in sequen-tial decision making, which follows an explore and exploit strat-egy to optimize the control functions of the agents. Starting fromits proposal, it is mostly used in games and robotics applications[18, 23, 30] where the exploration can be done in a simulated or

arX

iv:1

803.

0734

7v3

[cs

.IR

] 2

6 M

ar 2

018

Page 2: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

arti�cial environment. When utilizing the algorithms in advertis-ing platforms, we need to consider the cost for the exploration andtry to perform exploration in a simulated environment. However,di�erent from the games, simulating the advertisement serving andrewards (click, user purchase, etc.) is hard due to the large space ofcontrolling factors like user intention, tra�c distribution changes,advertisers budget limitation etc. In this work, we build a simulatedsponsored search environment by making the historical advertise-ment serving replayable. Speci�cally, the simulated environmentis composed of an advertisement replay dataset (state in the termi-nology of reinforcement learning), which for each advertisementshowing chance, stores the full list of advertisement candidates to-gether with their predicted CTR, CVR and the advertisers’ biddingprices, and a set of ‘virtual’ exploration agents which simulate theadvertisement results and users’ response under di�erent rankingfunctions (i.e., actions in reinforcement learning terminology). �ereward for the exploration is computed by reward shaping methods[26].

However, there are two problems with the simulated environ-ment: (1) the simulated state-action-reward tuple is temporally in-dependent: it is hard to simulate the temporal correlation betweenthe response of a user on one advertisement and the response on thenext presented advertisement; (2) the simulated response is incon-sistent with online user response to some extent. To solve problem(1), we employ an o�-policy reinforcement learning architecture[19] to optimize the advertisement ranking function, which doesnot require the observation of the next states’ action and reward.For problem (2), we use an o�ine calibration method to adjustthe simulated rewards according to the online observed reward.Moreover, a�er the o�ine reinforcement learning, we employ anonline learning module to further tune the reinforcement learningmodel to be�er �t the real-time market dynamics.Contributions. In this work, we present our work of optimizingthe advertisement ranking functions on a popular mobile sponsoredsearch platform. �e platform shows the search results to users ina streaming fashion, and the advertisements are plugged in �xedpositions within the streaming contents. Our main contributionsare summarized as follows:

• We introduce a new advertisement ranking strategy whichinvolves more business factors like the user engagementand advertiser return, and a reinforcement learning frame-work to optimize the parameters of the new ranking func-tion;

• We propose to initialize the ranking function by conductingreinforcement learning in a simulated sponsored searchenvironment. In this way, the reinforcement learning canexplore adequately without hurting the performance ofthe commercial platform;

• We further present an online learning module to optimizethe ranking function adaptively to online data distribution.

2 RELATEDWORKSAuction mechanisms have been widely used in Internet companieslike Google, Yahoo [8] and Facebook [34] to allocate the advertise-ment showing positions for the advertisers. �e Internet advertisingauction generally works in the following procedure: an advertiser

submit a bid price stating their willingness to pay for an advertise-ment response (click for a performance-based campaign, view for abranding campaign, etc.) �e publisher ranks the advertisementsaccording to their quality scores and bid prices, and presents thetop ranked ones to the users together with the organic contents.�e advertisers are then charged, for the response of the users,the minimum amount of dollars to keep the showing positions oftheir advertisements. �e auction mechanism has been extensivelystudied in the literature both theoretically and empirically: Bejaminet al. in [8] investigate the properties of generalized second price(GSP) and compare it with the VCG [34] mechanism in terms ofthe equilibrium behavior. In [21], the authors formulate the adver-tisement allocation problem as a matching problem with budgetconstraints, and provide theoretical proof that the algorithm canachieve a higher competitive ratio than the greedy algorithm. Tosolve the ine�ectiveness in next-price auction, the authors in [1] de-signs a truth-telling keyword auction mechanism. In [24] followedby [9], the reserved price problem is studied, including its welfaree�ects and its relation to equilibrium selection criteria. A �eldanalysis on se�ing reserve prices in sponsored search platforms ofInternet companies is presented in [27]. Existing work generallyfocuses on the revenue e�ect of the auction mechanism to makeit e�cient in the bidding process and capable of pro�ting morerevenue. In our work, we use the generalized second price (GSP)auction for pricing. But since we are working on an industrial spon-sored search platform, in consideration of long term return, insteadof maximizing the platform revenue only, we also add the userexperience and advertiser utility terms into the ranking function.

Reinforcement learning problem is basically modeled as a MarkovDecision Process [33], which concerns with how agents adjust theirpolicy to interact with the environment so as to maximize certaincumulative rewards. Recently, with the combination of deep neuralnetwork [14], the reinforcement learning methods are able to workin the environment with high-dimensional observations by feedinglarge-scale experience data and training with powerful computa-tional machines [18], and make breakthrough in many di�erentareas including game of Go [30], video games [22, 38], naturallanguage processing systems [32, 37] and robotics [15, 29]. How-ever, most of the existing applications are conducted on simulatednon-pro�table platforms, where the experience data are easy toacquire and there is no restrict to try any agent policies and train-ing schemes. In a commercial system, however, the exploration ofthe reinforcement learning may bring in uncertainty in the plat-form’s behavior, prone to loss of revenue, thus o�ine methods area practical solution [40]. In [17], Li and Lipton et al. design a usersimulator under movie booking scenario. �e simulator is designedon some rules and data observations. However, in our platform,there are far more factors (users, advertisers) to simulate. For onlineadvertising with reinforcement learning, the authors in [2] �rstpropose to tune sponsored search keyword bid price in an MDPframework, where the state space is represented by the auctioninformation, the advertisement campaign’s remaining budgets andlife-time, while the actions refer to the bid price to set. �en in[4], the authors formulate the sequential bid decision making pro-cess in the real-time bidding display advertising as a reinforcementlearning problem. �e method is based on assumptions that thewinning rate depends only on the bid price and the actual clicks

Page 3: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

Figure 1: System �ow chart.

can be well estimated by the predicted CTR. Hence, enabling thebest bidding strategy can be computed in an o�ine fashion. Inour work, we initialize the reinforcement learning model using theo�ine simulated data, and combine the reward shaping methodand online model update procedure to make the model consistentwith the online data distribution.

3 SYSTEM OVERVIEWAs shown in Fig. 1 (highlighted in orange), the whole system iscomposed of three modules: the o�ine sponsored search environ-ment simulation module, the o�ine reinforcement learning moduleand the online reinforcement learning module. �e environmentsimulation module is used to simulate the e�ect caused by changingthe ranking function parameters, including re-ranking the adver-tisement candidates, showing the new top-ranked advertisement,and generating the users’ response with respect to such changes.To allow adequate exploration, the o�ine reinforcement learningmodule collects training data by deploying randomly generatedranking functions on the simulated environment. An actor-criticdeep reinforcement learner [19] is then trained on top of thesetraining data. To bridge the gap between the o�ine simulated dataand the online user-advertiser-platform interaction, we build anonline learning module to update the model by the online servingresults.

3.1 Ranking Strategy FormulationWe use the following ranking function to compute the rank scorefor advertisement ad

ϕ(s,a,ad) = fa1 (CTR) · bid︸ ︷︷ ︸platform

+a2·fa3 (CTR,CVR)︸ ︷︷ ︸user

+a4·fa5 (CVR,price)︸ ︷︷ ︸advertiser

(1)where s represents the search context including the search query,user demographic information and status of advertisement can-didates for the advertisement showing chance. bid and price arethe bid price and product price set by the advertiser for the ad-vertisement ad on the current query. CTR and CVR of ad arepredicted by the platform. fai (i ∈ {1, 3, 5}) performs nonlinearmonotonic projection on CTR and CVR, and scalar a2,a3 are usedto balance the weights between the three terms. Because our spon-sored search platform charges the advertisers by ‘click’, the �rstterm fa1 (CTR) · bid can be seen as the expected revenue of the

platform. We refer the users’ preference to the presented advertise-ments as their response ratio (CTR and CVR). Hence, the secondterm indicates the engagement of the users. �e third term com-putes the expected return (expected user purchase amount) of theadvertiser by showing the advertisement, which measures the gainof the advertisers. According to the GSP rule, if the advertisementad is shown, its advertiser is charged

click price =ϕ(s,a,ad ′) −

(a2 · fa3 (CTR,CVR) + a4 · fa5 (CVR,price)

)fa1 (CTR)

(2)where ad ′ is the advertisement just ranked below ad . However,there is possibility that the numerator is a negative value. In ourimplementation, we solve this problem by imposing a lower boundon the click price , just like the reserve price [27], and since the plat-form revenue is always one of the optimization goal, the numeratoris rarely negative in the experiments. Compared with the rankingfunction proposed in [13], which is generally used by Google andYahoo, our ranking function (1) involves more parameters and com-mercial factors to consider, which can be used to optimize for morecomprehensive commercial goals.

�e ranking function is parameterized by a in consideration ofthe following issues: �rst, the predicted CTR and CVR are generallybiased due to the imbalance training data, and need to be calibratedto make it consistent with the online user response; second, wecharge the advertisers according to second price auction mecha-nism, which is lower than the bid price and computed accordingto the next ranked advertisement; third, the three terms may notbe the same in numeric scale. �e ranking function optimizationproblem can be formulated as to predict the best parameter a giventhe search context s as

π (s) = arg maxa

R(ϕ(s,a)) (3)

where R(ϕ(s,a)) is the reward given the ranking function ϕ(s,a).�e reward can be de�ned as the sum of purchase amount, numberof click and platform revenue or any weighted combinations of thethree terms during a certain period a�er the ranking function isoperated, depending on the platform performance goal. Since inthe reinforcement literature, a is used to represent the action of thelearning agent, we take this notation to align with the literature.

3.2 Long-term Reward Oriented RankingFunction Optimization

Reinforcement learning methods are designed to solve the sequen-tial decision making problem, in order to maximize a cumulativereward. In the literature, it is generally formulated as optimiz-ing a Markov decision process (MDP) which is composed of: astate space S, an action space A, a reward space R with a rewardfunction R : S × A → R and a stationary transition dynamicdistribution with conditional density p(st+1 |st ,at ) which satis�esMarkov property p(st+1 |s1,a1, ..., st ,at ) = p(st+1 |st ,at ) for anystate-action transition process s1,a1, ..., st ,at . �e action decisionis made through a policy function πθ : S → A parameterizedby θ . �e reinforcement learning agent interacts with the envi-ronment according to πθ giving rise to a state-action transitionprocess s1,a1, ..., st ,at , and the goal of the learning agent is to �nda π∗θ which maximizes the expected discounted cumulative reward

Page 4: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

rγ =∑s0∼p0(s0)

∑∞k=0 γ

k · R(sk ,ak ) where 0 < γ < 1 and p0(s0)represents the initial state distribution.

�e ranking function learning of a sponsored search platformhas many special characteristics making it suitable to be formulatedunder the reinforcement learning framework. First of all, duringone user-search session, the sponsored search platform sequen-tially makes decisions on choosing ranking functions to presentingadvertisements for the user. Second, during the interaction withusers, the platform collects users’ response as rewards and balancesbetween the exploration and exploitation to maximize the longterm cumulative rewards. �ird, since the data distribution chancesduring the time, it requires online learning for adaptation. In thiswork, we choose the reinforcement learning methodology to learnthe ranking function in Eq. (3) for continuously improving the longterm rewards. Speci�cally, in our scenario, the state s is de�ned asthe search context of a user query including the query terms, querycategories, user demographic information and online behavior. �esponsored search platform (i.e., reinforcement learning agent in ourscenario) uses the ranking function as an action a to rank the adver-tisements and interacts with a user to get the reward r = R(s,a) asthe combination of platform revenue, user engagement (quanti�edas user click, purchase, etc.) and advertiser’s sale amount. �en theenvironment transits to the next state s ′. �e reinforcement learn-ing method optimizes the ranking function parameters throughexploring and exploiting on the observed state-action-reward-state(s,a, r , s ′) tuples.

4 METHODIn the following sections, we will introduce the detail algorithmsof the three modules in Fig. 1.

4.1 Environment Simulation ModuleOn one hand, the performance of the reinforcement learning isguaranteed by adequate exploration; on the other hand, the explo-ration may bring uncertainty in the ranking functions’ behaviorand have performance cost. Since the algorithm is designed to runon a commercial sponsored search platform, we minimize the explo-ration cost by building a simulated sponsored search environmentto training the reinforcement learning model in an o�ine manner.

�e sponsored search procedure is e�ected by many factorsincluding the advertisers’ budgets and bidding prices, the users’propensity and intention, making the procedure hard to simulatefrom scratch. In this work, instead of generating the whole spon-sored search procedure, we propose to do the simulation by replay-ing the existing advertisement serving processes and make themadjustable in terms of ranking function se�ing. Given a user query,the platform proceeds by retrieving a big set of advertisementsaccording to semantic matching criterion, predicting the CTR andCVR for these advertisements, and computing the rank scores ofthem for advertisement selection and click price estimation. Tomake the advertisement ranking process replayable, for each ad-vertisement showing chance, we store the bidding information andpredicted CTR, CVR of all the associated advertisement candidates.According to Eqs.( 1) and (2), the ranking orders and click pricesfor these advertisement candidates can be computed out of thesereplay information. �e reward for showing an advertisement is

Predicted CTR

Groundtruth CTR

device1, position 1device1, position 2device2, position 1

Figure 2: Illustration of the di�erence between the predictedCTR and the groundtruth value on two types of devices andtwo showing positions.

Figure 3: Illustration of the Actor and Critic network archi-tectures used in our work.

simulated by the reward shaping approach [26] as the user response(like clicking the advertisement or purchasing the product). Forexample, if our goal is to get more platform revenue and user clicks,the intermediate reward can be computed as

r (st ,at ) = CTR · click price + δ ·CTR (4)

where δ is a manually tunable parameter to balance between theexpectation of more user engagement (click behavior) or more plat-form revenue. To the best of our knowledge, most of the predictionalgorithms concentrate on ordering the response rate of the adver-tisement instead of predicting the true value, and are trained onthe biased data [12]. As a result, there is a gap between the groundtruth user response rates and the predict ones. To guarantee the re-inforcement learning method optimize towards the right direction,we minimize the gap by CTR and CVR calibration.

4.1.1 Reward Calibration. In Fig. 2, we illustrate the di�erencebetween the predicted CTR and groundtruth CTR on di�erent de-vice types and di�erent advertisement showing positions. �e ad-vertisement showing chances (impressions) are grouped into bins

Page 5: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

by discretizing their predicted CTR values, and the groundtruthCTR of a bin is computed as the number of clicks achieved by theimpressions in this bin divided by the number of impressions. It canbe seen that, the predicted CTR and the groundtruth CTR exhibitdiverse mapping relations on di�erent context (e.g. device type,location). We manually select some context features which e�ectthe mapping relations most, and calibrate the predicted CTR as

Γ(CTR,F ) = CTR (5)

where F refers to the manually selected feature �elds like time anddevice type,CTR is the averaged ground truth CTR. To maintain theordering relations of the predicted CTR on the set of advertisements,we employ the Isotonic regression method [3] to compute the cali-brated values for each < CTR,F > combinations. �e method canautomatically divide the CTR values into bins, targeting at mini-mizing the least square error between the predicted CTRs and thecalibrated ones. By grouping the advertisement showing resultsinto bins according to the binned CTR values, the ground truthCTR for a bin is computed as the number of observed clicks in thisbin divided by the number of presented advertisements. CVR iscalibrated in the same way.

4.2 O�line Reinforcement Learning forRanking Function Initialization

In this section, we will introduce the reinforcement learning prob-lem formulation, the model architecture and the training methodbased on the simulated environment introduced above. Regardingto the MDP representation in Section 3.2, for each advertisementshowing chance, we de�ne the state st as the search context com-posed of three types of features: (1) query related features like queryID, query category ID; (2) user demographic and behavior featuresincluding age, gender and aggregated click number on certain ad-vertisement; (3) advertisement related features, e.g. advertisementposition. �e action at is the ranking function parameter vectora in Eq. (1), and reward is de�ned as Eq. (4). Since user intentionis di�cult to model correctly, it is hard to predict whether a userwould switch to another speci�c query a�er seeing current adver-tisement. �us we focus on the state transitions in each searchsession and omit the inter-session ones. To generate the next statest+1, we make simpli�cation by assuming there is no change inquery related features. �e user behavior features are updated byadding the expected behavior calculated out of the predicted CTRand CVR with calibration (refer to Section 4.1.1). For the advertise-ment related features, we assume the user are continuously readingthe streaming contents and advertisements one by one, and theadvertisement related features are updated accordingly.

Learning from the simulated environment introduced abovegives rise to many special requirements for the learning algorithm.First of all, the simulation method above could only generate tempo-rally independent state-action pairs, while lacking the capability ofsimulating the interaction between search sessions. �is is becausethe state-action sequence of user-platform interaction is tangled.�e current user behavior is correlated with the previously pre-sented advertisements and occurred user responses. �e temporalindependency of the training data requires the reinforcement learn-ing method to support o�-policy learning [7]. Second, because the

action space A is continuous (refer to Eq. (3)), it is practical to de-�ne the deterministic policy function [31]. Moreover, the learningmethod should be capable of dealing with the complex mappingrelation between action and rewards caused by the discontinuouslydistributed bid price. Taking these requirements into consideration,we use the Deep Deterministic Policy Gradient (DDPG) learningmethod in [19] as the learner to optimize our ranking function.�e method supports o�-policy learning, and combines the learn-ing power of deep neural networks with the deterministic policyfunction property in Actor-Critic architecture.

�e DDPG algorithm iterates between the value network (criticnetwork) learning step and the policy network (actor network)learning step, the value network estimates the expectation of thediscounted cumulative rewards from current time t as (an examplereward function is shown in Eq. (4))

QθQ (st ,at ) = E[rt + γ · rt+1 + γ2 · rt+2 + ...|st ,at ] (6)

and the policy network calculates the best ranking function strategygiven the current search context as

at = πθπ (st ) . (7)�e parameter θQ and θπ refer to the weights and bias terms in thedeep networks.

4.2.1 DDPGNetwork Architecture. �e architectures of the valuenetwork and policy networks in our DDPG based reinforcementlearning model is shown in Fig. 3. Refer to Eqs. (6) and (7), boththe two networks have the current state st as input. We representall the features of st (refer to Section 4.2) as ID features, and use ashared embedding layer to convert each of these ID features intoa �xed-length dense vector and concatenate them to form the fea-ture representation of st . �e embedding vectors are initializedas random vectors and updated during the reinforcement learningprocess. For the policy network, we connect the embedding layerto a fully-connected hidden layer with the Exponential Linear Units(ELU) [6] as activation function. In the experiment, we �nd thatwhen using activation functions like Sigmoid and ReLU, the outputof nodes are easily move to numeric ranges with zero gradients,hindering the propagation of gradients along the network. By em-ploying the ELU as activation function, the networks convergemuch faster. �e hidden layer in the policy network is then con-catenated to the output layer by a sigmoid activation function. Wealso use clip method as [23] to clip the output into a valid range toavoid over-learning.

For the value network, its inputs are composed of the state fea-tures (st ) and the action features (ranking function parameter vectorat ). Di�erent from the state features, the action features are con-tinuous. We connect them each to an independent fully-connectedhidden layer with ELU as activation function. �e two hidden lay-ers are of the same number of nodes and are connected together toa higher level fully-connected hidden layer. In the output layer, weutilize a dueling network architecture [35] which divides the valuefunction (Q(s,a)) in Eq. (6) into the sum of a state value function(V (s)) and a state-dependent action advantage function (A(s,a)),such that Q(s,a) = V (s) +A(s,a). According to the insight of [35],the dueling architecture makes the learning of the value networke�cient by identifying the highly rewarded states and the stateswhere the selected actions do not a�ect the rewards much. In

Page 6: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

Figure 4: Illustration of the o�line reinforcement learningframework.

our work, since our reward is de�ned on click prices (and productprices), it is highly varied for di�erent states and discontinuouswith the change of advertisement candidates ordering. �e variancein the rewards poses a big challenge for learning a stable policyfunction. In our experiments (refer to Section 5.1 and Fig. 6), weobserved that the dueling architecture makes the learning processconverge more quickly. �e observation coincides with the conclu-sion in [36]. Parameter setups of the network will be discussed inexperiments.

4.2.2 Learning the Ranking Function from Simulated Environ-ment. We employ the asynchronous training strategy [25] to trainthe DDPG model introduced above. As shown in Fig. 4, the trainingdata is sampled by multiple independent exploration agents. �esetraining agents interact with the simulated sponsored search envi-ronment (Section 4.1) by sampling the advertisement serving replaydata, trying di�erent ranking functions (actions a’s) on the sampledreplay data, and collecting the simulated rewards (Section 4.1.1)and state transitions (Section 4.2) for these actions to build trainingtuples in the form 〈st ,at , rt , st+1〉. �e training tuples are then sentto local DDPG reinforcement learners to calculate the gradients ofthe value network parameters and policy network parameters andupdate these network parameters individually. �e local learnersalso send their ‘local’ gradients to update the ‘global’ model pa-rameters asynchronously. �e algorithm is shown in Algorithm 1.To guarantee the independency and adequate exploration of thelocal reinforcement learners, we set the exploration agents to actby uniformly sampling from the ranking function parameter space.

In our implementation, we use a distributed computing platformwith around 100 GPU machines, and 5 central parameter servers tostore the global model parameters. For the distributed computingplatform, each process maintains an asynchronous DDPG learner,an exploration agent and a local model (copied from the globalmodel). A�er several rounds of gradients computation from thesimulated training data, the gradients are sent asynchronously tothe central parameter servers, and the updated model parametersare sent back to update the local learners’ copies.

Algorithm 1: Asynchronous DDPG LearningInput: Simulated transition tuple set T in the form

ψ =< st ,at , rt , st+1 >Output: Strategy Network πθπ (st )

1 Initialize critic network QθQ (st ,at ) with parameter θQ andactor network πθπ (st ) with parameter θπ ;

2 Initialize target network Q ′, π ′ with weights θQ ′ ← θQ ,θπ ′ ← θπ ;

3 repeat4 Update network parameters θQ , θQ ′ , θπ and θπ ′ from

parameter server;5 Sampling subset Ψ = {ψ1,ψ2, ...,ψm } from T ;6 For eachψi , calculate Q∗ = rt + γ ·Q ′(st+1,π ′(st ));7 Calculate critic loss L =

∑ψi ∈Ψ

12 · (Q∗ −Q(st ,at ))2;

8 Compute gradients of Q with respect to θQ by5θQQ =

∂L∂θQ

;9 Compute gradients of π with respect to θπ by

5θπ π =∑ψi ∈Ψ

∂Q (st ,π (st ))∂π (st ) · ∂π (st )

∂θπ=∑

ψi ∈Ψ∂A(st ,π (st ))

∂π (st ) · ∂π (st )∂θπ

;10 Send gradients 5θQQ and 5θπ π to the parameter server;11 Update θQ and θπ with 5θQQ and 5θπ π for each global N

steps by gradients method;12 Update θQ ′ and θπ ′ by θQ ′ ← θQ ′ + (1 − τ )θQ ,

θπ ′ ← θπ ′ + (1 − τ )θπ ;13 until Convergence;

Figure 5: Illustration of the evolution strategy based onlinereinforcement learning method.

4.3 Online Reinforcement Learning forRanking Function Updating

Despite the e�ort of reward calibration, the o�ine simulated envi-ronment is still inconsistent with the real online environment dueto the dynamic data distribution and sequential correlation betweenthe continuous user behavior. �is inconsistency poses the onlineupdate requirement for the learned ranking function. However,

Page 7: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

directly using the asynchronous training framework in Section 4.2.2is not proper due to the speciality of the online updating: (1) thedata distribution is di�erent, where the online rewards are sparseand discrete (e.g. click or non-click, not the click expectation insimulated environment); (2) there are latency in reward collection,for instance, a user clicks an advertisement immediately, but thepurchase behavior may be postponed for several days [? ].

Regarding to these specialities, we introduce the evolution strat-egy [28] to update the parameters of the policy model. �e evolutionstrategy based online updating method is illustrated in Fig. 5. Weperform the following steps to online update the policy networksπθπ (st ): (i) stochastically perturb the parameters θπ by a Gaussiannoise generator with zero mean and variance σ 2. Denote the set ofn perturbed parameters as Θπ ,ϵ = {θπ +ϵ1,θπ +ϵ2, ...,θπ +ϵn }. (ii)Hash the the online tra�c into bins according to dimensions likeuser ID and IP address. For each parameter θπ ,i ∈ Θπ ,ϵ , we deploya policy network πθπ ,i (st ) on a tra�c bin and get the reward ac-cording to Eq. (4) as the weighted sum of platform revenue and theclick number in this bin Ri = total click price + λ · click number .However, in reality, the number of advertisement showing num-ber should not be exactly the same for each bin. We compute therelative value of the reward by dividing it with the number ofserved advertisements as Ri = Ri

served ad number . (iii) Update theparameter θπ by the weighted sum of the perturbations as

θ ′π = θπ + η1nσ

n∑i=1

Riϵi (8)

where η is the learning rate. It should be noted that regard to theonline stability, we only use a small percentage of tra�c (totally 2%of the overall online tra�c) for testing the performance of Θπ ,ϵ .

�e evolution strategy based method has several merits underour scenario. First of all, it is derivative-free. Since the rewardsare discrete, it is hard to compute the gradient from the rewardto the policy network parameters. Second, by �xing the seed ofthe random number generator, we just need to communicate thereward (a scalar) between the policy networks in local tra�c binsand the central parameter servers. �irdly, the method does nothave intermediate reward requirement due to the homogeneityof these online tra�c bins. �us it can be deployed to optimizeconversion related performance.

5 EXPERIMENTAL RESULTSWe conduct experiments on a popular commercial mobile spon-sored search platform which serves hundreds of millions of activeusers monthly covering a wide range of search categories. To fullystudy the e�ectiveness of the proposed ranking function learningmethod, both analytical experiments on o�ine data and empiricalexperiments by deploying the learned ranking functions online arecarried out. On the platform, the search results are presented in astreaming fashion, and the advertisements are allowed to be shownon �xed positions within the streamed content. Since the searchresults are tangled with the advertisements, besides the platformrevenue, one important issue we need to deal with is the user ex-perience. In the experiments, we design the immediate reward ras

click price · is click + λ · is click (9)

where click price is the amount we charge the advertisers accordingto generalized second price auction, and is click is a binary numberindicating whether the advertisement is clicked (1) or not (0). λis manually set according to the average click price to balancebetween the platform revenue goal and the user experience goal.We can also add the advertisers’ satisfactory term like purchaseprice. But because we test our model on a small percentage of tra�conline (2% tra�c), the purchase amount is highly varied accordingto our observation. In the current experiments, we do not show thepurchase optimized results and leave it as a future work when weramp up our test tra�c amount.

5.1 Experiments on O�line DataSince we are learning on a biased sampled data, the mapping re-lation between the reward and ranking function parameters iscomplex due to the highly varied distribution of bidding price. �econvergence property of the proposed method is worthy of study-ing. In the o�ine experiments, we study the convergence propertyof the proposed method and the e�ect of using di�erent architec-tures and di�erent super parameters on the speed of convergence.We employ an analytical method to verify whether the proposedmethod can converge to the ‘right’ ranking function. In the exper-iment, a simple state representation (only query + advertisementposition) is utilized such that from the simulated data, it is com-putational feasible to perform brute force search to �nd the bestparameters of the ranking function (refer to Eq. (1)). �e bruteforce method proceeds by uniformly sampling the parameters in θat a �xed step size for each replay sample, computing the rewards(refer Eq. (9)) based on the method in Section 4.2.2, and �ndingthe best parameter θ∗ from these samples according to the aggre-gated rewards. For training the reinforcement learning model, weencode both the query IDs and advertisement position IDs into8-dimension embedding vectors. As a result, our embedding layerconsists of a 16-dimension feature vector. For the critic hidden layer,we utilize two full-connection units, each of which has 500 nodes,and ELU [6] as the activation function. We use the same se�ingsfor the actor hidden layer except using 100 nodes in each layer.�e learning rate for network parameter, target network parameterand regularization loss penalty factor are set to be 1.0e-5, 0.01 and1.0e-5 respectively. �e λ is set to be the average of the click pricecalculated out of the data log. Experimental results are presentedin Fig. 6. �e performance of the proposed method is measured bythe squared error between the learnt ranking function parametersand the ‘best’ parameters found by brute force method. From theresults, we can see, the proposed method could converge graduallyto the best ranking function as the training process goes on.

We also evaluate the performance improvement brought by thedueling architecture in Fig. 6. Comparing the results of using thedueling architecture (annotated by ‘dueling’ in the �gure) and not(‘without dueling’), it can be seen that, the dueling architectureimprove the convergence speed dramatically. �e intuition behindthe results is the dueling architecture could help remove the rewardvariance of the same action under di�erent states byV (s), and guidethe action-value network A(s,a) to focus more on di�erentiatingbetween di�erent actions. As a result, the policy network learningis accelerated. In Fig. 7, we evaluate the in�uences of di�erent

Page 8: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

0 10 20 30 40 50

training batches (Million)

0.10

0.15

0.20

0.25

0.30MSE loss

dueling

without dueling

0 10 20 30 40 50

training batches (Million)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

weighted M

SE loss

dueling

without dueling

Figure 6: Comparison of the convergence speed of training by utilizing di�erent network architectures. (1) Averaged squarederror di�erence between strategy parameters of DDPG and the searched results; (2) advertisement impression weightedsquared error di�erence between strategy parameters of DDPG and the searched results. ‘dueling’ is the trained result us-ing the dueling network structure and ‘without dueling’ is the result without using the dueling network structure.

0 10 20 30 40 50

training batches (Million)

0.10

0.15

0.20

0.25

0.30

MSE loss

decay

batch size

regular

0 10 20 30 40 50

training batches (Million)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

weighted M

SE loss

decay

batch size

regular

(a) Effects of batch size and regularization

Figure 7: Comparison of the convergence speed of trainingby using di�erent hyperparameters. �e hyperparameterdescription is shown in Table 1.

Table 1: Hyperparameter setup of Fig. 7.

ID Learning rate Regularization Batch sizebase 1.0e-5 1.0e-5 50klearning rate 1.0e-4 1.0e-5 50kbatch size 1.0e-5 1.0e-5 10kregular 1.0e-5 1.0e-3 50k

super parameters for training the model, including the learningrate, regularization penalties and batch sizes. �e parameter setupis listed in Table 1. As can be observed, lower learning rate (‘base’and ‘learning rate’) makes the learning method converge moresmoothly and larger batch size (‘base’ and ‘batch size’) and lowerregularization penalty (see ‘base’ and ‘regular’) make the learningmethod converge more closely to the optimal solution. �is isbecause there is strong variance in the rewards. Lower learning rateand larger batch size helps to reduce the variance in the batchedtraining data, and lower regularization penalty allows a largersearching space for variables.

5.2 Online Serving ExperimentsIn this section, we present the experimental results of conducting abucket experiment on a popular sponsored search platform. �esponsored search platform charge the advertisers for the click ontheir advertisements according to GSP auction mechanism. We splita small percentage (about 2%) of tra�c from the whole online tra�cby hashing the user IDs, IP addresses, etc., and deploy the learnedranking function online for advertisement selection and pricing.�e following business metric are measured to see the improvementbrought by the proposed method. (1) Revenue-Per-Mille (RPM): therevenue generated per thousand impressions; (2) Price-Per-Click(PPC): the average price per-click determined by the auction; (3)Click-through-rate (CTR). We employ the three business metricsbecause we are optimizing towards the platform revenue and userexperience as in Eq. (9). RPM is determined by the product of CTRand PPC. From the change of CTR and PPC, we can also induce theimprovements of the advertisers’ saling e�ciency, i.e. increase inCTR and decrease in PPC means the advertisers can a�ract morecustomers by less spending on advertisement serving.

In this work, a new ranking function is proposed by adding theterms re�ecting the user engagement and the advertisers’ gain, anda reinforcement learning method is presented for learning the opti-mal parameter of this term. In the online experiments, we want toverify the performance improvement brought by the new rankingfunction and the improvement from the reinforcement learningmethod. To evaluate the e�ectiveness of the new ranking function,we compare the proposed ranking function with the one proposedby Lahaie and McAfee [13]. �e ranking function in [13] has beenemployed by many companies and proved to be e�cient in onlineadvertising auctions. In the experiment, we use this ranking func-tion as the baseline and set its parameter (the exponential term)by brute force searching in the same manner as Section 5.1. Forthe proposed method, we use both the ranking functions learnedby the brute force method in Section 5.1 and the ranking function

Page 9: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

Table 2: Experimental results comparing the performanceof the ranking function in [13], the brute force searchedranking function in Section 5.1 and the one learned by re-inforcement learning method in Section 4.2.2.

Metrics (%) ∆rpm ∆ctr ∆ppcMcAfee [13] 0.00 0.00 0.00brute force 2.55 1.12 1.45o�ine learning 2.52 2.08 0.26

learned by the reinforcement learning method. �e comparisonresults on two continuous days’ data are shown in Table 2. As isobserved, compared to the method [13], our ranking function withparameters searched by brute force method is capable of devlivering2.5% of RPM growth with 1.1% of CTR increase and 1.4% of PPC in-crease. �is observation con�rms the e�ectiveness of the proposedranking function in improving the platform’s performance. For theranking function learned by reinforcement learning method, weobserve a 2.5% RPM growth which is arisen majorly from the CTRincrease (2.0% of CTR gain). We interpret the di�erence betweenthe reinforcement learning method and ’brute force’ method by thefact that the reinforcement learning method dose not converge tothe exact value of brute force method, and it use a more elaborateset of features for learning. From the business side, we can �ndthat the new ranking function can improve more user engagementby a�racting more user clickness on the presented advertisements.For the platform, the increase of RPM brings in more e�ciency inearn platform pro�t, and for the advertisers, the RPM growth isdriven by the CTR increase while there is a li�le increase in PPC(for the learned ranking function), this means, the advertisers onlyneed pay a li�er more money to a�ract more potential buyers.

Section 4.3 introduces the online evolution strategy method fortuning the policy networks based on online data. In this experiment,we evaluate the online performance changes in seven continuousdays to con�rm the performance increases brought by online learn-ing method. To disturb strategy actions in πθπ (s), we add a gaussiannoise G(0,δ2) with mean 0 and variance δ2 = 0.01 to the param-eters θπ of πθπ (s). We split the tra�c of the test bucket into anumber of splits and apply each perturbed policy networks to oneof them. �e buckets and the user feedback history are collectedfrom the data logs to compute the updates in Eq. (8). �e averageperformance of the test bucket are shown in Fig. 8. As can be seen,all the three business metrics improves during the days. Comparedwith the baseline ranking function [13] (whose parameter stay un-changed during the days), the RPM grows from 2% to 4%, CTRgrows from 2% to about 3%. �e results indicate the e�ectiveness ofthe online updating method. We also �nd the PPC grows becausethere is no constraints added on it. In the future work, we will tryto add constraints to limit PPC increase to generate more returnfor advertisers.

6 CONCLUSIONSIt is commonly accepted when building a commercial sponsoredsearch platform, besides the intermediate platform revenue, theusers’ engagement and the advertisers’ return are also importantto the long-term pro�t of the commercial platform. In this work,

1 2 3 4 5 6 7 8updating time (day)

0

1

2

3

4

5

changes of metrics (%

)

∆rpm

∆ctr

∆ppc

Figure 8: Experimental results illustrating the businessmetric changes during the online update of the proposedmethod in Section 4.3.

we design a new ranking function by incorporating these factorstogether. However, these additional terms increase the complexityof the ranking function. In this work, we propose a reinforcementlearning framework to optimize the ranking function towards theoptimal long-term pro�t of the platform. As is known, the reinforce-ment learning works in a trail-and-error manner. To allow adequateexploration without hurting the performance of the commercialplatform, we propose to initialize the ranking function by an o�inelearning procedure conducted in a simulated sponsored search envi-ronment, followed by an online learning module which updates themodel adaptively to online data distribution. Experimental resultscon�rms the e�ectiveness of the proposed method.

In the future, we will focus on the following directions: (1) Se-quential user behavior simulation. �e environment simulationmethod introduced in Section 4.1 is limited to one time advertise-ment serving without considering the correlation between sequen-tial user behaviors. We plan to try generative models like GAN [10]to model the continuous user behaviors. (2) �e proposed methodhas the ability to improve advertisers’ return-per-cost by adding thepurchase amount into the reward term. Due to tra�c limitation, wedid not investigate on this e�ect, in future work, we will increasethe online testing tra�c to measure the gain brought by adding thepurchase amount reward.

REFERENCES[1] Gagan Aggarwal, Ashish Goel, and Rajeev Motwani. 2006. Truthful auctions for

pricing search keywords. In Proceedings of the 7th ACM conference on Electroniccommerce.

[2] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. 2012. Budgetoptimization for sponsored search: Censored learning in MDPs. UAI (2012).

[3] Michael Best and Nilotpal Chakravarti. 1990. Active set algorithms for isotonicregression: a unifying framework. Mathematical Programming 1–3 (1990), 425–429.

[4] Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, andDefeng Guo. 2017. Real-Time Bidding by Reinforcement Learning in DisplayAdvertising. In Proceedings of the ACM International Conference on Web Searchand Data Mining. 661–670.

Page 10: Optimizing Sponsored Search Ranking Strategy by …Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning Li He hl121322@alibaba-inc.com Liang Wang liangbo.wl@alibaba-inc.com

[5] Deepayan Chakrabarti, Deepak Agarwal, and Vanja Josifovski. 2008. Contextualadvertising by combining relevance with click feedback. In Proceedings of theInternational Conference on World Wide Web. 417–426.

[6] Djork-Arn Clevert, �omas Unterthiner, and Sepp Hochreiter. 2015. Fast andAccurate Deep Network Learning by Exponential Linear Units (ELUs). ComputerScience (2015).

[7] �omas Degris, Martha White, and Richard Su�on. 2012. O�-Policy Actor-Critic.In Proceedings of International Conference on Machine Learning.

[8] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. InternetAdvertising and the Generalized Second-Price Auction: Selling Billions of DollarsWorth of Keywords. American Economic Review 97, 1 (2007), 242–259.

[9] Benjamin Edelman and Michael Schwarz. 2010. Optimal auction design andequilibrium selection in sponsored search auctions. �e American EconomicReview 100, 2 (2010), 597–602.

[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarialnets. In Proceedings of Advances in Neural Information Processing Systems. 2672–2680.

[11] Cheng Haibin and Erick Cantu-Paz. 2010. Personalized click prediction in spon-sored search. In Proceedings of the ACM International Conference on Web Searchand Data Mining.

[12] �orsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiasedlearning-to-rank with biased feedback. In Proceedings of the ACM InternationalConference on Web Search and Data Mining.

[13] Sebastien Lahaie and R Preston McAfee. 2011. E�cient Ranking in SponsoredSearch. (2011), 254–265.

[14] Yann LeCun, Yoshua Bengio, and Geo�rey Hinton. 2015. Deep learning. Nature521, 7553 (2015), 436–444.

[15] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-endtraining of deep visuomotor policies. Journal of Machine Learning Research 17,39 (2016), 1–40.

[16] Wei Li, Xuerui Wang, Ruofei Zhang, Ying Cui, Jianchang Mao, and Rong Jin.2010. Exploitation and exploration in a performance based contextual advertisingsystem. In Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 27–36.

[17] Xiujun Li, Zachary C. Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, andYun-Nung Chen. 2016. A user simulator for task-completion dialogues. arXivpreprint arXiv:1612.05688 (2016).

[18] Yuxi Li. 2017. Deep Reinforcement Learning: An Overview. arXiv preprintarXiv:1701.07274 (2017).

[19] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, TomErez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous Controlwith Deep Reinforcement Learning. In Proceedings of International Conference onLearning Representations. 1–14.

[20] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner,Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al.2013. Ad click prediction: a view from the trenches. In Proceedings of the 19thACM SIGKDD international conference on Knowledge discovery and data mining.ACM, 1222–1230.

[21] Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. 2007. Adwordsand Generalized Online Matching. J. ACM 54, 5 (2007).

[22] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-thy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchro-nous methods for deep reinforcement learning. In Proceedings of InternationalConference on Machine learning. 1928–1937.

[23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, and Alex Graves et al. 2015. Human-level control throughdeep reinforcement learning. Nature 7540 (2015), 529–533.

[24] Roger Myerson. 1981. Optimal Auction Design. Mathematics of OperationsResearch 6, 1 (1981), 58–73.

[25] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon,Alessandro De Maria, and Vedavyas Panneershelvam et. al. 2015. MassivelyParallel Methods for Deep Reinforcement Learning. (2015).

[26] Andrew Ng, Daishi Harada, and Stuart Russell. 1999. Policy invariance under re-ward transformations: �eory and application to reward shaping. In Proceedingsof International Conference on Machine Learning, Vol. 99.

[27] Michael Ostrovsky and Michael Schwarz. 2011. Reserve prices in internet adver-tising auctions: A �eld experiment. (2011).

[28] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. 2017. Evolutionstrategies as a scalable alternative to reinforcement learning. arXiv preprintarXiv:1703.03864 (2017).

[29] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz.2015. Trust region policy optimization. In Proceedings of the 32nd InternationalConference on Machine Learning. 1889–1897.

[30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J.Schri�wieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. [n. d.]. Master-ing the Game of Go with Deep Neural Networks and Tree Search. Nature 529,7587 ([n. d.]), 484–489.

[31] David Silver, Guy Lever, Nicolas Heess, �omas Degris, Daan Wierstra, andMartin Riedmiller. 2014. Deterministic policy gradient algorithms. In Proceedingsof the 31st International Conference on Machine Learning. 387–395.

[32] Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes,David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line active rewardlearning for policy optimisation in spoken dialogue systems. arXiv preprintarXiv:1605.07669 (2016).

[33] Richard Su�on and Andrew Barto. 1998. Reinforcement Learning: An Introduction.Cambridge: MIT Press.

[34] William Vickrey. 1961. Counterspeculation, auctions, and competitive sealedtenders. �e Journal of �nance 16, 1 (1961), 8–37.

[35] Ziyu Wang, Tom Schaul, Ma�eo Hessel, Hado Van Hasselt, Marc Lanctot, andNando De Freitas. 2015. Dueling network architectures for deep reinforcementlearning. (2015), 1995–2003.

[36] Ronald Williams. 1992. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine learning 8, 3–4 (1992), 229–256.

[37] Yonghui Wu, Mike Schuster, Zhifeng Chen, �oc V. Le, Mohammad Norouzi,Wolfgang Macherey, and Maxim Krikun et al. 2016. Google’s neural machinetranslation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144 (2016).

[38] Y. Wu and Y. Tian. 2017. Training Agent for First-person Shooter Game withActor-Critic Curriculum Learning. In Proceedings of International Conference onLearning Representations. 1–8.

[39] H. Yu, C. Hsieh, and C. Lin K. Chang. 2012. Large linear classi�cation when datacannot �t in memory. ACM Transactions on Knowledge Discovery from Data 4(2012), 23–30.

[40] Weinan Zhang, Ulrich Paquet, and Katja Hofmann. 2016. Collective NoiseContrastive Estimation for Policy Transfer Learning.. In AAAI. 1408–1414.