Statistically adaptive learning for a general class of..

7
arXiv:1209.0029v3 [cs.LG] 5 Sep 2012 Statistically adaptive learning for a general class of cost functions (SA L-BFGS) Stephen Purpura †§ [email protected] Jim Walsh § [email protected] Dustin Hillard § [email protected] Scott Golder § [email protected] Mark Hubenthal ‡§ [email protected] Scott Smith § [email protected] Abstract We present a system that enables rapid model experimentation for tera-scale machine learn- ing with trillions of non-zero features, billions of training examples, and millions of param- eters. Our contribution to the literature is a new method (SA L-BFGS) for changing batch L-BFGS to perform in near real-time by using statistical tools to balance the contributions of previous weights, old training examples, and new training examples to achieve fast conver- gence with few iterations. The result is, to our knowledge, the most scalable and flexible linear learning system reported in the literature, beating standard practice with the current best system (Vowpal Wabbit and AllReduce). Using the KDD Cup 2012 data set from Tencent, Inc. we provide experimental results to verify the performance of this method. 1 Introduction The demand for analysis and predictive modeling derived from very large data sets has grown immensely in recent years. One of the big problems in meeting this demand is the fact that data has grown faster than the availability of raw computational speed. As such, it has been necessary to use intelligent and efficient approaches when tackling the data training process. Specifically, there has been much focus on problems of the form min θR l m i=1 l(θ T x (i) ; y (i) )+ λS (θ), (1) * The material of this work is patent pending. in absentia at Department of Information Science, Cornell University, Ithaca, NY, 14850 recent graduate of Department of Mathematics, University of Washington, Seattle, WA, 98195 § Context Relevant, Inc., Seattle, WA, 98121 where x (i) R l is the feature vector of the ith exam- ple, y (i) ∈{0, 1} is the label, θ R l is the vector of fitting parameters, l is a smooth convex loss function and S a regularizer. Some of the more popular methods for determining θ include linear and logistic regression, respectively. The optimal such θ corresponds to a linear predictor function p θ (x)= θ T x that best fits the data in some appropriate sense, depending on l and S. We remark that such cost functions in (1) have a structure which is naturally decomposable over the given training examples, so that all computations can potentially be run in parallel over a distributed environment. However, in practice it is often the case that the model must be updated accordingly as new data is acquired. That is, we want to answer the question of how θ changes in the presence of new training examples. One naive approach would be to completely redo the entire data analysis process from scratch on the larger data set. The current fastest method in such a case utilizes the L-BFGS quasi-Newton minimization algorithm with AllReduce along a distributed cluster, (Agarwal et al., 2012). The other extreme is to apply the method of online learning, which considers one data point at a time and updates the parameters θ according to some form of gradient descent, see (Langford et al., 2009), (Duchi et al., 2010). How- ever, on the tera-scale, neither approach is as appealing or as fast as we can achieve with our method. We describe in a bit more detail these recent approaches to solving (1) in Section 3. For completeness, we also refer the reader to recent work relating to large-scale optimization contained in (Schraudolph et al., 2007) and (Bottou, 2010). Our approach in simple terms lies somewhere be- tween pure L-BFGS minimization (widely accepted as the fastest brute force optimization algorithm whenever the function is convex and smooth) and online learning. While L-BFGS offers accuracy and robustness with a relatively small number of iterations, it fails to take di- rect advantage of situations where the new data is not

Transcript of Statistically adaptive learning for a general class of..

Page 1: Statistically adaptive learning for a general class of..

arX

iv:1

209.

0029

v3 [

cs.L

G]

5 S

ep 2

012

Statistically adaptive learning for a general class ofcost functions (SA L-BFGS)∗

Stephen Purpura †§

[email protected]

Jim Walsh §

[email protected]

Dustin Hillard §

[email protected]

Scott Golder §

[email protected]

Mark Hubenthal ‡§

[email protected]

Scott Smith §

[email protected]

Abstract

We present a system that enables rapid modelexperimentation for tera-scale machine learn-ing with trillions of non-zero features, billionsof training examples, and millions of param-eters. Our contribution to the literature is anew method (SA L-BFGS) for changing batchL-BFGS to perform in near real-time by usingstatistical tools to balance the contributions ofprevious weights, old training examples, andnew training examples to achieve fast conver-gence with few iterations. The result is, toour knowledge, the most scalable and flexiblelinear learning system reported in the literature,beating standard practice with the current bestsystem (Vowpal Wabbit and AllReduce). Usingthe KDD Cup 2012 data set from Tencent, Inc.we provide experimental results to verify theperformance of this method.

1 Introduction

The demand for analysis and predictive modeling derivedfrom very large data sets has grown immensely in recentyears. One of the big problems in meeting this demand isthe fact that data has grown faster than the availability ofraw computational speed. As such, it has been necessaryto use intelligent and efficient approaches when tacklingthe data training process. Specifically, there has beenmuch focus on problems of the form

minθ∈Rl

m∑

i=1

l(θTx(i); y(i)) + λS(θ), (1)

∗The material of this work is patent pending.†in absentia at Department of Information Science, Cornell

University, Ithaca, NY, 14850‡recent graduate of Department of Mathematics, University

of Washington, Seattle, WA, 98195§Context Relevant, Inc., Seattle, WA, 98121

wherex(i) ∈ Rl is the feature vector of theith exam-

ple, y(i) ∈ {0, 1} is the label,θ ∈ Rl is the vector

of fitting parameters,l is a smooth convex loss functionandS a regularizer. Some of the more popular methodsfor determiningθ include linear and logistic regression,respectively. The optimal suchθ corresponds to a linearpredictor functionpθ(x) = θTx that bestfits the datain some appropriate sense, depending onl andS. Weremark that such cost functions in (1) have a structurewhich is naturally decomposable over the given trainingexamples, so that all computations can potentially be runin parallel over a distributed environment.

However, in practice it is often the case that the modelmust be updated accordingly as new data is acquired.That is, we want to answer the question of howθ changesin the presence of new training examples. One naiveapproach would be to completely redo the entire dataanalysis process from scratch on the larger data set. Thecurrent fastest method in such a case utilizes the L-BFGSquasi-Newton minimization algorithm with AllReducealong a distributed cluster, (Agarwal et al., 2012). Theother extreme is to apply the method of online learning,which considers one data point at a time and updates theparametersθ according to some form of gradient descent,see (Langford et al., 2009), (Duchi et al., 2010). How-ever, on the tera-scale, neither approach is as appealing oras fast as we can achieve with our method. We describe ina bit more detail these recent approaches to solving (1) inSection 3. For completeness, we also refer the reader torecent work relating to large-scale optimization containedin (Schraudolph et al., 2007) and (Bottou, 2010).

Our approach in simple terms lies somewhere be-tween pure L-BFGS minimization (widely accepted asthe fastest brute force optimization algorithm wheneverthe function is convex and smooth) and online learning.While L-BFGS offers accuracy and robustness with arelatively small number of iterations, it fails to take di-rect advantage of situations where the new data is not

Page 2: Statistically adaptive learning for a general class of..

very different from that acquired previously or situationswhere the new data is extremely different than the olddata. Certainly, one can initiate a new optimization jobon the larger data set with the parameterθ initialized tothe previous result. But we are left with the problem ofoptimizing over increasingly larger training sets at onetime. Similarly, online learning methods only considerone data point at a time and cannot reasonably changethe parameterθ by too much at any given step withoutrisk of severely increasing the regret. It also cannot typ-ically reach as small of an error count as that of a globalgradient descent approach. On the other hand, we willshow that it is possible to combine the advantages ofboth methods: in particular the small number of iterationsand speed of L-BFGS when applied to reasonably sizedbatches, and the ability of online learning to “forget” pre-vious data when the new data has changed significantly.

The outline of the paper is as follows. In Section 2we describe the general problem of interest. In Sec-tion 3 we briefly mention current widely used methodsof solving (1). In Section 4 we outline the statisticallyadaptive learning algorithm. Finally, in Section 5 webenchmark the performance of our two related methods(Context Relevant FAST L-BFGS and SA L-BFGS, re-spectively) against Vowpal Wabbit - one of the fastestcurrently available routines which incorporates the workof (Agarwal et al., 2012). We also include the associatedArea Under Curve (AUC) rating, which roughly speak-ing, is a number in[0, 1], where a value of1 indicatesperfect prediction by the model.

2 Background and Problem Setup

In this paper the underlying problem is as follows. Sup-pose we have a sequence of time-indexed data sets{Xt, Yt} whereXt = {x

(i)t }

mt

i=1, Yt = {y(i)t }

mt

i=1, t =0, 1, . . . , tf is the time index, andmt ∈ N is the batchsize (typically independent oft). Such data is given se-quentially as it is acquired (e.g.t could represent days),so that att = 0 one only has possession of{X0, Y0}. Al-ternatively, if we are given a large data set all at once, wecould divide it into batches indexed sequentially byt. Ingeneral, we use the notationxt with subscriptt to denotea time-dependent vector at timet, and we writext,j todenote thejth component ofxt. For eacht = 0, . . . , tfwe define

ft(θ) =

mt∑

i=1

l(θTx(i)t ; y

(i)t )

φt(θ) = ft(θ) + λS(θ), (2)

where as before,l is a given smooth convex loss functionandS is a regularizer. Also letθt be the parameter vectorobtained at timet, which in practice will approximatelyminimizeλS(θ)+

∑ts=0 fs(θ). We define the regret with

respect to a fixed (optimal) parameterθ∗ (in practice wedon’t know the true value ofθ∗) by

Rφ(t) = rφ(θ∗, t) :=

t∑

s=0

[φs(θs)− φs(θ∗)] (3)

=t

s=0

[fs(θs) + λS(θs)− fs(θ∗)− λS(θ∗)].

An effective algorithm is then one in which the sequence{θt}

tft=0 suffers sub-linear regret, i.e.,Rφ(t) = o(t).

As mentioned earlier, there has been much work doneregarding how to solve (1) with a variety of meth-ods. Before proceeding with a basic overview of thetwo most popular approaches to large-scale machinelearning in Section 3, it is important to understand theunderlying assumptions and implications of the pre-vious body of work. In particular, we mention thework of Leon Bottou in (Bottou, 2010) regarding large-scale optimization with stochastic gradient descent, and(Schraudolph et al., 2007) regarding stochastic online L-BFGS (oL-BFGS) optimization. Such works and othershave demonstrated that with a lot of randomly shuffleddata, a variety of methods (oL-BFGS, 2nd order stochas-tic gradient descent and averaged stochastic gradient de-scent) can work in fewer iterations than L-BFGS because:

(a) Small data learning problems are fundamentally dif-ferent from large data learning problems;

(b) The cost functions as framed in the literature havewell suited curvature near the global minimum.

We remark that the key problem for all quasi-Newtonbased optimization methods (including L-BFGS) hasbeen that noise associated with the approximation pro-cess – with specific properties dependent on each learningproblem – causes adverse conditions which can make L-BFGS (and its variants) fail. However, the problem ofnoise leading to non-positive curvature near the minimumcan be averted if the data is appropriately shaped (i.e. fea-ture selection plus proper data transformations). For nowthough, we ignore the issue and assume we already havea methodology for “feature shaping” that assures underoperational conditions that the curvature of the resultinglearning problem is well-suited to the algorithm that wedescribe.

3 Previous Work

3.1 Online Updates

In online learning, the problem of storage is completelyaverted as each data point is discarded once it is read.We remark that one can essentially view this approachas a special case of the statistically adaptive method de-scribed in Section 4 with a batch size of1. Such algo-rithms iteratively make a predictionθt ∈ R

l and then

Page 3: Statistically adaptive learning for a general class of..

receive a convex loss functionφt as in (2). Typically,φt(θ) = l(θTxt; yt) + λR(θ), where(xt, yt) is the datapoint read at timet. We then make an update to obtainθt+1 using a rule that is typically based on the gradi-ent of l(θTxt, yt) in θ. Indeed, the simplest approach(with no regularization term) would be the update ruleθt+1 = θt − ηt∇θl(θ

Tt xt, yt).

However, there currently exist more sophisticated up-date schemes which can achieve better regret bounds for(3). In particular, the work of Duchi, Hazan, and Singeris a type of subgradient method with adaptive proximalfunctions. It is proven that theirADAGRAD algorithmcan achieve theoretical regret bounds of the form

Rφ(t) = O(‖θ∗‖2tr(G1/2t )) and (4)

Rφ(t) = O

Åmaxs≤t‖θs − θ∗‖2tr(G

1/2t )

ã,

where in general,Gt =∑t

s=0 gsgTs is an outer prod-

uct matrix generated by the sequence of gradientsgs =∇θfs(θs) (Duchi et al., 2010). We remark that since theloss function gradients converge to zero underidealcon-ditions, the estimate (4) is indeed sublinear, because thedecay of the gradients, however slow, counters the lineargrowth in the size ofG1/2

t .

3.2 Vowpal Wabbit with Gradient Descent

Vowpal Wabbit is a freely available software pack-age which implements the method described briefly in(Agarwal et al., 2012). In particular, it combines onlinelearning and brute force gradient-descent optimization ina slightly different way. First, one does a single passover the whole data set using online learning to obtaina rough choice of parameterθ. Then, L-BFGS optimiza-tion of the cost function is initiated with the data splitacross a cluster. The cost function and its gradient arecomputed locally and AllReduce is utilized to collect theglobal function values and gradients in order to updateθ.The main improvement of this algorithm over previousmethods is the use of AllReduce with the Hadoop filestructure, which significantly cuts down on communi-cation time as is the case with MapReduce. Moreover,the baseline online learning step is done with a learn-ing rate chosen in an optimal manner as discussed in(Karampatziakis and Langford, 2011).

4 Our Approach

4.1 Least Squares Digression

Before we describe the statistically adaptive approach forminimizing a generic cost function, consider the follow-ing simpler scenario in the context of least squares regres-sion. Given data{X,Y } with X ∈ R

m×l andY ∈ Rm,

we want to chooseθ that solvesminθ∈Rl ‖Xθ − Y ‖22.

Assuming invertibility ofXTX , it is well known that thesolution is given by

θ = (XTX)−1XTY. (5)

Now, suppose that we have time indexed data{Xs, Ys}

Ts=0 with Xs ∈ R

ms×l andYs ∈ Rms . In order

to updateθt given{Xt+1, Yt+1}, first we must check howwell θt fits the newly augmented data set. We do this byevaluating

t+1∑

s=0

‖Xsθ − Ys‖22 (6)

with θ = θt. Depending on the result, we choose aparameterλ ∈ [0, 1] that determines how much weightto give the previous data when computingθt+1. Thatis, λ represents how much we would like to “forget” theprevious data (or emphasize the new data), with a value ofλ = 1 indicating that all previous data has been thrownout. Similarly, the caseλ = 1

2 corresponds to the casewhen past and present are weighed equally, and the caseλ = 1 corresponds to the case whenθt fits the new dataperfectly (i.e. (6) is equal to zero).

Let X[0,t] be the∑t

s=0 ms × l matrix[XT

0 , XT1 , . . . , X

Tt ]

T obtained by concatenation,and similarly define the length-

∑ts=0 ms vector

Y[0,t] := [Y T0 , Y T

1 , . . . , Y Tt ]T . Then (6) is equivalent to

‖X[0,t+1]θ − Y[0,t+1]‖22. Now, when using a particular

second order Newton method for minimizing a smoothconvex function, the computation of the inverse Hessianmatrix is analogous to computing(XT

[0,t]X[0,t])−1

above. Ast grows large, the cumulative normal matrixXT

[0,t]X[0,t] becomes increasingly costly to computefrom scratch, as does its inverse. Fortunately, we observethat

XT[0,t+1]X[0,t+1] = XT

[0,t]X[0,t] +XTt+1Xt+1. (7)

However, if we want to incorporate the flexibility toweigh current data differently relative to previous data,we need to abandon the exact computation of (7). In-stead, lettingAt denote the approximate analogue ofXT

[0,t]X[0,t], we introduce the update

At+1 ←2

1 + µ2

(

µ2At +XTt+1Xt+1

)

(8)

whereµ satisfiesλ = µ2

1+µ2 .The actual update ofθt is performed as follows. Define

‹Y[0,t] := XT[0,t]Y[0,t] = [XT

0 , . . . , XTt ]

Y0

Y1

...Yt

=t

s=0

XTs Ys.

Page 4: Statistically adaptive learning for a general class of..

Up to time t, the standard solution to the least squaresproblem on the data{X[0,t], Y[0,t]} is then

θ = (XT[0,t]X[0,t])

−1‹Y[0,t]. (9)

Now letBt be an approximation to‹Y[0,t]. We defineBt+1

by the update

Bt+1 =2

1 + µ2

(

µ2Bt +XTt+1Yt+1

)

.

Finally, we set

θt+1 := A−1t+1Bt+1. (10)

It is easily verified that (10) coincides with the standardupdate (9) whenµ = 1.

4.2 Statistically Adaptive Learning

Returning to our original problem, we start with the pa-rameterθ0 obtained from some initial pass through of{X0, Y0}, typically using a particular gradient descent al-gorithm. In what follows, we will need to define an easilyevaluated error function to be applied at each iteration,mildly related to the cumulative regret (3):

I(t, θ) :=

∑ts=0

∑ms

i=1 |pθ(x(i)t )− y

(i)t |

∑ts=0 |Xs|

(11)

We remark thatI(t, θ) represents the relative number ofincorrect predictions associated with the parameterθ overall data points from times = 0 to s = t. Moreover,becausepθ is a linear function ofx, I is very fast toevaluate (essentiallyO(m) wherem =

∑ts=0 ms).

Given θt, we computeI(t + 1, θt). There are twoextremal possibilities:

1. I(t+1, θt) is significantlylarger thanI(t, θt). Moreprecisely, we mean thatI(t + 1, θt) − I(t, θt) >

σ(t), where σ(t) is the standard deviation of{I(s, θs)}

ts=0. In this case the data has significantly

changed, and soθ must be modified.

2. Otherwise, there is no need to changeθ and we setθt+1 = θt.

In the former case, we use the magnitude ofI(t+1, θt)−I(t, θt) to determine a subsample of the old and new datawith Mold and Mnew points chosen, respectively (seeFigure 1). Roughly speaking, the larger the difference themore weight will be given to the most recent data points.The sampling of previous data points serves to anchorthe model so that the parameters do not over fit to thenew batch at the expense of significantly increasing theglobal regret. This is a generalization of online learningmethods where only the most recent single data point

batch

1,2,...,t-1

batch t

data points selected

from previous batches

data points selected

from current batch

Figure 1: Subsampling of the partitioned data stream attime t and times0, 1, . . . , t− 1, respectively.

is used to updateθ. From the subsample chosen, wethen apply a gradient descent optimization routine wherethe initialization of the associated starting parameters isgenerated from those stored from the previous iteration.In the case of L-BFGS, the rank1 matrices used to ap-proximate the inverse Hessian stored from the previousiteration are used to initialize the new descent routine.We summarize the process in Algorithm 1.

Algorithm 1 Statistically Adaptive Learning Method (SAL-BFGS)Require: Error checking functionI(t, θ)

Given data{Xs, Ys}tfs=0

Run gradient descent optimization on{X0, Y0} tocomputeθ0for t = 1 to tf do

if I(t+ 1, θt)− I(t, θt) > σ(t) thenChooseMold andMnew

Subsample dataRun L-BFGS with initial parameterθt to ob-

tainθt+1

else θt+1 ← θtend if

end for

As a typical example, at some timet we might haveMold = 1000,Mnew = 100, 000, and

∑tft=0 mt = 1·109.

This would be indicative of a batch{Xt+1, Yt+1} withsignificantly higher error using the current parameterθtthan for previously analyzed batches.

We remark that when learning on each new batch ofdata, there are two main aspects that can be parallelized.First, the batch itself can be partitioned and distributedamong nodes in a cluster via AllReduce to significantlyspeed up the evaluation of the cost function and its gra-dient as is done in (Agarwal et al., 2012). Furthermore,one can run multiple independent optimization routinesin parallel where the distribution used to subsample fromXt+1 and∪ts=0Xs is varied. The resulting parametersθ obtained from each separate instance can then be sta-

Page 5: Statistically adaptive learning for a general class of..

tistically compared so as to make sure that the model isnot overly sensitive to the choice of sampling distribu-tion. Otherwise, havingθ be too highly dependent on thechoice of subsample would invalidate using a stochas-tic gradient descent-based approach. A bi-product ofthis ability to simultaneously experiment with differentsamplings is that it provides a quick means to check theconsistency of the data.

Finally, we remark that the SA L-BFGS method can bereasonably adapted to account for changes in the selectedfeatures as new data is acquired. Indeed, it is very ap-pealing within the industry to be able to experiment withdifferent choices of features in order to find those thatmatter most, while still being able to use the previouslycomputed parametersθt to speed up the optimization onthe new data. Of course, it is possible to directly ap-ply an online learning approach in this situation, sinceprevious data points have already been discarded. Buttypical gradient descent algorithms do not a priori havethe flexibility to be directly applied in such cases and theytypically perform worse than batch methods such as L-BFGS(Agarwal et al., 2012).

5 Experiments

5.1 Description of Dataset and Features

We consider data used to predict the click-through-rate(pCTR) of online ads. An accurate model is necessaryin the search advertising market in order to appropriatelyrank ads and price clicks. The data contains11 variablesand 1 output, corresponding to the number of times agiven ad was clicked by the user among the number oftimes it was displayed. In order to reduce the data size,instances with the same user id, ad id, query, and settingare combined, so that the output may take on any posi-tive integer value. For each instance (training example),the input variables serve to classify various properties ofthe ad displayed, in addition to the specific search queryentered. This data was acquired from sessions of the Ten-cent proprietary search engine and was posted publicly onwww.kddcup.2012.org (Tencent, 2012).

For these experiments we build a basic model thatlearns from the identifiers provided in the training set.These include unique identifiers for each query, ad, key-word, advertiser, title, description, display url, user, adposition, and ad depth (further details available in theKDD documentation). We compute a position and depthnormalized click through rate for each identifier, as wellas combinations (conjunctions) of these identifiers. Thenat training and testing time we annotate each examplewith these normalized click through rates. Additionally,before running the optimization, it is necessary to buildappropriate feature vectors (i.e. shape the data). We willnot go into detail regarding how this is done, except to

mention that the number of features generated is on theorder of1000.

5.2 Model 1 Results

For our first set of experiments, we compare the perfor-mance of Vowpal Wabbit (VW) using its L-BFGS im-plementation and the Context Relevant Flexible Analyt-ics and Statistics TechnologyTM L-BFGS implementationrunning on10 Amazon m1.xlarge instances. The timemeasured (in seconds) is only the time required to trainthe models. The features are generated and cached foreach implementation prior to training.

Performance was measured using the Area UnderCurve (AUC) metric because this was the methodologyused in (Tencent, 2012). In short, the AUC is equal to theprobability that a classifier will rank a randomly chosenpositive instance higher than a randomly chosen negativeone. More precisely, it is computed via Algorithm 3 in(Fawcett, 2004). We compute our AUC results over aportion of the public section of the test set that has about2 million examples.

Model 1 includes only the basic id features, with noconjunction features, and achieves an AUC of 0.748 asshown in Table 1. A simple baseline performance, whichcan be generated by predicting the mean ctr for eachad id would perform at approximately an AUC of 0.71.The winner of (Tencent, 2012) performed at an AUC of0.80. However, the winning model was substantiallymore complicated and used many additional features thatwere excluded from this simple demonstration. In futurework, we will explore more sophisticated feature sets.

The Context Relevant and VW models both achieve thesame AUC on the test set, which validates that the basicgradient descent and L-BFGS implementations are func-tionally equivalent. The Context Relevant model com-pletes learning between four and five times more quickly.Our implementation is heavily optimized to reduce com-putation time as well as memory footprint. In addition,our implementation utilizes an underlying MapReduceimplementation that provides robustness to job and nodefailures1.

5.3 Model 2 Results

Context Relevant’s implementation of SA L-BFGS isdesigned to accelerate and simplify learning iterativechanges to models. Using information gleaned from theinitial L-BFGS pass, SA L-BFGS develops a sampling

1Context Relevant had to re-write the AllReduce networkimplementation to add error checking so that the AllReduce sys-tem was robust to errors that were encountered during normalexecution of experiments on Amazon’s EC2 systems. Withoutthese changes, we could not keep AllReduce from hanging dur-ing the experiments. There is no graceful recovery from the lossof a single node.

Page 6: Statistically adaptive learning for a general class of..

Table 1: Model 1 Results For Different Learning Mech-anisms (VW = Vowpal Wabbit; CR = Context RelevantFAST L-BFGS

VW CRL-BFGS L-BFGS

seconds 490 114AUC .748 .748

strategy to minimize sampling induced noise when learn-ing new models that are derived from previous models.The larger the divergence in the models, the less speed-up is likely. For this set of experiments, we add a con-junction feature that captures the interaction between aquery id and an ad id, which has frequently been animportant feature in well known advertising systems. Wethen compare the speed and accuracy of Vowpal Wab-bit (VW) using its L-BFGS implementation; the ContextRelevant Flexible Analytics and Statistics TechnologyTM

L-BFGS implementation, and the Context Relevant Flex-ible Analytics and Statistics TechnologyTM SA L-BFGSimplementation (SA) running on10 Amazon X1.Largeinstances. Here the baseline L-BFGS models are trainedwith the standard practice for adding a new feature, themodels are retrained on the entire dataset. The time mea-sured (in seconds) is only the time required to train themodels. The features are generated and cached for eachimplementation prior to training.

Again, performance was measured using the Area Un-der Curve (AUC) metric. Table 2 lists the results for eachalgorithm, and shows that AUC improves in comparisonto Model 1 when the new feature is added. As with Model1, the VW and CR models achieve similar AUC and thebasic L-BFGS CR learning time is significantly faster.Furthermore, we show that SA L-BFGS also achievessimilar AUC, but in less than one tenth of the time (whichlikewise implies one tenth of the compute cost required).This speed up can enable a large increase in the numberof experiments, without requiring additional compute ortime. It is important to note that the speed of L-BFGSand SA L-BFGS is essentially tied to the number offeatures and the number of examples for each iteration.The primary performance gains that can be found are: (a)reducing the number of iterations; (b) reducing the num-ber of examples; or (c) reducing the number of featureswith non-zero weights. SA L-BFGS adopts the formertwo strategies. A reduction in the number of featureswith non-zero weights can be forced through aggressiveregularization, but at the expense of specificity.

6 Conclusions

We have presented a new tera-scale machine learningsystem that enables rapid model experimentation. The

Table 2: Model 2 Results For Different Learning Mech-anisms (VW = Vowpal Wabbit; CR = Context RelevantFAST L-BFGS; SA = Context Relevant FAST SA L-BFGS)

VW CR CRL-BFGS L-BFGS SA L-BFGS

seconds 515 115 9AUC .750 .752 .751

system uses a new version of L-BFGS to combine therobustness and accuracy of second order gradient descentoptimization methods with the memory advantages ofonline learning. This provides a model building envi-ronment that significantly lowers the time and computecost of asking new questions. The ability to quickly askand answer experimental questions vastly expands to setof questions that can be asked, and therefore the spaceof models that can be explored to discover the optimalsolution.

SA L-BFGS is also well suited to environments wherethe underlying distribution of the data provided to a learn-ing algorithm is shifting. Whether the shift is causedby changes in user behavior, changes in market pricing,or changes in term usage, SA L-BFGS can be empiri-cally tuned to dynamically adjust to the changing con-ditions. One can utilize the parallelized approach in(Agarwal et al., 2012) on each batch of data in the timevariable, with the additional freedom to choose the batchsize. Furthermore, the statistical aspects of the algorithmprovide a useful way to check the consistency of the datain real time. However, like all L-BFGS implementa-tions that rely on small, reduced, or sampled data sets,increased sampling noise from the L-BFGS estimationprocess affects the quality of the resulting learning al-gorithm. Users of this new algorithm must take care toprovide smooth convex loss functions for optimization.The Context Relevant Flexible Analytics and StatisticsTechnologyTM is designed to provide such functions foroptimization.

7 About Context Relevant and the Authors

Context Relevant was founded in March 2012 by StephenPurpura and Christian Metcalfe. The company was ini-tially funded by friends, family, Seattle angel investors,and Madrona Venture Group.

Stephen Purpura – CEO & Co-Founder – Stephenworks as the CEO and CTO of the company. He has morethan 20 years of experience, is listed as an inventor offive issued United States patents and served as programmanager of Microsoft Windows and the Chief SecurityOfficer of MSFDC, one of the first Internet bill paymentsystems. Stephen is recognized as a leading expert in the

Page 7: Statistically adaptive learning for a general class of..

fields of machine learning, political micro-targeting andpredictive analytics.

Stephen received a bachelor’s degree from the Uni-versity of Washington, a master’s degree from Harvard,where he was part of the Program on Networked Gover-nance at the John F. Kennedy School of Government, andhe is set to be granted a PhD in information science fromCornell.

Jim Walsh – VP Engineering – Jim brings 23 yearsof experience in technology innovation and engineeringmanagement to Context Relevant. He founded, built andled the Cosmos team – Microsoft’s massive scale dis-tributed data storage and analysis environment, under-pinning virtually all Microsoft products including Bing– and founded the Bing multimedia search team. In hisfinal position at Microsoft, Jim served as the principalarchitect for Microsoft advertising platforms.

Prior to joining Microsoft, Jim created two softwarestartups, wrote the second-ever Windows application andcreated custom computer animation software for the tele-vision broadcast industry. He is recognized as one ofthe world’s leading network performance experts and hasbeen granted twenty two technology patents that rangefrom performance to user interface design. Jim earned abachelor of science degree in computing science from theUniversity of Alberta.

Dustin Rigg Hillard – Director of Engineering andData Scientist – Dustin is a recognized data science andmachine learning expert who has published more than 30papers in these areas. Previously at Microsoft and Ya-hoo!, he spent the last decade building systems that sig-nificantly improve large-scale processing and machine-learning for advertising, natural language and speech.

Prior to joining Context Relevant, Dustin Hillardworked for Microsoft, where he worked to improvespeech understanding for mobile applications and XBoxKinect. Before that he was at Yahoo! for threeyears, where he focused on improving ad relevance.His research in graduate school focused on automaticspeech recognition and statistical translation. Dustinincorporates approaches from these and other fields tolearn from massive amounts of data with supervised,semi-supervised, and unsupervised machine learning ap-proaches.

Dustin holds a bachelor’s and master’s degree as wellas a PhD in Electrical Engineering from the University ofWashington.

Scott Golder – Data Scientist & Staff Sociologist –Scott mines social networking data to investigate broadquestions such as when people are happiest (morningsand weekends) and how Twitter users form new socialties. His work has been published in the journal Scienceas well as top computer science conferences by the ACMand IEEE, and has been covered in media outlets such

as MSNBC, The New York Times, The Washington Postand National Public Radio. He has also been profiled byLiveScience’s “ScienceLives”.

He has worked as a research scientist in the SocialComputing Lab at HP Labs and has interned at Google,IBM and Microsoft. Scott holds a master’s degree fromMIT, where he worked with the Media Laboratory’s So-ciable Media Group, and graduated from Harvard Univer-sity, where he studied Linguistics and Computer Science.Scott is currently on leave from the PhD program in So-ciology at Cornell University.

Mark Hubenthal – Member of the Technical Staff –Mark recently received his PhD in Mathematics from theUniversity of Washington. He works on inverse problemsapplicable to medical imaging and geophysics.

Scott Smith - Principal Engineer and Architect - Scotthas experience with distributed computing, compiler de-sign, and performance optimization. At Akamai, hehelped design and implement the load balancing andfailover logic for the first CDN. At Clustrix, he built theSQL optimizer and compiler for a distributed RDBMS.He holds a masters degree in computer science from MIT.

References

[Agarwal et al.2012] A. Agarwal, O. Chapelle, M. Dudik,and J. Langford. 2012. A reliable effective terascalelinear learning system. Feb. arXiv:1110.4198v2.

[Bottou2010] L. Bottou. 2010. Large-scale machinelearning with stochastic gradient descent. pages 177–187.

[Duchi et al.2010] J. Duchi, E. Hazan, and Y. Singer.2010. Adaptive subgradient methods for online learn-ing and stochastic optimization.

[Fawcett2004] T. Fawcett. 2004. ROC graphs: Notes andpractical considerations for researchers.

[Karampatziakis and Langford2011] N. Karampatziakisand J. Langford. 2011. Online importance weightaware updates.

[Langford et al.2009] J. Langford, L. Li, and T. Zhang.2009. Sparse online learning via truncated gradient.Journal of Machine Learning Research, 10:777–801.

[Schraudolph et al.2007] N. Schraudolph, J. Yu, andS. Gunter. 2007. A stochastic quasi-newton methodfor online convex optimization.

[Tencent2012] Tencent. 2012.KDD Cup 2012, Track 2 Data.http://www.kddcup2012.org/c/kddcup2012-track2/data