On Empirical Comparisons of Optimizers for Deep LearningOn Empirical Comparisons of Optimizers for...

On Empirical Comparisons of Optimizers for Deep Learning

Dami Choi 1 2 Christopher J. Shallue 1 Zachary Nado 1 Jaehoon Lee 1 Chris J. Maddison 3 4 George E. Dahl 1

AbstractSelecting an optimizer is a central step in thecontemporary deep learning pipeline. In this pa-per, we demonstrate the sensitivity of optimizercomparisons to the hyperparameter tuning proto-col. Our findings suggest that the hyperparame-ter search space may be the single most impor-tant factor explaining the rankings obtained byrecent empirical comparisons in the literature. Infact, we show that these results can be contra-dicted when hyperparameter search spaces arechanged. As tuning effort grows without bound,more general optimizers should never underper-form the ones they can approximate (i.e., Adamshould never perform worse than momentum),but recent attempts to compare optimizers eitherassume these inclusion relationships are not prac-tically relevant or restrict the hyperparameters inways that break the inclusions. In our experi-ments, we find that inclusion relationships be-tween optimizers matter in practice and alwayspredict optimizer comparisons. In particular,we find that the popular adaptive gradient meth-ods never underperform momentum or gradientdescent. We also report practical tips aroundtuning often ignored hyperparameters of adap-tive gradient methods and raise concerns aboutfairly benchmarking optimizers for neural net-work training.

1. IntroductionThe optimization algorithm chosen by a deep learning prac-titioner determines the training speed and the final predic-tive performance of their model. To date, there is no theorythat adequately explains how to make this choice. Instead,our community relies on empirical studies (Wilson et al.,2017) and benchmarking (Schneider et al., 2019). Indeed,it is the de facto standard that papers introducing new opti-mizers report extensive comparisons across a large number

1Google Research, Brain Team 2Vector Institute andUniversity of Toronto 3DeepMind 4Institute for AdvancedStudy, Princeton NJ. Correspondence to: Dami Choi<[email protected]>.

of workloads. Therefore, to maximize scientific progress,we must have confidence in our ability to make empiricalcomparisons between optimization algorithms.

Although there is no theory guiding us when comparingoptimizers, the popular first-order optimizers form a natu-ral inclusion hierarchy. For example, ADAM (Kingma &Ba, 2015) and RMSPROP (Tieleman & Hinton, 2012) canapproximately simulate MOMENTUM (Polyak, 1964) if theε term in the denominator of their parameter updates is al-lowed to grow very large. However, these relationshipsmay not matter in practice. For example, the settings ofADAM’s hyperparameters that allow it to match the per-formance of MOMENTUM may be too difficult to find (forinstance, they may be infinite).

In this paper, we demonstrate two important and interre-lated points about empirical comparisons of neural networkoptimizers. First, we show that inclusion relationships be-tween optimizers actually matter in practice; in our experi-ments, more general optimizers never underperform spe-cial cases. Despite conventional wisdom (Wilson et al.,2017; Balles & Hennig, 2017), we find that when carefullytuned, ADAM and other adaptive gradient methods neverunderperform MOMENTUM or SGD. Second, we demon-strate the sensitivity of optimizer comparisons to the hyper-parameter tuning protocol. By comparing to previous ex-perimental evaluations, we show how easy it is to changeoptimizer rankings on a given workload (model and datasetpair) by changing the hyperparameter tuning protocol, withoptimizer rankings stabilizing according to inclusion rela-tionships as we spend more and more effort tuning. Ourfindings raise serious questions about the practical rele-vance of conclusions drawn from these sorts of empiricalcomparisons.

The remainder of this paper is structured as follows. InSection 2, we review related work, focusing on papers thatmake explicit claims about optimizer comparisons in deeplearning and application papers that provide evidence aboutthe tuning protocols of practitioners. We develop our def-inition of first-order optimizers in Section 3 along with anotion of inclusion relationships between optimizers. Wepresent our experimental results in Section 4. Despitethorny methodological issues over how to avoid biases incomparisons due to search spaces that favor one optimizer

arX

iv:1

910.

0544

6v3

[cs

.LG

] 1

6 Ju

n 20

20


over another, we believe that our experimental methodol-ogy is an acceptable compromise and has substantial prac-tical relevance. Among other results, we show that theinclusion hierarchy of update rules is almost entirely pre-dictive of optimizer comparisons. In particular, NADAM(Dozat, 2016) achieves the best top-1 validation accuracyon ResNet-50 on ImageNet in our experiments. The 77.1%we obtain with NADAM, although not as good as the 77.6%obtained using learned data augmentation by Cubuk et al.(2018), is better than the best existing published resultsusing any of the more standard pre-processing pipelines(76.5%, due to Goyal et al. (2017) using MOMENTUM).

2. Background and Related WorkOur work was inspired by the recent studies of neural net-work optimizers by Wilson et al. (2017) and Schneideret al. (2019). Wilson et al. (2017) constructed a simpleclassification problem in which adaptive gradient methods(e.g. ADAM) converge to provably worse solutions thanstandard gradient methods. However, crucially, their anal-ysis ignored the ε parameter in the denominator of someadaptive gradient methods. Wilson et al. (2017) also pre-sented experiments in which ADAM produced worse vali-dation accuracy than SGD across all deep learning work-loads considered. However, they only tuned over the learn-ing rate and learning rate decay scheme in their exper-iments, leaving all other parameters of ADAM at fixeddefault values. Despite these findings, adaptive gradientmethods continue to be popular since the work of Wilsonet al. (2017). Schneider et al. (2019) presented a bench-mark suite (DEEPOBS) for deep learning optimizers andreported that there was no single best optimizer across theworkloads they considered. Yet Schneider et al. (2019)only tuned the learning rate of each optimizer and left allother hyperparameters at some fixed default values.

As we discuss in Section 4.3, the choices of hyperparam-eter tuning protocols in Wilson et al. (2017) and Schnei-der et al. (2019) may be the most important factor prevent-ing their results from being relevant to practical choicesabout which optimizer to use. Hyperparameter tuning isa crucial step of the deep learning pipeline (Bergstra &Bengio, 2012; Snoek et al., 2012; Sutskever et al., 2013;Smith, 2018), so it is critical for papers studying optimiz-ers to match as closely as possible the tuning protocols ofan ideal practitioner. Yet, tuning protocols often differ be-tween works studying neural network optimizers and worksconcerned with training neural networks to solve specificproblems.

Recent papers that study or introduce optimization algo-rithms tend to compare to ADAM and RMSPROP withouttuning their respective ε hyperparameters (see Table 1 fornotations), presumably to simplify their experiments. It is

standard to leave ε at the common default value of 10−8

for ADAM and 10−10 for RMSPROP (Tieleman & Hinton,2012; Kingma & Ba, 2015; Dozat, 2016; Balles & Hennig,2017; Loshchilov & Hutter, 2017; Zou & Shen, 2018; Ma& Yarats, 2018; Bernstein et al., 2018; Chen et al., 2019;Zou et al., 2019). Others do not even report the value of εused (Balles & Hennig, 2017; Zhang & Mitliagkas, 2017;Keskar & Socher, 2017; Chen et al., 2018; Zhou et al.,2018; Aitchison, 2018; Reddi et al., 2019; Luo et al., 2019).There are exceptions. Zaheer et al. (2018) and Liu et al.(2019) considered ε values orders of magnitude larger thanthe standard default. However, the experiments in both pa-pers gave only a limited consideration to ε, testing at mosttwo values while tuning ADAM. De et al. (2018) is the onlywork we found that considered a broad range of values forε. Both Zaheer et al. (2018) and De et al. (2018) found thatnon-default values of ε outperformed the default.

While it is also extremely common in applications to usea default value of ε, some notable papers tuned ε and se-lected values up to eight orders of magnitude away fromthe common defaults. Szegedy et al. (2016) used ε = 1for RMSPROP; Liu et al. (2019) reported that their resultswere sensitive to ε and set ε = 10−6 for ADAM; Tan et al.(2019) and Tan & Le (2019) set ε = 10−3 for RMSPROP,the latter achieving state-of-the-art ImageNet top-1 accu-racy. In reinforcement learning, Hessel et al. (2017) setε = 1.5× 10−4.

Despite being introduced solely to prevent division byzero1 , ADAM’s ε can be interpreted in ways that suggestthe optimal choice is problem-dependent. If ADAM is in-terpreted as an empirical, diagonal approximation to natu-ral gradient descent (Kingma & Ba, 2015), ε can be viewedas a multi-purpose damping term whose role is to improvethe conditioning of the Fisher, in analogy to the approxi-mate second-order method considered by Becker & Le Cun(1988). We can also view ε as setting a trust region radius(Martens & Grosse, 2015; Adolphs et al., 2019) and con-trolling an interpolation between momentum and diagonalnatural gradient descent, by either diminishing or increas-ing the effect of vt on the update direction. Under either in-terpretation, the best value for ε will be problem-dependentand likely benefit from tuning.

3. What is an optimizer?Optimization algorithms are typically defined by their up-date rule, which is controlled by hyperparameters that de-termine its behavior (e.g. the learning rate). Consider adifferentiable loss function ` : Rd → R whose vector

1TensorFlow currently refers to ε as “a small constant fornumerical stability”; https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/AdamOptimizer.

https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/train/AdamOptimizer




Table 1: Update rules considered in this work. SGD is due to (Robbins & Monro, 1951), MOMENTUM to (Polyak, 1964),NESTEROV to (Nesterov, 1983), RMSPROP to (Tieleman & Hinton, 2012), and NADAM to (Dozat, 2016). All operationsare taken component-wise for vectors. In particular, for x ∈ Rd, x2 is a component-wise power function.

SGD(Ht, ηt)

θt+1 = θt − ηt∇`(θt)

MOMENTUM(Ht, ηt, γ)

v0 = 0

vt+1 = γvt +∇`(θt)θt+1 = θt − ηtvt+1

NESTEROV(Ht, ηt, γ)

v0 = 0

vt+1 = γvt +∇`(θt)θt+1 = θt − ηt (γvt+1 +∇`(θt))

RMSPROP(Ht, ηt, γ, ρ, ε)

v0 = 1,m0 = 0

vt+1 = ρvt + (1− ρ)∇`(θt)2

mt+1 = γmt +ηt√

vt+1 + ε∇`(θt)

θt+1 = θt −mt+1

ADAM(Ht, αt, β1, β2, ε)

m0 = 0, v0 = 0

mt+1 = β1mt + (1− β1)∇`(θt)vt+1 = β2vt + (1− β2)∇`(θt)2

bt+1 =

√1− βt+1

2

1− βt+11

θt+1 = θt − αtmt+1√vt+1 + ε

bt+1

NADAM(Ht, αt, β1, β2, ε)

m0 = 0, v0 = 0

mt+1 = β1mt + (1− β1)∇`(θt)vt+1 = β2vt + (1− β2)∇`(θt)2

bt+1 =

√1− βt+1

2

1− βt+11

θt+1 = θt − αtβ1mt+1 + (1− β1)∇`(θt)√

vt+1 + εbt+1

of first partial derivatives is given by ∇`(θ) (more gen-erally, ∇`(θ) might be a stochastic estimate of the truegradient). In our context, ` represents the loss functioncomputed over an entire dataset by a neural network andθ ∈ Rd represents the vector of model parameters. The op-timization problem is to find a point that (at least locally)minimizes `. First-order iterative methods for this problem(Nesterov, 2018) construct a sequence θt of iterates con-verging to a local minimum θ? using queries to ` and ∇`.The sequence θt is constructed by an update ruleM, whichdetermines the next iterate θt+1 from the history Ht of pre-vious iterates along with their function and gradient values,Ht = {θs,∇`(θs), `(θs)}ts=0, and a setting of hyperpa-rameters φ : N → Rn. Given an initial parameter valueθ0 ∈ Rd, the sequence of points visited by an optimizerwith update ruleM is given by,

θt+1 =M(Ht, φt).

The stochastic gradient descent algorithm (SGD; Rob-bins & Monro, 1951) is one of the simplest such methodsused for training neural networks. SGD is initialized withθ0 ∈ Rd, and its hyperparameter is a learning rate sched-ule η : N → (0,∞). The SGD update rule is given bySGD(Ht, ηt) = θt − ηt∇`(θt). The MOMENTUM method

due to Polyak (1964) generalizes the SGD method by lin-early combining the gradient direction with a constant mul-tiple of the previous parameter update. Its hyperparametersare a learning rate schedule η : N→ (0,∞) and a momen-tum parameter γ ∈ [0,∞),

MOMENTUM(Ht, ηt, γ) = θt − ηt∇`(θt) + γ(θt − θt−1).

There has been an explosion of novel first-order methodsin deep learning, all of which fall into this standard first-order scheme. In Table 1 we list the first-order update rulesconsidered in this paper.

The difference between optimizers is entirely captured bythe choice of update ruleM and hyperparameters φ. Sincethe roles of optimizer hyperparameters on neural networkloss functions are not well-understood, most practitionerstune a subset of the hyperparameters to maximize perfor-mance over a validation set, while leaving some hyper-parameters at fixed default values. The choice of whichhyperparameters to tune determines an effective family ofupdate rules, and this family is the critical object from apractitioners perspective. Thus, in analogy to (overloaded)function declarations in C++, we define an optimizer by anupdate rule “signature,” the update rule name together withthe free hyperparameter arguments. For example, in this


definition MOMENTUM(·, ηt, γ) is not the same optimizeras MOMENTUM(·, ηt, 0.9), because the latter has two freehyperparameters while the former only has one. ADAMwith the default ε is “different” from ADAM with tuned ε.

3.1. The taxonomy of first-order methods

The basic observation of this section is that some optimiz-ers can approximately simulate others (i.e., optimizer Amight be able to approximately simulate the trajectory ofoptimizer B for any particular setting of B’s hyperparame-ters). This is important knowledge because, as a hyperpa-rameter tuning protocol approaches optimality, a more ex-pressive optimizer can never underperform any of its spe-cializations. To capture the concept of one optimizer ap-proximating another, we define the following inclusion re-lationship between optimizers.

Definition 1 (Inclusion relationship). LetM,N be updaterules for use in a first-order optimization method. M is asubset or specialization of N , if for all φ : N → Rn, thereexists a sequenceψi : N→ Rm, such that for all t ∈ [0,∞)and histories Ht,

limi→∞

N (Ht, ψit) =M(Ht, φt)

This is denotedM ⊆ N , with equalityM = N iffM ⊆N and N ⊆M.

Evidently SGD ⊆ MOMENTUM, since SGD(Ht, ηt) =MOMENTUM(Ht, ηt, 0). Many well-known optimizers fallnaturally into this taxonomy. In particular, we considerRMSPROP with momentum (Tieleman & Hinton, 2012),ADAM (Kingma & Ba, 2015) and NADAM (Dozat, 2016)(see Table 1) and show the following inclusions in the ap-pendix.

SGD ⊆ MOMENTUM ⊆ RMSPROP

SGD ⊆ MOMENTUM ⊆ ADAM

SGD ⊆ NESTEROV ⊆ NADAM

Note, some of these inclusions make use of the flexibilityof hyperparameter schedules (dependence of ψi on t). Inparticular, to approximate MOMENTUM with ADAM, oneneeds to choose a learning rate schedule that accounts forADAM’s bias correction.

If two optimizers have an inclusion relationship, the moregeneral optimizer can never be worse with respect to anymetric of interest, provided the hyperparameters are suf-ficiently tuned to optimize that metric. Optimally-tunedMOMENTUM cannot underperform optimally-tuned SGD,because setting γ = 0 in MOMENTUM recovers SGD.However, optimizers with more hyperparameters might bemore expensive to tune, so we should have a theoretical orexperimental reason for using (or creating) a more general

optimizer. For example, MOMENTUM improves local con-vergence rates over SGD on twice-differentiable functionsthat are smooth and strongly convex (Polyak, 1964), andNESTEROV has globally optimal convergence rates withinthe class of smooth and strongly convex functions (Nes-terov, 1983; 2018).

At first glance, the taxonomy of optimizer inclusions ap-pears to resolve many optimizer comparison questions.However, for a deep learning practitioner, there is no guar-antee that the inclusion hierarchy is at all meaningful inpractice. For example, the hyperparameters that allowADAM to match or outperform MOMENTUM might not beeasily accessible. They might exist only in the limit of verylarge values, or be so difficult to find that only practition-ers with huge computational budgets can hope to discoverthem. Indeed, empirical studies and conventional wisdomhold that the inclusion hierarchy does not predict optimizerperformance for many practical workloads (Wilson et al.,2017; Balles & Hennig, 2017; Schneider et al., 2019). Ei-ther these experimental investigations are too limited or thetaxonomy of this section is of limited practical interest andprovides no guidance about which optimizer to use on areal workload. In the following section we attempt to an-swer this question experimentally, and show that these in-clusion relationships are meaningful in practice.

4. ExperimentsAn empirical comparison of optimizers should aim to in-form a careful practitioner. Accordingly, we model ourprotocol on a practitioner that is allowed to vary all opti-mization hyperparameters for each optimizer (e.g. αt, β1,β2, ε for ADAM) in addition to a parameterized learningrate decay schedule, in contrast to studies that fix a subsetof the optimization hyperparameters to their default values(e.g. Wilson et al., 2017; Schneider et al., 2019). There isno standard method for selecting the values of these hyper-parameters, but most practitioners tune at least a subset ofthe optimization hyperparameters by running a set of trialsto maximize performance over the validation set. In our ex-periments, we run tens to hundreds of individual trials perworkload. Given the variety of workloads we consider, thistrial budget covers a wide range of computational budgets.

Selecting the hyperparameter search space for each opti-mizer is a key methodological choice for any empiricalcomparison of optimizers. Prior studies have attemptedto treat each optimizer fairly by using the “same” searchspace for all optimizers (e.g. Wilson et al., 2017; Schneideret al., 2019). However, this requires the assumption thatsimilarly-named hyperparameters should take similar val-ues between optimizers, which is not generally true. Forexample, MOMENTUM and NESTEROV both have similar-looking momentum and learning rate hyperparameters,


Table 2: Summary of workloads used in experiments.

Task Evaluation metric Model Dataset Target error Batch size Budget

Imageclassification

Classificationerror

Simple CNN Fashion MNIST 6.6% 256 10k steps

ResNet-32 CIFAR-10 7% 256 50k steps

CNN CIFAR-100 – 256 350 epochs

VGG-16 CIFAR-10 – 128 250 epochs

ResNet-50 ImageNet 24% 1024 150k steps

Languagemodeling

Classificationerror LSTM War and Peace – 50 200 epochs

Cross entropy Transformer LM1B 3.45 256 750k steps

but NESTEROV tolerates larger values of its momentum(Sutskever et al., 2013). The situation worsens with lessclosely related optimizers: similarly-named hyperparame-ters could have totally different units, making it impossibleto equate search volumes. Despite coming with its own setof challenges, it is most informative to compare optimiz-ers assuming the practitioner is allowed to tune hyperpa-rameters for different optimizers independently by way ofoptimizer-specific search spaces.

In our experiments, we chose the search space for each op-timizer by running an initial set of experiments over a rela-tively large search space. In a typical case, we ran a singleset of initial trials per optimizer to select the final searchspace. However, in some cases we chose one of the searchspaces poorly, so we ran another set of experiments to se-lect the final search space. We include these initial searchspaces in Appendix D for the sake of transparency. Theeffort required to choose each search space cannot easilybe quantified: our initial guesses were inevitably informedby prior experience with particular models and optimizers.This is true of all search spaces in the literature: tuningpractices tend to be refined over many experiments andacross many workloads, representing the sum total of ourcommunity’s experience.

Following standard practice, we tuned the hyperparametersη0, 1−γ, 1−β1, and 1−β2 from Table 1 on a log scale bysearching over the logarithms of their values. Additionally,we decoupled ε from the initial learning rate α0 by search-ing over the logarithms of (ε, α0/

√ε) for RMSPROP and

(ε, α0/ε) for ADAM and NADAM, instead of (ε, α0). Thefact that ε is coupled with the learning rate, with largervalues of ε generally requiring larger learning rates, wasshown for ADAM by Savarese et al. (2019). These searchspace transforms merely served to make our hyperparam-eter search more efficient; in principle, our results wouldbe the same if we used a larger trial budget and naivelysearched all hyperparameters on a linear scale.

We validated our final search spaces by checking that thatthe optimal hyperparameter values were away from thesearch space boundaries for all optimizers in all experi-ments (see Figure 5 in Appendix E). We provide our finalsearch spaces for all experiments in Appendix D. The factthat our final error rates compare favorably to prior pub-lished results – including reaching state-of-the-art for ourparticular configuration of ResNet-50 on ImageNet (seeSection 4.2) – supports our claim that our methodology ishighly competitive with expert tuning procedures.

4.1. Overview of Workloads and Experimental Details

We investigated the relative performance of optimizersacross a variety of image classification and language mod-eling tasks. For image classification, we trained a simpleconvolutional neural network (Simple CNN) on FashionMNIST (Xiao et al., 2017); ResNet-32 (He et al., 2016a)on CIFAR-10 (Krizhevsky, 2009); a CNN on CIFAR-100;VGG-16 (Simonyan & Zisserman, 2014) on CIFAR-10;and ResNet-50 on ImageNet (Russakovsky et al., 2015).For language modeling, we trained a 2-layer LSTM model(Hochreiter & Schmidhuber, 1997) on Tolstoy’s War andPeace; and Transformer (Vaswani et al., 2017) on LM1B(Chelba et al., 2014). We used a linear learning rate de-cay schedule parameterized the same way as Shallue et al.(2019) for all workloads. We used a fixed batch size and afixed budget of training steps for each workload indepen-dent of the optimizer. Table 2 summarizes these workloadsand Appendix B provides the full details.

Given a search space, our tuning protocol sought to model apractitioner trying to achieve the best outcome with a fixedbudget of trials (10, 50, or 100 depending on the work-load).2 A feasible trial is any trial that achieves finite train-

2In retrospect the best validation error across tuning trials con-verged quite quickly for our final search spaces, producing similarresults with fewer than 20 trials in many cases. See Figures 6– 8in Appendix E.


0.055

0.060

0.065

0.070

0.075

Valid

ati

on E

rror

CNN on Fashion MNIST

0.055

0.060

0.065

0.070

0.075

0.080

Valid

ati

on E

rror

ResNet-32 on CIFAR-10

0.225

0.230

0.235

0.240

0.245

Valid

ati

on E

rror

ResNet-50 on ImageNet

3.2

3.3

3.4

3.5

3.6

Valid

ati

on C

ross

Entr

opy

Transformer on LM1B

0.055

0.060

0.065

0.070

0.075

0.080

0.085

0.090

Test

Err

or

0.050

0.060

0.070

0.080

0.090

0.100

Test

Err

or

0.195

0.200

0.205

0.210

0.215

0.220

Test

Err

or

3.30

3.35

3.40

3.45

3.50

3.55

3.60

Test

Cro

ss E

ntr

opy

SGD Momentum Nesterov RMSProp Adam NAdam

Figure 1: The relative performance of optimizers is consistent with the inclusion relationships, regardless of whetherwe compare final validation error (top) or test error (bottom). For all workloads, we tuned the hyperparameters of eachoptimizer separately, and selected the trial that achieved the lowest final validation error. Optimizers appear in the sameorder as the legend in all plots in this paper.

0.0

0.2

0.4

0.6

0.8

1.0

Ste

ps

to V

alid

ati

on T

arg

et ×104


0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

Ste

ps

to V

alid

ati

on T

arg

et ×104


0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Ste

ps

to V

alid

ati

on T

arg

et ×105


1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

Ste

ps

to V

alid

ati

on T

arg

et ×105Transformer on LM1B


Figure 2: The relative training speed of optimizers is consistent with the inclusion relationships. We measured (idealized)training speed as the number of training steps required to reach a target validation error (see Table 2 for the error targets).

ing loss. We used quasi-random uniform search (Bousquetet al., 2017), and continued the search until we obtained afixed number of feasible trials. From those trials we con-sidered two statistics. The first, in order to characterize thebest outcome, is a metric of interest (e.g. test accuracy) cor-responding to the trial achieving the optimum of some othermetric (e.g. validation accuracy). The second, in order tocharacterize the speed of training, is the number of stepsrequired to reach a fixed validation target conditional on atleast one trial in the search having reached that target. Wechose the target for each workload based on initial experi-ments and known values from the literature (see Table 2).We estimated means and uncertainties using the bootstrapprocedure described in Appendix C.

4.2. Inclusion relationships matter in practice

Figure 1 shows the final predictive performance of six op-timizers on four different workloads after tuning hyper-parameters to minimize validation error. Regardless ofwhether we compare final validation error or test error, theinclusion relationships hold in all cases – a more generaloptimizer never underperforms any of its specializationswithin the error bars. Similar results hold for training er-ror (see Figure 9 in Appendix E). Training speed is alsoan important consideration, and Figure 2 demonstrates thatthe inclusion relationships also hold within error bars whenwe compare the number of steps required to reach a targetvalidation error. Moreover, these results confirming the rel-evance of optimizer inclusion relationships do not dependon the exact step budgets or error targets we chose (seeFigure 10 in Appendix E), although large changes to thesevalues would require new experiments.


Grid search over LR Tuning LR + {γ, ε}0.06

0.07

0.08

0.09

0.10

Final Test

Err

or

Using Code from Wilson et al., 2017

Tuning LR schedule Tuning LR schedule+ {γ, β1, β2, ε}

+ L2 reg.

Using Our CodeVGG on CIFAR-10

SGD Momentum RMSProp Adam

Figure 3: Tuning more hyperparameters removes the differences in test error between optimizers observed by Wilson et al.(2017). Tuning a subset of optimizer hyperparameters and the initial learning rate is sufficient to equalize performancebetween all optimizers (left). More extensive hyperparameter tuning in our setup, including the learning rate schedule,improves results for all optimizers and still does not produce any differences between optimizer performances (right).

Of course, just because a more general optimizer is noworse than any of its specializations doesn’t mean thechoice of optimizer makes a large difference on all work-loads. For some workloads in Figures 1 and 2, all optimiz-ers perform about the same, while other workloads havea clear ranking or even dramatic differences. For exam-ple, the choice of optimizer seems to make little differ-ence for ResNet-32 on CIFAR-10; all optimizers achievesimilar predictive performance and training speed. On theother hand, Transformer on LM1B exhibits a clear rank-ing in terms of predictive performance and training speed.For this workload, ADAM needs only 60% as many stepsas MOMENTUM to reach our target error, and only 25%as many steps to get the same final result as SGD (see Fig-ure 10 in the Appendix). These differences are clearly largeenough to matter to a practitioner, and highlight the prac-tical importance of choosing the right optimizer for someworkloads.

The most general optimizers we considered wereRMSPROP, ADAM, and NADAM, which do not includeeach other as special cases, and whose relative perfor-mance is not predicted by inclusion relationships. Acrossthe workloads we considered, none of these optimizersemerged as the clear winner, although ADAM and NADAMgenerally seemed to have an edge over RMSPROP. Forall of these optimizers, we sometimes had to set the ε pa-rameter orders of magnitude larger than the default valuein order to get good results. In particular, we achieved avalidation accuracy of 77.1% for ResNet-50 on ImageNetusing NADAM with ε = 9475, a result that exceeds the76.5% achieved by Goyal et al. (2017) using MOMENTUM.Across just these 4 workloads, the range of the optimal val-ues of the ε parameter spanned 10 orders of magnitude.

4.3. Reconciling disagreements with previous work

In order to confirm that differences in hyperparameter tun-ing protocols explain the differences between our conclu-sions and those of Wilson et al. (2017) and Schneider et al.(2019), we reproduced a representative subset of their re-sults and then inverted, or at least collapsed, the rank-ing over optimizers just by expanding the hyperparametersearch space.

The left pane of Figure 3 shows our experiments on VGGon CIFAR-10 using code released by Wilson et al. (2017).When we match their protocol and perform their gridsearch over the initial learning rate and no other tuning,we reproduce their original result showing worse test errorfor RMSPROP and ADAM. However, when we tune themomentum parameter and ε with random search, all fouroptimizers reach nearly identical test error rates.3 Withour learning rate schedule search space, merely tuning thelearning rate schedule was enough to make all optimiz-ers reach the same test error within error bars. When weadditionally tuned the optimization hyperparameters andweight decay in our setup we also get similar results forall optimizers, removing any evidence the inclusion rela-tionships might be violated in practice.

Figure 4 shows our results with different tuning protocolsfor a CNN on CIFAR-100 and an LSTM language modeltrained on War and Peace to match the experiments inSchneider et al. (2019). As reported by Schneider et al.(2019), if we only tune the learning rate without tuning thedecay schedule or other optimizer hyperparameters, ADAM

3Wilson et al. (2017) selected trials to minimize the trainingloss and then report test set results. As Figure 3 shows, removingthis somewhat non-standard choice and tuning on a validation setand reporting test set results does not change anything.


Tuningconst. LR

TuningLR schedule

Tuningconst. LR

+ {γ, β1, β2, ε}

TuningLR schedule+ {γ, β1, β2, ε}

0.30

0.35

0.40

0.45

0.50

0.55Fi

nal Test

Err

or

CNN on CIFAR-100

Tuningconst. LR

TuningLR schedule+ {γ, β1, β2, ε}

0.37

0.38

0.39

0.40

0.41

0.42LSTM on War and Peace

SGD Momentum Adam

Figure 4: Tuning more hyperparameters changes optimizer rankings from Schneider et al. (2019) to rankings that areconsistent with the inclusion relationships. The leftmost columns for each workload reproduce the rankings from Schneideret al. (2019), while the remaining columns tune over increasingly general search spaces. All columns use our random searchtuning protocol.

does worse than MOMENTUM for the CNN and SGD per-forms slightly better than ADAM and MOMENTUM on theWar and Peace dataset, although Schneider et al. (2019)found a larger advantage for SGD. However, once we tunethe all the optimizer hyperparameters, ADAM does betterthan MOMENTUM which does better than SGD, as pre-dicted by the inclusion relationships.

We conclude that the reason both Schneider et al. (2019)and Wilson et al. (2017) observed a ranking that, at firstglance, contradicts the inclusion relationships is becausethey were not tuning enough of the hyperparameters. If werecast their results in our terminology where ADAM withdefault ε is a different optimizer than ADAM with ε tunedthen there is no contradiction with our results and it be-comes clear immediately that they do not consider the mostinteresting form of ADAM for practitioners.

5. ConclusionsInspired by the recent efforts of Wilson et al. (2017) andSchneider et al. (2019), we set out to provide a detailed em-pirical characterization of the optimizer selection processin deep learning. Our central finding is that inclusion re-lationships between optimizers are meaningful in practice.When tuning all available hyperparameters under a real-istic protocol at scales common in deep learning, we findthat more general optimizers never underperform their spe-cial cases. In particular, we found that RMSPROP, ADAM,and NADAM never underperformed SGD, NESTEROV, orMOMENTUM under our most exhaustive tuning protocol.We did not find consistent trends when comparing optimiz-ers that could not approximate each other. We also foundworkloads for which there was not a statistically significantseparation in the optimizer ranking.

Our experiments have some important limitations and we

should be careful not to overgeneralize from our results.The first major caveat is that we did not measure the effectsof varying the batch size. Recent empirical work (Shallueet al., 2019; Zhang et al., 2019) has shown that increas-ing the batch size can increase the gaps between trainingtimes for different optimizers, with the gap from SGD toMOMENTUM (Shallue et al., 2019) and from MOMENTUMto ADAM (Zhang et al., 2019) increasing with the batchsize. Nevertheless, we strongly suspect that the inclusionrelations would be predictive at any batch size under a tun-ing protocol similar to the one we used. The second im-portant caveat of our results is that they inevitably dependon the tuning protocol and workloads that we considered.Although we made every attempt to conduct realistic ex-periments, we should only expect our detailed findings tohold for similar workloads under similar protocols, namelyuniform quasi-random tuning for tens to hundreds of trials,over hypercube search spaces, and with our specific learn-ing rate schedule parameterization. Nevertheless, thesecaveats reinforce our central point: all empirical compar-isons of neural network optimizers depend heavily on thehyperparameter tuning protocol, perhaps far more than weare used to with comparisons between model architectures.

If we were to extract “best practices” from our findings,then we suggest the following. If we can afford tens ormore runs of our code, we should tune all of the hyper-parameters of the popular adaptive gradient methods. Justbecause two hyperparameters have a similar role in two dif-ferent update rules doesn’t mean they should take similarvalues— optimization hyperparameters tend to be coupledand the optimal value for one may depend on how the oth-ers are set. Our results also confirm that the optimal valueof Adam’s ε is problem-dependent, so the onus is on empir-ical studies that fix ε = 10−8 to defend that choice. Finally,we should be skeptical of empirical comparisons of opti-


mizers in papers, especially if an optimizer underperformsany of its specializations. When we do inevitably compareoptimizers, we should report search spaces and highlightdecisions about what hyperparameters were tuned when in-terpreting results.

ReferencesAdolphs, L., Kohler, J., and Lucchi, A. Ellipsoidal trust

region methods and the marginal value of Hessian in-formation for neural network training. arXiv preprintarXiv:1905.09201, 2019.

Aitchison, L. A unified theory of adaptive stochasticgradient descent as Bayesian filtering. arXiv preprintarXiv:1807.07540, 2018.

Balles, L. and Hennig, P. Dissecting Adam: The sign,magnitude and variance of stochastic gradients. arXive-prints, art. arXiv:1705.07774, May 2017.

Becker, S. and Le Cun, Y. Improving the convergence ofthe backpropagation learning with second order meth-ods. Morgan Koufmann, San Mateo, CA, 1988.

Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine LearningResearch, 13(Feb):281–305, 2012.

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., andAnandkumar, A. signSGD: Compressed optimisation fornon-convex problems. arXiv preprint arXiv:1802.04434,2018.

Bousquet, O., Gelly, S., Kurach, K., Teytaud, O., and Vin-cent, D. Critical hyper-parameters: no random, no cry.arXiv preprint arXiv:1706.03200, 2017.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T.,Koehn, P., and Robinson, T. One billion word bench-mark for measuring progress in statistical language mod-eling. In Conference of the International Speech Com-munication Association, 2014.

Chen, J., Zhao, L., Qiao, X., and Fu, Y. NAMSG: An effi-cient method for training neural networks. arXiv preprintarXiv:1905.01422, 2019.

Chen, X., Liu, S., Sun, R., and Hong, M. On the con-vergence of a class of Adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941,2018.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., andLe, Q. V. Autoaugment: Learning augmentation poli-cies from data. In Conference on Computer Vision andPattern Recognition, 2018.

De, S., Mukherjee, A., and Ullah, E. Convergence guaran-tees for RMSProp and ADAM in non-convex optimiza-tion and an empirical comparison to Nesterov accelera-tion. arXiv preprint arXiv:1807.06766, 2018.

Dozat, T. Incorporating Nesterov momentum into Adam.In ICLR Workshops, 2016.

Goyal, P., Dollar, P., Girshick, R., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,K. Accurate, large minibatch SGD: training ImageNetin 1 hour. arXiv preprint arXiv:1706.02677, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-ual learning for image recognition. In Conference onComputer Vision and Pattern Recognition, pp. 770–778.IEEE, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In European Conference onComputer Vision, pp. 630–645. Springer, 2016b.

Hessel, M., Modayil, J., and van Hasselt, H. Rainbow:Combining improvements in deep reinforcement learn-ing. 2017. arXiv preprint arXiv:1710.02298, 2017.

Hochreiter, S. and Schmidhuber, J. Long short-term mem-ory. Neural Computation, 9(8):1735–1780, 1997.

Hoffer, E., Hubara, I., and Soudry, D. Train longer, gener-alize better: closing the generalization gap in large batchtraining of neural networks. In Advances in Neural In-formation Processing Systems, pp. 1731–1741, 2017.

Ioffe, S. and Szegedy, C. Batch normalization: accelerat-ing deep network training by reducing internal covariateshift. In International Conference on Machine Learning,pp. 448–456, 2015.

Keskar, N. S. and Socher, R. Improving generalizationperformance by switching from Adam to SGD. arXivpreprint arXiv:1712.07628, 2017.

Kingma, D. P. and Ba, J. Adam: a method for stochasticoptimization. In ICLR, 2015.

Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, University of Toronto,2009. URL http://www.cs.toronto.edu/

˜kriz/learning-features-2009-TR.pdf.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., andHan, J. On the variance of the adaptive learning rate andbeyond. arXiv preprint arXiv:1908.03265, 2019.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.RoBERTa: a robustly optimized BERT pretraining ap-proach. arXiv e-prints, art. arXiv:1907.11692, Jul 2019.

http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf


Loshchilov, I. and Hutter, F. Fixing weight decay regu-larization in Adam. arXiv preprint arXiv:1711.05101,2017.

Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradi-ent methods with dynamic bound of learning rate. arXivpreprint arXiv:1902.09843, 2019.

Ma, J. and Yarats, D. Quasi-hyperbolic momen-tum and Adam for deep learning. arXiv preprintarXiv:1810.06801, 2018.

Martens, J. and Grosse, R. Optimizing neural networkswith Kronecker-factored approximate curvature. arXivpreprint arXiv:1503.05671, pp. 58, 2015.

Nesterov, Y. Lectures on convex optimization, volume 137.Springer, 2018.

Nesterov, Y. E. A method for solving the convex program-ming problem with convergence rate O(1/kˆ2). In Dokl.akad. nauk Sssr, volume 269, pp. 543–547, 1983.

Polyak, B. T. Some methods of speeding up the conver-gence of iteration methods. USSR Computational Math-ematics and Mathematical Physics, 4(5):1–17, 1964.

Reddi, S. J., Kale, S., and Kumar, S. On the convergenceof Adam and beyond. In ICLR, 2019.

Robbins, H. and Monro, S. A stochastic approximationmethod. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., et al. ImageNet large scale visual recognition chal-lenge. International Journal of Computer Vision, 115(3):211–252, 2015.

Savarese, P., McAllester, D., Babu, S., and Maire, M.Domain-independent dominance of adaptive methods.arXiv preprint arXiv:1912.01823, 2019.

Schneider, F., Balles, L., and Hennig, P. DeepOBS: adeep learning optimizer benchmark suite. arXiv preprintarXiv:1903.05499, 2019.

Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J.,Frostig, R., and Dahl, G. E. Measuring the effects ofdata parallelism on neural network training. Journal ofMachine Learning Research, 20(112):1–49, 2019.

Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556, 2014.

Smith, L. N. A disciplined approach to neural networkhyper-parameters: Part 1 – learning rate, batch size,momentum, and weight decay. arXiv e-prints, art.arXiv:1803.09820, Mar 2018.

Snoek, J., Larochelle, H., and Adams, R. P. PracticalBayesian optimization of machine learning algorithms.In Proceedings of the 25th International Conferenceon Neural Information Processing Systems - Volume 2,NIPS’12, pp. 2951–2959, USA, 2012. Curran AssociatesInc.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Ried-miller, M. Striving for simplicity: the all convolutionalnet. arXiv preprint arXiv:1412.6806, 2014.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. Onthe importance of initialization and momentum in deeplearning. In ICML, 2013.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-jna, Z. Rethinking the inception architecture for com-puter vision. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp. 2818–2826, 2016.

Tan, M. and Le, Q. V. Efficientnet: Rethinking model scal-ing for convolutional neural networks. arXiv preprintarXiv:1905.11946, 2019.

Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,Howard, A., and Le, Q. V. Mnasnet: Platform-awareneural architecture search for mobile. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, pp. 2820–2828, 2019.

Tieleman, T. and Hinton, G. Lecture 6.5-RMSProp: Dividethe gradient by a running average of its recent magni-tude. COURSERA: Neural networks for machine learn-ing, 4(2):26–31, 2012.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in Neural InformationProcessing Systems, pp. 5998–6008, 2017.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., andRecht, B. The marginal value of adaptive gradient meth-ods in machine learning. In Advances in Neural Infor-mation Processing Systems 30, pp. 4148–4158. CurranAssociates, Inc., 2017.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747, 2017.

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar,S. Adaptive methods for nonconvex optimization. In


Advances in Neural Information Processing Systems 31,pp. 9793–9803. Curran Associates, Inc., 2018.

Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S.,Dahl, G. E., Shallue, C. J., and Grosse, R. Whichalgorithmic choices matter at which batch sizes? in-sights from a noisy quadratic model. arXiv e-prints, art.arXiv:1907.04164, Jul 2019.

Zhang, J. and Mitliagkas, I. Yellowfin and the art of mo-mentum tuning. arXiv preprint arXiv:1706.03471, 2017.

Zhou, Z., Zhang, Q., Lu, G., Wang, H., Zhang, W., and Yu,Y. Adashift: Decorrelation and convergence of adaptivelearning rate methods. arXiv preprint arXiv:1810.00143,2018.

Zou, F. and Shen, L. On the convergence of weighted Ada-Grad with momentum for training deep neural networks.arXiv preprint arXiv:1808.03408, 2018.

Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. Asufficient condition for convergences of Adam and RM-SProp. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pp. 11127–11135,2019.


A. Optimizer inclusionsTable 1 summarizes the update rules for the optimizers we consider in this work. We assume update rules as implementedin TensorFlow r1.15. Here we prove their inclusion relationships, see Definition 1.

MOMENTUM can exactly implement SGD

MOMENTUM(It, ηt, 0) = SGD(It, ηt), so SGD ⊆ MOMENTUM.

NESTEROV can exactly implement SGD

NESTEROV(It, ηt, 0) = SGD(It, ηt), so SGD ⊆ NESTEROV.

RMSPROP with momentum can exactly implement MOMENTUM

Consider RMSPROP(It, ηt, γ, ρ = 1, ε = 0), so that

mt+1 = γmt + ηt∇`(θt) ,θt+1 = θt −mt+1 .

This is equivalent to MOMENTUM, sincem(RMSPROP)t+1 ≡ ηtv(MOMENTUM)

t+1 .

Thus RMSPROP(It, ηt, γ, 1, 0) = MOMENTUM(It, ηt, γ), so MOMENTUM ⊆ RMSPROP.

RMSTEROV can exactly implement NESTEROV

Consider RMSTEROV(It, ηt, γ, ρ = 1, ε = 0), so that

mt+1 = γmt + ηt∇`(θt) ,θt+1 = θt − [γmt+1 + ηt∇`(θt)] .

This is equivalent to MOMENTUM, sincem(RMSPROP)t+1 ≡ ηtv(NESTEROV)

t+1 .

Thus RMSTEROV(It, ηt, γ, 1, 0) = MOMENTUM(It, ηt, γ), so MOMENTUM ⊆ RMSTEROV.

ADAM can approximate MOMENTUM for large ε

Consider ADAM(It, αt = ε ηt(1− γt), β1 = γ, β2 = 0, ε), so that

mt+1 = γmt + (1− γ)∇`(θt) ,

θt+1 = θt −ηt

(1− γ)

[mt+1

|∇`(θt)|/ε+ 1

].

If ε is large, so that |∇`(θt)|/ε� 1, then

mt+1 = γmt + (1− γ)∇`(θt) ,

θt+1 = θt − ηtmt+1

1− γ.

This is equivalent to MOMENTUM, since

m(ADAM)t ≡ (1− γ) v(MOMENTUM)

t .

Thus limε→∞ ADAM(It, ε ηt(1− γt), γ, 0, ε) = MOMENTUM(It, ηt, γ), so MOMENTUM ⊆ ADAM

NADAM can approximate NESTEROV for large ε

Consider NADAM(It, αt = ε ηt(1− γt), β1 = γ, β2 = 0, ε), so that

mt+1 = γmt + (1− γ)∇`(θt) ,

θt+1 = θt −ηt

(1− γ)

[γmt+1 + (1− γ)∇`(θt)|∇`(θt)|/ε+ 1

].


If ε is large, so that |∇`(θt)|/ε� 1, then

mt+1 = γmt + (1− γ)∇`(θt) ,

θt+1 = θt − ηt[γmt+1

1− γ+∇`(θt)

].

This is equivalent to NESTEROV, sincem(NADAM)t ≡ (1− γ) v(NESTEROV)

t

Thus limε→∞ NADAM(It, ε ηt(1− γt), γ, 0, ε) = NESTEROV(It, ηt, γ), so NESTEROV ⊆ NADAM.

B. Workload detailsThis section details the datasets and models summarized in Table 2.

B.1. Dataset Descriptions

For Fashion MNIST, CIFAR-10, ImageNet, and LM1B, our setup was identical to Shallue et al. (2019) except for the imagepre-processing details described below. For War and Peace, our setup was identical to the “Tolstoi” dataset of Schneideret al. (2019).

CIFAR-10/100: We pre-processed images by subtracting the average value across all pixels and channels and dividing bythe standard deviation.4 For experiments with the ResNet-32 and CNN models, we followed the standard data augmentationscheme used in He et al. (2016a): 4 pixels padded on each side with single random crop from padded image or its horizontalreflection. We did not use random cropping for experiments with VGG for consistency with Wilson et al. (2017).

ImageNet: We augmented images at training time by resizing each image, taking a random crop of 224 × 224 pixels,randomly horizontally reflecting the cropped images, and randomly distorting the image colors. At evaluation time, weperformed a single central crop of 224 × 224 pixels. In both training and evaluation, we then subtracted the global meanRGB value from each pixel using the values computed by Simonyan & Zisserman (2014).5

B.2. Model Descriptions

Simple CNN is identical to the base model described in Shallue et al. (2019). It consists of 2 convolutional layers withmax pooling followed by 1 fully connected layer. The convolutional layers use 5× 5 filters with stride 1, “same” padding,and ReLU activation function. Max pooling uses a 2× 2 window with stride 2. Convolutional layers have 32 and 64 filterseach and the fully connected layer has 1024 units. It does not use batch normalization.

CNN is the “All-CNN-C” model from Springenberg et al. (2014), as used in Schneider et al. (2019). The model consistsof 3 convolutional layer blocks with max pooling. The convolutional layers use 5× 5 filters with stride 1, “same” padding,and ReLU activation function. Max pooling uses a 2 × 2 window with stride 2. Convolutional layer blocks have 96, 192and 192 filters each. As in Schneider et al. (2019), we used L2 regularization of 5× 10−4.

ResNet is described in He et al. (2016a). We used the improved residual block described in He et al. (2016b). We usedbatch normalization (Ioffe & Szegedy, 2015) with exponential moving average (EMA) decay of 0.997 for ResNet-32, andghost batch normalization (Hoffer et al., 2017) with ghost batch size of 32 and EMA decay of 0.9 for ResNet-50.

VGG is based on “model C” from Simonyan & Zisserman (2014). It consists of 13 convolutional layers followed by 3fully connected hidden layers. We followed the modification used by Wilson et al. (2017) with batch normalization layers.

LSTM is a two hidden-layer LSTM model (Hochreiter & Schmidhuber, 1997) identical to the model used in Schneideret al. (2019). It uses 128 embedding dimensions and 128 hidden units.

Transformer is the “base” model described in (Vaswani et al., 2017). We used it as an autoregressive language model byapplying the decoder directly to the sequence of word embeddings for each sentence. Unlike the default implementation,

4We used the TensorFlow op tf.image.per image standardization.5See https://gist.github.com/ksimonyan/211839e770f7b538e2d8#description for the mean RGB values

used.

https://gist.github.com/ksimonyan/211839e770f7b538e2d8#description


we removed dropout regularization and used separate weight matrices for the input embedding layer and the pre-softmaxlinear transformation, as we observed these choices led to better performing models.

C. Estimating trial outcomes via bootstrapOur tuning protocol corresponds to running trials with quasi-random hyperparameter values sampled uniformly from thesearch space until K feasible trials are obtained, with K depending on the workload. We then select the best trial, basedon our statistic of interest, over those K trials.

We used the following bootstrap procedure to estimate means and uncertainties of our tuning protocol. We ran N > Ktrials, with N depending on the workload. Then, for each bootstrap sample, we resampled the dataset of N trials withreplacement and computed our statistic on the first K trials of the resampled dataset. We collected 100 such bootstrapsamples each time, and from those computed the means, 5th percentiles, and 95th percentiles of the bootstrap distribution.We used this procedure to generate the means and error bars for each plot.

Simple CNN on Fashion MNIST used (K,N) = (100, 500); ResNet-32 on CIFAR-10 used (K,N) = (100, 500); ResNet-50 on ImageNet used (K,N) = (50, 250); Transformer on LM1B used (K,N) = (50, 250); VGG on CIFAR-10 with ourcode used (K,N) = (50, 250) for tuning the learning rate schedule and (K,N) = (100, 500) for tuning the learning rateschedule, {γ, β1, β2, ε}, and L2 regularization; CNN on CIFAR-100 used (K,N) = (100, 500); LSTM on War and Peaceused (K,N) = (10, 50) for tuning just the learning rate and (K,N) = (100, 500) for tuning the learning rate scheduleand {γ, β1, β2, ε}.

The sole exceptions to this bootstrap procedure are the two left panels of Figure 3, for which we used a similar procedureto Wilson et al. (2017) to ensure comparability. For each optimizer, we selected the trial that minimized validation error inour final search space and ran the same hyperparameter values 5 times, reporting the mean, minimum, and maximum testerror over those 5 runs in Figure 3. This is slightly different to Wilson et al. (2017), who chose the trial that minimizedtraining error and reported validation error. When tuning the learning rate and {γ, ε}, we used 24 trials per optimizer in theinitial search space (which we used to select the final search space), and 16 trials per optimizer in the final search space.

D. Hyperparameter Search SpacesBelow we report the search spaces used for our experiments. We include both the initial search spaces used to refine thesearch spaces, and the final spaces used to generate the plots. When only one search space was used, we denote the initialspace as final. η0, α0, 1− γ, 1− β1, 1− β2, ε, and combinations thereof are always tuned on a log scale. The number ofsamples from each search space is specified in Appendix C.


D.1. CNN on Fashion MNIST

We used linear learning rate decay for all experiments. We tuned the number of decay steps within [0.5, 1.0] times thenumber of training steps and the learning rate decay factor within {10−3, 10−2, 10−1}. We did not use L2 regularizationor weight decay.

η0

initial [10−2, 102]

final [10−2, 1]

Table 3: SGD

η0 1− γinitial [10−4, 102] [10−4, 1]

final [10−3, 10−1] [10−3, 1]

Table 4: MOMENTUM

η0 1− γinitial [10−4, 102] [10−4, 1]

final [10−3, 10−1] [10−3, 1]

Table 5: NESTEROV

η0/√ε 1− γ 1− ρ ε

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]

final [10−2, 1] [10−3, 1] [10−4, 1] [10−10, 10−6]

Table 6: RMSPROP

α0/ε 1− β1 1− β2 ε

initial [10−2, 10−4] [10−3, 1] [10−4, 1] [10−10, 1010]

final [10−1, 101] [10−3, 1] [10−4, 1] [10−6, 10−2]

Table 7: ADAM

α0/ε 1− β1 1− β2 ε

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]

final [10−1, 101] [10−3, 1] [10−4, 1] [10−6, 10−2]

Table 8: NADAM


D.2. ResNet-32 on CIFAR-10

We used linear learning rate decay for all experiments. We tuned the number of decay steps within [0.5, 1.0] times thenumber of training steps and the learning rate decay factor f within the values shown in the tables below. λL2

denotes theL2 regularization coefficient.

η0 λL2f

initial [10−2, 102] {10−5, 10−4, 10−3, 10−2} {10−4, 10−3, 10−2, 10−1}final [10−1, 101] {10−5, 10−4, 10−3, 10−2} {10−4, 10−3, 10−2, 10−1}

Table 9: SGD

η0 1− γ λL2f

initial [10−4, 102] [10−3, 1] {10−5, 10−4, 10−3, 10−2} {10−4, 10−3, 10−2, 10−1}final [10−2, 1] [10−3, 1] {10−5, 10−4, 10−3, 10−2} {10−4, 10−3, 10−2, 10−1}

Table 10: MOMENTUM

η0 1− γ λL2 f

initial [10−4, 102] [10−4, 101] 10−4 {10−3, 10−2, 10−1}final [10−2, 1] [10−3, 1] {10−5, 10−4, 10−3, 10−2} {10−4, 10−3, 10−2, 10−1}

Table 11: NESTEROV

η0/√ε 1− γ 1− ρ ε λL2 f

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010] 10−4 {10−3, 10−2, 10−1}

final [10−2, 1] [10−3, 1] [10−4, 1] [10−4, 1]{10−5, 10−4

10−3, 10−2}{10−4, 10−3

10−2, 10−1}

Table 12: RMSPROP

α0/ε 1− β1 1− β2 ε λL2f

initial [10−4, 10−1] [10−3, 5× 10−1] [10−4, 10−1] [10−9, 10−5] 10−4{10−3, 10−2

10−1}

final [10−1, 101] [10−3, 1] [10−4, 1] [10−3, 101]{10−5, 10−4

10−3, 10−2}{10−4, 10−3

10−2, 10−1}

Table 13: ADAM


α0/ε 1− β1 1− β2 ε λL2 f

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]{10−5, 10−4

10−3, 10−2}{10−4, 10−3

10−2, 10−1}

final [10−2, 1] [10−3, 1] [10−4, 1] [1, 104]{10−5, 10−4

10−3, 10−2}{10−4, 10−3

10−2, 10−1}

Table 14: NADAM


D.3. ResNet-50 on ImageNet

We used linear learning rate decay for all experiments. We tuned the number of decay steps within [0.5, 1.0] times thenumber of training steps and the learning rate decay factor f within the values shown in the tables below. λwd denotes theweight decay coefficient and τ denotes the label smoothing coefficient.

η0 λwd τ f

initial [10−2, 101] [10−5, 10−2] {0, 10−2, 10−1} {10−4, 10−3, 10−2, 10−1}final [1, 102] [10−4, 10−3] 10−1 {10−4, 10−3, 10−2, 10−1}

Table 15: SGD

η0 1− γ λwd τ f

initial [10−3, 1] [10−3, 1] [10−5, 10−2] {0, 10−2, 10−1} {10−4, 10−3, 10−2, 10−1}final [10−2, 1] [10−2, 1] [10−4, 10−3] 10−2 {10−4, 10−3, 10−2, 10−1}

Table 16: MOMENTUM

η0 1− γ λwd τ f

initial [10−3, 1] [10−3, 1] [10−5, 10−2] {0, 10−2, 10−1} 10−3

final [10−2, 1] [10−3, 1] [10−4, 10−3] 0 {10−4, 10−3, 10−2, 10−1}

Table 17: NESTEROV

η0/√ε 1− γ 1− ρ ε λwd τ f

initial [10−2, 104] 0.1 [10−4, 1] [10−10, 1010] [10−5, 10−2] {0, 10−2, 10−1} 10−3

final [10−2, 1] 0.1 [10−2, 1] [10−8, 10−3] [10−4, 10−3] 0{10−4, 10−3

10−2, 10−1}

Table 18: RMSPROP

α0/ε 1− β1 ε λwd τ f

initial [1, 102] [10−3, 1] [1, 104] [10−5, 10−3] {0, 10−2, 10−1} 10−3

final [1, 102] [10−2, 1] [10−2, 102] 10−4 10−1{10−4, 10−3

10−2, 10−1}

Table 19: ADAM

α0/ε 1− β1 ε λwd τ f

initial [10−1, 103] [10−3, 1] [10−2, 1010] [10−5, 10−2] {0, 10−2, 10−1} 10−3

final [1, 102] [10−3, 1] [103, 107] 10−4 10−1 10−3

Table 20: NADAM


D.4. Transformer on LM1B

We used linear learning rate decay for all experiments. We tuned the number of decay steps within [0.5, 1.0] times thenumber of training steps and the learning rate decay factor within {10−4, 10−3, 10−2, 10−1, 1}.

η0

initial [10−4, 10−1]

final [10−3, 10−1]

Table 21: SGD

initial [10−4, 10−1] [10−4, 1]

final [10−4, 10−2] [10−3, 1]

Table 22: MOMENTUM

η0 1− γinitial [10−4, 10−1] [10−4, 1]

final [10−4, 10−2] [10−3, 1]

Table 23: NESTEROV

η0/√ε 1− γ 1− ρ ε

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]

final [10−1, 101] [10−3, 1] [10−4, 1] [10−9, 10−5]

Table 24: RMSPROP

α0/ε 1− β1 1− β2 ε

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]

final [101, 103] [10−3, 1] [10−6, 10−2] [10−7, 10−3]

Table 25: ADAM

α0/ε 1− β1 1− β2 ε

initial [10−2, 104] [10−3, 1] [10−4, 1] [10−10, 1010]

final [1, 102] [10−3, 1] [10−6, 10−2] [10−6, 10−2]

Table 26: NADAM


D.5. VGG on CIFAR-10 Using Code from Wilson et al. (2017)

D.5.1. GRID SEARCH OVER LEARNING RATE

We tuned over the same grid of initial learning rate values for each optimizer as Wilson et al. (2017). As in Wilson et al.(2017), we decayed the initial learning rate by a factor of 0.5 every 25 epochs and used a fixed L2 regularization coefficientof 0.0005.

D.5.2. TUNING LEARNING RATE & {γ, ε}

We used our quasi-random tuning protocol to tune over the initial learning rate, MOMENTUM’s γ, RMSPROP’s ε, andADAM’s ε. As in Wilson et al. (2017), we decayed the initial learning rate by a factor of 0.5 every 25 epochs and used afixed L2 regularization coefficient of 0.0005.

η0

initial [10−3, 1]

final [10−1, 101]

Table 27: SGD

η0 1− γinitial [10−3, 1] [10−3, 1]

final [10−1, 101] [10−1, 1]

Table 28: MOMENTUM

ε α0/√ε

initial [10−10, 1010] [10−2, 104]

final [10−2, 102] [10−1, 101]

Table 29: RMSPROP

ε α0/ε

initial [10−10, 1010] [10−2, 104]

final [106, 1010] [10−1, 101]

Table 30: ADAM


D.6. VGG on CIFAR-10 Using our Code

We used linear learning rate decay for all experiments. We tuned the number of decay steps within [0.5, 1.0] times thenumber of training steps and the learning rate decay factor within {10−4, 10−3, 10−2, 10−1}.

D.6.1. TUNING LEARNING RATE SCHEDULE

We fixed all optimizer hyperparameters excluding the learning rate to match those specified in Wilson et al. (2017). As inWilson et al. (2017), we used a fixed L2 regularization coefficient of 0.0005.

η0 (SGD) η0 (MOMENTUM) η0 (RMSPROP) α0 (ADAM)

initial [10−3, 101] [10−3, 101] [10−5, 10−1] [10−5, 10−1]

final 1.0 [10−2, 1] [10−4, 10−2] [10−5, 10−1]

Table 31: Learning rate search ranges.

D.6.2. TUNING LEARNING RATE SCHEDULE & {γ, β1, β2, ε, λL2}

η0 λL2

initial [10−3, 101] [10−5, 10−2]

final [10−2, 1] [10−3, 10−1]

Table 32: SGD

η0 1− γ λL2

initial [10−3, 101] [10−3, 1] [10−5, 10−2]

final [10−2, 1] [10−1, 1] [10−3, 10−1]

Table 33: MOMENTUM

α0/√ε 1− γ 1− ρ ε λL2

initial [10−2, 104] [10−3, 1] [10−3, 1] [10−10, 1010] [10−5, 10−2]

final [10−2, 1] [10−1, 1] [10−3, 10−2] [102, 106] [10−3, 10−1]

Table 34: RMSPROP

α0/ε 1− β1 1− β2 ε λL2

initial [10−2, 104] [10−3, 1] [10−4, 10−1] [10−10, 1010] [10−5, 10−2]

final [10−2, 101] [10−1, 1] [10−4, 10−1] [106, 1010] [10−3, 10−1]

Table 35: ADAM


D.7. CNN on CIFAR-100

D.7.1. TUNING CONSTANT LEARNING RATE

We fixed all optimizer hyperparameters excluding the learning rate to match those specified in Schneider et al. (2019).

η (SGD) η (MOMENTUM) α (ADAM)

initial [10−2, 1] [10−4, 1] [10−5, 10−2]

final [10−1, 1] [10−3, 10−2] [10−4, 10−3]


D.7.2. TUNING LEARNING RATE SCHEDULE

We used linear learning rate decay, and tuned the number of decay steps within [0.5, 1.0] times the number of training stepsand the learning rate decay factor within {10−4, 10−3, 10−2, 10−1}.

η0 (SGD) η0 (MOMENTUM) α0 (ADAM)

initial [10−2, 1] [10−4, 1] [10−5, 10−2]

final [10−1, 1] [10−3, 10−1] [10−4, 10−3]


D.7.3. TUNING CONSTANT LEARNING RATE & {γ, β1, β2, ε}

For SGD, we reused the results from Appendix D.7.1, since there were no additional hyperparameters to tune.

η 1− γfinal [10−4, 1] [10−3, 1]

Table 38: MOMENTUM

α/ε 1− β1 1− β2 ε

initial [10−2, 10−2] [10−3, 1] [10−4, 10−1] [10−10, 10−10]

final [10−1, 10−1] [10−2, 1] [10−4, 10−1] [102, 106]

Table 39: ADAM

D.7.4. TUNING LEARNING RATE SCHEDULE & {γ, β1, β2, ε}

We used linear learning rate decay, and tuned the number of decay steps within [0.5, 1.0] times the number of training stepsand the learning rate decay factor within {10−4, 10−3, 10−2, 10−1}. For SGD, we reused the results from Appendix D.7.2,since there were no additional hyperparameters to tune.

η0 1− γinitial [10−4, 1] [10−2, 1]

final [10−3, 10−1] [10−3, 10−1]

Table 40: MOMENTUM


α0/ε 1− β1 1− β2 ε

initial [10−1, 101] [10−3, 1] [10−4, 10−1] [102, 106]

final [10−1, 101] [10−3, 1] [10−5, 10−2] [102, 106]

Table 41: ADAM


D.8. LSTM on War and Peace

D.8.1. TUNING CONSTANT LEARNING RATE

η

final [10−2, 101]

Table 42: SGD

η 1− γfinal [10−4, 1] 0.99

Table 43: MOMENTUM

α/ε 1− β1 1− β2 ε

final [10−5, 10−2] 0.9 0.999 10−8

Table 44: ADAM

D.8.2. TUNING LEARNING RATE SCHEDULE & {γ, β1, β2, ε}

We used linear learning rate decay, and tuned the number of decay steps within [0.5, 1.0] times the number of training stepsand the learning rate decay factor f within the values shown in the tables below.

η0 f

initial [10−3, 101] {10−4, 10−3, 10−2, 10−1}final [1, 101] {10−4, 10−3, 10−2, 10−1}

Table 45: SGD

η0 1− γ f

initial [10−4, 1] [10−3, 1] {10−4, 10−3, 10−2, 10−1}final [10−1, 101] [10−2, 1] {10−4, 10−3, 10−2, 10−1}

Table 46: MOMENTUM

α0/ε 1− β1 1− β2 ε f

initial [10−2, 104] [10−3, 1] [10−4, 10−1] [10−10, 1010] {10−4, 10−3, 10−2, 10−1}final [1, 102] [10−2, 1] 0.999 [1, 104] 10−3

Table 47: ADAM


E. Additional plots

10-4 10-3 10-2 10-1 100

Learning rate

0.40

0.42

0.44

0.46

0.48

0.50

0.52

0.54

Final V

alid

ati

on E

rror

10-3 10-2 10-1 100

1− γ

CNN on CIFAR-100 with Momentum

Figure 5: Example plot of final validation error projected onto the axes of the hyperparameter space. We consider thissearch space to be appropriate because the optimal values are away from the search space boundaries.

20 21 22 23 24 25 26 273.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0

Fina

l Val

idat

ion

Cros

s Ent

ropy

SGD

20 21 22 23 24 25 26 273.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0 Momentum

20 21 22 23 24 25 26 273.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0 Nesterov

20 21 22 23 24 25 26 27

Number of Independent Trials

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0

Fina

l Val

idat

ion

Cros

s Ent

ropy

RMSProp

20 21 22 23 24 25 26 27


3.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0 Adam

20 21 22 23 24 25 26 27


3.3

3.4

3.5

3.6

3.7

3.8

3.9

4.0 NAdam

Figure 6: Validation performance of the best trial mostly converges with as few as 24 hyperparameter tuning trials for Trans-former on LM1B. Shaded regions indicate 5th and 95th percentiles estimated with bootstrap sampling (see Appendix C).The search spaces can be found in Appendix D.4.


20 21 22 23 24 25 26 27

0.23

0.24

0.25

0.26

0.27

Fina

l Val

idat

ion

Erro

r

SGD

20 21 22 23 24 25 26 27

0.23

0.24

0.25

0.26

0.27 Momentum

20 21 22 23 24 25 26 27

0.23

0.24

0.25

0.26

0.27 Nesterov

20 21 22 23 24 25 26 27


0.23

0.24

0.25

0.26

0.27

Fina

l Val

idat

ion

Erro

r

RMSProp

20 21 22 23 24 25 26 27


0.23

0.24

0.25

0.26

0.27 Adam

20 21 22 23 24 25 26 27


0.23

0.24

0.25

0.26

0.27 NAdam

Figure 7: Validation performance of the best trial mostly converges with as few as 24 hyperparameter tuning trials forResNet-50 in ImageNet. Shaded regions indicate 5th and 95th percentile estimated with bootstrap sampling (see Ap-pendix C). The search spaces can be found in D.3.

20 21 22 23 24 25 26 27


0.350.400.450.500.550.600.650.70

Fina

l Tes

t Erro

r

SGD

20 21 22 23 24 25 26 27


0.350.400.450.500.550.600.650.70 Momentum

20 21 22 23 24 25 26 27


0.350.400.450.500.550.600.650.70 Adam

Figure 8: Test performance of the best trial mostly converges with as few as 23 hyperparameter tuning trials for a 2-layer LSTM on War and Peace. Shaded regions indicate 5th and 95th percentile estimated with bootstrap sampling (seeAppendix C). The search spaces can be found in D.8.2.


0.000

0.001

0.002

0.003

0.004

0.005

0.006

Tra

in L

oss


0.000

0.005

0.010

0.015

0.020

0.025

Tra

in L

oss


0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Tra

in L

oss


3.20

3.25

3.30

3.35

3.40

3.45

3.50

3.55

3.60

Tra

in C

ross

Entr

opy

Transformer on LM1B


Figure 9: The relative performance of optimizers is consistent with the inclusion relationships when we select for lowesttraining loss. Note that SGD, ADAM, and NADAM for ResNet-50 on ImageNet used label smoothing in their final searchspaces (see Section D.3), which makes their loss values incommensurate with the other optimizers. This is because theirfinal search spaces were optimized to minimize validation error—if we had optimized their search spaces to minimizetraining error instead, we would not have used label smoothing, and we expect their training loss values would be consistentwith the inclusion relationships.

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Transformer on LM1B

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0


3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Step Budget 1e5

3.30

3.35

3.40

3.45

3.50

3.55

3.60

3.65

Valid

ati

on C

ross

Entr

opy

3.353.403.453.503.553.60

Validation Cross Entropy Target

0

1

2

3

4

5

6

7

8

Ste

ps

to T

hre

shold

×105

2.5 3.0 3.5 4.0 4.5 5.0

Step Budget 1e4

0.060

0.065

0.070

0.075

0.080

0.085

0.090

0.095

Valid

ati

on E

rror

0.0650.0700.0750.080

Validation Error Target

2. 0

2. 5

3. 0

3. 5

4. 0

4. 5

5. 0

Ste

ps

to T

hre

shold

×104


Figure 10: Our results confirming the relevance of optimizer inclusion relationships do not depend on the exact step budgetsor error targets we chose.

On Empirical Comparisons of Optimizers for Deep LearningOn Empirical Comparisons of Optimizers for...

Documents

Transcript of On Empirical Comparisons of Optimizers for Deep LearningOn Empirical Comparisons of Optimizers for...