Post on 02-Jun-2020
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Scalable Bayesian Optimization Using DeepNeural Networks
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros,Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary,
Prabhat and Ryan P. Adams
June 27, 2016
Discussion by Ikenna OdinakaDuke University
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Outline
1 Background
2 Adaptive Basis Regression with Deep Neural NetworksModel DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
3 Experiments
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Bayesian Optimization in a nutshellGlobal optimization aims to solve the minimization problem
x∗ = arg minx∈χ
f (x) (1)
where χ is a compact subset of RK .When f (x) is noisy and expensive (and x is intrinsicallylow-dimensional), Bayesian optimization is a natural fitGoal: Find the global optimum in as few steps as possible, sincef (x) is expensivePrincipled modeling of uncertainty -> balancing of explorationand exploitation during searchBayesian optimization uses a surrogate probabilistic modelAcquisition functions are used to search for the next xAcquisition functions balance exploration and exploitation. Theiroptima are located
Where the model prediction is low (exploitation)Where the uncertainty of the surrogate model is large (exploration)
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Authors’ Contribution in a nutshell
State-of-the-art approach (Spearmint) uses a Gaussian processas the surrogate modelAuthors’ contribution is to improve scalability while maintainingprincipled modeling of uncertainty
Replace the surrogate model with an adaptive basis regressionAdaptive basis provided by deep neural networkBayesian linear regression performed on the last hidden layer
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Example of Global Optimization: HyperparameterTuning
Era of BigData, more computational power, and ambitiousapplicationsThis means more sophisticated machine learning modelsComplex models means more hyperparameters to tuneFor example,
Design decisions e.g. shape of neural network architecture — # ofhidden layers, # of neurons in each hidden layer, choice ofactivation functionsRegularization parameters e.g. dropout rate, weight decay (`2
regularizer) coefficientOptimization parameters e.g. learning rate, size of mini-batch,momentum coefficient
Hyperparameters need to be set properly for good performanceComplex models also mean more function evaluations needed toget a good enough solutionNeed scalable surrogate model
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Approaches to Global Optimization
Non model-based approaches (aka cross validation): grid orrandom searchModel-based approaches
Random forests (SMAC)Tree Parzen estimator (TPE)Gaussian process (GP)Authors’ method: Deep Networks for Global Optimization (DNGO)
GPs are widely used because they are simple and flexible w.r.t.conditioning and inference, and have well-calibrated uncertaintyDNGO maintains this simplicity, flexibility, and uncertaintyproperties
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Details of Bayesian Optimization
Choose prior over functional form of f (x)
Given a set of N observations of input-target pairsD := {(xn, yn = f (xn))}N
n=1 ⊂ RK × RConstruct probabilistic regression model (a distribution overobjective functions)Query surrogate model (cheaper than f (x)) to determine whereto find the optimumOptimize over acquisition function to determine the next x toevaluateAugment D and repeat
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Acquisition Functions: Expected Improvement (EI)
Let µ(x;D,Θ) and σ2(x;D,Θ) be the predictive mean and variance ofthe surrogate modelDefine
γ(x) =f (xbest)− µ(x;D,Θ)
σ(x;D,Θ)(2)
where f (xbest) = min yn is the lowest observed valueThe expected improvement is given as
aEI(x;D,Θ) = σ(x;D,Θ)[γ(x)Φ(γ(x)) +N (γ(x); 0,1)] (3)
where Φ(·) and N (·; 0,1) are the CDF and PDF of a standard normal,respectively
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Bayesian Neural Networks (BNNs)
BNNs try to uncover the full posterior over the network weightsso as to
Capture uncertaintyAct as a regularizerProvide a framework for comparing different models
Full posterior is intractable for most neural networks -> expensiveapproximate inference or MCMCRecent trend
Variational approaches e.g. use another neural network toapproximate the posterior over the network weightsPerform full or approximate inference on a small part of the networke.g. the last layer of the network
Authors pursue the latter approach
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Adaptive Basis Regression
GP-based Bayesian optimization is cubic in NLimited to applications where f (x) requires a small number ofobservations to optimizeNeed to replace GP with a regressor that keeps GP’s desirableproperties
Flexible w.r.t. conditioning and inferenceWell-calibrated uncertainty
Theoretical relationship between GPs and infinite Bayesianneural networks (BNNs) makes BNNs a natural choiceBNNs are computational expensivePractical approach -> Adaptive basis regression:
Train deep neural network using a linear output layer for regressionAll weights estimated via MAPAfter training, replace the output layer with a Bayesian linearregressorMarginalize the output weights
Adaptive basis regression has cubic in D and linear in N;D << N is # of nodes in last hidden layer
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Basis Functions
Scale input space to unit hypercubeDeep neural network trained on DThe vector of outputs from the last hidden layer is denoted byφ(·) = [φ1(·), . . . , φD(·)]T
The output vectors from each training sample form the set ofbasis functionsThe resulting design matrix is denoted by Φ = [Φnd = φd (xn)],n = 1, . . . ,N, d = 1, . . . ,D
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Bayesian Linear Regression
y is the stacked target vector, X is the concatenated input vectorsPredictive mean µ(x;D,Θ) and variance σ2(x;D,Θ) of Bayesianlinear regression are given by
µ(x;D,Θ) = mTφ(x) + η(x), (4)
σ2(x;D,Θ) = φ(x)T K−1φ(x) +1β, (5)
where
m = βK−1ΦT y ∈ RD, (6)K = βΦTΦ + Iα ∈ RD×D, (7)y = y− η(x),
η(x) is a prior mean function, α, β ∈ Θ are regression modelhyperparameters
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Bayesian Linear Regression Contd
Marginal log-likelihood is given by
log p(y|X, α, β) =D2
logα +N2
logβ − N2
log(2π)
− β
2‖y−Φm‖2 − α
2mT m− 1
2log |K|
(8)
α, β, and parameters of η(x) are integrated out using slice sampling
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Scalability Comparison Between GP and ABR
GP scales cubically with NAdaptive Basis Regression (DNGO) scales linearly with N,cubically with D; D is fixed and small.
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Network Architecture
Need an architecture that generalizes across optimizationproblems
Important to choose the right activation functionInterestingly, ReLU is a poor choice for last hidden layerUnbounded activation functions lead to poor uncertainty estimatesUnnecessary exploration (more expensive function evaluations)
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Network Architecture
Need an architecture that generalizes across optimizationproblemsImportant to choose the right activation function
Interestingly, ReLU is a poor choice for last hidden layerUnbounded activation functions lead to poor uncertainty estimatesUnnecessary exploration (more expensive function evaluations)
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Network Architecture
Need an architecture that generalizes across optimizationproblemsImportant to choose the right activation functionMinimize average relative loss on HPOLib benchmark problemsChoice between 1 to 4 hidden layersGP-based Bayesian optimization (Spearmint) was used to tuneother hyperparameters
Learning rate, momentumWidth of each hidden layerdropout rates, `2 normalization coefficient
Optimal configuration had no dropout and small `2 normalizationcoefficientSpearmint restricted capacity via a small number of hidden units(50 hidden units per layer)
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Network Architecture
3 hidden layers chosenSame architecture used in all experiments
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Marginal Likelihood vs MAP Estimate
Standard approach is to maximize Equation 8 with respect tobasis parameters (weights of network)Computing gradient of log p(y|X, α, β) requires inverting K eachiteration; expensiveAuthors approach:
Optimize basis using MAP point estimateApply Bayesian linear regression layer, after the fact
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Quadratic PriorThe prior mean function was chosen as
η(x) = λ+ (x− c)TΛ(x− c) (9)
where λ is the offset, c is the center of the quadratic, and Λ is adiagonal scaling matrix
c ∼ N (0.51, I)Λkk ∼ Horseshoe sparsifying prior,∀k ∈ {1, . . . ,K}
Reasons for horseshoe sparsifying prior:Positive support -> convex functionsLarge spike at 0 with a heavy tail -> strong shrinkage for smallvalues, preserving large onesShrinkage allows quadratic part of equation 9 to disappear if themodel is misspecified
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Handling Input Space Constraints in DNGOCreate a constraint classifierLet cn ∈ {0,1} be an indicator of the validity of xnLet V = {(xn, yn)|cn = 1} and I = {(xn, yn)|cn = 0} be sets of validand invalid inputs, respectively; D := V ∪ I.Let Ψ be the set of hyperparameters for the constraint classifierThe expected improvement function in Equation 3 is modified to give
aCEI(x;D,Θ,Ψ) = aEI(x;V,Θ)P[c = 1 | x,D,Ψ]
where
P[c = 1 | x,D,Ψ] =
∫wP[c = 1 | x,D,w,Ψ]P(w;Ψ)dw (10)
is obtained by integrating out the output layer weights of the adaptivebasis modelFor noisy constraints, a logistic likelihood function is used forP[c = 1 | x,D,w,Ψ]For noiseless constraints, a step function is used instead
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Parallel DNGO
Intractable to create joint acquisition function across multipleinputsAcquisitions are in general sequentialHowever, one can utilize fantasies from experiments that arerunning in parallel to aid the next choice of xIdea:
Use posterior predictive distribution in Equations 4 and 5 togenerate a set of fantasy outcomes y for each running experimentAverage fantasy outcomes to get a fantasy outcome for eachrunning experimentAugment dataset DMarginalize out fantasies
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization
Parallel DNGO ContdGiven J currently running jobs with inputs {xj}J
j=1, the marginalizedacquisition function is
aMCEI(x;D, {xj}Jj=1,Θ,Ψ) =∫
aCEI(x;D ∪ {(xj , yj )}Jj=1,Θ,Ψ)
× P[{cj , yj}J
j=1 | D, {xj}Jj=1]
dy1 . . . dyJdc1 . . . dcJ
Next input x∗ is chosen as
x∗ = arg maxx
aMCEI(x;D, {xj}Jj=1), (11)
where
aMCEI(x;D, {xj}Jj=1) =
∫aMCEI(x;D, {xj}J
j=1,Θ,Ψ)dΘdΨ (12)
is the integrated acquisition functionSnoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
HPOLib Benchmarks
DNGO was compared to other methods for global optimizationon a benchmark set of problemsTPE and SMAC are scalable, but have ad-hoc estimates ofuncertaintySpearmint is based on standard GP, so its not scalable
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Image Caption Generation: Description
Using BLEU-4 metric on the Microsoft COCO 2014 test setDNGO based on log bilinear model (LBL), which is a simplermodel relative to LSTMEach evaluation of LBL model took 26.6 hoursTuned learning rate, momentum, batch size, dropout rate andweight decay for word and image representations, context size,size of word embeddings, etc.Between 300 and 800 experiments run in parallelTotal of 2500 experiments (2700 CPU days) ran in less than 1weekDistinct local optima in hyperparameter space may explaindramatic improvement in combining top 2 and 3 models
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Image Caption Generation: Results
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Deep Convolutional Neural Networks: Architecture
DNGO to tune Deep CNN for visual object recognition onCIFAR-10 and CIFAR-100 datasetsSame architecture (from Springenberg et al., 2014) for bothdatasets
Snoek et al.
BackgroundAdaptive Basis Regression with Deep Neural Networks
Experiments
Deep Convolutional Neural Networks: Results
40 experiments in parallelTuned momentum, learning rate, `2 weight decay coefficients,dropout rates, standard deviations of random i.i.d. Gaussianweight initializations, etc
Snoek et al.