STA141C: Big Data & High Performance Statistical Computing...
Transcript of STA141C: Big Data & High Performance Statistical Computing...
![Page 1: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/1.jpg)
STA141C: Big Data & High Performance StatisticalComputing
Lecture 8: Optimization
Cho-Jui HsiehUC Davis
May 9, 2017
![Page 2: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/2.jpg)
Optimization
![Page 3: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/3.jpg)
Numerical Optimization
Numerical Optimization:
minX f (X )
Can be applied to many areas.
Machine Learning: find a model that minimizes the predictionerror.
![Page 4: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/4.jpg)
Properties of the Function
Smooth function: functions with continuous derivative.
Example: ridge regression
argminw
1
2‖Xw − y‖2 +
λ
2‖w‖2
Non-smooth function: Lasso, primal SVM
Lasso: argminw
1
2‖Xw − y‖2 + λ‖w‖1
SVM: argminw
n∑i=1
max(0, 1− yiwT xi ) +λ
2‖w‖2
![Page 5: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/5.jpg)
Convex Functions
A function is convex if:
∀x1, x2,∀t ∈ [0, 1], f (tx1 + (1− t)x2) ≤ tf (x1) + (1− t)f (x2)
No local optimum (why?)
Figure from Wikipedia
![Page 6: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/6.jpg)
Convex Functions
If f (x) is twice differentiable, then
f is convex if and only if ∇2f (x) � 0
Optimal solution may not be unique:
has a set of optimal solutions S
Gradient: capture the first order change of f :
f (x + αd ) = f (x) + α∇f (x)T d + O(α2)
If f is convex and differentiable, we have the following optimalitycondition:
x∗ ∈ S if and only if ∇f (x) = 0
![Page 7: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/7.jpg)
Nonconvex Functions
If f is nonconvex, most algorithms can only converge to stationarypointsx̄ is a stationary point if and only if ∇f (x̄) = 0Three types of stationary points:
(1) Global optimum (2) Local optimum (3) Saddle pointExample: matrix completion, neural network, . . .Example: f (x , y) = 1
2(xy − a)2
![Page 8: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/8.jpg)
Coordinate Descent
![Page 9: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/9.jpg)
Coordinate Descent
Update one variable at a timeCoordinate Descent: repeatedly perform the following loop
Step 1: pick an index iStep 2: compute a step size δ∗ by (approximately) minimizing
argminδ
f (x + δei )
Step 3: xi ← xi + δ∗
![Page 10: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/10.jpg)
Coordinate Descent (update sequence)
Three types of updating order:
Cyclic: update sequence
x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration
, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration
, . . .
Randomly permute the sequence for each outer iteration (fasterconvergence in practice)
Some interesting theoretical analysis for this recentlyC.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016
Random: each time pick a random coordinate to update
Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In
ICML 2015
D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
![Page 11: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/11.jpg)
Coordinate Descent (update sequence)
Three types of updating order:
Cyclic: update sequence
x1, x2, . . . , xn︸ ︷︷ ︸1st outer iteration
, x1, x2, . . . , xn︸ ︷︷ ︸2nd outer iteration
, . . .
Randomly permute the sequence for each outer iteration (fasterconvergence in practice)
Some interesting theoretical analysis for this recentlyC.-P. Lee and S. J. Wright, Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. 2016
Random: each time pick a random coordinate to update
Typical way: sample from uniform distributionSample from uniform distribution vs sample from biased distributionP. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization. In
ICML 2015
D. Csiba, Z. Qu and P. Richtarik, Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015
![Page 12: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/12.jpg)
Greedy Coordinate Descent
Greedy: choose the most “important” coordinate to update
How to measure the importance?
By first derivative: |∇i f (x)|By first and second derivative: |∇i f (x)/∇2
ii f (x)|By maximum reduction of objective function
i∗ = argmaxi=1,...,n
(f (x)−min
δf (x + δei )
)Need to consider the time complexity for variable selection
Useful for kernel SVM (see lecture 6)
![Page 13: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/13.jpg)
Extension: block coordinate descent
Variables are divided into blocks {X1, . . . ,Xp}, where each Xi is asubset of variables and
X1 ∪ X2, . . . ,Xp = {1, . . . , n}, Xi ∩ Xj = ϕ, ∀i , j
Each time update a Xi by (approximately) solving the subproblemwithin the block
Example: alternating minimization for matrix completion (2 blocks).(See lecture 7)
![Page 14: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/14.jpg)
Coordinate Descent (convergence)
Converge to optimal if f (·) is convex and smooth
Has a linear convergence rate if f (·) is strongly convex
Linear convergence: error f (x t)− f (x∗) decays as
β, β2, β3, . . .
for some β < 1.
Local linear convergence: an algorithm converges linearly after‖x − x∗‖ ≤ K for some K > 0
![Page 15: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/15.jpg)
Coordinate Descent: other names
Alternating minimization (matrix completion)
Iterative scaling (for log-linear models)
Decomposition method (for kernel SVM)
Gauss Seidel (for linear system when the matrix is positive definite)
. . .
![Page 16: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/16.jpg)
Gradient Descent
![Page 17: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/17.jpg)
Gradient Descent
Gradient descent algorithm: repeatedly conduct the following update:
x t+1 ← x t − α∇f (x t)
where α > 0 is the step sizeIt is a fixed point iteration method:
x − α∇f (x) = x if and only if x is an optimal solution
Step size too large ⇒ diverge; too small ⇒ slow convergence
![Page 18: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/18.jpg)
Gradient Descent: step size
A twice differentiable function has L-Lipchitz continuous gradient ifand only if
∇2f (x) ≤ LI ∀x
In this case, Condition 2 is satisfied if α < 1L
Theorem: gradient descent converges if α < 1L
Theorem: gradient descent converges linearly with α < 1L if f is
strongly convex
![Page 19: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/19.jpg)
Gradient Descent
In practice, we do not know L. . . . . .
Step size α too large: the algorithm diverges
Step size α too small: the algorithm converges very slowly
![Page 20: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/20.jpg)
Gradient Descent: line search
d ∗ is a “descent direction” if and only if (d ∗)T∇f (x) < 0
Armijo rule bakctracking line search:
Try α = 1, 12 ,14 , . . . until it staisfies
f (x + αd ∗) ≤ f (x) + γα(d ∗)T∇f (x)
where 0 < γ < 1
Figure from http://ool.sourceforge.net/ool-ref.html
![Page 21: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/21.jpg)
Gradient Descent: line search
Gradient descent with line search:
Converges to optimal solutions if f is smoothConverges linearly if f is strongly convex
However, each iteration requires evaluating f several times
Several other step-size selection approaches
![Page 22: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/22.jpg)
Gradient Descent: applying to ridge regression
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w∗ := argminw12‖Xw − y‖2 + λ
2‖w‖2
1: t = 02: while not converged do3: Compute the gradient
g = XT (Xw − y) + λw
4: Choose step size αt
5: Update w ← w − αtg6: t ← t + 17: end while
Time complexity: O(nnz(X )) per iteration
![Page 23: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/23.jpg)
Proximal Gradient Descent
How can we apply gradient descent to solve the Lasso problem?
argminw
1
2‖Xw − y‖2 + λ ‖w‖1︸ ︷︷ ︸
non−differentiable
General composite function minimization:
argminx
f (x) := {g(x) + h(x)}
where g is smooth and convex, h is convex but may benon-differentiable
Usually assume h is simple (for computational efficiency)
![Page 24: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/24.jpg)
Proximal Gradient Descent: successive approximation
At each iteration, form an approximation of f (·):
f (x t + d ) ≈ f̃x t (d ) :=g(x t) +∇g(x t)T d +1
2dT (
1
αI )d + h(x t + d )
=g(x t) +∇g(x t)T d +1
2αdT d + h(x t + d )
Update solution by x t+1 ← x t + argmind f̃x t (d )
This is called “proximal” gradient descent
Usually d ∗ = argmind f̃x t (d ) has a closed form solution
![Page 25: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/25.jpg)
Proximal Gradient Descent: `1-regularization
The subproblem:
x t+1 =x t + argmind∇g(x t)T d +
1
2αdT d + λ‖x t + d‖1
= argminu
1
2‖u − (x t − α∇g(x t))‖2 + λα‖u‖1
=§(x t − α∇g(x t), αλ),
where § is the soft-thresholding operator defined by
§(a, z) =
a− z if a > z
a + z if a < −z0 if a ∈ [−z , z ]
![Page 26: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/26.jpg)
Proximal Gradient: soft-thresholding
Figure from http://jocelynchi.com/soft-thresholding-operator-and-the-lasso-solution/
![Page 27: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/27.jpg)
Proximal Gradient Descent for Lasso
Input: X ∈ RN×d , y ∈ RN , initial w (0)
Output: Solution w∗ := argminw12‖Xw − y‖2 + λ‖w‖1
1: t = 02: while not converged do3: Compute the gradient
g = XT (Xw − y)
4: Choose step size αt
5: Update w ← §(w − αtg , αtλ)6: t ← t + 17: end while
Time complexity: O(nnz(X )) per iteration
![Page 28: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/28.jpg)
Newton’s Method
![Page 29: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/29.jpg)
Newton’s Method
Iteratively conduct the following updates:
x ← x − α∇2f (x)−1∇f (x)
where α is the step size
If α = 1: converges quadratically when x t is close enough to x∗:
‖x t+1 − x∗‖ ≤ K‖x t − x∗‖2
for some constant K . This means the error f (x t)− f (x∗) decaysquadratically:
β, β2, β4, β8, β16, . . .
Only need few iterations to converge in this “quadratic convergenceregion”
![Page 30: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/30.jpg)
Newton’s Method
However, Newton’s update rule is more expensive than gradientdescent/coordinate descent
![Page 31: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/31.jpg)
Newton’s Method
Need to compute ∇2f (x)−1∇f (x)
Closed form solution: O(d3) for solving a d dimensional linear system
Usually solved by another iterative solver:
gradient descent
coordinate descent
conjugate gradient method
. . .
Examples: primal L2-SVM/logistic regression, `1-regularized logisticregression, . . .
![Page 32: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/32.jpg)
Stochastic Gradient Method
![Page 33: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/33.jpg)
Stochastic Gradient Method: Motivation
Widely used for machine learning problems (with large number ofsamples)
Given training samples x1, . . . , xn, we usually want to solve thefollowing empirical risk minimization (ERM) problem:
argminw
n∑i=1
`i (xi ),
where each `i (·) is the loss function
Minimize the summation of individual loss on each sample
![Page 34: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/34.jpg)
Stochastic Gradient Method
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
![Page 35: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/35.jpg)
Stochastic Gradient Method
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
![Page 36: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/36.jpg)
Stochastic Gradient Method
Assume the objective function can be written as
f (x) =1
n
n∑i=1
fi (x)
Stochastic gradient method:
Iterative conducts the following updates1 Choose an index i (uniform) randomly2 x t+1 ← x t − ηt∇fi (x t)
ηt > 0 is the step size
Why does SG work?
Ei [∇fi (x)] =1
n
n∑i=1
∇fi (x) = ∇f (x)
Is it a fixed point method? No if η > 0 because x∗ − η∇i f (x∗) 6= x∗
Is it a descent method? No, because f (x t+1) 6< f (x t)
![Page 37: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/37.jpg)
Stochastic Gradient
Step size η has to decay to 0
(e.g., ηt = Ct−a for some constant a,C )
SGD converges sub-linearly
Many variants proposed recently
SVRG, SAGA (2013, 2014): variance reductionAdaGrad (2011): adaptive learning rateRMSProp (2012): estimate learning rate by a running average ofgradient.Adam (2015): adaptive moment estimation
Widely used in machine learning
![Page 38: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/38.jpg)
SGD in Deep Learning
AdaGrad: adaptive step size for each parameter
Update rule at the t-th iteration:
Compute g = ∇f (x)Estimate the second moment: Gii = Gii + g2
i for all iParameter update: xi ← xi − η√
Giigi for all i
Proposed for convex optimization in:
“Adaptive subgradient methods for online learning and stochastic optimization” (JMLR
2011)
“Adaptive Bound Optimization for Online Convex Optimization” (COLT 2010)
![Page 39: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/39.jpg)
SGD in Deep Learning
AdaGrad: adaptive step size for each parameter
Update rule at the t-th iteration:
Compute g = ∇f (x)Estimate the second moment: Gii = Gii + g2
i for all iParameter update: xi ← xi − η√
Giigi for all i
![Page 40: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/40.jpg)
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
How to decompose into n components?
![Page 41: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/41.jpg)
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2
Update rule:
w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw
= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi
![Page 42: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/42.jpg)
Stochastic Gradient: applying to ridge regression
Objective function:
argminw
1
n
n∑i=1
(wT xi − yi )2 + λ‖w‖2
How to write as argminw1n
∑ni=1 fi (w)?
First approach: fi (w) = (wT xi − yi )2 + λ‖w‖2
Update rule:
w t+1 ← w t − 2ηt(wT xi − yi )xi − 2ηtλw
= (1− 2ηtλ)w − 2ηt(wT xi − yi )xi
![Page 43: STA141C: Big Data & High Performance Statistical Computing ...chohsieh/teaching/STA141C_Spring2017/lect… · STA141C: Big Data & High Performance Statistical Computing Lecture 8:](https://reader033.fdocuments.us/reader033/viewer/2022060303/5f08d3177e708231d423e66f/html5/thumbnails/43.jpg)
Coming up
Other classification models
Questions?