Recent Progresses in Stochastic Algorithms for Big Data...
Transcript of Recent Progresses in Stochastic Algorithms for Big Data...
![Page 1: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/1.jpg)
Recent Progresses in Stochastic Algorithms for BigData Optimization
Tong Zhang
Rutgers University & Baidu Inc.
collaborators: Shai Shalev-Shwartz, Rie Johnson, Lin Xiao, Ohad Shamirand Nathan Srebro
T. Zhang Big Data Optimization 1 / 36
![Page 2: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/2.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 2 / 36
![Page 3: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/3.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 2 / 36
![Page 4: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/4.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 2 / 36
![Page 5: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/5.jpg)
Mathematical Problem
Big Data Optimization Problem in machine learning:
minw
f (w) f (w) =1n
n∑i=1
fi(w)
Special structure: sum over data.
Big data (n large) requires distrubuted training.
T. Zhang Big Data Optimization 3 / 36
![Page 6: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/6.jpg)
Assumptions on loss function
λ-strong convexity:
f (w ′) ≥ f (w) +∇f (w)>(w ′ − w) +λ
2‖w ′ − w‖22
L-smoothness:
fi(w ′) ≤ fi(w) +∇fi(w)>(w ′ − w) +L2‖w ′ − w‖22
T. Zhang Big Data Optimization 4 / 36
![Page 7: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/7.jpg)
Example: Computational Advertizing
Large scale regularized logistic regression
minw
1n
n∑i=1
ln(1 + e−w>xi yi ) +λ
2‖w‖22︸ ︷︷ ︸
fi (w)
data (xi , yi ) with yi ∈ ±1parameter vector w .
λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.
big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion
How to solve big optimization problems efficiently?
T. Zhang Big Data Optimization 5 / 36
![Page 8: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/8.jpg)
Example: Computational Advertizing
Large scale regularized logistic regression
minw
1n
n∑i=1
ln(1 + e−w>xi yi ) +λ
2‖w‖22︸ ︷︷ ︸
fi (w)
data (xi , yi ) with yi ∈ ±1parameter vector w .
λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.
big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion
How to solve big optimization problems efficiently?
T. Zhang Big Data Optimization 5 / 36
![Page 9: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/9.jpg)
Example: Computational Advertizing
Large scale regularized logistic regression
minw
1n
n∑i=1
ln(1 + e−w>xi yi ) +λ
2‖w‖22︸ ︷︷ ︸
fi (w)
data (xi , yi ) with yi ∈ ±1parameter vector w .
λ strongly convex and L = 0.25 maxi ‖xi‖22 + λ smooth.
big data: n ∼ 10− 100 billionhigh dimension: dim(xi) ∼ 10− 100 billion
How to solve big optimization problems efficiently?
T. Zhang Big Data Optimization 5 / 36
![Page 10: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/10.jpg)
Statistical Thinking: sampling
Objective function:
f (w) =1n
n∑i=1
fi(w)
sample objective function: only optimize approximate objective
1st order gradient
∇f (w) =1n
n∑i=1
∇fi(w)
sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate2nd order gradient
∇2f (w) =1n
n∑i=1
∇2fi(w)
sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing
T. Zhang Big Data Optimization 6 / 36
![Page 11: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/11.jpg)
Statistical Thinking: sampling
Objective function:
f (w) =1n
n∑i=1
fi(w)
sample objective function: only optimize approximate objective1st order gradient
∇f (w) =1n
n∑i=1
∇fi(w)
sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate
2nd order gradient
∇2f (w) =1n
n∑i=1
∇2fi(w)
sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing
T. Zhang Big Data Optimization 6 / 36
![Page 12: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/12.jpg)
Statistical Thinking: sampling
Objective function:
f (w) =1n
n∑i=1
fi(w)
sample objective function: only optimize approximate objective1st order gradient
∇f (w) =1n
n∑i=1
∇fi(w)
sample 1st order gradient (stochastic gradient): converges to exactoptimal – variance reduction: fast rate2nd order gradient
∇2f (w) =1n
n∑i=1
∇2fi(w)
sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate, distributed computing
T. Zhang Big Data Optimization 6 / 36
![Page 13: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/13.jpg)
Batch Optimization Method: Gradient Descent
Solve
w∗ = arg minw
f (w) f (w) =1n
n∑i=1
fi(w).
Gradient Descent (GD):
wk = wk−1 − ηk∇f (wk−1).
How fast does this method converge to the optimal solution?
General result: converge to local minimum under suitable conditions.Convergence rate depends on conditions of f (·). For λ-stronglyconvex and L-smooth problems, it is linear rate:
f (wk )− f (w∗) = O((1− ρ)k ),
where ρ = O(λ/L) is the inverse condition number
T. Zhang Big Data Optimization 7 / 36
![Page 14: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/14.jpg)
Batch Optimization Method: Gradient Descent
Solve
w∗ = arg minw
f (w) f (w) =1n
n∑i=1
fi(w).
Gradient Descent (GD):
wk = wk−1 − ηk∇f (wk−1).
How fast does this method converge to the optimal solution?
General result: converge to local minimum under suitable conditions.Convergence rate depends on conditions of f (·). For λ-stronglyconvex and L-smooth problems, it is linear rate:
f (wk )− f (w∗) = O((1− ρ)k ),
where ρ = O(λ/L) is the inverse condition number
T. Zhang Big Data Optimization 7 / 36
![Page 15: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/15.jpg)
Stochastic Approximate Gradient Computation
If
f (w) =1n
n∑i=1
fi(w),
GD requires the computation of full gradient, which is extremely costly
∇f (w) =1n
n∑i=1
∇fi(w)
Idea: stochastic optimization employs random sample (mini-batch) B
to approximate
∇f (w) ≈ 1|B|∑i∈B
∇fi(w)
It is an unbiased estimatormore efficient computation but introduces variance
T. Zhang Big Data Optimization 8 / 36
![Page 16: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/16.jpg)
Stochastic Approximate Gradient Computation
If
f (w) =1n
n∑i=1
fi(w),
GD requires the computation of full gradient, which is extremely costly
∇f (w) =1n
n∑i=1
∇fi(w)
Idea: stochastic optimization employs random sample (mini-batch) B
to approximate
∇f (w) ≈ 1|B|∑i∈B
∇fi(w)
It is an unbiased estimatormore efficient computation but introduces variance
T. Zhang Big Data Optimization 8 / 36
![Page 17: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/17.jpg)
SGD versus GD
SGD:faster computation per stepSublinear convergence: due to the variance of gradientapproximation.
f (wt )− f (w∗) = O(1/t).
GD:slower computation per stepLinear convergence:
f (wt )− f (w∗) = O((1− ρ)t ).
T. Zhang Big Data Optimization 9 / 36
![Page 18: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/18.jpg)
Improving SGD via Variance Reduction
GD converges fast but computation is slowSGD computation is fast but converges slowly
slow convergence due to inherent variance
SGD as a statistical estimator of gradient:let gi = ∇fi .unbaisedness: E gi = 1
n
∑ni=1 gi = ∇f .
error of using gi to approx ∇f : variance E‖gi − Egi‖22.
Statistical thinking:relating variance to optimizationdesign other unbiased gradient estimators with smaller variance
T. Zhang Big Data Optimization 10 / 36
![Page 19: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/19.jpg)
Improving SGD via Variance Reduction
GD converges fast but computation is slowSGD computation is fast but converges slowly
slow convergence due to inherent variance
SGD as a statistical estimator of gradient:let gi = ∇fi .unbaisedness: E gi = 1
n
∑ni=1 gi = ∇f .
error of using gi to approx ∇f : variance E‖gi − Egi‖22.
Statistical thinking:relating variance to optimizationdesign other unbiased gradient estimators with smaller variance
T. Zhang Big Data Optimization 10 / 36
![Page 20: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/20.jpg)
Relating Statistical Variance to Optimization
Want to optimizemin
wf (w)
Full gradient ∇f (w).
Given unbiased random estimator gi of ∇f (w), and SGD rule
w → w − ηgi ,
reduction of objective is
Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random
+η2L2
E‖g− Eg‖22︸ ︷︷ ︸variance
.
Smaller variance implies bigger reduction
T. Zhang Big Data Optimization 11 / 36
![Page 21: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/21.jpg)
Relating Statistical Variance to Optimization
Want to optimizemin
wf (w)
Full gradient ∇f (w).
Given unbiased random estimator gi of ∇f (w), and SGD rule
w → w − ηgi ,
reduction of objective is
Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random
+η2L2
E‖g− Eg‖22︸ ︷︷ ︸variance
.
Smaller variance implies bigger reduction
T. Zhang Big Data Optimization 11 / 36
![Page 22: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/22.jpg)
Relating Statistical Variance to Optimization
Want to optimizemin
wf (w)
Full gradient ∇f (w).
Given unbiased random estimator gi of ∇f (w), and SGD rule
w → w − ηgi ,
reduction of objective is
Ef (w − ηgi) ≤ f (w)− (η − η2L/2)‖∇f (w)‖22︸ ︷︷ ︸non-random
+η2L2
E‖g− Eg‖22︸ ︷︷ ︸variance
.
Smaller variance implies bigger reduction
T. Zhang Big Data Optimization 11 / 36
![Page 23: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/23.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 12 / 36
![Page 24: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/24.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 12 / 36
![Page 25: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/25.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 12 / 36
![Page 26: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/26.jpg)
Stochastic Variance Reduced Gradient (SVRG) I
Objective function
f (w) =1n
n∑i=1
fi(w) =1n
n∑i=1
fi(w),
wherefi(w) = fi(w)− (∇fi(w)−∇f (w))>w︸ ︷︷ ︸
sum to zero
.
Pick w to be an approximate solution (close to w∗).The SVRG rule (control variates) is
wt = wt−1 − ηt∇fi(wt−1) = wt−1 − ηt [∇fi(wt−1)−∇fi(w) +∇f (w)]︸ ︷︷ ︸small variance
.
T. Zhang Big Data Optimization 13 / 36
![Page 27: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/27.jpg)
Stochastic Variance Reduced Gradient (SVRG) II
Assume that w ≈ w∗ and wt−1 ≈ w∗. Then
∇f (w) ≈ ∇f (w∗) = 0 ∇fi(wt−1) ≈ ∇fi(w).
This means∇fi(wt−1)−∇fi(w) +∇f (w)→ 0.
It is possible to choose a constant step size ηt = η instead of requiringηt → 0.One can achieve comparable linear convergence with SVRG:
Ef (wt )− f (w∗) = O((1− ρ)t ),
where ρ = O(λn/(L + λn); convergence is faster than GD.
T. Zhang Big Data Optimization 14 / 36
![Page 28: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/28.jpg)
Compare SVRG to Batch Gradient Descent Algorithm
Number of examples needed to achieve ε accuracy:
Batch GD: O(n · L/λ log(1/ε))
SVRG: O((n + L/λ) log(1/ε))
Assume L-smooth loss:
‖∇fi(w)−∇fi(w ′)‖ ≤ L‖w − w ′‖
and λ strong convex objective:
‖∇f (w)−∇f (w ′)‖ ≥ λ‖w − w ′‖
The gain of SVRG over batch algorithm is significant when n is large.
T. Zhang Big Data Optimization 15 / 36
![Page 29: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/29.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 16 / 36
![Page 30: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/30.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 16 / 36
![Page 31: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/31.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 16 / 36
![Page 32: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/32.jpg)
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
minw
[1n
n∑i=1
(w>xi − yi)2 + λ‖w‖1
]
or the ridge regression problem:
minw
1n
n∑i=1
(w>xi − yi)2
︸ ︷︷ ︸loss
+λ
2‖w‖22︸ ︷︷ ︸
regularization
Our goal: solve regularized loss minimization problems as fast as wecan.
A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.
T. Zhang Big Data Optimization 17 / 36
![Page 33: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/33.jpg)
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
minw
[1n
n∑i=1
(w>xi − yi)2 + λ‖w‖1
]or the ridge regression problem:
minw
1n
n∑i=1
(w>xi − yi)2
︸ ︷︷ ︸loss
+λ
2‖w‖22︸ ︷︷ ︸
regularization
Our goal: solve regularized loss minimization problems as fast as wecan.
A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.
T. Zhang Big Data Optimization 17 / 36
![Page 34: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/34.jpg)
Motivation of SDCA: regularized loss minimization
Assume we want to solve the Lasso problem:
minw
[1n
n∑i=1
(w>xi − yi)2 + λ‖w‖1
]or the ridge regression problem:
minw
1n
n∑i=1
(w>xi − yi)2
︸ ︷︷ ︸loss
+λ
2‖w‖22︸ ︷︷ ︸
regularization
Our goal: solve regularized loss minimization problems as fast as wecan.
A good solution leads to stochastic algorithm called proximalStochastic Dual Coordinate Ascent (Prox-SDCA).We show: fast convergence of SDCA for many regularized lossminimization problems in machine learning.
T. Zhang Big Data Optimization 17 / 36
![Page 35: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/35.jpg)
General Problem
Want to solve:
minw
P(w) :=
[1n
n∑i=1
φi(X>i w) + λg(w)
],
where Xi are matrices; g(·) is strongly convex.Examples:
Multi-class logistic loss
φi(X>i w) = lnK∑`=1
exp(w>Xi,`)− w>Xi,yi .
L1 − L2 regularization
g(w) =12‖w‖22 +
σ
λ‖w‖1
T. Zhang Big Data Optimization 18 / 36
![Page 36: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/36.jpg)
Dual Formulation
Primal:
minw
P(w) :=
[1n
n∑i=1
φi (X>i w) + λg(w)
],
Dual:
maxα
D(α) :=
[1n
n∑i=1
−φ∗i (−αi )− λg∗(
1λn
n∑i=1
Xiαi
)]
with the relationship
w = ∇g∗(
1λn
n∑i=1
Xiαi
).
The convex conjugate (dual) is defined as φ∗i (a) = supz(az − φi(z)).
SDCA: randomly pick i and optimize D(α) by varying αi while keepingother dual variables fixed.
T. Zhang Big Data Optimization 19 / 36
![Page 37: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/37.jpg)
Example: L1 − L2 Regularized Logistic Regression
Primal:
P(w) =1n
n∑i=1
ln(1 + e−w>Xi Yi )︸ ︷︷ ︸φi (w)
+λ
2w>w + σ‖w‖1︸ ︷︷ ︸
λg(w)
.
Dual: with αiYi ∈ [0,1]
D(α) =1n
n∑i=1
−αiYi ln(αiYi)− (1− αiYi) ln(1− αiYi)︸ ︷︷ ︸φ∗i (−αi )
−λ2‖trunc(v , σ/λ)‖22
s.t. v =1λn
n∑i=1
αiXi ; w = trunc(v , σ/λ)
where
trunc(u, δ)j =
uj − δ if uj > δ
0 if |uj | ≤ δuj + δ if uj < −δ
T. Zhang Big Data Optimization 20 / 36
![Page 38: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/38.jpg)
Proximal-SDCA for L1-L2 Regularization
Algorithm:Keep dual α and v = (λn)−1∑
i αiXi
Randomly pick iFind ∆i by approximately maximizing:
−φ∗i (αi + ∆i)− trunc(v , σ/λ)>Xi ∆i −1
2λn‖Xi‖22∆2
i ,
where φ∗i (αi + ∆) = (αi + ∆)Yi ln((αi + ∆)Yi ) + (1− (αi + ∆)Yi ) ln(1− (αi + ∆)Yi )
α = α + ∆i · ei
v = v + (λn)−1∆i · Xi .Let w = trunc(v , σ/λ).
T. Zhang Big Data Optimization 21 / 36
![Page 39: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/39.jpg)
Convergence rate
The number of iterations needed to achieve ε accuracyFor (1/γ)-smooth loss:
O((
n +1γλ
)log
1ε
)For L-Lipschitz loss:
O(
n +L2
λ ε
)
T. Zhang Big Data Optimization 22 / 36
![Page 40: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/40.jpg)
Solving L1 with Smooth Loss
Assume we want to solve L1 regularization to accuracy ε with smoothφi :
1n
n∑i=1
φi(w) + σ‖w‖1.
Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):
number of iterations needed is O(n + 1/ε).
Compare to Dual Averaging SGD (Xiao):number of iterations needed is O(1/ε2).
Compare to batch FISTA (Nesterov’s accelerated proximal gradient):number of iterations needed is O(n/
√ε).
Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)
Can design an accelerated prox-SDCA procedure always superior toFISTA
T. Zhang Big Data Optimization 23 / 36
![Page 41: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/41.jpg)
Solving L1 with Smooth Loss
Assume we want to solve L1 regularization to accuracy ε with smoothφi :
1n
n∑i=1
φi(w) + σ‖w‖1.
Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):
number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):
number of iterations needed is O(1/ε2).
Compare to batch FISTA (Nesterov’s accelerated proximal gradient):number of iterations needed is O(n/
√ε).
Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)
Can design an accelerated prox-SDCA procedure always superior toFISTA
T. Zhang Big Data Optimization 23 / 36
![Page 42: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/42.jpg)
Solving L1 with Smooth Loss
Assume we want to solve L1 regularization to accuracy ε with smoothφi :
1n
n∑i=1
φi(w) + σ‖w‖1.
Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):
number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):
number of iterations needed is O(1/ε2).Compare to batch FISTA (Nesterov’s accelerated proximal gradient):
number of iterations needed is O(n/√ε).
Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)
Can design an accelerated prox-SDCA procedure always superior toFISTA
T. Zhang Big Data Optimization 23 / 36
![Page 43: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/43.jpg)
Solving L1 with Smooth Loss
Assume we want to solve L1 regularization to accuracy ε with smoothφi :
1n
n∑i=1
φi(w) + σ‖w‖1.
Apply Prox-SDCA with extra term 0.5λ‖w‖22, where λ = O(ε):
number of iterations needed is O(n + 1/ε).Compare to Dual Averaging SGD (Xiao):
number of iterations needed is O(1/ε2).Compare to batch FISTA (Nesterov’s accelerated proximal gradient):
number of iterations needed is O(n/√ε).
Prox-SDCA wins in the statistically interesting regime: ε > Ω(1/n2)
Can design an accelerated prox-SDCA procedure always superior toFISTA
T. Zhang Big Data Optimization 23 / 36
![Page 44: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/44.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 24 / 36
![Page 45: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/45.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 24 / 36
![Page 46: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/46.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 24 / 36
![Page 47: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/47.jpg)
Accelerated Prox-SDCA
Solving:
P(w) :=1n
n∑i=1
φi(X>i w) + λg(w)
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior to acceleration when λ is very small O(1/n), which hasO(1/
√λ) dependency
Inner-outer Iteration Accelerated Prox-SDCAPick a suitable κ = Θ(1/n) and βFor t = 2,3 . . . (outer iter)
Let gt (w) = λg(w) + 0.5κ‖w‖22 − κw>y (t−1) (κ-strongly convex)
Let Pt (w) = P(w)− λg(w) + gt (w) (redefine P(·) – κ strongly convex)Approximately solve Pt (w) for (w (t), α(t)) with prox-SDCA (inner iter)Let y (t) = w (t) + β(w (t) − w (t−1)) (acceleration)
T. Zhang Big Data Optimization 25 / 36
![Page 48: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/48.jpg)
Accelerated Prox-SDCA
Solving:
P(w) :=1n
n∑i=1
φi(X>i w) + λg(w)
Convergence rate of Prox-SDCA depends on O(1/λ)
Inferior to acceleration when λ is very small O(1/n), which hasO(1/
√λ) dependency
Inner-outer Iteration Accelerated Prox-SDCAPick a suitable κ = Θ(1/n) and βFor t = 2,3 . . . (outer iter)
Let gt (w) = λg(w) + 0.5κ‖w‖22 − κw>y (t−1) (κ-strongly convex)
Let Pt (w) = P(w)− λg(w) + gt (w) (redefine P(·) – κ strongly convex)Approximately solve Pt (w) for (w (t), α(t)) with prox-SDCA (inner iter)Let y (t) = w (t) + β(w (t) − w (t−1)) (acceleration)
T. Zhang Big Data Optimization 25 / 36
![Page 49: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/49.jpg)
Performance Comparisons
Problem Algorithm Runtime
SVMSGD d
λε
AGD (Nesterov) dn√
1λ ε
Acc-Prox-SDCA d(
n + min 1λ ε ,√
nλε)
Lasso
SGD and variants dε2
Stochastic Coordinate Descent dnε
FISTA dn√
1ε
Acc-Prox-SDCA d(
n + min 1ε ,√
nε )
Ridge Regression
Exact d2n + d3
SGD, SDCA d(n + 1
λ
)AGD dn
√1λ
Acc-Prox-SDCA d(
n + min 1λ ,√
nλ)
T. Zhang Big Data Optimization 26 / 36
![Page 50: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/50.jpg)
Summary of Sampling
1st order gradient
∇f (w) =1n
n∑i=1
∇fi(w)
sample 1st order gradient (stochastic gradient): variance reductionleads to fast linear convergence2nd order gradient
∇2f (w) =1n
n∑i=1
∇2fi(w)
sample 2nd order gradient (stochastic Newton): converges to exactoptimal with fast rate in distributed computing setting
T. Zhang Big Data Optimization 27 / 36
![Page 51: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/51.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 28 / 36
![Page 52: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/52.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 28 / 36
![Page 53: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/53.jpg)
Outline
Background:big data optimization problem1st order stochastic gradient versus batch gradient: pros and cons
Stochastic gradient algorithms with variance reductionalgorithm 1: SVRG (Stochastic variance reduced gradient)algorithm 2: SDCA (Stochastic Dual Coordinate Ascent)algorithm 3: accelerated SDCA (with Nesterov acceleration)
Strategies for distributed computingalgorithm 4: DANE (Distributed Approximate NEwton-type method)behaves like 2nd order stochastic sampling
T. Zhang Big Data Optimization 28 / 36
![Page 54: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/54.jpg)
Distributed Computing
Assume: data distributed over machinesm processorseach has n/m examples
Simple Computational Strategy — One Shot Averaging (OSA)run optimization on m machines separately
obtaining parameters w (1), . . . ,w (m)
average the parameters: lw = m−1∑m
i=1 w (i)
T. Zhang Big Data Optimization 29 / 36
![Page 55: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/55.jpg)
Improvement
OSA Strategy’s advantage:machines run independentlysimple and computationally efficient; asymptotically good in theory
Disadvantage:practically inferior to training all examples on a single machine
Traditional solution in optimization: ADMM
New Idea: via 2nd order gradient samplingDistributed Approximate NEwton (DANE)
T. Zhang Big Data Optimization 30 / 36
![Page 56: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/56.jpg)
Improvement
OSA Strategy’s advantage:machines run independentlysimple and computationally efficient; asymptotically good in theory
Disadvantage:practically inferior to training all examples on a single machine
Traditional solution in optimization: ADMM
New Idea: via 2nd order gradient samplingDistributed Approximate NEwton (DANE)
T. Zhang Big Data Optimization 30 / 36
![Page 57: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/57.jpg)
Distribution Scheme
Assume: data distributed over machines with decomposed problem
f (w) =m∑`=1
f (`)(w).
m processorseach f (`)(w) has n/m randomly partitioned exampleseach machine holds a complete set of parameters
T. Zhang Big Data Optimization 31 / 36
![Page 58: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/58.jpg)
DANE
Starting with w using OSAIterate
Take w and define
f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w
on each machine solves
w (`) = arg minw
f (`)(w)
independently.Take partial average as the next w
The gradient correction is similar to SVRG
T. Zhang Big Data Optimization 32 / 36
![Page 59: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/59.jpg)
DANE
Starting with w using OSAIterate
Take w and define
f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w
on each machine solves
w (`) = arg minw
f (`)(w)
independently.Take partial average as the next w
The gradient correction is similar to SVRG
T. Zhang Big Data Optimization 32 / 36
![Page 60: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/60.jpg)
Relationship to 2nd Order Gradient Sampling
Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).
f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w
Expand f (`)(w) around w , we have
f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)
− (∇f (`)(w)−∇f (w))>w .
minw f (`)(w) is equivalent to approximate minimization of
minw
f (`)(w) +∇f (w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)
Lead to fast convergence
T. Zhang Big Data Optimization 33 / 36
![Page 61: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/61.jpg)
Relationship to 2nd Order Gradient Sampling
Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).
f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w
Expand f (`)(w) around w , we have
f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)
− (∇f (`)(w)−∇f (w))>w .
minw f (`)(w) is equivalent to approximate minimization of
minw
f (`)(w) +∇f (w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)
Lead to fast convergence
T. Zhang Big Data Optimization 33 / 36
![Page 62: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/62.jpg)
Relationship to 2nd Order Gradient Sampling
Each f (`)(w) takes m out of n sample from f (w) = n−1∑i fi(w).
f (`)(w) = f (`)(w)− (∇f (`)(w)−∇f (w))>w
Expand f (`)(w) around w , we have
f (`)(w) ≈f (`)(w) +∇f (`)(w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)
− (∇f (`)(w)−∇f (w))>w .
minw f (`)(w) is equivalent to approximate minimization of
minw
f (`)(w) +∇f (w)>(w − w) +12
(w − w)>∇2f (`)(w)(w − w)︸ ︷︷ ︸2nd order gradient sampling from ∇2f (w)
Lead to fast convergence
T. Zhang Big Data Optimization 33 / 36
![Page 63: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/63.jpg)
Comparisons
0 5 10
0.229
0.23
0.231
t
COV1
0 5 100.04
0.05
0.06
0.07
t
ASTRO
0 5 10
0.03
0.04
0.05
0.06
t
MNIST−47
DANEADMMOSAOpt
T. Zhang Big Data Optimization 34 / 36
![Page 64: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/64.jpg)
Summary
Optimization in machine learning: sum over data structuresuitable for statistical sampling
Traditional methods: gradient based batch algorithmsdo not take advantage of special structure
Recent progress: stochastic optimization (focusing on my own work)
Sampling of 1st order gradientvariance reduction leads to fast linear convergence rate
Sampling of 2nd order gradientleads to fast distributed algorithm
T. Zhang Big Data Optimization 35 / 36
![Page 65: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/65.jpg)
Summary
Optimization in machine learning: sum over data structuresuitable for statistical sampling
Traditional methods: gradient based batch algorithmsdo not take advantage of special structure
Recent progress: stochastic optimization (focusing on my own work)
Sampling of 1st order gradientvariance reduction leads to fast linear convergence rate
Sampling of 2nd order gradientleads to fast distributed algorithm
T. Zhang Big Data Optimization 35 / 36
![Page 66: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/66.jpg)
References
Rie Johnson and TZ. Accelerating Stochastic Gradient Descentusing Predictive Variance Reduction, NIPS 2013.Lin Xiao and TZ. A Proximal Stochastic Gradient Method withProgressive Variance Reduction, Tech Report arXiv:1403.4699,March 2014.Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, JMLR 14:567-599,2013.Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic DualCoordinate Ascent for Regularized Loss Minimization, Tech ReportarXiv:1309.2375, Sep 2013.Ohad Shamir and Nathan Srebro and Tong Zhang.Communication-Efficient Distributed Optimization using anApproximate Newton-type Method, ICML 2014.
My friend Lin Xiao has made further improves with collaborators inmultiple fronts on this thread of big optimization research ...
T. Zhang Big Data Optimization 36 / 36
![Page 67: Recent Progresses in Stochastic Algorithms for Big Data ...bicmr.pku.edu.cn/conference/opt-2014/slides/Tong-Zhang.pdf · Recent Progresses in Stochastic Algorithms for Big Data Optimization](https://reader033.fdocuments.us/reader033/viewer/2022060212/5f04ef3e7e708231d410700a/html5/thumbnails/67.jpg)
References
Rie Johnson and TZ. Accelerating Stochastic Gradient Descentusing Predictive Variance Reduction, NIPS 2013.Lin Xiao and TZ. A Proximal Stochastic Gradient Method withProgressive Variance Reduction, Tech Report arXiv:1403.4699,March 2014.Shai Shalev-Shwartz and TZ. Stochastic Dual Coordinate AscentMethods for Regularized Loss Minimization, JMLR 14:567-599,2013.Shai Shalev-Shwartz and TZ. Accelerated Proximal Stochastic DualCoordinate Ascent for Regularized Loss Minimization, Tech ReportarXiv:1309.2375, Sep 2013.Ohad Shamir and Nathan Srebro and Tong Zhang.Communication-Efficient Distributed Optimization using anApproximate Newton-type Method, ICML 2014.My friend Lin Xiao has made further improves with collaborators inmultiple fronts on this thread of big optimization research ...
T. Zhang Big Data Optimization 36 / 36