Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time...
Transcript of Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time...
![Page 1: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/1.jpg)
Introduction Time Reduction Space Reduction Conclusion
Stochastic Optimization forLarge-scale Machine Learning
Lijun Zhang
LAMDA group, Nanjing University, China
The 13rd Chinese Workshop on Machine Learning and Applications
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAMLNANJING UNIVERSITY
![Page 2: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/2.jpg)
Introduction Time Reduction Space Reduction Conclusion
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 3: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/3.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 4: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/4.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (I)
Definition 1: A Special Objective [Nemirovski et al., 2009]
minw∈W
f (w) = Eξ [ℓ(w, ξ)] =
∫
Ξℓ(w, ξ)dP(ξ)
where ξ is a random variable
It is possible to generate an i.i.d. sample ξ1, ξ2, . . .
Risk Minimization
minw∈W
f (w) = E(x,y) [ℓ(w, (x, y))]
ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)
Training data (x1, y1), . . . , (xn, yn) are i.i.d.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 5: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/5.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (I)
Definition 1: A Special Objective [Nemirovski et al., 2009]
minw∈W
f (w) = Eξ [ℓ(w, ξ)] =
∫
Ξℓ(w, ξ)dP(ξ)
where ξ is a random variable
It is possible to generate an i.i.d. sample ξ1, ξ2, . . .
Risk Minimization
minw∈W
f (w) = E(x,y) [ℓ(w, (x, y))]
ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)
Training data (x1, y1), . . . , (xn, yn) are i.i.d.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 6: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/6.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (II)
Definition 2: A Special Access Model [Hazan and Kale, 2011]
minw∈W
f (w)
There exists a stochastic oracle that produces unbiasedgradient o(·)
E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)
Empirical Risk Minimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Sampling a (xt , yt) from (xi , yi)ni=1 randomly
E[∂ℓ(yt , x⊤t w)] ⊆ ∂
1n
n∑
i=1
ℓ(yi , x⊤i w)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 7: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/7.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (II)
Definition 2: A Special Access Model [Hazan and Kale, 2011]
minw∈W
f (w)
There exists a stochastic oracle that produces unbiasedgradient o(·)
E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)
Empirical Risk Minimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Sampling a (xt , yt) from (xi , yi)ni=1 randomly
E[∂ℓ(yt , x⊤t w)] ⊆ ∂
1n
n∑
i=1
ℓ(yi , x⊤i w)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 8: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/8.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 9: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/9.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (I)
https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 10: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/10.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (II)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
n is the number of training data
d is the dimensionality
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , x⊤
i wt))
3: wt+1 = ΠW(w′t+1)
4: end for
The Challenge
Time complexity per iteration: O(nd) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 11: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/11.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (II)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
n is the number of training data
d is the dimensionality
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , x⊤
i wt))
3: wt+1 = ΠW(w′t+1)
4: end for
The Challenge
Time complexity per iteration: O(nd) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 12: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/12.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (III)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
The Advantage—Time Reduction
Time complexity per iteration: O(d) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 13: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/13.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (III)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
The Advantage—Time Reduction
Time complexity per iteration: O(d) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 14: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/14.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (IV)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Matrix completion, distance metric learning, multi-tasklearning, multi-class learning
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: W ′
t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′
t+1)4: end for
The Challenge
Space complexity per iteration: O(mn)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 15: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/15.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (IV)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Matrix completion, distance metric learning, multi-tasklearning, multi-class learning
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: W ′
t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′
t+1)4: end for
The Challenge
Space complexity per iteration: O(mn)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 16: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/16.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (V)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = ΠW(W ′t+1)
5: end for
The Advantage—Space Reduction
Space complexity per iteration could be: O((m + n)r)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 17: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/17.jpg)
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (V)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = ΠW(W ′t+1)
5: end for
The Advantage—Space Reduction
Space complexity per iteration could be: O((m + n)r)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 18: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/18.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 19: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/19.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
Advantage v.s. Disadvantage
Time complexity per iteration is low: O(d) + O(poly(d))
The iteration complexity is much higher than GD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 20: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/20.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
Advantage v.s. Disadvantage
Time complexity per iteration is low: O(d) + O(poly(d))
The iteration complexity is much higher than GD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 21: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/21.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)O(√
κ log 1ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 22: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/22.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)O(√
κ log 1ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 23: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/23.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 24: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/24.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 25: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/25.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)103n O
(n√κ log 1
ǫ
)6n
√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 26: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/26.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)103n O
(n√κ log 1
ǫ
)6n
√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
http://cs.nju.edu.cn/zlj Stochastic Optimization
SGD may be slower thanGD if ǫ is very small.
Learning And Mining from DatA
A DAML
![Page 27: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/27.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 28: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/28.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 29: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/29.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 30: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/30.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 31: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/31.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 32: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/32.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 33: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/33.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 34: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/34.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 35: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/35.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 36: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/36.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (II)
The Algorithm [Zhang et al., 2013]
1: Compute the true gradient of w0
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt
m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0) +∇f (w0)
5: w′t+1 = wt − ηm(wt)
6: wt+1 = ΠW(w′t+1)
7: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 37: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/37.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Theoretical Guarantees
Theorem 1 ([Zhang et al., 2013])
Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs
True Gradient Stochastic GradientMGD O
(log 1
ǫ
)O(κ2 log 1
ǫ
)
In contrast, SGD needs O(1/µǫ) stochastic gradients.
Extensions
For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]
For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 38: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/38.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Theoretical Guarantees
Theorem 1 ([Zhang et al., 2013])
Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs
True Gradient Stochastic GradientMGD O
(log 1
ǫ
)O(κ2 log 1
ǫ
)
In contrast, SGD needs O(1/µǫ) stochastic gradients.
Extensions
For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]
For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 39: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/39.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Experimental Results (I)
Reuters Corpus Volume I (RCV1) data set
The optimization error
0 50 100 150 200
10−10
10−5
# of Gradients
Tra
inin
g Lo
ss −
Opt
imum
EMGDSGD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 40: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/40.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Experimental Results (II)
Reuters Corpus Volume I (RCV1) data set
The variance of mixed gradient
0 50 100 150 20010
−15
10−10
10−5
# of Gradients
Var
ianc
e
EMGDSGD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 41: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/41.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 42: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/42.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 43: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/43.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 44: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/44.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 45: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/45.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Advantage v.s. Disadvantage
Space complexity per iteration is low: O((m + n)r)
There is no theoretical guarantee due to truncation
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 46: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/46.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 47: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/47.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 48: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/48.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 49: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/49.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 50: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/50.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Advantages
Space complexity per iteration is still low: O((m + n)r)
It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 51: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/51.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Theoretical Guarantees
Theorem 2 (General convex functions [Zhang et al., 2015b])
Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/
√T , we
have
E [F (WT )− F (W∗)] ≤ O(
log T√T
)
where W∗ is the optimal solution.
Theorem 3 (Strongly convex functions [Zhang et al., 2015b])
Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper
bounded. Setting ηt = 1/(µt), we have
E [F (WT )− F (W∗)] ≤ O(
log TT
)
where W∗ is the optimal solution.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 52: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/52.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Theoretical Guarantees
Theorem 2 (General convex functions [Zhang et al., 2015b])
Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/
√T , we
have
E [F (WT )− F (W∗)] ≤ O(
log T√T
)
where W∗ is the optimal solution.
Theorem 3 (Strongly convex functions [Zhang et al., 2015b])
Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper
bounded. Setting ηt = 1/(µt), we have
E [F (WT )− F (W∗)] ≤ O(
log TT
)
where W∗ is the optimal solution.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 53: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/53.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 54: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/54.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 55: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/55.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 56: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/56.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (I)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Top eigensystems of K can be recovered from the optimalsolution W∗
Proximal Gradient Descent
1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:
Wt+1 = argminW∈Rm×n
12‖W − (Wt − ηtGt)‖2
F + ηtλ‖W‖∗
4: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 57: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/57.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (I)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Top eigensystems of K can be recovered from the optimalsolution W∗
Proximal Gradient Descent
1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:
Wt+1 = argminW∈Rm×n
12‖W − (Wt − ηtGt)‖2
F + ηtλ‖W‖∗
4: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 58: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/58.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (II)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Stochastic Proximal Gradient Descent [Zhang et al., 2015a]
1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for
ξ is low-rank and E[ξ] = K
Random Fourier features [Rahimi and Recht, 2008]
Constructing columns/rows of K randomly
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 59: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/59.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (II)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Stochastic Proximal Gradient Descent [Zhang et al., 2015a]
1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for
ξ is low-rank and E[ξ] = K
Random Fourier features [Rahimi and Recht, 2008]
Constructing columns/rows of K randomly
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 60: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/60.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Experimental Results (I)
The Magic data set
The rank of each iterate
50 100 150 200
50
100
150
200
t
Rank(W
t)
λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 61: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/61.jpg)
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Experimental Results (II)
The Magic data setThe optimization error
50 100 150 2000
1
2
3
4x 10
−3
t
‖W
t−W
∗‖2 F/n2
λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 62: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/62.jpg)
Introduction Time Reduction Space Reduction Conclusion
Conclusion and Future Work
Time Reduction by Stochastic Optimization
A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity
Extend MGD to other scenarios, such as distributedsetting, non-convex functions
Space Reduction by Stochastic Optimization
A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity
Apply to more applications, such as matrix completion,multi-task learning
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 63: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/63.jpg)
Introduction Time Reduction Space Reduction Conclusion
Special Issue on “Multi-instance Learning in PatternRecognition and Vision”
URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/
special-issue-on-multiple-instance-learning-in-pattern-recog
Dates:
First submission date: Mar. 1, 2016
Final paper notification: Dec. 1, 2016
Guest editors:
Jianxin Wu, Nanjing University
Xiang Bai, Huazhong University of Science andTechnology
Marco Loog, Delft University of Technology
Fabio Roli, University of Cagliari
Zhi-Hua Zhou, Nanjing University
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
![Page 64: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/64.jpg)
Introduction Time Reduction Space Reduction Conclusion
Reference I
,
Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.
Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.
Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Thanks!
Learning And Mining from DatA
A DAML
![Page 65: Stochastic Optimization for Large-scale Machine Learning · 2020. 7. 15. · Introduction Time Reduction Space Reduction Conclusion Definition Advantages What is Stochastic Optimization?](https://reader034.fdocuments.us/reader034/viewer/2022051920/600d030dc1f00320ce289952/html5/thumbnails/65.jpg)
Introduction Time Reduction Space Reduction Conclusion
Reference II
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.
Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.
Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.
Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.
Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML