Stochastic Optimization for Large-scale Learning
-
Upload
- -
Category
Technology
-
view
231 -
download
0
Transcript of Stochastic Optimization for Large-scale Learning
Introduction Time Reduction Space Reduction Conclusion
Stochastic Optimization forLarge-scale Machine Learning
Lijun Zhang
LAMDA group, Nanjing University, China
The 13rd Chinese Workshop on Machine Learning and Applications
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAMLNANJING UNIVERSITY
Introduction Time Reduction Space Reduction Conclusion
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (I)
Definition 1: A Special Objective [Nemirovski et al., 2009]
minw∈W
f (w) = Eξ [ℓ(w, ξ)] =
∫
Ξℓ(w, ξ)dP(ξ)
where ξ is a random variable
It is possible to generate an i.i.d. sample ξ1, ξ2, . . .
Risk Minimization
minw∈W
f (w) = E(x,y) [ℓ(w, (x, y))]
ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)
Training data (x1, y1), . . . , (xn, yn) are i.i.d.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (I)
Definition 1: A Special Objective [Nemirovski et al., 2009]
minw∈W
f (w) = Eξ [ℓ(w, ξ)] =
∫
Ξℓ(w, ξ)dP(ξ)
where ξ is a random variable
It is possible to generate an i.i.d. sample ξ1, ξ2, . . .
Risk Minimization
minw∈W
f (w) = E(x,y) [ℓ(w, (x, y))]
ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)
Training data (x1, y1), . . . , (xn, yn) are i.i.d.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (II)
Definition 2: A Special Access Model [Hazan and Kale, 2011]
minw∈W
f (w)
There exists a stochastic oracle that produces unbiasedgradient o(·)
E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)
Empirical Risk Minimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Sampling a (xt , yt) from (xi , yi)ni=1 randomly
E[∂ℓ(yt , x⊤t w)] ⊆ ∂
1n
n∑
i=1
ℓ(yi , x⊤i w)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
What is Stochastic Optimization? (II)
Definition 2: A Special Access Model [Hazan and Kale, 2011]
minw∈W
f (w)
There exists a stochastic oracle that produces unbiasedgradient o(·)
E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)
Empirical Risk Minimization
minw∈W
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Sampling a (xt , yt) from (xi , yi)ni=1 randomly
E[∂ℓ(yt , x⊤t w)] ⊆ ∂
1n
n∑
i=1
ℓ(yi , x⊤i w)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (I)
https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (II)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
n is the number of training data
d is the dimensionality
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , x⊤
i wt))
3: wt+1 = ΠW(w′t+1)
4: end for
The Challenge
Time complexity per iteration: O(nd) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (II)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
n is the number of training data
d is the dimensionality
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: w′
t+1 = wt − ηt(1
n
∑ni=1 ∇ℓ(yi , x⊤
i wt))
3: wt+1 = ΠW(w′t+1)
4: end for
The Challenge
Time complexity per iteration: O(nd) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (III)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
The Advantage—Time Reduction
Time complexity per iteration: O(d) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (III)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
The Advantage—Time Reduction
Time complexity per iteration: O(d) + O(poly(d))
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (IV)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Matrix completion, distance metric learning, multi-tasklearning, multi-class learning
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: W ′
t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′
t+1)4: end for
The Challenge
Space complexity per iteration: O(mn)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (IV)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Matrix completion, distance metric learning, multi-tasklearning, multi-class learning
Deterministic Optimization—Gradient Descent (GD)
1: for t = 1, 2, . . . ,T do2: W ′
t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′
t+1)4: end for
The Challenge
Space complexity per iteration: O(mn)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (V)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = ΠW(W ′t+1)
5: end for
The Advantage—Space Reduction
Space complexity per iteration could be: O((m + n)r)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Definition Advantages
Why Stochastic Optimization? (V)
Optimization over Large Matrices
minW∈W⊆Rm×n
F (W )
Stochastic Optimization—Stochastic Gradient Descent (SGD)
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = ΠW(W ′t+1)
5: end for
The Advantage—Space Reduction
Space complexity per iteration could be: O((m + n)r)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
Advantage v.s. Disadvantage
Time complexity per iteration is low: O(d) + O(poly(d))
The iteration complexity is much higher than GD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Stochastic Gradient Descent (SGD)
Empirical Risk Minimization
minw∈W⊆Rd
f (w) =1n
n∑
i=1
ℓ(yi , x⊤i w)
The Algorithm
1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′
t+1 = wt − ηt∇ℓ(yt , x⊤t wt)
4: wt+1 = ΠW(w′t+1)
5: end for
Advantage v.s. Disadvantage
Time complexity per iteration is low: O(d) + O(poly(d))
The iteration complexity is much higher than GD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)O(√
κ log 1ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)O(√
κ log 1ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)O(n√κ log 1
ǫ
)
SGD O( 1ǫ2
)O(
1µǫ
)
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)103n O
(n√κ log 1
ǫ
)6n
√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
The Problem
Iteration Complexity
The number of iterations T to ensure
f (wT )− minw∈Ω
f (w) ≤ ǫ
Iteration Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
1√ǫ
)103 O
(√κ log 1
ǫ
)6√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
Total Time Complexity of GD and SGD ǫ = 10−6
Convex & Smooth Strongly Convex & Smooth
GD O(
n√ǫ
)103n O
(n√κ log 1
ǫ
)6n
√κ
SGD O( 1ǫ2
)1012 O
(1µǫ
)106
µ
http://cs.nju.edu.cn/zlj Stochastic Optimization
SGD may be slower thanGD if ǫ is very small.
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Motivations
Reason of Slow Convergence Rate
The step size of SGD is a decreasing sequence
ηt =1√t
for convex function
ηt =1t for strongly convex function
Reason of Decreasing Step Size
w′t+1 = wt − ηt∇ℓ(yi , x⊤
i wt)
Stochastic gradients introduce a constant error
The Key Idea
Control the variance of stochastic gradients
Choose a fixed step size η
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (I)
Mixed Gradient of wt
m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt
⊤w0) +∇f (w0)
where (xt , yt) is a random sample, w0 is a initial solution, and
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
The Properties of Mixed Gradient
It is still a unbiased estimate of true gradient
E[m(wt)] =1n
n∑
i=1
∇ℓ(yi , x⊤i wt) = ∇f (wt)
The variance is controlled by the distance
‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0)‖2 ≤ L‖wt − w0‖2
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Mixed Gradient Descent (II)
The Algorithm [Zhang et al., 2013]
1: Compute the true gradient of w0
∇f (w0) =1n
n∑
i=1
∇ℓ(yi , x⊤i w0)
2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt
m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤
t w0) +∇f (w0)
5: w′t+1 = wt − ηm(wt)
6: wt+1 = ΠW(w′t+1)
7: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Theoretical Guarantees
Theorem 1 ([Zhang et al., 2013])
Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs
True Gradient Stochastic GradientMGD O
(log 1
ǫ
)O(κ2 log 1
ǫ
)
In contrast, SGD needs O(1/µǫ) stochastic gradients.
Extensions
For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]
For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Theoretical Guarantees
Theorem 1 ([Zhang et al., 2013])
Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs
True Gradient Stochastic GradientMGD O
(log 1
ǫ
)O(κ2 log 1
ǫ
)
In contrast, SGD needs O(1/µǫ) stochastic gradients.
Extensions
For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]
For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Experimental Results (I)
Reuters Corpus Volume I (RCV1) data set
The optimization error
0 50 100 150 200
10−10
10−5
# of Gradients
Tra
inin
g Lo
ss −
Opt
imum
EMGDSGD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent
Experimental Results (II)
Reuters Corpus Volume I (RCV1) data set
The variance of mixed gradient
0 50 100 150 20010
−15
10−10
10−5
# of Gradients
Var
ianc
e
EMGDSGD
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Efficient Implementation
Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Gradient Descent (SGD)
Nuclear Norm Regularization (Composition Optimization)
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank
The Algorithm [Avron et al., 2012]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt
3: W ′t+1 = Wt − ηtGt
4: Wt+1 = Πr (W ′t+1) (truncate small singular values)
5: end for
Advantage v.s. Disadvantage
Space complexity per iteration is low: O((m + n)r)
There is no theoretical guarantee due to truncation
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Outline
1 IntroductionDefinitionAdvantages
2 Time ReductionBackgroundMixed Gradient Descent
3 Space ReductionBackgroundStochastic Proximal Gradient Descent
4 Conclusion
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Efficient Implementation
The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Stochastic Proximal Gradient Descent (SPGD)
Nuclear Norm Regularization
minW∈Rm×n
F (W ) = f (W ) + λ‖X‖∗
The Algorithm [Zhang et al., 2015b]
1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for5: return WT+1
Advantages
Space complexity per iteration is still low: O((m + n)r)
It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Theoretical Guarantees
Theorem 2 (General convex functions [Zhang et al., 2015b])
Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/
√T , we
have
E [F (WT )− F (W∗)] ≤ O(
log T√T
)
where W∗ is the optimal solution.
Theorem 3 (Strongly convex functions [Zhang et al., 2015b])
Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper
bounded. Setting ηt = 1/(µt), we have
E [F (WT )− F (W∗)] ≤ O(
log TT
)
where W∗ is the optimal solution.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Theoretical Guarantees
Theorem 2 (General convex functions [Zhang et al., 2015b])
Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/
√T , we
have
E [F (WT )− F (W∗)] ≤ O(
log T√T
)
where W∗ is the optimal solution.
Theorem 3 (Strongly convex functions [Zhang et al., 2015b])
Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper
bounded. Setting ηt = 1/(µt), we have
E [F (WT )− F (W∗)] ≤ O(
log TT
)
where W∗ is the optimal solution.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA
Traditional Algorithm of Kernel PCA
1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)
2 Calculate the top eigenvectors and eigenvalues of K
The Challenge
Space complexity is O(n2)
The Question
Is it possible to find top eigensystems of K withoutconstructing K explicitly?
Yes, by SPGD.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (I)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Top eigensystems of K can be recovered from the optimalsolution W∗
Proximal Gradient Descent
1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:
Wt+1 = argminW∈Rm×n
12‖W − (Wt − ηtGt)‖2
F + ηtλ‖W‖∗
4: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (I)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Top eigensystems of K can be recovered from the optimalsolution W∗
Proximal Gradient Descent
1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:
Wt+1 = argminW∈Rm×n
12‖W − (Wt − ηtGt)‖2
F + ηtλ‖W‖∗
4: end for
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (II)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Stochastic Proximal Gradient Descent [Zhang et al., 2015a]
1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for
ξ is low-rank and E[ξ] = K
Random Fourier features [Rahimi and Recht, 2008]
Constructing columns/rows of K randomly
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Application to Kernel PCA (II)
Nuclear Norm Regularized Least Squares
minW∈Rn×n
12‖W − K‖2
F + λ‖W‖∗
Stochastic Proximal Gradient Descent [Zhang et al., 2015a]
1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:
Wt+1 = argminW∈Rm×n
12
∥∥∥W − (Wt − ηtGt)∥∥∥
2
F+ ηtλ‖W‖∗
4: end for
ξ is low-rank and E[ξ] = K
Random Fourier features [Rahimi and Recht, 2008]
Constructing columns/rows of K randomly
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Experimental Results (I)
The Magic data set
The rank of each iterate
50 100 150 200
50
100
150
200
t
Rank(W
t)
λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent
Experimental Results (II)
The Magic data setThe optimization error
50 100 150 2000
1
2
3
4x 10
−3
t
‖W
t−W
∗‖2 F/n2
λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion
Conclusion and Future Work
Time Reduction by Stochastic Optimization
A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity
Extend MGD to other scenarios, such as distributedsetting, non-convex functions
Space Reduction by Stochastic Optimization
A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity
Apply to more applications, such as matrix completion,multi-task learning
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion
Special Issue on “Multi-instance Learning in PatternRecognition and Vision”
URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/
special-issue-on-multiple-instance-learning-in-pattern-recog
Dates:
First submission date: Mar. 1, 2016
Final paper notification: Dec. 1, 2016
Guest editors:
Jianxin Wu, Nanjing University
Xiang Bai, Huazhong University of Science andTechnology
Marco Loog, Delft University of Technology
Fabio Roli, University of Cagliari
Zhi-Hua Zhou, Nanjing University
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion
Reference I
,
Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.
Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.
Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.
Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Thanks!
Learning And Mining from DatA
A DAML
Introduction Time Reduction Space Reduction Conclusion
Reference II
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.
Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.
Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.
Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.
Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.
http://cs.nju.edu.cn/zlj Stochastic Optimization
Learning And Mining from DatA
A DAML