Stochastic Optimization for Large-scale Learning

Transcript of Stochastic Optimization for Large-scale Learning

Introduction Time Reduction Space Reduction Conclusion

Stochastic Optimization forLarge-scale Machine Learning

Lijun Zhang

LAMDA group, Nanjing University, China

The 13rd Chinese Workshop on Machine Learning and Applications

http://cs.nju.edu.cn/zlj Stochastic Optimization

Learning And Mining from DatA

A DAMLNANJING UNIVERSITY

http://cs.nju.edu.cn/zlj

Page 2: Stochastic Optimization for Large-scale Learning

Outline

1 IntroductionDefinitionAdvantages

2 Time ReductionBackgroundMixed Gradient Descent

3 Space ReductionBackgroundStochastic Proximal Gradient Descent

4 Conclusion

A DAML

Page 3: Stochastic Optimization for Large-scale Learning

Introduction Time Reduction Space Reduction Conclusion Definition Advantages

Outline

4 Conclusion

A DAML

Page 4: Stochastic Optimization for Large-scale Learning

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

A DAML

Page 5: Stochastic Optimization for Large-scale Learning

What is Stochastic Optimization? (I)

Definition 1: A Special Objective [Nemirovski et al., 2009]

minw∈W

f (w) = Eξ [ℓ(w, ξ)] =

∫

Ξℓ(w, ξ)dP(ξ)

where ξ is a random variable

It is possible to generate an i.i.d. sample ξ1, ξ2, . . .

Risk Minimization

minw∈W

f (w) = E(x,y) [ℓ(w, (x, y))]

ℓ(·, ·) is a loss function, e.g., hinge loss ℓ(u, v) = max(0, 1− uv)

Training data (x1, y1), . . . , (xn, yn) are i.i.d.

A DAML

Page 6: Stochastic Optimization for Large-scale Learning

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

Empirical Risk Minimization

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

A DAML

Page 7: Stochastic Optimization for Large-scale Learning

What is Stochastic Optimization? (II)

Definition 2: A Special Access Model [Hazan and Kale, 2011]

minw∈W

f (w)

There exists a stochastic oracle that produces unbiasedgradient o(·)

E[o(w)] ∈ ∂f (w) or E[o(w)] = ∇f (w)

minw∈W

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Sampling a (xt , yt) from (xi , yi)ni=1 randomly

E[∂ℓ(yt , x⊤t w)] ⊆ ∂

1n

n∑

i=1

ℓ(yi , x⊤i w)

A DAML

Page 8: Stochastic Optimization for Large-scale Learning

Outline

4 Conclusion

A DAML

Page 9: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (I)

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

A DAML

https://infographiclist.files.wordpress.com/2011/09/world-of-data.jpeg

Page 10: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (II)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

Deterministic Optimization—Gradient Descent (GD)

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

A DAML

Page 11: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (II)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

n is the number of training data

d is the dimensionality

1: for t = 1, 2, . . . ,T do2: w′

t+1 = wt − ηt(1

n

∑ni=1 ∇ℓ(yi , x⊤

i wt))

3: wt+1 = ΠW(w′t+1)

4: end for

The Challenge

Time complexity per iteration: O(nd) + O(poly(d))

A DAML

Page 12: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (III)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

Stochastic Optimization—Stochastic Gradient Descent (SGD)

1: for t = 1, 2, . . . ,T do2: Sample a training example (xt , yt) randomly3: w′

t+1 = wt − ηt∇ℓ(yt , x⊤t wt)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

A DAML

Page 13: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (III)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

4: wt+1 = ΠW(w′t+1)

5: end for

The Advantage—Time Reduction

Time complexity per iteration: O(d) + O(poly(d))

A DAML

Page 14: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (IV)

Optimization over Large Matrices

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

A DAML

Page 15: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (IV)

minW∈W⊆Rm×n

F (W )

Matrix completion, distance metric learning, multi-tasklearning, multi-class learning

1: for t = 1, 2, . . . ,T do2: W ′

t+1 = Wt − ηt∇F (Wt)3: Wt+1 = ΠW(W ′

t+1)4: end for

The Challenge

Space complexity per iteration: O(mn)

A DAML

Page 16: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (V)

minW∈W⊆Rm×n

F (W )

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of F (·) at Wt

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

A DAML

Page 17: Stochastic Optimization for Large-scale Learning

Why Stochastic Optimization? (V)

minW∈W⊆Rm×n

F (W )

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = ΠW(W ′t+1)

5: end for

The Advantage—Space Reduction

Space complexity per iteration could be: O((m + n)r)

A DAML

Page 18: Stochastic Optimization for Large-scale Learning

Introduction Time Reduction Space Reduction Conclusion Background Mixed Gradient Descent

Outline

4 Conclusion

A DAML

Page 19: Stochastic Optimization for Large-scale Learning

Stochastic Gradient Descent (SGD)

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

4: wt+1 = ΠW(w′t+1)

5: end for

Advantage v.s. Disadvantage

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

A DAML

Page 20: Stochastic Optimization for Large-scale Learning

minw∈W⊆Rd

f (w) =1n

n∑

i=1

ℓ(yi , x⊤i w)

The Algorithm

4: wt+1 = ΠW(w′t+1)

5: end for

Time complexity per iteration is low: O(d) + O(poly(d))

The iteration complexity is much higher than GD

A DAML

Page 21: Stochastic Optimization for Large-scale Learning

The Problem

Iteration Complexity

The number of iterations T to ensure

f (wT )− minw∈Ω

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

Convex & Smooth Strongly Convex & Smooth

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

Total Time Complexity of GD and SGD

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

A DAML

Page 22: Stochastic Optimization for Large-scale Learning

The Problem

f (w) ≤ ǫ

Iteration Complexity of GD and SGD

GD O(

1√ǫ

)O(√

κ log 1ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

A DAML

Page 23: Stochastic Optimization for Large-scale Learning

The Problem

f (w) ≤ ǫ

Iteration Complexity of GD and SGD ǫ = 10−6

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

A DAML

Page 24: Stochastic Optimization for Large-scale Learning

The Problem

f (w) ≤ ǫ

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

GD O(

n√ǫ

)O(n√κ log 1

ǫ

)

SGD O( 1ǫ2

)O(

1µǫ

)

A DAML

Page 25: Stochastic Optimization for Large-scale Learning

The Problem

f (w) ≤ ǫ

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

A DAML

Page 26: Stochastic Optimization for Large-scale Learning

The Problem

f (w) ≤ ǫ

GD O(

1√ǫ

)103 O

(√κ log 1

ǫ

)6√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

Total Time Complexity of GD and SGD ǫ = 10−6

GD O(

n√ǫ

)103n O

(n√κ log 1

ǫ

)6n

√κ

SGD O( 1ǫ2

)1012 O

(1µǫ

)106

µ

SGD may be slower thanGD if ǫ is very small.

A DAML

Page 27: Stochastic Optimization for Large-scale Learning

Outline

4 Conclusion

A DAML

Page 28: Stochastic Optimization for Large-scale Learning

Motivations

Reason of Slow Convergence Rate

The step size of SGD is a decreasing sequence

ηt =1√t

for convex function

ηt =1t for strongly convex function

Reason of Decreasing Step Size

w′t+1 = wt − ηt∇ℓ(yi , x⊤

i wt)

Stochastic gradients introduce a constant error

The Key Idea

Control the variance of stochastic gradients

Choose a fixed step size η

A DAML

Page 29: Stochastic Optimization for Large-scale Learning

Motivations

ηt =1√t

for convex function

i wt)

The Key Idea

A DAML

Page 30: Stochastic Optimization for Large-scale Learning

Motivations

ηt =1√t

for convex function

i wt)

The Key Idea

A DAML

Page 31: Stochastic Optimization for Large-scale Learning

Mixed Gradient Descent (I)

Mixed Gradient of wt

m(wt) = ∇ℓ(yt , xt⊤wt)−∇ℓ(yt , xt

⊤w0) +∇f (w0)

where (xt , yt) is a random sample, w0 is a initial solution, and

∇f (w0) =1n

n∑

i=1

∇ℓ(yi , x⊤i w0)

The Properties of Mixed Gradient

It is still a unbiased estimate of true gradient

E[m(wt)] =1n

n∑

i=1

∇ℓ(yi , x⊤i wt) = ∇f (wt)

The variance is controlled by the distance

‖∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Page 32: Stochastic Optimization for Large-scale Learning

⊤w0) +∇f (w0)

∇f (w0) =1n

n∑

i=1

E[m(wt)] =1n

n∑

i=1

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Page 33: Stochastic Optimization for Large-scale Learning

⊤w0) +∇f (w0)

∇f (w0) =1n

n∑

i=1

E[m(wt)] =1n

n∑

i=1

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Page 34: Stochastic Optimization for Large-scale Learning

⊤w0) +∇f (w0)

∇f (w0) =1n

n∑

i=1

E[m(wt)] =1n

n∑

i=1

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Page 35: Stochastic Optimization for Large-scale Learning

⊤w0) +∇f (w0)

∇f (w0) =1n

n∑

i=1

E[m(wt)] =1n

n∑

i=1

t w0)‖2 ≤ L‖wt − w0‖2

A DAML

Page 36: Stochastic Optimization for Large-scale Learning

Mixed Gradient Descent (II)

The Algorithm [Zhang et al., 2013]

1: Compute the true gradient of w0

∇f (w0) =1n

n∑

i=1

2: for t = 1, 2, . . . ,T do3: Sample a training example (xt , yt) randomly4: Compute the mixed gradient of wt

m(wt) = ∇ℓ(yt , x⊤t wt)−∇ℓ(yt , x⊤

t w0) +∇f (w0)

5: w′t+1 = wt − ηm(wt)

6: wt+1 = ΠW(w′t+1)

7: end for

A DAML

Page 37: Stochastic Optimization for Large-scale Learning

Theoretical Guarantees

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

A DAML

Page 38: Stochastic Optimization for Large-scale Learning

Theorem 1 ([Zhang et al., 2013])

Suppose the objective function is smooth and strongly convex.To find an ǫ-optimal solution, the mixed gradient descent needs

True Gradient Stochastic GradientMGD O

(log 1

ǫ

)O(κ2 log 1

ǫ

)

In contrast, SGD needs O(1/µǫ) stochastic gradients.

Extensions

For unbounded domain, O(κ2 log 1/ǫ)) can be improved toO(κ log 1/ǫ) [Johnson and Zhang, 2013]

For smooth and convex function, O(log 1/ǫ) true gradientsand O(1/ǫ) stochastic gradients are needed[Mahdavi et al., 2013]

A DAML

Page 39: Stochastic Optimization for Large-scale Learning

Experimental Results (I)

Reuters Corpus Volume I (RCV1) data set

The optimization error

0 50 100 150 200

10−10

10−5

# of Gradients

Tra

inin

g Lo

ss −

Opt

imum

EMGDSGD

A DAML

Page 40: Stochastic Optimization for Large-scale Learning

Experimental Results (II)

Reuters Corpus Volume I (RCV1) data set

The variance of mixed gradient

0 50 100 150 20010

−15

10−10

10−5

# of Gradients

Var

ianc

e

EMGDSGD

A DAML

Page 41: Stochastic Optimization for Large-scale Learning

Introduction Time Reduction Space Reduction Conclusion Background Stochastic Proximal Gradient Descent

Outline

4 Conclusion

A DAML

Page 42: Stochastic Optimization for Large-scale Learning

Nuclear Norm Regularization (Composition Optimization)

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗The optimal solution W∗ is low-rank

The Algorithm [Avron et al., 2012]

3: W ′t+1 = Wt − ηtGt

4: Wt+1 = Πr (W ′t+1) (truncate small singular values)

5: end for

Efficient Implementation

Wt+1 = Πr (Wt − ηtGt) can be solved by incremental SVDwith O((m + n)r) space and O((m + n)r2) time

A DAML

Page 43: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

A DAML

Page 44: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

A DAML

Page 45: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

3: W ′t+1 = Wt − ηtGt

5: end for

Space complexity per iteration is low: O((m + n)r)

There is no theoretical guarantee due to truncation

A DAML

Page 46: Stochastic Optimization for Large-scale Learning

Outline

4 Conclusion

A DAML

Page 47: Stochastic Optimization for Large-scale Learning

Stochastic Proximal Gradient Descent (SPGD)

Nuclear Norm Regularization

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

The Algorithm [Zhang et al., 2015b]

1: for t = 1, 2, . . . ,T do2: Generate a low-rank stochastic gradient Gt of f (·) at Wt3:

Wt+1 = argminW∈Rm×n

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for5: return WT+1

The above problem can also be solved by incrementalSVD with O((m + n)r) space and O((m + n)r2) time

A DAML

Page 48: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

A DAML

Page 49: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

A DAML

Page 50: Stochastic Optimization for Large-scale Learning

minW∈Rm×n

F (W ) = f (W ) + λ‖X‖∗

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

Advantages

Space complexity per iteration is still low: O((m + n)r)

It is supported by solid theoretical guaranteeshttp://cs.nju.edu.cn/zlj Stochastic Optimization

A DAML

Page 51: Stochastic Optimization for Large-scale Learning

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

where W∗ is the optimal solution.

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

A DAML

Page 52: Stochastic Optimization for Large-scale Learning

Theorem 2 (General convex functions [Zhang et al., 2015b])

Assume E[‖Gt‖2F ] is upper bounded. Setting ηt = 1/

√T , we

have

E [F (WT )− F (W∗)] ≤ O(

log T√T

)

Theorem 3 (Strongly convex functions [Zhang et al., 2015b])

Suppose f (·) is strongly convex, and E[‖Gt‖2F ] is upper

bounded. Setting ηt = 1/(µt), we have

E [F (WT )− F (W∗)] ≤ O(

log TT

)

A DAML

Page 53: Stochastic Optimization for Large-scale Learning

Application to Kernel PCA

Traditional Algorithm of Kernel PCA

1 Construct a kernel matrix K ∈ Rn×n with Kij = κ(xi , xj)

2 Calculate the top eigenvectors and eigenvalues of K

The Challenge

Space complexity is O(n2)

The Question

Is it possible to find top eigensystems of K withoutconstructing K explicitly?

Yes, by SPGD.

A DAML

Page 54: Stochastic Optimization for Large-scale Learning

The Challenge

The Question

Yes, by SPGD.

A DAML

Page 55: Stochastic Optimization for Large-scale Learning

The Challenge

The Question

Yes, by SPGD.

A DAML

Page 56: Stochastic Optimization for Large-scale Learning

Application to Kernel PCA (I)

Nuclear Norm Regularized Least Squares

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

A DAML

Page 57: Stochastic Optimization for Large-scale Learning

Application to Kernel PCA (I)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Top eigensystems of K can be recovered from the optimalsolution W∗

Proximal Gradient Descent

1: for t = 1, 2, . . . ,T do2: Calculate the gradient Gt = Wt − K3:

12‖W − (Wt − ηtGt)‖2

F + ηtλ‖W‖∗

4: end for

A DAML

Page 58: Stochastic Optimization for Large-scale Learning

Application to Kernel PCA (II)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

A DAML

Page 59: Stochastic Optimization for Large-scale Learning

Application to Kernel PCA (II)

minW∈Rn×n

12‖W − K‖2

F + λ‖W‖∗

Stochastic Proximal Gradient Descent [Zhang et al., 2015a]

1: for t = 1, 2, . . . ,T do2: Calculate a low-rank stochastic gradient Gt = Wt − ξ3:

12

∥∥∥W − (Wt − ηtGt)∥∥∥

2

F+ ηtλ‖W‖∗

4: end for

ξ is low-rank and E[ξ] = K

Random Fourier features [Rahimi and Recht, 2008]

Constructing columns/rows of K randomly

A DAML

Page 60: Stochastic Optimization for Large-scale Learning

Experimental Results (I)

The Magic data set

The rank of each iterate

50 100 150 200

50

100

150

200

t

Rank(W

t)

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 50

A DAML

Page 61: Stochastic Optimization for Large-scale Learning

Experimental Results (II)

The Magic data setThe optimization error

50 100 150 2000

1

2

3

4x 10

−3

t

‖W

t−W

∗‖2 F/n2

λ = 10, k = 5λ = 10, k = 50λ = 100, k = 5λ = 100, k = 500.05t

A DAML

Page 62: Stochastic Optimization for Large-scale Learning

Conclusion and Future Work

Time Reduction by Stochastic Optimization

A novel algorithm named Mixed Gradient Descent (MGD)is proposed to reduce the iteration complexity

Extend MGD to other scenarios, such as distributedsetting, non-convex functions

Space Reduction by Stochastic Optimization

A simple algorithm based on Stochastic Proximal GradientDescent (SPGD) is proposed to reduce the spacecomplexity

Apply to more applications, such as matrix completion,multi-task learning

A DAML

Page 63: Stochastic Optimization for Large-scale Learning

Special Issue on “Multi-instance Learning in PatternRecognition and Vision”

URL: http://www.journals.elsevier.com/pattern-recognition/call-for-papers/

special-issue-on-multiple-instance-learning-in-pattern-recog

Dates:

First submission date: Mar. 1, 2016

Final paper notification: Dec. 1, 2016

Guest editors:

Jianxin Wu, Nanjing University

Xiang Bai, Huazhong University of Science andTechnology

Marco Loog, Delft University of Technology

Fabio Roli, University of Cagliari

Zhi-Hua Zhou, Nanjing University

A DAML

http://www.journals.elsevier.com/pattern-recognition/call-for-papers/special-issue-on-multiple-instance-learning-in-pattern-recog

Page 64: Stochastic Optimization for Large-scale Learning

Reference I

,

Avron, H., Kale, S., Kasiviswanathan, S., and Sindhwani, V. (2012).Efficient and practical stochastic subgradient descent for nuclear normregularization.In Proceedings of the 29th International Conference on Machine Learning, pages1231–1238.

Hazan, E. and Kale, S. (2011).Beyond the regret minimization barrier: an optimal algorithm for stochasticstrongly-convex optimization.In Proceedings of the 24th Annual Conference on Learning Theory, pages421–436.

Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In Advances in Neural Information Processing Systems 26, pages 315–323.

Mahdavi, M., Zhang, L., and Jin, R. (2013).Mixed optimization for smooth functions.In Advance in Neural Information Processing Systems 26 (NIPS), pages674–682.

Thanks!

A DAML

Page 65: Stochastic Optimization for Large-scale Learning

Reference II

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009).Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574–1609.

Rahimi, A. and Recht, B. (2008).Random features for large-scale kernel machines.In Advances in Neural Information Processing Systems 20, pages 1177–1184.

Zhang, L., Mahdavi, M., and Jin, R. (2013).Linear convergence with condition number independent access of full gradients.In Advance in Neural Information Processing Systems 26, pages 980–988.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015a).Stochastic optimization for kernel pca.Submitted.

Zhang, L., Yang, T., Jin, R., and Zhou, Z.-H. (2015b).Stochastic proximal gradient descent for nuclear norm regularization.ArXiv e-prints, arXiv:1511.01664.

A DAML