Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the...

56
Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015

Transcript of Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the...

Page 1: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Peter Richtárik

Randomized Dual Coordinate Ascentwith Arbitrary Sampling

1st UCL Workshop on the Theory of Big Data – London– January 2015

Page 2: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Coauthors

Zheng QuEdinburgh, Mathematics

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

Tong ZhangRutgers, Statistics

Baidu, Big Data Lab

Page 3: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 1MACHINE LEARNING

Page 4: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Statistical nature of data

Data (e.g., image, text,

measurements, …)

Label

Page 5: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Prediction of labels from data

Find

Such that when (data, label) pair is drawn from the distribution

Then

Predicted label True label

Linear predictor

Page 6: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Measure of Success

We want the expected loss (=risk) to be small:

data label

Page 7: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Finding a Linear Predictor viaEmpirical Risk Minimization

Draw i.i.d. data (samples) from the distribution

Output predictor which minimizes the empirical risk:

Page 8: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 2OPTIMIZATION

Page 9: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Primal Problem

d = # features (parameters)

n = # samples 1 - strongly convex function (regularizer)

- smooth & convex

regularizationparameter

Page 10: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Assumption 1

Loss functions have Lipschitz gradient

Lipschitz constant

Page 11: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Assumption 2

Regularizer is 1-strongly convex

subgradient

Page 12: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Dual Problem

- strongly convex 1 – smooth

& convex

Page 13: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 3ALGORITHM

Page 14: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Quartz

Page 15: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Fenchel Duality

Weak duality

Optimality conditions

Page 16: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

The Algorithm

Page 17: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Quartz: Bird’s Eye View

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

Page 18: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

The Algorithm

STEP 1

STEP 2

Convex combinationconstant

Page 19: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Randomized Primal-Dual Methods

SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014

Page 20: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 4MAIN RESULT

Page 21: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Assumption 3 (Expected Separable Overapproximation)

inequality must hold for all

Page 22: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Complexity Theorem (QRZ’14)

Page 23: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 5aUPDATING ONE DUALVARIABLE AT A TIME

Page 24: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Complexity of Quartz specialized to serial sampling

Page 25: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Data

Page 26: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Standard primal update

Experiment: Quartz vs SDCA,uniform vs optimal sampling

“Aggressive” primal update

Page 27: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 5bTAU-NICE SAMPLING

(STANDARD MINIBATCHING)

Page 28: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Data sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

Page 29: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Complexity of Quartz

Page 30: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Speedup

Assume the data is normalized:

Then:

Linear speedup up to a certain data-independent minibatch size:

Further data-dependent speedup, up to the extreme case:

Page 31: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Speedup: sparse data

Page 32: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Speedup: denser data

Page 33: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Speedup: fully dense data

Page 34: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

astro_ph: n = 29,882 density = 0.08%

Page 35: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

CCAT: n = 781,265 density = 0.16%

Page 36: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Primal-dual methods with tau-nice sampling

SS-Shwartz & T Zhang ‘13

SS-Shwartz & T Zhang ‘13

Y Zhang & L Xiao ‘14

Page 37: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

For sufficiently sparse data, Quartz wins even when compared against accelerated methods

Acce

lera

ted

Page 38: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 6ESO

Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014

Page 39: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Computation of ESO parameters

Lemma (QR’14b) {For simplicity, assume that m = 1}

Page 40: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

ESO

For any sampling , ESO holds with

Theorem (QR’14b)

where

Page 41: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

ESO

Page 42: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Part 7DISTRIBUTED

QUARTZ

Page 43: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Distributed Quartz: Perform the Dual Updates in a Distributed Manner

Quartz STEP 2: DUAL UPDATE

Data required to compute the update

Page 44: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Distribution of Datan = # dual variables Data matrix

Page 45: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Distributed sampling

Page 46: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Distributed sampling

Random set of dual variables

Page 47: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Distributed sampling & distributed coordinate descent

P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014

Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014

2

strongly convex & smooth

convex & smooth

Page 48: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Complexity of distributed Quartz

Page 49: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Reallocating load: theoretical speedup

n = 1,000,000density = 100%

n = 1,000,000density = 0.01%

Page 50: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Reallocating load: experiment

Data: webspamn = 350,000density = 33.51%

Page 51: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Experiment

Machine: 128 nodes of Hector Supercomputer (4096 cores)

Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB

Algorithm: with c = 512

P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

Page 52: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

LASSO: 3TB data + 128 nodes

Page 53: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Experiment

Machine: 128 nodes of Archer Supercomputer

Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)

Algorithm: Hydra2 with c = 256

Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

Page 54: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

LASSO: 5TB data (d = 50b) + 128 nodes

Page 55: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

Related Work

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv:1412.80630, 2014

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv:1412.8060, 2014

P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013

Page 56: Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the Theory of Big Data – London– January 2015.

END