Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the...

Peter Richtárik

Randomized Dual Coordinate Ascentwith Arbitrary Sampling

1st UCL Workshop on the Theory of Big Data – London– January 2015

Coauthors

Zheng QuEdinburgh, Mathematics

Zheng Qu, P.R. and Tong ZhangRandomized dual coordinate ascent with arbitrary sampling arXiv:1411.5873, 2014

Tong ZhangRutgers, Statistics

Baidu, Big Data Lab

Part 1MACHINE LEARNING

Statistical nature of data

Data (e.g., image, text,

measurements, …)

Label

Prediction of labels from data

Find

Such that when (data, label) pair is drawn from the distribution

Then

Predicted label True label

Linear predictor

Measure of Success

We want the expected loss (=risk) to be small:

data label

Finding a Linear Predictor viaEmpirical Risk Minimization

Draw i.i.d. data (samples) from the distribution

Output predictor which minimizes the empirical risk:

Part 2OPTIMIZATION

Primal Problem

d = # features (parameters)

n = # samples 1 - strongly convex function (regularizer)

- smooth & convex

regularizationparameter

Assumption 1

Loss functions have Lipschitz gradient

Lipschitz constant

Assumption 2

Regularizer is 1-strongly convex

subgradient

Dual Problem

- strongly convex 1 – smooth

& convex

Part 3ALGORITHM

Quartz

Fenchel Duality

Weak duality

Optimality conditions

The Algorithm

Quartz: Bird’s Eye View

STEP 1: PRIMAL UPDATE

STEP 2: DUAL UPDATE

The Algorithm

STEP 1

STEP 2

Convex combinationconstant

Randomized Primal-Dual Methods

SDCA: SS Shwartz & T Zhang, 09/2012mSDCA M Takac, A Bijral, P R & N Srebro, 03/2013ASDCA: SS Shwartz & T Zhang, 05/2013AccProx-SDCA: SS Shwartz & T Zhang, 10/2013 DisDCA: T Yang, 2013 Iprox-SDCA: P Zhao & T Zhang, 01/2014 APCG: Q Lin, Z Lu & L Xiao, 07/2014SPDC: Y Zhang & L Xiao, 09/2014Quartz: Z Qu, P R & T Zhang, 11/2014

Part 4MAIN RESULT

Assumption 3 (Expected Separable Overapproximation)

inequality must hold for all

Complexity Theorem (QRZ’14)

Part 5aUPDATING ONE DUALVARIABLE AT A TIME

Complexity of Quartz specialized to serial sampling

Standard primal update

Experiment: Quartz vs SDCA,uniform vs optimal sampling

“Aggressive” primal update

Part 5bTAU-NICE SAMPLING

(STANDARD MINIBATCHING)

Data sparsity

A normalized measure of average sparsity of the data

“Fully sparse data” “Fully dense data”

Complexity of Quartz

Speedup

Assume the data is normalized:

Then:

Linear speedup up to a certain data-independent minibatch size:

Further data-dependent speedup, up to the extreme case:

Speedup: sparse data

Speedup: denser data

Speedup: fully dense data

astro_ph: n = 29,882 density = 0.08%

CCAT: n = 781,265 density = 0.16%

Primal-dual methods with tau-nice sampling

SS-Shwartz & T Zhang ‘13

SS-Shwartz & T Zhang ‘13

Y Zhang & L Xiao ‘14

For sufficiently sparse data, Quartz wins even when compared against accelerated methods

Acce

lera

ted

Part 6ESO

Zheng Qu and P.R.Coordinate Descent with Arbitrary Sampling II: Expected Separable OverapproximationarXiv:1412.8063, 2014

Computation of ESO parameters

Lemma (QR’14b) {For simplicity, assume that m = 1}

ESO

For any sampling , ESO holds with

Theorem (QR’14b)

where

Part 7DISTRIBUTED

QUARTZ

Distributed Quartz: Perform the Dual Updates in a Distributed Manner

Quartz STEP 2: DUAL UPDATE

Data required to compute the update

Distribution of Datan = # dual variables Data matrix

Distributed sampling

Distributed sampling

Random set of dual variables

Distributed sampling & distributed coordinate descent

P.R. and Martin TakáčDistributed coordinate descent for learning with big dataarXiv:1310.2059, 2013

Previously studied (not in the primal-dual setup):

Olivier Fercoq, Zheng Qu, P.R. and Martin TakáčFast distributed coordinate descent for minimizing non strongly convex losses2014 IEEE Int Workshop on Machine Learning for Signal Processing, May 2014

Jakub Marecek, P.R. and Martin TakáčFast distributed coordinate descent for minimizing partially separable functionsarXiv:1406.0238, June 2014

2

strongly convex & smooth

convex & smooth

Complexity of distributed Quartz

Reallocating load: theoretical speedup

n = 1,000,000density = 100%

n = 1,000,000density = 0.01%

Reallocating load: experiment

Data: webspamn = 350,000density = 33.51%

Experiment

Machine: 128 nodes of Hector Supercomputer (4096 cores)

Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB

Algorithm: with c = 512

P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

LASSO: 3TB data + 128 nodes

Experiment

Machine: 128 nodes of Archer Supercomputer

Problem: LASSO, n = 5 million, d = 50 billion, 5 TB(60,000 nnz per row of A)

Algorithm: Hydra2 with c = 256

Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Int Workshop on Machine Learning for Signal Processing, 2014

LASSO: 5TB data (d = 50b) + 128 nodes

Related Work

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling II: expected separable overapproximation, arXiv:1412.80630, 2014

Zheng Qu and P.R., Coordinate ascent with arbitrary sampling I: algorithms and complexity, arXiv:1412.8060, 2014

P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods, arXiv:1310.3438, 2013

Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the...

Documents

Transcript of Peter Richtárik Randomized Dual Coordinate Ascent with Arbitrary Sampling 1 st UCL Workshop on the...