Download - Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Splash

A User-friendly Programming Interface forParallelizing Stochastic Algorithms

Yuchen Zhang and Michael Jordan

AMP Lab, UC Berkeley

AMP Lab Splash April 2015 1 / 1

Page 2: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Batch Algorithms vs. Stochastic Algorithms

Consider minimizing a loss function L(w) := 1n

∑ni=1 `i (w).

Gradient Descent: iteratively update

wt+1 = wt − ηt∇L(wt).

Pros: Easy to parallelize (via Spark).Cons: May need hundreds of iterations to converge.

running time (seconds)

0 50 100 150 200 250

loss

funct

ion

0.55

0.6

0.65

0.7Gradient Descent - 64 threads

Page 3: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

∑ni=1 `i (w).

0 50 100 150 200 250

loss

funct

ion

0.55

0.6

0.65

Page 4: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

∑ni=1 `i (w).

0 50 100 150 200 250

loss

funct

ion

0.55

0.6

0.65

Page 5: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Batch Algorithm v.s. Stochastic Algorithm

∑ni=1 `i (w).

Stochastic Gradient Descent (SGD): randomly draw `t , then

wt+1 = wt − ηt∇`t(wt).

Pros: Much faster convergence.Cons: Sequential algorithm, non-trivial to parallelize.

0 50 100 150 200 250

loss

funct

ion

0.55

0.6

0.65

Stochastic Gradient Descent

Page 6: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Batch Algorithm v.s. Stochastic Algorithm

∑ni=1 `i (w).

Stochastic Gradient Descent (SGD): randomly draw `t , then

wt+1 = wt − ηt∇`t(wt).

Pros: Much faster convergence.Cons: Sequential algorithm, non-trivial to parallelize.

0 50 100 150 200 250

loss

funct

ion

0.55

0.6

0.65

Stochastic Gradient Descent

Page 7: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

More Stochastic Algorithms

Convex Optimization

Adaptive SGD (Duchi et al.)

Stochastic Average Gradient Method (Schmidt et al.)

Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang)

Probabilistic Model Inference

Markov chain Monte Carlo (e.g., Gibbs sampling)

Expectation propagation (Minka)

Stochastic variational inference (Hoffman et al.)

SGD variants for

Matrix factorization

Learning neural networks

Learning denoising auto-encoder

How to parallelize these algorithms?

Page 8: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Convex Optimization

SGD variants for

Page 9: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Convex Optimization

SGD variants for

Page 10: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Convex Optimization

SGD variants for

Page 11: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Naive Attempt

After processing a subsequence of random samples...Single-thread Algorithm: incremental update w ← w + ∆.

Parallel Algorithm:

Thread 1 (on 1/m of samples): w ← w + ∆1.Thread 2 (on 1/m of samples): w ← w + ∆2.. . .Thread m (on 1/m of samples): w ← w + ∆m.

Aggregate parallel updates w ← w + ∆1 + · · ·+ ∆m.

0 20 40 60

loss

fu

nct

ion

0

20

40

60

80

100Single-thread SGD

Parallel SGD - 64 threads

Doesn’t work for SGD!

Page 12: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Naive Attempt

After processing a subsequence of random samples...Single-thread Algorithm: incremental update w ← w + ∆.Parallel Algorithm:

0 20 40 60

loss

fu

nct

ion

0

20

40

60

80

Page 13: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Naive Attempt

0 20 40 60

loss

fu

nct

ion

0

20

40

60

80

Page 14: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Naive Attempt

0 20 40 60

loss

fu

nct

ion

0

20

40

60

80

Page 15: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Conflicts in Parallel Updates

Reason for failure: ∆1, . . . ,∆m simultaneously manipulate the samevariable w , causing conflicts in parallel updates.

How to resolve conflicts1 Frequent communication between threads:

Pros: general approach to resolving conflict.Cons: inter-node (asynchronous) communication is expensive!

2 Carefully partition the data to avoid having threads simultaneouslymanipulate the same variable:

Pros: doesn’t need frequent communication.Cons: need problem-specific partitioning schemes; only works for asubset of problems.

Page 16: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

How to resolve conflicts

1 Frequent communication between threads:

Page 17: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Page 18: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Page 19: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Splash: An Omnibus Solution

Splash is

A programming interface for developing stochastic algorithms

An execution engine for running stochastic algorithms on distributedsystems.

Features of Splash include:

Easy Programming: User develop single-thread algorithms viaSplash: no communication protocol, no conflict management, no datapartitioning, no hyper-parameter tuning.

Fast Performance: Splash adopts novel strategy for automaticparallelization with infrequent communication. Communication is nolonger a performance bottleneck.

Integration with Spark: Splash takes an RDD as input and returnsan RDD as output. It works with KeystoneML, MLlib and other dataanalysis tools on Spark.

Page 20: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Splash is

Page 21: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Splash is

Page 22: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Splash is

Programming Interface

Programming with Splash

Splash users implement the following function:

def process(sample: Any, weight: Int, var: VariableSet){/*implement stochastic algorithm*/

}

where

sample — a random sample from the dataset.

weight — the sample is conceptually duplicated weight times.

var — set of all shared variables.

Page 25: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Example: SGD for Linear Regression

Goal: find w∗ = arg minw1n

∑ni=1(wxi − yi )

2.

SGD update: randomly draw (xi , yi ), then w ← w − η∇w (wxi − yi )2.

Splash implementation:

def process(sample: Any, weight: Int, var: VariableSet){val stepsize = var.get(“eta”) * weightval gradient = sample.x * (var.get(“w”) * sample.x - sample.y)var.add(“w”, - stepsize * gradient)

}

Supported operations: get, add, multiply, delayedAdd.

Page 26: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

2.

}

Page 27: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

2.

}

Page 28: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Get Operations

Get the value of the variable (Double or Array[Double]).

get(key) returns var[key]

getArray(key) returns varArray[key]

getArrayElement(key, index) returns varArray[key][index]

getArrayElements(key, indices) returns varArray[key][indices]

Array-based operations are more efficient than element-wise operations,because the key-value retrieval is executed only once when accessing anarray.

Page 29: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Add Operations

Add a quantity δ to the variable.

add(key, delta): var[key] += delta

addArray(key, deltaArray): varArray[key] += deltaArray

addArrayElement(key, index, delta): varArray[key][index] += delta

addArrayElements(key, indices, deltaArrayElements):varArray[key][indices] += deltaArrayElements

Page 30: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Multiply Operations

Multiply the variable v by a quantity γ.

multiply(key, gamma): var[key] *= gamma

multiplyArray(key, gamma): varArray[key] *= gamma

We have optimized the implementation so that the time complexity ofmultiplyArray is O(1), independent of the array dimension.

Example: SGD with sparse features and `2-norm regularization.

w ← (1− λ) ∗ w (multiply operation) (1)

w ← w − η∇f (w) (addArrayElements operation) (2)

Time complexity of (1) = O(1); Time complexity of (2) = nnz(∇f (w)).

Page 31: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Multiply Operations

Multiply the variable v by a quantity γ.

multiply(key, gamma): var[key] *= gamma

multiplyArray(key, gamma): varArray[key] *= gamma

We have optimized the implementation so that the time complexity ofmultiplyArray is O(1), independent of the array dimension.

Example: SGD with sparse features and `2-norm regularization.

w ← (1− λ) ∗ w (multiply operation) (1)

w ← w − η∇f (w) (addArrayElements operation) (2)

Time complexity of (1) = O(1); Time complexity of (2) = nnz(∇f (w)).

Page 32: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Delayed Add Operations

Add a quantity δ to the variable v . The operation is not executed until thenext time the same sample is processed by the system.

delayedAdd(key, delta): var[key] += delta

delayedAddArray(key, deltaArray): varArray[key] += deltaArray

delayedAddArrayElement(key, index, delta):varArray[key][index] += delta

Example: Collapsed Gibbs Sampling for LDA – update the count nwkwhen topic k is assigned to word w .

nwk ← nwk + weight (add operation) (3)

nwk ← nwk − weight (delayed add operation) (4)

(3) executed instantly; (4) will be executed at the next time before a newtopic is sampled for the same word.

Page 33: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Delayed Add Operations

Add a quantity δ to the variable v . The operation is not executed until thenext time the same sample is processed by the system.

delayedAdd(key, delta): var[key] += delta

delayedAddArray(key, deltaArray): varArray[key] += deltaArray

delayedAddArrayElement(key, index, delta):varArray[key][index] += delta

Example: Collapsed Gibbs Sampling for LDA – update the count nwkwhen topic k is assigned to word w .

nwk ← nwk + weight (add operation) (3)

nwk ← nwk − weight (delayed add operation) (4)

(3) executed instantly; (4) will be executed at the next time before a newtopic is sampled for the same word.

Page 34: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Running a Stochastic Algorithm

Three simple steps:

1 Convert RDD dataset to a Parametrized RDD:

val paramRdd = new ParametrizedRDD(rdd)

2 Set a function that implements the algorithm:

paramRdd.setProcessFunction(process)

3 Start running:

paramRdd.run()

Page 35: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Three simple steps:

3 Start running:

paramRdd.run()

Page 36: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Three simple steps:

3 Start running:

paramRdd.run()

Page 37: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Execution Engine

Page 38: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

How does Splash work?

In each iteration, the execution engine does:

1 Propose candidate degrees of parallelism m1, . . . ,mk such that∑ki mi = m := (# of cores). For each i ∈ [k], collect mi cores and

do:1 Each core gets a sub-sequence of samples (by default 1

m of the fulldata). They process the samples sequentially using the processfunction. Every sample is weighted by mi .

2 Combine the updates of all mi cores to get the global update. Thereare different strategies for combining different types of updates. Foradd operations, the updates are averaged.

2 If k > 1, select the best mi via a parallel cross-validation procedure.

3 Broadcast the update to all machines to apply this update. Thenproceed to the next iteration.

Page 39: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Page 40: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Page 41: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Why Reweighting?

Recall that each thread processes a subsequence of samples.

Without reweighting, the averaged updates make little progresscomparing to the full sequential update, because the subsequence isshorter than the full sequence.

With reweighting, the weighted subsequence approximates thedistribution of the full sequence, so that local updates are nearlyunbiased estimates of the full update.

Averaging reduces the variance of local updates.

Theorem

With m cores, this strategy achieves m times speedup over thesingle-thread SGD if the objective function is smooth and strongly convex.The communication can be arbitrarily infrequent.

Page 42: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Why Reweighting?

Theorem

Page 43: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Why Reweighting?

Theorem

Page 44: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Why Reweighting?

Theorem

Page 45: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Example: Reweighting for SGD

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(a) Optimal solution

(b) Solution with full update

(c) Local solutions with unit-weight update

Page 46: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(d) Average local solutions in (c)

(e) Aggregate local solutions in (c)

(29,8)

Page 47: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(f) Local solutions with weighted update

(29,8)

Page 48: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

(f) Local solutions with weighted update

(g) Average local solutions in (f)

(29,8)

Page 49: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Experiments

Page 50: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Experimental Setup

System: Amazon EC2 cluster with 8 workers. Each worker has 8Intel Xeon E5-2665 cores and 30 GBs of memory and was connectedto a commodity 1GB network

Algorithms: SGD for logistic regression; Gibbs Sampling andStochastic Variational Inference for topic modelling; BayesianPersonalized Ranking for recommendation system.

Datasets: Covtype, RCV1 and MNIST 8M for logistic regression;NIPS, Enron and NYTimes for topic modelling; Netflix forrecommendation system.

Page 51: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Logistic Regression

Covtype RCV1 MNIST 8M

run

nin

g t

ime

(sec

)

0

500

1000

1500

2000

2500

SGD (1 thread)Splash + SGD (64 threads)

Single-thread SGD is much faster than Batch GD.

Splash is 16x-37x faster than SGD.

Page 52: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Topic Modelling with Gibbs Sampling

NIPS Enron NYTimes

runnin

g t

ime

(sec

)

×104

0

0.5

1

1.5

2

2.5

3

Gibbs Sampling (1 thread)Splash + Gibbs (64 threads)

Splash is 30x-149x faster than Single-thread Gibbs Sampling.

Page 53: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Topic Modelling with Variational Inference

NIPS Enron NYTimes

runnin

g t

ime

(sec

)

×104

0

0.5

1

1.5

2

2.5

SVI (1 thread)Batch VI (64 threads)Splash + SVI (64 threads)

Splash is 3x – 18x faster than Parallel Batch Algorithm.

Splash is 6x – 20x faster than Single-thread Stochastic Algorithm.

Page 54: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Netflix Movie Recommendation

AUC = 0.91 AUC = 0.94

runnin

g t

ime

(sec

)

0

200

400

600

800

1000

1200

1400

Stochastic (1 thread)Batch (64 threads)Splash + Stochastic (64 threads)

Splash is 3x – 6x faster than Parallel Batch Algorithm.

Splash is 12x – 20x faster than Single-thread Stochastic Algorithm.

Page 55: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Machine Learning Package

Page 56: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Preimplement Machine Learning Algorithms on Splash

Integrated with other tools in the Spark ecosystem. Ease of use withone line of code.

Parallel AdaGrad SGD: faster than MLlib on large dataset.

(MNIST 8M dataset, 64 cores, 10-class logistic regression)

Parallel Gibbs Sampling for LDA.

Will implement more algorithms in the future...

Page 57: Splash: User-friendly Programming Interface for Parallelizing Stochastic Learning Algorithms

Summary

Splash is a general-purpose programming interface for developingstochastic algorithms.

Splash is also an execution engine for automatic parallelizingstochastic algorithms.

Reweighting is the key to retaining communication efficiency and thusfast performance.

We observe good empirical performance and we have theoreticalguarantees for SGD.

Splash is online at http://zhangyuc.github.io/splash/.