A Method for Parallel Online Learning - Stanford...

Post on 06-Jan-2020

1 views 0 download

Transcript of A Method for Parallel Online Learning - Stanford...

A Method for Parallel Online Learning

John Langford, Yahoo! Research

(on joint work with Daniel Hsu & Alex Smola & Martin Zinkevich& others)

MMDS 2010

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?

20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/second

Other systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

How does Vowpal Wabbit work?

Start with ∀i : wi = 0, Repeatedly:

1 Get example x ∈ R∗.

2 Make prediction y =P

i wixi√|{i :xi 6=0}|

clipped to interval [0, 1].

3 Learn truth y ∈ [0, 1] with importance I or goto (1).

4 Update wi ← wi + η2(y−y)Ixi√|{i :xi 6=0}|

and go to (1).

This is routine, but with old and new optimization tricks likehashing.This is open source @ http://hunch.net/~vwAlso reimplemented in Torch, Streams, and Mahout projects.

How does Vowpal Wabbit work?

Start with ∀i : wi = 0, Repeatedly:

1 Get example x ∈ R∗.

2 Make prediction y =P

i wixi√|{i :xi 6=0}|

clipped to interval [0, 1].

3 Learn truth y ∈ [0, 1] with importance I or goto (1).

4 Update wi ← wi + η2(y−y)Ixi√|{i :xi 6=0}|

and go to (1).

This is routine, but with old and new optimization tricks likehashing.This is open source @ http://hunch.net/~vwAlso reimplemented in Torch, Streams, and Mahout projects.

Why I’m dissatisfied, and What I’ve learned so far.

1 Ghz processor should imply 1B features/second. And it’s easy toimagine datasets with 1P features. How can we deal with suchlarge datasets?

Core Problem for Learning on much data = Bandwidth limits1 Gb/s ethernet = 450GB/hour ⇒ 1T features is reasonable.

Outline

1 Multicore parallelization

2 Multinode parallelization

Why I’m dissatisfied, and What I’ve learned so far.

1 Ghz processor should imply 1B features/second. And it’s easy toimagine datasets with 1P features. How can we deal with suchlarge datasets?

Core Problem for Learning on much data = Bandwidth limits1 Gb/s ethernet = 450GB/hour ⇒ 1T features is reasonable.

Outline

1 Multicore parallelization

2 Multinode parallelization

Why I’m dissatisfied, and What I’ve learned so far.

1 Ghz processor should imply 1B features/second. And it’s easy toimagine datasets with 1P features. How can we deal with suchlarge datasets?

Core Problem for Learning on much data = Bandwidth limits1 Gb/s ethernet = 450GB/hour ⇒ 1T features is reasonable.

Outline

1 Multicore parallelization

2 Multinode parallelization

Why I’m dissatisfied, and What I’ve learned so far.

1 Ghz processor should imply 1B features/second. And it’s easy toimagine datasets with 1P features. How can we deal with suchlarge datasets?

Core Problem for Learning on much data = Bandwidth limits1 Gb/s ethernet = 450GB/hour ⇒ 1T features is reasonable.

Outline

1 Multicore parallelization

2 Multinode parallelization

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.

But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.

Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.

But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

Algorithms for Speed

1 Multicore parallelization

2 Multinode parallelization

Multinode = inevitable delay

Ethernet latency = 0.1 milliseconds = 105 cycles = manyexamples.

1 Example Sharding ⇒ weights out of sync by delay factor.

2 Feature Sharding ⇒ global predictions delayed by delay factor.

How bad is delay?

Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.What do we do?

Multinode = inevitable delay

Ethernet latency = 0.1 milliseconds = 105 cycles = manyexamples.

1 Example Sharding ⇒ weights out of sync by delay factor.

2 Feature Sharding ⇒ global predictions delayed by delay factor.

How bad is delay?Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.

What do we do?

Multinode = inevitable delay

Ethernet latency = 0.1 milliseconds = 105 cycles = manyexamples.

1 Example Sharding ⇒ weights out of sync by delay factor.

2 Feature Sharding ⇒ global predictions delayed by delay factor.

How bad is delay?Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.What do we do?

How can we avoid delay?

FeaturesLabel

How can we avoid delay?

FeaturesLabel

Feature

Shard

How can we avoid delay?

FeaturesLabel

Feature

Shard

Predict & Learn

Predictions

Feature

Shard

How can we avoid delay?

FeaturesLabel

Feature

Shard

Predict & Learn

Predictions

Feature

Shard

Predict & Learn

Observations about Feed Forward

1 No longer the same algorithm—it’s designed for parallelenvironments.

2 Bandwidth = few bytes per example, per node ⇒Tera-example feasible with single master, arbitrarily more withhierarchical structure.

3 No delay.

4 Feature Shard is stateless ⇒ parallelizable & cachable.

Bad News: Feed Forward can’t compete with general linearpredictors

Probability y x1 x2 x3

0.25 1 1 1 0

0.125 1 1 0 1

0.125 1 0 1 1

0.25 0 0 0 1

0.125 0 1 0 0

0.125 0 0 1 0Features 1&2 are imperfect predictors. Feature 3 is uncorrelatedwith truth. Optimal predictor = majority vote on all 3 features.

Good news

If Naive Bayes holds P(x1|y)P(x2|y) = P(x1, x2|y), you win.Better news: x1 = first shard, x2 = second shardEven better: There are more complex problem classes for whichthis also works.

Initial experiments on a medium size text Ad dataset @ Yahoo!

1 ∼100GB when gzip compressed.

2 ∼10M examples.

3 ∼125G nonzero features

4 Uses outerproduct features

Relative progressive validation (BKL COLT 1999) squared loss &relative wall-clock time reported.

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 8

rela

tive

squa

red

loss

or

time

shard count

Initial Experiments, Sharding & Training

r. squared lossr. time

0

0.2

0.4

0.6

0.8

1

1 2 4 8

rela

tive

squa

red

loss

or

time

shard count

Initial Experiments, Training & Combining

r. squared lossr. time

Final thoughts

About x6 speedup achieved over sequential system so far.This general approach, unlike averaging approaches, is fullyapplicable to nonlinear systems.Code at: http://github.com/JohnLangford/vowpal wabbitPatches welcome. Much more work needs to be done.Some further discussion @ http://hunch.net