A Method for Parallel Online Learning

John Langford, Yahoo! Research

(on joint work with Daniel Hsu & Alex Smola & Martin Zinkevich& others)

MMDS 2010

My Personal Supercomputer

A RCV1 derived binary classification task:

1 424MB Gzip compressed

2 781K examples

3 60M (nonunique) features

How long does it take to learn a good predictor?

20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

1 AD-LDA (2000 topics) (KDD 2008): 205K features/secondusing 1000 nodes.

2 PSVM (2007): 23K features/second using 500 nodes (onRCV1)

3 PLANET (depth 10 tree) (VLDB2009): 3M features/secondusing 200 nodes

2 781K examples

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/second

Other systems:

2 781K examples

How long does it take to learn a good predictor?20 seconds (1.2 seconds on desktop) = 3 M features/secondOther systems:

2 781K examples

How does Vowpal Wabbit work?

Start with ∀i : wi = 0, Repeatedly:

1 Get example x ∈ R∗.

2 Make prediction y =P

i wixi√|{i :xi 6=0}|

clipped to interval [0, 1].

3 Learn truth y ∈ [0, 1] with importance I or goto (1).

4 Update wi ← wi + η2(y−y)Ixi√|{i :xi 6=0}|

and go to (1).

This is routine, but with old and new optimization tricks likehashing.This is open source @ http://hunch.net/~vwAlso reimplemented in Torch, Streams, and Mahout projects.

How does Vowpal Wabbit work?

Start with ∀i : wi = 0, Repeatedly:

1 Get example x ∈ R∗.

2 Make prediction y =P

i wixi√|{i :xi 6=0}|

clipped to interval [0, 1].

3 Learn truth y ∈ [0, 1] with importance I or goto (1).

4 Update wi ← wi + η2(y−y)Ixi√|{i :xi 6=0}|

and go to (1).

This is routine, but with old and new optimization tricks likehashing.This is open source @ http://hunch.net/~vwAlso reimplemented in Torch, Streams, and Mahout projects.

Why I’m dissatisfied, and What I’ve learned so far.

1 Ghz processor should imply 1B features/second. And it’s easy toimagine datasets with 1P features. How can we deal with suchlarge datasets?

Core Problem for Learning on much data = Bandwidth limits1 Gb/s ethernet = 450GB/hour ⇒ 1T features is reasonable.

Outline

1 Multicore parallelization

2 Multinode parallelization

Outline

How do we paralellize across multiple cores?

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

1 Example Sharding: Each core handles an example subset.

2 Feature Sharding: Each core handles a feature subset.

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.

But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.Possibilities:

Answer 1: It’s no use because it doesn’t address the bandwidthproblem.But there’s a trick. Sometimes you care about the interaction oftwo sets of features—queries with results for example. Tweak thealgorithm so as to specify (query features, result features), thenuse a fast hash to compute the outer product in the core.

Possibilities:

Empirically: Feature Sharding > Example Sharding. Both work ontwo cores, but Example Sharding doesn’t scale. Feature shardingprovides about x3 speedup on 4 cores.

But, again, this is just for a special case. Need multinodeparallelization to address data scaling.

Algorithms for Speed

Multinode = inevitable delay

Ethernet latency = 0.1 milliseconds = 105 cycles = manyexamples.

1 Example Sharding ⇒ weights out of sync by delay factor.

2 Feature Sharding ⇒ global predictions delayed by delay factor.

How bad is delay?

Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.What do we do?

How bad is delay?Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.

What do we do?

How bad is delay?Theorem: (Mesterharm 2005) Delayed updates reduce convergenceby delay factor in worst case for expert algorithms.Theorem: (LSZ NIPS 2009) Same for linear predictors.(Caveat: there are some special cases where you can do better.)Empirically: Delay can hurt substantially when examples arestructured.What do we do?

How can we avoid delay?

FeaturesLabel

Feature

FeaturesLabel

Feature

Predict & Learn

Predictions

Feature

FeaturesLabel

Feature

Predict & Learn

Predictions

Feature

Predict & Learn

Observations about Feed Forward

1 No longer the same algorithm—it’s designed for parallelenvironments.

2 Bandwidth = few bytes per example, per node ⇒Tera-example feasible with single master, arbitrarily more withhierarchical structure.

3 No delay.

4 Feature Shard is stateless ⇒ parallelizable & cachable.

Bad News: Feed Forward can’t compete with general linearpredictors

Probability y x1 x2 x3

0.25 1 1 1 0

0.125 1 1 0 1

0.125 1 0 1 1

0.25 0 0 0 1

0.125 0 1 0 0

0.125 0 0 1 0Features 1&2 are imperfect predictors. Feature 3 is uncorrelatedwith truth. Optimal predictor = majority vote on all 3 features.

Good news

If Naive Bayes holds P(x1|y)P(x2|y) = P(x1, x2|y), you win.Better news: x1 = first shard, x2 = second shardEven better: There are more complex problem classes for whichthis also works.

Initial experiments on a medium size text Ad dataset @ Yahoo!

1 ∼100GB when gzip compressed.

2 ∼10M examples.

3 ∼125G nonzero features

4 Uses outerproduct features

Relative progressive validation (BKL COLT 1999) squared loss &relative wall-clock time reported.

1 2 4 8

shard count

Initial Experiments, Sharding & Training

r. squared lossr. time

1 2 4 8

shard count

Initial Experiments, Training & Combining

r. squared lossr. time

Final thoughts

About x6 speedup achieved over sequential system so far.This general approach, unlike averaging approaches, is fullyapplicable to nonlinear systems.Code at: http://github.com/JohnLangford/vowpal wabbitPatches welcome. Much more work needs to be done.Some further discussion @ http://hunch.net

A Method for Parallel Online Learning - Stanford...

Transcript of A Method for Parallel Online Learning - Stanford...

A Method for Parallel Online Learning - Stanford...

Documents

Transcript of A Method for Parallel Online Learning - Stanford...

C:UsersmontseDocumentsMy PageManagerPage6stel.ub.edu/cba2010/slides2010/nominalization/morning/...Aronoff, Mark (1976) Word Formation in Generative Grammar, MIT Press, Cambridge, MA.

Now MMDS Operators Don't...Now MMDS Operators Don't Have To GoThrough Channels To DoubleTheir Channels. GENERAL ELECTRIC POWER CONVERTER ocomunDe DECODER ADO CU. KRDO LOCK CHANNEL

Integrated Access Device PSTN DSLAM Home Access Networking SOHO Telecommuter Small Business MMDS IEEE802.11 Cable MTU/MDU Airport/Hotels.

LMDS & MMDS

Luiz Fernando Ferreira Silva, Ph.D. Development … Symposium for Regulators Geneva, 20 ... (DTH, MMDS and ... document22.ppt [Read-Only] Author:

Introducing MMDS - De Nederlandstalige TeX … · What is MMDS MMDS is a ... How does the output of this look? Squirrel Consultancy November 5, 2003 Well, Like This top of page bottom

TIME MANAGEMENT & ORGANIZATION SKILLS LISA M. ANDERSON, MMDS, MLS(ASCP)MB,SH.

Welcome to AustriaSat Hungary - Broadband TV News · 2017. 9. 9. · 10/06/2014 5 SAT subs MMDS subs Multichannel Multipoint Distribution Service (MMDS) The company The brand Minimalist

Introduction to parallel computers and parallel …...Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming – p.

Finding Similar Items - Stanford Universityinfolab.stanford.edu/~ullman/mmds/ch3a.pdf · Finding Similar Items ... News aggregators, such as Google News, ... card similarity that

MMDS 2008: Workshop on Algorithms for Modern Massive Data …mmds-data.org/programs/program2008.pdf · large-scale data analysis. The goals of MMDS 2008 are to explore novel tech-niques

Parallel Database Systems - NUS Computingtankl/cs5225/2008/parallel...1 CS5225 Parallel DB 1 Parallel Database Systems CS5225 Parallel DB 2 Parallel DBMS • Uniprocessor technology

MMDS Frequently Asked Questions

Mahoney Mmds

Fundamentals of Vocational Assessment Mike Ahlers MMDS, CVE.

WE ARE FOR SCUBA - University of Washingtonstudents.washington.edu/udawgs/files/slides2010.pdf · 2010. 4. 14. · •Open Water Diver –PADI •Underwater Sports •$100 per person

Dimension Reduction Techniques - Stanford Universityweb.stanford.edu/group/mmds/slides/li-mmds.pdf · · 2010-05-04Dimension Reduction Techniques, (MMDS) June 24, 2006 4 ... 4 5

Interim Channel Models for G2 MMDS Fixed Wireless Applicationsgrouper.ieee.org/groups/802/16/tg3/contrib/802163p-00_… · · 2000-11-177 Nov 00 Interim Channel Models for G2 MMDS

NEOTEC José Luiz N. Frauendorf June, 2001 BRAZILIAN MMDS OPERATORS ASSOCIATION.

RF Deployment Strategies for MMDS Dale Dalesio; Product Manager ADC The Broadband Company.