Vowpal Wabbit A Machine Learning System

Vowpal WabbitA Machine Learning System

John LangfordMicrosoft Research

http://hunch.net/~vw/

git clonegit://github.com/JohnLangford/vowpal_wabbit.git

Why does Vowpal Wabbit exist?

1. Prove research.

2. Curiosity.

3. Perfectionist.

4. Solve problem better.

A user base becomes addictive

1. Mailing list of >400

2. The o�cial strawman for large scale logisticregression @ NIPS :-)

3.

An example

wget http://hunch.net/~jl/VW_raw.tar.gz

vw -c rcv1.train.raw.txt -b 22 --ngram 2

--skips 4 -l 0.25 --binary provides stellarperformance in 12 seconds.

http://hunch.net/~jl/VW_raw.tar.gz

Surface details

1. BSD license, automated test suite, githubrepository.

2. VW supports all I/O modes: executable, library,port, daemon, service (see next).

3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...

4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).

5. A substantial user base + developer base.Thanks to many who have helped.

VW servicehttp://tinyurl.com/vw-azureml

Problem: How to deploy model for large scale use?

Solution:

http://tinyurl.com/vw-azureml

This Tutorial in 4 parts

How do you:

1. use all the data?

2. solve the right problem?

3. solve complex joint problems?

4. solve interactive problems?

Using all the data: Step 1

Small RAM + large data ⇒ Online Learning

Active research area, 4-5 papers related to onlinelearning algorithms in VW.


1. 3.2 ∗ 106 labeled emails.

2. 433167 users.

3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?

Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.

Using Hashing

Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉

NEUVotre

Apothekeen

ligneEuro

...

USER123_NEUUSER123_Votre

USER123_ApothekeUSER123_en

USER123_ligneUSER123_Euro

...

+323

05235

00

1230

626232...

text document (email) tokenized, duplicatedbag of words

hashed, sparse vector

h2x

classification

w!xh

x xl xh

(in VW: specify the userid as a feature and use -q)

Results

!"#$%!"#&% !"##% !"##% !%

!"!'%

#"$'%

#"(#%

#")$% #")(%

#"##%

#"'#%

#"*#%

#")#%

#"$#%

!"##%

!"'#%

!$% '#% ''% '*% ')%

!"#$%$

&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

0%0&)!%&1%3#!3')#0,*%

+,-./,01/2134%

5362-7/,8934%

./23,873%

226 parameters = 64M parameters = 256MB ofRAM.An x270K savings in RAM requirements.

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!

I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!

The worst part: he had a point.


Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.


Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.

MPI-style AllReduce

7

2 3 4

6

Allreduce initial state

5

1

AllReduce = Reduce+BroadcastProperties:1. How long does it take?

O(1) time(*)

2. How much bandwidth?

O(1) bits(*)

3. How hard to program?

Very easy

(*) When done right.

MPI-style AllReduce

2828 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+Broadcast

Properties:1. How long does it take?

O(1) time(*)


O(1) bits(*)


Very easy(*) When done right.

MPI-style AllReduce

2 3 4

6

7

5

1

Create Binary Tree

AllReduce = Reduce+BroadcastProperties:

1. How long does it take?

O(1) time(*)


O(1) bits(*)


Very easy


MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1


O(1) time(*)


O(1) bits(*)


Very easy


MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28


O(1) time(*)


O(1) bits(*)


Very easy


MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28


O(1) time(*)


O(1) bits(*)


Very easy


MPI-style AllReduce

28

28 28


28 28 28 28

AllReduce = Reduce+Broadcast

Properties:1. How long does it take?

O(1) time(*)


O(1) bits(*)


Very easy


MPI-style AllReduce

28

28 28


28 28 28 28


O(1) time(*)


O(1) bits(*)


Very easy(*) When done right.

MPI-style AllReduce

28

28 28


28 28 28 28

AllReduce = Reduce+BroadcastProperties:1. How long does it take? O(1) time(*)2. How much bandwidth? O(1) bits(*)3. How hard to program? Very easy


An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1. While (examples left)1.1 Do online update.

2. AllReduce(weights)

3. For each weight w ← w/n

Code tour

What is Hadoop AllReduce?

1.

DataProgram

�Map� job moves program to data.

2. Delayed initialization: Most failures are diskfailures. First read (and cache) all data, beforeinitializing allreduce.

3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.

What is Hadoop AllReduce?

1.

DataProgram

�Map� job moves program to data.2. Delayed initialization: Most failures are disk

failures. First read (and cache) all data, beforeinitializing allreduce.

3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.

Robustness & Speedup

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Spe

edup

Nodes

Speed per method

Average_10Min_10

Max_10linear

Splice Site Recognition

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

Splice Site Recognition

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

Effective number of passes over data

auP

RC

L−BFGS w/ one online passZinkevich et al.Dekel et al.


How do you:



3. solve complex joint problems easily?


Applying Machine Learning in Practice

1. Ignore mismatch. Often faster.

2. Understand problem and �nd more suitable tool.Often better.

Importance Weighted Classi�cation

Importance-Weighted Classi�cation

I Given training data

{(x1, y1, c1), . . . , (xn, yn, cn)}, produce a

classi�er h : X → {0, 1}.I Unknown underlying distribution D over

X × {0, 1}×[0,∞).

I Find h with small expected cost:

`(h,D) = E(x ,y ,c)∼D[c · 1(h(x) 6= y)]

Where does this come up?

1. Spam Prediction (Ham predicted as Spam muchworse than Spam predicted as Ham.)

2. Distribution Shifts (Optimize search engineresults for monetizing queries.)

3. Boosting (Reweight problem examples forresidual learning.)

4. Large Scale Learning (Downsample commonclass and importance weight to compensate.)

Multiclass Classi�cation

Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D

`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]

1. Categorization: Which of k things is it?

2. Actions: Which of k choices should be made?

Use in VW

Multiclass label format: Label [Importance] ['Tag]

Methods:�oaa k one-against-all prediction. O(k) time. Thebaseline�ect k error correcting tournament. O(log(k)) time.�log_multi n Adaptive log time. O(log(n)) time.

One-Against-All (OAA)

Create k binary problems, one per class.For class i predict �Is the label i or not?�

(x , y) 7−→

(x ,1(y = 1))

(x ,1(y = 2))

. . .

(x ,1(y = k))

The inconsistency problem

Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.

Prob(label|features) 1

2− δ 1

4+ δ

2

1

4+ δ

2Prediction

1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0

Solution: always use one-against-all regression.

Cost-sensitive Multiclass

Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.

Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost

cost(c,D) = E(x ,c)∼D [ch(x)].

1. Is this packet {normal,error,attack}?

2. A subroutine used later...

Use in VW

Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace Feature

Methods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.

Use in VW

Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace FeatureMethods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.

Code Tour


How do you:




4. How do you solve interactive problems?

The Problem: Joint Prediction

How?

1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.


How?1. Each prediction is independent.

2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.


How?1. Each prediction is independent.2. Multitask learning.

3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.


How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.

4. Hand-crafted approaches.


How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.

What makes a good solution?

1. Programming complexity.

Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

What makes a good solution?

1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.

2. Prediction accuracy. It had better work well.

3. Train speed. Debug/development productivity +maximum data input.

4. Test speed. Application e�ciency

A program complexity comparison

1

10

100

1000

CRFSGD CRF++ S-SVM Search

lin

es o

f co

de

fo

r P

OS

10-3 10-2 10-1 100 101 102 103

Training time (minutes)

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

Accu

racy

(per

wor

d)94.995.7

96.695.9 95.5

96.1

90.7

96.1

1s 10s 1m 10m 30m1h

POS Tagging (tuned hps)

OAAL2SL2S (ft)CRFsgd

CRF++StrPercStrSVMStrSVM2

NER

POS

0 100 200 300 400 500 600

563

365

520

404

24

5.7

98

13

5.614

5.3

Prediction (test-time) Speed

L2SL2S (ft)CRFsgdCRF++StrPercStrSVMStrSVM2

Thousands of Tokens per Second

How do you program?

Sequential_RUN(examples)

1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for

In essence, write the decoder, providing a little bit ofside information for training.

RunParser(sentence)

1: stack S ← {Root}2: bu�er B ← [words in sentence]3: arcs A ← ∅4: while B 6= ∅ or |S | > 1 do5: ValidActs ← GetValidActions(S ,B)6: features ← GetFeat(S ,B ,A)7: ref ← GetGoldAction(S ,B)8: action ← predict(features, ref, ValidActs)9: S ,B ,A ← Transition(S ,B ,A, action)

10: end while11: loss(A[w ] 6= A∗[w ], ∀w ∈ sentence)12: return output

How does it work?

An Application of �Learning to Search� algorithms(e.g. Searn,DAgger,LOLS [ICML2015]).

Decoder run many times at train time to optimizepredict(...) for loss(...).See Tutorial with Hal Daume @ ICML2015 + LOLSpaper @ICML2015.

Named Entity Recogntion

Is this word part of an organization, person, or not?

10-1 100 101

Training Time (minutes)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

F-sc

ore

(per

enti

ty)

80.0 79.2

73.3

76.5 75.9 76.5

74.6

78.3

10s 1m 10m

Named entity recognition (tuned hps)

Entity Relation

Goal: �nd the Entities and then �nd their RelationsMethod Entity F1 Relation F1 Train Time

Structured SVM 88.00 50.04 300 secondsL2S 92.51 52.03 13 seconds

Requires about 100 LOC.

Dependency Parsing

Goal: Find dependency structure of sentence.

70

75

80

85

90

95

Ar* Bu Ch Da Du En Ja Po* Sl* Sw Avg

UA

S (

hig

her=

bett

er)

language

L2SDynaSNN

Requires about 300 LOC.

A demonstration

wget http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip

vw -b 24 -d wsj.train.vw -c �search_task sequence �search 45

�search_alpha 1e-8 �search_neighbor_features -1:w,1:w

�a�x -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw

http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip

http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip


How do you:





Examples of Interactive Learning

Repeatedly:

1. A user comes to Microsoft (with history of previousvisits, IP address, data related to an account)

2. Microsoft chooses information to present (urls, ads, newsstories)

3. The user reacts to the presented information (clicks onsomething, clicks, comes back and clicks again,...)

Microsoft wants to interactively choose content and use the

observed feedback to improve future content choices.

Another Example: Clinical Decision Making

Repeatedly:

1. A patient comes to a doctorwith symptoms, medical history,test results

2. The doctor chooses a treatment

3. The patient responds to it

The doctor wants a policy forchoosing targeted treatments forindividual patients.

The Contextual Bandit Setting

For t = 1, . . . ,T :

1. The world produces some context x ∈ X

2. The learner chooses an action a ∈ A

3. The world reacts with reward ra ∈ [0, 1]

Goal: Learn a good policy for choosing actionsgiven context.

The �Direct method�

Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).

Example: Deployed policy always takes a1 on x1 anda2 on x2.

Observed/Estimated/True

a1 a2x1

.8/.8/.8 ?/1

x2

/.3 .2 /.2




Observed

/Estimated/True

a1 a2x1 .8

/.8/.8

?

/1

x2 ?

/.3

.2

/.2




Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.5

/1

x2 ?/.5

/.3

.2 /.2

/.2




Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.5

/1

x2 .3/.5

/.3

.2 /.2

/.2




Observed/Estimated

/True

a1 a2x1 .8/.8

/.8

?/.514

/1

x2 .3/.3

/.3

.2 /.014

/.2





a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2





a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 1: Generalization insu�cient.





a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 2: Exploration required.





a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 3: Errors 6= exploration.

The Evaluation Problem

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

Method 1: Deploy algorithm in the world.

Very Expensive!

The Importance Weighting Trick

Let π : X → A be a policy mapping features toactions. How do we evaluate it?

One answer: Collect T exploration samples

(x , a, ra, pa),

wherex = contexta = actionra = reward for actionpa = probability of action a

then evaluate:

Value(π) = Average

(ra 1(π(x) = a)

pa

)


TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:

E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:

Action 1 2Reward 0.5 1

Probability 1

4

3

4

Estimate

2 | 0 0 | 4

3



E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:


Probability 1

4

3

4

Estimate 2

| 0

0

| 4

3



E(x ,~r)∼D[rπ(x)

]= E[Value(π) ]

Proof: Ea∼p

[ra1(π(x)=a)

pa

]=∑

a para1(π(x)=a)

pa= rπ(x)

Example:


Probability 1

4

3

4

Estimate 2 | 0 0 | 4

3

How do you test things?

Use format:action:cost:probability | featuresExample:1:1:0.5 | tuesday year million short compan vehicl linestat �nanc commit exchang plan corp subsid creditissu debt pay gold bureau prelimin re�n billiontelephon time draw basic relat �le spokesm reut securacquir form prospect period interview regist torontresourc barrick ontario qualif bln prospectusconvertibl vinc borg arequip...

How do you train?

Reduce to cost-sensitive classi�cation.

vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582

vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065

vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125

Progressive 0/1 loss: 0.04679

Reminder: Contextual Bandit Setting

For t = 1, . . . ,T :




Goal: Learn a good policy for choosing actions given context.

What does learning mean? E�ciently competing with somelarge reference class of policies Π = {π : X → A}:

Regret = maxπ∈Π

averaget(rπ(x) − ra)

Building an Algorithm

Let Q1 = uniform distribution

For t = 1, . . . ,T :


2. Draw π ∼ Qt


using π(x).


5. Update Qt+1

Building an Algorithm

Let Q1 = uniform distributionFor t = 1, . . . ,T :


2. Draw π ∼ Qt

3. The learner chooses an action a ∈ A using π(x).


5. Update Qt+1

What is good Qt?

I Exploration: Qt allows discovery of goodpolicies

I Exploitation: Qt small on bad policies

How do you �nd Qt?

by Reduction ... [details complex, but coded]

How well does this work?

0

0.02

0.04

0.06

0.08

0.1

0.12

eps-greedy

tau-first

LinUCB*

Cover

loss

losses on CCAT RCV1 problem

How long does this take?

1

10

100

1000

10000

100000

1e+06

eps-greedy

tau-first

LinUCB*

Cover

seconds

running times on CCAT RCV1 problem

Next for Vowpal Wabbit

Version 8 series just started.Primary goal: New research (as always)

+ tackle deployability:

1. Backwards compatibility of model to VW.

2. More serious testing.

3. More serious documentation.

4. More and better library interfaces.

5. Dynamic module loading.

Further reading

VW wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki

Search: NYU large scale learning class

NIPS tutorial on Exploration:http://hunch.net/~jl/interact.pdf

ICML Tutorial on Learning to Search: ... comingsoon.

https://github.com/JohnLangford/vowpal_wabbit/wiki

https://github.com/JohnLangford/vowpal_wabbit/wiki

http://hunch.net/~jl/interact.pdf

Bibliography

Release Vowpal Wabbit, 2007, http://github.com/JohnLangford/vowpal_wabbit/wiki.

Terascale A. Agarwal et al, A Reliable E�ective TerascaleLinear Learning System,http://arxiv.org/abs/1110.4198

Reduce A. Beygelzimer et al, Learning Reductions thatReally Work,http://arxiv.org/abs/1502.02704

LOLS K. Chang et al, Learning to Search Better ThanYour Teacher, ICML 2014,http://arxiv.org/abs/1502.02206

Explore A. Agarwal et al, Taming the Monster...,http://arxiv.org/abs/1402.0555

http://github.com/JohnLangford/vowpal_wabbit/wiki

http://github.com/JohnLangford/vowpal_wabbit/wiki

http://arxiv.org/abs/1110.4198




Vowpal Wabbit A Machine Learning System

Technology

Transcript of Vowpal Wabbit A Machine Learning System