Vowpal Wabbit A Machine Learning System
-
Upload
jun-wang -
Category
Technology
-
view
98 -
download
5
Transcript of Vowpal Wabbit A Machine Learning System
![Page 1: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/1.jpg)
Vowpal WabbitA Machine Learning System
John LangfordMicrosoft Research
http://hunch.net/~vw/
git clonegit://github.com/JohnLangford/vowpal_wabbit.git
![Page 2: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/2.jpg)
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
![Page 3: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/3.jpg)
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
![Page 4: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/4.jpg)
Why does Vowpal Wabbit exist?
1. Prove research.
2. Curiosity.
3. Perfectionist.
4. Solve problem better.
![Page 5: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/5.jpg)
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
![Page 6: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/6.jpg)
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
![Page 7: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/7.jpg)
A user base becomes addictive
1. Mailing list of >400
2. The o�cial strawman for large scale logisticregression @ NIPS :-)
3.
![Page 8: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/8.jpg)
An example
wget http://hunch.net/~jl/VW_raw.tar.gz
vw -c rcv1.train.raw.txt -b 22 --ngram 2
--skips 4 -l 0.25 --binary provides stellarperformance in 12 seconds.
![Page 9: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/9.jpg)
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
![Page 10: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/10.jpg)
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
![Page 11: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/11.jpg)
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
![Page 12: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/12.jpg)
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
![Page 13: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/13.jpg)
Surface details
1. BSD license, automated test suite, githubrepository.
2. VW supports all I/O modes: executable, library,port, daemon, service (see next).
3. VW has a reasonable++ input format: sparse,dense, namespaces, etc...
4. Mostly C++, but bindings in other languages ofvarying maturity (python, C#, Java good).
5. A substantial user base + developer base.Thanks to many who have helped.
![Page 14: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/14.jpg)
VW servicehttp://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
Solution:
![Page 15: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/15.jpg)
VW servicehttp://tinyurl.com/vw-azureml
Problem: How to deploy model for large scale use?
Solution:
![Page 16: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/16.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
![Page 17: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/17.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
![Page 18: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/18.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
![Page 19: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/19.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems?
4. solve interactive problems?
![Page 20: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/20.jpg)
Using all the data: Step 1
Small RAM + large data ⇒ Online Learning
Active research area, 4-5 papers related to onlinelearning algorithms in VW.
![Page 21: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/21.jpg)
Using all the data: Step 1
Small RAM + large data ⇒ Online Learning
Active research area, 4-5 papers related to onlinelearning algorithms in VW.
![Page 22: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/22.jpg)
Using all the data: Step 2
1. 3.2 ∗ 106 labeled emails.
2. 433167 users.
3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?
Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.
![Page 23: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/23.jpg)
Using all the data: Step 2
1. 3.2 ∗ 106 labeled emails.
2. 433167 users.
3. ∼ 40 ∗ 106 unique tokens.How do we construct a spam �lter which ispersonalized, yet uses global information?
Bad answer: Construct a global �lter + 433167personalized �lters using a conventional hashmap tospecify features. This might require433167 ∗ 40 ∗ 106 ∗ 4 ∼ 70Terabytes of RAM.
![Page 24: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/24.jpg)
Using Hashing
Use hashing to predict according to:〈w , φ(x)〉+ 〈w , φu(x)〉
NEUVotre
Apothekeen
ligneEuro
...
USER123_NEUUSER123_Votre
USER123_ApothekeUSER123_en
USER123_ligneUSER123_Euro
...
+323
05235
00
1230
626232...
text document (email) tokenized, duplicatedbag of words
hashed, sparse vector
h2x
classification
w!xh
x xl xh
(in VW: specify the userid as a feature and use -q)
![Page 25: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/25.jpg)
Results
!"#$%!"#&% !"##% !"##% !%
!"!'%
#"$'%
#"(#%
#")$% #")(%
#"##%
#"'#%
#"*#%
#")#%
#"$#%
!"##%
!"'#%
!$% '#% ''% '*% ')%
!"#$%$
&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%
0%0&)!%&1%3#!3')#0,*%
+,-./,01/2134%
5362-7/,8934%
./23,873%
226 parameters = 64M parameters = 256MB ofRAM.An x270K savings in RAM requirements.
![Page 26: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/26.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!
I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.
![Page 27: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/27.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!
The worst part: he had a point.
![Page 28: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/28.jpg)
Applying for a fellowship in 1997
Interviewer: So, what do you want to do?John: I'd like to solve AI.I: How?J: I want to use parallel learning algorithms to createfantastic learning machines!I: You fool! The only thing parallel machines aregood for is computational windtunnels!The worst part: he had a point.
![Page 29: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/29.jpg)
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
![Page 30: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/30.jpg)
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
![Page 31: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/31.jpg)
Using all the data: Step 3
Given 2.1 Terafeatures of data, how can you learn agood linear predictor fw(x) =
∑i wixi?
17B Examples16M parameters1K nodesHow long does it take?
70 minutes = 500M features/second: faster than theIO bandwidth of a single machine⇒ faster than allpossible single machine linear learning algorithms.
![Page 32: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/32.jpg)
MPI-style AllReduce
7
2 3 4
6
Allreduce initial state
5
1
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 33: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/33.jpg)
MPI-style AllReduce
2828 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy(*) When done right.
![Page 34: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/34.jpg)
MPI-style AllReduce
2 3 4
6
7
5
1
Create Binary Tree
AllReduce = Reduce+BroadcastProperties:
1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 35: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/35.jpg)
MPI-style AllReduce
2 3 4
7
8
1
13
Reducing, step 1
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 36: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/36.jpg)
MPI-style AllReduce
2 3 4
8
1
13
Reducing, step 2
28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 37: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/37.jpg)
MPI-style AllReduce
2 3 41
28
Broadcast, step 1
28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 38: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/38.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy
(*) When done right.
![Page 39: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/39.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take?
O(1) time(*)
2. How much bandwidth?
O(1) bits(*)
3. How hard to program?
Very easy(*) When done right.
![Page 40: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/40.jpg)
MPI-style AllReduce
28
28 28
Allreduce final state
28 28 28 28
AllReduce = Reduce+BroadcastProperties:1. How long does it take? O(1) time(*)2. How much bandwidth? O(1) bits(*)3. How hard to program? Very easy
(*) When done right.
![Page 41: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/41.jpg)
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1. While (examples left)1.1 Do online update.
2. AllReduce(weights)
3. For each weight w ← w/n
Code tour
![Page 42: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/42.jpg)
An Example Algorithm: Weight averaging
n = AllReduce(1)While (pass number < max)
1. While (examples left)1.1 Do online update.
2. AllReduce(weights)
3. For each weight w ← w/n
Code tour
![Page 43: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/43.jpg)
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.
2. Delayed initialization: Most failures are diskfailures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
![Page 44: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/44.jpg)
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
![Page 45: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/45.jpg)
What is Hadoop AllReduce?
1.
DataProgram
�Map� job moves program to data.2. Delayed initialization: Most failures are disk
failures. First read (and cache) all data, beforeinitializing allreduce.
3. Speculative execution: In a busy cluster, onenode is often slow. Use speculative execution tostart additional mappers.
![Page 46: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/46.jpg)
Robustness & Speedup
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
Spe
edup
Nodes
Speed per method
Average_10Min_10
Max_10linear
![Page 47: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/47.jpg)
Splice Site Recognition
0 10 20 30 40 500.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
0 5 10 15 20
0.466
0.468
0.47
0.472
0.474
0.476
0.478
0.48
0.482
0.484
Iteration
auP
RC
Online
L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass
L−BFGS
![Page 48: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/48.jpg)
Splice Site Recognition
0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
Effective number of passes over data
auP
RC
L−BFGS w/ one online passZinkevich et al.Dekel et al.
![Page 49: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/49.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. solve interactive problems?
![Page 50: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/50.jpg)
Applying Machine Learning in Practice
1. Ignore mismatch. Often faster.
2. Understand problem and �nd more suitable tool.Often better.
![Page 51: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/51.jpg)
Applying Machine Learning in Practice
1. Ignore mismatch. Often faster.
2. Understand problem and �nd more suitable tool.Often better.
![Page 52: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/52.jpg)
Importance Weighted Classi�cation
Importance-Weighted Classi�cation
I Given training data
{(x1, y1, c1), . . . , (xn, yn, cn)}, produce a
classi�er h : X → {0, 1}.I Unknown underlying distribution D over
X × {0, 1}×[0,∞).
I Find h with small expected cost:
`(h,D) = E(x ,y ,c)∼D[c · 1(h(x) 6= y)]
![Page 53: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/53.jpg)
Where does this come up?
1. Spam Prediction (Ham predicted as Spam muchworse than Spam predicted as Ham.)
2. Distribution Shifts (Optimize search engineresults for monetizing queries.)
3. Boosting (Reweight problem examples forresidual learning.)
4. Large Scale Learning (Downsample commonclass and importance weight to compensate.)
![Page 54: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/54.jpg)
Multiclass Classi�cation
Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D
`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]
1. Categorization: Which of k things is it?
2. Actions: Which of k choices should be made?
![Page 55: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/55.jpg)
Multiclass Classi�cation
Distribution D over X × Y , where Y = {1, . . . , k}.Find a classi�er h : X → Y minimizing themulti-class loss on D
`k(h,D) = Pr(x ,y)∼D [h(x) 6= y ]
1. Categorization: Which of k things is it?
2. Actions: Which of k choices should be made?
![Page 56: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/56.jpg)
Use in VW
Multiclass label format: Label [Importance] ['Tag]
Methods:�oaa k one-against-all prediction. O(k) time. Thebaseline�ect k error correcting tournament. O(log(k)) time.�log_multi n Adaptive log time. O(log(n)) time.
![Page 57: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/57.jpg)
One-Against-All (OAA)
Create k binary problems, one per class.For class i predict �Is the label i or not?�
(x , y) 7−→
(x ,1(y = 1))
(x ,1(y = 2))
. . .
(x ,1(y = k))
![Page 58: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/58.jpg)
The inconsistency problem
Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.
Prob(label|features) 1
2− δ 1
4+ δ
2
1
4+ δ
2Prediction
1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0
Solution: always use one-against-all regression.
![Page 59: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/59.jpg)
The inconsistency problem
Given an optimal binary classi�er, one-against-alldoesn't produce an optimal multiclass classi�er.
Prob(label|features) 1
2− δ 1
4+ δ
2
1
4+ δ
2Prediction
1v23 1 0 0 02v13 0 1 0 03v12 0 0 1 0
Solution: always use one-against-all regression.
![Page 60: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/60.jpg)
Cost-sensitive Multiclass
Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.
Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost
cost(c,D) = E(x ,c)∼D [ch(x)].
1. Is this packet {normal,error,attack}?
2. A subroutine used later...
![Page 61: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/61.jpg)
Cost-sensitive Multiclass
Cost-sensitive multiclass classi�cationDistribution D over X × [0, 1]k , where a vector in[0, 1]k speci�es the cost of each of the k choices.
Find a classi�er h : X → {1, . . . , k} minimizing theexpected cost
cost(c,D) = E(x ,c)∼D [ch(x)].
1. Is this packet {normal,error,attack}?
2. A subroutine used later...
![Page 62: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/62.jpg)
Use in VW
Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace Feature
Methods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.
![Page 63: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/63.jpg)
Use in VW
Label information via sparse vector.A test example:|Namespace FeatureA test example with only classes 1,2,4 valid:1: 2: 4: |Namespace FeatureA training example with only classes 1,2,4 valid:1:0.4 2:3.1 4:2.2 |Namespace FeatureMethods:�csoaa k cost-sensitive OAA prediction. O(k) time.�csoaa_ldf Label-dependent features OAA.�wap_ldf LDF Weighted-all-pairs.
![Page 64: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/64.jpg)
Code Tour
![Page 65: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/65.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. How do you solve interactive problems?
![Page 66: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/66.jpg)
The Problem: Joint Prediction
How?
1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
![Page 67: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/67.jpg)
The Problem: Joint Prediction
How?1. Each prediction is independent.
2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
![Page 68: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/68.jpg)
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.
3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
![Page 69: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/69.jpg)
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.
4. Hand-crafted approaches.
![Page 70: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/70.jpg)
The Problem: Joint Prediction
How?1. Each prediction is independent.2. Multitask learning.3. Assume tractable graphical model, optimize.4. Hand-crafted approaches.
![Page 71: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/71.jpg)
What makes a good solution?
1. Programming complexity.
Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
![Page 72: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/72.jpg)
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
![Page 73: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/73.jpg)
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
![Page 74: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/74.jpg)
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
![Page 75: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/75.jpg)
What makes a good solution?
1. Programming complexity. Most complexproblems addressed independently�too complexto do otherwise.
2. Prediction accuracy. It had better work well.
3. Train speed. Debug/development productivity +maximum data input.
4. Test speed. Application e�ciency
![Page 76: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/76.jpg)
A program complexity comparison
1
10
100
1000
CRFSGD CRF++ S-SVM Search
lin
es o
f co
de
fo
r P
OS
![Page 77: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/77.jpg)
10-3 10-2 10-1 100 101 102 103
Training time (minutes)
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
Accu
racy
(per
wor
d)94.995.7
96.695.9 95.5
96.1
90.7
96.1
1s 10s 1m 10m 30m1h
POS Tagging (tuned hps)
OAAL2SL2S (ft)CRFsgd
CRF++StrPercStrSVMStrSVM2
![Page 78: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/78.jpg)
NER
POS
0 100 200 300 400 500 600
563
365
520
404
24
5.7
98
13
5.614
5.3
Prediction (test-time) Speed
L2SL2S (ft)CRFsgdCRF++StrPercStrSVMStrSVM2
Thousands of Tokens per Second
![Page 79: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/79.jpg)
How do you program?
Sequential_RUN(examples)
1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for
In essence, write the decoder, providing a little bit ofside information for training.
![Page 80: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/80.jpg)
How do you program?
Sequential_RUN(examples)
1: for i = 1 to len(examples) do2: prediction ← predict(examples[i ], examples[i ].label)3: loss(prediction 6= examples[i ].label)4: end for
In essence, write the decoder, providing a little bit ofside information for training.
![Page 81: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/81.jpg)
RunParser(sentence)
1: stack S ← {Root}2: bu�er B ← [words in sentence]3: arcs A ← ∅4: while B 6= ∅ or |S | > 1 do5: ValidActs ← GetValidActions(S ,B)6: features ← GetFeat(S ,B ,A)7: ref ← GetGoldAction(S ,B)8: action ← predict(features, ref, ValidActs)9: S ,B ,A ← Transition(S ,B ,A, action)
10: end while11: loss(A[w ] 6= A∗[w ], ∀w ∈ sentence)12: return output
![Page 82: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/82.jpg)
How does it work?
An Application of �Learning to Search� algorithms(e.g. Searn,DAgger,LOLS [ICML2015]).
Decoder run many times at train time to optimizepredict(...) for loss(...).See Tutorial with Hal Daume @ ICML2015 + LOLSpaper @ICML2015.
![Page 83: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/83.jpg)
Named Entity Recogntion
Is this word part of an organization, person, or not?
10-1 100 101
Training Time (minutes)
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
F-sc
ore
(per
enti
ty)
80.0 79.2
73.3
76.5 75.9 76.5
74.6
78.3
10s 1m 10m
Named entity recognition (tuned hps)
![Page 84: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/84.jpg)
Entity Relation
Goal: �nd the Entities and then �nd their RelationsMethod Entity F1 Relation F1 Train Time
Structured SVM 88.00 50.04 300 secondsL2S 92.51 52.03 13 seconds
Requires about 100 LOC.
![Page 85: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/85.jpg)
Dependency Parsing
Goal: Find dependency structure of sentence.
70
75
80
85
90
95
Ar* Bu Ch Da Du En Ja Po* Sl* Sw Avg
UA
S (
hig
her=
bett
er)
language
L2SDynaSNN
Requires about 300 LOC.
![Page 86: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/86.jpg)
A demonstration
wget http://bilbo.cs.uiuc.edu/~kchang10/tmp/wsj.vw.zip
vw -b 24 -d wsj.train.vw -c �search_task sequence �search 45
�search_alpha 1e-8 �search_neighbor_features -1:w,1:w
�a�x -1w,+1w -f foo.reg; vw -t -i foo.reg wsj.test.vw
![Page 87: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/87.jpg)
This Tutorial in 4 parts
How do you:
1. use all the data?
2. solve the right problem?
3. solve complex joint problems easily?
4. solve interactive problems?
![Page 88: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/88.jpg)
Examples of Interactive Learning
Repeatedly:
1. A user comes to Microsoft (with history of previousvisits, IP address, data related to an account)
2. Microsoft chooses information to present (urls, ads, newsstories)
3. The user reacts to the presented information (clicks onsomething, clicks, comes back and clicks again,...)
Microsoft wants to interactively choose content and use the
observed feedback to improve future content choices.
![Page 89: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/89.jpg)
Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctorwith symptoms, medical history,test results
2. The doctor chooses a treatment
3. The patient responds to it
The doctor wants a policy forchoosing targeted treatments forindividual patients.
![Page 90: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/90.jpg)
The Contextual Bandit Setting
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. The learner chooses an action a ∈ A
3. The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actionsgiven context.
![Page 91: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/91.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1
.8/.8/.8 ?/1
x2
/.3 .2 /.2
![Page 92: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/92.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1
.8/.8/.8 ?/1
x2
/.3 .2 /.2
![Page 93: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/93.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed
/Estimated/True
a1 a2x1 .8
/.8/.8
?
/1
x2 ?
/.3
.2
/.2
![Page 94: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/94.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.5
/1
x2 ?/.5
/.3
.2 /.2
/.2
![Page 95: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/95.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.5
/1
x2 .3/.5
/.3
.2 /.2
/.2
![Page 96: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/96.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated
/True
a1 a2x1 .8/.8
/.8
?/.514
/1
x2 .3/.3
/.3
.2 /.014
/.2
![Page 97: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/97.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2
![Page 98: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/98.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 1: Generalization insu�cient.
![Page 99: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/99.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 2: Exploration required.
![Page 100: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/100.jpg)
The �Direct method�
Use past data to learn a reward predictor r̂(x , a), andact according to arg maxa r̂(x , a).
Example: Deployed policy always takes a1 on x1 anda2 on x2.
Observed/Estimated/True
a1 a2x1 .8/.8/.8 ?/.514/1x2 .3/.3/.3 .2 /.014 /.2Basic observation 3: Errors 6= exploration.
![Page 101: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/101.jpg)
The Evaluation Problem
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
Method 1: Deploy algorithm in the world.
Very Expensive!
![Page 102: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/102.jpg)
The Evaluation Problem
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
Method 1: Deploy algorithm in the world.
Very Expensive!
![Page 103: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/103.jpg)
The Importance Weighting Trick
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
One answer: Collect T exploration samples
(x , a, ra, pa),
wherex = contexta = actionra = reward for actionpa = probability of action a
then evaluate:
Value(π) = Average
(ra 1(π(x) = a)
pa
)
![Page 104: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/104.jpg)
The Importance Weighting Trick
Let π : X → A be a policy mapping features toactions. How do we evaluate it?
One answer: Collect T exploration samples
(x , a, ra, pa),
wherex = contexta = actionra = reward for actionpa = probability of action a
then evaluate:
Value(π) = Average
(ra 1(π(x) = a)
pa
)
![Page 105: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/105.jpg)
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate
2 | 0 0 | 4
3
![Page 106: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/106.jpg)
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate 2
| 0
0
| 4
3
![Page 107: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/107.jpg)
The Importance Weighting Trick
TheoremFor all policies π, for all IID data distributions D,Value(π) is an unbiased estimate of the expectedreward of π:
E(x ,~r)∼D[rπ(x)
]= E[Value(π) ]
Proof: Ea∼p
[ra1(π(x)=a)
pa
]=∑
a para1(π(x)=a)
pa= rπ(x)
Example:
Action 1 2Reward 0.5 1
Probability 1
4
3
4
Estimate 2 | 0 0 | 4
3
![Page 108: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/108.jpg)
How do you test things?
Use format:action:cost:probability | featuresExample:1:1:0.5 | tuesday year million short compan vehicl linestat �nanc commit exchang plan corp subsid creditissu debt pay gold bureau prelimin re�n billiontelephon time draw basic relat �le spokesm reut securacquir form prospect period interview regist torontresourc barrick ontario qualif bln prospectusconvertibl vinc borg arequip...
![Page 109: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/109.jpg)
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
![Page 110: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/110.jpg)
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
![Page 111: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/111.jpg)
How do you train?
Reduce to cost-sensitive classi�cation.
vw �cb 2 �cb_type dr rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.25Progressive 0/1 loss: 0.04582
vw �cb 2 �cb_type ips rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125Progressive 0/1 loss: 0.05065
vw �cb 2 �cb_type dm rcv1.train.txt.gz -c �ngram 2 �skips 4 -b24 -l 0.125
Progressive 0/1 loss: 0.04679
![Page 112: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/112.jpg)
Reminder: Contextual Bandit Setting
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. The learner chooses an action a ∈ A
3. The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actions given context.
What does learning mean? E�ciently competing with somelarge reference class of policies Π = {π : X → A}:
Regret = maxπ∈Π
averaget(rπ(x) − ra)
![Page 113: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/113.jpg)
Building an Algorithm
Let Q1 = uniform distribution
For t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. Draw π ∼ Qt
3. The learner chooses an action a ∈ A
using π(x).
4. The world reacts with reward ra ∈ [0, 1]
5. Update Qt+1
![Page 114: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/114.jpg)
Building an Algorithm
Let Q1 = uniform distributionFor t = 1, . . . ,T :
1. The world produces some context x ∈ X
2. Draw π ∼ Qt
3. The learner chooses an action a ∈ A using π(x).
4. The world reacts with reward ra ∈ [0, 1]
5. Update Qt+1
![Page 115: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/115.jpg)
What is good Qt?
I Exploration: Qt allows discovery of goodpolicies
I Exploitation: Qt small on bad policies
![Page 116: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/116.jpg)
How do you �nd Qt?
by Reduction ... [details complex, but coded]
![Page 117: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/117.jpg)
How do you �nd Qt?
by Reduction ... [details complex, but coded]
![Page 118: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/118.jpg)
How well does this work?
0
0.02
0.04
0.06
0.08
0.1
0.12
eps-greedy
tau-first
LinUCB*
Cover
loss
losses on CCAT RCV1 problem
![Page 119: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/119.jpg)
How long does this take?
1
10
100
1000
10000
100000
1e+06
eps-greedy
tau-first
LinUCB*
Cover
seconds
running times on CCAT RCV1 problem
![Page 120: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/120.jpg)
Next for Vowpal Wabbit
Version 8 series just started.Primary goal: New research (as always)
+ tackle deployability:
1. Backwards compatibility of model to VW.
2. More serious testing.
3. More serious documentation.
4. More and better library interfaces.
5. Dynamic module loading.
![Page 121: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/121.jpg)
Further reading
VW wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki
Search: NYU large scale learning class
NIPS tutorial on Exploration:http://hunch.net/~jl/interact.pdf
ICML Tutorial on Learning to Search: ... comingsoon.
![Page 122: Vowpal Wabbit A Machine Learning System](https://reader030.fdocuments.us/reader030/viewer/2022020123/55bed2b7bb61eb133d8b45a0/html5/thumbnails/122.jpg)
Bibliography
Release Vowpal Wabbit, 2007, http://github.com/JohnLangford/vowpal_wabbit/wiki.
Terascale A. Agarwal et al, A Reliable E�ective TerascaleLinear Learning System,http://arxiv.org/abs/1110.4198
Reduce A. Beygelzimer et al, Learning Reductions thatReally Work,http://arxiv.org/abs/1502.02704
LOLS K. Chang et al, Learning to Search Better ThanYour Teacher, ICML 2014,http://arxiv.org/abs/1502.02206
Explore A. Agarwal et al, Taming the Monster...,http://arxiv.org/abs/1402.0555