Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing...

28
Boosting Instructor: Pranjal Awasthi

Transcript of Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing...

Page 1: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Boosting

Instructor: Pranjal Awasthi

Page 2: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

Want to predict winner of 2016 general election

Page 3: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– …......

β€’ Step 2: Find a function that gets high accuracy(~99%) on training set.

What if instead, we have good β€œrules of thumb” that often work well?

Page 4: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– …......

β€’ β„Ž1will do well on a subset of train set, and will do better than half overall, say 51% accuracy.

β„Ž1= If sitting president is running, go with his/her party,Otherwise, predict randomly

Page 5: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– ,

– 2008

– ,

– 2000,

– ,

– ,

– …......

β„Ž1= If sitting president is running, go with his/her party,Otherwise, predict randomly

What if only focus on a subset of examples: β„Ž1 will do poorly.

Page 6: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– ,

– 2008

– ,

– 2000,

– ,

– ,

– …......

β€’ β„Ž2will do well, say 51% accuracy.

β„Ž2= If CNN poll gives a 20pt lead to someone, go with his/her party,Otherwise, predict randomly

What if only focus on a subset of examples: β„Ž1 will do poorly.But β„Ž2 will do well.

Page 7: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– …......

β€’ Suppose can consistently produce rules in 𝐻 that do slightly better than random guessing.

β„Ž1, β„Ž2, β„Ž3, … .

H

Page 8: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Motivating Example

β€’ Step 1: Collect training data

– 2012,

– 2008,

– 2004,

– 2000,

– 1996,

– 1992,

– …......

β€’ Output a single rule that gets high accuracy on training set.

β„Ž1, β„Ž2, β„Ž3, … .

H

Page 9: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Boosting

Weak Learning Algorithm

Strong Learning Algorithm

Boosting: A general method for converting β€œrules of thumb” into highly accurate predictions.

Page 10: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

History of Boosting

β€’ Originally studied from a purely theoretical perspective

– Is weak learning equivalent to strong learning?

– Answered positively by Rob Schapire in 1989 via the boosting algorithm.

β€’ AdaBoost -- A highly practical version developed in 1995

– One of the most popular ML (meta)algorithms

Page 11: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

History of Boosting

β€’ Has connections to many different areas

– Empirical risk minimization

– Game theory

– Convex optimization

– Computational complexity theory

Page 12: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Strong Learning

β€’ To get strong learning, need to

– Design an algorithm that can get any arbitrary error πœ– > 0,on the training set π‘†π‘š

– Do VC dimension analysis to see how large π‘š needs to be for generalization

Page 13: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Theory of Boosting

β€’ Weak Learning Assumption:

– There exists algorithm 𝐴 that can consistently produce weak classifiers on the training set

π‘†π‘š = 𝑋1, 𝑦1 , 𝑋2, 𝑦2 …

– Weak Classifier:

β€’ For every weighting W of π‘†π‘š, 𝐴 outputs β„Žπ‘Š such that

π‘’π‘Ÿπ‘Ÿπ‘†π‘š,π‘Šβ„Žπ‘Š ≀

1

2βˆ’ 𝛾, 𝛾 > 0

β€’ π‘’π‘Ÿπ‘Ÿπ‘†π‘š,π‘Š β„Žπ‘Š = σ𝑖=1π‘š 𝑀𝑖𝐼 β„Žπ‘Š 𝑋𝑖 ≠𝑦𝑖

σ𝑖=1π‘š 𝑀𝑖

Page 14: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Theory of Boosting

β€’ Weak Learning Assumption:

– There exists an algorithm 𝐴 that can consistently produce weak classifiers on the training set π‘†π‘š

β€’ Boosting Guarantee: Use 𝐴 to get algorithm 𝐡 such that

– For every πœ– > 0, 𝐡 outputs 𝑓 with π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ πœ–

Page 15: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

AdaBoost

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier.

Q1: How to choose weights?Q2: How to combine classifiers?

Page 16: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Choosing weights

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier.

Idea 1: Let π‘Šπ‘‘ be uniform over examples that β„Žπ‘‘βˆ’1 got incorrectly

Page 17: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Choosing weights

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier.

Idea 2: Let π‘Šπ‘‘ be such that error of β„Žπ‘‘βˆ’1 becomes 1

2.

Page 18: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Choosing weights

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier

β€’ If π‘’π‘Ÿπ‘Ÿπ‘†π‘š,π‘Šπ‘‘βˆ’1β„Žπ‘‘βˆ’1 =

1

2βˆ’ 𝛾, then

– π‘Šπ‘‘,𝑖 = π‘Šπ‘‘βˆ’1,𝑖, if β„Žπ‘‘βˆ’1 is incorrect on 𝑋𝑖

– π‘Šπ‘‘,𝑖 =π‘Šπ‘‘βˆ’1,𝑖

1

2βˆ’π›Ύ

(1

2+𝛾)

, if β„Žπ‘‘βˆ’1 is correct on 𝑋𝑖

Page 19: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Combining Classifiers

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier.

Take majority vote 𝑓(𝑋) = 𝑀𝐴𝐽(β„Ž1( Ԧ𝑋), β„Ž2( Ԧ𝑋), … β„Žπ‘‡( Ԧ𝑋))

Page 20: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Combining Classifiers

β€’ Run 𝐴 on π‘†π‘š get β„Ž1β€’ Use β„Ž1to get π‘Š2

β€’ Run 𝐴 on π‘†π‘š with weights π‘Š2 to get β„Ž2β€’ Repeat 𝑇 times

β€’ Combine β„Ž1, β„Ž2, … β„Žπ‘‡ to get the final classifier.

Take majority vote 𝑓(𝑋) = 𝑀𝐴𝐽(β„Ž1( Ԧ𝑋), β„Ž2( Ԧ𝑋), … β„Žπ‘‡( Ԧ𝑋))

Theorem:

If error of each β„Žπ‘– is =1

2βˆ’ 𝛾, then π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ π‘’βˆ’2𝑇𝛾

2

Page 21: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Analysis

Theorem:

If error of each β„Žπ‘– is =1

2βˆ’ 𝛾, then π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ π‘’βˆ’2𝑇𝛾

2

Page 22: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Analysis

Theorem:

If error of each β„Žπ‘– is =1

2βˆ’ 𝛾, then π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ π‘’βˆ’2𝑇𝛾

2

Page 23: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Analysis

Theorem:

If error of each β„Žπ‘– is =1

2βˆ’ 𝛾, then π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ π‘’βˆ’2𝑇𝛾

2

Page 24: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Analysis

Theorem:

If error of each β„Žπ‘– is ≀1

2βˆ’ 𝛾, then π‘’π‘Ÿπ‘Ÿπ‘†π‘š 𝑓 ≀ π‘’βˆ’2𝑇𝛾

2

Let π‘Šπ‘‘ be such that error of β„Žπ‘‘βˆ’1 becomes 1

2.

Take (weighted) majority vote

Page 25: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Strong Learning

β€’ To get strong learning, need to

– Design an algorithm that can get any arbitrary error πœ– > 0,on the training set π‘†π‘š

– Do VC dimension analysis to see how large π‘š needs to be for generalization

Page 26: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Generalization Bound

β„Ž1, β„Ž2, β„Ž3, … .

H

Page 27: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Applications: Decision Stumps

Page 28: Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing weights β€’Run on π‘š get β„Ž1 β€’Use β„Ž1 to get 2 β€’Run on π‘š with weights

Advantages of AdaBoost

β€’ Helps learn complex models from simple ones without overfitting

β€’ A general paradigm and easy to implement

β€’ Parameter free, no tuning

β€’ Often not very sensitive to the number of rounds 𝑇

β€’ Cons:

– Does not help much if weak learners are already complex.

– Sensitive to noise in the data