Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf ·...

48
Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabás Póczos

Transcript of Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf ·...

Page 1: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Introduction to Machine Learning CMU-10701

11. Learning Theory

Barnabás Póczos

Page 2: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

We have explored many ways of learning from data But…

– How good is our classifier, really?

– How much data do we need to make it “good enough”?

Learning Theory

2

Page 3: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Please ask Questions and give us Feedbacks!

3

Page 4: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Review of what we have learned so far

4

Page 5: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Notation

5

This is what the learning algorithm produces

We will need these definitions, please copy it!

Page 6: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Big Picture

6

Bayes risk

Estimation error Approximation error

Bayes risk

Ultimate goal:

Approximation error

Estimation error

Page 7: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Big Picture

7

Bayes risk

Estimation error Approximation error

Bayes risk

Page 8: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Big Picture

8

Bayes risk

Estimation error Approximation error

Bayes risk

Page 9: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Big Picture: Illustration of Risks

9

Upper bound

Goal of Learning:

Page 10: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

11. Learning Theory

10

Page 11: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Outline

11

These results are useless if N is big, or infinite. (e.g. all possible hyper-planes)

Today we will see how to fix this with the Shattering coefficient and VC dimension

Theorem:

From Hoeffding’s inequality, we have seen that

Page 12: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Outline

12

Theorem:

From Hoeffding’s inequality, we have seen that

After this fix, we can say something meaningful about this too:

This is what the learning algorithm produces and its true risk

Page 13: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Hoeffding inequality

13

Theorem:

Observation:

Page 14: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

McDiarmid’s Bounded Difference Inequality

It follows that

14

Page 15: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Bounded Difference Condition

15

Our main goal is to bound Lemma:

Let g denote the following function:

Observation:

Proof:

) McDiarmid can be applied for g!

Page 16: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Bounded Difference Condition

16

Corollary:

The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)!

Page 17: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Concentration and Expected Value

17

Page 18: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Vapnik-Chervonenkis inequality

18

We already know:

Vapnik-Chervonenkis inequality:

Corollary: Vapnik-Chervonenkis theorem:

Our main goal is to bound

Page 19: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Shattering

19

Page 20: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

How many points can a linear boundary classify exactly in 1D?

- +

2 pts 3 pts - + +

- - +

- + - ?? There exists placement s.t. all labelings can be classified

- +

20

The answer is 2

Page 21: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

- + 3 pts 4 pts

- + + - +

- +

- ??

- + - +

How many points can a linear boundary classify exactly in 2D?

There exists placement s.t. all labelings can be classified

21

The answer is 3

Page 22: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

How many points can a linear boundary classify exactly in d-dim?

22

The answer is d+1

How many points can a linear boundary classify exactly in 3D? The answer is 4

tetraeder

+

+ -

-

Page 23: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Growth function, Shatter coefficient

Definition

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1

(=5 in this example)

Growth function, Shatter coefficient

maximum number of behaviors on n points

23

Page 24: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Growth function, Shatter coefficient

Definition

Growth function, Shatter coefficient

maximum number of behaviors on n points

- +

+

Example: Half spaces in 2D

+ + -

24

Page 25: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC-dimension

Definition

Growth function, Shatter coefficient

maximum number of behaviors on n points

Definition: VC-dimension

# be

havi

ors

Definition: Shattering

Note: 25

Page 26: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC-dimension

Definition

# be

havi

ors

26

Page 27: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

- + - +

VC-dimension

27

Page 28: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Examples

28

Page 29: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC dim of decision stumps (axis aligned linear separator) in 2d

What’s the VC dim. of decision stumps in 2d?

- +

+

- +

-

+ +

-

There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

29

Page 30: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

What’s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that

can’t be shattered

3 collinear 1 in convex hull of other 3

quadrilateral

- + - - +

-

- +

-

+ -

VC dim of decision stumps (axis aligned linear separator) in 2d

30

Page 31: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

- +

+

- +

- There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

31

Page 32: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC dim. of axis parallel rectangles in 2d

There is a placement of 4 pts that can be shattered ) VC dim ≥ 4 32

Page 33: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

VC dim. of axis parallel rectangles in 2d

What’s the VC dim. of axis parallel rectangles in 2d?

+ + -

-

- -

- + - - +

-

- -

- +

-

+ -

If VC dim = 4, then for all placements of 5 pts, there exists a labeling that can’t be shattered

4 collinear 2 in convex hull 1 in convex hull

pentagon

33

Page 34: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Sauer’s Lemma

34

The VC dimension can be used to upper bound the shattering coefficient.

Sauer’s lemma:

Corollary:

We already know that [Exponential in n]

[Polynomial in n]

Page 35: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Sauer’s Lemma

Write all different behaviors on a sample (x1,x2,…xn) in a matrix:

0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1

35

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Page 36: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Sauer’s Lemma

We will prove that

36

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Shattered subsets of columns:

Therefore,

Page 37: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Sauer’s Lemma

Lemma 1

37

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

for any binary matrix with no repeated rows.

Shattered subsets of columns:

Lemma 2

In this example: 5· 6

In this example: 6· 1+3+3=7

Page 38: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Lemma 1

Lemma 1

38

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Shattered subsets of columns:

Proof

In this example: 6· 1+3+3=7

Page 39: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Lemma 2

39

for any binary matrix with no repeated rows. Lemma 2

Induction on the number of columns Proof

Base case: A has one column. There are three cases:

) 1 · 1 ) 1 · 1 ) 2 · 2

Page 40: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Lemma 2

40

Inductive case: A has at least two columns.

We have,

By induction (less columns)

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Page 41: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Proof of Lemma 2

41

because

0 0 0 0 1 0 1 1 1 1 0 0 0 1 1

Page 42: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Vapnik-Chervonenkis inequality

42

Vapnik-Chervonenkis inequality:

From Sauer’s lemma:

Since

Therefore,

[We don’t prove this]

Estimation error

Page 43: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Linear (hyperplane) classifiers

43

Estimation error

We already know that

Estimation error

Estimation error

Page 44: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Vapnik-Chervonenkis Theorem

44

We already know from McDiarmid:

Corollary: Vapnik-Chervonenkis theorem: [We don’t prove them]

Vapnik-Chervonenkis inequality:

Hoeffding + Union bound for finite function class:

Page 45: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

PAC Bound for the Estimation Error

45

VC theorem:

Inversion:

Estimation error

Page 46: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Structoral Risk Minimization

46

Bayes risk

Estimation error Approximation error

Ultimate goal:

Approximation error

Estimation error

So far we studied when estimation error ! 0, but we also want approximation error ! 0

Many different variants… penalize too complex models to avoid overfitting

Page 47: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

What you need to know

Complexity of the classifier depends on number of points that can be classified exactly

Finite case – Number of hypothesis

Infinite case – Shattering coefficient, VC dimension

PAC bounds on true error in terms of empirical/training error and complexity of hypothesis space

Empirical and Structural Risk Minimization

47

Page 48: Introduction to Machine Learning - Alex Smolaalex.smola.org/.../slides/11_Learning_Theory.pdf · 2013-09-09 · Introduction to Machine Learning CMU-10701 . 11. Learning Theory .

Thanks for your attention

48