Michael Pfeiffer [email protected] 25.11.2004

33
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer [email protected] 25.11.2004

description

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory. Michael Pfeiffer [email protected] 25.11.2004. Motivation. Supervised Learning learn functional relationships from a finite set of labelled training examples Generalization - PowerPoint PPT Presentation

Transcript of Michael Pfeiffer [email protected] 25.11.2004

Page 1: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:

General Conditions for Predictivity in Learning Theory

Michael Pfeiffer

[email protected]

25.11.2004

Page 2: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Motivation

Supervised Learning learn functional

relationships from a finite set of labelled training examples

Generalization How well does the learned

function perform on unseen test examples?

Central question in supervised learning

Page 3: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

What you will hear

New Idea: Stability implies predictivity learning algorithm is stable if

small pertubations of training set do not change hypothesis much

Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis

Page 4: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Agenda

Introduction Problem Definition Classical Results Stability Criteria Conclusion

Page 5: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Some Definitions 1/2

Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} Z = X Y Unknown Distribution (x, y)

Hypothesis Space: H Hypothesis fS H: X Y

Learning Algorithm: Regression: fS is real-valued / Classification: fS is binary symmetric learning algorithm (ordering irrelevant)

1:

n n HZL

Snnn fyxzyxzLSL ),(),...,,()( 111

Page 6: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Some Definitions 2/2

Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y)2

Assume that V is bounded

Empirical Error (Training Error)

Expected Error (True Error)

Z

zdzfVfI )(μ),(][

n

iiS zfV

nfI

1

),(1

][

RZHV :

Page 7: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Generalization and Consistency

Convergence in Probability

Generalization Performance on training examples must be a good

indicator of performance on future examples

Consistency Expected error converges to most accurate one in H

yprobabilitin 0][][lim SSS

nfIfI

0ε][inf][lim0εμ

fIfIP

HfS

n

0ε0εlim

XXP nn

Page 8: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Agenda

Introduction Problem Definition Classical Results Stability Criteria Conclusion

Page 9: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Empirical Risk Minimization (ERM)

Focus of classical learning theory research exact and almost ERM

Minimize training error over H: take best hypothesis on training data

For ERM: Generalization Consistency

][min][ fIfI SHf

SS

Page 10: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

What algorithms are ERM? All these belong to class of ERM algorithms

Least Squares Regression Decision Trees ANN Backpropagation (?) ...

Are all learning algorithms ERM? NO!

Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization ...

Page 11: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Vapnik asked

What property must the hypothesis space H have to ensure good generalization

of ERM?

Page 12: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Classical Results for ERM1

Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class:

convergence of empirical mean to true expected value uniform convergence in probability of loss functions

induced by H and V

0ε)(μ)()(1

supsuplim0ε1μ

n

iXi

HfS

nxdxfxf

nP

1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

Page 13: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

VC-Dimension

Binary functions f: X{0, 1} VC-dim(H) = size of largest

finite set in X that can be shattered by H e.g. linear separation in

2D yields VC-dim = 3

Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1.

1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

Page 14: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Achievements of Classical Learning Theory

Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM

Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the

hypothesis space?

Page 15: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Agenda

Introduction Problem Definition Classical Results Stability Criteria Conclusion

Page 16: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Poggio et.al. asked

What property must the learning map L have for good

generalization of general algorithms?

Can a new theory subsume the classical results for

ERM?

Page 17: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Stability

Small pertubations of the training set should not change the hypothesis much especially deleting one

training example Si = S \ {zi}

How can this be mathematically defined?

Original Training Set S

Perturbed Training Set Si

Hypothesis Space

Learning Map

Page 18: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Uniform Stability1

A learning algorithm L is uniformly stable if

n

KzfVzfVniZS iSS

Zz

n

),(),(sup},...,1{,

After deleting one training sample the change must be small at all points z Z

Uniform stability implies generalizationRequirement is too strong

Most algorithms (e.g. ERM) are not uniformly stable

1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

Page 19: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

CVloo stability1

Cross-Validation leave-one-out stability

considers only errors at removed training points

strictly weaker than uniform stability

remove zi

error at xi

..0),(),(suplim},...,1{

pizfVzfV iSiSnin

i

1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Page 20: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Equivalence for ERM1

Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CVloo stable ERM generalizes and is universally consistent H is a uGC class

Question: Does CVloo stability ensure generalization for all learning algorithms?

1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Page 21: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

CVloo Counterexample1

X be uniform on [0, 1] Y {-1, +1} Target f *(x) = 1 Learning algorithm L:

No change at removed training point CVloo stable Algorithm does not generalize at all!

otherwise1

point traininga is if1)(

1n

n

S

xxf

otherwise)(

if)()(

xf

xxxfxf

S

iS

S i

1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Page 22: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Additional Stability Criteria

Error (Eloo) stability

Empirical Error (EEloo) stability

Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM)

Not sufficient for generalization

yprobabilitin0][][suplim},...,1{

iSSnin

fIfI

yprobabilitin0][][suplim},...,1{

ii SSSSnin

fIfI

Page 23: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

CVEEEloo Stability

Learning Map L is CVEEEloo stable if it is CVloo stable

and Eloo stable

and EEloo stable

Question: Does this imply generalization for all L?

Page 24: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

CVEEEloo implies Generalization1

Theorem: If L is CVEEEloo stable and the loss function is bounded, then fS generalizes

Remarks: Neither condition (CV, E, EE) itself is sufficient Eloo and EEloo stability are not sufficient

For ERM CVloo stability alone is necessary and sufficient for generalization and consistency

1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Page 25: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Consistency

CVEEEloo stability in general does NOT guarantee consistency

Good generalization does NOT necessarily mean good prediction but poor expected

performance is indicated by poor training performance

Page 26: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

CVEEEloo stable algorithms

Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost)

For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN)

For all these algorithms generalization is guaranteed by the shown theorems!

Page 27: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Agenda

Introduction Problem Definition Classical Results Stability Criteria Conclusion

Page 28: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Implications

Classical „VC-style“ conditions Occams Razor: prefer simple hypotheses

CVloo stability Incremental Change online-algorithms

Inverse Problems: stability well-posedness condition numbers characterize stability

Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery

Page 29: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Language Learning

Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more

insight into real language learning? Language learning algorithm or Class of all learnable grammars?

Focus on algorithms shift focus to stability

Page 30: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Conclusion

Stability implies generalization intuitive (CVloo) and technical (Eloo, EEloo) criteria

Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms

Restrictions on learning map rather than hypothesis space

New approach for designing learning algorithms

Page 31: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Open Questions

Easier / other necessary and sufficient conditions for generalization

Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms

Page 32: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Thank you!

Page 33: Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for

predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning:

Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003

T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S.

378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale-

sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997