Michael Pfeiffer [email protected] 25.11.2004

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:

General Conditions for Predictivity in Learning Theory

Michael Pfeiffer

[email protected]

25.11.2004

Motivation

Supervised Learning learn functional

relationships from a finite set of labelled training examples

Generalization How well does the learned

function perform on unseen test examples?

Central question in supervised learning

What you will hear

New Idea: Stability implies predictivity learning algorithm is stable if

small pertubations of training set do not change hypothesis much

Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis

Agenda

Introduction Problem Definition Classical Results Stability Criteria Conclusion

Some Definitions 1/2

Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} Z = X Y Unknown Distribution (x, y)

Hypothesis Space: H Hypothesis fS H: X Y

Learning Algorithm: Regression: fS is real-valued / Classification: fS is binary symmetric learning algorithm (ordering irrelevant)

1:

n n HZL

Snnn fyxzyxzLSL ),(),...,,()( 111

Some Definitions 2/2

Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y)2

Assume that V is bounded

Empirical Error (Training Error)

Expected Error (True Error)

Z

zdzfVfI )(μ),(][

n

iiS zfV

nfI

1

),(1

][

RZHV :

Generalization and Consistency

Convergence in Probability

Generalization Performance on training examples must be a good

indicator of performance on future examples

Consistency Expected error converges to most accurate one in H

yprobabilitin 0][][lim SSS

nfIfI

0ε][inf][lim0εμ

fIfIP

HfS

n

0ε0εlim

XXP nn

Agenda


Empirical Risk Minimization (ERM)

Focus of classical learning theory research exact and almost ERM

Minimize training error over H: take best hypothesis on training data

For ERM: Generalization Consistency

][min][ fIfI SHf

SS

What algorithms are ERM? All these belong to class of ERM algorithms

Least Squares Regression Decision Trees ANN Backpropagation (?) ...

Are all learning algorithms ERM? NO!

Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization ...

Vapnik asked

What property must the hypothesis space H have to ensure good generalization

of ERM?

Classical Results for ERM1

Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class:

convergence of empirical mean to true expected value uniform convergence in probability of loss functions

induced by H and V

0ε)(μ)()(1

supsuplim0ε1μ

n

iXi

HfS

nxdxfxf

nP

1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

VC-Dimension

Binary functions f: X{0, 1} VC-dim(H) = size of largest

finite set in X that can be shattered by H e.g. linear separation in

2D yields VC-dim = 3

Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1.

1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

Achievements of Classical Learning Theory

Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM

Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the

hypothesis space?

Agenda


Poggio et.al. asked

What property must the learning map L have for good

generalization of general algorithms?

Can a new theory subsume the classical results for

ERM?

Stability

Small pertubations of the training set should not change the hypothesis much especially deleting one

training example Si = S \ {zi}

How can this be mathematically defined?

Original Training Set S

Perturbed Training Set Si

Hypothesis Space

Learning Map

Uniform Stability1

A learning algorithm L is uniformly stable if

n

KzfVzfVniZS iSS

Zz

n

),(),(sup},...,1{,

After deleting one training sample the change must be small at all points z Z

Uniform stability implies generalizationRequirement is too strong

Most algorithms (e.g. ERM) are not uniformly stable

1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

CVloo stability1

Cross-Validation leave-one-out stability

considers only errors at removed training points

strictly weaker than uniform stability

remove zi

error at xi

..0),(),(suplim},...,1{

pizfVzfV iSiSnin

i

1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Equivalence for ERM1

Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CVloo stable ERM generalizes and is universally consistent H is a uGC class

Question: Does CVloo stability ensure generalization for all learning algorithms?


CVloo Counterexample1

X be uniform on [0, 1] Y {-1, +1} Target f *(x) = 1 Learning algorithm L:

No change at removed training point CVloo stable Algorithm does not generalize at all!

otherwise1

point traininga is if1)(

1n

n

S

xxf

otherwise)(

if)()(

xf

xxxfxf

S

iS

S i


Additional Stability Criteria

Error (Eloo) stability

Empirical Error (EEloo) stability

Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM)

Not sufficient for generalization

yprobabilitin0][][suplim},...,1{

iSSnin

fIfI

yprobabilitin0][][suplim},...,1{

ii SSSSnin

fIfI

CVEEEloo Stability

Learning Map L is CVEEEloo stable if it is CVloo stable

and Eloo stable

and EEloo stable

Question: Does this imply generalization for all L?

CVEEEloo implies Generalization1

Theorem: If L is CVEEEloo stable and the loss function is bounded, then fS generalizes

Remarks: Neither condition (CV, E, EE) itself is sufficient Eloo and EEloo stability are not sufficient

For ERM CVloo stability alone is necessary and sufficient for generalization and consistency


Consistency

CVEEEloo stability in general does NOT guarantee consistency

Good generalization does NOT necessarily mean good prediction but poor expected

performance is indicated by poor training performance

CVEEEloo stable algorithms

Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost)

For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN)

For all these algorithms generalization is guaranteed by the shown theorems!

Agenda


Implications

Classical „VC-style“ conditions Occams Razor: prefer simple hypotheses

CVloo stability Incremental Change online-algorithms

Inverse Problems: stability well-posedness condition numbers characterize stability

Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery

Language Learning

Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more

insight into real language learning? Language learning algorithm or Class of all learnable grammars?

Focus on algorithms shift focus to stability

Conclusion

Stability implies generalization intuitive (CVloo) and technical (Eloo, EEloo) criteria

Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms

Restrictions on learning map rather than hypothesis space

New approach for designing learning algorithms

Open Questions

Easier / other necessary and sufficient conditions for generalization

Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms

Thank you!

Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for

predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning:

Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003

T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S.

378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale-

sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997

Michael Pfeiffer [email protected] 25.11.2004

Documents

Transcript of Michael Pfeiffer [email protected] 25.11.2004