Intro a Aprendizaje Predictivo

8/11/2019 Intro a Aprendizaje Predictivo

1/64


2/64

2

Introduction toPredictive Learning

Electrical and Computer Engineering

LECTURE SET 4

Statistical Learning Theory


3/64

3

OUTLINE of Set 4

Objectives and Overview Inductive Learning Problem Setting

Keep-It-Direct Principle

Analysis of ERM VC-dimension

Generalization Bounds Structural Risk Minimization (SRM)

Summary and Discussion


4/64

4

ObjectivesProblems with philosophical approaches

- lack quantitative description/ characterization ofideas;- no real predictive power (as in Natural Sciences)

- no agreement on basic definitions/ concepts (as inNatural Sciences)

Goal: to introduce Predictive Learning as a scientificdiscipline


5/64

5

Characteristics of Scientific Theory

Problem setting Solution approach

Math proofs (technical analysis) Constructive methods

Applications

Note: Problem Setting and Solution Approach areindependent (of each other)


6/64

6

History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite

samples ( predictive learning setting )

Based on the risk minimization approach All main results originally developed in 1970s for

classification (pattern recognition) why?but remained largely unknown

Recent renewed interest due to practical success ofSupport Vector Machines (SVM)


7/64

7

History and Overview(contd)

MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle

and learning algorithms

Direct approach to estimation with finite data ( KIDprinciple ) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization:

- empirical risk (fitting error)

- complexity(capacity) of approximating functions


8/64

8

Importance of VC-theory

Math results addressing the main question :- under what general conditions the ERM approach leads to(good) generalization?

New approach to induction:

Predictive vs generative modeling (in classical statistics)

Connection to philosophy of science- VC-theory developed for binary classification (pattern

recognition) ~ the simplest generalization problem- natural sciences: from observations to scientific law VC-theoretical results can be interpreted using general

philosophical principles of induction, and vice versa.


9/64


10/64


11/64

11

Empirical Risk Minimization

ERM principle in model-based learning Model parameterization: f (x, w) Loss function: L( f (x, w),y)

Estimate risk from data: Choose w* that minimizes Remp

Statistical Learning Theory developed fromthe theoretical analysis of ERM principleunder finite sample settings

n

iiiemp y f Ln R 1 )),,((

1)( wxw


12/64

12

Probabilistic Modeling vs ERM


13/64

13

Probabilistic Modeling vs ERM: Example

-2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

x1

x 2

Known class distribution optimal decision boundary


14/64

14

Probabilistic Approach

Estimate parameters of Gaussian class distributions , andplug them into quadratic decision boundary

-2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

x1

x 2


15/64

15

ERM Approach

Quadratic and linear decision boundary estimated viaminimization of squared loss

-2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

x1

x 2


16/64

16

Estimation of multivariate functions Is it possible to estimate a function from finite data?

Simplified problem: estimation of unknown continuousfunction from noise-free samples Many results from function approximation theory :

To estimate accurately a d -dimensional function one needsO(n^^d) data points

For example, if 3 points are needed to estimate 2-nd orderpolynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space.

Similar results in signal processing Never enough data points to estimate multivariate

functions in most practical applications (image recognition,genomics etc.)

For multivariate function estimation, the number of freeparameters increases exponentially with problemdimensionality (the Curse of Dimensionality)


17/64


18/64

18

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting






19/64

19

Keep-It-Direct Principle The goal of learning is generalization rather than

estimation of true function (system identification)

Keep-It-Direct Principle (Vapnik, 1995)

Do not solve an estimation problem of interest bysolving a more general (harder) problem as anintermediate step

Good predictive model reflects some properties ofunknown distribution P(x,y)

Since model estimation with finite data is ill-posed, oneshould never try to solve a more general problem thanrequired by given application

Importance of formalizing application requirementsas a learning problem.

min ,y) ,w)) dP( Loss(y, f( xx


20/64

20

Learning vs System Identification Consider regression problem

where unknown target function

Goal 1 : Prediction

Goal 2 : Function Approximation (system identification)

or

Admissible models : algebraic polynomials Purpose of comparison : contrast goals (1) and (2)NOTE: most applications assume Goal 2, i.e.

Noisy Data ~ true signal + noise

)(x g y)/()( xx y E g

min),()),(()( 2 ydP f y R xwxw

min))(),(()( 2 xxwxw d g f Rmin)/(),( xwx y E f


21/64


22/64


23/64

23

Regression estimates (2 typical realizations of data):

Dotted line ~ estimate obtained using predictive learningDashed line ~ estimate via function approximation setting


24/64

24

Conclusion The goal of prediction (1) is different (less

demanding) than the goal of estimating thetrue target function (2) everywhere in theinput space.

The curse of dimensionality applies to systemidentification setting (2), but may not holdunder predictive setting (1).

Both settings coincide if the input distributionis uniform (i.e., in signal and image denoisingapplications)


25/64

25

Philosophical Interpretation of KIDInterpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind

(imposed on the data) favored by KID

Objective Evaluation still possible (via predictionrisk reflecting application needs) NaturalScience

Methodological implications

Importance of good learning formulations(asking the right question)

Accounts for 80% of success in applications


26/64

26







27/64

27

VC-theory has 4 parts:

1. Analysis of consistency/convergence ofERM

2. Generalization bounds3. Inductive principles (for finite samples)

4. Constructive methods (learningalgorithms) for implementing (3)

NOTE: (1) (2) (3) (4)

1

1( ) ( , ( )) min

n

emp i ii

R L y f n

x ,


28/64

28

Consistency/Convergence of ERM

Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement :

under what (general) conditions models providing minEmpirical Risk will also provide min Prediction Risk ,when the number of samples grows large?

Why asymptotic analysis is needed?

- helps to develop useful concepts- necessary and sufficient conditions ensure that VC-theory is general and can not be improved


29/64


30/64

30

Conditions for Consistency of ERM Main insight : consistency is not possible without

restricting the set of possible models Example: 1-nearest neighbor classification method.

- is it consistent ?

Consider binary decision functions (classification) How to measure their flexibility, or ability to explain the

training data (for binary classification)? This complexity index for indicator functions:

- is independent of unknown data distribution ;- measures the capacity of a set of possible models ,rather than characteristics of the true model


31/64

31







32/64

32

SHATTERING Linear indicator functions : can split 3 data points in 2D in

all 2^^3 = 8 possible binary partitions

If a set of n samples can be separated by a set of

functions in all 2^^n possible ways, this sample is said tobe shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of

size n (for all possible labelings)


33/64

33

VC DIMENSION Definition : A set of functions has VC-dimension h is there exist h

samples that can be shattered by this set of functions, but thereare no h+1 samples that can be shattered

VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?


34/64

34

VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split

in all 2^^n possible ways

(in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient

conditions for:(1) consistency of ERM-based learning

(2) fast rate of convergence(these conditions are distribution-independent)

Interpretation of the VC-dimension via falsifiability:

functions with small VC-dim can be easily falsified


35/64

35

VC-dimension and Falsifiability

A set of functions has VC-dimension h if(a) It can explain (shatter) a set of h samples~ there exists h samples that cannot falsify itand(b) It can not shatter h+1 samples~ any h+1 samples falsify this set

Finiteness of VC-dim is necessary and sufficient condition for generalization(for any learning method based on ERM)


36/64

36

Recall Occams Razor: Main problem in predictive learning

- Complexity control (model selection)- How to measure complexity ?

Interpretation of Occams razor (in Statistics):

Entities ~ model parametersComplexity ~ degrees-of-freedomNecessity ~ explaining (fitting) available data

Model complexity = number of parameters (DoF) Consistent with classical statistical view:

learning = function approx. / density estimation


37/64

37

Philosophical Principle of VC-falsifiability

Occams Razor : Select the model that explainsavailable data and has the smallest number offree parameters (entities)

VC theory : Select the model that explainsavailable data and has low VC-dimension (i.e. can

be easily falsified ) New principle of VC-falsifiability


38/64

38

Calculating the VC-dimension

How to estimate the VC-dimension (for a given set offunctions)?

Apply definition (via shattering) to derive analyticestimates works for simple sets of functions

Generally, such analytic estimates are not possible forcomplex nonlinear parameterizations (i.e., for practicalmachine learning and statistical methods)


39/64

39

Example: VC-dimension of spherical indicator functions .

Consider spherical decision surfaces in a d -dimensional x-space,parameterized by center c and radius r parameters:

In a 2-dim space (d=2) there exists 3 points that can be shattered,but 4 points cannot be shattered h=3

2 2, , ( ) f r I r x c x c


40/64

40

Example: VC-dimension of a linear combination of fixed basis functions (i.e.polynomials, Fourier expansion etc.)Assuming that basis functions are linearly independent, the VC-dim equals

the number of basis functions (free parameters). Example: single parameter but infinite VC-dimension f x ,w I sin wx 0


41/64

41

Example: Wide linear decision boundaries

Consider linear functions such that the distance btwn D(x) and

the closest data sample is larger than given valueThen VC-dimension depends on the width parameter, ratherthan d (as in linear models):

b D )( xwx

1,min 22

d R

h


42/64


43/64

43

VC-dimension vs number of parameters

VC-dimension can be equal to DoF (number ofparameters)Example: linear estimators

VC-dimension can be smaller than DoFExample: penalized estimators

VC-dimension can be larger than DoFExample: feature selection

sin (wx)


44/64

44

VC-dimension for Regression Problems VC-dimension was defined for indicator functions

Can be extended to real-valued functions, i.e.third-order polynomial for univariate regression:

linear parameterization VC-dim = 4

Qualitatively, the VC-dimension ~ the ability to fit (orexplain) finite training data for regression.

This can be also related to falsifiability

b xw xw xwb x f 12233),,( w


45/64


46/64

46






ll f


47/64

47

Recall consistency of ERM

Two Types of VC-bounds:(1) How close is the empirical risk to the true risk

(2) How close is the empirical risk to the minimal possible risk ?

* emp R * R


48/64

48

Generalization Bounds Bounds for learning machines (implementing ERM)

evaluate the difference btwn (unknown) risk and knownempirical risk, as a function of sample size n and generalproperties of admissible models (their VC-dimension)

Classification: the following bound holds with probability offor all approximating functions

where is called the confidence interval Regression: the following bound holds with probability of

for all approximating functions

where

nhn R R R empemp /ln,/),()()( 1

c R R emp 1/)()(

1

nh

,ln n

a1

h lna 2 n

h1 ln / 4 n


49/64

49

Practical VC Bound for regression Practical regression bound can be obtained by setting

the confidence level and theoretical constants:

can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of denominator) shows that

h < 0.8 n for any estimatorIn practice:

h < 0.5 n for any estimator

1

2

lnln1)()(

n

n

n

h

n

h

n

hh Rh R emp

1,/4 nmin


50/64

50

VC Regression Bound for model selection VC-bound can be used for analytic model selection

(if the VC-dimension is known)

Example: polynomial regression for estimating Sine_Squared targetfunction from 25 noisy samples

Optimal model found:6-th degree polynomial

(no resampling needed)

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

x

y

M d li g i ith i [0 1] i l g i


51/64

51

Modeling pure noise with x in [0,1] via poly regressionsample size n=30, noise 1

Comparison of different model selection methods:- prediction risk (MSE)- selected DoF (~ h)

fpe gcv vc cv10

-10

100

1010

R i s k ( M S E )

fpe gcv vc cv0

10

20

30

D e g r e e o f

F r e e d o m


52/64


53/64

53

Structural Risk Minimization Analysis of generalization bounds

suggests that when n/h is large , the term is small

This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-

hand side of VC- bound need to be minimized make the VC-dimension a controlling variable

SRM = formal mechanism for controlling model complexitySet of admissible models has a nested structure

such that structure formally defines complexity ordering

nhn R R R empemp /ln,/),()()(

)(~)( emp R R

S 1 S 2 ... S k ...),( x f

h1 h2 ... hk ...

S l i k i i i i


54/64

54

Structural Risk Minimization An upper bound on the true risk and the empirical risk, as a

function of VC-dimension h (for fixed sample size n)

SRM ERM d li


55/64

55

SRM vs ERM modeling


56/64

56

SRM Approach Use VC-dimension as a controlling parameter for minimizing

VC bound:

Two general strategies for implementing SRM :

1. Keep fixed and minimize(most statistical and neural network methods)

2. Keep fixed and minimize(Support Vector Machines)

hn R R emp /)()( hn R R emp /)()(

)( emp R

)( emp R

hn /

hn /


57/64

57

Common SRM structures Dictionary structure

A set of algebraic polynomials

is a structure since

More generallywhere is a set of basis functions ( dictionary ).

The number of terms (basis functions) m specifies an elementof a structure.For fixed basis fcts, VC-dim ~ number of parameters

mi

iim xwb x f

0

, w f 1 f 2 .... f k ....

m

iiim , g wb f

0,, vxVwx

g x ,v i

wi


58/64

58

Common SRM structures Feature selection (aka subset selection)

Consider sparse polynomials of degree m:for m=1:for m=2:

etc. Each monomial is a feature. The goal is to select a set of m

features providing min. empirical risk (MSE)This is a structure since

More generally ,

where m basis fcts are selected from a (large) set of M fcts

Note: nonlinear optimization, VC-dimension is unknown

.... f .... f f m21

f m x , w ,V wi g x ,v i i 0

m

1),,,( 11k wxbk bw x f

2121212 ),,,,(

k k xw xwbk k b x f w


59/64

59

Common SRM structures Penalization

Consider algebraic polynomial of fixed degreewhere

For each (positive) value c this set of functions specifies anelement of a structureMinimization of empirical risk (MSE) on each element of astructure is a constrained minimization problem

This optimization problem can be equivalently stated asminimization of the penalized empirical risk functional :

where the choice of

Note: VC-dimension is unknown

10

0

,i

ii xw x f w k c

2w

S k f x ,w , w 2 c k

...321 ccc

S k

R pen , k Remp k w 2 k k c~


60/64

60

Example: SRM structures for regression Regression data set

x-values~ uniformly sampled in [0,1]y-values ~ target fctadditive Gaussian noise with st. dev 0.05

Experimental set-up training set ~ 40 samplesvalidation set ~ 40 samples (for model selection)

SRM structures defined on algebraic polynomials- dictionary (polynomial degrees 1 to 10)- penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)

x x x y 5.02.0)2sin(8.0 2

Estimated models using different SRM structures:


61/64

61

Estimated models using different SRM structures:- dictionary- penalization lambda=1.013e-005

- sparse polynomial Visual results : target fct~ red line, feature selection~ black solid,

dictionary ~ green, penalization ~ yellow line

5432 9565.553952.1587679.1632162.684198.64078.0

x x x x x y

432 2736.191772.417337.226186.0

x x x y

0 0.2 0.4 0.6 0.8 1 -1.5

-1

-0.5

0

0.5

1

x

y


62/64

62

SRM Summary SRM structure ~ complexity ordering on a set of

admissible models (approximating functions) Many different structures on the same set of

approximating functions (possible models) How to choose the best structure?

- depends on application data- VC theory cannot provide answer

SRM = mechanism for complexity control

- selecting optimal complexity for a given data set- new measure of complexity: VC-dimension- model selection using analytic VC-bounds


63/64


64/64

Summary and Discussion: VC-theory Methodology

- learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure)

Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do :

- provide formalization (for a given application)

- select good structure - always a gap between theory and applications

Intro a Aprendizaje Predictivo

Documents

Transcript of Intro a Aprendizaje Predictivo