Intro a Aprendizaje Predictivo

download Intro a Aprendizaje Predictivo

of 64

Transcript of Intro a Aprendizaje Predictivo

  • 8/11/2019 Intro a Aprendizaje Predictivo

    1/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    2/64

    2

    Introduction toPredictive Learning

    Electrical and Computer Engineering

    LECTURE SET 4

    Statistical Learning Theory

  • 8/11/2019 Intro a Aprendizaje Predictivo

    3/64

    3

    OUTLINE of Set 4

    Objectives and Overview Inductive Learning Problem Setting

    Keep-It-Direct Principle

    Analysis of ERM VC-dimension

    Generalization Bounds Structural Risk Minimization (SRM)

    Summary and Discussion

  • 8/11/2019 Intro a Aprendizaje Predictivo

    4/64

    4

    ObjectivesProblems with philosophical approaches

    - lack quantitative description/ characterization ofideas;- no real predictive power (as in Natural Sciences)

    - no agreement on basic definitions/ concepts (as inNatural Sciences)

    Goal: to introduce Predictive Learning as a scientificdiscipline

  • 8/11/2019 Intro a Aprendizaje Predictivo

    5/64

    5

    Characteristics of Scientific Theory

    Problem setting Solution approach

    Math proofs (technical analysis) Constructive methods

    Applications

    Note: Problem Setting and Solution Approach areindependent (of each other)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    6/64

    6

    History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite

    samples ( predictive learning setting )

    Based on the risk minimization approach All main results originally developed in 1970s for

    classification (pattern recognition) why?but remained largely unknown

    Recent renewed interest due to practical success ofSupport Vector Machines (SVM)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    7/64

    7

    History and Overview(contd)

    MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle

    and learning algorithms

    Direct approach to estimation with finite data ( KIDprinciple ) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization:

    - empirical risk (fitting error)

    - complexity(capacity) of approximating functions

  • 8/11/2019 Intro a Aprendizaje Predictivo

    8/64

    8

    Importance of VC-theory

    Math results addressing the main question :- under what general conditions the ERM approach leads to(good) generalization?

    New approach to induction:

    Predictive vs generative modeling (in classical statistics)

    Connection to philosophy of science- VC-theory developed for binary classification (pattern

    recognition) ~ the simplest generalization problem- natural sciences: from observations to scientific law VC-theoretical results can be interpreted using general

    philosophical principles of induction, and vice versa.

  • 8/11/2019 Intro a Aprendizaje Predictivo

    9/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    10/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    11/64

    11

    Empirical Risk Minimization

    ERM principle in model-based learning Model parameterization: f (x, w) Loss function: L( f (x, w),y)

    Estimate risk from data: Choose w* that minimizes Remp

    Statistical Learning Theory developed fromthe theoretical analysis of ERM principleunder finite sample settings

    n

    iiiemp y f Ln R 1 )),,((

    1)( wxw

  • 8/11/2019 Intro a Aprendizaje Predictivo

    12/64

    12

    Probabilistic Modeling vs ERM

  • 8/11/2019 Intro a Aprendizaje Predictivo

    13/64

    13

    Probabilistic Modeling vs ERM: Example

    -2 0 2 4 6 8 10-6

    -4

    -2

    0

    2

    4

    6

    8

    10

    x1

    x 2

    Known class distribution optimal decision boundary

  • 8/11/2019 Intro a Aprendizaje Predictivo

    14/64

    14

    Probabilistic Approach

    Estimate parameters of Gaussian class distributions , andplug them into quadratic decision boundary

    -2 0 2 4 6 8 10-6

    -4

    -2

    0

    2

    4

    6

    8

    10

    x1

    x 2

  • 8/11/2019 Intro a Aprendizaje Predictivo

    15/64

    15

    ERM Approach

    Quadratic and linear decision boundary estimated viaminimization of squared loss

    -2 0 2 4 6 8 10-6

    -4

    -2

    0

    2

    4

    6

    8

    10

    x1

    x 2

  • 8/11/2019 Intro a Aprendizaje Predictivo

    16/64

    16

    Estimation of multivariate functions Is it possible to estimate a function from finite data?

    Simplified problem: estimation of unknown continuousfunction from noise-free samples Many results from function approximation theory :

    To estimate accurately a d -dimensional function one needsO(n^^d) data points

    For example, if 3 points are needed to estimate 2-nd orderpolynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space.

    Similar results in signal processing Never enough data points to estimate multivariate

    functions in most practical applications (image recognition,genomics etc.)

    For multivariate function estimation, the number of freeparameters increases exponentially with problemdimensionality (the Curse of Dimensionality)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    17/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    18/64

    18

    OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting

    Keep-It-Direct Principle

    Analysis of ERM VC-dimension

    Generalization Bounds Structural Risk Minimization (SRM)

    Summary and Discussion

  • 8/11/2019 Intro a Aprendizaje Predictivo

    19/64

    19

    Keep-It-Direct Principle The goal of learning is generalization rather than

    estimation of true function (system identification)

    Keep-It-Direct Principle (Vapnik, 1995)

    Do not solve an estimation problem of interest bysolving a more general (harder) problem as anintermediate step

    Good predictive model reflects some properties ofunknown distribution P(x,y)

    Since model estimation with finite data is ill-posed, oneshould never try to solve a more general problem thanrequired by given application

    Importance of formalizing application requirementsas a learning problem.

    min ,y) ,w)) dP( Loss(y, f( xx

  • 8/11/2019 Intro a Aprendizaje Predictivo

    20/64

    20

    Learning vs System Identification Consider regression problem

    where unknown target function

    Goal 1 : Prediction

    Goal 2 : Function Approximation (system identification)

    or

    Admissible models : algebraic polynomials Purpose of comparison : contrast goals (1) and (2)NOTE: most applications assume Goal 2, i.e.

    Noisy Data ~ true signal + noise

    )(x g y)/()( xx y E g

    min),()),(()( 2 ydP f y R xwxw

    min))(),(()( 2 xxwxw d g f Rmin)/(),( xwx y E f

  • 8/11/2019 Intro a Aprendizaje Predictivo

    21/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    22/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    23/64

    23

    Regression estimates (2 typical realizations of data):

    Dotted line ~ estimate obtained using predictive learningDashed line ~ estimate via function approximation setting

  • 8/11/2019 Intro a Aprendizaje Predictivo

    24/64

    24

    Conclusion The goal of prediction (1) is different (less

    demanding) than the goal of estimating thetrue target function (2) everywhere in theinput space.

    The curse of dimensionality applies to systemidentification setting (2), but may not holdunder predictive setting (1).

    Both settings coincide if the input distributionis uniform (i.e., in signal and image denoisingapplications)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    25/64

    25

    Philosophical Interpretation of KIDInterpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind

    (imposed on the data) favored by KID

    Objective Evaluation still possible (via predictionrisk reflecting application needs) NaturalScience

    Methodological implications

    Importance of good learning formulations(asking the right question)

    Accounts for 80% of success in applications

  • 8/11/2019 Intro a Aprendizaje Predictivo

    26/64

    26

    OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting

    Keep-It-Direct Principle

    Analysis of ERM VC-dimension

    Generalization Bounds Structural Risk Minimization (SRM)

    Summary and Discussion

  • 8/11/2019 Intro a Aprendizaje Predictivo

    27/64

    27

    VC-theory has 4 parts:

    1. Analysis of consistency/convergence ofERM

    2. Generalization bounds3. Inductive principles (for finite samples)

    4. Constructive methods (learningalgorithms) for implementing (3)

    NOTE: (1) (2) (3) (4)

    1

    1( ) ( , ( )) min

    n

    emp i ii

    R L y f n

    x ,

  • 8/11/2019 Intro a Aprendizaje Predictivo

    28/64

    28

    Consistency/Convergence of ERM

    Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement :

    under what (general) conditions models providing minEmpirical Risk will also provide min Prediction Risk ,when the number of samples grows large?

    Why asymptotic analysis is needed?

    - helps to develop useful concepts- necessary and sufficient conditions ensure that VC-theory is general and can not be improved

  • 8/11/2019 Intro a Aprendizaje Predictivo

    29/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    30/64

    30

    Conditions for Consistency of ERM Main insight : consistency is not possible without

    restricting the set of possible models Example: 1-nearest neighbor classification method.

    - is it consistent ?

    Consider binary decision functions (classification) How to measure their flexibility, or ability to explain the

    training data (for binary classification)? This complexity index for indicator functions:

    - is independent of unknown data distribution ;- measures the capacity of a set of possible models ,rather than characteristics of the true model

  • 8/11/2019 Intro a Aprendizaje Predictivo

    31/64

    31

    OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting

    Keep-It-Direct Principle

    Analysis of ERM VC-dimension

    Generalization Bounds Structural Risk Minimization (SRM)

    Summary and Discussion

  • 8/11/2019 Intro a Aprendizaje Predictivo

    32/64

    32

    SHATTERING Linear indicator functions : can split 3 data points in 2D in

    all 2^^3 = 8 possible binary partitions

    If a set of n samples can be separated by a set of

    functions in all 2^^n possible ways, this sample is said tobe shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of

    size n (for all possible labelings)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    33/64

    33

    VC DIMENSION Definition : A set of functions has VC-dimension h is there exist h

    samples that can be shattered by this set of functions, but thereare no h+1 samples that can be shattered

    VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?

  • 8/11/2019 Intro a Aprendizaje Predictivo

    34/64

    34

    VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split

    in all 2^^n possible ways

    (in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient

    conditions for:(1) consistency of ERM-based learning

    (2) fast rate of convergence(these conditions are distribution-independent)

    Interpretation of the VC-dimension via falsifiability:

    functions with small VC-dim can be easily falsified

  • 8/11/2019 Intro a Aprendizaje Predictivo

    35/64

    35

    VC-dimension and Falsifiability

    A set of functions has VC-dimension h if(a) It can explain (shatter) a set of h samples~ there exists h samples that cannot falsify itand(b) It can not shatter h+1 samples~ any h+1 samples falsify this set

    Finiteness of VC-dim is necessary and sufficient condition for generalization(for any learning method based on ERM)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    36/64

    36

    Recall Occams Razor: Main problem in predictive learning

    - Complexity control (model selection)- How to measure complexity ?

    Interpretation of Occams razor (in Statistics):

    Entities ~ model parametersComplexity ~ degrees-of-freedomNecessity ~ explaining (fitting) available data

    Model complexity = number of parameters (DoF) Consistent with classical statistical view:

    learning = function approx. / density estimation

  • 8/11/2019 Intro a Aprendizaje Predictivo

    37/64

    37

    Philosophical Principle of VC-falsifiability

    Occams Razor : Select the model that explainsavailable data and has the smallest number offree parameters (entities)

    VC theory : Select the model that explainsavailable data and has low VC-dimension (i.e. can

    be easily falsified ) New principle of VC-falsifiability

  • 8/11/2019 Intro a Aprendizaje Predictivo

    38/64

    38

    Calculating the VC-dimension

    How to estimate the VC-dimension (for a given set offunctions)?

    Apply definition (via shattering) to derive analyticestimates works for simple sets of functions

    Generally, such analytic estimates are not possible forcomplex nonlinear parameterizations (i.e., for practicalmachine learning and statistical methods)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    39/64

    39

    Example: VC-dimension of spherical indicator functions .

    Consider spherical decision surfaces in a d -dimensional x-space,parameterized by center c and radius r parameters:

    In a 2-dim space (d=2) there exists 3 points that can be shattered,but 4 points cannot be shattered h=3

    2 2, , ( ) f r I r x c x c

  • 8/11/2019 Intro a Aprendizaje Predictivo

    40/64

    40

    Example: VC-dimension of a linear combination of fixed basis functions (i.e.polynomials, Fourier expansion etc.)Assuming that basis functions are linearly independent, the VC-dim equals

    the number of basis functions (free parameters). Example: single parameter but infinite VC-dimension f x ,w I sin wx 0

  • 8/11/2019 Intro a Aprendizaje Predictivo

    41/64

    41

    Example: Wide linear decision boundaries

    Consider linear functions such that the distance btwn D(x) and

    the closest data sample is larger than given valueThen VC-dimension depends on the width parameter, ratherthan d (as in linear models):

    b D )( xwx

    1,min 22

    d R

    h

  • 8/11/2019 Intro a Aprendizaje Predictivo

    42/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    43/64

    43

    VC-dimension vs number of parameters

    VC-dimension can be equal to DoF (number ofparameters)Example: linear estimators

    VC-dimension can be smaller than DoFExample: penalized estimators

    VC-dimension can be larger than DoFExample: feature selection

    sin (wx)

  • 8/11/2019 Intro a Aprendizaje Predictivo

    44/64

    44

    VC-dimension for Regression Problems VC-dimension was defined for indicator functions

    Can be extended to real-valued functions, i.e.third-order polynomial for univariate regression:

    linear parameterization VC-dim = 4

    Qualitatively, the VC-dimension ~ the ability to fit (orexplain) finite training data for regression.

    This can be also related to falsifiability

    b xw xw xwb x f 12233),,( w

  • 8/11/2019 Intro a Aprendizaje Predictivo

    45/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    46/64

    46

    OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting

    Keep-It-Direct Principle

    Analysis of ERM VC-dimension

    Generalization Bounds Structural Risk Minimization (SRM)

    Summary and Discussion

    ll f

  • 8/11/2019 Intro a Aprendizaje Predictivo

    47/64

    47

    Recall consistency of ERM

    Two Types of VC-bounds:(1) How close is the empirical risk to the true risk

    (2) How close is the empirical risk to the minimal possible risk ?

    * emp R * R

  • 8/11/2019 Intro a Aprendizaje Predictivo

    48/64

    48

    Generalization Bounds Bounds for learning machines (implementing ERM)

    evaluate the difference btwn (unknown) risk and knownempirical risk, as a function of sample size n and generalproperties of admissible models (their VC-dimension)

    Classification: the following bound holds with probability offor all approximating functions

    where is called the confidence interval Regression: the following bound holds with probability of

    for all approximating functions

    where

    nhn R R R empemp /ln,/),()()( 1

    c R R emp 1/)()(

    1

    nh

    ,ln n

    a1

    h lna 2 n

    h1 ln / 4 n

  • 8/11/2019 Intro a Aprendizaje Predictivo

    49/64

    49

    Practical VC Bound for regression Practical regression bound can be obtained by setting

    the confidence level and theoretical constants:

    can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of denominator) shows that

    h < 0.8 n for any estimatorIn practice:

    h < 0.5 n for any estimator

    1

    2

    lnln1)()(

    n

    n

    n

    h

    n

    h

    n

    hh Rh R emp

    1,/4 nmin

  • 8/11/2019 Intro a Aprendizaje Predictivo

    50/64

    50

    VC Regression Bound for model selection VC-bound can be used for analytic model selection

    (if the VC-dimension is known)

    Example: polynomial regression for estimating Sine_Squared targetfunction from 25 noisy samples

    Optimal model found:6-th degree polynomial

    (no resampling needed)

    0 0.2 0.4 0.6 0.8 1-0.5

    0

    0.5

    1

    1.5

    x

    y

    M d li g i ith i [0 1] i l g i

  • 8/11/2019 Intro a Aprendizaje Predictivo

    51/64

    51

    Modeling pure noise with x in [0,1] via poly regressionsample size n=30, noise 1

    Comparison of different model selection methods:- prediction risk (MSE)- selected DoF (~ h)

    fpe gcv vc cv10

    -10

    100

    1010

    R i s k ( M S E )

    fpe gcv vc cv0

    10

    20

    30

    D e g r e e o f

    F r e e d o m

  • 8/11/2019 Intro a Aprendizaje Predictivo

    52/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    53/64

    53

    Structural Risk Minimization Analysis of generalization bounds

    suggests that when n/h is large , the term is small

    This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-

    hand side of VC- bound need to be minimized make the VC-dimension a controlling variable

    SRM = formal mechanism for controlling model complexitySet of admissible models has a nested structure

    such that structure formally defines complexity ordering

    nhn R R R empemp /ln,/),()()(

    )(~)( emp R R

    S 1 S 2 ... S k ...),( x f

    h1 h2 ... hk ...

    S l i k i i i i

  • 8/11/2019 Intro a Aprendizaje Predictivo

    54/64

    54

    Structural Risk Minimization An upper bound on the true risk and the empirical risk, as a

    function of VC-dimension h (for fixed sample size n)

    SRM ERM d li

  • 8/11/2019 Intro a Aprendizaje Predictivo

    55/64

    55

    SRM vs ERM modeling

  • 8/11/2019 Intro a Aprendizaje Predictivo

    56/64

    56

    SRM Approach Use VC-dimension as a controlling parameter for minimizing

    VC bound:

    Two general strategies for implementing SRM :

    1. Keep fixed and minimize(most statistical and neural network methods)

    2. Keep fixed and minimize(Support Vector Machines)

    hn R R emp /)()( hn R R emp /)()(

    )( emp R

    )( emp R

    hn /

    hn /

  • 8/11/2019 Intro a Aprendizaje Predictivo

    57/64

    57

    Common SRM structures Dictionary structure

    A set of algebraic polynomials

    is a structure since

    More generallywhere is a set of basis functions ( dictionary ).

    The number of terms (basis functions) m specifies an elementof a structure.For fixed basis fcts, VC-dim ~ number of parameters

    mi

    iim xwb x f

    0

    , w f 1 f 2 .... f k ....

    m

    iiim , g wb f

    0,, vxVwx

    g x ,v i

    wi

  • 8/11/2019 Intro a Aprendizaje Predictivo

    58/64

    58

    Common SRM structures Feature selection (aka subset selection)

    Consider sparse polynomials of degree m:for m=1:for m=2:

    etc. Each monomial is a feature. The goal is to select a set of m

    features providing min. empirical risk (MSE)This is a structure since

    More generally ,

    where m basis fcts are selected from a (large) set of M fcts

    Note: nonlinear optimization, VC-dimension is unknown

    .... f .... f f m21

    f m x , w ,V wi g x ,v i i 0

    m

    1),,,( 11k wxbk bw x f

    2121212 ),,,,(

    k k xw xwbk k b x f w

  • 8/11/2019 Intro a Aprendizaje Predictivo

    59/64

    59

    Common SRM structures Penalization

    Consider algebraic polynomial of fixed degreewhere

    For each (positive) value c this set of functions specifies anelement of a structureMinimization of empirical risk (MSE) on each element of astructure is a constrained minimization problem

    This optimization problem can be equivalently stated asminimization of the penalized empirical risk functional :

    where the choice of

    Note: VC-dimension is unknown

    10

    0

    ,i

    ii xw x f w k c

    2w

    S k f x ,w , w 2 c k

    ...321 ccc

    S k

    R pen , k Remp k w 2 k k c~

  • 8/11/2019 Intro a Aprendizaje Predictivo

    60/64

    60

    Example: SRM structures for regression Regression data set

    x-values~ uniformly sampled in [0,1]y-values ~ target fctadditive Gaussian noise with st. dev 0.05

    Experimental set-up training set ~ 40 samplesvalidation set ~ 40 samples (for model selection)

    SRM structures defined on algebraic polynomials- dictionary (polynomial degrees 1 to 10)- penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)

    x x x y 5.02.0)2sin(8.0 2

    Estimated models using different SRM structures:

  • 8/11/2019 Intro a Aprendizaje Predictivo

    61/64

    61

    Estimated models using different SRM structures:- dictionary- penalization lambda=1.013e-005

    - sparse polynomial Visual results : target fct~ red line, feature selection~ black solid,

    dictionary ~ green, penalization ~ yellow line

    5432 9565.553952.1587679.1632162.684198.64078.0

    x x x x x y

    432 2736.191772.417337.226186.0

    x x x y

    0 0.2 0.4 0.6 0.8 1 -1.5

    -1

    -0.5

    0

    0.5

    1

    x

    y

  • 8/11/2019 Intro a Aprendizaje Predictivo

    62/64

    62

    SRM Summary SRM structure ~ complexity ordering on a set of

    admissible models (approximating functions) Many different structures on the same set of

    approximating functions (possible models) How to choose the best structure?

    - depends on application data- VC theory cannot provide answer

    SRM = mechanism for complexity control

    - selecting optimal complexity for a given data set- new measure of complexity: VC-dimension- model selection using analytic VC-bounds

  • 8/11/2019 Intro a Aprendizaje Predictivo

    63/64

  • 8/11/2019 Intro a Aprendizaje Predictivo

    64/64

    Summary and Discussion: VC-theory Methodology

    - learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure)

    Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do :

    - provide formalization (for a given application)

    - select good structure - always a gap between theory and applications