Learning theory: generalization and VC...

Learning theory: generalization and VC dimension

Yifeng TaoSchool of Computer ScienceCarnegie Mellon University

Slides adapted from Eric Xing

Carnegie Mellon University 1Yifeng Tao

Introduction to Machine Learning

Outline

oComputational learning theoriesoPAC frameworkoAgnostic framework

oVC dimension

Yifeng Tao Carnegie Mellon University 2

PAC Agnostic

|H| finite

|H| infinite, but VC(H) finite

Generalizability of LearningoIn machine learning it's really generalization error that we care, but

most learning algorithms fit their models to the training set. oWhy should doing well on the training set tell us anything about

generalization error? Specifically, can we relate error on training set to generalization error?

oAre there conditions under which we can actually prove that learning algorithms will work well?

oLecture 1:


[Slide from EricXing]

What General Laws Constrain Inductive Learning?oWant theory to relate:

oTraining examples: moComplexity of hypothesis/concept space: H oAccuracy of approximation to target concept: oProbability of successful learning:

oAll the results in O(…)



Prototypical concept learning taskoBinary classification

oEverything we'll say here generalizes to other, including regression and multi-class classification problems.

oGiven:o Instances X: Possible days, each described by the attributes Sky, AirTemp,

Humidity, Wind, Water, ForecastoTarget function c: EnjoySport : X à {0, 1} oHypotheses space H: Conjunctions of literals. E.g.

o (?, Cold, High, ?, ?, ¬EnjoySport).oTraining examples S: iid positive and negative examples of the target

functiono (x1, c(x1)), ... (xm, c(xm))

oDetermine:oA hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in S?oA hypothesis h in H such that h(x) is "good" w.r.t c(x) for all x in the true

distribution D?



Sample ComplexityoHow many training examples m are sufficient to learn the target

concept? oTraining scenarios:

o If learner proposes instances, as queries to teacher o Learner proposes instance x, teacher provides c(x)

o If teacher (who knows c) provides training examples o Teacher provides sequence of examples offer m(x,c(x))

o If some random process (e.g., nature) proposes instances o Instance x generated randomly, teacher provides c(x)



Two Basic Competing Models



Protocol



True error of a hypothesis

oDefinition: The true error (denoted εD(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.



Two notions of erroroTraining error (a.k.a., empirical risk or empirical error) of hypothesis

h with respect to target concept coHow often h(x) ≠ c(x) over training instance from S

oTrue error of (a.k.a., generalization error, test error) hypothesis h with respect to c oHow often h(x) ≠ c(x) over future random instances drew iid from D



The Union Bound



Hoeffding inequality



Version Space



Consistent LearneroA learner is consistent if it outputs hypothesis that perfectly fits the

training dataoThis is a quite reasonable learning strategy

oEvery consistent learning outputs a hypothesis belonging to the version space

oWe want to know how such hypothesis generalizes



Probably Approximately Correct

oDouble “hedging" oApproximatelyoProbably

oNeed both!



Exhausting the version space


VSH,D


How many examples will ε-exhaust the VS



Proof


[Slide from EricXingandDavidSontag]

What it meanso[Haussler, 1988]: probability that the version space is not ε-

exhausted after m training examples is at most |H|e-εm

oSuppose we want this probability to be at most δ

oHow many training examples suffice?

oIf



Learning Conjunctions of Boolean Literals



PAC LearnabilityoA learning algorithm is PAC learnable if it

oRequires no more than polynomial computation per training example, and ono more than polynomial number of samples

oTheorem: conjunctions of Boolean literals is PAC learnable



How about EnjoySport?



PAC-Learning



Agnostic Learning



Empirical Risk Minimization Paradigm



The Case of Finite HoH = {h1, ..., hk} consisting of k hypotheses.oWe would like to give guarantees on the generalization error of h.oFirst, we will show that is a reliable estimate of ε(h) for all h.oSecond, we will show that this implies an upper-bound on the

generalization error of h.



Misclassification Probabilityo

o

o



Uniform Convergenceo

o

o



Sample Complexity



Generalization Error Bound



Agnostic framework



What if H is not finite?oCan’t use our result for infinite H

oNeed some other measure of complexity for H oVapnik-Chervonenkis (VC) dimension!



How do we characterize “power”?oDifferent machines have different amounts of “power”. oTradeoff between:

oMore power: Can model more complex classifiers but might overfitoLess power: Not going to overfit, but restricted in what it can model

oHow do we characterize the amount of power?



Shattering a Set of Instances



Three Instances Shattered



The Vapnik-Chervonenkis Dimension



VC dimension: examples


[Slide from EricXingandDavidSontag]

VC dimension: examples



The VC Dimension and the Number of ParametersoThe VC dimension thus gives concreteness to the notion of the

capacity of a given set of h. oIs it true that learning machines with many parameters would have

high VC dimension, while learning machines with few parameters would have low VC dimension?

oAn infinite-VC function with just one parameter!



An infinite-VC function with just one parameter



Sample Complexity from VC Dimension



ConsistencyoA learning process (model) is said to be consistent if model error,

measured on new data sampled from the same underlying probability laws of our original sample, converges, when original sample size increases, towards model error, measured on original sample.



Vapnik main theorem



Agnostic Learning: VC Bounds



Model convergence speed



How to control model generalization capacityoRisk Expectation = Empirical Risk + Confidence IntervaloTo minimize Empirical Risk alone will not always give a good

generalization capacity: one will want to minimize the sum of Empirical Risk and Confidence Interval

oWhat is important is not the numerical value of the Vapnik limit, most often too large to be of any practical use, it is the fact that this limit is a non decreasing function of model family function “richness”



Structural Risk Minimization



SRM strategy



Putting SRM into action: linear models caseoThere are many SRM-based strategies to build models: oIn the case of linear models

y = wTx + boone wants to make ||w|| a controlled parameter: let us call HC the

linear model function family satisfying the constraint: ||w|| < C

oVapnik Major theorem: When C decreases, d(HC) decreases



Putting SRM into action: linear models case


yi - wTxi - b

wTxi +b


Take away message

oSample complexity varies with the learning setting oLearner actively queries traineroExamples provided at random

oWithin the PAC learning setting, we can bound the probability that learner will output hypothesis with given error oFor ANY consistent learner (case where c in H) oFor ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H)

oVC dimension as measure of complexity of H oQuantitative bounds characterizing bias/variance in choice of H



Take home message


PAC

[Slide from MattGormley]

References

oEric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning: http://www.cs.cmu.edu/~epxing/Class/10701/

oMatt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html

oDavid Sontag. Introduction To Machine Learning. https://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture14.pdf


Learning theory: generalization and VC...

Documents

Transcript of Learning theory: generalization and VC...