Introduction to Machine Learningepxing/Class/10715/lectures/RiskMin.pdf · Advanced Introduction to...

Advanced Introduction

to Machine Learning CMU-10715

Risk Minimization

Barnabás Póczos

What have we seen so far?

Several classification & regression algorithms seem to work

fine on training datasets:

• Linear regression

• Logistic regression

• Gaussian Processes

• Naïve Bayes classifier

• Support Vector Machines

How good are these algorithms on unknown test sets?

How many training samples do we need to achieve small error?

What is the smallest possible error we can achieve?

) Learning Theory

Outline • Risk and loss

–Loss functions

–Risk

–Empirical risk vs True risk

–Empirical Risk minimization

• Underfitting and Overfitting

• Classification

• Regression

Supervised Learning Setup

Generative model of the data: (train and test data)

Regression:

Classification:

It measures how good we are on a particular (x,y) pair.

Loss function:

Loss Examples

Classification loss:

L2 loss for regression:

Regression: Predict house prices.

Squared loss, L2 loss

7 Picture form Alex

L1 loss

8 Picture form Alex

-insensitive loss

9 Picture form Alex

Huber’s robust loss

10 Picture form Alex

Risk of f classification/regression function:

= The expected loss

Why do we care about this?

Why do we care about risk?

Risk of f classification/regression function:

=The expected loss

Our true goal is to minimize the loss of the test points!

Usually we don’t know the test points and their labels in advance…, but

12 That is why our goal is to minimize the risk.

Risk Examples

Risk: The expected loss

Classification loss:

Risk of classification loss:

Risk of L2 loss:

Bayes Risk The expected loss

We consider all possible function f here

We don’t know P, but we have i.i.d. training data sampled from P!

Goal of Learning:

The learning algorithm constructs this function fD from the training data. 14

Definition: Bayes Risk

Consistency of learning methods

Definition:

Stone’s theorem 1977: Many classification, regression algorithms

are universally consistent for certain loss functions under certain

conditions: kNN, Parzen kernel regression, SVM,…

Yayyy!!! Wait! This doesn’t tell us anything about the rates…

Risk is a random variable:

No Free Lunch!

Devroy 1982: For every consistent learning method and for every fixed convergence rate an, 9 P(X,Y) distribution such that the

convergence rate of this learning method on P(X,Y) distributed

data is slower than an

What can we do now?

What do we mean on rate?

Notation: (stochastic rate, stochastic little o and big O)

(stochastically bounded)

Definition: (stochastically bounded)

Example: (CLT)

Empirical Risk and True Risk

Empirical Risk

Let us use the empirical counter part:

Shorthand:

True risk of f (deterministic):

Bayes risk:

Empirical risk:

Empirical Risk Minimization

Law of Large Numbers:

Empirical risk is converging to the Bayes risk

Overfitting in Classification with ERM

Bayes classifier:

Picture from David Pal

Bayes risk:

Generative model:

Picture from David Pal

Bayes risk:

n-order thresholded polynomials

Empirical risk:

Overfitting in Classification with ERM

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

Is the following predictor a good one?

What is its empirical risk? (performance on training data)

zero !

What about true risk?

> zero

Will predict very poorly on new random test point:

Large generalization error !

Overfitting in Regression with

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-45

k=1 k=2

k=3 k=7

If we allow very complicated predictors, we could overfit

the training data.

Examples: Regression (Polynomial of order k-1 – degree k )

Overfitting in Regression

constant linear

quadratic 6th order

Solutions to Overfitting

Structural Risk Minimization

Notation:

1st issue: (Model error, Approximation error)

Solution: Structural Risk Minimzation (SRM)

Risk Empirical risk

Approximation error, Estimation

error, PAC framework

Bayes risk

Risk of the classifier f

Estimation error Approximation error

Probably Approximately Correct (PAC) learning framework

Estimation error

Big Picture

Bayes risk

Estimation error Approximation error

Bayes risk

Ultimate goal:

Approximation error

Estimation error

Bayes risk

Solution to Overfitting

2nd issue:

Solution:

Approximation with the

Hinge loss and quadratic loss

Picture is taken from R. Herbrich

Empirical risk is no longer a good indicator of true risk

fixed # training data

If we allow very complicated predictors, we could overfit the training data.

Effect of Model Complexity

Prediction error on training data

Underfitting

Bayes risk = 0.1 32

Underfitting

Best linear classifier:

The empirical risk of the best linear classifier:

Underfitting

Best quadratic classifier:

Same as the Bayes risk ) good fit!

Classification

using the classification loss

The Bayes Classifier

Lemma I:

Lemma II:

Proofs

Lemma I: Trivial from definition

Lemma II: Surprisingly long calculation

This is what the learning algorithm produces

We will need these definitions, please copy it!

Theorem I: Bound on the Estimation error

The true risk of what the learning algorithm produces

This is what the learning algorithm produces

Theorem II: This is what the learning algorithm produces

Proofs

Theorem I: Not so long calculations.

Theorem II: Trivial

Main message:

It’s enough to derive upper bounds for

Corollary:

Illustration of the Risks

It is a random variable that we need to bound!

We will bound it with tail bounds!

It’s enough to derive upper bounds for

Hoeffding’s inequality (1963)

Special case

Binomial distributions

Our goal is to bound

Bernoulli(p)

Therefore, from Hoeffding we have:

Yuppie!!!

Inversion

From Hoeffding we have:

Therefore,

Union Bound

Our goal is to bound:

We already know:

Theorem: [tail bound on the ‘deviation’ in the worst case]

Worst case error

Proof:

This is not the worst classifier in terms of classification accuracy!

Worst case means that the empirical risk of classifier f is the furthest from its true risk!

Inversion of Union Bound

Therefore,

We already know:

Inversion of Union Bound

•The larger the N, the looser the bound

•This results is distribution free: True for all P(X,Y) distributions

• It is useless if N is big, or infinite… (e.g. all possible hyperplanes)

We will see later how to fix that. (Hint: McDiarmid, VC dimension…)

The Expected Error

Our goal is to bound:

Theorem: [Expected ‘deviation’ in the worst case]

Worst case deviation

We already know:

Proof: we already know a tail bound.

(Tail bound, Concentration inequality)

(From that actually we get a bit weaker inequality… oh well)

This is not the worst classifier in terms of classification accuracy!

Worst case means that the empirical risk of classifier f is the furthest from its true risk!

Thanks for your attention

Introduction to Machine Learningepxing/Class/10715/lectures/RiskMin.pdf · Advanced Introduction to...

Documents

Transcript of Introduction to Machine Learningepxing/Class/10715/lectures/RiskMin.pdf · Advanced Introduction to...

Introduction to Machine Learning CMU-10701aarti/Class/10701_Spring14/slides/DeepLearning.pdf · Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture3-Structured_Sparsity.pdfAdvanced Introduction to Machine Learning 10715, Fall 2014 Structured Sparsity,

Introduction to Machine Learning - Carnegie Mellon School of …epxing/Class/10715/lectures/... · 2014-11-19 · Stone’s theorem 1977: Many classification, regression algorithms

10715 MACARTHUR BLVD OAKLAND, CA - betaagency.com · 10715 MACARTHUR BLVD, OAKLAND SITE PLAN ©2019 Beta. This information has been obtained from sources believed reliable however

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture1... · Logistics Text book No required book Reading assignments on class homepage Optional: David Mackay,

Entropy and Dependence Estimation Barnabás Póczos · 2009-07-28 · Entropy and Dependence Estimation Barnabás Póczos Department of Computing Science, ... Contents •Entropy

Deep Sets - Neural Information Processing Systemspapers.nips.cc/paper/6931-deep-sets.pdf · Deep Sets Manzil Zaheer 1;2, Satwik Kottur , Siamak Ravanbhakhsh , Barnabás Póczos 1,

Machine Learningepxing/Class/10701-08s/Lecture/...2 Logistics zText book zChris Bishop, Pattern Recognition and Machine Learning (required) z Tom Mitchell, Machine Learning zDavid

Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

No. 10715 IMPROVING MINERAL SPECTRA USING SINGLE ... · No. 10715 OBJECTIVE This paper investigates approaches to further improvement in mineral spectra reproducibility using the

СТАНДАРТ (ИСО 10715:1997) Газ природный РУКОВОДСТВО … · ГОСТ 12.4.221—2002 Система стандартов безопасности

Convex Optimization CMU-10725ryantibs/convexopt-F13/lectures/15-ellipsoid.pdfConvex Optimization CMU-10725 Ellipsoid Methods Barnabás Póczos & Ryan Tibshirani . 2 Outline Linear

Introduction to Machine Learning10701/slides/8_Perceptron.pdf · 2017. 9. 28. · Introduction to Machine Learning Perceptron Barnabás Póczos. 2 Contents History of Artificial Neural

Advanced Introduction to Machine Learning, CMU-10715

10715 - Lego · 0000 00000 00000 0000 . Created Date: 9/13/2017 10:48:06 AM

Sample Complexity for Function Approximation. Model Selection.10715-f18/lectures/srm.pdf · general family closely related to SRM • Often computationally hard…. • Nice case

Systems Level Design Review Project #10715 1/15/09 Rev 01 1.

Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.

Introduction to Machine Learning10701/slides/10_Deep_Learning.pdf · 2017-10-04 · Introduction to Machine Learning Deep Learning Barnabás Póczos. 2 Credits Many of the pictures,

Advanced Introduction to Machine Learningepxing/Class/10715/lectures/lecture2-LR.pdf · A Summary: LMS update rule Pros: on-line, low per-step cost, fast convergence and perhaps less