LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification...

14
Loss Functions for Regression and Classification David Rosenberg New York University February 11, 2015 David Rosenberg (New York University) DS-GA 1003 February 11, 2015 1 / 14

Transcript of LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification...

Page 1: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Loss Functions for Regression and Classification

David Rosenberg

New York University

February 11, 2015

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 1 / 14

Page 2: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Regression Loss Functions

Loss Functions for Regression

In general, loss function may take the form

(y ,y) 7→ `(y ,y)

Regression losses usually only depend on the residual:

r = y − y

(y ,y) 7→ `(r) = `(y − y)

When would you not want a translation-invariant loss?

Can you transform your response y so that the loss you want istranslation-invariant?

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 2 / 14

Page 3: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Regression Loss Functions

Some Losses for Regression

Square or `2 Loss: `(r) = r2 (not robust)Absolute or Laplace or `1 Loss: `(r) = |r | (not differentiable)

gives median regression

Huber Loss: Quadratic for |r |6 δ and linear for |r |> δ (robust anddifferentiable)

0 0.2 0.4 0.6 0.8 1−6

−5

−4

−3

−2

−1

0

1

2

3

4Linear data with noise and outliers

least squares

laplace

−3 −2 −1 0 1 2 3−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

L2

L1

huber

KPM Figure 7.6

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 3 / 14

Page 4: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

The Classification Problem

Action space A= {−1,1} Output space Y= {−1,1}0-1 loss for f : X→ {−1,1}:

`(f (x),y) = 1(f (x) 6= y)

But let’s allow real-valued predictions f : X→ R:

f > 0 =⇒ Predict 1f < 0 =⇒ Predict −1

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 4 / 14

Page 5: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

The Classification Problem: Real-Valued Predictions

Action space A= R Output space Y= {−1,1}Prediction function f : X→ R

DefinitionThe value f (x) is called the score for the input x . Generally, themagnitude of the score represents the confidence of our prediction.

DefinitionThe margin on an example (x ,y) is yf (x). The margin is a measure ofhow correct we are.

We want to maximize the margin.Most classification losses depend only on the margin.

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 5 / 14

Page 6: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

The Classification Problem: Real-Valued Predictions

Empirical risk for 0−1 loss:

Rn(f ) =1n

n∑i=1

1(yi f (xi )6 0)

Minimizing empirical 0−1 risk not computationally feasible

Rn(f ) is non-convex, not differentiable (in fact, discontinuous!).Optimization is NP-Hard.

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 6 / 14

Page 7: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

Classification Losses

Zero-One loss: `0-1 = 1(m 6 0)

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 7 / 14

Page 8: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

Classification Losses

SVM/Hinge loss: `Hinge =max {1−m,0}= (1−m)+

Hinge is a convex, upper bound on 0−1 loss. Not differentiable at 1.We have a “margin error” when m < 1.

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 8 / 14

Page 9: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

Classification Losses

Logistic/Log loss: `Logistic = log (1+ e−m)

Logistic loss is differentiable. Never enough margin for logistic loss.How many support vectors?

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 9 / 14

Page 10: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

(Soft Margin) Linear Support Vector Machine

Hypothesis space F ={f (x) = wT x | w ∈ Rd

}.

Loss `(m) = (1−m)+`2 regularization

minw∈Rd

n∑i=1

(1− yi fw (xi ))++λ‖w‖22

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 10 / 14

Page 11: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descentinitialize w = 0repeat

randomly choose training point (xi ,yi ) ∈Dn

w ← w −η ∇w `(fw (xi ),yi )︸ ︷︷ ︸Grad(Loss on i’th example)

until stopping criteria met

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 11 / 14

Page 12: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

SGD for Hinge Loss and Linear Predictors

Consider linear hypothesis space: fw (x) = wT x .Gradient of hinge loss (x ,y):

∇w `Hinge(ywT x) =

−yx if yfw (x)< 10 if yfw (x)> 1undefined if yfw (x) = 1

A point with margin m = yfw (x) = 1 is correctly classified.

We can skip SGD update for these points.Rigorous approach: subgradient descent

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 12 / 14

Page 13: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

SGD for Hinge Loss and Linear Predictors

For step t+1 of SGD, we select a random training point (x ,y) and set

w (t+1) =

{w (t)+η(t)yx if yfw (x)< 1w (t) otherwise

w (T) is a linear combination of xi ’s with margin error when selected.Any xi in the expansion of w (T) is called a support vector.We can write:

w =

s∑i=1

aix(i),

where x(1), . . . ,x(s) are the support vectors.Having 0 gradient for m > 1 allows sparse support vectors.

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 13 / 14

Page 14: LossFunctionsforRegressionandClassification · LossFunctionsforRegressionandClassification DavidRosenberg New York University February11,2015 DavidRosenberg (NewYorkUniversity)

Classification Loss Functions

Population Minimizers

The population minimizer is another name for risk minimizer.It’s the “infinite data” case.

Hastie, Tibshirani, Friedman The Elements of Statistical Learning, 2nd Ed. Table 12.1

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 14 / 14