LossFunctionsforRegressionandClassiﬁcation · LossFunctionsforRegressionandClassiﬁcation...

Loss Functions for Regression and Classification

David Rosenberg

New York University

February 11, 2015

David Rosenberg (New York University) DS-GA 1003 February 11, 2015 1 / 14

Regression Loss Functions

Loss Functions for Regression

In general, loss function may take the form

(y ,y) 7→ `(y ,y)

Regression losses usually only depend on the residual:

r = y − y

(y ,y) 7→ `(r) = `(y − y)

When would you not want a translation-invariant loss?

Can you transform your response y so that the loss you want istranslation-invariant?

Regression Loss Functions

Some Losses for Regression

Square or `2 Loss: `(r) = r2 (not robust)Absolute or Laplace or `1 Loss: `(r) = |r | (not differentiable)

gives median regression

Huber Loss: Quadratic for |r |6 δ and linear for |r |> δ (robust anddifferentiable)

0 0.2 0.4 0.6 0.8 1−6

4Linear data with noise and outliers

least squares

laplace

−3 −2 −1 0 1 2 3−0.5

KPM Figure 7.6

Classification Loss Functions

The Classification Problem

Action space A= {−1,1} Output space Y= {−1,1}0-1 loss for f : X→ {−1,1}:

`(f (x),y) = 1(f (x) 6= y)

But let’s allow real-valued predictions f : X→ R:

f > 0 =⇒ Predict 1f < 0 =⇒ Predict −1

The Classification Problem: Real-Valued Predictions

Action space A= R Output space Y= {−1,1}Prediction function f : X→ R

DefinitionThe value f (x) is called the score for the input x . Generally, themagnitude of the score represents the confidence of our prediction.

DefinitionThe margin on an example (x ,y) is yf (x). The margin is a measure ofhow correct we are.

We want to maximize the margin.Most classification losses depend only on the margin.

The Classification Problem: Real-Valued Predictions

Empirical risk for 0−1 loss:

Rn(f ) =1n

n∑i=1

1(yi f (xi )6 0)

Minimizing empirical 0−1 risk not computationally feasible

Rn(f ) is non-convex, not differentiable (in fact, discontinuous!).Optimization is NP-Hard.

Classification Losses

Zero-One loss: `0-1 = 1(m 6 0)

SVM/Hinge loss: `Hinge =max {1−m,0}= (1−m)+

Hinge is a convex, upper bound on 0−1 loss. Not differentiable at 1.We have a “margin error” when m < 1.

Logistic/Log loss: `Logistic = log (1+ e−m)

Logistic loss is differentiable. Never enough margin for logistic loss.How many support vectors?

(Soft Margin) Linear Support Vector Machine

Hypothesis space F ={f (x) = wT x | w ∈ Rd

Loss `(m) = (1−m)+`2 regularization

minw∈Rd

n∑i=1

(1− yi fw (xi ))++λ‖w‖22

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descentinitialize w = 0repeat

randomly choose training point (xi ,yi ) ∈Dn

w ← w −η ∇w `(fw (xi ),yi )︸︷︷︸Grad(Loss on i’th example)

until stopping criteria met

SGD for Hinge Loss and Linear Predictors

Consider linear hypothesis space: fw (x) = wT x .Gradient of hinge loss (x ,y):

∇w `Hinge(ywT x) =

−yx if yfw (x)< 10 if yfw (x)> 1undefined if yfw (x) = 1

A point with margin m = yfw (x) = 1 is correctly classified.

We can skip SGD update for these points.Rigorous approach: subgradient descent

SGD for Hinge Loss and Linear Predictors

For step t+1 of SGD, we select a random training point (x ,y) and set

w (t+1) =

{w (t)+η(t)yx if yfw (x)< 1w (t) otherwise

w (T) is a linear combination of xi ’s with margin error when selected.Any xi in the expansion of w (T) is called a support vector.We can write:

s∑i=1

aix(i),

where x(1), . . . ,x(s) are the support vectors.Having 0 gradient for m > 1 allows sparse support vectors.

Population Minimizers

The population minimizer is another name for risk minimizer.It’s the “infinite data” case.

Hastie, Tibshirani, Friedman The Elements of Statistical Learning, 2nd Ed. Table 12.1

LossFunctionsforRegressionandClassiﬁcation · LossFunctionsforRegressionandClassiﬁcation...

Documents

Transcript of LossFunctionsforRegressionandClassiﬁcation · LossFunctionsforRegressionandClassiﬁcation...

THE YORK COUNTY OF NEW YORK - Judiciary of New York

Stochastic Volatility for Lévy Processes...MathematicalFinance,Vol.13,No.3(July2003),345–382 STOCHASTIC VOLATILITY FOR LEVY PROCESSES´ PETERCARR CourantInstitute,NewYorkUniversity

NeuralNetworks - GitHub Pages · NeuralNetworks DavidRosenberg New York University December25,2016 David Rosenberg (New York University) DS-GA 1003 December 25, 2016 1 / 35

MC-LEF - Squarespace · New York California New York North Carolina New York New York New York North Carolina Texas Virginia Pennsylvania Virginia Virginia New York Pennsylvania

SC Catalogue January 2015 - York Family History Society · York York graveyard guide by Murray, H Book 1 Blue York York maps 1920s. cd -rom 1 Blue York York marriage index 1538 -1700

New york, new york...

LagrangianDualityandConvexOptimization · LagrangianDualityandConvexOptimization MarylouGabrié&HeHe Materialoriginallydesignedby: JuliaKempe&DavidRosenberg CDS, NYU February16,2021

1 and 2 Regularization DavidRosenberg - GitHub Pages · ‘1 and‘2 Regularization DavidRosenberg New York University February5,2015 David Rosenberg (New York University) DS-GA 1003

LagrangianDualityandConvexOptimization · LagrangianDualityandConvexOptimization DavidRosenberg New York University July26,2017 David Rosenberg (New York University) DS-GA 1003 July

-NEWYORKUNIVERSITY;i · Arthur Beiser Project Director The work described herein was sponsored by the Advanced Research Projects Agency under Order Nr. 163-61, through the U.S. Army

ConditionalProbabilityModels - GitHub Pages...ConditionalProbabilityModels DavidRosenberg New York University March30,2015 David Rosenberg (New York University) DS-GA 1003 March 30,

NEW YORK,NEW YORK

Les antiquités égyptiennes de la collection Fouquetdlib.nyu.edu/awdl/sites/dl-pa.home.nyu.edu.awdl/files/...fn>mrlielihruryof BernardV.Bothmer (1912-1993) InHisMemory InstituteofFineArts,NewYorkUniversity

ChIPSeqSpike: ChIP-seq data scaling with spike-in control · 2020-04-28 · ChIPSeqSpike: ChIP-seq data scaling with spike-in control Nicolas Descostes∗1 1HowardHughesMedicalInstitute-NewYorkUniversity

TESL North York/York Region

K-MeansandGaussianMixtureModels · K-MeansandGaussianMixtureModels DavidRosenberg New York University June15,2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1

Frank.Sinatra.New-York New-York

Research Article Validation of Geriatric Care Environment ...downloads.hindawi.com/journals/cggr/2013/426596.pdfCollegeofNursing,NewYorkUniversity, Broadway,NewYork,NY, USA Correspondence

Value-at-Risk-BasedRiskManagement ... · Value-at-Risk-BasedRiskManagement: OptimalPoliciesandAssetPrices SuleymanBasak LondonBusinessSchool AlexanderShapiro NewYorkUniversity ...

The Trouble With Macroeconomics - Paul Romer · The Trouble With Macroeconomics Paul Romer SternSchoolofBusiness NewYorkUniversity Wednesday14th September,2016 Abstract Formorethanthreedecades,