Matching Loss - University of California, Santa Cruz Grid 0 50 100 150 200 250 300 350 400 0 50 100...

Matching Loss

Manfred K. WarmuthMaya Hristakeva

UC Santa Cruz

October 23, 2009

MW, MH (UCSC) Matching Loss October 23, 2009 1 / 24

Two setups

w

a = w · xa = h−1(y)

x

Post: loss between y and y

y = h(a)

y = h(a) h(z) Pre: loss between a and a


Two setups continued

RegressionExamples are tuples (xt , at)

xt ∈ Rn: data point (example)at ∈ R: true concentration (activity)

Linear activation label estimate: at = w · xt

Loss between at and at

ClassificationExamples are tuples (xt , yt)

xt ∈ Rn: data point (example)yt ∈ [0, 1]: true probability (label)

Probability label estimate: yt = h(at)Loss between yt and yt


Why are we doing this?

Want framework for designing non-symmetric loss functions

Loss functions should be steep in important areas and flat inunimportant areas

Need flexible method for designing loss functions

Tunning example: Clarke Grid for measureing Glucose


Clarke Grid

0 50 100 150 200 250 300 350 4000

50

100

150

200

250

300

350

400

E

E

D D

C

C

B

B

A

A

Reference Values

Pre

dict

ions

Student Version of MATLAB


Designing Loss for Clarke Grid

Goal: Accurately predict glucose levels of people with diabetes

Non-symmetric loss: Assymetry is needed ...

Low concentrations more important than high ones

Clarke Grid is standard loss

How do you optimize such a loss?


Clarke Grid vs. square loss

0 50 100 150 200 250 300 350 4000

50

100

150

200

250

300

350

400

E

E

D D

C

C

B

B

A

A

Reference Values

Pre

dict

ions

Clarke Grid with Square Loss Level Curves



Clarke Grid versus our loss

0 50 100 150 200 250 300 350 4000

50

100

150

200

250

300

350

400

E

E

D D

C

C

B

B

A

A

Reference Values

Pre

dict

ions

Clarke Grid with Matching Loss Level Curves



Single neuron again

w

a = w · xa = h−1(y)

x

Post: ∆h−1(y , y)

y = h(a)

y = h(a) h(z) Pre: ∆h(a, a)


Pre Matching Loss

a a

h(a)

h(a)

∆h(a, a) =

∫ a

a

(h(z)− h(a)) ∂z

h(z)


Pre Matching Loss Examples

∆h(a, a) =

∫ a

a

(h(z)− h(a)) dz

Square Loss: h(z) = z

∆h(a, a) =1

2(a − a)2

Pre Logistic Loss: h(z) = ez

1+ez

∆h(a, a) = ln(1 + e a)− ln(1 + ea)− (a − a)ea

1 + ea︸︷︷︸y


Post Matching Loss Examples

∆h−1(y , y) =

∫ y

y

(h−1(p)− h−1(y)

)dp

Square Loss: h(z) = h−1(z) = z

∆h−1(y , y) =1

2(y − y)2 =

1

2(a − a)2 = ∆h(a, a)

Logistic Loss: y = h(z) = ez

1+ez and h−1(p) = ln p1−p

∆h−1(y , y) = y lny

y+ (1− y) ln

1− y

1− y


Dual View of Matching Loss

∆h(a, a) =

∫ a

a(h(z)− h(a)) dz Pre

h(z)=p h−1(p)=z

dz=(h−1(p))′dp

=

∫ h(a)

h(a)(p − h(a))) (h−1(p))

′dp

Integ. by parts=

∣∣∣h(a)h(a)(p − h(a))h−1(p) −

∫ h(a)

h(a)(h−1(p))dp

y=h(a) y=h(a)= (y − y)h−1(y)−

∫ y

y(h−1(p))dp

=

∫ y

y

(h−1(p)− h−1(y)

)dp

= ∆h−1(y , y) Post


Two domains

Pre domain:

Examples: (x, a), for a ∈ RPrediction: a = x ·wLoss:

∆h(a, a) =

∫ a

a(h(z)− h(a)) dz

Post domain:

Examples: (x, y), for y ∈ [0, 1]Prediction: y = h(a)Loss:

∆h−1(y , y) =

∫ y

y

(h−1(p)− h−1(y)

)dp


Why are we doing this?

Want to design good matching losses given a problem

Post Domain:

Shifting and scaling w results in use of different part of transferfunctionShifting and scaling transfer function can be undone by shiftingand scaling w

Pre Domain:

Shifting and scaling transfer function cannot be undone byshifting and scaling wLoss is “fixed” by choosing transfer functionAllows for design of “fancy” losses


Scaling and Shifting the Sigmoid

Define the transfer function h(a) as

h(a) =eα(w·x+β)

1 + eα(w·x+β),

where α scales the sigmoid and β shifts it

For Clarke Grid use piece of sigmoid that is steep on the smallactivations and then flattens out


Different Parts of Sigmoid

h(a)=s(a)h’(a)

Legend

0

0.2

0.4

0.6

0.8

1

–6 –4 –2 2 4 6a

h(a)=s(a/2–5)h’(a)

Legend

0

0.02

0.04

0.06

0.08

0.1

0.12

–6 –4 –2 2 4 6a

h(a)=s(a/2+2)h’(a)

Legend

0

0.2

0.4

0.6

0.8

1

–6 –4 –2 2 4 6a

loss(ah,–3)loss(ah,0)loss(ah,3)

Legend

0

1

2

3

4

5

–6 –4 –2 2 4 6ah


Legend

0

0.05

0.1

0.15

0.2

–6 –4 –2 2 4 6ah


Legend

0

0.5

1

1.5

2

2.5

–6 –4 –2 2 4 6ah

In the bottom row we plot the ∆(a, a) as a function of the estimate a for fixedactivities a = −3, 0, 3. Note that locally the losses are quadratic and the

steepness of the bowl is determined by h′(a).MW, MH (UCSC) Matching Loss October 23, 2009 17 / 24

3D View of the Loss

50 5

2

ah0

4

0 a

6

−5−5

5

50

ah

2

00

a

4

−5

6

−5

5

ah

0

−55

a0

−50

2

4

6

5.0

−2.5

ah

5−5.0

a

0

2.5

−5

0.0

5.0

2.5

0.0ah

−2.5

−5.0

5.02.50.0−2.5−5.0a

0.0

5.0

a5.0

ah

2.5

−5.0

−5.0 −2.5

0.0

−2.5

2.5

Regular Sigmoid, Left Piece of Sigmoid, Right Piece of SigmoidMW, MH (UCSC) Matching Loss October 23, 2009 18 / 24

Right Piece of Sigmoid Contours

0 50 100 150 200 250 300 350 4000

50

100

150

200

250

300

350

400

E

E

D D

C

C

B

B

A

A

Reference Values

Pre

dict

ions

Clarke Grid with Matching Loss Level Curves



Shift formulas

hα,β(a) := h(α(a + β)) (1)

hα,β(1

αa − β) = h(a) (2)

h−1α,β(y) =

1

αh−1(y)− β (3)

hα,β(h−1α,β(y))

(3)= hα,β(

1

αh−1(y)− β)

(2)= h(h−1(y)) = y

h−1α,β(hα,β(a))

(1)= h−1

α,β(h(α(a + β))(3)=

1

α(h−1(h(α(a + β)))− β = a


Post loss unaffected by shifting and scaling of

transfer function

Can be undone by shifting and scaling linear activation

∆h−1α,β

(y ,

h(a)︷︸︸︷hα,β(

1

αa − β)) =

∫ y

h(a)

(h−1α,β(p)−Z

ZZh−1α,β(HHHhα,β(

1

αa − β))) dp

=

∫ y

h(a)

(1

αh−1(p)ZZ−β − (

1

αaZZ−β)) dp

=1

α

∫ y

h(a)

(h−1(p)− h−1(h(a))) dp

=1

α∆h−1(y , h(a))


Summarize: why are we doing this?

Design good loss functions for given problem

What is best loss?

Square loss and logistic loss bad gold standards


Gradient and Hessian of Preloss

Work in (a, a) domain

Loss

∆h(

a︷︸︸︷w · x, a) =

∫ a

a

(h(z)− h(a)) dz

1st derivatives

∂∆h(a, a)

∂w= (h(a)− h(a)) x

2nd derivatives∂2∆h(a, a)

(∂w)2= h′(a) x>x


Weights updates

Pre Loss Gradient Descent (explicit)

wt = wt−1 − η∂∆h(at , at)

∂w︸︷︷︸old gradient

= wt−1 − η (h(at)− h(at)) xt , where at = wt−1 · xt

Post Loss Gradient Descent (explicit)

wt = wt−1 − η∂∆h−1(yt , h(at))

∂w︸︷︷︸old gradient

= wt−1 − η∂∆h(at , h

−1(yt))

∂w= wt−1 − η (h(at)− yt) xt


Matching Loss - University of California, Santa Cruz Grid 0 50 100 150 200 250 300 350 400 0 50 100...

Documents

Transcript of Matching Loss - University of California, Santa Cruz Grid 0 50 100 150 200 250 300 350 400 0 50 100...