arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a...

23
Calibrating Deep Neural Networks using Focal Loss Jishnu Mukhoti *,1,2 , Viveka Kulharia *,2 , Amartya Sanyal 2,3 , Stuart Golodetz 1 , Philip H. S. Torr 1,2 , and Puneet K. Dokania 1,2 1 FiveAI Ltd. 2 University of Oxford 3 The Alan Turing Institute, London Abstract Miscalibration – a mismatch between a model’s confidence and its correctness – of Deep Neural Networks (DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated and confident. We show that, as opposed to the standard cross-entropy loss, focal loss Lin et al. [2017] allows us to learn models that are already very well calibrated. When combined with temperature scaling, whilst preserving accuracy, it yields state-of-the-art calibrated models. We provide a thorough analysis of the factors causing miscalibration, and use the insights we glean from this to justify the empirically excellent performance of focal loss. To facilitate the use of focal loss in practice, we also provide a principled approach to automatically select the hyperparameter involved in the loss function. We perform extensive experiments on a variety of computer vision and NLP datasets, and with a wide variety of network architectures, and show that our approach achieves state-of-the-art accuracy and calibration in almost all cases. 1 Introduction Deep neural networks have dominated computer vision and machine learning in recent years, and this has led to their widespread deployment in real-world systems [Cao et al., 2018, Chen et al., 2018, Kamilaris and Prenafeta-Boldú, 2018, Ker et al., 2018, Wang et al., 2018]. However, many current multi-class classification networks in particular are poorly calibrated, in the sense that the probability values that they associate with the class labels they predict overestimate the likelihoods of those class labels being correct in the real world. This is a major problem, since if networks are routinely overconfident, then downstream components cannot trust their predictions. The underlying cause is hypothesised to be that these networks’ high capacity leaves them vulnerable to overfitting on the negative log-likelihood (NLL) loss they conventionally use during training [Guo et al., 2017]. Given the importance of this problem, numerous suggestions for how to address it have been proposed. Much work has been inspired by approaches that were not originally formulated in a deep learning context, such as Platt scaling [Platt, 1999], histogram binning [Zadrozny and Elkan, 2001], isotonic regression [Zadrozny and Elkan, 2002], and Bayesian binning and averaging [Naeini et al., 2015, Naeini and Cooper, 2016]. As deep learning has become more dominant, however, various works have begun to directly target the calibration of deep networks. For example, Guo et al. [Guo et al., 2017] have popularised a modern variant of Platt scaling known as temperature scaling, which works by dividing a network’s logits by a scalar T> 0 (learnt on a validation subset) prior to performing softmax. Temperature scaling has the desirable property that it can improve the calibration of a network without in any way affecting its accuracy. However, whilst its simplicity and effectiveness have made it a popular network calibration method, it does have downsides. For example, whilst it scales the logits to * Joint first authors 1 arXiv:2002.09437v1 [cs.LG] 21 Feb 2020

Transcript of arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a...

Page 1: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Calibrating Deep Neural Networks using Focal Loss

Jishnu Mukhoti∗,1,2, Viveka Kulharia*,2, Amartya Sanyal2,3,Stuart Golodetz1, Philip H. S. Torr1,2, and Puneet K. Dokania1,2

1FiveAI Ltd.2University of Oxford

3The Alan Turing Institute, London

Abstract

Miscalibration – a mismatch between a model’s confidence and its correctness – of Deep Neural Networks(DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated andconfident. We show that, as opposed to the standard cross-entropy loss, focal loss Lin et al. [2017] allows us tolearn models that are already very well calibrated. When combined with temperature scaling, whilst preservingaccuracy, it yields state-of-the-art calibrated models. We provide a thorough analysis of the factors causingmiscalibration, and use the insights we glean from this to justify the empirically excellent performance of focalloss. To facilitate the use of focal loss in practice, we also provide a principled approach to automatically selectthe hyperparameter involved in the loss function. We perform extensive experiments on a variety of computervision and NLP datasets, and with a wide variety of network architectures, and show that our approach achievesstate-of-the-art accuracy and calibration in almost all cases.

1 IntroductionDeep neural networks have dominated computer vision and machine learning in recent years, and this hasled to their widespread deployment in real-world systems [Cao et al., 2018, Chen et al., 2018, Kamilaris andPrenafeta-Boldú, 2018, Ker et al., 2018, Wang et al., 2018]. However, many current multi-class classificationnetworks in particular are poorly calibrated, in the sense that the probability values that they associate with theclass labels they predict overestimate the likelihoods of those class labels being correct in the real world. Thisis a major problem, since if networks are routinely overconfident, then downstream components cannot trusttheir predictions. The underlying cause is hypothesised to be that these networks’ high capacity leaves themvulnerable to overfitting on the negative log-likelihood (NLL) loss they conventionally use during training [Guoet al., 2017].

Given the importance of this problem, numerous suggestions for how to address it have been proposed. Muchwork has been inspired by approaches that were not originally formulated in a deep learning context, such asPlatt scaling [Platt, 1999], histogram binning [Zadrozny and Elkan, 2001], isotonic regression [Zadrozny andElkan, 2002], and Bayesian binning and averaging [Naeini et al., 2015, Naeini and Cooper, 2016]. As deeplearning has become more dominant, however, various works have begun to directly target the calibration of deepnetworks. For example, Guo et al. [Guo et al., 2017] have popularised a modern variant of Platt scaling known astemperature scaling, which works by dividing a network’s logits by a scalar T > 0 (learnt on a validation subset)prior to performing softmax. Temperature scaling has the desirable property that it can improve the calibrationof a network without in any way affecting its accuracy. However, whilst its simplicity and effectiveness havemade it a popular network calibration method, it does have downsides. For example, whilst it scales the logits to

∗Joint first authors

1

arX

iv:2

002.

0943

7v1

[cs

.LG

] 2

1 Fe

b 20

20

Page 2: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

reduce the network’s confidence in incorrect predictions, this also slightly reduces the network’s confidence inpredictions that were correct.

By contrast, [Kumar et al., 2018] initially eschew temperature scaling in favour of minimising a differentiableproxy for calibration error at training time, called Maximum Mean Calibration Error (MMCE), although they dolater also use temperature scaling as a post-processing step to obtain better results than cross-entropy followed bytemperature scaling [Guo et al., 2017]. Separately, [Müller et al., 2019] propose training models on cross-entropyloss with label smoothing instead of one-hot labels and show that label smoothing has a very favourable effecton model calibration.

In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss conventionally used when training classification networks with the focal loss proposed by [Linet al., 2017]. We observe that unlike cross-entropy, which minimises the KL divergence between the predicted(softmax) distribution and the target distribution (one-hot encoding in classification tasks) over classes, focalloss minimises a regularised KL divergence between these two distributions, which ensures minimisation of theKL divergence whilst increasing the entropy of the predicted distribution, thereby preventing the model frombecoming overconfident. Since focal loss, as shown in §4, is dependent on a hyperparameter, γ, that needs tobe cross-validated, we also provide a method of choosing γ automatically for each sample, and show that itoutperforms all the baseline models.

The intuition behind using focal loss is to direct the network’s attention during training towards samples forwhich it is currently predicting a low probability for the correct class, since trying to reduce the NLL on samplesfor which it is already predicting a high probability for the correct class is liable to lead to NLL overfitting, andthereby miscalibration [Guo et al., 2017]. More formally, we show in §4 that focal loss can be seen as implicitlyregularising the weights of the network during training by causing the gradient norms for confident samplesto be lower than they would have been with cross-entropy, which we would expect to reduce overfitting andimprove the network’s calibration.

Overall, we make the following contributions:

1. In §3, we study the link that [Guo et al., 2017] observed between miscalibration and NLL overfitting in detail,and show that the overfitting is associated with the predicted distributions for misclassified test samplesbecoming peakier as the optimiser tries to increase the magnitude of the network’s weights to reduce thetraining NLL.

2. In §4, we propose the use of focal loss for training better-calibrated networks, and provide both theoreticaland empirical justifications for this approach. In addition, we provide a principled method of automaticallychoosing γ for each sample during training.

3. In §5.1, we show, via experiments on a variety of classification datasets and network architectures, that DNNstrained with focal loss are more calibrated than those trained with cross-entropy loss (both with and withoutlabel smoothing), MMCE or Brier loss [Brier, 1950].

4. Finally, in §5.2, we show that a network trained using focal loss is able to improve on both the BLEU andECE scores of baseline models trained using cross-entropy (with one-hot labels and label smoothing) on theWMT 2014 English-to-German translation dataset, thereby showing the practical impact of the calibrationimprovements we are able to achieve for performance on downstream tasks.

2 Problem Formulation

Let D = 〈(xi, yi)〉Ni=1 denote a dataset consisting of N samples from a joint distribution D(X ,Y), where foreach sample i, xi ∈ X is the input and yi ∈ Y = {1, 2, ...,K} is the ground-truth class label. Let pi,y = fθ(y|xi)be the probability that a neural network f with model parameters θ predicts for a class y on a given inputxi. The class that f predicts for xi is computed as yi = argmaxy∈Y pi,y, and the predicted confidence aspi = maxy∈Y pi,y. The network is said to be perfectly calibrated when, for each sample (x, y) ∈ D, theconfidence p is equal to the model accuracy P(y = y|p), i.e. the probability that the predicted class is correct.

2

Page 3: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 100)

0.00.20.40.60.81.0

Accu

racy

ExpectedActual

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 200)

0.00.20.40.60.81.0

ExpectedActual

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 300)

0.00.20.40.60.81.0

ExpectedActual

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 350)

0.00.20.40.60.81.0

ExpectedActual

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 100)

01020304050

% o

f sam

ples

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 200)

0

20

40

60

80

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 300)

020406080

100

0.0 0.2 0.4 0.6 0.8 1.0Confidence (epoch 350)

020406080

100

Figure 1: The confidence values for training samples at different epochs during the NLL training of a ResNet-50on CIFAR-10 (see §3). Top row: reliability plots using 25 confidence bins; bottom row: % of samples ineach bin. As training progresses, the model gradually shifts all training samples to the highest confidence bin.Notably, it continues to do so even after achieving 100% training accuracy by the 300 epoch point.

For instance, of all the samples to which a perfectly calibrated neural network assigns a confidence of 0.8, 80%should be correctly predicted.

A popular metric used to measure model calibration is the expected calibration error (ECE) [Naeini et al.,2015], defined as the expected absolute difference between the model’s confidence and its accuracy, i.e.Ep[|P(y = y|p)− p|

]. Since we only have finite samples, the ECE cannot in practice be computed using

this definition. Instead, we divide the interval [0, 1] into M equispaced bins, where the ith bin is the interval(i−1M , i

M

]. Let Bi denote the set of samples with confidences belonging to the ith bin. The accuracy Ai of

this bin is computed as Ai = 1|Bi|

∑j∈Bi

1 (yj = yj), where 1 is the indicator function, and yj and yj are thepredicted and ground-truth labels for the jth sample. Similarly, the confidence Ci of the ith bin is computed asCi =

1|Bi|

∑j∈Bi

pj , i.e. Ci is the average confidence of all samples in the bin. The ECE can be approximatedas a weighted average of the absolute difference between the accuracy and confidence of each bin:

ECE =∑Mi=1

|Bi|N |Ai − Ci| (1)

A similar metric, the maximum calibration error (MCE) [Naeini et al., 2015], is defined as the maximumabsolute difference between the confidence and accuracy of each bin:

MCE = maxi∈{1,...,M} |Ai − Ci| (2)

A common way of visually exploring the calibration of a model is to use a reliability plot [Niculescu-Miziland Caruana, 2005], which plots the accuracies of the various confidence bins as a bar chart (e.g. see Figure 1).Reliability plots also capture whether or not a model is under-confident or over-confident in general. For aperfectly calibrated model, the accuracy for each bin will match the confidence, and hence all of the bars will lieon a diagonal. By contrast, if most of the bars lie above the diagonal, meaning that the model is more accuratethan it expects, then it is said to be under-confident, and if most of the bars lie below the diagonal, then it is saidto be over-confident.

AdaECE: One disadvantage of ECE is the uniform bin width. Once a model is trained, most of the sampleslie within the highest confidence bins, and hence these bins dominate the value of the ECE, as the contributionof each bin is weighted by the number of samples it contains. To mitigate this, we thus also consider anothermetric, which we call AdaECE (Adaptive ECE), for which bin sizes are calculated so as to evenly distributesamples between bins:

AdaECE =∑Mi=1

|Bi|N |Ai − Ci| s.t. ∀i, j · |Bi| = |Bj | (3)

3

Page 4: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0 50 100 150 200 250 300 350Training epoch

0.00.20.40.60.81.01.21.4

Trai

n NL

L / T

rain

Ent

ropy

Train NLL correctTrain NLL incorrectTrain NLLTrain Entropy correctTrain Entropy incorrect

(a)

0 50 100 150 200 250 300 350Training epoch

0.00.20.40.60.81.01.21.41.6

Test

NLL

/ Te

st E

ntro

py

Test NLL correctTest NLL incorrectTest NLLTest Entropy correctTest Entropy incorrect

(b)

0 50 100 150 200 250 300 350Training epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Trai

n er

ror /

Tes

t erro

r / T

est E

CE

Train classification errorTest classification errorTest ECE

(c)

Figure 2: How metrics related to model calibration change whilst training a ResNet-50 network on CIFAR-10.

3 What Causes Miscalibration?

We now discuss why high-capacity neural networks, despite achieving low classification errors on well-knowndatasets, tend to be miscalibrated. A key empirical observation made by [Guo et al., 2017] was that poorcalibration of such networks appears to be linked to overfitting on the negative log-likelihood (NLL) duringtraining. In this section, we further inspect this observation to provide new insights.

For the analysis, we train a high-capacity ResNet-50 network on CIFAR-10 with state-of-the-art performancesettings [PyTorch-CIFAR]. We use Stochastic Gradient Descent (SGD), with a mini-batch of size 128, momentumof 0.9, and learning rate schedule of {0.1, 0.01, 0001} for the first 150, next 100, and last 100 epochs, respectively.We minimise cross-entropy loss (a.k.a. NLL) Lc, which, in a standard classification context, is− log pi,yi , wherepi,yi is the probability assigned by the network to the correct class yi for the ith sample. Note that the NLLis minimised when for each training sample i, pi,yi = 1, whereas the classification error is minimised whenpi,yi > pi,y for all y 6= yi. This indicates that even when the classification error is 0, the NLL can be positive,and the optimisation algorithm can still try to reduce it to 0 by further increasing the value of pi,yi for eachsample. This can be empirically observed in Figure 1, where we plot reliability diagrams and percentages ofsamples in each confidence bin for different training epochs.

To study how miscalibration occurs during training, we plot the average NLL for the train and test sets ateach training epoch in Figures 2(a) and 2(b). We also plot the average NLL and the entropy of the softmaxdistribution produced by the network for the correctly and incorrectly classified samples. In Figure 2(c), we plotthe classification errors on the train and test sets, along with the test set ECE.

Curse of misclassified samples: Figures 2(a) and 2(b) show that although the average train NLL (for bothcorrectly and incorrectly classified training samples) broadly decreases throughout training, after the 150th

epoch (where the learning rate drops by a factor of 10), there is a marked rise in the average test NLL, indicatingthat the network starts to overfit on average NLL. Interestingly, the increase in average test NLL is caused only bythe incorrectly classified samples, as the average NLL for the correctly classified samples continues to decreaseeven after the 150th epoch. We also observe that after epoch 150, the test set ECE rises, indicating that thenetwork is becoming miscalibrated. This corroborates the observation in [Guo et al., 2017] that miscalibrationand NLL overfitting are linked.

Peak at the wrong place: We further observe that the entropies of the softmax distributions for both thecorrectly and incorrectly classified test samples decrease throughout training (in other words, the distributionsget peakier). This observation, coupled with the one we made above, indicates that for the wrongly classifiedtest samples, the network gradually becomes more and more confident about its incorrect predictions.

Weight magnification: The increase in confidence of the network’s predictions can happen if the networkincreases the norm of its weights W to increase the magnitudes of the logits. In fact, cross-entropy loss isminimised when for each training sample i, pi,yi = 1, which is possible only when ||W || → ∞. Cross-entropyloss thus inherently induces this tendency of weight magnification in neural network optimisation. The promising

4

Page 5: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0 50 100 150 200 250 300 350Training epoch

0.20.40.60.81.01.21.41.6

Test

NLL

CEFocal ( =1.0)Focal ( =2.0)Focal ( =3.0)

(a) (b)

0 50 100 150 200 250 300 350Training epoch

0.2

0.4

0.6

0.8

1.0

1.2

Test

NLL

misc

lass

ified

sam

ples CE

Focal ( =1.0)Focal ( =2.0)Focal ( =3.0)

(c)

0 50 100 150 200 250 300 350Training epoch

0.0

0.2

0.4

0.6

0.8

1.0

Test

Ent

ropy

(misc

lass

ified

) CEFocal ( =1.0)Focal ( =2.0)Focal ( =3.0)

(d)

0 50 100 150 200 250 300 350Training epoch

6

8

10

12

Wei

ght n

orm

CEFocal ( = 1)Focal ( = 2)Focal ( = 3)

(e)

Figure 3: How metrics related to model calibration change whilst training several ResNet-50 networks onCIFAR-10, using either cross-entropy loss, or focal loss with γ set to 1, 2 or 3.

performance of weight decay (regulating the norm of weights) on the calibration of neural networks can perhapsbe explained using this. We explore this further in §4. This increase in the network’s confidence during trainingis one of the key causes of miscalibration.

4 Improving Calibration using Focal Loss

As discussed in §3, overfitting on NLL, which is observed as the network grows more confident on all of itspredictions irrespective of their correctness, is strongly related to poor calibration. One cause of this is thatthe cross-entropy objective minimises the difference between the softmax distribution and the ground-truthone-hot encoding over an entire mini-batch, irrespective of how well a network classifies individual samples inthe mini-batch. In this work, we study an alternative loss function, popularly known as focal loss [Lin et al.,2017], that tackles this by weighting loss components generated from individual samples in a mini-batch by howwell the model classifies them. For classification tasks where the target distribution is a one-hot encoding, it isdefined as Lf = −(1− pi,yi)γ log pi,yi , where γ is a user-defined hyperparameter.1

Why might focal loss improve calibration? We know that cross-entropy forms an upper bound on the KL-divergence between the target distribution q and the predicted distribution p, i.e. Lc ≥ KL(q||p), so minimisingcross-entropy results in minimising KL(q||p). Interestingly, a general form of focal loss can be shown to be anupper bound on the regularised KL-divergence, where the regulariser is the negative entropy of the predicteddistribution p, and the regularisation parameter is γ, the hyperparameter of focal loss (a proof of this can befound in Appendix A):

Lf ≥ KL(q||p)− γH[p] (4)

The most interesting property of this upper bound is that it shows that replacing cross-entropy with focal losshas the effect of adding a maximum-entropy regulariser [Pereyra et al., 2017] to the implicit minimisationthat was previously being performed. In other words, trying to minimise it means trying to minimise the KLdivergence between p and q, whilst simultaneously trying to increase the entropy of the predicted distributionp. Encouraging the predicted distribution to have higher entropy can help avoid the overconfident predictionsproduced by modern neural networks (see the ‘Peak at the wrong place’ paragraph of Section 3), and therebyimprove calibration.

Empirical observations: To analyse the behaviour of neural networks trained on focal loss, we use the sameframework as mentioned above, and train four ResNet-50 networks on CIFAR-10, one using cross-entropyloss, and three using focal loss with γ = 1, 2 and 3. Figure 3(a) shows that the test NLL for the cross-entropymodel significantly increases towards the end of training (before saturating), whereas the NLLs for the focalloss models remain low. To better understand this, we analyse the behaviour of these models for correctly andincorrectly classified samples. Figure 3(b) shows that even though the NLLs for the correctly classified samplesbroadly-speaking decrease over the course of training for all the models, the NLLs for the focal loss models

1We note in passing that unlike cross-entropy loss, focal loss in its general form is not a proper loss function, as minimising it does notalways lead to the predicted distribution p being equal to the target distribution q (see Appendix A for the relevant definition and a longerdiscussion). However, when q is a one-hot encoding (as in our case, and for most classification tasks), minimising focal loss does lead to pbeing equal to q.

5

Page 6: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

(a) (b) (c) (d)

Figure 4: (a): g(p, γ) vs. p and (b-d): histograms of the gradient norms of the last linear layer for bothcross-entropy and focal loss.

remain consistently higher than that for the cross-entropy model throughout training, implying that the focal lossmodels are relatively less confident than the cross-entropy model for samples that they predict correctly. This isimportant, as we have already discussed that it is overconfidence that normally makes deep neural networksmiscalibrated. Figure 3(c) shows that in contrast to the cross-entropy model, for which the NLL for misclassifiedtest samples increases significantly after epoch 150, the rise in this value for the focal loss models is much lesssevere. Additionally, in Figure 3(d), we notice that the entropy of the softmax distribution for misclassified testsamples is consistently (if marginally) higher for focal loss than for cross-entropy (consistent with Equation 4).

As per §3, an increase in the test NLL and a decrease in the test entropy for misclassified samples, taken togetherwith no corresponding increase in the test NLL for the correctly classified samples, can be interpreted as thenetwork starting to predict softmax distributions for the misclassified samples that are ever more peaky in thewrong place. Notably, our results in Figures 3(b), 3(c) and 3(d) clearly show that this effect is significantlyreduced when training with focal loss rather than cross-entropy, leading to a network that is less peaky in thewrong place and better calibrated.

Theoretical justification: As mentioned previously, once a model trained using cross-entropy reaches highaccuracy on the training set, the optimiser may try to further reduce the training NLL by increasing theconfidences for the correctly classified samples. One way it could achieve this would be to increase the weightsof the network to increase the magnitudes of the logits. In fact, this hypothesis would help to explain theobservation of [Guo et al., 2017] that models trained using some form of weight decay are relatively bettercalibrated. To verify this, we plot the L2 norm of the weights of the last linear layer for all four networks asa function of the training epoch (see Figure 3(e)). Notably, although the norms of the weights for the modelstrained on focal loss are initially higher than that for the cross-entropy model, a complete reversal in the orderingof the weight norms occurs between epochs 150 and 250. In other words, as the networks start to becomemiscalibrated, the weight norm for the cross-entropy model also starts to become greater than those for the focalloss models. In practice, this is because focal loss, by design, starts to act as a regulariser on the network’sweights once the model has gained a certain amount of confidence in its predictions. This behaviour of focalloss can be observed even on a much simpler setup like a linear model (see Appendix B). To better understandthis, we start by considering the following proposition:

Proposition 1. For focal loss Lf and cross-entropy Lc, the gradients ∂Lf

∂w = ∂Lc

∂w g(pi,yi , γ), where g(p, γ) =(1− p)γ − γp(1− p)γ−1 log(p), γ ∈ R+ is the focal loss hyperparameter, and w denotes the parameters of the

last linear layer. Thus∥∥∥∂Lf

∂w

∥∥∥ ≤ ∥∥∂Lc

∂w

∥∥ if g(pi,yi , γ) ∈ [0, 1].

Proof. See Appendix C.

Proposition 1 shows the relationship between the norms of the gradients of the last linear layer for focal lossand cross-entropy loss, for the same network architecture. Note that this relation depends on a function g(p, γ),which we plot in Figure 4(a) to understand its behaviour. It is clear that for every γ, there exists a (different)threshold p0 such that for all p ∈ [0, p0], g(p, γ) ≥ 1, and for all p ∈ (p0, 1], g(p, γ) < 1. (For example, forγ = 1, p0 ≈ 0.4.) We use this insight to further explain why focal loss provides implicit weight regularisation.

Implicit weight regularisation: For a network trained using focal loss with a fixed γ, during the initial stages

6

Page 7: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

of the training, when pi,yi ∈ (0, p0), g(pi,yi , γ) > 1. This implies that the confidences of the focal loss model’spredictions will initially increase faster than they would for cross-entropy. However, as soon as pi,yi crosses thethreshold p0, g(pi,yi , γ) falls below 1 and reduces the size of the gradient updates made to the network weights,thereby having a regularising effect on the weights. This is why, in Figure 3(e), we find that the weight normsof the models trained with focal loss are initially higher than that for the model trained using cross-entropy.However, as training progresses, we find that the ordering of the weight norms reverses, as focal loss startsregularising the network weights. Moreover, we can draw similar insights from Figures 4(b), 4(c) and 4(d),in which we plot histograms of the gradient norms of the last linear layer (over all samples in the trainingset) at epochs 10, 100 and 200, respectively. At epoch 10, the gradient norms for cross-entropy and focal lossare similar, but as training progresses, those for cross-entropy decrease less rapidly than those for focal loss,indicating that the gradient norms for focal loss are consistently lower than those for cross-entropy throughouttraining.

Finally, observe in Figure 4(a) that for higher γ values, the fall in g(p, γ) is steeper. We would thus expecta greater weight regularisation effect for models that use higher values of γ. This explains why, of the threemodels that we trained using focal loss, the one with γ = 3 outperforms (in terms of calibration) the one withγ = 2, which in turn outperforms the model with γ = 1. Based on this observation, one might think that, ingeneral, a higher value of gamma would lead to a more calibrated model. However, this is not the case, as wenotice from Figure 4(a) that for γ ≥ 7, g(p, γ) reduces to nearly 0 for a relatively low value of p (around 0.5).As a result, using values of γ that are too high will cause the gradients to die (i.e. reduce to nearly 0) early, at apoint at which the network’s predictions remain ambiguous, thereby causing the training process to fail.

How to choose γ: As discussed, focal loss provides implicit entropy and weight regularisation, both of whichheavily depend on the value of γ. Finding an appropriate γ is normally done using cross-validation. Also,traditionally, γ is fixed for all samples in the dataset. However, as shown, the regularisation effect for a sample idepends on pi,yi , i.e. the predicted probability for the ground truth label for the sample. It thus makes sense tochoose γ for each sample based on the value of pi,yi . To this end, we provide Proposition 2, which we use tofind a solution to this problem:

Proposition 2. Given a p0, for 1 ≥ p ≥ p0 > 0, g(p, γ) ≤ 1 for all γ ≥ γ∗ = ab +

1log aW−1

(− a(1−a/b)

b log a),

where a = 1 − p0, b = p0 log p0, and W−1 is the Lambert-W function [Corless et al., 1996]. Moreover, forp ≥ p0 > 0 and γ ≥ γ∗, the equality g(p, γ) = 1 holds only for p = p0 and γ = γ∗.

Proof. See Appendix C. Note that there exist multiple values of γ for which g(p, γ) ≤ 1 for all p ≥ p0.

For a given p0, this allows us to compute γ s.t. (i) g(p0, γ) = 1; (ii) g(p, γ) ≥ 1 for p ∈ [0, p0); and (iii)g(p, γ) < 1 for p ∈ (p0, 1]. This allows us to control the magnitude of the gradients for a particular sample ibased on the current value of pi,yi , and gives us a way of obtaining an informed value of γ for each sample. Forinstance, a reasonable policy might be to choose γ s.t. g(pi,yi , γ) > 1 if pi,yi is small (say less than 0.25), andg(pi,yi , γ) < 1 otherwise. Such a policy will have the effect of making the weight updates larger for sampleshaving a low predicted probability for the correct class and smaller for samples with a relatively higher predictedprobability for the correct class.

Following the aforementioned arguments, we choose a threshold p0 of 0.25, and use Proposition 2 to obtaina γ policy such that g(p, γ) is observably greater than 1 for p ∈ [0, 0.25) and g(p, γ) < 1 for p ∈ (0.25, 1].In particular, we use the following schedule: if pi,yi ∈ [0, 0.2), then γ = 5, otherwise γ = 3 (note thatg(0.2, 5) ≈ 1 and g(0.25, 3) ≈ 1, refer Figure 4(a)). We find it to perform consistently well across multipleclassification datasets and network architectures. Having said that, one can calculate multiple such schedules forγ following Proposition 2, using the intuition of having a relatively high γ for low values of pi,yi and a relativelylow γ for high values of pi,yi .

7

Page 8: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

5 Experiments

5.1 Image and Document Classification

We use multiple image and document classification datasets to verify the effectiveness of focal loss for trainingcalibrated models. For our image classification experiments, we use CIFAR-10/100 [Krizhevsky, 2009] andTiny-ImageNet [Deng et al., 2009], and for document classification, we use 20 Newsgroups [Lang, 1995] andthe Stanford Sentiment Treebank (SST) [Socher et al., 2013]. We provide details regarding these datasets andtheir train/validation/test splits in Appendix D. On CIFAR-10/100, we train ResNet-50, ResNet-110 [He et al.,2016], Wide-ResNet-26-10 [Zagoruyko and Komodakis, 2016] and DenseNet-121 [Huang et al., 2017]. OnTiny-ImageNet, we train a ResNet-50 network. We train a Global Pooling Convolutional Network [Lin et al.,2014] on 20 Newsgroups [Lang, 1995] and a Tree-LSTM [Tai et al., 2015] on the SST Binary dataset. Weprovide implementation details for training each of these models in Appendix D. We use the following lossfunctions for training the above models:

Baselines Along with cross-entropy loss, we compare our method against the following strong baselines (withand without temperature scaling):

1. MMCE (Maximum Mean Calibration Error) [Kumar et al., 2018]: a continuous and differentiable proxy forcalibration error that is normally used as a regulariser alongside cross-entropy.

2. Label smoothing [Müller et al., 2019] (LS): given a one-hot ground-truth label distribution q and a smoothingfactor α (hyperparameter), the smoothed vector s is obtained as si = (1−α)qi+α(1−qi)/(K− 1), wheresi and qi denote the ith elements of vectors s and q respectively, and K is the number of classes. Insteadof q, the smoothed vector is now treated as the ground truth. In our experiments, we train models usingsmoothing factors α = 0.05 and α = 0.1 but find α = 0.05 to perform better. We thus report the resultsobtained from LS-0.05 with α = 0.05.

3. Brier Score [Brier, 1950]: computed as the squared error between the predicted softmax vector and theone-hot ground truth encoding. Brier score is particularly relevant baseline for calibration as it can bedecomposed into calibration and refinement [DeGroot and Fienberg, 1983, Snoek et al., 2019]. Moreover, ithas a distinct penalty on incorrect class probabilities.

Focal Loss (Sample-Dependent γ 5, 3): As mentioned in §4, we use the sample-dependent schedule FLSD-53:γ = 5 for py ∈ [0, 0.2), and γ = 3 for py ∈ [0.2, 1] which we find to consistently perform well across all theclassification datasets and network architectures we experiment on.

In addition to the aforementioned sample-dependent γ approach, we also train other baselines on focal loss aswell. We train models on focal loss with γ fixed to 1, 2 and 3. As a simplification to the sample-dependent γapproach, we also try using a training epoch-dependent schedule for γ. We describe these in more detail andreport the results in Appendix E.

Temperature Scaling: We compute the optimal temperature using two different approaches: (a) learning theoptimal temperature by optimising NLL over a validation set, and (b) performing grid search over temperaturevalues between 0 and 10, with a step of 0.1, and choosing the temperature that minimises the validation setECE. We find the second approach to produce stronger baselines. Since we report ECE and AdaECE as theperformance metrics and grid search does not require a differentiable objective function, we directly minimiseECE over the validation set during grid search.

Performance Gains: We report the optimal temperatures and their corresponding ECE% and AdaECE%(computed using 15 bins) in Tables 1 and 2. Full results (ECE, AdaECE, MCE, NLL and test error) for allapproaches are reported in Appendix E.

Firstly, for all dataset-network pairs, we obtain state-of-the-art classification accuracies (shown in Table 4 in theappendix). It is clear from Tables 1 and 2 that focal loss with sample-dependent γ outperforms all the baselines:

8

Page 9: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FLSD-53 (Ours)Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T

CIFAR-10

ResNet 50 4.35 1.35(2.5) 1.82 1.08(1.1) 4.56 1.19(2.6) 2.96 1.67(0.9) 1.55 0.95(1.1)ResNet 110 4.41 1.09(2.8) 2.56 1.25(1.2) 5.08 1.42(2.8) 2.09 2.09(1) 1.87 1.07(1.1)

Wide ResNet 26-10 3.23 0.92(2.2) 1.25 1.25(1) 3.29 0.86(2.2) 4.26 1.84(0.8) 1.56 0.84(0.9)DenseNet 121 4.52 1.31(2.4) 1.53 1.53(1) 5.1 1.61(2.5) 1.88 1.82(0.9) 1.22 1.22(1)

CIFAR-100

ResNet 50 17.52 3.42(2.1) 6.52 3.64(1.1) 15.32 2.38(1.8) 7.81 4.01(1.1) 4.5 2.0(1.1)ResNet 110 19.05 4.43(2.3) 7.88 4.65(1.2) 19.14 3.86(2.3) 11.02 5.89(1.1) 8.56 4.12(1.2)

Wide ResNet 26-10 15.33 2.88(2.2) 4.31 2.7(1.1) 13.17 4.37(1.9) 4.84 4.84(1) 3.03 1.64(1.1)DenseNet 121 20.98 4.27(2.3) 5.17 2.29(1.1) 19.13 3.06(2.1) 12.89 7.52(1.2) 3.73 1.31(1.1)

Tiny-ImageNet ResNet 50 15.32 5.48(1.4) 4.44 4.13(0.9) 13.01 5.55(1.3) 15.23 6.51(0.7) 1.76 1.76(1)

20 Newsgroups Global Pooling CNN 17.92 2.39(3.4) 13.58 3.22(2.3) 15.48 6.78(2.2) 4.79 2.54(1.1) 6.92 2.19(1.5)

SST Binary Tree LSTM 7.37 2.62(1.8) 9.01 2.79(2.5) 5.03 4.02(1.5) 4.84 4.11(1.2) 9.19 1.83(0.7)

Table 1: ECE (%) computed for different approaches both pre and post temperature scaling (cross-validating Ton ECE). Optimal temperature for each method is indicated in brackets.

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FLSD-53 (Ours)Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T

CIFAR-10

ResNet 50 4.33 2.14(2.5) 1.74 1.23(1.1) 4.55 2.16(2.6) 3.89 2.92(0.9) 1.56 1.26(1.1)ResNet 110 4.4 1.99(2.8) 2.6 1.7(1.2) 5.06 2.52(2.8) 4.44 4.44(1) 2.07 1.67(1.1)

Wide ResNet 26-10 3.23 1.69(2.2) 1.7 1.7(1) 3.29 1.6(2.2) 4.27 2.44(0.8) 1.52 1.38(0.9)DenseNet 121 4.51 2.13(2.4) 2.03 2.03(1) 5.1 2.29(2.5) 4.42 3.33(0.9) 1.42 1.42(1)

CIFAR-100

ResNet 50 17.52 3.42(2.1) 6.52 3.64(1.1) 15.32 2.38(1.8) 7.81 4.01(1.1) 4.5 2.0(1.1)ResNet 110 19.05 5.86(2.3) 7.73 4.53(1.2) 19.14 4.85(2.3) 11.12 8.59(1.1) 8.55 3.96(1.2)

Wide ResNet 26-10 15.33 2.89(2.2) 4.22 2.81(1.1) 13.16 4.25(1.9) 5.1 5.1(1) 2.75 1.63(1.1)DenseNet 121 20.98 5.09(2.3) 5.04 2.56(1.1) 19.13 3.07(2.1) 12.83 8.92(1.2) 3.55 1.24(1.1)

Tiny-ImageNet ResNet 50 15.23 5.41(1.4) 4.37 4.07(0.9) 13.0 5.56(1.3) 15.28 6.29(0.7) 1.42 1.42(1)

20 Newsgroups Global Pooling CNN 17.91 2.23(3.4) 13.57 3.11(2.3) 15.21 6.47(2.2) 4.39 2.63(1.1) 6.92 2.35(1.5)

SST Binary Tree LSTM 7.27 3.39(1.8) 8.12 2.84(2.5) 5.01 4.32(1.5) 5.14 4.23(1.2) 9.15 1.92(0.7)

Table 2: Adaptive ECE (%) computed for different approaches both pre and post temperature scaling (cross-validating T on ECE). Optimal temperature for each method is indicated in brackets.

Metrics CE (α = 0.0) LS (α = 0.1) FL (γ = 1.0)

ECE% Pre T / Post T / T 10.16/2.59/1.2 3.25/3.25/1.0 1.69/1.69/1.0BLEU Pre T / Post T 26.31/26.21 26.33/26.33 26.39/26.39

Table 3: Test set ECE and BLEU score both pre and post temperature scaling for cross-entropy (CE) with hardtargets, cross-entropy with label smoothing (LS) (α = 0.1) and focal loss (FL) (γ = 1).

cross-entropy, label smoothing, Brier score and MMCE. It broadly produces the lowest ECE and AdaECE valuesboth before and after temperature scaling. This observation is particularly encouraging, as it indicates that aprincipled method of obtaining values of γ for focal loss can work well. Furthermore, Tables 5 and 6 in theappendix show that other focal loss based approaches are also very competitive. Finally, we observe that thereare cases where ECE might be low, implying that the model is well calibrated, whereas AdaECE evaluated onthe same model might be high. For example, in the case of WideResNet on CIFAR-10 for cross-entropy, the bestECE obtained after temperature scaling is 0.92, whereas AdaECE on the same model at the same temperature is1.69.

Confident and Calibrated Models: It is interesting to note that for focal loss with sample-based γ (see Tables1 and 2), the optimal temperatures are very close to 1, mostly lying between 0.9 and 1.1. This property isshown by the Brier score and the label smoothing models as well. By contrast, the optimal temperatures for thebaselines (cross-entropy with hard targets and MMCE) are significantly higher, with values lying between 2.0to 2.8. An optimal temperature close to 1 indicates that the model is innately calibrated and cannot be madesignificantly more calibrated by temperature scaling. Furthermore, an optimal temperature that is much greaterthan 1 can make the network underconfident in general, as its outputs are temperature-scaled irrespective of theircorrectness. We provide additional experimental and qualitative results to support this claim in Appendix F.

9

Page 10: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0.5 1.0 1.5Temperature

CE

25.5

26.0

BLEU

0.5 1.0 1.5Temperature

LS

0.5 1.0 1.5Temperature

FL

0.0

0.1

0.2

ECE

Figure 5: Changes in test set BLEU score and validation set ECE with temperature, for models trained using (a)cross-entropy with hard targets (CE) (b) cross-entropy with label smoothing (LS) (α = 0.1), and (c) focal loss(FL) (γ = 1).

5.2 Machine Translation: A Downstream Task

In order to observe the performance of focal loss for a downstream task, where a calibrated model can potentiallyimprove the performance on the task, we conduct an experiment on machine translation with beam search.Following the setup described in the work Müller et al. [2019], we train the Transformer architecture [Vaswaniet al., 2017] on the standard WMT 2014 English-to-German translation dataset. The settings used for training(like optimiser, learning rate schedule, number of training iterations, etc.) are exactly the same as mentioned inthe paper [Vaswani et al., 2017]. The intuition behind having machine translation as the downstream task ofchoice lies in the fact that in translation, the softmax vectors produced by the transformer model are directly fedinto the beam search algorithm, and hence softmax vectors from a more calibrated model should ideally producebetter translations and a better BLEU score.

We train three transformer models, one on cross-entropy with hard target labels, the second on cross-entropy withlabel smoothing (with smoothing factor α = 0.1) and the third on focal loss with γ = 1. In order to comparethese models in terms of calibration, we report the test set ECE (%) both before and after temperature scalingin the first row of Table 3. Furthermore, to evaluate their performance on the English-to-German translationtask, we also report the test set BLEU score of these models in the second row of Table 3. Finally, to study thevariation of test set BLEU score and validation set ECE with temperature, we plot them against temperature forall three models in Figure 5.

It is clear from Table 3 that the model trained on focal loss outperforms its competitors on both ECE and BLEUscore. The focal loss model also has an optimal temperature of 1, just like the model trained on cross-entropywith label smoothing. From Figure 5, we can see that the models obtain the highest BLEU scores at around thesame temperatures at which they obtain low ECEs, thereby confirming our initial notion that a more calibratedmodel provides better translations. However, since the optimal temperatures are tuned on the validation set,they don’t often correspond to the best BLEU scores on the test set. On the test set, the highest BLEU scoreswe observe are 26.33 for cross-entropy, 26.36 for cross-entropy with label smoothing, and 26.39 for focal loss.Thus, focal loss obtains both the lowest ECE and the highest BLEU.

6 Conclusion

In this paper, we have shown that training using focal loss can yield multi-class classification networks that aremore naturally calibrated than those trained using the more conventional cross-entropy loss. There are soundtheoretical reasons to expect this: in particular, as we show in §4, focal loss implicitly maximises entropy whileminimising the KL divergence between the predicted and the target distributions. We also show that, because ofits design, it naturally regularises the weights of a network during training, reducing NLL overfitting and therebyimproving calibration. Extensive experiments on a variety of computer vision (CIFAR-10/100/Tiny-ImageNet)and NLP (20 Newsgroups/SST) datasets, with a wide variety of different network architectures, show that thisexpectation is also borne out in practice. Our results show that in almost all cases, networks trained with focalloss are more calibrated than those trained with cross-entropy loss, label smoothing, Brier score and MMCE,whilst having similar levels of accuracy, making their predictions much easier for downstream components

10

Page 11: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

to trust. Finally, we verify this by showing the superior performance of focal loss on the downstream task ofEnglish-to-German translation.

7 Acknowledgements

This work was started whilst J Mukhoti was at FiveAI, and completed after he moved to the University of Oxford.V Kulharia is wholly funded by a Toyota Research Institute grant. A Sanyal acknowledges support from TheAlan Turing Institute under the Turing Doctoral Studentship grant TU/C/000023. This work was also supportedby ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1, EPSRC/MURI grantEP/N019474/1, and the Royal Academy of Engineering.

ReferencesGlenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 1950. 2, 8

Chensi Cao, Feng Liu, Hai Tan, Deshou Song, Wenjie Shu, Weizhong Li, Yiming Zhou, Xiaochen Bo, and Zhi Xie. DeepLearning and Its Applications in Biomedicine. Genomics, Proteomics & Bioinformatics, 16(1):17–32, 2018. 1

Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. The rise of deep learning in drugdiscovery. Drug Discovery Today, 23(36):1241–1250, 2018. 1

Robert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Donald E Knuth. On the lambertw function.Advances in Computational mathematics, 5(1):329–359, 1996. 7, 15, 16

Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal StatisticalSociety: Series D (The Statistician), 1983. 8

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database.In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 8, 17

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On Calibration of Modern Neural Networks. In ICML, 2017. 1,2, 4, 6

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. 8

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InCVPR, 2017. 8

Andreas Kamilaris and Francesc X Prenafeta-Boldú. Deep learning in agriculture: A survey. Computers and Electronics inAgriculture, 147:70–90, 2018. 1

Justin Ker, Lipo Wang, Jai Rao, and Tchoyoson Lim. Deep Learning Applications in Medical Image Analysis. IEEE Access,6:9375–9389, 2018. 1

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 8, 17

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable Calibration Measures For Neural Networks From Kernel MeanEmbeddings. In ICML, 2018. 2, 8, 19

Ken Lang. NewsWeeder: Learning to Filter Netnews. In Machine Learning Proceedings 1995, pages 331–339. Elsevier,1995. 8, 17

Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In ICLR, 2014. 8, 17

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. In ICCV,2017. 1, 2, 5

11

Page 12: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In Advances in NeuralInformation Processing Systems, pages 4696–4705, 2019. 2, 8, 10

Mahdi Pakdaman Naeini and Gregory F Cooper. Binary Classifier Calibration using an Ensemble of Near Isotonic RegressionModels. In ICDM, 2016. 1

Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Probabilities Using BayesianBinning. In AAAI, 2015. 1, 3

Ng. 20 Newsgroups. https://github.com/aviralkumar2907/MMCE/, 2018. Accessed: 2019-05-22. 17

Alexandru Niculescu-Mizil and Rich Caruana. Predicting Good Probabilities With Supervised Learning. In ICML, 2005. 3

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. InEmpirical Methods in Natural Language Processing (EMNLP), 2014. 18

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks bypenalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017. 5

John Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.Advances in Large Margin Classifiers, 10(3):61–74, 1999. 1

PyTorch-CIFAR. Train CIFAR10 with PyTorch, 2019. Available online at https://github.com/kuangliu/pytorch-cifar (as of 22nd May 2019). 4

Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren,and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. InAdvances in Neural Information Processing Systems, pages 13969–13980, 2019. 8

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts.Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in NaturalLanguage Processing (EMNLP), 2013. 8, 17

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured longshort-term memory networks. In Association for Computational Linguistics (ACL), 2015. 8, 18

TreeLSTM. TreeLSTM. https://github.com/ttpro1995/TreeLSTMSentiment/, 2015. Accessed: 2019-05-22. 18

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. 10

Jinjiang Wang, Yulin Ma, Laibin Zhang, Robert X Gao, and Dazhong Wu. Deep learning for smart manufacturing: Methodsand applications. JMSY, 48:144–156, 2018. 1

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive Bayesianclassifiers. In ICML, 2001. 1

Bianca Zadrozny and Charles Elkan. Transforming Classifier Scoes into Accurate Multiclass Probability Estimates. InSIGKDD, 2002. 1

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference(BMVC), 2016. 8

12

Page 13: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Appendix

In § A, we discuss the relation between focal loss and a regularised KL divergence where the regulariser isthe entropy of the predicted distribution. In § B, we discuss the regularisation effect of focal loss on a simplesetup, i.e., a generalised linear model trained on a simple data distribution. In § C, we show the proofs of thetwo propositions formulated in the main text. We then describe all the datasets and implementation detailsfor our experiments in § D. In § E, we discuss additional approaches for training using focal loss and also theresults we get from these approaches. We also provide Top-5 accuracies of several models to possibly hint attheir calibration. We further provide results on evaluating our models using various metrics other than ECEand Ada-ECE (like MCE and NLL). In § F, we provide empirical and qualitative results to show that modelstrained using focal loss are calibrated while maintaining their confidence on correct predictions. Finally, in § G,we provide a brief extension on our discussion about Figure 3(e) in the main paper with a plot of L2 norms offeatures obtained from the last ResNet block during training.

A Relation between Focal Loss and Entropy Regularised KL Diver-gence

Here we show why focal loss favours accurate but relatively less confident solutions. We show that it inherentlyprovides a trade-off between minimizing the KL-divergence and maximizing the entropy, depending on thestrength of γ. We use Lf and Lc to denote the focal loss with parameter γ and cross entropy between p and q,respectively. K denotes the number of classes and qy denotes the ground-truth probability assigned to the y-thclass (similarly for py). We consider the following simple extension of focal loss:

Lf = −K∑y=1

(1− py)γqy log py

≥ −K∑y=1

(1− γpy)qy log py By Bernoulli’s inequality ∀γ ≥ 1. Note, py ∈ [0, 1]

= −K∑y=1

qy log py − γ

∣∣∣∣∣K∑y=1

qypy log py

∣∣∣∣∣ ∀y, log py ≤ 0

≥ −K∑y=1

qy log py − γmaxjqj

K∑y=1

|py log py| By Hölder’s inequality ||fg||1 ≤ ||f ||∞|||g||1

≥ −K∑y=1

qy log py + γ

K∑y=1

py log py ∀j, qj ∈ [0, 1]

= Lc − γH[p].

We know that Lc = KL(q||p) +H[q], thus, combining this equality with the above inequality leads to:

Lf ≥ KL(q||p) + H[q]︸︷︷︸constant

−γH[p].

In the case of one-hot encoding (Delta distribution for q), focal loss would maximize −py log py (let y be theground-truth class index), the component of the entropy of p corresponding to the ground-truth index. Thus,would prefer learning p such that py is assigned a higher value (because of the KL term) but not too high

13

Page 14: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0.0 0.2 0.4 0.6 0.8 1.0q

0.0

0.2

0.4

0.6

0.8

1.0

p

CEFL-1FL-2FL-3

Figure 6: Optimal p for var-ious values of q.

(because of the entropy term) which eventually would avoid preferring overcon-fident models (as opposed to the cross-entropy loss). Experimentally, we foundthe solution of the cross entropy and focal loss equations, i.e. the value of thepredicted probability p which minimizes the loss, for various values of q in abinary classification problem (i.e. K = 2) and plotted it in Figure 6. As expected,focal loss favours a more entropic solution p that is closer to 0.5. In other words,as Figure 6 shows, solutions to focal loss (Eqn 5) will always have higher entropythan that of cross entropy depending on the value of γ.

p = argminx − (1− x)γq log x− xγ(1− q) log (1− x) 0 ≤ x ≤ 1 (5)

B Focal Loss and Cross-Entropy on a LinearModel

0 10000 20000 30000 40000 50000Iterations

40

60

80

100

120

140

160

180

200

Norm

of L

ogits

CEFL

(a)

0 10000 20000 30000 40000 50000Iterations

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Norm

of W

eigh

ts

CEFL

(b)

Figure 7: (a): Norm of logits (b): Norm of weights.

The behaviour of deep neural networks is generally quite different from linear models and the problem ofcalibration is more pronounced in the case of deep neural networks, hence we focus on analysing the calibrationof deep networks in the paper. However, to understand the effect of focal loss on a simpler setup, we alsoconducted experiments on a generalised linear model using a simple data distribution.

Setup We consider a binary classification problem. The data matrix X ∈ R2×N is created by assigning eachclass, two normally distributed clusters such that the mean of the clusters are linearly separable. The meanof the clusters are situated on the vertices of a two-dimensional hypercube of side length 4. The standarddeviation for each cluster is 1 and the samples are randomly linearly combined within each cluster in order toadd covariance. Further, for 10% of the data points, the labels were flipped. 4000 samples are used for trainingand 1000 samples are used for testing. The model consists of a simple 2-parameter logistic regression model.fw(x) = σ(w1x1 + w2x2). We train this model using both cross-entropy and focal loss with γ = 1.

Weight Magnification We have argued that focal loss implicitly regularizes the weights of the model byproviding smaller gradients as compared to cross-entropy. This helps in calibration as, if all the weights are large,the logits are large and thus the confidence of the network is large on all test points, even on the misclassifiedpoints. When the model misclassifies, it misclassifies with a high confidence. Figure 7 shows, for a generalisedlinear model, that the norm of the logits and the weights of a network blows for Cross Entropy as compared toFocal Loss.

14

Page 15: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

High Confidence for mistakes Figures 8 (b) and (c) show that running gradient descent with cross-entropy(CE) and focal loss (FL) both gives the same decision regions i.e. the weight vector points in the same region forboth FL and CE. However, as we have seen that the norm of the weights is much larger for CE as compared toFL, we would expect the confidence of misclassified test points to be large for CE as compared to FL. Figure 8(a) plots a histogram of the confidence of the misclassified points and it shows that CE misclassifies almostalways with greater than 90% confidence whereas FL misclassifies with much lower confidence.

0.4 0.6 0.8 1.0 1.2Confidence

0

1

2

3

4

5

6

7

8

Den

sity

CEFL

(a)

4 3 2 1 0 1 2 3

4

2

0

2

4

(b)

4 3 2 1 0 1 2 3

4

2

0

2

4

(c)

Figure 8: (a): Confidence of mis-classifications (b): Decision boundary of linear classifier trained using crossentropy (c): Decision boundary of linear classifier trained using focal loss

C Proofs

Here we provide the proofs of both the propositions presented in the main text. While Proposition 1 helps usunderstand the regularization effect of focal loss, Proposition 2 provides us the γ values in a principled waysuch that it is sample-dependent. Implementing the sample-dependent γ is very easy as implementation of theLambert-W function [Corless et al., 1996] is available in standard libraries (e.g. python scipy).

Proposition 1. For focal loss Lf and cross-entropy Lc, the gradients ∂Lf

∂w = ∂Lc

∂w g(pi,yi , γ), where g(p, γ) =(1− p)γ − γp(1− p)γ−1 log(p), γ ∈ R+ is the focal loss hyperparameter, and w denotes the parameters of the

last linear layer. Thus∥∥∥∂Lf

∂w

∥∥∥ ≤ ∥∥∂Lc

∂w

∥∥ if g(pi,yi , γ) ∈ [0, 1].

Proof. Let w be the linear layer parameters connecting the feature map to the logit s. Then, using the chainrule, ∂Lf

∂w =(∂s∂w

)(∂pi,yi∂s

)(∂Lf

∂pi,yi

). Similarly, ∂Lc

∂w =(∂s∂w

)(∂pi,yi∂s

)(∂Lc

∂pi,yi

). The derivative of the focal

loss with respect to pi,yi , the softmax output of the network for the true class yi, takes the form

∂Lf∂pi,yi

= − 1

pi,yi

((1− pi,yi)γ − γpi,yi(1− pi,yi)γ−1 log(pi,yi)

)=

∂Lc∂pi,yi

g(pi,yi , γ),

in which g(pi,yi , γ) = (1−pi,yi)γ−γpi,yi(1−pi,yi)γ−1 log(pi,yi) and ∂Lc

∂pi,yi= − 1

pi,yi. It is thus straightforward

to verify that if g(pi,yi , γ) ∈ [0, 1], then∥∥∥ ∂Lf

∂pi,yi

∥∥∥ ≤ ∥∥∥ ∂Lc

∂pi,yi

∥∥∥, which itself implies that∥∥∥∂Lf

∂w

∥∥∥ ≤ ∥∥∂Lc

∂w

∥∥.

Proposition 2. Given a p0, for 1 ≥ p ≥ p0 > 0, g(p, γ) ≤ 1 for all γ ≥ γ∗ = ab +

1log aW−1

(− a(1−a/b)

b log a),

where a = 1 − p0, b = p0 log p0, and W−1 is the Lambert-W function [Corless et al., 1996]. Moreover, forp ≥ p0 > 0 and γ ≥ γ∗, the equality g(p, γ) = 1 holds only for p = p0 and γ = γ∗.

15

Page 16: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Proof. We derive the value of γ > 0 for which g(p0, γ) = 1 for a given p0 ∈ [0, 1]. From Proposition 4.1, wealready know that

∂Lf∂pi,yi

=∂Lc∂pi,yi

g(pi,yi , γ), (6)

where Lf is focal loss, Lc is cross entropy loss, pi,yi is the probability assigned by the model to the ground-truthcorrect class for the ith sample, and

g(pi,yi , γ) = (1− pi,yi)γ − γpi,yi(1− pi,yi)γ−1 log(pi,yi). (7)

For p ∈ [0, 1], if we look at the function g(p, γ), then we can clearly see that g(p, γ) → 1 as p → 0, and thatg(p, γ) = 0 when p = 1. To observe the behaviour of g(p, γ) for intermediate values of p, we first take itsderivative with respect to p:

∂g(p, γ)

∂p= γ(1− p)γ−2

[− 2(1− p)− (1− p) log p+ (γ − 1)p log p

](8)

In Equation 8, γ(1− p)γ−2 > 0 except when p = 1 (in which case γ(1− p)γ−2 = 0). Thus, to observe the signof the gradient ∂g(p,γ)∂p , we focus on the term

− 2(1− p)− (1− p) log p+ (γ − 1)p log p. (9)

Dividing Equation 9 by (− log p), the sign remains unchanged and we get

k(p, γ) =2(1− p)log p

+ 1− γp. (10)

We can see that k(p, γ)→ 1 as p→ 0 and k(p, γ)→ −(1 + γ) as p→ 1 (using l’Hôpital’s rule). Furthermore,k(p, γ) is monotonically decreasing for p ∈ [0, 1]. Thus, as the gradient ∂g(p,γ)∂p monotonically decreases from apositive value at p = 0 to a negative value at p = 1, we can say that g(p, γ) first monotonically increases startingfrom 1 (as p→ 0) and then monotonically decreases down to 0 (at p = 1). Thus, if for some threshold p0 > 0and for some γ > 0, g(p, γ) = 1, then ∀p > p0, g(p, γ) < 1. We now want to find a γ such that ∀p ≥ p0,g(p, γ) ≤ 1. First, let a = (1− p0) and b = p0 log p0. Then:

g(p0, γ) = (1− p0)γ − γp0(1− p0)γ−1 log p0 ≤ 1

=⇒ (1− p0)γ−1[(1− p0)− γp0 log p0] ≤ 1

=⇒ aγ−1(a− γb) ≤ 1

=⇒ (γ − 1) log a+ log(a− γb) ≤ 0

=⇒(γ − a

b

)log a+ log(a− γb) ≤

(1− a

b

)log a

=⇒ (a− γb)e(γ−a/b) log a ≤ a(1−a/b)

=⇒(γ − a

b

)e(γ−a/b) log a ≤ −a

(1−a/b)

b

=⇒((γ − a

b

)log a

)e(γ−a/b) log a ≥ −a

(1−a/b)

blog a

(11)

where a = (1− p0) and b = p0 log p0. We know that the inverse of y = xex is defined as x = W (y), whereW is the Lambert-W function [Corless et al., 1996]. Furthermore, the r.h.s. of the inequality in Equation 11is always negative, with a minimum possible value of −1/e that occurs at p0 = 0.5. Therefore, applying theLambert-W function to the r.h.s. will yield two real solutions (corresponding to a principal branch denoted byW0 and a negative branch denoted by W−1). We first consider the solution corresponding to the negative branch

16

Page 17: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

(which is the smaller of the two solutions):((γ − a

b) log a

)≤W−1

(− a(1−a/b)

blog a

)=⇒ γ ≥ a

b+

1

log aW−1

(− a(1−a/b)

blog a

) (12)

If we consider the principal branch, the solution is

γ ≤ a

b+

1

log aW0

(− a(1−a/b)

blog a

), (13)

which yields a negative value for γ that we discard. Thus Equation 12 gives the values of γ for which if p > p0,then g(p, γ) < 1. In other words, g(p0, γ) = 1, and for any p < p0, g(p, γ) > 1.

D Dataset Description and Implementation Details

We use the following image and document classification datasets in our experiments:

1. CIFAR-10 [Krizhevsky, 2009]: This dataset has 60,000 colour images of size 32× 32, divided equally into10 classes. We use a train/validation/test split of 45,000/5,000/10,000 images.

2. CIFAR-100 [Krizhevsky, 2009]: This dataset has 60,000 colour images of size 32× 32, divided equally into100 classes. (Note that the images in this dataset are not the same images as in CIFAR-10.) We again use atrain/validation/test split of 45,000/5,000/10,000 images.

3. Tiny-ImageNet [Deng et al., 2009]: Tiny-ImageNet is a subset of ImageNet with 64 x 64 dimensionalimages, 200 classes and 500 images per class in the training set and 50 images per class in the validation set.The image dimensions of Tiny ImageNet are twice that of CIFAR-10/100 images.

4. 20 Newsgroups [Lang, 1995]: This dataset contains 20,000 news articles, categorised evenly into 20different newsgroups based on their content. It is a popular dataset for text classification. Whilst some of thenewsgroups are very related (e.g. rec.motorcycles and rec.autos), others are quite unrelated (e.g. sci.spaceand misc.forsale). We use a train/validation/test split of 15,098/900/3,999 documents.

5. Stanford Sentiment Treebank (SST) [Socher et al., 2013]: This dataset contains movie reviews in the formof sentence parse trees, where each node is annotated by sentiment. We use the dataset version with binarylabels, for which 6,920/872/1,821 documents are used as the training/validation/test split. In the training set,each node of a parse tree is annotated as positive, neutral or negative. At test time, the evaluation is donebased on the model classification at the root node, i.e. considering the whole sentence, which contains onlypositive or negative sentiment.

For training networks on CIFAR-10 and CIFAR-100, we use SGD with a momentum of 0.9 as our optimiser,and train the networks for 350 epochs, with a learning rate of 0.1 for the first 150 epochs, 0.01 for the next 100epochs, and 0.001 for the last 100 epochs. We use a training batch size of 128. Furthermore, we augment thetraining images by applying random crops and random horizontal flips. For Tiny-ImageNet, we train for 100epochs with a learning rate of 0.1 for the first 40 epochs, 0.01 for the next 20 epochs and 0.001 for the last 40epochs. We use a training batch size of 64. It should be noted that for Tiny-ImageNet, we saved 50 samples perclass (i.e., a total of 10000 samples) from the training set as our own validation set to fine-tune the temperatureparameter (hence, we trained on 90000 images) and we use the Tiny-ImageNet validation set as our test set.

For 20 Newsgroups, we train the Global Pooling Convolutional Network [Lin et al., 2014] using the Adamoptimiser, with learning rate 0.001, and betas 0.9 and 0.999. The code is a PyTorch adaptation of Ng. We used

17

Page 18: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Glove word embeddings [Pennington et al., 2014] to train the network. We trained all the models for 50 epochsand used the models with the best validation accuracy.

For the SST Binary dataset, we train the Tree-LSTM [Tai et al., 2015] using the AdaGrad optimiser witha learning rate of 0.05 and a weight decay of 10−4, as suggested by the authors. We used the constituencymodel, which considers binary parse trees of the data and trains a binary tree LSTM on them. The Glove wordembeddings [Pennington et al., 2014] were also tuned for best results. The code framework we used is inspiredby TreeLSTM. We trained these models for 25 epochs and used the models with the best validation accuracy.

For all our models, we use the PyTorch framework, setting any hyperparameters not explicitly mentioned tothe default values used in the standard models. For MMCE, we used λ = 2 for all the image-classificationtasks, whilst we found λ = 8 to perform better for document classification. A calibrated model which doesnot generalise well to an unseen test set is not very useful. Hence, for all the experiments, we set the trainingparameters in a way such that we get state-of-the-art test set accuracies on all datasets for each model.

E Additional Results

In addition to the sample-dependent γ approach, we try the following focal loss approaches as well:

Focal Loss (Fixed γ): We trained models on focal loss with γ fixed to 1, 2 and 3. We found γ = 3 to producethe best ECE among models trained using a fixed γ. This corroborates the observation we made in §4 of themain paper that γ = 3 should produce better results than γ = 1 or γ = 2, as the regularising effect for γ = 3 ishigher.

Focal Loss (Scheduled γ): As a simplification to the sample-dependent γ approach, we also tried using aschedule for γ during training, as we expect the value of pi,yi to increase in general for all samples over time. Inparticular, we report results for two different schedules: (a) Focal Loss (scheduled γ 5,3,2): γ = 5 for the first100 epochs, γ = 3 for the next 150 epochs, and γ = 2 for the last 100 epochs, and (b) Focal Loss (scheduled γ5,3,1): γ = 5 for the first 100 epochs, γ = 3 for the next 150 epochs, and γ = 1 for the last 100 epochs. Wealso tried various other schedules, but found these two to produce the best results on the validation sets.

Finally, for the sample-dependent γ approach, we also found the policy: Focal Loss (sample-dependent γ 5,3,2)with γ = 5 for pi,yi ∈ [0, 0.2), γ = 3 for pi,yi ∈ [0.2, 0.5) and γ = 2 for pi,yi ∈ [0.5, 1] to produce competitiveresults.

In Table 4 we present the classification errors on the test datasets for all the major loss functions we considered.Moreover, we also report the classification errors for the different focal loss approaches in Table 7. We alsoreport the ECE and Ada-ECE for all the focal loss approaches in Table 5 and Table 6.

Finally, calibrated models should have a higher logit score (or softmax probability) on the correct class evenwhen they misclassify, as compared to models which are less calibrated. Thus, intuitively, such models shouldhave a higher Top-5 accuracy. In Table 8, we report the Top-5 accuracies for all our models on datasets where thenumber of classes is relatively high (i.e., on CIFAR-100 with 100 classes and Tiny-ImageNet with 200 classes).We observe focal loss with sample-dependent γ to produce the highest top-5 accuracies on all models trained onCIFAR-100 and the second best top-5 accuracy (only marginally below the highest accuracy) on Tiny-ImageNet.

In addition to ECE and Ada-ECE, we use various other metrics to compare the proposed methods with thebaselines (i.e. cross-entropy, Brier score, MMCE and Label Smoothing). We present the test NLL % beforeand after temperature scaling in Tables 9 and 10, respectively. We report the test set MCE % before and aftertemperature scaling in Tables 11 and 12, respectively.

We use the following abbreviation to report results on different varieties of Focal Loss. FL-1 refers to FocalLoss (fixed γ 1), FL-2 refers to Focal Loss (fixed γ 2), FL-3 refers to Focal Loss (fixed γ 3), FLSc-531 refers toFocal Loss (scheduled γ 5,3,1), FLSc-532 refers to Focal Loss (scheduled γ 5,3,2), FLSD-532 refers to FocalLoss (sample-dependent γ 5,3,2) and FLSD-53 refers to Focal Loss (sample-dependent γ 5,3).

18

Page 19: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FL-3 FLSc-532 FLSD-53

CIFAR-10

ResNet 50 4.95 5.0 4.99 5.29 5.25 5.63 4.98ResNet 110 4.89 5.48 5.4 5.52 5.08 5.71 5.42

Wide ResNet 26-10 3.86 4.08 3.91 4.2 4.13 4.46 4.01DenseNet 121 5.0 5.11 5.41 5.09 5.33 5.65 5.46

CIFAR-100

ResNet 50 23.3 23.39 23.2 23.43 22.75 23.24 23.22ResNet 110 22.73 25.1 23.07 23.43 22.92 22.96 22.51

Wide ResNet 26-10 20.7 20.59 20.73 21.19 19.69 20.13 20.11DenseNet 121 24.52 23.75 24.0 24.05 23.25 23.72 22.67

Tiny Imagenet ResNet 50 49.81 53.2 51.31 47.12 49.69 49.83 49.06

20 Newsgroups Global Pooling CNN 26.68 27.06 27.23 26.03 29.26 28.16 27.98

SST Binary Tree LSTM 12.85 12.85 11.86 13.23 12.19 13.07 12.8

Table 4: Error (%) computed for different approaches. In this table, FL-3 denotes Focal Loss (fixed γ 3),FLSc-532 denotes Focal Loss (scheduled γ 5,3,2) and FLSD-53 denotes Focal Loss (sample-dependent γ 5,3)i.e. focal loss with sample-dependent γ with γ = 5 for pi,yi ∈ [0, 0.2] and γ = 3 for pi,yi ∈ (0.2, 1].

Dataset Model FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T

CIFAR-10

ResNet 50 3.42 1.08(1.6) 2.36 0.91(1.2) 1.48 1.42(1.1) 4.06 1.53(1.6) 2.97 1.53(1.2) 2.52 0.88(1.3) 1.55 0.95(1.1)

ResNet 110 3.46 1.2(1.6) 2.7 0.89(1.3) 1.55 1.02(1.1) 4.92 1.5(1.7) 3.33 1.36(1.3) 2.82 0.97(1.3) 1.87 1.07(1.1)

Wide ResNet 26-10 2.69 1.46(1.3) 1.42 1.03(1.1) 1.69 0.97(0.9) 2.81 0.96(1.4) 1.82 1.45(1.1) 1.31 0.87(1.1) 1.56 0.84(0.9)

DenseNet 121 3.44 1.63(1.4) 1.93 1.04(1.1) 1.32 1.26(0.9) 4.12 1.65(1.5) 2.22 1.34(1.1) 2.45 1.31(1.2) 1.22 1.22(1)

CIFAR-100

ResNet 50 12.86 2.3(1.5) 8.61 2.24(1.3) 5.13 1.97(1.1) 11.63 2.09(1.4) 8.47 2.13(1.3) 9.09 1.61(1.3) 4.5 2.(1.1)

ResNet 110 15.08 4.55(1.5) 11.57 3.73(1.3) 8.64 3.95(1.2) 14.99 4.56(1.5) 11.2 3.43(1.3) 11.74 3.64(1.3) 8.56 4.12(1.2)

Wide ResNet 26-10 8.93 2.53(1.4) 4.64 2.93(1.2) 2.13 2.13(1) 9.36 2.48(1.4) 4.98 1.94(1.2) 4.98 2.55(1.2) 3.03 1.64(1.1)

DenseNet 121 14.24 2.8(1.5) 7.9 2.33(1.2) 4.15 1.25(1.1) 13.05 2.08(1.5) 7.63 1.96(1.2) 8.14 2.35(1.3) 3.73 1.31(1.1)

Tiny Imagenet ResNet 50 7.61 3.29(1.2) 3.02 3.02(1) 1.87 1.87(1) 7.77 3.07(1.2) 3.62 2.54(1.1) 2.81 2.57(1.1) 1.76 1.76(1)

20 Newsgroups Global Pooling CNN 15.06 2.14(2.6) 12.1 3.22(1.6) 8.67 3.51(1.5) 13.55 4.32(1.7) 12.13 2.47(1.8) 12.2 2.39(2) 6.92 2.19(1.5)

SST Binary Tree LSTM 6.78 3.29(1.6) 3.05 3.05(1) 16.05 1.78(0.5) 4.66 3.36(1.4) 3.91 2.64(0.9) 4.47 2.77(0.9) 9.19 1.83(0.7)

Table 5: ECE (%) computed for different focal loss approaches both pre and post temperature scaling (cross-validating T on ECE). Optimal temperature for each method is indicated in brackets.

Dataset Model FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T Pre T Post T

CIFAR-10

ResNet 50 3.42 1.51(1.6) 2.37 1.69(1.2) 1.95 1.83(1.1) 4.06 2.43(1.6) 2.95 2.18(1.2) 2.5 1.23(1.3) 1.56 1.26(1.1)

ResNet 110 3.42 1.57(1.6) 2.69 1.29(1.3) 1.62 1.44(1.1) 4.91 2.61(1.7) 3.32 1.92(1.3) 2.78 1.58(1.3) 2.07 1.67(1.1)

Wide ResNet 26-10 2.7 1.71(1.3) 1.64 1.47(1.1) 1.84 1.54(0.9) 2.75 1.85(1.4) 2.04 1.9(1.1) 1.68 1.49(1.1) 1.52 1.38(0.9)

DenseNet 121 3.44 1.85(1.4) 1.8 1.39(1.1) 1.22 1.48(0.9) 4.11 2.2(1.5) 2.19 1.64(1.1) 2.44 1.6(1.2) 1.42 1.42(1)

CIFAR-100

ResNet 50 12.86 2.54(1.5) 8.55 2.44(1.3) 5.08 2.02(1.1) 11.58 2.01(1.4) 8.41 2.25(1.3) 9.08 1.94(1.3) 4.39 2.33(1.1)

ResNet 110 15.08 5.16(1.5) 11.57 4.46(1.3) 8.64 4.14(1.2) 14.98 4.97(1.5) 11.18 3.68(1.3) 11.74 4.21(1.3) 8.55 3.96(1.2)

Wide ResNet 26-10 8.93 2.74(1.4) 4.65 2.96(1.2) 2.08 2.08(1) 9.2 2.52(1.4) 5 2.11(1.2) 5 2.58(1.2) 2.75 1.63(1.1)

DenseNet 121 14.24 2.71(1.5) 7.9 2.36(1.2) 4.15 1.23(1.1) 13.01 2.18(1.5) 7.61 2.04(1.2) 8.04 2.1(1.3) 3.55 1.24(1.1)

Tiny Imagenet ResNet 50 7.56 2.95(1.2) 3.15 3.15(1) 1.88 1.88(1) 7.7 2.9(1.2) 3.76 2.4(1.1) 2.81 2.6(1.1) 1.42 1.42(1)

20 Newsgroups Global Pooling CNN 15.06 2.22(2.6) 12.1 3.33(1.6) 8.65 3.78(1.5) 13.55 4.58(1.7) 12.13 2.49(1.8) 12.19 2.37(2) 6.92 2.35(1.5)

SST Binary Tree LSTM 6.27 4.59(1.6) 3.69 3.69(1) 16.01 2.16(0.5) 4.43 3.57(1.4) 3.37 2.46(0.9) 4.42 2.96(0.9) 9.15 1.92(0.7)

Table 6: AdaECE (%) computed for different focal loss approaches both pre and post temperature scaling(cross-validating T on ECE). Optimal temperature for each method is indicated in brackets.

F Focal Loss is Confident and Calibrated

In extension to what we present in Section 5 of the main paper, we also follow the approach adopted in Kumaret al. [2018], and measure the percentage of test samples that are predicted with a confidence of 0.99 or more(we call this set of test samples S99). In Table 13, we report |S99| as a percentage of the total number of testsamples, along with the accuracy of the samples in S99 for ResNet-50 and ResNet-110 trained on CIFAR-10,

19

Page 20: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Dataset Model FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53

CIFAR-10

ResNet 50 4.93 4.98 5.25 5.66 5.63 5.24 4.98

ResNet 110 4.78 5.06 5.08 6.13 5.71 5.19 5.42

Wide ResNet 26-10 4.27 4.27 4.13 4.11 4.46 4.14 4.01

DenseNet 121 5.09 4.84 5.33 5.46 5.65 5.46 5.46

CIFAR-100

ResNet 50 22.8 23.15 22.75 23.49 23.24 23.55 23.22

ResNet 110 22.36 22.53 22.92 22.81 22.96 22.93 22.51

Wide ResNet 26-10 19.61 20.01 19.69 20.13 20.13 19.71 20.11

DenseNet 121 23.82 23.19 23.25 23.69 23.72 22.41 22.67

Tiny Imagenet ResNet 50 50.06 47.7 49.69 50.49 49.83 48.95 49.06

20 Newsgroups Global Pooling CNN 26.13 28.23 29.26 29.16 28.16 27.26 27.98

SST Binary Tree LSTM 12.63 12.3 12.19 12.36 13.07 12.3 12.8

Table 7: Error (%) computed for different focal loss approaches.

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FLSD-53 (Ours)Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5

CIFAR-100

ResNet 50 76.7 93.77 76.61 93.24 76.8 93.69 76.57 92.86 76.78 94.44ResNet 110 77.27 93.79 74.9 92.44 76.93 93.78 76.57 92.27 77.49 94.78

Wide ResNet 26-10 79.3 93.96 79.41 94.56 79.27 94.11 78.81 93.18 79.89 95.2DenseNet 121 75.48 91.33 76.25 92.76 76 91.96 75.95 89.51 77.33 94.49

Tiny-ImageNet ResNet 50 50.19 74.24 46.8 70.34 48.69 73.52 52.88 76.15 50.94 76.07

Table 8: Top-1 and Top-5 accuracies computed for different approaches.

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53

CIFAR-10

ResNet 50 41.21 18.67 44.83 27.68 22.67 18.6 18.43 25.32 20.5 18.69 17.55

ResNet 110 47.51 20.44 55.71 29.88 22.54 19.19 17.8 32.77 22.48 19.39 18.54

Wide ResNet 26-10 26.75 15.85 28.47 21.71 17.66 14.96 15.2 18.5 15.57 14.78 14.55

DenseNet 121 42.93 19.11 52.14 28.7 22.5 17.56 18.02 27.41 19.5 20.14 18.39

CIFAR-100

ResNet 50 153.67 99.63 125.28 121.02 105.61 92.82 87.52 100.09 92.66 94.1 88.03

ResNet 110 179.21 110.72 180.54 133.11 114.18 96.74 90.9 112.46 95.85 97.97 89.92

Wide ResNet 26-10 140.1 84.62 119.58 108.06 87.56 77.8 74.66 88.61 78.52 78.86 76.92

DenseNet 121 205.61 98.31 166.65 142.04 115.5 93.11 87.13 107.91 93.12 91.14 85.47

Tiny Imagenet ResNet 50 232.85 240.32 234.29 235.04 219.07 202.92 207.2 217.52 211.42 204.71 204.97

20 Newsgroups Global Pooling CNN 176.57 130.41 158.7 90.95 140.4 115.97 109.62 128.75 123.72 124.03 109.17

SST Binary Tree LSTM 50.2 54.96 37.28 44.34 53.9 47.72 50.29 50.25 53.13 45.08 49.23

Table 9: NLL (%) computed for different approaches pre temperature scaling.

using cross-entropy loss, MMCE loss, and focal loss. We observe that |S99| for the focal loss model is muchlower than for the cross-entropy or MMCE models before temperature scaling. However, after temperaturescaling, |S99| for focal loss is significantly higher than for both MMCE and cross-entropy. The reason is thatwith an optimal temperature of 1.1, the confidence of the temperature-scaled model for focal loss does notreduce as much as those of the models for cross-entropy and MMCE, for which the optimal temperatures liebetween 2.5 to 2.8. We thus conclude that models trained on focal loss are not only more calibrated, but alsobetter preserve their confidence on predictions, even after being post-processed with temperature scaling.

In Figure 9, we present some qualitative results to support this claim and show the improvement in the confidenceestimates of focal loss in comparison to other baselines (i.e., cross entropy, MMCE and Brier score). For this, we

20

Page 21: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53

CIFAR-10

ResNet 50 20.38 18.36 21.58 27.69 17.56 17.67 18.34 19.93 19.25 17.28 17.37

ResNet 110 21.52 19.60 24.61 29.88 17.32 17.53 17.62 23.79 20.21 17.78 18.24

Wide ResNet 26-10 15.33 15.85 16.16 21.19 15.48 14.85 15.06 15.81 15.38 14.69 14.23

DenseNet 121 21.77 19.11 24.88 28.95 18.71 17.21 18.10 21.65 19.04 19.27 18.39

CIFAR-100

ResNet 50 106.83 99.57 101.92 120.19 94.58 91.80 87.37 92.77 91.58 92.83 88.27

ResNet 110 104.63 111.81 106.73 129.76 94.65 91.24 89.92 93.73 91.30 92.29 88.93

Wide ResNet 26-10 97.10 85.77 95.92 108.06 83.68 80.44 74.66 84.11 80.01 80.40 78.14

DenseNet 121 119.23 98.74 113.24 136.28 100.81 91.35 87.55 98.16 91.55 90.57 86.06

Tiny Imagenet ResNet 50 220.98 238.98 226.29 214.95 217.51 202.92 207.20 215.37 211.57 205.42 204.97

20 Newsgroups Global Pooling CNN 87.95 93.11 99.74 90.42 87.24 93.60 94.69 97.89 93.66 91.73 93.98

SST Binary Tree LSTM 41.05 38.27 36.37 43.45 45.67 47.72 45.96 45.82 54.52 45.36 49.69

Table 10: NLL (%) computed for different approaches post temperature scaling (cross-validating T on ECE).

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53

CIFAR-10

ResNet 50 38.65 31.54 60.06 35.61 31.75 25 21.83 30.54 23.57 25.45 14.89

ResNet 110 44.25 25.18 67.52 45.72 73.35 25.92 25.15 34.18 30.38 30.8 18.95

Wide ResNet 26-10 48.17 77.15 36.82 24.89 29.17 30.17 23.86 37.57 30.65 18.51 74.07

DenseNet 121 45.19 19.39 43.92 45.5 38.03 29.59 77.08 33.5 16.47 17.85 13.36

CIFAR-100

ResNet 50 44.34 36.75 39.53 26.11 33.22 21.03 13.02 26.76 23.56 22.4 16.12

ResNet 110 55.92 24.85 50.69 36.23 40.49 32.57 26 37.24 29.56 34.73 22.57

Wide ResNet 26-10 49.36 14.68 40.13 23.79 27 15.14 9.96 27.81 17.59 13.64 10.17

DenseNet 121 56.28 15.47 49.97 43.59 35.45 21.7 11.61 38.68 18.91 21.34 9.68

Tiny Imagenet ResNet 50 30.83 8.41 26.48 25.48 20.7 8.47 6.11 16.03 9.28 8.97 3.76

20 Newsgroups Global Pooling CNN 36.91 31.35 34.72 8.93 34.28 24.1 18.85 26.02 25.02 24.29 17.44

SST Binary Tree LSTM 71.08 92.62 68.43 39.39 95.48 86.21 22.32 76.28 86.93 80.85 73.7

Table 11: MCE (%) computed for different approaches pre temperature scaling.

Dataset Model Cross-Entropy Brier Loss MMCE LS-0.05 FL-1 FL-2 FL-3 FLSc-531 FLSc-532 FLSD-532 FLSD-53

CIFAR-10

ResNet 50 20.6 22.46 23.6 40.51 25.86 28.17 15.76 22.05 23.85 24.76 26.37

ResNet 110 29.98 22.73 31.87 45.72 29.74 23.82 37.61 26.25 25.94 11.59 17.35

Wide ResNet 26-10 26.63 77.15 32.33 37.53 74.58 29.58 25.64 28.63 20.23 19.68 36.56

DenseNet 121 32.52 19.39 27.03 53.57 19.68 22.71 76.27 21.05 32.76 35.06 13.36

CIFAR-100

ResNet 50 12.75 21.61 11.99 18.58 8.92 8.86 6.76 7.46 6.76 5.24 27.18

ResNet 110 22.65 13.56 19.23 30.46 20.13 12 13.06 18.28 13.72 15.89 10.94

Wide ResNet 26-10 14.18 13.42 16.5 23.79 10.28 18.32 9.96 13.18 11.01 12.5 9.73

DenseNet 121 21.63 8.55 13.02 29.95 10.49 11.63 6.17 6.21 6.48 9.41 5.68

Tiny Imagenet ResNet 50 13.33 12.82 12.52 17.2 6.5 8.47 6.11 5.97 7.01 5.73 3.76

20 Newsgroups Global Pooling CNN 36.91 31.35 34.72 8.93 34.28 24.1 18.85 26.02 25.02 24.29 17.44

SST Binary Tree LSTM 88.48 91.86 32.92 35.72 87.77 86.21 74.52 54.27 88.85 82.42 76.71

Table 12: MCE (%) computed for different approaches post temperature scaling (cross-validating T on ECE).

take ResNet-50 networks trained on CIFAR-10 using the four loss functions (cross entropy, MMCE, Brier scoreand Focal loss with sample-dependent γ 5,3) and measure the confidence of their predictions for four correctlyand four incorrectly classified test samples. We report these confidences both before and after temperaturescaling. It is clear from Figure 9 that for all the correctly classified samples, the model trained using focal losshas very confident predictions both pre and post temperature scaling. However, on misclassified samples, weobserve a very low confidence for the focal loss model. The ResNet-50 network trained using cross entropy isvery confident even on the misclassified samples, particularly before temperature scaling. Apart from focal loss,the only model which has relatively low confidences on misclassified test samples is the one trained using Brierscore. These observations support our claim that focal loss produces not only a calibrated model but also onewhich is confident on its correct predictions.

21

Page 22: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

Dataset Model Cross-Entropy (Pre T) Cross-Entropy (Post T) MMCE (Pre T) MMCE (Post T) Focal Loss (Pre T) Focal Loss (Post T)|S99|% Accuracy |S99|% Accuracy |S99|% Accuracy |S99|% Accuracy |S99|% Accuracy |S99|% Accuracy

CIFAR-10 ResNet 110 97.11 96.33 11.5 97.39 97.65 96.72 10.62 99.83 61.41 99.51 31.10 99.68CIFAR-10 ResNet 50 95.93 96.72 7.33 99.73 92.33 98.24 4.21 100 46.31 99.57 14.27 99.93

Table 13: Percentage of test samples predicted with confidence higher than 99% and the corresponding accuracyfor Cross Entropy, MMCE and Focal loss computed both pre and post temperature scaling (represented in thetable as pre T and post T respectively).

Figure 9: Qualitative results showing the performance of Cross Entropy, Brier Score, MMCE and Focal Loss(sample-dependent γ 5,3) for a ResNet-50 trained on CIFAR-10. The first row of images have been correctlyclassified by networks trained on all four loss functions and the second row of images have all been incorrectlyclassified. For each image, we present the actual label, the predicted label and the confidence of the predictionboth before and after temperature scaling.

22

Page 23: arXiv:2002.09437v1 [cs.LG] 21 Feb 2020 · on model calibration. In this paper, we propose a technique for improving network calibration that works by replacing the cross-entropy loss

0 50 100 150 200 250 300 350Training epochs

150

200

250

300

350

400

450

500

550

Feat

ure

norm

CEFocal ( =1)Focal ( =2)Focal ( =3)

Figure 10: L2 norm of features obtained from the last ResNet block (before the linear layer) of ResNet 50averaged over entire training dataset of CIFAR-10 using a batch size of 128.

G Ordering of Feature Norms

As an extension to the discussion related to Figure 3(e) in the main paper, we plot the L2 norm of thefeatures/activations obtained from the last ResNet block (right before the linear layer is applied on these featuresto get the logits). We plot these norms throughout the training period for networks trained on cross-entropy andfocal loss with γ set to 1, 2 and 3 in Figure 10. We observe that there is a distinct ordering of feature normsfor the four models: cross-entropy has the highest feature norm, followed by focal loss with γ = 1, followedby focal loss with γ = 2 and finally focal loss with γ = 3. Furthermore, this ordering is preserved throughouttraining. As we saw from Figure 3(e) in the main paper, from epoch 150 onwards (i.e., the epoch from which thenetworks start getting miscalibrated), there is a flip in the ordering of weight norms of the last linear layer. Fromepoch 150 onwards, the weight norms also follow the exact same ordering that we observe from Figure 10 here.This shows that throughout training the initial layer weights (before last linear layer) of the network trained usingfocal loss are also regularized to favor lower norm of the output features, thus possibly leading to less peakinessin final prediction as compared to that of cross-entropy loss (see the ‘Peak at the wrong place’ paragraph ofSection 3 of the main paper).

23