Calibrating Deep Neural Networks using Focal Loss (NeurIPS’20)

What we want

What we do

reliability diagrams
Fig. 1: Reliability diagrams for a ResNet110 trained on CIFAR-100

What Causes Miscalibration?

Fig. 2: NLL and softmax entropy computed over CIFAR-10 train and test sets (for correctly and incorrectly classified samples) over the course of training a ResNet-50 usign cross-entropy loss.

Two observations are worth noting from the above plots:

  1. Curse of misclassified samples: The NLL overfitting is quite apparent after training epoch 150. Also, the rise in test set NLL is only because of misclassified test samples.
  2. Peak at the wrong place: While the test set NLL rises, the test set entropy decreases throughout training even for misclassified samples. Hence, the model gets more and more confident on its predictions irrespective of their correctness.

We posit the reason behind the above observations as follows:

Improving Calibration using Focal Loss

We explore an alternative loss function, focal loss. Below are the forms of the two loss functions, cross-entropy (CE) and focal loss (Focal) (assuming one-hot ground truth encodings):

In the above equations is the probability assigned by the model to the ground-truth correct class. When compared with cross-entropy, focal loss has an added factor. The idea behind this is to give more preference to samples for which the model is placing less probability mass on the correct class. is a hyperparameter.

Why might focal loss improve calibration?

Focal Loss minimises a regularised KL divergence. We know that cross-entropy loss minimises the Kullback-Leibler (KL) divergence between the target distribution over classes and the predicted softmax distribution . As it turns out, focal loss minimises a regularised KL divergence between the target and predicted distributions as shown below.

How does focal loss fare on NLL overfitting?

On the same training setup as above (i.e., ResNet-50 on CIFAR-10), if we use focal loss with values set to 1, 2 and 3, the test set NLL vs training epochs plot is as follows.

Fig. 3: Test set NLL computed over different training epochs for ResNet-50 models trained using cross-entropy and focal loss with values 1,2 and 3.

Does focal loss regularise model weights?

If we consider the gradient of focal loss with respect to the last layer model weights and the same for cross-entropy, we find that for a single sample, the following relation exists.

The factor is a function of both the predicted correct class probability and . On plotting against for different values of , we get the following plot.

Fig. 4: vs plot.

Fig. 5: Weight norm of the last linear layer over the course of training.

This provides strong evidence in favour of our hypothesis that focal loss regularises weight norms in the network once the network achieves a certain level of confidence on its predictions.

Empirical Results

Classification Results (Calibration and Accuracy)

Fig. 6: Error bar plots with confidence intervals for ECE, AdaECE and Classwise-ECE computed for ResNet-50 (first 3 from the left) and ResNet-110 (first 3 from the right) on CIFAR-10. T: post-temperature scaling

Model Architecture Cross-Entropy Brier Loss MMCE LS-0.05 FL-3 (Ours) FLSD-53 (Ours)
ResNet-50 4.95 5.0 4.99 5.29 5.25 4.98
ResNet-110 4.89 5.48 5.4 5.52 5.08 5.42

Detection of OOD Samples

We run the models trained using CIFAR-10 on test data drawn from the SVHN dataset which is out-of-distribution and consider the softmax entropy of these networks as a measure of uncertainty. For ResNet-110, the ROC plots obtained from this experiment are provided below.

Fig. 7: ROC plots obtained from ResNet-110 trained on CIFAR-10 and tested on SVHN. Left: pre-temperature scaling, right: post-temperature scaling.

Thus, focal loss seems to be a good alternative to the conventionally used cross-entropy loss for producing confident and calibrated models without compromising on classification accuracy. Our paper provides with a lot more experiments and analysis, please have a look. The code with all the pretrained models is also available.

Citation and Contact

If the code or the paper has been useful in your research, please add a citation to our work:

  title={Calibrating Deep Neural Networks using Focal Loss},
  author={Mukhoti, Jishnu and Kulharia, Viveka and Sanyal, Amartya and Golodetz, Stuart and Torr, Philip HS and Dokania, Puneet K},
  booktitle={Advances in Neural Information Processing Systems},

For any questions related to this work, please feel free to reach out to us at or