What is Label Smoothing in Machine Learning?

Shahad Mahmud

4 years ago

Nowadays for a lot of tasks, we are using deep neural networks or deep learning. While working with deep learning very often we face situations like overfitting and overconfidence. Overfitting is relatively well studied and can be tackled with several strategies like dropout, weight regularization, early stopping, etc. We have tools for tackling overconfidence too. With label smoothing, we can handle both of these problems.

What is Label Smoothing?

It is a method of making a model more robust so that it can generalize well. Label smoothing is a loss function modification that has been demonstrated to be highly beneficial for deep learning network training. It enhances image classification, translation, and even speech recognition performance. To have a more solid understanding, let’s have some preliminary assumptions.

Initial intuitions

Let we are trying to deal with a classification problem. In the problem, we have \( K \) candidate classes. The labels are \( [1, 2, …, K] \). We can get the predictions from the last layer of a neural network as:
\[
p_k = \frac{exp(x^Tw_k)}{\sum_{l=1}^{L}exp(x^Tw_l)}
\]
Here, \( p_k \) is the likelihood of the model assigning label \( k \), \( w_i \) is the weights of the \( i^{th} \) layer, and \( x \) is the vector containing activations. Now we can use the cross-entropy as the loss function and try to minimize it.

We have been doing this for a long time, right? So what’s the problem now? The problem is with hard targets. A model has to produce a large logit value for the correct label. The gradient descend will try to make \( p_k \) as close to the target as possible. It will also try to maximize the difference between the largest logit and all other logits. This reduces the ability of the model to adopt. As a result, being overconfident. This may lead the model to overfit also.

Label Smoothing – the tool we need

Label smoothing can solve the discussed problem. We can define label smoothing as:
\[
y_k^{LS} = y_k(1-\alpha) + \frac{\alpha}{K}
\]
We can call this a soft target. Here, \( \alpha \) is a hyperparameter that determines the amount of smoothing. for \( \alpha = 0 \) it becomes a hard target.

That’s all for this post. I hope this post will help you to enrich your understanding of label smoothing in machine learning or deep learning. I am a learner who is learning new things and trying to share with others. Let me know your thoughts on this post. You can get more machine learning-related posts here.