Understand Why Use Cross Entropy as Loss Function in Classification Problem

In most Classification Problem (Text Classification, Sentiment Classification), Searchers often use Cross Entropy as loss function for their models. Do you understand why?

Cross Entropy is computed as:

It is also defined as:

Here you can read the relation of Cross Entropy, Entropy and Kullback-Leibler Divergence.

Why cross entropy is often be used as loss function of deep learning model in classification problem? We analysis it from its equation.

1. Consider only one class.

(1) y is the class label, it is [0, 0 , 0, 0, 1]

It means y = [0, 0 , 0, 0, 1]

We can compute its entropy:

H(y) = 0

(2) y_pred is the predicted class label computed by model, it may be [0.1, 0.15, 0.2, 0.35, 0.2]

It means y_pred= [0.1, 0.15, 0.2, 0.35, 0.2]

To make our model can predict the class label more correctly, we should be sure that the error is minimum between y and y_pred.

It means:

error = f(y,y_pred)

How to minimize the error? It means what is f?

The best value of y_pred is [0, 0 , 0, 0, 1]

It means:

y_pred = [0, 0 , 0, 0, 1]

However, it is hard to get this best value for y_pred by model.

To minimize the error, we use cross entropy as f.

It means:

f(y,y_pred) = H(y,y_pred)

However, we can not use:

f(y_pred, y) = H(y_pred, y)

H(y,y_pred) ≠H(y_pred, y)

Because:

H(y,y_pred) = H(y) +D_KL(y||y_pred)

H(y) = 0

H(y,y_pred) = D_KL(y||y_pred)

As to y = [0, 0 , 0, 0, 1]

H(y,y_pred) = D_KL(y||y_pred) = 0log(0/y_pred[0]) + 0log(0/y_pred[1]) + 0log(0/y_pred[2]) + 0log(0/y_pred[3]) + 1log(1/y_pred[4])

= 0 + log(1/y_pred[4])

It means: we only make sure the value of y_pred[4] is maximum, the best is y_pred[4] ≈ 1

Why we can not use H(y_pred, y)?

As to y = [0, 0 , 0, 0, 1]

H(y_pred, y) =H(y_pred) + D_KL(y_pred||y) = H(y_pred) + y_pred[0]log(y_pred[0]/0) + y_pred[1]log(y_pred[1]/0) + y_pred[2]log(y_pred[2]/0) +y_pred[3]log(y_pred[3]/0) + y_pred[4]log(y_pred[4]/1)

=H(y_pred) + ∞ + y_pred[4]log(y_pred[4]/1)

It can not minimize H(y_pred, y).

2. Consider multi classes.

The loss function vlaue is the sum of cross entropy of each class.

To sumary:

Use cross entropy as loss function in classification problem in model can classify the classes.
We can use H(y,y_pred), but we can not use H(y_pred,y)

Understand Why Use Cross Entropy as Loss Function in Classification Problem – Deep Learning Tutorial

1. Consider only one class.

2. Consider multi classes.

Leave a Reply Cancel reply