In most Classification Problem (Text Classification, Sentiment Classification), Searchers often use Cross Entropy as loss function for their models. Do you understand why?
Cross Entropy is computed as:
It is also defined as:
Here you can read the relation of Cross Entropy, Entropy and Kullback-Leibler Divergence.
Why cross entropy is often be used as loss function of deep learning model in classification problem? We analysis it from its equation.
1. Consider only one class.
(1) y is the class label, it is [0, 0 , 0, 0, 1]
It means y = [0, 0 , 0, 0, 1]
We can compute its entropy:
H(y) = 0
(2) ypred is the predicted class label computed by model, it may be [0.1, 0.15, 0.2, 0.35, 0.2]
It means ypred = [0.1, 0.15, 0.2, 0.35, 0.2]
To make our model can predict the class label more correctly, we should be sure that the error is minimum between y and ypred.
It means:
error = f(y,ypred)
How to minimize the error? It means what is f?
The best value of ypred is [0, 0 , 0, 0, 1]
It means:
ypred = [0, 0 , 0, 0, 1]
However, it is hard to get this best value for ypred by model.
To minimize the error, we use cross entropy as f.
It means:
f(y,ypred) = H(y,ypred)
However, we can not use:
f(ypred, y) = H(ypred, y)
H(y,ypred) ≠H(ypred, y)
Because:
H(y,ypred) = H(y) +DKL(y||ypred)
H(y) = 0
H(y,ypred) = DKL(y||ypred)
As to y = [0, 0 , 0, 0, 1]
H(y,ypred) = DKL(y||ypred) = 0log(0/ypred[0]) + 0log(0/ypred[1]) + 0log(0/ypred[2]) + 0log(0/ypred[3]) + 1log(1/ypred[4])
= 0 + log(1/ypred[4])
It means: we only make sure the value of ypred[4] is maximum, the best is ypred[4] ≈ 1
Why we can not use H(ypred, y)?
As to y = [0, 0 , 0, 0, 1]
H(ypred, y) =H(ypred) + DKL(ypred||y) = H(ypred) + ypred[0]log(ypred[0]/0) + ypred[1]log(ypred[1]/0) + ypred[2]log(ypred[2]/0) +ypred[3]log(ypred[3]/0) + ypred[4]log(ypred[4]/1)
=H(ypred) + ∞ + ypred[4]log(ypred[4]/1)
It can not minimize H(ypred, y).
2. Consider multi classes.
The loss function vlaue is the sum of cross entropy of each class.
To sumary:
- Use cross entropy as loss function in classification problem in model can classify the classes.
- We can use H(y,ypred), but we can not use H(ypred,y)