Cross-Entropy Loss

Try Interactive Demo / 试一试交互式演示

分类问题的度量尺:交叉熵损失详解

在机器学习中,我们如何衡量模型的预测有多”错”?对于分类问题,最常用的答案就是交叉熵损失(Cross-Entropy Loss)。

想象你是一位天气预报员。今天你预测有80%的概率下雨,结果真的下雨了。你的预测好吗?再想象另一天,你预测有99%的概率是晴天,结果却下雨了。哪个预测更糟糕?

交叉熵损失正是用来量化这种”预测与现实的差距”的数学工具。

从信息论说起

交叉熵源自信息论。要理解它,我们先来看几个关键概念:

信息量(Self-Information)

一个事件发生所包含的信息量与它的概率成反比:

I(x)=logP(x)I(x) = -\log P(x)
  • 确定发生的事情(P=1)信息量为0
  • 越不可能发生的事情,信息量越大

熵(Entropy)

熵是一个分布的平均信息量,衡量不确定性:

H(P)=xP(x)logP(x)H(P) = -\sum_{x} P(x) \log P(x)

交叉熵(Cross-Entropy)

交叉熵衡量用分布Q来编码来自分布P的信息所需的平均位数:

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_{x} P(x) \log Q(x)

分类中的交叉熵损失

在分类任务中:

  • PP 是真实分布(one-hot标签)
  • QQ 是模型预测的概率分布(softmax输出)

二分类交叉熵(Binary Cross-Entropy)

L=[ylog(y^)+(1y)log(1y^)]L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

其中:

  • y{0,1}y \in \{0, 1\} 是真实标签
  • y^(0,1)\hat{y} \in (0, 1) 是预测概率

多分类交叉熵(Categorical Cross-Entropy)

L=c=1Cyclog(y^c)L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

由于真实标签是one-hot,只有正确类别的对数概率会被计算。

为什么用交叉熵而不是MSE?

你可能会问:为什么不直接用均方误差(MSE)来衡量分类损失?

问题1:梯度消失

使用MSE + Sigmoid时,当预测远离正确答案,梯度反而会变小:

Lz=(y^y)y^(1y^)\frac{\partial L}{\partial z} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y})

y^\hat{y}接近0或1时,y^(1y^)\hat{y}(1-\hat{y})趋近于0。

问题2:概率解释

交叉熵直接与概率相关,最小化交叉熵等价于最大化似然估计。

交叉熵的优势

使用交叉熵 + Softmax时,梯度非常简洁:

Lzi=y^iyi\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i

预测越错,梯度越大,学习越快!

交叉熵的直觉理解

让我们用例子来理解交叉熵的行为:

情况1:完美预测

  • 真实:猫(1,0,0)
  • 预测:(0.99, 0.005, 0.005)
  • 损失:log(0.99)0.01-\log(0.99) \approx 0.01

情况2:不确定预测

  • 真实:猫(1,0,0)
  • 预测:(0.6, 0.3, 0.1)
  • 损失:log(0.6)0.51-\log(0.6) \approx 0.51

情况3:错误预测

  • 真实:猫(1,0,0)
  • 预测:(0.1, 0.8, 0.1)
  • 损失:log(0.1)2.30-\log(0.1) \approx 2.30

情况4:极度错误

  • 真实:猫(1,0,0)
  • 预测:(0.01, 0.98, 0.01)
  • 损失:log(0.01)4.61-\log(0.01) \approx 4.61

可以看到,预测越自信地错误,惩罚越大!

数值稳定性

在实现时,直接计算log(y^)\log(\hat{y})可能会遇到问题:

  • y^\hat{y}接近0时,log(y^)\log(\hat{y}) \rightarrow -\infty

解决方法:

  1. 裁剪概率y^=clip(y^,ϵ,1ϵ)\hat{y} = \text{clip}(\hat{y}, \epsilon, 1-\epsilon)
  2. 合并计算:将softmax和cross-entropy合并,使用log-softmax

变体与扩展

1. 加权交叉熵
处理类别不平衡问题:

L=cwcyclog(y^c)L = -\sum_{c} w_c \cdot y_c \log(\hat{y}_c)

2. Focal Loss
专注于难分类样本:

L=c(1y^c)γyclog(y^c)L = -\sum_{c} (1-\hat{y}_c)^\gamma \cdot y_c \log(\hat{y}_c)

3. 标签平滑(Label Smoothing)
防止过度自信:

ysmooth=(1ϵ)y+ϵCy_{smooth} = (1-\epsilon) \cdot y + \frac{\epsilon}{C}

交叉熵 vs KL散度

交叉熵与KL散度(相对熵)密切相关:

DKL(PQ)=H(P,Q)H(P)D_{KL}(P||Q) = H(P,Q) - H(P)

由于H(P)H(P)是常数(在分类任务中),最小化交叉熵等价于最小化KL散度。

应用场景

交叉熵损失广泛应用于:

  • 图像分类
  • 文本分类
  • 语言模型
  • 目标检测
  • 语义分割
  • 以及几乎所有分类任务

理解交叉熵损失是掌握深度学习分类模型的关键一步。它连接了信息论与机器学习,为模型训练提供了理论基础和实践指导。

The Measuring Stick for Classification: A Deep Dive into Cross-Entropy Loss

In machine learning, how do we measure how “wrong” a model’s predictions are? For classification problems, the most common answer is Cross-Entropy Loss.

Imagine you’re a weather forecaster. Today you predict an 80% chance of rain, and it actually rains. Is your prediction good? Now imagine another day when you predict a 99% chance of sunshine, but it rains. Which prediction is worse?

Cross-entropy loss is the mathematical tool for quantifying this “gap between prediction and reality.”

Starting from Information Theory

Cross-entropy originates from information theory. To understand it, let’s look at a few key concepts:

Self-Information

The information content of an event is inversely related to its probability:

I(x)=logP(x)I(x) = -\log P(x)
  • Certain events (P=1) have zero information
  • Less likely events carry more information

Entropy

Entropy is the average information content of a distribution, measuring uncertainty:

H(P)=xP(x)logP(x)H(P) = -\sum_{x} P(x) \log P(x)

Cross-Entropy

Cross-entropy measures the average number of bits needed to encode information from distribution P using distribution Q:

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_{x} P(x) \log Q(x)

Cross-Entropy Loss in Classification

In classification tasks:

  • PP is the true distribution (one-hot labels)
  • QQ is the model's predicted probability distribution (softmax output)

Binary Cross-Entropy

L=[ylog(y^)+(1y)log(1y^)]L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

Where:

  • y{0,1}y \in \{0, 1\} is the true label
  • y^(0,1)\hat{y} \in (0, 1) is the predicted probability

Categorical Cross-Entropy

L=c=1Cyclog(y^c)L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Since true labels are one-hot, only the log probability of the correct class is computed.

Why Cross-Entropy Instead of MSE?

You might ask: why not just use Mean Squared Error (MSE) to measure classification loss?

Problem 1: Vanishing Gradients

Using MSE + Sigmoid, when predictions are far from correct, gradients actually get smaller:

Lz=(y^y)y^(1y^)\frac{\partial L}{\partial z} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y})

When y^\hat{y} is close to 0 or 1, y^(1y^)\hat{y}(1-\hat{y}) approaches 0.

Problem 2: Probabilistic Interpretation

Cross-entropy directly relates to probability; minimizing cross-entropy is equivalent to maximum likelihood estimation.

Advantages of Cross-Entropy

Using Cross-Entropy + Softmax, the gradient is very clean:

Lzi=y^iyi\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i

The more wrong the prediction, the larger the gradient, the faster the learning!

Intuitive Understanding of Cross-Entropy

Let’s understand cross-entropy behavior with examples:

Case 1: Perfect Prediction

  • True: Cat (1,0,0)
  • Predicted: (0.99, 0.005, 0.005)
  • Loss: log(0.99)0.01-\log(0.99) \approx 0.01

Case 2: Uncertain Prediction

  • True: Cat (1,0,0)
  • Predicted: (0.6, 0.3, 0.1)
  • Loss: log(0.6)0.51-\log(0.6) \approx 0.51

Case 3: Wrong Prediction

  • True: Cat (1,0,0)
  • Predicted: (0.1, 0.8, 0.1)
  • Loss: log(0.1)2.30-\log(0.1) \approx 2.30

Case 4: Extremely Wrong

  • True: Cat (1,0,0)
  • Predicted: (0.01, 0.98, 0.01)
  • Loss: log(0.01)4.61-\log(0.01) \approx 4.61

As you can see, the more confidently wrong the prediction, the greater the penalty!

Numerical Stability

When implementing, directly computing log(y^)\log(\hat{y}) can cause problems:

  • When y^\hat{y} is close to 0, log(y^)\log(\hat{y}) \rightarrow -\infty

Solutions:

  1. Clip probabilities: y^=clip(y^,ϵ,1ϵ)\hat{y} = \text{clip}(\hat{y}, \epsilon, 1-\epsilon)
  2. Combined computation: Merge softmax and cross-entropy, use log-softmax

Variants and Extensions

1. Weighted Cross-Entropy
Handles class imbalance:

L=cwcyclog(y^c)L = -\sum_{c} w_c \cdot y_c \log(\hat{y}_c)

2. Focal Loss
Focuses on hard-to-classify samples:

L=c(1y^c)γyclog(y^c)L = -\sum_{c} (1-\hat{y}_c)^\gamma \cdot y_c \log(\hat{y}_c)

3. Label Smoothing
Prevents overconfidence:

ysmooth=(1ϵ)y+ϵCy_{smooth} = (1-\epsilon) \cdot y + \frac{\epsilon}{C}

Cross-Entropy vs KL Divergence

Cross-entropy is closely related to KL divergence (relative entropy):

DKL(PQ)=H(P,Q)H(P)D_{KL}(P||Q) = H(P,Q) - H(P)

Since H(P)H(P) is constant (in classification tasks), minimizing cross-entropy is equivalent to minimizing KL divergence.

Applications

Cross-entropy loss is widely used in:

  • Image classification
  • Text classification
  • Language models
  • Object detection
  • Semantic segmentation
  • And almost all classification tasks

Understanding cross-entropy loss is a key step in mastering deep learning classification models. It connects information theory with machine learning, providing both theoretical foundation and practical guidance for model training.