2026-01-22

Cross-Entropy Loss

Try Interactive Demo / 试一试交互式演示

分类问题的度量尺：交叉熵损失详解

在机器学习中，我们如何衡量模型的预测有多”错”？对于分类问题，最常用的答案就是交叉熵损失（Cross-Entropy Loss）。

想象你是一位天气预报员。今天你预测有80%的概率下雨，结果真的下雨了。你的预测好吗？再想象另一天，你预测有99%的概率是晴天，结果却下雨了。哪个预测更糟糕？

交叉熵损失正是用来量化这种”预测与现实的差距”的数学工具。

从信息论说起

交叉熵源自信息论。要理解它，我们先来看几个关键概念：

信息量（Self-Information）

一个事件发生所包含的信息量与它的概率成反比：

I(x) = -\log P(x)

确定发生的事情（P=1）信息量为0
越不可能发生的事情，信息量越大

熵（Entropy）

熵是一个分布的平均信息量，衡量不确定性：

H(P) = -\sum_{x} P(x) \log P(x)

交叉熵（Cross-Entropy）

交叉熵衡量用分布Q来编码来自分布P的信息所需的平均位数：

H(P, Q) = -\sum_{x} P(x) \log Q(x)

分类中的交叉熵损失

在分类任务中：

$P$ 是真实分布（one-hot标签）
$Q$ 是模型预测的概率分布（softmax输出）

二分类交叉熵（Binary Cross-Entropy）

L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

其中：

$y \in \{0, 1\}$ 是真实标签
$\hat{y} \in (0, 1)$ 是预测概率

多分类交叉熵（Categorical Cross-Entropy）

L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

由于真实标签是one-hot，只有正确类别的对数概率会被计算。

为什么用交叉熵而不是MSE？

你可能会问：为什么不直接用均方误差（MSE）来衡量分类损失？

问题1：梯度消失

使用MSE + Sigmoid时，当预测远离正确答案，梯度反而会变小：

\frac{\partial L}{\partial z} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y})

当 $\hat{y}$ 接近0或1时， $\hat{y}(1-\hat{y})$ 趋近于0。

问题2：概率解释

交叉熵直接与概率相关，最小化交叉熵等价于最大化似然估计。

交叉熵的优势

使用交叉熵 + Softmax时，梯度非常简洁：

\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i

预测越错，梯度越大，学习越快！

交叉熵的直觉理解

让我们用例子来理解交叉熵的行为：

情况1：完美预测

真实：猫（1,0,0）
预测：（0.99, 0.005, 0.005）
损失： $-\log(0.99) \approx 0.01$

情况2：不确定预测

真实：猫（1,0,0）
预测：（0.6, 0.3, 0.1）
损失： $-\log(0.6) \approx 0.51$

情况3：错误预测

真实：猫（1,0,0）
预测：（0.1, 0.8, 0.1）
损失： $-\log(0.1) \approx 2.30$

情况4：极度错误

真实：猫（1,0,0）
预测：（0.01, 0.98, 0.01）
损失： $-\log(0.01) \approx 4.61$

可以看到，预测越自信地错误，惩罚越大！

数值稳定性

在实现时，直接计算 $\log(\hat{y})$ 可能会遇到问题：

当 $\hat{y}$ 接近0时， $\log(\hat{y}) \rightarrow -\infty$

解决方法：

裁剪概率： $\hat{y} = \text{clip}(\hat{y}, \epsilon, 1-\epsilon)$
合并计算：将softmax和cross-entropy合并，使用log-softmax

变体与扩展

1. 加权交叉熵
处理类别不平衡问题：

L = -\sum_{c} w_c \cdot y_c \log(\hat{y}_c)

2. Focal Loss
专注于难分类样本：

L = -\sum_{c} (1-\hat{y}_c)^\gamma \cdot y_c \log(\hat{y}_c)

3. 标签平滑（Label Smoothing）
防止过度自信：

y_{smooth} = (1-\epsilon) \cdot y + \frac{\epsilon}{C}

交叉熵 vs KL散度

交叉熵与KL散度（相对熵）密切相关：

D_{KL}(P||Q) = H(P,Q) - H(P)

由于 $H(P)$ 是常数（在分类任务中），最小化交叉熵等价于最小化KL散度。

应用场景

交叉熵损失广泛应用于：

图像分类
文本分类
语言模型
目标检测
语义分割
以及几乎所有分类任务

理解交叉熵损失是掌握深度学习分类模型的关键一步。它连接了信息论与机器学习，为模型训练提供了理论基础和实践指导。

The Measuring Stick for Classification: A Deep Dive into Cross-Entropy Loss

In machine learning, how do we measure how “wrong” a model’s predictions are? For classification problems, the most common answer is Cross-Entropy Loss.

Imagine you’re a weather forecaster. Today you predict an 80% chance of rain, and it actually rains. Is your prediction good? Now imagine another day when you predict a 99% chance of sunshine, but it rains. Which prediction is worse?

Cross-entropy loss is the mathematical tool for quantifying this “gap between prediction and reality.”

Starting from Information Theory

Cross-entropy originates from information theory. To understand it, let’s look at a few key concepts:

Self-Information

The information content of an event is inversely related to its probability:

I(x) = -\log P(x)

Certain events (P=1) have zero information
Less likely events carry more information

Entropy

Entropy is the average information content of a distribution, measuring uncertainty:

H(P) = -\sum_{x} P(x) \log P(x)

Cross-Entropy

Cross-entropy measures the average number of bits needed to encode information from distribution P using distribution Q:

H(P, Q) = -\sum_{x} P(x) \log Q(x)

Cross-Entropy Loss in Classification

In classification tasks:

$P$ is the true distribution (one-hot labels)
$Q$ is the model's predicted probability distribution (softmax output)

Binary Cross-Entropy

L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

Where:

$y \in \{0, 1\}$ is the true label
$\hat{y} \in (0, 1)$ is the predicted probability

Categorical Cross-Entropy

L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Since true labels are one-hot, only the log probability of the correct class is computed.

Why Cross-Entropy Instead of MSE?

You might ask: why not just use Mean Squared Error (MSE) to measure classification loss?

Problem 1: Vanishing Gradients

Using MSE + Sigmoid, when predictions are far from correct, gradients actually get smaller:

\frac{\partial L}{\partial z} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y})

When $\hat{y}$ is close to 0 or 1, $\hat{y}(1-\hat{y})$ approaches 0.

Problem 2: Probabilistic Interpretation

Cross-entropy directly relates to probability; minimizing cross-entropy is equivalent to maximum likelihood estimation.

Advantages of Cross-Entropy

Using Cross-Entropy + Softmax, the gradient is very clean:

\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i

The more wrong the prediction, the larger the gradient, the faster the learning!

Intuitive Understanding of Cross-Entropy

Let’s understand cross-entropy behavior with examples:

Case 1: Perfect Prediction

True: Cat (1,0,0)
Predicted: (0.99, 0.005, 0.005)
Loss: $-\log(0.99) \approx 0.01$

Case 2: Uncertain Prediction

True: Cat (1,0,0)
Predicted: (0.6, 0.3, 0.1)
Loss: $-\log(0.6) \approx 0.51$

Case 3: Wrong Prediction

True: Cat (1,0,0)
Predicted: (0.1, 0.8, 0.1)
Loss: $-\log(0.1) \approx 2.30$

Case 4: Extremely Wrong

True: Cat (1,0,0)
Predicted: (0.01, 0.98, 0.01)
Loss: $-\log(0.01) \approx 4.61$

As you can see, the more confidently wrong the prediction, the greater the penalty!

Numerical Stability

When implementing, directly computing $\log(\hat{y})$ can cause problems:

When $\hat{y}$ is close to 0, $\log(\hat{y}) \rightarrow -\infty$

Solutions:

Clip probabilities: $\hat{y} = \text{clip}(\hat{y}, \epsilon, 1-\epsilon)$
Combined computation: Merge softmax and cross-entropy, use log-softmax

Variants and Extensions

1. Weighted Cross-Entropy
Handles class imbalance:

L = -\sum_{c} w_c \cdot y_c \log(\hat{y}_c)

2. Focal Loss
Focuses on hard-to-classify samples:

L = -\sum_{c} (1-\hat{y}_c)^\gamma \cdot y_c \log(\hat{y}_c)

3. Label Smoothing
Prevents overconfidence:

y_{smooth} = (1-\epsilon) \cdot y + \frac{\epsilon}{C}

Cross-Entropy vs KL Divergence

Cross-entropy is closely related to KL divergence (relative entropy):

D_{KL}(P||Q) = H(P,Q) - H(P)

Since $H(P)$ is constant (in classification tasks), minimizing cross-entropy is equivalent to minimizing KL divergence.

Applications

Cross-entropy loss is widely used in:

Image classification
Text classification
Language models
Object detection
Semantic segmentation
And almost all classification tasks

Understanding cross-entropy loss is a key step in mastering deep learning classification models. It connects information theory with machine learning, providing both theoretical foundation and practical guidance for model training.