防止模型”死记硬背”:L1/L2正则化详解
在机器学习中,有一个经典的问题:模型在训练数据上表现很好,但在新数据上却表现很差。这种现象叫做过拟合(Overfitting)。就像一个学生死记硬背了所有练习题的答案,却在考试中遇到新题型就手足无措。
正则化(Regularization)是解决过拟合的重要武器。其中最经典的两种方法就是L1正则化和L2正则化。
什么是正则化?
正则化的核心思想很简单:惩罚模型的复杂度。
原始的损失函数:
加入正则化后:
其中是正则化强度,控制惩罚的力度。
L2正则化(Ridge / 权重衰减)
L2正则化惩罚权重的平方和:
特点:
- 权重被”推向”较小的值,但很少变成精确的0
- 所有特征都会被保留,只是权重变小
- 梯度:
直觉理解:
想象每个权重都连着一根弹簧,弹簧另一端固定在原点。弹簧的力与权重大小成正比,不断把权重拉回零点。但权重越接近零,拉力越小,所以很难精确变成零。
L2正则化也叫权重衰减(Weight Decay),因为在每次更新时:
权重会持续”衰减”。
L1正则化(Lasso)
L1正则化惩罚权重的绝对值之和:
特点:
- 倾向于产生稀疏的权重(很多权重精确等于0)
- 自动进行特征选择
- 梯度:
直觉理解:
L1的惩罚力度与权重大小无关(只看正负号)。无论权重多小,惩罚的”推力”都是恒定的。这意味着小权重很容易被一路推到精确的零。
这就是为什么L1能产生稀疏模型——不重要的特征会被直接”关闭”。
L1 vs L2:几何视角
从几何角度看,L1和L2正则化的约束区域形状不同:
L2约束:圆形(2D)或超球(高维)
L1约束:菱形(2D)或超立方体角(高维)
当损失函数的等高线与约束区域相切时,L1更容易在坐标轴上(即某个)相切,这解释了为什么L1产生稀疏解。
实际应用中的选择
| 情况 | 推荐 |
|---|---|
| 想保留所有特征 | L2 |
| 想自动特征选择 | L1 |
| 特征之间高度相关 | L2(L1可能只选一个) |
| 需要可解释性 | L1(稀疏模型更易理解) |
| 神经网络 | 通常用L2(Weight Decay) |
Elastic Net:L1 + L2
Elastic Net结合了L1和L2的优点:
它既能产生稀疏性,又能处理相关特征。
深度学习中的正则化
在神经网络中,L2正则化(权重衰减)是最常用的。典型设置:
1 | optimizer = torch.optim.Adam(model.parameters(), |
其他正则化技术:
- Dropout:随机关闭神经元
- Batch Normalization:间接的正则化效果
- Early Stopping:提前停止训练
- Data Augmentation:增加训练数据多样性
如何选择正则化强度λ?
太小:正则化效果不明显 太大:模型欠拟合选择方法:
- 交叉验证:尝试多个值,选择验证集表现最好的
- 网格搜索:系统地搜索参数空间
- 经验法则:常见范围 到
数学深入:贝叶斯视角
从贝叶斯角度看:
- L2正则化 ≈ 权重服从高斯先验
- L1正则化 ≈ 权重服从拉普拉斯先验
正则化实际上是在做最大后验估计(MAP)而不是最大似然估计。
总结
| 特性 | L1 | L2 |
|---|---|---|
| 惩罚项 | ||
| 稀疏性 | 高 | 低 |
| 特征选择 | 是 | 否 |
| 计算效率 | 较低 | 较高 |
| 常见名称 | Lasso | Ridge / 权重衰减 |
正则化是机器学习工具箱中的基础工具。理解L1和L2的区别,能帮助你更好地控制模型复杂度,构建泛化能力更强的模型。
Preventing Models from “Memorizing”: Understanding L1/L2 Regularization
In machine learning, there’s a classic problem: a model performs well on training data but poorly on new data. This phenomenon is called Overfitting. Like a student who memorizes all practice problem answers but becomes confused when facing new question types on exams.
Regularization is an important weapon against overfitting. The two most classic methods are L1 regularization and L2 regularization.
What is Regularization?
The core idea of regularization is simple: penalize model complexity.
Original loss function:
With regularization:
Where is the regularization strength, controlling the penalty intensity.
L2 Regularization (Ridge / Weight Decay)
L2 regularization penalizes the sum of squared weights:
Characteristics:
- Weights are “pushed toward” smaller values but rarely become exactly 0
- All features are retained, just with smaller weights
- Gradient:
Intuitive Understanding:
Imagine each weight is connected to a spring, with the other end fixed at the origin. The spring force is proportional to weight magnitude, constantly pulling weights back to zero. But as weights approach zero, the force becomes smaller, making it hard to reach exactly zero.
L2 regularization is also called Weight Decay because at each update:
Weights continuously “decay.”
L1 Regularization (Lasso)
L1 regularization penalizes the sum of absolute weight values:
Characteristics:
- Tends to produce sparse weights (many weights exactly equal to 0)
- Automatically performs feature selection
- Gradient:
Intuitive Understanding:
L1’s penalty force is independent of weight magnitude (only looks at sign). No matter how small the weight, the “push” is constant. This means small weights can easily be pushed all the way to exact zero.
This is why L1 produces sparse models—unimportant features are directly “turned off.”
L1 vs L2: Geometric Perspective
From a geometric perspective, L1 and L2 regularization have different constraint region shapes:
L2 Constraint: Circle (2D) or hypersphere (high-D)
L1 Constraint: Diamond (2D) or hypercube corners (high-D)
When the loss function’s contour lines are tangent to the constraint region, L1 is more likely to be tangent on coordinate axes (i.e., some ), explaining why L1 produces sparse solutions.
Choosing in Practice
| Situation | Recommendation |
|---|---|
| Want to keep all features | L2 |
| Want automatic feature selection | L1 |
| Features are highly correlated | L2 (L1 may select only one) |
| Need interpretability | L1 (sparse models easier to understand) |
| Neural networks | Usually L2 (Weight Decay) |
Elastic Net: L1 + L2
Elastic Net combines advantages of both L1 and L2:
It can produce sparsity while handling correlated features.
Regularization in Deep Learning
In neural networks, L2 regularization (weight decay) is most commonly used. Typical setup:
1 | optimizer = torch.optim.Adam(model.parameters(), |
Other regularization techniques:
- Dropout: Randomly disable neurons
- Batch Normalization: Indirect regularization effect
- Early Stopping: Stop training early
- Data Augmentation: Increase training data diversity
How to Choose Regularization Strength λ?
too small: Regularization effect not noticeable too large: Model underfitsSelection methods:
- Cross-validation: Try multiple values, choose best on validation set
- Grid search: Systematically search parameter space
- Rule of thumb: Common range to
Mathematical Deep Dive: Bayesian Perspective
From a Bayesian perspective:
- L2 regularization ≈ Gaussian prior on weights
- L1 regularization ≈ Laplace prior on weights
Regularization is actually doing Maximum A Posteriori (MAP) estimation rather than Maximum Likelihood.
Summary
| Property | L1 | L2 |
|---|---|---|
| Penalty term | ||
| Sparsity | High | Low |
| Feature selection | Yes | No |
| Computational efficiency | Lower | Higher |
| Common name | Lasso | Ridge / Weight Decay |
Regularization is a fundamental tool in the machine learning toolbox. Understanding the difference between L1 and L2 helps you better control model complexity and build models with stronger generalization ability.