2026-01-22

Pooling Layer

Try Interactive Demo / 试一试交互式演示

图像压缩的智慧：池化层详解

想象你正在远处看一幅巨大的壁画。你不需要看清每一个细节，只需要抓住主要的色彩块和形状，就能理解画作的内容。这种”抓大放小”的策略，正是卷积神经网络中池化层（Pooling Layer）的核心思想。

什么是池化？

池化是一种下采样（Downsampling）操作。它将输入的特征图划分成若干个小区域，然后用一个代表值（如最大值或平均值）来替代整个区域。

\text{输入: } 4 \times 4 \xrightarrow{\text{2×2 池化}} \text{输出: } 2 \times 2

特征图的尺寸减半，但保留了最重要的信息。

池化的类型

1. 最大池化（Max Pooling）

取每个区域的最大值。这是最常用的池化方式。

输入:              输出:
[1, 3, 2, 4]
[5, 6, 1, 2]  →   [6, 4]
[7, 2, 3, 1]  →   [8, 4]
[8, 4, 2, 4]

最大池化的直觉：只保留最强的激活信号，忽略弱响应。

2. 平均池化（Average Pooling）

取每个区域的平均值。

输入:              输出:
[1, 3, 2, 4]
[5, 6, 1, 2]  →   [3.75, 2.25]
[7, 2, 3, 1]  →   [5.25, 2.50]
[8, 4, 2, 4]

平均池化保留了区域的整体强度信息。

3. 全局池化（Global Pooling）

对整个特征图进行池化，输出一个数值。

全局平均池化（GAP）：常用于分类网络的最后一层，替代全连接层
全局最大池化（GMP）：取整个特征图的最大值

池化的作用

1. 降低计算量

特征图尺寸减小，后续层的计算量显著降低。

\text{计算量} \propto H \times W \times C

2×2池化将计算量降为原来的1/4。

2. 增大感受野

池化后，同样大小的卷积核能”看到”更大范围的原始图像。

3. 提供平移不变性

物体在图像中稍微移动，池化后的特征仍然相似。这对识别”任何位置的猫”很重要。

4. 控制过拟合

通过减少参数和特征图大小，池化起到了一定的正则化作用。

池化的参数

池化窗口大小（Pool Size）

常用2×2，每次将尺寸减半
较大的窗口会丢失更多细节

步幅（Stride）

通常等于窗口大小（无重叠）
步幅小于窗口大小时会有重叠

填充（Padding）

通常不使用填充
某些情况下使用”same”填充保持尺寸

池化的反思

近年来，池化层的必要性受到了一些质疑：

替代方案1：步幅卷积

使用步幅为2的卷积层代替池化：

1	Conv(stride=2) 代替 Conv(stride=1) + MaxPool(2)

这种方式让网络自己学习如何下采样。

替代方案2：空洞卷积

使用空洞卷积增大感受野，而不减小特征图尺寸。

替代方案3：全局平均池化

在网络末端直接使用GAP，避免全连接层。

经典架构中的池化

VGGNet

每个卷积块后使用2×2最大池化
5次池化，尺寸从224降到7

ResNet

开始使用3×3最大池化
最后使用全局平均池化

Inception/GoogLeNet

在Inception模块中使用最大池化分支
最后使用全局平均池化

现代趋势

更多使用步幅卷积
在需要精细空间信息的任务（如分割）中避免过度池化

代码示例

import torch.nn as nn

# 2x2 最大池化
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# 2x2 平均池化
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# 全局平均池化
global_avg_pool = nn.AdaptiveAvgPool2d(1)

池化层虽然简单，但它在CNN的成功中扮演了重要角色。理解池化的原理和权衡，能帮助你更好地设计和优化卷积神经网络。

The Wisdom of Image Compression: Understanding Pooling Layers

Imagine you’re looking at a huge mural from a distance. You don’t need to see every detail—just grasping the main color blocks and shapes lets you understand the painting’s content. This strategy of “capturing the big picture while ignoring small details” is the core idea behind Pooling Layers in convolutional neural networks.

What is Pooling?

Pooling is a downsampling operation. It divides the input feature map into several small regions, then replaces each region with a representative value (such as maximum or average).

\text{Input: } 4 \times 4 \xrightarrow{\text{2×2 pooling}} \text{Output: } 2 \times 2

The feature map size is halved, but the most important information is preserved.

Types of Pooling

1. Max Pooling

Takes the maximum value from each region. This is the most commonly used pooling method.

Input:              Output:
[1, 3, 2, 4]
[5, 6, 1, 2]  →   [6, 4]
[7, 2, 3, 1]  →   [8, 4]
[8, 4, 2, 4]

Max pooling intuition: Keep only the strongest activation signals, ignore weak responses.

2. Average Pooling

Takes the average value of each region.

Input:              Output:
[1, 3, 2, 4]
[5, 6, 1, 2]  →   [3.75, 2.25]
[7, 2, 3, 1]  →   [5.25, 2.50]
[8, 4, 2, 4]

Average pooling preserves the overall intensity information of regions.

3. Global Pooling

Pools over the entire feature map, outputting a single value.

Global Average Pooling (GAP): Often used in the last layer of classification networks, replacing fully connected layers
Global Max Pooling (GMP): Takes the maximum value of the entire feature map

Purposes of Pooling

1. Reduce Computation

Smaller feature maps significantly reduce computation in subsequent layers.

\text{Computation} \propto H \times W \times C

2×2 pooling reduces computation to 1/4.

2. Increase Receptive Field

After pooling, the same-sized convolution kernel can “see” a larger area of the original image.

3. Provide Translation Invariance

When objects move slightly in the image, pooled features remain similar. This is important for recognizing “cats anywhere.”

4. Control Overfitting

By reducing parameters and feature map size, pooling provides some regularization effect.

Pooling Parameters

Pool Size

Commonly 2×2, halving the size each time
Larger windows lose more details

Stride

Usually equals window size (no overlap)
Stride smaller than window size creates overlap

Padding

Usually not used
“Same” padding sometimes used to maintain size

Rethinking Pooling

In recent years, the necessity of pooling layers has been questioned:

Alternative 1: Strided Convolution

Use convolution with stride 2 instead of pooling:

1	Conv(stride=2) replaces Conv(stride=1) + MaxPool(2)

This lets the network learn how to downsample itself.

Alternative 2: Dilated Convolution

Use dilated convolution to increase receptive field without reducing feature map size.

Alternative 3: Global Average Pooling

Use GAP directly at the network’s end, avoiding fully connected layers.

Pooling in Classic Architectures

VGGNet

Uses 2×2 max pooling after each conv block
5 pooling operations, size goes from 224 to 7

ResNet

Uses 3×3 max pooling at the start
Uses global average pooling at the end

Inception/GoogLeNet

Uses max pooling branch in Inception modules
Uses global average pooling at the end

Modern Trends

More use of strided convolutions
Avoid excessive pooling in tasks requiring fine spatial information (like segmentation)

Code Example

import torch.nn as nn

# 2x2 max pooling
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# 2x2 average pooling
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

# Global average pooling
global_avg_pool = nn.AdaptiveAvgPool2d(1)

Although simple, pooling layers played an important role in CNN’s success. Understanding the principles and trade-offs of pooling helps you better design and optimize convolutional neural networks.