访存优化

访存优化:打通 AI 计算的”交通瓶颈”

在 GPU 计算的世界里,有一个残酷的现实:计算速度远超数据传输速度。就像一座超级工厂,生产线飞速运转,但原材料运输跟不上,工人只能干等。访存优化就是解决这个”交通瓶颈”的关键技术。

计算 vs 访存:谁才是瓶颈?

先来看一组数据对比:

硬件 性能
A100 GPU 计算能力 312 TFLOPS (FP16)
A100 显存带宽 2 TB/s

做个简单计算:

  • 假设每次计算需要读 2 个数据,写 1 个结果
  • 每个 FP16 数据 = 2 字节
  • 每次运算数据量 = 6 字节
  • 312 TFLOPS 需要的带宽 = 312 × 10¹² × 6 = 1872 TB/s

实际带宽只有 2 TB/s,差了将近 1000 倍!

这意味着:如果你的程序频繁访问显存,GPU 的强大算力根本发挥不出来,大部分时间都在等数据

什么是访存优化?

访存优化(Memory Access Optimization)是指通过各种技术手段,减少或加速内存访问,从而让计算不再”等米下锅”。

核心目标:

  1. 减少访存次数: 能不读就不读
  2. 加速访存速度: 必须读时用最快的方式
  3. 隐藏访存延迟: 读数据的同时算别的

GPU 内存层级回顾

理解访存优化,首先要知道 GPU 的”存储地图”:

1
2
3
4
5
6
速度慢 ← ─────────────────────────────── → 速度快

全局内存 L2 Cache 共享内存/L1 寄存器
(HBM) (SMEM)
~2TB/s ~4TB/s ~19TB/s 最快
16-80GB ~40MB ~164KB/SM ~256KB/SM

黄金法则: 尽量让数据停留在靠右边(快)的位置。

核心优化技术

1. 内存合并访问(Memory Coalescing)

GPU 读内存是按”批次”读的,一次读 128 字节。如果 32 个线程(一个 Warp)访问的地址正好连续,一次就能全读出来。

反面教材:

1
2
3
// 跨步访问 - 低效
data[threadIdx.x * stride] // 线程0访问0,线程1访问128,线程2访问256...
// 需要多次内存事务

正确做法:

1
2
3
// 连续访问 - 高效
data[threadIdx.x] // 线程0访问0,线程1访问1,线程2访问2...
// 一次内存事务搞定

效果差距: 合并访问可以比非合并快 10 倍以上

2. 数据复用(Data Reuse)

如果同一份数据要用多次,把它加载到快速存储(共享内存),反复使用。

矩阵乘法示例:

1
2
3
4
5
6
7
8
9
10
朴素实现:每个元素计算都从全局内存读取
C[i][j] = A[i][0]*B[0][j] + A[i][1]*B[1][j] + ...
→ 每个乘加都访问全局内存,超慢

Tiling 优化:把小块数据加载到共享内存
1. 加载 A 的一个 tile 到共享内存
2. 加载 B 的一个 tile 到共享内存
3. 在共享内存中完成所有计算
4. 加载下一个 tile...
→ 全局内存访问减少 N 倍(N = tile 大小)

代码示意:

1
2
3
4
5
6
7
8
9
10
11
12
__shared__ float tileA[TILE][TILE];
__shared__ float tileB[TILE][TILE];

// 加载到共享内存(访问全局内存 1 次)
tileA[ty][tx] = A[row][col];
tileB[ty][tx] = B[row][col];
__syncthreads();

// 在共享内存中计算(不再访问全局内存)
for (int k = 0; k < TILE; k++) {
sum += tileA[ty][k] * tileB[k][tx];
}

3. 预取(Prefetching)

在需要数据之前,提前把它加载到缓存中。

原理:

1
2
传统:  计算A → 等待加载B → 计算B → 等待加载C → ...
预取: 计算A + 预取B → 计算B + 预取C → 计算C + 预取D → ...

计算和数据加载重叠进行,隐藏了访存延迟。

4. 避免 Bank Conflict

共享内存被分成 32 个 Bank。如果多个线程同时访问同一个 Bank 的不同地址,会产生冲突。

冲突示例:

1
2
// 32 个线程都访问 bank 0 的不同地址
shared_mem[threadIdx.x * 32] // 全部冲突,串行执行

无冲突示例:

1
2
// 32 个线程访问 32 个不同 bank
shared_mem[threadIdx.x] // 无冲突,并行执行

解决方法: 添加 padding 错开 bank。

1
2
3
4
5
// 原本
__shared__ float data[32][32]; // 每行都从 bank 0 开始

// 优化后
__shared__ float data[32][33]; // padding,每行错开 1 个 bank

5. 向量化访问

使用向量类型(float4 等)一次读多个数据。

1
2
3
4
5
6
7
8
9
// 标量访问 - 4 次内存事务
float a = data[idx];
float b = data[idx+1];
float c = data[idx+2];
float d = data[idx+3];

// 向量访问 - 1 次内存事务
float4 vec = reinterpret_cast<float4*>(data)[idx/4];
// vec.x, vec.y, vec.z, vec.w 直接可用

计算密度与访存优化

计算密度(Arithmetic Intensity)= 计算量 / 访存量

操作类型 计算密度 优化策略
逐元素操作(ReLU) 极低 算子融合
矩阵向量乘 批量处理
矩阵乘法 Tiling
卷积 im2col + GEMM

低计算密度的操作最需要访存优化,因为它们”算得少,读得多”。

实际案例:Flash Attention

Flash Attention 是访存优化的经典案例:

传统 Attention 问题:

1
2
Q × K^T → 存到全局内存 → Softmax → 存到全局内存 → × V
中间矩阵(N×N)非常大,内存访问成为瓶颈

Flash Attention 优化:

1
2
3
4
1. 分块计算(Tiling)
2. 在 SRAM(共享内存)中完成 Softmax
3. 不存储完整的 N×N 中间矩阵
4. 用 Online Softmax 技巧避免多次遍历

效果:

  • 显存使用:从 O(N²) 降到 O(N)
  • 速度:提升 2-4 倍

性能分析方法

使用 NVIDIA 工具分析访存瓶颈:

1
2
3
4
5
6
7
8
9
## 分析内存吞吐
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second \
--metrics dram__bytes_read.sum.per_second \
./my_program

## 检查内存合并效率
ncu --metrics smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct \
./my_program
## 理想值接近 100%,低于 50% 说明有严重的非合并访问

优化检查清单

检查项 问题征兆 解决方案
合并访问 内存效率 < 50% 调整访问模式
数据复用 全局内存访问过多 使用共享内存
Bank 冲突 共享内存带宽低 添加 padding
向量化 标量访问过多 使用 float4 等
占用率 SM 利用率低 调整线程配置

总结

访存优化是释放 GPU 真正算力的关键。在”计算快、访存慢”的现实下,谁能更好地管理数据流动,谁就能获得更高的性能。

核心要点:

  1. 合并访问: 让相邻线程访问相邻地址
  2. 数据复用: 把热数据留在快速存储
  3. 预取重叠: 计算和访存并行
  4. 避免冲突: Bank conflict、Cache miss
  5. 提高密度: 算子融合、批量处理

记住:最好的访存是不访存。通过融合、复用和缓存,让数据尽量少”跑路”。

Memory Access Optimization: Clearing AI Computing’s “Traffic Bottleneck”

In the world of GPU computing, there’s a harsh reality: computation speed far exceeds data transfer speed. It’s like a super factory where production lines run at full speed, but raw material transport can’t keep up, leaving workers waiting. Memory access optimization is the key technology for solving this “traffic bottleneck.”

Computation vs Memory Access: Which is the Bottleneck?

Let’s look at some comparative data:

Hardware Performance
A100 GPU Compute 312 TFLOPS (FP16)
A100 Memory Bandwidth 2 TB/s

Simple calculation:

  • Assume each computation needs to read 2 data items and write 1 result
  • Each FP16 data = 2 bytes
  • Data per operation = 6 bytes
  • Bandwidth needed for 312 TFLOPS = 312 × 10¹² × 6 = 1872 TB/s

Actual bandwidth is only 2 TB/s—nearly 1000x difference!

This means: if your program frequently accesses GPU memory, the powerful compute capability can’t be utilized—most time is spent waiting for data.

What is Memory Access Optimization?

Memory Access Optimization refers to using various techniques to reduce or accelerate memory access, so computation no longer “waits for ingredients.”

Core goals:

  1. Reduce access count: Don’t read if you don’t have to
  2. Speed up access: When you must read, use the fastest method
  3. Hide access latency: Compute other things while reading data

GPU Memory Hierarchy Review

To understand memory optimization, first know the GPU’s “storage map”:

1
2
3
4
5
6
Slow ← ─────────────────────────────── → Fast

Global Memory L2 Cache Shared Memory/L1 Registers
(HBM) (SMEM)
~2TB/s ~4TB/s ~19TB/s Fastest
16-80GB ~40MB ~164KB/SM ~256KB/SM

Golden rule: Keep data on the right side (fast) as much as possible.

Core Optimization Techniques

1. Memory Coalescing

GPU reads memory in “batches”—128 bytes at a time. If 32 threads (one Warp) access consecutive addresses, everything can be read at once.

Bad example:

1
2
3
// Strided access - inefficient
data[threadIdx.x * stride] // Thread 0 accesses 0, thread 1 accesses 128...
// Requires multiple memory transactions

Correct approach:

1
2
3
// Consecutive access - efficient
data[threadIdx.x] // Thread 0 accesses 0, thread 1 accesses 1...
// One memory transaction handles all

Performance difference: Coalesced access can be 10x faster than non-coalesced.

2. Data Reuse

If the same data is used multiple times, load it to fast storage (shared memory) and reuse it.

Matrix multiplication example:

1
2
3
4
5
6
7
8
9
10
Naive implementation: Each element computation reads from global memory
C[i][j] = A[i][0]*B[0][j] + A[i][1]*B[1][j] + ...
→ Every multiply-add accesses global memory, super slow

Tiling optimization: Load small data blocks to shared memory
1. Load a tile of A to shared memory
2. Load a tile of B to shared memory
3. Complete all computations in shared memory
4. Load next tile...
→ Global memory access reduced by N times (N = tile size)

Code sketch:

1
2
3
4
5
6
7
8
9
10
11
12
__shared__ float tileA[TILE][TILE];
__shared__ float tileB[TILE][TILE];

// Load to shared memory (access global memory once)
tileA[ty][tx] = A[row][col];
tileB[ty][tx] = B[row][col];
__syncthreads();

// Compute in shared memory (no more global memory access)
for (int k = 0; k < TILE; k++) {
sum += tileA[ty][k] * tileB[k][tx];
}

3. Prefetching

Load data to cache before it’s needed.

Principle:

1
2
Traditional: Compute A → Wait load B → Compute B → Wait load C → ...
Prefetch: Compute A + Prefetch B → Compute B + Prefetch C → ...

Computation and data loading overlap, hiding memory latency.

4. Avoiding Bank Conflicts

Shared memory is divided into 32 banks. If multiple threads simultaneously access different addresses in the same bank, conflicts occur.

Conflict example:

1
2
// 32 threads all access different addresses in bank 0
shared_mem[threadIdx.x * 32] // All conflict, serial execution

No-conflict example:

1
2
// 32 threads access 32 different banks
shared_mem[threadIdx.x] // No conflict, parallel execution

Solution: Add padding to offset banks.

1
2
3
4
5
// Original
__shared__ float data[32][32]; // Each row starts at bank 0

// Optimized
__shared__ float data[32][33]; // Padding, each row offset by 1 bank

5. Vectorized Access

Use vector types (float4, etc.) to read multiple data at once.

1
2
3
4
5
6
7
8
9
// Scalar access - 4 memory transactions
float a = data[idx];
float b = data[idx+1];
float c = data[idx+2];
float d = data[idx+3];

// Vector access - 1 memory transaction
float4 vec = reinterpret_cast<float4*>(data)[idx/4];
// vec.x, vec.y, vec.z, vec.w directly available

Arithmetic Intensity and Memory Optimization

Arithmetic Intensity = Computation / Memory Access

Operation Type Intensity Optimization Strategy
Element-wise (ReLU) Very low Kernel fusion
Matrix-vector multiply Low Batch processing
Matrix multiplication High Tiling
Convolution High im2col + GEMM

Low arithmetic intensity operations need memory optimization most because they “compute little, read much.”

Real Case: Flash Attention

Flash Attention is a classic memory optimization case:

Traditional Attention problem:

1
2
Q × K^T → Store to global memory → Softmax → Store to global memory → × V
Intermediate matrix (N×N) is huge, memory access becomes bottleneck

Flash Attention optimization:

1
2
3
4
1. Block computation (Tiling)
2. Complete Softmax in SRAM (shared memory)
3. Don't store complete N×N intermediate matrix
4. Use Online Softmax trick to avoid multiple passes

Results:

  • Memory usage: From O(N²) to O(N)
  • Speed: 2-4x improvement

Performance Analysis Methods

Use NVIDIA tools to analyze memory bottlenecks:

1
2
3
4
5
6
7
8
9
## Analyze memory throughput
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second \
--metrics dram__bytes_read.sum.per_second \
./my_program

## Check memory coalescing efficiency
ncu --metrics smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct \
./my_program
## Ideal value close to 100%, below 50% indicates serious non-coalesced access

Optimization Checklist

Check Item Symptom Solution
Coalescing Memory efficiency < 50% Adjust access pattern
Data Reuse Too many global memory accesses Use shared memory
Bank Conflict Low shared memory bandwidth Add padding
Vectorization Too many scalar accesses Use float4, etc.
Occupancy Low SM utilization Adjust thread config

Summary

Memory access optimization is key to unleashing GPU’s true computing power. In the reality of “fast compute, slow memory,” whoever better manages data flow achieves higher performance.

Core points:

  1. Coalesced access: Have adjacent threads access adjacent addresses
  2. Data reuse: Keep hot data in fast storage
  3. Prefetch overlap: Parallelize computation and memory access
  4. Avoid conflicts: Bank conflict, cache miss
  5. Increase intensity: Kernel fusion, batch processing

Remember: The best memory access is no memory access. Through fusion, reuse, and caching, minimize data “travel.”