分布式推理

分布式推理:当一张 GPU 装不下大模型

GPT-4、LLaMA-70B、Mixtral-8x22B……这些大模型的参数量动辄数百亿甚至万亿,单张 GPU 根本装不下。分布式推理技术应运而生,让多张 GPU 协同工作,共同完成一个超大模型的推理任务。

为什么需要分布式推理?

单卡的困境

让我们算一笔账:

模型 参数量 FP16 显存需求 INT8 显存需求
LLaMA-7B 70亿 ~14GB ~7GB
LLaMA-70B 700亿 ~140GB ~70GB
GPT-3 1750亿 ~350GB ~175GB

目前最强的消费级 GPU(RTX 4090)只有 24GB 显存,即使是专业级的 A100 也只有 80GB。

结论: 大模型必须”分家”部署到多张卡上。

分布式推理的三种策略

就像搬一座大山有多种方法,分布式推理也有多种策略:

1. 张量并行(Tensor Parallelism, TP)

思想: 把单个计算任务”横切”分给多张 GPU。

类比: 一道超大的数学题,把它拆成几部分,每个人算一部分,最后汇总答案。

1
2
3
4
5
6
7
8
9
10
11
12
13
          ┌──────────────────────────────────────┐
原始矩阵 │ A (4096 × 4096) │
└──────────────────────────────────────┘
↓ 横向切分
┌─────────────┐ ┌─────────────┐
GPU 0 │ A[:, :2048] │ │ A[:, 2048:] │ GPU 1
└─────────────┘ └─────────────┘
↓ ↓
计算部分1 计算部分2
↓ ↓
└──────────── AllReduce ────────────┘

最终结果

特点:

  • 每层计算都需要 GPU 间通信(AllReduce)
  • 对通信带宽要求高(需要 NVLink 等高速互连)
  • 适合单机多卡场景
  • 可以降低延迟(每张卡计算量减少)

2. 流水线并行(Pipeline Parallelism, PP)

思想: 把模型按层”纵向切分”,不同 GPU 负责不同的层。

类比: 工厂流水线,每个工人负责一道工序,产品依次传递。

1
2
3
4
5
6
输入 → [GPU 0: 层1-8] → [GPU 1: 层9-16] → [GPU 2: 层17-24] → 输出

时间 →
GPU 0: [Batch1层1-8] [Batch2层1-8] [Batch3层1-8] ...
GPU 1: [Batch1层9-16] [Batch2层9-16] ...
GPU 2: [Batch1层17-24] ...

特点:

  • GPU 间通信少(只在层间传递激活值)
  • 存在”流水线气泡”(部分 GPU 空闲等待)
  • 适合跨机器部署
  • 吞吐量高,但单请求延迟可能增加

3. 数据并行(Data Parallelism, DP)

思想: 每张 GPU 都有完整的模型副本,各自处理不同的请求。

类比: 开多家连锁店,每家店都能提供完整服务。

1
2
3
请求1 → GPU 0 (完整模型) → 结果1
请求2 → GPU 1 (完整模型) → 结果2
请求3 → GPU 2 (完整模型) → 结果3

特点:

  • 实现简单
  • 无 GPU 间通信开销(推理时)
  • 前提:单卡能装下整个模型
  • 吞吐量线性扩展

混合并行策略

实际生产中,往往组合使用多种并行策略:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                 数据并行 (DP=2)
┌─────────────────────────────────────────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ 副本 1 │ │ 副本 2 │
└─────────┘ └─────────┘
│ │
▼ ▼
流水线并行 (PP=2) 流水线并行 (PP=2)
┌─────────────────┐ ┌─────────────────┐
│Stage 0 │Stage 1│ │Stage 0 │Stage 1│
└─────────────────┘ └─────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
张量并行 (TP=2) 张量并行 (TP=2)
┌────┬────┐ ┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│GPU0│GPU1│ │GPU2│GPU3│ │GPU4│GPU5│ │GPU6│GPU7│
└────┴────┘ └────┴────┘ └────┴────┘ └────┴────┘

组合示例:

  • 8 卡服务器
  • TP=4(4卡做一层的张量并行)
  • PP=2(2个流水线阶段)
  • 总共:4 × 2 = 8 卡

技术挑战与解决方案

挑战 1:通信开销

问题: GPU 间数据传输可能成为瓶颈。

解决方案:

  • 使用 NVLink(比 PCIe 快 5-10 倍)
  • 使用 NVSwitch(全互连)
  • 优化通信与计算的重叠

挑战 2:负载均衡

问题: 流水线并行中,不同阶段计算量可能不均。

解决方案:

  • 合理划分层数
  • 使用 Interleaved Schedule(交错调度)

挑战 3:KV Cache 管理

问题: 分布式环境下 KV Cache 管理更复杂。

解决方案:

  • 分布式 KV Cache 池
  • 跨节点的 PagedAttention

主流框架支持

框架 张量并行 流水线并行 备注
vLLM 开箱即用
TensorRT-LLM 高性能
DeepSpeed 灵活配置
Megatron-LM 大规模训练/推理

使用示例

vLLM 分布式推理

1
2
3
4
## 4卡张量并行
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 4

TensorRT-LLM 分布式推理

1
2
3
4
5
6
7
8
## 构建支持张量并行的引擎
trtllm-build --checkpoint_dir ./llama-70b \
--output_dir ./engine \
--tp_size 4 \
--pp_size 2

## 运行(使用 mpirun)
mpirun -n 8 python run_inference.py

性能考量

张量并行 vs 流水线并行

特性 张量并行 (TP) 流水线并行 (PP)
通信频率 每层都通信 仅层间通信
通信量 较大 较小
延迟 较低 较高(气泡)
带宽需求 高(需 NVLink) 较低
适用场景 单机多卡 跨机器

如何选择?

1
2
3
4
5
6
7
8
9
单机多卡 (如 8×A100-SXM4):
→ 优先使用 TP,充分利用 NVLink

多机部署:
→ PP 跨机,TP 机内
→ 例如:2 机 × 4 卡 = PP2 × TP4

模型较小,请求量大:
→ 考虑数据并行 (DP)

实际部署建议

  1. 硬件选择:

    • 优先选择有 NVLink 的机器(如 DGX、HGX)
    • 跨机器用高速网络(InfiniBand)
  2. 并行度规划:

    • TP 度不超过单机卡数
    • PP 度 = 总卡数 / TP 度
  3. 监控指标:

    • 各 GPU 利用率是否均衡
    • 通信时间占比
    • 流水线气泡率

总结

分布式推理是运行超大模型的必备技术。三种核心策略——张量并行、流水线并行、数据并行——各有特点,实际应用中常常组合使用。

关键要点:

  1. 张量并行 (TP): 横切计算,通信密集,适合单机
  2. 流水线并行 (PP): 纵切模型,通信少,适合跨机
  3. 数据并行 (DP): 复制模型,扩展吞吐量
  4. 混合并行: 生产环境的最佳实践

理解分布式推理,你就能部署任意规模的大模型服务。

Distributed Inference: When One GPU Can’t Fit a Large Model

GPT-4, LLaMA-70B, Mixtral-8x22B… these large models have parameters ranging from hundreds of billions to trillions—a single GPU simply cannot hold them. Distributed inference technology emerged to enable multiple GPUs to work together, jointly completing inference tasks for ultra-large models.

Why Do We Need Distributed Inference?

The Single-Card Dilemma

Let’s do some math:

Model Parameters FP16 Memory INT8 Memory
LLaMA-7B 7B ~14GB ~7GB
LLaMA-70B 70B ~140GB ~70GB
GPT-3 175B ~350GB ~175GB

The most powerful consumer GPU (RTX 4090) only has 24GB of memory, and even professional A100s only have 80GB.

Conclusion: Large models must be “split” and deployed across multiple cards.

Three Strategies for Distributed Inference

Just as there are multiple ways to move a mountain, distributed inference has multiple strategies:

1. Tensor Parallelism (TP)

Idea: “Horizontally slice” a single computation task across multiple GPUs.

Analogy: A super-large math problem is split into parts, each person computes one part, and answers are combined at the end.

1
2
3
4
5
6
7
8
9
10
11
12
13
          ┌──────────────────────────────────────┐
Original │ A (4096 × 4096) │
Matrix └──────────────────────────────────────┘
↓ Horizontal split
┌─────────────┐ ┌─────────────┐
GPU 0 │ A[:, :2048] │ │ A[:, 2048:] │ GPU 1
└─────────────┘ └─────────────┘
↓ ↓
Compute Part1 Compute Part2
↓ ↓
└──────────── AllReduce ────────────┘

Final Result

Characteristics:

  • Each layer computation requires GPU communication (AllReduce)
  • High communication bandwidth requirements (needs NVLink or similar)
  • Suitable for single-machine multi-GPU scenarios
  • Can reduce latency (each card computes less)

2. Pipeline Parallelism (PP)

Idea: “Vertically slice” the model by layers, different GPUs handle different layers.

Analogy: Factory assembly line—each worker handles one process, products pass through sequentially.

1
2
3
4
5
6
Input → [GPU 0: Layers 1-8] → [GPU 1: Layers 9-16] → [GPU 2: Layers 17-24] → Output

Time →
GPU 0: [Batch1 L1-8] [Batch2 L1-8] [Batch3 L1-8] ...
GPU 1: [Batch1 L9-16] [Batch2 L9-16] ...
GPU 2: [Batch1 L17-24] ...

Characteristics:

  • Less inter-GPU communication (only activations passed between stages)
  • “Pipeline bubbles” exist (some GPUs idle waiting)
  • Suitable for cross-machine deployment
  • High throughput, but single-request latency may increase

3. Data Parallelism (DP)

Idea: Each GPU has a complete model replica, each processing different requests.

Analogy: Opening multiple franchise stores—each store can provide complete service.

1
2
3
Request1 → GPU 0 (complete model) → Result1
Request2 → GPU 1 (complete model) → Result2
Request3 → GPU 2 (complete model) → Result3

Characteristics:

  • Simple implementation
  • No inter-GPU communication overhead (during inference)
  • Prerequisite: Single card can fit the entire model
  • Throughput scales linearly

Hybrid Parallel Strategies

In production, multiple parallel strategies are often combined:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
                 Data Parallel (DP=2)
┌─────────────────────────────────────────────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Replica1│ │ Replica2│
└─────────┘ └─────────┘
│ │
▼ ▼
Pipeline Parallel (PP=2) Pipeline Parallel (PP=2)
┌─────────────────┐ ┌─────────────────┐
│Stage 0 │Stage 1│ │Stage 0 │Stage 1│
└─────────────────┘ └─────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Tensor Parallel (TP=2) Tensor Parallel (TP=2)
┌────┬────┐ ┌────┬────┐ ┌────┬────┐ ┌────┬────┐
│GPU0│GPU1│ │GPU2│GPU3│ │GPU4│GPU5│ │GPU6│GPU7│
└────┴────┘ └────┴────┘ └────┴────┘ └────┴────┘

Combination example:

  • 8-GPU server
  • TP=4 (4 cards for tensor parallelism per layer)
  • PP=2 (2 pipeline stages)
  • Total: 4 × 2 = 8 cards

Technical Challenges and Solutions

Challenge 1: Communication Overhead

Problem: Data transfer between GPUs can become a bottleneck.

Solutions:

  • Use NVLink (5-10x faster than PCIe)
  • Use NVSwitch (full interconnect)
  • Optimize overlap of communication and computation

Challenge 2: Load Balancing

Problem: In pipeline parallelism, different stages may have unequal computation loads.

Solutions:

  • Reasonable layer partitioning
  • Use Interleaved Schedule

Challenge 3: KV Cache Management

Problem: KV Cache management is more complex in distributed environments.

Solutions:

  • Distributed KV Cache pool
  • Cross-node PagedAttention

Framework Support

Framework Tensor Parallel Pipeline Parallel Notes
vLLM Out-of-box
TensorRT-LLM High performance
DeepSpeed Flexible config
Megatron-LM Large-scale training/inference

Usage Examples

vLLM Distributed Inference

1
2
3
4
## 4-card tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 4

TensorRT-LLM Distributed Inference

1
2
3
4
5
6
7
8
## Build engine with tensor parallelism support
trtllm-build --checkpoint_dir ./llama-70b \
--output_dir ./engine \
--tp_size 4 \
--pp_size 2

## Run (using mpirun)
mpirun -n 8 python run_inference.py

Performance Considerations

Tensor Parallel vs Pipeline Parallel

Feature Tensor Parallel (TP) Pipeline Parallel (PP)
Comm Frequency Every layer Only between stages
Comm Volume Larger Smaller
Latency Lower Higher (bubbles)
Bandwidth Need High (needs NVLink) Lower
Use Case Single-machine multi-GPU Cross-machine

How to Choose?

1
2
3
4
5
6
7
8
9
Single machine multi-GPU (e.g., 8×A100-SXM4):
→ Prefer TP, fully utilize NVLink

Multi-machine deployment:
→ PP across machines, TP within machine
→ Example: 2 machines × 4 cards = PP2 × TP4

Smaller model, high request volume:
→ Consider Data Parallelism (DP)

Practical Deployment Recommendations

  1. Hardware Selection:

    • Prefer machines with NVLink (e.g., DGX, HGX)
    • High-speed network for cross-machine (InfiniBand)
  2. Parallelism Planning:

    • TP degree should not exceed single-machine card count
    • PP degree = total cards / TP degree
  3. Monitoring Metrics:

    • GPU utilization balance
    • Communication time ratio
    • Pipeline bubble rate

Summary

Distributed inference is essential technology for running ultra-large models. Three core strategies—Tensor Parallelism, Pipeline Parallelism, Data Parallelism—each have their characteristics, and are often combined in practice.

Key points:

  1. Tensor Parallel (TP): Horizontal computation split, communication-intensive, suitable for single machine
  2. Pipeline Parallel (PP): Vertical model split, less communication, suitable for cross-machine
  3. Data Parallel (DP): Model replication, throughput scaling
  4. Hybrid Parallel: Best practice for production environments

Understand distributed inference, and you can deploy large model services of any scale.

算子融合

算子融合:让 AI 计算少跑”冤枉路”

在深度学习的世界里,一个看似简单的操作,背后可能是无数次的数据搬运。算子融合(Kernel Fusion)技术就像一位聪明的快递员,把原本需要多次往返的包裹合并成一趟送完,大幅提升了 AI 计算的效率。

什么是算子(Kernel)?

在 GPU 计算中,算子(也叫 Kernel)是运行在 GPU 上的一个计算函数。常见的算子包括:

  • 矩阵乘法(MatMul)
  • 激活函数(ReLU、GELU)
  • 归一化(LayerNorm、BatchNorm)
  • 逐元素运算(加法、乘法)

每个算子执行时,通常需要:

  1. 从显存(GPU 内存)读取数据
  2. 执行计算
  3. 把结果写回显存

问题:反复读写显存太慢了!

假设我们要计算:y = ReLU(x + bias)

不融合的做法:

1
2
3
4
5
6
7
8
9
算子1: Add
- 从显存读取 x 和 bias
- 计算 x + bias
- 把结果写回显存(临时存储 temp)

算子2: ReLU
- 从显存读取 temp
- 计算 ReLU(temp)
- 把结果写回显存(最终结果 y)

问题在哪?

  • 中间结果 temp 被写入显存,紧接着又被读出来
  • 显存访问是 GPU 计算的最大瓶颈
  • 这次”往返”完全是浪费时间!

算子融合的解决方案

融合后的做法:

1
2
3
4
5
融合算子: Add_ReLU
- 从显存读取 x 和 bias
- 计算 x + bias
- 直接在寄存器/共享内存中计算 ReLU
- 把最终结果 y 写回显存

效果:

  • 省掉了中间结果的读写
  • 显存访问次数减半
  • 速度大幅提升!

形象类比

不融合: 你要寄三个包裹去三个地址。

  • 去邮局寄第一个,回家
  • 再去邮局寄第二个,回家
  • 再去邮局寄第三个

融合: 你带着三个包裹一趟全寄完。

省的就是那些”回家再出门”的时间。在 GPU 世界,”去邮局”就是访问显存,这个时间成本非常高!

常见的融合模式

1. 激活函数融合

最常见的融合是把激活函数合并到前一个算子:

1
2
MatMul + ReLU → MatMul_ReLU
Conv + BatchNorm + ReLU → Conv_BN_ReLU

2. 归一化融合

1
2
LayerNorm = Mean + Variance + Normalize + Scale + Shift
→ 融合成一个 FusedLayerNorm 算子

3. 注意力机制融合

Flash Attention 就是经典的算子融合案例:

1
2
3
4
5
6
传统 Attention:
Q × K^T → Softmax → × V
(每步都读写显存)

Flash Attention:
融合成一个算子,中间结果保留在 SRAM 中

4. 逐元素操作链融合

1
2
x → Add(bias) → Multiply(scale) → Tanh → Dropout
→ 融合成一个算子

技术实现

手动融合

在 CUDA 中手动实现融合算子:

1
2
3
4
5
6
7
8
__global__ void fused_add_relu(float* x, float* bias, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// 一次读取,一次计算,一次写入
float val = x[idx] + bias[idx];
y[idx] = val > 0 ? val : 0; // ReLU
}
}

自动融合

现代深度学习编译器可以自动进行算子融合:

工具/框架 融合能力
TensorRT 自动融合常见模式
XLA TensorFlow 的编译器
TorchScript/torch.compile PyTorch 的 JIT 编译
Triton 用户友好的自定义 Kernel
ONNX Runtime 图优化器自动融合

PyTorch 示例:

1
2
3
4
5
6
7
8
9
10
11
import torch

## 定义模型
class MyModel(torch.nn.Module):
def forward(self, x, bias):
return torch.relu(x + bias)

model = MyModel()

## 使用 torch.compile 自动优化(包括算子融合)
optimized_model = torch.compile(model)

融合的收益分析

以 Transformer 中的 FFN 层为例:

1
2
3
4
5
传统实现:
Linear1 → 写显存 → 读显存 → GELU → 写显存 → 读显存 → Linear2

融合实现:
Linear1 → GELU → Linear2 (中间结果留在快速存储中)

性能对比:

指标 未融合 融合后
显存访问次数 6次 2次
内存带宽占用
计算效率 30-50% 70-90%
延迟 基准 降低 30-50%

融合的限制

并非所有算子都能融合,需要满足一些条件:

1. 数据依赖

只有串行依赖的算子才能融合。如果两个算子是并行关系,融合反而可能降低效率。

2. 计算/访存比

融合的收益取决于节省的访存时间 vs 增加的复杂度

  • 如果计算量很大、访存很少,融合收益不明显
  • 如果访存多、计算少(如逐元素操作),融合收益很大

3. 共享内存限制

融合后的算子可能需要更多共享内存和寄存器,可能超过 GPU 限制。

实际应用案例

Flash Attention

1
2
3
4
5
6
7
8
## 传统注意力 - 多个独立算子
attn_weights = torch.matmul(Q, K.transpose(-2, -1))
attn_weights = torch.softmax(attn_weights / math.sqrt(d_k), dim=-1)
output = torch.matmul(attn_weights, V)

## Flash Attention - 融合算子
from flash_attn import flash_attn_func
output = flash_attn_func(Q, K, V) # 一个融合 Kernel

效果: 速度提升 2-4 倍,显存降低 5-20 倍。

TensorRT 自动融合

1
2
3
4
5
import tensorrt as trt

## TensorRT 会自动识别并融合:
## Conv + BatchNorm + ReLU → CBR融合算子
## MatMul + Add + GELU → 融合算子

如何判断是否需要融合?

使用性能分析工具:

1
2
3
4
5
6
7
8
## 查看 Kernel 调用情况
nsys profile python my_model.py

## 如果看到大量小 Kernel 连续调用,考虑融合
## Kernel: add_kernel (0.1ms)
## Kernel: relu_kernel (0.1ms)
## Kernel: mul_kernel (0.1ms)
## → 可以融合成一个 Kernel

总结

算子融合是深度学习性能优化的核心技术之一。它的本质是减少显存访问,通过把多个小操作合并成一个大操作,避免中间结果的反复读写。

关键要点:

  1. 显存访问是瓶颈: 计算快,读写慢
  2. 融合减少往返: 数据留在快速存储中处理
  3. 自动融合工具: TensorRT、torch.compile 等
  4. 经典案例: Flash Attention

理解算子融合,你就理解了 GPU 加速的核心奥秘之一。

Kernel Fusion: Making AI Computing Take Fewer “Unnecessary Trips”

In the world of deep learning, a seemingly simple operation might involve countless data transfers behind the scenes. Kernel Fusion technology is like a smart delivery driver who combines packages that would require multiple trips into one delivery, dramatically improving AI computing efficiency.

What is a Kernel?

In GPU computing, a Kernel is a computational function that runs on the GPU. Common kernels include:

  • Matrix multiplication (MatMul)
  • Activation functions (ReLU, GELU)
  • Normalization (LayerNorm, BatchNorm)
  • Element-wise operations (addition, multiplication)

When each kernel executes, it typically needs to:

  1. Read data from GPU memory (VRAM)
  2. Perform computation
  3. Write results back to GPU memory

The Problem: Repeated Memory Access is Too Slow!

Suppose we need to compute: y = ReLU(x + bias)

Without fusion:

1
2
3
4
5
6
7
8
9
Kernel 1: Add
- Read x and bias from GPU memory
- Compute x + bias
- Write result to GPU memory (temporary storage: temp)

Kernel 2: ReLU
- Read temp from GPU memory
- Compute ReLU(temp)
- Write result to GPU memory (final result: y)

Where’s the problem?

  • Intermediate result temp is written to GPU memory, then immediately read back
  • Memory access is the biggest bottleneck in GPU computing
  • This “round trip” is a complete waste of time!

Kernel Fusion’s Solution

After fusion:

1
2
3
4
5
Fused Kernel: Add_ReLU
- Read x and bias from GPU memory
- Compute x + bias
- Compute ReLU directly in registers/shared memory
- Write final result y to GPU memory

Effect:

  • Eliminated intermediate result read/write
  • Memory accesses cut in half
  • Significant speed improvement!

An Analogy

Without fusion: You need to mail three packages to three addresses.

  • Go to post office to send first one, go home
  • Go to post office again to send second one, go home
  • Go to post office again to send third one

With fusion: You bring all three packages and mail them in one trip.

What you save is all that “going home and going out again” time. In the GPU world, “going to the post office” is accessing GPU memory—this time cost is very high!

Common Fusion Patterns

1. Activation Function Fusion

The most common fusion is merging activation functions into the preceding operator:

1
2
MatMul + ReLU → MatMul_ReLU
Conv + BatchNorm + ReLU → Conv_BN_ReLU

2. Normalization Fusion

1
2
LayerNorm = Mean + Variance + Normalize + Scale + Shift
→ Fused into one FusedLayerNorm kernel

3. Attention Mechanism Fusion

Flash Attention is a classic kernel fusion case:

1
2
3
4
5
6
Traditional Attention:
Q × K^T → Softmax → × V
(Each step reads/writes GPU memory)

Flash Attention:
Fused into one kernel, intermediate results stay in SRAM

4. Element-wise Operation Chain Fusion

1
2
x → Add(bias) → Multiply(scale) → Tanh → Dropout
→ Fused into one kernel

Technical Implementation

Manual Fusion

Manually implementing a fused kernel in CUDA:

1
2
3
4
5
6
7
8
__global__ void fused_add_relu(float* x, float* bias, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// One read, one compute, one write
float val = x[idx] + bias[idx];
y[idx] = val > 0 ? val : 0; // ReLU
}
}

Automatic Fusion

Modern deep learning compilers can automatically perform kernel fusion:

Tool/Framework Fusion Capability
TensorRT Automatic fusion of common patterns
XLA TensorFlow’s compiler
TorchScript/torch.compile PyTorch’s JIT compilation
Triton User-friendly custom kernels
ONNX Runtime Graph optimizer automatic fusion

PyTorch example:

1
2
3
4
5
6
7
8
9
10
11
import torch

## Define model
class MyModel(torch.nn.Module):
def forward(self, x, bias):
return torch.relu(x + bias)

model = MyModel()

## Use torch.compile for automatic optimization (including kernel fusion)
optimized_model = torch.compile(model)

Fusion Benefit Analysis

Using Transformer’s FFN layer as an example:

1
2
3
4
5
Traditional implementation:
Linear1 → write memory → read memory → GELU → write memory → read memory → Linear2

Fused implementation:
Linear1 → GELU → Linear2 (intermediate results stay in fast storage)

Performance comparison:

Metric Unfused Fused
Memory accesses 6 times 2 times
Memory bandwidth usage High Low
Compute efficiency 30-50% 70-90%
Latency Baseline 30-50% lower

Fusion Limitations

Not all operators can be fused; some conditions must be met:

1. Data Dependencies

Only serially dependent operators can be fused. If two operators are parallel, fusion might actually reduce efficiency.

2. Compute-to-Memory Ratio

Fusion benefits depend on saved memory access time vs. added complexity:

  • If computation is heavy and memory access is light, fusion benefits are minimal
  • If memory access is heavy and computation is light (like element-wise ops), fusion benefits are significant

3. Shared Memory Limits

Fused kernels might need more shared memory and registers, potentially exceeding GPU limits.

Practical Application Cases

Flash Attention

1
2
3
4
5
6
7
8
## Traditional attention - multiple separate kernels
attn_weights = torch.matmul(Q, K.transpose(-2, -1))
attn_weights = torch.softmax(attn_weights / math.sqrt(d_k), dim=-1)
output = torch.matmul(attn_weights, V)

## Flash Attention - fused kernel
from flash_attn import flash_attn_func
output = flash_attn_func(Q, K, V) # One fused kernel

Effect: 2-4x speed improvement, 5-20x memory reduction.

TensorRT Automatic Fusion

1
2
3
4
5
import tensorrt as trt

## TensorRT automatically identifies and fuses:
## Conv + BatchNorm + ReLU → CBR fused kernel
## MatMul + Add + GELU → Fused kernel

How to Determine if Fusion is Needed?

Use performance analysis tools:

1
2
3
4
5
6
7
8
## View kernel call patterns
nsys profile python my_model.py

## If you see many small kernels called consecutively, consider fusion
## Kernel: add_kernel (0.1ms)
## Kernel: relu_kernel (0.1ms)
## Kernel: mul_kernel (0.1ms)
## → Can be fused into one kernel

Summary

Kernel fusion is one of the core technologies in deep learning performance optimization. Its essence is reducing memory access by combining multiple small operations into one large operation, avoiding repeated read/write of intermediate results.

Key points:

  1. Memory access is the bottleneck: Computing is fast, reading/writing is slow
  2. Fusion reduces round trips: Data stays in fast storage for processing
  3. Automatic fusion tools: TensorRT, torch.compile, etc.
  4. Classic example: Flash Attention

Understand kernel fusion, and you understand one of the core secrets of GPU acceleration.

CUDA性能优化

CUDA 性能优化:让 GPU 飞起来的秘密

如果说大模型是一辆超级跑车,那么 CUDA 优化就是让这辆跑车发挥极致性能的调校技术。今天,我们用通俗易懂的方式,来揭秘如何让 GPU 跑得更快。

什么是 CUDA?

CUDA(Compute Unified Device Architecture)是 NVIDIA 推出的并行计算平台和编程模型。简单说,它是让我们能够在 GPU 上写程序的”语言”和”工具箱”。

为什么 GPU 这么重要?

对比 CPU GPU
核心数 8-64 个强力核心 数千个小核心
擅长 复杂逻辑、串行任务 大规模并行计算
类比 一个数学教授 一万个小学生

计算 1+1=? 一个教授秒答。但要计算一亿道加法题?一万个小学生同时做,比一个教授快多了!

深度学习中的矩阵运算正是这种”大量简单计算”的典型场景,所以 GPU 成了 AI 的主力。

为什么需要 CUDA 优化?

“GPU 核心多不就够了吗?”——并不是。

没有优化的 GPU 程序,就像:

  • 雇了一万个员工,但只有 1000 人在干活,其他人在摸鱼
  • 买了 8 车道高速公路,但大家都挤在一条道上
  • 请了顶级厨师,但食材供应跟不上

CUDA 优化的目标: 让每个 GPU 核心都忙起来,让数据流转无阻塞。

CUDA 优化的核心概念

1. 理解 GPU 架构

GPU 由多层结构组成:

1
2
3
4
5
6
7
8
GPU
├── SM (Streaming Multiprocessor) × N ← 多个流处理器
│ ├── CUDA Core × M ← 计算核心
│ ├── Shared Memory ← 共享内存(很快)
│ ├── L1 Cache ← 一级缓存
│ └── Registers ← 寄存器(最快)
├── L2 Cache ← 二级缓存
└── Global Memory (HBM) ← 显存(大但慢)

关键洞察: 数据离计算核心越近,访问越快。优化的核心就是尽量让数据待在”快”的地方。

2. 内存层级与访问速度

内存类型 大小 速度 作用域
寄存器 KB级 最快 单个线程
共享内存 几十KB 非常快 同一线程块
L1/L2 Cache MB级 自动管理
全局内存 (HBM) GB级 慢(相对) 所有线程

类比:

  • 寄存器 = 你手上拿着的笔(秒取)
  • 共享内存 = 你桌上的文具盒(抬手就到)
  • L2 Cache = 你身后的书架(转身取)
  • 全局内存 = 图书馆仓库(要走一趟)

核心优化技术

1. 内存合并访问(Memory Coalescing)

GPU 一次读取内存是按”块”读的(128 字节)。如果多个线程访问的数据正好在一块里,一次就能全读出来。

反例(低效):

1
2
3
4
线程0 访问 地址 0
线程1 访问 地址 1000
线程2 访问 地址 2000
→ 需要 3 次内存访问

正例(高效):

1
2
3
4
线程0 访问 地址 0
线程1 访问 地址 4
线程2 访问 地址 8
→ 1 次内存访问搞定

优化方法: 让相邻线程访问相邻内存地址。

2. 使用共享内存(Shared Memory)

对于需要多次访问的数据,先从全局内存加载到共享内存,再重复使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
__global__ void matmul_optimized(float* A, float* B, float* C) {
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];

// 1. 把数据块加载到共享内存
tile_A[ty][tx] = A[row * N + tx];
tile_B[ty][tx] = B[ty * N + col];
__syncthreads(); // 等待所有线程加载完成

// 2. 在共享内存中计算(快!)
for (int k = 0; k < TILE_SIZE; k++) {
sum += tile_A[ty][k] * tile_B[k][tx];
}
}

效果: 减少对慢速全局内存的访问次数。

3. 避免 Bank Conflict(存储体冲突)

共享内存被分成 32 个 Bank。如果多个线程同时访问同一个 Bank 的不同地址,就会产生冲突,被迫串行执行。

类比: 32 个人同时去 32 个不同的 ATM 取钱 = 没冲突。32 个人抢 1 台 ATM = 排队等死。

解决方法: 合理设计数据布局,让线程访问分散到不同 Bank。

4. 优化线程配置

线程组织成三层结构:

  • Thread(线程): 最小执行单位
  • Block(线程块): 一组线程,共享共享内存
  • Grid(网格): 所有线程块的集合

优化要点:

  • 每个 Block 的线程数应是 32 的倍数(Warp 大小)
  • 通常选择 128 或 256 线程每 Block
  • 确保有足够的 Block 数让 GPU 所有 SM 都有活干

5. 指令级优化

使用快速数学函数:

1
2
3
4
5
// 慢
float result = sin(x);

// 快(精度略低但够用)
float result = __sinf(x);

利用 Tensor Core(张量核心):
现代 NVIDIA GPU 有专门的 Tensor Core 做矩阵运算,比普通 CUDA Core 快得多:

1
2
// 使用 Tensor Core 的矩阵乘法
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

实战:矩阵乘法优化对比

1
2
3
4
5
6
7
优化阶段              | 性能 (相对)
---------------------|------------
朴素实现 | 1x
+ 内存合并访问 | 3x
+ 共享内存 tiling | 10x
+ 避免 bank conflict | 12x
+ Tensor Core | 50x+

性能分析工具

优化离不开测量。NVIDIA 提供了强大的分析工具:

工具 用途
nsys 系统级分析,查看整体执行情况
ncu Kernel 级分析,深入分析单个函数
nvprof 传统分析工具

使用示例:

1
2
3
4
5
## 分析程序整体性能
nsys profile ./my_cuda_program

## 深入分析某个 kernel
ncu --set full ./my_cuda_program

常见优化清单

优化项 检查点
内存合并 相邻线程是否访问相邻地址?
占用率 GPU SM 是否充分利用?
共享内存 是否用共享内存减少全局访问?
Bank Conflict 共享内存访问是否有冲突?
分支发散 同一 Warp 内是否有不同分支?
指令吞吐 是否使用了高效指令?

在 AI 框架中的应用

PyTorch、TensorFlow 等框架底层大量使用优化过的 CUDA Kernel:

  • cuBLAS: 矩阵运算库
  • cuDNN: 深度学习原语库
  • Flash Attention: 优化的注意力机制实现

当你调用 torch.matmul() 时,背后是无数工程师精心优化的 CUDA 代码在工作。

总结

CUDA 性能优化是让 GPU 发挥真正实力的关键技术。核心思想包括:

  1. 减少内存访问: 内存是瓶颈,尽量复用数据
  2. 利用内存层级: 把热数据放到更快的存储层
  3. 保持并行度: 让所有计算单元都忙起来
  4. 使用专用硬件: Tensor Core 等加速单元

掌握这些技术,你就掌握了让 AI 模型”飞”起来的魔法。

CUDA Performance Optimization: The Secret to Making GPUs Fly

If large models are a supercar, then CUDA optimization is the tuning technique that unleashes the car’s ultimate performance. Today, we’ll reveal how to make GPUs run faster in an easy-to-understand way.

What is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. Simply put, it’s the “language” and “toolbox” that allows us to write programs on GPUs.

Why are GPUs so important?

Comparison CPU GPU
Core Count 8-64 powerful cores Thousands of small cores
Good at Complex logic, serial tasks Massive parallel computing
Analogy One math professor Ten thousand elementary students

Computing 1+1=? A professor answers instantly. But computing a hundred million addition problems? Ten thousand students working simultaneously is much faster than one professor!

Matrix operations in deep learning are exactly this type of “massive simple computations” scenario, which is why GPUs became the workhorse of AI.

Why Do We Need CUDA Optimization?

“Isn’t having lots of GPU cores enough?”—Not really.

An unoptimized GPU program is like:

  • Hiring ten thousand employees, but only 1000 are working while others are slacking
  • Buying an 8-lane highway, but everyone’s crammed into one lane
  • Hiring a top chef, but ingredient supply can’t keep up

The goal of CUDA optimization: Keep every GPU core busy, and let data flow without blockages.

Core Concepts of CUDA Optimization

1. Understanding GPU Architecture

GPUs have a multi-layered structure:

1
2
3
4
5
6
7
8
GPU
├── SM (Streaming Multiprocessor) × N ← Multiple streaming processors
│ ├── CUDA Core × M ← Compute cores
│ ├── Shared Memory ← Shared memory (very fast)
│ ├── L1 Cache ← Level 1 cache
│ └── Registers ← Registers (fastest)
├── L2 Cache ← Level 2 cache
└── Global Memory (HBM) ← Video memory (large but slow)

Key insight: The closer data is to compute cores, the faster the access. The core of optimization is keeping data in “fast” places as much as possible.

2. Memory Hierarchy and Access Speed

Memory Type Size Speed Scope
Registers KB level Fastest Single thread
Shared Memory Tens of KB Very fast Same thread block
L1/L2 Cache MB level Fast Auto-managed
Global Memory (HBM) GB level Slow (relatively) All threads

Analogy:

  • Registers = Pen in your hand (instant)
  • Shared Memory = Pencil case on your desk (reach over)
  • L2 Cache = Bookshelf behind you (turn around)
  • Global Memory = Library warehouse (need to walk there)

Core Optimization Techniques

1. Memory Coalescing

GPU reads memory in “chunks” (128 bytes). If multiple threads access data that happens to be in one chunk, it can all be read at once.

Bad example (inefficient):

1
2
3
4
Thread 0 accesses address 0
Thread 1 accesses address 1000
Thread 2 accesses address 2000
→ Requires 3 memory accesses

Good example (efficient):

1
2
3
4
Thread 0 accesses address 0
Thread 1 accesses address 4
Thread 2 accesses address 8
→ 1 memory access handles all

Optimization method: Have adjacent threads access adjacent memory addresses.

2. Using Shared Memory

For data that needs to be accessed multiple times, first load from global memory to shared memory, then reuse:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
__global__ void matmul_optimized(float* A, float* B, float* C) {
__shared__ float tile_A[TILE_SIZE][TILE_SIZE];
__shared__ float tile_B[TILE_SIZE][TILE_SIZE];

// 1. Load data blocks to shared memory
tile_A[ty][tx] = A[row * N + tx];
tile_B[ty][tx] = B[ty * N + col];
__syncthreads(); // Wait for all threads to finish loading

// 2. Compute in shared memory (fast!)
for (int k = 0; k < TILE_SIZE; k++) {
sum += tile_A[ty][k] * tile_B[k][tx];
}
}

Effect: Reduces accesses to slow global memory.

3. Avoiding Bank Conflicts

Shared memory is divided into 32 banks. If multiple threads simultaneously access different addresses in the same bank, conflicts occur and execution becomes serial.

Analogy: 32 people going to 32 different ATMs simultaneously = no conflict. 32 people fighting for 1 ATM = waiting in line forever.

Solution: Design data layout carefully so thread accesses are distributed across different banks.

4. Optimizing Thread Configuration

Threads are organized in three levels:

  • Thread: Smallest execution unit
  • Block: A group of threads sharing shared memory
  • Grid: Collection of all thread blocks

Optimization points:

  • Threads per block should be multiples of 32 (Warp size)
  • Usually choose 128 or 256 threads per block
  • Ensure enough blocks so all GPU SMs have work to do

5. Instruction-level Optimization

Use fast math functions:

1
2
3
4
5
// Slow
float result = sin(x);

// Fast (slightly less precise but sufficient)
float result = __sinf(x);

Leverage Tensor Cores:
Modern NVIDIA GPUs have dedicated Tensor Cores for matrix operations, much faster than regular CUDA Cores:

1
2
// Matrix multiplication using Tensor Core
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

Practical: Matrix Multiplication Optimization Comparison

1
2
3
4
5
6
7
Optimization Stage       | Performance (relative)
------------------------|----------------------
Naive implementation | 1x
+ Memory coalescing | 3x
+ Shared memory tiling | 10x
+ Avoid bank conflict | 12x
+ Tensor Core | 50x+

Performance Analysis Tools

Optimization requires measurement. NVIDIA provides powerful analysis tools:

Tool Purpose
nsys System-level profiling, view overall execution
ncu Kernel-level profiling, deep analysis of individual functions
nvprof Traditional profiling tool

Usage example:

1
2
3
4
5
## Profile overall program performance
nsys profile ./my_cuda_program

## Deep analysis of a specific kernel
ncu --set full ./my_cuda_program

Common Optimization Checklist

Optimization Checkpoint
Memory Coalescing Do adjacent threads access adjacent addresses?
Occupancy Are GPU SMs fully utilized?
Shared Memory Using shared memory to reduce global access?
Bank Conflict Any conflicts in shared memory access?
Branch Divergence Different branches within same Warp?
Instruction Throughput Using efficient instructions?

Application in AI Frameworks

Frameworks like PyTorch and TensorFlow extensively use optimized CUDA kernels under the hood:

  • cuBLAS: Matrix operation library
  • cuDNN: Deep learning primitives library
  • Flash Attention: Optimized attention mechanism implementation

When you call torch.matmul(), countless engineer-optimized CUDA code is working behind the scenes.

Summary

CUDA performance optimization is the key technology to unleash GPU’s true power. Core ideas include:

  1. Reduce memory access: Memory is the bottleneck, reuse data as much as possible
  2. Utilize memory hierarchy: Put hot data in faster storage layers
  3. Maintain parallelism: Keep all compute units busy
  4. Use specialized hardware: Tensor Cores and other acceleration units

Master these techniques, and you’ll master the magic of making AI models “fly.”

Continuous Batching

Continuous Batching:让大模型服务”永不闲置”的黑科技

当你和 ChatGPT 聊天时,背后的服务器可能同时在服务成千上万的用户。如何高效地处理这些并发请求?Continuous Batching(连续批处理)技术应运而生,它让 GPU 保持满负荷运转,大幅提升了大模型服务的吞吐量。

先理解传统批处理的问题

什么是批处理(Batching)?

在深度学习中,批处理是指把多个请求打包在一起同时处理,而不是一个一个处理。这样做能更好地利用 GPU 的并行计算能力。

类比: 就像餐厅炒菜,一次炒一盘效率低,一次炒四盘(如果锅够大)效率更高。

传统静态批处理的困境

在传统的 Static Batching(静态批处理)中:

  1. 服务器等待收集到一批请求(比如 8 个)
  2. 把这 8 个请求一起送入模型处理
  3. 等到 所有请求都完成 后,才处理下一批

问题来了:

假设这 8 个请求中:

  • 用户 A 问:”1+1=?” → 生成 3 个 token 就完成了
  • 用户 B 问:”给我写一篇 1000 字的文章” → 需要生成 500 个 token

静态批处理的结果:

1
2
3
4
5
6
时间线:
[Token 1] A✓ B○ C○ D○ E○ F○ G○ H○
[Token 2] A✓ B○ C✓ D○ E○ F○ G○ H○
[Token 3] A✓ B○ C✓ D✓ E✓ F○ G○ H○ ← A、C、D、E 已完成
...
[Token 500] 等... B○ 等... 等... 等... 等... 等... 等... ← 大家都在等 B

用户 A 早就该收到回复了,却要等用户 B 的长文章写完!这期间:

  • GPU 资源浪费: 已完成的请求仍占着位置
  • 延迟增加: 短请求被长请求”拖后腿”
  • 用户体验差: 明明可以秒回的问题,等了半天

Continuous Batching 如何解决问题

Continuous Batching(也叫 Iteration-level BatchingIn-Flight Batching)的核心思想是:

在每一步迭代(每生成一个 token)时重新调度批次

工作原理

1
2
3
4
5
6
7
8
9
10
11
12
初始状态:正在处理 [A, B, C, D]

Token 1 后:
- A 完成 ✓ → 移出批次,立即返回结果给用户 A
- B, C, D 继续
- 新请求 E 加入 → 批次变成 [B, C, D, E]

Token 2 后:
- C 完成 ✓ → 移出,返回
- 新请求 F, G 加入 → 批次变成 [B, D, E, F, G]

...以此类推

效果:

  • 完成的请求 立即释放 资源
  • 新请求 随时加入 正在运行的批次
  • GPU 始终满载 运行,没有空闲

形象类比

静态批处理像传统公交车:

  • 每站必须等所有人上下车完毕
  • 车满就发车,到终点站所有人一起下
  • 有人中途想下车?抱歉,等到终点吧

Continuous Batching 像”云轨”或”无人出租车”:

  • 乘客随时可以上下
  • 到达目的地立即下车,座位立即空出
  • 新乘客马上就能坐上空座
  • 车辆永远保持高效运转

技术实现要点

1. 序列级别的调度

传统批处理以”批次”为单位调度,Continuous Batching 以”序列”为单位调度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class ContinuousBatcher:
def __init__(self, max_batch_size=32):
self.running_sequences = [] # 正在生成的序列
self.waiting_queue = [] # 等待队列
self.max_batch_size = max_batch_size

def step(self):
# 1. 移除已完成的序列
completed = [s for s in self.running_sequences if s.is_done()]
for seq in completed:
self.running_sequences.remove(seq)
self.return_result(seq)

# 2. 填充空位
while len(self.running_sequences) < self.max_batch_size:
if self.waiting_queue:
new_seq = self.waiting_queue.pop(0)
self.running_sequences.append(new_seq)
else:
break

# 3. 执行一步推理
self.forward_one_step(self.running_sequences)

2. 动态内存管理

由于批次组成随时变化,内存管理变得复杂:

  • 需要 PagedAttention 等技术配合
  • KV Cache 按需分配和释放
  • 支持不同长度的序列共存

3. 前缀共享优化

当多个请求有相同前缀(如相同的 System Prompt)时:

  • 可以共享前缀的 KV Cache
  • 新请求加入时直接复用

Continuous Batching 的优势

优势 说明
更高吞吐量 GPU 利用率从 30-50% 提升到 90%+
更低延迟 短请求不用等长请求完成
更好伸缩性 动态适应负载变化
更高并发 同时服务更多用户

性能对比

以 LLaMA-7B 模型为例:

指标 静态批处理 Continuous Batching
吞吐量 (tokens/s) 1000 3000-5000
平均延迟 高,受最长请求影响 低,按实际生成长度
GPU 利用率 30-50% 85-95%
尾延迟 (P99) 很高 可控

主流框架支持

以下推理框架都实现了 Continuous Batching:

框架 实现名称
vLLM Continuous Batching + PagedAttention
TensorRT-LLM In-Flight Batching
SGLang Continuous Batching + RadixAttention
Text Generation Inference Continuous Batching

使用示例(vLLM)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from vllm import LLM, SamplingParams

## 初始化模型(自动启用 Continuous Batching)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

## 批量请求 - 自动调度,短的先返回
prompts = [
"What is 1+1?", # 短请求
"Write a 500-word essay about AI...", # 长请求
"Hello!", # 短请求
]

outputs = llm.generate(prompts, SamplingParams(max_tokens=1024))

## 注意:虽然代码看起来是批量的,
## 但内部使用 Continuous Batching 优化调度

挑战与解决方案

挑战 1:调度复杂度增加

  • 问题: 每步都要判断哪些完成、哪些继续
  • 解决: 高效的数据结构和调度算法

挑战 2:内存碎片化

  • 问题: 序列长短不一,内存分配零散
  • 解决: PagedAttention 解决碎片问题

挑战 3:请求饥饿

  • 问题: 长请求可能一直被短请求”插队”
  • 解决: 优先级调度、公平调度策略

总结

Continuous Batching 是现代 LLM 推理服务的核心技术之一。它打破了传统批处理”等齐再走”的局限,实现了请求级别的动态调度。配合 PagedAttention 等内存优化技术,Continuous Batching 让大模型服务的吞吐量提升数倍,同时降低了用户等待时间。

如果你正在部署 LLM 服务,选择支持 Continuous Batching 的推理框架(如 vLLM、TensorRT-LLM、SGLang)将是明智之选。

Continuous Batching: The Black Magic That Keeps Large Model Services “Never Idle”

When you chat with ChatGPT, the servers behind it might be serving thousands of users simultaneously. How do we efficiently handle these concurrent requests? Continuous Batching technology emerged to keep GPUs running at full capacity, dramatically improving the throughput of large model services.

First, Understanding the Problem with Traditional Batching

What is Batching?

In deep learning, batching means packaging multiple requests together for simultaneous processing, rather than processing them one by one. This better utilizes the GPU’s parallel computing capabilities.

Analogy: It’s like cooking at a restaurant—stir-frying one dish at a time is inefficient, but doing four at once (if the wok is big enough) is more efficient.

The Dilemma of Traditional Static Batching

In traditional Static Batching:

  1. The server waits to collect a batch of requests (say, 8)
  2. These 8 requests are sent to the model together
  3. It waits until all requests complete before processing the next batch

Here’s the problem:

Suppose among these 8 requests:

  • User A asks: “What’s 1+1?” → Completes after generating 3 tokens
  • User B asks: “Write me a 1000-word article” → Needs to generate 500 tokens

Result with Static Batching:

1
2
3
4
5
6
Timeline:
[Token 1] A✓ B○ C○ D○ E○ F○ G○ H○
[Token 2] A✓ B○ C✓ D○ E○ F○ G○ H○
[Token 3] A✓ B○ C✓ D✓ E✓ F○ G○ H○ ← A, C, D, E completed
...
[Token 500] waiting... B○ waiting... waiting... ← Everyone waiting for B

User A should have received their response long ago, but has to wait for User B’s long article to finish! During this time:

  • GPU resources wasted: Completed requests still occupy slots
  • Increased latency: Short requests are “held back” by long ones
  • Poor user experience: Questions that could be answered instantly take forever

How Continuous Batching Solves the Problem

Continuous Batching (also called Iteration-level Batching or In-Flight Batching) has a core idea:

Reschedule the batch at each iteration step (every token generated)

How It Works

1
2
3
4
5
6
7
8
9
10
11
12
Initial state: Processing [A, B, C, D]

After Token 1:
- A completed ✓ → Remove from batch, immediately return result to User A
- B, C, D continue
- New request E joins → Batch becomes [B, C, D, E]

After Token 2:
- C completed ✓ → Remove, return
- New requests F, G join → Batch becomes [B, D, E, F, G]

...and so on

Effects:

  • Completed requests immediately release resources
  • New requests join anytime to the running batch
  • GPU always runs at full load, no idle time

A Visual Analogy

Static Batching is like a traditional bus:

  • Must wait for everyone to board/disembark at each stop
  • Departs when full, everyone gets off together at the terminal
  • Want to get off midway? Sorry, wait until the end

Continuous Batching is like a “cloud rail” or “robo-taxi”:

  • Passengers can get on and off anytime
  • Get off immediately upon reaching destination, seat is freed immediately
  • New passengers can immediately take empty seats
  • Vehicle always operates efficiently

Key Technical Implementation Points

1. Sequence-level Scheduling

Traditional batching schedules by “batch”; Continuous Batching schedules by “sequence”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class ContinuousBatcher:
def __init__(self, max_batch_size=32):
self.running_sequences = [] # Sequences currently generating
self.waiting_queue = [] # Waiting queue
self.max_batch_size = max_batch_size

def step(self):
# 1. Remove completed sequences
completed = [s for s in self.running_sequences if s.is_done()]
for seq in completed:
self.running_sequences.remove(seq)
self.return_result(seq)

# 2. Fill empty slots
while len(self.running_sequences) < self.max_batch_size:
if self.waiting_queue:
new_seq = self.waiting_queue.pop(0)
self.running_sequences.append(new_seq)
else:
break

# 3. Execute one inference step
self.forward_one_step(self.running_sequences)

2. Dynamic Memory Management

Since batch composition changes constantly, memory management becomes complex:

  • Requires technologies like PagedAttention
  • KV Cache allocated and freed on demand
  • Supports sequences of different lengths coexisting

3. Prefix Sharing Optimization

When multiple requests share the same prefix (like the same System Prompt):

  • Can share prefix KV Cache
  • New requests joining directly reuse it

Advantages of Continuous Batching

Advantage Description
Higher Throughput GPU utilization increases from 30-50% to 90%+
Lower Latency Short requests don’t wait for long ones to complete
Better Scalability Dynamically adapts to load changes
Higher Concurrency Serves more users simultaneously

Performance Comparison

Using LLaMA-7B model as an example:

Metric Static Batching Continuous Batching
Throughput (tokens/s) 1000 3000-5000
Average Latency High, affected by longest request Low, based on actual generation length
GPU Utilization 30-50% 85-95%
Tail Latency (P99) Very high Controllable

Mainstream Framework Support

The following inference frameworks implement Continuous Batching:

Framework Implementation Name
vLLM Continuous Batching + PagedAttention
TensorRT-LLM In-Flight Batching
SGLang Continuous Batching + RadixAttention
Text Generation Inference Continuous Batching

Usage Example (vLLM)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from vllm import LLM, SamplingParams

## Initialize model (automatically enables Continuous Batching)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

## Batch requests - automatically scheduled, short ones return first
prompts = [
"What is 1+1?", # Short request
"Write a 500-word essay about AI...", # Long request
"Hello!", # Short request
]

outputs = llm.generate(prompts, SamplingParams(max_tokens=1024))

## Note: Although the code looks like batch processing,
## internally it uses Continuous Batching for optimized scheduling

Challenges and Solutions

Challenge 1: Increased Scheduling Complexity

  • Problem: Each step must determine which are complete, which continue
  • Solution: Efficient data structures and scheduling algorithms

Challenge 2: Memory Fragmentation

  • Problem: Varying sequence lengths cause scattered memory allocation
  • Solution: PagedAttention solves fragmentation issues

Challenge 3: Request Starvation

  • Problem: Long requests might keep getting “cut in line” by short ones
  • Solution: Priority scheduling, fair scheduling policies

Summary

Continuous Batching is one of the core technologies in modern LLM inference services. It breaks the traditional batching limitation of “wait until all are ready,” achieving dynamic scheduling at the request level. Combined with memory optimization techniques like PagedAttention, Continuous Batching increases large model service throughput by several times while reducing user wait times.

If you’re deploying LLM services, choosing an inference framework that supports Continuous Batching (like vLLM, TensorRT-LLM, SGLang) would be a wise choice.

PagedAttention

PagedAttention:让大模型推理告别”内存焦虑”

在大语言模型(LLM)推理的世界里,内存管理一直是个让工程师头疼的问题。今天我们要介绍的 PagedAttention 技术,借鉴了操作系统的经典思想,巧妙地解决了 LLM 推理中的内存难题。这项技术是 vLLM 框架的核心创新,让我们来深入了解它。

从一个痛点说起

想象你正在运营一个 AI 聊天服务,同时有 100 个用户在和 AI 对话。每个用户的对话长度不一样——有人只问了一句”今天天气怎么样”,有人却写了一篇 2000 字的小说让 AI 续写。

传统做法的问题:

传统的 LLM 推理需要为每个用户预先分配一块固定大小的内存(用于存储 KV Cache)。为了应对”写小说”的用户,你不得不给每个人都分配能容纳 2000 字的空间。

结果呢?

  • 那个只问天气的用户,分配的内存 99% 都浪费了
  • GPU 显存很快就被”空气”占满了
  • 本来能服务 100 个用户,现在只能服务 20 个

这就是 PagedAttention 要解决的问题。

什么是 KV Cache?

在理解 PagedAttention 之前,我们先搞清楚什么是 KV Cache

当 LLM 生成文本时,它是一个字一个字”吐”出来的(自回归生成)。每生成一个新字,模型需要”回顾”之前所有已生成的内容。

如果每次都从头计算,效率太低。所以我们把之前计算过的注意力机制中的 KeyValue 向量存起来,这就是 KV Cache

类比: 就像你写一篇长文章,每写一句话都要从第一句开始读一遍,太累了。KV Cache 就像你的”记忆笔记”,把之前的关键信息记下来,写新句子时只需要翻笔记就行。

PagedAttention 的核心思想

PagedAttention 的灵感来自于操作系统中的虚拟内存分页技术。

操作系统是怎么做的?

在现代操作系统中,程序需要的内存不是一次性全部分配的,而是分成一个个小”页面”(Page),按需分配:

  • 程序说”我可能需要 8GB 内存”
  • 操作系统说”好的,但你现在只用了 100MB,我先给你这些”
  • 当程序真正需要更多时,再分配新的页面

PagedAttention 对 KV Cache 做了同样的事情:

  1. 分页存储: 把连续的 KV Cache 分割成固定大小的”页块”(Block)
  2. 按需分配: 生成新 token 时,才分配新的页块
  3. 非连续存储: 页块不需要在物理内存中连续,用”页表”来管理映射关系

图解 PagedAttention

传统方式:

1
2
3
用户A: [████████████____________________] ← 预分配大空间,大量浪费
用户B: [████________________________] ← 同样浪费
用户C: [██████████████████______________] ← 还是浪费

PagedAttention:

1
2
3
4
5
页块池: [A1][B1][A2][C1][B2][C2][A3][C3][空][空][空]...

用户A → 页表: [A1, A2, A3] ← 只用了3个页块
用户B → 页表: [B1, B2] ← 只用了2个页块
用户C → 页表: [C1, C2, C3] ← 只用了3个页块

效果: 没有浪费,每个用户精确使用所需的内存。

技术细节

1. Block(页块)

  • 每个 Block 包含固定数量的 token 对应的 KV 向量
  • 典型的 Block 大小:16 个 token
  • Block 是内存分配的最小单位

2. Block Table(页表)

  • 每个请求维护一个 Block Table
  • 记录该请求的 KV Cache 使用了哪些物理 Block
  • 类似操作系统的页表,实现虚拟到物理的映射

3. 按需分配

1
2
3
4
5
时间线:
T1: 用户输入 "你好" → 分配 Block #1
T2: 模型生成 "," → Block #1 还有空间,继续使用
T3: 模型生成 "我是..." → Block #1 满了,分配 Block #2
...

4. 内存释放

当请求完成或被中断时,其占用的 Block 立即归还到空闲池,供其他请求使用。

PagedAttention 的优势

1. 接近零浪费

研究表明,传统方式的内存浪费率高达 60-80%。PagedAttention 将浪费率降到接近 0%(仅最后一个 Block 可能有少量未使用空间)。

2. 更高并发

同样的 GPU 显存,可以同时处理 2-4 倍 更多的请求。

3. 支持更长上下文

内存利用率提高后,同样的硬件可以支持更长的对话上下文。

4. 灵活的内存共享

PagedAttention 还支持一个高级特性:Copy-on-Write(写时复制)

当多个请求共享相同的前缀(如相同的系统提示词)时:

  • 它们可以共享同一批 Block
  • 只有当需要修改时,才复制出新的 Block
  • 进一步节省内存

代码概念示例

虽然实际实现很复杂,但核心思想可以这样理解:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class PagedKVCache:
def __init__(self, block_size=16, num_blocks=1000):
self.block_size = block_size
# 预分配 Block 池
self.block_pool = [Block() for _ in range(num_blocks)]
self.free_blocks = list(range(num_blocks))

def allocate_block(self):
"""按需分配一个新 Block"""
if self.free_blocks:
return self.free_blocks.pop()
raise MemoryError("No free blocks!")

def free_block(self, block_id):
"""释放 Block 回池中"""
self.free_blocks.append(block_id)

class Request:
def __init__(self, cache):
self.cache = cache
self.block_table = [] # 页表
self.num_tokens = 0

def add_token(self, kv_data):
# 检查当前 Block 是否已满
if self.num_tokens % self.cache.block_size == 0:
# 需要新 Block
new_block = self.cache.allocate_block()
self.block_table.append(new_block)

# 写入 KV 数据
current_block = self.block_table[-1]
# ... 写入逻辑
self.num_tokens += 1

在 vLLM 中的应用

vLLM 是第一个将 PagedAttention 付诸实践的推理框架:

1
2
3
4
5
6
## 安装 vLLM
pip install vllm

## 启动服务(自动使用 PagedAttention)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf

用户无需任何额外配置,vLLM 自动应用 PagedAttention 优化。

性能对比

指标 传统方式 PagedAttention
内存利用率 20-40% 95%+
最大并发数 基准 2-4x
内存浪费 60-80% <5%
动态伸缩 困难 容易

总结

PagedAttention 是 LLM 推理领域的一项重要创新。它借鉴操作系统的分页内存管理思想,将 KV Cache 的内存管理从”静态预分配”变为”动态按需分配”,从而大幅提高了内存利用率和系统吞吐量。

这项技术告诉我们:有时候,解决新问题的最佳方案,就藏在经典技术的智慧之中。

PagedAttention: Freeing Large Model Inference from “Memory Anxiety”

In the world of Large Language Model (LLM) inference, memory management has always been a headache for engineers. Today’s topic, PagedAttention technology, cleverly solves the memory challenges in LLM inference by borrowing classic ideas from operating systems. This technology is the core innovation of the vLLM framework. Let’s dive deep into understanding it.

Starting from a Pain Point

Imagine you’re running an AI chat service with 100 users simultaneously talking to the AI. Each user’s conversation varies in length—some only ask “What’s the weather today?”, while others write a 2000-word story asking the AI to continue it.

Problems with Traditional Approaches:

Traditional LLM inference requires pre-allocating a fixed-size memory block for each user (for storing KV Cache). To accommodate the “novel writer” user, you have to allocate space for 2000 words to everyone.

The result?

  • The user who only asked about weather has 99% of their allocated memory wasted
  • GPU memory quickly fills up with “air”
  • Instead of serving 100 users, you can only serve 20

This is the problem PagedAttention solves.

What is KV Cache?

Before understanding PagedAttention, let’s clarify what KV Cache is.

When an LLM generates text, it outputs one token at a time (autoregressive generation). For each new token generated, the model needs to “look back” at all previously generated content.

Computing from scratch each time is too inefficient. So we store the Key and Value vectors from the attention mechanism that were previously computed—this is the KV Cache.

Analogy: It’s like writing a long article where you have to read from the first sentence every time you write a new one—too exhausting. KV Cache is like your “memory notes,” recording key information from before, so you only need to check your notes when writing new sentences.

Core Ideas of PagedAttention

PagedAttention’s inspiration comes from virtual memory paging technology in operating systems.

How Do Operating Systems Do It?

In modern operating systems, memory needed by programs isn’t allocated all at once, but divided into small “pages” allocated on demand:

  • Program says: “I might need 8GB of memory”
  • OS says: “Okay, but you’re only using 100MB now, I’ll give you that first”
  • When the program actually needs more, new pages are allocated

PagedAttention does the same thing for KV Cache:

  1. Paged Storage: Splits continuous KV Cache into fixed-size “blocks”
  2. On-demand Allocation: New blocks are allocated only when generating new tokens
  3. Non-contiguous Storage: Blocks don’t need to be contiguous in physical memory; a “page table” manages the mapping

Visualizing PagedAttention

Traditional Approach:

1
2
3
User A: [████████████____________________] ← Large pre-allocation, lots of waste
User B: [████________________________] ← Same waste
User C: [██████████████████______________] ← Still wasted

PagedAttention:

1
2
3
4
5
Block Pool: [A1][B1][A2][C1][B2][C2][A3][C3][free][free][free]...

User A → Page Table: [A1, A2, A3] ← Only used 3 blocks
User B → Page Table: [B1, B2] ← Only used 2 blocks
User C → Page Table: [C1, C2, C3] ← Only used 3 blocks

Effect: No waste—each user uses exactly the memory they need.

Technical Details

1. Block

  • Each Block contains KV vectors for a fixed number of tokens
  • Typical Block size: 16 tokens
  • Block is the minimum unit of memory allocation

2. Block Table (Page Table)

  • Each request maintains a Block Table
  • Records which physical Blocks the request’s KV Cache uses
  • Similar to OS page tables, implementing virtual-to-physical mapping

3. On-demand Allocation

1
2
3
4
5
Timeline:
T1: User inputs "Hello" → Allocate Block #1
T2: Model generates "," → Block #1 still has space, continue using
T3: Model generates "I am..." → Block #1 is full, allocate Block #2
...

4. Memory Release

When a request completes or is interrupted, its occupied Blocks are immediately returned to the free pool for other requests.

Advantages of PagedAttention

1. Near-zero Waste

Research shows traditional methods have memory waste rates of 60-80%. PagedAttention reduces waste to nearly 0% (only the last Block might have some unused space).

2. Higher Concurrency

Same GPU memory can handle 2-4x more concurrent requests.

3. Longer Context Support

With improved memory utilization, the same hardware can support longer conversation contexts.

4. Flexible Memory Sharing

PagedAttention also supports an advanced feature: Copy-on-Write

When multiple requests share the same prefix (like the same system prompt):

  • They can share the same Blocks
  • New Blocks are copied only when modification is needed
  • Further memory savings

Conceptual Code Example

Although actual implementation is complex, the core idea can be understood like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class PagedKVCache:
def __init__(self, block_size=16, num_blocks=1000):
self.block_size = block_size
# Pre-allocate Block pool
self.block_pool = [Block() for _ in range(num_blocks)]
self.free_blocks = list(range(num_blocks))

def allocate_block(self):
"""Allocate a new Block on demand"""
if self.free_blocks:
return self.free_blocks.pop()
raise MemoryError("No free blocks!")

def free_block(self, block_id):
"""Release Block back to pool"""
self.free_blocks.append(block_id)

class Request:
def __init__(self, cache):
self.cache = cache
self.block_table = [] # Page table
self.num_tokens = 0

def add_token(self, kv_data):
# Check if current Block is full
if self.num_tokens % self.cache.block_size == 0:
# Need new Block
new_block = self.cache.allocate_block()
self.block_table.append(new_block)

# Write KV data
current_block = self.block_table[-1]
# ... write logic
self.num_tokens += 1

Application in vLLM

vLLM is the first inference framework to implement PagedAttention:

1
2
3
4
5
6
## Install vLLM
pip install vllm

## Start server (automatically uses PagedAttention)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf

Users don’t need any extra configuration—vLLM automatically applies PagedAttention optimization.

Performance Comparison

Metric Traditional PagedAttention
Memory Utilization 20-40% 95%+
Max Concurrency Baseline 2-4x
Memory Waste 60-80% <5%
Dynamic Scaling Difficult Easy

Summary

PagedAttention is an important innovation in the LLM inference field. By borrowing the paged memory management concept from operating systems, it transforms KV Cache memory management from “static pre-allocation” to “dynamic on-demand allocation,” dramatically improving memory utilization and system throughput.

This technology teaches us that sometimes the best solution to new problems lies hidden in the wisdom of classic techniques.

SGLang

SGLang:让大模型”结构化输出”的推理利器

在与大语言模型(LLM)打交道时,你是否遇到过这样的困扰:明明只想让模型输出一个 JSON 格式的数据,它却天马行空地”自由发挥”?SGLang 正是为解决这类问题而生的新一代 LLM 推理框架。让我们一起来认识这个来自 UC Berkeley 的创新工具。

什么是 SGLang?

SGLang(Structured Generation Language)是由 UC Berkeley 团队开发的高性能 LLM 推理引擎和编程语言。它的核心目标是:让大模型的输出更加可控结构化,同时保持极高的推理性能。

一句话概括: SGLang = 高性能推理 + 结构化输出控制

为什么需要 SGLang?

想象一下这个场景:

你让 ChatGPT 帮你提取一篇文章中的关键信息,并以 JSON 格式返回:

1
2
3
4
请提取以下信息并以JSON格式返回:
- 作者姓名
- 发布日期
- 主要观点

模型可能返回:

1
2
3
4
5
6
7
8
9
好的,我来帮你提取信息:

{
"author": "张三",
"date": "2024年1月1日",
"main_points": ["观点1", "观点2"]
}

希望这对你有帮助!如果需要更多信息,请告诉我。

问题来了——模型在 JSON 前后加了”废话”,你的程序解析 JSON 时直接崩溃了。

SGLang 的解决方案: 通过约束生成,强制模型只输出符合预期格式的内容。

SGLang 的核心特性

1. 结构化生成(Constrained Decoding)

SGLang 最强大的功能是约束解码。你可以精确控制模型的输出格式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sglang import gen, select

## 强制模型在固定选项中选择
answer = select("是", "否", "不确定")

## 强制输出符合正则表达式的内容
phone = gen(regex=r"\d{3}-\d{4}-\d{4}")

## 强制输出有效的 JSON
data = gen(json_schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
})

类比: 这就像给学生发一张选择题试卷,而不是让他写作文。答案必须在 A、B、C、D 中选择,不能”自由发挥”。

2. RadixAttention:智能缓存复用

SGLang 引入了创新的 RadixAttention 机制:

  • 使用 Radix Tree(基数树)来管理 KV Cache
  • 自动识别和复用相同前缀的缓存
  • 多个请求共享公共部分,大幅提升效率

举个例子:

假设有 100 个用户都在使用同一个系统提示词(System Prompt),传统方法需要计算 100 次。而 RadixAttention 只计算一次,然后复用给所有用户。

3. 前端编程语言

SGLang 提供了一种直观的编程方式来编排复杂的 LLM 调用:

1
2
3
4
5
6
7
@sgl.function
def multi_turn_qa(s, question1, question2):
s += sgl.system("你是一个有帮助的助手。")
s += sgl.user(question1)
s += sgl.assistant(sgl.gen("answer1", max_tokens=100))
s += sgl.user(question2)
s += sgl.assistant(sgl.gen("answer2", max_tokens=100))

这种方式比传统的字符串拼接更清晰、更易维护。

4. 高性能推理后端

SGLang 的推理速度非常快,主要得益于:

  • 连续批处理(Continuous Batching)
  • 优化的 CUDA Kernel
  • 高效的内存管理
  • 推测解码(Speculative Decoding)支持

SGLang vs 其他框架

特性 SGLang vLLM TensorRT-LLM
结构化输出 ✅ 原生支持 ⚠️ 有限 ⚠️ 有限
约束解码 ✅ JSON/Regex
前缀缓存 ✅ RadixAttention
推理性能 优秀 优秀 极致
易用性 ✅ Python原生 ⚠️ 需要编译

实际应用场景

场景一:API 数据提取

1
2
3
4
5
6
7
8
9
10
11
@sgl.function
def extract_info(s, text):
s += f"从以下文本中提取结构化信息:\n{text}"
s += sgl.gen("result", json_schema={
"type": "object",
"properties": {
"entities": {"type": "array"},
"sentiment": {"enum": ["positive", "negative", "neutral"]},
"summary": {"type": "string"}
}
})

场景二:多选题问答

1
2
3
4
@sgl.function  
def multiple_choice(s, question, options):
s += f"问题:{question}\n选项:{options}\n请选择正确答案:"
s += sgl.select(["A", "B", "C", "D"], name="answer")

场景三:代码生成

1
2
3
4
@sgl.function
def generate_function(s, description):
s += f"根据以下描述生成Python函数:\n{description}"
s += sgl.gen("code", regex=r"def \w+\([^)]*\):[\s\S]+")

快速上手

安装

1
pip install sglang[all]

启动服务

1
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000

使用示例

1
2
3
4
5
6
7
8
9
10
import sglang as sgl

@sgl.function
def simple_qa(s, question):
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256))

## 运行
state = simple_qa.run(question="什么是机器学习?")
print(state["answer"])

性能表现

根据官方基准测试,SGLang 在多种场景下表现出色:

  • JSON 模式生成: 比其他框架快 3-5 倍
  • 共享前缀场景: RadixAttention 带来 2-4 倍加速
  • 多轮对话: 缓存复用显著降低延迟

适用场景

SGLang 特别适合以下需求:

  • ✅ 需要结构化 JSON 输出的 API 服务
  • ✅ 需要约束模型在固定选项中选择
  • ✅ 多个请求共享相同前缀(如系统提示)
  • ✅ 复杂的多轮对话和推理链
  • ✅ 需要高性能的生产环境部署

总结

SGLang 是一个将结构化输出控制高性能推理完美结合的 LLM 推理框架。通过约束解码技术,它让大模型的输出变得可控、可预测;通过 RadixAttention 等优化,它在保持灵活性的同时实现了极高的推理效率。如果你的应用需要可靠的结构化输出,SGLang 是一个非常值得尝试的选择。

SGLang: The Inference Powerhouse for “Structured Output” from Large Models

When working with Large Language Models (LLMs), have you ever encountered this frustration: you only want the model to output data in JSON format, but it goes off on a tangent with “creative freedom”? SGLang is a next-generation LLM inference framework designed precisely to solve such problems. Let’s get to know this innovative tool from UC Berkeley.

What is SGLang?

SGLang (Structured Generation Language) is a high-performance LLM inference engine and programming language developed by the UC Berkeley team. Its core goal is: to make LLM output more controllable and structured while maintaining extremely high inference performance.

In one sentence: SGLang = High-performance inference + Structured output control

Why Do We Need SGLang?

Imagine this scenario:

You ask ChatGPT to extract key information from an article and return it in JSON format:

1
2
3
4
Please extract the following information and return in JSON format:
- Author name
- Publication date
- Main points

The model might return:

1
2
3
4
5
6
7
8
9
Okay, let me help you extract the information:

{
"author": "John Smith",
"date": "January 1, 2024",
"main_points": ["Point 1", "Point 2"]
}

Hope this helps! Let me know if you need more information.

The problem—the model added “fluff” before and after the JSON, and your program crashes when trying to parse the JSON.

SGLang’s Solution: Through constrained generation, force the model to output only content that matches the expected format.

Core Features of SGLang

1. Structured Generation (Constrained Decoding)

SGLang’s most powerful feature is constrained decoding. You can precisely control the model’s output format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sglang import gen, select

## Force model to choose from fixed options
answer = select("Yes", "No", "Uncertain")

## Force output matching a regex pattern
phone = gen(regex=r"\d{3}-\d{4}-\d{4}")

## Force output of valid JSON
data = gen(json_schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
})

Analogy: This is like giving students a multiple-choice test instead of asking them to write an essay. Answers must be chosen from A, B, C, D—no “creative freedom” allowed.

2. RadixAttention: Smart Cache Reuse

SGLang introduces the innovative RadixAttention mechanism:

  • Uses a Radix Tree to manage KV Cache
  • Automatically identifies and reuses caches with the same prefix
  • Multiple requests share common parts, dramatically improving efficiency

Example:

Suppose 100 users are all using the same system prompt. Traditional methods need to compute it 100 times. RadixAttention computes it once and reuses it for all users.

3. Frontend Programming Language

SGLang provides an intuitive programming approach to orchestrate complex LLM calls:

1
2
3
4
5
6
7
@sgl.function
def multi_turn_qa(s, question1, question2):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(question1)
s += sgl.assistant(sgl.gen("answer1", max_tokens=100))
s += sgl.user(question2)
s += sgl.assistant(sgl.gen("answer2", max_tokens=100))

This approach is cleaner and more maintainable than traditional string concatenation.

4. High-Performance Inference Backend

SGLang’s inference is very fast, thanks to:

  • Continuous Batching
  • Optimized CUDA Kernels
  • Efficient memory management
  • Speculative Decoding support

SGLang vs Other Frameworks

Feature SGLang vLLM TensorRT-LLM
Structured Output ✅ Native support ⚠️ Limited ⚠️ Limited
Constrained Decoding ✅ JSON/Regex
Prefix Caching ✅ RadixAttention
Inference Performance Excellent Excellent Extreme
Ease of Use ✅ Python native ⚠️ Requires compilation

Practical Application Scenarios

Scenario 1: API Data Extraction

1
2
3
4
5
6
7
8
9
10
11
@sgl.function
def extract_info(s, text):
s += f"Extract structured information from the following text:\n{text}"
s += sgl.gen("result", json_schema={
"type": "object",
"properties": {
"entities": {"type": "array"},
"sentiment": {"enum": ["positive", "negative", "neutral"]},
"summary": {"type": "string"}
}
})

Scenario 2: Multiple Choice Q&A

1
2
3
4
@sgl.function  
def multiple_choice(s, question, options):
s += f"Question: {question}\nOptions: {options}\nPlease select the correct answer:"
s += sgl.select(["A", "B", "C", "D"], name="answer")

Scenario 3: Code Generation

1
2
3
4
@sgl.function
def generate_function(s, description):
s += f"Generate a Python function based on the following description:\n{description}"
s += sgl.gen("code", regex=r"def \w+\([^)]*\):[\s\S]+")

Getting Started

Installation

1
pip install sglang[all]

Launch Server

1
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000

Usage Example

1
2
3
4
5
6
7
8
9
10
import sglang as sgl

@sgl.function
def simple_qa(s, question):
s += sgl.user(question)
s += sgl.assistant(sgl.gen("answer", max_tokens=256))

## Run
state = simple_qa.run(question="What is machine learning?")
print(state["answer"])

Performance

According to official benchmarks, SGLang performs excellently in various scenarios:

  • JSON Mode Generation: 3-5x faster than other frameworks
  • Shared Prefix Scenarios: RadixAttention provides 2-4x speedup
  • Multi-turn Dialogue: Cache reuse significantly reduces latency

Use Cases

SGLang is particularly suitable for:

  • ✅ API services requiring structured JSON output
  • ✅ Constraining models to choose from fixed options
  • ✅ Multiple requests sharing the same prefix (like system prompts)
  • ✅ Complex multi-turn dialogues and reasoning chains
  • ✅ High-performance production deployments

Summary

SGLang is an LLM inference framework that perfectly combines structured output control with high-performance inference. Through constrained decoding technology, it makes LLM output controllable and predictable; through optimizations like RadixAttention, it achieves extremely high inference efficiency while maintaining flexibility. If your application requires reliable structured output, SGLang is definitely worth trying.

TensorRT-LLM

TensorRT-LLM:NVIDIA 专为大语言模型打造的推理加速引擎

在大语言模型(LLM)席卷全球的今天,如何让这些”大块头”跑得更快、更省钱,成为了业界最关心的问题之一。NVIDIA 推出的 TensorRT-LLM 正是为解决这一问题而生的”神器”。今天,让我们用通俗易懂的方式,来了解这个强大的工具。

什么是 TensorRT-LLM?

TensorRT-LLM 是 NVIDIA 专门为大语言模型推理优化而设计的开源库。你可以把它理解为 TensorRT(NVIDIA 的通用深度学习推理优化器)的”LLM 特化版”。

打个比方:

  • TensorRT 就像一辆性能出色的”多功能越野车”,什么路都能跑。
  • TensorRT-LLM 则是一辆专门为”高速公路”(LLM推理场景)设计的”超级跑车”,在这条赛道上,它跑得比谁都快。

为什么需要 TensorRT-LLM?

大语言模型(如 GPT、LLaMA、Mistral 等)有几个显著特点:

  1. 参数量巨大: 动辄几十亿、几百亿甚至上万亿参数
  2. 自回归生成: 一个字一个字地”吐”出来,每生成一个 token 都需要完整的计算
  3. 显存占用高: 存储模型参数和中间状态需要大量 GPU 显存
  4. 延迟敏感: 用户希望对话响应越快越好

普通的推理方式根本”喂不饱”这些大模型,而 TensorRT-LLM 通过一系列黑科技,让 LLM 推理变得又快又省。

TensorRT-LLM 的核心技术

1. In-Flight Batching(动态批处理)

传统批处理要等一批请求都准备好才能一起处理。但用户的问题有长有短,有的三个字,有的三百字,等齐了再处理效率太低。

TensorRT-LLM 采用 In-Flight Batching

  • 新请求随时可以”插队”加入正在处理的批次
  • 已完成的请求立即释放资源
  • GPU 利用率大幅提升

类比: 就像餐厅不用等一桌客人都点完菜才下单,而是谁点好就先做谁的,厨房永远在忙碌。

2. Paged KV Cache(分页键值缓存)

LLM 推理时需要存储大量的 Key-Value 缓存(用于记住之前生成的内容)。传统方式需要预先分配固定大小的显存,很容易造成浪费。

TensorRT-LLM 借鉴操作系统的虚拟内存分页思想:

  • KV Cache 按需分配,用多少占多少
  • 避免显存碎片化
  • 支持更长的上下文和更多并发请求

3. 多种量化技术

TensorRT-LLM 支持多种量化方案来压缩模型:

  • FP8 量化: 在 Hopper 架构 GPU 上效果最佳
  • INT8/INT4 量化: 大幅减少显存占用
  • AWQ、GPTQ、SmoothQuant: 各种先进的量化算法

量化就像把”高清大图”压缩成”缩略图”,虽然损失一点细节,但存储空间和传输速度都大幅改善。

4. 张量并行与流水线并行

当一张 GPU 放不下整个模型时,TensorRT-LLM 支持:

  • 张量并行: 把一个计算任务”横切”分给多张 GPU
  • 流水线并行: 把模型”竖切”成多段,像流水线一样处理

这让超大模型也能高效运行。

5. 优化的注意力机制

TensorRT-LLM 集成了多种高效注意力实现:

  • Flash Attention: 减少显存访问,加速计算
  • Multi-Query Attention (MQA)Grouped-Query Attention (GQA): 减少 KV Cache 占用

TensorRT-LLM vs 其他推理框架

特性 TensorRT-LLM vLLM 原生 PyTorch
开发者 NVIDIA UC Berkeley Meta
动态批处理 ✅ In-Flight ✅ Continuous
分页 KV Cache ✅ PagedAttention
量化支持 FP8/INT8/INT4 INT8/INT4 有限
硬件优化 NVIDIA GPU 深度优化 通用 通用
性能 极致 优秀 基准

使用场景

TensorRT-LLM 特别适合以下场景:

  • 大规模在线服务: 需要处理大量并发请求的 ChatBot
  • 低延迟应用: 对响应时间要求苛刻的实时对话系统
  • 成本敏感场景: 希望用更少的 GPU 服务更多用户
  • 私有化部署: 在企业内部部署大模型服务

快速上手

使用 TensorRT-LLM 的基本流程:

  1. 安装: 通过 pip 或 Docker 安装
  2. 转换模型: 将 HuggingFace 模型转换为 TensorRT-LLM 格式
  3. 构建引擎: 编译生成优化后的推理引擎
  4. 部署服务: 使用 Triton Inference Server 或自定义服务
1
2
3
4
5
6
7
8
9
## 示例:转换 LLaMA 模型
python convert_checkpoint.py --model_dir ./llama-7b \
--output_dir ./trt_ckpt \
--dtype float16

## 构建引擎
trtllm-build --checkpoint_dir ./trt_ckpt \
--output_dir ./trt_engine \
--gemm_plugin float16

性能表现

根据 NVIDIA 官方数据,TensorRT-LLM 相比原生实现可以获得:

  • 吞吐量提升: 2-5 倍甚至更高
  • 延迟降低: 首 token 延迟和生成延迟都显著降低
  • 显存效率: 支持更长上下文和更多并发

总结

TensorRT-LLM 是 NVIDIA 为大语言模型推理打造的”专用跑车”。它通过动态批处理、分页 KV Cache、多种量化技术和硬件级优化,让 LLM 推理变得更快、更省、更强。如果你正在部署大模型服务,TensorRT-LLM 绝对值得一试。

TensorRT-LLM: NVIDIA’s Inference Acceleration Engine Built for Large Language Models

As Large Language Models (LLMs) sweep across the globe, how to make these “giants” run faster and cheaper has become one of the industry’s top concerns. NVIDIA’s TensorRT-LLM is a powerful tool designed precisely to solve this problem. Today, let’s understand this tool in an easy-to-understand way.

What is TensorRT-LLM?

TensorRT-LLM is an open-source library designed by NVIDIA specifically for LLM inference optimization. You can think of it as the “LLM-specialized version” of TensorRT (NVIDIA’s general deep learning inference optimizer).

An analogy:

  • TensorRT is like a high-performance “multi-purpose SUV” that can handle any road.
  • TensorRT-LLM is a “supercar” specifically designed for the “highway” (LLM inference scenarios)—on this track, it runs faster than anything else.

Why Do We Need TensorRT-LLM?

Large Language Models (such as GPT, LLaMA, Mistral, etc.) have several notable characteristics:

  1. Massive Parameters: Often billions, hundreds of billions, or even trillions of parameters
  2. Autoregressive Generation: Outputs tokens one by one, requiring complete computation for each generated token
  3. High Memory Usage: Storing model parameters and intermediate states requires significant GPU memory
  4. Latency Sensitive: Users want conversational responses as fast as possible

Ordinary inference methods simply cannot “feed” these large models efficiently, while TensorRT-LLM uses a series of advanced technologies to make LLM inference faster and more efficient.

Core Technologies of TensorRT-LLM

1. In-Flight Batching (Dynamic Batching)

Traditional batching waits until a batch of requests is ready before processing them together. But user queries vary in length—some are three words, others are three hundred—waiting for them all is too inefficient.

TensorRT-LLM uses In-Flight Batching:

  • New requests can “jump in” and join an ongoing batch at any time
  • Completed requests immediately release resources
  • GPU utilization improves dramatically

Analogy: It’s like a restaurant that doesn’t wait for everyone at a table to finish ordering before sending orders to the kitchen—whoever orders first gets served first, keeping the kitchen always busy.

2. Paged KV Cache

During LLM inference, large amounts of Key-Value cache need to be stored (to remember previously generated content). Traditional methods require pre-allocating fixed-size memory, which easily leads to waste.

TensorRT-LLM borrows the virtual memory paging concept from operating systems:

  • KV Cache is allocated on demand—use only what you need
  • Avoids memory fragmentation
  • Supports longer contexts and more concurrent requests

3. Multiple Quantization Techniques

TensorRT-LLM supports various quantization schemes to compress models:

  • FP8 Quantization: Best performance on Hopper architecture GPUs
  • INT8/INT4 Quantization: Significantly reduces memory usage
  • AWQ, GPTQ, SmoothQuant: Various advanced quantization algorithms

Quantization is like compressing a “high-resolution image” into a “thumbnail”—you lose some detail, but storage space and transfer speed improve dramatically.

4. Tensor Parallelism and Pipeline Parallelism

When a single GPU cannot fit the entire model, TensorRT-LLM supports:

  • Tensor Parallelism: “Horizontally slices” a computation task across multiple GPUs
  • Pipeline Parallelism: “Vertically slices” the model into multiple stages, processing like an assembly line

This allows ultra-large models to run efficiently.

5. Optimized Attention Mechanisms

TensorRT-LLM integrates multiple efficient attention implementations:

  • Flash Attention: Reduces memory access, accelerates computation
  • Multi-Query Attention (MQA) and Grouped-Query Attention (GQA): Reduces KV Cache usage

TensorRT-LLM vs Other Inference Frameworks

Feature TensorRT-LLM vLLM Native PyTorch
Developer NVIDIA UC Berkeley Meta
Dynamic Batching ✅ In-Flight ✅ Continuous
Paged KV Cache ✅ PagedAttention
Quantization Support FP8/INT8/INT4 INT8/INT4 Limited
Hardware Optimization Deep NVIDIA GPU optimization General General
Performance Extreme Excellent Baseline

Use Cases

TensorRT-LLM is particularly suitable for the following scenarios:

  • Large-scale Online Services: ChatBots that need to handle many concurrent requests
  • Low-latency Applications: Real-time conversation systems with strict response time requirements
  • Cost-sensitive Scenarios: Serving more users with fewer GPUs
  • Private Deployment: Deploying LLM services within enterprises

Getting Started

The basic workflow for using TensorRT-LLM:

  1. Installation: Install via pip or Docker
  2. Convert Model: Convert HuggingFace models to TensorRT-LLM format
  3. Build Engine: Compile to generate optimized inference engine
  4. Deploy Service: Use Triton Inference Server or custom service
1
2
3
4
5
6
7
8
9
## Example: Converting LLaMA model
python convert_checkpoint.py --model_dir ./llama-7b \
--output_dir ./trt_ckpt \
--dtype float16

## Build engine
trtllm-build --checkpoint_dir ./trt_ckpt \
--output_dir ./trt_engine \
--gemm_plugin float16

Performance

According to NVIDIA’s official data, TensorRT-LLM can achieve compared to native implementations:

  • Throughput Improvement: 2-5x or even higher
  • Latency Reduction: Both time-to-first-token and generation latency significantly reduced
  • Memory Efficiency: Supports longer contexts and more concurrency

Summary

TensorRT-LLM is NVIDIA’s “dedicated supercar” built for LLM inference. Through dynamic batching, paged KV cache, various quantization techniques, and hardware-level optimizations, it makes LLM inference faster, more efficient, and more powerful. If you’re deploying LLM services, TensorRT-LLM is definitely worth trying.

Sampling method - UniPC AYS

揭秘 AI 绘画的“加速引擎”:UniPC AYS 采样方法

在当今的人工智能世界里,AI 绘画(如 Stable Diffusion)已经变得非常普遍。你输入一段话,AI 就能画出一幅精美的图片。但是,你是否想过,AI 是如何从一片模糊的噪点中“变”出画面的?在这个过程中,有一个叫做 “采样器”(Sampler) 的关键角色。

今天我们要介绍的主角,就是采样器家族中的一位新晋明星——UniPC AYS


1. 基础概念:AI 绘画是如何工作的?

在理解 UniPC AYS 之前,我们需要先打个比方来理解 AI 绘画的过程。

想象你在雕刻一块石头

甚至更简单一点,想象你在擦除雾气

  • 初始状态(高斯噪声): AI 刚开始时,就像面对充满浓雾的窗户,全是杂乱无章的像素点,你什么都看不清。
  • 生成过程(去噪): AI 收到你的指令(比如“一只猫”),它就开始一步步擦除雾气,把不属于“猫”的杂点去掉,慢慢显露出猫的轮廓、毛发和眼睛。
  • 采样步数(Steps): 你擦除雾气的动作次数。动作越多(步数越多),画面通常越精细,但花费的时间也越长。

采样器(Sampler) 就是那个决定“怎么擦”的策略师。有的采样器动作慢但细致,有的动作快但可能粗糙。


2. 什么是 UniPC?(全能的预测者)

UniPC(Unified Predictor-Corrector,统一预测-校正器)是近年来非常强力的采样算法。

类比:经验丰富的老画家

如果你让一个新手画家画画,他可能每画一笔都要停下来看看模特,生怕画错,速度很慢。
UniPC 就像一位经验丰富的老画家。他看一眼模特,就能精准地预测接下来好几笔该怎么画(Predictor),画完之后,他还会快速扫一眼进行微调(Corrector)。

  • 核心优势: UniPC 极其高效,它不仅画得快,而且在步数很少的情况下(比如只需 10 步)就能画出极高质量的图,而传统方法可能需要 20 到 50 步。

3. 什么是 AYS?(自适应的时刻表)

AYS 代表 Adaptive Yield Sampling(自适应产出采样)。这是最近与 NVIDIA 相关的研究提出的一种优化调度策略。

类比:聪明的旅行规划师

假设你要从起点(全是噪点)走到终点(完美的画)。

  • 传统的调度(Standard Schedule): 就像坐一辆每站必停的公交车。不管这一段路是直道还是弯道,不管风景重不重要,它都匀速前进,每隔固定的距离停一下。这很稳,但可能浪费时间。
  • AYS 调度(Adaptive Schedule): 它就像一位聪明的专车司机。
    • 在路况复杂(画面很难处理,还是模糊的时候),它会开得慢一点,多停几次,多做几次处理。
    • 在路况简单(画面已经基本成型,只是微调细节)时,它会一脚油门踩过去,跳过不必要的停顿。

AYS 的核心思想是:把宝贵的计算资源(步数)花在对画质提升最关键的时刻。


4. 强强联手:UniPC AYS

当你把 UniPC(那个下笔如有神的老画家)和 AYS(那个聪明的旅行规划师)结合在一起时,神奇的事情发生了。

UniPC AYS = 极速 + 高质量

在 Stable Diffusion WebUI 或 ComfyUI 等软件中选择 UniPC AYS 作为采样器,意味着:

  1. 速度极快: 通常只需要 10 步甚至更少 就能生成一张可用的、细节丰富的图片。
  2. 效率极高: 它通过 AYS 优化了每一步的“去噪力度”,配合 UniPC 强大的预测能力,用最少的动作完成了最复杂的绘画任务。

这对用户意味着什么?

  • 省时间: 以前生成一张好图可能要 5-10 秒,现在可能只需要 2-3 秒。
  • 省显卡: 生成同样的图片,计算量减少了,你的显卡不用那么累。
  • 无需纠结: 你不再需要为了画质把步数拉到 50 或 100,UniPC AYS 告诉你:少即是多。

总结图表

为了更直观地理解,我们可以看下面这个对比表:

采样方法 角色类比 步数需求 速度 质量 就像
Euler A 勤恳的学生 20-40 步 中等 多样性好 每一笔都认真思考,稍微有点犹豫。
DPM++ 2M Karras 严谨的工程师 20-30 步 很高 按部就班,效率很高,目前的主流。
UniPC (标准) 天才画家 15-25 步 很快 极高 拥有超强预判能力,下笔精准。
UniPC AYS 带着智能导航的天才画家 10 步左右 极速 极高 在最关键的地方精雕细琢,其他地方一笔带过,只需几秒完成大作。

结语

技术总是在向着“更快、更好、更省”的方向发展。UniPC AYS 就是这一趋势的典型代表。它证明了 AI 不仅是在堆砌算力,更是在学习如何“聪明地”工作。下次你在使用 AI 绘图软件时,不妨把采样器切换到 UniPC 并配合 AYS 调度,亲身体验一下这种“飞一般”的感觉吧!

Unveiling the “Turbo Engine” of AI Art: UniPC AYS Sampling Method

In the world of Artificial Intelligence today, AI Art generation (like Stable Diffusion) has become ubiquitous. You type in a sentence, and the AI conjures up a beautiful image. But have you ever wondered how AI transforms a field of blurry static into a masterpiece? In this process, there is a key player called the “Sampler.”

Today, we are introducing a rising star in the sampler family: UniPC AYS.


1. The Basics: How Does AI Art Work?

To understand UniPC AYS, we first need an analogy for the AI painting process.

Imagine sculpting a block of stone

Or even simpler, imagine wiping fog off a window.

  • Initial State (Gaussian Noise): When the AI starts, it’s like facing a window covered in thick fog (static pixels). You can’t see anything clearly.
  • Generation Process (Denoising): The AI receives your command (e.g., “a cat”). It starts wiping away the fog step by step, removing the random dots that don’t belong to a “cat,” slowly revealing the outline, fur, and eyes.
  • Sampling Steps: The number of times you wipe the fog. More wipes (more steps) usually mean a more detailed image, but it also takes longer.

The Sampler is the strategist who creates the plan for “how to wipe.” Some samplers are slow but meticulous; others are fast but might be rough.


2. What is UniPC? (The Unified Predictor)

UniPC (Unified Predictor-Corrector) is a powerful sampling algorithm developed in recent years.

Analogy: The Experienced Master Painter

If you ask a novice to paint, they might stop after every brushstroke to check the model, afraid of making a mistake. It’s slow.
UniPC, however, is like an old master painter. He looks at the model once and can accurately predict how the next several brushstrokes should flow (Predictor). After painting, he takes a quick glance to make minor adjustments (Corrector).

  • Core Advantage: UniPC is extremely efficient. Not only does it paint fast, but it can produce very high-quality images in very few steps (e.g., only 10 steps), whereas traditional methods might need 20 to 50 steps.

3. What is AYS? (The Adaptive Schedule)

AYS stands for Adaptive Yield Sampling. It is an optimization scheduling strategy related to recent research often associated with NVIDIA.

Analogy: The Smart Travel Planner

Suppose you want to travel from the starting point (pure noise) to the destination (a perfect picture).

  • Standard Schedule: This is like a bus that stops at every single station. It doesn’t matter if the road is straight or curved, or if the scenery is important; it moves at a constant speed and stops at fixed intervals. It’s stable, but it wastes time.
  • AYS Schedule (Adaptive): This is like a smart private driver.
    • When the road conditions are complex (when the image is still blurry and the foundational structure is being formed), it drives slower and stops more often to process details.
    • When the road is simple (when the image is basically formed and just needs slight polishing), it steps on the gas and skips unnecessary stops.

The core idea of AYS is: Spend precious computing resources (steps) on the moments that matter most for image quality.


4. The Power Combo: UniPC AYS

When you combine UniPC (the master painter with intuitive foresight) with AYS (the smart travel planner), magic happens.

UniPC AYS = Extreme Speed + High Quality

Selecting UniPC AYS as your sampler in software like Stable Diffusion WebUI or ComfyUI means:

  1. Blazing Speed: It often produces a usable, detailed image in 10 steps or even fewer.
  2. High Efficiency: It uses AYS to optimize the “denoising strength” of each step, combined with UniPC’s powerful predictive capabilities, completing the most complex painting tasks with the minimum number of actions.

What does this mean for the user?

  • Save Time: Generating a good image used to take 5-10 seconds; now it might only take 2-3 seconds.
  • Save GPU Power: Generating the same image requires less calculation, so your graphics card doesn’t have to work as hard.
  • No More Guesswork: You no longer need to crank the steps up to 50 or 100 for quality. UniPC AYS tells you: Less is more.

Summary Chart

For a more intuitive understanding, let’s look at this comparison table:

Sampling Method Role Analogy Typical Steps Speed Quality Ideally Like
Euler A Diligent Student 20-40 steps Medium Good Diversity Thinks carefully about every stroke, slightly hesitant.
DPM++ 2M Karras Rigorous Engineer 20-30 steps Fast Very High Methodical, highly efficient, currently the mainstream.
UniPC (Standard) Genius Painter 15-25 steps Very Fast Extremely High Has super prediction skills, precise strokes.
UniPC AYS Genius Painter with a Smart GPS Around 10 steps Extreme Extremely High Focuses effort on critical areas, speeds through the easy parts. A masterpiece in seconds.

Conclusion

Technology is always evolving towards being “faster, better, and cheaper.” UniPC AYS is a prime example of this trend. It proves that AI isn’t just about piling up raw computing power, but about learning how to work “smartly.” Next time you are using AI art software, try switching the sampler to UniPC combined with the AYS schedule, and experience that “flying” sensation for yourself

Sampling method - UniPC Trailing

AI 绘图的幕后英雄:采样方法 UniPC 详解

AI Art’s Unsung Hero: Explaining the UniPC Sampling Method

在人工智能生成图像(AI Art)的魔法世界里,我们输入一段文字,几秒钟后,一副精美的画作就诞生了。但你知道吗?在这个过程中,有一个至关重要的步骤叫做**“采样” (Sampling)**。今天我们要介绍的主角,就是近年来备受瞩目的一种高效采样方法——UniPC

In the magical world of AI-generated imagery (AI Art), we type a few words, and seconds later, a masterpiece appears. But did you know there is a crucial step in this process called “Sampling”? Today, our protagonist is a highly efficient sampling method that has garnered much attention recently—UniPC.


1. 什么是“采样”?(What is “Sampling”?)

在深入了解 UniPC 之前,我们需要先明白 AI 是如何画画的。目前最流行的 AI 绘画技术叫做**“扩散模型” (Diffusion Model)**。

Before diving into UniPC, we need to understand how AI paints. The most popular technology currently used is called the “Diffusion Model.”

形象的比喻:从满屏雪花到清晰照片

The Analogy: From Static Noise to a Clear Photo

想象一下,你有一张清晰的照片,然后你慢慢地往上面撒沙子(噪点)。撒得越多,照片就越模糊,直到最后变成了一片完全随机的杂乱沙堆(这种状态在 AI 里叫做“高斯噪声”)。

Imagine you have a clear photo, and you slowly sprinkle sand (noise) over it. The more sand you add, the blurrier the photo becomes, until finally, it’s just a completely random pile of sand (in AI terms, this is called “Gaussian Noise”).

AI 的绘画过程,其实就是这个过程的逆向操作

  1. AI 面对的是一堆毫无意义的随机噪点(像老电视的雪花屏)。
  2. 它开始一步步地“扫去沙子”。
  3. 它每扫一步,都要猜测:“这下面的图案原来应该是什么样子的?”
  4. 经过几十步甚至上百步的清理和修正,一副清晰的图像就显露出来了。

AI’s painting process is essentially the reverse operation.

  1. The AI faces a pile of meaningless random noise (like the static on an old TV).
  2. It starts to “sweep away the sand” step by step.
  3. With each sweep, it guesses: “What should the pattern underneath look like?”
  4. After dozens or even hundreds of steps of cleaning and correcting, a clear image is revealed.

这个**“一步步去除噪点,逐渐还原图像”的过程,就是采样 (Sampling)。而采样器 (Sampler)** 就是那个负责执行清理工作的“清洁工”。

This process of “removing noise step by step to gradually restore the image” is Sampling. The Sampler is the “cleaner” responsible for executing this work.


2. 为什么我们需要 UniPC?(Why Do We Need UniPC?)

传统的采样方法虽然有效,但有一个大问题:

Traditional sampling methods work, but they have a big problem: they are slow.

如果要画出一张完美的图,可能需要那个“清洁工”扫 50 次甚至 100 次以确保每一个细节都对。这就意味着生成一张图可能需要很久。我们总是希望 AI 能画得又快又好

To create a perfect image, the “cleaner” might need to sweep 50 or even 100 times to ensure every detail is correct. This means generating one image can take a long time. We always want AI to paint fast and well.

UniPC 的出现,就是为了解决速度问题。

UniPC appeared to solve the speed problem.


3. UniPC 的绝技:预估与修正 (The Concept: Predictor-Corrector)

UniPC 的全称是 Unified Predictor-Corrector(统一预测-修正器)。听起来很复杂,但它的核心原理可以这样理解:

UniPC stands for Unified Predictor-Corrector. It sounds complex, but its core principle can be understood this way:

形象的比喻:老司机过弯道

The Analogy: A Veteran Driver on a Curved Road

想象你在开一辆车(这就是图像生成的路径),前面的路是弯弯曲曲的(图像从模糊变清晰的过程是非线性的)。

Imagine you are driving a car (this is the image generation path), and the road ahead is winding (the process of the image going from blurry to clear is non-linear).

  • 普通的采样器(新手司机):
    每开一小步,就要停下来仔细看地图,计算下一步怎么走,非常谨慎,所以开得很慢。

    Ordinary Sampler (Novice Driver):
    Stops every few feet to check the map carefully and calculate the next move. Very cautious, therefore very slow.

  • UniPC(老司机):
    它具备一种强大的“预判”能力。它看一眼路况,就能预测 (Predict) 接下来的弯大概是多大,然后直接打方向盘冲过去。
    但如果只是盲目冲刺可能会翻车(画崩了),所以它还有一个修正 (Correct) 在机制。一旦发现实际路况跟预测的稍微有点偏差,它会立刻微调方向盘,把车拉回正轨。

    UniPC (Veteran Driver):
    It possesses a powerful “anticipation” ability. With one look at the road, it can Predict the curvature of the turn and steer through it confidently.
    However, blindly rushing could lead to a crash (ruined image), so it also has a Correct mechanism. As soon as it senses a slight deviation between the actual road and its prediction, it immediately fine-tunes the steering wheel to pull the car back on track.

UniPC 的两大杀手锏:

UniPC’s Two Killer Features:

  1. 统一性 (Unified): 它可以兼容各种类型的扩散模型,不管是画二次元的,还是画写实照片的,它都能驾驭。

  2. 极少步数 (Few Steps): 因为它预测得准,修正得快,别人要走 50 步才能画好的图,UniPC 可能只需要 10 步!

  3. Unified: It is compatible with various types of diffusion models, whether it’s for anime style or photorealistic photos, it can handle them all.

  4. Few Steps: Because it predicts accurately and corrects quickly, while others might need 50 steps to finish a drawing, UniPC might only need 10 steps!


4. 什么是 “Trailing”?(What is “Trailing”?)

在很多 AI 软件(如 Stable Diffusion WebUI)中,你可能会看到 UniPC 后面并没有跟着 Trailing 这个词,但在学术论文或某些特定实现中,会讨论到 UniPC 处理时间的策略。

In many AI software interfaces (like Stable Diffusion WebUI), you might not see the word Trailing directly after UniPC. However, in academic papers or specific implementations, strategies regarding how UniPC handles time steps are discussed.

虽然 “Trailing” 不是 UniPC 名字的一部分,但它描述了 UniPC 在处理数学序列时的一种特性。如果把生成过程看作是一个时间轴:t1, t2, t3...

Although “Trailing” isn’t part of the UniPC name itself, it relates to how UniPC handles mathematical sequences. If we view the generation process as a timeline: t1, t2, t3....

Trailing(追踪/拖尾) 在这里可以理解为利用过去的经验来辅助当前的决策。

Trailing here can be understood as using past experience to assist current decisions.

比喻:看着后视镜开车

Analogy: Driving Using the Rearview Mirror

UniPC 在预测下一步怎么走时,不仅仅只看当前的位置,它还会参考之前走过的几个点(Previous Time Steps)。这就好比司机如果不确定前面的弯有多急,他会回想一下刚刚经过的那段路的弯度变化趋势。

When UniPC predicts the next move, it doesn’t just look at the current position; it also references several points it has already passed (Previous Time Steps). It’s like a driver who, if unsure how sharp the turn ahead is, recalls the curvature trend of the road segment they just drove through.

通过分析“之前的数据轨迹/尾迹 (Trailing points)”,UniPC 能画出一条更平滑、更精准的曲线,从而在极少的步数内直达终点(生成完美的图像)。

By analyzing the “past data trajectory/trailing points,” UniPC can draw a smoother, more precise curve, thereby reaching the destination (generating a perfect image) in very few steps.


5. 总结:它对你意味着什么?(Summary: What Does This Mean for You?)

如果你是一名 AI 绘画的使用者,选择 UniPC 采样器意味着:

  1. 速度飞快:生成图片的时间大幅缩短。别人生成一张图你可以生成两张。
  2. 质量不减:尽管速度快,但画面的细节和清晰度依然保持顶尖水平。
  3. 计算省力:对于显卡配置不高的电脑,UniPC 是一个非常友好的选择,因为它用更少的计算量就能达到同样的效果。

If you are an AI art user, choosing the UniPC sampler means:

  1. Blazing Speed: The time to generate images is drastically reduced. While others generate one image, you can generate two.
  2. Uncompromised Quality: Despite the speed, the details and clarity of the image remain top-tier.
  3. Computational Efficiency: For computers with lower-end graphics cards, UniPC is a very friendly choice because it achieves the same result with less computational power.

简单来说,UniPC 就像是给了 AI 绘画引擎装上了一个涡轮增压器,让创作变得既轻松又高效。

In short, UniPC is like installing a turbocharger on the AI painting engine, making creation both effortless and efficient.

AI Art’s Unsung Hero: Explaining the UniPC Sampling Method

In the magical world of AI-generated imagery (AI Art), we type a few words, and seconds later, a masterpiece appears. But did you know there is a crucial step in this process called “Sampling”? Today, our protagonist is a highly efficient sampling method that has garnered much attention recently—UniPC.


1. What is “Sampling”?

Before diving into UniPC, we need to understand how AI paints. The most popular technology currently used is called the “Diffusion Model.”

The Analogy: From Static Noise to a Clear Photo

Imagine you have a clear photo, and you slowly sprinkle sand (noise) over it. The more sand you add, the blurrier the photo becomes, until finally, it’s just a completely random pile of sand (in AI terms, this is called “Gaussian Noise”).

AI’s painting process is essentially the reverse operation.

  1. The AI faces a pile of meaningless random noise (like the static on an old TV).
  2. It starts to “sweep away the sand” step by step.
  3. With each sweep, it guesses: “What should the pattern underneath look like?”
  4. After dozens or even hundreds of steps of cleaning and correcting, a clear image is revealed.

This process of “removing noise step by step to gradually restore the image” is Sampling. The Sampler is the “cleaner” responsible for executing this work.


2. Why Do We Need UniPC?

Traditional sampling methods work, but they have a big problem: they are slow.

To create a perfect image, the “cleaner” might need to sweep 50 or even 100 times to ensure every detail is correct. This means generating one image can take a long time. We always want AI to paint fast and well.

UniPC appeared to solve the speed problem.


3. The Concept: Predictor-Corrector

UniPC stands for Unified Predictor-Corrector. It sounds complex, but its core principle can be understood this way:

The Analogy: A Veteran Driver on a Curved Road

Imagine you are driving a car (this is the image generation path), and the road ahead is winding (the process of the image going from blurry to clear is non-linear).

  • Ordinary Sampler (Novice Driver):
    Stops every few feet to check the map carefully and calculate the next move. Very cautious, therefore very slow.

  • UniPC (Veteran Driver):
    It possesses a powerful “anticipation” ability. With one look at the road, it can Predict the curvature of the turn and steer through it confidently.
    However, blindly rushing could lead to a crash (ruined image), so it also has a Correct mechanism. As soon as it senses a slight deviation between the actual road and its prediction, it immediately fine-tunes the steering wheel to pull the car back on track.

UniPC’s Two Killer Features:

  1. Unified: It is compatible with various types of diffusion models, whether it’s for anime style or photorealistic photos, it can handle them all.
  2. Few Steps: Because it predicts accurately and corrects quickly, while others might need 50 steps to finish a drawing, UniPC might only need 10 steps!

4. What is “Trailing”?

In many AI software interfaces (like Stable Diffusion WebUI), you might not see the word Trailing directly after UniPC. However, in academic papers or specific implementations, strategies regarding how UniPC handles time steps are discussed.

Although “Trailing” isn’t part of the UniPC name itself, it relates to how UniPC handles mathematical sequences. If we view the generation process as a timeline: t1, t2, t3....

Trailing here can be understood as using past experience to assist current decisions.

Analogy: Driving Using the Rearview Mirror

When UniPC predicts the next move, it doesn’t just look at the current position; it also references several points it has already passed (Previous Time Steps). It’s like a driver who, if unsure how sharp the turn ahead is, recalls the curvature trend of the road segment they just drove through.

By analyzing the “past data trajectory/trailing points,” UniPC can draw a smoother, more precise curve, thereby reaching the destination (generating a perfect image) in very few steps.


5. Summary: What Does This Mean for You?

If you are an AI art user, choosing the UniPC sampler means:

  1. Blazing Speed: The time to generate images is drastically reduced. While others generate one image, you can generate two.
  2. Uncompromised Quality: Despite the speed, the details and clarity of the image remain top-tier.
  3. Computational Efficiency: For computers with lower-end graphics cards, UniPC is a very friendly choice because it achieves the same result with less computational power.

In short, UniPC is like installing a turbocharger on the AI painting engine, making creation both effortless and efficient.

Sampling method - DDIM Trailing

被遗忘的时光倒流者:深入浅出 DDIM Trailing

The Forgotten Time Traveler: Demystifying DDIM Trailing

在人工智能绘画的奇妙世界里,我们输入一段文字,AI 就能变出一幅惊艳的画作。这背后离不开一种叫“扩散模型”(Diffusion Model)的技术。而在这种技术中,如何从一堆杂乱的噪点恢复成清晰图像,取决于一种名为**“采样方法”(Sampling Method)**的策略。

今天我们要揭秘的,是一个特定且稍显冷门的概念:DDIM Trailing

不用担心那些复杂的公式,我们就用**“复原古画”**的例子来聊聊它。


1. 基础概念:扩散模型就像“泼墨与复原”

想象一下,你有一幅精美的油画(比如《蒙娜丽莎》)。

  1. 加噪(泼墨): 如果我们往画上撒一点点沙子,画变模糊了。再撒一点,更模糊了。重复一千次,最后这幅画就变成了一堆毫无意义的沙砾(纯噪声)。这个过程叫“扩散”。
  2. 去噪(复原): AI 学习的就是如何“逆转”这个过程。它看着那堆沙砾,试图猜出这一步之前沙子是怎么分布的,然后一点点把沙子拿走,直到变回《蒙娜丽莎》。

这个“一点点拿走沙子”的过程,就是采样(Sampling)

2. 什么是 DDIM?(快速通道)

最原始的复原方法(DDPM)非常慢,就像一个强迫症工匠,必须严格按照泼沙子的反向步骤,一步步慢慢清理,可能需要走1000步。

DDIM (Denoising Diffusion Implicit Models) 就像是一个经验丰富的大师。它发现其实不需要每一步都走。它可以“跳步”。比如,它看了一眼现在的沙子分布,直接预测出10步之后的样子,甚至直接大致猜出原画的样子,从而大大加快了作画速度。原本1000步的工作,它只要50步就能完成。

3. 核心主角:DDIM Trailing (DDIM 拖尾)

那么,什么是 Trailing(拖尾) 呢?

这就涉及到底层代码中一个非常微妙的时间步(Timestep)对齐问题。

形象比喻:倒计时跳格游戏

想象你在玩一个“时光倒流”的跳格游戏。

  • 起点:第1000格(全是沙子)。
  • 终点:第0格(清晰的画)。
  • 规则:你不需要每格都跳,你可以大步跳跃。比如每一步跳20格。

你有一个任务表(Timesteps Schedule),告诉你接下来要踩在哪一格上。假设你要用10步走完这1000格。

任务表可能是这样的: [999, 899, 799, ..., 99, 0]

在这个过程中,AI 需要做两件事:

  1. 看一看:现在的格子什么样?
  2. 算一算:下一个目标格子该长什么样?

DDIM Trailing 的关键在于:在计算下一个格子时,我参考的“时间刻度”和实际跳过去的“物理落点”是否有一点点偏差?

两种模式的对比

我们用更生活化的**“公交车报站”**来类比:

A. 无拖尾(Standard / Leading):以此为准

这就好比你是公交车司机。你要每隔10分钟报一次站。

  • 现在是 10:00。
  • 系统逻辑:“我现在就在 10:00 这一站,请计算去 09:50 的路线。”
  • 这是一种很直观、标准的对齐方式。这也是很多现代采样器的默认逻辑。

B. 拖尾(Trailing):向后参考

DDIM Trailing 是一种早期的、特定的实现逻辑(源自原始的 DDIM 代码库)。它的逻辑有点像你在用一个稍微延迟的旧手表

  • 虽然物理上你在 10:00 这一站。
  • 但算法在抓取参数时,实际上参考的是上一步留下的时间戳索引(Trailing index)。它像是在说:“虽然我在这一站,但我计算跨度时,要从这一个区间的末尾开始算起。”

从数学实现上讲,如果你要把 1000 个原本的步骤压缩成 50 个步骤:

  • Trailing 开启时:生成的时间步序列可能会让你感觉像是把采样的终点“拖”在后面。它在转换连续的时间(0到1)到离散的步骤(0到1000)时,会采用一种向下取整或特定偏移的方式。
  • 结果差异:这会导致 AI 在去噪的最后几步(最接近成画的时候)处理方式不同。

4. 为什么要注意 DDIM Trailing?

你可能会问:“这不就是代码写法的细微区别吗?对我有影响吗?”

有,主要体现在“还原度”和“确定性”上。

  1. 确定性(Determinism): DDIM 的一大卖点是,给定相同的随机种子(Seed),它生成的图应该是一模一样的。但是,如果你的软件(比如 Stable Diffusion WebUI)和别人的软件在 “Trailing” 设置上不一致,哪怕所有参数都一样,你们跑出来的图也会有细微差别(构图相似,但细节不同)。
  2. 最后一步的精度: 很多研究发现,Trailing 的处理方式这会影响图片生成最后一步是否能完美收敛到“纯净图像”。如果处理不好(比如时间步没对齐),图片可能会残留一点点灰蒙蒙的噪点,或者亮度和对比度略微不对劲。

5. 总结

在大多数现代 AI 绘画软件中,这些复杂的数学细节已经被优化和隐藏了。现在的调度器(Scheduler)大多使用更精确的数学对齐(Linspace 等),不再需要手动去纠结 Trailing。

但是,了解 DDIM Trailing 能让你明白:

  • AI 并不神奇,它全是数学:哪怕是“时间步”怎么数这么小的问题,都会影响最终画作的每一笔。
  • Trailing 就像是“复原古画”时的节奏感:它是那种老式的、特定的跳步节奏。虽然现在有了更精密的电子节拍器,但那种老派的节奏,是属于早期 DDIM 算法独特的标记。

下次如果你看到生成的图片和教程里有一点点细节对不上,也许就是这个名为“Trailing”的时间幽灵在该它的小玩笑。

The Forgotten Time Traveler: Demystifying DDIM Trailing

In the wondrous world of AI art, we type in text, and the AI conjures up a stunning image. This magic relies on a technology known as “Diffusion Models.” Within this technology, the strategy for restoring a clear image from a mess of noise is called a “Sampling Method.”

Today, we are going to demystify a specific and somewhat niche concept: DDIM Trailing.

Don’t worry about complex formulas; we will use the analogy of “Restoring an Ancient Painting” to explain it.


1. The Basic Concept: Diffusion Models are like “Spilling Ink and Restoring”

Imagine you have an exquisite oil painting (like the Mona Lisa).

  1. Adding Noise (Spilling Sand): If we sprinkle a little sand on the painting, it becomes blurred. Sprinkle more, and it gets blurrier. Repeat this a thousand times, and the painting eventually becomes a meaningless pile of gravel (pure noise). This process is called “Diffusion.”
  2. Denoising (Restoring): What the AI learns is how to “reverse” this process. It looks at the pile of gravel, tries to guess how the sand was distributed one step before, and removes the sand bit by bit until the Mona Lisa reappears.

This process of “removing sand bit by bit” is Sampling.

2. What is DDIM? (The Fast Lane)

The original restoration method (DDPM) is very slow, like an obsessive craftsman who must strictly follow the reverse steps of spilling the sand, cleaning up step-by-step. It might take 1,000 steps.

DDIM (Denoising Diffusion Implicit Models) is like an experienced master. It realizes that you don’t actually need to take every single step. It can “skip steps.” For example, by looking at the current sand distribution, it can predict what it will look like 10 steps later, or even roughly guess the original painting immediately, thus greatly speeding up the process. A task that took 1,000 steps can now be done in just 50.

3. The Protagonist: DDIM Trailing

So, what is Trailing?

This involves a very subtle alignment issue with Timesteps in the underlying code.

Metaphor: The Hopscotch Countdown

Imagine playing a “Time Travel” hopscotch game.

  • Start: Square #1000 (Full of sand).
  • End: Square #0 (Clear painting).
  • Rule: You don’t need to jump on every square; you can take giant leaps. For example, jump 20 squares at a time.

You have a task list (Timesteps Schedule) telling you which square to land on next. Suppose you want to finish these 1000 squares in 10 steps.

The schedule might look like this: [999, 899, 799, ..., 99, 0]

During this process, the AI needs to do two things:

  1. Observe: What does the current square look like?
  2. Calculate: What should the next target square look like?

The key to DDIM Trailing lies here: When calculating the next square, is there a slight deviation between the “time scale” I reference and the actual “physical landing spot” I jump to?

Comparing the Two Modes

Let’s use a more everyday analogy: Bus Stop Announcements.

A. No Trailing (Standard / Leading): “As Is”

This is like being a bus driver. You have to announce a stop every 10 minutes.

  • Current time: 10:00.
  • System Logic: “I am currently at the 10:00 station exactly. Please calculate the route to 09:50.”
  • This is a very intuitive, standard alignment. It is the default logic for many modern samplers.

B. Trailing: Referencing Backwards

DDIM Trailing is an early, specific implementation logic (originating from the original DDIM codebase). Its logic is a bit like using a slightly delayed old watch.

  • Physically, you are at the 10:00 station.
  • However, when the algorithm grabs parameters, it actually references the previous step’s trailing index. It’s like saying: “Although I am at this station, when I calculate the span, I count from the tail end of the previous interval.”

Mathematically, if you are compressing 1,000 original steps into 50 steps:

  • When Trailing is On: The sequence of time steps generated might feel like the sampling endpoint is “trailing” behind. When converting continuous time (0 to 1) to discrete steps (0 to 1000), it uses a form of floor rounding or specific offset.
  • Resulting Difference: This causes the AI to handle the very last few steps (when the image is closest to completion) differently.

4. Why Does DDIM Trailing Matter?

You might ask: “Isn’t this just a tiny difference in code? Does it affect me?”

Yes, primarily in terms of ‘Reproduction’ and ‘Determinism’.

  1. Determinism: A big selling point of DDIM is that given the same random Seed, the image generated should be identical. However, if your software (like Stable Diffusion WebUI) and someone else’s software differ in their “Trailing” settings, you will get slightly different images (similar composition, but different details), even if all other parameters are the same.
  2. Precision of the Last Step: Many studies have found that how Trailing is handled affects whether the final step of image generation converges perfectly to a “pure image.” If handled poorly (e.g., timesteps are misaligned), the image might retain a slight foggy noise, or the brightness and contrast might be slightly off.

5. Conclusion

In most modern AI art software, these complex mathematical details have been optimized and hidden. Current Schedulers mostly use more precise mathematical alignment (like Linspace), and there is usually no need to manually fuss over Trailing.

However, understanding DDIM Trailing helps you realize:

  • AI isn’t magic; it’s all math: Even a problem as small as “how to count time steps” affects every stroke of the final painting.
  • Trailing is like the rhythm in ‘Restoring Ancient Paintings’: It is that old-school, specific skipping rhythm. Although we now have more precise electronic metronomes, that old-school rhythm is a unique signature of the early DDIM algorithms.

Next time, if you see that a generated image doesn’t quite match the details in a tutorial, perhaps it’s the time ghost named “Trailing” playing a little joke.