Prefill-Decode 分离:让大模型推理”两条腿走路”
当你向 ChatGPT 提问时,AI 的回答并非一蹴而就。它实际上经历了两个截然不同的阶段:Prefill(预填充)和 Decode(解码)。理解这两个阶段,并针对性地优化,是现代 LLM 推理系统设计的核心思想之一。
LLM 推理的两个阶段
阶段一:Prefill(预填充)
做什么: 处理用户输入的全部内容,生成第一个输出 token。
特点:
- 一次性处理整个输入序列
- 计算量大,但高度并行
- 生成所有 token 的 KV Cache
- 类似于”读题”阶段
1 | 输入: "请解释什么是人工智能?" |
阶段二:Decode(解码)
做什么: 一个接一个地生成后续 token,直到完成。
特点:
- 每次只生成一个 token
- 计算量小,但串行进行
- 复用 Prefill 阶段的 KV Cache
- 类似于”写答案”阶段
1 | "人" → "工" → "智" → "能" → "是" → "..." → "[结束]" |
两个阶段的本质区别
| 特性 | Prefill | Decode |
|---|---|---|
| 处理 token 数 | 多(整个输入) | 1(每次) |
| 计算密度 | 高(计算密集) | 低(访存密集) |
| 并行度 | 高 | 低 |
| 瓶颈 | 计算能力 | 内存带宽 |
| GPU 利用率 | 高 | 低 |
核心洞察: Prefill 是”计算密集型”,Decode 是”访存密集型”——它们需要的硬件资源完全不同!
传统方式的问题
传统的 LLM 推理将 Prefill 和 Decode 混在一起处理:
1 | 时间线: |
问题:
- 资源浪费: Prefill 时 GPU 满载,Decode 时 GPU 空闲
- 相互干扰: 长 Prefill 阻塞短 Decode
- 延迟不均: 用户体验不稳定
Prefill-Decode 分离架构
核心思想: 把两个阶段分开,用不同的硬件或调度策略处理。
架构方案一:物理分离
1 | 用户请求 |
工作流程:
- Prefill 集群处理输入,生成 KV Cache
- KV Cache 传输到 Decode 集群(或存储到共享存储)
- Decode 集群逐 token 生成输出
架构方案二:逻辑分离
在同一组 GPU 上,通过调度实现分离:
1 | GPU 时间片分配: |
分离的优势
1. 硬件针对性优化
Prefill 集群:
- 使用计算能力强的 GPU(如 H100)
- 可以使用较低的内存带宽
- 适合批量处理
Decode 集群:
- 优先考虑内存带宽(HBM3)
- 可以使用更多但较弱的 GPU
- 适合流式处理
2. 更好的资源利用
1 | 传统方式: |
3. 延迟优化
- Decode 不被 Prefill 阻塞
- 首 token 延迟(TTFT)更可控
- 用户体验更流畅
4. 弹性扩展
可以根据负载特点独立扩展:
- Prefill 请求多 → 扩展 Prefill 集群
- 长对话多 → 扩展 Decode 集群
技术挑战
挑战 1:KV Cache 传输
Prefill 完成后,KV Cache 需要传给 Decode:
解决方案:
- 高速网络传输(NVLink、InfiniBand)
- 共享存储(分布式 KV Cache)
- 压缩传输(量化 KV Cache)
挑战 2:调度复杂度
需要智能调度器决定:
- 哪些请求发给 Prefill?
- Decode 请求如何批处理?
- 如何平衡两边负载?
挑战 3:一致性保证
确保 Prefill 和 Decode 使用相同的模型状态。
实际系统案例
Splitwise (Microsoft)
微软提出的 Splitwise 系统:
1 | 特点: |
Mooncake (月之暗面)
国内 Kimi 团队的实践:
1 | 特点: |
DistServe
学术界的分离式推理系统:
1 | 特点: |
实现示例
简化的分离调度逻辑
1 | class PrefillDecodeScheduler: |
性能对比
| 指标 | 传统方式 | 分离架构 |
|---|---|---|
| 首 token 延迟 (TTFT) | 波动大 | 稳定 |
| 吞吐量 | 基准 | +30-50% |
| GPU 利用率 | 40-60% | 80-95% |
| 延迟 P99 | 高 | 可控 |
| 硬件灵活性 | 低 | 高 |
适用场景
最适合分离架构的场景:
- ✅ 高并发在线服务
- ✅ 对延迟有严格要求
- ✅ 输入长度变化大
- ✅ 需要弹性扩展
可能不需要分离的场景:
- ❌ 单用户批量处理
- ❌ 小规模部署
- ❌ 对延迟不敏感
总结
Prefill-Decode 分离是 LLM 推理系统设计的重要范式。通过识别两个阶段的本质差异,并针对性地分配资源和优化,可以显著提升系统的效率和用户体验。
关键要点:
- 两阶段特性不同: Prefill 计算密集,Decode 访存密集
- 分离带来优势: 资源利用率高,延迟可控,弹性扩展
- 核心挑战: KV Cache 传输,智能调度
- 实践案例: Splitwise、Mooncake、DistServe
理解 Prefill-Decode 分离,你就掌握了设计高性能 LLM 推理系统的核心思想。
Prefill-Decode Separation: Letting Large Model Inference “Walk on Two Legs”
When you ask ChatGPT a question, the AI’s response doesn’t happen all at once. It actually goes through two distinctly different phases: Prefill and Decode. Understanding these two phases and optimizing them specifically is one of the core ideas in modern LLM inference system design.
Two Phases of LLM Inference
Phase One: Prefill
What it does: Processes all user input content and generates the first output token.
Characteristics:
- Processes the entire input sequence at once
- High computational load, but highly parallel
- Generates KV Cache for all tokens
- Similar to the “reading the question” phase
1 | Input: "Please explain what artificial intelligence is?" |
Phase Two: Decode
What it does: Generates subsequent tokens one by one until completion.
Characteristics:
- Generates only one token at a time
- Low computational load, but sequential
- Reuses KV Cache from Prefill phase
- Similar to the “writing the answer” phase
1 | "Artificial" → "intelligence" → "is" → "a" → "..." → "[END]" |
Essential Differences Between the Two Phases
| Feature | Prefill | Decode |
|---|---|---|
| Tokens Processed | Many (entire input) | 1 (each time) |
| Compute Intensity | High (compute-bound) | Low (memory-bound) |
| Parallelism | High | Low |
| Bottleneck | Compute power | Memory bandwidth |
| GPU Utilization | High | Low |
Core insight: Prefill is “compute-intensive,” Decode is “memory-intensive”—they need completely different hardware resources!
Problems with Traditional Approach
Traditional LLM inference mixes Prefill and Decode together:
1 | Timeline: |
Problems:
- Resource waste: GPU fully loaded during Prefill, idle during Decode
- Mutual interference: Long Prefill blocks short Decode
- Uneven latency: Unstable user experience
Prefill-Decode Separation Architecture
Core idea: Separate the two phases and handle them with different hardware or scheduling strategies.
Architecture Option 1: Physical Separation
1 | User Request |
Workflow:
- Prefill cluster processes input, generates KV Cache
- KV Cache transfers to Decode cluster (or stored in shared storage)
- Decode cluster generates output token by token
Architecture Option 2: Logical Separation
On the same set of GPUs, achieve separation through scheduling:
1 | GPU Time Slice Allocation: |
Advantages of Separation
1. Hardware-Specific Optimization
Prefill Cluster:
- Use GPUs with strong compute power (like H100)
- Can use lower memory bandwidth
- Suitable for batch processing
Decode Cluster:
- Prioritize memory bandwidth (HBM3)
- Can use more but weaker GPUs
- Suitable for streaming processing
2. Better Resource Utilization
1 | Traditional: |
3. Latency Optimization
- Decode not blocked by Prefill
- Time-to-first-token (TTFT) more controllable
- Smoother user experience
4. Elastic Scaling
Can independently scale based on load characteristics:
- Many Prefill requests → Scale Prefill cluster
- Many long conversations → Scale Decode cluster
Technical Challenges
Challenge 1: KV Cache Transfer
After Prefill completes, KV Cache needs to be passed to Decode:
Solutions:
- High-speed network transfer (NVLink, InfiniBand)
- Shared storage (distributed KV Cache)
- Compressed transfer (quantized KV Cache)
Challenge 2: Scheduling Complexity
Needs intelligent scheduler to decide:
- Which requests go to Prefill?
- How to batch Decode requests?
- How to balance load on both sides?
Challenge 3: Consistency Guarantee
Ensure Prefill and Decode use the same model state.
Real System Cases
Splitwise (Microsoft)
Microsoft’s Splitwise system:
1 | Features: |
Mooncake (Moonshot AI)
Practice from the Kimi team in China:
1 | Features: |
DistServe
Academic disaggregated inference system:
1 | Features: |
Implementation Example
Simplified Separation Scheduling Logic
1 | class PrefillDecodeScheduler: |
Performance Comparison
| Metric | Traditional | Separated |
|---|---|---|
| Time-to-First-Token (TTFT) | High variance | Stable |
| Throughput | Baseline | +30-50% |
| GPU Utilization | 40-60% | 80-95% |
| Latency P99 | High | Controllable |
| Hardware Flexibility | Low | High |
Use Cases
Most suitable scenarios for separation architecture:
- ✅ High-concurrency online services
- ✅ Strict latency requirements
- ✅ Variable input lengths
- ✅ Need elastic scaling
Scenarios that may not need separation:
- ❌ Single-user batch processing
- ❌ Small-scale deployment
- ❌ Latency-insensitive applications
Summary
Prefill-Decode separation is an important paradigm in LLM inference system design. By recognizing the essential differences between the two phases and specifically allocating resources and optimizations, system efficiency and user experience can be significantly improved.
Key points:
- Different phase characteristics: Prefill is compute-intensive, Decode is memory-intensive
- Separation brings advantages: High resource utilization, controllable latency, elastic scaling
- Core challenges: KV Cache transfer, intelligent scheduling
- Practice cases: Splitwise, Mooncake, DistServe
Understand Prefill-Decode separation, and you’ve mastered the core idea of designing high-performance LLM inference systems.