2026-01-07

Prefill-Decode分离

Prefill-Decode 分离：让大模型推理”两条腿走路”

当你向 ChatGPT 提问时，AI 的回答并非一蹴而就。它实际上经历了两个截然不同的阶段：Prefill（预填充）和 Decode（解码）。理解这两个阶段，并针对性地优化，是现代 LLM 推理系统设计的核心思想之一。

LLM 推理的两个阶段

阶段一：Prefill（预填充）

做什么： 处理用户输入的全部内容，生成第一个输出 token。

特点：

一次性处理整个输入序列
计算量大，但高度并行
生成所有 token 的 KV Cache
类似于”读题”阶段

输入: "请解释什么是人工智能？"
       ↓ 并行处理所有 token
       ↓ 生成 KV Cache
第一个输出 token: "人"

阶段二：Decode（解码）

做什么： 一个接一个地生成后续 token，直到完成。

特点：

每次只生成一个 token
计算量小，但串行进行
复用 Prefill 阶段的 KV Cache
类似于”写答案”阶段

1
2
3

"人" → "工" → "智" → "能" → "是" → "..." → "[结束]"
 ↓      ↓      ↓      ↓
每次生成一个，需要等待上一个完成

两个阶段的本质区别

特性	Prefill	Decode
处理 token 数	多（整个输入）	1（每次）
计算密度	高（计算密集）	低（访存密集）
并行度	高	低
瓶颈	计算能力	内存带宽
GPU 利用率	高	低

核心洞察： Prefill 是”计算密集型”，Decode 是”访存密集型”——它们需要的硬件资源完全不同！

传统方式的问题

传统的 LLM 推理将 Prefill 和 Decode 混在一起处理：

时间线：
请求1: [Prefill....][D][D][D][D][D][D][D][D]
请求2:      等待...  [Prefill....][D][D][D][D]
请求3:                    等待...  [Prefill....][D]

问题：

资源浪费： Prefill 时 GPU 满载，Decode 时 GPU 空闲
相互干扰： 长 Prefill 阻塞短 Decode
延迟不均： 用户体验不稳定

Prefill-Decode 分离架构

核心思想： 把两个阶段分开，用不同的硬件或调度策略处理。

架构方案一：物理分离

          用户请求
              ↓
    ┌─────────────────────┐
    │   请求调度器 (Router)  │
    └─────────────────────┘
         ↓           ↓
┌─────────────┐  ┌─────────────┐
│ Prefill 集群 │  │ Decode 集群  │
│ (计算优化)   │  │ (带宽优化)   │
└─────────────┘  └─────────────┘
         ↓           ↓
    ┌─────────────────────┐
    │   KV Cache 存储      │
    │   (共享或传输)       │
    └─────────────────────┘

工作流程：

Prefill 集群处理输入，生成 KV Cache
KV Cache 传输到 Decode 集群（或存储到共享存储）
Decode 集群逐 token 生成输出

架构方案二：逻辑分离

在同一组 GPU 上，通过调度实现分离：

GPU 时间片分配：
[Prefill 批次1][Decode 批次1,2,3][Prefill 批次2][Decode 批次1,2,3,4]...

优先级调度：
- Decode 请求优先（保证低延迟）
- Prefill 请求在空闲时处理

分离的优势

1. 硬件针对性优化

Prefill 集群：

使用计算能力强的 GPU（如 H100）
可以使用较低的内存带宽
适合批量处理

Decode 集群：

优先考虑内存带宽（HBM3）
可以使用更多但较弱的 GPU
适合流式处理

2. 更好的资源利用

传统方式：
GPU利用率: ████░░░░████░░░░████░░░░  (忽高忽低)

分离方式：
Prefill GPU: ████████████████████████  (持续高负载)
Decode GPU:  ████████████████████████  (持续高带宽利用)

3. 延迟优化

Decode 不被 Prefill 阻塞
首 token 延迟（TTFT）更可控
用户体验更流畅

4. 弹性扩展

可以根据负载特点独立扩展：

Prefill 请求多 → 扩展 Prefill 集群
长对话多 → 扩展 Decode 集群

技术挑战

挑战 1：KV Cache 传输

Prefill 完成后，KV Cache 需要传给 Decode：

解决方案：

高速网络传输（NVLink、InfiniBand）
共享存储（分布式 KV Cache）
压缩传输（量化 KV Cache）

挑战 2：调度复杂度

需要智能调度器决定：

哪些请求发给 Prefill？
Decode 请求如何批处理？
如何平衡两边负载？

挑战 3：一致性保证

确保 Prefill 和 Decode 使用相同的模型状态。

实际系统案例

Splitwise (Microsoft)

微软提出的 Splitwise 系统：

特点：
- 混合使用不同类型 GPU
- Prefill 用计算强的卡
- Decode 用性价比高的卡
- 智能 KV Cache 迁移

Mooncake (月之暗面)

国内 Kimi 团队的实践：

特点：
- Prefill 和 Decode 完全分离
- 分布式 KV Cache 池
- 针对超长上下文优化

DistServe

学术界的分离式推理系统：

特点：
- 细粒度的资源分配
- 支持异构硬件
- 延迟保证的调度算法

实现示例

简化的分离调度逻辑

class PrefillDecodeScheduler:
    def __init__(self):
        self.prefill_queue = Queue()
        self.decode_queue = Queue()
        self.kv_cache_store = KVCacheStore()
    
    def handle_request(self, request):
        if request.is_new():
            # 新请求 → Prefill 队列
            self.prefill_queue.put(request)
        else:
            # 继续生成 → Decode 队列
            self.decode_queue.put(request)
    
    def run_prefill_worker(self, gpu):
        while True:
            request = self.prefill_queue.get()
            # 执行 Prefill
            kv_cache = gpu.prefill(request.input_tokens)
            # 存储 KV Cache
            self.kv_cache_store.save(request.id, kv_cache)
            # 转入 Decode 队列
            request.kv_cache_id = request.id
            self.decode_queue.put(request)
    
    def run_decode_worker(self, gpu):
        while True:
            # 批量获取 Decode 请求
            batch = self.decode_queue.get_batch(max_size=64)
            # 加载 KV Cache
            for req in batch:
                req.kv_cache = self.kv_cache_store.load(req.kv_cache_id)
            # 批量 Decode
            outputs = gpu.decode_batch(batch)
            # 处理输出...

性能对比

指标	传统方式	分离架构
首 token 延迟 (TTFT)	波动大	稳定
吞吐量	基准	+30-50%
GPU 利用率	40-60%	80-95%
延迟 P99	高	可控
硬件灵活性	低	高

适用场景

最适合分离架构的场景：

✅ 高并发在线服务
✅ 对延迟有严格要求
✅ 输入长度变化大
✅ 需要弹性扩展

可能不需要分离的场景：

❌ 单用户批量处理
❌ 小规模部署
❌ 对延迟不敏感

总结

Prefill-Decode 分离是 LLM 推理系统设计的重要范式。通过识别两个阶段的本质差异，并针对性地分配资源和优化，可以显著提升系统的效率和用户体验。

关键要点：

两阶段特性不同： Prefill 计算密集，Decode 访存密集
分离带来优势： 资源利用率高，延迟可控，弹性扩展
核心挑战： KV Cache 传输，智能调度
实践案例： Splitwise、Mooncake、DistServe

理解 Prefill-Decode 分离，你就掌握了设计高性能 LLM 推理系统的核心思想。

Prefill-Decode Separation: Letting Large Model Inference “Walk on Two Legs”

When you ask ChatGPT a question, the AI’s response doesn’t happen all at once. It actually goes through two distinctly different phases: Prefill and Decode. Understanding these two phases and optimizing them specifically is one of the core ideas in modern LLM inference system design.

Two Phases of LLM Inference

Phase One: Prefill

What it does: Processes all user input content and generates the first output token.

Characteristics:

Processes the entire input sequence at once
High computational load, but highly parallel
Generates KV Cache for all tokens
Similar to the “reading the question” phase

Input: "Please explain what artificial intelligence is?"
       ↓ Process all tokens in parallel
       ↓ Generate KV Cache
First output token: "Artificial"

Phase Two: Decode

What it does: Generates subsequent tokens one by one until completion.

Characteristics:

Generates only one token at a time
Low computational load, but sequential
Reuses KV Cache from Prefill phase
Similar to the “writing the answer” phase

1
2
3

"Artificial" → "intelligence" → "is" → "a" → "..." → "[END]"
     ↓              ↓            ↓      ↓
Each generation waits for the previous one to complete

Essential Differences Between the Two Phases

Feature	Prefill	Decode
Tokens Processed	Many (entire input)	1 (each time)
Compute Intensity	High (compute-bound)	Low (memory-bound)
Parallelism	High	Low
Bottleneck	Compute power	Memory bandwidth
GPU Utilization	High	Low

Core insight: Prefill is “compute-intensive,” Decode is “memory-intensive”—they need completely different hardware resources!

Problems with Traditional Approach

Traditional LLM inference mixes Prefill and Decode together:

Timeline:
Request 1: [Prefill....][D][D][D][D][D][D][D][D]
Request 2:      waiting... [Prefill....][D][D][D][D]
Request 3:                      waiting... [Prefill....][D]

Problems:

Resource waste: GPU fully loaded during Prefill, idle during Decode
Mutual interference: Long Prefill blocks short Decode
Uneven latency: Unstable user experience

Prefill-Decode Separation Architecture

Core idea: Separate the two phases and handle them with different hardware or scheduling strategies.

Architecture Option 1: Physical Separation

          User Request
              ↓
    ┌─────────────────────┐
    │   Request Router     │
    └─────────────────────┘
         ↓           ↓
┌─────────────┐  ┌─────────────┐
│Prefill Cluster│ │Decode Cluster│
│(Compute-opt) │  │(Bandwidth-opt)│
└─────────────┘  └─────────────┘
         ↓           ↓
    ┌─────────────────────┐
    │   KV Cache Storage   │
    │  (Shared or Transfer)│
    └─────────────────────┘

Workflow:

Prefill cluster processes input, generates KV Cache
KV Cache transfers to Decode cluster (or stored in shared storage)
Decode cluster generates output token by token

Architecture Option 2: Logical Separation

On the same set of GPUs, achieve separation through scheduling:

GPU Time Slice Allocation:
[Prefill Batch1][Decode Batch1,2,3][Prefill Batch2][Decode Batch1,2,3,4]...

Priority Scheduling:
- Decode requests priority (ensure low latency)
- Prefill requests during idle time

Advantages of Separation

1. Hardware-Specific Optimization

Prefill Cluster:

Use GPUs with strong compute power (like H100)
Can use lower memory bandwidth
Suitable for batch processing

Decode Cluster:

Prioritize memory bandwidth (HBM3)
Can use more but weaker GPUs
Suitable for streaming processing

2. Better Resource Utilization

Traditional:
GPU Utilization: ████░░░░████░░░░████░░░░  (fluctuating)

Separated:
Prefill GPU: ████████████████████████  (sustained high load)
Decode GPU:  ████████████████████████  (sustained high bandwidth util)

3. Latency Optimization

Decode not blocked by Prefill
Time-to-first-token (TTFT) more controllable
Smoother user experience

4. Elastic Scaling

Can independently scale based on load characteristics:

Many Prefill requests → Scale Prefill cluster
Many long conversations → Scale Decode cluster

Technical Challenges

Challenge 1: KV Cache Transfer

After Prefill completes, KV Cache needs to be passed to Decode:

Solutions:

High-speed network transfer (NVLink, InfiniBand)
Shared storage (distributed KV Cache)
Compressed transfer (quantized KV Cache)

Challenge 2: Scheduling Complexity

Needs intelligent scheduler to decide:

Which requests go to Prefill?
How to batch Decode requests?
How to balance load on both sides?

Challenge 3: Consistency Guarantee

Ensure Prefill and Decode use the same model state.

Real System Cases

Splitwise (Microsoft)

Microsoft’s Splitwise system:

Features:
- Mixed use of different GPU types
- Compute-strong cards for Prefill
- Cost-effective cards for Decode
- Intelligent KV Cache migration

Mooncake (Moonshot AI)

Practice from the Kimi team in China:

Features:
- Complete separation of Prefill and Decode
- Distributed KV Cache pool
- Optimized for ultra-long context

DistServe

Academic disaggregated inference system:

Features:
- Fine-grained resource allocation
- Support for heterogeneous hardware
- Latency-guaranteed scheduling algorithm

Implementation Example

Simplified Separation Scheduling Logic

class PrefillDecodeScheduler:
    def __init__(self):
        self.prefill_queue = Queue()
        self.decode_queue = Queue()
        self.kv_cache_store = KVCacheStore()
    
    def handle_request(self, request):
        if request.is_new():
            # New request → Prefill queue
            self.prefill_queue.put(request)
        else:
            # Continue generation → Decode queue
            self.decode_queue.put(request)
    
    def run_prefill_worker(self, gpu):
        while True:
            request = self.prefill_queue.get()
            # Execute Prefill
            kv_cache = gpu.prefill(request.input_tokens)
            # Store KV Cache
            self.kv_cache_store.save(request.id, kv_cache)
            # Move to Decode queue
            request.kv_cache_id = request.id
            self.decode_queue.put(request)
    
    def run_decode_worker(self, gpu):
        while True:
            # Batch get Decode requests
            batch = self.decode_queue.get_batch(max_size=64)
            # Load KV Cache
            for req in batch:
                req.kv_cache = self.kv_cache_store.load(req.kv_cache_id)
            # Batch Decode
            outputs = gpu.decode_batch(batch)
            # Process outputs...

Performance Comparison

Metric	Traditional	Separated
Time-to-First-Token (TTFT)	High variance	Stable
Throughput	Baseline	+30-50%
GPU Utilization	40-60%	80-95%
Latency P99	High	Controllable
Hardware Flexibility	Low	High

Use Cases

Most suitable scenarios for separation architecture:

✅ High-concurrency online services
✅ Strict latency requirements
✅ Variable input lengths
✅ Need elastic scaling

Scenarios that may not need separation:

❌ Single-user batch processing
❌ Small-scale deployment
❌ Latency-insensitive applications

Summary

Prefill-Decode separation is an important paradigm in LLM inference system design. By recognizing the essential differences between the two phases and specifically allocating resources and optimizations, system efficiency and user experience can be significantly improved.

Key points:

Different phase characteristics: Prefill is compute-intensive, Decode is memory-intensive
Separation brings advantages: High resource utilization, controllable latency, elastic scaling
Core challenges: KV Cache transfer, intelligent scheduling
Practice cases: Splitwise, Mooncake, DistServe

Understand Prefill-Decode separation, and you’ve mastered the core idea of designing high-performance LLM inference systems.