2026-01-07

Continuous Batching

Continuous Batching：让大模型服务”永不闲置”的黑科技

当你和 ChatGPT 聊天时，背后的服务器可能同时在服务成千上万的用户。如何高效地处理这些并发请求？Continuous Batching（连续批处理）技术应运而生，它让 GPU 保持满负荷运转，大幅提升了大模型服务的吞吐量。

先理解传统批处理的问题

什么是批处理（Batching）？

在深度学习中，批处理是指把多个请求打包在一起同时处理，而不是一个一个处理。这样做能更好地利用 GPU 的并行计算能力。

类比： 就像餐厅炒菜，一次炒一盘效率低，一次炒四盘（如果锅够大）效率更高。

传统静态批处理的困境

在传统的 Static Batching（静态批处理）中：

服务器等待收集到一批请求（比如 8 个）
把这 8 个请求一起送入模型处理
等到 所有请求都完成 后，才处理下一批

问题来了：

假设这 8 个请求中：

用户 A 问：”1+1=?” → 生成 3 个 token 就完成了
用户 B 问：”给我写一篇 1000 字的文章” → 需要生成 500 个 token

静态批处理的结果：

时间线：
[Token 1] A✓ B○ C○ D○ E○ F○ G○ H○
[Token 2] A✓ B○ C✓ D○ E○ F○ G○ H○
[Token 3] A✓ B○ C✓ D✓ E✓ F○ G○ H○  ← A、C、D、E 已完成
...
[Token 500] 等... B○ 等... 等... 等... 等... 等... 等...  ← 大家都在等 B

用户 A 早就该收到回复了，却要等用户 B 的长文章写完！这期间：

GPU 资源浪费： 已完成的请求仍占着位置
延迟增加： 短请求被长请求”拖后腿”
用户体验差： 明明可以秒回的问题，等了半天

Continuous Batching 如何解决问题

Continuous Batching（也叫 Iteration-level Batching 或 In-Flight Batching）的核心思想是：

在每一步迭代（每生成一个 token）时重新调度批次

工作原理

初始状态：正在处理 [A, B, C, D]

Token 1 后：
- A 完成 ✓ → 移出批次，立即返回结果给用户 A
- B, C, D 继续
- 新请求 E 加入 → 批次变成 [B, C, D, E]

Token 2 后：
- C 完成 ✓ → 移出，返回
- 新请求 F, G 加入 → 批次变成 [B, D, E, F, G]

...以此类推

效果：

完成的请求 立即释放 资源
新请求 随时加入 正在运行的批次
GPU 始终满载 运行，没有空闲

形象类比

静态批处理像传统公交车：

每站必须等所有人上下车完毕
车满就发车，到终点站所有人一起下
有人中途想下车？抱歉，等到终点吧

Continuous Batching 像”云轨”或”无人出租车”：

乘客随时可以上下
到达目的地立即下车，座位立即空出
新乘客马上就能坐上空座
车辆永远保持高效运转

技术实现要点

1. 序列级别的调度

传统批处理以”批次”为单位调度，Continuous Batching 以”序列”为单位调度：

class ContinuousBatcher:
    def __init__(self, max_batch_size=32):
        self.running_sequences = []  # 正在生成的序列
        self.waiting_queue = []       # 等待队列
        self.max_batch_size = max_batch_size
    
    def step(self):
        # 1. 移除已完成的序列
        completed = [s for s in self.running_sequences if s.is_done()]
        for seq in completed:
            self.running_sequences.remove(seq)
            self.return_result(seq)
        
        # 2. 填充空位
        while len(self.running_sequences) < self.max_batch_size:
            if self.waiting_queue:
                new_seq = self.waiting_queue.pop(0)
                self.running_sequences.append(new_seq)
            else:
                break
        
        # 3. 执行一步推理
        self.forward_one_step(self.running_sequences)

2. 动态内存管理

由于批次组成随时变化，内存管理变得复杂：

需要 PagedAttention 等技术配合
KV Cache 按需分配和释放
支持不同长度的序列共存

3. 前缀共享优化

当多个请求有相同前缀（如相同的 System Prompt）时：

可以共享前缀的 KV Cache
新请求加入时直接复用

Continuous Batching 的优势

优势	说明
更高吞吐量	GPU 利用率从 30-50% 提升到 90%+
更低延迟	短请求不用等长请求完成
更好伸缩性	动态适应负载变化
更高并发	同时服务更多用户

性能对比

以 LLaMA-7B 模型为例：

指标	静态批处理	Continuous Batching
吞吐量 (tokens/s)	1000	3000-5000
平均延迟	高，受最长请求影响	低，按实际生成长度
GPU 利用率	30-50%	85-95%
尾延迟 (P99)	很高	可控

主流框架支持

以下推理框架都实现了 Continuous Batching：

框架	实现名称
vLLM	Continuous Batching + PagedAttention
TensorRT-LLM	In-Flight Batching
SGLang	Continuous Batching + RadixAttention
Text Generation Inference	Continuous Batching

使用示例（vLLM）

from vllm import LLM, SamplingParams

## 初始化模型（自动启用 Continuous Batching）
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

## 批量请求 - 自动调度，短的先返回
prompts = [
    "What is 1+1?",           # 短请求
    "Write a 500-word essay about AI...",  # 长请求
    "Hello!",                  # 短请求
]

outputs = llm.generate(prompts, SamplingParams(max_tokens=1024))

## 注意：虽然代码看起来是批量的，
## 但内部使用 Continuous Batching 优化调度

挑战与解决方案

挑战 1：调度复杂度增加

问题： 每步都要判断哪些完成、哪些继续
解决： 高效的数据结构和调度算法

挑战 2：内存碎片化

问题： 序列长短不一，内存分配零散
解决： PagedAttention 解决碎片问题

挑战 3：请求饥饿

问题： 长请求可能一直被短请求”插队”
解决： 优先级调度、公平调度策略

总结

Continuous Batching 是现代 LLM 推理服务的核心技术之一。它打破了传统批处理”等齐再走”的局限，实现了请求级别的动态调度。配合 PagedAttention 等内存优化技术，Continuous Batching 让大模型服务的吞吐量提升数倍，同时降低了用户等待时间。

如果你正在部署 LLM 服务，选择支持 Continuous Batching 的推理框架（如 vLLM、TensorRT-LLM、SGLang）将是明智之选。

Continuous Batching: The Black Magic That Keeps Large Model Services “Never Idle”

When you chat with ChatGPT, the servers behind it might be serving thousands of users simultaneously. How do we efficiently handle these concurrent requests? Continuous Batching technology emerged to keep GPUs running at full capacity, dramatically improving the throughput of large model services.

First, Understanding the Problem with Traditional Batching

What is Batching?

In deep learning, batching means packaging multiple requests together for simultaneous processing, rather than processing them one by one. This better utilizes the GPU’s parallel computing capabilities.

Analogy: It’s like cooking at a restaurant—stir-frying one dish at a time is inefficient, but doing four at once (if the wok is big enough) is more efficient.

The Dilemma of Traditional Static Batching

In traditional Static Batching:

The server waits to collect a batch of requests (say, 8)
These 8 requests are sent to the model together
It waits until all requests complete before processing the next batch

Here’s the problem:

Suppose among these 8 requests:

User A asks: “What’s 1+1?” → Completes after generating 3 tokens
User B asks: “Write me a 1000-word article” → Needs to generate 500 tokens

Result with Static Batching:

Timeline:
[Token 1] A✓ B○ C○ D○ E○ F○ G○ H○
[Token 2] A✓ B○ C✓ D○ E○ F○ G○ H○
[Token 3] A✓ B○ C✓ D✓ E✓ F○ G○ H○  ← A, C, D, E completed
...
[Token 500] waiting... B○ waiting... waiting...  ← Everyone waiting for B

User A should have received their response long ago, but has to wait for User B’s long article to finish! During this time:

GPU resources wasted: Completed requests still occupy slots
Increased latency: Short requests are “held back” by long ones
Poor user experience: Questions that could be answered instantly take forever

How Continuous Batching Solves the Problem

Continuous Batching (also called Iteration-level Batching or In-Flight Batching) has a core idea:

Reschedule the batch at each iteration step (every token generated)

How It Works

Initial state: Processing [A, B, C, D]

After Token 1:
- A completed ✓ → Remove from batch, immediately return result to User A
- B, C, D continue
- New request E joins → Batch becomes [B, C, D, E]

After Token 2:
- C completed ✓ → Remove, return
- New requests F, G join → Batch becomes [B, D, E, F, G]

...and so on

Effects:

Completed requests immediately release resources
New requests join anytime to the running batch
GPU always runs at full load, no idle time

A Visual Analogy

Static Batching is like a traditional bus:

Must wait for everyone to board/disembark at each stop
Departs when full, everyone gets off together at the terminal
Want to get off midway? Sorry, wait until the end

Continuous Batching is like a “cloud rail” or “robo-taxi”:

Passengers can get on and off anytime
Get off immediately upon reaching destination, seat is freed immediately
New passengers can immediately take empty seats
Vehicle always operates efficiently

Key Technical Implementation Points

1. Sequence-level Scheduling

Traditional batching schedules by “batch”; Continuous Batching schedules by “sequence”:

class ContinuousBatcher:
    def __init__(self, max_batch_size=32):
        self.running_sequences = []  # Sequences currently generating
        self.waiting_queue = []       # Waiting queue
        self.max_batch_size = max_batch_size
    
    def step(self):
        # 1. Remove completed sequences
        completed = [s for s in self.running_sequences if s.is_done()]
        for seq in completed:
            self.running_sequences.remove(seq)
            self.return_result(seq)
        
        # 2. Fill empty slots
        while len(self.running_sequences) < self.max_batch_size:
            if self.waiting_queue:
                new_seq = self.waiting_queue.pop(0)
                self.running_sequences.append(new_seq)
            else:
                break
        
        # 3. Execute one inference step
        self.forward_one_step(self.running_sequences)

2. Dynamic Memory Management

Since batch composition changes constantly, memory management becomes complex:

Requires technologies like PagedAttention
KV Cache allocated and freed on demand
Supports sequences of different lengths coexisting

When multiple requests share the same prefix (like the same System Prompt):

Can share prefix KV Cache
New requests joining directly reuse it

Advantages of Continuous Batching

Advantage	Description
Higher Throughput	GPU utilization increases from 30-50% to 90%+
Lower Latency	Short requests don’t wait for long ones to complete
Better Scalability	Dynamically adapts to load changes
Higher Concurrency	Serves more users simultaneously

Performance Comparison

Using LLaMA-7B model as an example:

Metric	Static Batching	Continuous Batching
Throughput (tokens/s)	1000	3000-5000
Average Latency	High, affected by longest request	Low, based on actual generation length
GPU Utilization	30-50%	85-95%
Tail Latency (P99)	Very high	Controllable

Mainstream Framework Support

The following inference frameworks implement Continuous Batching:

Framework	Implementation Name
vLLM	Continuous Batching + PagedAttention
TensorRT-LLM	In-Flight Batching
SGLang	Continuous Batching + RadixAttention
Text Generation Inference	Continuous Batching

Usage Example (vLLM)

from vllm import LLM, SamplingParams

## Initialize model (automatically enables Continuous Batching)
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

## Batch requests - automatically scheduled, short ones return first
prompts = [
    "What is 1+1?",           # Short request
    "Write a 500-word essay about AI...",  # Long request
    "Hello!",                  # Short request
]

outputs = llm.generate(prompts, SamplingParams(max_tokens=1024))

## Note: Although the code looks like batch processing,
## internally it uses Continuous Batching for optimized scheduling

Challenges and Solutions

Challenge 1: Increased Scheduling Complexity

Problem: Each step must determine which are complete, which continue
Solution: Efficient data structures and scheduling algorithms

Challenge 2: Memory Fragmentation

Problem: Varying sequence lengths cause scattered memory allocation
Solution: PagedAttention solves fragmentation issues

Challenge 3: Request Starvation

Problem: Long requests might keep getting “cut in line” by short ones
Solution: Priority scheduling, fair scheduling policies

Summary

Continuous Batching is one of the core technologies in modern LLM inference services. It breaks the traditional batching limitation of “wait until all are ready,” achieving dynamic scheduling at the request level. Combined with memory optimization techniques like PagedAttention, Continuous Batching increases large model service throughput by several times while reducing user wait times.

If you’re deploying LLM services, choosing an inference framework that supports Continuous Batching (like vLLM, TensorRT-LLM, SGLang) would be a wise choice.