Continuous Batching:让大模型服务”永不闲置”的黑科技
当你和 ChatGPT 聊天时,背后的服务器可能同时在服务成千上万的用户。如何高效地处理这些并发请求?Continuous Batching(连续批处理)技术应运而生,它让 GPU 保持满负荷运转,大幅提升了大模型服务的吞吐量。
先理解传统批处理的问题
什么是批处理(Batching)?
在深度学习中,批处理是指把多个请求打包在一起同时处理,而不是一个一个处理。这样做能更好地利用 GPU 的并行计算能力。
类比: 就像餐厅炒菜,一次炒一盘效率低,一次炒四盘(如果锅够大)效率更高。
传统静态批处理的困境
在传统的 Static Batching(静态批处理)中:
- 服务器等待收集到一批请求(比如 8 个)
- 把这 8 个请求一起送入模型处理
- 等到 所有请求都完成 后,才处理下一批
问题来了:
假设这 8 个请求中:
- 用户 A 问:”1+1=?” → 生成 3 个 token 就完成了
- 用户 B 问:”给我写一篇 1000 字的文章” → 需要生成 500 个 token
静态批处理的结果:
1 | 时间线: |
用户 A 早就该收到回复了,却要等用户 B 的长文章写完!这期间:
- GPU 资源浪费: 已完成的请求仍占着位置
- 延迟增加: 短请求被长请求”拖后腿”
- 用户体验差: 明明可以秒回的问题,等了半天
Continuous Batching 如何解决问题
Continuous Batching(也叫 Iteration-level Batching 或 In-Flight Batching)的核心思想是:
在每一步迭代(每生成一个 token)时重新调度批次
工作原理
1 | 初始状态:正在处理 [A, B, C, D] |
效果:
- 完成的请求 立即释放 资源
- 新请求 随时加入 正在运行的批次
- GPU 始终满载 运行,没有空闲
形象类比
静态批处理像传统公交车:
- 每站必须等所有人上下车完毕
- 车满就发车,到终点站所有人一起下
- 有人中途想下车?抱歉,等到终点吧
Continuous Batching 像”云轨”或”无人出租车”:
- 乘客随时可以上下
- 到达目的地立即下车,座位立即空出
- 新乘客马上就能坐上空座
- 车辆永远保持高效运转
技术实现要点
1. 序列级别的调度
传统批处理以”批次”为单位调度,Continuous Batching 以”序列”为单位调度:
1 | class ContinuousBatcher: |
2. 动态内存管理
由于批次组成随时变化,内存管理变得复杂:
- 需要 PagedAttention 等技术配合
- KV Cache 按需分配和释放
- 支持不同长度的序列共存
3. 前缀共享优化
当多个请求有相同前缀(如相同的 System Prompt)时:
- 可以共享前缀的 KV Cache
- 新请求加入时直接复用
Continuous Batching 的优势
| 优势 | 说明 |
|---|---|
| 更高吞吐量 | GPU 利用率从 30-50% 提升到 90%+ |
| 更低延迟 | 短请求不用等长请求完成 |
| 更好伸缩性 | 动态适应负载变化 |
| 更高并发 | 同时服务更多用户 |
性能对比
以 LLaMA-7B 模型为例:
| 指标 | 静态批处理 | Continuous Batching |
|---|---|---|
| 吞吐量 (tokens/s) | 1000 | 3000-5000 |
| 平均延迟 | 高,受最长请求影响 | 低,按实际生成长度 |
| GPU 利用率 | 30-50% | 85-95% |
| 尾延迟 (P99) | 很高 | 可控 |
主流框架支持
以下推理框架都实现了 Continuous Batching:
| 框架 | 实现名称 |
|---|---|
| vLLM | Continuous Batching + PagedAttention |
| TensorRT-LLM | In-Flight Batching |
| SGLang | Continuous Batching + RadixAttention |
| Text Generation Inference | Continuous Batching |
使用示例(vLLM)
1 | from vllm import LLM, SamplingParams |
挑战与解决方案
挑战 1:调度复杂度增加
- 问题: 每步都要判断哪些完成、哪些继续
- 解决: 高效的数据结构和调度算法
挑战 2:内存碎片化
- 问题: 序列长短不一,内存分配零散
- 解决: PagedAttention 解决碎片问题
挑战 3:请求饥饿
- 问题: 长请求可能一直被短请求”插队”
- 解决: 优先级调度、公平调度策略
总结
Continuous Batching 是现代 LLM 推理服务的核心技术之一。它打破了传统批处理”等齐再走”的局限,实现了请求级别的动态调度。配合 PagedAttention 等内存优化技术,Continuous Batching 让大模型服务的吞吐量提升数倍,同时降低了用户等待时间。
如果你正在部署 LLM 服务,选择支持 Continuous Batching 的推理框架(如 vLLM、TensorRT-LLM、SGLang)将是明智之选。
Continuous Batching: The Black Magic That Keeps Large Model Services “Never Idle”
When you chat with ChatGPT, the servers behind it might be serving thousands of users simultaneously. How do we efficiently handle these concurrent requests? Continuous Batching technology emerged to keep GPUs running at full capacity, dramatically improving the throughput of large model services.
First, Understanding the Problem with Traditional Batching
What is Batching?
In deep learning, batching means packaging multiple requests together for simultaneous processing, rather than processing them one by one. This better utilizes the GPU’s parallel computing capabilities.
Analogy: It’s like cooking at a restaurant—stir-frying one dish at a time is inefficient, but doing four at once (if the wok is big enough) is more efficient.
The Dilemma of Traditional Static Batching
In traditional Static Batching:
- The server waits to collect a batch of requests (say, 8)
- These 8 requests are sent to the model together
- It waits until all requests complete before processing the next batch
Here’s the problem:
Suppose among these 8 requests:
- User A asks: “What’s 1+1?” → Completes after generating 3 tokens
- User B asks: “Write me a 1000-word article” → Needs to generate 500 tokens
Result with Static Batching:
1 | Timeline: |
User A should have received their response long ago, but has to wait for User B’s long article to finish! During this time:
- GPU resources wasted: Completed requests still occupy slots
- Increased latency: Short requests are “held back” by long ones
- Poor user experience: Questions that could be answered instantly take forever
How Continuous Batching Solves the Problem
Continuous Batching (also called Iteration-level Batching or In-Flight Batching) has a core idea:
Reschedule the batch at each iteration step (every token generated)
How It Works
1 | Initial state: Processing [A, B, C, D] |
Effects:
- Completed requests immediately release resources
- New requests join anytime to the running batch
- GPU always runs at full load, no idle time
A Visual Analogy
Static Batching is like a traditional bus:
- Must wait for everyone to board/disembark at each stop
- Departs when full, everyone gets off together at the terminal
- Want to get off midway? Sorry, wait until the end
Continuous Batching is like a “cloud rail” or “robo-taxi”:
- Passengers can get on and off anytime
- Get off immediately upon reaching destination, seat is freed immediately
- New passengers can immediately take empty seats
- Vehicle always operates efficiently
Key Technical Implementation Points
1. Sequence-level Scheduling
Traditional batching schedules by “batch”; Continuous Batching schedules by “sequence”:
1 | class ContinuousBatcher: |
2. Dynamic Memory Management
Since batch composition changes constantly, memory management becomes complex:
- Requires technologies like PagedAttention
- KV Cache allocated and freed on demand
- Supports sequences of different lengths coexisting
3. Prefix Sharing Optimization
When multiple requests share the same prefix (like the same System Prompt):
- Can share prefix KV Cache
- New requests joining directly reuse it
Advantages of Continuous Batching
| Advantage | Description |
|---|---|
| Higher Throughput | GPU utilization increases from 30-50% to 90%+ |
| Lower Latency | Short requests don’t wait for long ones to complete |
| Better Scalability | Dynamically adapts to load changes |
| Higher Concurrency | Serves more users simultaneously |
Performance Comparison
Using LLaMA-7B model as an example:
| Metric | Static Batching | Continuous Batching |
|---|---|---|
| Throughput (tokens/s) | 1000 | 3000-5000 |
| Average Latency | High, affected by longest request | Low, based on actual generation length |
| GPU Utilization | 30-50% | 85-95% |
| Tail Latency (P99) | Very high | Controllable |
Mainstream Framework Support
The following inference frameworks implement Continuous Batching:
| Framework | Implementation Name |
|---|---|
| vLLM | Continuous Batching + PagedAttention |
| TensorRT-LLM | In-Flight Batching |
| SGLang | Continuous Batching + RadixAttention |
| Text Generation Inference | Continuous Batching |
Usage Example (vLLM)
1 | from vllm import LLM, SamplingParams |
Challenges and Solutions
Challenge 1: Increased Scheduling Complexity
- Problem: Each step must determine which are complete, which continue
- Solution: Efficient data structures and scheduling algorithms
Challenge 2: Memory Fragmentation
- Problem: Varying sequence lengths cause scattered memory allocation
- Solution: PagedAttention solves fragmentation issues
Challenge 3: Request Starvation
- Problem: Long requests might keep getting “cut in line” by short ones
- Solution: Priority scheduling, fair scheduling policies
Summary
Continuous Batching is one of the core technologies in modern LLM inference services. It breaks the traditional batching limitation of “wait until all are ready,” achieving dynamic scheduling at the request level. Combined with memory optimization techniques like PagedAttention, Continuous Batching increases large model service throughput by several times while reducing user wait times.
If you’re deploying LLM services, choosing an inference framework that supports Continuous Batching (like vLLM, TensorRT-LLM, SGLang) would be a wise choice.