2026-01-07

分布式推理

分布式推理：当一张 GPU 装不下大模型

GPT-4、LLaMA-70B、Mixtral-8x22B……这些大模型的参数量动辄数百亿甚至万亿，单张 GPU 根本装不下。分布式推理技术应运而生，让多张 GPU 协同工作，共同完成一个超大模型的推理任务。

为什么需要分布式推理？

单卡的困境

让我们算一笔账：

模型	参数量	FP16 显存需求	INT8 显存需求
LLaMA-7B	70亿	~14GB	~7GB
LLaMA-70B	700亿	~140GB	~70GB
GPT-3	1750亿	~350GB	~175GB

目前最强的消费级 GPU（RTX 4090）只有 24GB 显存，即使是专业级的 A100 也只有 80GB。

结论： 大模型必须”分家”部署到多张卡上。

分布式推理的三种策略

就像搬一座大山有多种方法，分布式推理也有多种策略：

1. 张量并行（Tensor Parallelism, TP）

思想： 把单个计算任务”横切”分给多张 GPU。

类比： 一道超大的数学题，把它拆成几部分，每个人算一部分，最后汇总答案。

          ┌──────────────────────────────────────┐
原始矩阵   │          A (4096 × 4096)             │
          └──────────────────────────────────────┘
                            ↓ 横向切分
          ┌─────────────┐  ┌─────────────┐
GPU 0     │ A[:, :2048] │  │ A[:, 2048:] │  GPU 1
          └─────────────┘  └─────────────┘
                ↓                  ↓
            计算部分1          计算部分2
                ↓                  ↓
          └──────────── AllReduce ────────────┘
                            ↓
                      最终结果

特点：

每层计算都需要 GPU 间通信（AllReduce）
对通信带宽要求高（需要 NVLink 等高速互连）
适合单机多卡场景
可以降低延迟（每张卡计算量减少）

2. 流水线并行（Pipeline Parallelism, PP）

思想： 把模型按层”纵向切分”，不同 GPU 负责不同的层。

类比： 工厂流水线，每个工人负责一道工序，产品依次传递。

输入 → [GPU 0: 层1-8] → [GPU 1: 层9-16] → [GPU 2: 层17-24] → 输出

     时间 →
GPU 0: [Batch1层1-8] [Batch2层1-8] [Batch3层1-8] ...
GPU 1:              [Batch1层9-16] [Batch2层9-16] ...
GPU 2:                            [Batch1层17-24] ...

特点：

GPU 间通信少（只在层间传递激活值）
存在”流水线气泡”（部分 GPU 空闲等待）
适合跨机器部署
吞吐量高，但单请求延迟可能增加

3. 数据并行（Data Parallelism, DP）

思想： 每张 GPU 都有完整的模型副本，各自处理不同的请求。

类比： 开多家连锁店，每家店都能提供完整服务。

1
2
3

请求1 → GPU 0 (完整模型) → 结果1
请求2 → GPU 1 (完整模型) → 结果2
请求3 → GPU 2 (完整模型) → 结果3

特点：

实现简单
无 GPU 间通信开销（推理时）
前提：单卡能装下整个模型
吞吐量线性扩展

混合并行策略

实际生产中，往往组合使用多种并行策略：

                 数据并行 (DP=2)
     ┌─────────────────────────────────────────────┐
     │                                             │
     ▼                                             ▼
┌─────────┐                                   ┌─────────┐
│ 副本 1  │                                   │ 副本 2  │
└─────────┘                                   └─────────┘
     │                                             │
     ▼                                             ▼
流水线并行 (PP=2)                             流水线并行 (PP=2)
┌─────────────────┐                         ┌─────────────────┐
│Stage 0 │Stage 1│                         │Stage 0 │Stage 1│
└─────────────────┘                         └─────────────────┘
     │       │                                   │       │
     ▼       ▼                                   ▼       ▼
张量并行 (TP=2)                              张量并行 (TP=2)
┌────┬────┐ ┌────┬────┐                    ┌────┬────┐ ┌────┬────┐
│GPU0│GPU1│ │GPU2│GPU3│                    │GPU4│GPU5│ │GPU6│GPU7│
└────┴────┘ └────┴────┘                    └────┴────┘ └────┴────┘

组合示例：

8 卡服务器
TP=4（4卡做一层的张量并行）
PP=2（2个流水线阶段）
总共：4 × 2 = 8 卡

技术挑战与解决方案

挑战 1：通信开销

问题： GPU 间数据传输可能成为瓶颈。

解决方案：

使用 NVLink（比 PCIe 快 5-10 倍）
使用 NVSwitch（全互连）
优化通信与计算的重叠

挑战 2：负载均衡

问题： 流水线并行中，不同阶段计算量可能不均。

解决方案：

合理划分层数
使用 Interleaved Schedule（交错调度）

挑战 3：KV Cache 管理

问题： 分布式环境下 KV Cache 管理更复杂。

解决方案：

分布式 KV Cache 池
跨节点的 PagedAttention

主流框架支持

框架	张量并行	流水线并行	备注
vLLM	✅	✅	开箱即用
TensorRT-LLM	✅	✅	高性能
DeepSpeed	✅	✅	灵活配置
Megatron-LM	✅	✅	大规模训练/推理

使用示例

vLLM 分布式推理

## 4卡张量并行
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4

TensorRT-LLM 分布式推理

## 构建支持张量并行的引擎
trtllm-build --checkpoint_dir ./llama-70b \
             --output_dir ./engine \
             --tp_size 4 \
             --pp_size 2

## 运行（使用 mpirun）
mpirun -n 8 python run_inference.py

性能考量

张量并行 vs 流水线并行

特性	张量并行 (TP)	流水线并行 (PP)
通信频率	每层都通信	仅层间通信
通信量	较大	较小
延迟	较低	较高（气泡）
带宽需求	高（需 NVLink）	较低
适用场景	单机多卡	跨机器

如何选择？

单机多卡 (如 8×A100-SXM4):
  → 优先使用 TP，充分利用 NVLink

多机部署:
  → PP 跨机，TP 机内
  → 例如：2 机 × 4 卡 = PP2 × TP4

模型较小，请求量大:
  → 考虑数据并行 (DP)

实际部署建议

硬件选择：
- 优先选择有 NVLink 的机器（如 DGX、HGX）
- 跨机器用高速网络（InfiniBand）
并行度规划：
- TP 度不超过单机卡数
- PP 度 = 总卡数 / TP 度
监控指标：
- 各 GPU 利用率是否均衡
- 通信时间占比
- 流水线气泡率

总结

分布式推理是运行超大模型的必备技术。三种核心策略——张量并行、流水线并行、数据并行——各有特点，实际应用中常常组合使用。

关键要点：

张量并行 (TP)： 横切计算，通信密集，适合单机
流水线并行 (PP)： 纵切模型，通信少，适合跨机
数据并行 (DP)： 复制模型，扩展吞吐量
混合并行： 生产环境的最佳实践

理解分布式推理，你就能部署任意规模的大模型服务。

Distributed Inference: When One GPU Can’t Fit a Large Model

GPT-4, LLaMA-70B, Mixtral-8x22B… these large models have parameters ranging from hundreds of billions to trillions—a single GPU simply cannot hold them. Distributed inference technology emerged to enable multiple GPUs to work together, jointly completing inference tasks for ultra-large models.

Why Do We Need Distributed Inference?

The Single-Card Dilemma

Let’s do some math:

Model	Parameters	FP16 Memory	INT8 Memory
LLaMA-7B	7B	~14GB	~7GB
LLaMA-70B	70B	~140GB	~70GB
GPT-3	175B	~350GB	~175GB

The most powerful consumer GPU (RTX 4090) only has 24GB of memory, and even professional A100s only have 80GB.

Conclusion: Large models must be “split” and deployed across multiple cards.

Three Strategies for Distributed Inference

Just as there are multiple ways to move a mountain, distributed inference has multiple strategies:

1. Tensor Parallelism (TP)

Idea: “Horizontally slice” a single computation task across multiple GPUs.

Analogy: A super-large math problem is split into parts, each person computes one part, and answers are combined at the end.

          ┌──────────────────────────────────────┐
Original  │          A (4096 × 4096)             │
Matrix    └──────────────────────────────────────┘
                            ↓ Horizontal split
          ┌─────────────┐  ┌─────────────┐
GPU 0     │ A[:, :2048] │  │ A[:, 2048:] │  GPU 1
          └─────────────┘  └─────────────┘
                ↓                  ↓
            Compute Part1      Compute Part2
                ↓                  ↓
          └──────────── AllReduce ────────────┘
                            ↓
                      Final Result

Characteristics:

Each layer computation requires GPU communication (AllReduce)
High communication bandwidth requirements (needs NVLink or similar)
Suitable for single-machine multi-GPU scenarios
Can reduce latency (each card computes less)

2. Pipeline Parallelism (PP)

Idea: “Vertically slice” the model by layers, different GPUs handle different layers.

Analogy: Factory assembly line—each worker handles one process, products pass through sequentially.

Input → [GPU 0: Layers 1-8] → [GPU 1: Layers 9-16] → [GPU 2: Layers 17-24] → Output

     Time →
GPU 0: [Batch1 L1-8] [Batch2 L1-8] [Batch3 L1-8] ...
GPU 1:              [Batch1 L9-16] [Batch2 L9-16] ...
GPU 2:                            [Batch1 L17-24] ...

Characteristics:

Less inter-GPU communication (only activations passed between stages)
“Pipeline bubbles” exist (some GPUs idle waiting)
Suitable for cross-machine deployment
High throughput, but single-request latency may increase

3. Data Parallelism (DP)

Idea: Each GPU has a complete model replica, each processing different requests.

Analogy: Opening multiple franchise stores—each store can provide complete service.

1
2
3

Request1 → GPU 0 (complete model) → Result1
Request2 → GPU 1 (complete model) → Result2
Request3 → GPU 2 (complete model) → Result3

Characteristics:

Simple implementation
No inter-GPU communication overhead (during inference)
Prerequisite: Single card can fit the entire model
Throughput scales linearly

Hybrid Parallel Strategies

In production, multiple parallel strategies are often combined:

                 Data Parallel (DP=2)
     ┌─────────────────────────────────────────────┐
     │                                             │
     ▼                                             ▼
┌─────────┐                                   ┌─────────┐
│ Replica1│                                   │ Replica2│
└─────────┘                                   └─────────┘
     │                                             │
     ▼                                             ▼
Pipeline Parallel (PP=2)                      Pipeline Parallel (PP=2)
┌─────────────────┐                         ┌─────────────────┐
│Stage 0 │Stage 1│                         │Stage 0 │Stage 1│
└─────────────────┘                         └─────────────────┘
     │       │                                   │       │
     ▼       ▼                                   ▼       ▼
Tensor Parallel (TP=2)                       Tensor Parallel (TP=2)
┌────┬────┐ ┌────┬────┐                    ┌────┬────┐ ┌────┬────┐
│GPU0│GPU1│ │GPU2│GPU3│                    │GPU4│GPU5│ │GPU6│GPU7│
└────┴────┘ └────┴────┘                    └────┴────┘ └────┴────┘

Combination example:

8-GPU server
TP=4 (4 cards for tensor parallelism per layer)
PP=2 (2 pipeline stages)
Total: 4 × 2 = 8 cards

Technical Challenges and Solutions

Challenge 1: Communication Overhead

Problem: Data transfer between GPUs can become a bottleneck.

Solutions:

Use NVLink (5-10x faster than PCIe)
Use NVSwitch (full interconnect)
Optimize overlap of communication and computation

Challenge 2: Load Balancing

Problem: In pipeline parallelism, different stages may have unequal computation loads.

Solutions:

Reasonable layer partitioning
Use Interleaved Schedule

Challenge 3: KV Cache Management

Problem: KV Cache management is more complex in distributed environments.

Solutions:

Distributed KV Cache pool
Cross-node PagedAttention

Framework Support

Framework	Tensor Parallel	Pipeline Parallel	Notes
vLLM	✅	✅	Out-of-box
TensorRT-LLM	✅	✅	High performance
DeepSpeed	✅	✅	Flexible config
Megatron-LM	✅	✅	Large-scale training/inference

Usage Examples

vLLM Distributed Inference

## 4-card tensor parallelism
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4

TensorRT-LLM Distributed Inference

## Build engine with tensor parallelism support
trtllm-build --checkpoint_dir ./llama-70b \
             --output_dir ./engine \
             --tp_size 4 \
             --pp_size 2

## Run (using mpirun)
mpirun -n 8 python run_inference.py

Performance Considerations

Tensor Parallel vs Pipeline Parallel

Feature	Tensor Parallel (TP)	Pipeline Parallel (PP)
Comm Frequency	Every layer	Only between stages
Comm Volume	Larger	Smaller
Latency	Lower	Higher (bubbles)
Bandwidth Need	High (needs NVLink)	Lower
Use Case	Single-machine multi-GPU	Cross-machine

How to Choose?

Single machine multi-GPU (e.g., 8×A100-SXM4):
  → Prefer TP, fully utilize NVLink

Multi-machine deployment:
  → PP across machines, TP within machine
  → Example: 2 machines × 4 cards = PP2 × TP4

Smaller model, high request volume:
  → Consider Data Parallelism (DP)

Practical Deployment Recommendations

Hardware Selection:
- Prefer machines with NVLink (e.g., DGX, HGX)
- High-speed network for cross-machine (InfiniBand)
Parallelism Planning:
- TP degree should not exceed single-machine card count
- PP degree = total cards / TP degree
Monitoring Metrics:
- GPU utilization balance
- Communication time ratio
- Pipeline bubble rate

Summary

Distributed inference is essential technology for running ultra-large models. Three core strategies—Tensor Parallelism, Pipeline Parallelism, Data Parallelism—each have their characteristics, and are often combined in practice.

Key points:

Tensor Parallel (TP): Horizontal computation split, communication-intensive, suitable for single machine
Pipeline Parallel (PP): Vertical model split, less communication, suitable for cross-machine
Data Parallel (DP): Model replication, throughput scaling
Hybrid Parallel: Best practice for production environments

Understand distributed inference, and you can deploy large model services of any scale.