2026-01-07

PagedAttention

PagedAttention：让大模型推理告别”内存焦虑”

在大语言模型（LLM）推理的世界里，内存管理一直是个让工程师头疼的问题。今天我们要介绍的 PagedAttention 技术，借鉴了操作系统的经典思想，巧妙地解决了 LLM 推理中的内存难题。这项技术是 vLLM 框架的核心创新，让我们来深入了解它。

从一个痛点说起

想象你正在运营一个 AI 聊天服务，同时有 100 个用户在和 AI 对话。每个用户的对话长度不一样——有人只问了一句”今天天气怎么样”，有人却写了一篇 2000 字的小说让 AI 续写。

传统做法的问题：

传统的 LLM 推理需要为每个用户预先分配一块固定大小的内存（用于存储 KV Cache）。为了应对”写小说”的用户，你不得不给每个人都分配能容纳 2000 字的空间。

结果呢？

那个只问天气的用户，分配的内存 99% 都浪费了
GPU 显存很快就被”空气”占满了
本来能服务 100 个用户，现在只能服务 20 个

这就是 PagedAttention 要解决的问题。

什么是 KV Cache？

在理解 PagedAttention 之前，我们先搞清楚什么是 KV Cache。

当 LLM 生成文本时，它是一个字一个字”吐”出来的（自回归生成）。每生成一个新字，模型需要”回顾”之前所有已生成的内容。

如果每次都从头计算，效率太低。所以我们把之前计算过的注意力机制中的 Key 和 Value 向量存起来，这就是 KV Cache。

类比： 就像你写一篇长文章，每写一句话都要从第一句开始读一遍，太累了。KV Cache 就像你的”记忆笔记”，把之前的关键信息记下来，写新句子时只需要翻笔记就行。

PagedAttention 的核心思想

PagedAttention 的灵感来自于操作系统中的虚拟内存分页技术。

操作系统是怎么做的？

在现代操作系统中，程序需要的内存不是一次性全部分配的，而是分成一个个小”页面”（Page），按需分配：

程序说”我可能需要 8GB 内存”
操作系统说”好的，但你现在只用了 100MB，我先给你这些”
当程序真正需要更多时，再分配新的页面

PagedAttention 对 KV Cache 做了同样的事情：

分页存储： 把连续的 KV Cache 分割成固定大小的”页块”（Block）
按需分配： 生成新 token 时，才分配新的页块
非连续存储： 页块不需要在物理内存中连续，用”页表”来管理映射关系

图解 PagedAttention

传统方式：

1
2
3

用户A: [████████████____________________] ← 预分配大空间，大量浪费
用户B: [████________________________] ← 同样浪费
用户C: [██████████████████______________] ← 还是浪费

PagedAttention：

页块池: [A1][B1][A2][C1][B2][C2][A3][C3][空][空][空]...

用户A → 页表: [A1, A2, A3]         ← 只用了3个页块
用户B → 页表: [B1, B2]             ← 只用了2个页块
用户C → 页表: [C1, C2, C3]         ← 只用了3个页块

效果： 没有浪费，每个用户精确使用所需的内存。

技术细节

1. Block（页块）

每个 Block 包含固定数量的 token 对应的 KV 向量
典型的 Block 大小：16 个 token
Block 是内存分配的最小单位

2. Block Table（页表）

每个请求维护一个 Block Table
记录该请求的 KV Cache 使用了哪些物理 Block
类似操作系统的页表，实现虚拟到物理的映射

3. 按需分配

时间线：
T1: 用户输入 "你好" → 分配 Block #1
T2: 模型生成 "，" → Block #1 还有空间，继续使用
T3: 模型生成 "我是..." → Block #1 满了，分配 Block #2
...

4. 内存释放

当请求完成或被中断时，其占用的 Block 立即归还到空闲池，供其他请求使用。

PagedAttention 的优势

1. 接近零浪费

研究表明，传统方式的内存浪费率高达 60-80%。PagedAttention 将浪费率降到接近 0%（仅最后一个 Block 可能有少量未使用空间）。

2. 更高并发

同样的 GPU 显存，可以同时处理 2-4 倍 更多的请求。

3. 支持更长上下文

内存利用率提高后，同样的硬件可以支持更长的对话上下文。

4. 灵活的内存共享

PagedAttention 还支持一个高级特性：Copy-on-Write（写时复制）

当多个请求共享相同的前缀（如相同的系统提示词）时：

它们可以共享同一批 Block
只有当需要修改时，才复制出新的 Block
进一步节省内存

代码概念示例

虽然实际实现很复杂，但核心思想可以这样理解：

class PagedKVCache:
    def __init__(self, block_size=16, num_blocks=1000):
        self.block_size = block_size
        # 预分配 Block 池
        self.block_pool = [Block() for _ in range(num_blocks)]
        self.free_blocks = list(range(num_blocks))
    
    def allocate_block(self):
        """按需分配一个新 Block"""
        if self.free_blocks:
            return self.free_blocks.pop()
        raise MemoryError("No free blocks!")
    
    def free_block(self, block_id):
        """释放 Block 回池中"""
        self.free_blocks.append(block_id)

class Request:
    def __init__(self, cache):
        self.cache = cache
        self.block_table = []  # 页表
        self.num_tokens = 0
    
    def add_token(self, kv_data):
        # 检查当前 Block 是否已满
        if self.num_tokens % self.cache.block_size == 0:
            # 需要新 Block
            new_block = self.cache.allocate_block()
            self.block_table.append(new_block)
        
        # 写入 KV 数据
        current_block = self.block_table[-1]
        # ... 写入逻辑
        self.num_tokens += 1

在 vLLM 中的应用

vLLM 是第一个将 PagedAttention 付诸实践的推理框架：

## 安装 vLLM
pip install vllm

## 启动服务（自动使用 PagedAttention）
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf

用户无需任何额外配置，vLLM 自动应用 PagedAttention 优化。

性能对比

指标	传统方式	PagedAttention
内存利用率	20-40%	95%+
最大并发数	基准	2-4x
内存浪费	60-80%	<5%
动态伸缩	困难	容易

总结

PagedAttention 是 LLM 推理领域的一项重要创新。它借鉴操作系统的分页内存管理思想，将 KV Cache 的内存管理从”静态预分配”变为”动态按需分配”，从而大幅提高了内存利用率和系统吞吐量。

这项技术告诉我们：有时候，解决新问题的最佳方案，就藏在经典技术的智慧之中。

PagedAttention: Freeing Large Model Inference from “Memory Anxiety”

In the world of Large Language Model (LLM) inference, memory management has always been a headache for engineers. Today’s topic, PagedAttention technology, cleverly solves the memory challenges in LLM inference by borrowing classic ideas from operating systems. This technology is the core innovation of the vLLM framework. Let’s dive deep into understanding it.

Starting from a Pain Point

Imagine you’re running an AI chat service with 100 users simultaneously talking to the AI. Each user’s conversation varies in length—some only ask “What’s the weather today?”, while others write a 2000-word story asking the AI to continue it.

Problems with Traditional Approaches:

Traditional LLM inference requires pre-allocating a fixed-size memory block for each user (for storing KV Cache). To accommodate the “novel writer” user, you have to allocate space for 2000 words to everyone.

The result?

The user who only asked about weather has 99% of their allocated memory wasted
GPU memory quickly fills up with “air”
Instead of serving 100 users, you can only serve 20

This is the problem PagedAttention solves.

What is KV Cache?

Before understanding PagedAttention, let’s clarify what KV Cache is.

When an LLM generates text, it outputs one token at a time (autoregressive generation). For each new token generated, the model needs to “look back” at all previously generated content.

Computing from scratch each time is too inefficient. So we store the Key and Value vectors from the attention mechanism that were previously computed—this is the KV Cache.

Analogy: It’s like writing a long article where you have to read from the first sentence every time you write a new one—too exhausting. KV Cache is like your “memory notes,” recording key information from before, so you only need to check your notes when writing new sentences.

Core Ideas of PagedAttention

PagedAttention’s inspiration comes from virtual memory paging technology in operating systems.

How Do Operating Systems Do It?

In modern operating systems, memory needed by programs isn’t allocated all at once, but divided into small “pages” allocated on demand:

Program says: “I might need 8GB of memory”
OS says: “Okay, but you’re only using 100MB now, I’ll give you that first”
When the program actually needs more, new pages are allocated

PagedAttention does the same thing for KV Cache:

Paged Storage: Splits continuous KV Cache into fixed-size “blocks”
On-demand Allocation: New blocks are allocated only when generating new tokens
Non-contiguous Storage: Blocks don’t need to be contiguous in physical memory; a “page table” manages the mapping

Visualizing PagedAttention

Traditional Approach:

1
2
3

User A: [████████████____________________] ← Large pre-allocation, lots of waste
User B: [████________________________] ← Same waste
User C: [██████████████████______________] ← Still wasted

PagedAttention:

Block Pool: [A1][B1][A2][C1][B2][C2][A3][C3][free][free][free]...

User A → Page Table: [A1, A2, A3]     ← Only used 3 blocks
User B → Page Table: [B1, B2]         ← Only used 2 blocks  
User C → Page Table: [C1, C2, C3]     ← Only used 3 blocks

Effect: No waste—each user uses exactly the memory they need.

Technical Details

1. Block

Each Block contains KV vectors for a fixed number of tokens
Typical Block size: 16 tokens
Block is the minimum unit of memory allocation

2. Block Table (Page Table)

Each request maintains a Block Table
Records which physical Blocks the request’s KV Cache uses
Similar to OS page tables, implementing virtual-to-physical mapping

3. On-demand Allocation

Timeline:
T1: User inputs "Hello" → Allocate Block #1
T2: Model generates "," → Block #1 still has space, continue using
T3: Model generates "I am..." → Block #1 is full, allocate Block #2
...

4. Memory Release

When a request completes or is interrupted, its occupied Blocks are immediately returned to the free pool for other requests.

Advantages of PagedAttention

1. Near-zero Waste

Research shows traditional methods have memory waste rates of 60-80%. PagedAttention reduces waste to nearly 0% (only the last Block might have some unused space).

2. Higher Concurrency

Same GPU memory can handle 2-4x more concurrent requests.

3. Longer Context Support

With improved memory utilization, the same hardware can support longer conversation contexts.

PagedAttention also supports an advanced feature: Copy-on-Write

When multiple requests share the same prefix (like the same system prompt):

They can share the same Blocks
New Blocks are copied only when modification is needed
Further memory savings

Conceptual Code Example

Although actual implementation is complex, the core idea can be understood like this:

class PagedKVCache:
    def __init__(self, block_size=16, num_blocks=1000):
        self.block_size = block_size
        # Pre-allocate Block pool
        self.block_pool = [Block() for _ in range(num_blocks)]
        self.free_blocks = list(range(num_blocks))
    
    def allocate_block(self):
        """Allocate a new Block on demand"""
        if self.free_blocks:
            return self.free_blocks.pop()
        raise MemoryError("No free blocks!")
    
    def free_block(self, block_id):
        """Release Block back to pool"""
        self.free_blocks.append(block_id)

class Request:
    def __init__(self, cache):
        self.cache = cache
        self.block_table = []  # Page table
        self.num_tokens = 0
    
    def add_token(self, kv_data):
        # Check if current Block is full
        if self.num_tokens % self.cache.block_size == 0:
            # Need new Block
            new_block = self.cache.allocate_block()
            self.block_table.append(new_block)
        
        # Write KV data
        current_block = self.block_table[-1]
        # ... write logic
        self.num_tokens += 1

Application in vLLM

vLLM is the first inference framework to implement PagedAttention:

## Install vLLM
pip install vllm

## Start server (automatically uses PagedAttention)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf

Users don’t need any extra configuration—vLLM automatically applies PagedAttention optimization.

Performance Comparison

Metric	Traditional	PagedAttention
Memory Utilization	20-40%	95%+
Max Concurrency	Baseline	2-4x
Memory Waste	60-80%	<5%
Dynamic Scaling	Difficult	Easy

Summary

PagedAttention is an important innovation in the LLM inference field. By borrowing the paged memory management concept from operating systems, it transforms KV Cache memory management from “static pre-allocation” to “dynamic on-demand allocation,” dramatically improving memory utilization and system throughput.

This technology teaches us that sometimes the best solution to new problems lies hidden in the wisdom of classic techniques.