低比特量化:让大模型”瘦身”到极致
当我们说”量化”时,通常指的是 INT8 或 FP16。但在追求极致压缩的道路上,工程师们走得更远——低比特量化(如 INT4、INT2 甚至二值网络)让大模型的体积和计算量压缩到令人惊叹的程度。
什么是低比特量化?
量化是用更少的比特(bit)来表示数字的技术。
| 数据类型 | 比特数 | 表示范围 | 相对大小 |
|---|---|---|---|
| FP32 | 32 位 | 极大 | 100% |
| FP16 | 16 位 | 大 | 50% |
| INT8 | 8 位 | -128 ~ 127 | 25% |
| INT4 | 4 位 | -8 ~ 7 或 0~15 | 12.5% |
| INT2 | 2 位 | 0 ~ 3 | 6.25% |
| 二值 | 1 位 | 0 或 1 | 3.125% |
低比特量化特指使用 4 位或更少的精度来表示模型参数。
为什么要用低比特量化?
1. 显存节省
以 LLaMA-70B 为例:
| 精度 | 模型大小 | 需要的 GPU |
|---|---|---|
| FP16 | 140 GB | 2× A100-80G |
| INT8 | 70 GB | 1× A100-80G |
| INT4 | 35 GB | 1× A100-40G 或消费级 GPU |
| INT2 | 17.5 GB | 1× RTX 4090 |
低比特量化让原本”高不可攀”的大模型,能在普通硬件上运行!
2. 推理加速
更少的比特意味着:
- 更少的内存带宽消耗
- 更快的数据传输
- 对于访存密集型操作,速度提升明显
3. 成本降低
- 更少的 GPU 数量
- 更低的能耗
- 更便宜的硬件配置
低比特量化的挑战
用 4 位甚至更少的比特表示数字,听起来”不靠谱”——
INT4 只有 16 个可能的值!
原本可以表示 3.14159…、2.71828… 等精确值的 FP32,现在只能用 0-15 之间的整数近似。信息损失是巨大的。
核心挑战:如何在极低精度下保持模型性能?
主流低比特量化技术
1. GPTQ:训练后量化的先驱
GPTQ(GPT Quantization)是首个成功将大语言模型量化到 INT4 的方法。
核心思想:
- 逐层量化,使用少量校准数据
- 最小化量化误差的 Hessian 加权
- 无需重新训练模型
工作流程:
1 | 原始模型 (FP16) |
使用示例:
1 | from transformers import AutoModelForCausalLM |
2. AWQ:激活感知量化
AWQ(Activation-aware Weight Quantization)观察到:不是所有权重都同等重要。
核心思想:
- 识别”重要”权重(对激活值影响大的)
- 对重要权重保持更高精度
- 通过缩放技巧平衡量化误差
公式直觉:
1 | 传统量化:W_q = round(W / scale) |
优势:
- 比 GPTQ 更快(无需复杂的 Hessian 计算)
- 精度损失更小
- 硬件友好
3. QLoRA:训练与量化的结合
QLoRA 结合了量化和 LoRA 微调:
1 | 基础模型:INT4 量化(冻结) |
优势: 在 4 位量化模型上进行微调,显存需求极低。
4. NF4:专为正态分布设计
NF4(4-bit NormalFloat)是专门针对神经网络权重分布设计的数据类型。
原理: 模型权重通常呈正态分布。NF4 的量化点按正态分布的分位数分配,使得量化误差最小。
1 | 传统 INT4:均匀分布量化点 |
更极端:2-bit 和 1-bit 量化
INT2 量化
只有 4 个可能的值(0, 1, 2, 3),挑战极大:
技术手段:
- 使用分组量化(每组有独立的 scale)
- 结合稀疏化技术
- 需要更多校准数据
二值网络(1-bit)
权重只有 +1 和 -1:
1 | 传统: y = W × x (浮点乘法) |
优势:
- 存储极小
- 可用高效的位运算
劣势:
- 精度损失严重
- 主要用于特定场景(如边缘设备)
量化的关键技术细节
分组量化(Group Quantization)
不是整个层共用一个 scale,而是分成小组:
1 | 传统:整层 scale = 1 个参数 |
常见配置:
- GPTQ:group_size = 128
- AWQ:group_size = 128
- 更小的组 = 更高精度,但开销更大
零点(Zero Point)
处理非对称分布:
1 | 对称量化:-8 ~ 7,0 对应浮点 0 |
性能对比
以 LLaMA-2-70B 在单卡上的表现:
| 方法 | 精度 | 显存 | 困惑度损失 | 速度 |
|---|---|---|---|---|
| FP16 | 16-bit | 140GB | 基准 | 基准 |
| GPTQ-4bit | 4-bit | 35GB | +0.1-0.3 | 1.5x |
| AWQ-4bit | 4-bit | 35GB | +0.05-0.2 | 1.5x |
| GGUF-Q4 | 4-bit | ~35GB | +0.1-0.3 | 1.3x |
| 2-bit | 2-bit | ~17GB | +1.0-3.0 | 1.8x |
实践建议
如何选择量化方法?
1 | 场景 1:追求最高精度 |
常用工具
| 工具 | 支持格式 | 特点 |
|---|---|---|
| AutoGPTQ | GPTQ | 易用,HuggingFace 集成 |
| AutoAWQ | AWQ | 快速,硬件友好 |
| llama.cpp | GGUF | CPU 友好,多种量化级别 |
| bitsandbytes | NF4/INT8 | QLoRA 常用 |
| exllama/exllamav2 | GPTQ | 高性能推理 |
使用示例
加载 4-bit 模型
1 | from transformers import AutoModelForCausalLM, BitsAndBytesConfig |
总结
低比特量化是在资源受限条件下运行大模型的关键技术。从 INT4 到 INT2 甚至二值网络,每降低一个比特,都是精度与效率的艰难平衡。
关键要点:
- INT4 是主流: GPTQ、AWQ 成熟可用
- 分组量化: 提高精度的关键技巧
- 激活感知: AWQ 等方法识别重要权重
- 工具丰富: AutoGPTQ、llama.cpp 等开箱即用
低比特量化让大模型”飞入寻常百姓家”,是 AI 民主化的重要推动力。
Low-bit Quantization: Slimming Large Models to the Extreme
When we say “quantization,” we usually mean INT8 or FP16. But on the path to extreme compression, engineers have gone further—low-bit quantization (such as INT4, INT2, or even binary networks) compresses large models to astonishing degrees.
What is Low-bit Quantization?
Quantization is a technique for representing numbers with fewer bits.
| Data Type | Bits | Range | Relative Size |
|---|---|---|---|
| FP32 | 32 bits | Huge | 100% |
| FP16 | 16 bits | Large | 50% |
| INT8 | 8 bits | -128 ~ 127 | 25% |
| INT4 | 4 bits | -8 ~ 7 or 0~15 | 12.5% |
| INT2 | 2 bits | 0 ~ 3 | 6.25% |
| Binary | 1 bit | 0 or 1 | 3.125% |
Low-bit quantization specifically refers to using 4 bits or fewer to represent model parameters.
Why Use Low-bit Quantization?
1. Memory Savings
Using LLaMA-70B as an example:
| Precision | Model Size | GPUs Needed |
|---|---|---|
| FP16 | 140 GB | 2× A100-80G |
| INT8 | 70 GB | 1× A100-80G |
| INT4 | 35 GB | 1× A100-40G or consumer GPU |
| INT2 | 17.5 GB | 1× RTX 4090 |
Low-bit quantization allows previously “unreachable” large models to run on regular hardware!
2. Inference Acceleration
Fewer bits mean:
- Less memory bandwidth consumption
- Faster data transfer
- Significant speedup for memory-bound operations
3. Cost Reduction
- Fewer GPUs needed
- Lower energy consumption
- Cheaper hardware configurations
Challenges of Low-bit Quantization
Using 4 bits or fewer to represent numbers sounds “unreliable”—
INT4 only has 16 possible values!
What could previously represent precise values like 3.14159…, 2.71828… in FP32 now must be approximated with integers between 0-15. Information loss is significant.
Core challenge: How to maintain model performance at extremely low precision?
Mainstream Low-bit Quantization Techniques
1. GPTQ: Pioneer of Post-training Quantization
GPTQ (GPT Quantization) was the first method to successfully quantize large language models to INT4.
Core idea:
- Layer-by-layer quantization using small calibration data
- Hessian-weighted minimization of quantization error
- No model retraining needed
Workflow:
1 | Original model (FP16) |
Usage example:
1 | from transformers import AutoModelForCausalLM |
2. AWQ: Activation-aware Quantization
AWQ (Activation-aware Weight Quantization) observed that: not all weights are equally important.
Core idea:
- Identify “important” weights (those with large impact on activations)
- Maintain higher precision for important weights
- Balance quantization error through scaling tricks
Formula intuition:
1 | Traditional quantization: W_q = round(W / scale) |
Advantages:
- Faster than GPTQ (no complex Hessian computation)
- Less accuracy loss
- Hardware-friendly
3. QLoRA: Combining Training and Quantization
QLoRA combines quantization with LoRA fine-tuning:
1 | Base model: INT4 quantized (frozen) |
Advantage: Fine-tune on 4-bit quantized models with minimal memory requirements.
4. NF4: Designed for Normal Distributions
NF4 (4-bit NormalFloat) is a data type specifically designed for neural network weight distributions.
Principle: Model weights typically follow a normal distribution. NF4’s quantization points are distributed according to normal distribution quantiles, minimizing quantization error.
1 | Traditional INT4: Uniformly distributed quantization points |
More Extreme: 2-bit and 1-bit Quantization
INT2 Quantization
Only 4 possible values (0, 1, 2, 3), extremely challenging:
Technical approaches:
- Use group quantization (each group has independent scale)
- Combine with sparsification techniques
- Requires more calibration data
Binary Networks (1-bit)
Weights are only +1 and -1:
1 | Traditional: y = W × x (floating-point multiplication) |
Advantages:
- Minimal storage
- Can use efficient bit operations
Disadvantages:
- Severe accuracy loss
- Mainly used for specific scenarios (e.g., edge devices)
Key Technical Details of Quantization
Group Quantization
Instead of sharing one scale for the entire layer, divide into small groups:
1 | Traditional: Whole layer scale = 1 parameter |
Common configurations:
- GPTQ: group_size = 128
- AWQ: group_size = 128
- Smaller groups = higher precision, but more overhead
Zero Point
Handling asymmetric distributions:
1 | Symmetric quantization: -8 ~ 7, 0 corresponds to float 0 |
Performance Comparison
LLaMA-2-70B performance on a single card:
| Method | Precision | Memory | Perplexity Loss | Speed |
|---|---|---|---|---|
| FP16 | 16-bit | 140GB | Baseline | Baseline |
| GPTQ-4bit | 4-bit | 35GB | +0.1-0.3 | 1.5x |
| AWQ-4bit | 4-bit | 35GB | +0.05-0.2 | 1.5x |
| GGUF-Q4 | 4-bit | ~35GB | +0.1-0.3 | 1.3x |
| 2-bit | 2-bit | ~17GB | +1.0-3.0 | 1.8x |
Practical Recommendations
How to Choose a Quantization Method?
1 | Scenario 1: Pursuing highest accuracy |
Common Tools
| Tool | Supported Formats | Features |
|---|---|---|
| AutoGPTQ | GPTQ | Easy to use, HuggingFace integrated |
| AutoAWQ | AWQ | Fast, hardware-friendly |
| llama.cpp | GGUF | CPU-friendly, multiple quantization levels |
| bitsandbytes | NF4/INT8 | Commonly used for QLoRA |
| exllama/exllamav2 | GPTQ | High-performance inference |
Usage Example
Loading a 4-bit Model
1 | from transformers import AutoModelForCausalLM, BitsAndBytesConfig |
Summary
Low-bit quantization is a key technology for running large models under resource constraints. From INT4 to INT2 to binary networks, each bit reduction represents a difficult balance between precision and efficiency.
Key points:
- INT4 is mainstream: GPTQ, AWQ are mature and usable
- Group quantization: Key technique for improving precision
- Activation-aware: Methods like AWQ identify important weights
- Rich tooling: AutoGPTQ, llama.cpp, etc. are ready to use
Low-bit quantization brings large models “into ordinary homes,” serving as an important driver for AI democratization.