低比特量化

低比特量化:让大模型”瘦身”到极致

当我们说”量化”时,通常指的是 INT8 或 FP16。但在追求极致压缩的道路上,工程师们走得更远——低比特量化(如 INT4、INT2 甚至二值网络)让大模型的体积和计算量压缩到令人惊叹的程度。

什么是低比特量化?

量化是用更少的比特(bit)来表示数字的技术。

数据类型 比特数 表示范围 相对大小
FP32 32 位 极大 100%
FP16 16 位 50%
INT8 8 位 -128 ~ 127 25%
INT4 4 位 -8 ~ 7 或 0~15 12.5%
INT2 2 位 0 ~ 3 6.25%
二值 1 位 0 或 1 3.125%

低比特量化特指使用 4 位或更少的精度来表示模型参数。

为什么要用低比特量化?

1. 显存节省

以 LLaMA-70B 为例:

精度 模型大小 需要的 GPU
FP16 140 GB 2× A100-80G
INT8 70 GB 1× A100-80G
INT4 35 GB 1× A100-40G 或消费级 GPU
INT2 17.5 GB 1× RTX 4090

低比特量化让原本”高不可攀”的大模型,能在普通硬件上运行!

2. 推理加速

更少的比特意味着:

  • 更少的内存带宽消耗
  • 更快的数据传输
  • 对于访存密集型操作,速度提升明显

3. 成本降低

  • 更少的 GPU 数量
  • 更低的能耗
  • 更便宜的硬件配置

低比特量化的挑战

用 4 位甚至更少的比特表示数字,听起来”不靠谱”——

INT4 只有 16 个可能的值!

原本可以表示 3.14159…、2.71828… 等精确值的 FP32,现在只能用 0-15 之间的整数近似。信息损失是巨大的。

核心挑战:如何在极低精度下保持模型性能?

主流低比特量化技术

1. GPTQ:训练后量化的先驱

GPTQ(GPT Quantization)是首个成功将大语言模型量化到 INT4 的方法。

核心思想:

  • 逐层量化,使用少量校准数据
  • 最小化量化误差的 Hessian 加权
  • 无需重新训练模型

工作流程:

1
2
3
4
5
6
7
原始模型 (FP16)

逐层量化
↓ 使用校准数据(几百条样本)
↓ 计算最优量化参数
↓ 最小化输出误差
INT4 模型

使用示例:

1
2
3
4
5
6
7
8
from transformers import AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM

## 加载 GPTQ 量化模型
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto"
)

2. AWQ:激活感知量化

AWQ(Activation-aware Weight Quantization)观察到:不是所有权重都同等重要

核心思想:

  • 识别”重要”权重(对激活值影响大的)
  • 对重要权重保持更高精度
  • 通过缩放技巧平衡量化误差

公式直觉:

1
2
传统量化:W_q = round(W / scale)
AWQ:先识别重要性 s,然后 W_q = round(W * s / scale)

优势:

  • 比 GPTQ 更快(无需复杂的 Hessian 计算)
  • 精度损失更小
  • 硬件友好

3. QLoRA:训练与量化的结合

QLoRA 结合了量化和 LoRA 微调:

1
2
3
4
5
基础模型:INT4 量化(冻结)

LoRA 适配器:FP16(可训练)

输出:高质量且高效

优势: 在 4 位量化模型上进行微调,显存需求极低。

4. NF4:专为正态分布设计

NF4(4-bit NormalFloat)是专门针对神经网络权重分布设计的数据类型。

原理: 模型权重通常呈正态分布。NF4 的量化点按正态分布的分位数分配,使得量化误差最小。

1
2
3
传统 INT4:均匀分布量化点
NF4:按正态分布分位数分配量化点
→ 更好地覆盖常见权重值

更极端:2-bit 和 1-bit 量化

INT2 量化

只有 4 个可能的值(0, 1, 2, 3),挑战极大:

技术手段:

  • 使用分组量化(每组有独立的 scale)
  • 结合稀疏化技术
  • 需要更多校准数据

二值网络(1-bit)

权重只有 +1 和 -1:

1
2
传统: y = W × x    (浮点乘法)
二值: y = sign(W) × x (可用位运算替代!)

优势:

  • 存储极小
  • 可用高效的位运算

劣势:

  • 精度损失严重
  • 主要用于特定场景(如边缘设备)

量化的关键技术细节

分组量化(Group Quantization)

不是整个层共用一个 scale,而是分成小组:

1
2
3
传统:整层 scale = 1 个参数
分组:每 128 个权重一组,每组有独立 scale
→ 更精细的量化,精度更高

常见配置:

  • GPTQ:group_size = 128
  • AWQ:group_size = 128
  • 更小的组 = 更高精度,但开销更大

零点(Zero Point)

处理非对称分布:

1
2
3
对称量化:-8 ~ 7,0 对应浮点 0
非对称量化:0 ~ 15,可以有偏移
实际值 = (量化值 - zero_point) × scale

性能对比

以 LLaMA-2-70B 在单卡上的表现:

方法 精度 显存 困惑度损失 速度
FP16 16-bit 140GB 基准 基准
GPTQ-4bit 4-bit 35GB +0.1-0.3 1.5x
AWQ-4bit 4-bit 35GB +0.05-0.2 1.5x
GGUF-Q4 4-bit ~35GB +0.1-0.3 1.3x
2-bit 2-bit ~17GB +1.0-3.0 1.8x

实践建议

如何选择量化方法?

1
2
3
4
5
6
7
8
9
10
11
场景 1:追求最高精度
→ AWQ 或 GPTQ,4-bit group=128

场景 2:极限压缩,可接受精度损失
→ 3-bit 或 2-bit 量化

场景 3:需要微调
→ QLoRA (NF4 + LoRA)

场景 4:边缘部署
→ GGUF 格式(llama.cpp)

常用工具

工具 支持格式 特点
AutoGPTQ GPTQ 易用,HuggingFace 集成
AutoAWQ AWQ 快速,硬件友好
llama.cpp GGUF CPU 友好,多种量化级别
bitsandbytes NF4/INT8 QLoRA 常用
exllama/exllamav2 GPTQ 高性能推理

使用示例

加载 4-bit 模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

## 配置 4-bit 量化
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # 使用 NF4
bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)

总结

低比特量化是在资源受限条件下运行大模型的关键技术。从 INT4 到 INT2 甚至二值网络,每降低一个比特,都是精度与效率的艰难平衡。

关键要点:

  1. INT4 是主流: GPTQ、AWQ 成熟可用
  2. 分组量化: 提高精度的关键技巧
  3. 激活感知: AWQ 等方法识别重要权重
  4. 工具丰富: AutoGPTQ、llama.cpp 等开箱即用

低比特量化让大模型”飞入寻常百姓家”,是 AI 民主化的重要推动力。

Low-bit Quantization: Slimming Large Models to the Extreme

When we say “quantization,” we usually mean INT8 or FP16. But on the path to extreme compression, engineers have gone further—low-bit quantization (such as INT4, INT2, or even binary networks) compresses large models to astonishing degrees.

What is Low-bit Quantization?

Quantization is a technique for representing numbers with fewer bits.

Data Type Bits Range Relative Size
FP32 32 bits Huge 100%
FP16 16 bits Large 50%
INT8 8 bits -128 ~ 127 25%
INT4 4 bits -8 ~ 7 or 0~15 12.5%
INT2 2 bits 0 ~ 3 6.25%
Binary 1 bit 0 or 1 3.125%

Low-bit quantization specifically refers to using 4 bits or fewer to represent model parameters.

Why Use Low-bit Quantization?

1. Memory Savings

Using LLaMA-70B as an example:

Precision Model Size GPUs Needed
FP16 140 GB 2× A100-80G
INT8 70 GB 1× A100-80G
INT4 35 GB 1× A100-40G or consumer GPU
INT2 17.5 GB 1× RTX 4090

Low-bit quantization allows previously “unreachable” large models to run on regular hardware!

2. Inference Acceleration

Fewer bits mean:

  • Less memory bandwidth consumption
  • Faster data transfer
  • Significant speedup for memory-bound operations

3. Cost Reduction

  • Fewer GPUs needed
  • Lower energy consumption
  • Cheaper hardware configurations

Challenges of Low-bit Quantization

Using 4 bits or fewer to represent numbers sounds “unreliable”—

INT4 only has 16 possible values!

What could previously represent precise values like 3.14159…, 2.71828… in FP32 now must be approximated with integers between 0-15. Information loss is significant.

Core challenge: How to maintain model performance at extremely low precision?

Mainstream Low-bit Quantization Techniques

1. GPTQ: Pioneer of Post-training Quantization

GPTQ (GPT Quantization) was the first method to successfully quantize large language models to INT4.

Core idea:

  • Layer-by-layer quantization using small calibration data
  • Hessian-weighted minimization of quantization error
  • No model retraining needed

Workflow:

1
2
3
4
5
6
7
Original model (FP16)

Layer-by-layer quantization
↓ Use calibration data (hundreds of samples)
↓ Calculate optimal quantization parameters
↓ Minimize output error
INT4 model

Usage example:

1
2
3
4
5
6
7
8
from transformers import AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM

## Load GPTQ quantized model
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto"
)

2. AWQ: Activation-aware Quantization

AWQ (Activation-aware Weight Quantization) observed that: not all weights are equally important.

Core idea:

  • Identify “important” weights (those with large impact on activations)
  • Maintain higher precision for important weights
  • Balance quantization error through scaling tricks

Formula intuition:

1
2
Traditional quantization: W_q = round(W / scale)
AWQ: First identify importance s, then W_q = round(W * s / scale)

Advantages:

  • Faster than GPTQ (no complex Hessian computation)
  • Less accuracy loss
  • Hardware-friendly

3. QLoRA: Combining Training and Quantization

QLoRA combines quantization with LoRA fine-tuning:

1
2
3
4
5
Base model: INT4 quantized (frozen)

LoRA adapters: FP16 (trainable)

Output: High quality and efficient

Advantage: Fine-tune on 4-bit quantized models with minimal memory requirements.

4. NF4: Designed for Normal Distributions

NF4 (4-bit NormalFloat) is a data type specifically designed for neural network weight distributions.

Principle: Model weights typically follow a normal distribution. NF4’s quantization points are distributed according to normal distribution quantiles, minimizing quantization error.

1
2
3
Traditional INT4: Uniformly distributed quantization points
NF4: Quantization points distributed by normal distribution quantiles
→ Better coverage of common weight values

More Extreme: 2-bit and 1-bit Quantization

INT2 Quantization

Only 4 possible values (0, 1, 2, 3), extremely challenging:

Technical approaches:

  • Use group quantization (each group has independent scale)
  • Combine with sparsification techniques
  • Requires more calibration data

Binary Networks (1-bit)

Weights are only +1 and -1:

1
2
Traditional: y = W × x    (floating-point multiplication)
Binary: y = sign(W) × x (can use bit operations!)

Advantages:

  • Minimal storage
  • Can use efficient bit operations

Disadvantages:

  • Severe accuracy loss
  • Mainly used for specific scenarios (e.g., edge devices)

Key Technical Details of Quantization

Group Quantization

Instead of sharing one scale for the entire layer, divide into small groups:

1
2
3
Traditional: Whole layer scale = 1 parameter
Group: Every 128 weights form a group, each group has independent scale
→ Finer quantization, higher precision

Common configurations:

  • GPTQ: group_size = 128
  • AWQ: group_size = 128
  • Smaller groups = higher precision, but more overhead

Zero Point

Handling asymmetric distributions:

1
2
3
Symmetric quantization: -8 ~ 7, 0 corresponds to float 0
Asymmetric quantization: 0 ~ 15, can have offset
actual value = (quantized value - zero_point) × scale

Performance Comparison

LLaMA-2-70B performance on a single card:

Method Precision Memory Perplexity Loss Speed
FP16 16-bit 140GB Baseline Baseline
GPTQ-4bit 4-bit 35GB +0.1-0.3 1.5x
AWQ-4bit 4-bit 35GB +0.05-0.2 1.5x
GGUF-Q4 4-bit ~35GB +0.1-0.3 1.3x
2-bit 2-bit ~17GB +1.0-3.0 1.8x

Practical Recommendations

How to Choose a Quantization Method?

1
2
3
4
5
6
7
8
9
10
11
Scenario 1: Pursuing highest accuracy
→ AWQ or GPTQ, 4-bit group=128

Scenario 2: Extreme compression, acceptable accuracy loss
→ 3-bit or 2-bit quantization

Scenario 3: Need fine-tuning
→ QLoRA (NF4 + LoRA)

Scenario 4: Edge deployment
→ GGUF format (llama.cpp)

Common Tools

Tool Supported Formats Features
AutoGPTQ GPTQ Easy to use, HuggingFace integrated
AutoAWQ AWQ Fast, hardware-friendly
llama.cpp GGUF CPU-friendly, multiple quantization levels
bitsandbytes NF4/INT8 Commonly used for QLoRA
exllama/exllamav2 GPTQ High-performance inference

Usage Example

Loading a 4-bit Model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

## Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Use NF4
bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto"
)

Summary

Low-bit quantization is a key technology for running large models under resource constraints. From INT4 to INT2 to binary networks, each bit reduction represents a difficult balance between precision and efficiency.

Key points:

  1. INT4 is mainstream: GPTQ, AWQ are mature and usable
  2. Group quantization: Key technique for improving precision
  3. Activation-aware: Methods like AWQ identify important weights
  4. Rich tooling: AutoGPTQ, llama.cpp, etc. are ready to use

Low-bit quantization brings large models “into ordinary homes,” serving as an important driver for AI democratization.