Repository of AutoRound: Advanced Weight-Only Quantization Algorithm for LLMs
Project description
AutoRound
Advanced Weight-Only Quantization Algorithm for LLMs
AutoRound is an advanced weight-only quantization algorithm, based on SignRound. It's tailored for a wide range of models and consistently delivers noticeable improvements, often significantly outperforming SignRound with the cost of more tuning time for quantization.
Prerequisites
- Python 3.9 or higher
Installation
Build from Source
pip install -r requirements.txt
python setup.py install
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(
model_name, low_cpu_mem_usage=True, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
bits, group_size, scheme = 4, 128, "asym"
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, scheme=scheme)
autoround.quantize()
Detailed Hyperparameters
-
model
: The PyTorch model to be quantized. -
tokenizer
: An optional tokenizer for processing input data. If none is provided, a dataloader must be supplied. -
bits (int)
: Number of bits for quantization (default is 4). -
group_size (int)
: Size of the quantization group (default is 128). -
scheme (str)
: The quantization scheme (symmetric/asymmetric) to be used (default is "asym"). -
use_quant_input (bool)
: Whether to use the output of the previous quantized block as the input for the current block (default is True). -
enable_minmax_tuning (bool)
: Whether to enable weight min-max tuning (default is True). -
iters (int)
: Number of tuning iterations (default is 200). -
lr (float)
: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically). -
minmax_lr (float)
: The learning rate for min-max tuning (default is None, it will be set to lr automatically). -
n_samples (int)
: Number of samples for tuning (default is 512). -
seqlen (int)
: Data length of the sequence for tuning (default is 2048). -
bs (int)
: Batch size for training (default is 8). -
amp (bool)
: Whether to use automatic mixed precision (default is True). -
n_blocks (int)
: Packing several blocks as one for tuning together (default is 1). -
gradient_accumulate_steps (int)
: Number of gradient accumulation steps (default is 1). -
low_gpu_mem_usage (bool)
: Whether to save GPU memory at the cost of a little tuning time (default is True). -
dataset_name (str)
: The default dataset name for tuning (default is "NeelNanda/pile-10k"). -
dataset_split (str)
: The split of the dataset to be used for tuning (default is "train"). -
dataloader
: The dataloader for tuning data. -
weight_config (dict)
: Configuration for weight quantization (default is an empty dictionary), mainly for mixed bits or mixed precision. -
device
: The device to be used for tuning (default is "cuda:0").
Validated Models
For wikitext2/ptb-new/c4-new ppl, we follow the code of gptq and set the sequence length to 2048. For lm-eval wikitext ppl, we adopt lm-eval. The quantization configure is W4G128.
Model | Method | Acc AVG. | MMLU | Lamb. | Hella. | Wino. | Piqa | Truth. | Open. | Boolq | RTE | ARC-e | ARC-c. | wikitext2 ppl | ptb_new ppl | c4_new ppl | lm_eval wikitext ppl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Intel/neural-chat-7b-v3 | FP16 | 67.92 | 61.13 | 73.03 | 66.39 | 76.40 | 81.01 | 47.37 | 38.8 | 86.97 | 75.81 | 82.66 | 57.51 | 6.00 | 48.96 | 9.65 | - |
Ours | 66.90 | 60.56 | 72.19 | 65.28 | 75.37 | 81.18 | 46.76 | 36.0 | 86.91 | 73.29 | 81.73 | 56.66 | 6.21 | 59.78 | 10.01 | - | |
Ours iters1K, disable use_quant_input, minmax_lr 0.002 | 67.70 | 60.57 | 73.74 | 65.62 | 77.43 | 80.85 | 47.61 | 36.8 | 86.94 | 75.09 | 82.66 | 57.34 | 6.17 | 59.12 | 9.83 | - | |
mistralai/Mixtral-8x7B-v0.1 | BF16 | 67.16 | 69.83 | 78.44 | 64.89 | 76.40 | 82.43 | 34.15 | 35.40 | 84.98 | 71.12 | 84.22 | 56.91 | 3.84 | 19.22 | 7.41 | - |
Ours | 65.98 | 68.90 | 78.11 | 64.31 | 74.27 | 82.10 | 30.97 | 34.20 | 84.57 | 67.87 | 83.96 | 56.57 | 4.08 | 354 | 7.56 | - | |
Ours iters1K, disable use_quant_input | 66.78 | 68.68 | 78.61 | 64.40 | 76.56 | 81.99 | 32.56 | 34.80 | 85.96 | 70.76 | 83.96 | 56.31 | 3.99 | 17.65 | 7.52 | - | |
microsoft/phi-2 | FP16 | 61.80 | 56.40 | 62.78 | 55.83 | 75.77 | 78.67 | 31.21 | 40.40 | 83.36 | 62.45 | 80.05 | 52.90 | 9.71 | 18.16 | 14.12 | 11.05 |
AutoRound | 61.67 | 54.57 | 61.32 | 55.04 | 76.48 | 78.89 | 29.74 | 40.60 | 83.24 | 66.43 | 79.76 | 52.30 | 9.98 | 18.67 | 14.39 | 11.37 |
We provide a comparative analysis with other methods link in our accuracy data section. Notably, our approach has outperformed GPTQ with a score of 30/32 and AWQ with a score of 27/32 across llamv1/llamav2/mistral-7b on W4G-1, W4G128, W3G128, W2G128. And the tuning costs are comparable.
Models passed smoke test
LaMini-GPT-124M; QWEN1-8B; OPT-125M; Bloom-560m;falcon-7b;gpt-leo-125m;stablelm-base-alpha-3b;dolly-v2-3b;mpt-7b;gpt-j-6b;chatglm2-6b
Tips
1 Consider increasing tuning steps to achieve better results, albeit with increased tuning time.
2 Leverage AutoGPTQ to evaluate the model on GPU
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(
model_name, low_cpu_mem_usage=True, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
autoround = AutoRound(model, tokenizer, bits=4, group_size=128, scheme="asym")
autoround.quantize()
## export to autogptq
# please install auto-gptq https://github.com/AutoGPTQ/
output_dir = "/path/to/quantized_model"
autoround.export(output_dir, target="auto_gptq", use_triton=True)
# then follow auto-gptq to load the model and inference
Known Issues
- Random issues in tuning Qwen models
- ChatGlm-V1 is not supported
Examples
Enter into the examples folder and install lm-eval to run the evaluation
pip install -r requirements.txt
- Default Settings:
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --enable_minmax_tuning --use_quant_input
- Reduced GPU Memory Usage and Adjusted Training Batch Size:
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --low_gpu_mem_usage --train_bs 1 --gradient_accumulate_steps 8
-
Utilizing the AdamW Optimizer: Include the flag
--adam
. Note that AdamW is less effective than Sign gradient descent in many scenarios we tested. -
Running the Original SignRound:
CUDA_VISIBLE_DEVICES=0 python3 main.py --model_name facebook/opt-125m --amp --bits 4 --group_size -1 --iters 400 --lr 0.0025 --minmax_lr 0.0025
--enable_minmax_tuning
is strongly recommended
-
The transformers version required varies across different types of models. Here, the transformers version used for running models during experiments is provided as a reference.
Model Transformers version EleutherAI/gpt-j-6b 4.28/4.30/4.34/4.36 huggyllama/llama-7b 4.28/4.30/4.34/4.36 meta-llama/Llama-2-7b-hf 4.30/4.34/4.36 facebook/opt-6.7b 4.28/4.30/4.34/4.36 tiiuae/falcon-7b 4.28/4.30/4.34/4.36 mosaicml/mpt-7b 4.28/4.30/4.34/4.36 bigscience/bloom-7b1 4.28/4.30/4.34/4.36 baichuan-inc/Baichuan-7B 4.28/4.30 Qwen/Qwen-7B 4.28/4.30/4.34/4.36 THUDM/chatglm3-6b 4.34/4.36 mistralai/Mistral-7B-v0.1 4.34/4.36
Reference
If you find SignRound useful for your research, please cite our paper:
@article{cheng2023optimize,
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao},
journal={arXiv preprint arXiv:2309.05516},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for auto_around-0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 198d002b44f1df57cf318babe7c3466fee78fb120da519a9d56b5a2da1feb0ee |
|
MD5 | 610c74e0525effad3e3bc4ef2c1a05f0 |
|
BLAKE2b-256 | 0d859629467e086324175e5ddfb8a0215eb4ce38cd708242dae164ecbe5180dc |