1.Tensorrt-llm安装
os: ubuntu 22.04
1.1搭建docker 环境
切换到 root 用户
sodu passwd root
更新apt
sudo apt-get update --fix-missing
更新docker
sudo apt-get upgrade docker-ce
安装nvidia 容器运行时,避免如下错误
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
- 安装方法如 Installing the NVIDIA Container Toolkit
安装nvidia image
docker run --runtime=nvidia --gpus all --name tllm --entrypoint /bin/bash -it nvidia/cuda:12.3.0-devel-ubuntu22.04
注意: cuda 版本不必和宿主机cuda 版本一致。
1.2. 安装tensorrt-llm
安装python 环境
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs
安装tensorrt-llm
pip3 install tensorrt_llm==0.10.0 -U --extra-index-url https://pypi.nvidia.com
注意:截至2024年6月建议安装0.10.0
2. 量化
weight only 量化
转换huggince face 模型为checkpoint
CUDA_VISIBLE_DEVICES=0 python convert_checkpoint.py \
--model_version v2_7b --model_dir ./baichuan7B \
--dtype float16 --output_dir ./ckpt --use_weight_only
编译模型
trtllm-build --checkpoint_dir /ckpt \
--output_dir /engine --gemm_plugin float16 \--max_batch_size=1 --max_input_len=2048 \
--max_output_len=128
逐层量化
在转换时期修改
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/convert_utils.py(65)weight_only_quantize_dict()
其中exclusive 为不想量化的decode block
def weight_only_quantize_dict(weights: Dict[str, torch.Tensor],quant_algo: str,quant_weights=['qkv.weight', 'dense.weight', 'fc.weight','proj.weight', 'gate.weight'],plugin: bool = True):exclusive = ['24','25', '26', '27', '28','29','30','31'] # <- Hereif quant_algo not in [QuantAlgo.W4A16, QuantAlgo.W8A16]:return weightsfor name in list(weights):"""exclu """conti = False for exlu in exclusive:if exlu in name:conti = Trueprint("exclu: ", exlu) if conti:continue"""ori code """if any([_name in name for _name in quant_weights]) and weights[name].dtype != torch.int8:quant_weight, quant_scale = weight_only_quantize(weight=weights[name], quant_algo=quant_algo, plugin=plugin)weights[name] = quant_weightweights[name.replace('.weight', '.per_channel_scale')] = quant_scalereturn weights
在之前trtllm-build前修改
修改/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize.py
同样exclusive_modules 为不想量化的层。
def weight_only_quantize(model,quant_config: QuantConfig,current_key_name=None):assert quant_config.quant_mode.is_weight_only()exclude_modules = quant_config.exclude_modules or ['lm_head', 'router', '20', '21', '22', '23', '24','25','26','27','28','29','30','31']for name, module in model.named_children():if current_key_name is None:current_key_name = []current_key_name.append(name)print(current_key_name)if len(list(module.children())) > 0:weight_only_quantize(module, quant_config, current_key_name)if isinstance(module, ColumnLinear) and name not in exclude_modules and current_key_name[2] not in exclude_modules:
精度比较
量化层数 | ||
全部32层 | 98.78% | |
量化0-27层 | 98.87% | |
量化0-23层 | 99.06% | |
fp16 | 99.26 |
可以看出放弃最后几层的量化是对模型精度有略微提升的。
smoothquant 量化
python convert_checkpoint.py --model_version v2_7b --model_dir ./baichuan7b --dtype float16 --output_dir /ckpt -sq 0.5 --per_token --per_channel
CUDA_VISIBLE_DEVICES=0 trtllm-build --checkpoint_dir /ckpt --output_dir /engine --gemm_plugin float16 --max_batch_size=1 --max_input_len=1024 --max_output_len=256