批量推理
多卡部署
使用huggingface
【AI大模型】Transformers大模型库(七):单机多卡推理之device_map_transformers多卡推理-CSDN博客
首先用
CUDA_VISIBLE_DEVICES=1,2,3 python
或者os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2" 限制gpu安装transformers 和 acce库
pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple
pip install accelerate -i https://mirrors.cloud.tencent.com/pypi/simple然后
model =AutoModelForCausalLM.from_pretrained(
model_dir,device_map="auto",trust_remote_code=True,torch_dtype=torch.float16)
也可以想问中一样对于模型的层进行分割然后部署
Huggingface Transformers+Accelerate多卡推理实践(指定GPU和最大显存) - 知乎
使用Pytorch自带的DDP和DP
不要用DP效率低
实践
使用transformers的auto分配显存
速率尽然要13个小时这2000条数据 但是之前单卡只十几万条才44个小时
单卡4小时左右
首先是有这个提示
We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.
然后我用的是GPU0和GPU4是不在一张PCIE板上
(TinyRAG) jsh@user-ESC8000A-E11:/data/jsh/code/TinyRAG$ nvidia-smi topo -mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU4 SYS SYS SYS SYS X NODE NODE NODE 64-127,192-255 1 N/A
GPU5 SYS SYS SYS SYS NODE X NODE NODE 64-127,192-255 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NODE 64-127,192-255 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NODE X 64-127,192-255 1 N/ALegend:X = SelfSYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA nodePHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)PIX = Connection traversing at most a single PCIe bridgeNV# = Connection traversing a bonded set of # NVLinks
尝试用GPU4 和 GPU7在同一个NODE上