前言
vLLM在开启多显卡并行模式下,-tp 2 或者 --tensor-parallel-size 2,运行报错提示如下:
The above exception was the direct cause of the following exception:Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/vllm/entrypoints/openai/api_server.py", line 236, in <module> engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 628, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 321, in init self.engine = self._init_engine(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 369, in _init_engine return engine_class(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 126, in init self._init_workers_ray(placement_group) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 304, in _init_workers_ray self._run_workers("init_model", File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 94, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 275, in init_distributed_environment cupy_utils.init_process_group( File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/parallel_utils/cupy_utils.py", line 79, in init_process_group raise ImportError( ImportError: NCCLBackend is not available. Please install cupy.
一般报错的原因情况有三种
- 未安装cuda-toolkit
- cuda和cupy的版本不匹配
- 为设置 cuda的环境变量 LD_LIBRARY_PATH
安装cuda-toolkit
ubuntu20.04系统,只要是桌面版,都有nouveau驱动,这是一个第三方搞的驱动,开源,很久前很火,但现在英伟达都有很完善的驱动方法,我们不用nouveau,马上卸载禁用!
不做的话,后面cuda环境有变数,概率出现不支持设备情况!
先打开terminal命令窗!输入指令,卸载已有的旧驱动!
dpkg -l | grep -i nvidia
sudo apt-get purge nvidia* libnvidia* -y
sudo apt autoremove && sudo reboot
继续!输入指令,彻底禁用nouveau驱动!
打开blacklist.conf文件
sudo vi /etc/modprobe.d/blacklist.conf
添加blacklist nouveau,后:wq保存
blacklist nouveau
接着执行以下命令
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
然后更新内核!+重启!
sudo update-initramfs -u && sudo reboo
重启后,看是否成功,命令窗输入下面指令,无回复内容,则成功!
lsmod | grep nouveau
接下来,安装显卡驱动!
驱动跟cuda是不同的东西!cuda是一个并行计算平台和编程模型,cuda要用显卡资源来计算,就要通过驱动来链接GPU!
每个cuda版本都有一个相匹配的显卡驱动,cuda安装程序已经把显卡驱动都打包在一起了!
建议一起安装!避免出现版本冲突问题!
到这里下载即可!不要下最新的!很多工具都没适配,会报错!
官网地址:https://developer.nvidia.com/cuda-toolkit-archive
选择你要下载的cuda toolkit版本 以12.1.0为例。vllm
复制粘贴 Base Installer中的命令
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run
上面的窗口输入到窗口,会自动下载!要等一会,耐心等待!
到了这个界面,输入accept,回车确认安装!
来到配置页面,按下图确认!安装!
要等2分钟!成功会有提示!
最后执行命令,配置系统变量
export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH
- /usr/local/cuda-12.3/lib64 要和你实际cuda路径相符
打印变量命令
echo $LD_LIBRARY_PATH