今天在使用deepspeed进行训练的时候,本来想使用GPU 4,5,6,7,但是设置了如下命令还是不管用:
export CUDA_VISIBLE_DEVICES=4,5,6,7
最后在deepspeed的配置文件中进行配置,才得以解决,期间遇到错误:
[2023-07-29 09:29:29,308] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
File "/home/mapengsen/anaconda3/envs/drug/bin/deepspeed", line 6, in <module>
main()
File "/home/mapengsen/anaconda3/envs/drug/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 405, in main
raise ValueError("Cannot specify num_nodes/gpus with include/exclude")
ValueError: Cannot specify num_nodes/gpus with include/exclude
上述原因是因为在这是deepspeed shell文件的时候出现了错误配置:
错误:
run_cmd="deepspeed --include localhost:4,5,6,7 --num_nodes ${DLWS_NUM_WORKER} --master_port=${MASTER_PORT} --num_gpus ${DLWS_NUM_GPU_PER_WORKER} \train_retrieval.py ${full_options} ${custom_train_options}"
正确:
run_cmd="deepspeed --include localhost:4,5,6,7 --master_port=${MASTER_PORT} train_retrieval.py ${full_options} ${custom_train_options}"
是因为你既然指定了GPU的ID,那么就不需要再设置“--num_nodes”、“--num_gpus”