Megatron-DeepSpeed-cuda-多机训练
- 1.从ngc拉取pytorch:24.03-py3镜像
- 2.安装nvidia-docker、创建容器
- 3.安装Megatron-DeepSpeed环境
- 4.安装openmpi和ssh服务
- 5.拷贝公钥
- 6.安装pdsh
- 7.升级protobuf
- 8.准备数据集
- 9.创建配置文件
- 10.开始测试
本文演示了Megatron-DeepSpeed-GPU-多机训练的操作步骤
1.从ngc拉取pytorch:24.03-py3镜像
docker pull nvcr.io/nvidia/pytorch:24.03-py3
2.安装nvidia-docker、创建容器
cd /mnt
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get install -y nvidia-docker2
nvidia-docker run -ti -e NVIDIA_VISIBLE_DEVICES=all --privileged \--net=host -v