1.DataParallel
DataParallel更易于使用(只需简单包装单GPU模型)。
model = nn.DataParallel(model)
它使用一个进程来计算模型参数,然后在每个批处理期间将分发到每个GPU,然后每个GPU计算各自的梯度,然后汇总到GPU0中进行求平均,然后由GPU0进行反向传播更新参数,然后再把模型的参数由GPU0传播给其他的GPU。
特点:
(1)broadcast 的是模型的参数,因此速度慢,效率低
(2)操作简单
因此通信很快成为一个瓶颈,GPU利用率通常很低。nn.DataParallel要求所有的GPU都在同一个节点上(不支持分布式),而且不能使用Apex进行混合精度训练。
https://zhuanlan.zhihu.com/p/113694038
1.DistributedDataParallel支持模型并行,而DataParallel并不支持,这意味如果模型太大单卡显存不足时只能使用前者;
2.DataParallel是单进程多线程的,只用于单机情况,而DistributedDataParallel是多进程的,适用于单机和多机情况,真正实现分布式训练;
3.DistributedDataParallel的训练更高效,因为每个进程都是独立的Python解释器,避免GIL问题,而且通信成本低其训练速度更快,基本上DataParallel已经被弃用;
4.必须要说明的是DistributedDataParallel中每个进程都有独立的优化器,执行自己的更新过程,但是梯度通过通信传递到每个进程,所有执行的内容是相同的;
2. DistributedDataParallel
https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel
官网链接
main_proc = Truedevice = torch.device("cuda")is_distributed = os.environ.get("LOCAL_RANK") # If local rank exists, distributed envprint("distributed: ", is_distributed)if is_distributed:device_id = args.local_ranktorch.cuda.set_device(device_id)print(f"Setting CUDA Device to {device_id}")os.environ['NCCL_IB_DISABLE'] = '0'dist.init_process_group(backend="nccl")print("distributed finished........")main_proc = device_id == 0 # Main process handles saving of models and reportingif is_distributed:train_sampler = torch.utils.data.distributed.DistributedSampler(train_set, shuffle=True) #train_sampler = db2sampler(SequentialSampler(train_set), batch_size, False, bucket_size_multiplier=len(train_set)//batch_size) else:train_sampler = torch.utils.data.RandomSampler(train_set)#train_sampler = db1sampler(SequentialSampler(train_set), batch_size, False, bucket_size_multiplier=len(train_set)//batch_size)train_loader = torch.utils.data.DataLoader(train_set, batch_size, sampler=train_sampler, num_workers=args.workers, collate_fn = pad_collate)valid_loader = torch.utils.data.DataLoader(valid_set, valid_batch_size, num_workers=args.workers, collate_fn = pad_collate)if is_distributed:WAP_model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(WAP_model) #解决了batchnormal的问题if is_distributed:WAP_model = torch.nn.parallel.DistributedDataParallel(WAP_model, device_ids=[device_id],find_unused_parameters=True)for eidx in range(max_epochs):n_samples = 0ud_epoch = time.time()if is_distributed:train_sampler.set_epoch(epoch=eidx) for i, (x, y,x_idx, x_name) in enumerate(train_loader):WAP_model.train()
注意:在 DataParallel 中,batch size 设置必须为单卡的 n 倍,但是在 DistributedDataParallel 内,batch size 设置于单卡一样即可
比DataParallel,DistributedDataParallel训练时间缩减了好几倍。
一定要用DistributedDataParallel
if is_distributed:train_sampler.set_epoch(epoch=eidx)
https://zhuanlan.zhihu.com/p/97115875
pytorch(分布式)数据并行个人实践总结
坑:
(1)DistributedDataParallel 内,batch size 设置于单卡一样即可