MMCV1.6.0之Runner/Hook/OptimizerHook（反向传播+参数更新）、Fp16OptimizerHook、自定义优化器与config设置

OptimizerHook

这段代码定义了一个名为 OptimizerHook 的类，它是一个用于优化器的自定义操作钩子。该钩子包含了一些用于梯度裁剪和检测异常参数的操作。这对于在深度学习训练过程中优化模型的性能和调试模型非常有用。

类的定义
OptimizerHook 类继承自 Hook，实现了一些与优化器相关的自定义操作。
参数说明
grad_clip: 一个字典，用于配置梯度裁剪的参数。默认值为 None。
detect_anomalous_params: 一个布尔值，用于调试目的。这将减慢训练速度，检测不包含在计算图中的异常参数。默认值为 False。

@HOOKS.register_module()
class OptimizerHook(Hook):"""A hook contains custom operations for the optimizer.Args:grad_clip (dict, optional): A config dict to control the clip_grad.Default: None.detect_anomalous_params (bool): This option is only used fordebugging which will slow down the training speed.Detect anomalous parameters that are not included inthe computational graph with `loss` as the root.There are two cases- Parameters were not used duringforward pass.- Parameters were not used to produceloss.Default: False."""def __init__(self,grad_clip: Optional[dict] = None,detect_anomalous_params: bool = False):self.grad_clip = grad_clipself.detect_anomalous_params = detect_anomalous_paramsdef clip_grads(self, params):params = list(filter(lambda p: p.requires_grad and p.grad is not None, params))if len(params) > 0:return clip_grad.clip_grad_norm_(params, **self.grad_clip)def after_train_iter(self, runner):runner.optimizer.zero_grad()if self.detect_anomalous_params:self.detect_anomalous_parameters(runner.outputs['loss'], runner)runner.outputs['loss'].backward()if self.grad_clip is not None:grad_norm = self.clip_grads(runner.model.parameters())if grad_norm is not None:# Add grad norm to the loggerrunner.log_buffer.update({'grad_norm': float(grad_norm)},runner.outputs['num_samples'])runner.optimizer.step()def detect_anomalous_parameters(self, loss: Tensor, runner) -> None:logger = runner.loggerparameters_in_graph = set()visited = set()def traverse(grad_fn):if grad_fn is None:returnif grad_fn not in visited:visited.add(grad_fn)if hasattr(grad_fn, 'variable'):parameters_in_graph.add(grad_fn.variable)parents = grad_fn.next_functionsif parents is not None:for parent in parents:grad_fn = parent[0]traverse(grad_fn)traverse(loss.grad_fn)for n, p in runner.model.named_parameters():if p not in parameters_in_graph and p.requires_grad:logger.log(level=logging.ERROR,msg=f'{n} with shape {p.size()} is not 'f'in the computational graph \n')

主要逻辑
初始化参数

接受 grad_clip 和 detect_anomalous_params 两个可选参数，并将它们赋值给实例变量。
clip_grads 方法

过滤出需要梯度裁剪的参数。
如果有参数需要裁剪，使用 clip_grad.clip_grad_norm_ 函数进行梯度裁剪。
after_train_iter 方法

每次训练迭代后被调用。
清零优化器的梯度。
如果启用了异常参数检测，调用 detect_anomalous_parameters 方法。
反向传播计算梯度。
如果启用了梯度裁剪，调用 clip_grads 方法，并将裁剪后的梯度范数记录到日志中。
更新优化器的参数。
detect_anomalous_parameters 方法

用于检测计算图中未包含的异常参数。
遍历损失的计算图，收集在图中的参数。
将模型中的参数与计算图中的参数进行比对，找出未包含在计算图中的参数，并记录错误日志。
总结
OptimizerHook 类提供了一种灵活的方法来管理和调试优化器的操作。通过梯度裁剪，可以防止梯度爆炸问题。而通过检测异常参数，可以帮助用户在训练过程中发现可能未正确参与计算的参数，从而提高模型的训练效率和效果。这对于大型深度学习模型的训练和调试尤为重要。

Fp16OptimizerHook（支持 FP16 精度的优化器钩子）

这段代码定义了一个名为 Fp16OptimizerHook 的类，它继承自 OptimizerHook，用于支持 FP16 精度的优化器钩子。这对于使用混合精度训练（Mixed Precision Training）以加速深度学习模型训练和减少显存使用非常有用。

类的定义
Fp16OptimizerHook 类继承自 OptimizerHook，实现了一些用于支持 FP16 精度的自定义操作。
参数说明
grad_clip: 一个字典，用于配置梯度裁剪的参数。默认值为 None。
coalesce: 一个布尔值，指示是否合并小的梯度张量以提高通信效率。默认值为 True。
bucket_size_mb: 一个整数，指示梯度桶的大小（以MB为单位）。默认值为 -1。
loss_scale: 一个浮点数、字符串或字典，配置损失缩放的参数。如果是浮点数，则使用静态损失缩放。如果是字符串，则必须为 ‘dynamic’，使用动态损失缩放。如果是字典，则包含 GradScaler 的参数。默认值为 512。
distributed: 一个布尔值，指示是否使用分布式训练。默认值为 True。

@HOOKS.register_module()class Fp16OptimizerHook(OptimizerHook):"""FP16 optimizer hook (using PyTorch's implementation).If you are using PyTorch >= 1.6, torch.cuda.amp is used as the backend,to take care of the optimization procedure.Args:loss_scale (float | str | dict): Scale factor configuration.If loss_scale is a float, static loss scaling will be used withthe specified scale. If loss_scale is a string, it must be'dynamic', then dynamic loss scaling will be used.It can also be a dict containing arguments of GradScalar.Defaults to 512. For Pytorch >= 1.6, mmcv uses officialimplementation of GradScaler. If you use a dict version ofloss_scale to create GradScaler, please refer to:https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScalerfor the parameters.Examples:>>> loss_scale = dict(...     init_scale=65536.0,...     growth_factor=2.0,...     backoff_factor=0.5,...     growth_interval=2000... )>>> optimizer_hook = Fp16OptimizerHook(loss_scale=loss_scale)"""def __init__(self,grad_clip: Optional[dict] = None,coalesce: bool = True,bucket_size_mb: int = -1,loss_scale: Union[float, str, dict] = 512.,distributed: bool = True):self.grad_clip = grad_clipself.coalesce = coalesceself.bucket_size_mb = bucket_size_mbself.distributed = distributedself._scale_update_param = Noneif loss_scale == 'dynamic':self.loss_scaler = GradScaler()elif isinstance(loss_scale, float):self._scale_update_param = loss_scaleself.loss_scaler = GradScaler(init_scale=loss_scale)elif isinstance(loss_scale, dict):self.loss_scaler = GradScaler(**loss_scale)else:raise ValueError('loss_scale must be of type float, dict, or 'f'"dynamic", got {loss_scale}')def before_run(self, runner) -> None:"""Preparing steps before Mixed Precision Training."""# wrap model mode to fp16wrap_fp16_model(runner.model)# resume from state dictif 'fp16' in runner.meta and 'loss_scaler' in runner.meta['fp16']:scaler_state_dict = runner.meta['fp16']['loss_scaler']self.loss_scaler.load_state_dict(scaler_state_dict)def copy_grads_to_fp32(self, fp16_net: nn.Module,fp32_weights: Tensor) -> None:"""Copy gradients from fp16 model to fp32 weight copy."""for fp32_param, fp16_param in zip(fp32_weights,fp16_net.parameters()):if fp16_param.grad is not None:if fp32_param.grad is None:fp32_param.grad = fp32_param.data.new(fp32_param.size())fp32_param.grad.copy_(fp16_param.grad)def copy_params_to_fp16(self, fp16_net: nn.Module,fp32_weights: Tensor) -> None:"""Copy updated params from fp32 weight copy to fp16 model."""for fp16_param, fp32_param in zip(fp16_net.parameters(),fp32_weights):fp16_param.data.copy_(fp32_param.data)def after_train_iter(self, runner) -> None:"""Backward optimization steps for Mixed Precision Training. Fordynamic loss scaling, please refer tohttps://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.1. Scale the loss by a scale factor.2. Backward the loss to obtain the gradients.3. Unscale the optimizer’s gradient tensors.4. Call optimizer.step() and update scale factor.5. Save loss_scaler state_dict for resume purpose."""# clear grads of last iterationrunner.model.zero_grad()runner.optimizer.zero_grad()self.loss_scaler.scale(runner.outputs['loss']).backward()self.loss_scaler.unscale_(runner.optimizer)# grad clipif self.grad_clip is not None:grad_norm = self.clip_grads(runner.model.parameters())if grad_norm is not None:# Add grad norm to the loggerrunner.log_buffer.update({'grad_norm': float(grad_norm)},runner.outputs['num_samples'])# backward and update scalerself.loss_scaler.step(runner.optimizer)self.loss_scaler.update(self._scale_update_param)# save state_dict of loss_scalerrunner.meta.setdefault('fp16', {})['loss_scaler'] = self.loss_scaler.state_dict()

主要逻辑
初始化参数

接受 grad_clip、coalesce、bucket_size_mb、loss_scale 和 distributed 五个参数，并将它们赋值给实例变量。
根据 loss_scale 的类型，初始化 GradScaler 对象。
before_run 方法

在混合精度训练开始前的准备步骤。
将模型包装为 FP16 精度。
从状态字典中恢复 loss_scaler 的状态。
copy_grads_to_fp32 方法

将 FP16 模型中的梯度复制到 FP32 权重副本中。
copy_params_to_fp16 方法

将更新后的 FP32 权重副本的参数复制到 FP16 模型中。
after_train_iter 方法

每次训练迭代后被调用。
清零上一次迭代的梯度。
将损失按比例缩放并进行反向传播计算梯度。
将优化器的梯度张量取消缩放。
如果启用了梯度裁剪，调用 clip_grads 方法，并将裁剪后的梯度范数记录到日志中。
调用 optimizer.step() 并更新缩放因子。
保存 loss_scaler 的状态字典。
总结
Fp16OptimizerHook 类提供了一种灵活的方法来管理和支持混合精度训练。通过使用 PyTorch 的 torch.cuda.amp 模块，可以显著加速模型训练并减少显存使用。它还包括梯度裁剪和异常参数检测的功能，可以帮助用户在训练过程中更好地管理和调试模型。

# fp16 settings
fp16 = dict(loss_scale=512.)

您还可以设置fp16 = dict(loss_scale='dynamic')启用自动损失缩放。

Pytorch支持的自定义优化器

我们已经支持使用所有由PyTorch实现的优化器，唯一的修改就是更改配置文件的优化器字段。例如，如果您想要使用ADAM(注意性能可能会下降很多)，修改可以如下所示。

optimizer = dict(type='Adam', lr=0.0003, weight_decay=0.0001)

要修改模型的学习率，用户只需修改optimizer配置中的lr即可。用户可以直接在PyTorch的API文档后面设置参数。
在这里插入图片描述

定制自行实现的优化器

1、Define a new optimizer（定义一个新的优化器）

一个定制的优化器可以定义如下。假设您想添加一个名为MyOptimizer的优化器，它有参数a、b和c。您需要创建一个名为mmdet/core/optimizer的新目录。然后在文件中实现新的优化器，例如在mmdet/core/optimizer/my_optimizer.py中:

from .registry import OPTIMIZERS
from torch.optim import Optimizer
@OPTIMIZERS.register_module()
class MyOptimizer(Optimizer):def __init__(self, a, b, c)

2. 将优化器添加到注册表

要找到上面定义的模块，首先应该将该模块导入主命名空间。实现这一目标有两种选择。
(1)修改mmdet/core/optimizer/init.py来导入它。新定义的模块应该导入到mmdet/core/optimizer/init.py中，这样注册表就会找到新模块并添加它:

from .my_optimizer import MyOptimizer

(2)Use custom_imports in the config to manually import it （使用配置中的custom_imports手动导入它）

custom_imports = dict(imports=['mmdet.core.optimizer.my_optimizer'], allow_failed_imports=False)

模块 mmdet.core.optimizer.my_optimizer 会在程序开始时被导入，并且 MyOptimizer 类将会被自动注册。请注意，只需要导入包含 MyOptimizer 类的包，而不需要直接导入 mmdet.core.optimizer.my_optimizer.MyOptimizer。
实际上，用户可以使用这种导入方法使用完全不同的文件目录结构，只要模块根可以位于PYTHONPATH。

3.在配置文件中指定优化器

然后你可以在配置文件的优化器字段中使用MyOptimizer。在配置文件中，优化器由字段优化器定义，如下所示:

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)

To use your own optimizer, the field can be changed to

optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)

自定义优化器构造函数

某些模型可能有一些针对优化的参数设置，例如 BatchNorm 层的权重衰减。用户可以通过自定义优化器构造函数来进行这些细粒度的参数调整。

from mmcv.utils import build_from_cfgfrom mmcv.runner.optimizer import OPTIMIZER_BUILDERS, OPTIMIZERS
from mmdet.utils import get_root_logger
from .my_optimizer import MyOptimizer@OPTIMIZER_BUILDERS.register_module()
class MyOptimizerConstructor(object):def __init__(self, optimizer_cfg, paramwise_cfg=None):def __call__(self, model):return my_optimizer

默认的优化器构造函数mmcv/mmcv/runner/optimizer
/default_constructor.py
在这里实现，它也可以作为新优化器构造函数的模板。

@OPTIMIZER_BUILDERS.register_module()
class DefaultOptimizerConstructor:"""Default constructor for optimizers.By default each parameter share the same optimizer settings, and weprovide an argument ``paramwise_cfg`` to specify parameter-wise settings.It is a dict and may contain the following fields:- ``custom_keys`` (dict): Specified parameters-wise settings by keys. Ifone of the keys in ``custom_keys`` is a substring of the name of oneparameter, then the setting of the parameter will be specified by``custom_keys[key]`` and other setting like ``bias_lr_mult`` etc. willbe ignored. It should be noted that the aforementioned ``key`` is thelongest key that is a substring of the name of the parameter. If thereare multiple matched keys with the same length, then the key with loweralphabet order will be chosen.``custom_keys[key]`` should be a dict and may contain fields ``lr_mult``and ``decay_mult``. See Example 2 below.- ``bias_lr_mult`` (float): It will be multiplied to the learningrate for all bias parameters (except for those in normalizationlayers).- ``bias_decay_mult`` (float): It will be multiplied to the weightdecay for all bias parameters (except for those innormalization layers and depthwise conv layers).- ``norm_decay_mult`` (float): It will be multiplied to the weightdecay for all weight and bias parameters of normalizationlayers.- ``dwconv_decay_mult`` (float): It will be multiplied to the weightdecay for all weight and bias parameters of depthwise convlayers.- ``bypass_duplicate`` (bool): If true, the duplicate parameterswould not be added into optimizer. Default: False.Args:model (:obj:`nn.Module`): The model with parameters to be optimized.optimizer_cfg (dict): The config dict of the optimizer.Positional fields are- `type`: class name of the optimizer.Optional fields are- any arguments of the corresponding optimizer type, e.g.,lr, weight_decay, momentum, etc.paramwise_cfg (dict, optional): Parameter-wise options.Example 1:>>> model = torch.nn.modules.Conv1d(1, 1, 1)>>> optimizer_cfg = dict(type='SGD', lr=0.01, momentum=0.9,>>>                      weight_decay=0.0001)>>> paramwise_cfg = dict(norm_decay_mult=0.)>>> optim_builder = DefaultOptimizerConstructor(>>>     optimizer_cfg, paramwise_cfg)>>> optimizer = optim_builder(model)Example 2:>>> # assume model have attribute model.backbone and model.cls_head>>> optimizer_cfg = dict(type='SGD', lr=0.01, weight_decay=0.95)>>> paramwise_cfg = dict(custom_keys={'.backbone': dict(lr_mult=0.1, decay_mult=0.9)})>>> optim_builder = DefaultOptimizerConstructor(>>>     optimizer_cfg, paramwise_cfg)>>> optimizer = optim_builder(model)>>> # Then the `lr` and `weight_decay` for model.backbone is>>> # (0.01 * 0.1, 0.95 * 0.9). `lr` and `weight_decay` for>>> # model.cls_head is (0.01, 0.95)."""def __init__(self, optimizer_cfg, paramwise_cfg=None):if not isinstance(optimizer_cfg, dict):raise TypeError('optimizer_cfg should be a dict',f'but got {type(optimizer_cfg)}')self.optimizer_cfg = optimizer_cfgself.paramwise_cfg = {} if paramwise_cfg is None else paramwise_cfgself.base_lr = optimizer_cfg.get('lr', None)self.base_wd = optimizer_cfg.get('weight_decay', None)self._validate_cfg()

其他设置

使用梯度剪辑来稳定训练：

有些模型需要使用梯度剪辑来剪辑梯度以稳定训练过程。示例如下：

optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))

如果您的配置继承了已设置的基础配置optimizer_config，则可能需要_delete_=True覆盖不必要的设置。

使用动量计划来加速模型收敛:

我们支持动量调度，根据学习速率修改模型的动量，使模型更快地收敛。动量调度器通常与LR调度器一起使用，例如，在三维检测中使用以下配置来加速收敛。更多细节，请参考CyclicLrUpdater和CyclicMomentumUpdater的实现。

lr_config = dict(policy='cyclic',target_ratio=(10, 1e-4),cyclic_times=1,step_ratio_up=0.4,
)
momentum_config = dict(policy='cyclic',target_ratio=(0.85 / 0.95, 1),cyclic_times=1,step_ratio_up=0.4,
)