PyTorch 中的 apply

Abstract

nn.Module[List].apply(callable)
Tensor.apply_(callable) → Tensor
Function.apply(Tensor...)

nn.Module[List].apply()?

源码:

def apply(self: T, fn: Callable[['Module'], None]) -> T:"""Typical use includes initializing the parameters of a modelArgs:fn: function to be applied to each submoduleReturns:self"""for module in self.children():module.apply(fn)  # 看来这里是先 apply 了子模块fn(self)  # 最后才是根return self

nn.ModuleList 是 PyTorch 中用于存储子模块的容器，而 apply() 方法可以应用一个函数到 ModuleList 中的每个子模块。具体来说，apply() 方法会递归地将指定的函数应用到 ModuleList 中的每个子模块以及每个子模块的子模块上。这个方法的语法如下：

nn.ModuleList.apply(fn)

其中 fn 是要应用的函数，它接受一个 Module 参数并且没有返回值。在 apply() 方法被调用后，会遍历 ModuleList 中的每个子模块，并把这个函数应用到每个子模块上。

例如，假设有一个 ModuleList 包含了若干线性层（Linear），我们想要初始化所有线性层的权重为 0，可以使用 apply() 方法：

import torch
import torch.nn as nn# 创建一个 ModuleList 包含两个线性层
module_list = nn.ModuleList([nn.Linear(10, 5), nn.Linear(5, 2)])# 定义一个函数用于初始化权重为0
def init_weights(module):if isinstance(module, nn.Linear):module.weight.data.fill_(0)# 应用函数到 ModuleList 的每个子模块上
module_list.apply(init_weights)# 打印每个线性层的权重
for module in module_list:print(module.weight)

在这个例子中，我们定义了一个函数 init_weights，它会将输入的 nn.Linear 模块的权重初始化为 0。然后我们通过 apply() 方法将这个函数应用到 ModuleList 中的每个线性层上，并最终打印出每个线性层的权重。

Tensor.apply_(callable) → Tensor

对张量的每个元素执行 callable 操作, 并且是 inplace 的, 即它不返回新的张量.

import torchdef add(x):return x + 1a = torch.randn(2, 3)
print(a)
# tensor([[-1.6572, -0.7502, -0.9984],
#		  [ 0.3035, -0.6085, -0.1091]])b = a.apply_(add)
print(a)
print(b)
# tensor([[-0.6572,  0.2498,  0.0016],
#		  [ 1.3035,  0.3915,  0.8909]])
# tensor([[-0.6572,  0.2498,  0.0016],
#		  [ 1.3035,  0.3915,  0.8909]])print(b is a)
# True, 说明 a.apply_(add) 不返回新的张量, 是 inplace 的

NOTE
仅对 CPU 上的张量有效, 不应在要求高效的代码段中使用. 官方这么说, 大概是它效率不高吧.

a = torch.randn(2, 3, device='cuda:0')
a.apply_(lambda x: x + 1)
# TypeError: apply_ is only implemented on CPU tensors

NOTE
似乎没有不 in-place 的方法.

a.apply(lambda x: x + 1)
# AttributeError: 'Tensor' object has no attribute 'apply'. Did you mean: 'apply_'?

Function.apply(Tensor…)

以上的两个 apply 函数都是由对象 (Module 或 Tensor) 发起, 参数为 Callable. Function.apply(Tensor...) 不一样, 它由 Function 发起, 接收参数为张量, 起到"运行 forward"的作用. 先看 Relu 是如何求微分的:

import torch
from torch import autogradclass CustomReLUFunction(autograd.Function):@staticmethoddef forward(ctx, *args, **kwargs):x = args[0]ctx.save_for_backward(x)return x.clamp(min=0)@staticmethoddef backward(ctx, *grad_outputs):x, = ctx.saved_tensorsgrad_output = grad_outputs[0]grad_input = grad_output.clone()  # 意思是不改变传进来的 outputs 的 grad 吗?grad_input[x < 0] = 0return grad_input# 使用自定义的 ReLU 激活函数
custom_relu = CustomReLUFunction.apply  # 注意这里的 apply
a = torch.randn(5, requires_grad=True)
output = CustomReLUFunction.apply(a)
output.backward(torch.ones_like(a))print(a)
print(output)
print(a.grad)#########################
tensor([-1.8688, -0.0540, -0.6364, -0.9364,  1.2601], requires_grad=True)
tensor([0.0000, 0.0000, 0.0000, 0.0000, 1.2601],grad_fn=<CustomReLUFunctionBackward>)
tensor([0., 0., 0., 0., 1.])

没错, 代码里出现了 apply. 这需要了解 torch.autograd.

Extending torch.autograd

PyTorch 的自动微分机制是通过动态计算图实现的, 图中的张量 Tensor 是节点, 连接节点的边是叫做 Function 的东西. 一般的 PyTorch 内置运算都可以自动求微分, 这才使得优化模型时仅仅需要三行代码:

optimizer.zero_grad()
loss.backward()
optimizer.step()

就可以完成梯度下降. 如果一些运算不可微呢?比如计算一些积分, 或者比较简单的 Relu 函数在 0 处也是不可微的, 又或者运算中需要优化的部分使用了 Numpy 等其他库, 则需要我们自己实现求微分. 做法就是继承 class torch.autograd.Function, 实现其中的三个 method:

def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any)
def backward(ctx: Any, *grad_outputs: Any) -> Any

然后通过 Function.apply 导出运算. 见上面的 CustomReLUFunction, 不过它是老版的, 新版(pytorch>=2.0) 建议使用这三个方法. 先看官方给的例子:

from torch import autogradclass LinearFunction(autograd.Function):# Note that forward, setup_context, and backward are @staticmethods@staticmethoddef forward(input, weight, bias):output = input.mm(weight.t())if bias is not None:output += bias.unsqueeze(0).expand_as(output)return output@staticmethod# inputs is a Tuple of all of the inputs passed to forward.# output is the output of the forward().def setup_context(ctx, inputs, output):  # output 没用到input, weight, bias = inputsctx.save_for_backward(input, weight, bias)# This function has only a single output, so it gets only one gradient@staticmethoddef backward(ctx, grad_output):input, weight, bias = ctx.saved_tensorsgrad_input = grad_weight = grad_bias = None# These needs_input_grad checks are optional and there only to# improve efficiency. If you want to make your code simpler, you can# skip them. Returning gradients for inputs that don't require it is# not an error.if ctx.needs_input_grad[0]:grad_input = grad_output.mm(weight)if ctx.needs_input_grad[1]:grad_weight = grad_output.t().mm(input)if bias is not None and ctx.needs_input_grad[2]:grad_bias = grad_output.sum(0)return grad_input, grad_weight, grad_bias

之后, 就可以使用 Function.apply(input, weight, bias) 进行运算了(不可直接调用 forward), 它可以实现执行 forward 方法, 并通过 setup_context 将计算状态(输入值等)保存进 ctx 对象中, 供反向传播时的 backward 使用.

新老版的区别:
老版的 def forward(ctx, *args, **kwargs) 第一个参数是 ctx, 环境的保存需要在 forward 中完成;
新版的 def forward(*args, **kwargs) 仅接收输入就行了, 保存环境的工作交给 setup_context(ctx, inputs, output) 完成;
不过这些都不需要用户关心.
建议用新版, 因为它和 pytorch 内置的 operator 更接近, 兼容性更好.

参数数量方面需要注意的是: forward 和 backward 的参数数量和返回值数量要对应, 互反: forward 的输出数量对应 backward 的参数数量; backward 的输出数量对应 forward 的参数数量; 这很好理解, 传播一正一反嘛, 张量和其对应的梯度!

forward 的 non-Tensor 参数的梯度必须为 None, 不能省, 数量要一致.

class MulConstant(Function):@staticmethoddef forward(tensor, constant):return tensor * constant@staticmethoddef setup_context(ctx, inputs, output):# ctx is a context object that can be used to stash information# for backward computationtensor, constant = inputsctx.constant = constant  # 非 Tensor 直接保存在 ctx 中, 而不是 save_for_backward@staticmethoddef backward(ctx, grad_output):# We return as many input gradients as there were arguments.# Gradients of non-Tensor arguments to forward must be None.return grad_output * ctx.constant, None  # const 的梯度

注意, non-tensors should be stored directly on ctx, 如 ctx.constant = constant.

set_materialize_grads 告诉 autograd engine 梯度计算与 inputs 无关, 以提升计算效率

**class MulConstant(Function):@staticmethoddef forward(tensor, constant):return tensor * constant@staticmethoddef setup_context(ctx, inputs, output):tensor, constant = inputsctx.set_materialize_grads(False)  # 不太懂这个 materialize 啥意思ctx.constant = constant@staticmethoddef backward(ctx, grad_output):# Here we must handle None grad_output tensor. In this case we# can skip unnecessary computations and just return None.if grad_output is None:return None, None# We return as many input gradients as there were arguments.# Gradients of non-Tensor arguments to forward must be None.return grad_output * ctx.constant, None**

虽然不太懂这个 materialize 是啥意思.

明白了 loss.backward()

也许只知道一句 loss.backward() 可以求梯度, 不知为何当 loss 不是标量时需要传入一个与 output 形状相同的张量? 传入之后究竟经历了什么?

import torchx = torch.randn(2, 3, requires_grad=True)
y = torch.norm(x, dim=1)  # 是个向量shape=(2)y.retain_grad()
grad = torch.randn(2)  # y 的 grad, 平时调用 loss.backward() 空参数, 其实是 loss.backward(torch.tensor(1.0)), 也即 loss 自己的 grad
y.backward(grad)  # 调用 backward 函数会执行其 grad_fn 的 backward, 沿着计算图链式地反向传播print(grad)
print(y.grad_fn)
print(y.grad)
print(x.grad)# %%
x = torch.randn(2, 3, requires_grad=True)
z = torch.norm(x)z.retain_grad()
grad = torch.tensor(1.0)
z.backward(grad)  # 其实是 loss.backward(torch.tensor(1.0))print(z.grad_fn)
print(z.grad)
print(x.grad)

传入 xxx.backward(grad_of_xxx) 的张量 grad_of_xxx 是 xxx 自己的 grad, 需要它来进行链式法则的计算, 在 LinearFunction.backward 中输出 *grad_output 看一看:

	@staticmethoddef backward(ctx, *grad_output):  # save_for_backward, 所以 backward 还是需要 ctx 的, 不像 forwardprint(grad_output)  # 验证 .backward(grad)x, weight, bias = ctx.saved_tensorsgrad_input = grad_weight = grad_bias = None  # 先设置好 None, 那么不需要梯度的变量, 梯度就返回 Noneif ctx.needs_input_grad[0]:grad_input = grad_output[0].mm(weight)if ctx.needs_input_grad[1]:grad_weight = grad_output[0].t().mm(x)if bias is not None and ctx.needs_input_grad[2]:grad_bias = grad_output[0].sum(0)return grad_input, grad_weight, grad_bias

输出 *grad_output:

linear = LinearFunction.apply
a = torch.randn(2, 3)
w = torch.randn(4, 3, requires_grad=True)
b = torch.randn(4, requires_grad=True)ln = linear(a, w, b)
ln.backward(torch.ones(2, 4))
##################################
(tensor([[1., 1., 1., 1.],[1., 1., 1., 1.]]),)

小结
至于 LinearFunction.apply 具体是如何工作的, 源码比较多, 看不懂! 反正比直接调用 forward 多了些工作, 为反向传播做准备!

Function.apply 问答

新旧版的参数保存方式

假如我需要在 Function 中保存一个数值 gamma, 新旧版分别是如何做的?
旧版:

class F(torch.autograd.Function):def __init__(self, gamma=0.1):super().__init__()self.gamma = gammadef forward(self, args):passdef backward(self, args):pass#################################
F(gamma)(inp)

新版:

class F_new(torch.autograd.Function):@staticmethoddef forward(ctx, args, gamma):ctx.gamma = gammapass@staticmethoddef backward(ctx, args):pass####################################
F_new.apply(inp, gamma)

问: 每次调用 F.apply, 都会创建新的 “instance” with its own context 吗?
答: 对, 每次调用 .apply 都会有a different context. 所以你可以安全地保存 everything 到其中, 并无风险.
问: 我可以用 ctx.intermediary = intermediary 语句保存 intermediary results 吗?
答: 对于 intermediary results, 你可以将它们保存到 ctx 的属性中.
问: 为什么需要用 save_for_backward? 仅仅是 a convention? 或者它执行了额外的 checks?
我 尝试用 save_for_backwards 保存 intermediary tensors, 但 failed, 所以我将它们作为 attributes 保存到了 self (ctx now) 中.
答: 是的, save_for_backward is just for input and outputs, 它会执行额外的 checks (make sure that you don’t create non-collectable cycles). For intermediary results, you can save them as attribute of the context yes. [记得说求梯度的变量一定要是 input or output]