官网链接

Profiling your PyTorch Module — PyTorch Tutorials 2.0.1+cu117 documentation

分析pytorch模块

PyTorch包含一个分析器API，用于识别代码中各种PyTorch操作的时间和内存成本。分析器可以很容易地集成到代码中，结果可以作为表格打印或以JSON跟踪文件返回。

分析器支持多线程模型。分析器与主线程在同一个线程中运行，但它也会分析可能在另一个线程中运行的子线程。同时运行的分析器将被限制在它们自己的线程中，以防止混合结果。

PyTorch 1.8引入了新的API，将在未来的版本中取代旧的分析器API。在这个页面查看新的API(this page)。

在这个攻略中(this recipe)，你可以更快速地了解Profiler API的用法。

import torch
import numpy as np
from torch import nn
import torch.autograd.profiler as profiler

使用分析器进行性能调试

分析器可以帮助识别模型中的性能瓶颈。在这个例子中，我们构建了一个自定义模块，它执行两个子任务:

对输入进行线性变换
使用转换结果来获得掩码张量上的索引。

我们使用profiler.record_function("label")将每个子任务的代码包装在单独的带标签的上下文管理器中。在分析器的输出中，子任务中所有操作的综合性能指标将显示在相应的标签下。

注意，使用分析器会带来一些开销，所以最好只在检查代码时使用。如果您要对运行时进行基准测试，请记住删除它。

class MyModule(nn.Module):def __init__(self, in_features: int, out_features: int, bias: bool = True):super(MyModule, self).__init__()self.linear = nn.Linear(in_features, out_features, bias)def forward(self, input, mask):with profiler.record_function("LINEAR PASS"):out = self.linear(input)with profiler.record_function("MASK INDICES"):threshold = out.sum(axis=1).mean().item()hi_idx = np.argwhere(mask.cpu().numpy() > threshold)hi_idx = torch.from_numpy(hi_idx).cuda()return out, hi_idx

分析forward执行

我们初始化随机输入和掩码张量以及模型。

在我们运行分析器之前，我们热身CUDA以确保准确的性能基准测试。我们将模块的正向传递封装在描述profiler.profile上下文管理器中。with_stack=True参数在跟踪中附加操作的文件和行号。

with_stack=True会带来额外的开销，更适合研究代码。如果您要对性能进行基准测试，请记住删除它。

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.double).cuda()# warm-up
model(input, mask)with profiler.profile(with_stack=True, profile_memory=True) as prof:out, idx = model(input, mask)

打印分析器结果

最后，我们打印分析器的结果。profiler.key_averages 按操作符名称、输入形状和/或堆栈跟踪事件聚合结果。按输入形状分组对于识别模型使用哪些张量形状是有用的。

在这里，我们使用group_by_stack_n=5，它根据操作及其回溯(截断为最近的5个事件)汇总运行时间，并按注册的顺序显示事件。也可以通过传递sort_by参数对表进行排序(有关有效的排序键，请参阅文档docs )。

在notebook上运行分析器时，你可能会在堆栈跟踪中看到类似<ipython-input-18-193a910735e8>(13): forward的条目，而不是文件名。这些对应于<notebook-cell>(line number): calling-function.。

print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))"""
(Some columns are omitted)-------------  ------------  ------------  ------------  ---------------------------------Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-------------  ------------  ------------  ------------  ---------------------------------MASK INDICES        87.88%        5.212s    -953.67 Mb  /mnt/xarfuse/.../torch/au<ipython-input-...>(10): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython/aten::copy_        12.07%     715.848ms           0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/LINEAR PASS         0.01%     350.151us         -20 b  /mnt/xarfuse/.../torch/au<ipython-input-...>(7): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython/aten::addmm         0.00%     293.342us           0 b  /mnt/xarfuse/.../torch/nn/mnt/xarfuse/.../torch/nn/mnt/xarfuse/.../torch/nn<ipython-input-...>(8): forward/mnt/xarfuse/.../torch/nnaten::mean         0.00%     235.095us           0 b  <ipython-input-...>(11): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/-----------------------------  ------------  ---------- ----------------------------------
Self CPU time total: 5.931s"""

提高内存性能

请注意，就内存和时间而言，最昂贵的操作是掩码索引中的forward(10)操作。让我们首先尝试解决内存消耗问题。可以看到，第12行的。.to()操作消耗了953.67 Mb。该操作将mask复制到CPU。Mask用torch.double数据类型初始化。我们可以通过把它转换成torch.float来减少内存占用吗?

model = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()# warm-up
model(input, mask)with profiler.profile(with_stack=True, profile_memory=True) as prof:out, idx = model(input, mask)print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))"""
(Some columns are omitted)-----------------  ------------  ------------  ------------  --------------------------------Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
-----------------  ------------  ------------  ------------  --------------------------------MASK INDICES        93.61%        5.006s    -476.84 Mb  /mnt/xarfuse/.../torch/au<ipython-input-...>(10): forward/mnt/xarfuse/  /torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython/aten::copy_         6.34%     338.759ms           0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/aten::as_strided         0.01%     281.808us           0 b  <ipython-input-...>(11): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/aten::addmm         0.01%     275.721us           0 b  /mnt/xarfuse/.../torch/nn/mnt/xarfuse/.../torch/nn/mnt/xarfuse/.../torch/nn<ipython-input-...>(8): forward/mnt/xarfuse/.../torch/nnaten::_local        0.01%     268.650us           0 b  <ipython-input-...>(11): forward_scalar_dense                                          /mnt/xarfuse/.../torch/nn<ipython-input-...>(9): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/-----------------  ------------  ------------  ------------  --------------------------------
Self CPU time total: 5.347s"""

该操作的CPU内存占用减少了一半。

提高时间性能

虽然消耗的时间也减少了一些，但仍然太高了。事实证明，将矩阵从CUDA复制到CPU是非常昂贵的。forward (12)中的aten::copy_运算符将mask复制到CPU，这样CPU就可以使用NumPy中的argwhere函数。forward(13) 中的aten::copy_ 将数组作为张量复制回CUDA。如果我们在这里使用torch函数nonzero()，就可以消除这两个问题。

class MyModule(nn.Module):def __init__(self, in_features: int, out_features: int, bias: bool = True):super(MyModule, self).__init__()self.linear = nn.Linear(in_features, out_features, bias)def forward(self, input, mask):with profiler.record_function("LINEAR PASS"):out = self.linear(input)with profiler.record_function("MASK INDICES"):threshold = out.sum(axis=1).mean()hi_idx = (mask > threshold).nonzero(as_tuple=True)return out, hi_idxmodel = MyModule(500, 10).cuda()
input = torch.rand(128, 500).cuda()
mask = torch.rand((500, 500, 500), dtype=torch.float).cuda()# warm-up
model(input, mask)with profiler.profile(with_stack=True, profile_memory=True) as prof:out, idx = model(input, mask)print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total', row_limit=5))"""
(Some columns are omitted)--------------  ------------  ------------  ------------  ---------------------------------Name    Self CPU %      Self CPU  Self CPU Mem   Source Location
--------------  ------------  ------------  ------------  ---------------------------------aten::gt        57.17%     129.089ms           0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(25): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/aten::nonzero        37.38%      84.402ms           0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(25): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/INDEX SCORE         3.32%       7.491ms    -119.21 Mb  /mnt/xarfuse/.../torch/au<ipython-input-...>(10): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(25): <module>/mnt/xarfuse/.../IPython/aten::as_strided         0.20%    441.587us          0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(25): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/aten::nonzero_numpy             0.18%     395.602us           0 b  <ipython-input-...>(12): forward/mnt/xarfuse/.../torch/nn<ipython-input-...>(25): <module>/mnt/xarfuse/.../IPython//mnt/xarfuse/.../IPython/
--------------  ------------  ------------  ------------  ---------------------------------
Self CPU time total: 225.801ms"""