Nsight Compute 是怎么计算Roofline的呢

1.参考链接
2.小结
3.Nsight Compute 是怎么计算Roofline的呢
4.生成测试程序
5.测试规模为8192时的性能
6.计算Roofline
7.指标解释
8.测试规模为1024时的性能
9.测试规模为128时的性能
10.RTX 3060基础能力测试
11.sm__inst_executed.avg.pct_of_peak_sustained_active
12.全面分析

用Roofline模型去分析pytorch和Triton算子发现Nsight Compute中的Peak Work跟峰值算力对不上.这里进一步分析

1.参考链接

Metrics smsp__sass_thread_inst_executed_op
sm__sass_thread_inst_executed_op_ffma_
使用Nsight Compute构建roofline model
NsightComputeCli
Roofline_model
用Roofline模型去分析pytorch和Triton算子
H800基础能力测试
nvtx-include

2.小结

理论算力: 35841.852=13.26 TFLOPS
硬件的理论算力密度: 36.87
该测例pytorch测出的实际算力:4147.84 GFOPS【是峰值算力的:31.2%】; 测出的带宽: 6.07GB/s .(当黑盒处理,统计周期里包括了计算和IO,所以并不准确)
该测例的算力密度:682.66 (>36.87) 是计算瓶颈
按Nsight Compute的算法 PeakWork(FFMA): 9.46 TFLOPS (跟具体的规模无关)
按Nsight Compute的算法 PeakTraffic: 349.92 GB/s
按Nsight Compute的算法 AchievedWork: 6.02 TFLOPS 是峰值算力的: 63%
按Nsight Compute的算法 AchievedTraffic: 42.82 Gbyte/second
sgemm single kernels: 9936.39 GFLOPS
sgemm N=10 without streams: 10260.6 GFLOPS
sgemm N=10 with stream: 10339.9 GFLOPS
sgemm N=10 batched: 8482.82 GFLOPS
根据该内核的占用情况，理论上每个调度程序可以发出 4.00 个 warp，低于硬件最大值 12。该内核的理论占用率 (33.3%) 受到所需寄存器数量的限制。

3.Nsight Compute 是怎么计算Roofline的呢

公式:C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.1.1\sections\SpeedOfLight_HierarchicalSingleRooflineChart.section
内容如下(实际比这个多):

MetricDefinitions {MetricDefinitions {Name: "derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2"Expression: "sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained * 2"}MetricDefinitions {Name: "derived__smsp__sass_thread_inst_executed_op_ffma_pred_on_x2"Expression: "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed * 2"}
}Rooflines {PeakWork {ValueCyclesPerSecondExpression {ValuePerCycleMetrics {Label: "Theoretical Predicated-On FFMA Operations"Name: "derived__sm__sass_thread_inst_executed_op_ffma_pred_on_x2"}CyclesPerSecondMetric {Label: "SM Frequency"Name: "sm__cycles_elapsed.avg.per_second"}}}PeakTraffic {ValueCyclesPerSecondExpression {ValuePerCycleMetrics {Label: "Theoretical DRAM Bytes Accessible"Name: "dram__bytes.sum.peak_sustained"}CyclesPerSecondMetric {Label: "DRAM Frequency"Name: "dram__cycles_elapsed.avg.per_second"}}}Options {Label: "DRAM Roofline"}AchievedValues {AchievedWork {ValueCyclesPerSecondExpression {ValuePerCycleMetrics {Label: "Predicated-On FFMA Operations Per Cycle"Name: "derived__smsp__sass_thread_inst_executed_op_ffma_pred_on_x2"}CyclesPerSecondMetric {Label: "SM Frequency"Name: "smsp__cycles_elapsed.avg.per_second"}}}AchievedTraffic {Metric {Label: "DRAM Bandwidth"Name: "dram__bytes.sum.per_second"Filter {MaxArch: CC_70}}}}
}

4.生成测试程序

tee Theoretical_FLOPS.py <<-'EOF'
import sys
import torch
import torch.nn as nn
import math
import torch
import torch.nn as nn
from fvcore.nn import FlopCountAnalysis, ActivationCountAnalysis
import numpy as np
import os# 定义一个测试模型
class SimpleModel(nn.Module):def __init__(self,input_features,output_features):super(SimpleModel, self).__init__()self.fc1 = torch.nn.utils.skip_init(nn.Linear,input_features,output_features,bias=False)def forward(self, x):x = self.fc1(x)return xinput_features = int(sys.argv[1])
output_features = input_features
batch_size = input_featuresmodel = SimpleModel(input_features,output_features).cuda()
input_data = torch.ones(batch_size, input_features).cuda()test_count=10# 计算 FLOPs 和内存访问量
flops = FlopCountAnalysis(model, input_data).total()*test_count
activations = ActivationCountAnalysis(model, input_data).total() + input_data.numel()
print("activations:",activations)
# 计算参数个数
params = sum(p.numel() for p in model.parameters())# 内存访问量假定为 activations 和params 乘以 4 字节（假设 activations 和 params 是 float32 类型）
activation_memory_access = activations * 4
params_memory_access = params * 4
memory_access = activation_memory_access + params_memory_access
memory_access=memory_access*test_count# warmup
output = model(input_data) 
torch.cuda.synchronize()start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for i in range(test_count):output = model(input_data)
end_event.record()
torch.cuda.synchronize()
total_cuda_time = start_event.elapsed_time(end_event) / 1000  # 转换为秒    # FLOPs 转换至 GFLOPs
flops_measured_glops = flops / 1e9
# 内存带宽测量
memory_access_gb=memory_access/ 1e9
bandwidth_measured = memory_access_gb / total_cuda_time  # 单位：GB/s    arithmetic_intensity_measured=flops_measured_glops/memory_access_gb #GFLOPs/GB(算法的静态属性
flops_measured = arithmetic_intensity_measured*bandwidth_measured# RTX 3060 GPU 的峰值性能和带宽
peak_performance = 13.275136  * 1e3  # 单位：GFLOPs
memory_bandwidth = 360.0             # 单位：GB/sprint("arithmetic_intensity:",peak_performance/memory_bandwidth)
print("flops_measured:",flops_measured,flops_measured/peak_performance)
print("bandwidth_measured:",bandwidth_measured)
print("total_cuda_time:",total_cuda_time)
print("arithmetic_intensity_measured:",arithmetic_intensity_measured)# ncu从这些开始收集性能
import nvtx
with nvtx.annotate("kernel_prof", color="blue"):output = model(input_data)
torch.cuda.synchronize()
EOF

5.测试规模为8192时的性能

/usr/local/cuda/bin/ncu --nvtx --nvtx-include "kernel_prof/"  --metrics sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained,smsp__cycles_elapsed.avg.per_second,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed,sm__cycles_elapsed.avg.per_second,dram__bytes.sum.peak_sustained,dram__bytes.sum.per_second,dram__cycles_elapsed.avg.per_second python Theoretical_FLOPS.py  8192

输出:

activations: 134217728
arithmetic_intensity: 36.87537777777778
flops_measured: 4147.841730822858 0.3124519199519205
bandwidth_measured: 6.075940035385045
total_cuda_time: 1.325402099609375
arithmetic_intensity_measured: 682.6666666666667
==PROF== Profiling "ampere_sgemm_128x128_tn" - 0: 0%....50%.
...100% - 3 passes
==PROF== Disconnected from process 266138
[266138] python3.10@127.0.0.1ampere_sgemm_128x128_tn (64, 64, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.6NVTX Push/Pop Stack for Thread 266138:<default domain><0,kernel_prof>RGB: 0xffREGISTERED: kernel_profSection: Command line profiler metrics--------------------------------------------------------------------- ------------- ------------Metric Name                                                             Metric Unit Metric Value--------------------------------------------------------------------- ------------- ------------dram__bytes.sum.peak_sustained                                           byte/cycle           48dram__bytes.sum.per_second                                             Gbyte/second        42.84dram__cycles_elapsed.avg.per_second                                   cycle/nsecond         7.29sm__cycles_elapsed.avg.per_second                                     cycle/nsecond         1.32sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained         inst/cycle        3,584smsp__cycles_elapsed.avg.per_second                                   cycle/nsecond         1.32smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed    inst/cycle     2,282.58--------------------------------------------------------------------- ------------- ------------

6.计算Roofline

# 峰值性能与带宽
PeakWork=sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained * 2 * sm__cycles_elapsed.avg.per_second = 3584 *2 * 1.32 inst/nsecond = 9.46 TFLOPS
PeakTraffic=dram__bytes.sum.peak_sustained * dram__cycles_elapsed.avg.per_second =  48 * 7.29 byte/nsecond = 48 * 7.29 GB/s = 349.92 GB/s# 实测性能与带宽
AchievedWork=smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed *2 * smsp__cycles_elapsed.avg.per_second = 2282.58 * 2 * 1.32 inst/nsecond = 6.02 TFLOPS
AchievedTraffic=dram__bytes.sum.per_second = 42.82 Gbyte/second

7.指标解释

与 sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained 相比，smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed
最主要的区别在于衡量的是每个时钟周期内的执行事件，而不是持续峰值。这两个指标从不同的角度描述了 GPU 在执行特定类型操作（如 FMA）时的性能：可以逐一解释如下:
smsp: 表示该指标是在 Streaming Multiprocessor（流处理器）层面上测量的。在 NVIDIA 架构中，SM 或 SMSM（流多处理器）是负责处理计算任务的主要组件。
sass: 代表 Shader Assembly，是 NVIDIA GPU 的底层指令集，指的是直接在硬件上执行的指令。
thread_inst_executed: 这表示执行在 GPU 线程中的指令的数量。
op_ffma: 表示融合乘加（Fused Multiply-Add）操作，这是一种同时执行乘法和加法的算数操作，对于浮点运算非常常见和重要。
pred_on: 这意味着这些统计数据仅包括那些在谓词（条件）为真时执行的指令。
sum: 指在一定的采集时间窗或一系列样本中，这一指标的累积总和。
per_cycle_elapsed: 这表示指标是以每个 GPU 时钟周期为单位来计测的。它提供了在每个时钟周期内执行的 FMA 操作的平均次数，通常用于衡量单位时间内的执行效率。
peak_sustained: 指示这是在观测期间持续达到的峰值性能的度量。频率 vs 峰值: per_cycle_elapsed 类似于平均效率（每个时钟周期中执行的平均次数），而 peak_sustained 侧重于在执行高峰期间达到的最高绩效（累积的最大值）。
实时效率: per_cycle_elapsed 更关注瞬时的执行效率，它可以帮助开发者了解在每个具体的执行周期内，硬件是如何响应的。总的来说，smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed 更适合用来评估和优化 GPU 代码在单个时钟周期内的效率，而 sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained 更适合于评估在密集运算期间GPU的最大处理能力。两者结合使用可以提供一个更全面的 GPU 性能分析。

8.测试规模为1024时的性能

/usr/local/cuda/bin/ncu --nvtx --nvtx-include "kernel_prof/"  --metrics sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained,smsp__cycles_elapsed.avg.per_second,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed,sm__cycles_elapsed.avg.per_second,dram__bytes.sum.peak_sustained,dram__bytes.sum.per_second,dram__cycles_elapsed.avg.per_second python Theoretical_FLOPS.py  128

输出

activations: 2097152
arithmetic_intensity: 36.87537777777778
flops_measured: 1470.6536158405618 0.1107825649274374
bandwidth_measured: 17.23422206063158
total_cuda_time: 0.007301119804382325
arithmetic_intensity_measured: 85.33333333333334
==PROF== Profiling "ampere_sgemm_128x64_tn" - 0: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 267209
[267209] python3.10@127.0.0.1ampere_sgemm_128x64_tn (8, 16, 3)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6NVTX Push/Pop Stack for Thread 267209:<default domain><0,kernel_prof>RGB: 0xffREGISTERED: kernel_profSection: Command line profiler metrics--------------------------------------------------------------------- ------------- ------------Metric Name                                                             Metric Unit Metric Value--------------------------------------------------------------------- ------------- ------------dram__bytes.sum.peak_sustained                                           byte/cycle           48dram__bytes.sum.per_second                                             Gbyte/second       100.81dram__cycles_elapsed.avg.per_second                                   cycle/nsecond         7.29sm__cycles_elapsed.avg.per_second                                     cycle/nsecond         1.32sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained         inst/cycle        3,584smsp__cycles_elapsed.avg.per_second                                   cycle/nsecond         1.32smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed    inst/cycle     2,057.05--------------------------------------------------------------------- ------------- ------------

9.测试规模为128时的性能

/usr/local/cuda/bin/ncu --nvtx --nvtx-include "kernel_prof/"  --metrics sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained,smsp__cycles_elapsed.avg.per_second,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed,sm__cycles_elapsed.avg.per_second,dram__bytes.sum.peak_sustained,dram__bytes.sum.per_second,dram__cycles_elapsed.avg.per_second python Theoretical_FLOPS.py  128

输出

activations: 32768
arithmetic_intensity: 36.87537777777778
flops_measured: 3.9713012060185076 0.0002991533349276804
bandwidth_measured: 0.37230948806423503
total_cuda_time: 0.005280767917633057
arithmetic_intensity_measured: 10.666666666666668
==PROF== Profiling "ampere_sgemm_32x32_sliced1x4_tn" - 0: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 267388
[267388] python3.10@127.0.0.1ampere_sgemm_32x32_sliced1x4_tn (4, 4, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6NVTX Push/Pop Stack for Thread 267388:<default domain><0,kernel_prof>RGB: 0xffREGISTERED: kernel_profSection: Command line profiler metrics--------------------------------------------------------------------- ------------- ------------Metric Name                                                             Metric Unit Metric Value--------------------------------------------------------------------- ------------- ------------dram__bytes.sum.peak_sustained                                           byte/cycle           48dram__bytes.sum.per_second                                             Gbyte/second        17.21dram__cycles_elapsed.avg.per_second                                   cycle/nsecond         7.24sm__cycles_elapsed.avg.per_second                                     cycle/nsecond         1.31sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained         inst/cycle        3,584smsp__cycles_elapsed.avg.per_second                                   cycle/nsecond         1.31smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed    inst/cycle       185.00--------------------------------------------------------------------- ------------- ------------

通过不同规模的测试发现,dram__bytes.sum.peak_sustained和sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained不随规模变化

10.RTX 3060基础能力测试

git clone https://www.github.com/nvidia/cuda-samples
cd cuda-samples/Samples/1_Utilities/deviceQuery
make clean && make
./deviceQuery
cd ../bandwidthTest/
make clean && make
./bandwidthTest
cd ../../4_CUDA_Libraries/batchCUBLAS/
make clean && make
./batchCUBLAS -m8192 -n8192 -k8192 --device=0

 CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "NVIDIA GeForce RTX 3060"CUDA Driver Version / Runtime Version          12.2 / 12.1CUDA Capability Major/Minor version number:    8.6Total amount of global memory:                 12044 MBytes (12629377024 bytes)(028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA CoresGPU Max Clock rate:                            1852 MHz (1.85 GHz)Memory Clock rate:                             7501 MhzMemory Bus Width:                              192-bitL2 Cache Size:                                 2359296 bytesMaximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layersMaximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layersTotal amount of constant memory:               65536 bytesTotal amount of shared memory per block:       49152 bytesTotal shared memory per multiprocessor:        102400 bytesTotal number of registers available per block: 65536Warp size:                                     32Maximum number of threads per multiprocessor:  1536Maximum number of threads per block:           1024Max dimension size of a thread block (x,y,z): (1024, 1024, 64)Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)Maximum memory pitch:                          2147483647 bytesTexture alignment:                             512 bytesConcurrent copy and kernel execution:          Yes with 2 copy engine(s)Run time limit on kernels:                     YesIntegrated GPU sharing Host Memory:            NoSupport host page-locked memory mapping:       YesAlignment requirement for Surfaces:            YesDevice has ECC support:                        DisabledDevice supports Unified Addressing (UVA):      YesDevice supports Managed Memory:                YesDevice supports Compute Preemption:            YesSupports Cooperative Kernel Launch:            YesSupports MultiDevice Co-op Kernel Launch:      YesDevice PCI Domain ID / Bus ID / location ID:   0 / 3 / 0Compute Mode:< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.1, NumDevs = 1
Result = PASSRunning on...Device 0: NVIDIA GeForce RTX 3060Quick ModeHost to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes)        Bandwidth(GB/s)32000000                     12.0Device to Host Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes)        Bandwidth(GB/s)32000000                     13.2Device to Device Bandwidth, 1 Device(s)PINNED Memory TransfersTransfer Size (Bytes)        Bandwidth(GB/s)32000000                     326.3Result = PASSgpuDeviceInit() CUDA Device [0]: "Ampere==== Running single kernels ====Testing sgemm
#### args: ta=0 tb=0 m=8192 n=8192 k=8192  alpha = (0xbf800000, -1) beta= (0x40000000, 2)
#### args: lda=8192 ldb=8192 ldc=8192
^^^^ elapsed = 0.11065507 sec  GFLOPS=9936.39
@@@@ sgemm test OK==== Running N=10 without streams ====
Testing sgemm
#### args: ta=0 tb=0 m=8192 n=8192 k=8192  alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=8192 ldb=8192 ldc=8192
^^^^ elapsed = 1.07158208 sec  GFLOPS=10260.6
@@@@ sgemm test OK==== Running N=10 with streams ====
Testing sgemm
#### args: ta=0 tb=0 m=8192 n=8192 k=8192  alpha = (0xbf800000, -1) beta= (0x00000000, 0)
#### args: lda=8192 ldb=8192 ldc=8192
^^^^ elapsed = 1.06336808 sec  GFLOPS=10339.9
@@@@ sgemm test OK==== Running N=10 batched ====
Testing sgemm
#### args: ta=0 tb=0 m=8192 n=8192 k=8192  alpha = (0x40000000, 2) beta= (0x40000000, 2)
#### args: lda=8192 ldb=8192 ldc=8192
^^^^ elapsed = 1.29616284 sec  GFLOPS=8482.82
@@@@ sgemm test OK

FP32 理论算力

(028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
GPU Max Clock rate:                            1852 MHz (1.85 GHz)
3584*1.85*2=13.26TFLOPS

11.sm__inst_executed.avg.pct_of_peak_sustained_active

/usr/local/cuda/bin/ncu --nvtx --nvtx-include "kernel_prof/"  --metrics sm__inst_executed.avg.pct_of_peak_sustained_active python Theoretical_FLOPS.py  8192

输出

  ampere_sgemm_128x128_tn (64, 64, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.6NVTX Push/Pop Stack for Thread 270100:<default domain><0,kernel_prof>RGB: 0xffREGISTERED: kernel_profSection: Command line profiler metrics-------------------------------------------------- ----------- ------------Metric Name                                        Metric Unit Metric Value-------------------------------------------------- ----------- ------------sm__inst_executed.avg.pct_of_peak_sustained_active           %        73.47-------------------------------------------------- ----------- ------------

12.全面分析

/usr/local/cuda/bin/ncu  --nvtx --nvtx-include "kernel_prof/"   -f --set full --export roofline_report python Theoretical_FLOPS.py  8192

根据该内核的占用情况，理论上每个调度程序可以发出 4.00 个 warp，低于硬件最大值 12。该内核的理论占用率 (33.3%) 受到所需寄存器数量的限制。

Nsight Compute 是怎么计算Roofline的呢

Nsight Compute 是怎么计算Roofline的呢

1.参考链接

2.小结

3.Nsight Compute 是怎么计算Roofline的呢

4.生成测试程序

5.测试规模为8192时的性能

6.计算Roofline

7.指标解释

8.测试规模为1024时的性能

9.测试规模为128时的性能

10.RTX 3060基础能力测试

11.sm__inst_executed.avg.pct_of_peak_sustained_active

12.全面分析

相关文章

IPv6测试指标有哪些？怎么看网站是否完成IPv6升级改造？

高性价比蓝牙耳机有哪些？2024超高性价比蓝牙耳机推荐

数据库同步最简单的方法

Kali Linux源

risc-v 怎么使用内存呢？

【LeetCode】每日一题：K个一组反转链表

力扣刷题使用双指针进行数组去重分析

vue3封装表格嵌套表单问题汇总

java基于ssm+jsp 多人命题系统

vue项目连接多个服务后台地址

台式电脑没有音响？你还可以用这 7 个软件把手机变成音响

遥感数据并行运算（satellite remote sensing data parallell processing）

申请免费6个月SSL证书方式和证书特点槽点

力扣：203. 移除链表元素（Java）

ONLYOFFICE 8.1版本桌面编辑器深度体验：创新功能与卓越性能的结合

Centos安装redis（附：图形化管理工具）

高频科技亮相SEMl-e2024第六届深圳国际半导体展，以超纯工艺推动行业发展

时常在面试中被问到的多线程问题：上篇

ElementUI笔记

超全汇总，性能测试常用指标大全