[笔记] ONNX Export笔记[进行中...]

1.ONNX模型的优势：

ONNX可以提供方便的平台兼容性和方便的移植实现，Opset，可以接收其他的推理模型的的移植，比如从.pt=>.onnx。而.onnx利用Runtime规范，以及可扩展接口，把.onnx模型向下兼容到各个硬件平台。硬件芯片，比如npu供应商，可以依据Runtime扩展接口，提供基于.onnx opset模型的执行能力。

ONNX Opset 是指 ONNX 格式中支持的操作集合版本。ONNX 标准随着时间的推移会推出新的 opset 版本，每个 opset 版本定义了支持的操作集合和规范。当创建或导出一个 ONNX 模型时，会使用特定的 opset 版本来定义模型中使用到的操作。ONNX Runtime 会根据模型中使用的 opset 版本来解释和执行模型中的操作。
ONNX Runtime 是一个用于执行和优化 ONNX 模型推理的推理引擎。它提供了一个高性能的运行时环境，可以在不同的硬件平台上部署和运行经过导出的 ONNX 模型。ONNX Runtime 负责解释和执行 ONNX 模型的运算，以获得最佳的性能和效率。

2.RK3588当前的ONNX模型适应能力

RKNN的软件分为四个部分：

RKNN-Toolkit2 是一个用于用户在PC和瑞芯微NPU平台进行模型转换、推理和性能评估的软件开发工具包。
RKNN-Toolkit-Lite2 为瑞芯微NPU平台提供了Python编程接口，便于用户部署RKNN模型，并加速AI应用的实现。
RKNN Runtime 提供了C/C++编程接口，旨在支持瑞芯微NPU平台上RKNN模型的部署及AI应用实施的加速。
RKNPU内核驱动 负责与NPU硬件交互，已开源，可从瑞芯微的内核代码中获取。

截止20240620，在rknn model translate toolset（GitHub - airockchip/rknn-toolkit2）里提到：

v2.0.0-beta0

Support pytorch 2.1
Improve support for QAT models of pytorch and onnx

3.onnx层面的模型量化指南

Quantize ONNX models | onnxruntime

量化的基本处置方法是：

val_fp32 = scale * (val_quantized - zero_point)

量化模型，使用zero_point, scale,来把一个fp32的实数映射到一个uint8, uint16的定点数区间。除val_quantized之外的数，都是实数。
onnx有两种量化模型：

There are two ways to represent quantized ONNX models:

Operator-oriented (QOperator) :
All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
Tensor-oriented (QDQ; Quantize and DeQuantize) :
This format inserts DeQuantizeLinear(QuantizeLinear(tensor)) between the original operators to simulate the quantization and dequantization process.
In Static Quantization, the QuantizeLinear and DeQuantizeLinear operators also carry the quantization parameters.
In Dynamic Quantization, a ComputeQuantizationParameters function proto is inserted to calculate quantization parameters on the fly.

a Convolution operator followed by BatchNormalization can be fused into one during the optimization, which can be quantized very efficiently.

In general, it is recommended to use dynamic quantization for RNNs and transformer-based models, and static quantization for CNN models.

If neither post-training quantization method can meet your accuracy goal, you can try using quantization-aware training (QAT) to retrain the model. ONNX Runtime does not provide retraining at this time, but you can retrain your models with the original framework and convert them back to ONNX.

dynamic和static量化的最本质因素在于，如何确定zero和range of parameters.非时变的CNN模型不需要进行QAT处理。因为它的模型是固定的。RNN则不同，它处理的是时变数据，参数在动态调整。

另外Opset10以上才支持量化。

3.1 U8S8 U8U8的问题

这个问题是模型计算过程中的溢出，因为无符号数默认是饱和加，饱和减，则U8类型具备先天优势，会跑得更快。

3.2 额外的优化步骤

There are specific optimizations for transformer-based models, such as QAttention for quantization of attention layers. In order to leverage these optimizations, you need to optimize your models using the Transformer Model Optimization Tool before quantizing the model.

This notebook demonstrates the process.

3.3 量化的硬件实现机制

量化的本质在于使用硬件支持的更简单的数据类型，来实现针对矩阵运算（但指令多数据）的加速，pytorch的量化过程，包括：

Implement a CalibrationDataReader.
Compute quantization parameters using a calibration data set. Note: In order to include all tensors from the model for better calibration, please run symbolic_shape_infer.py first. Please refer to here for details.
Save quantization parameters into a flatbuffer file
Load model and quantization parameter file and run with the TensorRT EP.

4. RKNN的加速指导和额外的加速

4.1

4.FAQ

4.1 Why are operators like MaxPool not quantized?

8-bit type support for certain operators such as MaxPool was added in ONNX opset 12. Please check your model version and upgrade it to opset 12 and above.

4.2 我到底是选用static 量化模式还是 Dynamic量化模式？

对于RNN使用dynamic模式，对于CNN直接采用static模式即可。

4.3 量化后精度不足，怎么办？

在准确性丢失更多时，建议采用Per Channel模式。

4.4 我在CPU上运行量化后，速度没有提升，是怎么回事？

量化加速在旧式硬件上会全无提升，而且会造成性能下降：

Old hardware has none or few of the instructions needed to perform efficient inference in int8. And quantization has overhead (from quantizing and dequantizing), so it is not rare to get worse performance on old devices.

x86-64 with VNNI, GPU with Tensor Core int8 support and ARM with dot-product instructions can get better performance in general.

4.5 量化能否给出示例？

基本步骤：

Export a PyTorch model to ONNX — PyTorch Tutorials 2.3.0+cu121 documentation

然后还有更完整的针对特定模型输出的步骤示例：We provide two end-to end examples:

Yolo V3 and resnet50.

4.6 什么是QAT?

这是将量化过程预先内置在模型内部：

If neither post-training quantization method can meet your accuracy goal, you can try using quantization-aware training (QAT) to retrain the model. ONNX Runtime does not provide retraining at this time, but you can retrain your models with the original framework and convert them back to ONNX.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/859693.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！