autoware.universe源码略读(3.3)--perception：tensorrt

autoware.universe源码略读3.3--perception：tensorrt_yolo

模块组成
cuda_utils（CUDA接口）
calibrator（校准器）
- ImageStream
- Int8EntropyCalibrator
mish（mish激活函数，基于CUDA）
mish_plugin（mish激活函数插件）
- MishPlugin
- MishPluginCreator
nms（非极大值抑制）
nms_plugin（NMS插件）
yolo_layer（YOLO模型层）
yolo_layer_plugin（YOLO层插件）
trt_yolo（基于TensorRT的YOLO过程）
nodelet
总结

模块组成

这里趁着刚看过YOLO系列的相关知识，直接来看下这里的YOLO模块调用是怎么实现的。首先这里和其他部分没什么区别的地方是，也是一个继承于rclcpp::Node的类，在lib文件夹中则是具体功能的相关实现，比如从名字看起来mish应该是负责激活函数的，nms应该就是非极大值抑制的，当然trt_yolo应该就是yolo的主体部分了，这个部分的组成大概就像下面这张图这样，接下来就分开来简单看下
在这里插入图片描述

cuda_utils（CUDA接口）

这里主要是为了适配CUDA而写的一部分接口功能函数，没有什么特别需要说的。

calibrator（校准器）

这个校准器里面有两个部分，一个是ImageStream，这里应该就是输入的图像流；还有一个部分则是Int8EntropyCalibrator，可以看到这个类是继承自TensorRT中的nvinfer1::IInt8EntropyCalibrator2类的，所以具体执行校准的步骤应该是这里

ImageStream

这里主要是对输入的图像流进行预处理吧。这里的batch_，应该就是指一次处理的数据，那么batch_size_应该对应的就是一次处理几张图片？这里还有的就是这个input_dims_，这里应该是与输入图像的性质关联的，根据构造函数的代码

batch_.resize(batch_size_ * input_dims_.d[1] * input_dims_.d[2] * input_dims_.d[3]);

那我估计这里对应的就是每次输入，后边三个对应的可能就是通道数、图像宽度和图像高度了吧。
图像流的遍历操作是通过next函数实现的，这个其实很简单，就是对calibration_images_进行遍历，然后把经过预处理后的图片插入到batch_尾部。至于预处理的话，在preprocess函数中

  std::vector<float> preprocess(const cv::Mat & in_img, const int c, const int w, const int h) const{cv::Mat rgb;cv::cvtColor(in_img, rgb, cv::COLOR_BGR2RGB);  // BGR to RGBcv::resize(rgb, rgb, cv::Size(w, h));          // Resizecv::Mat img_float;rgb.convertTo(img_float, CV_32FC3, 1 / 255.0); // Normalize// HWC TO CHWstd::vector<cv::Mat> input_channels(c);cv::split(img_float, input_channels);          // Split based on channelsstd::vector<float> result(h * w * c);auto data = result.data();int channel_length = h * w;for (int i = 0; i < c; ++i) {memcpy(data, input_channels[i].data, channel_length * sizeof(float));data += channel_length;}return result;}

Int8EntropyCalibrator

这个类用于实现 TensorRT 的 INT8 校准机制，该机制通过使用校准数据来确定量化后的 INT8 模型在推理时的精度。构造函数就是根据传入的数据流，进行内存的分配，剩下的函数实现其实都是针对于nvinfer1::IInt8EntropyCalibrator2中虚函数的实现吧。

getBatch：这个函数是获取了当前批次的数据（调用了图像流的next函数），然后把内存从CPU复制到了GPU，并且更新bindings数组以指向设备内存。
readCalibrationCache：从缓存文件读取校准数据，返回了校准数据的长度
writeCalibrationCache：将校准数据写入缓存文件

mish（mish激活函数，基于CUDA）

这个文件的作用没什么好说的，就是实现了mish激活函数的计算，这里涉及到了CUDA函数的前缀，可以参考这篇文章，所以__device__前缀的就是在GPU上执行的函数，__global__则表示在GPU上执行，但在CPU上调用的函数，像这个文件就是，核心的计算公式是在__device__ float mish(float x)这里，但是调用的时候又调用的是__global__ void mishKernel(const T * input, T * output, int num_elem)这个方法

另外的一点就是，这里似乎上不是标准的mish函数？mish函数里的计算步骤与标准的mish激活函数中的明显不同，可能是简化版本的吧？

__device__ float mish(float x)
{float e = __expf(x);float n = e * e + 2 * e;if (x <= -0.6f) return x * __fdividef(n, n + 2);return x - 2 * __fdividef(x, n + 2);
}

如果是标准的话，应该是下面这样？

__device__ float mish_standard(float x)
{return x * tanh(log1p(exp(x)));
}

不太清楚这里使用的这个形式有什么好处，不过毕竟只是在推理的时候使用到，应该大差不差吧

mish_plugin（mish激活函数插件）

这里是为TensorRT添加了一个激活函数的插件，里面也是分成了两个类，一个是MishPlugin应该就是具体的插件类，继承自nvinfer1::IPluginV2DynamicExt，另一个类就是MishPluginCreator，看起来作用是生成一个插件类的

MishPlugin

这个类是一个实现自定义 TensorRT 插件的示例，用于在推理过程中应用 Mish 激活函数。实现了一些 TensorRT 插件接口的方法，使其可以集成到 TensorRT 的推理引擎中。
在这里涉及到的函数里，有很多地方都有这样的代码：

(void)inputTypes;

搜了下这里的作用好像就是先显式地标记某些未使用的参数，这样即使后边没有用到这个参数，编译的时候也不会发出相关的警告。在这个类中，执行mish激活函数的部分在enqueue函数之中，

int MishPlugin::enqueue(const PluginTensorDesc * inputDesc, const PluginTensorDesc * outputDesc,const void * const * inputs, void * const * outputs, void * workspace,cudaStream_t stream) noexcept
{(void)inputDesc;(void)outputDesc;(void)workspace;const int input_volume = volume(inputDesc[0].dims);int status = -1;const float * input = static_cast<const float *>(inputs[0]);float * output = static_cast<float *>(outputs[0]);status = mish(stream, input, output, input_volume);return status;
}

MishPluginCreator

这个类的作用是实现 TensorRT 插件创建器接口，用于创建和反序列化MishPlugin插件。它是 TensorRT 插件机制的一部分，负责提供插件的元数据，并在需要时创建插件实例。这里有两个静态成员变量：

PluginFieldCollection MishPluginCreator::mFC{};
std::vector<PluginField> MishPluginCreator::mPluginAttributes;

mFC是一个PluginFieldCollection，用于存储插件的字段信息（即插件的参数）。mPluginAttributes 是一个 PluginField的向量，用于保存具体的插件字段。构造函数中清空了mPluginAttributes并将其信息设置到mFC中。而在函数createPlugin之中，则是完成了这个插件的生成，具体到代码里就是new了一个MishPlugin并且设置了命名空间。

  MishPlugin * obj = new MishPlugin();obj->setPluginNamespace(mNamespace.c_str());return obj;

还有一个函数是反序列化的操作，因为通常数据会序列化成类似于二进制的形式用于传输和储存，所以这里将之前经序列化储存过的对象反序列化成了插件实例

  // This object will be deleted when the network is destroyed, which will// call MishPlugin::destroy()MishPlugin * obj = new MishPlugin(serialData, serialLength);obj->setPluginNamespace(mNamespace.c_str());return obj;

nms（非极大值抑制）

之前有提到过，非极大值抑制的主要作用是在YOLO检测到物体后，防止一个物体被多次重复检测的步骤。在具体的文件中，定义的是一个__global__前缀的函数来实现核心的计算步骤，这里就是对提取到的boxes进行判断，当然遍历的是检测类别数num_detections和线程数num_per_thread，然后对同一类的进行判断，重叠区域大于设置阈值的化就把得分置为0了（所以应该还有步骤会提前根据得分排序的吧？）
这里还涉及到了CUDA里的一个函数：__syncthreads()，它的作用就是让线程块中的每个线程都执行完 __syncthreads()前面的语句后，才会执行下一条语句。
当然nms的主函数也是和mish激活函数一样是单独写出来的，不过这里的处理逻辑比mish激活函数的难上不少。

首先，计算工作区的大小，其中使用Flagged和SortPairsDescending分别计算了临时工作区的大小，在代码里选择了较大的那个

  if (!workspace || !workspace_size) {// Return required scratch space size cub styleworkspace_size = cuda::get_size_aligned<bool>(count);    // flagsworkspace_size += cuda::get_size_aligned<int>(count);    // indicesworkspace_size += cuda::get_size_aligned<int>(count);    // indices_sortedworkspace_size += cuda::get_size_aligned<float>(count);  // scoresworkspace_size += cuda::get_size_aligned<float>(count);  // scores_sortedsize_t temp_size_flag = 0;cub::DeviceSelect::Flagged((void *)nullptr, temp_size_flag, cub::CountingInputIterator<int>(count), (bool *)nullptr,(int *)nullptr, (int *)nullptr, count);size_t temp_size_sort = 0;cub::DeviceRadixSort::SortPairsDescending((void *)nullptr, temp_size_sort, (float *)nullptr, (float *)nullptr, (int *)nullptr,(int *)nullptr, count);workspace_size += std::max(temp_size_flag, temp_size_sort);return workspace_size;}

分配临时数据的工作区指针，其中thrust::cuda::par.on(stream)这里是使用了多线程并行的意思吧

  auto on_stream = thrust::cuda::par.on(stream);auto flags = cuda::get_next_ptr<bool>(count, workspace, workspace_size);auto indices = cuda::get_next_ptr<int>(count, workspace, workspace_size);auto indices_sorted = cuda::get_next_ptr<int>(count, workspace, workspace_size);auto scores = cuda::get_next_ptr<float>(count, workspace, workspace_size);auto scores_sorted = cuda::get_next_ptr<float>(count, workspace, workspace_size);

接下来比较重要的一步是直接把得分为0的结果过滤掉了

    // Discard null scoresthrust::transform(on_stream, in_scores, in_scores + count, flags, thrust::placeholders::_1 > 0.0f);

接下来果然有对分数排序的步骤，gather的解释

gather copies elements from a source array into a destination range according to a map. For each input iterator i in the range [map_first, map_last), the value input_first[*i] is assigned to *(result + (i - map_first)). RandomAccessIterator must permit random access.

然后具体排序的话用的是SortPairsDescending

    // Sort scores and corresponding indicesthrust::gather(on_stream, indices, indices + num_detections, in_scores, scores);cub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size, scores, scores_sorted, indices, indices_sorted, num_detections, 0,sizeof(*scores) * 8, stream);

接下来就是调用了NMS的核心函数来执行NMS的步骤

    // Launch actual NMS kernel - 1 block with each thread handling n detectionsconst int max_threads = 1024;int num_per_thread = ceil((float)num_detections / max_threads);nms_kernel<<<1, max_threads, 0, stream>>>(num_per_thread, nms_thresh, num_detections, indices_sorted, scores_sorted, in_classes,in_boxes);

对执行NMS后的结果重新进行排序

    // Re-sort with updated scorescub::DeviceRadixSort::SortPairsDescending(workspace, workspace_size, scores_sorted, scores, indices_sorted, indices, num_detections, 0,sizeof(*scores) * 8, stream);

最终得到最后的结果

    // Gather filtered scores, boxes, classesnum_detections = min(detections_per_im, num_detections);cudaMemcpyAsync(out_scores, scores, num_detections * sizeof *scores, cudaMemcpyDeviceToDevice, stream);if (num_detections < detections_per_im) {thrust::fill_n(on_stream, out_scores + num_detections, detections_per_im - num_detections, 0);}thrust::gather(on_stream, indices, indices + num_detections, in_boxes, out_boxes);thrust::gather(on_stream, indices, indices + num_detections, in_classes, out_classes);

函数整体的逻辑还是比较简单的，但是里面涉及到一些CUDA编程的东西以及内存相关的操作，所以读起来可能还是有些吃力

nms_plugin（NMS插件）

这个插件类和之前提到的mish激活函数的插件大差不差，都是分为了插件类自身以及生成插件两个类，这里NMS的核心调用也是放在了enqueue函数中，因为没有什么太多新的东西，所以这里就不再仔细记录了

yolo_layer（YOLO模型层）

这部分应该是涉及到YOLO输出特征转换为边界框、类别和分数，并存储在输出数组中，一上来先定义了两个激活函数，分别是普通的sigmoid和带有尺度的scaleSigmoid

inline __device__ float sigmoid(float x) { return 1.0f / (1.0f + __expf(-x)); }inline __device__ float scaleSigmoid(float x, float scale)
{return scale * sigmoid(x) - (scale - 1.0f) * 0.5f;
}

接下来是yoloLayerKernel，这里应该就是YOLO层的核心部分

input: 输入特征图，包含预测数据。
out_scores: 输出分数数组，存储每个检测的得分。
out_boxes: 输出边界框数组，存储每个检测的边界框坐标。
out_classes: 输出类别数组，存储每个检测的类别。
grid_width: 特征图的网格宽度。
grid_height: 特征图的网格高度。
num_classes: 类别的数量。
num_anchors: 锚框的数量。
anchors: 锚框数组，存储每个锚框的宽高。
input_width: 输入图像的宽度。
input_height: 输入图像的高度。
scale_x_y: 用于坐标缩放的系数。
score_thresh: 分数阈值，过滤掉低于该阈值的检测结果。
use_darknet_layer: 是否使用 Darknet 层（用于不同版本的 YOLO 实现）。

先是进行了一些数据的准备工作

  int idx = threadIdx.x + TPB * blockIdx.x;int total_grids = grid_width * grid_height;if (idx >= total_grids * num_anchors) return;auto out_score = (out_scores) + idx;auto out_box = (out_boxes) + idx;auto out_class = (out_classes) + idx;int anchor_idx = idx / total_grids;     // 锚框索引idx = idx - total_grids * anchor_idx;   // 这里的idx就相当于一个局部索引int info_len = 5 + num_classes;auto cur_input = static_cast<const float *>(input) + anchor_idx * (info_len * total_grids);

其中，置信度的计算是对输出使用了一层sigmoid激活函数的

int class_id;
float max_cls_logit = -CUDART_INF_F;  // minus infinity
for (int i = 5; i < info_len; ++i) {float l = cur_input[idx + i * total_grids];if (l > max_cls_logit) {max_cls_logit = l;class_id = i - 5;}
}
float max_cls_prob = sigmoid(max_cls_logit);
float objectness = sigmoid(cur_input[idx + 4 * total_grids]);

然后根据是否用了darknet，来使用不同的方法计算x，y，h和w

  if (use_darknet_layer) {x = (col + scaleSigmoid(cur_input[idx + 0 * total_grids], scale_x_y)) / grid_width;    // [0, 1]y = (row + scaleSigmoid(cur_input[idx + 1 * total_grids], scale_x_y)) / grid_height;   // [0, 1]w = __expf(cur_input[idx + 2 * total_grids]) * anchors[2 * anchor_idx] / input_width;  // [0, 1]h = __expf(cur_input[idx + 3 * total_grids]) * anchors[2 * anchor_idx + 1] /input_height;  // [0, 1]} else {x = (col + sigmoid(cur_input[idx + 0 * total_grids]) * 2 - 0.5) / grid_width;   // [0, 1]y = (row + sigmoid(cur_input[idx + 1 * total_grids]) * 2 - 0.5) / grid_height;  // [0, 1]w = (sigmoid(cur_input[idx + 2 * total_grids]) * 2) *(sigmoid(cur_input[idx + 2 * total_grids]) * 2) * anchors[2 * anchor_idx] /input_width;  // [0, 1]h = (sigmoid(cur_input[idx + 3 * total_grids]) * 2) *(sigmoid(cur_input[idx + 3 * total_grids]) * 2) * anchors[2 * anchor_idx + 1] /input_height;  // [0, 1]}

至于最后，就是把结果赋值给对应的输出了

  *out_box = make_float4(x, y, w, h);*out_class = class_id;*out_score = objectness < score_thresh ? 0.0 : max_cls_prob * objectness;

关于封装好的YOLO层的函数，yoloLayer的作用是为 YOLO目标检测模型执行推理过程，具体地调用 yoloLayerKernel核函数来处理每个批次的输入数据，生成边界框、类别和分数。

batch_size: 批次大小。
inputs: 输入数据指针数组。
outputs: 输出数据指针数组。
grid_width: 特征图的网格宽度。
grid_height: 特征图的网格高度。
num_classes: 类别的数量。
num_anchors: 锚框的数量。
anchors: 锚框的宽高列表。
input_width: 输入图像的宽度。
input_height: 输入图像的高度。
scale_x_y: 用于坐标缩放的系数。
score_thresh: 分数阈值，过滤掉低于该阈值的检测结果。
use_darknet_layer: 是否使用 Darknet 层（用于不同版本的 YOLO 实现）。
workspace: 工作空间指针。
workspace_size: 工作空间大小。
stream: CUDA 流。

先是判断工作空间是否有效，然后把数据放到GPU上

  if (!workspace || !workspace_size) {workspace_size = cuda::get_size_aligned<float>(anchors.size());return workspace_size;}auto anchors_d = cuda::get_next_ptr<float>(anchors.size(), workspace, workspace_size);cudaMemcpyAsync(anchors_d, anchors.data(), anchors.size() * sizeof *anchors_d, cudaMemcpyHostToDevice, stream);

计算线程块和网格大小

int num_elements = num_anchors * grid_width * grid_height;
constexpr int block_size = 256;
const int grid_size = (num_elements + block_size - 1) / block_size;

接下来循环处理每个批次，这里调用了yoloLayerKernel函数

for (int batch = 0; batch < batch_size; ++batch) {auto input = static_cast<const float *>(inputs[0]) +batch * num_anchors * (num_classes + 5) * grid_width * grid_height;auto out_scores = static_cast<float *>(outputs[0]) + batch * num_elements;auto out_boxes = static_cast<float4 *>(outputs[1]) + batch * num_elements;auto out_classes = static_cast<float *>(outputs[2]) + batch * num_elements;yoloLayerKernel<block_size><<<grid_size, block_size, 0, stream>>>(input, out_scores, out_boxes, out_classes, grid_width, grid_height, num_classes, num_anchors,anchors_d, input_width, input_height, scale_x_y, score_thresh, use_darknet_layer);
}

yolo_layer_plugin（YOLO层插件）

这个和前面两个插件类也没什么太多区别，所以这里也不记录了

trt_yolo（基于TensorRT的YOLO过程）

在这个文件中，看起来是YOLO的主要部分，因为里面有Config这样一个配置文件的结构体，还有一个Net类，首先来看下配置文件的一些参数

struct Config
{int num_anchors;                  // 锚框数量std::vector<float> anchors;       // 锚框宽高列表std::vector<float> scale_x_y;     // 坐标缩放系数列表float score_thresh;               // 分数阈值，过滤低于该阈值的检测结果float iou_thresh;                 // IOU（交并比）阈值，用于非极大值抑制int detections_per_im;            // 每张图片的最大检测数量bool use_darknet_layer;           // 是否使用 Darknet 层（用于不同版本的 YOLO 实现）float ignore_thresh;              // 忽略阈值，低于该阈值的检测结果将被忽略
};

至于Net类，构造函数分成了两种，第一种的输入参数很简单，这种应该就是直接输入engine情况下的构造函数，另一种输入参数则多了很多，看起来输入的是onnx文件，所以是先经历了一步模型转换的过程。
所以如果输入的是engine的话，就很简单，只需要load这个模型就好

  Logger logger(verbose);runtime_ = unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(logger));load(path);if (!prepare()) {std::cout << "Fail to prepare engine" << std::endl;return;}

不然的话就复杂的很，然后这里没有用到之前在common包里定义的一些模型转换的函数，而是自己又重新写了一遍，具体原因可能是这里对输出层具体进行了一些处理。 模型转换核心部分的代码和之前的差不多

  std::cout << "Building " << precision << " core model..." << std::endl;const auto flag =1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);auto network = unique_ptr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(flag));if (!network) {std::cout << "Fail to create network" << std::endl;return;}auto parser = unique_ptr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, logger));if (!parser) {std::cout << "Fail to create parser" << std::endl;return;}parser->parseFromFile(onnx_file_path.c_str(), static_cast<int>(nvinfer1::ILogger::Severity::kERROR));

当然，这里也添加了插件类，不过没有用自己定义的。。

  auto nmsPlugin = yolo::NMSPlugin(yolo_config.iou_thresh, yolo_config.detections_per_im);auto layer = network->addPluginV2(concat.data(), concat.size(), nmsPlugin);for (int i = 0; i < layer->getNbOutputs(); i++) {auto output = layer->getOutput(i);network->markOutput(*output);}

最后，生成了engine文件。

至于具体的推理操作的调用，是通过detect实现的，更具体一点的话就是infer函数

这里就是具体的接口了，可以明显看到输入的直接就有图片类型，输出的就是三个：得分，检测框以及类别，这里先是对输入的图片进行了预处理：

const auto input = preprocess(in_img, input_dims.at(0), input_dims.at(2), input_dims.at(1));

预处理的主要操作是把图像从HWC转为了CHW，之后执行infer

  std::vector<void *> buffers = {input_d_.get(), out_scores_d_.get(), out_boxes_d_.get(), out_classes_d_.get()};try {infer(buffers, 1);} catch (const std::runtime_error & e) {return false;}

在infer函数中，输入的buffers代表的是包含输入和输出缓冲区的向量，另一个就是一次处理的数据量的大小，这里再次用到了一个函数cudaStreamSynchronize，作用是等待 CUDA 流中的所有操作完成。这样可以确保推理任务已经完成，输出缓冲区中的数据是最新的。

void Net::infer(std::vector<void *> & buffers, const int batch_size)
{if (!context_) {throw std::runtime_error("Fail to create context");}auto input_dims = engine_->getBindingDimensions(0);context_->setBindingDimensions(0, nvinfer1::Dims4(batch_size, input_dims.d[1], input_dims.d[2], input_dims.d[3]));context_->enqueueV2(buffers.data(), stream_, nullptr);cudaStreamSynchronize(stream_);
}

nodelet

最后我们来看一下这个节点类TensorrtYoloNodelet吧，构造函数一样先加载了很多参数，然后这里对YOLO类进行了初始化

    RCLCPP_INFO(this->get_logger(), "Found %s", engine_file.c_str());net_ptr_.reset(new yolo::Net(engine_file, false));

之后也是话题的发布

  objects_pub_ = this->create_publisher<tier4_perception_msgs::msg::DetectedObjectsWithFeature>("out/objects", 1);image_pub_ = image_transport::create_publisher(this, "out/image");out_scores_ =std::make_unique<float[]>(net_ptr_->getMaxBatchSize() * net_ptr_->getMaxDetections());out_boxes_ =std::make_unique<float[]>(net_ptr_->getMaxBatchSize() * net_ptr_->getMaxDetections() * 4);out_classes_ =std::make_unique<float[]>(net_ptr_->getMaxBatchSize() * net_ptr_->getMaxDetections());

这里比较有趣的是，图像话题的订阅没有像之前一样直接放在构造函数里，而是在connectCb函数中，然后在构造函数里会有定时器timer_变量对这个进行管理，所以相当于是根据订阅者的数量动态地开启或关闭 image_sub_ 的订阅，从而节省资源。（PS：这个逻辑在很多硬件的驱动上见过）

void TensorrtYoloNodelet::connectCb()
{using std::placeholders::_1;std::lock_guard<std::mutex> lock(connect_mutex_);if (objects_pub_->get_subscription_count() == 0 && image_pub_.getNumSubscribers() == 0) {image_sub_.shutdown();} else if (!image_sub_) {image_sub_ = image_transport::create_subscription(this, "in/image", std::bind(&TensorrtYoloNodelet::callback, this, _1), "raw",rmw_qos_profile_sensor_data);}
}