YOLOV8 |搞懂检测头

代码:

yaml结构的最后一层，接了前面三个层的，有3个检测头：

# YOLOv8.0n head
head:- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 6], 1, Concat, [1]] # cat backbone P4- [-1, 3, C2f, [512]] # 12- [-1, 1, nn.Upsample, [None, 2, "nearest"]]- [[-1, 4], 1, Concat, [1]] # cat backbone P3- [-1, 3, C2f, [256]] # 15 (P3/8-small)- [-1, 1, Conv, [256, 3, 2]]- [[-1, 12], 1, Concat, [1]] # cat head P4- [-1, 3, C2f, [512]] # 18 (P4/16-medium)- [-1, 1, Conv, [512, 3, 2]]- [[-1, 9], 1, Concat, [1]] # cat head P5- [-1, 3, C2f, [1024]] # 21 (P5/32-large)  - [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)

检测头代码部分的关键点：

1. 初始化 (`init` 方法)

初始化方法设置了模型的参数和网络结构。

def __init__(self, nc=80, ch=()):"""Initializes the YOLOv8 detection layer with specified number of classes and channels."""super().__init__()self.nc = nc  # number of classesself.nl = len(ch)  # number of detection layersself.reg_max = 16  # DFL channels (ch[0] // 16 to scale 4/8/12/16/20 for n/s/m/l/x)self.no = nc + self.reg_max * 4  # number of outputs per anchorself.stride = torch.zeros(self.nl)  # strides computed during buildc2, c3 = max((16, ch[0] // 4, self.reg_max * 4)), max(ch[0], min(self.nc, 100))  # channelsself.cv2 = nn.ModuleList(nn.Sequential(Conv(x, c2, 3), Conv(c2, c2, 3), nn.Conv2d(c2, 4 * self.reg_max, 1)) for x in ch)self.cv3 = nn.ModuleList(nn.Sequential(Conv(x, c3, 3), Conv(c3, c3, 3), nn.Conv2d(c3, self.nc, 1)) for x in ch)self.dfl = DFL(self.reg_max) if self.reg_max > 1 else nn.Identity()if self.end2end:self.one2one_cv2 = copy.deepcopy(self.cv2)self.one2one_cv3 = copy.deepcopy(self.cv3)

关键点:

self.nc: 模型需要预测的类别数量。
self.nl: 检测层的数量，通常对应于特征图的数量。
self.reg_max: 用于提高边界框回归精度的 DFL 通道数。
self.no: 每个锚点的输出数量（类别数 + 回归通道数）。
c2 和 c3: 中间层的通道数，分别用于位置回归和类别分类。
self.cv2 和 self.cv3: 位置回归和类别分类的卷积层序列。
self.dfl: DFL 模块，用于将多通道表示转换为实际坐标。
如果是端到端模式（end2end），复制 cv2 和 cv3 以创建 one2one_cv2 和 one2one_cv3

2. 前向传播 (`forward` 方法)

前向传播方法负责处理输入数据并生成最终的检测结果。

def forward(self, x):"""Concatenates and returns predicted bounding boxes and class probabilities."""if self.end2end:return self.forward_end2end(x)for i in range(self.nl):x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)if self.training:  # Training pathreturn xy = self._inference(x)return y if self.export else (y, x)

返回的目的

这些拼接后的特征图 x 会被传递给损失函数计算模块，用于计算损失并反向传播梯度以更新网络参数。具体步骤如下：

关键点:
- 如果是端到端模式（end2end），调用 forward_end2end。
- 否则，对每个检测层进行前向传播，将位置回归和类别分类的结果拼接在一起。
- 在训练模式下，直接返回拼接后的结果。
- 在推理模式下，调用 _inference 方法进行后处理，并返回最终的检测结果。
- 对于每个检测层（self.nl 个），将 cv2 和 cv3 的输出拼接在一起。
- cv2[i] 是位置回归的卷积层序列，cv3[i] 是类别分类的卷积层序列。
- 拼接的结果是一个包含边界框回归和类别分类信息的特征图。
训练模式下的返回

在训练模式下，直接返回拼接后的结果 x。这些结果是 cv2 和 cv3 的输出拼接在一起的特征图。具体来说：
cv2[i] 的输出是边界框回归的特征图。
cv3[i] 的输出是类别分类的特征图。
这两个特征图在通道维度上拼接在一起，形成一个包含边界框和类别信息的特征图。
损失计算:
- 拼接后的特征图 x 包含了预测的边界框和类别信息。
- 使用真实标签（ground truth）与这些预测结果计算损失。通常包括边界框回归损失（如 DFL 损失）和分类损失（如交叉熵损失）。
反向传播:
- 根据计算出的损失，通过反向传播算法计算每个参数的梯度。
- 反向传播从损失函数开始，逐层计算每一层的梯度。
参数更新:
- 使用优化器（如 SGD、Adam 等）根据计算出的梯度更新网络中的权重和偏置。
- 优化器使用学习率来控制每次更新时参数变化的步长

3. 端到端前向传播 (`forward_end2end` 方法)

这个方法在端到端模式下使用，生成两个检测结果：one2many 和 one2one。

def forward_end2end(self, x):"""Performs forward pass of the v10Detect module.Args:x (tensor): Input tensor.Returns:(dict, tensor): If not in training mode, returns a dictionary containing the outputs of both one2many and one2one detections.If in training mode, returns a dictionary containing the outputs of one2many and one2one detections separately."""x_detach = [xi.detach() for xi in x]one2one = [torch.cat((self.one2one_cv2[i](x_detach[i]), self.one2one_cv3[i](x_detach[i])), 1) for i in range(self.nl)]for i in range(self.nl):x[i] = torch.cat((self.cv2[i](x[i]), self.cv3[i](x[i])), 1)if self.training:  # Training pathreturn {"one2many": x, "one2one": one2one}y = self._inference(one2one)y = self.postprocess(y.permute(0, 2, 1), self.max_det, self.nc)return y if self.export else (y, {"one2many": x, "one2one": one2one})

关键点:
- 对输入特征图进行分离，生成 one2one 和 one2many 的检测结果。
- 在训练模式下，返回两个检测结果。
- 在推理模式下，对 one2one 的结果进行后处理，并返回最终的检测结果。

4. 推理路径 (`_inference` 方法)

推理路径方法解码预测的边界框和类别概率。

def _inference(self, x):"""Decode predicted bounding boxes and class probabilities based on multiple-level feature maps."""shape = x[0].shape  # BCHWx_cat = torch.cat([xi.view(shape[0], self.no, -1) for xi in x], 2)if self.dynamic or self.shape != shape:self.anchors, self.strides = (x.transpose(0, 1) for x in make_anchors(x, self.stride, 0.5))self.shape = shapeif self.export and self.format in {"saved_model", "pb", "tflite", "edgetpu", "tfjs"}:  # avoid TF FlexSplitV opsbox = x_cat[:, : self.reg_max * 4]cls = x_cat[:, self.reg_max * 4 :]else:box, cls = x_cat.split((self.reg_max * 4, self.nc), 1)if self.export and self.format in {"tflite", "edgetpu"}:grid_h = shape[2]grid_w = shape[3]grid_size = torch.tensor([grid_w, grid_h, grid_w, grid_h], device=box.device).reshape(1, 4, 1)norm = self.strides / (self.stride[0] * grid_size)dbox = self.decode_bboxes(self.dfl(box) * norm, self.anchors.unsqueeze(0) * norm[:, :2])else:dbox = self.decode_bboxes(self.dfl(box), self.anchors.unsqueeze(0)) * self.stridesreturn torch.cat((dbox, cls.sigmoid()), 1)

关键点:
- 将多个特征图的预测结果拼接在一起。
- 计算并更新锚点和步长。
- 分割出边界框和类别概率。
- 使用 DFL 将多通道表示转换为实际坐标。
- 返回解码后的边界框和类别的 sigmoid 概率.

5. 解码边界框 (`decode_bboxes` 方法)

这个方法将编码的边界框转换为实际的边界框坐标。

def decode_bboxes(self, bboxes, anchors):"""Decode bounding boxes."""return dist2bbox(bboxes, anchors, xywh=not self.end2end, dim=1)

关键点:
- 使用 dist2bbox 函数将编码的边界框转换为实际的边界框坐标。

6. 后处理 (`postprocess` 方法)

后处理方法选择最高分的边界框并返回最终检测结果。

@staticmethod
def postprocess(preds: torch.Tensor, max_det: int, nc: int = 80):"""Post-processes the predictions obtained from a YOLOv10 model.Args:preds (torch.Tensor): The predictions obtained from the model. It should have a shape of (batch_size, num_boxes, 4 + num_classes).max_det (int): The maximum number of detections to keep.nc (int, optional): The number of classes. Defaults to 80.Returns:(torch.Tensor): The post-processed predictions with shape (batch_size, max_det, 6),including bounding boxes, scores and cls."""assert 4 + nc == preds.shape[-1]boxes, scores = preds.split([4, nc], dim=-1)max_scores, index = torch.topk(scores.amax(dim=-1), min(max_det, scores.shape[1]), axis=-1)index = index.unsqueeze(-1)boxes = torch.gather(boxes, dim=1, index=index.repeat(1, 1, boxes.shape[-1]))scores = torch.gather(scores, dim=1, index=index.repeat(1, 1, scores.shape[-1]))scores, index = torch.topk(scores.flatten(1), max_det, axis=-1)labels = index % ncindex = index // ncboxes = boxes.gather(dim=1, index=index.unsqueeze(-1).repeat(1, 1, boxes.shape[-1]))return torch.cat([boxes, scores.unsqueeze(-1), labels.unsqueeze(-1).to(boxes.dtype)], dim=-1)