【目标检测】YOLOv7算法实现(二)：正样本匹配(SimOTA)与损失计算

本系列文章记录本人硕士阶段YOLO系列目标检测算法自学及其代码实现的过程。其中算法具体实现借鉴于ultralytics YOLO源码Github，删减了源码中部分内容，满足个人科研需求。
本篇文章在YOLOv5算法实现的基础上，进一步完成YOLOv7算法的实现。YOLOv7相比于YOLOv5，最主要的不同之处如下：

模型结构：引进了更为高效的特征提取模块(ELAN)、下采样模块(MP)，不同的空间池化层(SPPCSPC)，重参数卷积(RepConv)
正样本匹配：结合YOLOv5中和正样本匹配方法和YOLOX中的正样本筛选方法(SimOTA)

文章地址：
YOLOv7算法实现(一)：模型搭建
YOLOv7算法实现(二)：正样本匹配(SimOTA)与损失计算

本文目录

0 引言
1 正样本匹配
2 损失计算
3 代码实现
- 3.1 正样本匹配
- 3.2 损失计算

0 引言

YOLOv7中的正样本匹配在YOLOv5的正样本匹配基础上进一步通过SimOTA对正样本进行筛选，损失计算流程如图1所示。
在这里插入图片描述

图1 YOLOv7损失计算流程

1 正样本匹配

YOLOv5的正样本匹配方法可见文章YOLOv5算法实现(四)：损失计算。在YOLOv5正样本匹配方法中，在每一个feature_map上，根据目标中心点所在位置至多使用三个预测单元对目标进行匹配，在每一个预测单元中，根据宽高比至多使用三个Anchor对目标进行匹配，因此经过YOLOv5正样本匹配后，一个目标至多得到27个匹配样本。
SimOTA正样本筛选流程如下：

计算实际目标nt与匹配样本nt_n的IoU损失：
$pair\_wise\_iou\_loss = - \log (iou)$
计算实际目标nt与匹配样本nt_n的类别交叉熵损失：
$pair\_wise\_cls\_loss = - y\log (\sigma ({y_{pred}})) - (1 - y)\log (\sigma (1 - {y_{pred}}))$
根据IoU损失总和确定每一个实际目标nt的dynamic_k(每一个nt匹配的样本数量)
计算匹配样本总损失：
$pair\_wise\_loss = pair\_wise\_cls\_loss + 3pair\_wise\_iou\_loss$
根据总损失和dynamic_k对匹配的正样本进行筛选
假设某目标(类别为3)在某训练批次中得到了7个匹配结果，其SimOTA正样本筛选示例如图2所示。

在这里插入图片描述

图2 SimOTA计算示例

2 损失计算

YOLOv7中损失计算方式与YOLOv5一致，包含以下三个部分：

位置损失(仅计算正样本)：
$I o uL oss = 1 - C I o U$

在这里插入图片描述

图3 常见IoU计算方法

类别损失(仅计算正样本):
$\sum\limits_{i = 0}^{nf} {\{ {1 \over n}\sum\limits_{j = 0}^n {[{1 \over {nc}}\sum\limits_{k = 0}^{k = nc} {({y}} } } \log (\sigma ({p})) + (1 - {y})\log (1 - \sigma ({p})))]\}$
置信度损失(所有样本)：
$\sum\limits_{i = 0}^{nf} {\{ {1 \over {na}}\sum\limits_{j = 0}^{na} {[{1 \over {gridy \times gridx}}\sum\limits_{m = 0}^{gridy} {\sum\limits_{n = 0}^{gridx} {(y\log (\sigma (p)) + (1 - y)\log (1 - \sigma (p)))]} } } } \}$

3 代码实现

3.1 正样本匹配

YOLOv5匹配方法

    def find_3_positive(self, p, targets):# Build targets for compute_loss(), input targets(num_gt,(image_index,class,x,y,w,h))# input p (num_feature_map, bs, ba, y, x, (x, y, w, h, obj, classes)) 相对坐标# na: 每个特征图上的anchors数量; nt: 当前训练图像的正样本个数na, nt = self.na, targets.shape[0]  # number of anchors, targetsindices, anch = [], []# gain是为了后面将targets=[na, nt, t]中归一化了的xywh映射到相对feature map尺度上# image_index + class + xywh + anchor_indexgain = torch.ones(7, device=targets.device).long()ai = torch.arange(na, device=targets.device).float().view(na, 1).repeat(1, nt)  # same as .repeat_interleave(nt)# tagets [na, num_gt, (image_index,class,x,y,w,h, anchors_index)]# 对一张特征图上的三个anchors均进行正样本匹配targets = torch.cat((targets.repeat(na, 1, 1), ai[:, :, None]), 2)  # append anchor indices# 匹配的gridg = 0.5  # biasoff = torch.tensor([[0, 0],[1, 0], [0, 1], [-1, 0], [0, -1],  # j,k,l,m# [1, 1], [1, -1], [-1, 1], [-1, -1],  # jk,jm,lk,lm], device=targets.device).float() * g  # offsets# 对每一个尺度的features上的正样本进行匹配for i in range(self.nl):anchors = self.anchors[i]  # 当前feature_map上的anchors绝对尺寸# xyxy增益, 用于将targets中的(images_index, class, x, y, w, h, anchor_index)相对坐标转换为feature_map上的绝对坐标gain[2:6] = torch.tensor(p[i].shape)[[3, 2, 3, 2]]  # xyxy gaint = targets * gainif nt:# 根据目标的wh和anchors的wh比例筛选匹配的anchorsr = t[:, :, 4:6] / anchors[:, None]  # wh ratio# torch.max(r, 1. / r).max(2) -> return: values, indexj = torch.max(r, 1. / r).max(2)[0] < self.hyp['anchor_t']  # compare# j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t']  # iou(3,n)=wh_iou(anchors(3,2), gwh(n,2))# 根据长宽比对正样本进行筛选t = t[j]# Offsetsgxy = t[:, 2:4]  # gxy: 目标center相对左上角的偏置(用于选择左、上、左上grid)gxi = gain[[2, 3]] - gxy  # gxi: 目标center相对右下角的偏置(用于选择右、下、右下grid)j, k = ((gxy % 1. < g) & (gxy > 1.)).Tl, m = ((gxi % 1. < g) & (gxi > 1.)).Tj = torch.stack((torch.ones_like(j), j, k, l, m))# 将t复制5份, 用j筛选出需要保留的正样本t = t.repeat((5, 1, 1))[j]# [0, 0], [1, 0], [0, 1], [-1, 0], [0, -1]# 构造所有正样本的偏置offsets = (torch.zeros_like(gxy)[None] + off[:, None])[j]else:t = targets[0]offsets = 0# Defineb, c = t[:, :2].long().T  # image indices, classgxy = t[:, 2:4]  # grid xy features_map上绝对坐标gwh = t[:, 4:6]  # grid whgij = (gxy - offsets).long()  # 减去偏置获得匹配的grid坐标点gi, gj = gij.T  # grid xy indices# Appenda = t[:, 6].long()  # anchor indices# image indices, anchor indices, gj, giindices.append((b, a, gj.clamp_(0, gain[3] - 1), gi.clamp_(0, gain[2] - 1)))# 正样本对应的anchors大小, 当前features map上的绝对尺寸anch.append(anchors[a])  # anchorsreturn indices, anch

SimOTA正样本筛选

    def build_targets(self, p, targets, imgs):''':param p: [feature_map indices, bs, ba, y, x, (x, y, w, h, obj, num_class)]回归参数:param targets: [num_gt, (image_index, classes, x, y, w, h)]相对坐标:param imgs: [num_img, 3, y, x]'''# indices: [feature_map indices, image indices list, anchor indices list, gj, gi]# anch: 每个正样本对应的anchors大小(在对应feature map上的绝对坐标)'''1. 根据target中心点x, y确定作为正样本的cell(gj, gi);根据target的w, h和anchors的长宽比确定每一个cell中进行匹配的anchorindices: feature_map_list{image_indices_list, anchor indices_list, gj, gi}anch: feature_map_list{anchor_size}'''indices, anch = self.find_3_positive(p, targets)device = torch.device(targets.device)'''2. 根据OTA算法对上一步匹配的正样本进行进一步筛选每一张图片实际目标数nt, 匹配到的对应正样本数n_gta.对每一个nt和每一个n_gt的预测结果计算其iou矩阵[nt, n_gt], 求出iou损失[nt, n_gt]b.对每一个nt和每一个n_gt的预测结果计算其类别损失矩阵[nt, n_gt]c.根据iou损失总和确定一个dynamic_k(每一个nt需要几个n_gt进行匹配)d.计算cost矩阵(loss_iou + a * cls_iou)e.根据cost矩阵和dynamic_k确定nt匹配的正样本所在feature_map, gj,gi,anchor'''matching_bs = [[] for pp in p]  # imagesmatching_as = [[] for pp in p]  # anchormatching_gjs = [[] for pp in p]  # gjmatching_gis = [[] for pp in p]  # gimatching_targets = [[] for pp in p]  # 匹配的正样本matching_anchs = [[] for pp in p]  # 对应的anchors大小nl = len(p)  # 输出不同尺寸特征图数量# 对每一张图片进行正样本匹配for batch_idx in range(p[0].shape[0]):b_idx = targets[:, 0] == batch_idxthis_target = targets[b_idx]  # 获得当前图片的实际目标if this_target.shape[0] == 0:continue# 得到在原图尺度的(x, y, w, h)绝对坐标 -> (xmin, ymin, xmax, ymax)txywh = this_target[:, 2:6] * imgs[batch_idx].shape[1]txyxy = xywh2xyxy(txywh)pxyxys = []  # 预测的位置回归参数p_cls = []  # 预测的类别置信度p_obj = []  # 预测的目标置信度from_which_layer = []  # 当前预测特征来自哪个feature_mapall_b = []  # image indices(所有特征图)all_a = []  # anchor indices(所有特征图)all_gj = []  # gj(所有特征图)all_gi = []  # gi(所有特征图)all_anch = []  # anchor size(所有特征图)# 针对每个特征图匹配到的正样本进行OTA算法cost计算进一步对正样本进行筛选for i, pi in enumerate(p):b, a, gj, gi = indices[i]  # image indices, anchor indices, gj giidx = (b == batch_idx)  # 得到第一次匹配得到的属于当前图片的正样本b, a, gj, gi = b[idx], a[idx], gj[idx], gi[idx]  # image indices, anchor indices, gj giall_b.append(b)  # 当前图片第i个输出特征图的匹配imagesall_a.append(a)  # 当前图片第i个输出特征图的匹配anchors indicesall_gj.append(gj)  # 当前图片第i个输出特征图的匹配gjall_gi.append(gi)  # 当前图片第i个输出特征图的匹配giall_anch.append(anch[i][idx])  # 当前图片第i个输出特征图的匹配anchors大小(当前特征图上的绝对尺寸)from_which_layer.append((torch.ones(size=(len(b),)) * i).to(device))  # 当前匹配的正样本来自哪个输出特征图fg_pred = pi[b, a, gj, gi]  # 当前匹配的正样本预测结果(x, y, w, h, obj, cls)p_obj.append(fg_pred[:, 4:5])  # 预测目标置信度p_cls.append(fg_pred[:, 5:])  # 预测类别类别grid = torch.stack([gi, gj], dim=1)# 预测结果(x, y)回归参数转换为原图的(x, y)绝对坐标pxy = (fg_pred[:, :2].sigmoid() * 2. - 0.5 + grid) * self.stride[i]  # / 8.# pxy = (fg_pred[:, :2].sigmoid() * 3. - 1. + grid) * self.stride[i]# 预测结果(w, h)回归参数转换为原图的(w, h)绝对坐标pwh = (fg_pred[:, 2:4].sigmoid() * 2) ** 2 * anch[i][idx] * self.stride[i]  # / 8.# 预测结果(x, y, w, h)原图上的绝对坐标 -> (xmin, ymin, xmax, ymax)pxywh = torch.cat([pxy, pwh], dim=-1)pxyxy = xywh2xyxy(pxywh)pxyxys.append(pxyxy)pxyxys = torch.cat(pxyxys, dim=0)  # 预测结果xyxy:原图上的绝对大小if pxyxys.shape[0] == 0:continuep_obj = torch.cat(p_obj, dim=0)  # 预测结果目标置信度p_cls = torch.cat(p_cls, dim=0)  # 预测结果类别置信度from_which_layer = torch.cat(from_which_layer, dim=0)  # 预测结果属于哪个feature_mapall_b = torch.cat(all_b, dim=0)  # 预测结果属于batch中哪张图片all_a = torch.cat(all_a, dim=0)  # 预测结果属于哪个anchorall_gj = torch.cat(all_gj, dim=0)  # 预测结果属于哪个gjall_gi = torch.cat(all_gi, dim=0)  # 预测结果属于哪个giall_anch = torch.cat(all_anch, dim=0)  # 预测结果的anchor大小(对应feature_map上的绝对大小)# 计算pxyxy和txyxy的iou(均为原图上的实际大小)# txytxt:[nt, 4], pxypxy:[np, 4] -> pair_wise_iou: [nt, np]pair_wise_iou = box_iou(txyxy, pxyxys)# iou损失pair_wise_iou_loss = -torch.log(pair_wise_iou + 1e-8)# 根据iou从大到小选取至多10个ioutop_k, _ = torch.topk(pair_wise_iou, min(10, pair_wise_iou.shape[1]), dim=1)# 根据iou的总和确定dynamic_ks(每一个目标选择的匹配正样本数量), 至少会选择一个正样本对目标进行匹配dynamic_ks = torch.clamp(top_k.sum(1).int(), min=1)# 对当前图片的实际标签cls进行独热编码(对每一个nt进行扩充成和p一样的数量)gt_cls_per_image = (F.one_hot(this_target[:, 1].to(torch.int64), self.nc)  # 对类别标签进行独热编码: [nt, nc].float().unsqueeze(1)  # [nt, 1, nc].repeat(1, pxyxys.shape[0], 1)  # [nt, n_gt, nc])# 当前图片的实际目标个数, 对预测的置信度(类别置信度x目标置信度)进行扩充, 给每一个nt分配num_gt = this_target.shape[0]cls_preds_ = (p_cls.float().unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_()* p_obj.unsqueeze(0).repeat(num_gt, 1, 1).sigmoid_())y = cls_preds_.sqrt_()pair_wise_cls_loss = F.binary_cross_entropy_with_logits(torch.log(y / (1 - y)), gt_cls_per_image, reduction="none").sum(-1)  # 计算类别损失del cls_preds_cost = (pair_wise_cls_loss+ 3.0 * pair_wise_iou_loss)matching_matrix = torch.zeros_like(cost, device=device)# 确定每一个gt匹配的g_nt(根据cost和dynamic_k)for gt_idx in range(num_gt):_, pos_idx = torch.topk(cost[gt_idx], k=dynamic_ks[gt_idx].item(), largest=False)matching_matrix[gt_idx][pos_idx] = 1.0del top_k, dynamic_ks# 当同一个gt匹配了多个t时, 根据cost选择gt对哪一个t进行匹配anchor_matching_gt = matching_matrix.sum(0)if (anchor_matching_gt > 1).sum() > 0:_, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)matching_matrix[:, anchor_matching_gt > 1] *= 0.0matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0fg_mask_inboxes = (matching_matrix.sum(0) > 0.0).to(device)  # 保留匹配到的正样本matched_gt_inds = matching_matrix[:, fg_mask_inboxes].argmax(0)  # 每一个gt匹配的实际目标索引# 保留OTA算法进一步匹配到的结果from_which_layer = from_which_layer[fg_mask_inboxes]all_b = all_b[fg_mask_inboxes]all_a = all_a[fg_mask_inboxes]all_gj = all_gj[fg_mask_inboxes]all_gi = all_gi[fg_mask_inboxes]all_anch = all_anch[fg_mask_inboxes]this_target = this_target[matched_gt_inds]# 将每一个feature_map的预测结果分开for i in range(nl):layer_idx = from_which_layer == imatching_bs[i].append(all_b[layer_idx])matching_as[i].append(all_a[layer_idx])matching_gjs[i].append(all_gj[layer_idx])matching_gis[i].append(all_gi[layer_idx])matching_targets[i].append(this_target[layer_idx])matching_anchs[i].append(all_anch[layer_idx])# 将所有图片匹配到的正样本进行合并for i in range(nl):if matching_targets[i] != []:matching_bs[i] = torch.cat(matching_bs[i], dim=0)matching_as[i] = torch.cat(matching_as[i], dim=0)matching_gjs[i] = torch.cat(matching_gjs[i], dim=0)matching_gis[i] = torch.cat(matching_gis[i], dim=0)matching_targets[i] = torch.cat(matching_targets[i], dim=0)matching_anchs[i] = torch.cat(matching_anchs[i], dim=0)else:matching_bs[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)matching_as[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)matching_gjs[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)matching_gis[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)matching_targets[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)matching_anchs[i] = torch.tensor([], device='cuda:0', dtype=torch.int64)return matching_bs, matching_as, matching_gjs, matching_gis, matching_targets, matching_anchs

3.2 损失计算

class ComputeLossOTA:# Compute lossesdef __init__(self, model, autobalance=False):super(ComputeLossOTA, self).__init__()device = next(model.parameters()).device  # get model deviceh = model.hyp  # hyperparameters# Define criteriaBCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['cls_pw']], device=device))BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']], device=device))# Class label smoothing https://arxiv.org/pdf/1902.04103.pdf eqn 3self.cp, self.cn = smooth_BCE(eps=h.get('label_smoothing', 0.0))  # positive, negative BCE targets# Focal lossg = h['fl_gamma']  # focal loss gammaif g > 0:BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g)m = model.model[-1]  # Detect() moduleself.balance = {3: [4.0, 1.0, 0.4]}.get(m.nl, [4.0, 1.0, 0.25, 0.06, .02])  # P3-P7self.ssi = list(m.stride).index(16) if autobalance else 0  # stride 16 indexself.BCEcls, self.BCEobj, self.gr, self.hyp, self.autobalance = BCEcls, BCEobj, 1.0, h, autobalanceself.na = m.na  # anchors数量self.nc = m.nc  # 类别数量self.nl = m.nl  # 输出特征层数量self.anchors = m.anchors  # anchors [3, 3, 2], 缩放到feature map上的anchors尺寸self.stride = m.stride  # 输出特征图在输入特征图上的跨度self.device = device  # 数据存储设备def __call__(self, p, targets, imgs):  # predictions, targets, model'''正样本匹配, 计算损失:param p: [num_feature_map, batch_size, num_anchors, y, x, (x + y + w + h + obj + num_class)]:param targets: [num_gt, (image indices, classes, x, y, w, h)]:param imgs: [num_img, 3, y, x]'''device = targets.device# 分类损失, 位置损失, 置信度损失lcls, lbox, lobj = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, device=device)'''正样本匹配:1. 根据target中心点x, y确定作为正样本的cell(gj, gi);根据target的w, h和anchors的长宽比确定每一个cell中进行匹配的anchorinput:[nt, 6] output:[nt*cell_num*anchor_num, 6];2. 根据Optimal Transport Assignment(OTA)算法对上一步筛选出来的正样本计算cost进一步对正样本进行筛选;bs: 正样本匹配的images indices; as_: 正样本匹配的anchor索引; gjs, gis: 预测该正样本的gj, gitargets: 该正样本匹配的实际target(image indices, class, x, y, w, h)相对坐标anchors: 正样本的anchors大小(对应特征图上的绝对大小)'''bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)# 预测结果的x, y, w, h增益(feature_map)pre_gen_gains = [torch.tensor(pp.shape, device=device)[[3, 2, 3, 2]] for pp in p]# 根据匹配的正样本计算Lossesfor i, pi in enumerate(p):  # layer index, layer predictionsb, a, gj, gi = bs[i], as_[i], gjs[i], gis[i]  # image, anchor, gridy, gridxtobj = torch.zeros_like(pi[..., 0], device=device)  # target objn = b.shape[0]  # 匹配到的正样本数量if n:ps = pi[b, a, gj, gi]  # 预测结果(x, y, w, h, obj, classes)# 预测结果进行回归grid = torch.stack([gi, gj], dim=1)pxy = ps[:, :2].sigmoid() * 2. - 0.5# pxy = ps[:, :2].sigmoid() * 3. - 1.pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]pbox = torch.cat((pxy, pwh), 1)  # 预测box(回归到对应feature_map尺度)selected_tbox = targets[i][:, 2:6] * pre_gen_gains[i]  # 转换到feature_map尺度selected_tbox[:, :2] -= gridiou = bbox_iou(pbox, selected_tbox, CIoU=True).squeeze()  # iou(prediction, target)lbox += (1.0 - iou).mean()  # iou损失# 目标置信度(根据iou给正样本标签分配, 负样本标签为0)tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * iou.detach().clamp(0).type(tobj.dtype)  # iou ratio# 类别标签selected_tcls = targets[i][:, 1].long()if self.nc > 1:  # 分类损失(含有多个类别时), 仅计算正样本的t = torch.full_like(ps[:, 5:], self.cn, device=device)  # 负样本标签cnt[range(n), selected_tcls] = self.cp  # 正样本标签cplcls += self.BCEcls(ps[:, 5:], t)  # BCE# Append targets to text file# with open('targets.txt', 'a') as file:#     [file.write('%11.5g ' * 4 % tuple(x) + '\n') for x in torch.cat((txy[i], twh[i]), 1)]obji = self.BCEobj(pi[..., 4], tobj)lobj += obji * self.balance[i]  # obj lossif self.autobalance:self.balance[i] = self.balance[i] * 0.9999 + 0.0001 / obji.detach().item()if self.autobalance:self.balance = [x / self.balance[self.ssi] for x in self.balance]lbox *= self.hyp['box']lobj *= self.hyp['obj']lcls *= self.hyp['cls']bs = tobj.shape[0]  # batch sizeloss = lbox + lobj + lcls# return loss * bs, torch.cat((lbox, lobj, lcls, loss)).detach()return {"box_loss": lbox,"obj_loss": lobj,"class_loss": lcls}