Shape-IoU：考虑边框形状与尺度的度量

Abstract

https://arxiv.org/pdf/2312.17663.pdf
作为检测器定位分支的重要组成部分，边界框回归损失在目标检测任务中发挥着重要作用。现有的边界框回归方法通常考虑真实框（GT box）与预测框之间的几何关系，并使用边界框的相对位置和形状来计算损失，而忽略了边界框的固有属性（如形状和尺度）对边界框回归的影响。为了弥补现有研究的不足，本文提出了一种专注于边界框本身形状和尺度的边界框回归方法。首先，我们分析了边界框的回归特性，发现边界框本身的形状和尺度因素会对回归结果产生影响。基于上述结论，我们提出了Shape IoU方法，该方法可以通过关注边界框本身的形状和尺度来计算损失，从而使边界框回归更加准确。最后，我们通过大量对比实验验证了我们的方法，实验结果表明，我们的方法可以有效提高检测性能，并优于现有方法，在不同的检测任务中都取得了最先进的性能。代码可在https://github.com/malagoutou/Shape-IoU获取。

索引术语：目标检测、损失函数和边界框回归

1、简介

目标检测是计算机视觉中的基本任务之一，其目标是在图像中定位和识别物体。根据是否生成锚点，目标检测可以分为基于锚点和无锚点的方法。基于锚点的算法包括Faster R-CNN [1]、YOLO系列（You Only Look Once） [2]、SSD（Single Shot MultiBox Detector） [3]和RetinaNet [4]。无锚点检测算法包括CornerNet [5]、CenterNet [6]和FCOS（Fully Convolutional One Stage Object Detection） [7]。在这些检测器中，边界框回归损失函数作为定位分支的重要组成部分，发挥着不可替代的作用。

目标检测领域最常用的方法包括IoU [8]、GIoU [9]、CIoU [10]、SIoU [11]等。IoU[8]作为目标检测领域应用最广泛的损失函数，具有能够更准确地描述预测框与真实框（GT box）之间匹配程度的优点。其不足主要在于，当两个框之间的重叠部分为0时，无法准确描述预测框与GT框之间的位置关系。GIoU [9]通过引入最小外接框来针对这一不足进行了具体改进。CIoU [10]在考虑预测框与GT框之间归一化距离最小化的基础上，通过增加形状损失项来进一步提高检测精度。在SIoU [11]的研究中，提出将连接预测框和GT框中心点的线的角度大小作为新的损失项来考虑，从而通过角度的变化更准确地判断预测框与GT框之间的匹配程度。

综上所述，以前的边界框回归方法主要是通过在IoU [8]的基础上增加新的几何约束来实现更准确的回归。上述方法考虑了GT框和锚点框的距离、形状和角度对边界框回归的影响，但忽略了边界框本身的形状和尺度也会对边界框回归产生影响的事实。为了进一步提高回归的准确性，我们分析了边界框本身的形状和角度的影响，并提出了新一代的边界回归损失：Shape-IoU。

本文的主要贡献如下：

我们分析了边界框回归的特性，并得出结论：在边界框回归的过程中，边界框回归样本自身的形状和尺度因素会对回归结果产生影响。
基于现有的边界框回归损失函数，考虑到边界框回归样本自身的形状和尺度对边界框回归的影响，我们提出了shape-IoU损失函数；并且针对微小目标检测任务，我们提出了shape-dotdistance和shape-nwd损失。
我们使用最先进的一阶段检测器在不同的检测任务上进行了一系列对比实验，实验结果证明了本文方法的检测效果优于现有方法，达到了最先进水平（sota）。

注：在原文中，“sota”是“state-of-the-art”的缩写，意为“最先进水平”。在描述技术或方法时，它通常用来指代当前领域内性能最好或技术最先进的解决方案。

II. 相关工作
A. 基于IoU的目标检测度量

近年来，随着检测器的发展，边界框回归损失也得到了快速发展。最初，IoU被提出用于评估边界框回归的状态，而GIoU [9]、DIoU [10]、CIoU [10]、EIoU [12]和SIoU [11]等方法则通过在IoU的基础上添加不同的约束来不断更新，以实现更好的检测效果。

IoU度量：IoU（交并比）[8]是最受欢迎的目标检测评估标准，其定义如下：
$U=\frac{\left|B \cap B^{g t}\right|}{\left|B \cup B^{g t}\right|} \tag{1}$
其中， $B$ 和 $B^{g t}$ 分别表示预测框和真实框。
GIoU 度量：为了解决边界框回归中，由于真实框（GT 框）与锚框（Anchor 框）不重叠导致的 IoU 损失的梯度消失问题，提出了 GIoU（Generalized Intersection over Union） [9]。其定义如下：
$U-\frac{\left|C-B \cap B^{g t}\right|}{|C|} \tag{2}$
其中，C 表示真实框（GT 框）和锚框（Anchor 框）之间的最小外接框。
DIoU 度量：与 GIoU 相比，DIoU [10] 考虑了边界框之间的距离约束。在 IoU 的基础上增加了质心归一化距离损失项，使回归结果更加准确。其定义如下：
$U-\frac{\rho^{2}\left(b, b^{g t}\right)}{c^{2}}$

Where b and b^{g t} are the center points of anchor box and GT box respectively, \rho(\cdot) refers to the Euclidean distance, where c is the diagonal distance of the minimum enclosing bounding box between b and b^{g t} .

CIoU [10] further considers the shape similarity between GT and Anchor boxes by adding a new shape loss term to DIoU to reduce the difference in aspect ratio between Anchor and GT boxes. It is defined as follows:

\begin{array}{c}
C I o U=I o U-\frac{\rho^{2}\left(b, b^{g t}\right)}{c^{2}}-\alpha v \
\alpha=\frac{v}{(1-I o U)+v} \
v=\frac{4}{\pi^{2}}\left(\arctan \frac{w^{g t}}{h^{g t}}-\arctan \frac{w}{h}\right)^{2}
\end{array}

where w^{g t} and h^{g t} denote the width and height of GT box, w and h denote the width and height of anchor box.
4) EIoU Metric: EIoU [12] redefines the shape loss based on CloU, and further improves the detection accuracy by directly reducing the aspect difference between GT boxes and Anchor boxes. It is defined as follows:

E I o U=I o U-\frac{\rho^{2}\left(b, b^{g t}\right)}{c^{{2}}-\frac{\rho}{2}\left(w, w^{g t}\right)}{\left(w^{c}\right){2}}-\frac{\rho^{2}\left(h, h^{g t}\right)}{\left(h^{c}\right){2}}

Where w^{c} and h^{c} are the width and height of the minimum bounding box covering GT box and anchor box.

SIoU Metric: On the basis of previous research, SIoU [11] further considers the influence of the angle between the bounding boxes on the bounding box regression, which aims to accelerate the convergence process by decreasing the angle between the anchor box and GT box which is the horizontal or vertical direction. Its definition is as follows:

\begin{array}{l}
S I o U=I o U-\frac{(\Delta+\Omega)}{2} \
\Lambda=\sin \left(2 \sin ^{-1} \frac{\min \left(\left|x_{c}^{g t}-x_{c}\right|,\left|y_{c}^{g t}-y_{c}\right|\right)}{\sqrt{\left(x_{c}^{g t}-x_{c}\right)^{{2}+\left(y_{c}}{g t}-y_{c}\right)^{2}}+\epsilon}\right) \
\Delta=\sum_{t=w, h}\left(1-e^{-\gamma \rho_{t}}\right), \gamma=2-\Lambda \
\left{\begin{array}{l}
\rho_{x}=\left(\frac{x_{c}-x_{c}^{g t}}{w^{c}}\right){2} \
\rho_{y}=\left(\frac{y_{c}-y_{c}^{g t}}{h^{c}}\right){2}
\end{array}\right. \
\Omega=\sum_{t=w, h}\left(1-e^{{-\omega_{t}}\right)}{\theta}, \theta=4 \
\left{\begin{array}{l}
\omega_{w}=\frac{\left|w-w_{g t}\right|}{\max \left(w, w_{g t}\right)} \
\omega_{h}=\frac{\left|h-h_{g t}\right|}{\max \left(h, h_{g t}\right)}
\end{array}\right. \
\end{array}

B. Metric in Tiny Object Detection

IoU-based metrics are suitable for general object detection tasks, and in the case of small object detection, Dot Distance [13] and Normalized Wasserstein Distance (NWD) [14] have been proposed in order to overcome their own sensitivity to with IoU values.

Dot Distance:

\begin{array}{c}
D=\sqrt{\left(x_{c}-x_{c}^{g t}\right)^{{2}+\left(y_{c}-y_{c}}{g t}\right)^{2}} \
S=\sqrt{\frac{\sum_{i=1}^{M} \sum_{j=1}^{N_{i}} w_{i j} \cdot h_{i j}}{\sum_{i=1}^{M} N_{i}}} \
\operatorname{Dot} D=e^{-\frac{D}{S}}
\end{array}

Where D denotes the Euclidean distance between the center point of the GT box and the center point of the Anchor box, S refers to the average size of the target in the dataset. M refers to the number of images, N_{i} refers to the number of labeled bounding boxes in the i-th image, and w_{i j} and h_{i j} stand for the width and height of the \mathrm{j} -th border in the \mathrm{i} -th image species, respectively.

Normalized Gaussian Wasserstein Distance:

\begin{array}{c}
D=\sqrt{\left(x_{c}-x_{c}^{g t}\right)^{2}+\left(y_{c}-y_{c}{ }^{g t}\right)^{{2}+\frac{\left(w-w}{g t}\right)^{{2}+\left(h-h}{g t}\right)^{2}}{\text { weight }^{2}}} \
N W D=e^{-\frac{D}{C}}
\end{array}

where weight =2 and \mathrm{C} is the constant associated with the dataset.

III. Methods
A. Analysis of Bounding Box Regression Characteristics

As shown in Fig.2, the scale of the GT box in bounding box regression sample A and B is the same, while the scale of the GT box in C and D is the same. The shape of the GT box in A and D is the same, while the shape of the GT box in B and \mathrm{C} is the same. The scale of the bounding boxes in C and D is greater than the scale of the bounding boxes in A and B. The regression samples for all bounding boxes in Fig.2a have the same deviation, with a shape deviation of 0 . The difference between Fig.2a and Fig.2b is that the shape-deviation of all bounding box regression samples in Fig.2b is the same, with a deviation of 0 .

The deviation between A and B in Fig.2a is the same, but there is a difference in the IoU value.

The deviation between C and D in Fig.2a is the same, but there is a difference in IoU values, and compared to A and B Fig.2a, the difference in IoU values is not significant.

The shape-deviation of A and B in Fig. 2 b is the same, but there is a difference in the IoU value.

The shape-deviation of C and D in Fig. 2 b is the same, but there is a difference in IoU values, and compared to A and B in Fig.2a, the difference in IoU values is not significant.

The reason for the difference in IoU value between A and B in Fig.2a is that their GT boxes have different shapes, and the deviation direction corresponds to their long and short edge directions, respectively. For A, the deviation along the long edge direction of their GT boxes has a smaller impact on their IoU value, while for \mathrm{B} , the deviation in the short edge direction has a greater impact on their IoU value. Compared to large scale bounding boxes, smaller scale bounding boxes are more sensitive to changes in IoU value, and the shape of GT boxes has a more significant impact on the IoU value of smaller scale bounding boxes. Because A and B are smaller in scale than \mathrm{C} and \mathrm{D} , the change in IoU value is more significant when the shape and deviation are the same. Similarly, in Fig.2b, analyzing bounding box regression from the perspective of shape deviation reveals that the shape of the GT box in the regression sample will affect its IoU value during the regression process.

Based on the above analysis, the following conclusions can be drawn:
-Assuming that the GT box is not a square and has long and short sides, the difference in bounding box shape and scale in the regression sample will lead to differences in its IoU value when the deviation and shape deviation are the same and not all 0.
-For bounding box regression samples of the same scale, when the deviation and shape deviation of the regression sample are the same and not all 0 , the shape of the bounding box will have an impact on the IoU value of the regression sample. The change in IoU value corresponding to the deviation and shape deviation along the short edge direction of the bounding box is more significant.
-For regression samples with the same shape bounding box, when the regression sample deviation and shape deviation are the same and not all 0 , compared to larger scale regression samples, the IoU value of smaller scale bounding box regression samples is more significantly affected by the shape of the GT box

B. Shape-IoU
The formula for shape-iou can be derived from Fig.3:

\begin{array}{c}
I o U=\frac{\left|B \cap B^{g t}\right|}{\left|B \cup B^{g t}\right|} \
w w=\frac{2 \times\left(w^{g t}\right)^{\text {scale }}}{\left(w^{g t}\right)^{\text {scale }}+\left(h^{g t}\right)^{\text {scale }}} \
h h=\frac{2 \times\left(h^{g t}\right)^{\text {scale }}}{\left(w^{g t}\right)^{s c a l e}+\left(h^{g t}\right)^{\text {scale }}} \
\operatorname{distance}^{\text {shape }}=h h \times\left(x_{c}-x_{c}^{g t}\right)^{2} / c^{2}+w w \times\left(y_{c}-y_{c}^{g t}\right)^{2} / c^{2} \
\Omega^{\text {shape }}=\sum_{t=w, h}\left(1-e^{{-\omega_{t}}\right)}{\theta}, \theta=4 \
\left{\begin{array}{l}
\omega_{w}=h h \times \frac{\left|w-w^{g t}\right|}{\max \left(w, w^{g t}\right)} \
\omega_{h}=w w \times \frac{\left|h-h^{g t}\right|}{\max \left(h, h^{g t}\right)}
\end{array}\right.
\end{array}

Where scale is the scale factor, which is related to the scale of the target in the dataset, and w w and h h are the weight coefficients in the horizontal and vertical directions respectively, whose values are related to the shape of the GT box. Its corresponding bounding box regression loss is as follows:

L_{\text {Shape-IoU }}=1-\mathrm{IoU}+\text { distance }^{\text {shape }}+0.5 \times \Omega^{\text {shape }}

C. Shape-IoU in Small Target

Shape-Dot Distance: We integrate the idea of ShapeIoU into Dot Distance to obtain Shape-Dot Distance, which is defined as follows:

D=\sqrt{h h \times\left(x_{c}-x_{c}^{g t}\right)^{2}+w w \times\left(y_{c}-y_{c}^{g t}\right)^{2}}

\begin{array}{c}
S=\sqrt{\frac{\sum_{i=1}^{M} \sum_{j=1}^{N_{i}} w_{i j} \times h_{i j}}{\sum_{i=1}^{M} N_{i}}} \
\text { DotD }^{\text {shape }}=e^{-\frac{D}{S}}
\end{array}

Shape-NWD: Similarly, we integrate the idea of ShapeIoU into NWD to obtain Shape-NWD, which is defined as follows:

\begin{array}{c}
B=\frac{\left(w-w_{g t}\right)^{2}+\left(h-h_{g t}\right)^{2}}{w e i g h t^{2}}, \text { weight }=2 \
D=\sqrt{h h \times\left(x_{c}-x_{c}^{g t}\right)^{2}+w w \times\left(y_{c}-y_{c}^{g t}\right)^{2}+B} \
N W D^{\text {shape }}=e^{-\frac{D}{C}}
\end{array}

IV. EXPERIMENTS
A. PASCAL VOC on YOLOv 8 and YOLOv 7

The PASCAL VOC dataset is one of the most popular datasets in the field of object detection, in this article we use the train and val of VOC2007 and VOC2012 as the training set including 16551 images, and the test of VOC2007 as the test set containing 4952 images. In this experiment, we choose the state-of-the-art one-stage detector YOLOv8s and YOLOv7-tiny to perform comparison experiments on the VOC dataset, and SIoU is chosen as the comparison method for the experiments. The experimental results are shown in TABLEI:

B. VisDrone 2019 on Y O L O v 8

VisDrone2019 is the most popular UAV aerial imagery dataset in the field of object detection, which contains a large number of small targets compared to the general dataset. In this experiment, YOLOv8s is chosen as the detector and the comparison method is SIoU. The experimental results are as follows:

C. AI-TOD on YOLOv5

AI-TOD is a remote sensing image dataset, which is different from general datasets in that it contains a significant amount of tiny targets, and the average size of the targets is only 12.8 pixels. In this experiment, YOLOv5s is chosen as the detector, and the comparison method is SIoU. The experimental results are shown in TABLEIII:

V. Conclusion

In this article, we summarize the benefits and disadvantages of existing bounding box regression methods, pointing out that existing research methods focus on considering the geometric constraints between the GT box and the predicted box, while ignoring the influence of geometric factors such as the shape and scale of the bounding box itself on the regression results. Then, by analyzing the regression characteristics of the bounding boxes, we discovered rules in which the geometric factors of the bounding boxes themselves can affect the regression. Based on the above analysis, we propose the Shape-IoU method, which can focus on the shape and scale of the bounding box itself to calculate losses, thereby improving accuracy. Finally, a series of comparative experiments were conducted using the most advanced one-stage detector on datasets of different scales, and the experimental results showed that our method outperformed existing methods and achieved state-ofthe-art performance.