ECCV 2016 Workshops
文章目录
- 1 Background and Motivation
- 2 Related Work
- 3 Advantages / Contributions
- 4 Method
- 5 Experiments
- 5.1 Datasets and Metrics
- 5.2 The OTB-13 benchmark
- 5.3 The VOT benchmarks
- 5.4 Dataset size
- 6 Conclusion(own)/ Future work
1 Background and Motivation
单目标跟踪
track any arbitrary object, it is impossible to have already gathered data and trained a specific detector
在线学习方法的缺点(either apply “shallow” methods (e.g. correlation filters) using the network’s internal representation as features or perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network)
a clear deficiency of using data derived exclusively from the current video is that only comparatively simple models can be learnt.
实时性可能也是个问题
作者基于全卷积孪生网络,来实现单目标跟踪,且只要是目标检测的数据集,都可以拿来训练(the fairness of training and testing deep models for tracking using videos from the same domain is a point of controversy)
2 Related Work
- train Recurrent Neural Networks (RNNs) for the problem of object tracking
- track objects with a particle filter that uses a learnt distance metric to compare the current appearance to that of the first frame.
- feasibility of fine-tuning from pre-trained parameters at test time
3 Advantages / Contributions
-
we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video
-
frame-rates beyond real-time
-
achieves state-of-the-art performance in multiple benchmarks
4 Method
f ( z , x ) = g ( φ ( z ) , φ ( x ) ) f(z, x) = g(\varphi(z), \varphi(x)) f(z,x)=g(φ(z),φ(x))
exemplar image z z z
candidate image x x x
g g g is a simple distance or similarity metric
φ \varphi φ 是孪生网络,结构如下
x 和 z 获取的细节(来自 pysot 代码)
更具体的公式如下
b L b \mathbb{L} bL denotes a signal which takes value b ∈ R b ∈ \mathbb{R} b∈R in every location
每个空间位置的 b 应该是相等的吧
损失函数
y 是标签,1 或者 -1
v 是 score map 上的得分(0-1)之间
u 是空间位置,D 是 score map
预测的bounding box 中心点位于 ground true bounding box 中心半径小于 R 区域的都属于正样本
c 是 GT bbox 的中心点
stride k of the network
训练的时候用的 SGD 优化
5 Experiments
50 epochs 50,000 sampled pairs
SiamFC (Siamese Fully Convolutional) and SiamFC-3s, which searches over 3 scales instead of 5.
scale 的细节不太清楚
5.1 Datasets and Metrics
训练集
ImageNet Video for tracking,4500 videos
测试集
- ALOV
- OTB-13
- VOT-14 / VOT-15 / VOT-16
a tracker is successful in a given frame if the intersection over-union (IoU) between its estimate and the ground-truth is above a certain threshold
OTB上常用的3个:TRE、SRE、OPE
- OPE:单次评估精度,TRE运行一次的结果。
- TRE: 将序列划分为20个片段,每次是从不同的时间初始化,然后去跟踪目标。
- SRE: 从12个方向对第一帧的目标位置设置10%的偏移量,然后跟踪目标,判断目标跟踪精度。
通用指标
- OP(%): overlap precision 重叠率
重叠率 = 重叠区域面积/(预测矩形的面积+真实矩形的面积-重叠区域的面积) - CLE(pixels): center location error 中心位置误差
中心位置误差 = 真实中心和预测中心的欧式距离 - DP:distance precision 精确度
- AUC: area under curve 成功率z图的曲线下面积
VOT当中一些指标
- Robustness:数值越大,稳定性越差。
5.2 The OTB-13 benchmark
5.3 The VOT benchmarks
VOT-14
VOT-15
5.4 Dataset size
看看实际的效果
缺点:框的 spatial ratio 是固定的
6 Conclusion(own)/ Future work
参考文章:
- 视觉目标跟踪SiamFC
- 单目标跟踪论文综述:SiamFC、Siam系列、GradNet等一览
- 【目标跟踪线上交流会】第十五期 Pysot实验总结
- SiamRPN代码解读–proposal selection部分
- 单目标追踪-SiamFC
仅看文章,许多实现细节我都不够清晰,还是得撸撸代码
Deep Siamese conv-nets have previously been applied to tasks such as face verification, keypoint descriptor learning and one-shot character recognition