【多目标跟踪】 TrackFormer 耗时三天 单句翻译!!!
TrackFormer: Multi-Object Tracking with Transformers
The challenging task of multi-object tracking (MOT) re-quires simultaneous reasoning about track initialization,identity, and spatio-temporal trajectories. We formulatethis task as a frame-to-frame set prediction problem andintroduce TrackFormer, an end-to-end trainable MOT ap-proach based on an encoder-decoder Transformer architec-ture. Our model achieves data association between framesvia attention by evolving a set of track predictions througha video sequence. The Transformer decoder initializes newtracks from static object queries and autoregressively fol-lows existing tracks in space and time with the concep-tually new and identity preserving track queries.Bothquery types benefit from self- and encoder-decoder atten-tion on global frame-level features, thereby omitting any ad-ditional graph optimization or modeling of motion and/orappearance. TrackFormer introduces a new tracking-by-attention paradigm and while simple in its design is able toachieve state-of-the-art performance on the task of multi-object tracking (MOT17 and MOT20) and segmentation(MOTS20).
Humans need to focus theirattentionto track objects inspace and time, for example, when playing a game of ten-nis, golf, or pong. This challenge is only increased whentracking not one, butmultipleobjects, in crowded and realworld scenarios. Following this analogy, we demonstratethe effectiveness of Transformer [51] attention for the taskof multi-object tracking (MOT) in videos
The goal in MOT is to follow the trajectories of a set ofobjects,e.g., pedestrians, while keeping their identities dis-criminated as they are moving throughout a video sequence.Due to the advances in image-level object detection [7, 39],most approaches follow the two-steptracking-by-detectionparadigm: (i) detecting objects in individual video frames and (ii) associating sets of detections between frames andthereby creating individual object tracks over time. Tra-ditional tracking-by-detection methods associate detectionsvia temporally sparse [23, 26] or dense [19, 22] graph opti-mization, or apply convolutional neural networks to predictmatching scores between detections [8, 24]
Recent works [4,6,29,69] suggest a variation of the tradi-tional paradigm, coinedtracking-by-regression[12]. In thisapproach, the object detector not only provides frame-wisedetections, but replaces the data association step with a con-tinuous regression of each track to the changing position ofits object. These approaches achieve track association im-plicitly, but provide top performance only by relying eitheron additional graph optimization [6, 29] or motion and ap-pearance models [4]. This is largely due to the isolated andlocal bounding box regression which lacks any notion ofobject identity or global communication between tracks
We present a first straightforward instantiation oftracking-by-attention, TrackFormer, an end-to-end train-able Transformer [51] encoder-decoder architecture.Itencodes frame-level features from a convolutional neuralnetwork (CNN) [18] and decodes queries into boundingboxes associated with identities. The data association isperformed through the novel and simple concept oftrackqueries. Each query represents an object and follows it inspace and time over the course of a video sequence in anautoregressive fashion. New objects entering the scene aredetected by static object queries as in [7, 71] and subse-quently transform to future track queries. At each frame,the encoder-decoder computes attention between the inputimage features and the track as well as object queries, andoutputs bounding boxes with assigned identities. Thereby,TrackFormer performs tracking-by-attention and achievesdetection and data association jointly without relying onany additional track matching, graph optimization, or ex-plicit modeling of motion and/or appearance. In contrastto tracking-by-detection/regression, our approach detectsand associates tracks simultaneously in a single step via at-tention (and not regression). TrackFormer extends the re-cently proposed set prediction objective for object detec-tion [7, 48, 71] to multi-object tracking
总结:transfomer 是纯基于注意力作 检测 和数据关联 不基于 常见的匹配算法 例如匈牙利 卡尔曼等其他算法
We evaluate TrackFormer on the MOT17 [30] andMOT20 [13] benchmarks where it achieves state-of-the-artperformance for public and private detections. Furthermore,we demonstrate the extension with a mask prediction headand show state-of-the-art results on the Multi-Object Track-ing and Segmentation (MOTS20) challenge [52]. We hopethis simple yet powerful baseline will inspire researchers toexplore the potential of the tracking-by-attention paradigm.In summary, we make the following contributions:
An end-to-end trainable multi-object tracking ap-proach which achieves detection and data associationin a new tracking-by-attention paradigm
The concept of autoregressive track queries which em-bed an object’s spatial position and identity, therebytracking it in space and time
New state-of-the-art results on three challenging multi-object tracking (MOT17 and MOT20) and segmenta-tion (MOTS20) benchmarks
2. Related work
refrains from associating detec-tions between frames but instead accomplishes tracking byregressing past object locations to their new positions in thecurrent frame. Previous efforts [4, 15] use regression headson region-pooled object features. In [69], objects are rep-resented as center points which allow for an association bya distance-based greedy matching algorithm. To overcometheir lacking notion of object identity and global track rea-soning, additional re-identification and motion models [4],as well as traditional [29] and learned [6] graph methodshave been necessary to achieve top performance
not only predicts objectmasks but leverages the pixel-level information to mitigateissues with crowdedness and ambiguous backgrounds.Prior attempts used category-agnostic image segmenta-tion [31], applied Mask R-CNN [17] with 3D convolu-tions [52], mask pooling layers [38], or represented objectsas unordered point clouds [58] and cost volumes [57].However, the scarcity of annotated MOT segmentation datamakes modern approaches still rely on bounding boxes
In contrast, TrackFormer casts the entire tracking objec-tive into a single set prediction problem, applying attentionnot only for the association step. It jointly reasons abouttrack initialization, identity, and spatio-temporal trajecto-ries. We only rely on feature-level attention and avoid addi-tional graph optimization and appearance/motion models
3. TrackFormer
We present TrackFormer, an end-to-end trainable multi-object tracking (MOT) approach based on an encoder-decoder Transformer [51] architecture. This section de-scribes how we cast MOT as a set prediction problem andintroduce the newtracking-by-attentionparadigm. Further-more, we explain the concept oftrack queriesand their ap-plication for frame-to-frame data association
3.1MOT as a set prediction problem
Given a video sequence withKindividual object iden-tities, MOT describes the task of generating ordered tracksTk= (bkt1,bkt2,...)with bounding boxesbtand track iden-titiesk. The subset(t1,t2,...)of total framesTindicatesthe time span between an object entering and leaving thethe scene. These include all frames for which an object isoccluded by either the background or other objectsIn order to cast MOT as a set prediction problem, weleverage an encoder-decoder Transformer architecture. Ourmodel performs online tracking and yields per-frame objectbounding boxes and class predictions associated with iden-tities in four consecutive steps
- 使用通用CNN骨干网的帧级特征提取,例如ResNet-50[18]
- Transformer编码中具有自注意的帧特征编码
- 在Transformer解码器中使用自身和编码器解码器对查询进行解码
- 使用多层感知器(MLP)将查询映射到框和类预测。
Objects are implicitly represented in the decoderqueries,which are embeddings used by the decoder to output bound-ing box coordinates and class predictions. The decoder al-ternates between two types of attention: (i) self-attentionover all queries, which allows for joint reasoning aboutthe objects in a scene and (ii) encoder-decoder attention,which gives queries global access to the visual informationof the encoded features. The output embeddings accumu-late bounding box and class information over multiple de-coding layers. The permutation invariance of Transformersrequires additive feature and object encodings for the framefeatures and decoder queries, respectively
3.2 Tracking-by-attention with queries
The total set of output embeddings is initialized with twotypes of query encodings: (i) static object queries, whichallow the model to initialize tracks at any frame of the video,and (ii) autoregressive track queries, which are responsiblefor tracking objects across frames
The simultaneous decoding of object and track queriesallows our model to perform detection and tracking in a uni-fied way, thereby introducing a newtracking-by-attentionparadigm. Different tracking-by-X approaches are definedby their key component responsible for track generation.For tracking-by-detection, the tracking is performed bycomputing/modelling distances between frame-wise objectdetections. The tracking-by-regression paradigm also per-forms object detection, but tracks are generated by regress-ing each object box to its new position in the current frame.Technically, our TrackFormer also performs regression inthe mapping of object embeddings with MLPs. However,the actual track association happens earlier via attention inthe Transformer decoder. A detailed architecture overviewwhich illustrates the integration of track and object queriesinto the Transformer decoder is shown in the appendix
For this purpose, each new object detection initializesa track query with the corresponding output embedding ofthe previous frame. The Transformer encoder-decoder per-forms attention on frame features and decoder queriescon-tinuously updatingthe instance-specific representation of anobject‘s identity and location in each track query embed-ding. Self-attention over the joint set of both query types al-lows for the detection of new objects while simultaneouslyavoiding re-detection of already tracked objects
3.3. TrackFormer training
For track queries to work in interaction with objectqueries and follow objects to the next frame, TrackFormerrequires dedicated frame-to-frame tracking training. As in-dicated in Figure 2, we train on two adjacent frames andoptimize the entire MOT objective at once. The loss forframetmeasures the set prediction of all output embed-dingsN=Nobject+Ntrackwith respect to the ground truthobjects in terms of class and bounding box prediction
集合损失预测 两步: 检测 + 跟踪