
CVPR 2023 


1 Planning-oriented Autonomous Driving  面向规划的自动驾驶 (Best papper)


现代自动驾驶系统的特点是按顺序执行模块化任务,即感知、预测和规划。为了执行广泛多样的任务并实现高级智能,现代方法要么为单个任务部署独立模型,要么设计具有独立头部的多任务范例。然而,他们可能会遭受累积错误或任务协调不足的困扰。相反,我们认为应该设计和优化一个有利的框架来追求最终目标,即自动驾驶汽车的规划。以此为导向,我们重新审视感知和预测中的关键组成部分,并对任务进行优先排序,以便所有这些任务都有助于规划。我们介绍统一自动驾驶(UniAD),一个最新的综合框架,将全栈驾驶任务整合到一个网络中。它经过精心设计,可以充分利用每个模块的优势,并从全局角度为代理交互提供互补的特征抽象。任务通过统一的查询接口进行通信,以促进彼此进行规划。我们在具有挑战性的 nuScenes 基准测试中实例化了 UniAD。通过广泛的消融,使用这种理念的有效性通过在所有方面都大大优于以前的最先进技术得到证明。

2Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation


In this paper, we study the application of Test-time domain adaptation in semantic segmentation (TTDA-Seg) where both efficiency and effectiveness are crucial. Existing methods either have low efficiency (e.g., backward optimization) or ignore semantic adaptation (e.g., distribution alignment). Besides, they would suffer from the accumulated errors caused by unstable optimization and abnormal distributions. To solve these problems, we propose a novel backward-free approach for TTDA-Seg, called Dynamically Instance-Guided Adaptation (DIGA). Our principle is utilizing each instance to dynamically guide its own adaptation in a non-parametric way, which avoids the error accumulation issue and expensive optimizing cost. Specifically, DIGA is composed of a distribution adaptation module (DAM) and a semantic adaptation module (SAM), enabling us to jointly adapt the model in two indispensable aspects. DAM mixes the instance and source BN statistics to encourage the model to capture robust representation. SAM combines the historical prototypes with instance-level prototypes to adjust semantic predictions, which can be associated with the parametric classifier to mutually benefit the final results. Extensive experiments evaluated on five target domains demonstrate the effectiveness and efficiency of the proposed method. Our DIGA establishes new state-of-theart performance in TTDA-Seg. Source code is available at: https://github.com/Waybaba/DIGA. 

在本文中,我们研究了测试时域自适应在语义分割中的应用,其中效率和有效性至关重要。现有方法要么效率低(例如,向后优化),要么忽略语义自适应(例如,分布对齐)。此外,它们还将遭受由不稳定的优化和异常分布引起的累积误差。为了解决这些问题,我们提出了一种新的TTDA Seg的向后自由方法,称为动态实例引导自适应(DIGA)。我们的原理是利用每个实例以非参数的方式动态引导其自身的自适应,这避免了误差累积问题和昂贵的优化成本。具体来说,DIGA由分布自适应模块(DAM)和语义自适应模块(SAM)组成,使我们能够在两个不可或缺的方面对模型进行联合自适应。DAM混合了实例和源BN统计数据,以鼓励模型捕获稳健的表示。SAM将历史原型与实例级原型相结合,以调整语义预测,语义预测可以与参数分类器相关联,从而使最终结果互惠互利。在五个目标域上进行的大量实验验证了该方法的有效性和效率。我们的DIGA在TTDA Seg建立了新的最先进的性能。源代码位于:https://github.com/waybaba/diga. 

3 Rethinking Federated Learning With Domain Shift: A Prototype View


​Federated learning shows a bright promise as a privacypreserving collaborative learning technique. However, prevalent solutions mainly focus on all private data sampled from the same domain. An important challenge is that when distributed data are derived from diverse domains. The private model presents degenerative performance on other domains (with domain shift). Therefore, we expect that the global model optimized after the federated learning process stably provides generalizability performance on multiple domains. In this paper, we propose Federated Prototypes Learning (FPL) for federated learning under domain shift. The core idea is to construct cluster prototypes and unbiased prototypes, providing fruitful domain knowledge and a fair convergent target. On the one hand, we pull the sample embedding closer to cluster prototypes belonging to the same semantics than cluster prototypes from distinct classes. On the other hand, we introduce consistency regularization to align the local instance with the respective unbiased prototype. Empirical results on Digits and Office Caltech tasks demonstrate the effectiveness of the proposed solution and the efficiency of crucial modules.

联合学习作为一种保密的协作学习技术显示出光明的前景。然而,流行的解决方案主要关注从同一域采样的所有私有数据。一个重要的挑战是,当分布式数据来自不同的域时。私有模型在其他域上呈现退化性能(具有域偏移)。因此,我们期望在联合学习过程之后优化的全局模型在多个领域上稳定地提供可推广性性能。在本文中,我们提出了用于领域转移下的联邦学习的联邦原型学习(FPL)。核心思想是构建集群原型和无偏原型,提供富有成果的领域知识和公平的收敛目标。一方面,与来自不同类的集群原型相比,我们将样本嵌入更接近属于相同语义的集群原型。另一方面,我们引入一致性正则化来将局部实例与各自的无偏原型对齐。Digits和Office Caltech任务的实证结果证明了所提出的解决方案的有效性和关键模块的效率。


4 HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation 


Current semantic segmentation models have achieved great success under the independent and identically distributed (i.i.d.) condition. However, in real-world applications, test data might come from a different domain than training data. Therefore, it is important to improve model robustness against domain differences. This work studies semantic segmentation under the domain generalization setting, where a model is trained only on the source domain and tested on the unseen target domain. Existing works show that Vision Transformers are more robust than CNNs and show that this is related to the visual grouping property of self-attention. In this work, we propose a novel hierarchical grouping transformer (HGFormer) to explicitly group pixels to form part-level masks and then whole-level masks. The masks at different scales aim to segment out both parts and a whole of classes. HGFormer combines mask classification results at both scales for class label prediction. We assemble multiple interesting cross-domain settings by using seven public semantic segmentation datasets. Experiments show that HGFormer yields more robust semantic segmentation results than per-pixel classification methods and flat-grouping transformers, and outperforms previous methods significantly. Code will be available at https: //github.com/dingjiansw101/HGFormer.



5 Cross-Domain Image Captioning With Discriminative Finetuning


Neural captioners are typically trained to mimic humangenerated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a selfsupervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to groundtruth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner wihtout finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.1 


6 DaFKD: Domain-Aware Federated Knowledge Distillation 


Federated Distillation (FD) has recently attracted increasing attention for its efficiency in aggregating multiple diverse local models trained from statistically heterogeneous data of distributed clients. Existing FD methods generally treat these models equally by merely computing the average of their output soft predictions for some given input distillation sample, which does not take the diversity across all local models into account, thus leading to degraded performance of the aggregated model, especially when some local models learn little knowledge about the sample. In this paper, we propose a new perspective that treats the local data in each client as a specific domain and design a novel domain knowledge aware federated distillation method, dubbed DaFKD, that can discern the importance of each model to the distillation sample, and thus is able to optimize the ensemble of soft predictions from diverse models. Specifically, we employ a domain discriminator for each client, which is trained to identify the correlation factor between the sample and the corresponding domain. Then, to facilitate the training of the domain discriminator while saving communication costs, we propose sharing its partial parameters with the classification model. Extensive experiments on various datasets and settings show that the proposed method can improve the model accuracy by up to 6.02% compared to state-of-the-art baselines.


7 Style Projected Clustering for Domain Generalized Semantic Segmentation


Existing semantic segmentation methods improve generalization capability, by regularizing various images to a canonical feature space. While this process contributes to generalization, it weakens the representation inevitably. In contrast to existing methods, we instead utilize the difference between images to build a better representation space, where the distinct style features are extracted and stored as the bases of representation. Then, the generalization to unseen image styles is achieved by projecting features to this known space. Specifically, we realize the style projection as a weighted combination of stored bases, where the similarity distances are adopted as the weighting factors. Based on the same concept, we extend this process to the decision part of model and promote the generalization of semantic prediction. By measuring the similarity distances to semantic bases (i.e., prototypes), we replace the common deterministic prediction with semantic clustering. Comprehensive experiments demonstrate the advantage of proposed method to the state of the art, up to 3.6% mIoU improvement in average on unseen scenarios. Code and models are available at https://gitee.com/mindspore/ models/tree/master/research/cv/SPC-Net. 

现有的语义分割方法通过将各种图像正则化到规范特征空间来提高泛化能力。虽然这一过程有助于概括,但它不可避免地削弱了代表性。与现有的方法相反,我们利用图像之间的差异来构建更好的表示空间,在该空间中提取并存储不同的风格特征作为表示的基础。然后,通过将特征投影到这个已知空间来实现对看不见的图像样式的泛化。具体来说,我们将风格投影实现为存储基的加权组合,其中相似距离被用作加权因子。基于相同的概念,我们将这一过程扩展到模型的决策部分,促进了语义预测的泛化。通过测量到语义基础(即原型)的相似性距离,我们用语义聚类取代了常见的确定性预测。综合实验证明了所提出的方法对现有技术的优势,在看不见的场景下,平均mIoU提高了3.6%。代码和型号可在 https://gitee.com/mindspore/ models/tree/master/research/cv/SPC-Net. 

8  Revisiting Prototypical Network for Cross Domain Few-Shot Learning


Prototypical Network is a popular few-shot solver that aims at establishing a feature metric generalizable to novel few-shot classification (FSC) tasks using deep neural networks. However, its performance drops dramatically when generalizing to the FSC tasks in new domains. In this study, we revisit this problem and argue that the devil lies in the simplicity bias pitfall in neural networks. In specific, the network tends to focus on some biased shortcut features (e.g., color, shape, etc.) that are exclusively sufficient to distinguish very few classes in the meta-training tasks within a pre-defined domain, but fail to generalize across domains as some desirable semantic features. To mitigate this problem, we propose a Local-global Distillation Prototypical Network (LDP-net). Different from the standard Prototypical Network, we establish a two-branch network to classify the query image and its random local crops, respectively. Then, knowledge distillation is conducted among these two branches to enforce their class affiliation consistency. The rationale behind is that since such global-local semantic relationship is expected to hold regardless of data domains, the local-global distillation is beneficial to exploit some cross-domain transferable semantic features for feature metric establishment. Moreover, such local-global semantic consistency is further enforced among different images of the same class to reduce the intra-class semantic variation of the resultant feature. In addition, we propose to update the local branch as Exponential Moving Average (EMA) over training episodes, which makes it possible to better distill cross-episode knowledge and further enhance the generalization performance. Experiments on eight crossdomain FSC benchmarks empirically clarify our argument and show the state-of-the-art results of LDP-net. Code is available in https://github.com/NWPUZhoufei/LDP-Net 


9  Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection



As scientific and technological advancements result from human intellectual labor and computational costs, protecting model intellectual property (IP) has become increasingly important to encourage model creators and owners. Model IP protection involves preventing the use of welltrained models on unauthorized domains. To address this issue, we propose a novel approach called Compact UnTransferable Isolation Domain (CUTI-domain), which acts as a barrier to block illegal transfers from authorized to unauthorized domains. Specifically, CUTI-domain blocks cross-domain transfers by highlighting the private style features of the authorized domain, leading to recognition failure on unauthorized domains with irrelevant private style features. Moreover, we provide two solutions for using CUTI-domain depending on whether the unauthorized domain is known or not: target-specified CUTI-domain and target-free CUTI-domain. Our comprehensive experimental results on four digit datasets, CIFAR10 & STL10, and VisDA-2017 dataset demonstrate that CUTI-domain can be easily implemented as a plug-and-play module with different backbones, providing an efficient solution for model IP protection 


 10 Towards Professional Level Crowd Annotation of Expert Domain Data


Image recognition on expert domains is usually finegrained and requires expert labeling, which is costly. This limits dataset sizes and the accuracy of learning systems. To address this challenge, we consider annotating expert data with crowdsourcing. This is denoted as PrOfeSsional lEvel cRowd (POSER) annotation. A new approach, based on semi-supervised learning (SSL) and denoted as SSL with human filtering (SSL-HF) is proposed. It is a human-inthe-loop SSL method, where crowd-source workers act as filters of pseudo-labels, replacing the unreliable confidence thresholding used by state-of-the-art SSL methods. To enable annotation by non-experts, classes are specified implicitly, via positive and negative sets of examples and augmented with deliberative explanations, which highlight regions of class ambiguity. In this way, SSL-HF leverages the strong low-shot learning and confidence estimation ability of humans to create an intuitive but effective labeling experience. Experiments show that SSL-HF significantly outperforms various alternative approaches in several benchmarks. 

专家域上的图像识别通常是细粒度的,并且需要专家标记,这是昂贵的。这限制了数据集的大小和学习系统的准确性。为了应对这一挑战,我们考虑通过众包来注释专家数据。这表示为PrOfeSessional lEvel cRowd(POSER)注释。提出了一种基于半监督学习(SSL)的新方法,称为带人工滤波的SSL(SSL-HF)。这是一种人在环SSL方法,众源工作者充当伪标签的过滤器,取代了最先进的SSL方法使用的不可靠的置信阈值。为了能够由非专家进行注释,类是通过积极和消极的示例集隐式指定的,并通过深思熟虑的解释来增强,这些解释突出了类模糊的区域。通过这种方式,SSL-HF利用人类强大的低阶学习和置信度估计能力,创造直观但有效的标记体验。实验表明,SSL-HF在几个基准测试中显著优于各种替代方法。


 11 Learning Adaptive Dense Event Stereo From the Image Domain


Recently, event-based stereo matching has been studied due to its robustness in poor light conditions. However, existing event-based stereo networks suffer severe performance degradation when domains shift. Unsupervised domain adaptation (UDA) aims at resolving this problem without using the target domain ground-truth. However, traditional UDA still needs the input event data with groundtruth in the source domain, which is more challenging and costly to obtain than image data. To tackle this issue, we propose a novel unsupervised domain Adaptive Dense Event Stereo (ADES), which resolves gaps between the different domains and input modalities. The proposed ADES framework adapts event-based stereo networks from abundant image datasets with ground-truth on the source domain to event datasets without ground-truth on the target domain, which is a more practical setup. First, we propose a self-supervision module that trains the network on the target domain through image reconstruction, while an artifact prediction network trained on the source domain assists in removing intermittent artifacts in the reconstructed image. Secondly, we utilize the feature-level normalization scheme to align the extracted features along the epipolar line. Finally, we present the motion-invariant consistency module to impose the consistent output between the perturbed motion. Our experiments demonstrate that our approach achieves remarkable results in the adaptation ability of event-based stereo matching from the image domain. 


12 CLIP the Gap: A Single Domain Generalization Approach for Object Detection

CLIP the Gap:一种用于目标检测的单域泛化方法

Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD [52], on their own diverse weather-driving benchmark.


13 AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation


Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available. 



14 Domain Generalized Stereo Matching via Hierarchical Visual Transformation

Recently, deep Stereo Matching (SM) networks have shown impressive performance and attracted increasing attention in computer vision. However, existing deep SM networks are prone to learn dataset-dependent shortcuts, which fail to generalize well on unseen realistic datasets. This paper takes a step towards training robust models for the domain generalized SM task, which mainly focuses on learning shortcut-invariant representation from synthetic data to alleviate the domain shifts. Specifically, we propose a Hierarchical Visual Transformation (HVT) network to 1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, 2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. In this way, we can prevent the model from exploiting the artifacts of synthetic stereo images as shortcut features, thereby estimating the disparity maps more effectively based on the learned robust and shortcut-invariant representation. We integrate our proposed HVT network with SOTA SM networks and evaluate its effectiveness on several public SM benchmark datasets. Extensive experiments clearly show that the HVT network can substantially enhance the performance of existing SM networks in synthetic-to-realistic domain generalization.

近年来,深度立体匹配(SM)网络在计算机视觉中表现出了令人印象深刻的性能,并引起了越来越多的关注。然而,现有的深度SM网络倾向于学习依赖于数据集的快捷方式,无法在看不见的真实数据集上很好地推广。本文朝着训练域广义SM任务的鲁棒模型迈出了一步,该任务主要侧重于从合成数据中学习快捷不变表示,以缓解域偏移。具体而言,我们提出了一种分层视觉变换(HVT)网络:1)首先将训练样本分层变换到具有三个级别的不同分布的新域:全局、局部和像素;2)然后最大化源域和新域之间的视觉差异,并最小化跨域特征的不一致性,以捕获域不变特征。通过这种方式,我们可以防止模型利用合成立体图像的伪影作为快捷特征,从而基于所学习的鲁棒和快捷不变表示更有效地估计视差图。我们将我们提出的HVT网络与SOTA SM网络集成,并在几个公共SM基准数据集上评估其有效性。大量实验清楚地表明,HVT网络可以显著提高现有SM网络在合成到现实领域泛化方面的性能。

15 DA-DETR: Domain Adaptive Detection Transformer With Information Fusion

The recent detection transformer (DETR) simplifies the object detection pipeline by removing hand-crafted designs and hyperparameters as employed in conventional two-stage object detectors. However, how to leverage the simple yet effective DETR architecture in domain adaptive object detection is largely neglected. Inspired by the unique DETR attention mechanisms, we design DA-DETR, a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain. DA-DETR introduces a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains. Specifically, CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization. Extensive experiments show that DA-DETR achieves superior detection performance consistently across multiple widely adopted domain adaptation benchmarks. 

最近的检测转换器(DETR)通过去除传统的两级对象检测器中使用的手工设计和超参数,简化了对象检测流水线。然而,如何在域自适应对象检测中利用简单而有效的DETR架构在很大程度上被忽视了。受独特的DETR注意力机制的启发,我们设计了DA-DETR,这是一种域自适应对象检测转换器,它引入了信息融合,用于从标记的源域到未标记的目标域的有效转移。DA-DETR引入了一种新颖的CNN Transformer Blender(CTBLER),它巧妙地融合了CNN特征和Transformer特征,实现了跨领域的有效特征对齐和知识转移。具体地说,CTBlender使用Transformer特征来在多个尺度上调制CNN特征,其中高级语义信息和低级空间信息被融合以用于精确的对象识别和定位。大量实验表明,DA-DETR在多个广泛采用的领域自适应基准上一致地实现了卓越的检测性能。

16 Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective

Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively. https://vlis2022.github.io/cvpr23/PMTrans 

最近,人们努力利用视觉转换器(ViT)来完成具有挑战性的无监督领域自适应(UDA)任务。它们通常采用ViT中的交叉注意力进行直接域对齐。然而,由于交叉注意力的性能高度依赖于目标样本的伪标签的质量,因此当域间隙变大时,其效果会变差。我们从博弈论的角度解决了这个问题,提出了一个称为PMTrans的模型,该模型将源域和目标域与中间域连接起来。具体而言,我们提出了一种新的基于ViT的模块,称为PatchMix,该模块通过学习基于博弈论模型对两个领域的补丁进行采样,有效地建立了中间领域,即概率分布。通过这种方式,它学习混合来自源域和目标域的补丁,以最大化交叉熵(CE),同时利用特征和标签空间中的两个半监督混合损失将其最小化。因此,我们将UDA的过程解释为一个由三个参与者(包括特征提取器、分类器和PatchMix)组成的最小-最大CE游戏,以找到纳什均衡。此外,我们利用ViT的注意力图,根据其重要性重新加权每个补丁的标签,从而有可能获得更多的领域判别特征表示。我们在四个基准数据集上进行了广泛的实验,结果表明,PMTrans在Office Home上显著优于基于ViT和基于CNN的SoTA方法,分别提高了+3.6%、+1.4%和+17.7%。https://vlis2022.github.io/cvpr23/PMTrans


17 Upcycling Models Under Domain and Category Shift

Deep neural networks (DNNs) often perform poorly in the presence of domain shift and category shift. How to upcycle DNNs and adapt them to the target task remains an important open problem. Unsupervised Domain Adaptation (UDA), especially recently proposed Source-free Domain Adaptation (SFDA), has become a promising technology to address this issue. Nevertheless, most existing SFDA methods require that the source domain and target domain share the same label space, consequently being only applicable to the vanilla closed-set setting. In this paper, we take one step further and explore the Source-free Universal Domain Adaptation (SF-UniDA). The goal is to identify "known" data samples under both domain and category shift, and reject those "unknown" data samples (not present in source classes), with only the knowledge from standard pre-trained source model. To this end, we introduce an innovative global and local clustering learning technique (GLC). Specifically, we design a novel, adaptive one-vs-all global clustering algorithm to achieve the distinction across different target classes and introduce a local k-NN clustering strategy to alleviate negative transfer. We examine the superiority of our GLC on multiple benchmarks with different category shift scenarios, including partial-set, open-set, and open-partial-set DA. More remarkably, in the most challenging open-partial-set DA scenario, GLC outperforms UMAD by 14.8% on the VisDA benchmark.



18 Domain Expansion of Image Generators

 Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task -- domain expansion -- to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" axes, which do not affect the output. This provides an opportunity -- by "repurposing" these axes, we are able to represent new domains, without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several -- even hundreds -- of new domains! Using our expansion technique, one "expanded" model can supersede numerous domain-specific models, without expanding model size. Additionally, using a single, expanded generator natively supports smooth transitions between and composition of domains.




FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed. In addition, fairness is one of the most critical aspects when deploying the segmentation models into human-related real-world applications, e.g., autonomous driving, as any unfair predictions could influence human safety. In this paper, we propose a novel Fairness Domain Adaptation (FREDOM) approach to semantic scene segmentation. In particular, from the proposed formulated fairness objective, a new adaptation framework will be introduced based on the fair treatment of class distributions. Moreover, to generally model the context of structural dependency, a new conditional structural constraint is introduced to impose the consistency of predicted segmentation. Thanks to the proposed Conditional Structure Network, the self-attention mechanism has sufficiently modeled the structural information of segmentation. Through the ablation studies, the proposed method has shown the performance improvement of the segmentation models and promoted fairness in the model predictions. The experimental results on the two standard benchmarks, i.e., SYNTHIA -> Cityscapes and GTA5 -> Cityscapes, have shown that our method achieved State-of-the-Art (SOTA) performance.



20 Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization

Domain generalization (DG) is a principal task to evaluate the robustness of computer vision models. Many previous studies have used normalization for DG. In normalization, statistics and normalized features are regarded as style and content, respectively. However, it has a content variation problem when removing style because the boundary between content and style is unclear. This study addresses this problem from the frequency domain perspective, where amplitude and phase are considered as style and content, respectively. First, we verify the quantitative phase variation of normalization through the mathematical derivation of the Fourier transform formula. Then, based on this, we propose a novel normalization method, PCNorm, which eliminates style only as the preserving content through spectral decomposition. Furthermore, we propose advanced PCNorm variants, CCNorm and SCNorm, which adjust the degrees of variations in content and style, respectively. Thus, they can learn domain-agnostic representations for DG. With the normalization methods, we propose ResNet-variant models, DAC-P and DAC-SC, which are robust to the domain gap. The proposed models outperform other recent DG methods. The DAC-SC achieves an average state-of-the-art performance of 65.6% on five datasets: PACS, VLCS, Office-Home, DomainNet, and TerraIncognita.

领域泛化(DG)是评估计算机视觉模型鲁棒性的主要任务。以前的许多研究都对DG使用了归一化。在规范化中,统计和规范化特征分别被视为风格和内容。然而,由于内容和样式之间的边界不清楚,因此在删除样式时会出现内容变化问题。这项研究从频域的角度解决了这个问题,振幅和相位分别被视为风格和内容。首先,我们通过傅立叶变换公式的数学推导来验证归一化的定量相位变化。然后,在此基础上,我们提出了一种新的归一化方法PCNorm,该方法通过谱分解来消除仅作为保留内容的风格。此外,我们提出了先进的PCNorm变体CCNorm和SCNorm,它们分别调整内容和风格的变化程度。因此,他们可以学习DG的领域不可知表示。利用归一化方法,我们提出了对域间隙具有鲁棒性的ResNet变体模型DAC-P和DAC-SC。所提出的模型优于其他最近的DG方法。DAC-SC在五个数据集上实现了65.6%的平均最先进性能:PACS、VLCS、Office Home、DomainNet和TerraIncognita。

21 MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation

In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8% on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an improvement of +2.1 and +3.0 percent points over the previous state of the art. The implementation is available at https://github.com/lhoyer/MIC. 

在无监督领域自适应(UDA)中,在源数据(例如合成数据)上训练的模型适用于目标数据(例如真实世界),而无需访问目标注释。大多数以前的UDA方法都很难处理在目标域上具有相似视觉外观的类,因为没有基本事实可用于学习轻微的外观差异。为了解决这个问题,我们提出了一个掩蔽图像一致性(MIC)模块,通过学习目标域的空间上下文关系来增强UDA,作为鲁棒视觉识别的额外线索。MIC加强了掩蔽目标图像的预测和伪标签之间的一致性,其中随机补丁被保留,伪标签是由指数移动平均教师基于完整图像生成的。为了最大限度地减少一致性损失,网络必须学会从其上下文推断掩蔽区域的预测。由于其简单而通用的概念,MIC可以集成到不同视觉识别任务(如图像分类、语义分割和对象检测)的各种UDA方法中。MIC显著提高了从合成到真实、从白天到晚上以及从晴朗到恶劣天气的UDA的不同识别任务的最先进性能。例如,MIC在GTA到Cityscapes和VisDA-2017上分别实现了75.9 mIoU和92.8%的前所未有的UDA性能,这与之前的技术水平相比提高了+2.1和+3.0个百分点相对应https://github.com/lhoyer/MIC.


22 Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation

Standard Unsupervised Domain Adaptation (UDA) methods assume the availability of both source and target data during the adaptation. In this work, we investigate Source-free Unsupervised Domain Adaptation (SF-UDA), a specific case of UDA where a model is adapted to a target domain without access to source data. We propose a novel approach for the SF-UDA setting based on a loss reweighting strategy that brings robustness against the noise that inevitably affects the pseudo-labels. The classification loss is reweighted based on the reliability of the pseudo-labels that is measured by estimating their uncertainty. Guided by such reweighting strategy, the pseudo-labels are progressively refined by aggregating knowledge from neighbouring samples. Furthermore, a self-supervised contrastive framework is leveraged as a target space regulariser to enhance such knowledge aggregation. A novel negative pairs exclusion strategy is proposed to identify and exclude negative pairs made of samples sharing the same class, even in presence of some noise in the pseudo-labels. Our method outperforms previous methods on three major benchmarks by a large margin. We set the new SF-UDA state-of-the-art on VisDA-C and DomainNet with a performance gain of +1.8% on both benchmarks and on PACS with +12.3% in the single-source setting and +6.6% in multi-target adaptation. Additional analyses demonstrate that the proposed approach is robust to the noise, which results in significantly more accurate pseudo-labels compared to state-of-the-art approaches.


23 MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer

In this work, we introduce MDL-NAS, a unified framework that integrates multiple vision tasks into a manageable supernet and optimizes these tasks collectively under diverse dataset domains. MDL-NAS is storage-efficient since multiple models with a majority of shared parameters can be deposited into a single one. Technically, MDL-NAS constructs a coarse-to-fine search space, where the coarse search space offers various optimal architectures for different tasks while the fine search space provides fine-grained parameter sharing to tackle the inherent obstacles of multi-domain learning. In the fine search space, we suggest two parameter sharing policies, i.e., sequential sharing policy and mask sharing policy. Compared with previous works, such two sharing policies allow for the partial sharing and non-sharing of parameters at each layer of the network, hence attaining real fine-grained parameter sharing. Finally, we present a joint-subnet search algorithm that finds the optimal architecture and sharing parameters for each task within total resource constraints, challenging the traditional practice that downstream vision tasks are typically equipped with backbone networks designed for image classification. Experimentally, we demonstrate that MDL-NAS families fitted with non-hierarchical or hierarchical transformers deliver competitive performance for all tasks compared with state-of-the-art methods while maintaining efficient storage deployment and computation. We also demonstrate that MDL-NAS allows incremental learning and evades catastrophic forgetting when generalizing to a new task.



24 OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation

Extending from unimodal to multimodal is a critical challenge for unsupervised domain adaptation (UDA). Two major problems emerge in unsupervised multimodal domain adaptation: domain adaptation and modality alignment. An intuitive way to handle these two problems is to fulfill these tasks in two separate stages: aligning modalities followed by domain adaptation, or vice versa. However, domains and modalities are not associated in most existing two-stage studies, and the relationship between them is not leveraged which can provide complementary information to each other. In this paper, we unify these two stages into one to align domains and modalities simultaneously. In our model, a tensor-based alignment module (TAL) is presented to explore the relationship between domains and modalities. By this means, domains and modalities can interact sufficiently and guide them to utilize complementary information for better results. Furthermore, to establish a bridge between domains, a dynamic domain generator (DDG) module is proposed to build transitional samples by mixing the shared information of two domains in a self-supervised manner, which helps our model learn a domain-invariant common representation space. Extensive experiments prove that our method can achieve superior performance in two real-world applications. The code will be publicly available.


25 Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation

Unsupervised Domain Adaptation (UDA) of semantic segmentation transfers labeled source knowledge to an unlabeled target domain by relying on accessing both the source and target data. However, the access to source data is often restricted or infeasible in real-world scenarios. Under the source data restrictive circumstances, UDA is less practical. To address this, recent works have explored solutions under the Source-Free Domain Adaptation (SFDA) setup, which aims to adapt a source-trained model to the target domain without accessing source data. Still, existing SFDA approaches use only image-level information for adaptation, making them sub-optimal in video applications. This paper studies SFDA for Video Semantic Segmentation (VSS), where temporal information is leveraged to address video adaptation. Specifically, we propose Spatio-Temporal Pixel-Level (STPL) contrastive learning, a novel method that takes full advantage of spatio-temporal information to tackle the absence of source data better. STPL explicitly learns semantic correlations among pixels in the spatio-temporal space, providing strong self-supervision for adaptation to the unlabeled target domain. Extensive experiments show that STPL achieves state-of-the-art performance on VSS benchmarks compared to current UDA and SFDA approaches. Code is available at: https://github.com/shaoyuanlo/STPL


26 Semi-Supervised Domain Adaptation With Source Label Adaptation

Semi-Supervised Domain Adaptation (SSDA) involves learning to classify unseen target data with a few labeled and lots of unlabeled target data, along with many labeled source data from a related domain. Current SSDA approaches usually aim at aligning the target data to the labeled source data with feature space mapping and pseudo-label assignments. Nevertheless, such a source-oriented model can sometimes align the target data to source data of the wrong classes, degrading the classification performance. This paper presents a novel source-adaptive paradigm that adapts the source data to match the target data. Our key idea is to view the source data as a noisily-labeled version of the ideal target data. Then, we propose an SSDA model that cleans up the label noise dynamically with the help of a robust cleaner component designed from the target perspective. Since the paradigm is very different from the core ideas behind existing SSDA approaches, our proposed model can be easily coupled with them to improve their performance. Empirical results on two state-of-the-art SSDA approaches demonstrate that the proposed model effectively cleans up the noise within the source labels and exhibits superior performance over those approaches across benchmark datasets. Our code is available at https://github.com/chu0802/SLA.




Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios.

补充:CVPR2022 :

Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation 

Domain adaptive semantic segmentation aims to learn a model with the supervision of source domain data, and produce satisfactory dense predictions on unlabeled target domain. One popular solution to this challenging task is self-training, which selects high-scoring predictions on target samples as pseudo labels for training. However, the produced pseudo labels often contain much noise because the model is biased to source domain as well as majority categories. To address the above issues, we propose to di-rectly explore the intrinsic pixel distributions of target do-main data, instead of heavily relying on the source domain. Specifically, we simultaneously cluster pixels and rectify pseudo labels with the obtained cluster assignments. This process is done in an online fashion so that pseudo labels could co-evolve with the segmentation model without extra training rounds. To overcome the class imbalance problem on long-tailed categories, we employ a distribution align-ment technique to enforce the marginal class distribution of cluster assignments to be close to that of pseudo labels. The proposed method, namely Class-balanced Pixel-level Self-Labeling (CPSL), improves the segmentation performance on target domain over state-of-the-arts by a large margin, especially on long-tailed categories. The source code is available at ht tps: / / gi thub. com/lslrh/CPSL. 

域自适应语义分割旨在在源域数据的监督下学习模型,并在未标记的目标域上产生令人满意的密集预测。这项具有挑战性的任务的一个流行解决方案是自训练,它选择目标样本上的高分预测作为训练的伪标签。然而,由于模型偏向于源域和大多数类别,因此生成的伪标签通常包含大量噪声。为了解决上述问题,我们建议直接探索目标do主数据的内在像素分布,而不是严重依赖于源域。具体来说,我们同时对像素进行聚类,并用获得的聚类分配校正伪标签。这一过程是以在线方式完成的,因此伪标签可以与分割模型共同进化,而无需额外的训练回合。为了克服长尾类别上的类不平衡问题,我们使用分布对齐技术来强制集群分配的边际类分布接近伪标签的边缘类分布。所提出的方法,即类平衡像素级自标记(CPSL),大大提高了目标域的分割性能,特别是在长尾类别上。源代码可在ht-tps://gi thub获得。com/lslrh/CPSL。 






