This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
本文讨论了第三版Monocular Depth Estimation Challenge(MDEC)的结果。该挑战关注于将零散样本到具有挑战性的SYNCSH-Patches数据集,场景位于自然和室内环境中。与前几版一样,方法可以使用任何形式的监督,即监督或自监督。挑战在测试集上总共获得了19个提交,超过了基线:其中10个提交了一份报告,描述了他们的方法,并突出了基础模型如Depth Anything在方法核心中发现的扩散使用情况。挑战获胜者大幅提高了3D F- Score性能,从17.51%到23.72%。
https://arxiv.org/abs/2404.16831
Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.
我们的目标是发现和局部化图像序列中的单调时间变化。为了实现这一目标,我们利用了一个简单的代理任务,即对随机图像序列进行排序,其中`time'作为监督信号,因为只有与时间相关的单调变化才能得到正确的排序。我们还引入了一个灵活的Transformer-based模型,用于对任意长度的图像序列进行通用排序,并内置归一化映射。在训练之后,该模型在成功发现和局部化单调变化的同时,忽略了循环和随机变化。我们在多个视频设置中展示了该模型的应用,涵盖了不同的场景和对象类型,发现了未见过的序列中的物体级和环境变化。我们还证明了基于注意的归一化映射可以作为分割变化区域的有效提示,并且学到的表示可以用于下游应用。最后,我们证明了该模型在为给定一组图像排序的基准测试中达到了最先进的水平。
https://arxiv.org/abs/2404.16828
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods
自监督对比学习已经成为最成功的深度学习范式之一。在这方面,它在图像配准和更近期的医学图像配准领域看到了广泛的应用。在这项工作中,我们提出了一个用于测试和改进最先进的颜色 fundus 图像配准框架ConKeD的框架。使用ConKeD框架我们测试了多个损失函数,并将其适应框架和应用领域。此外,我们还使用标准化基准数据集FIRE以及之前没有用于颜色 fundus 图像配准的数据集来评估我们的模型。我们的工作在所有数据集和指标上都展示了当前最佳性能,并比当前最佳方法具有几个优势。
https://arxiv.org/abs/2404.16773
Photometric constraint is indispensable for self-supervised monocular depth estimation. It involves warping a source image onto a target view using estimated depth&pose, and then minimizing the difference between the warped and target images. However, the endoscopic built-in light causes significant brightness fluctuations, and thus makes the photometric constraint unreliable. Previous efforts only mitigate this relying on extra models to calibrate image brightness. In this paper, we propose MonoPCC to address the brightness inconsistency radically by reshaping the photometric constraint into a cycle form. Instead of only warping the source image, MonoPCC constructs a closed loop consisting of two opposite forward-backward warping paths: from target to source and then back to target. Thus, the target image finally receives an image cycle-warped from itself, which naturally makes the constraint invariant to brightness changes. Moreover, MonoPCC transplants the source image's phase-frequency into the intermediate warped image to avoid structure lost, and also stabilizes the training via an exponential moving average (EMA) strategy to avoid frequent changes in the forward warping. The comprehensive and extensive experimental results on three datasets demonstrate that our proposed MonoPCC shows a great robustness to the brightness inconsistency, and exceeds other state-of-the-arts by reducing the absolute relative error by at least 7.27%.
光度约束对于自监督单目深度估计是不可或缺的。它涉及将估计深度&姿态的源图像扭曲到目标视图,然后最小化扭曲后的源图像和目标图像之间的差异。然而,内窥镜内置的光导致显著的亮度波动,因此使得光度约束不可靠。之前的努力仅通过额外模型校准图像亮度来缓解这种依赖。在本文中,我们提出MonoPCC来通过将光度约束变换为环形形式来解决亮度不一致问题。 instead of only warping the source image, MonoPCC构建了一个由两个相反的前向-反向扭曲路径组成的闭环:从目标到源,然后回到目标。因此,目标图像最终从自身获得了一个周期扭曲,这自然使得约束对亮度变化不变。此外,MonoPCC通过指数移动平均(EMA)策略将源图像的相频移植到中间扭曲图像中,以避免结构丢失,并通过EMA策略稳定训练,以避免经常改变前向扭曲。三个数据集上的全面和广泛的实验结果表明,我们提出的MonoPCC在亮度不一致性方面具有很大的稳健性,超过了其他现有技术的水平,至少减少了7.27%的绝对相对误差。
https://arxiv.org/abs/2404.16571
Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud tokens to efficiently compute and utilize tokens proximity based on their indices during target and context selection. The sequencer also allows shared computations of the tokens proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.
近年来,在点云领域自监督学习的进展已经展示了很大的潜力。然而,这些方法通常存在缺点,包括漫长的预训练时间、在输入空间进行重建的必要性,或者需要额外的模块。为了应对这些问题,我们引入了点JEPA,一种专门针对点云数据的联合嵌入预测架构。为此,我们引入一个序列器,对点云令牌进行排序,以在目标和上下文选择期间基于其索引计算并利用令牌的接近性。序列器还允许在上下文和目标选择之间共享计算令牌接近性,从而进一步提高效率。实验证明,我们的方法在获得与最先进方法竞争力的结果的同时,避免了在输入空间进行重建或添加额外模块。
https://arxiv.org/abs/2404.16432
Etruscan mirrors constitute a significant category within Etruscan art and, therefore, undergo systematic examinations to obtain insights into ancient times. A crucial aspect of their analysis involves the labor-intensive task of manually tracing engravings from the backside. Additionally, this task is inherently challenging due to the damage these mirrors have sustained, introducing subjectivity into the process. We address these challenges by automating the process through photometric-stereo scanning in conjunction with deep segmentation networks which, however, requires effective usage of the limited data at hand. We accomplish this by incorporating predictions on a per-patch level, and various data augmentations, as well as exploring self-supervised learning. Compared to our baseline, we improve predictive performance w.r.t. the pseudo-F-Measure by around 16%. When assessing performance on complete mirrors against a human baseline, our approach yields quantitative similar performance to a human annotator and significantly outperforms existing binarization methods. With our proposed methodology, we streamline the annotation process, enhance its objectivity, and reduce overall workload, offering a valuable contribution to the examination of these historical artifacts and other non-traditional documents.
伊特鲁里亚镜子在伊特鲁里亚艺术中是一个重要的分类,因此它们会进行系统性的检查,以获得对古代时代的洞察。分析的关键部分涉及从背面手动绘制浮雕的劳动密集型任务。此外,由于这些镜子所承受的损害,这项任务本质上具有挑战性,并引入了主观因素。我们通过将光栅化与深度分割网络相结合来自动化这个过程,尽管如此,需要有效使用手头有限的数据。我们通过在每片区域上预测以及各种数据增强方法以及探索自监督学习来实现这一目标。与我们的基线相比,我们在关于伪-F-测量方面的预测性能提高了约16%。当评估整体镜子与人类基线上的表现时,我们的方法与人类注释者的量化类似,显著超过了现有的二值化方法。凭借我们提出的方法,我们简化了注释过程,提高了其客观性,并降低了整体工作量,为研究这些历史文物以及其他非传统文献提供了宝贵的贡献。
https://arxiv.org/abs/2404.15903
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability, acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability, learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability, comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks, compared to 11 baselines in zero-shot, few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding, yielding representations that are both generic and semantically meaningful, yet overlooked in existing SSL methods. All code and pretrained models are available at this https URL.
人类轻松地通过解析图像将其分解成部分-整体层次结构;深度学习在多级特征空间中表现出色,但他们通常缺乏对部分-整体关系的明确编码,这是医学成像的一个突出特点。为了克服这个局限性,我们引入了Adam-v2,一种在Adam [79]的基础上引入部分-整体层次结构的自监督学习框架。通过三个关键分支:(1)局部可解释性,获得区分不同解剖模式的可鉴别表示;(2)可组合性,以部分-整体的方式学习每个解剖结构;(3)可分解性,以整体-部分的方式理解每个解剖结构。在10个任务中的实验结果与11个基线在零散转移、少散转移和完整微调设置中的表现进行了比较,结果表明Adam-v2在大型医疗模型和现有SSL方法方面表现出色。Adam-v2表示的语义平衡和稳健性源于其对不同解剖结构从无标签医学图像中显式构建层次结构。Adam-v2保留了解剖多样性与和谐的语言平衡,使其嵌入具有既通用又具有语义意义的表示,然而在现有SSL方法中被忽视。所有代码和预训练模型都可以在https://这个链接中找到。
https://arxiv.org/abs/2404.15672
The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.
Vision Transformer (ViT) 在自监督学习 (SSL) 中对 3D 医疗图像分析的性能已经取得了显著的突破。对于特征预训练,采用掩码自动编码器(MAE)进行预训练可能进一步释放 ViT 在各种医疗视觉任务上的潜力。然而,由于 3D 医疗图像具有较大的空间尺寸和高维,MAE 可能缺乏层次结构设计,这可能会阻碍下游任务的性能。在本文中,我们提出了一个名为“掩码在掩码(MiM)”预训练框架,旨在通过从不同尺度之间的层次视觉令牌中学习具有判别性的表示来提高 MAE 的性能。我们引入了多个级别的粒度来处理掩码输入的体积,然后同时在不同精度和粗略水平上进行重建。此外,在相邻级别卷积中应用跨层对齐机制,以确保解剖结构相似性层次结构的强制。此外,我们还采用了一种混合骨干网络来提高预训练期间层次表示学习的高效性。MiM 在包括全身各个部位的较大范围内预先训练,这些预训练数据集包含各种器官/病灶/肿瘤。在十三个公开数据集上的广泛实验证明,MiM 在其他 SSL 方法在器官/病灶/肿瘤分割和疾病分类方面具有优越性。我们进一步将 MiM 扩展到具有超过 10k 个卷积的预训练大样本数据集,表明大规模预训练可以进一步增强下游任务的性能。这种改进还得出结论,研究社区应更加关注 healthcare foundation model 用于 3D 医疗图像的预训练数据集的大小。
https://arxiv.org/abs/2404.15580
Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。
https://arxiv.org/abs/2404.15161
Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
将预训练的图神经网络提取可转移知识并将其应用于下游任务的实际标准已经成为了图形表示学习的事实标准。 最近的工作集中在设计自监督的预训练任务,以从大规模未标注数据中提取有用的和通用的可转移知识。 然而,他们必须面对一个不可避免的质疑: 旨在提取预训练任务的有用信息的传统预训练策略,可能无法提取下游任务的全部有用信息。 在本文中,我们重新审视了传统预训练和微调框架中的预训练过程,从信息瓶颈(IB)的角度出发,证实了预训练阶段遗忘现象可能会对下游任务造成严重损害。 因此,我们提出了一个新颖的 \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP)框架,该框架在预训练阶段通过抑制压缩操作来尽可能保持潜在表示和训练数据之间的互信息,并将压缩操作延迟到微调阶段,以确保压缩可以引导有标签的微调数据和下游任务。 为了实现这一目标,我们设计了一个可以直接优化且可以进一步集成到实际模型设计中的两个信息控制目标。 在化学和生物学领域进行的大量实验证明DBP的有效性。
https://arxiv.org/abs/2404.14941
Self-Supervised Learning (SSL) frameworks became the standard for learning robust class representations by benefiting from large unlabeled datasets. For Speaker Verification (SV), most SSL systems rely on contrastive-based loss functions. We explore different ways to improve the performance of these techniques by revisiting the NT-Xent contrastive loss. Our main contribution is the definition of the NT-Xent-AM loss and the study of the importance of Additive Margin (AM) in SimCLR and MoCo SSL methods to further separate positive from negative pairs. Despite class collisions, we show that AM enhances the compactness of same-speaker embeddings and reduces the number of false negatives and false positives on SV. Additionally, we demonstrate the effectiveness of the symmetric contrastive loss, which provides more supervision for the SSL task. Implementing these two modifications to SimCLR improves performance and results in 7.85% EER on VoxCeleb1-O, outperforming other equivalent methods.
自监督学习(SSL)框架通过利用大量未标注数据的优势,成为学习稳健类别表示的标准。对于说话人验证(SV),大多数SSL系统依赖于对比式损失函数。我们探讨了通过回顾NT-Xent对比损失来提高这些技术性能的不同方法。我们的主要贡献是定义NT-Xent-AM损失,并研究了在SimCLR和MoCo SSL方法中添加Additive Margin(AM)对进一步区分正负对的重要性。尽管存在类别碰撞,我们证明了AM能增强相同说话者嵌入的紧凑性,并减少SV上的假负和假正数量。此外,我们还证明了对称对比损失的有效性,为SSL任务提供了更多的监督。对SimCLR进行这两种修改后的性能优于其他等效方法,提高了7.85%的均方误差(EER)在VoxCeleb1-O数据集上,超过了其他方法。
https://arxiv.org/abs/2404.14913
This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
本文关注在动态场景中自监督单目深度估计。现有的方法主要依靠图像重建损失来估计像素级的深度和运动,但由于深度和运动估计的固有不确定性,导致准确度不准确。本文提出了一种利用训练数据中动态区域的伪深度标签进行自监督训练的框架。我们提出了一种将图像训练数据中静态区域和动态区域的深度估计解耦的方法。我们的框架的关键贡献是解耦训练数据中静态和动态区域的深度估计。我们首先采用无监督的深度估计方法,为静态区域提供可靠的深度估计,并允许我们在实例级别提取移动物体信息。在下一阶段,我们使用物体网络估计假设刚性运动的动态对象的深度。然后,我们提出了一种新的尺度对齐模块来解决估计深度静态和动态区域之间的尺度不确定性。我们可以然后使用生成的深度标签来训练端到端深度估计算法,并提高其性能。在Cityscapes和KITTI数据集上的实验表明,我们的自训练策略 consistently优于现有的自/无监督深度估计方法。
https://arxiv.org/abs/2404.14908
Domain adaptive pose estimation aims to enable deep models trained on source domain (synthesized) datasets produce similar results on the target domain (real-world) datasets. The existing methods have made significant progress by conducting image-level or feature-level alignment. However, only aligning at a single level is not sufficient to fully bridge the domain gap and achieve excellent domain adaptive results. In this paper, we propose a multi-level domain adaptation aproach, which aligns different domains at the image, feature, and pose levels. Specifically, we first utilize image style transer to ensure that images from the source and target domains have a similar distribution. Subsequently, at the feature level, we employ adversarial training to make the features from the source and target domains preserve domain-invariant characeristics as much as possible. Finally, at the pose level, a self-supervised approach is utilized to enable the model to learn diverse knowledge, implicitly addressing the domain gap. Experimental results demonstrate that significant imrovement can be achieved by the proposed multi-level alignment method in pose estimation, which outperforms previous state-of-the-art in human pose by up to 2.4% and animal pose estimation by up to 3.1% for dogs and 1.4% for sheep.
领域自适应姿态估计的目的是使在源域(合成)数据上训练的深度模型在目标域(现实世界)数据上产生类似的结果。现有的方法通过进行图像级别或特征级别对齐取得了显著进展。然而,仅在单个层面对齐是不够的,不能完全弥合领域差异并获得卓越的领域自适应结果。在本文中,我们提出了一个多级领域自适应方法,该方法在图像、特征和姿态级别对齐不同领域。具体来说,我们首先利用图像风格转移来确保源域和目标域的图像具有相似的分布。然后,在特征级别,我们采用对抗训练来使源域和目标域的特征尽可能保持领域无关特征。最后,在姿态级别,采用自监督方法使模型能够学习到多样知识, implicitly addressing the domain gap。实验结果表明,与以前 state-of-the-art 相比,所提出的多级对齐方法在姿态估计方面取得了显著的改进,在人类姿态评估中提高了 2.4%,在动物姿态评估中提高了 3.1%,对于狗的动物姿态评估提高了 1.4%。
https://arxiv.org/abs/2404.14885
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
阻塞是在实体识别过程中一个关键的步骤,基于神经网络的表示模型的出现使得密集阻塞作为一种探索深度语义的有效方法而受到关注。然而,之前的高级自监督密集阻塞方法需要针对目标域进行领域特定的训练,这限制了这些方法的好处和快速适应能力。为了解决这个问题,我们提出了UBlocker,一种在自监督对比学习的基础上预训练于无领域无关、易于获取的表格语料库的密集阻塞方法。通过进行无领域的预训练,UBlocker可以适应各种下游阻塞场景,而无需进行领域特定的微调。为了评估我们实体阻塞器的普适性,我们还构建了一个新的基准,涵盖了多个领域和场景的广泛阻塞任务。我们的实验结果表明,与没有进行任何领域特定学习相比,所提出的UBlocker在阻塞任务中显著超过了以前的自监督和无监督密集阻塞方法,与最先进的稀疏阻塞方法相当,互补且具有优势。
https://arxiv.org/abs/2404.14831
This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.
这项研究解决了从成像声纳中估计海底地形这一挑战,因为最先进的工作主要依赖于监督学习或基于Lambertian假设的表面渲染。在本文中,我们提出了一个新颖的、自监督的框架,基于体积渲染,用于通过标准调查期间收集的前向声纳数据(FLS)重构海底地形。我们将海底被视为一个参数多分辨率哈希编码方案捕获的神经高度图,并使用采用分层采样技术展开的声纳体积渲染模型来建模声纳测量。此外,我们还建模水平和垂直束模式,并与其共同估计海底地形。我们对使用遥控操作车辆(ROVs)在低空调查期间收集的模拟和现场数据进行定量评估。结果表明,与使用成像声纳进行海底映射的现有最佳方法相比,所提出的方法表现优异。我们还证明了这种方法有可能用于从低空调查中增加低分辨率先验图的分辨率。
https://arxiv.org/abs/2404.14819
Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.
车道检测已经发展成为高度功能自动驾驶系统,以在复杂环境中理解驾驶场景。在本文中,我们致力于开发一个通用计算机视觉系统,能够无需使用任何标注来检测车道。我们做出以下贡献:(一)通过利用LIDAR点云帧中车道独特的强度进行无监督的三维车道分割,然后通过投影获取二维平面上的噪音车道标签;(二)我们提出了一种新颖的自监督训练方案,称为LaneCorrect,通过学习来自对抗增强的几何一致性和实例意识来自动纠正车道标签;(三)在自监督预训练模型的基础上,我们通过训练学生网络来检测任意目标车道(例如TuSimple)而无需任何人类标签;(四)我们在包括TuSimple、CULane、CurveLanes和LLAMAS在内的四个主要车道检测基准上对自监督方法进行了全面评估,并证明了与现有监督方法相比具有卓越的性能,同时表现出在减轻领域差异方面的更有效结果,即在CULane上训练并在TuSimple上测试。
https://arxiv.org/abs/2404.14671
When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.
当我们提示语言模型(LM)时,用户通常期望模型在各种任务上遵守一系列行为原则,例如在生成有洞察力的内容的同时避免使用有害或偏见的语言。将这样的原则注入模型可能需要大量的资源和技术挑战,通常需要人类偏好标签或示例。我们介绍了一种名为SAMI的方法,用于教授预训练LM遵循行为原则,而不需要任何偏好标签或演示。SAMI是一个迭代算法,通过优化预训练LM的条件 mutual information 增加,给定查询数据集。在单轮对话和摘要中,经过SAMI训练的mistral-7b在初始预训练模型基础上取得了更优异的胜率,范围在66%到77%之间。令人惊讶的是,它还在单轮对话上超过了指令微调的基线(mistral-7b-instruct),在55%到57%的胜率上超过了它。SAMI需要一个“原则编写者”模型;为了避免对更强大的模型的依赖,我们进一步评估使用弱指令微调的模型(mistral-7b-instruct)编写的constitution的alignment。SAMI训练的mistral-8x7b在摘要中超过了初始模型和指令微调模型,实现了65%的胜率。我们的结果表明,预训练LM可以学习遵循constitution,而无需使用偏好标签、演示或人类监督。
https://arxiv.org/abs/2404.14313
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
基于图像的强化学习(RL)代理器的性能取决于用于捕捉图像的相机的位置。同时使用多个相机进行训练,包括一个第一人称中心相机,可以利用不同相机视角的信息来提高RL代理器的性能。然而,硬件限制可能限制在现实世界中使用多个相机的能力。此外,训练过程中相机可能损坏,导致无法访问所有用于训练的相机。为了克服这些硬件限制,我们提出了多视角去噪(MVD),它使用多个相机学习一个政策,使得对于训练集中的所有相机,实现零样本泛化。我们的方法是针对RL的自监督辅助任务,通过多个相机学习一个分离的表示,具有共享表示,该表示在所有相机上对齐,允许对单个相机进行泛化;以及一个相机特定的私用表示。我们通过实验证明了,在许多控制任务中,仅使用单个第三方相机的RL代理器无法学习最优策略;但是,通过在训练过程中使用多个相机,我们的方法能够仅使用相同的单个第三方相机来解决问题。
https://arxiv.org/abs/2404.14064
We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.
我们提出了一种名为OcFeat的自监督预训练方法,用于相机仅鸟眼视(BEV)分割网络。通过OcFeat,我们通过占有预测和特征蒸馏任务预训练BEV网络。占有预测提供了场景的3D几何理解给模型。然而,学习到的几何是分类无关的。因此,我们在3D空间中通过自监督预训练图像基础模型进行语义信息添加。使用我们方法预训练的模型表现出 improved BEV语义分割性能,特别是在低数据场景中。此外,实验结果证实了在我们的预训练方法中整合特征蒸馏与3D占有预测的有效性。
https://arxiv.org/abs/2404.14027