Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。
https://arxiv.org/abs/2404.15161
Pre-training GNNs to extract transferable knowledge and apply it to downstream tasks has become the de facto standard of graph representation learning. Recent works focused on designing self-supervised pre-training tasks to extract useful and universal transferable knowledge from large-scale unlabeled data. However, they have to face an inevitable question: traditional pre-training strategies that aim at extracting useful information about pre-training tasks, may not extract all useful information about the downstream task. In this paper, we reexamine the pre-training process within traditional pre-training and fine-tuning frameworks from the perspective of Information Bottleneck (IB) and confirm that the forgetting phenomenon in pre-training phase may cause detrimental effects on downstream tasks. Therefore, we propose a novel \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP) framework which maintains as much as possible mutual information between latent representations and training data during pre-training phase by suppressing the compression operation and delays the compression operation to fine-tuning phase to make sure the compression can be guided with labeled fine-tuning data and downstream tasks. To achieve this, we design two information control objectives that can be directly optimized and further integrate them into the actual model design. Extensive experiments on both chemistry and biology domains demonstrate the effectiveness of DBP.
将预训练的图神经网络提取可转移知识并将其应用于下游任务的实际标准已经成为了图形表示学习的事实标准。 最近的工作集中在设计自监督的预训练任务,以从大规模未标注数据中提取有用的和通用的可转移知识。 然而,他们必须面对一个不可避免的质疑: 旨在提取预训练任务的有用信息的传统预训练策略,可能无法提取下游任务的全部有用信息。 在本文中,我们重新审视了传统预训练和微调框架中的预训练过程,从信息瓶颈(IB)的角度出发,证实了预训练阶段遗忘现象可能会对下游任务造成严重损害。 因此,我们提出了一个新颖的 \underline{D}elayed \underline{B}ottlenecking \underline{P}re-training (DBP)框架,该框架在预训练阶段通过抑制压缩操作来尽可能保持潜在表示和训练数据之间的互信息,并将压缩操作延迟到微调阶段,以确保压缩可以引导有标签的微调数据和下游任务。 为了实现这一目标,我们设计了一个可以直接优化且可以进一步集成到实际模型设计中的两个信息控制目标。 在化学和生物学领域进行的大量实验证明DBP的有效性。
https://arxiv.org/abs/2404.14941
Self-Supervised Learning (SSL) frameworks became the standard for learning robust class representations by benefiting from large unlabeled datasets. For Speaker Verification (SV), most SSL systems rely on contrastive-based loss functions. We explore different ways to improve the performance of these techniques by revisiting the NT-Xent contrastive loss. Our main contribution is the definition of the NT-Xent-AM loss and the study of the importance of Additive Margin (AM) in SimCLR and MoCo SSL methods to further separate positive from negative pairs. Despite class collisions, we show that AM enhances the compactness of same-speaker embeddings and reduces the number of false negatives and false positives on SV. Additionally, we demonstrate the effectiveness of the symmetric contrastive loss, which provides more supervision for the SSL task. Implementing these two modifications to SimCLR improves performance and results in 7.85% EER on VoxCeleb1-O, outperforming other equivalent methods.
自监督学习(SSL)框架通过利用大量未标注数据的优势,成为学习稳健类别表示的标准。对于说话人验证(SV),大多数SSL系统依赖于对比式损失函数。我们探讨了通过回顾NT-Xent对比损失来提高这些技术性能的不同方法。我们的主要贡献是定义NT-Xent-AM损失,并研究了在SimCLR和MoCo SSL方法中添加Additive Margin(AM)对进一步区分正负对的重要性。尽管存在类别碰撞,我们证明了AM能增强相同说话者嵌入的紧凑性,并减少SV上的假负和假正数量。此外,我们还证明了对称对比损失的有效性,为SSL任务提供了更多的监督。对SimCLR进行这两种修改后的性能优于其他等效方法,提高了7.85%的均方误差(EER)在VoxCeleb1-O数据集上,超过了其他方法。
https://arxiv.org/abs/2404.14913
This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
本文关注在动态场景中自监督单目深度估计。现有的方法主要依靠图像重建损失来估计像素级的深度和运动,但由于深度和运动估计的固有不确定性,导致准确度不准确。本文提出了一种利用训练数据中动态区域的伪深度标签进行自监督训练的框架。我们提出了一种将图像训练数据中静态区域和动态区域的深度估计解耦的方法。我们的框架的关键贡献是解耦训练数据中静态和动态区域的深度估计。我们首先采用无监督的深度估计方法,为静态区域提供可靠的深度估计,并允许我们在实例级别提取移动物体信息。在下一阶段,我们使用物体网络估计假设刚性运动的动态对象的深度。然后,我们提出了一种新的尺度对齐模块来解决估计深度静态和动态区域之间的尺度不确定性。我们可以然后使用生成的深度标签来训练端到端深度估计算法,并提高其性能。在Cityscapes和KITTI数据集上的实验表明,我们的自训练策略 consistently优于现有的自/无监督深度估计方法。
https://arxiv.org/abs/2404.14908
Domain adaptive pose estimation aims to enable deep models trained on source domain (synthesized) datasets produce similar results on the target domain (real-world) datasets. The existing methods have made significant progress by conducting image-level or feature-level alignment. However, only aligning at a single level is not sufficient to fully bridge the domain gap and achieve excellent domain adaptive results. In this paper, we propose a multi-level domain adaptation aproach, which aligns different domains at the image, feature, and pose levels. Specifically, we first utilize image style transer to ensure that images from the source and target domains have a similar distribution. Subsequently, at the feature level, we employ adversarial training to make the features from the source and target domains preserve domain-invariant characeristics as much as possible. Finally, at the pose level, a self-supervised approach is utilized to enable the model to learn diverse knowledge, implicitly addressing the domain gap. Experimental results demonstrate that significant imrovement can be achieved by the proposed multi-level alignment method in pose estimation, which outperforms previous state-of-the-art in human pose by up to 2.4% and animal pose estimation by up to 3.1% for dogs and 1.4% for sheep.
领域自适应姿态估计的目的是使在源域(合成)数据上训练的深度模型在目标域(现实世界)数据上产生类似的结果。现有的方法通过进行图像级别或特征级别对齐取得了显著进展。然而,仅在单个层面对齐是不够的,不能完全弥合领域差异并获得卓越的领域自适应结果。在本文中,我们提出了一个多级领域自适应方法,该方法在图像、特征和姿态级别对齐不同领域。具体来说,我们首先利用图像风格转移来确保源域和目标域的图像具有相似的分布。然后,在特征级别,我们采用对抗训练来使源域和目标域的特征尽可能保持领域无关特征。最后,在姿态级别,采用自监督方法使模型能够学习到多样知识, implicitly addressing the domain gap。实验结果表明,与以前 state-of-the-art 相比,所提出的多级对齐方法在姿态估计方面取得了显著的改进,在人类姿态评估中提高了 2.4%,在动物姿态评估中提高了 3.1%,对于狗的动物姿态评估提高了 1.4%。
https://arxiv.org/abs/2404.14885
Blocking is a critical step in entity resolution, and the emergence of neural network-based representation models has led to the development of dense blocking as a promising approach for exploring deep semantics in blocking. However, previous advanced self-supervised dense blocking approaches require domain-specific training on the target domain, which limits the benefits and rapid adaptation of these methods. To address this issue, we propose UBlocker, a dense blocker that is pre-trained on a domain-independent, easily-obtainable tabular corpus using self-supervised contrastive learning. By conducting domain-independent pre-training, UBlocker can be adapted to various downstream blocking scenarios without requiring domain-specific fine-tuning. To evaluate the universality of our entity blocker, we also construct a new benchmark covering a wide range of blocking tasks from multiple domains and scenarios. Our experiments show that the proposed UBlocker, without any domain-specific learning, significantly outperforms previous self- and unsupervised dense blocking methods and is comparable and complementary to the state-of-the-art sparse blocking methods.
阻塞是在实体识别过程中一个关键的步骤,基于神经网络的表示模型的出现使得密集阻塞作为一种探索深度语义的有效方法而受到关注。然而,之前的高级自监督密集阻塞方法需要针对目标域进行领域特定的训练,这限制了这些方法的好处和快速适应能力。为了解决这个问题,我们提出了UBlocker,一种在自监督对比学习的基础上预训练于无领域无关、易于获取的表格语料库的密集阻塞方法。通过进行无领域的预训练,UBlocker可以适应各种下游阻塞场景,而无需进行领域特定的微调。为了评估我们实体阻塞器的普适性,我们还构建了一个新的基准,涵盖了多个领域和场景的广泛阻塞任务。我们的实验结果表明,与没有进行任何领域特定学习相比,所提出的UBlocker在阻塞任务中显著超过了以前的自监督和无监督密集阻塞方法,与最先进的稀疏阻塞方法相当,互补且具有优势。
https://arxiv.org/abs/2404.14831
This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.
这项研究解决了从成像声纳中估计海底地形这一挑战,因为最先进的工作主要依赖于监督学习或基于Lambertian假设的表面渲染。在本文中,我们提出了一个新颖的、自监督的框架,基于体积渲染,用于通过标准调查期间收集的前向声纳数据(FLS)重构海底地形。我们将海底被视为一个参数多分辨率哈希编码方案捕获的神经高度图,并使用采用分层采样技术展开的声纳体积渲染模型来建模声纳测量。此外,我们还建模水平和垂直束模式,并与其共同估计海底地形。我们对使用遥控操作车辆(ROVs)在低空调查期间收集的模拟和现场数据进行定量评估。结果表明,与使用成像声纳进行海底映射的现有最佳方法相比,所提出的方法表现优异。我们还证明了这种方法有可能用于从低空调查中增加低分辨率先验图的分辨率。
https://arxiv.org/abs/2404.14819
Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.
车道检测已经发展成为高度功能自动驾驶系统,以在复杂环境中理解驾驶场景。在本文中,我们致力于开发一个通用计算机视觉系统,能够无需使用任何标注来检测车道。我们做出以下贡献:(一)通过利用LIDAR点云帧中车道独特的强度进行无监督的三维车道分割,然后通过投影获取二维平面上的噪音车道标签;(二)我们提出了一种新颖的自监督训练方案,称为LaneCorrect,通过学习来自对抗增强的几何一致性和实例意识来自动纠正车道标签;(三)在自监督预训练模型的基础上,我们通过训练学生网络来检测任意目标车道(例如TuSimple)而无需任何人类标签;(四)我们在包括TuSimple、CULane、CurveLanes和LLAMAS在内的四个主要车道检测基准上对自监督方法进行了全面评估,并证明了与现有监督方法相比具有卓越的性能,同时表现出在减轻领域差异方面的更有效结果,即在CULane上训练并在TuSimple上测试。
https://arxiv.org/abs/2404.14671
When prompting a language model (LM), users frequently expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles into a model can be resource-intensive and technically challenging, generally requiring human preference labels or examples. We introduce SAMI, a method for teaching a pretrained LM to follow behavioral principles that does not require any preference labels or demonstrations. SAMI is an iterative algorithm that finetunes a pretrained LM to increase the conditional mutual information between constitutions and self-generated responses given queries from a datasest. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a "principle writer" model; to avoid dependence on stronger models, we further evaluate aligning a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct). The SAMI-trained mixtral-8x7b outperforms both the initial model and the instruction-finetuned model, achieving a 65% win rate on summarization. Our results indicate that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.
当我们提示语言模型(LM)时,用户通常期望模型在各种任务上遵守一系列行为原则,例如在生成有洞察力的内容的同时避免使用有害或偏见的语言。将这样的原则注入模型可能需要大量的资源和技术挑战,通常需要人类偏好标签或示例。我们介绍了一种名为SAMI的方法,用于教授预训练LM遵循行为原则,而不需要任何偏好标签或演示。SAMI是一个迭代算法,通过优化预训练LM的条件 mutual information 增加,给定查询数据集。在单轮对话和摘要中,经过SAMI训练的mistral-7b在初始预训练模型基础上取得了更优异的胜率,范围在66%到77%之间。令人惊讶的是,它还在单轮对话上超过了指令微调的基线(mistral-7b-instruct),在55%到57%的胜率上超过了它。SAMI需要一个“原则编写者”模型;为了避免对更强大的模型的依赖,我们进一步评估使用弱指令微调的模型(mistral-7b-instruct)编写的constitution的alignment。SAMI训练的mistral-8x7b在摘要中超过了初始模型和指令微调模型,实现了65%的胜率。我们的结果表明,预训练LM可以学习遵循constitution,而无需使用偏好标签、演示或人类监督。
https://arxiv.org/abs/2404.14313
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that achieves zero-shot generalisation to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
基于图像的强化学习(RL)代理器的性能取决于用于捕捉图像的相机的位置。同时使用多个相机进行训练,包括一个第一人称中心相机,可以利用不同相机视角的信息来提高RL代理器的性能。然而,硬件限制可能限制在现实世界中使用多个相机的能力。此外,训练过程中相机可能损坏,导致无法访问所有用于训练的相机。为了克服这些硬件限制,我们提出了多视角去噪(MVD),它使用多个相机学习一个政策,使得对于训练集中的所有相机,实现零样本泛化。我们的方法是针对RL的自监督辅助任务,通过多个相机学习一个分离的表示,具有共享表示,该表示在所有相机上对齐,允许对单个相机进行泛化;以及一个相机特定的私用表示。我们通过实验证明了,在许多控制任务中,仅使用单个第三方相机的RL代理器无法学习最优策略;但是,通过在训练过程中使用多个相机,我们的方法能够仅使用相同的单个第三方相机来解决问题。
https://arxiv.org/abs/2404.14064
We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.
我们提出了一种名为OcFeat的自监督预训练方法,用于相机仅鸟眼视(BEV)分割网络。通过OcFeat,我们通过占有预测和特征蒸馏任务预训练BEV网络。占有预测提供了场景的3D几何理解给模型。然而,学习到的几何是分类无关的。因此,我们在3D空间中通过自监督预训练图像基础模型进行语义信息添加。使用我们方法预训练的模型表现出 improved BEV语义分割性能,特别是在低数据场景中。此外,实验结果证实了在我们的预训练方法中整合特征蒸馏与3D占有预测的有效性。
https://arxiv.org/abs/2404.14027
By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blurs and alleviate instant noises, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noises and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.
通过利用模糊噪声的权衡,非均匀曝光的成像在恶劣环境中大大扩展了图像采集的灵活性。然而,传统相机在感知帧内动态信息方面的局限性限制了现有方法在现实场景中实时自适应相机快门控制中的应用。为解决这个问题,我们提出了一个新的神经元仿真的快门控制(NSC)系统,以避免运动模糊并减轻即时噪声,其中利用事件极低延迟来监测实时运动并促进场景自适应曝光。此外,为了稳定由非均匀曝光时间引起的不一致信号-噪声比(SNR),我们提出了一个基于自监督学习范式的图像去噪网络,即SEID,通过分析事件图像噪声和事件间运动信息来获得高质量成像在现实场景中的的人工监督信号。为了说明所提出的NSC的有效性,我们在硬件上通过构建一个混合式相机成像原型系统,收集包含不同场景和运动模式下同步帧和事件的现实世界数据集。在合成和现实世界数据集上的实验证明,我们的方法相对于最先进的解决方案具有优越性。
https://arxiv.org/abs/2404.13972
As a preliminary work, NeRF-Det unifies the tasks of novel view synthesis and 3D perception, demonstrating that perceptual tasks can benefit from novel view synthesis methods like NeRF, significantly improving the performance of indoor multi-view 3D object detection. Using the geometry MLP of NeRF to direct the attention of detection head to crucial parts and incorporating self-supervised loss from novel view rendering contribute to the achieved improvement. To better leverage the notable advantages of the continuous representation through neural rendering in space, we introduce a novel 3D perception network structure, NeRF-DetS. The key component of NeRF-DetS is the Multi-level Sampling-Adaptive Network, making the sampling process adaptively from coarse to fine. Also, we propose a superior multi-view information fusion method, known as Multi-head Weighted Fusion. This fusion approach efficiently addresses the challenge of losing multi-view information when using arithmetic mean, while keeping low computational costs. NeRF-DetS outperforms competitive NeRF-Det on the ScanNetV2 dataset, by achieving +5.02% and +5.92% improvement in mAP@.25 and mAP@.50, respectively.
作为初步工作,NeRF-Det 统一了 novel view synthesis 和 3D 感知任务,证明了 NeRF 这样的感知任务可以通过 novel view synthesis 方法受益,显著提高了室内多视图 3D 物体检测的性能。利用 NeRF 的几何 MLP 指导检测头的注意力,并将来自 novel view 渲染的自监督损失融入其中,有助于实现所取得的改进。为了更好地利用连续空间表示中的显著优势,我们在 NeRF-Det 上引入了一个新的 3D 感知网络结构 NeRF-DetS。NeRF-DetS 的关键组件是 Multi-level Sampling-Adaptive Network,使抽样过程从粗到细进行自适应。此外,我们提出了一个更好的多视图信息融合方法,称为 Multi-head Weighted Fusion。这种融合方法有效地解决了使用算术平均值时丢失多视图信息的问题,同时保持较低的计算成本。在 ScanNetV2 数据集上,NeRF-DetS 超越了竞争 NeRF-Det,实现了 +5.02% 和 +5.92% 的 mAP@.25 和 mAP@.50 改善。
https://arxiv.org/abs/2404.13921
Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.
近年来,夜间自监督单目深度估计取得了越来越多的关注。然而,使用夜间图像进行自监督通常是不可靠的,因为在复杂的光线条件下拍摄的视频中,通常会违反光度一致性假设。即使使用领域自适应或光度损失修复,性能仍然受到训练网络对夜间图像的良好监督的限制。在本文中,我们提出了一种不使用任何夜间图像进行训练的自监督夜间单目深度估计方法。我们的框架利用白天图像作为自监督的稳定来源,并应用了一些物理先验(如波光学、反射模型和读取快照噪声模型)来弥补一些关键的白天和黑夜差异。通过白天到黑夜数据分布的补偿,我们的框架可以在一步自监督的方式下进行高效训练。尽管在训练过程中没有考虑夜间图像,但质量和数量结果表明,与现有方法相比,我们的方法在挑战的 nuScenes-Night 和 RobotCar-Night 上实现了卓越的深度估计结果。
https://arxiv.org/abs/2404.13854
Conventional video object segmentation (VOS) methods usually necessitate a substantial volume of pixel-level annotated video data for fully supervised learning. In this paper, we present HVC, a \textbf{h}ybrid static-dynamic \textbf{v}isual \textbf{c}orrespondence framework for self-supervised VOS. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. Our approach utilizes a minimalist fully-convolutional architecture to capture static-dynamic visual correspondence in image-cropped views. To achieve this objective, we present a unified self-supervised approach to learn visual representations of static-dynamic feature similarity. Firstly, we establish static correspondence by utilizing a priori coordinate information between cropped views to guide the formation of consistent static feature representations. Subsequently, we devise a concise convolutional layer to capture the forward / backward pseudo-dynamic signals between two views, serving as cues for dynamic representations. Finally, we propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations. Our approach, without bells and whistles, necessitates only one training session using static image data, significantly reducing memory consumption ($\sim$16GB) and training time ($\sim$\textbf{2h}). Moreover, HVC achieves state-of-the-art performance in several self-supervised VOS benchmarks and additional video label propagation tasks.
传统的视频对象分割(VOS)方法通常需要大量像素级的注释视频数据来进行完全监督学习。在本文中,我们提出了HVC,一个自我监督的视频对象分割(VOS)框架。HVC从静态图像中提取伪动态信号,使得自监督VOS模型具有高效和可扩展性。我们的方法采用最小主义的完全卷积架构来捕捉图像裁剪视图中的静态动态视觉对应关系。为了实现这一目标,我们提出了一个统一的自监督方法来学习静态动态特征的视觉表示。首先,我们利用裁剪视图之间的先验坐标信息来建立静态对应关系,指导形成一致的静态特征表示。随后,我们设计了一个简洁的卷积层来捕捉两个视图之间的前向/后向伪动态信号,作为动态表示的线索。最后,我们提出了混合视觉对应损失来学习联合静态和动态一致性表示。我们的方法没有花哨的功能,只需要使用静态图像数据进行一次训练,显著减少了内存消耗($\sim$16GB)和训练时间($\sim$2小时)。此外,HVC在几个自监督VOS基准测试和视频标签传播任务中实现了最先进的性能。
https://arxiv.org/abs/2404.13505
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.
深度学习革命对诸如风格/领域转移、增强/修复和视觉质量评估等低级图像处理任务产生了强烈影响。尽管这些任务通常被单独处理,但前述任务都 share a common theme of understanding、editing或增强输入图像的视觉效果,而不会修改底层内容。我们利用这个观察结果开发了一种新颖的解耦表示学习方法,将输入分解为内容和外观特征。该模型以自监督的方式进行训练,并使用学习到的特征开发了一个名为DisQUE的新质量预测模型。我们在广泛的评估中证明了DisQUE在质量预测任务和失真类型上的最先进准确度。此外,我们还证明了相同特征还可以用于图像处理任务,如 HDR 色调映射,其中所需的输出特性可以通过使用示例输入-输出对进行调整。
https://arxiv.org/abs/2404.13484
This document outlines the Text-dependent Speaker Verification (TdSV) Challenge 2024, which centers on analyzing and exploring novel approaches for text-dependent speaker verification. The primary goal of this challenge is to motive participants to develop single yet competitive systems, conduct thorough analyses, and explore innovative concepts such as multi-task learning, self-supervised learning, few-shot learning, and others, for text-dependent speaker verification.
本文概述了2024年文本相关说话人验证(TdSV)挑战,重点关注分析并探索文本相关说话人验证的新方法。挑战的主要目标是激发参与者开发具有竞争力的单个系统,进行深入分析,并探索诸如多任务学习、自监督学习、少样本学习等创新概念,用于文本相关说话人验证。
https://arxiv.org/abs/2404.13428
Despite recent advances in reconstructing an organic model with the neural signed distance function (SDF), the high-fidelity reconstruction of a CAD model directly from low-quality unoriented point clouds remains a significant challenge. In this paper, we address this challenge based on the prior observation that the surface of a CAD model is generally composed of piecewise surface patches, each approximately developable even around the feature line. Our approach, named NeurCADRecon, is self-supervised, and its loss includes a developability term to encourage the Gaussian curvature toward 0 while ensuring fidelity to the input points. Noticing that the Gaussian curvature is non-zero at tip points, we introduce a double-trough curve to tolerate the existence of these tip points. Furthermore, we develop a dynamic sampling strategy to deal with situations where the given points are incomplete or too sparse. Since our resulting neural SDFs can clearly manifest sharp feature points/lines, one can easily extract the feature-aligned triangle mesh from the SDF and then decompose it into smooth surface patches, greatly reducing the difficulty of recovering the parametric CAD design. A comprehensive comparison with existing state-of-the-art methods shows the significant advantage of our approach in reconstructing faithful CAD shapes.
尽管在最近,使用带神经签名距离函数(SDF)重构有机模型取得了进展,但直接从低质量无方向点云中重构高级别CAD模型仍然是一个重要的挑战。在本文中,我们根据先前的观察,即CAD模型的表面通常由局部表面补丁组成,每个补丁都可以在特征线附近开发,来解决这个问题。我们的方法称为NeurCADRecon,是一种自监督的方法,其损失包括一个开发性项,以鼓励高斯曲线向0发展,同时确保对输入点的忠实性。注意到在尖点处高斯曲线不为零,我们引入了一个双孔曲线来容忍这些尖点存在。此外,我们还开发了一种动态采样策略来处理输入点不完整或过于稀疏的情况。由于我们得到的神经SDF可以明显地表现出尖点/线,因此可以很容易地从SDF中提取特征对齐的三角形网格,然后将其分解成平滑的表面补丁,从而大大减少了从参数化CAD设计中恢复的难度。与现有最先进的方法进行全面的比较表明,我们方法在重构忠实CAD形状方面具有显著优势。
https://arxiv.org/abs/2404.13420
Fundus diseases are major causes of visual impairment and blindness worldwide, especially in underdeveloped regions, where the shortage of ophthalmologists hinders timely diagnosis. AI-assisted fundus image analysis has several advantages, such as high accuracy, reduced workload, and improved accessibility, but it requires a large amount of expert-annotated data to build reliable models. To address this dilemma, we propose a general self-supervised machine learning framework that can handle diverse fundus diseases from unlabeled fundus images. Our method's AUC surpasses existing supervised approaches by 15.7%, and even exceeds performance of a single human expert. Furthermore, our model adapts well to various datasets from different regions, races, and heterogeneous image sources or qualities from multiple cameras or devices. Our method offers a label-free general framework to diagnose fundus diseases, which could potentially benefit telehealth programs for early screening of people at risk of vision loss.
翻译: fundus diseases是全球导致视力残疾和盲的主要原因,特别是在欠发达地区,由于眼科医生的短缺,导致及时诊断遇到困难。 AI辅助 fundus 图像分析具有 several 优势,如高准确度、减轻的工作负担和提高的可访问性,但建立可靠的模型需要大量专家注释的数据。为解决这一困境,我们提出了一个通用的自监督机器学习框架,可以处理未标注的 fundus 图像中的各种 fundus diseases。我们方法 的 AUC 超过了现有监督方法的 15.7%,甚至超过了单个人类专家的性能。此外,我们的模型对各种数据集(地区、种族和多相机或设备中的图像来源或质量)适应性良好。我们的方法为无标签的 fundus disease 诊断提供了一个通用的框架,这有可能潜在地改善针对视力即将丧失的人群的远程医疗程序的早期筛查。
https://arxiv.org/abs/2404.13388
Machine learning-based fundus image diagnosis technologies trigger worldwide interest owing to their benefits such as reducing medical resource power and providing objective evaluation results. However, current methods are commonly based on supervised methods, bringing in a heavy workload to biomedical staff and hence suffering in expanding effective databases. To address this issue, in this article, we established a label-free method, name 'SSVT',which can automatically analyze un-labeled fundus images and generate high evaluation accuracy of 97.0% of four main eye diseases based on six public datasets and two datasets collected by Beijing Tongren Hospital. The promising results showcased the effectiveness of the proposed unsupervised learning method, and the strong application potential in biomedical resource shortage regions to improve global eye health.
基于机器学习的 fundus 图像诊断技术因其节省医疗资源、提供客观评估结果等好处而引起了全球关注。然而,现有的方法通常基于监督方法,给生物医学工作人员带来繁重的工作负担,导致难以扩展有效的数据库。为了应对这个问题,本文我们建立了一种名为 'SSVT' 的无标签方法,通过自动分析未标注的 fundus 图像并基于六个公开数据集和北京同仁医院的两个数据集,可以生成四种主要眼病97.0%的高评估精度。这种无标签学习方法展示出所提出的自监督学习方法的效力,以及在生物资源短缺地区改善全球眼健康具有强大的应用潜力。
https://arxiv.org/abs/2404.13386