Self-supervised stereo matching holds great promise for application and research due to its independence from expensive labeled data. However, direct self-supervised stereo matching paradigms based on photometric loss functions have consistently struggled with performance issues due to the occlusion challenge. The crux of the occlusion challenge lies in the fact that the positions of occluded pixels consistently align with the epipolar search direction defined by the input stereo images, leading to persistent information loss and erroneous feedback at fixed locations during self-supervised training. In this work, we propose a simple yet highly effective pseudo-stereo inputs strategy to address the core occlusion challenge. This strategy decouples the input and feedback images, compelling the network to probabilistically sample information from both sides of the occluding objects. As a result, the persistent lack of information in the aforementioned fixed occlusion areas is mitigated. Building upon this, we further address feedback conflicts and overfitting issues arising from the strategy. By integrating these components, our method achieves stable and significant performance improvements compared to existing methods. Quantitative experiments are conducted to evaluate the performance. Qualitative experiments further demonstrate accurate disparity inference even at occluded regions. These results demonstrate a significant advancement over previous methods in the field of direct self-supervised stereo matching based on photometric loss. The proposed pseudo-stereo inputs strategy, due to its simplicity and effectiveness, has the potential to serve as a new paradigm for direct self-supervised stereo matching. Code is available at this https URL.
自监督立体匹配在应用和研究方面具有巨大的潜力,因为它与昂贵的标记数据独立。然而,基于 photometric 损失函数的直接自监督立体匹配方法一直因遮挡挑战而表现不佳。遮挡挑战的核心在于,遮挡像素的位置始终与输入立体图像定义的 epipolar 搜索方向对齐,导致在自监督训练过程中固定位置持续丢失信息和错误的反馈。在本文中,我们提出了一种简单但非常有效的伪立体输入策略来解决核心遮挡挑战。该策略将输入和反馈图像解耦,迫使网络从遮挡物两侧的概率采样信息。因此,上述固定遮挡区域的持续信息缺乏得到了缓解。在此基础上,我们进一步解决了策略产生的反馈冲突和过拟合问题。通过整合这些组件,我们的方法实现了与现有方法相比的稳定和显著的性能改进。通过进行定量的实验来评估性能。定性实验进一步证明了即使在遮挡区域,精确的差异推理仍然存在。这些结果表明,基于 photometric 损失函数的直接自监督立体匹配在该领域取得了显著的进展。所提出的伪立体输入策略,由于其简单和有效的特点,有可能成为该领域的一种新的范例。代码可以从该链接下载。
https://arxiv.org/abs/2410.02534
Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo's 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.
对比学习已成为自监督视觉表示学习的主导方法,其中具有困难的负样本,这些负样本与学习到的表示的判别力密切相关,可以增强所学到的表示的判别力。然而,有效地利用困难的负样本仍然具有挑战性,因为很难在不显著增加计算成本的情况下,准确地识别和包含它们。为了应对这个问题,我们引入了SynCo(在对比学习中生成合成负样本),一种新颖的对比学习方法,通过生成合成负样本来提高模型性能。SynCo基于MoCo框架,引入了六个新颖的策略,可以在无需大量计算开销的情况下生成多样性的合成负样本。SynCo实现了更快的训练和更好的表示学习,在仅经过200个周期预训练后,ImageNet线性评估的准确率达到了68.1%,超过了使用相同ResNet-50编码器的MoCo的67.5%。此外,它在对检测任务上的转移效果上也表现更出色:在PASCAL VOC上,它超过了监督基线和MoCo,实现了82.5%的AP;在COCO数据集上,它为边界框检测和实例分割设置了新的基准,分别为40.4%和35.4%的AP。我们生成的合成负样本处理过程显著提高了通过自监督对比学习获得的视觉表示的质量。代码可以从这个链接下载:https://www.kaggle.com/your_username/synco
https://arxiv.org/abs/2410.02401
In this work, we present BiSSL, a first-of-its-kind training framework that introduces bilevel optimization to enhance the alignment between the pretext pre-training and downstream fine-tuning stages in self-supervised learning. BiSSL formulates the pretext and downstream task objectives as the lower- and upper-level objectives in a bilevel optimization problem and serves as an intermediate training stage within the self-supervised learning pipeline. By more explicitly modeling the interdependence of these training stages, BiSSL facilitates enhanced information sharing between them, ultimately leading to a backbone parameter initialization that is better suited for the downstream task. We propose a training algorithm that alternates between optimizing the two objectives defined in BiSSL. Using a ResNet-18 backbone pre-trained with SimCLR on the STL10 dataset, we demonstrate that our proposed framework consistently achieves improved or competitive classification accuracies across various downstream image classification datasets compared to the conventional self-supervised learning pipeline. Qualitative analyses of the backbone features further suggest that BiSSL enhances the alignment of downstream features in the backbone prior to fine-tuning.
在这项工作中,我们提出了BiSSL,一种前所未有的训练框架,它引入了二阶优化来增强自监督学习中预训练和下游微调阶段的文本对齐。BiSSL将预训练和下游任务的目标形式为二阶优化问题中的下和上层次目标,并作为自监督学习流水线中的中间训练阶段。通过更明确地建模这些训练阶段之间的相互依存关系,BiSSL促进了它们之间的信息共享,最终导致下游任务的骨干参数初始化更适合下游任务。我们提出了一个交替优化两个定义在BiSSL中的目标的训练算法。使用在STL10数据集上预训练的ResNet-18骨干,我们证明了我们的框架在各种下游图像分类数据集上的分类准确率普遍优于传统的自监督学习方法。对骨干特征的定性分析进一步表明,BiSSL在微调之前增强了下游特征在骨干中的对齐。
https://arxiv.org/abs/2410.02387
Weakly supervised whole slide image (WSI) classification is challenging due to the lack of patch-level labels and high computational costs. State-of-the-art methods use self-supervised patch-wise feature representations for multiple instance learning (MIL). Recently, methods have been proposed to fine-tune the feature representation on the downstream task using pseudo labeling, but mostly focusing on selecting high-quality positive patches. In this paper, we propose to mine hard negative samples during fine-tuning. This allows us to obtain better feature representations and reduce the training cost. Furthermore, we propose a novel patch-wise ranking loss in MIL to better exploit these hard negative samples. Experiments on two public datasets demonstrate the efficacy of these proposed ideas. Our codes are available at this https URL
由于缺乏局部标签和计算成本高,弱监督 Whole Slide Image (WSI) 分类具有挑战性。最先进的办法使用自监督的带标签的补丁级特征表示来进行多个实例学习 (MIL)。最近,方法提出了使用伪标签在下游任务上微调特征表示的方法,但主要集中在选择高质量的正补丁。在本文中,我们提出了在微调过程中挖掘硬负样本的想法。这使我们能够获得更好的特征表示并降低训练成本。此外,我们还在 MIL 中提出了一个新的补丁级排名损失方法,以更好地利用这些硬负样本。在两个公开数据集上的实验证明了这些建议的有效性。我们的代码可在此处下载:https://url
https://arxiv.org/abs/2410.02212
This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at this https URL.
这项工作通过引入一种名为2-维度自回归(DnD)Transformer的新模型架构来解决向量量化(VQ)自回归图像生成的信息损失瓶颈。DnD-Transformer通过引入新的自回归方向(\textit{模型深度})和序列长度方向来预测图像中的更多代码。与传统1D自回归和之前使用类似2D图像分解的工作(如RQ-Transformer)相比,DnD-Transformer是一个端到端的模型,可以在相同的骨干模型大小和序列长度下生成更高质量的图像,为图像生成开辟了新的优化视角。此外,我们的实验结果表明,DnD-Transformer的潜力超越了生成自然图像。它甚至可以在自监督方式下生成具有丰富文本和图形元素的图像,展示了这些结合模式的 understanding。之前还没有为流行的视觉生成模型(如扩散模型)证明过这一点,显示了在仅基于图像训练时训练视觉语言智能的火花。代码、数据集和模型都可以在https://这个网址找到。
https://arxiv.org/abs/2410.01912
Since videos record objects moving coherently, adjacent video frames have commonness (similar object appearances) and uniqueness (slightly changed postures). To prevent redundant modeling of common video signals, we propose a novel diffusion-based framework, named COMUNI, which decomposes the COMmon and UNIque video signals to enable efficient video generation. Our approach separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models. In particular, we introduce CU-VAE to decompose video signals and encode them into latent features. To train CU-VAE in a self-supervised manner, we employ a cascading merge module to reconstitute video signals and a time-agnostic video decoder to reconstruct video frames. Then we propose CU-LDM to model latent features for video generation, which adopts two specific diffusion streams to simultaneously model the common and unique latent features. We further utilize additional joint modules for cross modeling of the common and unique latent features, and a novel position embedding method to ensure the content consistency and motion coherence of generated videos. The position embedding method incorporates spatial and temporal absolute position information into the joint modules. Extensive experiments demonstrate the necessity of decomposing common and unique video signals for video generation and the effectiveness and efficiency of our proposed method.
由于视频记录对象的运动连贯,相邻的视频帧具有共同性(相似的物体外观)和独特性(稍微改变的姿态)。为了防止对共同视频信号的冗余建模,我们提出了一个名为COMUNI的新颖扩散基础框架,它将共同和独特的视频信号分解,以实现高效的视频生成。我们的方法将视频信号的分解与视频生成任务分开,从而降低了生成模型的计算复杂性。 特别地,我们引入了CU-VAE来分解视频信号并编码它们成为潜在特征。为了以自监督的方式训练CU-VAE,我们采用级联合并模块重构视频信号,并使用时间无关的视频解码器重构视频帧。然后我们提出了CU-LDM来建模视频生成中的潜在特征,它采用两个特定的扩散流来同时建模共同和独特的潜在特征。我们进一步利用了用于跨建模共同和独特潜在特征的附加联合模块,以及一种新颖的位置编码方法来确保生成视频的内容一致性和运动连贯性。位置编码方法将绝对位置信息融入联合模块中。 丰富的实验证明了分解共同和独特的视频信号对于视频生成是必要的,并且我们提出的方法的有效性和效率得到了证实。
https://arxiv.org/abs/2410.01718
There is a growing need for pluralistic alignment methods that can steer language models towards individual attributes and preferences. One such method, Self-Supervised Alignment with Mutual Information (SAMI), uses conditional mutual information to encourage the connection between behavioral preferences and model responses. We conduct two experiments exploring SAMI in multi-task settings. First, we compare SAMI to Direct Preference Optimization (DPO) on a multi-task benchmark (MT-Bench), using a stronger model to generate training data for a weaker one across diverse categories (humanities, STEM, extraction, coding, math, reasoning, and roleplay). Our results indicate that one iteration of SAMI has a 57% win rate against DPO, with significant variation in performance between task categories. Second, we examine SAMI's impact on mathematical accuracy (GSM-8K) relative to supervised fine-tuning (SFT). While SAMI increases zero-shot performance by 1.1%, SFT is more effective with a 3.2% boost. However, SAMI shows interesting scaling trends. When given 10 attempts, SAMI improves accuracy by 3.9%, while SFT achieves a 10.1% increase. Combining SAMI with SFT yields an additional improvement of 1.3% in multi-attempt settings, though single-attempt accuracy remains unchanged.
需要越来越多种语言模型协调方法,以便将语言模型引导向个人属性和偏好。一种 such 方法 是 Self-Supervised Alignment with Mutual Information (SAMI)。它使用条件 mutual information 来鼓励行为偏好与模型响应之间的联系。我们在多任务设置中进行了两个实验来探索 SAMI。首先,我们比较 SAMI 与直接偏好优化(DPO)在多任务基准(MT-Bench)上的效果,使用更强的模型为较弱的一个类别的训练数据生成数据。我们的结果表明,一次 SAMI 的迭代在 DPO 上取得了 57% 的胜利率,不同任务类别的表现有显著差异。其次,我们研究了 SAMI 对数学准确度(GSM-8K)与监督微调(SFT)的影响。虽然 SAMI 通过增加零散射击性能提高了 1.1%,但 SFT 却更有效地提高了 3.2% 的效果。然而,SAMI 显示出有趣的规模趋势。在给定 10 次尝试的情况下,SAMI 的准确度提高了 3.9%,而 SFT 取得了 10.1% 的增长。将 SAMI 与 SFT 相结合可以在多尝试设置中实现 1.3% 的额外提高,尽管单次尝试的准确度仍然不变。
https://arxiv.org/abs/2410.01704
Cardiac magnetic resonance imaging (CMR), considered the gold standard for noninvasive cardiac assessment, is a diverse and complex modality requiring a wide variety of image processing tasks for comprehensive assessment of cardiac morphology and function. Advances in deep learning have enabled the development of state-of-the-art (SoTA) models for these tasks. However, model training is challenging due to data and label scarcity, especially in the less common imaging sequences. Moreover, each model is often trained for a specific task, with no connection between related tasks. In this work, we introduce a vision foundation model trained for CMR assessment, that is trained in a self-supervised fashion on 36 million CMR images. We then finetune the model in supervised way for 9 clinical tasks typical to a CMR workflow, across classification, segmentation, landmark localization, and pathology detection. We demonstrate improved accuracy and robustness across all tasks, over a range of available labeled dataset sizes. We also demonstrate improved few-shot learning with fewer labeled samples, a common challenge in medical image analyses. We achieve an out-of-box performance comparable to SoTA for most clinical tasks. The proposed method thus presents a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.
心脏磁共振成像(CMR)被认为是非侵入性心脏评估的黄金标准,是一种具有多样性和复杂性的成像方式,需要进行广泛的图像处理任务才能全面评估心脏形态和功能。深度学习的进步使得为这些任务开发最先进的(SoTA)模型成为可能。然而,由于数据和标签稀少,尤其是在不太常见的成像序列中,模型训练变得具有挑战性。此外,每个模型通常针对特定的任务进行训练,而没有相关任务之间的联系。 在这项工作中,我们介绍了一个用于CMR评估的自监督视觉基础模型,该模型在3600万张CMR图像上进行训练。然后,我们以监督的方式微调模型,以应对9个临床任务的分类、分割、关键点定位和病理检测。我们证明了在所有任务上都有提高的准确性和稳健性,涵盖了可用标记数据集的大小范围。我们还证明了在几个标记样本的情况下,仅通过少量的标记样本实现 improved few-shot learning,这是医学图像分析中常见的一个挑战。我们在大多数临床任务上的表现与SoTA相当。因此,所提出的方法在CMR评估方面具有资源高效、统一框架,具有加速开发基于深度学习的图像分析任务的可能性,即使只有很少的标记数据可用。
https://arxiv.org/abs/2410.01665
Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top-$k$ retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.
近年来,基于检索的语言模型(RLMs)受到了广泛关注。然而,大多数RLMs都利用了预训练的检索器,其参数固定,这可能不适合因果语言模型。在本文中,我们提出了一种名为Grouped Cross-Attention的新模块,允许在预训练阶段联合预训练检索器和因果语言模型,并将它应用于长文本建模。对于给定的输入序列,我们将其分成片段,并使用当前片段来检索后续文本生成。我们的创新使得检索器能够以端到端的方式学习如何检索更好的自回归损失,从而最小化后续单词的自动相关损失。通过整合前k个检索,我们的模型可以以从头开始的方式进行预训练,并支持长达64K个token的上下文长度。我们的实验结果表明,与长距离语言模型基线相比,我们的模型具有较低的拼写错误,具有相似或较低的预训练和推理成本。
https://arxiv.org/abs/2410.01651
Learning motor skills for sports or performance driving is often done with professional instruction from expert human teachers, whose availability is limited. Our goal is to enable automated teaching via a learned model that interacts with the student similar to a human teacher. However, training such automated teaching systems is limited by the availability of high-quality annotated datasets of expert teacher and student interactions that are difficult to collect at scale. To address this data scarcity problem, we propose an approach for training a coaching system for complex motor tasks such as high performance driving via a Multi-Task Imitation Learning (MTIL) paradigm. MTIL allows our model to learn robust representations by utilizing self-supervised training signals from more readily available non-interactive datasets of humans performing the task of interest. We validate our approach with (1) a semi-synthetic dataset created from real human driving trajectories, (2) a professional track driving instruction dataset, (3) a track-racing driving simulator human-subject study, and (4) a system demonstration on an instrumented car at a race track. Our experiments show that the right set of auxiliary machine learning tasks improves performance in predicting teaching instructions. Moreover, in the human subjects study, students exposed to the instructions from our teaching system improve their ability to stay within track limits, and show favorable perception of the model's interaction with them, in terms of usefulness and satisfaction.
通常,学习驾驶技能(如运动或表演驾驶)时,会得到专家人类教师的职业指导,但由于他们的可用性有限,所以我们的目标是利用学会的模型进行自动教学,该模型与人类教师类似地与学生交互。然而,训练自动教学系统的一个限制是,高质量专家教师和学生互动的标注数据很难在规模上收集。为解决这一数据稀缺性问题,我们提出了一个通过多任务模仿学习(MTIL)范式训练复杂运动任务(如高性能驾驶)的教练系统。MTIL允许我们的模型通过利用来自更易获得的非交互性人类执行感兴趣任务的自我监督训练信号来学习稳健的表示。我们通过以下实验验证了我们的方法:(1)由真实人类驾驶轨迹创建的半 synthetic 数据集, (2)由专业 track driving instruction dataset, (3)由 track-racing driving simulator human-subject study 和(4)在赛车场上的带仪器汽车系统演示来评估我们的实验。我们的实验结果表明,适当的辅助机器学习任务可以提高预测教学指令的性能。此外,在人类subjects实验中,暴露于我们教学系统的指令的学生能够提高他们在 track limits内的能力,并且表现出对模型与他们的交互的积极态度,在有用性和满意度方面。
https://arxiv.org/abs/2410.01608
Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.
自监督学习在过去十年里发展迅速,并在许多计算机视觉领域得到了应用。基于相关性的自监督预训练在非对比性算法中显示出巨大的潜力,其性能与监督和对比性自监督基线相当。在这项工作中,我们探讨了基于相关性的自监督学习范式,并将同样的方法应用于学习作家识别的离散签名特征。我们提出了一个名为SWIS的修改后的相关性框架,该框架通过在现有框架的每个维度上标准化特征来提出。我们证明了所提出的框架在作家识别基准上优于当代自监督学习框架,并且还优于几个监督方法。据我们所知,这是第一个将自监督学习应用于作家验证任务的学习表示学习的先例。
https://arxiv.org/abs/2410.01441
Capitalizing on vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable area of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g, nutrient deficiency detection, livestock breed classification). To address this we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset, named ALive, that leverages customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on diverse set of 20 downstream tasks demonstrate the effectiveness of AgriCLIP framework, achieving an absolute gain of 7.8\% in terms of average zero-shot classification accuracy, over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible at \href{this https URL}{Github}.
利用大量图像-文本数据,大规模视觉-语言预训练已经在多个应用领域取得了显著的零样本能力。然而,基于通用日常网络爬取数据的模型在专业领域上表现往往低于最优水平,可能是由于领域迁移造成的。为了克服这一问题,一些领域(如医疗保健)已经通过构建领域专用图像-文本数据来解决。然而,为可持续农业和畜牧业领域构建专用的大规模图像-文本数据集仍然是研究的空白。此外,由于下游任务的微小差异(例如,营养缺陷检测,家畜分类),该领域希望进行精细特征学习。为了应对这一问题,我们提出了AgriCLIP,一个专注于农业和畜牧业的视觉语言基本模型。首先,我们提出了一个名为ALive的大规模数据集,利用自定义提示生成策略来克服专家注释的稀缺性。我们的ALive数据集涵盖了作物、家畜和渔业,大约有60万图像-文本对。然后,我们提出了一个结合对比学习和自监督学习的训练管道,以学习全局语义和局部细粒度领域专用特征。对于20个下游任务的多样测试,AgriCLIP框架的有效性得到了证明,实现了平均零样本分类精度的绝对增长,超过通过领域专用ALive数据集的标准CLIP适应方法。我们的ALive数据集和代码可以在Github上获取链接。
https://arxiv.org/abs/2410.01407
Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.
生成模型现在可以生成几乎无法分辨于训练数据的光学真实数据。这是一个在以前模型上发生的显著演变,以前模型可以生成训练数据的合理伪本,但通过人类评估可以视觉上区分于训练数据。最近关于自监督检测(OOD)的研究引起了人们对生成模型置信度是否为最优OOD检测器产生怀疑,因为涉及置信度误估计、生成过程的熵以及典型性等问题。我们认为,生成OOD检测器也可能失败,因为它们的模型关注像素而不是数据的语义内容,导致在近OOD情况下,像素可能相似,但信息内容可能有很大差异。我们猜想,通过自监督学习估计典型集将产生更好的OOD检测器。我们引入了一种新方法,该方法利用表示学习以及根据层次估计的信息总结统计量来解决上述所有问题。我们的方法在 其他无监督方法上表现优异,并在经过良好验证的具有挑战性的基准测试和新的合成数据检测任务上实现了最先进的性能。
https://arxiv.org/abs/2410.01322
LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy.
LiDAR-based 3D物体检测器已经在各种应用中得到了广泛应用,包括自动驾驶车辆或移动机器人。然而,LiDAR-based 检测器通常很难适应具有不同传感器配置(例如传感器类型、空间分辨率或 FOV)和位置漂移的目标领域。因此,收集和标注新数据集以缩小这些差距通常需要耗费大量时间和金钱。最近的研究表明,通过在大规模无标签 LiDAR 帧中以自监督方式学习预训练骨架,可以实现预训练模型的迁移学习。然而,尽管它们具有表现力的表示,但在没有大量目标领域数据的情况下,它们仍然很难进行良好的泛化。因此,我们提出了一种名为领域自适应蒸馏调整(DADT)的新方法,以适应有限目标数据的预训练模型,同时保留其表现力和防止过拟合。具体来说,我们使用正则化器在预训练和微调模型之间的对象级别和上下文级别表示之间进行对齐。我们在 Waymo Open 数据集和 KITTI 数据集上的实验证实,我们的方法有效地对预训练模型进行了微调,实现了显著的准确性提升。
https://arxiv.org/abs/2410.01319
Given the GPS coordinates of a large collection of human agents over time, how can we model their mobility behavior toward effective anomaly detection (e.g. for bad-actor or malicious behavior detection) without any labeled data? Human mobility and trajectory modeling have been studied extensively with varying capacity to handle complex input, and performance-efficiency trade-offs. With the arrival of more expressive models in machine learning, we attempt to model GPS data as a sequence of stay-point events, each with a set of characterizing spatiotemporal features, and leverage modern sequence models such as Transformers for un/self-supervised training and inference. Notably, driven by the inherent stochasticity of certain individuals' behavior, we equip our model with aleatoric/data uncertainty estimation. In addition, to handle data sparsity of a large variety of behaviors, we incorporate epistemic/model uncertainty into our model. Together, aleatoric and epistemic uncertainty enable a robust loss and training dynamics, as well as uncertainty-aware decision making in anomaly scoring. Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model against both forecasting and anomaly detection baselines.
由于时间序列中大量人类代理的GPS坐标,我们如何在没有任何标记数据的情况下,建模其向有效异常检测(例如,用于恶意行为或网络攻击检测)的迁移行为呢?人类运动和轨迹建模已经得到了广泛研究,包括处理复杂输入的能力和性能与效率的权衡。随着机器学习中有更富有表现力的模型的到来,我们尝试将GPS数据建模为一系列停留事件,每个事件都具有特征性的时空特征,并利用现代序列模型(如Transformer)进行无/自监督训练和推理。值得注意的是,由于某些个人行为的固有随机性,我们为我们的模型配备了随机/数据不确定性估计。此外,为了处理大型数据集各种行为的稀疏性,我们将元/模型不确定性融入了我们的模型。 together, aleatoric and epistemic uncertainty enable robust loss and training dynamics, as well as uncertainty-aware decision making in anomaly scoring. 实验在十万多个人工智能模拟数据集上进行,表明我们的模型对预测和异常检测基线具有有效性。
https://arxiv.org/abs/2410.01281
There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through universal visual encoders and supervised learning strategies. Therefore, we introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation. We start our experiments comparing with the state-of-the-art models in the evaluation of a large-scale dataset SOMVideo [18]. The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.
在现有的大规模视觉语言多模态模型中,遮挡物的理解存在一定的差距。当前最先进的多模态模型通过通用视觉编码器和监督学习策略描述遮挡物时,效果并不令人满意。因此,我们提出了一个支持3D生成的多模态大型语言框架和相关自监督学习策略。我们对大规模数据集SOMVideo [18]上的先进模型进行实验比较。实验结果表明,与最先进的山姆模型相比,我们的模型在性能上提升了16.92%。
https://arxiv.org/abs/2410.01861
Artificial intelligence algorithms have demonstrated their image classification and segmentation ability in the past decade. However, artificial intelligence algorithms perform less for actual clinical data than those used for simulations. This research aims to present a novel hybrid learning model using self-supervised learning and knowledge distillation, which can achieve sufficient generalization and robustness. The self-attention mechanism and tokens employed in ViT, besides the local-to-global learning approach used in the hybrid model, enable the proposed algorithm to extract a high-dimensional and high-quality feature space from images. To demonstrate the proposed neural network's capability in classifying and extracting feature spaces from medical images, we use it on a dataset of Diabetic Retinopathy images, specifically the EyePACS dataset. This dataset is more complex structurally and challenging regarding damaged areas than other medical images. For the first time in this study, self-supervised learning and knowledge distillation are used to classify this dataset. In our algorithm, for the first time among all self-supervised learning and knowledge distillation models, the test dataset is 50% larger than the training dataset. Unlike many studies, we have not removed any images from the dataset. Finally, our algorithm achieved an accuracy of 79.1% in the linear classifier and 74.36% in the k-NN algorithm for multiclass classification. Compared to a similar state-of-the-art model, our results achieved higher accuracy and more effective representation spaces.
人工智能算法在过去十年中已经展示了其图像分类和分割能力。然而,与用于模拟的数据相比,人工智能算法在实际临床数据上的表现要差。这项研究旨在介绍一种采用自监督学习和知识蒸馏的全新混合学习模型,该模型可以实现足够的泛化能力和鲁棒性。ViT中使用的自注意力机制和使用的token不仅来源于混合模型中的局部到全局学习方法,而且使该算法能够从图像中提取高维和高质量的特征空间。为了展示所提出的神经网络在分类和提取医疗图像特征空间方面的能力,我们使用该模型在糖尿病视网膜病变(Diabetic Retinopathy)图像数据集上进行测试,特别是EyePACS数据集。这个数据集比其他医疗图像更具复杂性和挑战性。在这项研究中,自监督学习和知识蒸馏首次被用于对这个数据集的分类。在我们的算法中,与所有自监督学习和知识蒸馏模型相比,测试数据集是训练数据的两倍大。与许多研究不同,我们没有从数据集中移除任何图像。最后,我们的算法在线性分类器和k-NN算法上的多分类分类准确率分别为79.1%和74.36%。与类似的最先进的模型相比,我们的结果具有更高的准确性和更有效的表示空间。
https://arxiv.org/abs/2410.00779
Self-supervised learning (SSL) methods learn from unlabeled data and achieve high generalization performance on downstream tasks. However, they may also suffer from overfitting to their training data and lose the ability to adapt to new tasks. To investigate this phenomenon, we conduct experiments on various SSL methods and datasets and make two observations: (1) Overfitting occurs abruptly in later layers and epochs, while generalizing features are learned in early layers for all epochs; (2) Coding rate reduction can be used as an indicator to measure the degree of overfitting in SSL models. Based on these observations, we propose Undoing Memorization Mechanism (UMM), a plug-and-play method that mitigates overfitting of the pre-trained feature extractor by aligning the feature distributions of the early and the last layers to maximize the coding rate reduction of the last layer output. The learning process of UMM is a bi-level optimization process. We provide a causal analysis of UMM to explain how UMM can help the pre-trained feature extractor overcome overfitting and recover generalization. We also demonstrate that UMM significantly improves the generalization performance of SSL methods on various downstream tasks.
自监督学习(SSL)方法通过无标签数据进行学习,在下游任务上取得高效。然而,它们也可能从训练数据中过拟合,并失去对新任务适应的能力。为了研究这一现象,我们对各种SSL方法和数据集进行实验,并得出两个观察结果:(1)在后续层和epoch中,过拟合突然发生,而所有epoch早期层学习的特征具有泛化能力;(2)编码率降低可以作为一个指标来衡量SSL模型的过拟合程度。基于这些观察结果,我们提出了Undoing Memorization Mechanism(UMM),一种可插可走的解决方案,通过将早期和最后层的特征分布对齐,最大程度地降低最后层输出代码率,从而减轻预训练特征提取器的过拟合。 UMM的学习过程是一个双层优化过程。我们通过因果分析来解释UMM如何帮助预训练特征提取器克服过拟合并恢复泛化能力。我们还证明了UMM在各种下游任务上的SSL方法显著提高了泛化性能。
https://arxiv.org/abs/2410.00772
Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, \textit{contrastive abstraction learning} does not require rewards and facilitates efficient reinforcement learning for a wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning.
使用强化学习学习智能体具有挑战性,尤其是在处理具有大量状态的长期轨迹时。要有效解决这些问题,可以通过将状态数量减少到聚类的状态表示中来降低状态数量。从理论上讲,深度强化学习可以找到抽象状态,但端到端学习是不稳定的。我们提出了一种对比性抽象学习方法来寻找抽象状态,我们假设轨迹中的连续状态属于同一个抽象状态。这样的抽象状态可以是基本位置、实现目标、库存或健康状况。对比性抽象学习通过对比学习首先构建了状态表示的聚类,然后应用现代Hopfield网络来确定抽象状态。对比性抽象学习的第一个阶段是自监督学习,其中对比学习迫使具有连续接近序列的状态具有类似的表示。第二个阶段使用现代Hopfield网络将具有相似状态表示的相同聚类映射到相同的固定点,即抽象状态。抽象程度的调整可以通过确定现代Hopfield网络的固定点数量来调整。此外,对比性抽象学习不需要奖励,并为广泛的下游任务实现高效的强化学习。我们的实验结果表明,对比性抽象学习在强化学习方面具有有效性。
https://arxiv.org/abs/2410.00704
Visual features, whose description often relies on the local intensity and gradient direction, have found wide applications in robot navigation and localization in recent years. However, the extraction of visual features is usually disturbed by the variation of illumination conditions, making it challenging for real-world applications. Previous works have addressed this issue by establishing datasets with variations in illumination conditions, but can be costly and time-consuming. This paper proposes a design procedure for an illumination-robust feature extractor, where the recently developed relightable 3D reconstruction techniques are adopted for rapid and direct data generation with varying illumination conditions. A self-supervised framework is proposed for extracting features with advantages in repeatability for key points and similarity for descriptors across good and bad illumination conditions. Experiments are conducted to demonstrate the effectiveness of the proposed method for robust feature extraction. Ablation studies also indicate the effectiveness of the self-supervised framework design.
视觉特征,其描述通常依赖于局部强度和梯度方向,近年来在机器人导航和定位中得到了广泛应用。然而,提取视觉特征通常受到光照条件变化的影响,这使得它在现实应用中具有挑战性。以前的工作通过建立具有不同光照条件变化的数据集来解决这个 issue,但这种方法代价昂贵且耗时。本文提出了一种用于光照条件下鲁棒特征提取的设计方案,采用了一种新近发展起来的可重构的 3D 重建技术来快速生成具有不同光照条件变化的数据。为了提取具有良好和较差光照条件下的关键点和描述符的相似度特征,提出了一个自监督框架。通过实验证明了所提出方法的有效性,同时自监督框架的设计也得到了验证。
https://arxiv.org/abs/2410.00629