Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
密集文档嵌入是神经检索的核心。主导范式是通过在单个文档上运行编码器来训练和构建嵌入。在本文中,我们 argue,尽管这些嵌入有效,但它们在针对检索目的时是隐含地脱离上下文的,并且应该考虑上下文的文档和邻近文档 - 类似于上下文的词嵌入。我们提出了两种互补的文档嵌入方法:首先,一种 alternative contrastive learning objective 明确地将文档邻居信息融入内部批次的上下文损失中;其次,一种新的文档编码器架构明确地将邻近文档信息编码到编码表示中。结果表明,两种方法在多个设置上都比生物编码器表现更好,尤其在前域表现更加突出。我们在没有进行负样本挖掘、分数蒸馏、数据集特定指令、GPU 内例共享或极其大的批量大小的 MTEB 基准上实现了与生物编码器相媲美的最佳结果。我们的方法可以应用于任何对比学习数据集和任何生物编码器,以提高其性能。
https://arxiv.org/abs/2410.02525
Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo's 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.
对比学习已成为自监督视觉表示学习的主导方法,其中具有困难的负样本,这些负样本与学习到的表示的判别力密切相关,可以增强所学到的表示的判别力。然而,有效地利用困难的负样本仍然具有挑战性,因为很难在不显著增加计算成本的情况下,准确地识别和包含它们。为了应对这个问题,我们引入了SynCo(在对比学习中生成合成负样本),一种新颖的对比学习方法,通过生成合成负样本来提高模型性能。SynCo基于MoCo框架,引入了六个新颖的策略,可以在无需大量计算开销的情况下生成多样性的合成负样本。SynCo实现了更快的训练和更好的表示学习,在仅经过200个周期预训练后,ImageNet线性评估的准确率达到了68.1%,超过了使用相同ResNet-50编码器的MoCo的67.5%。此外,它在对检测任务上的转移效果上也表现更出色:在PASCAL VOC上,它超过了监督基线和MoCo,实现了82.5%的AP;在COCO数据集上,它为边界框检测和实例分割设置了新的基准,分别为40.4%和35.4%的AP。我们生成的合成负样本处理过程显著提高了通过自监督对比学习获得的视觉表示的质量。代码可以从这个链接下载:https://www.kaggle.com/your_username/synco
https://arxiv.org/abs/2410.02401
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
建模时变特征在音频波形表示学习中起着重要作用。我们提出了一种名为 Contrastive Long-form Language-Audio Pretraining (CoLLAP) 的方法,显著扩展了输入音频(长达5分钟)和语言描述(超过250个单词)的感知窗口,同时通过跨模态和时变动态进行对比学习。利用最近的 Music-LLMs 生成完整歌曲的长篇音乐摘要,并添加了音乐时序结构,我们收集了基于大型音频集训练数据集的51.3K个音频文本对,平均音频长度达到288秒。我们提出了一种新颖的对比学习架构,通过将语言表示与结构化音频表示结合,将每首歌曲分割为片段并提取它们的嵌入。通过关注机制,我们捕捉了多模态时变关联,使得模型能够自动权衡并增强最终的融合得分,从而改善对比对齐。最后,我们开发了两种类型的 CoLLAP 模型,分别为不同类型的骨干语言模型。通过在多个长篇音乐文本检索数据集上的全面实验,我们证明了与基线相比,检索准确性的提高是持续的。我们还证明了预训练的 CoLLAP 模型可以应用于各种音乐信息检索任务,包括具有异质长篇多模态上下文的各种任务。
https://arxiv.org/abs/2410.02271
Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
知识追踪(KT)是一种广泛应用于建模学生学习进步的策略,可以实现更个性化和自适应的学习。然而,现有的KT方法面临着两个主要局限:(1)它们严重依赖专家定义的知识概念(KCs),这需要花费大量时间并容易出错;(2)KT方法往往忽视了问题和给定KCs的语义。在本文中,我们解决了这些挑战,并提出了KCQRL,一个自动知识概念注释和问题表示学习框架,可以提高现有KT模型的有效性。首先,我们提出了一种使用大型语言模型(LLMs)进行自动KC注释的方法,生成问题解决方案,然后在每个解决方案步骤中注释KCs。其次,我们引入了一种对比学习方法,为问题和解决方案生成语义丰富的人工嵌入,通过一种自适应的错误消除方法将它们与相应的KCs对齐。这些嵌入可以轻松地集成到现有的KT模型中,用其随机初始化嵌入来替代。我们在两个大型现实世界的数学学习数据集上展示了KCQRL的有效性,这些数据集上的KT算法均取得了显著的提高。
https://arxiv.org/abs/2410.01727
Multimodal language modeling constitutes a recent breakthrough which leverages advances in large language models to pretrain capable multimodal models. The integration of natural language during pretraining has been shown to significantly improve learned representations, particularly in computer vision. However, the efficacy of multimodal language modeling in the realm of functional brain data, specifically for advancing pathology detection, remains unexplored. This study pioneers EEG-language models trained on clinical reports and 15000 EEGs. We extend methods for multimodal alignment to this novel domain and investigate which textual information in reports is useful for training EEG-language models. Our results indicate that models learn richer representations from being exposed to a variety of report segments, including the patient's clinical history, description of the EEG, and the physician's interpretation. Compared to models exposed to narrower clinical text information, we find such models to retrieve EEGs based on clinical reports (and vice versa) with substantially higher accuracy. Yet, this is only observed when using a contrastive learning approach. Particularly in regimes with few annotations, we observe that representations of EEG-language models can significantly improve pathology detection compared to those of EEG-only models, as demonstrated by both zero-shot classification and linear probes. In sum, these results highlight the potential of integrating brain activity data with clinical text, suggesting that EEG-language models represent significant progress for clinical applications.
多模态语言建模是一项最近的重大突破,它利用大型语言模型的进步来预训练具有多模态能力的模型。在预训练过程中集成自然语言已被证明可以显著改善学习到的表示,特别是在计算机视觉领域。然而,在功能脑数据领域,尤其是在进步病理检测方面,多模态语言建模的有效性仍然是未探索的。 这项研究首创基于临床报告的EEG-语言模型。我们将多模态对齐方法扩展到这个新颖领域,并研究了报告中哪些文本信息对训练EEG-语言模型是有用的。我们的结果表明,通过暴露于各种报告段,模型可以从EEG中学习更丰富的表示,包括患者的病史、描述EEG和医生的解释。与仅暴露于较窄的临床文本信息的模型相比,我们发现这些模型能够大大更准确地检索到基于临床报告的EEG(反之亦然)。然而,只有在使用对比学习方法时才能观察到这一效果。特别是在标注较少的情况下,我们观察到EEG-语言模型的表示可以显著提高病理检测,这通过零散射击分类和线性探针实验得到了证实。总之,这些结果强调了将脑活动数据与临床文本相结合的潜力,表明EEG-语言模型在临床应用中取得了显著的进展。
https://arxiv.org/abs/2409.07480
Accurately identifying, understanding, and describing driving safety-critical events (SCEs), including crashes and near-crashes, is crucial for traffic safety, automated driving systems, and advanced driver assistance systems research and application. As SCEs are rare events, most general Vision-Language Models (VLMs) have not been trained sufficiently to link SCE videos and narratives, which could lead to hallucination and missing key safety characteristics. To tackle these challenges, we propose ScVLM, a hybrid approach that combines supervised learning and contrastive learning to improve driving video understanding and event description rationality for VLMs. The proposed approach is trained on and evaluated by more than 8,600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset, the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigate hallucinations from VLMs.
准确地识别、理解和描述驾驶安全关键事件(SCEs),包括碰撞和近碰撞,对交通安全、自动驾驶系统和高级驾驶辅助系统的研究和应用至关重要。因为SCE事件是罕见的,大多数通用视觉语言模型(VLMs)都没有足够的训练来将SCE视频和情节联系起来,这可能导致幻觉和缺失关键安全特性。为了解决这些挑战,我们提出了ScVLM,一种结合监督学习和对比学习的方法,以提高VLMs的驾驶视频理解和事件描述合理性。所提出的方法通过对第二战略高速公路研究计划自然驾驶研究数据集中的8600多个SCE进行训练和评估得到。结果表明,与VLMs相比,所提出的方法在生成具有情境准确性的事件描述和减轻幻觉方面具有优越性。
https://arxiv.org/abs/2410.00982
Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, \textit{contrastive abstraction learning} does not require rewards and facilitates efficient reinforcement learning for a wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning.
使用强化学习学习智能体具有挑战性,尤其是在处理具有大量状态的长期轨迹时。要有效解决这些问题,可以通过将状态数量减少到聚类的状态表示中来降低状态数量。从理论上讲,深度强化学习可以找到抽象状态,但端到端学习是不稳定的。我们提出了一种对比性抽象学习方法来寻找抽象状态,我们假设轨迹中的连续状态属于同一个抽象状态。这样的抽象状态可以是基本位置、实现目标、库存或健康状况。对比性抽象学习通过对比学习首先构建了状态表示的聚类,然后应用现代Hopfield网络来确定抽象状态。对比性抽象学习的第一个阶段是自监督学习,其中对比学习迫使具有连续接近序列的状态具有类似的表示。第二个阶段使用现代Hopfield网络将具有相似状态表示的相同聚类映射到相同的固定点,即抽象状态。抽象程度的调整可以通过确定现代Hopfield网络的固定点数量来调整。此外,对比性抽象学习不需要奖励,并为广泛的下游任务实现高效的强化学习。我们的实验结果表明,对比性抽象学习在强化学习方面具有有效性。
https://arxiv.org/abs/2410.00704
The scarcity of annotated medical images is a major bottleneck in developing learning models for medical image analysis. Hence, recent studies have focused on pretrained models with fewer annotation requirements that can be fine-tuned for various downstream tasks. However, existing approaches are mainly 3D adaptions of 2D approaches ill-suited for 3D medical imaging data. Motivated by this gap, we propose novel domain-aware multi-task learning tasks to pretrain a 3D Swin Transformer for brain magnetic resonance imaging (MRI). Our method considers the domain knowledge in brain MRI by incorporating brain anatomy and morphology as well as standard pretext tasks adapted for 3D imaging in a contrastive learning setting. We pretrain our model using large-scale brain MRI data of 13,687 samples spanning several large-scale databases. Our method outperforms existing supervised and self-supervised methods in three downstream tasks of Alzheimer's disease classification, Parkinson's disease classification, and age prediction tasks. The ablation study of the proposed pretext tasks shows the effectiveness of our pretext tasks.
标注医学图像的稀有性是发展医学图像分析模型的主要瓶颈。因此,近年来,研究主要集中在具有较少标注要求的预训练模型上,这些模型可以针对各种下游任务进行微调。然而,现有的方法主要是针对二维数据的2D方法的扩展。鉴于这一差距,我们提出了为预训练3D Swin Transformer(MRI)提出新颖领域感知多任务学习任务。我们的方法通过将脑解剖学和形态学知识以及适用于3D成像的标准预处理任务相结合,来考虑脑MRI的领域知识。我们使用跨越多个大型数据库的大规模脑MRI数据进行预训练。我们的方法在阿尔茨海默病分类、帕金森病分类和年龄预测等三个下游任务上优于现有的监督和自监督方法。所提出的预处理任务的消融研究证明了我们的预处理任务的实效性。
https://arxiv.org/abs/2410.00410
X-ray image-based medical report generation (MRG) is a pivotal area in artificial intelligence which can significantly reduce diagnostic burdens and patient wait times. Despite significant progress, we believe that the task has reached a bottleneck due to the limited benchmark datasets and the existing large models' insufficient capability enhancements in this specialized domain. Specifically, the recently released CheXpert Plus dataset lacks comparative evaluation algorithms and their results, providing only the dataset itself. This situation makes the training, evaluation, and comparison of subsequent algorithms challenging. Thus, we conduct a comprehensive benchmarking of existing mainstream X-ray report generation models and large language models (LLMs), on the CheXpert Plus dataset. We believe that the proposed benchmark can provide a solid comparative basis for subsequent algorithms and serve as a guide for researchers to quickly grasp the state-of-the-art models in this field. More importantly, we propose a large model for the X-ray image report generation using a multi-stage pre-training strategy, including self-supervised autoregressive generation and Xray-report contrastive learning, and supervised fine-tuning. Extensive experimental results indicate that the autoregressive pre-training based on Mamba effectively encodes X-ray images, and the image-text contrastive pre-training further aligns the feature spaces, achieving better experimental results. Source code can be found on \url{this https URL}.
X-ray image-based medical report generation (MRG)是人工智能领域的一个关键领域,可以显著减少诊断负担和患者等待时间。尽管取得了显著进展,但我们认为,由于缺乏可比较的基准数据集和现有大型模型在专业领域的不足增强能力,该任务已经达到了瓶颈。具体来说,最近发布的CheXpert Plus数据集缺乏比较评估算法及其结果,仅提供数据集本身。这种情况使得后续算法的训练、评估和比较变得具有挑战性。因此,我们对现有的主流X-ray报告生成模型和大型语言模型(LLMs)在CheXpert Plus数据集上进行全面基准测试。我们相信,所提出的基准可以为后续算法提供坚实的基础,并成为研究人員快速掌握该领域最先进模型的指南。更重要的是,我们提出了一种使用多级预训练策略的大模型,包括自监督自回归生成和Xray-report contrastive learning,以及监督微调。大量的实验结果表明,基于Mamba的自回归预训练有效编码X-ray图像,而图像-文本对比预训练进一步将特征空间对齐,实现更好的实验结果。代码可以从 \url{这个链接} 中找到。
https://arxiv.org/abs/2410.00379
The human visual system is capable of processing continuous streams of visual information, but how the brain encodes and retrieves recent visual memories during continuous visual processing remains unexplored. This study investigates the capacity of working memory to retain past information under continuous visual stimuli. And then we propose a new task Memory Disentangling, which aims to extract and decode past information from fMRI signals. To address the issue of interference from past memory information, we design a disentangled contrastive learning method inspired by the phenomenon of proactive interference. This method separates the information between adjacent fMRI signals into current and past components and decodes them into image descriptions. Experimental results demonstrate that this method effectively disentangles the information within fMRI signals. This research could advance brain-computer interfaces and mitigate the problem of low temporal resolution in fMRI.
人类视觉系统能够处理连续的视觉信息,但大脑在连续视觉处理过程中如何编码和检索最近的视觉记忆仍然是一个未探索的问题。这项研究调查了在连续视觉刺激下工作记忆保留过去信息的能力。然后我们提出了一个新的任务 Memory Disentangling,旨在提取和解码来自fMRI信号的过去信息。为了应对过去记忆信息的影响,我们设计了一种基于前馈干扰现象的解离卷积学习方法。这种方法将相邻的fMRI信号之间的信息分离成当前和过去组件,并将其解码成图像描述。实验结果表明,这种方法有效地解离了fMRI信号中的信息。这项研究可能推动脑机接口的发展,减轻fMRI中时间分辨率低的问题。
https://arxiv.org/abs/2409.20428
The application of deep learning in cancer research, particularly in early diagnosis, case understanding, and treatment strategy design, emphasizes the need for high-quality data. Generative AI, especially Generative Adversarial Networks (GANs), has emerged as a leading solution to challenges like class imbalance, robust learning, and model training, while addressing issues stemming from patient privacy and the scarcity of real data. Despite their promise, GANs face several challenges, both inherent and specific to histopathology data. Inherent issues include training imbalance, mode collapse, linear learning from insufficient discriminator feedback, and hard boundary convergence due to stringent feedback. Histopathology data presents a unique challenge with its complex representation, high spatial resolution, and multiscale features. To address these challenges, we propose a framework consisting of two components. First, we introduce a contrastive learning-based Multistage Progressive Finetuning Siamese Neural Network (MFT-SNN) for assessing the similarity between histopathology patches. Second, we implement a Reinforcement Learning-based External Optimizer (RL-EO) within the GAN training loop, serving as a reward signal generator. The modified discriminator loss function incorporates a weighted reward, guiding the GAN to maximize this reward while minimizing loss. This approach offers an external optimization guide to the discriminator, preventing generator overfitting and ensuring smooth convergence. Our proposed solution has been benchmarked against state-of-the-art (SOTA) GANs and a Denoising Diffusion Probabilistic model, outperforming previous SOTA across various metrics, including FID score, KID score, Perceptual Path Length, and downstream classification tasks.
在癌症研究中,特别是早期诊断、病灶理解和治疗策略设计,深度学习的应用突出了高质量数据的重要性。生成式AI,特别是生成对抗网络(GANs),已成为解决诸如分类不平衡、稳健学习和模型训练等问题的一种领先解决方案,同时解决了患者隐私问题和真实数据稀少等源于数据本身的问题。尽管GANs具有很大的潜力,但它们仍然面临着几个固有的挑战,一个是训练不平衡,另一个是来源于病理学数据本身的问题。固有问题包括训练不平衡、模式坍塌、从不足的判别器反馈线性学习和由于严格反馈而导致的边界收缩等。病理学数据的复杂表示、高空间分辨率和高多尺度特征是其独特的挑战。为了应对这些挑战,我们提出了一个由两个组件组成的框架。首先,我们引入了一种基于对比学习的多阶段渐进微调Siamese神经网络(MFT-SNN)来评估病理学补丁的相似性。其次,我们在GAN训练循环中实现了一个基于强化学习的外部优化器(RL-EO),作为奖励信号生成者。修改后的判别器损失函数包括加权奖励,引导GAN最大化这个奖励,同时最小化损失。这种方法为判别器提供了外部优化指导,防止了生成器过拟合,并确保平滑的收敛。我们的解决方案已经与最先进的(SOTA)GAN和去噪扩散概率模型进行了比较。在包括FID分数、KID分数、感知路径长度和下游分类任务等各种指标中,我们的解决方案均取得了优异的性能,超越了以前的SOTA。
https://arxiv.org/abs/2409.20340
Data is crucial for robotic manipulation, as it underpins the development of robotic systems for complex tasks. While high-quality, diverse datasets enhance the performance and adaptability of robotic manipulation policies, collecting extensive expert-level data is resource-intensive. Consequently, many current datasets suffer from quality inconsistencies due to operator variability, highlighting the need for methods to utilize mixed-quality data effectively. To mitigate these issues, we propose "Select Segments to Imitate" (S2I), a framework that selects and optimizes mixed-quality demonstration data at the segment level, while ensuring plug-and-play compatibility with existing robotic manipulation policies. The framework has three components: demonstration segmentation dividing origin data into meaningful segments, segment selection using contrastive learning to find high-quality segments, and trajectory optimization to refine suboptimal segments for better policy learning. We evaluate S2I through comprehensive experiments in simulation and real-world environments across six tasks, demonstrating that with only 3 expert demonstrations for reference, S2I can improve the performance of various downstream policies when trained with mixed-quality demonstrations. Project website: this https URL.
数据对机器人操作至关重要,因为它为复杂任务机器人系统的开发奠定了基础。虽然高质量的、多样化的数据可以提高机器人操作策略的性能和适应性,但收集大量专家级数据资源密集。因此,许多现有数据集由于操作员变异性,导致质量不一致,突显了需要有效利用混合质量数据的方法。为了减轻这些问题,我们提出了“选择片段模仿” (S2I) 框架,该框架在级别上选择和优化混合质量演示数据,同时确保与现有机器人操作策略的插件和播放兼容。框架有三个组成部分:演示片段分割将原始数据划分为有意义的片段,使用对比学习进行片段选择以找到高质量片段,轨迹优化以优化子优化的片段以提高策略学习。我们通过在六个任务中进行综合实验来评估 S2I,结果表明,仅使用三个专家演示,S2I可以在使用混合质量演示数据训练时改善各种下游策略的性能。项目网站:此https URL。
https://arxiv.org/abs/2409.19917
Recent works show that assembling multiple off-the-shelf large language models (LLMs) can harness their complementary abilities. To achieve this, routing is a promising method, which learns a router to select the most suitable LLM for each query. However, existing routing models are ineffective when multiple LLMs perform well for a query. To address this problem, in this paper, we propose a method called query-based Router by Dual Contrastive learning (RouterDC). The RouterDC model consists of an encoder and LLM embeddings, and we propose two contrastive learning losses to train the RouterDC model. Experimental results show that RouterDC is effective in assembling LLMs and largely outperforms individual top-performing LLMs as well as existing routing methods on both in-distribution (+2.76\%) and out-of-distribution (+1.90\%) tasks. Source code is available at this https URL.
最近的工作表明,将多个标准的大型语言模型(LLMs)组装起来可以发挥其互补能力。为了实现这一目标,路由是一个有前途的方法,它通过学习路由器来选择每个查询中最合适的LLM。然而,当多个LLM对于某个查询表现出色时,现有的路由模型就无效了。为了解决这个问题,本文提出了一种名为基于查询的双对比学习(RouterDC)的方法。RouterDC 模型包括编码器和解码器,我们提出了两个对比学习损失来训练 RouterDC 模型。实验结果表明,RouterDC 在组装LLMs方面非常有效,在分布式任务(+2.76%)和非分布式任务(+1.90%)上都表现出优于单个最佳LLM以及现有路由方法的显著性能。源代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2409.19886
Multimodal contrastive learning uses various data modalities to create high-quality features, but its reliance on extensive data sources on the Internet makes it vulnerable to backdoor attacks. These attacks insert malicious behaviors during training, which are activated by specific triggers during inference, posing significant security risks. Despite existing countermeasures through fine-tuning that reduce the malicious impacts of such attacks, these defenses frequently necessitate extensive training time and degrade clean accuracy. In this study, we propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities, known as Unlearn Backdoor Threats (UBT). We specifically use overfit training to improve backdoor shortcuts and accurately detect suspicious samples in the potential poisoning data set. Then, we select fewer unlearned samples from suspicious samples for rapid forgetting in order to eliminate the backdoor effect and thus improve backdoor defense efficiency. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime. This technique focuses on the model's compromised elements, dissociating backdoor correlations while maintaining the model's overall integrity. Extensive experimental results show that our method effectively defends against various backdoor attack methods in the CLIP model. Compared to SoTA backdoor defense methods, UBT achieves the lowest attack success rate while maintaining a high clean accuracy of the model (attack success rate decreases by 19% compared to SOTA, while clean accuracy increases by 2.57%).
多模态对比学习通过各种数据模态创建高质量的特征,但它对互联网上广泛的数据源的依赖使它容易受到后门攻击。这些攻击在训练期间注入恶意行为,在推理期间由特定触发器激活,导致显著的安全风险。尽管通过微调减少这些攻击的恶意影响,但这些防御措施通常需要大量训练时间并降低准确性。在本文中,我们提出了一种有效的防御机制来对抗后门威胁,利用一个称为机器学习禁用(MLW)的概念。这包括故意创建一小部分带有恶意样本的数据集来帮助模型快速学习后门漏洞,称为未学会的后门威胁(UBT)。我们特别使用过拟合训练来提高后门快捷键并准确地检测有可疑特征的潜在污染数据集中的可疑样本。然后,我们选择更少的未学习样本来进行快速的遗忘,以消除后门效应并从而提高后门防御效率。在后门禁用过程中,我们提出了一个新颖的基于标记的部分禁用训练策略。这种技术关注于模型的受损部分,同时保持模型的整体完整性。 丰富的实验结果表明,我们的方法有效地防御了各种后门攻击方法。与SoTA后门防御方法相比,UBT实现了后门攻击成功率最低的同时保持高清洁准确性(攻击成功率比SOTA下降了19%,而清洁准确性提高了2.57%)。
https://arxiv.org/abs/2409.19526
Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at this http URL.
多模态图像-文本对比学习表明,可以在模态之间学习联合表示。在这里,我们证明了利用对比学习跨模态学习可以提高物种识别下游细粒度分类性能,即使其中一个视图是缺失的。我们提出了ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a新的预训练任务,用于自然世界地表和航空影像表示学习,并引入了Nature Multi-View(NMV)数据集,这是一个由加州生态多样状态下超过6000个植物类群的第1和第2张地面和航空图像对组成的自然世界图像数据集,可用于此链接。
https://arxiv.org/abs/2409.19439
Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at this https URL.
视觉情感分析在计算机视觉和心理学领域具有重要的研究价值。然而,现有的视觉情感分析方法由于情感感知的不确定性和数据场景的多样性而具有局限的泛化能力。为了解决这个问题,我们引入了UniEmoX,一个跨模态语义引导的大规模预训练框架。受到心理学研究强调情感探索过程与个体与环境之间互动的重要性启发,UniEmoX将场景中心性和人物中心性的低级图像空间结构信息相结合,旨在得出更细微和判定的情感表示。通过利用成对和未成对图像-文本样本之间的相似性,UniEmoX从CLIP模型中提取丰富的语义知识,从而增强情感嵌入表示。据我们所知,这是第一个将心理理论与当代对比学习技术和遮罩图像建模技术相结合的大规模预训练框架,以进行跨场景的情感分析。此外,我们还开发了一个名为Emo8的视觉情感数据集。Emo8样本涵盖了多种领域,包括卡通、自然、现实主义、科幻和广告封面风格,几乎涵盖了所有常见的情感场景。在两个下游任务上对六个基准数据集进行全面的实验验证了UniEmoX的有效性。源代码可在此处访问:https://www. this URL。
https://arxiv.org/abs/2409.18877
Domain adaptation aims to reduce the model degradation on the target domain caused by the domain shift between the source and target domains. Although encouraging performance has been achieved by combining cognitive learning with the self-training paradigm, they suffer from ambiguous scenarios caused by scale, illumination, or overlapping when deploying deterministic embedding. To address these issues, we propose probabilistic proto-typical pixel contrast (PPPC), a universal adaptation framework that models each pixel embedding as a probability via multivariate Gaussian distribution to fully exploit the uncertainty within them, eventually improving the representation quality of the model. In addition, we derive prototypes from probability estimation posterior probability estimation which helps to push the decision boundary away from the ambiguity points. Moreover, we employ an efficient method to compute similarity between distributions, eliminating the need for sampling and reparameterization, thereby significantly reducing computational overhead. Further, we dynamically select the ambiguous crops at the image level to enlarge the number of boundary points involved in contrastive learning, which benefits the establishment of precise distributions for each category. Extensive experimentation demonstrates that PPPC not only helps to address ambiguity at the pixel level, yielding discriminative representations but also achieves significant improvements in both synthetic-to-real and day-to-night adaptation tasks. It surpasses the previous state-of-the-art (SOTA) by +5.2% mIoU in the most challenging daytime-to-nighttime adaptation scenario, exhibiting stronger generalization on other unseen datasets. The code and models are available at this https URL.
领域迁移的目标是通过在源域和目标域之间实现领域转移来减少模型在目标域上的降解。尽管通过将认知学习和自训练范式结合实现了鼓舞人心的性能,但当部署确定性嵌入时,它们仍然受到规模、光照或重叠的模糊场景的影响。为了应对这些问题,我们提出了概率原型典型像素对比(PPPC),一种通用的适应框架,通过多元高斯分布将每个像素嵌入建模为一个概率,以充分利用它们内部的不确定性,从而提高模型的表示质量。此外,我们通过后验概率估计从概率估计中导出原型,有助于将决策边界推向模糊点。此外,我们采用一种高效的方法计算分布之间的相似度,从而消除采样和参数重采样,从而大大降低计算开销。进一步,我们动态地在图像级别选择模糊 crop,以增加参与对比学习的不确定性边界的数量,从而为每个类别建立精确的分布做出贡献。大量的实验结果表明,PPPC不仅有助于解决像素级别的不确定性,从而产生具有区分性的表示,还取得了显著的模拟-到-现实和白天-到-夜晚迁移任务的改善。它在最具有挑战性的白天-到-夜晚迁移场景中的mIoU比前 state-of-the-art (SOTA) 提高了+5.2%。此外,它在其他未见过的数据集上的表现也优于现有技术水平。代码和模型可以从该链接下载:https://www.xxx.com/
https://arxiv.org/abs/2409.18543
Grounding objects in images using visual cues is a well-established approach in computer vision, yet the potential of audio as a modality for object recognition and grounding remains underexplored. We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes, termed Audio Grounding. By integrating pre-trained audio models with visual models using contrastive learning and multi-modal alignment, our approach captures speech commands or descriptions and maps them directly to corresponding objects within images. Experimental results indicate that audio guidance can be effectively applied to object grounding, suggesting that incorporating audio guidance may enhance the precision and robustness of current object grounding methods and improve the performance of robotic systems and computer vision applications. This finding opens new possibilities for advanced object recognition, scene understanding, and the development of more intuitive and capable robotic systems.
在计算机视觉中,利用视觉线索将物体进行接地是一种经过充分验证的方法。然而,音频作为物体识别和接地的一种模式,其潜在价值仍未得到充分利用。我们介绍了一种名为YOSS的方法,即“你只说一次就能看见”,以利用音频在视觉场景中进行物体接地,称为音频接地。通过使用对比学习多模态对齐,将预训练的音频模型与视觉模型集成,我们的方法将语音命令或描述直接映射到图像中的相应物体。实验结果表明,音频指导可以有效地应用于物体接地,表明将音频指导纳入物体接地方法可能会提高其准确性和稳健性,并改善机器人系统和相关计算机视觉应用的性能。这一发现为高级物体识别、场景理解和开发更直观、更有效的机器人系统提供了新的可能性。
https://arxiv.org/abs/2409.18372
Self-supervised pretraining (SSP) has shown promising results in learning from large unlabeled datasets and, thus, could be useful for automated cardiovascular magnetic resonance (CMR) short-axis cine segmentation. However, inconsistent reports of the benefits of SSP for segmentation have made it difficult to apply SSP to CMR. Therefore, this study aimed to evaluate SSP methods for CMR cine segmentation. To this end, short-axis cine stacks of 296 subjects (90618 2D slices) were used for unlabeled pretraining with four SSP methods; SimCLR, positional contrastive learning, DINO, and masked image modeling (MIM). Subsets of varying numbers of subjects were used for supervised fine-tuning of 2D models for each SSP method, as well as to train a 2D baseline model from scratch. The fine-tuned models were compared to the baseline using the 3D Dice similarity coefficient (DSC) in a test dataset of 140 subjects. The SSP methods showed no performance gains with the largest supervised fine-tuning subset compared to the baseline (DSC = 0.89). When only 10 subjects (231 2D slices) are available for supervised training, SSP using MIM (DSC = 0.86) improves over training from scratch (DSC = 0.82). This study found that SSP is valuable for CMR cine segmentation when labeled training data is scarce, but does not aid state-of-the-art deep learning methods when ample labeled data is available. Moreover, the choice of SSP method is important. The code is publicly available at: this https URL
自监督预训练(SSP)在从大无标签数据集中学习方面显示出积极的效果,因此可能对自动心电图(CMR)短轴心动图(cine segmentation)有用。然而,关于SSP对分割效果的报告存在不一致,这使得将SSP应用于CMR困难。因此,本研究旨在评估用于CMR cine分割的SSP方法。为此,使用了296名受试者的短轴心动图(90618个2D切片)进行无标签预训练,采用四种SSP方法:SimCLR、位置对比学习、DINO和遮罩图像建模(MIM)。为了进行有标签微调,将受试者分为不同的子集,用于有标签微调每个SSP方法2D模型的训练,以及从零开始训练2D基线模型。在有标签微调的数据集上使用3D Dice相似性系数(DSC)比较预训练模型与基线模型。与基线相比,SSP方法在最大的有监督微调子集上没有性能提升(DSC = 0.89)。当仅可用10个受试者进行有监督训练时,使用MIM的SSP(DSC = 0.86)在训练过程中超过了从零开始训练(DSC = 0.82)。本研究发现,当有标签训练数据有限时,SSP对于CMR cine分割是有价值的,但当充足的有标签训练数据可用时,它不能帮助实现最先进的深度学习方法。此外,SSP方法的选择很重要。代码可公开获取,此链接:https://此链接
https://arxiv.org/abs/2409.18100
Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to make policy more robust to diverse environments, such comprehensiveness potentially detracts from the policy's performance in any specific environment according to the No Free Lunch theorem, leading to a suboptimal solution once deployed in the real world. To address this issue, we propose a lifelong policy adaptation framework named LoopSR, which utilizes a transformer-based encoder to project real-world trajectories into a latent space, and accordingly reconstruct the real-world environments back in simulation for further improvement. Autoencoder architecture and contrastive learning methods are adopted to better extract the characteristics of real-world dynamics. The simulation parameters for continual training are derived by combining predicted parameters from the decoder with retrieved parameters from the simulation trajectory dataset. By leveraging the continual training, LoopSR achieves superior data efficiency compared with strong baselines, with only a limited amount of data to yield eminent performance in both sim-to-sim and sim-to-real experiments.
强化学习(RL)通过模拟到实时的迁移在腿部运动中展示了其非凡和通用能力。然而,虽然自适应方法如领域随机化预计将使策略对不同的环境更加稳健,但全面性可能会根据No Free Lunch定理减弱策略在任何一个特定环境中的表现,导致在现实世界中的部署后达到次优解决方案。为了解决这个问题,我们提出了一个名为LoopSR的终身策略适应框架,它采用基于Transformer的编码器将现实世界的轨迹投影到潜在空间,然后根据需要重构现实世界的环境以进一步优化。自编码器架构和对比学习方法被采用以更好地提取现实动态的特点。通过结合预测参数和模拟轨迹数据中的检索参数,LoopSR在持续训练的模拟参数上进行了优化。通过利用持续训练,LoopSR在模拟到模拟和模拟到实时的实验中实现了卓越的数据效率,仅有有限的数据就能在两个领域都获得卓越的表现。
https://arxiv.org/abs/2409.17992