Contrastive learning has become a dominant approach in self-supervised visual representation learning, with hard negatives-samples that closely resemble the anchor-being key to enhancing the discriminative power of learned representations. However, efficiently leveraging hard negatives remains a challenge due to the difficulty in identifying and incorporating them without significantly increasing computational costs. To address this, we introduce SynCo (Synthetic Negatives in Contrastive learning), a novel contrastive learning approach that improves model performance by generating synthetic hard negatives. Built on the MoCo framework, SynCo introduces six novel strategies for creating diverse synthetic hard negatives that can be generated on-the-fly with minimal computational overhead. SynCo achieves faster training and better representation learning, achieving a top-1 accuracy of 68.1% in ImageNet linear evaluation after only 200 epochs on pretraining, surpassing MoCo's 67.5% with the same ResNet-50 encoder. Additionally, it transfers more effectively to detection tasks: on the PASCAL VOC, it outperforms both the supervised baseline and MoCo, achieving an AP of 82.5%; on the COCO dataset, it sets a new benchmark with 40.4% AP for bounding box detection and 35.4% AP for instance segmentation. Our synthetic hard negative generation procedure significantly enhances the quality of visual representations learned through self-supervised contrastive learning. Code is available at this https URL.
对比学习已成为自监督视觉表示学习的主导方法,其中具有困难的负样本,这些负样本与学习到的表示的判别力密切相关,可以增强所学到的表示的判别力。然而,有效地利用困难的负样本仍然具有挑战性,因为很难在不显著增加计算成本的情况下,准确地识别和包含它们。为了应对这个问题,我们引入了SynCo(在对比学习中生成合成负样本),一种新颖的对比学习方法,通过生成合成负样本来提高模型性能。SynCo基于MoCo框架,引入了六个新颖的策略,可以在无需大量计算开销的情况下生成多样性的合成负样本。SynCo实现了更快的训练和更好的表示学习,在仅经过200个周期预训练后,ImageNet线性评估的准确率达到了68.1%,超过了使用相同ResNet-50编码器的MoCo的67.5%。此外,它在对检测任务上的转移效果上也表现更出色:在PASCAL VOC上,它超过了监督基线和MoCo,实现了82.5%的AP;在COCO数据集上,它为边界框检测和实例分割设置了新的基准,分别为40.4%和35.4%的AP。我们生成的合成负样本处理过程显著提高了通过自监督对比学习获得的视觉表示的质量。代码可以从这个链接下载:https://www.kaggle.com/your_username/synco
https://arxiv.org/abs/2410.02401
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
建模时变特征在音频波形表示学习中起着重要作用。我们提出了一种名为 Contrastive Long-form Language-Audio Pretraining (CoLLAP) 的方法,显著扩展了输入音频(长达5分钟)和语言描述(超过250个单词)的感知窗口,同时通过跨模态和时变动态进行对比学习。利用最近的 Music-LLMs 生成完整歌曲的长篇音乐摘要,并添加了音乐时序结构,我们收集了基于大型音频集训练数据集的51.3K个音频文本对,平均音频长度达到288秒。我们提出了一种新颖的对比学习架构,通过将语言表示与结构化音频表示结合,将每首歌曲分割为片段并提取它们的嵌入。通过关注机制,我们捕捉了多模态时变关联,使得模型能够自动权衡并增强最终的融合得分,从而改善对比对齐。最后,我们开发了两种类型的 CoLLAP 模型,分别为不同类型的骨干语言模型。通过在多个长篇音乐文本检索数据集上的全面实验,我们证明了与基线相比,检索准确性的提高是持续的。我们还证明了预训练的 CoLLAP 模型可以应用于各种音乐信息检索任务,包括具有异质长篇多模态上下文的各种任务。
https://arxiv.org/abs/2410.02271
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at this https URL.
建模人类偏好对齐基础模型与人类价值至关重要。传统的奖励建模方法,如布拉德利-特里(BT)奖励模型,在表现力上存在不足,尤其是在解决非Transititive偏好方面。尽管监督成对偏好模型(PairPM)可以表达一般偏好,但它们的实现非常随意,并且不能保证比较对的偏好概率的一致性。此外,它们在比较多个答案时,由于其二次查询复杂度而产生高计算成本。在本文中,我们引入了偏好表示学习,一种将响应嵌入到潜在空间中,以捕捉复杂偏好结构的途径,实现线性查询复杂度的方法。此外,我们提出了基于偏好的通用偏好优化(GPO),将人类反馈为基础的强化学习扩展到人类。实验结果表明,我们在RewardBench基准上,GPM显著优于BT奖励模型,其领先优势达到5.6%,并且有效地建模了环形偏好,其中任何BT奖励模型都像随机猜测一样行为。此外,在下游任务如AlpacaEval2.0和MT-Bench上,使用GPO和我们的通用偏好模型进行语言模型后训练,评估结果表明,性能改进的幅度达到9.3%。这些发现表明,我们的方法可能有助于增强基础模型与复杂人类价值的对齐。代码可在此链接下载:https://www.aclweb.org/anthology/W21-4246
https://arxiv.org/abs/2410.02197
Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.
知识追踪(KT)是一种广泛应用于建模学生学习进步的策略,可以实现更个性化和自适应的学习。然而,现有的KT方法面临着两个主要局限:(1)它们严重依赖专家定义的知识概念(KCs),这需要花费大量时间并容易出错;(2)KT方法往往忽视了问题和给定KCs的语义。在本文中,我们解决了这些挑战,并提出了KCQRL,一个自动知识概念注释和问题表示学习框架,可以提高现有KT模型的有效性。首先,我们提出了一种使用大型语言模型(LLMs)进行自动KC注释的方法,生成问题解决方案,然后在每个解决方案步骤中注释KCs。其次,我们引入了一种对比学习方法,为问题和解决方案生成语义丰富的人工嵌入,通过一种自适应的错误消除方法将它们与相应的KCs对齐。这些嵌入可以轻松地集成到现有的KT模型中,用其随机初始化嵌入来替代。我们在两个大型现实世界的数学学习数据集上展示了KCQRL的有效性,这些数据集上的KT算法均取得了显著的提高。
https://arxiv.org/abs/2410.01727
Extensive research has shown that deep neural networks (DNNs) are vulnerable to slight adversarial perturbations$-$small changes to the input data that appear insignificant but cause the model to produce drastically different outputs. In addition to augmenting training data with adversarial examples generated from a specific attack method, most of the current defense strategies necessitate modifying the original model architecture components to improve robustness or performing test-time data purification to handle adversarial attacks. In this work, we demonstrate that strong feature representation learning during training can significantly enhance the original model's robustness. We propose MOREL, a multi-objective feature representation learning approach, encouraging classification models to produce similar features for inputs within the same class, despite perturbations. Our training method involves an embedding space where cosine similarity loss and multi-positive contrastive loss are used to align natural and adversarial features from the model encoder and ensure tight clustering. Concurrently, the classifier is motivated to achieve accurate predictions. Through extensive experiments, we demonstrate that our approach significantly enhances the robustness of DNNs against white-box and black-box adversarial attacks, outperforming other methods that similarly require no architectural changes or test-time data purification. Our code is available at this https URL
大量研究表明,深度神经网络(DNNs)对微小的 adversarial 扰动非常脆弱$-$对输入数据微小的变化会导致模型产生截然不同的输出。除了通过特定攻击方法生成的对抗样本来增强训练数据外,大多数现有的防御策略需要修改原始模型架构组件以提高稳健性或进行测试时数据清除以处理 adversarial 攻击。在这项工作中,我们证明了在训练过程中进行强大的特征表示学习可以显著增强原始模型的稳健性。我们提出了 MOREL,一种多目标特征表示学习方法,鼓励分类模型为同一类别的输入生成类似的特征,即使存在扰动。我们的训练方法涉及一个嵌入空间,其中余弦相似度损失和多正则化对比损失被用来对模型的编码器和解码器中的自然和 adversarial 特征进行对齐,确保紧缩聚类。同时,分类器被激励实现准确预测。通过广泛的实验,我们证明了我们的方法显著增强了 DNNs 对白盒和黑盒 adversarial 攻击的稳健性,超越了其他需要不进行架构更改或测试时数据清除的方法。我们的代码可在此处访问:https://www.xxx.com
https://arxiv.org/abs/2410.01697
Visually-Rich Documents (VRDs), encompassing elements like charts, tables, and references, convey complex information across various fields. However, extracting information from these rich documents is labor-intensive, especially given their inconsistent formats and domain-specific requirements. While pretrained models for VRD Understanding have progressed, their reliance on large, annotated datasets limits scalability. This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework, which utilises machine-generated synthetic data for domain adaptation. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling. By leveraging pretrained models and synthetic data, DAViD achieves competitive performance with minimal annotated datasets. Extensive experiments validate DAViD's effectiveness, demonstrating its ability to efficiently adapt to domain-specific VRDU tasks.
视觉丰富文档(VRDs)包括图表、表格和参考文献等元素,传达了各种领域内的复杂信息。然而,从这些丰富文档中提取信息是劳动密集型工作,尤其是在它们的格式和领域特定要求不统一的情况下。虽然用于VRD理解的自监督模型已经取得进展,但它们对大型、注释过的数据集的依赖限制了可扩展性。本文介绍了领域自适应视觉丰富文档理解(DAViD)框架,它利用机器生成的合成数据进行领域适应。DAViD将细粒度和粗粒度的文档表示学习相结合,并采用合成注释来减少需要昂贵的人工标注的需求。通过利用预训练模型和合成数据,DAViD在最小注释数据集上实现与自监督模型的竞争性能。丰富的实验证实了DAViD的有效性,证明了其在适应领域特定VRDU任务方面的能力。
https://arxiv.org/abs/2410.01609
Self-supervised learning has developed rapidly over the last decade and has been applied in many areas of computer vision. Decorrelation-based self-supervised pretraining has shown great promise among non-contrastive algorithms, yielding performance at par with supervised and contrastive self-supervised baselines. In this work, we explore the decorrelation-based paradigm of self-supervised learning and apply the same to learning disentangled stroke features for writer identification. Here we propose a modified formulation of the decorrelation-based framework named SWIS which was proposed for signature verification by standardizing the features along each dimension on top of the existing framework. We show that the proposed framework outperforms the contemporary self-supervised learning framework on the writer identification benchmark and also outperforms several supervised methods as well. To the best of our knowledge, this work is the first of its kind to apply self-supervised learning for learning representations for writer verification tasks.
自监督学习在过去十年里发展迅速,并在许多计算机视觉领域得到了应用。基于相关性的自监督预训练在非对比性算法中显示出巨大的潜力,其性能与监督和对比性自监督基线相当。在这项工作中,我们探讨了基于相关性的自监督学习范式,并将同样的方法应用于学习作家识别的离散签名特征。我们提出了一个名为SWIS的修改后的相关性框架,该框架通过在现有框架的每个维度上标准化特征来提出。我们证明了所提出的框架在作家识别基准上优于当代自监督学习框架,并且还优于几个监督方法。据我们所知,这是第一个将自监督学习应用于作家验证任务的学习表示学习的先例。
https://arxiv.org/abs/2410.01441
Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.
生成模型现在可以生成几乎无法分辨于训练数据的光学真实数据。这是一个在以前模型上发生的显著演变,以前模型可以生成训练数据的合理伪本,但通过人类评估可以视觉上区分于训练数据。最近关于自监督检测(OOD)的研究引起了人们对生成模型置信度是否为最优OOD检测器产生怀疑,因为涉及置信度误估计、生成过程的熵以及典型性等问题。我们认为,生成OOD检测器也可能失败,因为它们的模型关注像素而不是数据的语义内容,导致在近OOD情况下,像素可能相似,但信息内容可能有很大差异。我们猜想,通过自监督学习估计典型集将产生更好的OOD检测器。我们引入了一种新方法,该方法利用表示学习以及根据层次估计的信息总结统计量来解决上述所有问题。我们的方法在 其他无监督方法上表现优异,并在经过良好验证的具有挑战性的基准测试和新的合成数据检测任务上实现了最先进的性能。
https://arxiv.org/abs/2410.01322
This paper introduces a novel hierarchical autoencoder that maps 3D models into a highly compressed latent space. The hierarchical autoencoder is specifically designed to tackle the challenges arising from large-scale datasets and generative modeling using diffusion. Different from previous approaches that only work on a regular image or volume grid, our hierarchical autoencoder operates on unordered sets of vectors. Each level of the autoencoder controls different geometric levels of detail. We show that the model can be used to represent a wide range of 3D models while faithfully representing high-resolution geometry details. The training of the new architecture takes 0.70x time and 0.58x memory compared to the baseline. We also explore how the new representation can be used for generative modeling. Specifically, we propose a cascaded diffusion framework where each stage is conditioned on the previous stage. Our design extends existing cascaded designs for image and volume grids to vector sets.
本文提出了一种新颖的分层自编码器,可以将3D模型映射到高度压缩的潜在空间。分层自编码器特别设计用于解决大规模数据集和扩散生成模型的挑战。与之前仅在规范图像或体积网格上工作的方法不同,我们的分层自编码器处理无序的向量集。每个级别的自编码器控制不同几何细节的层次结构。我们证明了该模型可以表示各种3D模型,同时忠实于高分辨率几何细节。与基线相比,新架构的训练需要0.70倍的时间和0.58倍的空间。我们还研究了如何使用新表示进行生成建模。具体来说,我们提出了一种级联扩散框架,其中每个阶段都条件于前一个阶段。我们的设计扩展了用于图像和体积网格的现有级联设计。
https://arxiv.org/abs/2410.01295
We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
我们提出了一个新颖的神经网络架构,称为正则Transformer(nGPT),它具有在 hypersphere 上的表示学习。在 nGPT 中,所有形成嵌入的向量、多层感知器(MLP)、注意矩阵和隐藏状态的向量都进行单位归一化。输入的标记单元在 hypersphere 的表面流动,每个层都沿着目标输出预测的方向贡献位移。这些位移由 MLP 和注意器模块定义,它们的向量组件也位于相同的 hypersphere。实验证明,nGPT 学习速度更快,将实现相同准确度所需的训练步骤数量减半,具体取决于序列长度。
https://arxiv.org/abs/2410.01131
In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps separate the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. We conduct simulation studies using the generated text data with an open-source LLM, Llama3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.
在本文中,我们探讨了如何使用无结构的高维处理方法(如文本)来增强因果推断的有效性,并利用生成式人工智能的力量。具体来说,我们提出了使用大语言模型(LLMs)等深度生成模型来高效生成处理,并将其内部表示用于后续因果效应估计。我们证明了这种真实内部表示的知识有助于区分感兴趣的治疗特征,例如特定的情感和某些主题,与其他可能未知的混淆特征区分开来。与现有方法不同,我们的方法消除了从数据中学习因果表示的需求,因此产生了更准确和有效的估计。我们正式建立了非参数识别平均治疗效应所需的条件,提出了一种避免违反重叠假设的估计策略,并通过应用双机器学习来推导出所提出的估计算法的渐进性质。最后,我们使用工具变量方法将我们的方法扩展到基于人类感知而不是假设固定治疗对象的设置中。我们使用开源的LLM Llama3生成的文本数据进行模拟研究,以说明我们的估价器相对于最先进的因果表示学习算法的优势。
https://arxiv.org/abs/2410.00903
Artificial intelligence algorithms have demonstrated their image classification and segmentation ability in the past decade. However, artificial intelligence algorithms perform less for actual clinical data than those used for simulations. This research aims to present a novel hybrid learning model using self-supervised learning and knowledge distillation, which can achieve sufficient generalization and robustness. The self-attention mechanism and tokens employed in ViT, besides the local-to-global learning approach used in the hybrid model, enable the proposed algorithm to extract a high-dimensional and high-quality feature space from images. To demonstrate the proposed neural network's capability in classifying and extracting feature spaces from medical images, we use it on a dataset of Diabetic Retinopathy images, specifically the EyePACS dataset. This dataset is more complex structurally and challenging regarding damaged areas than other medical images. For the first time in this study, self-supervised learning and knowledge distillation are used to classify this dataset. In our algorithm, for the first time among all self-supervised learning and knowledge distillation models, the test dataset is 50% larger than the training dataset. Unlike many studies, we have not removed any images from the dataset. Finally, our algorithm achieved an accuracy of 79.1% in the linear classifier and 74.36% in the k-NN algorithm for multiclass classification. Compared to a similar state-of-the-art model, our results achieved higher accuracy and more effective representation spaces.
人工智能算法在过去十年中已经展示了其图像分类和分割能力。然而,与用于模拟的数据相比,人工智能算法在实际临床数据上的表现要差。这项研究旨在介绍一种采用自监督学习和知识蒸馏的全新混合学习模型,该模型可以实现足够的泛化能力和鲁棒性。ViT中使用的自注意力机制和使用的token不仅来源于混合模型中的局部到全局学习方法,而且使该算法能够从图像中提取高维和高质量的特征空间。为了展示所提出的神经网络在分类和提取医疗图像特征空间方面的能力,我们使用该模型在糖尿病视网膜病变(Diabetic Retinopathy)图像数据集上进行测试,特别是EyePACS数据集。这个数据集比其他医疗图像更具复杂性和挑战性。在这项研究中,自监督学习和知识蒸馏首次被用于对这个数据集的分类。在我们的算法中,与所有自监督学习和知识蒸馏模型相比,测试数据集是训练数据的两倍大。与许多研究不同,我们没有从数据集中移除任何图像。最后,我们的算法在线性分类器和k-NN算法上的多分类分类准确率分别为79.1%和74.36%。与类似的最先进的模型相比,我们的结果具有更高的准确性和更有效的表示空间。
https://arxiv.org/abs/2410.00779
To effectively study complex causal systems, it is often useful to construct representations that simplify parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach in representation learning that compresses random variables while retaining information about a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces representations which are causally interpretable, and which can be used when reasoning about interventions. We present experimental results demonstrating that the learned representations accurately capture causality as intended.
要有效地研究复杂因果系统,构建简化的表示往往是有用的,通过丢弃无关细节来简化系统的某个部分,同时保留关键特征。信息瓶颈(IB)方法是一种在表示学习中广泛使用的方法,它在压缩随机变量的同时保留目标变量的信息。传统方法,如IB,是纯统计的,并忽略了潜在的因果结构,因此不适合因果任务。我们提出了一种名为因果信息瓶颈(CIB)的方法,作为信息瓶颈的因果扩展,它通过压缩所选的变量来保持对目标变量的因果控制。这种方法产生的表示具有可解释性,并且在推理干预时有用。我们提供了实验结果,证明了学习到的表示准确地捕捉了预期的因果关系。
https://arxiv.org/abs/2410.00535
This paper introduces an innovative approach to Medical Vision-Language Pre-training (Med-VLP) area in the specialized context of radiograph representation learning. While conventional methods frequently merge textual annotations into unified reports, we acknowledge the intrinsic hierarchical relationship between the findings and impression section in radiograph datasets. To establish a targeted correspondence between images and texts, we propose a novel HybridMED framework to align global-level visual representations with impression and token-level visual representations with findings. Moreover, our framework incorporates a generation decoder that employs two proxy tasks, responsible for generating the impression from (1) images, via a captioning branch, and (2) findings, through a summarization branch. Additionally, knowledge distillation is leveraged to facilitate the training process. Experiments on the MIMIC-CXR dataset reveal that our summarization branch effectively distills knowledge to the captioning branch, enhancing model performance without significantly increasing parameter requirements due to the shared self-attention and feed-forward architecture.
本文提出了一种在放射图表示学习领域的创新方法,即医学视觉-语言预训练(Med-VLP)。虽然传统方法通常将文本注释合并为统一报告,但我们承认放射图数据集中发现和印象部分之间的固有层次关系。为了建立图像和文本之间的目标对应关系,我们提出了一个新颖的HybridMED框架,该框架通过全局级别视觉表示与印象和词级视觉表示对研究结果进行对齐。此外,我们的框架还包括一个生成解码器,它采用两个代理任务生成印象(通过字幕分支)和发现结果(通过总结分支)。此外,我们还利用知识蒸馏来促进训练过程。在MIMIC-CXR数据集上的实验结果表明,我们的总结分支有效地将知识从摘要分支扩散到字幕分支,从而提高模型性能,而不会显著增加参数需求。
https://arxiv.org/abs/2410.00448
Zero-shot (ZS) 3D anomaly detection is a crucial yet unexplored field that addresses scenarios where target 3D training samples are unavailable due to practical concerns like privacy protection. This paper introduces PointAD, a novel approach that transfers the strong generalization capabilities of CLIP for recognizing 3D anomalies on unseen objects. PointAD provides a unified framework to comprehend 3D anomalies from both points and pixels. In this framework, PointAD renders 3D anomalies into multiple 2D renderings and projects them back into 3D space. To capture the generic anomaly semantics into PointAD, we propose hybrid representation learning that optimizes the learnable text prompts from 3D and 2D through auxiliary point clouds. The collaboration optimization between point and pixel representations jointly facilitates our model to grasp underlying 3D anomaly patterns, contributing to detecting and segmenting anomalies of unseen diverse 3D objects. Through the alignment of 3D and 2D space, our model can directly integrate RGB information, further enhancing the understanding of 3D anomalies in a plug-and-play manner. Extensive experiments show the superiority of PointAD in ZS 3D anomaly detection across diverse unseen objects.
零样本(ZS)3D异常检测是一个关键但尚未探索的领域,它解决了由于实际问题(如隐私保护)而使目标3D训练样本不可用的情况。本文介绍了PointAD,一种新方法,它将CLIP在未见过的物体上识别3D异常的强泛化能力引入了一个统一的框架。PointAD提供了一个理解3D异常的统一框架,从点级和像素级。为了将通用异常语义引入PointAD,我们提出了混合表示学习,通过辅助点云优化可学习到的文本提示。点-像素表示之间的协同优化使我们的模型能够抓住潜在的3D异常模式,有助于检测和分割未见过的多样3D对象的异常。通过3D和2D空间的对齐,我们的模型可以直接整合RGB信息,从而进一步增强对3D异常的理解。大量实验证明,PointAD在ZS 3D异常检测中具有优越性。
https://arxiv.org/abs/2410.00320
Major solar flares are abrupt surges in the Sun's magnetic flux, presenting significant risks to technological infrastructure. In view of this, effectively predicting major flares from solar active region magnetic field data through machine learning methods becomes highly important in space weather research. Magnetic field data can be represented in multivariate time series modality where the data displays an extreme class imbalance due to the rarity of major flare events. In time series classification-based flare prediction, the use of contrastive representation learning methods has been relatively limited. In this paper, we introduce CONTREX, a novel contrastive representation learning approach for multivariate time series data, addressing challenges of temporal dependencies and extreme class imbalance. Our method involves extracting dynamic features from the multivariate time series instances, deriving two extremes from positive and negative class feature vectors that provide maximum separation capability, and training a sequence representation embedding module with the original multivariate time series data guided by our novel contrastive reconstruction loss to generate embeddings aligned with the extreme points. These embeddings capture essential time series characteristics and enhance discriminative power. Our approach shows promising solar flare prediction results on the Space Weather Analytics for Solar Flares (SWAN-SF) multivariate time series benchmark dataset against baseline methods.
大太阳活动是太阳磁场中突然的激增,对技术基础设施造成显著风险。因此,在空间天气研究中,从太阳活动区磁场数据预测大太阳活动变得非常重要。磁场数据可以通过多元时间序列模式表示,由于大太阳活动事件的稀有性,数据表现出极端的类不平衡。基于时间序列分类的大太阳活动预测,使用对比性表示学习方法的应用相对较少。在本文中,我们引入了CONTREX,一种用于多维时间序列数据的对比性表示学习方法,解决了时间依赖关系和极端类不平衡的挑战。我们的方法包括从多维时间序列实例中提取动态特征,从正负类特征向量中分别得到两个极端值,基于我们新颖的对比性重构损失对原始多维时间序列数据进行训练,以生成与极端点对齐的嵌入。这些嵌入抓住了基本的时间序列特征并提高了判别能力。我们的方法在SWAN-SF多维时间序列基准数据集上的太阳活动预测结果表明,与基线方法相比具有 promising 的结果。
https://arxiv.org/abs/2410.00312
This paper offers a roadmap for the development of scalable aligned artificial intelligence (AI) from first principle descriptions of natural intelligence. In brief, a possible path toward scalable aligned AI rests upon enabling artificial agents to learn a good model of the world that includes a good model of our preferences. For this, the main objective is creating agents that learn to represent the world and other agents' world models; a problem that falls under structure learning (a.k.a. causal representation learning). We expose the structure learning and alignment problems with this goal in mind, as well as principles to guide us forward, synthesizing various ideas across mathematics, statistics, and cognitive science. 1) We discuss the essential role of core knowledge, information geometry and model reduction in structure learning, and suggest core structural modules to learn a wide range of naturalistic worlds. 2) We outline a way toward aligned agents through structure learning and theory of mind. As an illustrative example, we mathematically sketch Asimov's Laws of Robotics, which prescribe agents to act cautiously to minimize the ill-being of other agents. We supplement this example by proposing refined approaches to alignment. These observations may guide the development of artificial intelligence in helping to scale existing -- or design new -- aligned structure learning systems.
本文为从自然智能的原理描述中开发可扩展的对齐人工智能(AI)提供了路线图。简而言之,实现可扩展的对齐AI的主要途径是让人工智能代理学会理解世界,这个世界包括我们偏好的好模式。为此,我们的主要目标是为代理创建能够表示世界和其他代理的世界模型的代理;这是一个结构学习(即因果表示学习)问题。我们以此为目标,讨论了结构学习和对齐问题,并为大家提供了指导我们前进的原则,跨数学、统计和认知科学领域呈现了各种想法。 1)我们讨论了结构学习、信息几何和模型压缩在结构学习中的关键作用,并提出了学习广泛自然istic世界的核心结构模块。 2)我们通过结构学习和心理论来概述实现对齐代理的方法。作为示例,我们用数学方式勾勒了亚当斯机器人定律,这些定律要求代理谨慎行动,以最小化其他代理的痛苦。我们补充了这个例子,提出了改进的对齐方法。这些观察结果可能会引导开发人工智能帮助实现现有的--或设计新的--对齐结构学习系统。
https://arxiv.org/abs/2410.00258
Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. Motivated by our conical distribution hypothesis, which posits that potential queries and documents form a cone-like structure in the embedding space, we introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments on fourteen embedding models across six languages and eight datasets validate QAEncoder's alignment capability, which offers a plug-and-play solution that seamlessly integrates with existing RAG architectures and training-based methods.
现代的QA系统需要检索增强生成(RAG)来生成准确且可信的答案。然而,用户查询和相关文档之间的固有差距阻碍了精确匹配。为了激励我们的锥分布假设,该假设认为潜在查询和文档在嵌入空间中形成一个类似于锥的结构,我们引入了QAEncoder,一种无需训练的自助方法来弥合这一差距。具体来说,QAEncoder估计潜在查询在嵌入空间中的期望作为文档嵌入的鲁棒代理,并将文档指纹附加到有效地区分这些嵌入。在六个语言和八个数据集上对14种嵌入模型的实验证实了QAEncoder的协同能力,它提供了一个可插拔和可用的解决方案,可以轻松地与现有的RAG架构和基于训练的方法集成。
https://arxiv.org/abs/2409.20434
Singlish, or formally Colloquial Singapore English, is an English-based creole language originating from the SouthEast Asian country Singapore. The language contains influences from Sinitic languages such as Chinese dialects, Malay, Tamil and so forth. A fundamental task to understanding Singlish is to first understand the pragmatic functions of its discourse particles, upon which Singlish relies heavily to convey meaning. This work offers a preliminary effort to disentangle the Singlish discourse particles (lah, meh and hor) with task-driven representation learning. After disentanglement, we cluster these discourse particles to differentiate their pragmatic functions, and perform Singlish-to-English machine translation. Our work provides a computational method to understanding Singlish discourse particles, and opens avenues towards a deeper comprehension of the language and its usage.
Singlish 是一种基于新加坡英语的克里奥尔语,起源于新加坡。这种语言受到了来自华语、马来语、泰米尔语等的热带语言的影响。理解 Singlish 的一个基本任务是,首先了解其话语片段(啦、膜和哦)的语用功能,Singlish 对此非常依赖。这项工作尝试对 Singlish 话语片段(啦、膜和哦)进行任务驱动的表示学习。在分离之后,我们将这些话语片段聚类,以区分它们的语用功能。然后进行 Singlish 到英语的机器翻译。我们的工作提供了一种计算方法来理解 Singlish 话语片段,并开辟了更深入理解语言及其用法的途径。
https://arxiv.org/abs/2409.20366
Survival prediction is a crucial task associated with cancer diagnosis and treatment planning. This paper presents a novel approach to survival prediction by harnessing comprehensive information from CT and PET scans, along with associated Genomic data. Current methods rely on either a single modality or the integration of multiple modalities for prediction without adequately addressing associations across patients or modalities. We aim to develop a robust predictive model for survival outcomes by integrating multi-modal imaging data with genetic information while accounting for associations across patients and modalities. We learn representations for each modality via a self-supervised module and harness the semantic similarities across the patients to ensure the embeddings are aligned closely. However, optimizing solely for global relevance is inadequate, as many pairs sharing similar high-level semantics, such as tumor type, are inadvertently pushed apart in the embedding space. To address this issue, we use a cross-patient module (CPM) designed to harness inter-subject correspondences. The CPM module aims to bring together embeddings from patients with similar disease characteristics. Our experimental evaluation of the dataset of Non-Small Cell Lung Cancer (NSCLC) patients demonstrates the effectiveness of our approach in predicting survival outcomes, outperforming state-of-the-art methods.
生存预测是与癌症诊断和治疗规划密切相关的关键任务。本文提出了一种通过利用CT和PET扫描全面信息以及相关基因组数据的新型生存预测方法。目前的方法要么依赖于单一模式,要么在预测时整合多个模式,但并未充分解决患者之间的关联。我们的目标是通过将多模态成像数据与遗传信息相结合,开发出一种稳健的生存预测模型,同时考虑患者之间的关联。我们通过自监督模块学习每个模型的表示,并利用患者之间的语义相似性来确保嵌入的对齐。然而,仅优化全局相关性是不够的,因为在嵌入空间中,许多共享相似高级语义的对 pair 会被无意地推开。为了应对这个问题,我们使用了一个跨病人模块(CPM),旨在利用跨病人相关性。CPM 模块旨在将具有相似疾病特征的患者的嵌入聚集在一起。我们对非小细胞肺癌(NSCLC)患者数据集的实验评估表明,我们方法在预测生存结果方面具有有效性,能够超越最先进的方法。
https://arxiv.org/abs/2409.20179