Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.
蛋白质语言模型(pLMs)在庞大的蛋白质序列数据库上预训练后,在各种下游任务中表现出色,但缺乏许多生物学应用所需的重要结构知识。为了弥补这一不足,我们将来自预训练的蛋白质图神经网络(pGNNs)的结构洞察融入到pLMs中,通过一个潜在层次对比学习的任务实现这一点。此任务使来自pLMs和pGNNs的残基表示在多个蛋白质之间对齐,从而丰富了pLMs的跨蛋白结构知识。此外,我们还引入了一个物理层面的任务,通过优化pLM预测结构标记的能力来注入同蛋白内的结构知识。提出的双任务框架有效将跨蛋白与同蛋白内结构知识融入到pLMs中。 鉴于PDB中的蛋白质结构质量存在差异性,我们进一步引入了一种残基损失选择模块,该模块使用在高质量结构上训练的小型模型为pLM挑选可靠且具有挑战性的残基损失。应用我们的结构对齐方法至最先进的ESM2和AMPLIFY模型,在包括ESM2接触预测在内的广泛任务中实现了显著的性能提升(提高了12.7%)。数据、代码以及生成的SaESM2和SaAMPLIFY模型将在Hugging Face上发布。
https://arxiv.org/abs/2505.16896
Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models.
地球观测基础模型在多种地球观测任务中表现出强大的泛化能力,但其在真实世界扰动下的鲁棒性仍然未被充分研究。为填补这一空白,我们推出了REOBench——首个全面基准测试平台,用于评估六项任务和十二种图像退化的类型(包括基于外观的及几何扰动)下地球观测基础模型的鲁棒性。为了确保真实性和细粒度的评测,我们的基准专注于高分辨率光学遥感影像,这些图像在城市规划和灾害应对等关键应用中被广泛使用。 我们对一系列采用掩码图像建模、对比学习以及视觉-语言预训练范式进行训练的基础模型进行了系统的评估。我们的研究结果揭示了以下几点: 1. 当暴露于输入退化时,现有的地球观测基础模型会经历显著的性能下降。 2. 性能下降的程度因任务类型、模型架构、骨干网络大小及退化类型的差异而异,性能损失从不到1%到超过20%不等。 3. 视觉-语言模型在多模态任务中表现出增强的鲁棒性。 REOBench强调了当前地球观测基础模型对现实世界退化的脆弱性,并为开发更加强大和可靠的模型提供了可行见解。
https://arxiv.org/abs/2505.16793
Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.
尽管开源大型语言模型(LLM)提供了卓越的性能,但它们容易遭受滥用。为解决这一问题,近期的研究提出了通过LLM指纹识别技术来鉴定可疑应用背后的特定源LLM的方法。然而,这些方法未能提供隐蔽且稳健的指纹验证机制。在本文中,我们提出了一种新的LLM指纹识别方案——CoTSRF(Chain of Thought Steganographic and Robust Fingerprinting),它利用思维链(CoT)作为LLM的指纹。具体来说,CoTSRF首先通过使用精心设计的CoT查询来收集源LLM的响应。然后,应用对比学习训练一个CoT提取器,该提取器从响应中抽取CoT特征(即指纹)。最后,CoTSRF通过比较源LLM和可疑LLM之间CoT特征的Kullback-Leibler散度,并将其与经验阈值进行对比来执行指纹验证。各种实验已经证明了我们提出的CoTSRF在对LLM进行指纹识别时的优势,特别是在隐蔽性和稳健性方面的优势。
https://arxiv.org/abs/2505.16785
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model's discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs' performance on complex CR tasks. The source code is available at this https URL.
视觉-语言模型(VLM)对于多模态任务至关重要,尤其是组合推理(CR)任务。这些任务需要区分视觉和文本嵌入之间的细微语义差异。然而,现有的方法主要通过生成基于文本的硬负样本进行微调,而忽视了图像基元负样本的重要性,这导致视觉编码器训练不足,并最终影响整个模型的表现。此外,负面样本通常被同等对待,没有考虑到它们的难度级别,而且正样本的对齐也不充分,这使得对困难样本对的对齐变得具有挑战性。为了解决这些问题,我们提出了自适应硬负样本扰动学习(AHNPL)。AHNPL 将基于文本的硬负面翻译到视觉领域,以生成语义上受干扰的图像基元负面用于训练模型,从而提高其整体性能。同时,AHNPL 引入了一种对比学习方法,使用多模态硬负样本损失来增强模型对每个模式内部硬负样本的辨别能力,并引入了一个动态间隔损失,根据样本难度调整对比间隔,以增强困难样本对的区分度。在三个公共数据集上的实验表明,我们的方法有效地提高了VLM处理复杂CR任务的能力。源代码可在[此处](https://this_https_URL.com)获取。
https://arxiv.org/abs/2505.15576
Deploying machine learning models in resource-constrained environments, such as edge devices or rapid prototyping scenarios, increasingly demands distillation of large datasets into significantly smaller yet informative synthetic datasets. Current dataset distillation techniques, particularly Trajectory Matching methods, optimize synthetic data so that the model's training trajectory on synthetic samples mirrors that on real data. While demonstrating efficacy on medium-scale synthetic datasets, these methods fail to adequately preserve semantic richness under extreme sample scarcity. To address this limitation, we propose a novel dataset distillation method integrating contrastive learning during image synthesis. By explicitly maximizing instance-level feature discrimination, our approach produces more informative and diverse synthetic samples, even when dataset sizes are significantly constrained. Experimental results demonstrate that incorporating contrastive learning substantially enhances the performance of models trained on very small-scale synthetic datasets. This integration not only guides more effective feature representation but also significantly improves the visual fidelity of the synthesized images. Experimental results demonstrate that our method achieves notable performance improvements over existing distillation techniques, especially in scenarios with extremely limited synthetic data.
在资源受限的环境中(如边缘设备或快速原型设计场景)部署机器学习模型,越来越需要将大型数据集精简为更小但同样具有信息量的人工合成数据集。目前的数据集蒸馏技术,特别是轨迹匹配方法,优化人工生成的数据以使模型在这些样本上的训练路径与真实数据上相似。虽然这种方法在中等规模的合成数据集中显示出了有效性,但在极端少量样本的情况下无法充分保留语义丰富性。为了克服这一局限性,我们提出了一种新的数据集蒸馏方法,在图像合成过程中结合对比学习。通过显式地最大化实例级别的特征差异性,我们的方法能够生成更具信息量和多样性的合成样本,即使在数据规模极度受限的情况下也是如此。实验结果显示,将对比学习整合到训练模型中可以显著提高基于非常小的合成数据集所构建的模型性能。这种结合不仅引导了更有效的特征表示,还大幅提升了合成图像的视觉真实性。实验证明,在极其有限的合成数据环境中,我们的方法相比于现有的蒸馏技术取得了显著的性能提升。
https://arxiv.org/abs/2505.15267
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks.
可定制的多语言零样本歌声合成(SVS)在音乐创作和短视频配音中具有多种潜在应用。然而,现有的SVS模型过于依赖音素和音符边界的标注,这限制了它们在零样本场景中的鲁棒性,并且导致音素和音符之间的过渡效果不佳。此外,这些模型还缺乏通过多样化提示进行有效的多层次风格控制的能力。 为了解决这些问题,我们引入了TCSinger 2,这是一个基于多种提示的多任务、跨语言零样本歌声合成模型,具备风格转换和风格控制功能。TCSinger 2主要包括三个关键模块: 1. **模糊边界内容(BBC)编码器**:预测时长,扩展内容嵌入,并在边界的上下文应用掩码以实现平滑过渡。 2. **定制音频编码器**:利用对比学习从歌唱、言语和文本提示中提取对齐的表示,增强模型的学习能力与泛化性。 3. **基于流的自定义变压器**:采用Cus-MOE,并通过基频(F0)监督来提高生成歌声的质量和风格建模能力。 实验结果显示,在多个相关任务上,TCSinger 2在主观和客观指标方面均优于基准模型。
https://arxiv.org/abs/2505.14910
This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.
这篇论文提出了一种单阶段训练方法,该方法通过对比学习框架将音频、视觉和文本三种模态在语义上进行对齐。对比学习在多模态对齐中获得了广泛应用,它利用大规模未标记数据来学习共享表示。现有的深度学习三模态对齐方法通常采用两阶段方式,分别对齐视觉-文本和音频-文本模态。这种两阶段方法由于数据分布不匹配而导致次优的对齐效果。 本文通过利用AVCaps数据集进行研究,该数据集为视频片段提供了音频、视觉以及音视描述信息。我们的方法在对比训练中联合优化所有模态的表示形式。实验结果表明,单阶段方法优于两阶段方法,在基于音频的视觉检索任务上实现了两倍以上的性能提升,突显了统一多模态表征学习的优势。
https://arxiv.org/abs/2505.14562
Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest fine-tuning pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly fine-tuning for citation classification is challenging due to labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework, Citss, that adapts the PLMs to overcome these challenges. Citss introduces self-supervised contrastive learning to alleviate data scarcity, and is equipped with two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, Citss is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state of the art. Our code is available at: this http URL
引用分类,即识别学术引用背后的意图,在学术分析中至关重要。先前的研究建议在引用分类数据集上对预训练语言模型(PLM)进行微调,以利用它们在预训练期间获得的语言知识。然而,由于标注数据稀缺、上下文噪音以及误导性关键词关联等问题,直接针对引用分类进行微调面临挑战。本文提出了一个新颖的框架Citss,该框架使PLM能够克服这些困难。Citss引入了自我监督对比学习以缓解数据稀疏问题,并配备了两种专门策略来获取对比样本:句子级别的裁剪,增强了对长上下文中目标引用的关注;关键词扰动,则减少了对特定关键词的依赖。与仅针对基于编码器的PLM设计的工作不同,Citss经过精心开发,能够兼容基于编码器和解码器的大型语言模型(LLM),以利用扩大预训练带来的好处。使用三个基准数据集进行实验,在基于编码器的PLM和基于解码器的LLM上都证明了我们相较于先前最佳方法的优势。我们的代码可在以下网址获得:this http URL
https://arxiv.org/abs/2505.14471
Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.
图形用户界面(GUI)代理,由多模态大型语言模型(MLLM)驱动,在人类交互中展现出更大的潜力。然而,由于精细调整成本高昂,用户通常依赖开源GUI代理或AI提供商提供的API,这引入了一种关键但尚未充分研究的供应链威胁:后门攻击。在本工作中,我们首先揭示了由多模态大型语言模型驱动的GUI代理自然地暴露出了多个交互级触发器,例如历史步骤、环境状态和任务进度。基于这一观察,我们介绍了AgentGhost框架,这是一个有效的、隐蔽性的红队后门攻击框架。具体来说,我们首先通过结合目标层面与互动层面构建复合触发器,使得GUI代理在确保任务有用性的同时无意中激活了后门。然后,我们将后门注入表述为一个MinMax优化问题,利用监督对比学习最大化样本类别之间表示空间的特征差异,提高后门的灵活性;同时采用监督微调以最小化后门与正常行为生成之间的偏差,从而增强有效性和实用性。在两个已建立的移动基准中对各种代理模型进行广泛的评估表明,AgentGhost既有效又通用,在三个攻击目标上的攻击准确率达到了99.7%,并且仅表现出1%的任务有用性下降。此外,我们为AgentGhost定制了一种防御方法,将攻击准确性降低至22.1%。我们的代码可在\texttt{anonymous}处获取。
https://arxiv.org/abs/2505.14418
Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used network architecture and pre-training objectives. Second, we conduct a thorough review of existing works, examining foundation models and task-specific adaptation methods in contrastive-based VLM, architectural upgrades, training strategies and model capabilities in instruction-based VLM, as well as generative foundation models with their representative downstream applications. Third, we summarize datasets used for VLM pre-training, fine-tuning, and evaluation, with an analysis of their construction methodologies (including image sources and caption generation) and key properties, such as scale and task adaptability. Finally, we conclude this survey with insights and discussions on future research directions: cross-modal representation alignment, vague requirement comprehension, explanation-driven model reliability, continually scalable model capabilities, and large-scale datasets featuring richer modalities and greater challenges.
视觉语言模型(VLM)旨在弥合图像和自然语言之间的信息差距。在这一新模式下,先通过大量图-文对进行预训练,然后针对特定任务的数据进行微调,在遥感领域的VLM取得了显著进展。这些模型吸收了广泛的通用知识,并在各种遥感数据分析任务中表现出色。此外,它们能够以对话的方式与用户互动。本文旨在为遥感社区提供一个关于基于两阶段范式的VLM发展的及时且全面的回顾。 具体来说,我们首先涵盖了一种远程传感领域的视觉语言模型分类:对比学习、视觉指令微调以及文本条件图像生成。对于每一类,我们都详细介绍了常用的网络架构和预训练目标。其次,我们将对现有的工作进行详尽的审查,包括基础模型及其在基于对比的学习中的任务特定适应方法,基于指令的VLM中有关体系结构升级、培训策略及模型能力的研究,以及具有代表性的下游应用的生成性基础模型。第三,我们总结了用于VLM预训练、微调和评估的数据集,并分析了它们的构建方法(包括图像来源和标题生成),以及规模和任务适应性等关键属性。最后,我们将以对未来研究方向的见解和讨论作为此次综述的结论:跨模态表示对齐、模糊需求理解、基于解释驱动的模型可靠性、持续可扩展的模型能力,以及具有更丰富模态和更大挑战的大规模数据集。
https://arxiv.org/abs/2505.14361
This paper explores the use of contrastive learning and generative adversarial networks for generating realistic underwater images from synthetic images with uniform lighting. We investigate the performance of image translation models for generating realistic underwater images using the VAROS dataset. Two key evaluation metrics, Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM), provide insights into the trade-offs between perceptual quality and structural preservation. For paired image translation, pix2pix achieves the best FID scores due to its paired supervision and PatchGAN discriminator, while the autoencoder model attains the highest SSIM, suggesting better structural fidelity despite producing blurrier outputs. Among unpaired methods, CycleGAN achieves a competitive FID score by leveraging cycle-consistency loss, whereas CUT, which replaces cycle-consistency with contrastive learning, attains higher SSIM, indicating improved spatial similarity retention. Notably, incorporating depth information into CUT results in the lowest overall FID score, demonstrating that depth cues enhance realism. However, the slight decrease in SSIM suggests that depth-aware learning may introduce structural variations.
本文探讨了使用对比学习和生成对抗网络(GAN)从具有均匀照明的合成图像中生成逼真的水下图像的方法。我们利用VAROS数据集研究了图像转换模型在生成真实感水下图像方面的性能。两个关键评估指标,即Fréchet inception距离(FID)和结构相似性指数度量(SSIM),提供了感知质量与结构保留之间的权衡情况。 对于成对的图像翻译任务,pix2pix由于其配对监督和PatchGAN判别器而取得了最佳的FID得分,而自编码模型则获得了最高的SSIM值,表明在产生模糊输出的情况下仍能保持较高的结构性保真度。在无配对方法中,CycleGAN通过利用循环一致性损失实现了具有竞争力的FID分数,而CUT(将对比学习引入替代循环一致性)则达到了更高的SSIM,表明其空间相似性保留有所改进。 值得注意的是,在CUT模型中加入深度信息后,总体FID得分最低,证明了深度线索能够增强图像的真实性。然而,SSIM值略有下降也表明深度感知学习可能引入结构变异。
https://arxiv.org/abs/2505.14296
The growing collaboration between humans and AI models in generative tasks has introduced new challenges in distinguishing between human-written, AI-generated, and human-AI collaborative texts. In this work, we collect a multilingual, multi-domain, multi-generator dataset FAIDSet. We further introduce a fine-grained detection framework FAID to classify text into these three categories, meanwhile identifying the underlying AI model family. Unlike existing binary classifiers, FAID is built to capture both authorship and model-specific characteristics. Our method combines multi-level contrastive learning with multi-task auxiliary classification to learn subtle stylistic cues. By modeling AI families as distinct stylistic entities, FAID offers improved interpretability. We incorporate an adaptation to address distributional shifts without retraining for unseen data. Experimental results demonstrate that FAID outperforms several baseline approaches, particularly enhancing the generalization accuracy on unseen domains and new AI models. It provide a potential solution for improving transparency and accountability in AI-assisted writing.
人类与AI模型在生成任务中的合作日益密切,这给区分人类撰写、AI生成和人机协作文本带来了新的挑战。为此,我们收集了一个多语言、跨领域、多种生成器的FAIDSet数据集,并进一步引入了一种细粒度检测框架FAID来将文本分类为上述三类的同时识别底层AI模型族系。与现有的二元分类器不同,FAID被设计用于捕捉作者身份和特定于模型的特点。我们的方法结合了多层次对比学习以及多任务辅助分类,以学习细微的风格线索。通过将AI家族视为不同的风格实体,FAID提供了更好的可解释性。我们还引入了一种适应机制,以应对分布偏移问题,无需对未见数据进行重新训练。实验结果显示,FAID在多个基准方法上表现更优,尤其是在未见领域和新AI模型上的泛化准确性方面得到了显著提升。这为提高AI辅助写作的透明度与责任性提供了潜在解决方案。
https://arxiv.org/abs/2505.14271
Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
会话搜索涉及一系列互动查询和操作,以满足用户复杂的信息需求。当前策略通常侧重于通过深度语义理解进行序列建模,而忽视了交互中的图结构。尽管有些方法专注于捕捉结构信息,但它们使用了一种通用的文档表示方式,忽略了词语级别的语义建模。 在本文中,我们提出了一种符号图排序器(Symbolic Graph Ranker, SGR),旨在结合基于文本的方法和基于图形的方法的优势,并利用最近大型语言模型(LLMs)的力量。具体来说,我们首先引入一组符号语法规则,将会话图转换为文本。这使得会话历史、交互过程以及任务指令可以无缝地作为输入提供给LLM。此外,鉴于预先训练在文本语料库上的LLM与我们使用图到文本的语法生成的符号语言之间存在自然差异,我们的目标是增强LLM捕捉图形结构的能力,该能力以文本格式呈现。 为了实现这一目标,我们引入了一组自我监督的符号学习任务,包括链接预测、节点内容生成和生成对比学习,使LLM能够从粗粒度到细粒度地捕获拓扑信息。在两个基准数据集AOL和Tiangong-ST上的实验结果和全面分析证实了我们的方法的优越性。此外,我们的范式还提供了一种新颖且有效的方法,弥合传统搜索策略与现代LLM之间的差距。
https://arxiv.org/abs/2505.14156
Biological brains learn continually from a stream of unlabeled data, while integrating specialized information from sparsely labeled examples without compromising their ability to generalize. Meanwhile, machine learning methods are susceptible to catastrophic forgetting in this natural learning setting, as supervised specialist fine-tuning degrades performance on the original task. We introduce task-modulated contrastive learning (TMCL), which takes inspiration from the biophysical machinery in the neocortex, using predictive coding principles to integrate top-down information continually and without supervision. We follow the idea that these principles build a view-invariant representation space, and that this can be implemented using a contrastive loss. Then, whenever labeled samples of a new class occur, new affine modulations are learned that improve separation of the new class from all others, without affecting feedforward weights. By co-opting the view-invariance learning mechanism, we then train feedforward weights to match the unmodulated representation of a data sample to its modulated counterparts. This introduces modulation invariance into the representation space, and, by also using past modulations, stabilizes it. Our experiments show improvements in both class-incremental and transfer learning over state-of-the-art unsupervised approaches, as well as over comparable supervised approaches, using as few as 1% of available labels. Taken together, our work suggests that top-down modulations play a crucial role in balancing stability and plasticity.
生物大脑可以从一连串未标记的数据中不断学习,并通过整合稀疏标记样本中的专业化信息,同时不损害其泛化能力。然而,在这种自然的学习环境中,机器学习方法容易出现灾难性遗忘问题,因为监督下的专业微调会降低在原始任务上的性能。 我们引入了任务调节对比学习(TMCL),该方法借鉴了新皮质中生物物理机制的灵感,并使用预测编码原则持续地整合自顶向下信息而无需监督。基于这些原则构建视图不变表示空间的理念,以及可以通过对比损失来实现这一点的想法,当出现新类别的标记样本时,我们学会新的仿射调制以改善新类别与其他所有类别的分离效果,同时不改变前馈权重。 通过利用视图不变性学习机制,我们将前馈权重训练为匹配数据样本的未调制表示与其调制对应物。这在表示空间中引入了调制不变性,并且由于还使用过去的调制,从而稳定了它。 我们的实验表明,在类别增量和迁移学习方面,与最先进的无监督方法以及相当的有监督方法相比,TMCL即使使用仅1%的可用标签也能取得改进效果。总的来说,这项工作表明,自顶向下调节在平衡稳定性和可塑性中扮演着关键角色。
https://arxiv.org/abs/2505.14125
We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound's ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.
我们介绍了Sat2Sound,这是一个用于声音景观映射的多模态表示学习框架,旨在预测地球上任何位置的声音分布。现有的方法依赖于卫星图像和配对的地标签音频样本进行这项任务,但常常无法捕捉到给定地点声音来源的多样性。为了解决这一局限性,我们通过利用视觉-语言模型(VLM)来增强现有数据集,生成富含语义的声音景观描述,这些描述基于卫星图像中描绘的位置。我们的方法包括在音频、音频字幕、卫星图像和卫星图像字幕之间进行对比学习。我们认为,在不同的模态间存在一组固定的声音景观概念。为此,我们学习了一组共享的声音景观概念代码本,并将每个样本表示为这些概念的加权平均值。 Sat2Sound在两个数据集(GeoSound 和 SoundingEarth)上的跨模式检索任务中(卫星图像和音频之间),达到了最先进的性能水平。此外,基于Sat2Sound能够检索详细的声音景观描述的能力,我们引入了一个新颖的应用程序:基于位置的声音景观合成,这可以实现沉浸式的声学体验。 我们的代码和模型将公开提供。
https://arxiv.org/abs/2505.13777
Graphs serve as versatile data structures in numerous real-world domains-including social networks, molecular biology, and knowledge graphs-by capturing intricate relational information among entities. Among graph-based learning techniques, Graph Contrastive Learning (GCL) has gained significant attention for its ability to derive robust, self-supervised graph representations through the contrasting of positive and negative sample pairs. However, a critical challenge lies in ensuring high-quality positive pairs so that the intrinsic semantic and structural properties of the original graph are preserved rather than distorted. To address this issue, we propose SRGCL (Self-Reinforced Graph Contrastive Learning), a novel framework that leverages the model's own encoder to dynamically evaluate and select high-quality positive pairs. We designed a unified positive pair generator employing multiple augmentation strategies, and a selector guided by the manifold hypothesis to maintain the underlying geometry of the latent space. By adopting a probabilistic mechanism for selecting positive pairs, SRGCL iteratively refines its assessment of pair quality as the encoder's representational power improves. Extensive experiments on diverse graph-level classification tasks demonstrate that SRGCL, as a plug-in module, consistently outperforms state-of-the-art GCL methods, underscoring its adaptability and efficacy across various domains.
图在包括社交网络、分子生物学和知识图谱在内的众多现实世界领域中作为多功能数据结构,通过捕捉实体之间的复杂关系信息而发挥作用。在基于图的机器学习技术中,图对比学习(GCL)因其能够通过正负样本对的对比来获得鲁棒且自监督的图表示而备受关注。然而,一个关键挑战在于确保高质量的正样本对,以便保留原始图的基本语义和结构特性而不使其扭曲。为了解决这个问题,我们提出了SRGCL(自我强化图对比学习),这是一种新型框架,它利用模型自身的编码器来动态评估和选择高质量的正样本对。我们设计了一个统一的正样本生成器,采用多种增强策略,并使用流形假设指导的选样器来保持潜在空间的基本几何特性。通过采用概率机制来选择正样本对,SRGCL可以随着编码器表达能力的提高而迭代地改进其对配对质量的评估。在各种图级分类任务上的广泛实验表明,作为插件模块的SRGCL始终优于最先进的GCL方法,证明了其跨各个领域的适应性和有效性。
https://arxiv.org/abs/2505.13650
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.
少量样本跨模态检索专注于在有限的训练样本下学习跨模态表示,使模型能够在推理时处理未见过的数据类别。与传统的跨模态检索任务不同,后者假设训练和测试数据具有相同的类别分布,少量样本检索涉及各模态间稀疏表示的数据。现有方法通常无法充分建模少量样本跨模态数据的多峰分布,导致潜在语义空间中存在两种主要偏差:模内偏差(即稀疏样本无法捕捉类内的多样性)以及模间偏差(即图像和文本分布之间的不一致加剧了语义差距)。这些偏差阻碍了检索准确性。为了解决这些问题,我们提出了一种名为GCRDP的新方法,用于少量样本跨模态检索。该方法利用高斯混合模型(GMM)有效捕捉数据的复杂多峰分布,并引入一个多正例样本对比学习机制以进行全面特征建模。此外,我们还提出了一种新的跨模态语义对齐策略,通过约束图像和文本特征分布之间的相对距离来改进跨模态表示的准确性。我们通过在四个基准数据集上进行广泛的实验验证了该方法的有效性,并展示了其优于六种最新方法的卓越性能。
https://arxiv.org/abs/2505.13306
Vocal feedback (e.g., `mhm', `yeah', `okay') is an important component of spoken dialogue and is crucial to ensuring common ground in conversational systems. The exact meaning of such feedback is conveyed through both lexical and prosodic form. In this work, we investigate the perceived prosodic similarity of vocal feedback with the same lexical form, and to what extent existing speech representations reflect such similarities. A triadic comparison task with recruited participants is used to measure perceived similarity of feedback responses taken from two different datasets. We find that spectral and self-supervised speech representations encode prosody better than extracted pitch features, especially in the case of feedback from the same speaker. We also find that it is possible to further condense and align the representations to human perception through contrastive learning.
声学反馈(例如“嗯”,“是的”,“好的”)在口语对话中是一个重要组成部分,对于确保会话系统中的共同理解至关重要。这种反馈的确切含义通过词汇和韵律形式传达。在这项工作中,我们调查了具有相同词汇形式的声学反馈之间的感知韵律相似性,并探讨现有语音表示在多大程度上反映了这些相似性。使用招募参与者进行的三元比较任务来衡量从两个不同数据集中提取的反馈响应之间的感知相似度。 我们的研究发现: 1. 频谱和自监督语音表示比提取出来的音高特征更好地编码了韵律,尤其是在来自同一位说话人的反馈的情况下。 2. 通过对比学习进一步压缩并调整这些表示,使其与人类感知对齐是可能的。
https://arxiv.org/abs/2505.13268
Advances in computer vision and deep learning have blurred the line between deepfakes and authentic media, undermining multimedia credibility through audio-visual forgery. Current multimodal detection methods remain limited by unbalanced learning between modalities. To tackle this issue, we propose an Audio-Visual Joint Learning Method (MACB-DF) to better mitigate modality conflicts and neglect by leveraging contrastive learning to assist in multi-level and cross-modal fusion, thereby fully balancing and exploiting information from each modality. Additionally, we designed an orthogonalization-multimodal pareto module that preserves unimodal information while addressing gradient conflicts in audio-video encoders caused by differing optimization targets of the loss functions. Extensive experiments and ablation studies conducted on mainstream deepfake datasets demonstrate consistent performance gains of our model across key evaluation metrics, achieving an average accuracy of 95.5% across multiple datasets. Notably, our method exhibits superior cross-dataset generalization capabilities, with absolute improvements of 8.0% and 7.7% in ACC scores over the previous best-performing approach when trained on DFDC and tested on DefakeAVMiT and FakeAVCeleb datasets.
计算机视觉和深度学习的进步模糊了深伪(Deepfake)与真实媒体之间的界限,通过音频-视频伪造削弱了多媒体的可信度。当前多模态检测方法在模态间的不平衡学习方面仍然存在局限性。为了解决这个问题,我们提出了一种音视联合学习方法(MACB-DF),旨在更好地解决模态冲突和忽视问题,通过对比学习来协助多层次及跨模态融合,从而全面平衡并利用每种模态的信息。 此外,我们设计了一个正交化多模态帕累托模块,在保持单模态信息的同时解决了由于损失函数优化目标不同而在音频-视频编码器中产生的梯度冲突问题。在主流深伪数据集上进行的广泛实验和消融研究显示,我们的模型在关键评估指标上取得了持续的性能提升,跨多个数据集达到了平均95.5%的准确率。 值得注意的是,我们的方法展示了出色的跨数据集泛化能力,在DFDC训练并在DefakeAVMiT及FakeAVCeleb数据集中测试时,ACC评分分别比之前的最佳表现方法提高了8.0%和7.7%。
https://arxiv.org/abs/2505.12966
The increasing level of sound pollution in marine environments poses an increased threat to ocean health, making it crucial to monitor underwater noise. By monitoring this noise, the sources responsible for this pollution can be mapped. Monitoring is performed by passively listening to these sounds. This generates a large amount of data records, capturing a mix of sound sources such as ship activities and marine mammal vocalizations. Although machine learning offers a promising solution for automatic sound classification, current state-of-the-art methods implement supervised learning. This requires a large amount of high-quality labeled data that is not publicly available. In contrast, a massive amount of lower-quality unlabeled data is publicly available, offering the opportunity to explore unsupervised learning techniques. This research explores this possibility by implementing an unsupervised Contrastive Learning approach. Here, a Conformer-based encoder is optimized by the so-called Variance-Invariance-Covariance Regularization loss function on these lower-quality unlabeled data and the translation to the labeled data is made. Through classification tasks involving recognizing ship types and marine mammal vocalizations, our method demonstrates to produce robust and generalized embeddings. This shows to potential of unsupervised methods for various automatic underwater acoustic analysis tasks.
海洋环境中噪音污染水平的增加对海洋健康构成了更大的威胁,因此监测水下噪声变得至关重要。通过被动监听这些声音来执行监控工作,可以绘制出造成这种污染的源头分布图。这一过程会产生大量数据记录,捕捉包括船舶活动和海洋哺乳动物叫声等多种声源。 虽然机器学习为自动声音分类提供了一种有前景的方法,但目前最先进的方法采用的是监督学习方式,这种方法需要大量的高质量标注数据,而这些数据并未公开提供。相比之下,有大量的低质量未标记数据可以利用,这提供了探索无监督学习技术的机会。 这项研究通过实施一种无监督的对比学习(Contrastive Learning)方法来探讨这一可能性。在这里,基于Conformer的编码器使用所谓的方差-不变性-协方差正则化损失函数在这些低质量未标记数据上进行优化,并将其转换为标注数据。通过船舶类型识别和海洋哺乳动物叫声识别等分类任务,我们的方法展示了生成稳健且具有泛化的嵌入能力。这表明了无监督方法在各种自动水下声学分析任务中的潜力。
https://arxiv.org/abs/2505.12904