The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN's novelty lies in its unified multi-cue aggregation framework, which integrates spatial, frequency-domain, and chromaticity-based information for enhanced representation learning. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.
图像合成模型的迅速发展给AI生成图像检测器的一般化能力带来了挑战。然而,现有的方法往往依赖于特定模型的特征,导致过拟合和较差的泛化性能。在本文中,我们提出了一个多线索聚合网络(MCAN),这是一种新的框架,它在一个统一的网络中整合了不同的但互补的信息线索。MCAN采用混合编码器适配器来动态处理这些线索,从而使更适应和鲁棒的功能表现成为可能。我们的线索包括输入图像本身,这代表了整体内容,以及强调边缘细节的高频分量。此外,我们引入了一种色度不一致性(CI)线索,该线索通过标准化强度值并捕捉在真实图像获取过程中引入的噪声信息来使这些噪声模式与AI生成的内容中的噪声模式更加区别开来。 不同于先前的方法,MCAN的新颖性在于其统一的多线索聚合框架,这个框架整合了空间、频域和色度基的信息,以增强表示学习。这些线索本质上更能够指示真实图像,从而提高了跨模型的一般化能力。在GenImage、Chameleon以及UniversalFakeDetect基准测试中的广泛实验验证了MCAN处于业界领先地位的性能表现。在GenImage数据集中,MCAN相对于当前最佳的方法,在八个不同的图像生成器上的平均ACC上最高提升了7.4%。
https://arxiv.org/abs/2601.08790
Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
生成推荐系统通过利用语义ID来表示项目已经取得了显著的进步。然而,现有独立处理每种模式的方法面临着两个关键限制:(1)跨模式冗余降低了效率;(2)未能捕捉到模态间的相互作用,从而限制了项目的表征能力。我们引入了一个名为FusID的融合式语义ID框架,该框架通过三个核心组件解决了这些问题:(i)多模态融合,通过联合编码多种模式的信息来学习统一表示;(ii)表征学习,将频繁共同出现的商品嵌入体拉近的同时保持差异性和防止特征冗余;(iii)产品量化,将融合后的连续嵌入转换为多个离散令牌以缓解ID冲突。在多模态下一首歌推荐(即播放列表延续)基准测试中,FusID实现了零ID冲突,确保每个令牌序列映射到唯一的歌曲,减轻了代码本未充分利用的问题,并且在MRR和Recall@k(k = 1, 5, 10, 20)指标上超越了基线方法。
https://arxiv.org/abs/2601.08764
We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
我们介绍了一种用于分析脑电图(EEG)信号的两阶段多任务学习框架,该框架集成了去噪、动态建模和表示学习。在第一阶段,训练一个去噪自编码器来抑制伪迹并稳定时间动力学,从而提供鲁棒的信号表示。在第二阶段,一个多任务架构处理这些去噪后的信号以实现三个目标:运动想象分类、使用Lyapunov指数标签区分混沌与非混沌状态,以及利用NT-Xent损失进行自我监督对比表征学习。 该框架采用卷积骨干网络结合Transformer编码器来捕捉时空结构,而动态任务则鼓励对非线性脑动力学的敏感度。这种分阶段的设计有助于缓解重构和判别目标之间的干扰,提高数据集间的稳定性,并通过明确区分噪声减少与高级特征学习支持可重复训练。 实证研究表明,我们的框架不仅增强了鲁棒性和泛化能力,还在EEG解码方面超越了强大的基线方法以及最近的最先进方法,突显了结合去噪、动态特性及自我监督学习的有效性。
https://arxiv.org/abs/2601.08549
Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.
医学对比视觉-语言预训练(VLP)在提升下游任务表现方面展现了巨大潜力。传统方法通常采用对比学习,将成对的图像报告样本视为正例,而未配对的则被视为负例处理。然而,在医疗数据集中,来自不同患者的图像或报告之间可能存在大量相似性。僵化地将所有未配对样本都视为负例可能会破坏潜在的语义结构,并影响学到表示的质量。 在本文中,我们提出了一种多级对齐框架——具有语义感知实例和稀疏标记对齐的表示学习(SISTA),通过利用医学图像与放射学报告之间在两个层级上的语义对应关系:即图-文层面以及图像块-词项层面。具体来说,我们在传统的对比学习中引入了跨报告相似性以消除假负例,并提出了一种有效方法来对齐相关联的图像块和词项。 实验结果表明,所提出的框架在三个下游任务(图像分类、图像分割及目标检测)上显著提升了不同数据集之间的迁移性能。值得注意的是,即使面对有限标注数据的情况,我们的框架也能实现细粒度任务上的重大改进。代码与预训练模型将对外公开发布。
https://arxiv.org/abs/2601.08165
Geometric Representation Learning (GRL) aims to approximate the non-Euclidean topology of high-dimensional data through discrete graph structures, grounded in the manifold hypothesis. However, traditional static graph construction methods based on Euclidean distance often fail to capture the intrinsic curvature characteristics of the data manifold. Although Ollivier-Ricci Curvature Flow (OCF) has proven to be a powerful tool for dynamic topological optimization, its core reliance on Optimal Transport (Wasserstein distance) leads to prohibitive computational complexity, severely limiting its application in large-scale datasets and deep learning frameworks. To break this bottleneck, this paper proposes a novel geometric evolution framework: Resistance Curvature Flow (RCF). Leveraging the concept of effective resistance from circuit physics, RCF transforms expensive curvature optimization into efficient matrix operations. This approach achieves over 100x computational acceleration while maintaining geometric optimization capabilities comparable to OCF. We provide an in-depth exploration of the theoretical foundations and dynamical principles of RCF, elucidating how it guides the redistribution of edge weights via curvature gradients to eliminate topological noise and strengthen local cluster structures. Furthermore, we provide a mechanistic explanation of RCF's role in manifold enhancement and noise suppression, as well as its compatibility with deep learning models. We design a graph optimization algorithm, DGSL-RCF, based on this framework. Experimental results across deep metric learning, manifold learning, and graph structure learning demonstrate that DGSL-RCF significantly improves representation quality and downstream task performance.
几何表示学习(GRL)旨在通过离散图结构来近似高维数据的非欧几里得拓扑,这一过程基于流形假设。然而,传统的静态图构建方法通常依赖于欧氏距离,这种做法往往无法捕捉到数据流形内在的曲率特性。虽然奥利维尔-里奇曲率流动(OCF)已被证明是动态拓扑优化的强大工具,但其核心依赖于最优传输(Wasserstein距离),这导致了计算复杂度极高,严重限制了它在大规模数据集和深度学习框架中的应用。为了突破这一瓶颈,本文提出了一种新的几何演化框架:电阻曲率流动(RCF)。利用电路物理学中有效电阻的概念,RCF将昂贵的曲率优化转化为高效的矩阵运算。该方法实现了超过100倍的计算加速,并且在保持与OCF相当的几何优化能力的同时,还提供了更好的性能。 本文深入探讨了RCF的理论基础和动态原理,阐明了它是如何通过曲率梯度指导边权重的重新分配来消除拓扑噪声并强化局部集群结构的。此外,我们还从机制层面解释了RCF在流形增强和噪声抑制方面的作用及其与深度学习模型兼容性方面的优势。 基于这一框架,我们设计了一种图优化算法——DGSL-RCF(Deep Graph Structure Learning with Resistance Curvature Flow)。通过跨深度度量学习、流形学习以及图结构学习的实验结果表明,DGSL-RCF显著提高了表示质量和下游任务性能。
https://arxiv.org/abs/2601.08149
Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at this https URL.
最近的研究(如REPA)表明,通过外部语义特征(例如DINO)引导扩散模型可以显著加快扩散变压器(DiT)的训练速度。然而,这需要使用预训练的外部网络,增加了额外的依赖性并减少了灵活性。在这项工作中,我们提出DiTs实际上有能力自我引导其自身的训练,并提出了**自超越**(Self-Transcendence),这是一种仅通过内部特征监督就能实现快速收敛的有效方法。 研究发现,在DiT训练中慢速收敛的主要原因是浅层表示学习的难度。为解决这个问题,我们在最初的训练阶段让DiT模型的浅层特征与预训练VAE的潜在表示对齐(例如40个epoch),随后对中间特征应用无分类器引导以增强其判别能力和语义表达能力。这些完全在模型内部学习得到且丰富的内部特征被用作监督信号,来指导新的DiT训练。 相较于现有的自包含方法,我们的方法带来了显著的性能提升。它甚至可以在生成质量和收敛速度方面超越REPA,但无需任何外部预训练模型的支持。我们的方法不仅对不同骨干架构更为灵活,还可能适用于更广泛的基于扩散的生成任务范围。我们方法的源代码可在[此链接](https://example.com)找到。
https://arxiv.org/abs/2601.07773
In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
近年来,基于骨架的动作识别中的自监督表示学习随着对比学习方法的发展而取得了进展。然而,大多数对比范式本质上是判别式的,并且往往难以捕捉人类运动内在的变异性与不确定性。为了解决这一问题,我们提出了一种结合概率潜在建模和对比自监督学习的变分对比学习框架。这种形式化的方法能够学习出结构化的、语义上有意义的表示,这些表示在不同的数据集和监督水平上都能泛化得很好。我们在三个广泛使用的基于骨架的动作识别基准测试中进行了大量的实验,结果表明我们的方法始终优于现有的方法,在低标签环境下尤其如此。此外,定性的分析显示,与其它方法相比,我们提出的方法提供的特征更符合运动特性和样本特性,并且更加注重重要的骨骼关节。
https://arxiv.org/abs/2601.07666
Reliable learning on low-quality multimodal data is a widely concerning issue, especially in safety-critical applications. However, multimodal noise poses a major challenge in this domain and leads existing methods to suffer from two key limitations. First, they struggle to reliably remove heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces test-time cooperative enhancement, which adaptively updates the model in response to input noise in a label-free manner, improving adaptability and generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
在低质量多模态数据上进行可靠学习是一个广泛关心的问题,特别是在安全关键的应用中。然而,多模态噪声在这个领域构成了一个主要挑战,并导致现有方法存在两个关键限制:首先,它们难以有效地移除异构数据噪声,从而阻碍了鲁棒的多模态表示学习;其次,当遇到以前未见过的噪声时,它们表现出有限的适应性和泛化能力。为了解决这些问题,我们提出了测试时间自适应分层协同去噪网络(TAHCD)。 一方面,TAHCD 引入了自适应稳定子空间对齐和样本自适应置信度对齐,以可靠地移除异构噪声。这些方法考虑到了全局和实例级别的噪声,并且能够同时移除模态特定的和跨模态的噪声,从而实现了稳健的学习。 另一方面,TAHCD 引入了测试时间协同增强,该方法能够在无标签的情况下根据输入噪声自适应更新模型,提高了适应性和泛化能力。这通过协作性地增强了全局和实例级别上针对样本噪声的模态特定的和跨模态噪声联合移除过程来实现。 在多个基准上的实验表明,所提出的方法相比最先进的可靠多模态学习方法,在分类性能、稳健性和泛化方面均表现出优越的结果。
https://arxiv.org/abs/2601.07163
Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
长期以来,心理学研究利用圆周模型(circumplex models)来结构化情感,将相似的情感相邻放置并将对立的情感对角线放置。尽管这些模型经常被用来解释深度学习表示,但它们很少直接融入语言模型的表示学习中,因此其几何有效性尚未得到充分探索。本文提出了一种方法,通过在超球体上进行对比学习,在语言模型嵌入中诱导出圆形情感表示。我们发现,虽然这种圆周对齐提供了更好的可解释性和对抗维度缩减的鲁棒性,但在高维设置和细粒度分类任务中,其表现不及传统的设计。我们的研究结果阐明了将心理圆周模型应用于深度学习架构时所涉及的权衡。
https://arxiv.org/abs/2601.06575
Current Retrieval-Augmented Generation (RAG) systems typically employ a traditional two-stage pipeline: an embedding model for initial retrieval followed by a reranker for refinement. However, this paradigm suffers from significant inefficiency due to the lack of shared information between stages, leading to substantial redundant computation. To address this limitation, we propose \textbf{State-Centric Retrieval}, a unified retrieval paradigm that utilizes "states" as a bridge to connect embedding models and rerankers. First, we perform state representation learning by fine-tuning an RWKV-based LLM, transforming it into \textbf{EmbeddingRWKV}, a unified model that serves as both an embedding model and a state backbone for extracting compact, reusable states. Building upon these reusable states, we further design a state-based reranker to fully leverage precomputed information. During reranking, the model processes only query tokens, decoupling inference cost from document length and yielding a 5.4$\times$--44.8$\times$ speedup. Furthermore, we observe that retaining all intermediate layer states is unnecessary; with a uniform layer selection strategy, our model maintains 98.62\% of full-model performance using only 25\% of the layers. Extensive experiments demonstrate that State-Centric Retrieval achieves high-quality retrieval and reranking results while significantly enhancing overall system efficiency. Code is available at \href{this https URL}{our GitHub repository}.
目前的检索增强生成(RAG)系统通常采用传统的两阶段管道:使用嵌入模型进行初始检索,然后通过重新排序器进行细化。然而,由于各阶段之间缺乏共享信息而导致显著的计算冗余,这种范式存在较大的效率问题。为了解决这一限制,我们提出了一种称为“以状态为中心的检索”(State-Centric Retrieval)的统一检索方法,该方法利用“状态”作为连接嵌入模型和重新排序器之间的桥梁。 首先,通过微调基于RWKV的大语言模型,我们将其实现转换为\textbf{EmbeddingRWKV},这是一种既能作为嵌入模型使用又能作为提取紧凑且可重用的状态的主干模型。在此基础上,我们进一步设计了一种基于状态的重新排序器,以充分利用预计算的信息。 在重新排序过程中,该模型仅处理查询令牌,从而将推断成本与文档长度解耦,并实现了5.4至44.8倍的速度提升。此外,我们观察到保留所有中间层的状态并非必要;通过采用统一的选择策略,在只使用25%的层数的情况下,我们的模型仍能保持完整的98.62%性能。 广泛的实验表明,“以状态为中心的检索”能够实现高质量的检索和重新排序结果,并显著提升整体系统的效率。代码可在\href{this https URL}{我们的GitHub仓库}中获取。
https://arxiv.org/abs/2601.07861
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
我们发布了Pantagruel模型,这是一个新的自监督编码器模型系列,旨在处理法文文本和语音。与预测特定模态的目标(如文本标记或语音单元)不同,Pantagruel在特征空间中学习上下文化的目标表示,使特定于模态的编码器能够更有效地捕捉语言和声学规律性。 不同的模型在大规模法国语料库上进行了预训练,这些数据集包括维基百科、OSCAR 和 CroissantLLM(用于文本),以及多语言LibriSpeech、LeBenchmark 和 INA-100k(用于语音)。INA-100k 是一个由法国国家视听档案馆(Institut National de l'Audiovisuel,简称INA)的广播和电视节目存档资料中提取的新创建的10万小时法语音频数据集,提供了高度多样化的音频数据。 我们在广泛的下游任务上评估了Pantagruel模型,这些任务涵盖了两种模态,并包括标准法国基准测试中的FLUE 或 LeBenchmark。在这些任务中,Pantagruel 模型表现出了与CamemBERT、FlauBERT 和 LeBenchmark2.0 等强基线相比的竞争力或优越性,同时保持了可以无缝处理语音或文本输入的共享架构。 这些结果证实了特征空间自监督目标在法文表示学习中的有效性,并将Pantagruel 定位为多模态语音-文本理解的强大基础。
https://arxiv.org/abs/2601.05911
Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.
潜扩散模型(LDMs)通过在压缩的潜在空间中操作生成高质量图像,该空间通常由如变分自编码器(VAEs)之类的图像标记化工具获得。为了追求对生成友好的VAE,最近的研究探索了利用视觉基础模型(VFMs)作为VAEs表示对齐目标的方法,这种方法类似于LDMs常用的方式。尽管这带来了某些性能提升,但使用相同的对齐目标同时用于VAEs和LDMs忽视了它们在表征需求上的根本差异。 我们主张虽然LDMs从保留高层次语义概念的潜在空间中受益,但VAEs应当擅长于语义解耦,以便以结构化的方式编码属性级别的信息。为解决这一问题,我们提出了语义解耦VAE(Send-VAE),通过使其潜在空间与预训练VFMs的语义层级对齐来明确优化其解耦表示学习。我们的方法采用非线性映射网络将VAE潜在变量转换并与VFMs对齐,从而弥合属性级别解耦与高层次语义之间的差距,并促进有效指导VAE学习。 我们通过在属性预测任务上的线性探查来评估语义解耦的有效性,显示出其与生成性能改善的强相关性。最终,在使用Send-VAE训练基于流的变压器SiTs时,实验显示Send-VAE显著加快了训练速度,并分别在带有和不带无分类器引导的情况下取得了1.21和1.75的最佳FID分数(在ImageNet 256x256上)。
https://arxiv.org/abs/2601.05823
Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.
多视图多标签学习常常会因为数据采集的挑战和昂贵的数据标注成本而面临同时出现特征缺失和标注不完整的问题。为了克服现有特征恢复、表示解耦和标签语义建模方面的局限性,解决这个复杂但高度实用的问题,我们提出了一种自适应解耦表征学习方法(ADRL)。通过跨模式传播带邻域感知的特征级亲和度,ADRL 实现了稳健的观点补全,并利用随机掩码策略增强重构效果。通过在标签分布之间分散类别级别的关联,ADRL 优化分布参数以捕捉相互依赖的标签原型。此外,我们提出了一种基于互信息的目标函数,促进共享表示的一致性,并抑制视图特有表示与其他模式之间的信息重叠。从理论上讲,我们推导出训练双通道网络的可计算界限。此外,ADRL 通过启用标签嵌入和视图表示之间独立交互来进行特定原型特征选择,并为每个类别生成伪标签。然后利用伪标签空间的结构特性来指导视图融合过程中的判别性权衡。最后,在公共数据集和实际应用中进行的大规模实验展示了 ADRL 的优越性能。 这段文字主要描述了一种新的机器学习方法——自适应解耦表征学习(ADRL),该方法旨在解决多视图多标签学习中存在的特征缺失、标注不完整以及现有技术局限性等问题。通过引入一系列创新性的策略和理论框架,ADRL 显著提高了模型在数据处理和分类任务上的性能表现,并展示了其强大的应用潜力。
https://arxiv.org/abs/2601.05785
Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.
近期的端到端自动驾驶方法利用了视觉-语言模型(VLM)来提升复杂驾驶场景中的规划能力。然而,这些VLM本质上是作为通用型模型进行训练的,缺乏对三维空间和时间中特定于驾驶任务推理的理解。当应用于自动驾驶时,这类模型难以建立结构化的时空表示,这种表示对于捕捉几何关系、环境上下文以及确保安全轨迹规划所必需的运动模式至关重要。 为了克服这些限制,我们提出了SGDrive这一新颖框架,该框架明确地围绕驾驶特有知识层级来构造VLM的表现学习。SGDrive基于预训练的VLM骨干网络构建,并将驾驶理解分解为场景-代理者-目标层次结构,这种架构模仿了人类驾驶认知:驾驶员首先感知整体环境(场景上下文),然后关注具有安全关键性的代理及其行为模式,最后制定短期目标并执行相应动作。这一层级划分提供了通用型VLM所缺乏的结构化时空表示,并以紧凑而全面的方式整合多级信息,用于轨迹规划。 在NAVSIM基准测试中的广泛实验表明,SGDrive在仅基于摄像头的方法中,在PDMS和EPDMS两个评测标准上均达到了最先进的性能水平。这验证了分层知识构型对于将通用型VLM适应于自动驾驶的有效性。
https://arxiv.org/abs/2601.05640
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.
Grokking 是神经网络中的一种令人困惑的现象,即在完全记住训练数据之后,会出现一段延迟期,在此期间不会发生泛化,直到某个时刻才突然实现全面的泛化。先前的研究已经将这种延迟泛化与受权重衰减驱动的表现学习联系起来,但其背后的确切动态机制仍然不清楚。在这篇论文中,我们主张可以通过约束优化的视角来理解记忆后学习:在零损失流形上,梯度下降有效最小化了权重范数。我们在极限条件下(即学习率和权重衰减系数无限小的情况下)正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似方法,将一部分参数的学习动态与网络其余部分解耦。通过应用这个框架,我们推导出了双层网络中第一层在记忆后阶段的封闭形式表达式。实验结果证实,使用我们的预测梯度模拟训练过程可以再现 grokking 的延迟泛化和表示学习特性。 简单来说,这篇论文提出了一种理解神经网络中 Grokking 现象的新方法,即通过约束优化来解析权重衰减对模型泛化能力的影响,并且理论推导与实验结果吻合良好。
https://arxiv.org/abs/2511.01938
Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that -- when trained on images of moderate resolution -- QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.
最近,量子视觉场(QVFs)在学习提供的二维或三维信号时,在模型紧凑性和收敛速度方面显示出令人鼓舞的改进。与此同时,基于神经辐射场(NeRFs)的方法在新颖视角合成领域取得了重大进展,这些方法从二维图像中学习紧凑表示以渲染三维场景,尽管这需要较大的模型和密集的训练资源。在这项工作中,我们通过引入QNeRF扩展了QVFs的方法——这是首个专门用于从二维图像进行新颖视角合成的量子-经典混合模型。QNeRF利用参数化量子电路来编码空间和视点依赖信息,通过量子叠加态与纠缠态增强了表示能力,并且比传统的古典方法更加紧凑。 我们提出了两种架构变体:Full QNeRF最大限度地利用所有量子振幅以增强表示能力;而Dual-Branch QNeRF则通过分支空间和视角依赖的量子状态准备引入任务导向的归纳偏置,显著降低了此操作的复杂性,并确保了可扩展性和潜在硬件兼容性。 我们的实验表明,在训练中使用中等分辨率图像时,QNeRF能够匹配或超越经典的NeRF基准模型的表现,同时使用的参数量还不到后者的一半。这些结果表明,量子机器学习可以作为计算机视觉中级任务(如从二维观察中进行三维表示学习)中的连续信号表征的有竞争力的选择。
https://arxiv.org/abs/2601.05250
Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
针对作物病害分析的视觉问答需要准确的图像理解和可靠的文本生成能力。这项工作提出了一种轻量级的视觉-语言框架,用于从叶片图片中识别作物和疾病。该方法结合了Swin Transformer视觉编码器与序列到序列的语言解码器。采用两阶段训练策略以改进视觉表示学习及跨模态对齐。在大规模作物病害数据集上使用分类和自然语言生成指标对该模型进行了评估。实验结果显示,在作物识别和疾病诊断方面均达到了高准确率。此外,该框架在BLEU、ROUGE 和 BERTScore等评分标准中也表现出色。我们的模型相比大规模视觉-语言基准模型而言,参数量显著减少且性能更优。 可解释性通过Grad-CAM及token级归因分析进行评估。定性结果表明,在多样化的用户驱动查询下具有稳健的表现能力。这些发现突显了针对作物病害视觉问答任务的特定视觉预训练的有效性。
https://arxiv.org/abs/2601.05143
In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.
在这份报告中,我们介绍了Qwen3-VL-Embedding和Qwen3-VL-Reranker模型系列,这是基于Qwen3-VL基础模型构建的Qwen家族最新的扩展。这两个模型系列共同提供了一条端到端的高精度多模态搜索管道,通过将包括文本、图像、文档图像以及视频在内的多种模式映射到统一的表现空间来实现这一点。 Qwen3-VL-Embedding模型采用了一个多层次训练范式,从大规模对比预训练进展到重排序模型蒸馏,生成语义丰富的高维向量。该模型支持Matryoshka表示学习,允许灵活的嵌入维度,并能够处理多达32k标记的输入。而Qwen3-VL-Reranker则使用带有跨注意力机制的交叉编码器架构来为查询-文档对进行精细化的相关性评估。 这两个模型系列继承了Qwen3-VL的多语言能力,支持超过30种语言,并以2B和8B参数大小发布,以适应多样化的部署需求。实证研究表明,Qwen3-VL-Embedding系列在各种多模态嵌入评估基准上取得了最先进的结果。特别地,Qwen3-VL-Embedding-8B在MMEB-V2上的综合得分为77.8分,在所有模型中排名第一(截至2025年1月8日)。 该报告详细介绍了这两个系列的架构、训练方法及其实际能力,并展示了它们在包括图像-文本检索、视觉问答和视频-文本匹配在内的多种多模态检索任务中的有效性。
https://arxiv.org/abs/2601.04720
Contact-rich manipulation requires reliable estimation of extrinsic contacts-the interactions between a grasped object and its environment which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments. In this paper, we present UNIC, a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning. Extensive experiments demonstrate that UNIC performs reliably. It achieves a 9.6 mm average Chamfer distance error on unseen contact locations, performs well on unseen objects, remains robust under missing modalities, and adapts to dynamic camera viewpoints. These results establish extrinsic contact estimation as a practical and versatile capability for contact-rich manipulation.
接触丰富的操作需要可靠地估计外部接触——即抓取物体与其环境之间的相互作用,这些互动为规划、控制和策略学习提供了重要的上下文信息。然而,现有的方法往往依赖于严格的假设,例如预定义的接触类型、固定的抓握配置或相机校准,这限制了其在面对新颖对象时的一般化能力,并阻碍了其在非结构化环境中的部署。在这篇论文中,我们提出了UNIC,这是一种统一的多模态框架,用于外部接触估计,它无需任何先验知识或相机校准即可运行。 UNIC直接编码摄像机帧内的视觉观察结果,并以完全数据驱动的方式将其与本体感觉和触觉模式进行整合。它引入了一种基于场景可用性的统一接触表示方法,能够捕捉各种各样的接触形式,并采用随机掩码的多模态融合机制,从而实现稳健的多模态表示学习。 大量的实验表明,UNIC运行可靠:在未见过的接触位置上实现了9.6毫米的平均切片距离误差;在处理未见过的对象时表现良好;即使缺失某些模式仍然保持稳定性能;并且能够适应动态相机视角的变化。这些结果确立了外部接触估计作为接触丰富的操作的一种实用且多功能的能力。
https://arxiv.org/abs/2601.04356
Job scheduling is widely used in real-world manufacturing systems to assign ordered job operations to machines under various constraints. Existing solutions remain limited by long running time or insufficient schedule quality, especially when problem scale increases. In this paper, we propose ReLA, a reinforcement-learning (RL) scheduler built on structured representation learning and aggregation. ReLA first learns diverse representations from scheduling entities, including job operations and machines, using two intra-entity learning modules with self-attention and convolution and one inter-entity learning module with cross-attention. These modules are applied in a multi-scale architecture, and their outputs are aggregated to support RL decision-making. Across experiments on small, medium, and large job instances, ReLA achieves the best makespan in most tested settings over the latest solutions. On non-large instances, ReLA reduces the optimality gap of the SOTA baseline by 13.0%, while on large-scale instances it reduces the gap by 78.6%, with the average optimality gaps lowered to 7.3% and 2.1%, respectively. These results confirm that ReLA's learned representations and aggregation provide strong decision support for RL scheduling, and enable fast job completion and decision-making for real-world applications.
工作调度在实际制造系统中广泛用于将有序的工作操作分配给机器,且需要考虑各种约束条件。现有的解决方案因运行时间长或调度质量不足而受限,尤其是在问题规模增大时。本文提出了ReLA(基于结构化表示学习和聚合的强化学习调度器)。ReLA首先使用两个带有自注意力和卷积的内部实体学习模块以及一个带有跨注意力的实体间学习模块从调度实体(包括作业操作和机器)中学习多样化的表示。这些模块在一个多尺度架构中应用,其输出被汇总以支持RL决策制定。 在针对小、中、大规模工作实例进行的各种实验中,在大多数测试设置下,ReLA优于最新的解决方案,实现了最佳的完工时间。对于非大规模实例,与最先进的基线相比,ReLA优化差距减少了13.0%,而对于大规模实例则减少至78.6%;平均而言,优化差距分别降低到了7.3%和2.1%。这些结果证实了ReLA学习到的表示及其聚合为RL调度提供了强大的决策支持,并且能够加速实际应用中的作业完成和决策制定。
https://arxiv.org/abs/2601.03646