Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system's evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system's governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at this https URL.
https://arxiv.org/abs/2603.13227
We propose glaucoma lesion evaluation and analysis with multimodal imaging (GLEAM), the first publicly available tri-modal glaucoma dataset comprising scanning laser ophthalmoscopy fundus images, circumpapillary OCT images, and visual field pattern deviation maps, annotated with four disease stages, enabling effective exploitation of multimodal complementary information and facilitating accurate diagnosis and treatment across disease stages. To effectively integrate cross-modal information, we propose hierarchical attentive masked modeling (HAMM) for multimodal glaucoma classification. Our framework employs hierarchical attentive encoders and light decoders to focus cross-modal representation learning on the encoder.
https://arxiv.org/abs/2603.12800
We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters -- a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.
https://arxiv.org/abs/2603.12762
This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.
https://arxiv.org/abs/2603.12711
Connections between statistical mechanics and machine learning have repeatedly proven fruitful, providing insight into optimization, generalization, and representation learning. In this work, we follow this tradition by leveraging results from non-equilibrium thermodynamics to formalize curriculum learning in reinforcement learning (RL). In particular, we propose a geometric framework for RL by interpreting reward parameters as coordinates on a task manifold. We show that, by minimizing the excess thermodynamic work, optimal curricula correspond to geodesics in this task space. As an application of this framework, we provide an algorithm, "MEW" (Minimum Excess Work), to derive a principled schedule for temperature annealing in maximum-entropy RL.
https://arxiv.org/abs/2603.12324
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
现代视觉代理需要具备通用性、因果性和物理结构化的表示能力,以便在实时流式环境中运行。然而,目前的视觉基础模型仍然支离破碎,专门针对图像语义感知、离线时间建模或空间几何学进行优化。本文介绍了OmniStream,这是一种统一的流处理视觉骨干网络,能够从各种视觉输入中有效地感知、重建和行动。通过整合因果时空注意力机制以及3D旋转位置嵌入(3D-RoPE),我们的模型支持通过持久化的KV缓存对视频流进行高效的逐帧在线处理。我们使用包含静态和时间表征学习、流式几何重建及视觉-语言对齐的协同多任务框架,在29个数据集上预训练了OmniStream。 广泛测试表明,即使在严格冻结骨干网络的情况下,OmniStream也能在图像和视频探测、流式几何重建、复杂视频与空间推理以及机器人操作(这些场景在训练时未见过)等方面取得一致的竞争力表现,并且可以媲美专用专家模型。不同于追求特定基准的优势,我们的工作展示了通过训练单一但多才多艺的视觉骨干网络来实现跨语义、空间和时间推理的一般化能力是可行的,这是迈向用于交互式和具身代理的一般用途视觉理解的重要一步。
https://arxiv.org/abs/2603.12265
During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.
在听音乐时,大脑皮层活动会编码声音和预期相关的信息。先前的研究表明,人工神经网络(ANN)表示与大脑皮层的表示相似,并可以作为脑电图(EEG)识别的监督信号。在这里,我们展示了区分声音和预期相关的ANN表示作为教师目标能够改善基于EEG的音乐识别。预先训练以预测任一表示形式的模型优于未预训练的基础线模型,将它们结合使用会产生互补增益,这些增益超过由不同的随机初始化形成的强大种子集成。 这些发现表明,教师表示类型影响下游性能,并且可以通过神经编码引导表示学习。这项工作指向了预测性音乐认知和神经解码的进步。我们计算出的预期表示直接从原始信号中得出(无需手动标签),反映了超出起始时刻或音高之外的预测结构,使多层预测编码在各种刺激下的研究成为可能。其能够扩展到大型、多样化的数据集进一步表明,在基于皮质编码原则的基础上开发通用EEG模型方面具有潜力。
https://arxiv.org/abs/2603.03190
Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.
在野外视频数据中的情感识别仍然是一个具有挑战性的问题,因为面部外观、头部姿势、光照条件、背景噪声以及人类情绪固有的动态特性都存在很大的变化。仅依赖单一模态(如面部表情或言语)通常不足以捕捉这些复杂的情感线索。为了解决这个问题,我们提出了一种多模态情感识别框架,用于10th Affective Behavior Analysis in-the-wild (ABAW) 挑战赛中的Expression (EXPR) 任务。 我们的方法利用大规模预训练模型:CLIP(视觉编码)和Wav2Vec 2.0(音频表示学习),作为冻结的骨干网络。为了建模面部表情序列中时间依赖性,我们采用了一个在固定长度视频窗口上运行的时间卷积网络(TCN)。此外,我们引入了一种双向交叉注意力融合模块,在该模块中视觉和音频特征以对称方式相互作用,增强跨模式上下文化,并捕捉互补的情感信息。然后使用一个轻量级的分类头进行最终情感预测。 进一步地,我们将基于CLIP文本特征的引导式对比目标纳入其中,鼓励语义一致性的视觉表示。在ABAW 10th EXPR基准测试上的实验结果表明,所提出的框架提供了一个强大的多模态基线,并且比单一模式建模表现出更好的性能。这些结果显示了结合时间视觉建模、音频表示学习和跨模态融合对于在无约束的真实世界环境中进行稳健的情感识别的有效性。
https://arxiv.org/abs/2603.11971
Depression is a severe mental disorder, and reliable identification plays a critical role in early intervention and treatment. Multimodal depression detection aims to improve diagnostic performance by jointly modeling complementary information from multiple modalities. Recently, numerous multimodal learning approaches have been proposed for depression analysis; however, these methods suffer from the following limitations: 1) inter-modal inconsistency and depression-unrelated interference, where depression-related cues may conflict across modalities while substantial irrelevant content obscures critical depressive signals, and 2) diverse individual depressive presentations, leading to individual differences in modality and cue importance that hinder reliable fusion. To address these issues, we propose Individual-aware Multimodal Depression-related Representation Learning Framework (IDRL) for robust depression diagnosis. Specifically, IDRL 1) disentangles multimodal representations into a modality-common depression space, a modality-specific depression space, and a depression-unrelated space to enhance modality alignment while suppressing irrelevant information, and 2) introduces an individual-aware modality-fusion module (IAF) that dynamically adjusts the weights of disentangled depression-related features based on their predictive significance, thereby achieving adaptive cross-modal fusion for different individuals. Extensive experiments demonstrate that IDRL achieves superior and robust performance for multimodal depression detection.
抑郁症是一种严重的心理健康障碍,可靠的确诊对于早期干预和治疗至关重要。多模态抑郁症检测旨在通过共同建模来自多种模式的互补信息来提高诊断性能。近年来,提出了许多用于分析抑郁症的多模态学习方法;然而,这些方法存在以下局限性:1)跨模式不一致性和与抑郁无关的干扰,在不同模式中可能发现彼此矛盾的相关提示,同时大量无关联内容模糊了关键抑郁信号;2)个体差异化的抑郁表现形式导致在各个模式和线索重要性方面的个体差异,从而妨碍了可靠的融合。为了解决这些问题,我们提出了一个以个体为导向的多模态抑郁症相关表示学习框架(IDRL),用于稳健的抑郁症诊断。具体而言,IDRL 1)将多模态表征分离成共有的抑郁空间、特定于模式的抑郁空间和与抑郁无关的空间,以此增强模式对齐的同时抑制不相关信息;2)引入了一个以个体为导向的模态融合模块(IAF),该模块根据相关特征在预测中的重要性动态调整权重,从而实现针对不同个体的自适应跨模态融合。广泛的实验表明,IDRL 在多模态抑郁症检测中实现了卓越且稳健的表现。
https://arxiv.org/abs/2603.11644
End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real domain shifts when generalizing to new locations. In this work we investigate zero-shot cross-city generalization in end-to-end trajectory planning and ask whether self-supervised visual representations improve transfer across cities. We conduct a comprehensive study by integrating self-supervised backbones (I-JEPA, DINOv2, and MAE) into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models relying on traditional supervised backbones across cities with different road topologies and driving conventions, particularly when transferring from right-side to left-side driving environments. Self-supervised representation learning reduces this gap. In open-loop evaluation, a supervised backbone exhibits severe inflation when transferring from Boston to Singapore (L2 displacement ratio 9.77x, collision ratio 19.43x), whereas domain-specific self-supervised pretraining reduces this to 1.20x and 0.75x respectively. In closed-loop evaluation, self-supervised pretraining improves PDMS by up to 4 percent for all single-city training cities. These results show that representation learning strongly influences the robustness of cross-city planning and establish zero-shot geographic transfer as a necessary test for evaluating end-to-end autonomous driving systems.
端到端自动驾驶模型通常在多城市数据集上使用监督预训练的ImageNet骨干网络进行训练,但其对未见过的城市的泛化能力仍然很少被研究。当训练和评估数据地理位置混合时,模型可能会隐式地依赖于特定城市的线索,掩盖了在实际领域转移(即从一个地理区域或环境转移到另一个)过程中可能出现的失败模式。 在这项工作中,我们调查了零样本跨城市泛化的端到端轨迹规划,并探讨自我监督视觉表示是否能改善城市间的迁移能力。通过将自监督骨干网络(I-JEPA、DINOv2 和 MAE)整合到规划框架中,我们进行了全面的研究。我们在严格的地理划分下对nuScenes数据集进行开环设置下的性能评估,并在NAVSIM上采用闭环评估协议进行评估。 我们的实验揭示了依赖传统监督骨干网络的模型在迁移至具有不同道路拓扑和驾驶习惯的城市时存在显著泛化差距,尤其是在从右侧行驶环境转移到左侧行驶环境时尤为明显。自我监督表示学习可以减少这一差距。在开环评估中,当从波士顿迁移到新加坡时,一个传统的监督骨干网络表现出严重的性能下降(L2位移比为9.77倍,碰撞率增加19.43倍),而领域特定的自我监督预训练则将这些值分别减少到1.20倍和0.75倍。在闭环评估中,自我监督预训练使所有单城市培训城市的PDMS指标最多提高了4%。 这些结果表明,表示学习强烈影响跨城市规划的鲁棒性,并建立了零样本地理迁移作为评估端到端自动驾驶系统的必要测试。
https://arxiv.org/abs/2603.11417
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.
大型多模态模型(LMMs)在适应不同的计算预算时面临挑战,这主要是由于存在大量的视觉标记。之前的方法试图通过减少输入到LLMs之前的视觉标记数量来解决这个问题。然而,这些策略不可避免地会导致视觉语义的丢失。为了解决这些问题,我们引入了FMVR(Frequency-Modulated Visual Restoration),这是一种插件式且极其简单的策略,旨在增强在视觉标记减少情况下的LMM推理能力。 具体来说,FMVR通过使用AvgPool和MaxPool将较少视觉标记的视觉表示分解为低频和高频分量。随后,利用轻量级可学习参数对这些频率进行调制。来自AvgPool的高频部分充当了显著性过滤器,用于增强显著性的视觉语义;而来自MaxPool的低频部分则充当非显著性过滤器,用来加强弱化的视觉语义。这使得在使用少量视觉标记的情况下保留主导视觉语义,并恢复稀释的视觉语义成为可能。 此外,我们将FMVR整合到Matryoshka表示学习中,以从粗到细地学习视觉令牌集,从而能够在推理过程中灵活调整视觉令牌的数量,同时保持与原模型相当的表现力。跨10个基于图像和4个基于视频的基准测试的实验表明,在减少LLaVA-1.5-7B的FLOPs(89%)的同时,FMVR-LLaVA能够几乎维持原始准确性的100%不变。代码将会公开。
https://arxiv.org/abs/2603.11220
The world is inherently dynamic, and continual learning aims to enable models to adapt to ever-evolving data streams. While pre-trained models have shown powerful performance in continual learning, they still require finetuning to adapt effectively to downstream tasks. However, prevailing Parameter-Efficient Fine-Tuning (PEFT) methods operate through empirical, black-box optimization at the weight level. These approaches lack explicit control over representation drift, leading to sensitivity to domain shifts and catastrophic forgetting in continual learning scenarios. In this work, we introduce Continual Representation Learning (CoRe), a novel framework that for the first time shifts the finetuning paradigm from weight space to representation space. Unlike conventional methods, CoRe performs task-specific interventions within a low-rank linear subspace of hidden representations, adopting a learning process with explicit objectives, which ensures stability for past tasks while maintaining plasticity for new ones. By constraining updates to a low-rank subspace, CoRe achieves exceptional parameter efficiency. Extensive experiments across multiple continual learning benchmarks demonstrate that CoRe not only preserves parameter efficiency but also significantly outperforms existing state-of-the-art methods. Our work introduces representation finetuning as a new, more effective and interpretable paradigm for continual learning.
世界本质上是动态的,持续学习的目标在于使模型能够适应不断变化的数据流。尽管预训练模型在持续学习中展现了强大的性能,但它们仍需微调以有效适应下游任务。然而,现有的参数高效微调(PEFT)方法通过权重层面的经验、黑箱优化进行操作。这些方法缺乏对表征漂移的显式控制,导致其在域变化和连续学习场景下容易出现敏感性和灾难性遗忘问题。 在这项工作中,我们提出了持续表示学习(CoRe),这是一种全新的框架,首次将微调范式从权重空间转移到了表示空间。不同于传统的方法,CoRe 在隐藏表征的低秩线性子空间中执行特定任务的干预,并采用具有明确目标的学习过程,确保旧任务的稳定性同时保持新任务的可塑性。通过限制更新到一个低秩子空间,CoRe 实现了卓越的参数效率。 在多个持续学习基准上的广泛实验表明,CoRe 不仅保留了参数效率,还在性能上显著优于现有的最先进方法。我们的工作引入了表示微调作为持续学习的新范式,这种新的范式不仅更为有效,而且更加可解释。
https://arxiv.org/abs/2603.11201
Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at this https URL.
点云数据的基础模型最近在能力上有所提升,通常依赖于来自语言或视觉的广泛表示学习。在这项工作中,我们采用了一种更为可控的方法,引入了一个基于轻量级Transformer的点云架构。与大量跨模态监督的依赖不同,我们的模型仅使用39,000个点云进行训练——然而,它在性能上超过了几个在超过20万训练样本上训练的大规模基础模型。有趣的是,我们所提出的方法达到了从超过一百万个点云、图像和文本样本中学习到的最先进的模型的结果,这表明精心策划的训练设置和架构设计的价值。 为了确保严格的评估,我们进行了一项全面的复制研究,该研究标准化了培训制度并在多个点云架构上进行了基准测试。这种统一的实验框架隔离了架构选择的影响,允许透明地比较并突出我们的设计和其他无标记器架构的优势。我们的结果显示,简单的骨干网络可以与更复杂或数据丰富的策略提供竞争性的结果。 实现细节包括代码、预训练模型和培训协议可以在以下网址获得:[此URL](请将实际的链接替换到括号中的文本)。
https://arxiv.org/abs/2603.10963
The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.
基于骨架的动作表示学习领域已经从对比学习(Contrastive Learning,CL)转向了掩码自动编码器(Masked Auto-Encoder,MAE)架构。然而,每种方法都存在固有的局限性:CL常常忽视细粒度的局部细节,而MAE则由于其复杂的解码器结构导致计算量巨大。此外,MAE面临着严重的计算不对称问题——在预训练阶段可以利用高效的掩码操作,但在处理下游任务时需要进行耗时的整体序列处理。 为了解决这些问题,我们提出了SLiM(Skeleton Less is More)框架,这是一种新的统一方法,通过一个共享的编码器将掩码建模与对比学习相结合。SLiM摒弃了重建解码器,不仅消除了计算冗余,还迫使编码器直接捕捉判别性特征。SLiM是首个实现无解码器掩码建模的代表性学习框架。 为了防止由于骨架-时间相关性高而产生的平凡重建问题,我们引入了语义管状掩码以及基于骨架的数据增强策略,以确保在不同时间粒度下的解剖学一致性。 广泛的实验表明,在所有下游协议中,SLiM始终能达到最先进的性能。尤为重要的是,我们的方法能够以极高的效率实现这一卓越的准确性,与现有的MAE方法相比,推理计算成本降低了7.89倍。
https://arxiv.org/abs/2603.10648
Whether uniquely quantum resources confer advantages in fully classical, competitive environments remains an open question. Competitive zero-sum reinforcement learning is particularly challenging, as success requires modelling dynamic interactions between opposing agents rather than static state-action mappings. Here, we conduct a controlled study isolating the role of quantum entanglement in a quantum-classical hybrid agent trained on Pong, a competitive Markov game. An 8-qubit parameterised quantum circuit serves as a feature extractor within a proximal policy optimisation framework, allowing direct comparison between separable circuits and architectures incorporating fixed (CZ) or trainable (IsingZZ) entangling gates. Entangled circuits consistently outperform separable counterparts with comparable parameter counts and, in low-capacity regimes, match or exceed classical multilayer perceptron baselines. Representation similarity analysis further shows that entangled circuits learn structurally distinct features, consistent with improved modelling of interacting state variables. These findings establish entanglement as a function resource for representation learning in competitive reinforcement learning.
量子资源是否能在完全经典的竞争环境中提供优势仍然是一个开放性问题。特别是在零和强化学习中,成功需要建模对手代理之间的动态交互,而不仅仅是静态的状态-动作映射。在这里,我们进行了一项控制研究,探讨了在玩《乒乓球》(Pong)这一竞争型马尔可夫游戏中训练的量子经典混合智能体中纠缠量子态所起的作用。一个8量子位参数化的量子电路作为特征提取器,在近端策略优化框架内运行,并允许直接比较分离电路和采用固定(CZ)或可调用伊辛ZZ(IsingZZ)纠缠门架构的电路。 纠缠量子回路在具有类似参数数量的情况下始终优于分立电路,并且在低容量环境中,其性能与经典多层感知器基线相当甚至超过。表示相似性分析进一步显示,纠缠电路学习到了结构上不同的特征,这与改进的状态变量交互建模一致。 这些发现将纠缠确立为竞争型强化学习中表示学习的功能资源。
https://arxiv.org/abs/2603.10289
Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at this https URL.
脑部成像分类通常从两个角度进行:一是建立整个图像体积的模型,以捕捉全局解剖背景;二是构建基于感兴趣区域(ROI)的图,编码局部和拓扑互动。尽管这两种表示方法都证明了各自的有效性,但它们各自的贡献及潜在互补性尚不完全理解。现有的融合方法通常针对特定任务,并不能在一致的训练设置下评估每种表示法的表现。 为了弥补这一不足,我们提出了一种统一的跨视图对比框架,用于联合成像和ROI表征学习。我们的方法学习出受试者的全局(成像)和局部(ROI图)嵌入,并通过双向对比目标在共享潜在空间中对其进行对齐,鼓励来自同一受试者的表现收敛,同时分离不同受试者的表现。这种对齐产生了适合下游融合的可比较嵌入,并允许在同一训练协议内系统评估仅使用成像、仅使用ROI和联合配置的性能。 在ADHD-200和ABIDE数据集上进行的大规模实验表明,在各种主干选择下,联合学习始终优于单独任一分支的表现。此外,解释性分析显示基于成像的分支与基于ROI的分支强调了不同的但互补的判别模式,这解释了观察到的性能提升。 这些发现提供了明确证据,即明确整合全局体积和ROI级别的表示是用于神经影像学脑疾病分类的一个有前途的方向。源代码可在此网址获取:[此链接应为实际可用的具体URL]。
https://arxiv.org/abs/2603.10253
Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.
文本嵌入已经成为计算社会科学和心理学的核心,使意义的可扩展测量以及混合方法推断成为可能。然而,大多数表示学习是为预测和检索进行优化和评估的,这导致了一个预测-测量差距:在作为特征时表现良好的表示可能并不适合作为科学仪器。该论文认为,科学研究中的意义分析需要一组不同的目标——即“科学实用性”,强调几何清晰度、可解释性和对语言证据的追溯能力,并且要具备对抗非语义混淆的能力以及与基于回归风格推断的方法兼容性。 该论文以认知和神经心理学中关于意义的观点为基础,评估了静态词嵌入和上下文转换表示是否符合这些要求:静态空间仍然具有透明测量的优势,而上下文空间则提供了更丰富的语义,但会将意义与其他信号混淆,并表现出几何结构和可解释性问题,从而复杂化推断过程。 接着,该论文提出了一个路线图,围绕以下几个方面展开: (i) 以几何优先的设计理念为基础,进行梯度设计和抽象处理,包括通过心理优势层次约束的层级感知空间; (ii) 可逆的后期变换,重新调整嵌入的几何结构并减少干扰因素的影响; (iii) 意义地图以及测量导向评估协议,为可靠的、可追溯的语义推断提供保障。 在该领域辩论规模优先发展的极限时,“测量就绪”的表示方法为科学研究开辟了一个有原则的新方向。
https://arxiv.org/abs/2603.10130
Conventional clinical CMR pipelines rely on a sequential "reconstruct-then-analyze" paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
传统的临床心脏磁共振成像(CMR)流程依赖于一个顺序的“重建然后分析”的模式,这迫使中间步骤处于一种不适定的状态,从而引入了可避免的伪影和信息瓶颈。这种做法带来了基本的数学悖论:它试图从欠采样的k空间中恢复高维度像素数组(即图像),而不是直接提取诊断实际所需的低维度生理标签。 为了释放k空间内的直接诊断潜力,我们提出了k-MTR(k空间多任务表示),这是一个将欠采样k空间数据和全采样图像对齐到共享语义流形的框架。通过大规模受控模拟42,000个个体的数据,k-MTR迫使k空间编码器在潜在空间中直接恢复由于欠采样而丢失的解剖信息,从而绕过下游分析中的显式逆问题。 我们证明这种潜在对齐使得密集的潜在空间可以直接从欠采样的频率中嵌入高层次的生理语义。在连续表型回归、疾病分类和解剖分割方面,k-MTR与最先进的图像域基准相比表现出了极高的竞争力。 通过展示精确的空间几何形状和多任务特征可以成功地直接从k空间表示中恢复出来,k-MTR为基于任务的心脏MRI工作流程提供了一个稳健的架构蓝图。
https://arxiv.org/abs/2603.09945
Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains -- such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning -- multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.
视觉问答(VQA)是一种基本的多模态任务,要求模型同时理解视觉和文本信息。早期的VQA系统高度依赖语言偏见,这促使后续研究强调视觉基础的重要性,并推动了平衡数据集的发展。随着大规模预训练转换器在文本和视觉领域取得的成功——例如用于越南语语言理解的PhoBERT以及用于图像表示学习的Vision Transformers(ViT)——多模态融合已经取得了显著进展。对于越南语VQA,已推出了多个数据集以促进低资源多模态研究的发展,其中包括ViVQA、OpenViVQA以及最近提出的ViTextVQA。这些资源使能够评估在越南语背景下整合语言和视觉特征的模型的能力成为可能。 评价VQA系统通常采用自动评分标准,这些标准最初是为图像字幕生成或机器翻译设计的,例如BLEU、METEOR、CIDEr、Recall、Precision和F1-score。然而,最近的研究表明,大规模的语言模型可以在VQA任务中进一步改善自动化评估与人工判断之间的对齐度。 在本研究工作中,我们探讨了基于转换器架构的越南语视觉问答技术,同时利用文本和图像预训练,并系统地比较跨多语言设置下的自动评价指标。
https://arxiv.org/abs/2603.09689
Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.
多模态图,其中节点包含异构特征如图像和文本,在现实世界的应用中越来越常见。在这样的图上进行有效的学习需要适应性较强的单模消息传递以及高效的跨模融合。然而,大多数现有的多模态图学习方法通常是从传统的图形神经网络扩展而来的,并依赖于静态结构或密集注意力机制,这限制了灵活性和表达性的节点嵌入学习能力。为此,在本文中我们提出了一种新颖的具有动态信息路径(Dynamic information Pathways, DiP)的多模态图表示学习框架。通过引入特定模态的伪节点,DiP能够使每个模态内部的消息路由实现动态化,并且通过共享的状态空间中的高效信息通道捕捉跨模的相关性。这一设计实现了适应性强、表达力强和稀疏度高的消息传播,同时保持线性的复杂度。我们进行了链接预测和节点分类任务的性能评估,并开展了全面的实验分析。多项基准测试表明,DiP在各项指标上均优于基线方法。
https://arxiv.org/abs/2603.09258