The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.
从低动态范围(LDR)模糊图像合成高质量的新型视图,在极端光照条件下恢复高动态范围(HDR)和清晰的3D表示是一个挑战。虽然现有方法利用事件数据来解决这一问题,但它们忽视了摄像机输出与物理世界辐射度之间的传感器-物理学不匹配,导致HDR和去模糊结果不佳。 为了应对这个问题,我们提出了一种统一的基于传感器物理学的NeRF框架,用于从单次曝光模糊LDR图像及其对应事件中合成清晰的HDR新型视图。该方法直接利用NeRF表示3D场景的实际辐射度,并在高动态范围下模拟原始HDR场景光线击打传感器像素的情况,如同物理世界中的表现一样。我们引入了一个逐像素RGB映射字段,用于将上述渲染像素值与输入图像中传感器记录的LDR像素值对齐。还设计了一种新的事件映射字段来连接物理场景动力学和实际事件传感器输出。 两个映射字段与NeRF网络联合优化,利用事件中的空间和时间动态信息以增强清晰HDR 3D表示的学习能力。在收集的数据集和公开数据集中进行的实验表明,我们的方法能够使用单次曝光模糊LDR图像及其对应的事件来实现最先进的去模糊HDR新型视图合成结果。
https://arxiv.org/abs/2601.15475
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
图像表示学习的模型通常设计用于识别或生成任务。对比学习的各种形式帮助模型学习将图像转换为对分类、检测和分割有用的嵌入表示。另一方面,通过像素级、感知性和对抗性损失函数训练模型来重构图像,从而学习到对于图像生成有用的潜在空间。我们试图用一种前所未有的模型统一这两种方向,该模型能够同时学习出对识别和生成都有用的表示。 我们的模型被训练为隐式神经表示(INR)超网络,它学会将图像映射到模型权重上以实现快速准确地重构。此外,我们将INR超网络与知识蒸馏技术集成在一起,从而提高其泛化能力和性能表现。除了创新的训练设计之外,该模型还学习出了一个前所未有的压缩嵌入空间,在各种视觉任务中表现出色。 整个模型在图像表示学习领域取得了接近或达到现有最佳水平的结果,并且由于其高质量的小型嵌入,还能具备生成能力。代码可在提供的链接处获取(原文中的“this https URL”)。
https://arxiv.org/abs/2601.14256
Accurate prediction of drug response in precision medicine requires models that capture how specific chemical substructures interact with cellular pathway states. However, most existing deep learning approaches treat chemical and transcriptomic modalities independently or combine them only at late stages, limiting their ability to model fine-grained, context-dependent mechanisms of drug action. In addition, standard attention mechanisms are often sensitive to noise and sparsity in high-dimensional biological networks, hindering both generalization and interpretability. We present DiSPA, a representation learning framework that explicitly disentangles structure-driven and context-driven mechanisms of drug response through bidirectional conditioning between chemical substructures and pathway-level gene expression. DiSPA introduces a differential cross-attention module that suppresses spurious pathway-substructure associations while amplifying contextually relevant interactions. Across multiple evaluation settings on the GDSC benchmark, DiSPA achieves state-of-the-art performance, with particularly strong improvements in the disjoint-set setting, which assesses generalization to unseen drug-cell combinations. Beyond predictive accuracy, DiSPA yields mechanistically informative representations: learned attention patterns recover known pharmacophores, distinguish structure-driven from context-dependent compounds, and exhibit coherent organization across biological pathways. Furthermore, we demonstrate that DiSPA trained solely on bulk RNA-seq data enables zero-shot transfer to spatial transcriptomics, revealing region-specific drug sensitivity patterns without retraining. Together, these results establish DiSPA as a robust and interpretable framework for integrative pharmacogenomic modeling, enabling principled analysis of drug response mechanisms beyond post hoc interpretation.
在精准医疗中,准确预测药物反应需要能够捕捉特定化学亚结构与细胞通路状态之间相互作用的模型。然而,大多数现有的深度学习方法要么独立处理化学和转录组学模态,要么仅在后期阶段将它们结合在一起,这限制了这些方法对药物作用的细微、情境依赖机制建模的能力。此外,标准注意机制通常对于高维生物网络中的噪声和稀疏性敏感,从而阻碍了一般化能力和可解释性的提升。 我们提出了DiSPA(Differential Spatial Attention)框架,该框架通过化学亚结构与通路级基因表达之间的双向条件关系明确区分了药物反应的结构驱动和上下文驱动机制。DiSPA引入了一个差分交叉注意力模块,可以抑制虚假的路径-亚结构关联,并放大情境相关性相互作用。 在GDSC(Genomics of Drug Sensitivity in Cancer)基准测试中,无论是在标准设定还是在评估泛化能力的独立集合设置下,DiSPA均取得了最先进的性能表现。特别是,在后者即评估对未见过的药物-细胞组合的一般化能力方面,其改善尤为显著。 除了预测准确性之外,DiSPA还提供了机制上具有信息量的表示:学习到的关注模式能够恢复已知的药效团,区分结构驱动型与上下文依赖型化合物,并且在生物路径间表现出一致的组织方式。此外,我们证明了仅基于批量RNA-seq数据训练的DiSPA模型可以实现零样本迁移至空间转录组学,揭示特定区域内的药物敏感性模式而无需重新训练。 这些结果共同确立了DiSPA作为一个稳健和可解释框架的地位,用于整合药理基因组学建模,并能够超越事后解读地分析药物反应机制。
https://arxiv.org/abs/2601.14346
The quality of data augmentation serves as a critical determinant for the performance of contrastive learning in EEG tasks. Although this paradigm is promising for utilizing unlabeled data, static or random augmentation strategies often fail to preserve intrinsic information due to the non-stationarity of EEG signals where statistical properties change over time. To address this, we propose RL-BioAug, a framework that leverages a label-efficient reinforcement learning (RL) agent to autonomously determine optimal augmentation policies. While utilizing only a minimal fraction (10\%) of labeled data to guide the agent's policy, our method enables the encoder to learn robust representations in a strictly self-supervised manner. Experimental results demonstrate that RL-BioAug significantly outperforms the random selection strategy, achieving substantial improvements of 9.69\% and 8.80\% in Macro-F1 score on the Sleep-EDFX and CHB-MIT datasets, respectively. Notably, this agent mainly chose optimal strategies for each task -- for example, Time Masking with a 62\% probability for sleep stage classification and Crop \& Resize with a 77\% probability for seizure detection. Our framework suggests its potential to replace conventional heuristic-based augmentations and establish a new autonomous paradigm for data augmentation. The source code is available at \href{this https URL}{this https URL}.
数据增强的质量是脑电图(EEG)任务中对比学习性能的关键决定因素。虽然这一范式有潜力利用未标记的数据,但静态或随机的增强策略由于EEG信号随时间变化的非平稳性而常常无法保留内在信息。为了解决这个问题,我们提出了RL-BioAug框架,该框架利用一种标签高效的强化学习(RL)代理自主确定最佳增广策略。仅使用少量标记数据(10%)指导代理政策的情况下,我们的方法使编码器能够在完全自我监督的方式下学习稳健的表示形式。实验结果表明,与随机选择策略相比,RL-BioAug显著提高了性能,在Sleep-EDFX和CHB-MIT数据集上的宏平均F1分数分别提升了9.69%和8.80%。值得注意的是,该代理主要为每个任务选择了最佳策略——例如,在睡眠分期分类中,时间掩码的概率高达62%,而在癫痫发作检测中,“裁剪与重缩放”操作的概率为77%。我们的框架表明其可能取代传统的基于启发式的增强方法,并建立了一种新的自主数据增广范式。源代码可在此链接获取:[此URL](this https URL)。
https://arxiv.org/abs/2601.13964
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
当前的视觉表示学习仍然存在两极分化的问题:视觉-语言模型(例如CLIP)在全局语义对齐方面表现出色,但在空间精度上有所欠缺;而自监督方法(如MAE、DINO)则擅长捕捉复杂的局部结构,却难以处理高层次的语义上下文。我们认为这些范式从根本上说是互补的,并可以通过一个以密集的空间监督增强的原则性多任务框架进行整合。 我们介绍了一种名为MTV的多任务视觉预训练框架,该框架在共享的骨干网络上同时优化了视觉-语言对比、自监督和密集空间目标。为了减少手动注释的需求,我们利用高容量的“专家”模型(例如Depth Anything V2 和 OWLv2)来大规模合成密集且结构化的伪标签。 除了这一框架之外,我们还系统地探讨了多任务视觉学习机制,并分析了:(i) 每个目标的边际收益;(ii) 任务之间的协同作用与干扰;以及 (iii) 在不同数据和模型规模下的扩展行为。我们的研究结果表明,MTV实现了“集二者之长”的性能,在不牺牲全局语义理解的前提下显著增强了细粒度的空间推理能力。 这些发现表明,借助高质量的伪监督进行多任务学习是通向更通用视觉编码器的一条可扩展路径。
https://arxiv.org/abs/2601.13886
Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
多模态情感分析集成了语言、视觉和听觉三种模式。主流的方法基于模态不变性和特定于模态的分解,或是复杂的融合方法,这些方法仍然依赖时空混合建模。这种方法忽略了时空异质性,导致了时空信息不对称,并因此限制了性能表现。为此,我们提出了TSDA(Temporal-Spatial Decouple before Act),即在任何交互之前显式地将每种模式解耦为时间动态和空间结构背景。对于每一个模态,一个时间编码器和一个空间编码器分别将信号投影到独立的时间和空间主体中。随后的Factor-Consistent Cross-Modal Alignment(一致因子跨模态对齐)模块仅在不同模态之间对齐相应的时间特征,并且只对齐相应的空间特征。 为了减少跨因素泄漏同时保持互补性,引入了特定于因素的监督和去相关正则化。最后,一个Gated Recouple模块随后重新耦合已对齐的数据流以完成任务需求。广泛实验表明TSDA优于基准方法。消融分析研究证实了设计的必要性和可解释性。 简单来说,这种方法通过将每种模态的时间维度和空间维度解耦,并在两者之间建立一致性的跨模态对齐机制,解决了现有技术中的时空信息不对称问题,从而提高了多模态情感分析任务的表现。
https://arxiv.org/abs/2601.13659
Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.
以人类为中心的视觉分析在监控、医疗保健和人机交互等多种应用中扮演着关键角色。随着大规模未标注的人体图像数据集的出现,对于能够支持各种以人为中心的下游任务的一般无监督预训练模型的需求日益增长。为了实现这一目标,我们提出了CLASP(由CLIP引导的可适应自我监督学习框架),这是一种专门用于人类为中心的视觉任务中无监督预训练的新颖框架。CLASP利用强大的视觉-语言模型CLIP来生成低级(如身体部位)和高级(如属性)语义伪标签。然后将这些多层级的语义线索整合到所学的视觉表示中,丰富了其表达能力和泛化能力。 认识到不同的下游任务需要不同程度的语义粒度,CLASP集成了由提示控制的专家混合(MoE)模块。MoE根据特定于任务的提示动态调整特征提取,以减轻潜在的特征冲突并增强可迁移性。此外,CLASP采用了多任务预训练策略,在该策略中,从CLIP衍生出来的部分级和属性级伪标签引导表示学习过程。 在多个基准测试中的广泛实验表明,CLASP持续优于现有的无监督预训练方法,从而推进了以人为中心的视觉分析领域的发展。
https://arxiv.org/abs/2601.13133
Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: this https URL.
裂缝检测对于混凝土基础设施的安全至关重要,但在隧道和桥下等低光照环境中出现的裂缝会降低计算机视觉分割精度。对低光环境下裂纹图像进行像素级别的标注极其耗时,而大多数深度学习方法需要大规模且照明良好的数据集。我们提出了一种基于Retinex理论与少量样本学习相结合的双分支原型网络,用于低光环境下的裂缝分割。该网络利用基于Retinex的反射成分来指导光照不变性的全局表征学习,并通过度量学习减少对大量标注数据集的依赖。此外,我们引入了一个跨相似性先验掩码生成模块,计算查询特征和支持特征之间的高维相似性,以捕捉裂纹的位置和结构;还设计了多尺度特征增强模块,将多尺度特征与先验掩模融合,从而缓解空间不一致性问题。在多个基准测试中的广泛实验表明,在低光条件下该方法具有持续的优越性能。代码链接:[请在此处插入实际URL]。 (注:原文中提到的具体链接地址未提供,可以替换成实际可用的代码库或资源链接)
https://arxiv.org/abs/2601.13059
Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbf{SASA}, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9\% on FB15k-237 and +3.4\% on YAGO3-10.
知识图谱(KGs)常常受到不可靠信息的影响,这限制了它们的实用性。三元组分类(TC)的目标是确定来自知识图谱的三元组的有效性。最近,基于文本的方法通过从自然语言描述中学习实体和关系表示,显著提高了TC模型的泛化能力,并在性能上设立了新的基准。然而,仍存在两个关键挑战:首先,现有方法常常忽视了不同KG组件之间的有效语义交互;其次,大多数方法采用单一的二元分类训练目标,导致语义表征学习不足。为了应对这些挑战,我们提出了\textbf{SASA},这是一个通过分离注意力机制和语义感知对比学习(CL)来增强TC模型的新框架。具体来说,首先,我们提出了一种分离注意机制,用于将三元组编码为解耦的上下文表征,并以更有效的方式融合它们。其次,我们引入了层次化的、基于语义的对比学习作为辅助训练目标,以引导模型提高其判别能力和实现足够的语义学习,同时考虑局部和全局层面的CL。在两个基准数据集上的实验结果表明,SASA显著优于现有的最先进方法,在准确性方面分别提高了FB15k-237 +5.9% 和 YAGO3-10 +3.4%。
https://arxiv.org/abs/2601.13035
Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.
在遥感领域的自监督预训练中,通常使用的是中等空间分辨率(MR)图像数据集,因为它们的高可用性。鉴于高分辨率(HR)数据集的发布,我们探讨了如何将这些HR数据集纳入自监督预训练过程,以增强对MR图像表示的学习,并提升下游分割任务中的表现性能。为此,我们设计了一个可以添加到现有自监督学习框架中的空间亲和力组件,该组件利用HR影像来学习更好的MR影像表示。我们在两个自监督学习框架中测试了这个空间亲和力组件,并展示了它在仅使用HR或MR图像预训练的模型上表现更佳的结果。
https://arxiv.org/abs/2601.12964
The "You Only Look Once" (YOLO) framework has long served as the benchmark for real-time object detection, yet traditional iterations (YOLOv1 through YOLO11) remain constrained by the latency and hyperparameter sensitivity of Non-Maximum Suppression (NMS) post-processing. This paper analyzes a comprehensive analysis of YOLO26, an architecture that fundamentally redefines this paradigm by eliminating NMS in favor of a native end-to-end learning strategy. This study examines the critical innovations that enable this transition, specifically the introduction of the MuSGD optimizer for stabilizing lightweight backbones, STAL for small-target-aware assignment, and ProgLoss for dynamic supervision. Through a systematic review of official performance benchmarks, the results demonstrate that YOLO26 establishes a new Pareto front, outperforming a comprehensive suite of predecessors and state-of-the-art competitors (including RTMDet and DAMO-YOLO) in both inference speed and detection accuracy. The analysis confirms that by decoupling representation learning from heuristic post-processing, YOLOv26 successfully resolves the historical trade-off between latency and precision, signaling the next evolutionary step in edge-based computer vision.
《你只需看一次》(YOLO)框架长期以来一直是实时物体检测的基准,但传统的迭代版本(从YOLOv1到YOLOv11)仍然受到非极大值抑制(NMS)后处理过程中延迟和超参数敏感性的限制。本文分析了YOLO26架构,该架构从根本上重新定义了这一范式,通过采用一种原生的端到端学习策略来消除NMS。这项研究探讨了使这种转变成为可能的关键创新,包括引入MuSGD优化器以稳定轻量级骨干网络、STAL用于小目标感知分配以及ProgLoss用于动态监督。通过对官方性能基准进行系统的回顾性评估,结果表明YOLO26建立了一个新的帕累托前沿,在推断速度和检测精度方面均优于一系列前代技术和当前最先进的竞争者(包括RTMDet和DAMO-YOLO)。分析证实,通过将表示学习与启发式后处理解耦,YOLOv26成功解决了延迟与精度之间的历史权衡问题,标志着基于边缘的计算机视觉技术演进的下一步。
https://arxiv.org/abs/2601.12882
Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.
全身PET/CT成像使系统级分子成像成为可能,但解剖和代谢信号的异质性、约2米轴向覆盖范围以及结构化的放射学语义对现有的单模态输入假设的医学AI模型构成了挑战。为此,我们引入了SDF-HOLO(全身PET/CT的系统双流融合全息模型),这是一个多模态基础模型,在超过10,000名患者的数据上进行了预训练。 SDF-HOLO使用双流编码器分离CT和PET表示学习,并通过跨模式交互模块将它们结合在一起,使解剖学上下文能够优化PET数据的聚合过程,同时代谢显著性指导细微形态推理。为了建模身体各部位之间的长距离依赖关系,分层上下文模型结合了高效的局部窗口与全局注意力机制。 为了弥合体素和临床语言之间的差距,在预训练过程中我们使用解剖分割掩码作为显式的语义锚点,并执行体素-掩码-文本对齐。在肿瘤分割、低剂量病灶检测以及多语言诊断报告生成等任务中,SDF-HOLO超越了强特定任务基准和临床参照模型,减少了定位错误并降低了假阳性发现。 除了关注局部解释之外,该模型还能够进行系统级代谢谱分析,并揭示与器官间代谢网络相互作用相关的肿瘤特征,为全身PET/CT诊断以及系统层面的精准肿瘤学提供了可扩展的计算基础。
https://arxiv.org/abs/2601.12820
The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.
预训练的变压器在自然图像处理中展示了卓越的泛化能力。然而,直接将其应用于磁共振成像(MRI)时会面临两个关键挑战:一是无法适应医学解剖结构的特定性;二是由于医疗数据隐私性和稀缺性的限制带来的问题。为了解决这些问题,本文提出了一种用于MRI图像的自我监督预训练变压器(SSPFormer),该模型能够通过利用未标记的原始影像数据有效学习医学图像特有的特征表示。 为了应对领域差异和数据稀缺的问题,我们引入了逆频率投影掩码技术,优先重建高频率解剖区域,以此强化结构感知型表征学习。同时,为提高对现实世界MRI伪影的鲁棒性,我们采用了频率加权FFT噪声增强技术,在傅立叶域中注入生理真实的噪声。 通过这些策略,模型能够直接从原始扫描数据中学习到领域不变性和抗伪影特征。在分割、超分辨率和去噪任务上的广泛实验表明,所提出的SSPFormer达到了最先进的性能水平,充分验证了其捕捉精细MRI图像保真度及适应临床应用需求的能力。
https://arxiv.org/abs/2601.12747
Transformer-based models have shown strong performance in time-series forecasting by leveraging self-attention to model long-range temporal dependencies. However, their effectiveness depends critically on the quality and structure of input representations derived from raw multivariate time-series data. This work proposes a two-stage forecasting framework that explicitly separates local temporal representation learning from global dependency modelling. In the first stage, a convolutional neural network (CNN) operates on fixed-length temporal patches to extract short-range temporal dynamics and non-linear feature interactions, producing compact patch-level token embeddings. Token-level self-attention is subsequently applied during representation learning to refine these embeddings by enabling interactions across temporal patches. In the second stage, a Transformer encoder processes the resulting token sequence to model inter-patch temporal dependencies and generate per-patch forecasts. Experiments conducted on synthetic multivariate time-series data with controlled static and dynamic factors demonstrate that the proposed patch-based tokenization strategy achieves competitive forecasting performance compared to convolutional and patch-based Transformer baselines. The results highlight the importance of structured temporal representations and show that decoupling local temporal encoding from global attention-based modelling yields more effective and stable time-series forecasting.
基于Transformer的模型通过利用自注意力机制来建模长期时间依赖性,在时间序列预测中表现出强大的性能。然而,它们的有效性在很大程度上取决于从原始多变量时间序列数据派生的输入表示的质量和结构。这项工作提出了一种两阶段的时间序列预测框架,该框架明确地将局部时间表示学习与全局依赖关系建模分离。在这个框架的第一阶段,卷积神经网络(CNN)用于处理固定长度的时间片段,以提取短范围内的动态变化以及非线性特征交互,并生成紧凑的分片级别标记嵌入。随后,在表示学习过程中应用了令牌级别的自注意力机制,通过使时间片段之间能够相互作用来优化这些嵌入。 在第二阶段,Transformer编码器对由此产生的令牌序列进行处理,以建模跨分片的时间依赖关系并生成每个分片的预测值。实验使用具有可控静态和动态因素的合成多变量时间序列数据集进行,并表明所提出的基于片段的标记化策略相比卷积基线模型和基于片段的Transformer基线模型,在预测性能上具有竞争力。 结果强调了结构化的时序表示的重要性,同时展示了将局部时序编码从全局注意力机制建模中解耦能够获得更有效且稳定的时序预测效果。
https://arxiv.org/abs/2601.12467
Wearable foundation models have the potential to transform digital health by learning transferable representations from large-scale biosignals collected in everyday settings. While recent progress has been made in large-scale pretraining, most approaches overlook the spectral structure of photoplethysmography (PPG) signals, wherein physiological rhythms unfold across multiple frequency bands. Motivated by the insight that many downstream health-related tasks depend on multi-resolution features spanning fine-grained waveform morphology to global rhythmic dynamics, we introduce Masked Multiscale Reconstruction (MMR) for PPG representation learning - a self-supervised pretraining framework that explicitly learns from hierarchical time-frequency scales of PPG data. The pretraining task is designed to reconstruct randomly masked out coefficients obtained from a wavelet-based multiresolution decomposition of PPG signals, forcing the transformer encoder to integrate information across temporal and spectral scales. We pretrain our model with MMR using ~17 million unlabeled 10-second PPG segments from ~32,000 smartwatch users. On 17 of 19 diverse health-related tasks, MMR trained on large-scale wearable PPG data improves over or matches state-of-the-art open-source PPG foundation models, time-series foundation models, and other self-supervised baselines. Extensive analysis of our learned embeddings and systematic ablations underscores the value of wavelet-based representations, showing that they capture robust and physiologically-grounded features. Together, these results highlight the potential of MMR as a step toward generalizable PPG foundation models.
可穿戴基础模型有潜力通过从日常环境中收集的大规模生物信号中学习转移表示来变革数字健康。尽管在大规模预训练方面已经取得了一些进展,但大多数方法忽略了光体积描记图(PPG)信号中的频谱结构,在这种结构中生理节奏跨越多个频率带展开。鉴于许多下游与健康相关的任务依赖于从精细的波形形态到全局节律动态的多尺度特征,我们引入了Masked Multiscale Reconstruction (MMR)用于PPG表示学习——这是一个自我监督预训练框架,明确地从PPG数据的时间-频谱层级中进行学习。预训练的任务旨在重构通过基于小波的多分辨率分解获得并随机屏蔽掉的PPG信号系数,强制转换器编码器在时间与频谱尺度之间整合信息。 我们使用大约1700万个未标记的10秒PPG片段(来自约32,000名智能手表用户)对我们的模型进行MMR预训练。在19个多样化的健康相关任务中的17个,基于大规模可穿戴设备的PPG数据训练的MMR优于或匹敌现有的开源PPG基础模型、时间序列基础模型以及其他自我监督基线方法。我们学习到嵌入物和系统消融分析的广泛研究表明小波表示的价值所在,显示它们能够捕获稳健且生理学上可信的特征。 这些结果共同突显了MMR作为迈向通用化PPG基础模型一步潜力的重要价值。
https://arxiv.org/abs/2601.12215
The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.
视频基础的可见光-红外人员再识别(VVI-ReID)的核心在于学习不同模态之间的序列级模态不变表示。最近的研究倾向于使用由CLIP生成的模态共享语言提示来指导模态不变表示的学习。尽管这些方法已经达到了最优性能,但在高效的时空建模、充分的跨模态交互以及明确的模态级别损失引导方面仍然存在局限性。为了解决这些问题,我们提出了一种基于语言驱动的序列级模态不变表征学习(LSMRL)的方法,该方法包括时空特征学习(STFL)模块、语义扩散(SD)模块和跨模态交互(CMI)模块。 为了实现参数高效且计算高效的时空建模,STFL模块是在CLIP的基础上进行少量修改而构建的。为了达到充分的跨模态交互并增强模态不变特征的学习,提出了一个语义扩散模块,将模态共享的语言提示扩散到可见光和红外特征中,以建立初步的模态一致性。此外,通过发展双向跨模态自注意力机制来消除残余模态差距,并细化模态不变表示,进一步改进了CMI模块。 为了明确增强模态不变表示的学习,我们引入了两种模态级损失,以提高特征的判别能力和其对未见类别的泛化能力。在大规模VVI-ReID数据集上的广泛实验表明,LSMRL方法优于现有的AOTA(Attention-based One-to-Many Alignment)方法。
https://arxiv.org/abs/2601.12062
Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15\% of the parameters, and achieves a 1.4\% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.
光学遥感图像对于地球观测至关重要,但持续的云层遮挡限制了其下游应用。大多数去除云层(CR)的方法侧重于低级保真度优化,并且可能会过度平滑纹理和边界,这些是分析就绪数据(ARD)的关键部分,导致视觉上合理的恢复与语义效用之间的不匹配。为了弥合这一差距,我们提出了TDP-CR,这是一个任务驱动的多模态框架,可以同时执行云层去除和土地覆盖分割。我们的方法的核心是一个引导融合机制(Prompt-Guided Fusion, PGF),它利用一个可学习的退化提示来编码云层厚度和空间不确定性。通过结合全局通道上下文与局部提示条件下的空间偏差,PGF能够自适应地在光学数据受损的地方集成合成孔径雷达(SAR)信息。 我们进一步提出了一种参数效率高的两阶段训练策略,该策略解耦了重建和语义表示的学习过程。在LuojiaSET-OSFCR数据集上的实验表明,我们的框架具有优越性:TDP-CR在PSNR指标上超越了众多重量级的基准方法0.18 dB,并且仅使用了这些基线模型的15%参数;同时,在mIoU指标上相对于多任务竞争者保持了1.4%的一致性提升,有效提供了分析就绪数据。
https://arxiv.org/abs/2601.12052
Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.
从人类演示中学习结构化的任务表示对于理解长期操作行为至关重要,特别是在双臂设置下,此时动作顺序、物体参与度和互动几何关系的差异可能非常显著。关键挑战在于共同捕捉任务的离散语义结构及其基于对象中心几何关系的时间演变,这种形式可以支持对任务进展进行推理。在这项工作中,我们引入了一种语义-几何任务图表示法,该方法从人类演示中编码物体身份、物体间的关系以及它们随时间的几何变化情况。在此基础上,我们提出了一个学习框架,结合消息传递神经网络(MPNN)编码器和基于Transformer的解码器,将场景表示学习与动作条件下的任务进展推理分开。编码器仅在时态场景图上操作以学习结构化的表示形式,而解码器则根据动作上下文预测未来动作序列、相关物体及其移动情况,适用于较长的时间范围。通过在人类演示数据集上的广泛评估,我们证明了语义-几何任务图表示对于具有高动作和对象变异性任务特别有益,在这种情况下,简单的基于序列的模型难以捕捉任务进展。最后,我们展示了任务图表示可以转移到物理双臂机器人上,并用于在线动作选择,突出了它们作为操纵系统下游决策可重用任务抽象的潜力。
https://arxiv.org/abs/2601.11460