Ensuring equitable public transit access remains challenging, particularly in densely populated cities like New York City (NYC), where low-income and minority communities often face limited transit accessibility. Bike-sharing systems (BSS) can bridge these equity gaps by providing affordable first- and last-mile connections. However, strategically expanding BSS into underserved neighborhoods is difficult due to uncertain bike-sharing demand at newly planned ("cold-start") station locations and limitations in traditional accessibility metrics that may overlook realistic bike usage potential. We introduce Transit for All (TFA), a spatial computing framework designed to guide the equitable expansion of BSS through three components: (1) spatially-informed bike-sharing demand prediction at cold-start stations using region representation learning that integrates multimodal geospatial data, (2) comprehensive transit accessibility assessment leveraging our novel weighted Public Transport Accessibility Level (wPTAL) by combining predicted bike-sharing demand with conventional transit accessibility metrics, and (3) strategic recommendations for new bike station placements that consider potential ridership and equity enhancement. Using NYC as a case study, we identify transit accessibility gaps that disproportionately impact low-income and minority communities in historically underserved neighborhoods. Our results show that strategically placing new stations guided by wPTAL notably reduces disparities in transit access related to economic and demographic factors. From our study, we demonstrate that TFA provides practical guidance for urban planners to promote equitable transit and enhance the quality of life in underserved urban communities.
确保公共交通的公平准入仍然具有挑战性,特别是在像纽约市(NYC)这样的高人口密度城市中,低收入和少数族裔社区经常面临有限的交通可达性。共享单车系统(BSS)可以通过提供经济实惠的第一公里和最后一公里连接来弥合这些平等差距。然而,由于在新规划站点处不确定的自行车共享需求以及传统可达性指标可能忽视实际自行车使用潜力的限制,向服务不足地区战略性扩展BSS变得困难。 我们引入了Transit for All(TFA),这是一个基于空间计算框架的设计,旨在通过三个组成部分指导共享单车系统的公平扩张:(1) 使用区域表示学习在冷启动站点进行空间信息引导的共享单车需求预测,并整合多模式地理空间数据;(2) 采用我们的新型加权公共交通可达性水平(wPTAL)综合评估全面交通可达性,该指标结合了预测的自行车共享需求与传统的交通可达性指标;(3) 考虑潜在乘客和提升公平性的战略建议为新的自行车站点选址。 以NYC为例作为案例研究,我们识别出了历史上服务不足的社区中低收入和少数族裔群体面临的主要公共交通可达性差距。我们的结果表明,根据wPTAL指导的新站点战略性放置显著减少了与经济和人口统计因素相关的交通访问不平等。通过我们的研究,我们证明了TFA为城市规划者提供了一种实用指南来促进公平的公共交通,并提升服务不足的城市社区的生活质量。
https://arxiv.org/abs/2506.15113
Effective reinforcement learning (RL) for sepsis treatment depends on learning stable, clinically meaningful state representations from irregular ICU time series. While previous works have explored representation learning for this task, the critical challenge of training instability in sequential representations and its detrimental impact on policy performance has been overlooked. This work demonstrates that Controlled Differential Equations (CDE) state representation can achieve strong RL policies when two key factors are met: (1) ensuring training stability through early stopping or stabilization methods, and (2) enforcing acuity-aware representations by correlation regularization with clinical scores (SOFA, SAPS-II, OASIS). Experiments on the MIMIC-III sepsis cohort reveal that stable CDE autoencoder produces representations strongly correlated with acuity scores and enables RL policies with superior performance (WIS return $> 0.9$). In contrast, unstable CDE representation leads to degraded representations and policy failure (WIS return $\sim$ 0). Visualizations of the latent space show that stable CDEs not only separate survivor and non-survivor trajectories but also reveal clear acuity score gradients, whereas unstable training fails to capture either pattern. These findings highlight practical guidelines for using CDEs to encode irregular medical time series in clinical RL, emphasizing the need for training stability in sequential representation learning.
有效的强化学习(RL)用于治疗脓毒症,取决于从不规则的ICU时间序列中学习出稳定且具有临床意义的状态表示。尽管之前的工作已经探索了针对此任务的表示学习方法,但是关于顺序表征训练中的不稳定性和其对策略性能的负面影响这一关键挑战被忽视了。本研究证明,当满足两个重要因素时,控制微分方程(CDE)状态表示可以实现强大的RL策略:(1) 通过提前停止或稳定化方法确保训练稳定性;(2) 强制实施针对病情严重程度意识的表征,通过与临床评分(SOFA、SAPS-II、OASIS)的相关性正则化。在MIMIC-III脓毒症队列上的实验表明,稳定的CDE自编码器可以产生强烈相关于病情严重度评分的状态表示,并能够实现性能更优的RL策略(WIS回报>0.9)。相比之下,不稳定的CDE表征会导致表现下降和策略失败(WIS回报≈0)。对潜在空间的可视化显示,稳定CDE不仅能区分存活者与非存活者的轨迹,还能揭示明确的病情评分梯度,而不稳定的训练无法捕捉到这些模式。这些发现强调了在临床RL中使用CDE编码不规则医疗时间序列的实际指南,并突出了顺序表示学习中的训练稳定性需求的重要性。
https://arxiv.org/abs/2506.15019
Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail; and 2) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively combine all available information. To overcome these challenges, we propose SSLProfiler, a non-contrastive SSL framework specifically designed for cell profiling. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, SSLProfiler won the Cell Line Transferability challenge at CVPR 2025.
基于图像的细胞分析旨在创建具有信息量的细胞图像表示。这一技术在药物发现中至关重要,并且随着计算机视觉领域的近期进展得到了显著提升。受最近非对比自监督学习(SSL)发展的启发,本文初步探索了使用此类方法训练适用于细胞图像的一般化特征提取器的可能性。然而,存在两大挑战:1) 细胞图像与自然图像的分布差异很大,导致现有SSL方法中的视图生成过程失效;2) 与其他场景不同的是,在典型的场景中每个表示基于单一图像,而细胞分析通常涉及多张输入图像,这使得有效整合所有可用信息变得困难。为克服这些挑战,我们提出了一种专门针对细胞分析的非对比自监督学习框架——SSLProfiler。我们引入了专用于细胞图像的数据增强和表征后处理方法,有效地解决了上述问题,并生成了一个稳健的特征提取器。凭借这些改进,SSLProfiler在CVPR 2025的Cell Line Transferability挑战赛中胜出。
https://arxiv.org/abs/2506.14265
Predicting individuals' next locations is a core task in human mobility modelling, with wide-ranging implications for urban planning, transportation, public policy and personalised mobility services. Traditional approaches largely depend on location embeddings learned from historical mobility patterns, limiting their ability to encode explicit spatial information, integrate rich urban semantic context, and accommodate previously unseen locations. To address these challenges, we explore the application of CaLLiPer -- a multimodal representation learning framework that fuses spatial coordinates and semantic features of points of interest through contrastive learning -- for location embedding in individual mobility prediction. CaLLiPer's embeddings are spatially explicit, semantically enriched, and inductive by design, enabling robust prediction performance even in scenarios involving emerging locations. Through extensive experiments on four public mobility datasets under both conventional and inductive settings, we demonstrate that CaLLiPer consistently outperforms strong baselines, particularly excelling in inductive scenarios. Our findings highlight the potential of multimodal, inductive location embeddings to advance the capabilities of human mobility prediction systems. We also release the code and data (this https URL) to foster reproducibility and future research.
预测个人的下一个位置是人类移动模型中的核心任务,对城市规划、交通、公共政策和个人化移动服务有着广泛的影响。传统方法主要依赖于从历史移动模式中学习的位置嵌入,这限制了它们编码明确的空间信息的能力,整合丰富的城市语义背景,并适应之前未见过的位置。为了解决这些挑战,我们探索了CaLLiPer的应用——这是一个多模态表示学习框架,通过对比学习融合空间坐标和兴趣点的语义特征来进行位置嵌入在个人移动预测中的应用。CaLLiPer生成的嵌入是显式的、语义丰富的,并且设计上具有归纳能力,在涉及新兴地点的情况下也能实现稳健的预测性能。通过对四个公开的移动数据集进行广泛的实验,我们展示了无论是在传统的还是归纳的情境下,CaLLiPer都能持续超越强大的基准方法,尤其是在归纳场景中表现出色。我们的研究结果突显了多模态、归纳位置嵌入在提升人类移动预测系统能力方面的潜力。此外,为了促进再现性和未来的研究,我们也发布了代码和数据(此链接)。
https://arxiv.org/abs/2506.14070
Contrastive self-supervised learning based on point-wise comparisons has been widely studied for vision tasks. In the visual cortex of the brain, neuronal responses to distinct stimulus classes are organized into geometric structures known as neural manifolds. Accurate classification of stimuli can be achieved by effectively separating these manifolds, akin to solving a packing problem. We introduce Contrastive Learning As Manifold Packing (CLAMP), a self-supervised framework that recasts representation learning as a manifold packing problem. CLAMP introduces a loss function inspired by the potential energy of short-range repulsive particle systems, such as those encountered in the physics of simple liquids and jammed packings. In this framework, each class consists of sub-manifolds embedding multiple augmented views of a single image. The sizes and positions of the sub-manifolds are dynamically optimized by following the gradient of a packing loss. This approach yields interpretable dynamics in the embedding space that parallel jamming physics, and introduces geometrically meaningful hyperparameters within the loss function. Under the standard linear evaluation protocol, which freezes the backbone and trains only a linear classifier, CLAMP achieves competitive performance with state-of-the-art self-supervised models. Furthermore, our analysis reveals that neural manifolds corresponding to different categories emerge naturally and are effectively separated in the learned representation space, highlighting the potential of CLAMP to bridge insights from physics, neural science, and machine learning.
基于点对点比较的对比自监督学习方法已被广泛研究用于视觉任务。在大脑的视觉皮层中,不同刺激类别引起的神经元反应被组织成称为神经流形(neural manifolds)的几何结构。通过有效地分离这些流形可以实现准确地分类刺激,类似于解决包装问题的过程。我们引入了一种新的自监督框架——对比学习作为流形打包(CLAMP),它将表示学习重新定义为一个流形打包问题。 CLAMP 引入了一个损失函数,灵感来源于短程排斥粒子系统的势能,例如在简单液体和拥挤包装中的物理现象。在这个框架中,每个类别由包含单个图像的各种增强视图的子流形组成。子流形的大小和位置通过遵循打包损失的梯度动态优化。这种方法产生了与颗粒物质的锁紧物理学相平行的、可解释的动力学,并在损失函数内引入了几何上有意义的超参数。 在标准线性评估协议下,该协议冻结主干网络并仅训练一个线性分类器,在这种情况下,CLAMP 达到了与最先进的自监督模型相当的性能。此外,我们的分析表明,在学习表示空间中,不同类别的神经流形自然地出现并且有效分离,这突显了 CLAMP 将物理学、神经科学和机器学习领域见解结合在一起的潜力。
https://arxiv.org/abs/2506.13717
Owing to its rapid progress and broad application prospects, few-shot action recognition has attracted considerable interest. However, current methods are predominantly based on limited single-modal data, which does not fully exploit the potential of multimodal information. This paper presents a novel framework that actively identifies reliable modalities for each sample using task-specific contextual cues, thus significantly improving recognition performance. Our framework integrates an Active Sample Inference (ASI) module, which utilizes active inference to predict reliable modalities based on posterior distributions and subsequently organizes them accordingly. Unlike reinforcement learning, active inference replaces rewards with evidence-based preferences, making more stable predictions. Additionally, we introduce an active mutual distillation module that enhances the representation learning of less reliable modalities by transferring knowledge from more reliable ones. Adaptive multimodal inference is employed during the meta-test to assign higher weights to reliable modalities. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing approaches.
由于其迅速的发展和广泛的应用前景,少样本动作识别吸引了大量的关注。然而,当前的方法主要依赖于有限的单模态数据,未能充分利用多模态信息的潜力。本文提出了一种新颖的框架,该框架能够利用特定任务相关的上下文线索主动识别每个样本中的可靠模态,从而显著提高识别性能。我们的框架整合了一个积极样本推理(Active Sample Inference, ASI)模块,通过使用基于后验分布的积极推断来预测可靠的模态,并随后根据这些模态进行组织。与强化学习不同的是,积极推断用基于证据的偏好取代了奖励机制,从而能够做出更加稳定的预测。 此外,我们还引入了一个主动互蒸馏(active mutual distillation)模块,通过从更可靠的模态中转移知识来增强不那么可靠模态的表现学习能力。在元测试阶段采用自适应多模态推理技术,给定更多的权重于可靠的模态上。多项跨多个基准的实验表明,我们的方法显著优于现有方法。 总的来说,该研究提出了一种新颖的方法,在少样本动作识别领域利用多模态信息提高了模型性能,并展示了其优越性和潜力。
https://arxiv.org/abs/2506.13322
World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26.3% better mIoU metric than DOME and 23.7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code and videos will be available at this https URL.
世界模型对于自动驾驶至关重要,它们能够模拟环境动态并生成合成数据。现有的方法在将自身车辆的运动(视角变化)与场景演变(代理交互)分离时遇到了困难,导致预测效果不佳。为此,我们提出了一种通过利用以场景为中心的坐标系统来区分环境变化和自我运动的方法。本文介绍了一个新框架COME:一种将场景中心控制融入占用世界模型中的方法。 具体来说,COME首先通过一个以场景为中心的预测分支生成与自身车辆无关、空间一致的未来特征,然后使用定制化的ControlNet将其转换为场景条件特征。这些条件特征随后被注入到占用世界模型中,从而实现更准确和可控的未来占据预测。 在nuScenes-Occ3D数据集上的实验结果表明,COME在各种配置下(包括不同的输入源[真实值、基于摄像头、融合型占位]以及不同预测时间范围[3秒和8秒])均优于现有最佳方法。例如,在相同设置下,COME的mIoU指标比DOME高26.3%,比UniScene高23.7%。 这些结果凸显了解耦表示学习在增强世界模型时空预测准确性方面的有效性。代码与视频将在以下网址发布:[提供链接](请将[提供链接]替换为实际提供的链接)。
https://arxiv.org/abs/2506.13260
In recent years, the performance of lightweight Single-Image Super-Resolution (SISR) has been improved significantly with the application of Convolutional Neural Networks (CNNs) and Large Kernel Attention (LKA). However, existing information distillation modules for lightweight SISR struggle to map inputs into High-Dimensional Non-Linear (HDNL) feature spaces, limiting their representation learning. And their LKA modules possess restricted ability to capture the multi-shape multi-scale information for long-range dependencies while encountering a quadratic increase in the computational burden with increasing convolutional kernel size of its depth-wise convolutional layer. To address these issues, we firstly propose a Star Distillation Module (SDM) to enhance the discriminative representation learning via information distillation in the HDNL feature spaces. Besides, we present a Multi-shape Multi-scale Large Kernel Attention (MM-LKA) module to learn representative long-range dependencies while incurring low computational and memory footprints, leading to improving the performance of CNN-based self-attention significantly. Integrating SDM and MM-LKA, we develop a Residual Star Distillation Attention Module (RSDAM) and take it as the building block of the proposed efficient Star Distillation Attention Network (SDAN) which possesses high reconstruction efficiency to recover a higher-quality image from the corresponding low-resolution (LR) counterpart. When compared with other lightweight state-of-the-art SISR methods, extensive experiments show that our SDAN with low model complexity yields superior performance quantitatively and visually.
近年来,通过应用卷积神经网络(CNN)和大型核注意力机制(LKA),轻量级单图像超分辨率(SISR)的性能得到了显著提升。然而,现有的信息蒸馏模块在处理轻量级 SISR 时难以将输入映射到高维非线性(HDNL)特征空间中,这限制了它们的学习表示能力。此外,这些 LKA 模块在捕捉长距离依赖关系中的多形状和多尺度信息方面表现出受限的能力,并且随着其深度卷积层的核大小增加,计算负担呈二次增长。 为解决这些问题,我们首先提出了一个星形蒸馏模块(SDM),通过在 HDNL 特征空间中进行信息蒸馏来增强辨别式表示学习。此外,我们提出了一种多形状多尺度大型核注意力机制(MM-LKA)模块,在保持低计算和内存开销的同时,能够有效学习长距离依赖关系,并显著提升了基于 CNN 的自注意力性能。 通过整合 SDM 和 MM-LKA 模块,我们开发了一个残差星形蒸馏注意模块(RSDAM),并将其作为所提出的高效星形蒸馏注意网络(SDAN)的构建模块。该网络具有高效的重建能力,可以从对应的低分辨率图像中恢复出高质量的高分辨率图像。 与现有的轻量级 SISR 方法相比,大量实验表明,在模型复杂度较低的情况下,我们的 SDAN 网络在量化和视觉效果上均表现出色。
https://arxiv.org/abs/2506.12475
Multivariate Time Series Forecasting (MTSF) involves predicting future values of multiple interrelated time series. Recently, deep learning-based MTSF models have gained significant attention for their promising ability to mine semantics (global and local information) within MTS data. However, these models are pervasively susceptible to missing values caused by malfunctioning data collectors. These missing values not only disrupt the semantics of MTS, but their distribution also changes over time. Nevertheless, existing models lack robustness to such issues, leading to suboptimal forecasting performance. To this end, in this paper, we propose Multi-View Representation Learning (Merlin), which can help existing models achieve semantic alignment between incomplete observations with different missing rates and complete observations in MTS. Specifically, Merlin consists of two key modules: offline knowledge distillation and multi-view contrastive learning. The former utilizes a teacher model to guide a student model in mining semantics from incomplete observations, similar to those obtainable from complete observations. The latter improves the student model's robustness by learning from positive/negative data pairs constructed from incomplete observations with different missing rates, ensuring semantic alignment across different missing rates. Therefore, Merlin is capable of effectively enhancing the robustness of existing models against unfixed missing rates while preserving forecasting accuracy. Experiments on four real-world datasets demonstrate the superiority of Merlin.
多元时间序列预测(MTSF)涉及对未来多个相互关联的时间序列值进行预测。近年来,基于深度学习的MTSF模型因其在挖掘多时间序列数据中的语义信息(全局和局部信息)方面表现出的巨大潜力而备受关注。然而,这些模型普遍容易受到因数据采集设备故障而导致的数据缺失的影响。这些缺失不仅会破坏多元时间序列的语义结构,其分布也会随时间变化。现有的模型对于这些问题缺乏鲁棒性,导致预测性能不佳。 为此,在本文中我们提出了多视角表示学习(Merlin),它可以协助现有模型在不同缺失率的不完整观测与完整观测之间实现语义对齐。具体而言,Merlin包括两个关键模块:离线知识蒸馏和多视角对比学习。前者利用一个教师模型引导学生模型从不完整的观测数据中挖掘出类似于从完整观测数据中可获取的语义信息。后者通过从具有不同缺失率的不完整观察数据构建的正/负数据对来提高学生的鲁棒性,确保在不同的缺失率下保持语义一致性。因此,Merlin能够有效地增强现有模型对于不可修复的缺失率变化问题的鲁棒性同时维持预测准确性。 实验结果表明,在四个真实世界的数据集上,Merlin表现出了优于其他方法的优势。
https://arxiv.org/abs/2506.12459
Effective multimodal reasoning depends on the alignment of visual and linguistic representations, yet the mechanisms by which vision-language models (VLMs) achieve this alignment remain poorly understood. We introduce a methodological framework that deliberately maintains a frozen large language model (LLM) and a frozen vision transformer (ViT), connected solely by training a linear adapter during visual instruction tuning. This design is fundamental to our approach: by keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data. Consequently, the linear adapter must map visual features directly into the LLM's existing representational space rather than allowing the language model to develop specialized visual understanding through fine-tuning. Our experimental design uniquely enables the use of pre-trained sparse autoencoders (SAEs) of the LLM as analytical probes. These SAEs remain perfectly aligned with the unchanged language model and serve as a snapshot of the learned language feature-representations. Through systematic analysis of SAE reconstruction error, sparsity patterns, and feature SAE descriptions, we reveal the layer-wise progression through which visual representations gradually align with language feature representations, converging in middle-to-later layers. This suggests a fundamental misalignment between ViT outputs and early LLM layers, raising important questions about whether current adapter-based architectures optimally facilitate cross-modal representation learning.
有效的多模态推理依赖于视觉和语言表示的对齐,然而,视觉-语言模型(VLMs)如何实现这种对齐的机制仍然不甚明了。我们引入了一种方法论框架,在该框架中特意保持大规模语言模型(LLM)和视觉变压器(ViT)冻结状态,并仅通过训练线性适配器在视觉指令微调过程中进行连接。这一设计是我们的方法的基础:通过保持语言模型冻结,确保它保留其原有的语言表示而不适应于视觉数据的调整。因此,线性适配器必须将视觉特征直接映射到LLM已有的表示空间中,而不是允许语言模型通过微调发展专门的视觉理解。 我们实验设计的独特之处在于能够使用预训练的稀疏自编码器(SAEs)作为分析探针,这些SAE与未改变的语言模型完全对齐,并且反映了学习到的语言特征表示。通过对SAE重构误差、稀疏模式和特征SAE描述进行系统的分析,我们揭示了视觉表征如何逐层进展,逐渐与语言特征表示对齐,并在中后期层级收敛。这表明ViT输出与早期LLM层之间存在基本不对齐的问题,引发了关于当前基于适配器的架构是否能有效促进跨模态表示学习的重要问题。
https://arxiv.org/abs/2506.11976
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at this https URL
大型语言模型(LLMs)是具有深远社会影响的强大工具,但它们生成多样且不受控输入响应的能力也使它们容易受到对抗性攻击。尽管现有的防御方法往往难以在各种攻击类型中泛化,最近在表示工程方面的进展提供了一些有希望的替代方案。在这项工作中,我们提出了一种以对比表示学习(CRL)问题的形式来构建模型防御框架的方法。我们的方法通过使用基于三元组的损失结合对抗性难例挖掘技术对模型进行微调,鼓励良性与有害表示之间的分离。在多个模型上的实验结果表明,相较于之前基于表示工程的防御方法,我们提出的方法提高了抵御输入级和嵌入空间攻击的鲁棒性,并且没有牺牲标准性能。 我们的代码可在[此处](https://this https URL)获取。
https://arxiv.org/abs/2506.11938
Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.
自监督学习(Self-supervised Learning,简称SSL)在自然图像和视频理解方面取得了重大进展,但在某些领域如超声心动图(心脏超声)中仍面临挑战。这些挑战主要源于微妙的解剖结构、复杂的时空动态变化以及目前缺乏特定领域的预训练模型。现有的自监督学习方法,例如对比学习、掩码建模和基于聚类的方法,在处理样本间相似度高、输入PSNR低(常见于超声波图像中的问题)或会扭曲临床相关特征的激进增强操作时遇到了困难。 我们提出了DISCOVR(Distilled Image Supervision for Cross-Modal Video Representation),这是一个用于心脏超声视频表征学习的自监督双分支框架。DISCOVR结合了一个基于聚类的视频编码器,该编码器模拟时间动态变化,并且还有一个在线图像编码器,它提取细粒度的空间语义信息。这些分支通过一个语义簇蒸馏损失连接起来,这个损失机制将不断演化的图像编码器中的解剖知识传递给视频编码器,从而生成包含精细语义理解的时空一致表示。 在涵盖胎儿、儿童和成人人群的六个超声心动图数据集上进行评估后,DISCOVR在零样本设置(zero-shot)和线性探测设置中超越了专门针对视频异常检测的方法以及最先进的视频自监督学习基线,并且实现了更好的分割迁移性能。
https://arxiv.org/abs/2506.11777
Building energy management (BEM) tasks require processing and learning from a variety of time-series data. Existing solutions rely on bespoke task- and data-specific models to perform these tasks, limiting their broader applicability. Inspired by the transformative success of Large Language Models (LLMs), Time-Series Foundation Models (TSFMs), trained on diverse datasets, have the potential to change this. Were TSFMs to achieve a level of generalizability across tasks and contexts akin to LLMs, they could fundamentally address the scalability challenges pervasive in BEM. To understand where they stand today, we evaluate TSFMs across four dimensions: (1) generalizability in zero-shot univariate forecasting, (2) forecasting with covariates for thermal behavior modeling, (3) zero-shot representation learning for classification tasks, and (4) robustness to performance metrics and varying operational conditions. Our results reveal that TSFMs exhibit \emph{limited} generalizability, performing only marginally better than statistical models on unseen datasets and modalities for univariate forecasting. Similarly, inclusion of covariates in TSFMs does not yield performance improvements, and their performance remains inferior to conventional models that utilize covariates. While TSFMs generate effective zero-shot representations for downstream classification tasks, they may remain inferior to statistical models in forecasting when statistical models perform test-time fitting. Moreover, TSFMs forecasting performance is sensitive to evaluation metrics, and they struggle in more complex building environments compared to statistical models. These findings underscore the need for targeted advancements in TSFM design, particularly their handling of covariates and incorporating context and temporal dynamics into prediction mechanisms, to develop more adaptable and scalable solutions for BEM.
建筑能源管理(BEM)任务需要处理和从多种时间序列数据中学习。现有的解决方案依赖于定制的任务特定和数据特定模型来执行这些任务,这限制了它们的广泛适用性。受大型语言模型(LLM)变革成功的启发,基于多样数据集训练的时间序列基础模型(TSFM),有潜力改变这一现状。如果TSFMs能够在零样本单变量预测、带协变量热行为建模预测、零样本表示学习分类任务以及对性能指标和不同操作条件的鲁棒性方面达到类似LLM的跨任务和上下文泛化能力,它们可以从根本上解决BEM中的可扩展性挑战。为了了解当前TSFMs的位置,我们在四个维度上评估了它们:(1)零样本单变量预测的泛化能力;(2)用于热行为建模带协变量的预测;(3)下游分类任务中零样本表示学习的有效性;以及(4)对性能指标和不同操作条件的鲁棒性。我们的研究结果表明,TSFMs在零样本单变量预测中的泛化能力有限,在未见过的数据集和模式上的表现仅略好于统计模型。同样地,将协变量纳入TSFM中并没有带来性能提升,并且它们的表现仍低于使用协变量的传统模型。尽管TSFMs可以生成有效的零样本表示用于下游分类任务,但当统计模型进行测试时间拟合时,它们在预测方面可能仍然不如统计模型。此外,TSFMs的预测表现对评估指标敏感,在更复杂的建筑环境中相比统计模型表现出更大的困难。这些发现强调了需要针对TSFM设计的针对性改进,特别是其处理协变量、将上下文和时间动态融入预测机制的能力,以开发出更加适应性和可扩展性的BEM解决方案。
https://arxiv.org/abs/2506.11250
Learning medical visual representations from image-report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance-wise or token-wise cross-modal alignment, often neglecting the importance of pathological-level consistency. This paper presents a novel framework PLACE that promotes the Pathological-Level Alignment and enriches the fine-grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological-level cross-modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine-grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state-of-the-art performance on multiple downstream tasks, including classification, image-to-text retrieval, semantic segmentation, object detection and report generation.
从图像-报告对中通过联合学习来获取医学视觉表示,因其有潜力缓解医疗领域中的数据稀缺问题而日益受到研究关注。然而,这一领域的挑战主要来自于长篇复杂的报告文本中包含的复杂语义病理关系。以往的研究大多集中在实例级或标记级跨模态对齐上,并常常忽视了病理性一致性的关键作用。本文提出了一种新的框架PLACE,它通过相关性探索促进病理性级别的对齐并丰富细粒度细节,而无需额外的人工标注。 具体来说,我们提出了一个新的病理性级别的跨模态对齐(PCMA)方法来最大化来自图像和报告的病理观察的一致性。为实现这一目标,引入了一个视觉病理观察提取器从局部标记中抽取视觉病理观察表示。该PCMA模块独立于任何外部疾病注释运行,从而增强了我们方法的泛化能力和鲁棒性。 此外,我们设计了一个代理任务,强制模型识别图像块之间的相关性,以丰富对于各种下游任务至关重要的细粒度细节。 实验结果表明,我们的框架在多个下游任务上达到了新的最先进的性能,包括分类、图-文检索、语义分割、目标检测和报告生成。
https://arxiv.org/abs/2506.10573
Faithful evaluation of language model capabilities is crucial for deriving actionable insights that can inform model development. However, rigorous causal evaluations in this domain face significant methodological challenges, including complex confounding effects and prohibitive computational costs associated with extensive retraining. To tackle these challenges, we propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors. Crucially, these latent factors are identified as causally interrelated after appropriately controlling for the base model as a common confounder. Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks from the Open LLM Leaderboard, we identify a concise three-node linear causal structure that reliably explains the observed performance variations. Further interpretation of this causal structure provides substantial scientific insights beyond simple numerical rankings: specifically, we reveal a clear causal direction starting from general problem-solving capabilities, advancing through instruction-following proficiency, and culminating in mathematical reasoning ability. Our results underscore the essential role of carefully controlling base model variations during evaluation, a step critical to accurately uncovering the underlying causal relationships among latent model capabilities.
对语言模型能力进行忠实的评估对于获得能够指导模型开发的实际见解至关重要。然而,在这一领域中,严格的因果评价面临重大的方法论挑战,包括复杂的混杂效应和与广泛再训练相关的高昂计算成本。为了应对这些挑战,我们提出了一种因果表示学习框架,在该框架下,观察到的基准性能被建模为少数潜在能力因素的线性变换。关键在于,在适当控制作为公共混淆变量的基础模型之后,识别出这些潜在因素之间存在因果关系。 我们将这种方法应用于一个包含超过1500个模型的综合数据集上,这些模型在开放LLM排行榜上的六个基准测试中进行了评估。我们确定了一个简洁的三节点线性因果结构,该结构能够可靠地解释观察到的表现变化。进一步解读这种因果结构提供了超越简单数值排名的大量科学见解:具体来说,我们揭示了一种清晰的因果方向,从一般问题解决能力开始,通过指令遵循熟练度,最终达到数学推理能力。 我们的结果强调了在评估过程中仔细控制基础模型差异的重要作用,这是准确发现潜在模型能力之间因果关系的关键步骤。
https://arxiv.org/abs/2506.10378
Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in domains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective, $\text{BYOL-}\gamma$ augmented GCBC, which is not only able to theoretically approximate the successor representation in the finite MDP case without contrastive samples or TD learning, but also, results in competitive empirical performance across a suite of challenging tasks requiring combinatorial generalization.
行为克隆(Behavioral Cloning,BC)方法通过监督学习(Supervised Learning,SL)训练,在机器人等领域从人类演示中学习策略是一种有效的方式。通过对这些策略进行目标条件设定,可以使单一的通用策略捕获离线数据集中包含的各种行为。虽然目标导向的行为克隆(Goal-Conditioned Behavior Cloning,GCBC)方法在分布内任务上表现良好,但它们并不能必然地零样本泛化到需要对新颖的状态-目标对进行条件设定的任务中,即组合型泛化。这一限制部分归因于由行为克隆学习得到的状态表示缺乏时间一致性;如果相关的时间状态被编码成相似的潜在表示,则对于新出现的状态-目标对分布外差距将会减小。因此,在表现空间中鼓励这种时间一致性应该有助于实现组合型泛化。后继表示,即从当前状态访问到未来状态分布的编码方式,完美地封装了这一属性。然而,以前用于学习后继表示的方法依赖于对比样本、时差(Temporal-Difference,TD)学习或两者兼而有之。 在这项工作中,我们提出了一种简单且有效的表征学习目标——$\text{BYOL-}\gamma$增强的GCBC方法。该方法不仅能够在有限马尔可夫决策过程(MDP)的情况下理论上逼近后继表示,并且不需要对比样本或TD学习,而且还能在一系列需要组合型泛化的具有挑战性的任务中表现出竞争性的实证性能。
https://arxiv.org/abs/2506.10137
Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
条件扩散模型(CDMs)在多种生成任务中表现出卓越的性能。它们能够对完整数据分布进行建模的能力,为下游判别学习中的分析与合成方法开辟了新的途径。然而,这种强大的建模能力也导致CDMs将定义类别的特征与不相关的背景信息纠缠在一起,使得提取稳健且可解释的表示变得具有挑战性。为此,我们识别出了典范潜在表示(Canonical LAtent Representations, CLAReps),这是一种内部CDM特征能够保留关键类别信息同时摒弃非判别信号的潜在编码方式。当这些CLAReps被解码时,它们能为每个类生成代表性的样本,并提供一个简洁、可解释的核心语义摘要,包含最少的无关细节。 利用CLAReps,我们开发了一种新颖的基于扩散的方法——CaDistill(特征蒸馏),用于知识传递。在此过程中,学生模型可以完全访问整个训练集,而作为教师的CDM则仅通过CLAReps将核心类别知识传输给学生,这些CLAReps只占原始训练数据量的10%左右。经过训练后,学生模型在对抗鲁棒性和泛化能力方面表现优异,并且更注重类别的信号而非误导性的背景线索。 我们的发现表明,CDMs不仅可以作为图像生成器使用,还能充当紧凑、可解释的知识传授者,促进稳健的表示学习。
https://arxiv.org/abs/2506.09955
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at this https URL.
点云数据的尺度多样性为三维视觉中的统一表示学习技术的发展带来了显著挑战。目前,很少有通用的3D模型存在,并且没有现有的预训练方法能够同时有效地应用于对象级和场景级点云。在本文中,我们介绍了UniPre3D,这是首个可以无缝应用于任何规模点云及任意架构3D模型的统一预训练方法。我们的方法将预测高斯基元作为预训练任务,并采用可微分高斯渲染技术来生成图像,从而实现精确的像素级监督和端到端优化。为了进一步调节预训练任务的复杂度并引导模型关注几何结构,我们整合了来自预先训练好的图像模型的2D特征,以纳入已确立的良好纹理知识。我们通过广泛的实验验证了所提出方法在各种对象级和场景级任务中的通用有效性,并使用多种点云模型作为骨干网络进行测试。代码可在提供的链接中获取。
https://arxiv.org/abs/2506.09952
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at this https URL.
高光谱图像(HSI)聚类的任务是在没有标注信息的情况下,将相似的像素归为同一类别,这是一项重要但具有挑战性的任务。对于大规模HSIs来说,大多数方法依赖于超像素分割,并基于图神经网络(GNNs)进行超像素级别的聚类。然而,现有的GNN无法充分利用输入HSI的光谱信息,不准确的超像素拓扑图可能在信息聚合过程中导致不同类别语义之间的混淆。 为了解决这些问题,我们首先提出了一种结构-光谱图卷积算子(SSGCO),它专门针对具有图结构的HSI超像素设计,通过同时提取空间和光谱特征来提高其表示质量。其次,我们提出了一个证据引导自适应边学习模块(EGAEL),该模块能够根据需要预测并细化超像素拓扑图中的边缘权重。我们将所提出的方法集成到对比学习框架中以实现聚类,在此框架下,表示学习与聚类可以同时进行。 实验表明,我们的方法在四个HSI数据集上将聚类精度分别提高了2.61%,6.06%,4.96%和3.15%,优于所有比较的方法。我们提供的代码可以在给定的URL中找到(原文中的具体链接被替换为“this https URL”)。
https://arxiv.org/abs/2506.09920
Traditional models of climate change use complex systems of coupled equations to simulate physical processes across the Earth system. These simulations are highly computationally expensive, limiting our predictions of climate change and analyses of its causes and effects. Machine learning has the potential to quickly emulate data from climate models, but current approaches are not able to incorporate physics-informed causal relationships. Here, we develop an interpretable climate model emulator based on causal representation learning. We derive a physics-informed approach including a Bayesian filter for stable long-term autoregressive emulation. We demonstrate that our emulator learns accurate climate dynamics, and we show the importance of each one of its components on a realistic synthetic dataset and data from two widely deployed climate models.
传统的气候模型使用复杂的耦合方程系统来模拟地球系统的物理过程。这些模拟计算成本高昂,限制了我们对气候变化及其原因和影响的预测与分析能力。机器学习有潜力快速模拟气候模型的数据,但目前的方法无法纳入基于物理学的因果关系。在这里,我们开发了一种基于因果表示学习的可解释气候模型仿真器。我们提出一种物理知识导向的方法,包括用于长期稳定自回归仿真的贝叶斯滤波器。我们在一个现实合成数据集和两个广泛应用的气候模型的数据上证明了我们的仿真器能够准确地学习气候动态,并展示了该仿真器每个组件的重要性。
https://arxiv.org/abs/2506.09891