Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
Remote Sensing Image-Text Retrieval (RSITR) plays a critical role in geographic information interpretation, disaster monitoring, and urban planning by establishing semantic associations between image and textual descriptions. Existing Parameter-Efficient Fine-Tuning (PEFT) methods for Vision-and-Language Pre-training (VLP) models typically adopt symmetric adapter structures for exploring cross-modal correlations. However, the strong discriminative nature of text modality may dominate the optimization process and inhibits image representation learning. The nonnegligible imbalanced cross-modal optimization remains a bottleneck to enhancing the model performance. To address this issue, this study proposes a Representation Discrepancy Bridging (RDB) method for the RSITR task. On the one hand, a Cross-Modal Asymmetric Adapter (CMAA) is designed to enable modality-specific optimization and improve feature alignment. The CMAA comprises a Visual Enhancement Adapter (VEA) and a Text Semantic Adapter (TSA). VEA mines fine-grained image features by Differential Attention (DA) mechanism, while TSA identifies key textual semantics through Hierarchical Attention (HA) mechanism. On the other hand, this study extends the traditional single-task retrieval framework to a dual-task optimization framework and develops a Dual-Task Consistency Loss (DTCL). The DTCL improves cross-modal alignment robustness through an adaptive weighted combination of cross-modal, classification, and exponential moving average consistency constraints. Experiments on RSICD and RSITMD datasets show that the proposed RDB method achieves a 6%-11% improvement in mR metrics compared to state-of-the-art PEFT methods and a 1.15%-2% improvement over the full fine-tuned GeoRSCLIP model.
远程遥感图像-文本检索(Remote Sensing Image-Text Retrieval, RSITR)在地理信息解释、灾害监测和城市规划中扮演着关键角色,通过建立图像与文字描述之间的语义关联来实现这些目标。现有的视觉语言预训练模型(Vision-and-Language Pre-training, VLP)的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法通常采用对称适配器结构来探索跨模态相关性。然而,文本模态强烈的判别性质可能会在优化过程中占主导地位,并抑制图像表示学习。因此,显著且不可忽视的跨模态不平衡优化仍然是提高模型性能的一个瓶颈。 为解决这一问题,本研究提出了一种用于RSITR任务的表征差异桥接(Representation Discrepancy Bridging, RDB)方法。一方面,设计了跨模态非对称适配器(Cross-Modal Asymmetric Adapter, CMAA),以实现模态特异化优化,并提高特征对齐能力。CMAA 包括视觉增强适配器 (Visual Enhancement Adapter, VEA) 和文本语义适配器 (Text Semantic Adapter, TSA)。VEA 通过差分注意(Differential Attention, DA)机制挖掘精细的图像特征,而TSA 则通过层次化注意力(Hierarchical Attention, HA)机制识别关键的文字语义。 另一方面,本研究将传统的单一任务检索框架扩展为双任务优化框架,并开发了双任务一致性损失 (Dual-Task Consistency Loss, DTCL)。DTCL 通过跨模态、分类和指数移动平均一致性的自适应加权组合来提高跨模态对齐的鲁棒性。 在RSICD和RSITMD数据集上的实验表明,所提出的RDB方法相比现有的最先进PEFT方法,在mR指标上提升了6%-11%,并比完全微调后的GeoRSCLIP模型提高了1.15%-2%。
https://arxiv.org/abs/2505.16756
To encode point clouds containing both geometry and attributes, most learning-based compression schemes treat geometry and attribute coding separately, employing distinct encoders and decoders. This not only increases computational complexity but also fails to fully exploit shared features between geometry and attributes. To address this limitation, we propose SEDD-PCC, an end-to-end learning-based framework for lossy point cloud compression that jointly compresses geometry and attributes. SEDD-PCC employs a single encoder to extract shared geometric and attribute features into a unified latent space, followed by dual specialized decoders that sequentially reconstruct geometry and attributes. Additionally, we incorporate knowledge distillation to enhance feature representation learning from a teacher model, further improving coding efficiency. With its simple yet effective design, SEDD-PCC provides an efficient and practical solution for point cloud compression. Comparative evaluations against both rule-based and learning-based methods demonstrate its competitive performance, highlighting SEDD-PCC as a promising AI-driven compression approach.
为了编码包含几何和属性的点云,大多数基于学习的压缩方案将几何和属性的编码分离处理,使用不同的编解码器。这不仅增加了计算复杂性,还未能充分利用几何与属性之间的共享特征。为了解决这一限制,我们提出了SEDD-PCC,这是一个端到端的学习框架,用于有损点云压缩,并能够同时压缩几何和属性信息。在SEDD-PCC中,一个单一的编码器被用来提取几何和属性的共享特性并将其整合进统一的潜在空间,随后是两个专门化的解码器,它们依次重构几何与属性。 此外,我们还采用了知识蒸馏技术,从教师模型中增强特征表示的学习过程,进一步提高编码效率。凭借其简单而有效的设计,SEDD-PCC为点云压缩提供了一个高效且实用的解决方案。通过与其他基于规则的方法和学习方法进行比较评估,结果显示了SEDD-PCC的竞争性性能,这证明了它是一个有前景的人工智能驱动的压缩方法。
https://arxiv.org/abs/2505.16709
We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The framework includes 22 distinct tasks that cover essential aspects of audio processing, from speech recognition and emotion detection to sound event classification and music genre identification. Our extensive evaluation of state-of-the-art audio encoders reveals significant performance variations across different tasks and domains, highlighting the complexity of general audio representation learning.
我们介绍了X-ARES(eXtensive Audio Representation and Evaluation Suite),这是一个新颖的开源基准测试工具,旨在系统地评估音频编码器在不同领域的性能。通过涵盖从语音、环境声音到音乐的各项任务,X-ARES提供了两种用于评估音频表示的方法:线性微调和无参数化评估。该框架包括了22项不同的任务,涵盖了音频处理的各个方面,从语音识别和情感检测到声音事件分类和音乐流派识别。我们对最先进的音频编码器进行的广泛评估揭示了在不同任务和领域中性能差异显著,突显了一般音频表示学习的复杂性。
https://arxiv.org/abs/2505.16369
Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents' future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.
运动预测在自主驾驶系统中是一个关键挑战,需要准确地预测周围交通参与者(如车辆、行人等)的未来轨迹。尽管现有方法通过提取历史轨迹和道路布局的场景上下文特征来预测未来的运动状态,但这些方法在场景特征编码过程中会遇到信息降解的问题。 为了解决这一限制,我们提出了一种新的运动预测框架HAMF(Hybrid Attention-Mamba Framework),该框架能够在学习未来运动表示的同时联合进行场景上下文编码。这样可以将场景理解和对未来运动状态的预测紧密结合在一起。 首先,我们将观测到的代理状态和地图信息嵌入到1D令牌序列中,并将其与目标多模态未来运动特征作为一个可学习的令牌集一起处理。然后我们设计了一个统一的基于注意力机制的编码器,该编码器协同结合了自注意和交叉注意机制来建模场景上下文信息并联合聚合未来的运动特征。 在解码阶段,为了进一步保持所学未来运动表示的一致性和相关性,生成准确且多样化的最终轨迹,我们实现了Mamba模块(Motion and Map Attention Block)。 大量的实验表明,在Argoverse 2基准测试中,我们的混合注意力-Mamba模型通过使用简单而轻量的架构达到了最先进的运动预测性能。
https://arxiv.org/abs/2505.15703
With recent breakthroughs in large-scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end-to-end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT-Huge image encoder into the smaller ViT-Small image encoder via the Mask GAT-based Underwater Knowledge Distillation (MG-UKD) method for effective visual representation learning. Furthermore, we design an End-to-end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state-of-the-art methods on multiple underwater instance datasets. Datasets and codes are available at this https URL.
近期在大规模建模方面取得的突破使得《Segment Anything Model》(SAM)在各种视觉应用中展现了巨大的潜力。然而,由于缺乏水下领域的专业知识,SAM及其变体在端到端水下实例分割任务中表现受限,并且更高的计算需求进一步阻碍了它们在水下场景中的应用。为了解决这一挑战,我们提出了一个大规模的水下实例分割数据集UIIS10K,该数据集中包含了10,048张带有像素级标注的图像,这些图像涵盖了10个类别。此外,我们还引入了一种高效模型UWSAM,专门用于自动和准确地对水下实例进行分割。通过基于Mask GAT的水下知识蒸馏(MG-UKD)方法,UWSAM能够将SAM ViT-Huge图像编码器的知识有效地转移到较小的ViT-Small图像编码器中,从而实现有效的视觉表示学习。 此外,我们为UWSAM设计了一种端到端水下提示生成器(EUPG),该生成器可以自动产生水下提示,而无需显式地提供前景点或框作为提示。这使网络能够准确定位水下实例,以进行高效分割。全面的实验结果表明,我们的模型在多个水下实例数据集上实现了显著性能提升,并且优于当前最先进的方法。数据集和代码可在该链接获取。
https://arxiv.org/abs/2505.15581
Pitch manipulation is the process of producers adjusting the pitch of an audio segment to a specific key and intonation, which is essential in music production. Neural-network-based pitch-manipulation systems have been popular in recent years due to their superior synthesis quality compared to classical DSP methods. However, their performance is still limited due to their inaccurate feature disentanglement using source-filter models and the lack of paired in- and out-of-tune training data. This work proposes Neurodyne to address these issues. Specifically, Neurodyne uses adversarial representation learning to learn a pitch-independent latent representation to avoid inaccurate disentanglement and cycle-consistency training to create paired training data implicitly. Experimental results on global-key and template-based pitch manipulation demonstrate the effectiveness of the proposed system, marking improved synthesis quality while maintaining the original singer identity.
音高操控是指制作人调整音频片段的音高以适应特定的关键音和语调,这在音乐制作中至关重要。近年来,基于神经网络的音高操控系统因其合成质量优于传统数字信号处理(DSP)方法而广受欢迎。然而,这些系统的性能仍受限于使用源-滤波模型时特征解耦不准确以及缺乏配对的失谐训练数据的问题。本研究提出了Neurodyne来解决这些问题。具体而言,Neurodyne采用对抗性表示学习来学习一种音高无关的潜在表示以避免不精确的解耦,并利用循环一致性训练隐式生成配对训练数据。在全局键和基于模板的音高操控实验中,所提出的系统表现出其有效性,在提升合成质量的同时保持了原始歌手的身份特征。
https://arxiv.org/abs/2505.15368
Omni-domain infrared small target detection (IRSTD) poses formidable challenges, as a single model must seamlessly adapt to diverse imaging systems, varying resolutions, and multiple spectral bands simultaneously. Current approaches predominantly rely on visual-only modeling paradigms that not only struggle with complex background interference and inherently scarce target features, but also exhibit limited generalization capabilities across complex omni-scene environments where significant domain shifts and appearance variations occur. In this work, we reveal a critical oversight in existing paradigms: the neglect of readily available auxiliary metadata describing imaging parameters and acquisition conditions, such as spectral bands, sensor platforms, resolution, and observation perspectives. To address this limitation, we propose the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet), a novel multi-modal framework that fundamentally reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization. Through a high-dimensional fusion module based on multi-layer perceptrons (MLPs), AuxDet dynamically integrates metadata semantics with visual features, guiding adaptive representation learning for each individual sample. Additionally, we design a lightweight prior-initialized enhancement module using 1D convolutional blocks to further refine fused features and recover fine-grained target cues. Extensive experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy in omni-domain IRSTD tasks. Code is available at this https URL.
全方位红外小目标检测(IRSTD)面临着巨大的挑战,因为单一模型必须能够无缝适应多种成像系统、不同的分辨率以及多个光谱波段。目前的方法主要依赖于仅基于视觉的建模范式,这些方法不仅难以应对复杂的背景干扰和固有的稀缺目标特征,而且在出现显著领域转移和外观变化的复杂全方位环境中表现出有限的泛化能力。在这项工作中,我们揭示了现有范式的重大忽视:忽略了描述成像参数和获取条件的现成辅助元数据(如光谱带、传感器平台、分辨率以及观测视角)。 为了解决这一局限性,我们提出了基于辅助元信息驱动红外小目标检测器(AuxDet),这是一种全新的多模态框架。它通过将文本元数据与场景感知优化相结合来根本上重新构想IRSTD范式。通过基于多层感知机(MLP)的高维融合模块,AuxDet动态地整合了元数据语义和视觉特征,并引导每个样本进行自适应表示学习。此外,我们设计了一种轻量级先验初始化增强模块,利用一维卷积块进一步细化融合后的特征并恢复细粒度的目标线索。 在具有挑战性的WideIRSTD-Full基准上进行了广泛的实验表明,AuxDet始终优于最先进的方法,证明了辅助信息在提高全方位IRSTD任务中的鲁棒性和准确性方面所起的关键作用。代码可在提供的链接中获取。
https://arxiv.org/abs/2505.15184
Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20\% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at this https URL.
通用步态识别旨在实现跨不同领域的稳健性能,但由于视角、外观和环境的严重领域偏移,这仍然是一个具有挑战性的问题。虽然混合数据集训练广泛用于增强泛化能力,但它引入了新的障碍,包括数据集间的优化冲突以及冗余或噪声样本,这些都阻碍了有效的表示学习。为了解决这些问题,我们提出了一种统一框架,系统地改进跨域步态识别性能。 首先,我们设计了一个解耦三元组损失函数,该函数将不同数据集之间的监督信号隔离开来,从而在优化过程中减轻梯度冲突。其次,我们引入了针对性的数据集蒸馏策略,根据特征冗余和预测不确定性筛选出训练样本中信息量最少的20%,以提高数据效率。 在CASIA-B、OU-MVLP、Gait3D和GREW上的大量实验表明,我们的方法显著提高了基于GaitBase和DeepGaitV2骨干网络的跨数据集识别性能,并且没有牺牲源域的准确性。代码将在此 [URL] 发布。
https://arxiv.org/abs/2505.15176
This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.
这篇论文提出了一种单阶段训练方法,该方法通过对比学习框架将音频、视觉和文本三种模态在语义上进行对齐。对比学习在多模态对齐中获得了广泛应用,它利用大规模未标记数据来学习共享表示。现有的深度学习三模态对齐方法通常采用两阶段方式,分别对齐视觉-文本和音频-文本模态。这种两阶段方法由于数据分布不匹配而导致次优的对齐效果。 本文通过利用AVCaps数据集进行研究,该数据集为视频片段提供了音频、视觉以及音视描述信息。我们的方法在对比训练中联合优化所有模态的表示形式。实验结果表明,单阶段方法优于两阶段方法,在基于音频的视觉检索任务上实现了两倍以上的性能提升,突显了统一多模态表征学习的优势。
https://arxiv.org/abs/2505.14562
Semi-supervised learning (SSL) has achieved significant progress in medical image segmentation (SSMIS) through effective utilization of limited labeled data. While current SSL methods for medical images predominantly rely on consistency regularization and pseudo-labeling, they often overlook transferable semantic relationships across different clinical domains and imaging modalities. To address this, we propose TransMedSeg, a novel transferable semantic framework for semi-supervised medical image segmentation. Our approach introduces a Transferable Semantic Augmentation (TSA) module, which implicitly enhances feature representations by aligning domain-invariant semantics through cross-domain distribution matching and intra-domain structural preservation. Specifically, TransMedSeg constructs a unified feature space where teacher network features are adaptively augmented towards student network semantics via a lightweight memory module, enabling implicit semantic transformation without explicit data generation. Interestingly, this augmentation is implicitly realized through an expected transferable cross-entropy loss computed over the augmented teacher distribution. An upper bound of the expected loss is theoretically derived and minimized during training, incurring negligible computational overhead. Extensive experiments on medical image datasets demonstrate that TransMedSeg outperforms existing semi-supervised methods, establishing a new direction for transferable representation learning in medical image analysis.
半监督学习(SSL)在医学图像分割(SSMIS)中通过有效利用有限的标注数据取得了显著进展。然而,目前针对医学图像的SSL方法主要依赖于一致性正则化和伪标签生成,却往往忽视了不同临床领域和成像模态之间可转移的语义关系。为了解决这一问题,我们提出了TransMedSeg,这是一个新颖的可转移语义框架,用于半监督医学图像分割。我们的方法引入了一个可转移语义增强(TSA)模块,该模块通过跨域分布匹配和内部结构保留来隐式地增强了特征表示,并将领域不变性的语义对齐。 具体来说,TransMedSeg构建了一个统一的特征空间,在这个空间中,教师网络的特征能够被轻量级记忆模块自适应增强为学生网络的语义表达。这种方法使得在没有显式生成数据的情况下实现隐式的语义转换成为可能。此外,这种增强是通过计算增强后的教师分布上的期望可转移交叉熵损失来隐式地实现的。理论上推导出该预期损失的一个上界,并且在训练过程中将其最小化,从而引入了几乎可以忽略不计的计算开销。 广泛的医学图像数据集实验表明,TransMedSeg优于现有的半监督方法,在医学图像分析中的可转移表示学习方面开辟了一个新的方向。
https://arxiv.org/abs/2505.14753
Tactile perception is profoundly influenced by the surface properties of objects in contact. However, despite their crucial role in shaping tactile experiences, these material characteristics have been largely neglected in existing tactile representation learning methods. Most approaches primarily focus on aligning tactile data with visual or textual information, overlooking the richness of tactile feedback that comes from understanding the materials' inherent properties. In this work, we address this gap by revisiting the tactile representation learning framework and incorporating material-aware priors into the learning process. These priors, which represent pre-learned characteristics specific to different materials, allow tactile models to better capture and generalize the nuances of surface texture. Our method enables more accurate, contextually rich tactile feedback across diverse materials and textures, improving performance in real-world applications such as robotics, haptic feedback systems, and material editing.
触觉感知深受接触物体表面特性的深刻影响。然而,尽管这些材料特性在塑造触觉体验中起着关键作用,但在现有的触觉表示学习方法中却很少被考虑到。大多数研究主要集中在使触觉数据与视觉或文本信息对齐上,而忽略了理解材料固有属性所带来的丰富触觉反馈。在这项工作中,我们通过重新审视触觉表示学习框架并引入基于材料的先验知识来解决这一问题。这些先验知识代表了不同材料特有的预习特性,使得触觉模型能够更好地捕捉和泛化表面纹理的细微差别。我们的方法能够在多种材料和纹理中提供更准确、上下文丰富的触觉反馈,从而在机器人技术、力反馈系统和材料编辑等实际应用中的表现得到提升。
https://arxiv.org/abs/2505.14319
With the rapid advancement of unmanned aerial vehicles (UAVs) and missile technologies, perimeter-defense game between attackers and defenders for the protection of critical regions have become increasingly complex and strategically significant across a wide range of domains. However, existing studies predominantly focus on small-scale, simplified two-dimensional scenarios, often overlooking realistic environmental perturbations, motion dynamics, and inherent heterogeneity--factors that pose substantial challenges to real-world applicability. To bridge this gap, we investigate large-scale heterogeneous perimeter-defense game in a three-dimensional setting, incorporating realistic elements such as motion dynamics and wind fields. We derive the Nash equilibrium strategies for both attackers and defenders, characterize the victory regions, and validate our theoretical findings through extensive simulations. To tackle large-scale heterogeneous control challenges in defense strategies, we propose an Embedded Mean-Field Actor-Critic (EMFAC) framework. EMFAC leverages representation learning to enable high-level action aggregation in a mean-field manner, supporting scalable coordination among defenders. Furthermore, we introduce a lightweight agent-level attention mechanism based on reward representation, which selectively filters observations and mean-field information to enhance decision-making efficiency and accelerate convergence in large-scale tasks. Extensive simulations across varying scales demonstrate the effectiveness and adaptability of EMFAC, which outperforms established baselines in both convergence speed and overall performance. To further validate practicality, we test EMFAC in small-scale real-world experiments and conduct detailed analyses, offering deeper insights into the framework's effectiveness in complex scenarios.
随着无人驾驶航空器(UAVs)和导弹技术的迅速发展,保护关键区域的攻防游戏变得越来越复杂且具有战略重要性。这种攻防对抗在多个领域内发生,涉及攻击者与防御者的策略博弈。然而,现有的研究大多集中在小型规模、简化的二维场景上,往往忽视了现实中的环境扰动、运动动力学和内在异质性等因素,这些因素对实际应用构成了重大挑战。 为了弥合这一差距,我们研究了一个大规模的异构边界防卫游戏,并在三维环境中引入了诸如运动动力学和风场等真实元素。我们推导出了攻击者与防御者的纳什均衡策略,明确了胜利区域,并通过广泛的模拟验证了我们的理论发现。 为了解决大规模异质控制挑战,在防守战略中,我们提出了一种嵌入式平均场演员-评论家(EMFAC)框架。EMFAC利用表示学习以均值场的方式进行高级别动作聚合,支持防御者之间的可扩展协调。此外,我们还引入了一个基于奖励表示的轻量级代理级别注意力机制,该机制能够有选择地过滤观察和平均场信息,从而提高决策效率并加速大规模任务中的收敛。 跨不同规模的广泛模拟显示了EMFAC的有效性和适应性,在收敛速度和整体性能方面均优于现有的基准方法。为进一步验证其实用性,我们在小规模的真实世界实验中测试了EMFAC,并进行了详细分析,提供了对复杂场景下该框架有效性的深入见解。
https://arxiv.org/abs/2505.14209
We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.
我们介绍了感知初始化(PI),这是一种视觉表示学习的范式转变,它在初始化阶段就融入了人类感知结构,而不是作为下游微调步骤。通过将来自NIGHTS数据集的人类衍生三元组嵌入用于初始化CLIP视觉编码器,并随后在YFCC15M上进行自我监督学习,我们的方法展示了在29个零样本分类和2个检索基准测试中的显著性能提升,无需任何特定任务的微调。在ImageNet-1K上,经过大约15个预训练周期后,可以看到零样本改进的效果。这种改进在整个数据集中得到了体现,并且根据数据集的特点,在预训练过程的不同阶段会显现出来。我们的方法在各种规模的数据集上始终提高了零样本第一准确率、前五准确率和检索召回率(例如R@1,R@5),而无需对目标领域进行任何适应调整。这些发现挑战了使用人类感知数据主要用于微调的传统观念,并表明在早期表示学习阶段嵌入人类感知结构可以生成更强大且与视觉-语言对齐的系统,这些系统能够立即应用于未见过的任务。我们的工作展示了“从你开始”,即以人类感知为起点,为通用视觉-语言智能奠定了更强的基础。
https://arxiv.org/abs/2505.14204
Despite widespread adoption, multimodal large language models (MLLMs) suffer performance degradation when encountering unfamiliar queries under distribution shifts. Existing methods to improve MLLM generalization typically require either more instruction data or larger advanced model architectures, both of which incur non-trivial human labor or computational costs. In this work, we take an alternative approach to enhance the robustness of MLLMs under distribution shifts, from a representation learning perspective. Inspired by the information bottleneck (IB) principle, we derive a variational lower bound of the IB for MLLMs and devise a practical implementation, Visual Instruction Bottleneck Tuning (Vittle). We then provide a theoretical justification of Vittle by revealing its connection to an information-theoretic robustness metric of MLLM. Empirical validation of three MLLMs on open-ended and closed-form question answering and object hallucination detection tasks over 45 datasets, including 30 shift scenarios, demonstrates that Vittle consistently improves the MLLM's robustness under shifts by pursuing the learning of a minimal sufficient representation.
尽管多模态大型语言模型(MLLM)已被广泛采用,但在遇到分布变化下的不熟悉查询时仍会遭受性能下降。现有提高MLLM泛化能力的方法通常需要更多的指令数据或更复杂的高级架构,这都会带来相当大的人力劳动成本或计算成本。在本项工作中,我们从表示学习的角度出发,采取了一种替代方法来增强MLLM在分布变化下的鲁棒性。受信息瓶颈(IB)原则的启发,我们推导出了适用于MLLM的信息瓶颈变分下界,并设计了一个实用的实现方案——视觉指令瓶颈调优(Vittle)。随后,通过揭示其与衡量MLLM稳健性的信息理论指标之间的联系,从理论上对Vittle进行了解释。通过对三种MLLM在45个数据集上的开放性和封闭形式问答任务及对象幻觉检测任务进行实证验证,并包括30种变化场景的测试结果表明,在追求最小足够表示的学习过程中,Vittle能够持续增强MLLM在分布变化下的鲁棒性。
https://arxiv.org/abs/2505.13946
Existing point cloud representation learning tend to learning the geometric distribution of objects through data-driven approaches, emphasizing structural features while overlooking the relationship between the local information and the whole structure. Local features reflect the fine-grained variations of an object, while the whole structure is determined by the interaction and combination of these local features, collectively defining the object's shape. In real-world, objects undergo elastic deformation under external forces, and this deformation gradually affects the whole structure through the propagation of forces from local regions, thereby altering the object's geometric properties. Inspired by this, we propose a physics-driven self-supervised learning method for point cloud representation, which captures the relationship between parts and the whole by constructing a local-whole force propagation mechanism. Specifically, we employ a dual-task encoder-decoder framework, integrating the geometric modeling capability of implicit fields with physics-driven elastic deformation. The encoder extracts features from the point cloud and its tetrahedral mesh representation, capturing both geometric and physical properties. These features are then fed into two decoders: one learns the whole geometric shape of the point cloud through an implicit field, while the other predicts local deformations using two specifically designed physics information loss functions, modeling the deformation relationship between local and whole shapes. Experimental results show that our method outperforms existing approaches in object classification, few-shot learning, and segmentation, demonstrating its effectiveness.
现有的点云表示学习方法倾向于通过数据驱动的方法来捕捉物体的几何分布,侧重于结构特征的同时却忽视了局部信息与整体结构之间的关系。局部特征反映了对象的细微变化,而整个结构则由这些局部特征相互作用和组合定义,共同界定出对象的形状。在现实世界中,当物体在外力的作用下发生弹性变形时,这种变形会通过力量从局部区域向整个结构传播,从而改变物体的几何属性。受此启发,我们提出了一种基于物理驱动的自监督学习方法来表示点云,该方法通过构建一个局部到整体的力量传播机制捕捉了部分与整体之间的关系。 具体来说,我们的方法采用了双任务编码器-解码器框架,结合隐式场的几何建模能力以及基于物理学的弹性变形。编码器从点云及其四面体网格表示中提取特征,同时捕获了几何和物理属性。这些特性随后被送入两个解码器:一个通过隐式字段学习整个点云的几何形状;另一个则使用两种专门设计的物理信息损失函数预测局部形变,模拟了局部与整体之间变形的关系。 实验结果表明,在物体分类、少量样本学习以及分割任务中,我们的方法均优于现有的方法,证明了其有效性。
https://arxiv.org/abs/2505.13812
We present Sat2Sound, a multimodal representation learning framework for soundscape mapping, designed to predict the distribution of sounds at any location on Earth. Existing methods for this task rely on satellite image and paired geotagged audio samples, which often fail to capture the diversity of sound sources at a given location. To address this limitation, we enhance existing datasets by leveraging a Vision-Language Model (VLM) to generate semantically rich soundscape descriptions for locations depicted in satellite images. Our approach incorporates contrastive learning across audio, audio captions, satellite images, and satellite image captions. We hypothesize that there is a fixed set of soundscape concepts shared across modalities. To this end, we learn a shared codebook of soundscape concepts and represent each sample as a weighted average of these concepts. Sat2Sound achieves state-of-the-art performance in cross-modal retrieval between satellite image and audio on two datasets: GeoSound and SoundingEarth. Additionally, building on Sat2Sound's ability to retrieve detailed soundscape captions, we introduce a novel application: location-based soundscape synthesis, which enables immersive acoustic experiences. Our code and models will be publicly available.
我们介绍了Sat2Sound,这是一个用于声音景观映射的多模态表示学习框架,旨在预测地球上任何位置的声音分布。现有的方法依赖于卫星图像和配对的地标签音频样本进行这项任务,但常常无法捕捉到给定地点声音来源的多样性。为了解决这一局限性,我们通过利用视觉-语言模型(VLM)来增强现有数据集,生成富含语义的声音景观描述,这些描述基于卫星图像中描绘的位置。我们的方法包括在音频、音频字幕、卫星图像和卫星图像字幕之间进行对比学习。我们认为,在不同的模态间存在一组固定的声音景观概念。为此,我们学习了一组共享的声音景观概念代码本,并将每个样本表示为这些概念的加权平均值。 Sat2Sound在两个数据集(GeoSound 和 SoundingEarth)上的跨模式检索任务中(卫星图像和音频之间),达到了最先进的性能水平。此外,基于Sat2Sound能够检索详细的声音景观描述的能力,我们引入了一个新颖的应用程序:基于位置的声音景观合成,这可以实现沉浸式的声学体验。 我们的代码和模型将公开提供。
https://arxiv.org/abs/2505.13777
Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
从视频中识别手术阶段是一种自动分类手术过程进展的技术,具有广泛的应用潜力,包括实时手术支持、优化医疗资源、培训和技能评估以及安全改进。近年来,手术阶段识别技术的进展主要集中在基于转换的方法上,尽管使用卷积神经网络(CNN)提取单帧的空间特征,并通过时间序列建模从生成的时间序列空间特征中提取视频特征的方法已经表现出很高的性能。然而,关于用于特征提取或表示学习的CNN训练方法的研究仍然不足。 在本研究中,我们提出了一种基于视觉-语言模型(ReSW-VL)进行手术工作流程分析的表示学习方法。我们的方法涉及使用CLIP(卷积语言图像模型)视觉-语言模型中的图像编码器,并通过提示学习对其进行微调以用于手术阶段识别。我们在三个手术阶段识别数据集上的实验结果证明了所提出的方法相对于传统方法的有效性。
https://arxiv.org/abs/2505.13746
Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.
古典中国诗歌是中国文学中一个至关重要的且永恒的部分,它传达了深厚的情感共鸣。现有的研究主要基于文本意义分析情感,却忽视了诗歌独有的韵律和视觉特性,尤其是考虑到诗歌通常被朗诵,并常伴有中国传统绘画。在这项工作中,我们提出了一种增强方言的多模态框架来进行古典中国诗歌的情感分析。我们从诗歌中提取句子级别的音频特征,并融入多种方言的音频,这些方言可能保留着区域性的古代汉语发音特点,从而丰富了音韵表示。此外,我们还生成句子级别的视觉特征,并通过多模态对比表征学习,将多模态特征与经由大型语言模型翻译增强后的文本特征融合在一起。我们的框架在两个公开数据集上超过了现有最佳方法的表现,实现了至少2.51%的准确率提升和1.63%的宏平均F1值提升。我们开源了代码以促进该领域的研究,并为通用多模态中文表示提供见解。
https://arxiv.org/abs/2505.13210
The rise of time-series pre-trained models has advanced temporal representation learning, but current state-of-the-art models are often large-scale, requiring substantial compute. We introduce TSPulse, ultra-compact time-series pre-trained models with only 1M parameters, specialized to perform strongly across classification, anomaly detection, imputation, and retrieval tasks. TSPulse introduces innovations at both the architecture and task levels. At the architecture level, it employs a dual-space masked reconstruction, learning from both time and frequency domains to capture complementary signals. This is further enhanced by a dual-embedding disentanglement, generating both detailed embeddings for fine-grained analysis and high-level semantic embeddings for broader task understanding. Notably, TSPulse's semantic embeddings are robust to shifts in time, magnitude, and noise, which is important for robust retrieval. At the task level, TSPulse incorporates TSLens, a fine-tuning component enabling task-specific feature attention. It also introduces a multi-head triangulation technique that correlates deviations from multiple prediction heads, enhancing anomaly detection by fusing complementary model outputs. Additionally, a hybrid mask pretraining is proposed to improves zero-shot imputation by reducing pre-training bias. These architecture and task innovations collectively contribute to TSPulse's significant performance gains: 5-16% on the UEA classification benchmarks, +20% on the TSB-AD anomaly detection leaderboard, +50% in zero-shot imputation, and +25% in time-series retrieval. Remarkably, these results are achieved with just 1M parameters, making TSPulse 10-100X smaller than existing pre-trained models. Its efficiency enables GPU-free inference and rapid pre-training, setting a new standard for efficient time-series pre-trained models. Models will be open-sourced soon.
时间序列预训练模型的兴起推动了时间表示学习的进步,但当前最先进的模型往往规模庞大,需要大量的计算资源。我们推出了TSPulse,这是一种超紧凑的时间序列预训练模型,仅包含100万个参数,并且在分类、异常检测、插值和检索任务中表现出色。TSPulse在架构层面和任务层面上都进行了创新。 在架构层面,它采用了双空间掩码重构技术,从时间和频率两个领域学习互补信号,从而捕捉时间序列中的重要特征。这一过程还通过双重嵌入分解得到增强,生成了用于精细分析的详细嵌入以及用于广泛理解的任务级语义嵌入。特别值得注意的是,TSPulse的语义嵌入对于时间、幅度和噪声的变化具有鲁棒性,这对于稳健的检索任务来说非常重要。 在任务层面,TSPulse整合了TSLens,这是一个微调组件,能够根据具体任务特点进行特征关注;同时引入了一种多头三角测量技术,通过将多个预测头部之间的偏差相关联来增强异常检测能力,从而融合互补模型输出。此外,还提出了一种混合掩码预训练方法,以减少预训练偏置,从而提高零样本插值性能。 这些架构和任务层面的创新共同促成了TSPulse的重大性能提升:在UEA分类基准测试中提高了5-16%,在TSB-AD异常检测排行榜上提高了20%,在零样本插值方面提升了50%,时间序列检索能力提高了25%。特别令人惊讶的是,这些成果仅用到了1M的参数,使得TSPulse比现有的预训练模型小了10到100倍。其效率之高可以支持无GPU推理和快速预训练,为高效的时间序列预训练模型设定了新的标准。相关模型将很快开源发布。
https://arxiv.org/abs/2505.13033