Modern video understanding systems excel at tasks such as scene classification, object detection, and short video retrieval. However, as video analysis becomes increasingly central to real-world applications, there is a growing need for proactive video agents for the systems that not only interpret video streams but also reason about events and take informed actions. A key obstacle in this direction is temporal reasoning: while deep learning models have made remarkable progress in recognizing patterns within individual frames or short clips, they struggle to understand the sequencing and dependencies of events over time, which is critical for action-driven decision-making. Addressing this limitation demands moving beyond conventional deep learning approaches. We posit that tackling this challenge requires a neuro-symbolic perspective, where video queries are decomposed into atomic events, structured into coherent sequences, and validated against temporal constraints. Such an approach can enhance interpretability, enable structured reasoning, and provide stronger guarantees on system behavior, all key properties for advancing trustworthy video agents. To this end, we present a grand challenge to the research community: developing the next generation of intelligent video agents that integrate three core capabilities: (1) autonomous video search and analysis, (2) seamless real-world interaction, and (3) advanced content generation. By addressing these pillars, we can transition from passive perception to intelligent video agents that reason, predict, and act, pushing the boundaries of video understanding.
现代视频理解系统在场景分类、目标检测和短片检索等任务上表现出色。然而,随着视频分析在现实世界应用中的重要性日益增加,对于能够不仅解读视频流还能够对事件进行推理并采取明智行动的主动型视频代理的需求也在增长。这一方向上的一个关键障碍是时间推理:虽然深度学习模型在识别单帧或短片段内的模式方面取得了显著进展,但它们难以理解事件随时间排列和依赖关系,这对于基于行为的决策至关重要。解决这个限制需要超越传统的深度学习方法。我们认为,应对这一挑战需要采用神经符号学视角,即将视频查询分解为原子事件、结构化成连贯序列,并根据时间约束进行验证。这种做法可以增强可解释性,支持结构化的推理过程,并提供更有力的行为保证,这些都是构建值得信赖的视频代理的关键属性。 为此,我们向研究界提出了一个重大挑战:开发下一代智能视频代理,整合三个核心能力:(1) 自主视频搜索和分析;(2) 无缝的现实世界交互;以及 (3) 高级内容生成。通过解决这些支柱问题,我们可以从被动感知过渡到能够推理、预测并采取行动的智能视频代理,推动视频理解技术的发展边界。
https://arxiv.org/abs/2505.13851
Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion-based generative foundation model that enables synthesizing multi-category, cross-satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi-task generation for remote sensing. EarthSynth, trained on the EarthSynth-180K dataset, employs the Counterfactual Composition training strategy to improve training data diversity and enhance category control. Furthermore, a rule-based method of R-Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open-world scenarios, offering a practical solution for advancing RSI interpretation.
遥感图像(RSI)解释通常面临由于标注数据稀缺而带来的挑战,这限制了RSI解释任务的性能。为了应对这一挑战,我们提出了EarthSynth,这是一种基于扩散模型的生成式基础模型,能够为下游RSI解释任务合成多类别、跨卫星平台的地球观测数据标签。据我们所知,EarthSynth是第一个探索遥感中的多任务生成的研究。 EarthSynth在EarthSynth-180K数据集上进行训练,并采用反事实组合(Counterfactual Composition)的训练策略来提高训练数据多样性并增强类别控制能力。此外,还提出了一种基于规则的方法R-Filter,用于过滤出对下游任务更有信息量的人工合成数据。 我们在开放世界的场景分类、目标检测和语义分割中评估了我们的EarthSynth模型,并为推进RSI解释提供了切实可行的解决方案。
https://arxiv.org/abs/2505.12108
The distribution of child sexual abuse imagery (CSAI) is an ever-growing concern of our modern world; children who suffered from this heinous crime are revictimized, and the growing amount of illegal imagery distributed overwhelms law enforcement agents (LEAs) with the manual labor of categorization. To ease this burden researchers have explored methods for automating data triage and detection of CSAI, but the sensitive nature of the data imposes restricted access and minimal interaction between real data and learning algorithms, avoiding leaks at all costs. In observing how these restrictions have shaped the literature we formalize a definition of "Proxy Tasks", i.e., the substitute tasks used for training models for CSAI without making use of CSA data. Under this new terminology we review current literature and present a protocol for making conscious use of Proxy Tasks together with consistent input from LEAs to design better automation in this field. Finally, we apply this protocol to study -- for the first time -- the task of Few-shot Indoor Scene Classification on CSAI, showing a final model that achieves promising results on a real-world CSAI dataset whilst having no weights actually trained on sensitive data.
儿童色情虐待图像(CSAI)的传播是现代世界日益增长的关注点;遭受这种罪行侵害的孩子们再次受到伤害,而不断增加的非法图像数量则令执法人员不堪重负,不得不进行大量手动分类工作。为减轻这一负担,研究人员探索了自动化数据筛选和检测CSAI的方法,但由于这类数据的高度敏感性,实际数据与学习算法之间的接触被严格限制,以避免任何可能的数据泄露风险。观察这些限制如何影响相关文献的发展后,我们正式提出了“替代任务”(Proxy Tasks)的概念,即在不使用CSA数据的情况下训练模型的替代方法。在此新术语下,我们将回顾现有文献,并提出一种新的协议,在充分考虑执法人员的意见的同时,有意识地利用替代任务来设计更好的自动化工具。最后,我们将这一协议应用于首次研究的任务:基于CSAI的少量样本室内场景分类,展示了一个在现实世界的CSAI数据集上表现良好、但未使用敏感数据进行训练的最终模型。
https://arxiv.org/abs/2505.06621
This paper presents the Low-Complexity Acoustic Scene Classification with Device Information Task of the DCASE 2025 Challenge and its baseline system. Continuing the focus on low-complexity models, data efficiency, and device mismatch from previous editions (2022--2024), this year's task introduces a key change: recording device information is now provided at inference time. This enables the development of device-specific models that leverage device characteristics -- reflecting real-world deployment scenarios in which a model is designed with awareness of the underlying hardware. The training set matches the 25% subset used in the corresponding DCASE 2024 challenge, with no restrictions on external data use, highlighting transfer learning as a central topic. The baseline achieves 50.72% accuracy on this ten-class problem with a device-general model, improving to 51.89% when using the available device information.
本文介绍了DCASE 2025挑战赛中的低复杂度声景分类与设备信息任务及其基准系统。延续了前几届(2022-2024年)对低复杂性模型、数据效率和设备差异化的关注,今年的任务引入了一个关键变化:在推理时提供录音设备的信息。这使得开发具有特定设备特性的模型成为可能,这些特性反映了实际部署场景中,在设计模型时会考虑到底层硬件的情况。训练集匹配了对应DCASE 2024挑战赛使用的25%子集,并且不限制使用外部数据的范围,突出了迁移学习作为主要研究课题的重要性。基准系统在不考虑设备信息的情况下对于这个十类问题达到了50.72%的准确率,在利用可用设备信息时这一数字提升到了51.89%。
https://arxiv.org/abs/2505.01747
Streetscapes are an essential component of urban space. Their assessment is presently either limited to morphometric properties of their mass skeleton or requires labor-intensive qualitative evaluations of visually perceived qualities. This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence, a modular workflow for scoring street-level urban scenes using open-access data and vision-language models. SAGAI integrates OpenStreetMap geometries, Google Street View imagery, and a lightweight version of the LLaVA model to generate structured spatial indicators from images via customizable natural language prompts. The pipeline includes an automated mapping module that aggregates visual scores at both the point and street levels, enabling direct cartographic interpretation. It operates without task-specific training or proprietary software dependencies, supporting scalable and interpretable analysis of urban environments. Two exploratory case studies in Nice and Vienna illustrate SAGAI's capacity to produce geospatial outputs from vision-language inference. The initial results show strong performance for binary urban-rural scene classification, moderate precision in commercial feature detection, and lower estimates, but still informative, of sidewalk width. Fully deployable by any user, SAGAI can be easily adapted to a wide range of urban research themes, such as walkability, safety, or urban design, through prompt modification alone.
街道景观是城市空间的重要组成部分。目前,对它们的评估要么局限于其大规模骨架的形态计量属性,要么需要进行费时费力的定性评价来衡量视觉感知的质量。本文介绍了SAGAI:一种使用开源数据和视觉语言模型进行街景分析的工作流程,旨在利用生成式人工智能为街道级别的城市场景评分。SAGAI集成了OpenStreetMap几何图形、Google Street View影像以及轻量级LLaVA模型的版本,通过可定制的自然语言提示从图像中生成结构化的空间指标。该工作流包括一个自动映射模块,可以分别在点和街区级别上聚合视觉得分,从而直接进行地图解释。SAGAI无需特定任务训练或专有软件依赖,支持城市环境的大规模且易于理解的分析。 通过尼斯和维也纳两个探索性案例研究,展示了SAGAI如何从视觉-语言推理中产生地理空间输出的能力。初步结果显示,在二元城市/农村场景分类方面表现强劲,商业特征检测方面的精度适中,人行道宽度估计较低但仍具信息价值。任何用户都可以完全部署SAGAI,并通过修改提示来轻松适应各种研究主题,如可步行性、安全性或城市设计等。
https://arxiv.org/abs/2504.16538
In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at this https URL
近年来,大型视觉-语言模型(VLM),如CLIP,因其能够使用指令文本提示进行零样本推理而备受关注。尽管这些模型在通用计算机视觉方面表现出色,但它们在遥感(RS)领域的泛化潜力却尚未被充分探索。现有的方法通过生成视觉提示令牌来增强提示学习,但是依赖于完整图像特征,这会引入噪声和背景杂质,在同一类别内变化导致误分类问题。为了解决这一问题,我们提出了FrogDogNet,这是一种新颖的提示学习框架,它结合了傅里叶频率过滤和自注意力机制,以改善RS场景分类和领域泛化能力。FrogDogNet选择性地保留不变的低频成分,并消除噪声以及不相关的背景部分,在不同领域之间确保稳健的功能表示。该模型首先通过投影和自我注意提取重要特征,然后应用基于频率的过滤来保存提示学习所需的结构信息的关键细节。 在四个RS数据集和三项域泛化任务上的大量实验表明,FrogDogNet始终优于最先进的提示学习方法,并展示了跨领域转变的强大适应性。我们的研究结果强调了频基不变特性保留在推广中的有效性,为更广泛的应用铺平道路。我们的代码可在提供的链接获取。 (注意:原文中的"this https URL"应是一个具体的网页地址,这里未提供具体网址,因此只给出一般的说明信息。)
https://arxiv.org/abs/2504.16433
The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image's resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data's intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.
多源卫星图像的快速扩展推动了地球观测领域的创新,为遥感基础模型利用多样化数据提供了前所未有的机遇。然而,许多现有的模型仍然受到固定空间分辨率和补丁大小的限制,这限制了它们充分利用卫星图像中固有的异构空间特征的能力。为了解决这些挑战,我们提出了FlexiMo,这是一种灵活的遥感基础模型,它赋予预训练模型适应任意空间分辨率的灵活性。FlexiMo的核心是一个基于空间分辨率感知模块,该模块采用了一种无参数对齐嵌入机制,根据输入图像的分辨率和尺寸动态校准补丁嵌入。这种设计不仅保留了关键标记特性,并确保了多尺度特征的一致性,而且还能够在不修改底层网络架构的情况下实现高效的特征提取。 此外,FlexiMo集成了一个轻量级通道适应模块,利用传感器中的先验光谱信息。这一机制使模型能够处理不同通道数量的图像,同时保持数据的基本物理属性不变。在多种多模态、多分辨率和多尺度的数据集上进行广泛的实验表明,FlexiMo显著增强了模型的一般性和鲁棒性。特别是在场景分类、土地覆盖分类、城市建筑物分割以及云检测等一系列下游任务中,我们的方法取得了出色的性能。 通过实现参数效率高且物理属性一致的适应能力,FlexiMo为在现实世界的遥感应用中的更灵活和有效的基础模型的发展铺平了道路。
https://arxiv.org/abs/2503.23844
The global rise in the number of people with physical disabilities, in part due to improvements in post-trauma survivorship and longevity, has amplified the demand for advanced assistive technologies to improve mobility and independence. Autonomous assistive robots, such as smart wheelchairs, require robust capabilities in spatial segmentation and semantic recognition to navigate complex built environments effectively. Place segmentation involves delineating spatial regions like rooms or functional areas, while semantic recognition assigns semantic labels to these regions, enabling accurate localization to user-specific needs. Existing approaches often utilize deep learning; however, these close-vocabulary detection systems struggle to interpret intuitive and casual human instructions. Additionally, most existing methods ignore the uncertainty of the scene recognition problem, leading to low success rates, particularly in ambiguous and complex environments. To address these challenges, we propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs). Our approach follows a 'Segment Detect Select' framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.
全球因创伤幸存率和寿命的提高,导致身体残疾人数增加,这加大了对能够改善移动能力和独立性的先进辅助技术的需求。自主辅助机器人(如智能轮椅)需要具备强大的空间分割与语义识别能力才能有效地在复杂的建筑环境中导航。空间分割涉及划分诸如房间或功能区域等空间区域,而语义识别则为这些区域分配语义标签,从而能够根据用户的特定需求进行精准定位。现有方法通常利用深度学习技术;然而,这些封闭词汇表检测系统难以理解直观且随意的人类指令。此外,大多数现有的方法忽视了场景识别问题中的不确定性,导致成功率较低,尤其是在模糊和复杂的环境中。 为了解决这些问题,我们提出了一种基于视觉语言模型(VLMs)和大规模语言模型(LLMs)的开放词汇表场景语义分割与检测流水线。我们的方法遵循“Segment Detect Select”框架进行开放词汇表场景分类,使辅助机器人能够在建筑环境中实现适应性和直观导航。
https://arxiv.org/abs/2503.23105
Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.
通用度量深度理解对于精确的视觉引导机器人技术至关重要,而当前最先进的(SOTA)视觉编码器无法支持这一点。为了解决这个问题,我们提出了Vanishing Depth,这是一种自监督训练方法,它扩展了预训练的RGB编码器,以便将度量深度整合到其特征嵌入中,并进行对齐处理。基于我们的新型位置深度编码技术,我们能够实现稳定且与深度密度和分布无关的特征提取能力。我们在一系列相关的RGBD下游任务上实现了性能改进和SOTA结果——并且不需要微调编码器。最值得注意的是,在SUN-RGBD分割任务中我们达到了56.05 mIoU,在Void的深度完成任务中我们达到了88.3 RMSE,在NYUv2场景分类任务中我们达到了83.8 Top 1准确率。在6D物体姿态估计方面,我们超越了DinoV2、EVA-02和Omnivore等前人,并且对于几种相关的RGBD下游任务实现了未经微调编码器的SOTA结果。
https://arxiv.org/abs/2503.19947
Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: what time tells us? To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning. We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page:this https URL.
时间通过我们所见事物的照明变化而变得可见。受此启发,本文探讨了从静态图像中学习时间感知的潜力,并尝试回答:“时间告诉我们什么?”为此,我们首先引入了一个以时间为方向的数据集(Time-Oriented Collection,简称TOC),该数据集中包含130,906张带有可靠时间戳的照片。利用这一数据集,我们提出了一种时间-图像对比学习(Time-Image Contrastive Learning,简称TICL)方法,通过跨模态对比学习共同建模时间戳及其相关的视觉表示形式。我们发现,所提出的TICL方法不仅在时间戳估计任务上实现了最先进的性能,而且令人惊讶的是,在仅接触静态图像的情况下,从TICL中学习到的时间感知嵌入在时间感知下游任务(如基于时间的图片检索、视频场景分类和具有时间意识的图片编辑)方面表现出很强的能力。我们的发现表明,可以从静态图像中学习与时间相关的视觉线索,并且这对各种视觉任务有益,为未来研究理解与时间相关的情境奠定了基础。项目页面:[此链接](https://此链接/)。
https://arxiv.org/abs/2503.17899
Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. In contrast, we hypothesize that city-specific environmental and cultural differences in acoustic features are beneficial for the ASC task. In this paper, we introduce City2Scene, a novel framework that leverages city features to improve ASC. City2Scene transfers the city-specific knowledge from city classification models to a scene classification model using knowledge distillation. We evaluated City2Scene on the DCASE Challenge Task 1 datasets, where each audio clip is annotated with both scene and city labels. Experimental results demonstrate that city features provide valuable information for classifying scenes. By distilling the city-specific knowledge, City2Scene effectively improves accuracy for various state-of-the-art ASC backbone models, including both CNNs and Transformers.
声学场景录音通常是从不同城市的多样化环境中收集的。大多数现有的声学场景分类(ASC)方法侧重于识别跨城市之间普遍存在的声学场景模式,以增强泛化能力。相比之下,我们假设每个城市特有的环境和文化差异在声学特征中是有利于ASC任务的。本文介绍了City2Scene,这是一种新的框架,利用城市的特性来改进ASC。City2Scene通过知识蒸馏将特定于城市的知识从城市分类模型转移到场景分类模型上。 我们在DCASE挑战赛任务1数据集上评估了City2Scene,该数据集中每个音频片段都带有声学场景和城市的标签。实验结果表明,城市特征为声学场景的分类提供了有价值的信息。通过蒸馏出特定于城市的知识,City2Scene有效地提高了各种先进的ASC骨干模型(包括CNNs和Transformers)的准确性。
https://arxiv.org/abs/2503.16862
Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.
少样本遥感场景分类(Few-Shot Remote Sensing Scene Classification,FS-RSSC)面临的主要挑战是在仅有少量标注样本的情况下对遥感图像进行准确分类。现有的方法通常侧重于单一模态特征学习,忽视了优化多模态表示的潜在优势。为了解决这一局限性,我们提出了一种新颖的最优传输适配器调优(Optimal Transport Adapter Tuning,OTAT)框架,旨在通过最优传输理论构建理想化的柏拉图式表征空间。该框架致力于协调丰富的视觉信息与较少密度的文字线索,从而实现有效的跨模态信息传递和互补性。这一方法的核心是最佳传输适配器(Optimal Transport Adapter, OTA),它采用跨模态注意力机制来丰富文本表示,并促进后续更好的信息交互。通过将网络优化转化为最优传输优化问题,OTA建立了模态间平衡的信息交换高效路径。此外,我们还引入了一种样本级别的熵感知加权(Entropy-Aware Weighted,EAW)损失函数,该函数结合了基于难度的相似度得分和基于熵的正则化项。这种损失函数对OT优化过程提供了更精细的控制,增强了其可解性和稳定性。我们的框架为推进遥感应用中的多模态学习提供了一种可扩展且高效的解决方案。在基准数据集上的广泛实验表明,OTAT在FS-RSSC任务中达到了最先进的性能,并显著提升了模型的表现和泛化能力。
https://arxiv.org/abs/2503.14938
The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.
在遥感(RS)领域中,视觉-语言模型(VLMs)的应用已在场景分类、目标检测和图像描述等传统任务上展示了显著的潜力。然而,目前擅长于指称表达理解(REC)的任务模型,在处理涉及复杂指令(例如存在多个条件的情况)或像素级操作(如分割和变化检测)的任务时却显得力不从心。 在本白皮书中,我们提供了一套全面分层总结的遥感视觉-语言任务,并根据所需认知能力的不同层次进行分类。我们引入了远程感知视觉-语言任务集(RSVLTS),该任务集包括开放词汇任务(OVT)、指称表达任务(RET)和描述对象任务(DOT),难度依次递增,同时涵盖了视觉问答(VQA)。此外,我们提出了一种新颖的统一数据表示方法,采用点集方式来处理RSVLTS,并引入了一个条件解析器以及基于循环指代的自我增强策略。 这些特征被整合到GeoRSMLLM模型中,该改进后的模型旨在应对RSVLTS广泛任务范围的需求,为地理科学和遥感领域的视觉-语言任务提供更加通用的解决方案。
https://arxiv.org/abs/2503.12490
Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene Classification (ASC) systems and was used in all the top-ranked submissions to this task of the annual DCASE challenge in the past three years. There is extensive research available on establishing the KD process, designing efficient student models, and forming well-performing teacher ensembles. However, less research has been conducted on investigating which teacher model attributes are beneficial for low-complexity students. In this work, we try to close this gap by studying the effects on the student's performance when using different teacher network architectures, varying the teacher model size, training them with different device generalization methods, and applying different ensembling strategies. The results show that teacher model sizes, device generalization methods, the ensembling strategy and the ensemble size are key factors for a well-performing student network.
知识蒸馏(KD)是一种将大型模型的知识压缩到更紧凑、高效的模型中的广泛技术。在过去的三年里,KD已被证明是构建高性能低复杂度的声场景分类(ASC)系统的一种非常有效的方法,并且被年度DCASE挑战赛中该任务的所有顶级提交所采用。已有大量研究探讨了建立蒸馏过程、设计高效的学生模型以及形成表现良好的教师模型组合的方法。然而,关于探究哪些教师模型属性对低复杂度学生有益的研究较少。在这项工作中,我们尝试填补这一空白,通过使用不同的教师网络架构、改变教师模型大小、用不同的设备泛化方法对其进行训练,并应用不同的集合策略来研究对学生性能的影响。结果表明,教师模型的规模、设备泛化方法、组合策略和组合大小是构建表现良好的学生网络的关键因素。
https://arxiv.org/abs/2503.11363
Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at this https URL.
现有的适应技术通常需要进行架构修改或添加参数,这导致了高昂的计算成本和复杂性。我们引入了一种简单的方法——注意力投影层适配(Attention Projection Layer Adaptation, APLA),这种方法可以对视觉变换器(Vision Transformers, ViTs)进行调整而无需更改其架构或增加额外的参数。通过系统的分析,我们发现紧随注意机制之后的那一层对于适应至关重要。仅更新这一层中的投影层,甚至只是其中权重的一个随机子集,APLA就能达到最先进的性能,并且在不增加推理成本的情况下,最多可以减少52.63%的GPU内存使用和43.0%的训练时间。在涵盖场景分类、医学成像、卫星成像以及细粒度分类等任务的46个数据集中,APLA在分类、分割和检测任务上始终优于包括全量微调在内的17种其他领先适应方法。代码可在[此处](https://example.com/code)获取。
https://arxiv.org/abs/2503.11335
Accurate fine-grained geospatial scene classification using remote sensing imagery is essential for a wide range of applications. However, existing approaches often rely on manually zooming remote sensing images at different scales to create typical scene samples. This approach fails to adequately support the fixed-resolution image interpretation requirements in real-world scenarios. To address this limitation, we introduce the Million-scale finE-grained geospatial scEne classification dataseT (MEET), which contains over 1.03 million zoom-free remote sensing scene samples, manually annotated into 80 fine-grained categories. In MEET, each scene sample follows a scene-inscene layout, where the central scene serves as the reference, and auxiliary scenes provide crucial spatial context for finegrained classification. Moreover, to tackle the emerging challenge of scene-in-scene classification, we present the Context-Aware Transformer (CAT), a model specifically designed for this task, which adaptively fuses spatial context to accurately classify the scene samples. CAT adaptively fuses spatial context to accurately classify the scene samples by learning attentional features that capture the relationships between the center and auxiliary scenes. Based on MEET, we establish a comprehensive benchmark for fine-grained geospatial scene classification, evaluating CAT against 11 competitive baselines. The results demonstrate that CAT significantly outperforms these baselines, achieving a 1.88% higher balanced accuracy (BA) with the Swin-Large backbone, and a notable 7.87% improvement with the Swin-Huge backbone. Further experiments validate the effectiveness of each module in CAT and show the practical applicability of CAT in the urban functional zone mapping. The source code and dataset will be publicly available at this https URL.
使用遥感图像进行精确的细粒度地理场景分类对于广泛的应用至关重要。然而,现有的方法往往依赖于手动缩放遥感图像以不同比例创建典型的场景样本,这种方法无法充分支持现实世界中固定分辨率图像解释的需求。为了解决这一局限性,我们引入了百万级细粒度地理场景分类数据集(MEET),该数据集包含超过103万张无需缩放的遥感场景样本,并由人工标注到80个细粒度类别中。在MEET数据集中,每个场景样本遵循场景内场景布局,其中中心场景作为参考,辅助场景提供对精细分类至关重要的空间上下文信息。 此外,为了应对场景内场景分类这一新兴挑战,我们提出了一个专门为该任务设计的模型——上下文感知变换器(CAT),它能够自适应地融合空间上下文以准确分类场景样本。CAT通过学习捕捉中心和辅助场景之间关系的关注特征来自适应地融合空间上下文信息。 基于MEET数据集,我们在细粒度地理场景分类方面建立了全面的基准测试,并将CAT与11个具有竞争力的基础模型进行了比较。结果表明,CAT显著优于这些基础模型,在使用Swin-Large骨干网络时,平衡准确率(BA)提高了1.88%,而在使用Swin-Huge骨干网络时,取得了7.87%的显着改进。 进一步的实验验证了CAT中每个模块的有效性,并展示了CAT在城市功能区制图中的实际应用价值。源代码和数据集将在以下网址公开发布:[https URL]。
https://arxiv.org/abs/2503.11219
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at this https URL.
最近在视觉变换器(ViTs)的自监督学习方面的进展推动了远程传感(RS)基础模型的重大突破。然而,自我注意机制的二次复杂性构成了扩展性的重大障碍,尤其是对于大型模型和高分辨率图像而言。虽然线性复杂度的Mamba架构提供了一个有前景的替代方案,但现有的基于Mamba的RS应用仍然局限于小规模、特定领域的监督任务上。为了解决这些挑战,我们提出了RoMA框架,该框架可以利用大规模多样化的未标记数据进行基于Mamba的RS基础模型的大规模自监督预训练。通过定制的自回归学习策略,RoMA提升了对高分辨率图像的扩展性,并引入了两项关键创新: 1. 一种旋转感知的预训练机制,结合了适应性裁剪与角向嵌入,以处理具有任意方向的稀疏分布的对象。 2. 多尺度标记预测目标,解决了RS影像中固有的物体尺寸极端变化问题。 系统性的实证研究表明,Mamba遵循了适用于RS数据和参数缩放规律,并且随着模型和数据规模的增长,其性能可以可靠地扩展。此外,在场景分类、对象检测和语义分割任务中的实验表明,RoMA预训练的Mamba模型在精度和计算效率方面均优于基于ViT的同类模型。 相关源代码和预训练模型将在以下网址发布:[此链接](请根据实际情况提供实际链接)。
https://arxiv.org/abs/2503.10392
Due to the difficulty of obtaining labeled data for hyperspectral images (HSIs), cross-scene classification has emerged as a widely adopted approach in the remote sensing community. It involves training a model using labeled data from a source domain (SD) and unlabeled data from a target domain (TD), followed by inferencing on the TD. However, variations in the reflectance spectrum of the same object between the SD and the TD, as well as differences in the feature distribution of the same land cover class, pose significant challenges to the performance of cross-scene classification. To address this issue, we propose a dual classification head self-training network (DHSNet). This method aligns class-wise features across domains, ensuring that the trained classifier can accurately classify TD data of different classes. We introduce a dual classification head self-training strategy for the first time in the cross-scene HSI classification field. The proposed approach mitigates domain gap while preventing the accumulation of incorrect pseudo-labels in the model. Additionally, we incorporate a novel central feature attention mechanism to enhance the model's capacity to learn scene-invariant features across domains. Experimental results on three cross-scene HSI datasets demonstrate that the proposed DHSNET significantly outperforms other state-of-the-art approaches. The code for DHSNet will be available at this https URL.
由于获取高光谱图像(HSI)的标注数据难度较大,跨场景分类已成为遥感社区广泛采用的一种方法。这种方法涉及使用源域(SD)的有标签数据和目标域(TD)的无标签数据训练模型,并在目标域进行推断。然而,同一对象在源域与目标域之间的反射光谱差异以及同一地物类别在不同域中的特征分布差异对跨场景分类的性能构成了重大挑战。为解决这一问题,我们提出了一种双分类头自训练网络(DHSNet)。该方法通过跨领域对齐类别级别的特性,确保经过训练的分类器能够准确区分目标域的不同类别的数据。 我们在跨场景高光谱图像分类领域首次引入了双分类头自训练策略。所提出的这种方法不仅缩小了不同领域的差距,还防止了错误伪标签在模型中的累积。此外,我们融入了一种新颖的中心特性注意力机制来增强模型学习跨领域不变特征的能力。 实验结果表明,在三个跨场景高光谱图像数据集上的测试显示,我们的DHSNet显著优于其他最先进的方法。有关DHSNet的代码将在以下网址提供:[请在此处插入具体URL]。
https://arxiv.org/abs/2502.17879
In this work, we propose a novel variational Bayesian adaptive learning approach for cross-domain knowledge transfer to address acoustic mismatches between training and testing conditions, such as recording devices and environmental noise. Different from the traditional Bayesian approaches that impose uncertainties on model parameters risking the curse of dimensionality due to the huge number of parameters, we focus on estimating a manageable number of latent variables in deep neural models. Knowledge learned from a source domain is thus encoded in prior distributions of deep latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Two different strategies are proposed and investigated to estimate the posterior distributions: Gaussian mean-field variational inference, and empirical Bayes. These strategies address the presence or absence of parallel data in the source and target domains. Furthermore, structural relationship modeling is investigated to enhance the approximation. We evaluated our proposed approaches on two acoustic adaptation tasks: 1) device adaptation for acoustic scene classification, and 2) noise adaptation for spoken command recognition. Experimental results show that the proposed variational Bayesian adaptive learning approach can obtain good improvements on target domain data, and consistently outperforms state-of-the-art knowledge transfer methods.
在这项工作中,我们提出了一种新颖的变分贝叶斯自适应学习方法,用于跨域知识迁移,以解决训练和测试条件之间声学不匹配的问题,如录音设备和环境噪声。与传统的贝叶斯方法不同,后者通过在模型参数上施加不确定性来应对维度灾难问题(由于大量的参数数量),我们专注于估计深层神经网络中有限数量的潜在变量。从源域学到的知识因此被编码为深度潜在变量的先验分布,并在贝叶斯意义上与目标领域的少量适应数据集相结合,以近似相应的后验分布。为了估算这些后验分布,提出了并研究了两种不同的策略:高斯均值场变分推理和经验贝叶斯方法。这两种策略分别处理源域和目标域中是否存在平行数据的问题。此外,还探讨了结构关系建模以增强近似效果。 我们在两项声学适应任务上评估了我们提出的方法:1)用于声景分类的设备自适应;2)用于语音命令识别的噪声自适应。实验结果表明,所提出的变分贝叶斯自适应学习方法在目标领域数据上可以获得显著改进,并且始终优于当前最先进的知识迁移方法。
https://arxiv.org/abs/2501.15496
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
遥感中的场景理解常常面临挑战,尤其是在生成复杂环境(如各种土地利用区域或沿海地区)的准确表示时,这些环境中可能还包括雪、云或雾霾等干扰因素。为了解决这些问题,我们提出了一种视觉-语言框架,名为Spectral LLaVA,该框架将多光谱数据与视觉-语言对齐技术相结合,以增强场景的表现和描述能力。使用Sentinel-2卫星的BigEarthNet v2数据集,我们在基于RGB图像的场景描述基础上建立了基准,并通过引入多光谱信息进一步展示了显著改进。 我们的框架优化了一层轻量级线性投影层来实现视觉与语言的对齐,同时保持SpectralGPT的视觉骨干网络不变。实验涵盖了使用线性探测进行场景分类以及联合执行场景分类和描述生成的语言建模任务。我们的结果突显了Spectral LLaVA在产生详细且准确描述方面的能力,特别是在仅依靠RGB数据不足以充分描述的情况下表现尤为突出,并通过将SpectralGPT的特征精炼为语义上有意义的表现来提高分类性能。 简而言之,Spectral LLaVA框架不仅能够生成对复杂环境场景的细致和精确描述,还能够在多光谱图像处理中提升模型在场景理解任务中的效果。
https://arxiv.org/abs/2501.10144