Federated Learning (FL) enables collaborative model training while keeping training data localized, allowing us to preserve privacy in various domains including remote sensing. However, recent studies show that FL models may still leak sensitive information through their outputs, motivating the need for rigorous privacy evaluation. In this paper, we leverage membership inference attacks (MIA) as a quantitative privacy measurement framework for FL applied to remote sensing image classification. We evaluate multiple black-box MIA techniques, including entropy-based attacks, modified entropy attacks, and the likelihood ratio attack, across different FL algorithms and communication strategies. Experiments conducted on two public scene classification datasets demonstrate that MIA effectively reveals privacy leakage not captured by accuracy alone. Our results show that communication-efficient FL strategies reduce MIA success rates while maintaining competitive performance. These findings confirm MIA as a practical metric and highlight the importance of integrating privacy measurement into FL system design for remote sensing applications.
联邦学习(FL)能够在保持训练数据本地化的同时进行协作模型训练,从而在包括远程感应在内的多个领域中保护隐私。然而,近期的研究表明,联邦学习的模型仍然可能通过其输出泄露敏感信息,这促使了严格的隐私评估需求的出现。本文利用成员推断攻击(MIA)作为量化联邦学习应用于遥感图像分类中的隐私测量框架。我们评估了多种黑盒MIA技术,包括基于熵的攻击、修改后的熵攻击以及似然比攻击,在不同的FL算法和通信策略下进行测试。在两个公开场景分类数据集上进行的实验表明,MIA能够有效揭示超出准确度衡量之外的隐私泄露情况。我们的结果显示,通信效率较高的联邦学习策略能够在保持竞争力的同时降低MIA的成功率。这些发现确认了MIA作为实用度量标准的重要性,并强调了将隐私评估整合进远程感应应用中FL系统设计中的重要性。
https://arxiv.org/abs/2601.06200
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
基于扩散的遥感(RS)生成基础模型对于下游任务至关重要。然而,这些模型依赖于大量具有全球代表性的数据集,这些数据集中通常包含冗余、噪声和类别不平衡的问题,这会降低训练效率并妨碍收敛性。现有的RS扩散基础模型通常通过聚合多个分类数据集或应用简单的去重技术来处理这些问题,但它们往往忽视了生成建模的分布需求以及遥感图像的异质性。 为了克服这些限制,我们提出了一种无训练、两阶段的数据修剪方法,可以在高修剪比例下快速选择高质量子集,从而使初步基础模型能够迅速收敛,并作为多种应用(如生成、下游微调等)的强大骨干。我们的方法同时考虑了局部信息内容与全局场景级别的多样性和代表性。 首先,使用基于熵的准则高效去除低信息量样本。 其次,利用遥感场景分类数据集作为参考基准,我们执行面向场景的聚类,并采用分层抽样以提高大规模未标记数据上的聚类效果并减少计算成本。 最后,通过平衡集群级别的均匀性和样本代表性,在高修剪比例下实现精细选择的同时保持整体多样性和代表性。 实验表明,即使在剪枝85%训练数据后,我们的方法显著提高了收敛速度和生成质量。此外,使用我们方法训练的扩散基础模型在包括超分辨率和语义图像合成在内的下游任务中始终取得最先进的性能水平。这种数据修剪范式为开发RS生成基础模型提供了实用指导。
https://arxiv.org/abs/2512.23239
We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable "expand-depthwise-project" blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble's decision geometry with a single compact model. Evaluated on the TAU Urban Acoustic Scenes 2022 Mobile benchmark, our approach achieves state-of-the-art (SOTA) results on the TAU dataset under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.
我们提出了一种紧凑且适合量化的声学场景分类(ASC)框架,该框架结合了一个高效的“学生”网络与一个学习到的“教师”模型集合以及知识蒸馏技术。学生的主体采用堆叠的深度可分离"扩展-深度卷积-投影"(expand-depthwise-project)块,并使用全局响应规范化来稳定训练并提高对设备和噪声变化的鲁棒性,同时一个全局池化头部产生用于高效边缘推理的类别得分(logits)。为了注入更丰富的归纳偏置,我们组装了一组多样化的教师模型,并学习两个互补的融合头部:z1预测每个教师混合权重(使用学生样式的主体),而z2是一个轻量级MLP,执行每类logits融合。通过温度缩放后的软目标结合硬标签,学生从集合中蒸馏出来,使其能够用单个紧凑模型逼近集合的决策几何形状。 在TAU城市声学场景2022移动基准测试上评估我们的方法,在符合边缘部署约束条件下,该方法在TAU数据集上实现了最先进的(SOTA)结果,展示了在移动ASC中的强大性能和实用性。
https://arxiv.org/abs/2512.13905
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
卫星图像与自然图像在本质上有所不同:其空中视角、极高分辨率、多样化的尺度变化以及丰富的小物体,这些特性要求进行区域级空间推理和整体场景理解。当前的遥感方法仍处于片段化状态,一方面表现为双编码器检索模型,在大规模跨模态搜索方面表现出色但不能交织不同模式;另一方面则是生成式助手,支持区域级别的解释但是缺乏可扩展的检索能力。我们提出了**VLM2GeoVec**,这是一种训练中的指令跟随型单编码视觉-语言模型,该模型采用对比学习方式将交错输入(包括图像、文本、边界框和地理坐标)嵌入到统一的向量空间中。我们的单一编码器将所有输入交织成一个联合嵌入,并使用对比损失进行训练,从而消除了多阶段管道和特定任务模块的需求。 为了评估其多样性,我们引入了**RSMEB**这一新基准测试,涵盖关键的遥感嵌入应用:场景分类;跨模态搜索;组合检索;视觉问题回答;视觉定位与区域级推理;以及语义地理空间检索。在RSMEB上,VLM2GeoVec实现了以下性能: - 区域描述检索上的P@1为**26.6%**(相比双编码器基线高出25个百分点); - 引用表达式检索上的P@1为**32.5%**(比基线高19个百分点); - 语义地理定位检索上的P@1为**17.8%**(是先前最佳结果的三倍以上)。 同时,它在场景分类和跨模态检索等传统任务上也能匹配或超越专门化的基准。VLM2GeoVec统一了可扩展检索与区域级空间推理的能力,使遥感中的多模式分析更加连贯。我们将在接受后公开发布代码、检查点及数据集。
https://arxiv.org/abs/2512.11490
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
我们介绍了针对2025年ICCV BinEgo-360挑战赛的解决方案,该挑战专注于多视角和多模态视频环境下的时序动作定位(TAL)。此次比赛提供的数据集包括全景、第三人称和第一人称视角的记录,并详细标注了细粒度的动作类别。我们的方法基于时间偏移模块(Temporal Shift Module, TSM),并将其扩展以处理TAL,通过引入背景类以及对固定长度且不重叠的时间间隔进行分类来实现这一点。我们采用了一个多任务学习框架,在该框架中同时优化场景分类和TAL的性能,并利用动作与环境之间的上下文线索。最后,我们通过加权集成策略整合多个模型,这提升了预测结果的鲁棒性和一致性。 我们的方法在竞赛初始阶段以及扩展阶段均获得第一名,证明了将多任务学习、高效的骨干网络(backbone)及集成学习相结合,对于解决TAL问题的有效性。
https://arxiv.org/abs/2512.11189
Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: this https URL
遥感场景分类在地球观测中扮演着关键角色,它能够通过分析航空和卫星图像自动识别土地利用与覆盖(LULC)模式。尽管近期基于卷积神经网络(CNNs)和视觉变换器(ViTs)的进步显著提升了性能,但由于空间分辨率、视角、方向以及背景条件的差异导致现有模型泛化能力不足,这一任务仍然具有挑战性。为解决这些问题,本文提出了一种基于卷积混合器范式的轻量级架构。该模型通过在多个尺度上交替使用深度可分离卷积进行空间混合和逐点操作进行通道混合,从而能够高效地提取局部及上下文信息,并保持参数数量和计算需求的低水平。 研究团队在AID(Aerial Image Dataset)和EuroSAT基准数据集上进行了广泛实验。所提出模型在AID数据集中分别达到了74.7%的整体精度、74.57%的平均准确率以及73.79的Kappa值,在EuroSAT数据集中则分别为93.90%、93.93%和93.22。这些结果表明,所提出的方法在准确性与效率之间取得了良好的平衡,相比广泛使用的基于CNNs和transformers的模型具有优势。 代码将在以下链接公开发布:[此URL](this https URL)
https://arxiv.org/abs/2512.06877
The performance of deep learning models in remote sensing (RS) strongly depends on the availability of high-quality labeled data. However, collecting large-scale annotations is costly and time-consuming, while vast amounts of unlabeled imagery remain underutilized. To address this challenge, we propose a Hierarchical Semi-Supervised Active Learning (HSSAL) framework that integrates semi-supervised learning (SSL) and a novel hierarchical active learning (HAL) in a closed iterative loop. In each iteration, SSL refines the model using both labeled data through supervised learning and unlabeled data via weak-to-strong self-training, improving feature representation and uncertainty estimation. Guided by the refined representations and uncertainty cues of unlabeled samples, HAL then conducts sample querying through a progressive clustering strategy, selecting the most informative instances that jointly satisfy the criteria of scalability, diversity, and uncertainty. This hierarchical process ensures both efficiency and representativeness in sample selection. Extensive experiments on three benchmark RS scene classification datasets, including UCM, AID, and NWPU-RESISC45, demonstrate that HSSAL consistently outperforms SSL- or AL-only baselines. Remarkably, with only 8%, 4%, and 2% labeled training data on UCM, AID, and NWPU-RESISC45, respectively, HSSAL achieves over 95% of fully-supervised accuracy, highlighting its superior label efficiency through informativeness exploitation of unlabeled data. Our code will be released at this https URL.
https://arxiv.org/abs/2511.18058
Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.
https://arxiv.org/abs/2511.08901
Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.
https://arxiv.org/abs/2510.27219
Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at this https URL.
https://arxiv.org/abs/2510.27155
Annotating time boundaries of sound events is labor-intensive, limiting the scalability of strongly supervised learning in audio detection. To reduce annotation costs, weakly-supervised learning with only clip-level labels has been widely adopted. As an alternative, partial label learning offers a cost-effective approach, where a set of possible labels is provided instead of exact weak annotations. However, partial label learning for audio analysis remains largely unexplored. Motivated by the observation that acoustic scenes provide contextual information for constructing a set of possible sound events, we utilize acoustic scene information to construct partial labels of sound events. On the basis of this idea, in this paper, we propose a multitask learning framework that jointly performs acoustic scene classification and sound event detection with partial labels of sound events. While reducing annotation costs, weakly-supervised and partial label learning often suffer from decreased detection performance due to lacking the precise event set and their temporal annotations. To better balance between annotation cost and detection performance, we also explore a semi-supervised framework that leverages both strong and partial labels. Moreover, to refine partial labels and achieve better model training, we propose a label refinement method based on self-distillation for the proposed approach with partial labels.
https://arxiv.org/abs/2510.25075
Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.
https://arxiv.org/abs/2510.24321
Acoustic scene classification (ASC) suffers from device-induced domain shift, especially when labels are limited. Prior work focuses on curriculum-based training schedules that structure data presentation by ordering or reweighting training examples from easy-to-hard to facilitate learning; however, existing curricula are static, fixing the ordering or the weights before training and ignoring that example difficulty and marginal utility evolve with the learned representation. To overcome this limitation, we propose the Dynamic Dual-Signal Curriculum (DDSC), a training schedule that adapts the curriculum online by combining two signals computed each epoch: a domain-invariance signal and a learning-progress signal. A time-varying scheduler fuses these signals into per-example weights that prioritize domain-invariant examples in early epochs and progressively emphasize device-specific cases. DDSC is lightweight, architecture-agnostic, and introduces no additional inference overhead. Under the official DCASE 2024 Task~1 protocol, DDSC consistently improves cross-device performance across diverse ASC baselines and label budgets, with the largest gains on unseen-device splits.
声学场景分类(ASC)面临着由于设备引起的领域偏移问题,尤其是在标签有限的情况下。之前的工作主要集中在基于课程的学习计划上,通过按难度从易到难排序或重新加权训练示例来结构化数据呈现以促进学习;然而,现有的课程是静态的,在训练前固定了排序或权重,并忽略了随着所学表示的变化,样本难度和边际效用也会变化。为了克服这一限制,我们提出了一种动态双信号课程(DDSC)的训练计划,该计划通过结合每轮计算出的两个信号在线调整课程:领域不变性信号和学习进度信号。一个时变调度器将这些信号融合成每个示例的权重,在早期时期优先考虑领域不变性的示例,并逐步强调设备特定的情况。 DDSC轻量级、与架构无关且不增加额外的推理开销。在官方DCASE 2024 Task 1协议下,DDSC在各种ASC基线和标签预算上持续提高了跨设备性能,在未见过设备的数据集上取得了最大的改进。
https://arxiv.org/abs/2510.17345
Multi-label classification (MLC) offers a more comprehensive semantic understanding of Remote Sensing (RS) imagery compared to traditional single-label classification (SLC). However, obtaining complete annotations for MLC is particularly challenging due to the complexity and high cost of the labeling process. As a practical alternative, single-positive multi-label learning (SPML) has emerged, where each image is annotated with only one relevant label, and the model is expected to recover the full set of labels. While scalable, SPML introduces significant supervision ambiguity, demanding specialized solutions for model training. Although various SPML methods have been proposed in the computer vision domain, research in the RS context remains limited. To bridge this gap, we propose Adaptive Gradient Calibration (AdaGC), a novel and generalizable SPML framework tailored to RS imagery. AdaGC adopts a gradient calibration (GC) mechanism combined with Mixup and a dual exponential moving average (EMA) module for robust pseudo-label generation. To maximize AdaGC's effectiveness, we introduce a simple yet theoretically grounded indicator to adaptively trigger GC after an initial warm-up stage based on training dynamics, thereby guaranteeing the effectiveness of GC in mitigating overfitting to label noise. Extensive experiments on two benchmark RS datasets under two distinct label noise types demonstrate that AdaGC achieves state-of-the-art (SOTA) performance while maintaining strong robustness across diverse settings.
多标签分类(MLC)相比传统的单标签分类(SLC),能够为遥感(RS)影像提供更为全面的语义理解。然而,由于标注过程复杂且成本高昂,获取完整的MLC注释尤其具有挑战性。作为一种实用替代方案,单一正例多标签学习(SPML)应运而生,在这种模式下,每张图像仅被标记为一个相关的标签,模型则需要恢复出全部的标签集。尽管这种方法可扩展性强,但其引入了显著的学习监督模糊性,并且要求专门的方法来解决模型训练中的问题。虽然计算机视觉领域已经提出了各种SPML方法,但在遥感领域的研究仍然有限。 为了填补这一空白,我们提出了一种新的、具有通用性的SPML框架——自适应梯度校准(AdaGC),该框架特别针对遥感影像进行了优化。AdaGC采用了结合了Mixup和双指数移动平均(EMA)模块的梯度校准机制,以生成鲁棒伪标签。为了最大化AdaGC的效果,我们引入了一个简单但理论基础坚实的指标,在初始预热阶段后根据训练动态自适应地触发梯度校准,从而确保梯度校准在缓解过度拟合到标签噪声方面发挥有效作用。 在两个基准遥感数据集上进行的广泛实验验证了AdaGC在两种不同类型的标注噪音下的表现:实验结果显示,AdaGC实现了最先进的性能,并且在各种设置下保持了强大的鲁棒性。
https://arxiv.org/abs/2510.08269
Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: this https URL
远程感知视觉-语言模型(RSVLM)由于大规模预训练,在各种任务中展现了显著潜力,并实现了强大的零样本性能。然而,它们在低数据环境下(如少样本学习)的泛化能力仍需进一步探索。在这项工作中,我们首次提出了一个结构化的基准测试框架,用于评估针对远程感知视觉-语言模型的少样本适应方法。我们在十个不同的遥感场景分类数据集上进行了全面实验,将五种广泛使用的少样本适应策略应用于三种具有不同骨干网络的最先进的RSVLM。 我们的研究发现表明,即使零样本性能相似的模型,在进行少样本适应时也会表现出明显不同的行为,并且某些RSVLM比其他模型更适合这种适应。现有方法的表现差异性及缺乏明确的领先者强调了为远程感知开发更稳健的少样本适应方法的需求。 为了促进未来的研究,我们提供了一个可重复的基准测试框架和开源代码,以系统地评估在少样本条件下工作的RSVLM。源码可以在GitHub上公开获取:[此链接](https://github.com/remote-sensing-benchmark/few-shot-rs-adapter)
https://arxiv.org/abs/2510.07135
Deep learning has gained broad interest in remote sensing image scene classification thanks to the effectiveness of deep neural networks in extracting the semantics from complex data. However, deep networks require large amounts of training samples to obtain good generalization capabilities and are sensitive to errors in the training labels. This is a problem in remote sensing since highly reliable labels can be obtained at high costs and in limited amount. However, many sources of less reliable labeled data are available, e.g., obsolete digital maps. In order to train deep networks with larger datasets, we propose both the combination of single or multiple weak sources of labeled data with a small but reliable dataset to generate multisource labeled datasets and a novel training strategy where the reliability of each source is taken in consideration. This is done by exploiting the transition matrices describing the statistics of the errors of each source. The transition matrices are embedded into the labels and used during the training process to weigh each label according to the related source. The proposed method acts as a weighting scheme at gradient level, where each instance contributes with different weights to the optimization of different classes. The effectiveness of the proposed method is validated by experiments on different datasets. The results proved the robustness and capability of leveraging on unreliable source of labels of the proposed method.
深度学习在遥感图像场景分类中因深度神经网络提取复杂数据语义的有效性而引起了广泛兴趣。然而,深度网络需要大量的训练样本以获得良好的泛化能力,并且对训练标签中的错误非常敏感。这在遥感领域是一个问题,因为高度可靠的标签只能在高成本下获取并且数量有限。但是,许多来源的低质量标注数据是可用的,例如过时的数字地图。为了使用更大的数据集来训练深度网络,我们提出了一种方法:将单一或多个弱源的数据与一个小但可靠的数据集结合,生成多源标签数据集,并提出一种新的训练策略,在该策略中考虑了每个来源的可靠性。通过利用描述每个来源错误统计信息的转换矩阵实现了这一点。这些转换矩阵被嵌入到标签中并在训练过程中使用以根据相关来源对每个标签进行加权。所提出的方法在梯度级别上充当加权方案,使得每个实例能够根据不同类别贡献不同的权重来进行优化。通过不同数据集上的实验验证了所提方法的有效性。结果证明了该方法利用不可靠的标签源的鲁棒性和能力。
https://arxiv.org/abs/2510.05760
Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance.
声场景分类(ASC)模型在边缘设备上通常基于固定的类别假设运行,缺乏适应新或细化的声学类别的可转移性,而这对于需要进行此类调整的实际应用来说是必需的。我们提出了一种名为ContrastASC的方法,该方法通过构建嵌入空间来保持不同场景之间的语义关系,从而学习通用化的声场景表示,使模型能够无需重新训练就适应未见过的类别。我们的方法结合了预训练模型的监督对比微调与对比表征蒸馏技术,将这种结构化知识转移到紧凑的学生模型中。评估结果表明,ContrastASC在保持强闭集性能的同时,展示了对未见类别的良好少样本自适应能力。
https://arxiv.org/abs/2510.03728
Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at this https URL.
室内场景分类是计算机视觉中的一个重要任务,其应用范围广泛,从机器人技术到敏感内容分析(如儿童性虐待图像(CSAI) 分类)都有涉及。该问题由于物体之间的复杂关系和空间布局的多样性而特别具有挑战性。在本文中,我们提出了一个新的框架——用于敏感内容分析的场景图注意力 (ASGRA),它基于结构化的图表示而非原始像素进行操作。首先将图像转换为场景图,然后使用图注意力网络进行推理,ASGRA 直接建模场景组件之间的交互。这种方法提供了两个关键优势:(i) 通过识别对象和关系提供内在的可解释性;(ii) 保护隐私,在无需直接访问敏感图片的情况下就可以训练模型。 在 Places8 数据集上,我们达到了 81.27% 的平衡准确率,超过了基于图像的方法。与执法部门合作进行的真实世界 CSAI 测试中,我们的方法实现了 74.27% 的平衡准确率。我们的结果确立了结构化的场景表示作为一种稳健的室内场景分类和 CSAI 分类范式。代码可在以下网址公开获取:[此 URL] (请将此占位符替换为实际链接)。
https://arxiv.org/abs/2509.26457
Self-supervised learning through masked autoencoders has attracted great attention for remote sensing (RS) foundation model (FM) development, enabling improved representation learning across diverse sensors and downstream tasks. However, existing RS FMs often either suffer from substantial computational complexity during both training and inference or exhibit limited representational capacity. These issues restrict their practical applicability in RS. To address this limitation, we propose an adaptation for enhancing the efficiency of RS FMs by integrating the Soft mixture-of-experts (MoE) mechanism into the FM. The integration of Soft MoEs into the FM allows modality-specific expert specialization alongside shared cross-sensor representation learning. To demonstrate the effectiveness of our adaptation, we apply it on the Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross-Sensor Mixture-of-Experts (CSMoE) model. In addition, we introduce a thematic-climatic descriptor-driven sampling strategy for the construction of a representative and diverse training set to train our CSMoE model. Extensive experiments on scene classification, semantic segmentation, and content-based image retrieval demonstrate that our adaptation yields a reduction in computational requirements while maintaining or improving representational performance. Compared to state-of-the-art RS FMs, CSMoE achieves a superior trade-off between representational capacity, accuracy, and computational efficiency. On average, CSMoE achieves more than twice the computational efficiency of existing RS FMs, while maintaining competitive performance across all experiments. These results show the effectiveness of the proposed adaptation for creating computationally efficient RS FMs. The code for the model, the training set creation, and the model weights will be available at this https URL.
通过掩码自动编码器进行自我监督学习在远程 sensing(RS)基础模型(FM)开发中引起了极大的关注,使各种传感器和下游任务中的表示学习得到改善。然而,现有的 RS FMs 经常要么在训练和推理过程中面临较大的计算复杂度问题,要么表现出有限的表征能力。这些问题限制了它们在 RS 中的实际应用性。 为了解决这一局限性,我们提出了一种通过将 Soft 混合专家(MoE)机制整合到基础模型中来提升 RS FMs 效率的方法。Soft MoEs 的集成使模态特异性专家专门化与共享的跨传感器表示学习相结合成为可能。为了展示我们改进的有效性,我们将这种方法应用于 Cross-Sensor Masked Autoencoder (CSMAE) 模型上,从而产生了 Cross-Sensor Mixture-of-Experts (CSMoE) 模型。 此外,我们还提出了一种主题气候描述符驱动的采样策略来构建一个具有代表性和多样性的训练集,以用于我们的 CSMoE 模型的训练。在场景分类、语义分割和基于内容的图像检索方面的广泛实验表明,我们的改进降低了计算需求,同时保持或提高了表示性能。 与最先进的 RS FMs 相比,CSMoE 达到了表征能力、准确性和计算效率之间更佳的权衡点。平均而言,CSMoE 达到现有 RS FMs 计算效率的大约两倍,并在所有实验中均表现出竞争性的性能。这些结果表明了所提议改进的有效性,用于创建具有高计算效率的 RS 基础模型。 该模型、训练集生成代码及模型权重将在以下网址提供:[插入链接]
https://arxiv.org/abs/2509.14104
Acoustic Scene Classification (ASC) faces challenges in generalizing across recording devices, particularly when labeled data is limited. The DCASE 2024 Challenge Task 1 highlights this issue by requiring models to learn from small labeled subsets recorded on a few devices. These models need to then generalize to recordings from previously unseen devices under strict complexity constraints. While techniques such as data augmentation and the use of pre-trained models are well-established for improving model generalization, optimizing the training strategy represents a complementary yet less-explored path that introduces no additional architectural complexity or inference overhead. Among various training strategies, curriculum learning offers a promising paradigm by structuring the learning process from easier to harder examples. In this work, we propose an entropy-guided curriculum learning strategy to address the domain shift problem in data-efficient ASC. Specifically, we quantify the uncertainty of device domain predictions for each training sample by computing the Shannon entropy of the device posterior probabilities estimated by an auxiliary domain classifier. Using entropy as a proxy for domain invariance, the curriculum begins with high-entropy samples and gradually incorporates low-entropy, domain-specific ones to facilitate the learning of generalizable representations. Experimental results on multiple DCASE 2024 ASC baselines demonstrate that our strategy effectively mitigates domain shift, particularly under limited labeled data conditions. Our strategy is architecture-agnostic and introduces no additional inference cost, making it easily integrable into existing ASC baselines and offering a practical solution to domain shift.
声场景分类(ASC)在泛化到不同录音设备时面临挑战,尤其是在标记数据有限的情况下。DCASE 2024 挑战赛任务1突显了这一问题,要求模型从少数设备录制的小型标签子集中学习,并且需要这些模型在严格的复杂性约束下对来自之前未见过的设备的记录进行泛化。虽然诸如数据增强和使用预训练模型等技术已广泛应用于改进模型的泛化能力,但优化训练策略提供了一种补充而较少被探索的方法,这种方法不会引入额外的架构复杂性和推理开销。 在各种训练策略中,课程学习通过从简单的例子到更难的例子来结构化学习过程,提供了一个有前景的范例。在这项工作中,我们提出了一种基于熵引导的课程学习策略,以解决数据高效ASC中的领域偏移问题。具体来说,我们通过计算辅助领域分类器估计的设备后验概率的香农熵来量化每个训练样本在设备领域的预测不确定性。利用熵作为领域不变性的代理,该课程从高熵样本开始,并逐渐纳入低熵、特定领域的样本,以促进泛化表示的学习。 我们在多个DCASE 2024 ASC基线上进行了实验结果验证,证明我们的策略有效地缓解了域偏移问题,尤其是在标记数据有限的情况下。我们的策略是架构无关的,并且不会引入额外的推理成本,使其易于集成到现有的ASC基准中,并提供了一种实用的解决方案来解决领域偏移问题。
https://arxiv.org/abs/2509.11168