Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.
https://arxiv.org/abs/2604.10503
The adoption of vision-language models (VLMs) for wireless network management is accelerating, yet no systematic understanding exists of where these large foundation models outperform lightweight convolutional neural networks (CNNs) for spectrum-related tasks. This paper presents the first diagnostic comparison of VLMs and CNNs for spectrum heatmap understanding in non-terrestrial network and terrestrial network (NTN-TN) cooperative systems. We introduce SpectrumQA, a benchmark comprising 108K visual question-answer pairs across four granularity levels: scene classification (L1), regional reasoning (L2), spatial localization (L3), and semantic reasoning (L4). Our experiments on three NTN-TN scenarios with a frozen Qwen2-VL-7B and a trained ResNet-18 reveal a clear taskdependent complementarity: CNN achieves 72.9% accuracy at severity classification (L1) and 0.552 IoU at spatial localization (L3), while VLM uniquely enables semantic reasoning (L4) with F1=0.576 using only three in-context examples-a capability fundamentally absent in CNN architectures. Chain-of-thought (CoT) prompting further improves VLM reasoning by 12.6% (F1: 0.209->0.233) while having zero effect on spatial tasks, confirming that the complementarity is rooted in architectural differences rather than prompting limitations. A deterministic task-type router that delegates supervised tasks to CNN and reasoning tasks to VLM achieves a composite score of 0.616, a 39.1% improvement over CNN alone. We further show that VLM representations exhibit stronger cross-scenario robustness, with smaller performance degradation in 5 out of 6 transfer directions. These findings provide actionable guidelines: deploy CNNs for spatial localization and VLMs for semantic spectrum reasoning, rather than treating them as substitutes.
视觉语言模型(VLMs)在无线网络管理中的应用正加速推进,然而目前尚缺乏系统性研究阐明这些大型基础模型在与轻量级卷积神经网络(CNNs)对比时,于频谱相关任务中究竟在哪些方面表现更优。本文首次对非地面网络与地面网络(NTN-TN)协作系统中用于频谱热图理解的VLMs与CNNs进行了诊断性对比。我们提出了SpectrumQA基准,包含108K个跨四个粒度级别的视觉问答对:场景分类(L1)、区域推理(L2)、空间定位(L3)和语义推理(L4)。在三个NTN-TN场景中,使用冻结的Qwen2-VL-7B模型与训练好的ResNet-18进行实验,结果揭示了清晰的任务依赖互补性:CNN在严重度分类(L1)上达到72.9%准确率,在空间定位(L3)上达到0.552的IoU;而VLM仅通过三个上下文示例,在语义推理(L4)上实现了F1=0.576的独特能力——这是CNN架构 fundamentally 所不具备的。思维链(CoT)提示将VLM推理能力提升了12.6%(F1: 0.209→0.233),但对空间任务毫无影响,证实了这种互补性根植于架构差异而非提示限制。一个确定性任务类型路由器将监督任务分派给CNN、推理任务分派给VLM,实现了0.616的复合分数,较单独使用CNN提升了39.1%。我们还发现VLM表征展现出更强的跨场景鲁棒性,在6个迁移方向中的5个里性能下降更小。这些发现提供了可操作的指导:应为空间定位部署CNN,为语义频谱推理部署VLM,而非将它们视为替代关系。
https://arxiv.org/abs/2604.03774
YouTube Shorts have become central to news consumption on the platform, yet research on how geopolitical events are represented in this format remains limited. To address this gap, we present a multimodal pipeline that combines automatic transcription, aspect-based sentiment analysis (ABSA), and semantic scene classification. The pipeline is first assessed for feasibility and then applied to analyze short-form coverage of the Israel-Hamas war by state-funded outlets. Using over 2,300 conflict-related Shorts and more than 94,000 visual frames, we systematically examine war reporting across major international broadcasters. Our findings reveal that the sentiment expressed in transcripts regarding specific aspects differs across outlets and over time, whereas scene-type classifications reflect visual cues consistent with real-world events. Notably, smaller domain-adapted models outperform large transformers and even LLMs for sentiment analysis, underscoring the value of resource-efficient approaches for humanities research. The pipeline serves as a template for other short-form platforms, such as TikTok and Instagram, and demonstrates how multimodal methods, combined with qualitative interpretation, can characterize sentiment patterns and visual cues in algorithmically driven video environments.
YouTube Shorts已成为该平台上新闻消费的核心形式,然而关于地缘政治事件在此类短视频中的呈现方式研究仍较为匮乏。为填补这一空白,我们提出了一种融合自动转录、基于方面的情感分析(ABSA)及语义场景分类的多模态分析流程。该流程首先经过可行性验证,随后被应用于分析国有媒体对以色列-哈马斯战争的短视频报道。通过2300余条冲突相关Shorts及超过9.4万帧视觉画面,我们系统考察了主要国际广播机构的战争报道模式。研究发现:各媒体在转录文本中对特定方面的情感表达存在差异且随时间变化,而场景类型分类则反映了与现实事件一致的视觉线索。值得注意的是,针对特定领域优化的较小模型在情感分析任务中表现优于大型Transformer模型乃至大语言模型,这凸显了资源高效型方法在人文研究中的价值。该流程可为TikTok、Instagram等其他短视频平台提供参考模板,并展示了如何通过多模态方法结合质性解读,来刻画算法驱动视频环境中的情感模式与视觉线索。
https://arxiv.org/abs/2604.00994
This dataset provides a large collection of 10,915 synthetic hyperspectral image cubes paired with pixel-level vegetation trait maps, designed to support research in radiative transfer emulation, vegetation trait retrieval, and uncertainty quantification. Each hyperspectral cube contains 211 bands spanning 400--2500 nm at 10 nm resolution and a fixed spatial layout of 64 \times 64 pixels, offering continuous simulated surface reflectance spectra suitable for emulator development and machine-learning tasks requiring high spectral detail. Vegetation traits were derived by inverting Sentinel-2 Level-2A surface reflectance using a PROSAIL-based lookup-table approach, followed by forward PROSAIL simulations to generate hyperspectral reflectance under physically consistent canopy and illumination conditions. The dataset covers four ecologically diverse regions -- East Africa, Northern France, Eastern India, and Southern Spain -- and includes 5th and 95th percentile uncertainty maps as well as Sentinel-2 scene classification layers. This resource enables benchmarking of inversion methods, development of fast radiative transfer emulators, and studies of spectral--biophysical relationships under controlled yet realistic environmental variability.
https://arxiv.org/abs/2603.28390
Few-shot remote sensing image scene classification (FS-RSISC) aims at classifying remote sensing images with only a few labeled samples. The main challenges lie in small inter-class variances and large intra-class variances, which are the inherent property of remote sensing images. To address these challenges, we propose a transfer-based Dual Contrastive Network (DCN), which incorporates two auxiliary supervised contrastive learning branches during the training process. Specifically, one is a Context-guided Contrastive Learning (CCL) branch and the other is a Detail-guided Contrastive Learning (DCL) branch, which focus on inter-class discriminability and intra-class invariance, respectively. In the CCL branch, we first devise a Condenser Network to capture context features, and then leverage a supervised contrastive learning on top of the obtained context features to facilitate the model to learn more discriminative features. In the DCL branch, a Smelter Network is designed to highlight the significant local detail information. And then we construct a supervised contrastive learning based on the detail feature maps to fully exploit the spatial information in each map, enabling the model to concentrate on invariant detail features. Extensive experiments on four public benchmark remote sensing datasets demonstrate the competitive performance of our proposed DCN.
少样本遥感图像场景分类(FS-RSISC)旨在仅使用少量标注样本对遥感图像进行分类。其主要挑战在于类间差异小、类内差异大,这是遥感图像的固有特性。为应对这些挑战,我们提出了一种基于迁移的双对比网络(DCN),该网络在训练过程中集成了两个辅助监督对比学习分支。具体而言,其中一个分支为上下文引导对比学习(CCL)分支,另一个为细节引导对比学习(DCL)分支,二者分别聚焦于类间判别性与类内不变性。在CCL分支中,我们首先设计了冷凝器网络以捕获上下文特征,随后在获得的上下文特征上应用监督对比学习,从而促进模型学习更具判别性的特征。在DCL分支中,我们设计了熔炉网络以凸显显著的局部细节信息,并基于细节特征图构建监督对比学习,以充分挖掘各特征图的空间信息,使模型能够集中于不变的细节特征。在四个公开基准遥感数据集上的大量实验表明,我们提出的DCN具有竞争力的性能。
https://arxiv.org/abs/2603.23161
Remote sensing scene classification has experienced a paradigmatic transformation from traditional handcrafted feature methods to sophisticated artificial intelligence systems that now form the backbone of modern Earth observation applications. This comprehensive survey examines the complete methodological evolution, systematically tracing development from classical texture descriptors and machine learning classifiers through the deep learning revolution to current state-of-the-art foundation models and generative AI approaches. We chronicle the pivotal shift from manual feature engineering to automated hierarchical representation learning via convolutional neural networks, followed by advanced architectures including Vision Transformers, graph neural networks, and hybrid frameworks. The survey provides in-depth coverage of breakthrough developments in self-supervised foundation models and vision-language systems, highlighting exceptional performance in zero-shot and few-shot learning scenarios. Special emphasis is placed on generative AI innovations that tackle persistent challenges through synthetic data generation and advanced feature learning strategies. We analyze contemporary obstacles including annotation costs, multimodal data fusion complexities, interpretability demands, and ethical considerations, alongside current trends in edge computing deployment, federated learning frameworks, and sustainable AI practices. Based on comprehensive analysis of recent advances and gaps, we identify key future research priorities: advancing hyperspectral and multi-temporal analysis capabilities, developing robust cross-domain generalization methods, and establishing standardized evaluation protocols to accelerate scientific progress in remote sensing scene classification systems.
https://arxiv.org/abs/2603.26751
Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.
场景理解在增强机器人系统的智能和自主性方面扮演着关键角色。传统方法经常面临遮挡、边界模糊以及无法根据任务特定需求和样本变化调整注意力等问题的挑战。为了克服这些限制,本文提出了一种高效的RGB-D场景理解模型,该模型能够执行一系列任务,包括语义分割、实例分割、姿态估计、全景分割和场景分类。所提出的模型包含了一个增强融合编码器,可以有效利用来自RGB和深度输入数据的冗余信息。 对于语义分割任务,我们引入了标准化关注通道层和上下文特征交互层,旨在解决浅层特征误导以及局部-全局特征表示不足等问题。在实例分割任务中,我们的模型采用了一种非瓶颈1D结构,在减少参数的同时实现了更优的轮廓表示。此外,我们还提出了一种多任务自适应损失函数,可以根据场景变化动态调整不同的学习策略。 通过在NYUv2、SUN RGB-D和Cityscapes数据集上的大量实验验证,我们的方法不仅在分割准确性上超越了现有技术,而且在处理速度方面也表现出色。
https://arxiv.org/abs/2603.07570
Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at this https URL.
预训练和微调已成为遥感图像解释的新范式。其中,基于遮罩自动编码器(MAE)的预训练因其通过重构被遮罩的图像区域来学习通用特征表示的能力而脱颖而出。然而,由于复杂的背景、不清晰的目标以及在遮罩过程中缺乏语义指导,将MAE应用于多光谱遥感图像仍然具有挑战性,这阻碍了底层结构和有意义的空间-光谱特征的学习。为了解决这个问题,我们提出了一种简单而有效的方法——基于光谱指数的MAE(SIGMAE),用于多光谱图像预训练。核心思想是将领域特定的光谱指数作为先验知识,以指导动态标记遮罩向信息丰富的区域进行导向。 SIGMAE引入了语义显著性引导的动态令牌遮罩(SSDTM)策略,这是一种类似课程的学习方式,它量化每个补丁的语义丰富性和内部异质性,并在训练过程中自适应地选择最富有信息性的令牌。通过优先考虑语义显著区域并逐步增加样本难度,SSDTM增强了光谱丰富的和结构感知表示学习,减轻了过拟合问题,并与随机遮罩相比减少了冗余计算。 在五个广泛使用的数据集上进行了大量实验,这些数据集涵盖了包括场景分类、语义分割、目标提取和变化检测在内的各种下游任务。结果表明,SIGMAE优于其他预训练的地理空间基础模型。此外,即使在90%的遮罩比例下,它也展示了强大的光谱-空间重建能力,并且在标签有限的情况下提高了复杂目标识别的能力。 源代码和模型权重将在以下链接发布:[提供URL]
https://arxiv.org/abs/2603.07463
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at this https URL.
这篇论文介绍了DashengTokenizer,这是一种连续音频标记器,专门设计用于同时处理理解和生成任务。与传统方法不同,后者训练声学标记器并随后整合固定的语义知识,我们的方法则反其道而行之:我们利用固定化的语义特征,并注入声学信息。在包含22项多样化任务的线性评估中,我们的方法显著超越了以往的音频编解码器和音频编码器基准,在保持竞争性的音频重构质量的同时实现了这一点。值得注意的是,我们展示了这种声学注射可以提高诸如语音情感识别、音乐理解和声音场景分类等任务的表现。此外,我们在文本到音频(TTA)、文本到音乐(TTM)以及语音增强(SE)的生成性能上评估了该标记器的能力。我们的方法在TTA和TTM任务中超越了基于变分自动编码器(VAE)的标准方法,并且其在SE上的有效性进一步证实了它作为通用音频编码器的能力。最后,我们的结果挑战了认为基于VAE架构是音频合成的先决条件这一普遍假设。相关模型检查点可以在提供的链接处获取。
https://arxiv.org/abs/2602.23765
Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.
语音增强(SE)在音频设备中通常依赖于辅助模块,如声源活动检测(VAD)、信噪比估计或声音场景分类来确保稳健的上下文感知行为和无缝用户体验。与SE一样,这些任务也常采用深度学习技术;然而,在设备上部署额外模型从计算角度来看是不可行的,而基于云的推理则会增加延迟并损害隐私。先前的研究中,SE采用了动态通道剪枝(DynCP)来通过自适应地禁用特定通道减少计算量,具体依据当前输入进行调整。 在本项研究工作中,我们探讨了是否可以从这些内部剪枝掩码中估计出有用的信号属性,从而消除单独模型的需求。研究表明,简单的、可解释的预测器可以达到高达93%的VAD准确率,84%的噪声分类准确率以及F0估计上R2值为0.86的表现。使用二进制掩码时,预测简化为加权求和操作,几乎不会增加额外开销。 我们的贡献是双重的:一方面,我们通过下游预测任务的角度来考察DynCP模型产生的新兴行为,揭示它们所学的内容;另一方面,我们将重新定义并提议使用DynCP作为高效语音增强和同时估计信号属性的整体解决方案。
https://arxiv.org/abs/2602.10666
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.
最近的多模态大型语言模型(MLLMs)进展使得复杂推理成为可能。然而,现有的遥感(RS)基准测试仍然偏向于感知任务,例如物体识别和场景分类。这种限制阻碍了MLLMs在认知要求高的RS应用中的发展。为了解决这个问题,我们提出了一个视觉语言推理基准(VLRS-Bench),这是首个专门针对复杂RS推理的基准。VLRS-Bench涵盖了认知、决策和预测三个核心维度,包含2,000个问题-答案对,平均长度为71个单词,并涵盖14项任务及多达八个时间阶段。通过一个整合了遥感特定先验知识和专家知识的专业化管道构建而成的VLRS-Bench确保了地理空间的真实性以及推理复杂性。实验结果显示现有最先进的MLLMs存在显著瓶颈,这为推进多模态推理在遥感领域的进步提供了关键见解。
https://arxiv.org/abs/2602.07045
Aerial images play a vital role in urban planning and environmental preservation, as they consist of various structures, representing different types of buildings, forests, mountains, and unoccupied lands. Due to its heterogeneous nature, developing robust models for scene classification remains a challenge. In this study, we conduct a literature review of various machine learning methods for aerial image classification. Our survey covers a range of approaches from handcrafted features (e.g., SIFT, LBP) to traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks. In this connection, we have also designed Aerial-Y-Net, a spatial attention-enhanced CNN with multi-scale feature fusion mechanism, which acts as an attention-based model and helps us to better understand the complexities of aerial images. Evaluated on the AID dataset, our model achieves 91.72% accuracy, outperforming several baseline architectures.
https://arxiv.org/abs/2601.18263
Recent years have witnessed the remarkable success of deep learning in remote sensing image interpretation, driven by the availability of large-scale benchmark datasets. However, this reliance on massive training data also brings two major challenges: (1) high storage and computational costs, and (2) the risk of data leakage, especially when sensitive categories are involved. To address these challenges, this study introduces the concept of dataset distillation into the field of remote sensing image interpretation for the first time. Specifically, we train a text-to-image diffusion model to condense a large-scale remote sensing dataset into a compact and representative distilled dataset. To improve the discriminative quality of the synthesized samples, we propose a classifier-driven guidance by injecting a classification consistency loss from a pre-trained model into the diffusion training process. Besides, considering the rich semantic complexity of remote sensing imagery, we further perform latent space clustering on training samples to select representative and diverse prototypes as visual style guidance, while using a visual language model to provide aggregated text descriptions. Experiments on three high-resolution remote sensing scene classification benchmarks show that the proposed method can distill realistic and diverse samples for downstream model training. Code and pre-trained models are available online (this https URL).
近年来,深度学习在遥感图像解读领域取得了显著的成功,这主要得益于大规模基准数据集的可用性。然而,对大量训练数据的高度依赖也带来了两大挑战:一是高昂的存储和计算成本,二是当涉及敏感类别时存在数据泄露的风险。为了解决这些问题,本研究首次将数据集蒸馏的概念引入遥感图像解读领域。具体而言,我们通过训练文本到图像扩散模型来将大规模遥感数据集浓缩成一个紧凑且具有代表性的精炼数据集。 为了提高合成样本的判别质量,我们提出了一种由分类器驱动的指导方法,在扩散训练过程中注入预训练模型中的分类一致性损失。此外,考虑到遥感影像丰富的语义复杂性,我们在训练样本上执行潜在空间聚类以选择代表性且多样化的原型作为视觉风格引导,并使用视觉语言模型提供聚合的文字描述。 在三个高分辨率遥感场景分类基准上的实验表明,所提出的方法能够为下游模型训练蒸馏出真实且多样的样本。代码和预训练模型可在在线平台获取(此链接)。
https://arxiv.org/abs/2601.15829
Federated Learning (FL) enables collaborative model training while keeping training data localized, allowing us to preserve privacy in various domains including remote sensing. However, recent studies show that FL models may still leak sensitive information through their outputs, motivating the need for rigorous privacy evaluation. In this paper, we leverage membership inference attacks (MIA) as a quantitative privacy measurement framework for FL applied to remote sensing image classification. We evaluate multiple black-box MIA techniques, including entropy-based attacks, modified entropy attacks, and the likelihood ratio attack, across different FL algorithms and communication strategies. Experiments conducted on two public scene classification datasets demonstrate that MIA effectively reveals privacy leakage not captured by accuracy alone. Our results show that communication-efficient FL strategies reduce MIA success rates while maintaining competitive performance. These findings confirm MIA as a practical metric and highlight the importance of integrating privacy measurement into FL system design for remote sensing applications.
联邦学习(FL)能够在保持训练数据本地化的同时进行协作模型训练,从而在包括远程感应在内的多个领域中保护隐私。然而,近期的研究表明,联邦学习的模型仍然可能通过其输出泄露敏感信息,这促使了严格的隐私评估需求的出现。本文利用成员推断攻击(MIA)作为量化联邦学习应用于遥感图像分类中的隐私测量框架。我们评估了多种黑盒MIA技术,包括基于熵的攻击、修改后的熵攻击以及似然比攻击,在不同的FL算法和通信策略下进行测试。在两个公开场景分类数据集上进行的实验表明,MIA能够有效揭示超出准确度衡量之外的隐私泄露情况。我们的结果显示,通信效率较高的联邦学习策略能够在保持竞争力的同时降低MIA的成功率。这些发现确认了MIA作为实用度量标准的重要性,并强调了将隐私评估整合进远程感应应用中FL系统设计中的重要性。
https://arxiv.org/abs/2601.06200
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
基于扩散的遥感(RS)生成基础模型对于下游任务至关重要。然而,这些模型依赖于大量具有全球代表性的数据集,这些数据集中通常包含冗余、噪声和类别不平衡的问题,这会降低训练效率并妨碍收敛性。现有的RS扩散基础模型通常通过聚合多个分类数据集或应用简单的去重技术来处理这些问题,但它们往往忽视了生成建模的分布需求以及遥感图像的异质性。 为了克服这些限制,我们提出了一种无训练、两阶段的数据修剪方法,可以在高修剪比例下快速选择高质量子集,从而使初步基础模型能够迅速收敛,并作为多种应用(如生成、下游微调等)的强大骨干。我们的方法同时考虑了局部信息内容与全局场景级别的多样性和代表性。 首先,使用基于熵的准则高效去除低信息量样本。 其次,利用遥感场景分类数据集作为参考基准,我们执行面向场景的聚类,并采用分层抽样以提高大规模未标记数据上的聚类效果并减少计算成本。 最后,通过平衡集群级别的均匀性和样本代表性,在高修剪比例下实现精细选择的同时保持整体多样性和代表性。 实验表明,即使在剪枝85%训练数据后,我们的方法显著提高了收敛速度和生成质量。此外,使用我们方法训练的扩散基础模型在包括超分辨率和语义图像合成在内的下游任务中始终取得最先进的性能水平。这种数据修剪范式为开发RS生成基础模型提供了实用指导。
https://arxiv.org/abs/2512.23239
We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable "expand-depthwise-project" blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble's decision geometry with a single compact model. Evaluated on the TAU Urban Acoustic Scenes 2022 Mobile benchmark, our approach achieves state-of-the-art (SOTA) results on the TAU dataset under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.
我们提出了一种紧凑且适合量化的声学场景分类(ASC)框架,该框架结合了一个高效的“学生”网络与一个学习到的“教师”模型集合以及知识蒸馏技术。学生的主体采用堆叠的深度可分离"扩展-深度卷积-投影"(expand-depthwise-project)块,并使用全局响应规范化来稳定训练并提高对设备和噪声变化的鲁棒性,同时一个全局池化头部产生用于高效边缘推理的类别得分(logits)。为了注入更丰富的归纳偏置,我们组装了一组多样化的教师模型,并学习两个互补的融合头部:z1预测每个教师混合权重(使用学生样式的主体),而z2是一个轻量级MLP,执行每类logits融合。通过温度缩放后的软目标结合硬标签,学生从集合中蒸馏出来,使其能够用单个紧凑模型逼近集合的决策几何形状。 在TAU城市声学场景2022移动基准测试上评估我们的方法,在符合边缘部署约束条件下,该方法在TAU数据集上实现了最先进的(SOTA)结果,展示了在移动ASC中的强大性能和实用性。
https://arxiv.org/abs/2512.13905
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
卫星图像与自然图像在本质上有所不同:其空中视角、极高分辨率、多样化的尺度变化以及丰富的小物体,这些特性要求进行区域级空间推理和整体场景理解。当前的遥感方法仍处于片段化状态,一方面表现为双编码器检索模型,在大规模跨模态搜索方面表现出色但不能交织不同模式;另一方面则是生成式助手,支持区域级别的解释但是缺乏可扩展的检索能力。我们提出了**VLM2GeoVec**,这是一种训练中的指令跟随型单编码视觉-语言模型,该模型采用对比学习方式将交错输入(包括图像、文本、边界框和地理坐标)嵌入到统一的向量空间中。我们的单一编码器将所有输入交织成一个联合嵌入,并使用对比损失进行训练,从而消除了多阶段管道和特定任务模块的需求。 为了评估其多样性,我们引入了**RSMEB**这一新基准测试,涵盖关键的遥感嵌入应用:场景分类;跨模态搜索;组合检索;视觉问题回答;视觉定位与区域级推理;以及语义地理空间检索。在RSMEB上,VLM2GeoVec实现了以下性能: - 区域描述检索上的P@1为**26.6%**(相比双编码器基线高出25个百分点); - 引用表达式检索上的P@1为**32.5%**(比基线高19个百分点); - 语义地理定位检索上的P@1为**17.8%**(是先前最佳结果的三倍以上)。 同时,它在场景分类和跨模态检索等传统任务上也能匹配或超越专门化的基准。VLM2GeoVec统一了可扩展检索与区域级空间推理的能力,使遥感中的多模式分析更加连贯。我们将在接受后公开发布代码、检查点及数据集。
https://arxiv.org/abs/2512.11490
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
我们介绍了针对2025年ICCV BinEgo-360挑战赛的解决方案,该挑战专注于多视角和多模态视频环境下的时序动作定位(TAL)。此次比赛提供的数据集包括全景、第三人称和第一人称视角的记录,并详细标注了细粒度的动作类别。我们的方法基于时间偏移模块(Temporal Shift Module, TSM),并将其扩展以处理TAL,通过引入背景类以及对固定长度且不重叠的时间间隔进行分类来实现这一点。我们采用了一个多任务学习框架,在该框架中同时优化场景分类和TAL的性能,并利用动作与环境之间的上下文线索。最后,我们通过加权集成策略整合多个模型,这提升了预测结果的鲁棒性和一致性。 我们的方法在竞赛初始阶段以及扩展阶段均获得第一名,证明了将多任务学习、高效的骨干网络(backbone)及集成学习相结合,对于解决TAL问题的有效性。
https://arxiv.org/abs/2512.11189
Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: this https URL
遥感场景分类在地球观测中扮演着关键角色,它能够通过分析航空和卫星图像自动识别土地利用与覆盖(LULC)模式。尽管近期基于卷积神经网络(CNNs)和视觉变换器(ViTs)的进步显著提升了性能,但由于空间分辨率、视角、方向以及背景条件的差异导致现有模型泛化能力不足,这一任务仍然具有挑战性。为解决这些问题,本文提出了一种基于卷积混合器范式的轻量级架构。该模型通过在多个尺度上交替使用深度可分离卷积进行空间混合和逐点操作进行通道混合,从而能够高效地提取局部及上下文信息,并保持参数数量和计算需求的低水平。 研究团队在AID(Aerial Image Dataset)和EuroSAT基准数据集上进行了广泛实验。所提出模型在AID数据集中分别达到了74.7%的整体精度、74.57%的平均准确率以及73.79的Kappa值,在EuroSAT数据集中则分别为93.90%、93.93%和93.22。这些结果表明,所提出的方法在准确性与效率之间取得了良好的平衡,相比广泛使用的基于CNNs和transformers的模型具有优势。 代码将在以下链接公开发布:[此URL](this https URL)
https://arxiv.org/abs/2512.06877
The performance of deep learning models in remote sensing (RS) strongly depends on the availability of high-quality labeled data. However, collecting large-scale annotations is costly and time-consuming, while vast amounts of unlabeled imagery remain underutilized. To address this challenge, we propose a Hierarchical Semi-Supervised Active Learning (HSSAL) framework that integrates semi-supervised learning (SSL) and a novel hierarchical active learning (HAL) in a closed iterative loop. In each iteration, SSL refines the model using both labeled data through supervised learning and unlabeled data via weak-to-strong self-training, improving feature representation and uncertainty estimation. Guided by the refined representations and uncertainty cues of unlabeled samples, HAL then conducts sample querying through a progressive clustering strategy, selecting the most informative instances that jointly satisfy the criteria of scalability, diversity, and uncertainty. This hierarchical process ensures both efficiency and representativeness in sample selection. Extensive experiments on three benchmark RS scene classification datasets, including UCM, AID, and NWPU-RESISC45, demonstrate that HSSAL consistently outperforms SSL- or AL-only baselines. Remarkably, with only 8%, 4%, and 2% labeled training data on UCM, AID, and NWPU-RESISC45, respectively, HSSAL achieves over 95% of fully-supervised accuracy, highlighting its superior label efficiency through informativeness exploitation of unlabeled data. Our code will be released at this https URL.
https://arxiv.org/abs/2511.18058