Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{this https URL}{CamPilot Page}.
近期,在相机控制视频扩散模型方面的进展显著提高了视频与摄像机之间的对齐精度。然而,摄像机的可控性仍然有限。在这项工作中,我们基于奖励反馈学习(Reward Feedback Learning)方法,并致力于进一步提升摄像机的可控性。不过,直接借用现有的ReFL方法会遇到几个挑战:首先,当前的奖励模型缺乏评估视频与摄像机对齐能力的能力;其次,在计算奖励时将潜在变量解码为RGB视频带来了大量的计算开销;第三,在视频解码过程中通常忽略了3D几何信息。 为了应对这些局限性,我们引入了一个高效的感知相机的3D解码器,该解码器能够将视频潜变量解码成用于奖励量化的3D表示。具体而言,视频潜在编码与摄像机姿态一起被解码为3D高斯分布,在这一过程中,摄像机姿态不仅作为输入,还充当投影参数的角色。如果视频潜在变量和摄像机姿态之间存在对齐误差,则会导致3D结构的几何失真,并进而导致渲染模糊。 基于该特性,我们明确地优化了合成视角与真实视图之间的像素级一致性作为奖励计算的基础。考虑到这一随机性质,我们进一步引入了一个可见性项,仅针对通过几何变形导出的确定区域进行监督。在RealEstate10K和WorldScore基准上的广泛实验验证了所提出方法的有效性。 项目页面:\[链接\](请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.16214
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
关键词识别(KWS)系统在边缘设备上部署的小型模型面临着由于噪声和录音条件变化导致的领域偏移所引起的准确性和鲁棒性挑战。为了解决这些问题,我们提出了一种全面的连续学习框架,旨在适应新的领域同时保持计算效率。该提议的流程集成了一个双输入卷积神经网络(CNN),利用梅尔频率倒谱系数(MFCC)和梅尔频谱图特征,并结合多级去噪过程,包括离散小波变换和频谱减法技术以及模型更新和原型更新模块。 与以前的方法仅限于特定层的更新不同,我们的方法更新整个量化模型,这得益于紧凑型模型架构。在运行时使用类原型和基于置信度的过滤器选择输入样本的一部分,在这些选定的样本上添加伪标签,并将其与回放缓冲区结合以进行增量模型重训练。 实验结果表明,在嘈杂的数据测试集中该框架的有效性:对于干净数据,准确率达到了99.63%,并且即使在-10 dB信噪比的情况下,也能保持稳健性能(超过94%的准确性),适用于各种噪音环境。这项工作证明了结合高效的去噪技术与基于原型的连续学习可以使KWS模型能够在资源受限和动态环境中自主且鲁棒地运行。
https://arxiv.org/abs/2601.16158
Optical Character Recognition (OCR) for low-resource languages remains a significant challenge due to the scarcity of large-scale annotated training datasets. Languages such as Kashmiri, with approximately 7 million speakers and a complex Perso-Arabic script featuring unique diacritical marks, currently lack support in major OCR systems including Tesseract, TrOCR, and PaddleOCR. Manual dataset creation for such languages is prohibitively expensive, time-consuming, and error-prone, often requiring word by word transcription of printed or handwritten text. We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages. Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets. The system implements a comprehensive pipeline encompassing text segmentation (character, word, n-gram, sentence, and line levels), Unicode normalization with script purity enforcement, multi-font rendering with configurable distribution, and 25+ data augmentation techniques simulating real-world document degradations including rotation, blur, noise, and scanner artifacts. We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset, which we release publicly on HuggingFace. This work provides a practical pathway for bringing low-resource languages into the era of vision-language AI models, and the tool is openly available for researchers and practitioners working with underserved writing systems worldwide.
对于资源匮乏的语言,光学字符识别(OCR)仍然是一项重大挑战,主要是因为缺乏大规模的标注训练数据集。像克什米尔语这样的语言,拥有大约700万使用者和复杂的波斯-阿拉伯书写系统,其中包括独特的标点符号,在Tesseract、TrOCR 和 PaddleOCR 等主要 OCR 系统中目前仍得不到支持。为这类语言创建手动数据集的成本高得令人难以承受,耗时且容易出错,并常常需要逐词转录印刷或手写文本。 我们提出了一种开源的合成 OCR 数据集生成器——SynthOCR-Gen,专门针对资源匮乏的语言设计。该工具通过将数字 Unicode 文本语料库转换为即用型训练数据集来解决 OCR 开发中的基本瓶颈问题。系统实现了一个全面的工作流程,包括文本分割(字符、单词、n-gram、句子和行级别)、Unicode 正规化以及强制实施书写系统的纯度,多字体渲染和支持配置的分布设置,以及 25 多种数据增强技术来模拟现实世界文档退化的多种情况,如旋转、模糊、噪声和扫描器产生的伪影。 我们通过生成一个包含60万样本的克什米尔语单词级 OCR 数据集来展示了这种方法的有效性,并将其公开发布在 HuggingFace 上。本工作为资源匮乏的语言进入视觉-语言 AI 模型时代提供了一条实用路径,且工具对全世界从事未得到充分服务的文字系统的研究人员和实践者完全开放使用。
https://arxiv.org/abs/2601.16113
Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.
大型语言模型通过诸如Text2SQL、Text2SPARQL和Text2Cypher之类的工具,使用户能够使用自然语言接口访问数据库。这些工具可以将用户的查询问题转化为结构化的数据库查询语句。尽管此类系统提高了数据库的可访问性,但大多数相关研究主要集中在英语上,并且对多语言支持有限。这项工作旨在开发一种可扩展的多语言Text2Cypher方法,该方法在增加新语言时无需重新进行全面微调、避免手动超参数调整的同时,保持接近联合多语言微调后的性能水平。 我们为英语、西班牙语和土耳其语训练了特定的语言LoRA适配器,并通过统一线性合并或具有动态门控的学得融合MLP将它们结合起来。实验结果显示,使用融合MLP可以恢复大约75%由联合多语言微调带来的准确度提升效果,同时只需要较小的数据子集,在所有三种语言中均优于线性合并方法。 这种策略通过仅需一个LoRA适配器和轻量级的MLP重新训练来实现对新语言的增量扩展。学得适配器融合为昂贵的联合微调提供了一种实用替代方案,它在多语言Text2Cypher任务中实现了性能、数据效率与可伸缩性之间的平衡。
https://arxiv.org/abs/2601.16097
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.
基于语言的灵巧抓取生成需要模型理解任务语义、三维几何结构和复杂的手-物体交互。尽管视觉-语言模型已被应用于此类问题,但现有方法直接将观察结果映射到抓取参数上,并未经过关于物理交互的中间推理过程。我们提出了DextER(具有具身推理的灵巧抓取生成),这是一种基于接触点进行多指操作具身推理的方法。我们的关键洞察是,预测哪些手部链接在物体表面上何处接触可以提供一种感知身体特性的中间表示形式,将任务语义与物理约束连接起来。 DextER通过自回归方式生成具身接触令牌,这些令牌指定手指链在哪部分物体表面接触,随后生成抓取令牌来编码手部配置。在DexGYS数据集上,DextER实现了67.14%的成功率,比现有最佳方法高出3.83个百分点,并且意图对齐提高了96.4%。此外,我们还展示了通过部分接触规范实现可控制的生成过程,提供了对手部抓取合成进行精细调控的能力。 该研究强调了在灵巧抓取任务中引入物理交互理解的重要性,展示了一种将视觉-语言模型与复杂手-物体互动结合的有效途径,并为机器人操作和自动化系统中的精细化具身推理设定了新标准。
https://arxiv.org/abs/2601.16046
Planetary surfaces are typically analyzed using high-level semantic concepts in natural language, yet vast orbital image archives remain organized at the pixel level. This mismatch limits scalable, open-ended exploration of planetary surfaces. Here we present MarScope, a planetary-scale vision-language framework enabling natural language-driven, label-free mapping of Martian landforms. MarScope aligns planetary images and text in a shared semantic space, trained on over 200,000 curated image-text pairs. This framework transforms global geomorphic mapping on Mars by replacing pre-defined classifications with flexible semantic retrieval, enabling arbitrary user queries across the entire planet in 5 seconds with F1 scores up to 0.978. Applications further show that it extends beyond morphological classification to facilitate process-oriented analysis and similarity-based geomorphological mapping at a planetary scale. MarScope establishes a new paradigm where natural language serves as a direct interface for scientific discovery over massive geospatial datasets.
行星表面通常使用自然语言中的高级语义概念进行分析,然而,大量的轨道图像档案仍然以像素级别组织。这种不匹配限制了对行星表面的大规模、开放式探索。在这里,我们介绍了MarScope,这是一种基于视觉和语言的行星级框架,能够通过无标签的自然语言驱动方式绘制火星地形图。MarScope在共享语义空间中将行星图像与文本进行对齐,该框架是在超过20万份精选的图文配对基础上训练出来的。 这一框架通过用灵活的语义检索取代预先定义的分类,在5秒内实现了在整个星球上任意用户查询,并且F1评分最高可达0.978。进一步的应用表明,MarScope超越了形态分类,能够促进基于过程的分析以及以行星规模为单位的相似性地理地貌绘制。 MarScope建立了新的研究范式:自然语言作为大规模地空间数据集进行科学研究的直接接口。
https://arxiv.org/abs/2601.15949
Text-Based Person Search (TBPS) holds unique value in real-world surveillance bridging visual perception and language understanding, yet current paradigms utilizing pre-training models often fail to transfer effectively to complex open-world scenarios. The reliance on "Passive Observation" leads to multifaceted spurious correlations and spatial semantic misalignment, causing a lack of robustness against distribution shifts. To fundamentally resolve these defects, this paper proposes ICON (Invariant Counterfactual Optimization with Neuro-symbolic priors), a framework integrating causal and topological priors. First, we introduce Rule-Guided Spatial Intervention to strictly penalize sensitivity to bounding box noise, forcibly severing location shortcuts to achieve geometric invariance. Second, Counterfactual Context Disentanglement is implemented via semantic-driven background transplantation, compelling the model to ignore background interference for environmental independence. Then, we employ Saliency-Driven Semantic Regularization with adaptive masking to resolve local saliency bias and guarantee holistic completeness. Finally, Neuro-Symbolic Topological Alignment utilizes neuro-symbolic priors to constrain feature matching, ensuring activated regions are topologically consistent with human structural logic. Experimental results demonstrate that ICON not only maintains leading performance on standard benchmarks but also exhibits exceptional robustness against occlusion, background interference, and localization noise. This approach effectively advances the field by shifting from fitting statistical co-occurrences to learning causal invariance.
基于文本的人体搜索(TBPS)在现实世界的监控中具有独特的价值,它连接了视觉感知和语言理解。然而,当前利用预训练模型的范式通常无法有效地转移到复杂的开放世界场景中。对“被动观察”的依赖导致多方面的虚假关联及空间语义错位,这使得系统缺乏应对分布变化的鲁棒性。为了从根本上解决这些问题,本文提出了ICON(具有神经符号先验的不变反事实优化),这是一种结合因果和拓扑先验知识的框架。 首先,我们引入了规则引导的空间干预机制,严格惩罚对边界框噪声的敏感度,并强制切断位置捷径以实现几何上的不变性。其次,通过语义驱动的背景移植来实施反事实上下文解耦,迫使模型忽略背景干扰从而达到环境独立性。然后,利用自适应掩码进行显着性驱动的语义正则化,解决局部显着偏差,并保证整体完整性和一致性。最后,神经符号拓扑对齐利用了神经符号先验来约束特征匹配,确保激活区域与人体结构逻辑在拓扑上保持一致。 实验结果表明,ICON不仅在标准基准测试中维持领先的性能,而且表现出对抗遮挡、背景干扰和定位噪声的卓越鲁棒性。这种方法通过从拟合统计共现转变为学习因果不变性,有效地推动了该领域的发展。
https://arxiv.org/abs/2601.15931
In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and this http URL results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
在虚拟现实(VR)和人机交互(HCI)领域,实时情绪识别对帮助自闭症谱系障碍(ASD)患者提高社交技能具有潜力。这一任务需要严格处理延迟与精度之间的权衡问题,即运动到光子(MTP)的延迟应保持在140毫秒以下以维持连续性。然而,大多数现成的深度学习模型更注重准确性而非消费品硬件严格的定时约束条件。作为迈向可访问VR疗法的第一步,我们使用UIBVFED数据集对虚拟角色进行零样本面部表情识别(FER)任务,来基准测试最新的状态-of-the-art (SOTA) 模型。我们评估了YOLO的Medium和Nano变体(v8, v11 和 v12),用于脸部检测,并包括通用视觉转换器如CLIP、SigLIP以及另一个未明确指明的模型。 仅使用CPU进行推断的结果显示,尽管在风格化的头像上实现面部检测是稳健的(准确率为100%),但在分类阶段存在一个“延迟墙”。YOLOv11n架构提供了最佳平衡以用于检测(约为54毫秒)。然而,通用视觉转换器如CLIP和SigLIP未能达到实时循环中的可接受精度(小于23%)或速度(大于150毫秒)。这项研究强调了为了实现可访问的、实时的人工智能在治疗环境中的应用,需要开发轻量级且特定领域的架构。
https://arxiv.org/abs/2601.15914
Robots that follow natural-language instructions often either plan at a high level using hand-designed interfaces or rely on large end-to-end models that are difficult to deploy for real-time control. We propose TeNet (Text-to-Network), a framework for instantiating compact, task-specific robot policies directly from natural language descriptions. TeNet conditions a hypernetwork on text embeddings produced by a pretrained large language model (LLM) to generate a fully executable policy, which then operates solely on low-dimensional state inputs at high control frequencies. By using the language only once at the policy instantiation time, TeNet inherits the general knowledge and paraphrasing robustness of pretrained LLMs while remaining lightweight and efficient at execution time. To improve generalization, we optionally ground language in behavior during training by aligning text embeddings with demonstrated actions, while requiring no demonstrations at inference time. Experiments on MuJoCo and Meta-World benchmarks show that TeNet produces policies that are orders of magnitude smaller than sequence-based baselines, while achieving strong performance in both multi-task and meta-learning settings and supporting high-frequency control. These results show that text-conditioned hypernetworks offer a practical way to build compact, language-driven controllers for ressource-constrained robot control tasks with real-time requirements.
遵循自然语言指令的机器人通常要么使用手工设计的界面进行高层次规划,要么依赖于难以用于实时控制的大规模端到端模型。我们提出了一种名为TeNet(Text-to-Network)的框架,该框架可以从自然语言描述中直接生成紧凑且任务特定的机器人策略。在这一框架下,一个超网络根据预训练大型语言模型(LLM)产生的文本嵌入来生成完全可执行的策略,并随后仅基于低维状态输入以高频度进行控制操作。通过仅在策略实例化时使用一次自然语言,TeNet继承了预训练LLM的一般知识和同义句稳健性,同时保持了执行时的轻量级与高效性。 为了提高泛化能力,在训练过程中我们可选地通过将文本嵌入与演示动作对齐来使语言具体化于行为中,而在推理阶段则无需展示示例。在MuJoCo和Meta-World基准上的实验表明,TeNet产生的策略比基于序列的基线小几个数量级,同时在多任务设置和元学习设置中均表现出色,并支持高频控制。这些结果表明,以文本条件化的超网络提供了一种实用的方法来构建紧凑且语言驱动的控制器,适用于资源受限但需要实时响应的机器人控制任务。
https://arxiv.org/abs/2601.15912
Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.
大型AI模型的推理通常在稠密参数矩阵上进行,导致随着模型大小增加,推理成本和系统复杂性呈不可持续的增长。这种限制并非由于模型容量不足,而是因为在训练后将推理系统视为单一操作,并忽视了学习过程中形成的内部结构所造成的。我们发现,在大规模模型中,梯度更新事件高度局部化且具有选择性,导致许多参数依赖关系在训练后统计上与初始化分布难以区分。因此,经过训练后的推理系统在结构上是非均匀的并且本质上是可以分解的。 基于这一观察,我们提出了一种训练后的统计标准和一种结构退火程序,该程序可以去除不被支持的依赖关系,并揭示出稳定且独立的子结构。这项工作确立了一个模型无关的、关于推理系统的训练后结构视角,并能够实现在无需修改模型功能或接口的情况下进行有结构的并行推理。
https://arxiv.org/abs/2601.15871
The rapid spread of multimodal fake news poses a serious societal threat, as its evolving nature and reliance on timely factual details challenge existing detection methods. Dynamic Retrieval-Augmented Generation provides a promising solution by triggering keyword-based retrieval and incorporating external knowledge, thus enabling both efficient and accurate evidence selection. However, it still faces challenges in addressing issues such as redundant retrieval, coarse similarity, and irrelevant evidence when applied to deceptive content. In this paper, we propose ExDR, an Explanation-driven Dynamic Retrieval-Augmented Generation framework for Multimodal Fake News Detection. Our framework systematically leverages model-generated explanations in both the retrieval triggering and evidence retrieval modules. It assesses triggering confidence from three complementary dimensions, constructs entity-aware indices by fusing deceptive entities, and retrieves contrastive evidence based on deception-specific features to challenge the initial claim and enhance the final prediction. Experiments on two benchmark datasets, AMG and MR2, demonstrate that ExDR consistently outperforms previous methods in retrieval triggering accuracy, retrieval quality, and overall detection performance, highlighting its effectiveness and generalization capability.
多模态虚假新闻的快速传播构成了严重的社会威胁,因为其不断演变的性质和对及时事实细节的依赖挑战了现有的检测方法。动态检索增强生成提供了一种有希望的解决方案,通过触发基于关键词的检索并融合外部知识来实现高效的证据选择和准确度。然而,在应用于欺骗性内容时,它仍然面临诸如冗余检索、粗略相似性和无关证据等挑战。 在本文中,我们提出了ExDR,这是一种针对多模态虚假新闻检测的解释驱动动态检索增强生成框架。我们的框架系统地利用模型生成的解释来触发检索和证据检索模块,并从三个互补维度评估触发信心;通过融合欺骗实体构建基于实体感知的索引;并根据特定于欺骗性的特征检索对比证据,以挑战初始主张并提升最终预测。 在两个基准数据集AMG和MR2上的实验表明,ExDR在检索触发准确性、检索质量和整体检测性能方面始终优于先前的方法,突显了其有效性和泛化能力。
https://arxiv.org/abs/2601.15820
We introduce a lightweight experimentation pipeline designed to lower the barrier for applying machine learning (ML) methods for classifying images in ecological research. We enable ecologists to experiment with ML models independently, thus they can move beyond off-the-shelf models and generate insights tailored to local datasets and specific classification tasks and target variables. Our tool combines a simple command-line interface for preprocessing, training, and evaluation with a graphical interface for annotation, error analysis, and model comparison. This design enables ecologists to build and iterate on compact, task-specific classifiers without requiring advanced ML expertise. As a proof of concept, we apply the pipeline to classify red deer (Cervus elaphus) by age and sex from 3392 camera trap images collected in the Veldenstein Forest, Germany. Using 4352 cropped images containing individual deer labeled by experts, we trained and evaluated multiple backbone architectures with a wide variety of parameters and data augmentation strategies. Our best-performing models achieved 90.77% accuracy for age classification and 96.15% for sex classification. These results demonstrate that reliable demographic classification is feasible even with limited data to answer narrow, well-defined ecological problems. More broadly, the framework provides ecologists with an accessible tool for developing ML models tailored to specific research questions, paving the way for broader adoption of ML in wildlife monitoring and demographic analysis.
我们介绍了一种轻量级的实验流水线,旨在降低在生态研究中应用机器学习(ML)方法进行图像分类的门槛。此工具使生态学家能够独立地尝试各种机器学习模型,并超越现成模型的限制,生成适合特定本地数据集和具体分类任务与目标变量的见解。我们的工具结合了用于预处理、训练和评估的简单命令行界面以及用于注释、错误分析和模型比较的图形用户界面。这一设计使生态学家能够在不依赖高级机器学习专业知识的情况下构建并迭代针对特定任务的小型分类器。 作为概念验证,我们将该流水线应用于从德国维尔登斯坦森林收集到的3,392张相机陷阱图像中的红鹿(Cervus elaphus)按年龄和性别进行分类。使用专家标注的4,352张单独红鹿的照片,我们训练并评估了多种骨干架构,并在广泛的参数设置和数据增强策略下进行了测试。我们的最佳模型实现了年龄分类90.77%的准确率以及性别分类96.15%的准确率。 这些结果表明,即使在数据量有限的情况下,解决特定、明确的生态问题也有可能实现可靠的人口统计学分类。更广泛地说,这一框架为生态学家提供了一种开发针对具体研究问题定制的机器学习模型的易用工具,有助于促进在野生动物监测和人口统计分析中采用机器学习技术。
https://arxiv.org/abs/2601.15813
Autonomous Unmanned Underwater Vehicles (UUVs) enable military and civilian covert operations in coastal areas without relying on support vessels or Global Navigation Satellite Systems (GNSS). Such operations are critical when surface access is not possible and stealthy navigation is required in restricted environments such as protected zones or dangerous areas under access ban. GNSS denied navigation is then essential to maintaining concealment as surfacing could expose UUVs to detection. To ensure a precise fleet positioning a constellation of beacons deployed by aerial or surface drones establish a synthetic landmark network that will guide the fleet of UUVs along an optimized path from the continental shelf to the goal on the shore. These beacons either submerged or floating emit acoustic signals for UUV localisation and navigation. A hierarchical planner generates an adaptive route for the drones executing primitive actions while continuously monitoring and replanning as needed to maintain trajectory accuracy.
自主无人水下航行器(UUVs)能够在沿海区域执行军事和民用秘密行动,无需依赖支援舰船或全球导航卫星系统(GNSS)。这种操作在表面无法进入并且需要隐蔽导航的受限环境中尤为重要,例如保护区域或禁止进入的危险地区。在这种情况下,没有GNSS信号的自主导航对于保持隐藏至关重要,因为浮出水面可能会使UUV暴露于探测风险中。 为了确保精确的舰队定位,通过无人机(空中或水面)部署的一系列信标建立了一个合成地标网络,该网络将引导UUV舰队从大陆架到岸边目标沿优化路径航行。这些信标无论是沉入水下的还是漂浮在水面上的,都会发出声波信号用于UUV的定位和导航。 一个分层规划器生成了一条适应性路线供无人机执行基本动作,并且会不断监控并根据需要重新规划路线以保持航迹精度。
https://arxiv.org/abs/2601.15802
This paper presents Glove2UAV, a wearable IMU-glove interface for intuitive UAV control through hand and finger gestures, augmented with vibrotactile warnings for exceeding predefined speed thresholds. To promote safer and more predictable interaction in dynamic flight, Glove2UAV is designed as a lightweight and easily deployable wearable interface intended for real-time operation. Glove2UAV streams inertial measurements in real time and estimates palm and finger orientations using a compact processing pipeline that combines median-based outlier suppression with Madgwick-based orientation estimation. The resulting motion estimations are mapped to a small set of control primitives for directional flight (forward/backward and lateral motion) and, when supported by the platform, to object-interaction commands. Vibrotactile feedback is triggered when flight speed exceeds predefined threshold values, providing an additional alert channel during operation. We validate real-time feasibility by synchronizing glove signals with UAV telemetry in both simulation and real-world flights. The results show fast gesture-based command execution, stable coupling between gesture dynamics and platform motion, correct operation of the core command set in our trials, and timely delivery of vibratile warning cues.
本文介绍了Glove2UAV,这是一种可穿戴的IMU(惯性测量单元)手套界面,通过手部和手指手势进行直观的无人机控制,并辅以振动反馈提醒超过预定义速度阈值的情况。为了促进动态飞行中更安全、更具预测性的交互,Glove2UAV被设计成轻便且易于部署的手持式接口,旨在实现实时操作。 Glove2UAV能够实现实时惯性测量数据流传输,并利用一个结合了基于中位数的异常值抑制和Madgwick姿态估计的方法的小型处理管道来估算手掌和手指的姿态。由此产生的运动估计被映射到一组有限的控制基本动作,用于方向飞行(前后和横向移动),并在平台支持的情况下映射为对象交互命令。 当飞行速度超过预定义阈值时,振动反馈会被触发,这在操作过程中提供了一种额外的警报渠道。通过在模拟环境和真实世界飞行中同步手套信号与无人机遥测数据,我们验证了其实时可行性。实验结果表明:基于手势的命令执行速度快;手势动态变化与平台运动之间耦合稳定;我们的试验中核心命令集运行正确;振动警告提示及时送达。
https://arxiv.org/abs/2601.15775
Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model's integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
准确且可解释的空气污染预报对于公共健康至关重要,但大多数模型在性能和可解释性之间存在权衡。本研究提出了一种由物理指导、设计上具有可解释性的时空学习框架。该模型将污染物浓度的时空行为分解为两个透明且相加的模块。第一个是基于物理原理的传输核函数,其权重根据风向和地形(平流作用)进行定向调整。第二个是一个可解释的注意力机制,它能学习局部响应并将未来的浓度归因于特定的历史滞后及外部驱动因素。 在斯德哥尔摩地区的综合数据集上进行评估后,我们发现该模型在多个预报时间范围内始终优于最先进的基准模型。本模型将高预测性能与时空可解释性相结合,为实际应用中的操作空气质量管理提供了一个更可靠的基础。
https://arxiv.org/abs/2511.20257
Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.
尽管人工智能(AI)已经在研究工作流程的各个阶段深度集成,并取得了显著进展,但在学术反驳方面仍面临着一个重要且未充分探索的挑战。这是因为反驳是一种在严重信息不对称下的策略性沟通过程,而不仅仅是简单的技术辩论。因此,现有的方法由于主要模仿表面语言层面的表达,未能捕捉到有效说服所需的从对方角度出发的关键元素。 本文介绍了RebuttalAgent框架,这是首个基于心智理论(Theory of Mind, ToM)进行学术反驳的研究框架,并通过一种ToM-策略-响应(TSR)管道实现,该管道模型化审稿人的心理状态、制定说服策略并生成与策略相契合的回应。为了训练我们的代理程序,我们构建了RebuttalBench,这是一个大规模的数据集,通过新颖的批评和细化方法合成而成。我们的培训过程分为两个阶段:首先是监督微调阶段,使代理人具备基于心智理论的分析和战略规划能力;其次是利用自我奖励机制进行可扩展自我改进的强化学习阶段。 为了实现可靠的自动评估,我们进一步开发了Rebuttal-RM,这是一个专门的评估器,在超过10万个多源反驳数据样本上进行了训练。它在自动化评分方面与人类偏好的一致性超过了强大的裁判模型GPT-4.1。 广泛的实验表明,RebuttalAgent在自动化指标上的表现比基础模型平均高出18.3%,并在自动和人工评价中均超越了先进的专有模型。 免责声明:生成的反驳内容仅供作者参考启发,并辅助草拟。它并非旨在替代作者本人的批判性分析与回应。
https://arxiv.org/abs/2601.15715
Although AI agents have demonstrated impressive capabilities in long-horizon reasoning, their reliability is severely hampered by the ``Spiral of Hallucination,'' where early epistemic errors propagate irreversibly. Existing methods face a dilemma: uncertainty quantification (UQ) methods typically act as passive sensors, only diagnosing risks without addressing them, while self-reflection mechanisms suffer from continuous or aimless corrections. To bridge this gap, we propose a unified Dual-Process Agentic UQ (AUQ) framework that transforms verbalized uncertainty into active, bi-directional control signals. Our architecture comprises two complementary mechanisms: System 1 (Uncertainty-Aware Memory, UAM), which implicitly propagates verbalized confidence and semantic explanations to prevent blind decision-making; and System 2 (Uncertainty-Aware Reflection, UAR), which utilizes these explanations as rational cues to trigger targeted inference-time resolution only when necessary. This enables the agent to balance efficient execution and deep deliberation dynamically. Extensive experiments on closed-loop benchmarks and open-ended deep research tasks demonstrate that our training-free approach achieves superior performance and trajectory-level calibration. We believe this principled framework AUQ represents a significant step towards reliable agents.
尽管AI代理在长期推理方面展现了令人印象深刻的才能,但它们的可靠性却因“幻觉螺旋”而严重受损,在这种情况下,早期的知识性错误会不可逆转地传播。现有方法面临一个两难境地:不确定性量化(UQ)方法通常充当被动传感器的角色,只能诊断风险而不解决这些问题;自我反思机制则遭受持续或无目的修正的困扰。为了弥合这一差距,我们提出了一种统一的双过程代理不确定性量化(AUQ)框架,该框架将口头化的不确定性转化为积极的双向控制信号。我们的架构包括两个互补机制:系统1(不确定感知记忆,UAM),它隐式地传播口头表达的信心和语义解释以防止盲目决策;以及系统2(不确定感知反思,UAR),利用这些解释作为理性的线索,在必要时触发有针对性的推理时间解决方案。这使代理能够动态平衡高效的执行与深入的思考。 在闭合回路基准测试和开放式的深度研究任务中的广泛实验表明,我们无需训练的方法实现了卓越的表现和轨迹级校准。我们认为这一基于原则的AUQ框架代表了迈向可靠代理的一大步。
https://arxiv.org/abs/2601.15703
Few-shot recognition in synthetic aperture radar (SAR) imagery remains a critical bottleneck for real-world applications due to extreme data scarcity. A promising strategy involves synthesizing a large dataset with a generative adversarial network (GAN), pre-training a model via self-supervised learning (SSL), and then fine-tuning on the few labeled samples. However, this approach faces a fundamental paradox: conventional GANs themselves require abundant data for stable training, contradicting the premise of few-shot learning. To resolve this, we propose the consistency-regularized generative adversarial network (Cr-GAN), a novel framework designed to synthesize diverse, high-fidelity samples even when trained under these severe data limitations. Cr-GAN introduces a dual-branch discriminator that decouples adversarial training from representation learning. This architecture enables a channel-wise feature interpolation strategy to create novel latent features, complemented by a dual-domain cycle consistency mechanism that ensures semantic integrity. Our Cr-GAN framework is adaptable to various GAN architectures, and its synthesized data effectively boosts multiple SSL algorithms. Extensive experiments on the MSTAR and SRSDD datasets validate our approach, with Cr-GAN achieving a highly competitive accuracy of 71.21% and 51.64%, respectively, in the 8-shot setting, significantly outperforming leading baselines, while requiring only ~5 of the parameters of state-of-the-art diffusion models. Code is available at: this https URL.
在合成孔径雷达(SAR)图像中的少量样本识别仍然是实际应用中的一个重要瓶颈,原因在于极端的数据稀缺。一种有前途的策略是利用生成对抗网络(GAN)合成大量数据集,并通过自监督学习(SSL)进行预训练模型,然后对有限标记样本进行微调。然而,这种方法面临着一个基本矛盾:传统的GAN本身需要大量的数据才能进行稳定训练,这与少量样本学习的前提相违背。为了解决这个问题,我们提出了受一致性正则化的生成对抗网络(Cr-GAN),这是一种新颖的框架,旨在即使在这些严苛的数据限制条件下也能合成多样化且高保真的样本。 Cr-GAN引入了一个双分支判别器,将对抗性训练与表示学习解耦。这种架构支持一种基于通道的特征插值策略来创建新的潜在特征,并通过一个跨域循环一致性机制确保语义完整性。我们的Cr-GAN框架可以适应各种GAN架构,其生成的数据能够有效增强多种SSL算法。在MSTAR和SRSDD数据集上的广泛实验验证了我们方法的有效性,在8次样本的设置中,Cr-GAN分别达到了71.21%和51.64%的高度竞争准确性,显著优于领先的基准模型,并且仅需最先进的扩散模型参数的大约5%。代码可在以下网址获取:[this https URL]。
https://arxiv.org/abs/2601.15681
Retrieval-augmented generation (RAG) systems integrate document retrieval with large language models and have been widely adopted. However, in privacy-related scenarios, RAG introduces a new privacy risk: adversaries can issue carefully crafted queries to exfiltrate sensitive content from the underlying corpus gradually. Although recent studies have demonstrated multi-turn extraction attacks, they rely on heuristics and fail to perform long-term extraction planning. To address these limitations, we formulate the RAG extraction attack as an adaptive stochastic coverage problem (ASCP). In ASCP, each query is treated as a probabilistic action that aims to maximize conditional marginal gain (CMG), enabling principled long-term planning under uncertainty. However, integrating ASCP with practical RAG attack faces three key challenges: unobservable CMG, intractability in the action space, and feasibility constraints. To overcome these challenges, we maintain a global attacker-side state to guide the attack. Building on this idea, we introduce RAGCRAWLER, which builds a knowledge graph to represent revealed information, uses this global state to estimate CMG, and plans queries in semantic space that target unretrieved regions. In comprehensive experiments across diverse RAG architectures and datasets, our proposed method, RAGCRAWLER, consistently outperforms all baselines. It achieves up to 84.4% corpus coverage within a fixed query budget and deliver an average improvement of 20.7% over the top-performing baseline. It also maintains high semantic fidelity and strong content reconstruction accuracy with low attack cost. Crucially, RAGCRAWLER proves its robustness by maintaining effectiveness against advanced RAG systems employing query rewriting and multi-query retrieval strategies. Our work reveals significant security gaps and highlights the pressing need for stronger safeguards for RAG.
基于检索的生成(RAG)系统将文档检索与大型语言模型相结合,并已被广泛采用。然而,在涉及隐私的场景中,RAG引入了一种新的隐私风险:对手可以发布精心设计的查询,逐步从底层语料库中提取敏感内容。尽管最近的研究展示了多轮提取攻击,但它们依赖于启发式方法且无法进行长期提取规划。为了解决这些限制,我们将RAG提取攻击形式化为一个自适应随机覆盖问题(ASCP)。在ASCP中,每个查询被视为旨在最大化条件边际收益(CMG)的概率动作,从而能够在不确定性下进行原则性的长期规划。然而,将ASCP与实际的RAG攻击结合面临着三个关键挑战:不可观测的CMG、行动空间中的复杂性以及可行性约束。 为了克服这些挑战,我们维护一个全局攻击者状态来指导攻击。在此基础上,我们引入了RAGCRAWLER,该系统构建了一个知识图谱以表示已揭示的信息,并利用这个全局状态估算CMG,在语义空间中规划针对未检索区域的查询。 在涵盖多种RAG架构和数据集的全面实验中,我们的方法RAGCRAWLER始终优于所有基线模型。它在一个固定查询预算内实现了高达84.4%的语料库覆盖率,并比最先进的基线平均提高了20.7%。此外,它保持了高语义保真度和强内容重建准确率的同时具备较低的攻击成本。 至关重要的是,RAGCRAWLER通过证明其对采用查询重写及多轮检索策略的先进RAG系统的有效性,展示了其鲁棒性。我们的工作揭示了显著的安全漏洞,并强调了增强RAG防护措施的紧迫需求。
https://arxiv.org/abs/2601.15678
Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.
医疗保健组织开始将代理型人工智能(agentic AI)嵌入日常工作流程中,包括临床文档支持和早期预警监测。随着这些能力在不同部门和供应商之间扩散,医疗机构面临着“代理泛滥”的问题,这导致了重复的代理、不明确的责任划分、控制措施的一致性不足以及超出原始使用场景的工具权限持续存在。 现有的人工智能治理框架强调生命周期风险管理,但为代理舰队日常运营提供的指导有限。我们提出了一种综合代理生命周期管理(Unified Agent Lifecycle Management, UALM)蓝图,该蓝图基于快速实践导向的治理标准、代理安全文献和医疗保健合规要求的合成而得出。UALM将反复出现的问题映射到五个控制层面:(1) 身份与角色注册表;(2) 交响乐调度与跨域调解;(3) 受保护健康信息(PHI)限制下的上下文与记忆;(4) 运行时策略执行及紧急停止触发机制;以及 (5) 生命周期管理和退役,包括凭证撤销和审计日志记录。 一个配套的成熟度模型支持分阶段采用。UALM为医疗保健领域的CIOs、CISOs(首席信息安全官)和临床领导者提供了一种可实施模式,这种模式能够进行合规性审查,并在保持本地创新的同时,在临床和行政领域实现更安全的大规模应用扩展。
https://arxiv.org/abs/2601.15630