Invisible watermarking has become a critical mechanism for authenticating AI-generated image content, with major platforms deploying watermarking schemes at scale. However, evaluating the vulnerability of these schemes against sophisticated removal attacks remains essential to assess their reliability and guide robust design. In this work, we expose a fundamental vulnerability in invisible watermarks by reformulating watermark removal as a view synthesis problem. Our key insight is that generating a perceptually consistent alternative view of the same semantic content, akin to re-observing a scene from a shifted perspective, naturally removes the embedded watermark while preserving visual fidelity. This reveals a critical gap: watermarks robust to pixel-space and frequency-domain attacks remain vulnerable to semantic-preserving viewpoint transformations. We introduce a zero-shot diffusion-based framework that applies controlled geometric transformations in latent space, augmented with view-guided correspondence attention to maintain structural consistency during reconstruction. Operating on frozen pre-trained models without detector access or watermark knowledge, our method achieves state-of-the-art watermark suppression across 15 watermarking methods--outperforming 14 baseline attacks while maintaining superior perceptual quality across multiple datasets.
不可见水印已成为验证AI生成图像内容真伪的关键机制,各大平台已开始大规模部署水印方案。然而,评估这些方案抵御复杂移除攻击的能力仍然至关重要,以衡量其可靠性并指导稳健设计。在这项工作中,我们揭示了不可见水印的一个基本漏洞,通过将水印移除重新表述为视图合成问题来实现这一点。我们的关键见解是,生成同一语义内容的感知上一致的替代视角(类似于从稍有不同的角度重新观察一个场景),可以自然地去除嵌入的水印同时保持视觉保真度。这揭示了一个重要缺口:即使对像素空间和频域攻击具有鲁棒性的水印,在面对保持语义不变的视点变换时仍然脆弱。 我们引入了一种零样本扩散框架,该框架在潜在空间中应用受控几何变换,并通过视图引导对应注意力来维护重建过程中的结构一致性。此方法无需访问检测器或了解具体水印知识,仅基于预训练模型即可操作。我们的方法在15种不同的水印方案上实现了最先进的水印抑制效果,在超越了14种基线攻击的同时,还能在多个数据集上保持优越的感知质量。
https://arxiv.org/abs/2601.08832
Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
准确的个体识别对于监测稀有两栖动物至关重要,然而,侵入性标记方法通常不适用于极度濒危物种。我们评估了最先进的计算机视觉方法,用于2013年至2020年间在捕捉和再捕捉调查中收集到的191只霍拉彩蛙(Latonia nigriventer)的1,233张腹部图像的摄影重新识别。我们在零样本设置下比较了深度局部特征匹配与深度全局特征嵌入模型的效果。局部特征管道达到了98%的第一名封闭集识别准确率,优于所有全局特征模型;微调将最佳全局特征模型改进至60%的第一名(91%的前十名),但仍低于局部匹配的表现。为了结合可扩展性和准确性,我们实施了一个两阶段工作流程,在该流程中,一个微调后的全局特征模型检索出一个简短的候选名单,然后通过局部特征匹配重新排序,将端到端运行时间从6.5至7.8小时减少到了大约38分钟,同时保持了标记数据集上的约96%的第一名封闭集准确率。同源个体与不同个体配对之间的匹配得分差异支持阈值设置以进行开放集识别,从而能够处理新出现的个体。我们将此流程部署为网络应用程序供日常现场使用,提供快速、标准化且非侵入性的识别方式来支持保护监测和捕捉再释放分析。总体而言,在该物种中,零样本深度局部特征匹配优于全局特征嵌入,并提供了照片识别的一个强大的默认选项。
https://arxiv.org/abs/2601.08798
VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.
VLA模型在具身导航中展现出巨大的潜力,通过统一感知与规划,并继承大型视觉语言模型(VLMs)的强大泛化能力。然而,大多数现有的VLA模型依赖于从观察到动作的反应式映射,缺乏执行复杂、长期任务所需的明确推理能力和持久记忆功能。为了解决这些挑战,我们提出了VLingNav,这是一种基于语言驱动认知的具身导航VLA模型。 首先,借鉴人类认知的双过程理论,我们引入了一种适应性的链式思维机制(chain-of-thought),该机制能根据需要动态触发明确推理,使代理能够在快速直观执行和慢速深思熟虑规划之间灵活切换。其次,为了处理长期的空间依赖关系,我们开发了一个辅助语言记忆模块,构建持久的跨模态语义记忆,使代理能够回忆过去的观察结果以避免重复探索,并推断出动态环境中的移动趋势。 在训练策略方面,我们构建了Nav-AdaCoT-2.9M,这是迄今为止最大的具有推理注释的具身导航数据集,包含适应性链式思维(CoT)注释,能够诱导一种既考虑何时思考也考虑思考什么内容的推理模式。此外,我们还融入了一个在线专家指导增强学习阶段,使模型超越纯粹模仿学习,并获得更为稳健、自我探索的导航行为。 广泛的实验表明,VLingNav在各种具身导航基准测试中实现了最先进的性能。值得注意的是,VLingNav能够零样本迁移到真实世界的机器人平台,在执行多种导航任务时表现出强大的跨域和跨任务泛化能力。
https://arxiv.org/abs/2601.08665
Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
大型语言模型(LLMs)在语义理解方面表现出色,但它们从混乱输入中重构内部结构的能力尚未得到充分探索。句子级别的恢复对于自动评估而言是不适定问题,因为通常存在多种有效的词序排列。我们引入了OrderProbe,这是一个使用固定四字符表达式来测试中文、日文和韩文中结构重组能力的确定性基准。这些语言中的短语具有独特的规范顺序,因此支持精确匹配评分。此外,我们还提出了一种诊断框架,该框架评估模型在恢复准确性之外的能力,包括语义保真度、逻辑有效性、一致性、鲁棒性和信息密度。 实验结果表明,在十二个广泛使用的大型语言模型上,结构重组对于前沿系统来说仍然具有挑战性:零样本恢复的准确率经常低于35%。我们还观察到,语义召回与结构规划之间存在一致的分离现象,这表明结构性稳健性不是语义能力的自动产物。
https://arxiv.org/abs/2601.08626
Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.
近期在增强大型推理模型(LRMs)方面取得的进展,使得通过检索外部知识来减少多步骤推理中的幻觉成为可能。然而,这类模型处理图结构数据的能力,在电子商务、社交网络和科学引用等领域却鲜有研究。与纯文本语料库不同,图结构编码了丰富的拓扑信号,连接相关实体,并能作为有价值的先验条件用于检索,从而实现更加精准的搜索和推理效率的提升。然而,有效利用这种结构带来了独特的挑战,包括生成具有图表达能力查询的难度以及确保可靠的检索以平衡结构性和语义相关性。 为了解决这一缺口,我们引入了GraphSearch框架——首个将增强式推理扩展到图学习领域的框架,实现了无需特定任务微调的零样本图学习。GraphSearch结合了一个图感知查询规划器,该规划器解耦搜索空间(例如,1跳、多跳或全局邻居)和语义查询;以及一个图感知检索器,基于拓扑构造候选集合,并使用混合评分函数对其进行排名。我们进一步实现了两种遍历模式:GraphSearch-R递归地按跳扩展邻域;GraphSearch-F则灵活地在局部和全局邻域之间进行跨范围检索,而无需跳跃约束。 广泛的实验表明,GraphSearch在多种基准测试中取得了与监督图学习方法相当甚至更好的性能,在零样本节点分类和链接预测任务上创造了最新的成果。这些发现使GraphSearch成为在图上实现代理推理的灵活且通用的范例。
https://arxiv.org/abs/2601.08621
Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
可靠地在零样本条件下检测出分布外(OOD)输入对于将视觉语言模型部署于开放环境至关重要。然而,由于缺乏有标签的负面数据,在零样本OOD检测中需要使用有效的替代信号来应对分布变化带来的挑战。现有的负面标签方法依赖于固定的文本代理集,这些代理仅稀疏地采样超出已知类别(ID)范围的概念空间,并且在视觉特征发生变化时保持静态不变,导致跨模态偏差和预测不稳定。 本文提出了CoEvo框架,这是一个无需额外训练和标注的测试时间框架,能够实现双向、基于样本条件的文本与视觉代理的适应。具体来说,CoEvo引入了一个代理对齐共进化机制来维护两个动态变化中的代理缓存:通过测试图像引导挖掘上下文相关的文本负面示例,并且迭代地精炼视觉代理,逐步重新调整跨模态相似度并扩大局部OOD边界。最终,该框架会根据实际情况重新加权双重模式代理的贡献,以获得一个能够应对分布偏移影响的校准OOD分数。 在标准基准上的广泛实验表明,CoEvo实现了最先进的性能,在ImageNet-1K数据集上相比强大的负面标签基线方法提高了AUROC 1.33%,降低了FPR95值达45.98%。
https://arxiv.org/abs/2601.08476
Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
驾驶分心是交通事故的主要原因之一,因此需要强大的、可扩展的检测方法。视觉-语言模型(VLM)能够实现强大的零样本图像分类,但现有的基于VLM的分心驾驶员探测器在现实条件下的表现往往不尽如人意。我们发现主体特定的外观变化(例如服装、年龄和性别)是关键瓶颈:VLM将这些因素与行为线索交织在一起,导致决策更多地依赖于驾驶员的身份而非其行为本身。为了解决这个问题,我们提出了一种解耦主题框架,该框架提取驾驶者的外观嵌入,并在零样本分类前从中去除图像嵌入的影响,从而强调了相关于分心的证据。此外,我们通过将度量投影到施泰费尔流形上来正交化文本嵌入,以提高可分离性的同时保持接近原始语义。实验表明,与之前的基准相比,我们的方法在各个指标上都有显著提升,这预示着该方法在实际道路安全应用中的巨大潜力。
https://arxiv.org/abs/2601.08467
The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.
在建筑和城市规划的概念设计阶段,特别是在建筑物体量(massing)的设计中,这一过程既复杂又高度依赖于设计师的直觉和手工劳动。为了解决这个问题,我们提出了一种基于功能需求和场地背景自动生成建筑物体量框架的方法。然而,此类数据驱动方法的主要障碍是缺乏合适的数据库支持。因此,我们引入了CoMa-20K数据集,这是一个全面的数据集合,包含了详细的体量几何信息、相关的经济与项目数据以及开发地点在现有城市环境中的视觉表现。 为了评估这一数据集的效用,我们将建筑物体量生成视为针对Vision-Language Models(VLMs,即视觉语言模型)的一个条件任务,并对其进行了测试,包括微调后的模型和大规模零样本模型。我们的实验揭示了该任务固有的复杂性,同时也展示了VLMs在产生背景敏感型体量方案方面的潜力。 通过这一数据集及其分析,我们为数据驱动的建筑设计领域建立了一个基础性的基准点,并且强调未来研究中存在大量的机会。
https://arxiv.org/abs/2601.08464
Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
最近,大型多模态模型(LMM)在低级视觉感知任务中展现出了显著的潜力,尤其是在图像质量评估(IQA)方面,它们表现出强大的零样本学习能力。然而,要达到最先进的性能通常需要采用计算成本高昂的微调方法,这些方法旨在将输出中的与质量相关的令牌分布对齐到不同的图像质量级别上。受到最近无需训练的LMM工作的启发,我们提出了一种新的无需训练框架——IQA-RAG(Image Quality Assessment Retrieval-Augmented Generation),该框架能够增强LMM在IQA方面的性能。 IQA-RAG利用检索增强生成技术来为输入图片检索一些语义相似但质量不同的参考图像及其相应的平均意见评分(MOS)。这些检索到的图像和原始输入图像被整合到一个特定提示中。通过这种方式,检索到的高质量或低质量参照图像是LMM执行IQA任务时的一种视觉感知锚点。 IQA-RAG框架包含三个关键阶段:特征提取、图像检索及融合与生成质量评分。广泛的实验结果显示,在包括KADID, KonIQ, LIVE Challenge和SPAQ在内的多个多样化的IQA数据集上,所提出的IQA-RAG框架能够有效提升LMM在IQA任务中的性能,并为质量评估提供了一种资源高效的替代方案,相比传统的微调方法更加节省计算资源。
https://arxiv.org/abs/2601.08311
Unobtrusive sensor-based recognition of Activities of Daily Living (ADLs) in smart homes by processing data collected from IoT sensing devices supports applications such as healthcare, safety, and energy management. Recent zero-shot methods based on Large Language Models (LLMs) have the advantage of removing the reliance on labeled ADL sensor data. However, existing approaches rely on time-based segmentation, which is poorly aligned with the contextual reasoning capabilities of LLMs. Moreover, existing approaches lack methods for estimating prediction confidence. This paper proposes to improve zero-shot ADL recognition with event-based segmentation and a novel method for estimating prediction confidence. Our experimental evaluation shows that event-based segmentation consistently outperforms time-based LLM approaches on complex, realistic datasets and surpasses supervised data-driven methods, even with relatively small LLMs (e.g., Gemma 3 27B). The proposed confidence measure effectively distinguishes correct from incorrect predictions.
基于物联网感应设备收集的数据处理,非侵入式传感器识别智能家庭中的日常生活活动(ADL)支持了包括健康护理、安全和能源管理在内的多种应用。近期基于大型语言模型(LLM)的零样本方法具有不再依赖标注好的ADL传感器数据的优势。然而,现有方法依赖于时间分割技术,这与LLMs进行上下文推理的能力不甚匹配。此外,现有的方法缺乏预测置信度估计的方法。 本文提出了一种通过事件驱动的分割和一种新颖的预测置信度估算方法来改进零样本ADL识别的技术。实验评估表明,在复杂且现实的数据集上,基于事件的分割始终优于时间分割LLM方法,并且即使在相对较小的语言模型(例如Gemma 3 27B)情况下也能超越监督数据驱动的方法。所提出的置信度量可以有效地区分正确的和不正确的预测结果。
https://arxiv.org/abs/2601.08241
Dynamic voltage and frequency scaling (DVFS) and task-to-core allocation are critical for thermal management and balancing energy and performance in embedded systems. Existing approaches either rely on utilization-based heuristics that overlook stall times, or require extensive offline profiling for table generation, preventing runtime adaptation. We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms. Two collaborative agents decompose the exponential action space, achieving 358ms latency for subsequent decisions. First decisions require 3.5 to 8.0s including one-time LLM feature extraction. An accurate environment model leverages regression techniques to predict thermal dynamics and performance states. When combined with LLM-extracted semantic features, the environment model enables zero-shot deployment for new workloads on trained platforms by generating synthetic training data without requiring workload-specific profiling samples. We introduce LLM-based semantic feature extraction that characterizes OpenMP programs through 13 code-level features without execution. The Dyna-Q-inspired framework integrates direct reinforcement learning with model-based planning, achieving 20x faster convergence than model-free methods. Experiments on BOTS and PolybenchC benchmarks across NVIDIA Jetson TX2, Jetson Orin NX, RubikPi, and Intel Core i7 demonstrate 7.09x better energy efficiency and 4.0x better makespan than Linux ondemand governor. First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.
动态电压和频率调节(DVFS)以及任务到核心的分配对于嵌入式系统的热管理和平衡能耗与性能至关重要。现有方法要么依赖于忽视停顿时间的利用率启发式算法,要么需要进行大量的离线分析以生成表征数据,这阻碍了运行时适应性。我们提出了一种基于模型的层次化多智能体强化学习(MARL)框架,用于在多核平台上实现热感知和能源意识调度。两个协作的智能体分解了指数级的动作空间,在后续决策中仅需358毫秒的延迟时间。首次决策需要包括一次性LLM特征提取在内的3.5至8.0秒的时间。一个准确的环境模型通过回归技术预测热动态及性能状态。当与LLM抽取的语义特征结合时,该环境模型可以通过生成合成训练数据而不必进行特定工作负载的分析采样,从而实现对新工作负载的零样本部署。 我们引入了一种基于LLM的语义特征提取方法,通过13个代码级特性描述OpenMP程序,而无需执行。受Dyna-Q启发的框架结合了直接强化学习与模型基础规划,实现了比无模型方法快20倍的收敛速度。在NVIDIA Jetson TX2、Jetson Orin NX、RubikPi和Intel Core i7上的BOTS及PolybenchC基准测试中,我们展示了相较于Linux ondemand调度器,我们的方法能实现7.09倍的能量效率提升和4倍的任务完成时间减少。首次决策的延迟仅为基于表格分析的方法的8,300分之一,这使得在动态嵌入式系统中的实际部署成为可能。 通过这项工作,我们证明了模型引导学习与多智能体方法相结合可以在不牺牲性能的前提下实现热管理及能源效率的最大化。
https://arxiv.org/abs/2601.08166
Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.
尽管视觉-语言模型(VLMs)具备出色的零样本推理能力,但它们在面对分布变化时非常脆弱。测试时间适应(TTA)已成为一种主要策略,用于根据未标记的测试数据动态调整VLM。然而,现有的TTA方法严重依赖于作为自我训练伪标签的零样本预测,在分布变化下这些预测可能会不可靠并误导适应过程。这主要是由于两个基本限制:一是模态差距——分布变化导致视觉和文本模态之间的差异增大,使得跨模态关系变得不准确;二是视觉干扰——视觉嵌入编码了丰富的但与任务无关的噪声,在分布变化时这种噪声往往掩盖了特定于任务的意义。 为了克服这些限制,我们提出了SubTTA方法。该方法通过调整两种模态的语义子空间来增强零样本预测,使其更好地指导TTA过程。为缩小模态差距,SubTTA提取视觉和文本两种模态的主要子空间,并通过最小化弦距(chordal distance)将视觉流形对齐到文本语义锚点上。为了消除视觉干扰,SubTTA将对齐后的视觉特征投影到特定于任务的文本子空间中,在此过程中过滤掉与任务无关的噪声,从而限制了视觉嵌入在有效的语义范围内,并在此净化的空间内进一步执行标准的TTA以精炼决策边界。 在各种基准测试和VLM架构上的广泛实验表明,SubTTA的有效性超过了当前最先进的TTA方法,平均提高了2.24%。
https://arxiv.org/abs/2601.08139
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
科学复合图表将多个标注面板组合成一个单一图像,但在实际处理管道中,图例常常缺失或仅提供图片级别的总结,这使得理解每个单独的面板变得困难。在本文中,我们提出了FigEx2,这是一个基于视觉条件的框架,它可以定位面板,并直接从复合图形生成针对每个面板的文字描述。为了减轻开放性描述中的多种表述带来的影响,我们引入了一种噪声感知门控融合模块,该模块能够自适应地过滤词级别特征以稳定检测查询空间。 此外,我们采用了一个阶段化优化策略,结合了监督学习与强化学习(RL),利用CLIP基础的对齐和BERTScore基础的语义奖励来强制执行严格的多模态一致性。为了支持高质量的监督,我们策划了BioSci-Fig-Cap,这是一个针对面板级别定位进行了精炼基准测试的数据集,并且还包含跨学科物理学和化学领域的测试套件。 实验结果表明,FigEx2在检测方面取得了0.726 mAP@0.5:0.95的卓越表现,并在METEOR和BERTScore上分别比Qwen3-VL-8B高出0.51和0.24。值得注意的是,FigEx2展示了出色的零样本迁移能力,能够在未经任何微调的情况下应用于分布外的科学领域。
https://arxiv.org/abs/2601.08026
Zero-Shot image Anomaly Detection (ZSAD) aims to detect and localise anomalies without access to any normal training samples of the target data. While recent ZSAD approaches leverage additional modalities such as language to generate fine-grained prompts for localisation, vision-only methods remain limited to image-level classification, lacking spatial precision. In this work, we introduce a simple yet effective training-free vision-only ZSAD framework that circumvents the need for fine-grained prompts by leveraging the inversion of a pretrained Denoising Diffusion Implicit Model (DDIM). Specifically, given an input image and a generic text description (e.g., "an image of an [object class]"), we invert the image to obtain latent representations and initiate the denoising process from a fixed intermediate timestep to reconstruct the image. Since the underlying diffusion model is trained solely on normal data, this process yields a normal-looking reconstruction. The discrepancy between the input image and the reconstructed one highlights potential anomalies. Our method achieves state-of-the-art performance on VISA dataset, demonstrating strong localisation capabilities without auxiliary modalities and facilitating a shift away from prompt dependence for zero-shot anomaly detection research. Code is available at this https URL.
零样本图像异常检测(ZSAD)的目标是在没有目标数据正常训练样本的情况下,检测并定位异常。尽管最近的ZSAD方法利用了诸如语言等额外模态来生成用于定位的细粒度提示,但仅基于视觉的方法仍局限于图像级分类,缺乏空间精度。在这项工作中,我们引入了一个简单而有效的无训练、仅依赖于视觉的ZSAD框架,该框架通过利用预训练的去噪扩散隐式模型(DDIM)的反演过程来规避对细粒度提示的需求。具体而言,给定一个输入图像和一个通用文本描述(例如,“一张[物体类别]的照片”),我们将图像反转以获得潜在表示,并从固定的中间时间步开始启动去噪过程以重构图像。由于底层扩散模型仅在正常数据上进行训练,因此该过程会产生看起来正常的重构结果。输入图像与重建图像之间的差异突显了可能存在的异常。 我们的方法在VISA数据集上达到了最先进的性能,在没有辅助模态的情况下展示了强大的定位能力,并促进了零样本异常检测研究摆脱提示依赖的趋势。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.08022
Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.
大型语言模型(LLMs)在执行多步推理能力方面的评估越来越多,即结合多个信息片段进行连贯的推断。我们引入了KinshipQA,这是一个旨在通过宗亲关系推理来测试这种能力的新基准。我们的核心贡献是一个生成性管道,该管道能够根据需求产生大规模、现实且特定于文化的家谱数据:一系列相互连接的家庭树,这些家庭树满足与不同亲属制度相关的明确婚姻限制条件。这使得任务难度、文化假设和关系深度可以被系统地控制和变化。 从这些家谱中,我们导出了需要推理隐含的关系链的文字推断任务。我们使用六种最先进的LLMs(包括开源和闭源模型)在统一的零样本协议下进行测试,并采用确定性解码来评估这个基准。性能指标是基于精确匹配和集合基础度量。 我们的结果表明,KinshipQA产生了广泛的结果范围,并揭示了不同模型和文化背景下的多步推理中的系统差异。
https://arxiv.org/abs/2601.07794
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
大型语言模型代理在处理庞大的、动态的工具库时,依赖于有效的检索功能。然而,标准的一次性密集型检索器在面对复杂请求时往往表现不佳。这些失败主要源于用户抽象目标与技术文档之间的差距以及固定大小嵌入表达组合式工具有限的能力。为了解决这些问题,我们提出了TOOLQP框架,这是一个轻量级的迭代查询规划框架,它将检索过程分解成一系列子任务,并动态生成查询以与检索器交互,从而有效地缩小了语义差距,专注于实现组合工具所需的特定子任务。 TOOLQP通过合成查询轨迹进行训练,并使用带有可验证奖励的强化学习(RLVR)进行优化。实验结果表明,TOOLQP达到了最先进的性能水平,在零样本泛化、对各种检索器的鲁棒性以及下游代理执行方面取得了显著改进。
https://arxiv.org/abs/2601.07782
Autonomous 3D scanning of open-world target structures via drones remains challenging despite broad applications. Existing paradigms rely on restrictive assumptions or effortful human priors, limiting practicality, efficiency, and adaptability. Recent foundation models (FMs) offer great potential to bridge this gap. This paper investigates a critical research problem: What system architecture can effectively integrate FM knowledge for this task? We answer it with FlyCo, a principled FM-empowered perception-prediction-planning loop enabling fully autonomous, prompt-driven 3D target scanning in diverse unknown open-world environments. FlyCo directly translates low-effort human prompts (text, visual annotations) into precise adaptive scanning flights via three coordinated stages: (1) perception fuses streaming sensor data with vision-language FMs for robust target grounding and tracking; (2) prediction distills FM knowledge and combines multi-modal cues to infer the partially observed target's complete geometry; (3) planning leverages predictive foresight to generate efficient and safe paths with comprehensive target coverage. Building on this, we further design key components to boost open-world target grounding efficiency and robustness, enhance prediction quality in terms of shape accuracy, zero-shot generalization, and temporal stability, and balance long-horizon flight efficiency with real-time computability and online collision avoidance. Extensive challenging real-world and simulation experiments show FlyCo delivers precise scene understanding, high efficiency, and real-time safety, outperforming existing paradigms with lower human effort and verifying the proposed architecture's practicality. Comprehensive ablations validate each component's contribution. FlyCo also serves as a flexible, extensible blueprint, readily leveraging future FM and robotics advances. Code will be released.
自主无人机三维扫描开放环境中的目标结构仍然面临诸多挑战,尽管其应用范围广泛。现有的方法依赖于严格的假设或需要大量的先验人工干预,这限制了其实用性、效率和适应能力。最近的基础模型(Foundation Models, FMs)为解决这些问题提供了巨大的潜力。本文探讨了一个关键的研究问题:何种系统架构能够有效地将基础模型知识整合到该任务中?我们通过提出FlyCo来回答这个问题——这是一个基于原则的自主感知-预测-规划循环框架,它利用了强大的基础模型能力,能够在多样且未知的开放环境中进行完全自主、指令驱动的目标三维扫描。 FlyCo直接将低投入的人类指令(如文本和视觉注释)转化为精确而适应性强的无人机飞行任务,通过三个协调阶段实现这一目标: 1. **感知**:该阶段融合了实时传感器数据与视觉-语言基础模型的知识,以确保对目标对象的稳定识别和跟踪。 2. **预测**:此步骤提取并利用基础模型知识,并结合多模态线索来推断部分可见目标的整体几何结构。 3. **规划**:这一环节利用预测性的洞察力生成高效且安全的飞行路径,确保全面覆盖目标物体。 基于上述架构,我们进一步设计了关键组件以提升开放环境下的目标定位效率和鲁棒性,并增强预测质量,特别是在形状准确性、零样本泛化能力和时间稳定性方面。此外,还优化了长时段任务的执行效率与实时计算能力以及在线避障之间的平衡。一系列广泛的现实世界和仿真实验显示,FlyCo在场景理解精度、运行效率及即时安全性等方面超越现有方法,并且需要较低的人工干预量,验证了所提出架构的实际可行性。 详细消融研究进一步证实了每个组件的贡献。此外,FlyCo还作为一个灵活、可扩展的设计蓝图存在,能够轻松利用未来的基础模型和机器人技术进步。代码将在适当时候发布。
https://arxiv.org/abs/2601.07558
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
六维物体姿态估计在机器人和增强现实等应用的场景理解中扮演着至关重要的角色。为了支持这些应用场景中不断变化的对象集合的需求,现代零样本(zero-shot)物体姿态估算器被开发出来,不需要特定于某个对象的训练,而只需要依靠CAD(计算机辅助设计)模型即可工作。然而,在部署后获取这些模型变得非常困难,并且由于对象集持续变化和增长,准确识别感兴趣的实例模型变得更加具有挑战性。 为了解决这一难题,我们引入了一种名为OSCAR的方法,即从语言提示和单张图像进行开放集合CAD检索的新颖训练自由方法(Open-Set CAD Retrieval from a Language Prompt and a Single Image)。在部署时,OSCAR会生成数据库中模型的多视角渲染,并使用图像描述性文字注释工具来标注这些渲染。推理阶段时,GroundedSAM会在输入图像中检测查询对象,同时为感兴趣区域和数据库中的描述性文字计算多模态嵌入。 OSCAR采用两阶段检索方法:第一阶段是利用CLIP(一种文本到图像的匹配模型)基于文本过滤候选模型;第二阶段则使用DINOv2进行基于图像的细化,选择视觉上最相似的对象。在我们的实验中显示,与现有最佳方法相比,OSCAR在跨域3D模型检索基准MI3DOR上的性能更优。 此外,我们展示了OSCAR在自动化六维物体姿态估计所需对象模型获取中的直接应用价值。当无法获得确切实例时,我们可以使用最相似的对象模型进行姿态估计,并证明在YCB-V对象数据集上,OSCAR在物体检索期间达到了90.48%的平均精度。 最后,我们还展示了即使采用Megapose方法利用最接近的对象模型来进行姿态估计也能取得比基于重建的方法更好的结果。
https://arxiv.org/abs/2601.07333
Retargeting human motion to heterogeneous robots is a fundamental challenge in robotics, primarily due to the severe kinematic and dynamic discrepancies between varying embodiments. Existing solutions typically resort to training embodiment-specific models, which scales poorly and fails to exploit shared motion semantics. To address this, we present AdaMorph, a unified neural retargeting framework that enables a single model to adapt human motion to diverse robot morphologies. Our approach treats retargeting as a conditional generation task. We map human motion into a morphology-agnostic latent intent space and utilize a dual-purpose prompting mechanism to condition the generation. Instead of simple input concatenation, we leverage Adaptive Layer Normalization (AdaLN) to dynamically modulate the decoder's feature space based on embodiment constraints. Furthermore, we enforce physical plausibility through a curriculum-based training objective that ensures orientation and trajectory consistency via integration. Experimental results on 12 distinct humanoid robots demonstrate that AdaMorph effectively unifies control across heterogeneous topologies, exhibiting strong zero-shot generalization to unseen complex motions while preserving the dynamic essence of the source behaviors.
将人类动作重定向到不同类型的机器人是机器人技术中的一个基本挑战,主要是由于不同形态的机器人之间存在严重的运动学和动力学差异。现有的解决方案通常依赖于训练特定于每个机器人形态的模型,这种方法扩展性差且无法利用共享的动作语义。为了应对这一问题,我们提出了AdaMorph,这是一种统一的神经重定向框架,能够使单一模型适应各种机器人形态的人类动作。我们的方法将重定向视为一个条件生成任务。我们将人类动作映射到一种与具体形态无关的潜在意图空间,并采用双重目的提示机制来调节生成过程。通过使用自适应层归一化(AdaLN),我们不是简单地进行输入拼接,而是动态地根据机器人约束调整解码器的特征空间。此外,我们还通过基于课程的学习目标来强制执行物理合理性,确保通过积分保证姿态和轨迹的一致性。 在12种不同的类人型机器人的实验中,AdaMorph展示了其能够统一各种异构拓扑结构上的控制,并且在保持源行为动态本质的同时,表现出对复杂未见过动作的零样本泛化能力。
https://arxiv.org/abs/2601.07284
Document-Level Zero-Shot Relation Extraction (DocZSRE) aims to predict unseen relation labels in text documents without prior training on specific relations. Existing approaches rely on Large Language Models (LLMs) to generate synthetic data for unseen labels, which poses challenges for low-resource languages like Malaysian English. These challenges include the incorporation of local linguistic nuances and the risk of factual inaccuracies in LLM-generated data. This paper introduces Document-Level Zero-Shot Relation Extraction with Entity Side Information (DocZSRE-SI) to address limitations in the existing DocZSRE approach. The DocZSRE-SI framework leverages Entity Side Information, such as Entity Mention Descriptions and Entity Mention Hypernyms, to perform ZSRE without depending on LLM-generated synthetic data. The proposed low-complexity model achieves an average improvement of 11.6% in the macro F1-Score compared to baseline models and existing benchmarks. By utilizing Entity Side Information, DocZSRE-SI offers a robust and efficient alternative to error-prone, LLM-based methods, demonstrating significant advancements in handling low-resource languages and linguistic diversity in relation extraction tasks. This research provides a scalable and reliable solution for ZSRE, particularly in contexts like Malaysian English news articles, where traditional LLM-based approaches fall short.
文档级零样本关系抽取(DocZSRE)的目标是在没有针对特定关系进行前期训练的情况下,预测文本文件中未见过的关系标签。现有的方法依赖于大型语言模型(LLMs)来生成用于未见标签的合成数据,这对于马来西亚英语等资源匮乏的语言来说存在挑战,这些挑战包括难以融入当地的语言细微差别以及由LLM生成的数据中的事实错误风险。 本文提出了一种名为文档级零样本关系抽取带实体侧面信息(DocZSRE-SI)的方法,以解决现有DocZSRE方法的限制。DocZSRE-SI框架利用了诸如实体提及描述和实体提及超类之类的实体侧面信息来进行零样本关系抽取,而无需依赖于LLM生成的合成数据。所提出的低复杂度模型相比基线模型和现有的基准,在宏平均F1分数上提高了11.6%。 通过使用实体侧面信息,DocZSRE-SI为基于LLMs的方法提供了一个稳健且高效的替代方案,表明在处理资源匮乏的语言以及关系抽取任务中的语言多样性方面取得了重大进展。这项研究提供了用于零样本关系抽取的可扩展和可靠解决方案,在马来西亚英语新闻文章等情境下尤其有效,其中传统的基于LLM的方法表现不佳。
https://arxiv.org/abs/2601.07271