As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
随着基于大型语言模型(LLM)的代理越来越多地在具有实际后果的重要领域运行,确保其行为安全变得至关重要。目前主导的监管范式是“LLM作为裁判”方法,面临着一个根本性的困境:如何让概率系统可靠地监督其他概率系统而不继承它们的失败模式?我们主张形式验证提供了一种有原则的方法来摆脱这种困境,但其采用受到了一个重要瓶颈的阻碍:从自然语言需求到正式规范的翻译。本文通过提出一个神经符号框架解决了这一差距问题,该框架采用了双向“思想形式”架构:“LLM作为规格编译器”,它们自上而下地将高层次的人类意图分解为原子、可验证的约束条件,然后自下而上地使用Dafny规范和Z3理论可满足性求解来证明符合性,这会产生数学保证而不是概率评分。我们在涵盖行为安全、跨领域限制遵循以及代理向上欺骗检测三个基准测试中对进行了验证。在7个代理模型上的实验表明,相比“LLM作为裁判”基线,平均改进了16.6%,它还能够实现从弱到强的泛化,在此过程中一个7B裁判可以通过来自72B代理的数据以超过90%的准确性检测欺骗,并且通过迭代细化提供了接近线性的安全性改进。
https://arxiv.org/abs/2602.11136
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.
代理式编码要求代理能够有效地与运行时环境(例如命令行界面 CLI)进行交互,以完成诸如解决依赖问题、修复系统问题等任务。然而,如何在大规模上获取此类环境密集型任务以增强代理的能力仍然研究不足。为了解决这个问题,我们基于 Dockerfile 和代理任务之间的类比,提出让代理模拟和探索环境历史,并通过执行反馈进行指导的方法。通过追踪健康环境的历史记录,可以将其状态回溯到带有运行时故障的早期状态,在该状态下可以通过打包错误的状态及其相应的错误消息来派生出一项任务。 使用这种方法,我们构建了一个名为 CLI-Gym 的系统,总共衍生出了 1,655 个环境密集型任务,这是同类集合中最大的。此外,通过精心挑选的成功轨迹,我们的微调模型 LiberCoder 在 Terminal-Bench 上取得了绝对改进 +21.1%(达到 46.1%)的显著提升,优于多种强大的基线方法。据我们所知,这是第一个用于大规模导出环境密集型任务的公开流水线。
https://arxiv.org/abs/2602.10999
Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
基于块的编程环境(如Scratch)在低代码教育中扮演着核心角色,然而评估通过图形用户界面(GUI)构建程序的人工智能代理的能力仍然鲜有研究。我们引入了ScratchWorld这一基准测试,旨在评估多模态GUI代理在Scratch中的构造式编程任务上的表现。 该基准测试基于“使用-修改-创造”教学框架设计,涵盖了83个精心挑选的任务,跨越四个不同的问题类别:创建(Create)、调试(Debug)、扩展(Extend)和计算(Compute)。为了严格诊断代理失败的原因,基准测试采用了两种互补的交互模式:原始模式需要进行细致入微的拖放操作以直接评估视觉-运动控制能力;而复合模式则使用高层次语义API来区分程序推理与GUI执行。为确保可靠的评估,我们提出了一种基于执行的评估协议,在此协议中通过浏览器环境内的运行时测试验证所构建的Scratch程序的功能正确性。 跨多种先进多模态语言模型和GUI代理进行广泛实验后发现,存在显著的认知行动差距:尽管具有强大的规划能力,但在精细粒度的GUI操作方面仍面临持续挑战。
https://arxiv.org/abs/2602.10814
Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs but is challenged by complex, multi-faceted distributional shifts. Existing methods attempt to reduce distributional shifts by aligning manually selected graph elements (e.g., node attributes or structural statistics), which typically require manually designed graph filters to extract relevant features before alignment. However, such approaches are inflexible: they rely on scenario-specific heuristics, and struggle when dominant discrepancies vary across transfer scenarios. To address these limitations, we propose \textbf{ADAlign}, an Adaptive Distribution Alignment framework for GDA. Unlike heuristic methods, ADAlign requires no manual specification of alignment criteria. It automatically identifies the most relevant discrepancies in each transfer and aligns them jointly, capturing the interplay between attributes, structures, and their dependencies. This makes ADAlign flexible, scenario-aware, and robust to diverse and dynamically evolving shifts. To enable this adaptivity, we introduce the Neural Spectral Discrepancy (NSD), a theoretically principled parametric distance that provides a unified view of cross-graph shifts. NSD leverages neural characteristic function in the spectral domain to encode feature-structure dependencies of all orders, while a learnable frequency sampler adaptively emphasizes the most informative spectral components for each task via minimax paradigm. Extensive experiments on 10 datasets and 16 transfer tasks show that ADAlign not only outperforms state-of-the-art baselines but also achieves efficiency gains with lower memory usage and faster training.
图域自适应(GDA)是一种将标记源图的知识转移到未标记的目标图中的方法,但在面对复杂且多方面的分布变化时会遇到挑战。现有的方法试图通过手动选择的图元素对齐来减少这些分布差异(例如节点属性或结构统计),通常需要人工设计的图过滤器提取相关特征进行对齐操作。然而,这种方法缺乏灵活性:它们依赖于特定场景的启发式方法,在主要差异随传输场景变化时表现不佳。 为了克服这些限制,我们提出了**ADAlign**——一个用于GDA的自适应分布对齐框架。与基于启发式的传统方法不同,ADAlign不需要手动指定对齐标准。它能够自动识别每个转移中的最相关差异,并进行联合对齐,捕捉属性、结构及其依赖关系之间的相互作用。这使得ADAlign具有灵活性、场景感知能力以及面对多样且动态变化的偏移时保持稳健性。 为了实现这种适应性,我们引入了神经谱差异(NSD),这是一种理论上合理的参数距离度量,可以提供跨图转移的统一视角。NSD利用了频谱域中的神经特征函数来编码所有阶别的特性-结构依赖关系,并通过一种可学习的频率采样器在极小极大博弈范式下为每个任务自适应地强调最富有信息性的频谱成分。 广泛的实验验证显示,在10个数据集和16项转移任务上,ADAlign不仅超越了最先进的基线方法,而且实现了更低内存使用量和更快训练速度的效率提升。
https://arxiv.org/abs/2602.10489
Runtime quantification of vehicle operational intensity is essential for predictive maintenance and condition monitoring in commercial and heavy-duty fleets. Traditional metrics like mileage fail to capture mechanical burden, while unsupervised deep learning models detect statistical anomalies, typically transient surface shocks, but often conflate statistical stability with mechanical rest. We identify this as a critical blind spot: high-load steady states, such as hill climbing with heavy payloads, appear statistically normal yet impose significant drivetrain fatigue. To resolve this, we propose a Dual-Stream Architecture that fuses unsupervised learning for surface anomaly detection with macroscopic physics proxies for cumulative load estimation. This approach leverages low-frequency sensor data to generate a multi-dimensional health vector, distinguishing between dynamic hazards and sustained mechanical effort. Validated on a RISC-V embedded platform, the architecture demonstrates low computational overhead, enabling comprehensive, edge-based health monitoring on resource-constrained ECUs without the latency or bandwidth costs of cloud-based monitoring.
车辆运行强度的实时量化对于商用车队和重型车队的预测性维护及状态监测至关重要。传统的里程等指标无法捕捉机械负载,而无监督深度学习模型虽能检测统计异常(如瞬时表面冲击),却往往将统计稳定性误认为是机械休眠状态。我们发现这种错误认知是一个关键盲点:高负载稳定工况(例如重载爬坡)看似统计上正常,但实际上会对传动系统造成显著疲劳。 为解决这一问题,我们提出了一种双流架构,它融合了无监督学习以检测表面异常与宏观物理代理来估算累积负荷。这种方法利用低频传感器数据生成多维健康向量,能够区分动态危险和持续的机械努力。在RISC-V嵌入式平台上验证后,该架构展示了较低的计算开销,能够在资源受限的ECU上进行全面、基于边缘的健康监测,而无需面对云端监控带来的延迟或带宽成本。
https://arxiv.org/abs/2602.10432
AIvilization v0 is a publicly deployed large-scale artificial society that couples a resource-constrained sandbox economy with a unified LLM-agent architecture, aiming to sustain long-horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce (i) a hierarchical branch-thinking planner that decomposes life goals into parallel objective branches and uses simulation-guided validation plus tiered re-planning to ensure feasibility; (ii) an adaptive agent profile with dual-process memory that separates short-term execution traces from long-term semantic consolidation, enabling persistent yet evolving identity; and (iii) a human-in-the-loop steering interface that injects long-horizon objectives and short commands at appropriate abstraction levels, with effects propagated through memory rather than brittle prompt overrides. The environment integrates physiological survival costs, non-substitutable multi-tier production, an AMM-based price mechanism, and a gated education-occupation system. Using high-frequency transactions from the platforms mature phase, we find stable markets that reproduce key stylized facts (heavy-tailed returns and volatility clustering) and produce structured wealth stratification driven by education and access constraints. Ablations show simplified planners can match performance on narrow tasks, while the full architecture is more robust under multi-objective, long-horizon settings, supporting delayed investment and sustained exploration.
AIvilization v0 是一个公开部署的大型人工社会,它将资源受限的沙盒经济与统一的大规模语言模型-代理架构相结合,旨在保持在快速变化环境中的长期自主性。为了解决目标稳定性和即时正确性之间的矛盾,我们引入了以下三个机制: (i) 一种层次化的分枝思考规划器,它可以将生活目标分解成平行的目标分支,并使用模拟引导验证加上分级重新规划来确保可行性; (ii) 一个具有双过程记忆的自适应代理配置文件,它将短期执行轨迹与长期语义巩固分开,从而实现了持久但不断发展的身份特征; (iii) 一种人机交互控制接口,在适当抽象级别上注入长期目标和短命令,并通过内存传播效果,而不是通过脆弱的提示覆盖。 该环境整合了生理生存成本、不可替代的多层级生产体系、基于自动做市商(AMM)的价格机制以及受教育程度影响的职业准入系统。利用成熟阶段平台上的高频交易数据,我们发现稳定的市场能够重现关键特征事实(如回报分布的长尾效应和波动性集聚),并且会形成由教育及获取约束驱动的结构化财富分层。 简化版本的规划器在处理狭窄任务时可以匹配性能,而完整架构则更适用于多目标、长期视野设置,在延迟投资与持续探索的支持方面表现得更为稳健。
https://arxiv.org/abs/2602.10429
Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.
结构化声明分解通常被提出作为验证复杂、多方面的声明的一种解决方案,然而实证结果一直不一致。我们主张这些不一致性源自两个被忽视的瓶颈:证据对齐和子声明错误分布模式。为了更好地理解这些因素,我们引入了一个新的数据集,该数据集中包含现实世界中的复杂声明,并且具有时间限定的证据以及人工标注的子声明证据跨度。我们在两种证据对齐设置下评估了分解效果:子声明对齐证据(SAE)和重复声明级证据(SRE)。我们的结果显示,在证据细粒度且严格对准时,分解能带来显著性能提升;相反,标准设置依赖于重复声明级证据(SRE),其在不同数据集和领域(PHEMEPlus、MMM-Fact、COVID-Fact)中不仅无法提高表现,反而常常导致性能下降。此外,我们还证明,在存在噪音子声明标签的情况下,错误的性质最终会决定下游的稳健性。我们发现保守的“拒绝”策略显著减少了误差传播,相比于激进但不正确的预测更为有效。这些发现表明,未来的声明分解框架必须优先考虑精确证据合成,并调校子声明验证模型中的标签偏差。
https://arxiv.org/abs/2602.10380
To be practical for real-life applications, models for brain-computer interfaces must be easily and quickly deployable on new subjects, effective on affordable scanning hardware, and small enough to run locally on accessible computing resources. To directly address these current limitations, we introduce ENIGMA, a multi-subject electroencephalography (EEG)-to-Image decoding model that reconstructs seen images from EEG recordings and achieves state-of-the-art (SOTA) performance on the research-grade THINGS-EEG2 and consumer-grade AllJoined-1.6M benchmarks, while fine-tuning effectively on new subjects with as little as 15 minutes of data. ENIGMA boasts a simpler architecture and requires less than 1% of the trainable parameters necessary for previous approaches. Our approach integrates a subject-unified spatio-temporal backbone along with a set of multi-subject latent alignment layers and an MLP projector to map raw EEG signals to a rich visual latent space. We evaluate our approach using a broad suite of image reconstruction metrics that have been standardized in the adjacent field of fMRI-to-Image research, and we describe the first EEG-to-Image study to conduct extensive behavioral evaluations of our reconstructions using human raters. Our simple and robust architecture provides a significant performance boost across both research-grade and consumer-grade EEG hardware, and a substantial improvement in fine-tuning efficiency and inference cost. Finally, we provide extensive ablations to determine the architectural choices most responsible for our performance gains in both single and multi-subject cases across multiple benchmark datasets. Collectively, our work provides a substantial step towards the development of practical brain-computer interface applications.
为了在现实生活中实现应用,脑机接口模型必须能够轻松快速地部署到新受试者身上,在可负担的扫描硬件上有效运行,并且足够小巧可以在可访问的计算资源上本地运行。为直接解决这些当前限制,我们介绍了ENIGMA,这是一个多受试者的脑电图(EEG)到图像解码模型,它从EEG记录中重构看到的图像,并在研究级THINGS-EEG2和消费级AllJoined-1.6M基准测试上实现了最先进的性能。同时,在新受试者上进行微调时,只需要短短15分钟的数据即可有效运行。ENIGMA采用了更简单的架构,所需的可训练参数不到先前方法的1%。 我们的方法整合了一个受试者统一的时空骨干网络,并结合了一组多受试者的潜在对齐层以及一个MLP投影器,以将原始EEG信号映射到丰富的视觉潜在空间中。我们使用一系列标准化的图像重建指标来评估我们的方法,这些指标来自邻近领域的fMRI到图像研究,并且我们描述了第一个通过人类评价员进行详尽行为评估的EEG到图像研究。 简单而稳健的架构使我们在研究级和消费级EEG硬件上都取得了显著性能提升,并在微调效率和推理成本方面有了大幅改进。最后,我们进行了广泛的消融实验,以确定负责我们在单个受试者和多个受试者场景中,在多种基准数据集上的性能增益的架构选择。 总体而言,我们的工作为开发实用的脑机接口应用程序提供了重要的步骤。
https://arxiv.org/abs/2602.10361
Recent advances in LLM-guided evolutionary computation, particularly AlphaEvolve, have demonstrated remarkable success in discovering novel mathematical constructions and solving challenging optimization problems. In this article, we present ImprovEvolve, a simple yet effective technique for enhancing LLM-based evolutionary approaches such as AlphaEvolve. Given an optimization problem, the standard approach is to evolve program code that, when executed, produces a solution close to the optimum. We propose an alternative program parameterization that maintains the ability to construct optimal solutions while reducing the cognitive load on the LLM. Specifically, we evolve a program (implementing, e.g., a Python class with a prescribed interface) that provides the following functionality: (1) propose a valid initial solution, (2) improve any given solution in terms of fitness, and (3) perturb a solution with a specified intensity. The optimum can then be approached by iteratively applying improve() and perturb() with a scheduled intensity. We evaluate ImprovEvolve on challenging problems from the AlphaEvolve paper: hexagon packing in a hexagon and the second autocorrelation inequality. For hexagon packing, the evolved program achieves new state-of-the-art results for 11, 12, 15, and 16 hexagons; a lightly human-edited variant further improves results for 14, 17, and 23 hexagons. For the second autocorrelation inequality, the human-edited program achieves a new state-of-the-art lower bound of 0.96258, improving upon AlphaEvolve's 0.96102.
最近在大型语言模型(LLM)引导的进化计算方面,特别是AlphaEvolve,展示了发现新颖数学构造和解决复杂优化问题的显著成功。本文介绍了ImprovEvolve,这是一种简单而有效的技术,用于增强基于LLM的进化方法如AlphaEvolve。对于给定的优化问题,标准做法是演化出程序代码,在执行时能够生成接近最优解的解决方案。我们提出了一种替代性的程序参数化方案,该方案在减少对大型语言模型的认知负荷的同时保持构造最优解的能力。具体来说,我们将进化出一个具有特定接口(如Python类)的程序,该程序提供以下功能:(1) 提供一个有效的初始解;(2) 改进任何给定解以提高其适应度;(3) 根据指定强度扰动解决方案。通过按计划强度迭代应用`improve()`和`perturb()`方法,可以逐步逼近最优解。 我们在AlphaEvolve论文中提出的具有挑战性的问题上评估了ImprovEvolve:六边形内嵌入多个六边形的问题以及第二自相关不等式问题。对于六边形内嵌入的优化问题,在11、12、15和16个六边形的情况下,进化出的程序实现了新的最先进的结果;经过轻微人工编辑的变体进一步改进了14、17和23个六边形的结果。而对于第二自相关不等式问题,人工编辑后的程序达到了一个新的最佳下限0.96258,优于AlphaEvolve得到的最佳值0.96102。
https://arxiv.org/abs/2602.10233
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的有效方法。尽管其效果显著,但RLVR面临着元学习瓶颈:缺乏能够支持人类学习周期中实践和验证之外的错误归因和经验内化的内在机制,这限制了细微的信用分配和可重复使用的知识形成。我们称这些由过去错误衍生出的知识表示为元体验。基于这一洞察,我们提出了元体验学习(MEL),这是一种新颖的框架,该框架将自我提炼的元体验融入模型的参数记忆中。在标准RLVR的基础上,我们引入了一个额外的设计,利用LLMs的自验证能力对配对正确的和错误的轨迹进行对比分析,识别推理错误产生的精确分叉点,并将其总结为通用化的元体验。通过最小化负对数似然来进一步将这些元体验内化到LLM的参数记忆中,这诱导出一种语言建模奖励信号,将正确和错误的推理轨迹连接起来并促进有效的知识重用。实验结果表明,MEL在各种模型大小上的基准测试中实现了持续改进,获得了3.92%--4.73%的Pass@1收益。
https://arxiv.org/abs/2602.10224
Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
最近在大型图像编辑模型方面的进展已经改变了从文本驱动指令到视觉提示编辑的范式转变,在这种新的方式中,用户的意图直接通过诸如标记、箭头和图文提示等视觉输入来推断。虽然这一范式的转变大大增强了可用性,但它也引入了一个关键且尚未充分探讨的安全风险:攻击面本身变成了可视化的形式。为此,我们提出了Vision-Centric Jailbreak Attack(VJA),这是第一个仅凭视觉输入传达恶意指令的可视化到可视化的越狱攻击方式。 为了系统地研究这一新兴威胁,我们介绍了一种名为IESBench的安全导向基准测试平台,用于评估图像编辑模型的安全性。在IESBench上的广泛实验表明,VJA能够有效破坏最先进的商业模型,在Nano Banana Pro和GPT-Image-1.5上分别实现了高达80.9%和70.1%的攻击成功率。 为了缓解这一漏洞,我们提出了一种基于自省多模态推理的免训练防御方法。这种方法显著提高了性能较差的模型的安全性,使其安全性与商业系统相当,并且无需辅助防护模型,同时计算开销极小。 我们的研究揭示了新的安全漏洞,提供了评估基准和实用防御措施以促进现代图像编辑系统的安全性和可信度。注意:该论文包含由大型图像编辑模型生成的令人反感的内容。
https://arxiv.org/abs/2602.10179
Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS's performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.
多重旋转平均(MRA)是3D视觉和机器人技术中的一个基本优化问题,旨在从嘈杂的相对测量中恢复全局一致的绝对旋转。传统的经典方法,如L1-IRLS和Shonan,在局部最小值敏感性和依赖于无法保持精确非欧几里得流形几何结构的凸松弛方面存在局限性,在高噪声场景中的准确性降低。 我们引入了IQARS(迭代量子退火以同步旋转),这是第一个将MRA重新表述为一系列在二进制化后可在量子退火器上执行的局部二次非凸子问题的算法,利用其内在硬件优势。IQARS消除了对凸松弛的依赖,并更好地保持非欧几里得旋转流形几何结构的同时,通过利用量子隧穿和并行性来高效地探索解空间。 我们在合成数据集和现实世界数据集上评估了IQARS的性能。尽管当前的退火器仍处于初级阶段,仅支持解决规模有限且性能受限的问题,但我们观察到在D-Wave退火器上的IQARS已经能够比Shonan(即实证评估中表现最佳的经典方法)高出大约12%的准确性。
https://arxiv.org/abs/2602.10115
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$\Delta$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
规模可调的、受动作控制的世界模型受限于动作标签的稀缺性。虽然潜在动作学习承诺从无标签视频中提取控制接口,但所学得的潜在变量往往难以跨上下文转移:它们会纠缠特定场景中的线索,并缺乏一个共享的坐标系统。这是因为标准目标仅在每个片段内操作,没有机制来对齐不同上下文的动作语义。 我们的关键见解是:尽管动作不可见,其语义效果是可以观察到并能作为共同参考点使用的。我们引入了Seq$\Delta$-REPA,这是一个序列级的控制效果对准目标,它将整合后的潜在动作锚定于来自冻结的自监督视频编码器的时序特征差异上。 在此基础上,我们提出了Olaf-World,这是一种流水线方法,通过大规模被动视频预训练受动作条件约束的视频世界模型。大量的实验表明,我们的方法学习到了一个更结构化的潜在动作空间,在零样本动作迁移和对新控制接口的数据高效适应方面优于现有的最先进技术。
https://arxiv.org/abs/2602.10104
Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: this https URL
利用表示编码器进行生成建模提供了一种高效、高保真合成的路径。然而,标准扩散变换器无法直接在这些表示上收敛。虽然最近的工作将这一问题归因于容量瓶颈,并建议通过昂贵的宽度扩展来解决扩散变换器的问题,但我们证明了这个问题的根本原因在于几何特性。我们识别出几何干扰(Geometric Interference)是根本原因:标准欧氏流匹配强迫概率路径穿过表示编码器超球体特征空间中的低密度内部区域,而不是沿着流形表面行进。为了纠正这一问题,我们提出了黎曼流匹配与雅可比正则化(Riemannian Flow Matching with Jacobi Regularization, RJF)。通过将生成过程约束在流形测地线上,并修正曲率引起的误差传播,RJF使标准扩散变换器架构能够在不进行宽度扩展的情况下收敛。我们的方法RJF使得标准DiT-B架构(1.31亿参数)能够有效收敛,达到了先前方法无法达到的FID 3.37。 代码链接:[这个URL]
https://arxiv.org/abs/2602.10099
Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
大型语言模型(LLMs)在高风险、特定领域的环境中被越来越多地用于支持问答和决策制定,例如自然灾害响应和基础设施规划,在这些场景中,有效的回答必须传达细微的、关键性的细节以供决策参考。然而,现有的检索增强生成(RAG)和开放式问题解答评估框架主要依赖于表面相似性、事实一致性或语义相关性,往往无法评估回复是否提供了特定领域决策所需的具体信息。为了解决这一不足,我们提出了一种多维度、无需参照的评价框架,从四个互补的角度来评估LLM输出:具体性、对改写和语义扰动的鲁棒性、答案的相关性和上下文利用情况。为此,我们引入了一个精心策划的数据集,包含1,412个特定领域的问答配对,涵盖40种专业角色和七类自然灾害类型,以支持系统的评估工作。此外,我们还进行了人工评价,以评估注释者之间的协议以及模型输出与人类判断之间的一致性,这突显了开放式、领域特定的评价中固有的主观性。我们的结果显示,单一指标不足以孤立地捕捉答案质量,并且在部署LLMs于高风险应用时,需要采用结构化、多指标的评价框架。
https://arxiv.org/abs/2602.10017
Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.
单目法线估计的目标是从单一的RGB图像中估算出任意光照条件下的物体法线图。现有方法依赖于深度模型直接预测法线图,但这些方法往往会出现三维错位问题:虽然所估测的法线图看起来正确,但是重建出来的表面通常无法与几何细节对齐。我们认为这种错位源于当前的方法框架:模型难以区分和重构反映在法线图中的不同几何结构,因为不同的基础几何差异仅通过相对细微的颜色变化来体现。 为了解决这个问题,我们提出了一种新的方法框架,将法线估计重新表述为阴影序列估计,其中阴影序列对各种几何信息更为敏感。在此基础上,我们提出了RoSE(基于这种新框架的方法),该方法利用图像到视频的生成模型预测阴影序列,并通过解决简单的普通最小二乘问题将其转换成法线图。 为了增强鲁棒性并更好地处理复杂物体,RoSE在合成数据集MultiShade上进行训练,该数据集中包含各种形状、材料和光照条件。实验表明,在基于对象的单目法线估计的真实世界基准数据集上,RoSE达到了最先进的性能水平。
https://arxiv.org/abs/2602.09929
Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: this https URL
运行具有扩展推理能力的大规模语言模型(LLMs)来解决每个问题的成本非常高,但确定哪些输入实际上需要更多的计算资源仍然是一项挑战。我们研究了这些模型是否能够在生成之前从它们的内部表示中恢复出其自身的成功率,并且这种信号能否指导更高效的推理过程。 我们在预生成激活上训练线性探测器(linear probes),以预测数学和编程任务在特定策略下的成功概率,这种方法的表现远远超过了问题长度和TF-IDF等表面特征。使用E2H-AMC工具集,该工具能够同时提供相同问题的人类表现和模型性能,我们证明了这些模型编码了一个与人类难度感知不同的、特定于每个模型的困难度概念,并且这种差异随着推理范围的扩大而增加。 通过利用这些探测器,我们展示了在多个模型池中路由查询可以在MATH数据集上减少最多70%的推断成本的同时,还超过了任何单一最佳性能模型的表现。这表明即使内部表示与人类关于难度的理解不同,它们也能实现实际效率提升。 我们的代码可在以下链接获取:this https URL
https://arxiv.org/abs/2602.09924
Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.
可变形模型(3DMM)是一种将2D图像作为输入,重建三维物体的结构和外观的形态模型,特别是在人类面部和身体方面。3DMM通过结合身份和表情变化形状与基本面部网格来创建详细的三维模型。可以通过调整多样化的参数来控制3D可变形模型中的变异,这些参数包括高层次的图像描述符,如形状、纹理、光照以及相机参数。以往关于3D人体重建的研究主要集中在全局面部结构或几何形状上,而忽略了年龄、性别等定义面部边界的特征、曲线、凹陷和皱纹这样的面部语义特征。为了适应这些高层次的人脸特征的变化,本工作引入了一种基于感知形状与外观的三维重构系统(我们命名为SARS),这是一个模块化流程,可以从单张图像中提取身体和面部信息以准确重建完整人体的3D模型。
https://arxiv.org/abs/2602.09918
Mobile manipulators broaden the operational envelope for robot manipulation. However, the whole-body teleoperation of such robots remains a problem: operators must coordinate a wheeled base and two arms while reasoning about obstacles and contact. Existing interfaces are predominantly hand-centric (e.g., VR controllers and joysticks), leaving foot-operated channels underexplored for continuous base control. We present TriPilot-FF, an open-source whole-body teleoperation system for a custom bimanual mobile manipulator that introduces a foot-operated pedal with lidar-driven pedal haptics, coupled with upper-body bimanual leader-follower teleoperation. Using only a low-cost base-mounted lidar, TriPilot-FF renders a resistive pedal cue from proximity-to-obstacle signals in the commanded direction, shaping operator commands toward collision-averse behaviour without an explicit collision-avoidance controller. The system also supports arm-side force reflection for contact awareness and provides real-time force and visual guidance of bimanual manipulability to prompt mobile base repositioning, thereby improving reach. We demonstrate the capability of TriPilot-FF to effectively ``co-pilot'' the human operator over long time-horizons and tasks requiring precise mobile base movement and coordination. Finally, we incorporate teleoperation feedback signals into an Action Chunking with Transformers (ACT) policy and demonstrate improved performance when the additional information is available. We release the pedal device design, full software stack, and conduct extensive real-world evaluations on a bimanual wheeled platform. The project page of TriPilot-FF is this http URL.
移动操作臂机器人拓宽了机器人的操作范围。然而,这类机器人的全身遥操作系统仍然存在问题:操作员必须协调轮式底座和两个机械臂,并考虑障碍物和接触情况。现有接口主要集中在手部控制(如VR控制器和操纵杆),而脚控通道在连续底盘控制方面却鲜有探索。我们提出了TriPilot-FF,这是一个开源的全身遥操作系统,用于一款定制的手动双臂移动操作臂机器人,该系统引入了足控踏板,并结合了激光雷达驱动的踏板触觉反馈和上半身的手部领导跟随式遥操作。 利用低成本安装在底盘上的激光雷达,TriPilot-FF能够从目标方向的距离到障碍物信号中生成阻力踏板提示,使操作指令倾向于避开碰撞的行为而无需额外的碰撞避免控制器。此外,该系统还支持手臂侧的力量反射以感知接触情况,并实时提供双手臂可操作性的力和视觉引导,提醒移动底盘重新定位,从而提高其工作范围。 我们展示了TriPilot-FF能够在长时间段内有效“共驾”人类操作员,特别是在需要精确的移动底座运动与协调的任务中。最后,我们将遥操作反馈信号整合到基于Transformer的动作块化(Action Chunking with Transformers, ACT)策略中,并展示了在额外信息可用时性能得到提升。 我们公开了踏板装置的设计、完整的软件堆栈,并对双轮平台进行了广泛的现实世界评估。TriPilot-FF的项目页面在这里:http://www.tripilotff.com/
https://arxiv.org/abs/2602.09888
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
自主GUI代理通过感知界面并执行操作与环境互动。作为虚拟沙箱,GUI世界模型通过启用条件动作预测赋予了代理类似人类的预见能力。然而,现有基于文本和像素的方法难以同时实现高视觉保真度和细粒度结构控制。 为此,我们提出了Code2World,这是一种将下一个视觉状态通过可渲染代码生成进行模拟的视觉-语言编码器。具体而言,为了解决数据稀疏问题,我们将GUI轨迹转换为高保真的HTML,并通过视觉反馈修订机制对合成代码进行改进,从而产生超过80K高质量屏幕操作配对的数据集(命名为AndroidCode)。为了将现有的视觉语言模型适应于代码预测,我们首先通过格式布局遵循执行SFT作为冷启动,然后进一步应用感知渲染的强化学习,该方法使用渲染结果作为奖励信号,强制执行视觉语义保真度和动作一致性。 广泛的实验表明,Code2World-8B实现了领先的下一个UI预测性能,与竞争性的GPT-5和Gemini-3-Pro-Image相当。值得注意的是,Code2World显著提升了下游导航成功率,并以灵活的方式提高了在AndroidWorld导航中Gemini-2.5-Flash的+9.5%的成功率。 代码可在[此处](https://example.com)获取(请注意,实际链接需要根据具体情况填写)。
https://arxiv.org/abs/2602.09856