Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
当前的大规模视觉-语言模型(LVLMs)通常依赖于基于单次视觉编码的纯文本推理,这往往导致细粒度视觉信息的丢失。最近提出的“通过图像思考”方法试图通过使用外部工具或代码来操作图像以缓解这一限制;然而,由此产生的视觉状态常常未能充分扎根于语言语义之中,从而影响了跨模态对齐的效果——特别是在需要跨越遥远区域或多张图片进行视觉语义推理和几何关系分析时。为了解决这些挑战,我们提出了“与图像对话”,这是一种新框架,将视觉操作重新定义为由表达性语言提示引导的特征调制。在丰富语言提示的指导下,模型可以动态地对多个图像区域执行联合再编码,从而实现了语言推理与视觉状态更新之间的更紧密耦合。 我们将这一理念实现在ViLaVT中,这是一种新型LVLM,配备了专为这种交互式视觉推理设计的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练来促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT取得了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现出特别突出的优势。
https://arxiv.org/abs/2602.11073
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
人类对话是由一系列隐含的思维链组织起来,表现为有时间顺序的语言行为。捕捉这种感知路径是构建自然全双工交互系统的关键。我们介绍了一个框架,该框架将这一过程建模为多层次感知,并通过思想图(GoT)来推理对话行为。我们的方法使用分层标记方案形式化了意图到行动的路径,预测高层次的交流意图和低层次的语言行为,以学习它们之间的因果关系和时间依赖性。为了训练这个系统,我们开发了一个高质量的数据集,该数据集将可控、事件丰富的对话与人类标注标签配对。 GoT框架结构化连续流式的预测为一个不断演变的图,使变压器能够预测下一个语言行为、生成简洁的理由为其决策,并动态地改进其推理过程。在合成和真实的全双工对话上的实验表明,该框架提供了强大的行为检测能力,产生了可解释的推理链,并为进一步基准测试全双工语音对话系统的会话推理奠定了基础。
https://arxiv.org/abs/2602.11065
Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.
在手术室(OR)中对手术器械的准确计数是确保患者安全的关键前提。尽管大型视觉语言模型和代理AI近期取得了进展,但在复杂且密集的情景下精确计数仍然极具挑战性,尤其是在手术器械紧密堆积的情况下。为了解决这个问题,我们引入了一种名为“Chain-of-Look”的新型视觉推理框架。该框架模仿了人类的顺序计数过程,并通过施加一种结构化的视觉链来实现这一目标,而不是依赖于传统的无序对象检测方法。这种视觉链指导模型沿着连贯的空间轨迹进行计数,从而提高了复杂场景中的准确性。 为了进一步确保视觉链的物理合理性,我们引入了一种相邻损失函数(neighboring loss function),它明确地建模了密集排列手术器械所固有的空间约束条件。 此外,我们还推出了SurgCount-HD,这是一个包含1,464张高密度手术器械图像的新数据集。通过广泛的实验,我们的方法在具有挑战性的、复杂且密集的手术器械计数任务中超越了现有最佳的方法(如CountGD和REC),以及多模态大型语言模型(如Qwen和ChatGPT)。
https://arxiv.org/abs/2602.11024
Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
开发高效的GPU内核对于扩展现代AI系统至关重要,但由于复杂的硬件架构和对专业优化技能的需求,这仍然是一个复杂任务。尽管大型语言模型(LLMs)在生成通用顺序代码方面表现出强大的能力,但在生成GPU代码时却面临重大挑战,原因是高质量标记训练数据的稀缺、合成解决方案生成中的编译器偏见以及跨硬件世代泛化的能力有限。这些因素阻碍了监督微调(SFT)成为改进当前LLM的有效可扩展方法。相比之下,强化学习(RL)提供了一种数据效率高且适应性强的替代方案,但需要访问相关工具、仔细选择训练问题,并建立一个稳健的评估环境。 我们提出了Makora的环境和工具,用于前沿模型的强化学习微调,并报告了在Triton代码生成方面对GPT-5进行微调的结果。在单次尝试设置下,我们的微调模型将内核正确性从43.7%提高到77.0%(+33.3个百分点),并将优于PyTorch TorchInductor的问题比例从14.8%提升至21.8%(+7个百分点),与基线GPT-5相比,同时在KernelBench上超过了之前的最先进模型。当集成到完整的编程代理中时,它可以解决扩展后的KernelBench套件中的97.4%的问题,并且在72.9%的问题上优于PyTorch TorchInductor编译器,几何平均速度提升达到了2.12倍。 我们的工作证明了,在数据可用性限制传统监督学习能力的高度专业化技术领域中,使用强化学习进行有针对性的后训练能够解锁LLM的能力。这为AI辅助加速器编程开辟了新的途径。
https://arxiv.org/abs/2602.11000
Recent studies have revealed that when LLMs are appropriately prompted and configured, they demonstrate mixed results. Such results often meet or exceed the baseline performance. However, these comparisons have two primary issues. First, they mostly considered only reliability as a comparison metric and selected a few LLMs (such as Codex and ChatGPT) for comparision. This paper proposes a comprehensive code quality assessment framework called Programmatic Excellence via LLM Iteration (PELLI). PELLI is an iterative analysis-based process that upholds high-quality code changes. We extended the state-of-the-art by performing a comprehensive evaluation that generates quantitative metrics for analyzing three primary nonfunctional requirements (such as maintainability, performance, and reliability) while selecting five popular LLMs. For PELLI's applicability, we selected three application domains while following Python coding standards. Following this framework, practitioners can ensure harmonious integration between LLMs and human developers, ensuring that their potential is fully realized. PELLI can serve as a practical guide for developers aiming to leverage LLMs while adhering to recognized quality standards. This study's outcomes are crucial for advancing LLM technologies in real-world applications, providing stakeholders with a clear understanding of where these LLMs excel and where they require further refinement. Overall, based on three nonfunctional requirements, we have found that GPT-4T and Gemini performed slightly better. We also found that prompt design can influence the overall code quality. In addition, each application domain demonstrated high and low scores across various metrics, and even within the same metrics across different prompts.
最近的研究表明,当大规模语言模型(LLM)被适当提示和配置时,它们的表现参差不齐。这些结果通常能达到或超过基线性能水平。然而,这样的比较存在两个主要问题:首先,大多数研究仅将可靠性作为对比指标,并选择了少量的LLM(如Codex和ChatGPT)进行比较。本文提出了一种全面的代码质量评估框架,称为基于大规模语言模型迭代的程序卓越性(PELLI)。PELLI是一个迭代分析过程,它维护高质量的代码变更。我们通过执行一套全面的评价来扩展了最新的研究成果,这套评价能够生成用于分析三个主要非功能需求(如可维护性、性能和可靠性)的定量指标,并选择了五个流行的LLM进行评估。为了体现PELLI的应用范围,我们在遵循Python编码标准的同时选择了三个应用程序领域。 按照这个框架,实践者可以确保大规模语言模型与人类开发者之间的和谐整合,从而实现它们的最大潜力。PELLI能够为希望利用大规模语言模型同时符合公认质量标准的开发人员提供实用指南。这项研究的结果对于推进大规模语言模型在实际应用中的技术发展至关重要,并为利益相关方提供了对这些LLM优缺点的具体理解。 总体而言,根据三个非功能需求,在所评估的项目中,GPT-4T和Gemini的表现稍好一些。我们还发现提示设计会影响整体代码质量。此外,在各个指标以及不同的提示下,每个应用领域都表现出高分与低分差异。
https://arxiv.org/abs/2602.10808
Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
多模态大型语言模型(MLLM)在遥感(RS)领域被越来越广泛地采用,并且在遥感视觉定位(RSVG)、遥感问答(RSVQA)和多模态对话等任务中表现出色。然而,幻觉问题——即与输入的遥感图像不一致的响应——严重阻碍了这些模型在高风险场景中的应用(例如紧急管理和农业监测),并且这个问题在遥感领域尚未得到充分研究。 在此工作中,我们介绍了RSHallu,这是一个系统性研究项目,包含三项成果:(1) 我们用一个面向遥感的任务分类定义了遥感幻觉,并引入了图像级别的幻觉来捕捉超越对象中心误差的特定于遥感的一致性问题(如模态、分辨率和场景级语义);(2) 我们建立了一个幻觉基准测试 RSHalluEval(包含 2,023 对问答),并实现了双模式检查,支持通过在 RSHalluCheck 数据集(15,396 对问答)上微调的精简检查器进行高精度云端审计和低成本可重复本地审查;(3) 我们提出了一个针对特定领域的数据集RSHalluShield(包含 30k 对问答),用于友好训练的缓解措施,并进一步提出无需训练的即插即用策略,包括解码时间对数修正和遥感意识提示。 在代表性的RS-MLLM中,我们的缓解方法使无幻觉率提高了最多21.63个百分点,在统一协议下保持了下游RS任务(如 RSVQA 和 RSVG)上的竞争力。代码和数据集将开放发布。
https://arxiv.org/abs/2602.10799
A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at this https URL.
叙述型电子书结合了同步音频与数字文本,在播放过程中会高亮当前朗读的文字或句子。这种格式支持早期阅读能力和辅助有阅读障碍的人士,同时也允许普通读者在阅读和收听之间无缝切换。随着自然流畅的神经网络文字转语音(TTS)技术的发展,已经开发出几种商业服务来利用这些技术将标准文本电子书转换为高质量的叙述型电子书。然而,目前还没有开源解决方案可以执行这一任务。本文介绍了Calliope,这是一个旨在填补这一空白的开源框架。 我们的方法利用了最先进的开源TTS系统,将文本电子书转换成EPUB 3 Media Overlay格式的有声读物。该方法提供了几个创新步骤:在TTS过程中直接捕获音频时间戳,确保叙述与文字高亮之间的精确同步;严格保留出版商原有的排版、样式和嵌入式媒体;整个管道支持离线操作。这种离线能力消除了反复出现的API成本,减轻了隐私问题,并避免了与基于云的服务相关的版权合规问题。 当前框架支持最新的开源TTS系统XTTS-v2和Chatterbox。一种潜在的替代方法是先通过TTS生成叙述,然后使用强制对齐将其与文本同步。然而,虽然我们的方法确保精确同步,但实验表明,强制对齐会导致音频与文字高亮之间出现足以降低阅读体验的漂移。 源代码和使用说明可以在提供的链接中找到。
https://arxiv.org/abs/2602.10735
Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.
长期对话记忆是基于大型语言模型的对话系统的核心能力,然而现有的基准和评估协议主要集中在表面级别的事实性回忆。在实际互动中,适当的回应往往依赖于隐含约束,如用户状态、目标或价值观等,并且这些不是之后会明确查询的信息。为了评估这种情景,我们引入了**LoCoMo-Plus**,这是一个用于评估存在提示与触发语义断开情况下的认知记忆能力的基准测试,模型必须在长时间对话上下文中保留并应用潜在约束条件。 此外,我们还展示了传统的字符串匹配指标和显式的任务类型指令与此类场景不相符合,并提出了一种基于约束一致性的统一评估框架。跨多种骨干模型、检索方法及内存系统的实验表明,在这种情况下认知记忆仍然具有挑战性,并揭示了现有基准测试未能捕捉到的失败案例。 我们的代码和评估框架可公开获取:[此链接](this https URL)。
https://arxiv.org/abs/2602.10715
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
我们引入了Step 3.5 Flash,这是一种稀疏的专家混合模型(MoE),它在前沿级别的代理智能和计算效率之间架起了桥梁。我们的重点在于构建代理时最重要的两个方面:敏锐的推理能力和快速、可靠的执行能力。Step 3.5 Flash 结合了一个基础的1960亿参数模型与110亿个活跃参数,以实现高效的推断过程。它通过交错使用3:1滑动窗口/全注意力机制和多令牌预测(MTP-3)进行优化,从而减少多轮代理交互中的延迟和成本。 为了达到前沿级别的智能,我们设计了一个可扩展的强化学习框架,该框架结合了可验证信号与偏好反馈,并且能够在大规模离策略训练下保持稳定,使得在数学、代码和工具使用方面能够持续自我改进。Step 3.5 Flash 在代理任务、编程任务和数学任务中表现出色,在IMO-AnswerBench上得分为85.4%,LiveCodeBench-v6(2024.08-2025.05)得分为86.4%,tau2-Bench得分高达88.2%,在BrowseComp(带上下文管理)任务中得分为69.0%,以及在Terminal-Bench 2.0中的成绩为51.0%。这些结果与前沿模型如GPT-5.2 xHigh和Gemini 3.0 Pro相当。 通过重新定义效率边界,Step 3.5 Flash 为在现实世界工业环境中部署复杂代理提供了一个高密度的基础框架。
https://arxiv.org/abs/2602.10604
Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.
现代的人工智能系统已经在国际数学竞赛中赢得奖牌,帮助研究工作流程,并证明了新颖的技术命题。然而,尽管它们在高级数学领域的表现有所进步,但在基础算术方面仍然表现糟糕,尤其是在简单地加两个数字的任务上持续失败。我们对这一现象进行了系统的调查。 实证研究表明,所有前沿模型的整数加法准确性随着位数增加而显著下降。此外,这些模型所犯的大部分错误是可以解释的,并且可以归因于操作数错位或进位不正确;这两种类型的错误分别解释了Claude Opus 4.1、GPT-5和Gemini 2.5 Pro中87.9%、62.9%和92.4%的错误。最后,我们证明错位错误通常与分词有关,而进位错误则主要表现为独立的随机失败。
https://arxiv.org/abs/2602.10416
ECHO (Evaluation of Chat, Human behavior, and Outcomes) is an open research platform designed to support reproducible, mixed-method studies of human interaction with both conversational AI systems and Web search engines. It enables researchers from varying disciplines to orchestrate end-to-end experimental workflows that integrate consent and background surveys, chat-based and search-based information-seeking sessions, writing or judgment tasks, and pre- and post-task evaluations within a unified, low-coding-load framework. ECHO logs fine-grained interaction traces and participant responses, and exports structured datasets for downstream analysis. By supporting both chat and search alongside flexible evaluation instruments, ECHO lowers technical barriers for studying learning, decision making, and user experience across different information access paradigms, empowering researchers from information retrieval, HCI, and the social sciences to conduct scalable and reproducible human-centered AI evaluations.
ECHO(对话评估、人类行为和结果评价)是一个开放的研究平台,旨在支持可重复的混合方法研究,以探讨人们与对话式人工智能系统及网络搜索引擎互动的情况。该平台使来自不同领域的研究人员能够统筹从头到尾的实验工作流程,其中包括同意书和背景调查问卷、基于聊天或搜索的信息查询会话、写作任务或判断任务以及在统一框架中的前/后评估活动(此框架可实现低编码量)。ECHO记录了精细的互动轨迹及参与者回应,并导出结构化的数据集以供后续分析使用。通过同时支持对话和搜索并采用灵活的评价工具,ECHO降低了研究学习、决策过程和用户体验等不同信息访问模式所需的技术门槛,使来自信息检索、人机交互(HCI)和社会科学的研究人员能够进行规模大且可重复的人工智能评估工作。
https://arxiv.org/abs/2602.10295
Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
最近在大型图像编辑模型方面的进展已经改变了从文本驱动指令到视觉提示编辑的范式转变,在这种新的方式中,用户的意图直接通过诸如标记、箭头和图文提示等视觉输入来推断。虽然这一范式的转变大大增强了可用性,但它也引入了一个关键且尚未充分探讨的安全风险:攻击面本身变成了可视化的形式。为此,我们提出了Vision-Centric Jailbreak Attack(VJA),这是第一个仅凭视觉输入传达恶意指令的可视化到可视化的越狱攻击方式。 为了系统地研究这一新兴威胁,我们介绍了一种名为IESBench的安全导向基准测试平台,用于评估图像编辑模型的安全性。在IESBench上的广泛实验表明,VJA能够有效破坏最先进的商业模型,在Nano Banana Pro和GPT-Image-1.5上分别实现了高达80.9%和70.1%的攻击成功率。 为了缓解这一漏洞,我们提出了一种基于自省多模态推理的免训练防御方法。这种方法显著提高了性能较差的模型的安全性,使其安全性与商业系统相当,并且无需辅助防护模型,同时计算开销极小。 我们的研究揭示了新的安全漏洞,提供了评估基准和实用防御措施以促进现代图像编辑系统的安全性和可信度。注意:该论文包含由大型图像编辑模型生成的令人反感的内容。
https://arxiv.org/abs/2602.10179
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
神经检索和GPT风格的生成模型依赖于大量高质量的监督数据,但对于像阿姆哈拉语这样的低资源语言来说,这种类型的数据仍然稀缺。我们发布了一个包含两个数据集的阿姆哈拉语数据资源,支持关于(i)神经检索排序研究以及(ii)指令跟随文本生成的研究。 **检索-排名数据集**包含了1,091个人工验证过的查询-正样本-负样本三元组,这些三元组是从多种阿姆哈拉语来源中提取并构建出来的,用于支持对比训练和神经检索器(例如DPR、ColBERT风格的后期交互以及SPLADE风格的稀疏神经检索)的基准测试。这些建立了查询-文档对的数据集通过专家策划的查询、网络衍生的查询及LLM辅助生成相结合的方法创建,而正样本/负样本文档则从互联网上选择或由LLMs合成后,再经过母语者的验证。 **指令提示响应数据集**包含6,285个阿姆哈拉语的指令-响应对,这些数据覆盖了多个领域和多种类型的指令,并通过几个大型语言模型生成并经手工审核修正以确保语法正确性、相关性、流畅性和事实上的合理性后得到完善。 我们以标准化的分割格式(CSV、JSON、JSONL)发布这两个数据集,以便于在阿姆哈拉语检索、排名及生成建模方面进行可重复的研究。此外,这些数据集还附带了一种可以推广到其他低资源语言的方法论。
https://arxiv.org/abs/2602.09914
Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at this https URL.
自主GUI代理通过感知界面并执行操作与环境互动。作为虚拟沙箱,GUI世界模型通过启用条件动作预测赋予了代理类似人类的预见能力。然而,现有基于文本和像素的方法难以同时实现高视觉保真度和细粒度结构控制。 为此,我们提出了Code2World,这是一种将下一个视觉状态通过可渲染代码生成进行模拟的视觉-语言编码器。具体而言,为了解决数据稀疏问题,我们将GUI轨迹转换为高保真的HTML,并通过视觉反馈修订机制对合成代码进行改进,从而产生超过80K高质量屏幕操作配对的数据集(命名为AndroidCode)。为了将现有的视觉语言模型适应于代码预测,我们首先通过格式布局遵循执行SFT作为冷启动,然后进一步应用感知渲染的强化学习,该方法使用渲染结果作为奖励信号,强制执行视觉语义保真度和动作一致性。 广泛的实验表明,Code2World-8B实现了领先的下一个UI预测性能,与竞争性的GPT-5和Gemini-3-Pro-Image相当。值得注意的是,Code2World显著提升了下游导航成功率,并以灵活的方式提高了在AndroidWorld导航中Gemini-2.5-Flash的+9.5%的成功率。 代码可在[此处](https://example.com)获取(请注意,实际链接需要根据具体情况填写)。
https://arxiv.org/abs/2602.09856
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
大型语言模型(LLMs)越来越多地被部署用于自动标注和分析教育对话,然而当前的流水线缺乏可靠的方式来检测模型出错的情况。我们研究了由LLM生成的推理是否可以用来预测其自身预测的正确性。我们分析了来自课堂对话的30,300条教师话语,每一条都由多个最先进的LLM使用教学动作结构和相应的推理进行标注。利用人类验证的真实标签,我们将任务定义为预测模型对给定话语所分配的标签是否正确。我们采用词频-逆文档频率(TF-IDF)编码LLM推理,并评估了五种监督分类器。随机森林分类器实现了0.83的F1得分(召回率=0.854),成功地识别出大多数错误预测,且优于基线模型。为特定的教学动作构建训练专用检测器进一步提高了难以处理的动作构建的表现,表明错误检测可以从专门的语言线索中受益。 使用语言学探究与词计数框架(LIWC)分析了四种正确性的语言标志:因果关系、区分性、不确定性和洞察力。正确的预测表现出基于因果的语言(例如,“因为”、“因此”),而错误的推理更有可能依赖于认知保留(如“可能”、“能够”)和表演式元认知(如“认为”、“意识到”)。语法复杂度并不能区分正确与不正确的推理,较长的推理也不一定更加可靠。这些发现表明,基于推理的错误检测为自动教育对话分析的质量控制提供了一种实用且可扩展的方法。
https://arxiv.org/abs/2602.09832
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.
在这项工作中,我们介绍了Covo-Audio,这是一个具有70亿参数的端到端语言和音频处理模型(LALM),可以直接处理连续的音频输入并生成音频输出,在单一统一架构中实现。通过大规模精心策划的预训练和有针对性的后训练,Covo-Audio在包括语音-文本建模、对话交流、语音理解、音频理解和全双工语音交互在内的广泛任务集上达到了同类规模模型中的最先进或竞争性性能。广泛的评估表明,在多个基准测试中,经过预训练的基础模型展示了强大的语音-文本理解和语义推理能力,并且超过了相同规模的代表性开源模型。此外,Covo-Audio-Chat是对话导向的变体,展现了强大的口语会话能力,包括理解、上下文推理、指令遵循以及生成合适和有同情心的回答,验证了其在现实世界的对话助手场景中的适用性。而进化后的全双工模型Covo-Audio-Chat-FD,在口语对话能力和全双工交互行为上均达到了显著的优越性能,展示了其实用稳定性的能力。 为了降低端到端LALM在自然会话系统部署时的成本,我们提出了一种智能与扬声器解耦策略,该策略将对话智能与语音渲染分离出来,使用户能够在使用最少文本转语音(TTS)数据的情况下灵活地定制声音,同时保持对话性能。总体而言,我们的研究结果突显了70亿规模模型整合复杂音频智能和高级语义推理的强大潜力,并指出了向更强大、多用途LALM发展的可扩展路径。
https://arxiv.org/abs/2602.09823
This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: this https URL.
这篇论文介绍了AnalyticsGPT,这是一个直观且高效的大型语言模型(LLM)驱动的工作流程,用于科学计量问题回答。这一相对较少被研究的下游任务涉及有关“科学研究学”的元科学问题子类别。与基于论文的传统科学问答相比,在规划阶段该任务提出了独特的挑战,例如需要识别学术实体中的命名实体,并进行涉及科学计量指标(如影响因子)的多方面数据检索。除了在传统自然语言处理任务中表现出色外,LLM还在更复杂的应用中展现出巨大潜力,比如任务分解和计划推理等。本文探讨了将LLM应用于科学计量问题回答的应用场景,并描述了一个端到端系统,该系统实施了一种顺序工作流程,结合了检索增强生成和代理概念。此外,我们还解决了有效合成数据以形成易于呈现且结构良好的高层次分析的次要任务。作为检索增强生成的数据库,我们利用了一个专有的研究绩效评估平台。在评价方面,我们咨询了经验丰富的主题专家,并利用LLM进行评判。通过这种方式,我们为LLM在小众下游任务中的有效性提供了宝贵的见解。我们的(骨架)代码和提示可在以下网址获取:this https URL。
https://arxiv.org/abs/2602.09817
Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: this https URL
长期互动对于大型语言模型(LLM)来说仍是一个瓶颈,因为它们有限的上下文窗口难以管理随时间扩展的对话历史。现有的记忆系统往往将交互视为孤立片段,无法捕捉到对话流中潜在的故事连贯性。我们提出了TraceMem框架,这是一个受认知启发的设计,通过三阶段管道从用户的会话痕迹中编织出结构化、叙述性的记忆模式:(1)短期记忆处理,采用归纳话题分割方法界定事件边界,并提取语义表示;(2)突触记忆巩固过程,将事件总结为情景记忆,然后将其与语义一起提炼成用户特有的轨迹;以及(3)系统记忆巩固阶段,使用两阶段层次聚类组织这些痕迹形成连贯、随时间演化的叙事线索。这些线索被封装为结构化用户记忆卡,形成了叙述性记忆模式。为了提高记忆利用效率,我们提供了一个代理搜索机制以增强推理过程。 在LoCoMo基准测试上进行的评估表明,TraceMem凭借脑启发架构实现了最先进的性能。分析显示,通过构建连贯的故事线,它在多跳和时间推理方面超越了基线模型,在深度叙述理解中扮演着关键角色。此外,我们还提供了关于记忆系统的一个开放性讨论,并分享了我们在该领域的视角与未来展望。我们的代码实现可以在以下网址获得:this https URL
https://arxiv.org/abs/2602.09712
We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
我们介绍了MILE-RefHumEval,这是一个无需真实标注或评估者协调的大型语言模型(LLMs)无参考评估框架。该框架利用一组由人类一致性的模式指导的独立提示评估器,支持离散和连续评分判断。通过从最佳候选选择、摘要生成到图像描述和对话等特定任务的提示,MILE-RefHumEval 提供了灵活、可解释且可扩展的评估方式。实验表明,该框架与人类判断高度一致,优于先前的方法,并减少了计算开销,为现实世界中LLMs 的评估提供了一个高效、稳健且符合人类标准的解决方案。
https://arxiv.org/abs/2602.09624
Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{this https URL}{GitHub Repository}}
对话问答系统越来越依赖于检索增强生成(RAG)技术,以将大型语言模型(LLMs)与外部知识进行关联。然而,现有的大多数研究主要单独评估RAG方法,并且主要集中在单轮对话设置上。本文旨在填补多轮对话问答中RAG方法系统比较的空白,其中对话历史、共指关系以及不断变化的用户意图显著增加了检索难度。我们进行了全面的经验研究,在八个涵盖多个领域的多样化对话问答数据集上对基础和高级RAG方法进行评估。通过统一的实验设置,我们使用生成器和检索指标来评价检索质量和答案生成,并分析性能随对话轮次演进的情况。 我们的结果显示,简单而稳健的方法(如重排序、混合BM25和HyDE)始终优于原始RAG方法。相比之下,一些高级技术未能带来性能提升,甚至使性能下降到没有使用RAG基准线之下。此外,我们还证明了数据集特性以及对话长度强烈影响检索效果,解释为何没有单一的RAG策略能够在所有设置中占据主导地位。 总体而言,我们的研究发现表明,有效的对话RAG更多地依赖于检索策略与数据集结构之间的匹配程度,而非方法本身的复杂性。我们在GitHub上发布了所使用的代码。
https://arxiv.org/abs/2602.09552