Creating a stroke-by-stroke evolution process of a visual artwork tries to bridge the emotional and educational gap between the finished static artwork and its creation process. Recent stroke-based painting systems focus on capturing stroke details by predicting and iteratively refining stroke parameters to maximize the similarity between the input image and the rendered output. However, these methods often struggle to produce stroke compositions that align with artistic principles and intent. To address this, we explore an image-to-painting method that (i) facilitates semantic guidance for brush strokes in targeted regions, (ii) computes the brush stroke parameters, and (iii) establishes a sequence among segments and strokes to sequentially render the final painting. Experimental results on various input image types, such as face images, paintings, and photographic images, show that our method aligns with a region-based painting strategy while rendering a painting with high fidelity and superior stroke quality.
创建一幅视觉艺术品的笔触演变过程旨在弥合最终静态作品与其创作过程之间的情感和教育差距。最近的基于笔触的绘画系统专注于通过预测并迭代细化笔触参数来捕捉笔触细节,以最大限度地提高输入图像与渲染输出之间的相似性。然而,这些方法往往难以生成符合艺术原则和意图的笔触组合。为了解决这一问题,我们探索了一种将图像转化为绘画的方法,该方法: (i) 在目标区域提供语义指导给笔触; (ii) 计算刷子笔触参数; (iii) 建立段落和笔触之间的顺序以逐步渲染最终画作。 在各种输入图像类型上的实验结果(如人脸图像、油画和摄影作品)显示,我们的方法符合基于区域的绘画策略,并能高保真度地渲染出高质量笔触的作品。
https://arxiv.org/abs/2506.09969
How cost-effectively can we elicit strong reasoning in language models by leveraging their underlying representations? We answer this question with Resa, a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning (SAE-Tuning) procedure. This method first trains an SAE to capture reasoning abilities from a source model, and then uses the trained SAE to guide a standard supervised fine-tuning process to elicit such abilities in a target model, all using verified question-answer data without any reasoning traces. Notably, when applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance while reducing training costs by >2000x to roughly \$1 and training time by >450x to around 20 minutes. Furthermore, when applied to lightly RL-trained models (e.g., within 1 hour on 2 GPUs), it enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around \$1 additional cost. Surprisingly, the reasoning abilities extracted via SAEs are potentially both generalizable and modular. Generality means abilities extracted from one dataset still elevate performance on a larger and overlapping corpus. Modularity means abilities extracted from Qwen or Qwen-Math can be attached to the R1-Distill model at test time, without any retraining, and yield comparable gains. Extensive ablations validate these findings and all artifacts are fully open-sourced.
我们如何通过利用语言模型的底层表示来有效地激发其推理能力?我们用Resa回答了这个问题,这是一个通过一种新颖且高效的稀疏自编码器调优(SAE-Tuning)过程训练的一系列15亿参数规模的推理模型。该方法首先使用源模型数据训练一个稀疏自编码器以捕捉推理能力,然后利用训练好的自编码器指导标准监督微调过程,在目标模型中激发这些能力,整个过程仅需经过验证的问题-答案数据而无需任何推理痕迹。 值得注意的是,在某些基础模型上应用SAE-Tuning,并在进一步的强化学习后训练之前进行处理时,它能够保留超过97%的与之相对应的强化学习训练后的推理性能,同时将训练成本减少超过2000倍至大约1美元,并将训练时间减少超过450倍至约20分钟。此外,在对经过轻度强化学习训练的模型(例如在两块GPU上花费不到一小时)应用时,它能够实现如AIME24中Pass@1得分为43.33%,AMC23中Pass@1得分为90%等推理性能,并且仅增加约1美元的成本。 令人惊讶的是,通过自编码器提取的推理能力可能是通用和模块化的。泛化意味着从一个数据集中提取的能力仍然可以提升更大、更相关数据集上的表现;而模块化则表示可以从Qwen或Qwen-Math中抽取的能力可以在测试时附加到R1-Distill模型上,并且无需任何重新训练即可获得类似的效果。 一系列消融实验验证了上述发现,所有相关成果均已完全开源。
https://arxiv.org/abs/2506.09967
As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
随着大型语言模型(LLM)在文本推理方面取得了显著进展,增强大型视觉-语言模型(LVLM)的多模态推理能力的兴趣也随之增加。然而,现有的方法主要以直接、文本中心的方式处理多模态推理,在这种方式中,无论是推理还是答案推导都完全通过文本进行,唯一的区别在于存在多模态输入。因此,这些方法在需要精确几何理解及连续空间跟踪的任务上(人类通常通过心理可视化和操作来实现这些能力)往往遇到根本性的局限性。 为了解决这些问题,我们提出了一种新的范式——“空间绘图推理”,使LVLM可以通过基本的绘制操作在视觉空间中进行推理。通过赋予模型诸如标注边界框及绘制辅助线等基础绘图操作的能力,它们能够直接通过视觉操控表达和分析空间关系,并且避免了之前工具整合推理方法中存在的专业感知工具性能上限问题。 为了培养这种能力,我们开发了一个三阶段的训练框架:使用合成数据进行冷启动训练以建立基本绘图技能;采用反射拒绝采样增强自我反思行为;以及直接针对目标奖励优化的强化学习。广泛的实验表明,我们的模型VILASR在包括迷宫导航、静态空间推理、基于视频的推理和多视角推理任务在内的多样化空间推理基准测试中均显著优于现有方法,平均提升了18.4%。
https://arxiv.org/abs/2506.09965
Medical Visual Question Answering (MedVQA) is a promising field for developing clinical decision support systems, yet progress is often limited by the available datasets, which can lack clinical complexity and visual diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new, large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly expands upon the original Kvasir-VQA by incorporating 159,549 new question-answer pairs that are designed to test deeper clinical reasoning. We developed a systematic method using large language models to generate these questions, which are stratified by complexity to better assess a model's inference capabilities. To ensure our dataset prepares models for real-world clinical scenarios, we have also introduced a variety of visual augmentations that mimic common imaging artifacts. The dataset is structured to support two main evaluation tracks: one for standard VQA performance and another to test model robustness against these visual perturbations. By providing a more challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate the development of more reliable and effective multimodal AI systems for use in clinical settings. The dataset is fully accessible and adheres to FAIR data principles, making it a valuable resource for the wider research community. Code and data: this https URL and this https URL
医疗视觉问答(MedVQA)是开发临床决策支持系统的一个有前景的领域,然而进展常常受到可用数据集的限制,这些数据集可能缺乏临床复杂性和视觉多样性。为了解决这些问题,我们介绍了Kvasir-VQA-x1,这是一个新的大规模数据集,用于胃肠内镜检查。我们的工作在原有基础上大幅扩展了Kvasir-VQA,新增加了159,549个问题-答案对,旨在测试更深层次的临床推理能力。我们使用大型语言模型开发了一种系统化的方法来生成这些问题,并按复杂度进行了分层以更好地评估模型的推断能力。为了确保我们的数据集能够使模型为现实世界的临床场景做好准备,我们也引入了多种视觉增强措施,模拟常见的成像伪影。该数据集结构化支持两个主要的评价途径:一个是标准VQA性能评价,另一个是测试模型面对这些视觉干扰时的稳健性。通过提供一个更具挑战性和临床相关的基准,Kvasir-VQA-x1旨在加速开发更可靠和有效的多模态AI系统在临床上的应用。数据集完全开放并遵守FAIR数据原则,使其成为广大研究社区的重要资源。 代码和数据: [链接] 和 [链接]
https://arxiv.org/abs/2506.09958
Indirect Prompt Injection attacks exploit the inherent limitation of Large Language Models (LLMs) to distinguish between instructions and data in their inputs. Despite numerous defense proposals, the systematic evaluation against adaptive adversaries remains limited, even when successful attacks can have wide security and privacy implications, and many real-world LLM-based applications remain vulnerable. We present the results of LLMail-Inject, a public challenge simulating a realistic scenario in which participants adaptively attempted to inject malicious instructions into emails in order to trigger unauthorized tool calls in an LLM-based email assistant. The challenge spanned multiple defense strategies, LLM architectures, and retrieval configurations, resulting in a dataset of 208,095 unique attack submissions from 839 participants. We release the challenge code, the full dataset of submissions, and our analysis demonstrating how this data can provide new insights into the instruction-data separation problem. We hope this will serve as a foundation for future research towards practical structural solutions to prompt injection.
间接提示注入攻击利用了大型语言模型(LLM)在区分输入中的指令和数据方面的内在限制。尽管提出了许多防御建议,但针对适应性对手的系统评估仍有限制,即使成功的攻击可能具有广泛的安保和隐私影响,许多基于LLM的真实世界应用程序仍然易受攻击。我们展示了LLMail-Inject的结果,这是一项公开挑战,模拟了一个现实场景,在该场景中参与者尝试将恶意指令注入电子邮件以触发LLM基础邮件助手的未经授权的工具调用。此挑战涵盖了多种防御策略、LLM架构和检索配置,并产生了来自839名参与者的208,095个独特攻击提交的数据集。我们发布了挑战代码、完整的提交数据集以及我们的分析,展示如何利用这些数据为指令与数据分离问题提供新的见解。我们希望这将成为未来研究的基础,以寻求解决提示注入的实用结构化解决方案。
https://arxiv.org/abs/2506.09956
Conditional diffusion models (CDMs) have shown impressive performance across a range of generative tasks. Their ability to model the full data distribution has opened new avenues for analysis-by-synthesis in downstream discriminative learning. However, this same modeling capacity causes CDMs to entangle the class-defining features with irrelevant context, posing challenges to extracting robust and interpretable representations. To this end, we identify Canonical LAtent Representations (CLAReps), latent codes whose internal CDM features preserve essential categorical information while discarding non-discriminative signals. When decoded, CLAReps produce representative samples for each class, offering an interpretable and compact summary of the core class semantics with minimal irrelevant details. Exploiting CLAReps, we develop a novel diffusion-based feature-distillation paradigm, CaDistill. While the student has full access to the training set, the CDM as teacher transfers core class knowledge only via CLAReps, which amounts to merely 10 % of the training data in size. After training, the student achieves strong adversarial robustness and generalization ability, focusing more on the class signals instead of spurious background cues. Our findings suggest that CDMs can serve not just as image generators but also as compact, interpretable teachers that can drive robust representation learning.
条件扩散模型(CDMs)在多种生成任务中表现出卓越的性能。它们能够对完整数据分布进行建模的能力,为下游判别学习中的分析与合成方法开辟了新的途径。然而,这种强大的建模能力也导致CDMs将定义类别的特征与不相关的背景信息纠缠在一起,使得提取稳健且可解释的表示变得具有挑战性。为此,我们识别出了典范潜在表示(Canonical LAtent Representations, CLAReps),这是一种内部CDM特征能够保留关键类别信息同时摒弃非判别信号的潜在编码方式。当这些CLAReps被解码时,它们能为每个类生成代表性的样本,并提供一个简洁、可解释的核心语义摘要,包含最少的无关细节。 利用CLAReps,我们开发了一种新颖的基于扩散的方法——CaDistill(特征蒸馏),用于知识传递。在此过程中,学生模型可以完全访问整个训练集,而作为教师的CDM则仅通过CLAReps将核心类别知识传输给学生,这些CLAReps只占原始训练数据量的10%左右。经过训练后,学生模型在对抗鲁棒性和泛化能力方面表现优异,并且更注重类别的信号而非误导性的背景线索。 我们的发现表明,CDMs不仅可以作为图像生成器使用,还能充当紧凑、可解释的知识传授者,促进稳健的表示学习。
https://arxiv.org/abs/2506.09955
Recently, we have witnessed the great success of the generalist model in natural language processing. The generalist model is a general framework trained with massive data and is able to process various downstream tasks simultaneously. Encouraged by their impressive performance, an increasing number of researchers are venturing into the realm of applying these models to computer vision tasks. However, the inputs and outputs of vision tasks are more diverse, and it is difficult to summarize them as a unified representation. In this paper, we provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field. First, we review the background, including the datasets, tasks, and benchmarks. Then, we dig into the design of frameworks that have been proposed in existing research, while also introducing the techniques employed to enhance their performance. To better help the researchers comprehend the area, we take a brief excursion into related domains, shedding light on their interconnections and potential synergies. To conclude, we provide some real-world application scenarios, undertake a thorough examination of the persistent challenges, and offer insights into possible directions for future research endeavors.
最近,我们在自然语言处理领域见证了通用模型的巨大成功。通用模型是在海量数据上训练的一个框架,能够同时处理多种下游任务。受到这些模型出色性能的鼓舞,越来越多的研究人员开始尝试将其应用于计算机视觉任务中。然而,视觉任务的输入和输出更加多样,难以用统一的形式表示它们。在本文中,我们全面概述了视觉通用模型,并深入探讨了其特性和能力。首先,我们将回顾背景知识,包括数据集、任务及基准测试。然后,我们将讨论现有研究中提出的框架设计方法,同时介绍用于提升性能的技术手段。为了更好地帮助研究人员理解这一领域,我们将简要介绍相关的其他领域,揭示它们之间的相互联系和潜在协同作用。最后,我们提供了一些实际应用案例,并对持久存在的挑战进行了深入分析,同时也提供了对未来研究方向的见解。
https://arxiv.org/abs/2506.09954
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: this https URL.
在外部知识视觉问答(OK-VQA)中,模型必须识别图像中的相关视觉信息,并结合外部知识以准确回答问题。将这项任务扩展到基于视频的视觉支持对话设置中,对话模型不仅要能够随时间推断出相关的视觉细节,还要能回答那些所需信息不一定存在于视觉信息中的问题。此外,整体对话的上下文也必须被考虑在内。 为了探索这一任务,我们引入了一个数据集,包含2,017个视频和5,986个人工标注的对话,这些对话包括40,954轮交替进行的对话回合。虽然对话背景是基于特定视频片段中的视觉信息,但问题进一步需要那些不在视觉中直接显示的外部知识。因此,模型不仅必须识别相关视频部分,还要利用外部知识在对话中交流。 我们还在我们的数据集上提供了一些基准测试,并展示了与此任务相关的未来挑战。该数据集可在此公开获取:[此URL](this https URL)。
https://arxiv.org/abs/2506.09953
The scale diversity of point cloud data presents significant challenges in developing unified representation learning techniques for 3D vision. Currently, there are few unified 3D models, and no existing pre-training method is equally effective for both object- and scene-level point clouds. In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. Our approach predicts Gaussian primitives as the pre-training task and employs differentiable Gaussian splatting to render images, enabling precise pixel-level supervision and end-to-end optimization. To further regulate the complexity of the pre-training task and direct the model's focus toward geometric structures, we integrate 2D features from pre-trained image models to incorporate well-established texture knowledge. We validate the universal effectiveness of our proposed method through extensive experiments across a variety of object- and scene-level tasks, using diverse point cloud models as backbones. Code is available at this https URL.
点云数据的尺度多样性为三维视觉中的统一表示学习技术的发展带来了显著挑战。目前,很少有通用的3D模型存在,并且没有现有的预训练方法能够同时有效地应用于对象级和场景级点云。在本文中,我们介绍了UniPre3D,这是首个可以无缝应用于任何规模点云及任意架构3D模型的统一预训练方法。我们的方法将预测高斯基元作为预训练任务,并采用可微分高斯渲染技术来生成图像,从而实现精确的像素级监督和端到端优化。为了进一步调节预训练任务的复杂度并引导模型关注几何结构,我们整合了来自预先训练好的图像模型的2D特征,以纳入已确立的良好纹理知识。我们通过广泛的实验验证了所提出方法在各种对象级和场景级任务中的通用有效性,并使用多种点云模型作为骨干网络进行测试。代码可在提供的链接中获取。
https://arxiv.org/abs/2506.09952
Implicit neural representations (INRs) have emerged as a powerful tool for solving inverse problems in computer vision and computational imaging. INRs represent images as continuous domain functions realized by a neural network taking spatial coordinates as inputs. However, unlike traditional pixel representations, little is known about the sample complexity of estimating images using INRs in the context of linear inverse problems. Towards this end, we study the sampling requirements for recovery of a continuous domain image from its low-pass Fourier samples by fitting a single hidden-layer INR with ReLU activation and a Fourier features layer using a generalized form of weight decay regularization. Our key insight is to relate minimizers of this non-convex parameter space optimization problem to minimizers of a convex penalty defined over an infinite-dimensional space of measures. We identify a sufficient number of Fourier samples for which an image realized by an INR is exactly recoverable by solving the INR training problem. To validate our theory, we empirically assess the probability of achieving exact recovery of images realized by low-width single hidden-layer INRs, and illustrate the performance of INRs on super-resolution recovery of continuous domain phantom images.
隐式神经表示(INRs)已作为解决计算机视觉和计算成像中的逆问题的强大工具出现。INRs将图像表示为由神经网络实现的连续域函数,该网络以空间坐标作为输入。然而,与传统的像素表示不同,在线性逆问题上下文中使用INR估计图像所需的样本复杂度知之甚少。 为此,我们研究了通过拟合带有ReLU激活和傅里叶特征层的单隐藏层INR来从其低通傅里叶采样中恢复连续域图像所需的基本采样要求,并利用广义形式的权重衰减正则化。我们的关键见解是将这种非凸参数空间优化问题的极小值与在无限维测度空间上定义的凸惩罚的极小值联系起来。我们确定了对于使用INR实现的图像可以精确恢复所需的足够傅里叶采样数量,只需解决INR训练问题。 为了验证我们的理论,我们通过实证评估低宽度单隐藏层INRs实现的图像能够达到精确恢复的概率,并展示了在连续域假想图像超分辨率重建中的性能。
https://arxiv.org/abs/2506.09949
Recent work has identified retrieval heads (Wu et al., 2025b), a subset of attention heads responsible for retrieving salient information in long-context language models (LMs), as measured by their copy-paste behavior in Needle-in-a-Haystack tasks. In this paper, we introduce QRHEAD (Query-Focused Retrieval Head), an improved set of attention heads that enhance retrieval from long context. We identify QRHEAD by aggregating attention scores with respect to the input query, using a handful of examples from real-world tasks (e.g., long-context QA). We further introduce QR- RETRIEVER, an efficient and effective retriever that uses the accumulated attention mass of QRHEAD as retrieval scores. We use QR- RETRIEVER for long-context reasoning by selecting the most relevant parts with the highest retrieval scores. On multi-hop reasoning tasks LongMemEval and CLIPPER, this yields over 10% performance gains over full context and outperforms strong dense retrievers. We also evaluate QRRETRIEVER as a re-ranker on the BEIR benchmark and find that it achieves strong zero-shot performance, outperforming other LLM-based re-rankers such as RankGPT. Further analysis shows that both the querycontext attention scoring and task selection are crucial for identifying QRHEAD with strong downstream utility. Overall, our work contributes a general-purpose retriever and offers interpretability insights into the long-context capabilities of LMs.
最近的研究(Wu et al., 2025b)发现了一类称为检索头(retrieval heads)的注意力机制子集,这些头部在长上下文语言模型中负责从大量文本信息中提取关键内容,这一结论是通过它们在针刺干草堆任务中的复制粘贴行为得出的。在此论文中,我们提出了QRHEAD(查询聚焦检索头),这是一种改进后的注意力机制集合,旨在增强从长文本中进行有效检索的能力。我们通过聚集与输入查询相关的注意力得分,并结合真实世界任务(如长上下文问答)示例来识别QRHEAD。 此外,我们引入了QR- RETRIEVER,这是一个高效且有效的检索器,它使用QRHEAD累积的注意力质量作为检索分数。我们在多跳推理任务LongMemEval和CLIPPER中利用QR-RETRIEVER进行长文本推理,通过选择具有最高检索分数的相关部分来实现这一目标。相比全面考虑上下文的方法,这种方法在这些任务上带来了超过10%的性能提升,并且优于其他密集型检索器。 我们还在BEIR基准测试中将QR-RETRIEVER作为重排序器进行了评估,并发现它实现了强大的零样本性能,超过了其他基于大语言模型(LLM)的重排序器,如RankGPT。进一步分析表明,查询上下文注意力评分以及任务选择对于识别具有强大下游效用的QRHEAD至关重要。 总体而言,我们的工作为长文本推理贡献了一种通用检索方法,并提供了关于LMs在处理长上下文时能力的理解和解释性见解。
https://arxiv.org/abs/2506.09944
We introduce CausalVQA, a benchmark dataset for video question answering (VQA) composed of question-answer pairs that probe models' understanding of causality in the physical world. Existing VQA benchmarks either tend to focus on surface perceptual understanding of real-world videos, or on narrow physical reasoning questions created using simulation environments. CausalVQA fills an important gap by presenting challenging questions that are grounded in real-world scenarios, while focusing on models' ability to predict the likely outcomes of different actions and events through five question types: counterfactual, hypothetical, anticipation, planning and descriptive. We designed quality control mechanisms that prevent models from exploiting trivial shortcuts, requiring models to base their answers on deep visual understanding instead of linguistic cues. We find that current frontier multimodal models fall substantially below human performance on the benchmark, especially on anticipation and hypothetical questions. This highlights a challenge for current systems to leverage spatial-temporal reasoning, understanding of physical principles, and comprehension of possible alternatives to make accurate predictions in real-world settings.
我们介绍了一个名为CausalVQA的基准数据集,该数据集用于视频问答(Video Question Answering, VQA),其中包含探查模型对物理世界因果关系理解的问题-答案对。现有的VQA基准要么侧重于实际视频表面感知的理解,要么专注于使用模拟环境创建的狭窄物理推理问题。CausalVQA通过提出基于现实场景、挑战性的五类问题(反事实、假设性、预期、计划和描述性),填补了这一重要空白,这些问题聚焦于模型预测不同行动和事件可能结果的能力。我们设计了质量控制机制以防止模型利用简单的捷径,要求模型的答案必须建立在深层视觉理解之上而非语言线索。 我们发现当前前沿的多模态模型在该基准上的表现显著低于人类水平,特别是在预期和假设性问题上。这表明目前系统面临着如何充分利用时空推理能力、物理原理的理解以及对可能替代方案的理解来做出准确预测的挑战,尤其是在现实世界场景中。
https://arxiv.org/abs/2506.09943
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at this https URL.
带有可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLM)的关键技术,其中验证工程发挥了核心作用。然而,用于指令遵循的最佳强化学习实践尚未得到充分探索。在这项工作中,我们探讨了在指令跟随中实现RL所面临的验证挑战,并提出了一种名为VerIF的方法,该方法结合了基于规则的代码验证和大型推理模型(如QwQ-32B)中的LLM验证。为了支持这种方法,我们构建了一个高质量的指令遵循数据集VerInstruct,其中包括约22,000个实例及其相关的验证信号。我们将使用VerIF进行RL训练应用于两个模型,并在几个代表性的指令跟随基准测试中实现了显著改进。经过训练的模型在同类大小的模型中达到了最先进的性能水平,并且对未见过的约束具有良好的泛化能力。我们进一步观察到,它们的一般能力并未受到影响,这表明带有VerIF的RL可以整合到现有的RL配方中以提升整体模型性能。我们在[此链接](https://example.com/)发布了我们的数据集、代码和模型,以促进未来的研究。
https://arxiv.org/abs/2506.09942
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $O(1/\epsilon^2)$.
信息不对称是多代理系统中普遍存在的一种特征,尤其在经济学和社会科学领域表现得尤为明显。在这种背景下,各主体根据私有信息调整行为以最大化自身收益。这种策略性行为往往由于混淆变量的引入而变得复杂。同时,在目标环境中进行实验的难度也带来了知识迁移的重大挑战,这需要将知识从数据更容易获取的环境转移到其他场景中。在此背景下,本文探讨了在线学习中的一个基本问题:我们能否利用非独立同分布(non-i.i.d.)的动作来了解混淆变量,即使在这种情况下仍需实现知识迁移?为此,我们提出了一种样本效率高的算法,旨在准确识别信息不对称条件下的系统动态,并在强化学习框架内有效应对知识转移的挑战,在一个在线策略互动模型下进行。我们的方法可以证明,能够在具有紧致样本复杂度$O(1/\epsilon^2)$的情况下,实现$\epsilon$-最优策略的学习。
https://arxiv.org/abs/2506.09940
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $\pi_0$, and $\pi_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at this https URL.
尽管视觉语言行动模型(VLAs)在各种操作任务中展示了有前景的机器人行为,但当部署到全新的任务时,它们的成功率有限。为了使这些策略能够安全地与其环境互动,我们需要一个故障检测器,能够在关键时刻发出警报,让机器人可以停止、回溯或请求帮助。然而,现有的故障检测器仅在特定的一个或几个任务上进行训练和测试,而VLAs则需要检测器能在未见过的任务和新环境中泛化并识别故障。 在这篇论文中,我们引入了多任务故障检测问题,并提出了SAFE——一个为包括VLAs在内的通才机器人策略设计的故障检测器。通过对VLA特征空间的分析,我们发现VLAs对任务的成功与失败拥有足够的高层次知识,这种知识在不同任务间具有通用性。基于这一洞察,我们将SAFE设计成能够从VLA内部特性学习,并预测一个单一标量值来表示任务失败的可能性。SAFE是在成功和失败的情况下进行训练的,且评估时使用的是未见过的任务。此外,SAFE与不同的策略架构兼容。 我们在模拟环境和现实世界中对OpenVLA、$\pi_0$以及$\pi_0$-FAST进行了广泛的测试。我们将SAFE与各种基线进行了比较,并展示了它在故障检测性能上取得了最先进的成果,且使用一致预测实现了准确性与检测时间的最佳平衡。更多定性结果可以在[该链接](https://thisisnotalink.com)中找到。 请注意,提供的URL实际为示例链接,请根据实际情况替换为正确的网址。
https://arxiv.org/abs/2506.09937
Developing 3D-VL generalists capable of understanding 3D scenes and following natural language instructions to perform a wide range of tasks has been a long-standing goal in the 3D-VL community. Despite recent progress, 3D-VL models still lag behind their 2D counterparts in capability and robustness, falling short of the generalist standard. A key obstacle to developing 3D-VL generalists lies in data scalability, hindered by the lack of an efficient scene representation. We propose LEO-VL, a 3D-VL model built upon condensed feature grid (CFG), an efficient scene representation that bridges 2D perception and 3D spatial structure while significantly reducing token overhead. This efficiency unlocks large-scale training towards 3D-VL generalist, for which we curate over 700k high-quality 3D-VL data spanning four domains of real-world indoor scenes and five tasks such as captioning and dialogue. LEO-VL achieves state-of-the-art performance on a variety of 3D QA benchmarks, including SQA3D, MSQA, and Beacon3D. Ablation studies confirm the efficiency of our representation, the importance of task and scene diversity, and the validity of our data curation principle. Furthermore, we introduce SceneDPO, a novel post-training objective that enhances the robustness of 3D-VL models. We hope our findings contribute to the advancement of scalable and robust 3D-VL generalists.
开发能够理解三维场景并根据自然语言指令执行广泛任务的三维视觉-语言(3D-VL)通才,一直是3D-VL社区长期追求的目标。尽管近期取得了进展,但3D-VL模型在能力和稳健性方面仍落后于其二维对应模型,并未能达到通才的标准。开发3D-VL通才的关键障碍在于数据规模的扩展问题,而这一问题又受到缺乏高效场景表示方法的限制。我们提出了LEO-VL,这是一个基于凝缩特征网格(CFG)构建的三维视觉-语言模型,CFG是一种高效的场景表示方式,它在桥接二维感知和三维空间结构的同时显著减少了令牌开销。这种效率为大规模训练3D-VL通才打开了大门,为此我们整理了超过70万条高质量的涵盖四个真实室内场景领域和五个任务(如描述生成和对话)的数据集。 LEO-VL在各种3D问答基准测试中均达到了最先进的性能水平,包括SQA3D、MSQA和Beacon3D。通过消融研究确认了我们表示法的有效性、任务与场景多样性的重要性以及数据整理原则的合理性。此外,我们引入了一种新颖的事后训练目标——SceneDPO,旨在增强3D-VL模型的稳健性。 希望我们的研究成果能够为开发可扩展和稳健的三维视觉-语言通才贡献一份力量。
https://arxiv.org/abs/2506.09935
Safe navigation of steerable and robotic catheters in the cerebral vasculature requires awareness of the catheters shape and pose. Currently, a significant perception burden is placed on interventionalists to mentally reconstruct and predict catheter motions from biplane fluoroscopy images. Efforts to track these catheters are limited to planar segmentation or bulky sensing instrumentation, which are incompatible with microcatheters used in neurointervention. In this work, a catheter is equipped with custom radiopaque markers arranged to enable simultaneous shape and pose estimation under biplane fluoroscopy. A design measure is proposed to guide the arrangement of these markers to minimize sensitivity to marker tracking uncertainty. This approach was deployed for microcatheters smaller than 2mm OD navigating phantom vasculature with shape tracking errors less than 1mm and catheter roll errors below 40 degrees. This work can enable steerable catheters to autonomously navigate under biplane imaging.
在脑血管中安全导航可转向和机器人导管需要了解导管的形状和姿态。目前,介入医师必须从双平面荧光透视图像中精神上重构并预测导管运动,这给操作者带来了较大的感知负担。现有的跟踪这些导管的方法主要限于平面分割或笨重的传感仪器,这些方法与神经介入手术中使用的微导管不兼容。 在本项工作中,研究人员为导管配备了定制的放射不透明标记物,并将其排列以使双平面荧光透视下可以同时估算形状和姿态。提出了一种设计措施来指导这些标记的布置,从而最小化对标记跟踪不确定性的敏感性。该方法被应用于直径小于2毫米的微导管,在模拟血管中进行导航时,形状跟踪误差小于1毫米,并且导管滚转角误差低于40度。 这项工作可以使可转向导管在双平面成像下自主导航成为可能。
https://arxiv.org/abs/2506.09934
Diffusion models represent the cutting edge in image generation, but their high memory and computational demands hinder deployment on resource-constrained devices. Post-Training Quantization (PTQ) offers a promising solution by reducing the bitwidth of matrix operations. However, standard PTQ methods struggle with outliers, and achieving higher compression often requires transforming model weights and activations before quantization. In this work, we propose HadaNorm, a novel linear transformation that extends existing approaches and effectively mitigates outliers by normalizing activations feature channels before applying Hadamard transformations, enabling more aggressive activation quantization. We demonstrate that HadaNorm consistently reduces quantization error across the various components of transformer blocks, achieving superior efficiency-performance trade-offs when compared to state-of-the-art methods.
扩散模型在图像生成领域代表了最先进的技术,但其高内存和计算需求阻碍了它们在资源受限设备上的部署。后训练量化(PTQ)通过减少矩阵运算的位宽提供了一种有前景的解决方案。然而,标准的PTQ方法难以处理异常值,并且要实现更高的压缩率通常需要在量化前转换模型权重和激活。在这项工作中,我们提出了HadaNorm,这是一种新颖的线性变换方法,它扩展了现有的方法并在应用哈达玛变换之前通过归一化激活特征通道有效地缓解了异常值问题,从而允许更激进的激活量化。我们证明,与现有最佳方法相比,HadaNorm在变压器块的各种组件中一致地减少了量化误差,并实现了更好的效率-性能权衡。
https://arxiv.org/abs/2506.09932
One promise that Vision-Language-Action (VLA) models hold over traditional imitation learning for robotics is to leverage the broad generalization capabilities of large Vision-Language Models (VLMs) to produce versatile, "generalist" robot policies. However, current evaluations of VLAs remain insufficient. Traditional imitation learning benchmarks are unsuitable due to the lack of language instructions. Emerging benchmarks for VLAs that incorporate language often come with limited evaluation tasks and do not intend to investigate how much VLM pretraining truly contributes to the generalization capabilities of the downstream robotic policy. Meanwhile, much research relies on real-world robot setups designed in isolation by different institutions, which creates a barrier for reproducibility and accessibility. To address this gap, we introduce a unified probing suite of 50 simulation-based tasks across 10 subcategories spanning language instruction, vision, and objects. We systematically evaluate several state-of-the-art VLA architectures on this suite to understand their generalization capability. Our results show that while VLM backbones endow VLAs with robust perceptual understanding and high level planning, which we refer to as good intentions, this does not reliably translate into precise motor execution: when faced with out-of-distribution observations, policies often exhibit coherent intentions, but falter in action execution. Moreover, finetuning on action data can erode the original VLM's generalist reasoning abilities. We release our task suite and evaluation code to serve as a standardized benchmark for future VLAs and to drive research on closing the perception-to-action gap. More information, including the source code, can be found at this https URL
视觉-语言-行动(VLA)模型相对于传统的机器人模仿学习的一个承诺是,它们可以利用大型视觉-语言模型(VLMs)的广泛泛化能力来生成多功能、通用型的机器人策略。然而,目前对VLA的评估仍然不足。传统的人类演示模仿学习基准测试由于缺乏语言指令而不适用;而新兴的一些将语言融入考量中的VLA基准测试往往包含的任务有限,并且没有意图深入研究VLM预训练究竟在多大程度上促进了下游机器人策略的泛化能力提升。与此同时,许多研究依赖于由不同机构独立设计的真实世界机器人设置,这为再现性和可访问性设置了障碍。 为了填补这一空白,我们介绍了一套统一的基于模拟任务的评估工具包,包含50项跨10个子类别的任务,这些类别涵盖了语言指令、视觉和物体。我们系统地在该工具包上对几种最先进的VLA架构进行了评测,以理解它们的泛化能力。我们的结果显示,虽然VLM骨干网络赋予了VLA强大的感知理解和高层次规划能力——即所谓的良好意图,但这种优势并不能可靠地转化为精确的动作执行:当面临分布外(OOD)观察时,尽管策略往往表现出一致的目标意图,但在动作执行方面却常常表现不佳。此外,在行动数据上的微调会侵蚀原始VLM的通用推理能力。 为了作为未来VLA研究的标准基准,并推动跨越感知与行动鸿沟的研究进展,我们公开了我们的任务套件和评估代码。更多信息(包括源代码)可在此链接访问:[此URL应为原文中提供的具体网址,请自行查找]
https://arxiv.org/abs/2506.09930
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations, which is an important yet challenging task. For large-scale HSIs, most methods rely on superpixel segmentation and perform superpixel-level clustering based on graph neural networks (GNNs). However, existing GNNs cannot fully exploit the spectral information of the input HSI, and the inaccurate superpixel topological graph may lead to the confusion of different class semantics during information aggregation. To address these challenges, we first propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels to improve their representation quality through the co-extraction of spatial and spectral features. Second, we propose an evidence-guided adaptive edge learning (EGAEL) module that adaptively predicts and refines edge weights in the superpixel topological graph. We integrate the proposed method into a contrastive learning framework to achieve clustering, where representation learning and clustering are simultaneously conducted. Experiments demonstrate that the proposed method improves clustering accuracy by 2.61%, 6.06%, 4.96% and 3.15% over the best compared methods on four HSI datasets. Our code is available at this https URL.
高光谱图像(HSI)聚类的任务是在没有标注信息的情况下,将相似的像素归为同一类别,这是一项重要但具有挑战性的任务。对于大规模HSIs来说,大多数方法依赖于超像素分割,并基于图神经网络(GNNs)进行超像素级别的聚类。然而,现有的GNN无法充分利用输入HSI的光谱信息,不准确的超像素拓扑图可能在信息聚合过程中导致不同类别语义之间的混淆。 为了解决这些问题,我们首先提出了一种结构-光谱图卷积算子(SSGCO),它专门针对具有图结构的HSI超像素设计,通过同时提取空间和光谱特征来提高其表示质量。其次,我们提出了一个证据引导自适应边学习模块(EGAEL),该模块能够根据需要预测并细化超像素拓扑图中的边缘权重。我们将所提出的方法集成到对比学习框架中以实现聚类,在此框架下,表示学习与聚类可以同时进行。 实验表明,我们的方法在四个HSI数据集上将聚类精度分别提高了2.61%,6.06%,4.96%和3.15%,优于所有比较的方法。我们提供的代码可以在给定的URL中找到(原文中的具体链接被替换为“this https URL”)。
https://arxiv.org/abs/2506.09920