Generating high-quality and photorealistic 3D assets remains a longstanding challenge in 3D vision and computer graphics. Although state-of-the-art generative models, such as diffusion models, have made significant progress in 3D generation, they often fall short of human-designed content due to limited ability to follow instructions, align with human preferences, or produce realistic textures, geometries, and physical attributes. In this paper, we introduce Nabla-R2D3, a highly effective and sample-efficient reinforcement learning alignment framework for 3D-native diffusion models using 2D rewards. Built upon the recently proposed Nabla-GFlowNet method, which matches the score function to reward gradients in a principled manner for reward finetuning, our Nabla-R2D3 enables effective adaptation of 3D diffusion models using only 2D reward signals. Extensive experiments show that, unlike vanilla finetuning baselines which either struggle to converge or suffer from reward hacking, Nabla-R2D3 consistently achieves higher rewards and reduced prior forgetting within a few finetuning steps.
生成高质量和逼真的3D资产仍然是三维视觉和计算机图形学领域的长期挑战。尽管最先进的生成模型,如扩散模型,在3D生成方面取得了显著进展,但由于在遵循指令、符合人类偏好或产生逼真纹理、几何形状和物理属性方面的局限性,它们往往难以达到人工设计内容的水平。在这篇论文中,我们介绍了Nabla-R2D3,这是一个高效的强化学习对齐框架,用于基于2D奖励信号调整原生3D扩散模型。该框架建立在最近提出的Nabla-GFlowNet方法之上,后者通过将评分函数与奖励梯度匹配的方式,在原则基础上实现了奖励的微调功能。我们的Nabla-R2D3允许仅使用2D奖励信号就有效地调整3D扩散模型。 广泛的实验表明,不同于原始微调基准线(它们要么难以收敛,要么遭受奖励操控问题),Nabla-R2D3在几次微调步骤后始终能实现更高的奖励值并减少先前知识的遗忘。
https://arxiv.org/abs/2506.15684
With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.
随着大型语言模型(LLMs)的流行,诸如造谣和学术不端等不良社会问题变得更为严重,使得检测由这些模型生成的文本的重要性前所未有地提升。尽管现有方法已经取得了显著进展,但来自私人微调的LLM产生的文本所带来的新挑战仍然未被充分探索。用户可以通过使用私有语料库对开源模型进行微调来轻松拥有自己的私人LLM,这导致现有的检测工具在实际应用中的性能大幅下降。为解决这一问题,我们提出了PhantomHunter,这是一种专门用于检测来自未知的、私人性质调整过的LLMs生成文本的探测器。其家族感知学习框架能够捕捉基础模型及其衍生品之间的家族级特征,而不是记忆个体特性。在由LLaMA、Gemma和Mistral系列提供的数据上的实验表明,它比7种基线方法和3个工业服务表现更优,在F1评分上超过了96%。
https://arxiv.org/abs/2506.15683
Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and this http URL using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at this https URL and our code is available at this https URL.
基于扩散的图像生成模型在生产高质量合成内容方面表现出色,但其推理过程缓慢且计算成本高昂。先前的研究试图通过缓存和重复使用扩散变换器中的特征来缓解这一问题,在不同的推断步骤中实现这一点。然而,这些方法往往依赖于刚性的启发式规则,导致加速效果有限或不同架构之间泛化性能不佳。 我们提出了一种名为“基于进化缓存的扩散模型加速”(ECAD)的遗传算法,该算法能够根据少量校准提示学习每种模型的有效缓存时间表,并形成帕累托前沿。ECAD不需要对网络参数或参考图像进行任何修改,它可以显著提高推理速度,提供对质量-延迟权衡的精细控制,并无缝适应不同的扩散模型。 值得注意的是,通过使用与校准时未见过的不同分辨率和模型变体,ECAD所学习的时间表能够有效地泛化。 我们在PixArt-alpha、PixArt-Sigma及另一组图像生成模型上评估了ECAD,采用多种指标(FID、CLIP、Image Reward)以及多个基准数据集(COCO、MJHQ-30k、PartiPrompts),结果显示在各个方面都优于先前的方法。例如,在PixArt-alpha上的测试中,ECAD找到了一种方案,其表现超越了此前的最先进方法4.47分COCO FID,并将推断速度从2.35倍提升至2.58倍。 我们的实验结果表明,ECAD是一种可扩展且泛化的加速扩散模型推理的方法。该项目网站和代码可以在相应的链接中找到。
https://arxiv.org/abs/2506.15682
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.
近期,视觉语言模型(VLM)在利用大型语言模型(LLM)方面取得了进展,实现了与GPT-4V等封闭源系统相当的性能。然而,在资源受限设备上部署这些模型仍然面临挑战,因为它们需要大量的计算资源。这一现状促使人们研究如何将大型VLM的知识提炼到更小、更高效的模型中去。这里的一个关键挑战来自于各种不同的VLM架构,这些架构基于不同的LLM并采用不同类型的标记(在词汇大小、标记分割和标记索引排序方面有所不同)。为了解决这个问题,并不限定于特定类型的VLM,我们提出了“再校准后生成”(Generation after Recalibration, GenRecal),这是一种新颖的通用蒸馏框架。GenRecal包含一个再校准器(Recalibrator),用于在异构VLM之间对齐和适应特征表示,从而实现不同类型VLM之间的有效知识转移。通过多个具有挑战性的基准测试中的广泛实验,我们展示了GenRecal显著提高了基线性能,并最终超过了大规模开源和封闭源的VLM系统。
https://arxiv.org/abs/2506.15681
Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at this https URL .
模拟可变形物体的动力学是一个挑战,因为它们具有多样的物理属性,并且从有限的视觉信息中估计状态也十分困难。我们通过一个结合了对象粒子和空间网格的混合表示的神经动力框架来应对这些挑战。我们的粒子-网格模型能够捕获全局形状和运动信息,同时预测密集的粒子运动,从而可以对具有不同形状和材料的对象进行建模。在这个模型中,粒子代表物体的形状,而空间网格则将3D空间离散化,以确保空间连续性并提高学习效率。结合高斯Splattings用于视觉渲染,我们的框架能够实现可变形对象的完全基于学习的数字孪生,并生成3D动作条件视频。通过实验,我们展示了模型可以从机器人与物体交互时的稀疏视图RGB-D记录中学习不同种类物体(如绳索、布料、填充动物玩具和纸袋)的动力学特性,同时在类别层面推广到未见过的实例上。我们的方法在视角有限的情况下超过了基于学习的和物理引擎的模拟器的最佳性能。此外,我们还展示了所学模型在基于模型规划中的实用性,使目标导向的对象操作跨各种任务成为可能。该项目页面可以在以下链接访问:[项目网页](https://这个URL/)。
https://arxiv.org/abs/2506.15680
Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are \emph{dense}), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs -- suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and finally to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.
稀疏自编码器(SAEs)旨在通过施加稀疏性约束从语言模型中提取可解释的特征。理想情况下,训练后的 SAE 会生成既稀疏又有语义意义的潜在表示。然而,许多 SAE 的潜在层激活频繁(即为“密集”),引发担忧认为它们可能是训练过程中的不希望出现的副产物。在本工作中,我们系统地研究了密集潜在层的几何特性、功能和起源,并展示了这些密集层不仅持久存在且常常反映了有意义的模型表示。 首先,我们证明密集潜在层往往形成对抗性对,重建残差流中的特定方向,并通过消除其子空间抑制重新训练后的 SAE 中新密集特征的出现——这表明高密度特征是残差空间的一个内在属性。接下来,我们引入了一种密集潜在层分类法,识别出与位置跟踪、上下文绑定、熵调节、字母特异性输出信号、词性以及主成分重建相关的类。 最后,我们分析了这些特征在不同层级上的演化过程,揭示了从早期层级的结构性特征到中层的语义特征再到模型最后一层的输出导向信号的变化趋势。我们的研究结果表明,密集潜在层在语言模型计算中扮演着功能性角色,并且不应被简单地视为训练噪音。
https://arxiv.org/abs/2506.15679
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page this https URL.
当前的AI代理大多被隔离运作,它们要么检索和推断互联网上获取的大量数字信息和知识;要么通过具身感知、规划和行动与物理世界互动——但很少同时进行两者。这种分离限制了他们解决需要集成物理和数字智能的任务的能力,例如根据在线食谱烹饪、使用动态地图数据导航或利用网络知识解释现实世界的地标。我们介绍了一种新的AI代理范式:具身Web代理(Embodied Web Agents),该范式流畅地将具身性和大规模的网络推理结合起来。 为了实现这一概念,我们首先开发了具身Web代理任务环境,这是一个统一的模拟平台,紧密集成了真实的三维室内和室外环境与功能性的网页界面。在此基础上,我们构建并发布了具身Web代理基准测试(Embodied Web Agents Benchmark),涵盖了包括烹饪、导航、购物、旅游和地理定位等多样化的任务——所有这些都要求跨物理和数字领域协调推理来进行系统性评估的跨域智能。实验结果显示了最先进的AI系统与人类能力之间的显著性能差距,揭示了具身认知与大规模网络知识访问交汇处面临的挑战与机遇。 我们项目的全部数据集、代码和网站均可在我们的项目网页上公开获取:[提供URL]
https://arxiv.org/abs/2506.15677
Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.
性别包容性机器翻译(MT)应当保留源语言中的性别含糊性,以避免错误指代和代表性伤害。虽然英语等概念性别语言中常常自然存在性别模糊性,在具有语法性别语言中保持这种性别中立是一个挑战。在这里,我们评估了21种机器翻译系统在三种不同难度的翻译方向上对性别中立需求的敏感度。实践中观察到的具体性别中立策略被分类和讨论。此外,我们还考察了二元性别刻板印象对使用性别中立翻译的影响。总体而言,我们报告说,在应对性别模糊性时,缺乏性别中立翻译的情况令人失望。然而,我们也注意到一小部分机器翻译系统在面对特定目标语言时会采用具体策略转向性别中立的翻译。
https://arxiv.org/abs/2506.15676
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
视频生成技术取得了显著进步,有望成为互动式世界探索的基础。然而,现有的视频生成数据集由于存在一些局限性(如地点有限、时长较短、场景静态以及缺乏关于探索和世界的标注信息)并不适合用于世界探索训练。为此,在这篇论文中我们介绍了Sekai(日语中的“世界”),这是一个高质量的第一人称视角的全球视频数据集,包含丰富的世界探索注释信息。该数据集涵盖了来自超过100个国家和地区、750个城市的步行或无人机视图(FPV和UVA)视频,总时长超过5,000小时。我们开发了一个高效且有效的工具箱来收集、预处理并标注视频中的地点、场景、天气状况、人群密度、字幕以及相机轨迹等信息。实验结果证明了该数据集的质量。此外,我们使用其子集训练出一个互动式视频世界探索模型——YUME(日语中的“梦想”)。我们认为Sekai将有助于视频生成和世界探索领域的发展,并激发有价值的应用场景。
https://arxiv.org/abs/2506.15675
We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.
我们研究了大型推理模型在用作个人代理时,其推理痕迹中隐私泄露的情况。与最终输出不同,推理痕迹通常被认为属于内部信息且较为安全。我们挑战这一假设,通过展示推理痕迹经常包含敏感的用户数据,并可以通过提示注入或意外泄漏到输出来证明这一点。通过探测和代理评估,我们展示了测试时间计算方法(尤其是增加推理步骤)会放大此类泄露现象。虽然扩大这些测试时间计算方法的预算可以让模型在最终答案上更加谨慎,但这也使它们在自己的思考中变得更加冗长,并且更多地泄露信息。这揭示了一个核心矛盾:推理改善了实用性,但也扩大了隐私攻击面。我们认为安全措施必须扩展到模型的内部思维过程,而不仅仅是其输出。
https://arxiv.org/abs/2506.15674
We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.
我们解决的是对单张图像或视频进行重新光照的问题,这是一个需要精确场景内在理解以及高质量光线传输合成的任务。现有的端到端重新光照模型通常受限于多光照配对数据的稀缺性,这限制了它们在不同场景中的泛化能力。相反,结合逆向和正向渲染的两阶段管道可以缓解数据需求问题,但容易出现错误累积,并且常常无法在复杂的照明条件下或使用复杂材料时产生逼真的输出结果。 在这项工作中,我们提出了一种通用方法,该方法能够在一次通过中同时估计反射率并合成重新光照后的输出,利用视频扩散模型的生成能力。这种联合公式提高了隐式场景理解的能力,并促进了现实光线效果和复杂材质交互(如阴影、反射和透明度)的创建。 我们的模型在合成多光照数据和大量自动标注的真实世界视频上进行训练,在多种领域中显示出强大的泛化能力,并且在视觉保真度和时间一致性方面超过了之前的方法。
https://arxiv.org/abs/2506.15673
The rapid progress of Large Language Models has advanced agentic systems in decision-making, coordination, and task execution. Yet, existing agentic system generation frameworks lack full autonomy, missing from-scratch agent generation, self-optimizing agent functionality, and collaboration, limiting adaptability and scalability. We propose SwarmAgentic, a framework for fully automated agentic system generation that constructs agentic systems from scratch and jointly optimizes agent functionality and collaboration as interdependent components through language-driven exploration. To enable efficient search over system-level structures, SwarmAgentic maintains a population of candidate systems and evolves them via feedback-guided updates, drawing inspiration from Particle Swarm Optimization (PSO). We evaluate our method on six real-world, open-ended, and exploratory tasks involving high-level planning, system-level coordination, and creative reasoning. Given only a task description and an objective function, SwarmAgentic outperforms all baselines, achieving a +261.8% relative improvement over ADAS on the TravelPlanner benchmark, highlighting the effectiveness of full automation in structurally unconstrained tasks. This framework marks a significant step toward scalable and autonomous agentic system design, bridging swarm intelligence with fully automated system multi-agent generation. Our code is publicly released at this https URL.
大型语言模型的迅速进步提升了代理系统的决策、协调和任务执行能力。然而,现有的代理系统生成框架缺乏完全自主性,缺少从零开始生成代理、自我优化的代理功能以及合作机制,从而限制了适应性和可扩展性。我们提出了SwarmAgentic框架,这是一个全自动化的代理系统生成框架,它能够从头构建代理系统,并通过语言驱动的探索共同优化代理的功能和协作能力作为相互依赖的组件。为了实现对系统级别结构的有效搜索,SwarmAgentic维护了一组候选系统,并根据反馈指导更新来进化这些系统,借鉴了粒子群优化(PSO)的灵感。我们在涉及高层次规划、系统级协调和创造性推理的六个现实世界中的开放性和探索性任务上评估了我们的方法:仅凭任务描述和目标函数,SwarmAgentic在所有基线中表现出色,在TravelPlanner基准测试中比ADAS高出261.8%的相对改进,突显了完全自动化在结构不受约束的任务中的有效性。这一框架标志着向可扩展且自主化的代理系统设计迈进的重要一步,将群智能与全自动多代理系统的生成相结合。我们的代码可在该网址公开获取:[此链接](请根据实际情况替换为实际发布地址)。
https://arxiv.org/abs/2506.15672
We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.
我们介绍了一种名为“行动中的视觉”(Vision in Action,简称ViA)的双臂机器人操作主动感知系统。ViA能够直接从人类示范中学习与任务相关的主动感知策略(例如搜索、追踪和聚焦)。在硬件方面,ViA采用了一个简单而有效的六自由度机械颈部装置,使机器人头部运动更灵活、接近于人。为了捕捉人类的主动感知策略,我们设计了一种基于虚拟现实(VR)的远程操作界面,该界面创造了一个共享观察空间,使得机器人与人类操作员之间能够协同工作。 为了避免由于机器人物理动作中的延迟引起的VR运动病,我们的接口采用了一个中间3D场景表示方法,在操作人员这边实现实时视图渲染的同时,异步更新场景以反映机器人的最新观测数据。这些设计元素共同作用,使ViA能够在三个涉及视觉遮挡的复杂多步骤双臂操作任务中学习稳健的视觉-运动策略,显著优于基准系统。
https://arxiv.org/abs/2506.15666
Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.
大型语言模型在许多任务上表现出色,但仍难以进行一致且稳健的推理。我们引入了基于群体的一致性学习(CC-Learn),这是一种强化学习框架,通过训练来自共享程序抽象的类似问题群体来提高LLM推理的可靠性。为了强制执行群体级别的一致性,我们定义了一个复合目标,结合了群体准确性、有效问题分解的检索奖励以及对平凡或无效查找的拒绝惩罚,这些是强化学习可以直接优化的目标,而不是监督微调可以做到的。优化这个奖励引导模型采用一致的推理模式贯穿所有群体成员。在具有挑战性的推理基准测试(包括ARC-Challenge和StrategyQA)上的实验表明,CC-Learn相比预训练和SFT基线提升了准确性和推理稳定性。这些结果证明了基于群体的RL有效地增强了LLM中的推理一致性。
https://arxiv.org/abs/2506.15662
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at this https URL.
基于规则的奖励为从人类反馈中改进强化学习(RLHF)提供了一种有前景的战略,但目前的方法通常依赖于手动规则工程。我们提出了AutoRule,这是一种全自动方法,用于从偏好反馈中提取规则,并将其制定成基于规则的奖励。AutoRule提取过程分为三个阶段:它利用一个推理模型来解释用户偏好,从这些解释的推理链中识别候选规则,并将它们综合为统一的规则集。在使用最终确定的规则集时,我们采用语言模型验证器来计算每个输出满足规则的比例,将此指标作为辅助奖励,在策略优化过程中与学习到的奖励模型一起使用。 使用AutoRule训练Llama-3-8B模型,在AlpacaEval2.0上的长度控制胜率相对提高了28.6%,在独立于MT-Bench子集上第二回合的表现比GRPO基线(仅使用相同的已学习奖励模型,但不使用基于规则的辅助奖励)高出6.1%。我们的分析证实,提取出的规则与数据集偏好具有良好的一致性。我们发现,在运行两个时期时,AutoRule显示的奖励作弊现象少于学习到的奖励模型。最后,案例研究表明,提取的规则捕捉到了不同数据集中所重视的独特品质。 提取的规则详见附录,并且代码已在以下URL开源:[这里提供具体的GitHub或相关链接]。
https://arxiv.org/abs/2506.15651
This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.
这项研究解决了使用ROST语料库对罗马尼亚文本进行作者归属的问题,ROST语料库是该领域的标准基准。我们系统地评估了六种机器学习技术:支持向量机(SVM)、逻辑回归(LR)、k-近邻(k-NN)、决策树(DT)、随机森林(RF)和人工神经网络(ANN),采用字符n-gram特征进行分类。在这其中,使用5-gram特征时,ANN模型在十五次运行中的四次达到了完美的分类效果,表现最佳。这些结果表明,在罗马尼亚作者归属问题上,基于轻量级、可解释的字符n-gram方法可以实现最先进的精度,与更复杂的方法相当。我们的研究发现强调了简单风格度量特征在资源有限、约束或较少研究的语言环境中所具有的潜力。
https://arxiv.org/abs/2506.15650
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
尽管在视觉语言模型(VLMs)的推理时间搜索方面取得了显著进展,现有方法仍然既计算成本高昂又容易产生不受惩罚的低置信度生成结果,这往往会导致持续的幻觉现象。我们引入了**基于边际奖励的价值引导推理(ViMaR)**,这是一种两阶段推理框架,通过结合时差价值模型和边际感知的奖励调整来提高效率和输出保真度。 在第一阶段中,我们进行一次遍历以从多样化的候选方案中识别出最具价值的描述。第二阶段则有针对性地优化那些被忽视或视觉基础较弱的部分,从而消除频繁受到奖励评估的影响。一个校准过的边际惩罚机制会抑制低置信度的延续生成,同时保留描述的丰富性。 在多种VLM架构上的广泛实验表明,ViMaR能够产生更加可靠、事实准确、详细且具有解释性的描述,并与现有基于价值引导的方法相比,在速度上实现了超过4倍的加速。特别是,我们展示了仅使用LLaVA Mistral-7B训练的ViMaR可以**有效指导解码在未见过的强大模型中**进行操作。为了进一步验证这一点,我们将ViMaR调整为在LLaVA-OneVision-Qwen2-7B中引导生成,从而提高了描述质量的一致性,并展示了跨模型指导的稳健性能。 这种跨模型泛化突显了ViMaR的灵活性和模块化特性,使其成为一种可扩展且具有迁移性的推理时间解码策略。此外,在使用由ViMaR生成的描述进行自我训练时,底层模型在一系列视觉理解基准测试中实现了显著提升,这强调了快速、准确且自我改进的VLM管道的巨大潜力。 简而言之,ViMaR不仅提高了VLM输出的质量和效率,还展示了其强大的跨模型泛化能力和作为高效自我改善策略的潜在价值。
https://arxiv.org/abs/2506.15649
Recent advancements in large reasoning models (LRMs) have significantly enhanced language models' capabilities in complex problem-solving by emulating human-like deliberative thinking. However, these models often exhibit overthinking (i.e., the generation of unnecessarily verbose and redundant content), which hinders efficiency and inflates inference cost. In this work, we explore the representational and behavioral origins of this inefficiency, revealing that LRMs inherently possess the capacity for more concise reasoning. Empirical analyses show that correct reasoning paths vary significantly in length, and the shortest correct responses often suffice, indicating untapped efficiency potential. Exploiting these findings, we propose two lightweight methods to enhance LRM efficiency. First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction in the model's representation space. Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity by rewarding concise correct solutions. Extensive experiments on seven LRM backbones across multiple mathematical reasoning benchmarks demonstrate that our methods significantly reduce reasoning length while preserving or improving task performance. Our results highlight that reasoning efficiency can be improved by leveraging and guiding the intrinsic capabilities of existing models in a self-guided manner.
最近在大规模推理模型(LRMs)方面的进展显著增强了语言模型解决复杂问题的能力,通过模拟人类的审慎思考。然而,这些模型经常表现出过度思考的现象,即生成冗长且不必要的内容,这会降低效率并增加推断成本。在这项工作中,我们探索了这种低效性的表征和行为根源,并揭示出LRMs本身就具备进行更加简洁推理的能力。实证分析表明,正确的推理路径长度差异很大,而最短的正确答案通常就足够了,这意味着存在未被开发的效率潜力。 基于这些发现,我们提出了两种轻量级方法来提高LRM的效率。首先,我们引入了“Efficiency Steering”,这是一种无须重新训练的激活控制技术,通过在模型表征空间中的单一方向调节推理行为。其次,我们开发了一种名为“Self-Rewarded Efficiency RL”的强化学习框架,该框架动态平衡任务准确性和简洁性,通过奖励简短且正确的解决方案来实现。 我们在七个LRM基础架构上进行了广泛的实验,并针对多个数学推理基准测试展示了我们的方法显著减少了推理长度,同时保持甚至提高了任务性能。我们的结果表明,可以通过引导现有模型的内在能力来自我指导地改进推理效率。
https://arxiv.org/abs/2506.15647
Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.
近期的多模态大型语言模型(MLLM)在视觉-语言基准任务上表现出色,但关于输入图像质量如何影响其响应仍知之甚少。更高的感知质量图像是否已转化为更好的MLLM理解?我们进行了首个系统研究,涵盖了领先的MLLM和一系列视觉-语言基准,并对每张图片应用了受控退化和风格变化。 令人惊讶的是,我们揭示了一个视觉质量悖论:当图像偏离人类感知的保真度时,模型、任务甚至个别实例的表现可能会有所提高。现成的修复管道无法解决这些特异性的偏好。为缩小这一差距,我们引入了视觉质量测试时间调整(VQ-TTT)——一个轻量级适应模块,它: 1. 在冻结的视觉编码器之前插入可学习的低秩核来调节频率内容; 2. 仅通过LoRA微调浅层视觉编码器层次。 VQ-TTT在单个前向传递中动态调整每个输入图像,使其与特定任务的模型偏好相匹配。在评估的所有MLLM和数据集中,VQ-TTT显著提高了平均准确率,并且不需要外部模型、缓存特征或额外训练数据。 这些发现重新定义了MLLM中的“更好”的视觉输入,并强调在新的AI时代作为主要数据客户的背景下,需要适应性而非普遍的“清洁”图像。
https://arxiv.org/abs/2506.15645
As artificial intelligence (AI) further embeds itself into many settings across personal and professional contexts, increasing attention must be paid not only to AI ethics, but also to the governance and regulation of AI technologies through AI policy. However, the prevailing post-secondary computing curriculum is currently ill-equipped to prepare future AI practitioners to confront increasing demands to implement abstract ethical principles and normative policy preferences into the design and development of AI systems. We believe that familiarity with the 'AI policy landscape' and the ability to translate ethical principles to practices will in the future constitute an important responsibility for even the most technically-focused AI engineers. Toward preparing current computer science (CS) students for these new expectations, we developed an AI Policy Module to introduce discussions of AI policy into the CS curriculum. Building on a successful pilot in fall 2024, in this innovative practice full paper we present an updated and expanded version of the module, including a technical assignment on "AI regulation". We present the findings from our pilot of the AI Policy Module 2.0, evaluating student attitudes towards AI ethics and policy through pre- and post-module surveys. Following the module, students reported increased concern about the ethical impacts of AI technologies while also expressing greater confidence in their abilities to engage in discussions about AI regulation. Finally, we highlight the AI Regulation Assignment as an effective and engaging tool for exploring the limits of AI alignment and emphasizing the role of 'policy' in addressing ethical challenges.
随着人工智能(AI)在个人和专业环境中不断嵌入各种场景,不仅需要关注AI伦理问题,还需要通过AI政策来治理和监管AI技术。然而,现有的高等教育计算机课程目前无法为未来的AI从业者准备应对将抽象的伦理原则和规范性政策偏好融入AI系统设计与开发中的日益增长的需求。我们认为,熟悉“AI政策格局”并能够将伦理原则转化为实践在未来将成为即使是专注于技术的AI工程师的重要职责之一。为了使当前的计算机科学(CS)学生为这些新的期望做好准备,我们开发了一个AI政策模块,以在CS课程中引入关于AI政策的讨论。基于2024年秋季成功的试点项目,在这篇创新实践中论文中,我们介绍了该模块更新和扩展版本,包括一个名为“AI监管”的技术任务。 我们在本文中展示了对AI政策模块2.0试行版的研究成果,并通过预调查和后测来评估学生对于AI伦理与政策的态度。在完成模块学习之后,学生们报告说他们更加关注AI技术的伦理影响,同时也在讨论AI监管方面表达了更大的自信。最后,我们强调了“AI监管任务”作为探索AI对齐限制的有效且引人入胜的方法,并突出了“政策”在应对道德挑战中的作用。 这项工作表明,在计算机科学课程中纳入关于AI政策和伦理的教学模块对于培养未来AI专业人员的能力至关重要,这些能力包括理解并应用有关治理和监管的策略,以促进技术发展的同时保障社会福祉。
https://arxiv.org/abs/2506.15639