Large language models (LLMs) can often accurately describe probability distributions using natural language, yet they still struggle to generate faithful samples from them. This mismatch limits their use in tasks requiring reliable stochasticity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making. We investigate this gap between knowledge and sampling in the context of Bernoulli distributions. We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling that prompts the LLM to reason about and accept or reject proposed samples. Despite relying on the same Bernoulli mechanism internally, VRS substantially reduces sampling bias across models. We provide theoretical analysis showing that, under mild assumptions, VRS improves over direct sampling, with gains attributable to both the algorithm and prompt design. More broadly, our results show how classical probabilistic tools can be verbalized and embedded into LLM workflows to improve reliability, without requiring access to model internals or heavy prompt engineering.
大型语言模型(LLMs)通常能够准确地用自然语言描述概率分布,但它们在从这些分布中生成忠实样本时仍存在困难。这种不匹配限制了它们在需要可靠随机性的任务中的应用,例如蒙特卡洛方法、基于代理的仿真和随机决策制定。我们在此背景下研究伯努利分布的知识与采样之间的差距,并提出了语言化的拒绝抽样(Verbalized Rejection Sampling, VRS),这是一种经典拒绝抽样的自然语言版本,它促使LLM对提出的样本进行推理并接受或拒绝这些样本。尽管VRS在内部依赖于相同的伯努利机制,但它显著减少了不同模型的采样偏差。 我们提供理论分析表明,在适度假设下,VRS优于直接抽样,并且改进来自于算法本身以及提示设计。更广泛地说,我们的研究结果展示了如何将经典的概率工具语言化并嵌入到LLM工作流程中以提高可靠性,而无需访问模型内部或进行复杂的提示工程。
https://arxiv.org/abs/2506.09998
Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
尽管安全对齐已经被大多数大型语言模型(LLM)采用,但LLM服务提供商通常会在实际产品中部署后续的审核作为外部的安全保障。现有的审核人员主要执行传统的全检测方法,即根据完整的LLM输出来确定其危害性,这导致了较高的服务延迟。近期的研究更关注于部分检测,在生成过程中中途监督并提前停止有害内容的输出,但它们直接将使用全检测范式训练的审核员应用于不完整输出中,造成了训练与推理之间的差距,并降低了性能表现。 本文探讨如何形成一种原生支持部分检测的数据和模型解决方案。在数据方面,我们构建了FineHarm数据集,包含29,000个带有细粒度注释的提示-响应对,以提供合理监督,从而进行令牌级训练。然后,我们提出了流式内容监控器(SCM),它通过响应级别和令牌级别的双重视频标签进行训练,并能够跟随LLM输出流,及时判断有害性。 实验表明,SCM在仅查看平均20%的响应令牌的情况下,在宏F1得分上获得了比全检测方法相当甚至更高的性能表现(提高幅度为0.95+)。此外,SCM可以作为伪危害注释员来改进安全对齐,并导致相比DPO更高的无害性评分。
https://arxiv.org/abs/2506.09996
We introduce the Deformable Gaussian Splats Large Reconstruction Model (DGS-LRM), the first feed-forward method predicting deformable 3D Gaussian splats from a monocular posed video of any dynamic scene. Feed-forward scene reconstruction has gained significant attention for its ability to rapidly create digital replicas of real-world environments. However, most existing models are limited to static scenes and fail to reconstruct the motion of moving objects. Developing a feed-forward model for dynamic scene reconstruction poses significant challenges, including the scarcity of training data and the need for appropriate 3D representations and training paradigms. To address these challenges, we introduce several key technical contributions: an enhanced large-scale synthetic dataset with ground-truth multi-view videos and dense 3D scene flow supervision; a per-pixel deformable 3D Gaussian representation that is easy to learn, supports high-quality dynamic view synthesis, and enables long-range 3D tracking; and a large transformer network that achieves real-time, generalizable dynamic scene reconstruction. Extensive qualitative and quantitative experiments demonstrate that DGS-LRM achieves dynamic scene reconstruction quality comparable to optimization-based methods, while significantly outperforming the state-of-the-art predictive dynamic reconstruction method on real-world examples. Its predicted physically grounded 3D deformation is accurate and can readily adapt for long-range 3D tracking tasks, achieving performance on par with state-of-the-art monocular video 3D tracking methods.
我们介绍了一种名为变形高斯点阵大型重建模型(Deformable Gaussian Splats Large Reconstruction Model,简称 DGS-LRM)的方法。这是第一个能够从前视单目视频预测动态场景中可变形三维高斯点阵的前馈方法。前馈场景重建因其能快速创建真实世界环境的数字复制品而受到了广泛的关注。然而,大多数现有的模型仅限于静态场景,并且无法重构移动物体的动作。开发一种适用于动态场景重构的前馈模型面临着诸多挑战,包括训练数据稀少以及需要适当的三维表示和训练范式的问题。为了解决这些挑战,我们引入了几个关键技术贡献:一个增强的大规模合成数据集,该数据集中包含了多视角真值视频和密集的三维场景流监督;一种像素级别的可变形三维高斯表征方法,易于学习、支持高质量动态视图合成,并且能够实现长距离三维跟踪。此外还有一个大型变压器网络,实现了实时通用的动态场景重建。 大量的定性和定量实验表明,DGS-LRM在动态场景重构质量上与基于优化的方法相当,同时显著优于现有的最先进的预测性动态重建方法在真实世界示例上的表现。它的物理基础三维变形预测准确,并且能够轻松适应长距离三维跟踪任务,其性能可媲美当前最先进的单目视频三维跟踪方法。
https://arxiv.org/abs/2506.09997
We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and worldconsistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.
我们介绍了PlayerOne,这是第一个以自我为中心的现实世界模拟器,它能够支持在生动且动态的环境中进行沉浸式和无限制的探索。给定用户的一个以自我为中心的场景图像,PlayerOne可以准确地构建相应的世界,并生成与外向视角摄像头捕捉到的真实场景中的人类动作严格对齐的以自我为中心的视频。 PlayerOne采用了一种从粗到细的训练管道:首先在大规模的文本-视频配对上进行预训练,从而获得初步的以自我为中心的理解能力;然后基于我们自动构建流程提取的同步运动-视频数据集进行微调。此外,考虑到不同组件的重要性各不相同,我们设计了一个部分解耦的动作注入方案,实现了精确控制局部动作的能力。 另外,为了确保在长时间视频生成过程中场景的一致性,我们还提出了一种逐步建模4D场景和视频帧的联合重构框架。 实验结果表明,在精细控制各种人类运动以及构建符合实际世界的多样化场景模型方面,PlayerOne具有强大的泛化能力。它标志着第一个进入以自我为中心的真实世界模拟领域的尝试,并为社区深入探索新的世界建模前沿及其广泛应用铺平了道路。
https://arxiv.org/abs/2506.09995
If human experience is any guide, operating effectively in unstructured environments -- like homes and offices -- requires robots to sense the forces during physical interaction. Yet, the lack of a versatile, accessible, and easily customizable tactile sensor has led to fragmented, sensor-specific solutions in robotic manipulation -- and in many cases, to force-unaware, sensorless approaches. With eFlesh, we bridge this gap by introducing a magnetic tactile sensor that is low-cost, easy to fabricate, and highly customizable. Building an eFlesh sensor requires only four components: a hobbyist 3D printer, off-the-shelf magnets (<$5), a CAD model of the desired shape, and a magnetometer circuit board. The sensor is constructed from tiled, parameterized microstructures, which allow for tuning the sensor's geometry and its mechanical response. We provide an open-source design tool that converts convex OBJ/STL files into 3D-printable STLs for fabrication. This modular design framework enables users to create application-specific sensors, and to adjust sensitivity depending on the task. Our sensor characterization experiments demonstrate the capabilities of eFlesh: contact localization RMSE of 0.5 mm, and force prediction RMSE of 0.27 N for normal force and 0.12 N for shear force. We also present a learned slip detection model that generalizes to unseen objects with 95% accuracy, and visuotactile control policies that improve manipulation performance by 40% over vision-only baselines -- achieving 91% average success rate for four precise tasks that require sub-mm accuracy for successful completion. All design files, code and the CAD-to-eFlesh STL conversion tool are open-sourced and available on this https URL.
如果以人类经验为参考,要在无结构环境中(如家庭和办公室)有效地操作机器人需要让机器人能够感知物理互动中的力。然而,由于缺乏一种灵活、易得且易于定制的触觉传感器,导致了在机器人操控领域出现了碎片化、特定于某种传感器的解决方案,并且在许多情况下采取了忽视力感应的无感器方法。通过推出eFlesh,我们填补了这一空白,这是一种成本低廉、容易制造和高度可定制的磁性触觉传感器。 构建一个eFlesh传感器只需要四个组件:一台业余爱好者级别的3D打印机、现成的磁铁(不到5美元)、所需形状的CAD模型以及一块磁力计电路板。该传感器由排列整齐的参数化微结构组成,允许用户调节传感器的几何形状及其机械响应。我们提供了一个开源设计工具,可将凸形OBJ/STL文件转换为用于制造的3D打印STL格式。 这种模块化的设计框架使用户能够创建适用于特定应用的传感器,并根据任务需求调整灵敏度。我们的传感器特性实验展示了eFlesh的能力:接触定位均方根误差(RMSE)为0.5毫米,法向力预测RMSE为0.27牛顿,剪切力预测RMSE为0.12牛顿。 我们还提出了一种通过学习来检测滑动的模型,该模型能够以95%的准确率应用于未知物体,并且视觉触觉控制策略比单纯基于视觉的基础线提高了40%的操作性能——实现了平均成功率为91%的成绩,这四个精确任务需要毫米级精度才能完成。 所有设计文件、代码以及CAD到eFlesh STL转换工具均开源并可在此链接访问:[https URL]。
https://arxiv.org/abs/2506.09994
Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: this https URL
图像恢复的目标是修复受损的图像。然而,现有的基于扩散的方法虽然在自然图像恢复方面取得了巨大成功,但在处理含有文本区域的受损图像时往往难以准确重构这些文本区域。这些方法经常会生成看似合理但实际上错误的文字样图案,我们称之为“图文幻觉”。在这篇论文中,我们提出了图文感知图像恢复(TAIR),这是一个新的任务,要求同时恢复视觉内容和文本准确性。 为了应对这一挑战,我们构建了SA-Text,一个大规模基准数据集,包含10万张高质量的场景图片,并密集标注有各种复杂多样的文本实例。此外,我们提出了一种多任务扩散框架,称为TeReDiff,该框架将扩散模型中的内部特征整合到文字识别模块中,使得两个组件能够从联合训练中受益。这允许提取丰富的文本表示,在后续去噪步骤中将其用作提示。 大量的实验表明,我们的方法在图像恢复方面持续优于最先进的方法,并且在提高文本识别准确性方面取得了显著的成果。请参见我们的项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2506.09993
Online toxic language causes real harm, especially in regions with limited moderation tools. In this study, we evaluate how large language models handle toxic comments in Serbian, Croatian, and Bosnian, languages with limited labeled data. We built and manually labeled a dataset of 4,500 YouTube and TikTok comments drawn from videos across diverse categories, including music, politics, sports, modeling, influencer content, discussions of sexism, and general topics. Four models (GPT-3.5 Turbo, GPT-4.1, Gemini 1.5 Pro, and Claude 3 Opus) were tested in two modes: zero-shot and context-augmented. We measured precision, recall, F1 score, accuracy and false positive rates. Including a short context snippet raised recall by about 0.12 on average and improved F1 score by up to 0.10, though it sometimes increased false positives. The best balance came from Gemini in context-augmented mode, reaching an F1 score of 0.82 and accuracy of 0.82, while zero-shot GPT-4.1 led on precision and had the lowest false alarms. We show how adding minimal context can improve toxic language detection in low-resource settings and suggest practical strategies such as improved prompt design and threshold calibration. These results show that prompt design alone can yield meaningful gains in toxicity detection for underserved Balkan language communities.
在线有毒语言会造成实际伤害,特别是在那些缺乏监管工具的地区。在这项研究中,我们评估了大型语言模型在处理塞尔维亚语、克罗地亚语和波斯尼亚语中的有害评论的能力,这些语言由于数据标注资源有限而难以应对。为此,我们构建并手动标注了一个包含4500条YouTube和TikTok评论的数据集,这些评论来自各种类别的视频内容,包括音乐、政治、体育、模特展示、网红内容以及关于性别歧视的讨论等。 四种模型(GPT-3.5 Turbo、GPT-4.1、Gemini 1.5 Pro 和 Claude 3 Opus)在两种模式下进行了测试:零样本模式和上下文增强模式。我们测量了这些模型的精度、召回率、F1分数、准确性和假阳性率。 加入简短的上下文片段平均提高了召回率约0.12,并将F1分数最多提高到0.10,尽管有时会增加假阳性的数量。在上下文增强模式下,Gemini表现最佳,达到了F1分数和准确率为0.82,而零样本GPT-4.1则在精度上领先,并且具有最低的误报率。 我们展示了即使是在资源有限的情况下,添加少量的上下文信息也能提高有害语言检测的效果。此外,我们还提出了实用策略,例如改进提示设计和阈值校准,以进一步提升这些模型的表现。研究结果表明,仅靠优化提示设计就能在服务不足的巴尔干语社区中实现有意义的有害言论检测改进。
https://arxiv.org/abs/2506.09992
We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
我们提出了Chain-of-Action(CoA),这是一种新颖的基于轨迹自回归建模的视动策略范式。与传统的向前预测下一步动作的方法不同,CoA通过任务特定目标进行显式的反向推理来生成整个轨迹,使用行动层面的Chain-of-Thought (CoT) 过程。这一过程在单一的自回归结构中得到统一:(1) 第一个标记对应于编码了任务特定目标的稳定关键帧动作;(2) 随后的动作标记条件于初始的关键帧和先前预测的动作,以自回归方式生成。 这种反向行动推理强制执行全局到局部的结构,使每个局部动作都能紧密地受到最终目标的约束。为了进一步实现这一行动推理结构,CoA采用了四种互补的设计:连续的动作令牌表示;动态停止,用于可变长度轨迹生成;逆时间集成;以及多令牌预测,以平衡动作块建模和全局结构之间的关系。 结果表明,CoA在保持视动策略灵活性和简洁性的同时,具有强大的空间泛化能力。从实验上看,我们在60个RLBench任务和8个真实世界操作任务中观察到,CoA达到了最先进的性能水平。
https://arxiv.org/abs/2506.09990
We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? First, we record a video of a human manipulating objects within a 3D scene using their hands. We then use these action-sound pairs to train a rectified flow model to map 3D hand trajectories to their corresponding audio. At test time, a user can query the model for other actions, parameterized as sequences of hand poses, to estimate their corresponding sounds. In our experiments, we find that our generated sounds accurately convey material properties and actions, and that they are often indistinguishable to human observers from real sounds. Project page: this https URL
我们研究了通过预测人类手部与场景物理互动产生的声音,来使3D场景重建具有交互性的问题。具体来说,首先记录一个人用手在三维场景中操作物体的视频。然后使用这些动作-声音对来训练一个校正流模型,以将3D手部轨迹映射到相应的音频上。在测试阶段,用户可以向该模型查询其他动作(参数化为一系列手部姿态)并估算其对应的声响。实验结果显示,我们生成的声音能够准确传达材料特性和动作,并且这些声音经常与人类观察者难以区分真正的实际声音。项目页面:[此处应填写实际的网址链接]
https://arxiv.org/abs/2506.09989
Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.
基于近期生成式人工智能的进步,文本引导的图像编辑技术正变得越来越普及。这一趋势凸显了建立一个全面框架来验证文本指导下的编辑并评估其质量的需求。为了解决这个需求,我们引入了一种新的基准测试工具——EditInspector,它通过使用广泛的模板收集的人类注释来进行文本引导的图像编辑评价。利用EditInspector,我们可以从准确性、伪影检测、视觉质量、与图像场景的无缝集成度、符合常识性以及描述编辑后变化的能力等多个维度评估最先进的(SoTA)视觉和语言模型在评定编辑上的表现。我们的研究发现表明,目前的模型难以全面评估这些编辑,并且常常会错误地描述所发生的更改。为了应对这一挑战,我们提出了两种新的方法,它们在这两个关键任务——伪影检测和差异文本生成上都优于现有的SoTA模型。
https://arxiv.org/abs/2506.09988
Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair -- a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.
现有的评估视频语言模型在时空理解与推理能力方面的基准容易受到基于表面视觉或文本线索的捷径解决方案的影响,从而导致评分膨胀。为了解决准确评估模型性能所面临的挑战,本文引入了Minimal Video Pairs(MVP)基准测试,这是一个简单的、具有意识的捷径解决方案的视频问答基准测试,用于评估视频语言模型对物理世界的理解能力。该基准包含55,000个高质量的选择题形式的视频问答示例,着重于物理世界理解方面。这些样本来自九种不同的视频数据来源,包括第一人称视角的视频、第三人称视角的视频、机器人互动数据以及认知科学中的直观物理学基准测试。 为了减少依赖表面视觉或文本线索和偏差的捷径解决方案的影响,MVP中的每个样本都有一个最小变化对——即一个与原始视频在视觉上非常相似但答案相反的问题。要正确回答问题,模型必须为最小变化对中的两个示例都提供正确的答案;因此,仅依靠视觉或文本偏见进行推理的模型将表现得不如随机猜测好。 人类在MVP上的表现为92.9%,而最先进的开源视频语言模型的表现仅为40.2%(相比之下,随机性能为25%)。
https://arxiv.org/abs/2506.09987
End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios that multiple concepts could appears in the same video with rich human-human interactions and human-object interactions. Such global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in a iterative manner. This design enables the high-quality generation of controllable multi-concept human-centric videos. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods.
近年来,端到端的人体动画在接收丰富多模态条件(如文本、图像和音频)方面取得了显著进展。然而,大多数现有的方法只能为单一主体进行动画制作,并且以全局方式注入这些条件,忽略了同一视频中可能存在多种概念以及丰富的多人互动及人与物体的互动场景。这种全局假设阻止了对多个包含人类和物体的概念实施精确而有针对性的身份控制,从而限制了其应用范围。在这项工作中,我们摒弃单一实体的假设,并引入了一个新颖的框架,该框架强制执行来自不同模态条件到每个身份时空脚印上的特定区域绑定。给定多种概念的参考图像,我们的方法能够自动推断布局信息,通过利用一个掩码预测器来匹配去噪视频与每种参考外观之间的视觉线索。此外,我们还把局部音频条件注入其对应的区域,在迭代过程中确保模态间的布局对齐匹配。这种设计使得生成高质量且可控的多概念以人类为中心的视频成为可能。实证结果和消融研究验证了我们的显式布局控制在处理多模态条件下相比隐式方法和其他现有方法的有效性。
https://arxiv.org/abs/2506.09984
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
现代人工智能面临的一个主要挑战是通过观察来学习理解世界和采取行动。本文探讨了一种结合互联网规模视频数据与少量交互数据(机器人轨迹)的自监督方法,以开发能够在物理世界中理解和规划模型。我们首先在一个包含超过100万小时互联网视频图像数据集上对一个无动作的动作联合嵌入预测架构V-JEPA 2进行预训练。V-JEPA 2在运动理解方面表现出色(Something-Something v2的top-1准确率为77.3),并且在人类行为预测方面达到最先进的水平(Epic-Kitchens-100上recall-at-5为39.7,超过了之前特定任务模型的表现)。此外,在将V-JEPA 2与大型语言模型对齐后,我们展示了在视频问答任务上的最先进性能,规模达80亿参数级(例如,在PerceptionTest上得分84.0,在TempCompass上得分为76.9)。最后,我们展示如何通过后续训练一个隐含的动作条件世界模型V-JEPA 2-AC来将自监督学习应用于机器人规划任务,该模型使用来自Droid数据集的不到62小时未标记的机器人视频进行训练。我们将V-JEPA 2-AC零样本部署到两个实验室中的Franka机械臂上,并能够利用图像目标规划实现物体的选择和放置功能。值得注意的是,这一成果是在没有从这些环境中收集任何机器人数据以及没有特定任务培训或奖励的情况下完成的。这项工作展示了如何通过网络规模的数据和少量机器人交互数据进行自监督学习,从而开发出一种能够在物理世界中进行规划的世界模型。
https://arxiv.org/abs/2506.09985
Recent advances in large language models (LLMs) have enabled impressive performance in various tasks. However, standard prompting often struggles to produce structurally valid and accurate outputs, especially in dependency parsing. We propose a novel step-by-step instruction strategy, where universal part-of-speech tagging precedes the prediction of syntactic heads and dependency labels, and a simplified CoNLL-U like output format, our method achieves state-of-the-art accuracy on Universal Dependencies datasets across 17 languages without hallucination or contamination. We further show that multilingual fine-tuning simultaneously improves cross-language generalization performance. Our results highlight the effectiveness of explicit reasoning steps in LLM-based parsing and offer a scalable, format-consistent alternative to bracket-based approaches.
最近在大型语言模型(LLM)方面取得的进展使得它们在各种任务中表现出色。然而,标准提示方法通常难以生成结构正确且准确的结果,尤其是在依存句法分析中。我们提出了一种新颖的逐步指令策略,在这种策略中,通用词性标注先于句法头和依存关系标签的预测,并采用一种简化的类似CoNLL-U格式输出,我们的方法在涵盖17种语言的Universal Dependencies数据集上实现了无幻觉或污染的最佳准确率。此外,我们还证明了多语言微调同时提高了跨语言泛化性能。我们的结果突显了基于LLM解析中明确推理步骤的有效性,并为基于括号的方法提供了一种可扩展且格式一致的替代方案。
https://arxiv.org/abs/2506.09983
Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.
最近在4D内容生成领域的进展引起了越来越多的关注,但创建高质量的动画3D模型仍然具有挑战性,这主要是由于建模时空分布的复杂性和缺乏4D训练数据。在这篇论文中,我们提出了AnimateAnyMesh,这是第一个能够以文本驱动的方式高效地对任意3D网格进行动画处理的前馈框架。我们的方法利用了一种新颖的DyMeshVAE架构,在这个架构中,通过解耦空间和时间特征,并保持局部拓扑结构不变,可以有效地压缩并重构动态网格序列。为了实现高质量的条件生成,我们在压缩后的潜在空间中采用了一种基于修正流(Rectified Flow)的训练策略。 此外,我们贡献了一个包含超过400万个多样化且带有文本注释的动态网格序列数据集——DyMesh数据集。实验结果表明,我们的方法能够在几秒钟内生成语义准确、时间连贯的网格动画,并在质量和效率上显著优于现有的方法。我们的工作标志着使4D内容创建变得更加可访问和实用方面的一个重要进展。所有数据、代码和模型将在开源平台上开放发布。
https://arxiv.org/abs/2506.09982
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
如何可靠地模拟未来驾驶场景,尤其是在广泛的自我驾驶行为范围内?最近开发的基于真实世界驾驶数据(主要由安全专家轨迹组成)构建的驾驶世界模型,在遵循危险或非专业行为方面存在困难,因为这些行为在该类数据中极为罕见。这种限制使其仅适用于诸如策略评估之类的任务。为了解决这个问题,我们通过用来自驾驶模拟器(例如CARLA)收集的各种非专业人士数据来丰富真实世界的驾驶演示,并构建了一个基于异构语料库训练的可控世界模型。从一个具有扩散变换架构的视频生成器开始,我们设计了几种策略以有效整合条件信号并提高预测控制性和保真度。由此产生的模型ReSim能够在各种行动(包括危险非专家行为)下可靠地模拟多样的开放世界驾驶场景。为了缩小高保真仿真与需要奖励信号来评估不同动作的应用之间的差距,我们引入了一个Video2Reward模块,该模块可以从ReSim的模拟未来中估计奖励。我们的ReSim范式实现了高达44%的视觉保真度提升,将专家和非专业人士行为的可控性提高了超过50%,并且在NAVSIM上的规划和策略选择性能分别提升了2%和25%。
https://arxiv.org/abs/2506.09981
Recent progress in 3D object generation has greatly improved both the quality and efficiency. However, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts. A key challenge is that different objects may have a varying number of parts. To address this, we propose a new end-to-end framework for part-level 3D object generation. Given a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts. We introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object. Experiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.
近期在3D物体生成领域的进展显著提高了质量和效率。然而,大多数现有的方法会生成一个将所有部分融合在一起的单一网格模型,这限制了对单个部分进行编辑或操作的能力。关键挑战在于不同的对象可能具有不同数量的部分。为了解决这个问题,我们提出了一种新的端到端框架,用于在部件级别生成3D物体。给定一张输入图像,我们的方法可以生成包含任意数量的完整且语义上相关的部分的高质量3D物体。 为了实现这一点,我们引入了双体积打包策略,该策略将所有部分组织成两个互补的体积中,从而允许创建完整而交错的部分并组装成最终的对象。实验表明,与之前的基于图像的部件级别生成方法相比,我们的模型在质量、多样性和泛化能力方面均表现出色。
https://arxiv.org/abs/2506.09980
Computing stabilizing and optimal control actions for legged locomotion in real time is difficult due to the nonlinear, hybrid, and high dimensional nature of these robots. The hybrid nature of the system introduces a combination of discrete and continuous variables which causes issues for numerical optimal control. To address these challenges, we propose a layered architecture that separates the choice of discrete variables and a smooth Model Predictive Controller (MPC). The layered formulation allows for online flexibility and optimality without sacrificing real-time performance through a combination of gradient-free and gradient-based methods. The architecture leverages a sampling-based method for determining discrete variables, and a classical smooth MPC formulation using these fixed discrete variables. We demonstrate the results on a quadrupedal robot stepping over gaps and onto terrain with varying heights. In simulation, we demonstrate the controller on a humanoid robot for gap traversal. The layered approach is shown to be more optimal and reliable than common heuristic-based approaches and faster to compute than pure sampling methods.
实时计算四足机器人行走时的稳定性和最优控制动作非常困难,这是因为这些机器人的非线性、混合特性和高维度特性。这种系统的混合性质引入了离散变量和连续变量的组合,这给数值优化控制带来了问题。为了解决这些问题,我们提出了一种分层架构,该架构将离散变量的选择与平滑模型预测控制器(MPC)分开处理。这种分层形式化方法通过结合无梯度法和基于梯度的方法,在不牺牲实时性能的前提下实现了在线灵活性和最优性。该架构利用一种采样方法来确定离散变量,并使用这些固定离散变量进行经典的平滑MPC计算。 我们在四足机器人跨越间隙并行走于不同高度地形上的实验中展示了这种方法的效果。在仿真环境中,我们还展示了一种人形机器人的跳跃过隙控制效果。研究表明,分层方法比常见的基于启发式的方法更优且可靠,并且其计算速度也快于纯采样法。 这种方法的优势在于它能够在复杂的动态环境中实现高效的实时控制和优化性能,尤其是在处理那些需要频繁调整离散变量的混合系统中。
https://arxiv.org/abs/2506.09979
Understanding how humans revise their beliefs in light of new information is crucial for developing AI systems which can effectively model, and thus align with, human reasoning. While theoretical belief revision frameworks rely on a set of principles that establish how these operations are performed, empirical evidence from cognitive psychology suggests that people may follow different patterns when presented with conflicting information. In this paper, we present three comprehensive user studies showing that people consistently prefer explanation-based revisions, i.e., those which are guided by explanations, that result in changes to their belief systems that are not necessarily captured by classical belief change theory. Our experiments systematically investigate how people revise their beliefs with explanations for inconsistencies, whether they are provided with them or left to formulate them themselves, demonstrating a robust preference for what may seem non-minimal revisions across different types of scenarios. These findings have implications for AI systems designed to model human reasoning or interact with humans, suggesting that such systems should accommodate explanation-based, potentially non-minimal belief revision operators to better align with human cognitive processes.
理解人类如何在面对新信息时调整自己的信念,对于开发能够有效模拟并因此与人类推理相协调的AI系统来说至关重要。虽然理论上的信念修订框架依赖于一系列原则来规定这些操作是如何进行的,但认知心理学方面的实证证据表明,在面临矛盾信息时,人们可能会遵循不同的模式来进行修订。 在本文中,我们通过三项全面的用户研究展示了人们一致倾向于基于解释的修订方式,即那些由解释引导、导致变化的信念系统变更,并且这种变更未必被经典信念改变理论所涵盖。我们的实验系统地探讨了当人们用以解决矛盾信息时如何修改自己的信念以及这些解释是他人提供还是他们自己生成的,展示了在不同类型的场景下对看似非最小化修订的稳健偏好。 这些发现对于旨在模拟人类推理或与人类互动的AI系统的开发具有重要意义,表明此类系统应当容纳基于解释、可能非最小化的信念修订操作,以更好地适应人类的认知过程。
https://arxiv.org/abs/2506.09977
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
检测人工智能生成的文本本身就是一个难题;而在社交媒体上检测这类文本则更加困难,因为短文本长度和互联网特有的非正式、个性化的语言使得这一任务更为复杂。尽管如此,解决这个问题仍然非常重要,因为在网络影响力活动中,社交媒体代表了一个重要的攻击途径,通过大规模生产支持(或反对)特定政策、决策或事件的人工智能生成帖子可以增强这种活动的力度。 我们以一个较为复杂的威胁行为者的思维方式和资源来应对这个问题,并创建了一套数据集,其中包含来自开源、闭源以及经过微调的语言模型生成的505,159条社交媒体帖子,这些帖子涵盖了11个有争议的话题。研究表明,在典型的科研假设下(即研究者对生成文本的模型具有一定的了解和访问权限),可以检测到这些帖子;但在更为现实的情况下,若攻击者不会将其微调后的模型公开,则可检测性会大幅下降。这项结果也通过一项人类实验得到了确认。 消融实验进一步揭示了各种检测算法对于经过微调的语言模型存在明显的脆弱性。这一发现对所有领域的检测工作都有重要的影响,因为微调是大型语言模型的一种普遍适用且现实的应用场景。
https://arxiv.org/abs/2506.09975