Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.
个性化对于大型语言模型(LLM)来说正变得不可或缺,以使其能够与个人用户的偏好和需求保持一致。然而,目前的方法往往计算成本高昂、数据密集型,并且容易发生灾难性遗忘,在处理多轮交互或隐式查询时性能会下降。为了解决这些问题,我们将个性化视为一个模型编辑任务,并引入了“Personalization Editing”框架,该框架通过集群偏好数字表示进行局部编辑指导。这种设计能够在保持整体模型能力的同时实现精准的偏好对齐更新。 此外,现有的个性化基准测试通常依赖于基于角色的人机对话,而不是用户与LLM之间的实际交互,或者它们主要集中在风格模仿上,而忽略了需要准确回忆出特定用户偏好的信息查找任务。我们引入了“User Preference Question Answering”(UPQA),这是一个从现场用户的查询中构建的短答案问答数据集,并且具有不同难度级别的问题。与先前的基准测试相比,UPQA直接评估模型回忆和应用具体用户偏好的能力。 在各种实验设置下,“Personalization Editing”的编辑精度高于微调方法,并且计算效率更高,在多轮对话和隐式偏好查询场景中优于基于提示的方法。
https://arxiv.org/abs/2512.13676
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
空间追踪作为机器人本体互动的基本能力,由于其需要进行多步骤的度量推理并结合复杂的空间指代和现实世界的度量测量,因而具有固有的挑战性。然而,现有的方法难以处理这一组合任务。为此,我们提出了RoboTracer,这是一种3D感知视觉语言模型(VLM),它首先通过一个通用的空间编码器和受回归监督的解码器来实现空间指代与度量,从而在有监督微调(SFT)期间增强对尺度的认识。此外,RoboTracer通过带有度量子敏感过程奖励的强化学习微调(RFT)进一步推进了多步骤度量推理,并指导关键中间感知线索以准确生成空间轨迹。 为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含30M问题-答案对的大规模数据集,涵盖了户外、室内和平面场景,并且能够支持复杂的推理过程(多达9步)。此外,我们还推出了TraceSpatial-Bench,这是用于评估空间追踪性能的具有挑战性的基准测试,填补了现有评价方法的空白。实验结果表明,在空间理解、度量和指代方面,RoboTracer超越了基线模型,并在TraceSpatial-Bench上表现出色,大幅优于Gemini-2.5-Pro,准确性提高了36%。 值得注意的是,RoboTracer可以与各种控制策略集成在一起,以执行跨多种机器人(UR5、G1人形)的复杂现实场景中的长期动态任务。
https://arxiv.org/abs/2512.13660
Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at this https URL
原生4K (2160×3840) 视频生成仍然是一个关键挑战,因为随着时空分辨率的增加,全注意力机制的计算量呈二次增长,使得模型难以在效率和质量之间找到平衡点。本文提出了一种新颖的Transformer改进策略,称为T3(Transform Trained Transformer),该策略无需改变预训练全注意力模型的核心架构,而是通过优化其前向逻辑显著降低计算需求。 具体来说,**T3-Video** 引入了多尺度权重共享窗口注意机制,并且通过层次化阻塞与轴保留的全注意力设计,在仅使用适度计算和数据的情况下,能够实现对预训练模型的“注意力模式”转换。在4K-VBench上的结果显示,**T3-Video** 显著优于现有方法:它不仅提供了性能改进(+4.29↑ VQA 和 +0.08↑ VTC),还使得原生4K视频生成的速度提高了10倍以上。 项目页面在此链接中提供。
https://arxiv.org/abs/2512.13492
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
代理强化学习已经将大型语言模型(LLMs)推进到能够通过使用外部工具的交错来处理长链思维轨迹的能力。现有方法假设有一组固定的工具,限制了LLM代理适应新出现或不断变化的工件集的能力。我们提出了AutoTool框架,该框架使LLM代理在整个推理过程中具备动态选择工具的能力。 首先,我们构建了一个包含20万条数据的数据集,这些数据集中包含了1000多种工具和超过100种任务(涵盖数学、科学、代码生成和多模态推理)的明确工具选择理由。在此数据基础之上,AutoTool采用了双阶段优化管道:(i)通过监督学习和基于强化学习的轨迹稳定化实现连贯推理,并且(ii)使用KL正则化的Plackett-Luce排名来精炼一致的多步骤工具选择。 在包括十个不同基准在内的评估中,我们用AutoTool训练了两个基础模型——Qwen3-8B和Qwen2.5-VL-7B。尽管参数较少,但AutoTool在整个LLM代理和工具集成方法中的性能上始终领先,并且平均提高了数学与科学推理6.4%,基于搜索的问答4.5%,代码生成7.7%以及多模态理解6.9%的成绩。 此外,AutoTool在推断过程中表现出更强的泛化能力,能够动态地利用不断变化工件集中未见过的新工具。
https://arxiv.org/abs/2512.13278
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
视觉语言模型(VLMs)在视觉问答(VQA)方面表现出色,但它们仍然局限于快照式的视觉理解,即仅从静态图像中进行推理。相比之下,具身智能体则需要移动性视角(ambulatory vision),主动移动以获取更多有信息量的视图。我们引入了“基于视觉定位的主动视点选择”(VG-AVS)任务,该任务使用当前图像中的视觉信息来选择下一个最具有信息量的视点,并且不依赖于场景记忆或外部知识。 为了支持这项任务,我们构建了一个合成数据集,其中包括自动生成的配对查询-目标视图和问题-答案提示。我们还提出了一种框架,该框架通过监督微调(SFT)以及基于强化学习(RL)的策略优化来精炼预训练的VLMs。 我们的方法在视点选择的基础上实现了强大的问答性能,并且能够稳健地泛化到未见过的合成和真实场景中。此外,将我们所学得的VG-AVS框架整合进现有的基于环境探索的EQAs(例如Embodied Question Answering)系统中,可以提高下游问题回答准确性。
https://arxiv.org/abs/2512.13250
We present Ego-EXTRA, a video-language Egocentric Dataset for EXpert-TRAinee assistance. Ego-EXTRA features 50 hours of unscripted egocentric videos of subjects performing procedural activities (the trainees) while guided by real-world experts who provide guidance and answer specific questions using natural language. Following a ``Wizard of OZ'' data collection paradigm, the expert enacts a wearable intelligent assistant, looking at the activities performed by the trainee exclusively from their egocentric point of view, answering questions when asked by the trainee, or proactively interacting with suggestions during the procedures. This unique data collection protocol enables Ego-EXTRA to capture a high-quality dialogue in which expert-level feedback is provided to the trainee. Two-way dialogues between experts and trainees are recorded, transcribed, and used to create a novel benchmark comprising more than 15k high-quality Visual Question Answer sets, which we use to evaluate Multimodal Large Language Models. The results show that Ego-EXTRA is challenging and highlight the limitations of current models when used to provide expert-level assistance to the user. The Ego-EXTRA dataset is publicly available to support the benchmark of egocentric video-language assistants: this https URL.
我们介绍Ego-EXTRA,这是一个用于专家和新手协助的视频语言第一人称数据集。Ego-EXTRA包含50小时未经脚本化的第一人称视角视频,在这些视频中,主体(作为新手)在现实世界的专家指导下执行程序性活动。专家通过自然语言提供指导并回答具体问题。 遵循“巫师之 Oz”(Wizard of OZ)的数据收集模式,专家扮演一个可穿戴智能助手的角色,仅从新手的第一人称视角观察其活动,在被提问时作出回答,并且在程序过程中主动提出建议。这种独特数据收集方法使Ego-EXTRA能够捕捉高质量的对话,在其中提供给新手的专业级反馈。 记录、转录了专家和新手之间的双向对话,并以此创建了一个包含超过15,000个高质量视觉问答对的新基准,用于评估多模态大型语言模型。结果表明,Ego-EXTRA具有挑战性,并突显出当前模型在为用户提供专业级别帮助时的局限性。 Ego-EXTRA数据集公开提供,以支持第一人称视频语言助手基准测试:[此链接](this https URL)。
https://arxiv.org/abs/2512.13238
Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as \(1 - \max(P_{\mathrm{target}})\). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
Speculative Decoding 是一种通过使用快速草稿模型提出候选词序列并由大型目标模型进行并行验证来加速大规模语言模型(LLM)自回归推理的技术。然而,其核心组件——拒绝抽样机制——依赖于一个固定且与上下文无关的随机阈值。这导致在高不确定性生成场景中存在显著的“随机拒绝”问题:许多合理的候选词因纯粹的偶然性而被错误地拒绝,从而降低了推断效率。 本文介绍了一种名为高效自适应拒绝抽样(EARS)的新方法,该方法通过将目标模型自身的预测不确定性(用 \(1 - \max(P_{\mathrm{target}})\) 表示)纳入其中来动态调整接受阈值。通过引入一个与这种不确定性成比例的容差项,EARS 智能地在模型不确定时放宽接受标准,并且当模型信心充足时维持严格的接受条件,从而有效地减少了随机拒绝事件的发生。 实验结果表明,在创意写作和开放领域问答任务上,EARS 显著提升了 Speculative Decoding 的效率。具体而言,在 GSM8K 基准测试中,使用 EARS 可以实现高达 18.12% 的吞吐量提升,仅导致约 0.84% 的准确率下降。 该方法无需对模型架构进行任何修改,并且可以无缝集成到现有的 Speculative Decoding 框架中。
https://arxiv.org/abs/2512.13194
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
视觉-语言模型通过多源信息融合能够理解并推理复杂的交通场景,成为自动驾驶技术的核心。然而,现有的视觉-语言模型受限于二维平面上的图像理解范式,这限制了它们感知三维空间信息及进行深度语义融合的能力,在复杂驾驶环境中表现不佳。本研究提出了一种名为MMDrive的多模态视觉-语言模型框架,该框架将传统的图像理解扩展到一个通用的三维场景理解框架中。MMDrive结合了三种互补的模式:占据地图、激光雷达点云和文本场景描述。为此,它引入了两个新的组件来进行自适应跨模态融合和关键信息提取。具体而言,“基于文本的多模态调制器”根据问题中的语义线索动态调整每种模式的贡献权重,引导上下文感知特征整合。“跨模态抽象器”采用可学习的抽象令牌生成紧凑、跨模态摘要,强调关键区域及核心语义。 在DriveLM和NuScenes-QA基准上的全面评估表明,MMDrive相对于现有的视觉-语言模型,在自动驾驶方面实现了显著的性能提升。具体来说,在DriveLM上,其BLEU-4得分为54.56%,METEOR得分为41.78%;在NuScenes-QA上,准确率为62.7%。 MMDrive有效地打破了传统的仅基于图像理解的限制,在复杂的驾驶环境中实现了强大的多模态推理,并为可解释的自动驾驶场景理解提供了新的基础。
https://arxiv.org/abs/2512.13177
The development of clinical-grade artificial intelligence in pathology is limited by the scarcity of diverse, high-quality annotated datasets. Generative models offer a potential solution but suffer from semantic instability and morphological hallucinations that compromise diagnostic reliability. To address this challenge, we introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS), the first generative foundation model for pathology-specific text-to-image synthesis. By leveraging a dual-stage training strategy on approximately 2.8 million image-caption pairs, CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy. This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations. Furthermore, CRAFTS-augmented datasets enhance the performance across various clinical tasks, including classification, cross-modal retrieval, self-supervised learning, and visual question answering. In addition, coupling CRAFTS with ControlNet enables precise control over tissue architecture from inputs such as nuclear segmentation masks and fluorescence images. By overcoming the critical barriers of data scarcity and privacy concerns, CRAFTS provides a limitless source of diverse, annotated histology data, effectively unlocking the creation of robust diagnostic tools for rare and complex cancer phenotypes.
在病理学中,临床级人工智能的发展受到多样性和高质量标注数据稀缺的限制。生成模型提供了一种潜在解决方案,但它们存在的语义不稳定和形态幻觉问题影响了诊断可靠性。为了解决这一挑战,我们引入了一个名为组织合成相关性调节对齐框架(CRAFTS)的新模型,这是第一个专门针对病理学特定文本到图像合成任务的生成基础模型。 通过在大约280万个图像-描述配对数据上采用双阶段训练策略,CRAFTS结合了一种新的对齐机制,该机制能够抑制语义漂移以确保生物准确性。此模型可以产生涵盖30种癌症类型的多样病理图像,并且这些图像的质量已经过客观指标和病理学家评估的严格验证。 此外,使用增强后的数据集(通过CRAFTS),各种临床任务的表现得到了提高,包括分类、跨模态检索、自监督学习以及视觉问答。另外,将CRAFTS与ControlNet结合可以实现对组织结构的精确控制,例如可以通过细胞核分割掩码和荧光图像等输入进行调整。 通过克服数据稀缺性和隐私问题这两个关键障碍,CRAFTS为生成多样且标注丰富的组织学数据提供了一个无限来源,从而有效解锁了针对罕见和复杂癌症表型的稳健诊断工具的开发。
https://arxiv.org/abs/2512.13164
Large language models (LLMs) have demonstrated strong performance on a variety of natural language processing (NLP) tasks. However, they often struggle with long-text sequences due to the ``lost in the middle'' phenomenon. This issue has been shown to arise from a U-shaped attention bias, where attention is disproportionately focused on the beginning and end of a text, leaving the middle section underrepresented. While previous studies have attributed this bias to position encoding, our research first identifies an additional factor: initial saliency. It means that in the attention computation for each token, tokens with higher attention weights relative to the initial token tend to receive more attention in the prediction of the next token. We further find that utilizing this property by scaling attention weight between the initial token and others improves the model's ability to process long contexts, achieving a maximum improvement of 3.6\% in MDQA dataset. Moreover, combining this approach with existing methods to reduce position encoding bias further enhances performance, achieving a maximum improvement of 3.4\% in KV-Retrieval tasks.
大型语言模型(LLMs)在多种自然语言处理(NLP)任务中表现出强大的性能。然而,它们通常难以处理长文本序列,这是因为所谓的“中间部分丢失”现象。这个问题已被证明是由U形注意力偏置引起的,即注意力过于集中在文本的开头和结尾部分,而忽略了中间部分。尽管之前的研究将这种偏差归因于位置编码的影响,我们的研究首次识别出另一个重要因素:初始显著性。这意味着,在每个令牌的注意计算中,相对于初始令牌具有更高注意权重的令牌在预测下一个令牌时会获得更多的关注。 我们进一步发现,通过调整初始令牌与其他令牌之间的注意力权重来利用这一特性,可以增强模型处理长上下文的能力,并在MDQA数据集中实现了最高3.6%的性能提升。此外,将这种方法与现有的减少位置编码偏差的方法相结合,还可以进一步提高性能,在KV-Retrieval任务中实现最高的3.4%改进。
https://arxiv.org/abs/2512.13109
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via this https URL.
带有可验证奖励的强化学习(RLVR)已被证明在通过利用答案一致性信号来指导策略优化方面,对于训练大型推理模型(LRMs)非常有效。然而,这种方法存在标注成本高的问题。为了解决这个问题,最近的研究探索了仅基于模型内部一致性的无监督RLVR方法,例如通过熵和多数投票等方式获取奖励。尽管这些方法看似有前景,但它们通常在训练后期阶段会遭受模型崩溃的问题,这可能是由于缺乏外部监督导致错误推理模式被强化所引起的。 在这项工作中,我们调查了一种新的半监督RLVR范式,该范式利用一个小的标注集来指导未标注样本上的RLVR训练。我们的关键见解是,有监督奖励对于稳定未标注样本上的基于一致性的训练至关重要,并确保仅将已验证为正确的推理模式纳入强化学习中。 从技术上讲,我们提出了一种有效的策略优化算法——TraPO,该算法通过匹配未标注样本的学习轨迹与标注样本的相似性来识别可靠的未标注样本。在此基础上,TraPO在六个广泛使用的数学推理基准(包括AIME24/25、AMC、MATH-500、Minerva和奥林匹克竞赛)以及三个分布外任务(ARC-c、GPQA-diamond和MMLU-pro)上实现了显著的数据效率和强大的泛化能力。 使用1K个标注样本和3K个未标注样本,TraPO达到了42.6%的平均准确率,超过了在45K个未标注样本上训练的最佳无监督方法(38.3%)。值得注意的是,在使用4K个标注样本和12K个未标注样本时,TraPO甚至在所有基准测试中都超过了完全基于有监督模型使用的全部45K个标注样本的性能,并且仅用了10%的数据量。 代码可通过提供的链接访问。
https://arxiv.org/abs/2512.13106
The status quo for labeling text is third-party annotation, but there are many cases where information directly from the document's source would be preferable over a third-person proxy, especially for egocentric features like sentiment and belief. We introduce author labeling, an annotation technique where the writer of the document itself annotates the data at the moment of creation. We collaborate with a commercial chatbot with over 10,000 users to deploy an author labeling annotation system for subjective features related to product recommendation. This system identifies task-relevant queries, generates on-the-fly labeling questions, and records authors' answers in real time. We train and deploy an online-learning model architecture for product recommendation that continuously improves from author labeling and find it achieved a 534% increase in click-through rate compared to an industry advertising baseline running concurrently. We then compare the quality and practicality of author labeling to three traditional annotation approaches for sentiment analysis and find author labeling to be higher quality, faster to acquire, and cheaper. These findings reinforce existing literature that annotations, especially for egocentric and subjective beliefs, are significantly higher quality when labeled by the author rather than a third party. To facilitate broader scientific adoption, we release an author labeling service for the research community at this http URL.
当前的文本标注方式主要是由第三方进行,但在很多情况下,直接来自文档源头的信息更为理想,尤其是在处理情感和信念等自我中心特征时。我们引入了一种新的注释技术——作者标注,即文档撰写者在创建数据的同时对其进行标注。为了与相关产品推荐的情感属性相关的主观特性部署一个作者标注的注释系统,我们与一家拥有超过10,000名用户的商业聊天机器人合作。该系统能够识别任务相关的查询、生成即时标注问题,并实时记录作者的回答。 我们开发并部署了一个在线学习模型架构用于产品推荐,通过持续从作者标注中学习来不断改进性能,发现其点击率相较于同时运行的行业广告基准提升了534%。然后,我们将作者标注的质量和实用性与三种传统的注释方法在情感分析中的表现进行了比较,发现作者标注在质量、获取速度以及成本方面都优于传统方法。 这些研究结果进一步支持了现有文献中关于作者进行自我中心和主观信念的标注比第三方标注具有更高质量的观点。为了促进更广泛的科学应用,我们将在[网址]向科研社区发布一个作者标注服务。
https://arxiv.org/abs/2512.12976
Large language models (LLMs) excel on multiple-choice clinical diagnosis benchmarks, yet it is unclear how much of this performance reflects underlying probabilistic reasoning. We study this through questions from MedQA, where the task is to select the most likely diagnosis. We introduce the Frequency-Based Probabilistic Ranker (FBPR), a lightweight method that scores options with a smoothed Naive Bayes over concept-diagnosis co-occurrence statistics from a large corpus. When co-occurrence statistics were sourced from the pretraining corpora for OLMo and Llama, FBPR achieves comparable performance to the corresponding LLMs pretrained on that same corpus. Direct LLM inference and FBPR largely get different questions correct, with an overlap only slightly above random chance, indicating complementary strengths of each method. These findings highlight the continued value of explicit probabilistic baselines: they provide a meaningful performance reference point and a complementary signal for potential hybridization. While the performance of LLMs seems to be driven by a mechanism other than simple frequency aggregation, we show that an approach similar to the historically grounded, low-complexity expert systems still accounts for a substantial portion of benchmark performance.
大型语言模型(LLMs)在多选题临床诊断基准测试中表现出色,但不清楚这种表现中有多少反映了潜在的概率推理能力。我们通过MedQA中的问题研究了这一点,任务是选择最可能的诊断。我们引入了一种基于频率的概率排名器(Frequency-Based Probabilistic Ranker, FBPR),这是一种轻量级方法,它使用来自大型语料库的概念-诊断共现统计信息进行平滑的朴素贝叶斯评分。当从OLMo和Llama的预训练语料库中获取共现统计数据时,FBPR达到了与相应LLM在相同语料库上预训练后的相当的性能水平。直接使用LLMs推理和FBPR的方法往往正确回答不同的问题,两者之间的重叠仅略高于随机机会,这表明每种方法具有互补的优势。 这些发现强调了明确的概率基准线的持续价值:它们为评估提供了一个有意义的参考点,并且可以作为潜在混合模型的一个补充信号。尽管LLMs的表现似乎是由简单的频率聚合机制之外的因素驱动的,但我们证明了一种类似于历史上基于低复杂性的专家系统的方法仍然解释了相当一部分基准性能。
https://arxiv.org/abs/2512.12868
Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at this https URL
尽管多模态大型语言模型(MLLMs)在各个领域的强大能力已经得到展示,但它们在自驾车中生成精细的3D感知和预测输出的应用仍处于探索阶段。本文提出了DrivePI,这是一种新颖的空间感知4D MLLM,作为统一的视觉-语言-行动(VLA)框架,并兼容视觉-行动(VA)模型。我们的方法通过端到端优化同时进行空间理解、3D感知(即,3D占用)、预测(即,占用流)和规划(即,行动输出)。为了获得精确的几何信息和丰富的视觉外观,我们的方法在统一的MLLM架构中集成了点云、多视角图像和语言指令。我们进一步开发了一个数据引擎以生成4D空间理解所需的文本-占用和文本-流动问答对。值得注意的是,仅用0.5B参数量的Qwen2.5模型作为基础的MLLM,DrivePI作为一个单一统一的模型就能在性能上与现有的VLA模型和专业的VA模型匹敌或超越。具体而言,相比于VLA模型,DrivePI在nuScenes-QA上的均值准确度比OpenDriveVLA-7B高出2.5%,并在nuScenes数据集上将碰撞率从ORION的0.37%降低至0.11%;与专业的VA模型相比,DrivePI在OpenOcc上的3D占用任务中以10.3 RayIoU超越FB-OCC,在同一数据集上为占用流任务将mAVE值由0.591降至0.509,并且在nuScenes的规划任务上比VAD低了32%的L2误差(从0.72米减少到0.49米)。代码将在提供的链接中发布。
https://arxiv.org/abs/2512.12799
Real-world deployment of Vision-Language Models (VLMs) is hindered by high computational demands, as existing architectures inefficiently process all tokens uniformly. We introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that retains only the most informative tokens based on contextual relevance. ATP operates at the vision-language interface, assigning a hybrid importance score combining ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance) to keep top-K tokens for the LLM. Unlike static compression, ATP adapts to each input without modifying the backbone. Proposed as a lightweight gating module, ATP is compatible with popular backbones like BLIP-2, LLaVA, and Flamingo. Preliminary evaluations across VQAv2, GQA, and COCO indicate that ATP reduces inference FLOPs by around 40% and achieves roughly 1.5x speedups in end-to-end latency with negligible accuracy loss (less than 1%). Qualitative analyses suggest ATP preserves visual grounding and enhances interpretability. Beyond efficiency, we investigate robustness under corruptions; observations suggest adaptive pruning suppresses spurious correlations, improving stability. These findings imply that resource-constrained inference and model reliability are not competing objectives. Finally, we discuss ATP's role in efficient multimodal edge computing pipelines.
现实世界中视觉语言模型(VLMs)的部署受到高计算需求的限制,因为现有的架构在处理所有令牌时效率低下且方式统一。我们引入了一种动态推理机制——自适应令牌剪枝(ATP),它根据上下文相关性保留最具信息量的令牌,从而提高效率。ATP运行于视觉和语言接口处,并结合ViT CLS注意力(模态内显著性)与CLIP文本-图像相似度(跨模态相关性),为大语言模型(LLM)保留前K个最重要的令牌。不同于静态压缩方法,ATP能够根据每个输入进行自适应调整,而无需对主干架构进行修改。 作为轻量级的门控模块,ATP与BLIP-2、LLaVA和Flamingo等流行基础架构兼容。初步评估显示,在VQAv2、GQA和COCO数据集上,ATP可以将推理FLOPs减少约40%,并使端到端延迟加快1.5倍,同时保持精度几乎不变(损失小于1%)。定性分析表明,ATP保留了视觉接地能力,并提高了模型的可解释性。 除了提高效率之外,我们还研究了在数据损坏情况下的鲁棒性。观察结果表明,自适应剪枝可以抑制虚假相关性,从而提高模型稳定性。这些发现意味着资源受限推理和模型可靠性并不相互竞争的目标。最后,我们讨论ATP在未来多模态边缘计算管道中的作用。 简而言之,通过引入ATP机制,研究人员能够在不牺牲准确性的情况下大幅减少视觉语言模型的计算需求,并提高了其鲁棒性和可解释性,这对于高效部署在资源受限设备上的多模态应用至关重要。
https://arxiv.org/abs/2512.12701
Recent advances in Generative AI (GAI) have led to new opportunities for creativity support. However, this technology has raised ethical concerns in the visual artists community. This paper explores how GAI can assist visual artists in developing original characters (OCs) while respecting their creative agency. We present ORIBA, an AI chatbot leveraging large language models (LLMs) to enable artists to role-play with their OCs, focusing on conceptualization (e.g., backstories) while leaving exposition (visual creation) to creators. Through a study with 14 artists, we found ORIBA motivated artists' imaginative engagement, developing multidimensional attributes and stronger bonds with OCs that inspire their creative process. Our contributions include design insights for AI systems that develop from artists' perspectives, demonstrating how LLMs can support cross-modal creativity while preserving creative agency in OC art. This paper highlights the potential of GAI as a neutral, non-visual support that strengthens existing creative practice, without infringing artistic exposition.
最近在生成式人工智能(GAI)领域的进展为创意支持带来了新的机遇。然而,这种技术也引发了视觉艺术家社区的伦理担忧。本文探讨了如何利用GAI来帮助视觉艺术家在其创作过程中尊重其创造自主权的同时开发原创角色(OC)。我们介绍了ORIBA这一基于大语言模型(LLM)的AI聊天机器人,该工具旨在让艺术家与其OC进行角色扮演,着重于概念化阶段(例如背景故事),而将表现阶段(即视觉创作)留给创作者。通过与14位艺术家的合作研究,我们发现ORIBA能够激发艺术家们的想象参与度,并促进他们开发出具有多维属性的、更能激励其创意过程的角色。 我们的贡献包括从艺术家视角出发设计AI系统的见解,展示了LLM如何在不侵犯艺术表现的前提下支持跨模态创造力。本文强调了GAI作为中立且非视觉性创作辅助工具的可能性,这种工具可以强化现有的创作实践而不对艺术家的表现权构成威胁。
https://arxiv.org/abs/2512.12630
Infographic Visual Question Answering (InfographicVQA) evaluates a model's ability to read and reason over data-rich, layout-heavy visuals that combine text, charts, icons, and design elements. Compared with scene-text or natural-image VQA, infographics require stronger integration of OCR, layout understanding, and numerical and semantic reasoning. We introduce ViInfographicVQA, the first benchmark for Vietnamese InfographicVQA, comprising over 6747 real-world infographics and 20409 human-verified question-answer pairs across economics, healthcare, education, and more. The benchmark includes two evaluation settings. The Single-image task follows the traditional setup in which each question is answered using a single infographic. The Multi-image task requires synthesizing evidence across multiple semantically related infographics and is, to our knowledge, the first Vietnamese evaluation of cross-image reasoning in VQA. We evaluate a range of recent vision-language models on this benchmark, revealing substantial performance disparities, with the most significant errors occurring on Multi-image questions that involve cross-image integration and non-span reasoning. ViInfographicVQA contributes benchmark results for Vietnamese InfographicVQA and sheds light on the limitations of current multimodal models in low-resource contexts, encouraging future exploration of layout-aware and cross-image reasoning methods.
信息图视觉问答(Infographic Visual Question Answering,简称InfographicVQA)旨在评估模型读取并理解数据密集且版面复杂的图像的能力,这些图像结合了文本、图表、图标和设计元素。与场景文本或自然图像的视觉问答相比,信息图要求更强的文字识别(OCR)、版面理解和数值及语义推理能力。 我们介绍了ViInfographicVQA,这是第一个针对越南语信息图VQA的基准测试,包括6747个真实世界的信息图表和20,409个人工验证的问题答案对,涵盖了经济、医疗保健、教育等众多领域。该基准测试包含两种评估设置:单张图像任务遵循传统设置,在这种设置下每个问题都使用一张信息图来回答;多张图像任务要求在多个语义相关的信息图表之间综合证据,并据我们所知,这是第一个针对越南语的跨图推理VQA评估。 我们在这个基准上对一系列最近的视觉语言模型进行了测试,揭示了显著的表现差异,其中最严重的错误出现在涉及跨图整合和非跨度推理的多张图像问题上。ViInfographicVQA为越南语信息图VQA提供了基准结果,并突出了当前多模态模型在资源匮乏环境下存在的局限性,鼓励未来探索版面感知及跨图推理方法的研究。
https://arxiv.org/abs/2512.12424
The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62x speedup for prefill at 74.7% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.
对长上下文推理能力的需求增长,加剧了大型语言模型(LLMs)中标准注意机制所固有的计算和内存瓶颈。为解决这一挑战,我们提出了BLASST,这是一种即插即用的稀疏注意力方法,能够动态修剪注意力矩阵而无需任何预计算或代理分数。该方法使用固定阈值以及在线softmax中的现有信息来识别可忽略不计的关注分数,并跳过softmax计算、Value块加载和后续矩阵乘法步骤。BLASST可以无缝集成到现有的FlashAttention内核设计中,几乎不会带来延迟开销。 此方法适用于所有注意变体(MHA、GQA、MQA 和 MLA)的预填充和解码阶段,提供了一种统一的解决方案以加速长上下文推理。我们开发了自动校准程序,揭示出最优阈值与上下文长度之间存在简单的反比关系,使该方法能够稳健地应用于各种场景中。 在保持高准确性的前提下,在现代GPU上,我们的方法实现了74.7%稀疏度下的预填充阶段1.62倍加速和73.2%稀疏度下的解码阶段1.48倍加速。此外,我们还探索了稀疏训练作为自然扩展途径,表明模型可以被训练以对稀疏注意力模式更具鲁棒性,从而进一步推进准确性与稀疏性的边界。
https://arxiv.org/abs/2512.12087
Modern text-to-video (T2V) diffusion models can synthesize visually compelling clips, yet they remain brittle at fine-scale structure: even state-of-the-art generators often produce distorted faces and hands, warped backgrounds, and temporally inconsistent motion. Such severe structural artifacts also appear in very low-quality real-world videos. Classical video restoration and super-resolution (VR/VSR) methods, in contrast, are tuned for synthetic degradations such as blur and downsampling and tend to stabilize these artifacts rather than repair them, while diffusion-prior restorers are usually trained on photometric noise and offer little control over the trade-off between perceptual quality and fidelity. We introduce CreativeVR, a diffusion-prior-guided video restoration framework for AI-generated (AIGC) and real videos with severe structural and temporal artifacts. Our deep-adapter-based method exposes a single precision knob that controls how strongly the model follows the input, smoothly trading off between precise restoration on standard degradations and stronger structure- and motion-corrective behavior on challenging content. Our key novelty is a temporally coherent degradation module used during training, which applies carefully designed transformations that produce realistic structural failures. To evaluate AIGC-artifact restoration, we propose the AIGC54 benchmark with FIQA, semantic and perceptual metrics, and multi-aspect scoring. CreativeVR achieves state-of-the-art results on videos with severe artifacts and performs competitively on standard video restoration benchmarks, while running at practical throughput (about 13 FPS at 720p on a single 80-GB A100). Project page: this https URL.
现代的文本到视频(T2V)扩散模型能够合成视觉上吸引人的片段,但它们在细粒度结构方面仍然很脆弱:即使是最先进的生成器也常常会产生变形的脸部和手部、扭曲的背景以及时间上不一致的动作。这些严重的结构性缺陷同样出现在质量极低的真实世界视频中。相比之下,传统的视频修复与超分辨率(VR/VSR)方法针对的是模糊和下采样等合成退化现象,并倾向于稳定而不是修正这些缺陷,而基于扩散模型的修复方法通常是在光度噪声上进行训练,对感知质量和保真度之间的权衡控制较少。 我们提出了CreativeVR,这是一种基于扩散先验指导的视频修复框架,适用于存在严重结构性和时间性瑕疵的人工智能生成(AIGC)视频以及真实世界视频。我们的深度适配器方法提供了一个单一的精度旋钮,用来控制模型遵循输入数据的程度,从而在标准退化情况下的精确恢复与更具挑战性内容中的结构及运动修正之间平滑地进行权衡。 我们创新的关键在于训练过程中使用的时间一致性退化模块,该模块应用精心设计的转换来生成现实中的结构性失败。为了评估AIGC瑕疵修复,我们提出了AIGC54基准测试集,并采用了FIQA、语义和感知指标以及多方面评分系统。CreativeVR在具有严重缺陷的视频上实现了最佳结果,在标准视频修复基准上的表现也非常出色,同时运行效率高(在单个80GB A100 GPU上以720p分辨率大约为每秒13帧)。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2512.12060
In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding and the ability to provide accurate venue recommendations considering user constraints and situational context. To evaluate utterance-response coherence using an LLM, we synthetically generate user utterances accompanied by correct and modified failure-containing system responses. We use input-output, chain-of-thought, self-consistency prompting, and multi-agent prompting techniques with 13 reasoning and non-reasoning LLMs of varying sizes and providers, including OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study involving restaurant recommendations. The most substantial improvements occur for small non-reasoning models when applying advanced prompting techniques, particularly multi-agent prompting. However, reasoning models consistently outperform non-reasoning models, with the best performance achieved using single-agent prompting with self-consistency. Notably, DeepSeek-R1 reaches an F1-score of 0.99 at a cost of 0.002 USD per request. Overall, the best balance between effectiveness and cost-time efficiency is reached with the non-reasoning model DeepSeek-V3. Our findings show that LLM-based evaluation offers a scalable and accurate alternative to traditional human evaluation for benchmarking contextual understanding in ConvQA systems.
车载对话问答(ConvQA)系统通过支持无缝语音交互显著提升了用户体验,但评估其准确性和可靠性仍然是一项挑战。本文探讨了利用大型语言模型(LLMs)以及高级提示技术与基于代理的方法来衡量ConvQA系统的响应是否符合用户的口语表达程度。重点在于上下文理解能力及在考虑用户限制和情景背景的情况下提供精确的地点推荐的能力。 为了使用LLM评估语句-回应的一致性,我们人工生成了带有正确系统响应(包含修改后的错误)的用户语句。采用输入输出、思维链、自我一致性和多代理提示技术,并对来自OpenAI、DeepSeek、Mistral AI和Meta等不同提供者及大小不一的13种推理与非推理LLM进行评估。我们通过餐厅推荐案例研究来验证我们的方法。 对于小型非推理模型,应用高级提示技术(尤其是多代理提示)可以实现最显著的改进效果。然而,推理模型始终优于非推理模型,在使用自我一致性的单代理提示时性能最佳。特别值得一提的是,DeepSeek-R1在成本为每请求0.002美元的情况下达到了F1得分为0.99的成绩。 总体而言,对于有效性和成本时间效率的最佳平衡点是采用非推理模型中的DeepSeek-V3。我们的研究结果表明,基于LLM的评估提供了一种可扩展且准确的替代方案来取代传统的人类评估方法,在ConvQA系统的上下文理解基准测试中进行衡量。
https://arxiv.org/abs/2512.12042