Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: this https URL
空间转录组学(ST)是一种新兴技术,能够帮助研究人员探究组织形态背后的分子关系。然而,获取ST数据的成本仍然非常高昂,并且传统的固定网格采样策略会导致对在形态上相似或生物信息量较少的区域进行重复测量,从而导致稀缺的数据限制了当前的方法。然而,成熟的单细胞测序领域可以提供丰富的生物学数据作为有效的辅助来源以缓解这一局限性。为弥补这些差距,我们提出了SCR2-ST,这是一种统一框架,它利用单细胞先验知识来指导高效的数据采集和准确的表达预测。SCR2-ST整合了一个基于单细胞引导的强化学习(SCRL)主动采样策略以及一个混合回归-检索预测网络SCR2Net。SCRL结合了单细胞基础模型嵌入与空间密度信息,构建出以生物学为依据的奖励信号,在有限测序预算下能够有选择地获取富含信息的组织区域。随后,SCR2Net通过一个将基于回归建模和检索增强推断相结合的混合架构来利用主动采样的数据,并且其中设置了一个主要细胞类型过滤机制来抑制噪声匹配,而被检索到的表达谱则作为软标签用于辅助监督。 我们对三个公开的空间转录组学(ST)数据集进行了SCR2-ST的评估,在低预算场景下展示了在样本采集效率和预测准确度上的最佳性能。代码可在以下网址获取:this https URL
https://arxiv.org/abs/2512.13635
Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.
使用强化学习(RL)构建通用推理模型面临着显著的跨领域异质性挑战,包括推理时响应长度的巨大变化和验证延迟。这种变异性复杂化了RL基础设施,减慢了训练速度,并且使训练课程(例如,响应长度扩展)和超参数选择变得困难。 在本工作中,我们提出了级联领域的强化学习(Cascade RL),以开发能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。不同于传统的将不同领域中异质提示融合在一起的方法,Cascade RL协调了一系列按领域划分的RL过程,从而降低了工程复杂性,并在广泛的基准测试中实现了最先进的性能。 值得注意的是,在使用RLHF(基于人类反馈的强化学习)作为预步骤进行对齐时,这不仅优化了模型偏好,还极大地增强了模型的推理能力。此外,在后续的领域特定RLVR阶段中,罕见地会降低早期领域所达到的基准性能,并且可能会提升其表现(如图1所示)。 我们训练的14B模型在经过RL训练后,在LiveCodeBench v5/v6/Pro上超越了它的SFT教师DeepSeek-R1-0528,并在2025年国际信息学奥林匹克竞赛(IOI)中取得了银牌成绩。我们将透明地分享我们的训练和数据配方。
https://arxiv.org/abs/2512.13607
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
在预训练视频生成系统的基础上构建视频世界模型,是迈向通用时空智能的重要但具有挑战性的一步。一个世界模型应该具备三个基本属性:可控性、长期视觉质量以及时间一致性。为此,我们采用了一种渐进的方法——首先增强可控性,然后扩展到长期的高质量生成。我们提出了LongVie 2,这是一个分三阶段训练的端到端自回归框架: 1. 多模态引导:将密集和稀疏控制信号集成在一起,提供隐式的全局级监督,并提高可控性。 2. 输入帧退化感知训练:在训练与长期推理之间建立桥梁,以维持高质量视觉效果。 3. 历史上下文指引:使相邻片段之间的背景信息对齐,确保时间一致性。 我们还引入了LongVGenBench,这是一个包含100段高分辨率一分钟视频的全面基准测试集,涵盖了各种真实和合成环境。广泛的实验表明,LongVie 2在长范围可控性、时间连贯性和视觉保真度方面取得了最先进的性能,并支持长达五分钟左右的连续视频生成,标志着向统一视频世界建模迈进的重要一步。
https://arxiv.org/abs/2512.13604
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at this https URL.
图像扩散模型的缓慢推理过程显著降低了用户的互动体验。为了解决这一问题,我们提出了Diffusion Preview,这是一种新范式,采用快速、低步数采样生成初步输出供用户评估,并在预览被认定满意后再进行完整的细化处理。现有的加速方法,包括无训练求解器和后训练蒸馏,在提供高质量的预览或确保预览与最终结果之间的一致性方面存在困难。 我们提出了一种基于广义线性多步法(general linear multistep methods)的轻量级、可训练的高阶求解器ConsistencySolver,通过强化学习进行优化。这种求解器不仅提高了预览的质量和一致性,在低步数场景中生成质量和一致性的改进尤为显著,使其非常适合高效的预览与细化工作流程。 实验结果显示,ConsistencySolver在低步数的情况下显著提升了生成质量的一致性,并且在使用47%更少步骤时达到了与Multistep DPM-Solver相当的FID分数,同时超过了蒸馏基线模型的表现。此外,用户研究显示我们的方法将用户的总体交互时间减少了近50%,同时保持了生成的质量。 代码可以在以下链接找到:[此链接](https://this https URL)(请注意需要替换为实际提供的GitHub或其他代码托管平台的URL)。
https://arxiv.org/abs/2512.13592
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.
自回归模型(ARM)由于慢速的顺序推理而受到限制。虽然屏蔽扩散模型(MDM)提供了并行替代方案,但它们存在关键缺点:由于排除了键值(KV)缓存,计算开销高;以及因在难以处理的标记组合空间中学习依赖关系而导致生成不一致。为了克服这些局限性,我们引入了ReFusion,这是一种新型屏蔽扩散模型,通过将并行解码从令牌级别提升到更高的槽级别来实现卓越的表现和效率,其中每个槽是一个固定长度、连续的子序列。这是通过迭代“规划与填充”解码过程实现的:首先,在一个基于扩散的规划步骤中识别出一组弱依赖关系的槽;然后,在自回归填充步骤中并行地对选定的这些槽进行解码。该槽级设计同时解锁了完整的KV缓存重用,并在统一因果框架内将学习复杂性从难以处理的令牌组合空间减少到可管理的槽级别置换空间。在七个多样化的基准测试上的广泛实验表明,ReFusion不仅以34%的表现提升和平均超过18倍的速度加快超过了以前的MDMs,而且还缩小了与强大ARMs之间的性能差距,并且保持了2.33倍的平均速度优势。
https://arxiv.org/abs/2512.13586
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at this https URL.
近期在视频生成领域的进展为统一的音视频生成铺平了道路。在这项工作中,我们推出了Seedance 1.5 Pro,这是一个专门为原生联合音频-视频生成而设计的基础模型。通过采用双分支扩散变换器架构,该模型集成了跨模态联合模块和专门化的多阶段数据管道,实现了卓越的音视频同步以及优异的生成质量。 为了确保其实用性,我们在训练后实施了精细的优化措施,包括在高质量数据集上进行监督微调(SFT)及利用多维奖励模型的人类反馈强化学习(RLHF)。此外,我们还引入了一个加速框架,将推理速度提高了10倍以上。Seedance 1.5 Pro通过精准的跨语言和方言唇形同步、动态电影摄像机控制以及增强的故事连贯性,在专业级内容创作引擎中脱颖而出。 Seedance 1.5 Pro现可通过火山引擎在以下链接访问:[此处插入实际URL]。
https://arxiv.org/abs/2512.13507
We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at this https URL
我们提出了一种基于多模态的框架,用于高保真长期数字人类动画,并将其命名为$\textbf{Soul}$。该系统能够从单帧肖像图像、文本提示和音频中生成语义连贯的视频,实现精确的唇部同步、生动的表情以及稳健的身份保持。 为了缓解数据稀缺问题,我们构建了包含100万精细标注样本的数据集Soul-1M,涵盖肖像、上半身、全身及多人场景,并且使用了精准的自动化注释流程。同时,我们精心打造了用于全面和公正评估音频/文本引导动画方法的Soul-Bench基准测试。 该模型基于Wan2.2-5B骨干网络构建,通过集成音频注入层和多种训练策略(包括阈值感知代码簿替换)来确保长期生成的一致性。此外,我们使用了步长/CFG蒸馏和轻量级VAE技术以优化推理效率,在几乎不损失质量的情况下实现了11.4倍的速度提升。 广泛的实验表明,Soul在视频质量、视频文本对齐、身份保持以及唇部同步准确性方面显著优于当前领先的开源和商业模型。这些优势展示了它在诸如虚拟主播和影视制作等现实场景中的广泛应用潜力。项目页面请参见此链接:[https URL]
https://arxiv.org/abs/2512.13495
Premature semantic collapse -- the forced early commitment to a single meaning -- remains a core architectural limitation of current language models. Softmax-driven competition and greedy decoding cause models to discard valid interpretations before sufficient context is available, resulting in brittle reasoning and context failures. We introduce Non-Resolution Reasoning (NRR), a general computational framework that preserves semantic ambiguity during inference and performs resolution only when explicitly required. NRR integrates three components: (1) Multi-Vector Embeddings that maintain multiple viable interpretations per token, (2) Non-Collapsing Attention that prevents winner-take-all dynamics across layers, and (3) Contextual Identity Tracking (CIT), which assigns context-specific identities to recurring entities (e.g., distinguishing "Dr. Smith the cardiologist" from "Dr. Smith the researcher"). These mechanisms are unified by an external Resolution Operator $\rho$ that makes semantic commitment explicit, controllable, and task-dependent. Unlike standard architectures, NRR separates representation from resolution, allowing a single model to shift between creative, factual, and ambiguity-preserving reasoning without retraining. A synthetic evaluation demonstrates NRR's ability to preserve ambiguity and track context: CIT-enhanced models achieve 90.9% accuracy on out-of-distribution identity-shift tasks, compared to 9.1% for transformer baselines. NRR provides a principled alternative to premature collapse, reframing ambiguity as an explicit representational state rather than a failure mode. The question is not whether AI should resolve ambiguity, but when, how, and under whose control.
过早的语义崩溃——即在有足够的上下文之前被迫提前承诺单一含义——仍然是当前语言模型的核心架构限制之一。由softmax驱动的竞争和贪婪解码导致模型在没有足够背景信息的情况下舍弃有效的解释,从而引发脆弱的推理和上下文理解失败。我们提出了一种通用计算框架:非解析推理(NRR),该框架在推断过程中保留语义模糊性,并仅在明确需要时进行解析。NRR整合了三个组成部分: 1. 多向量嵌入,为每个标记维持多个可行的解释。 2. 非崩溃注意力机制,阻止各层间的赢家通吃动态过程。 3. 上下文身份追踪(CIT),用于根据上下文环境赋予反复出现的实体特定的身份(例如区分“心脏病专家史密斯医生”和“研究员史密斯博士”)。 这些机制通过一个外部解析操作符$\rho$统一起来,该操作符使语义承诺变得显式、可控并依赖于任务需求。与标准架构不同的是,NRR将表示与解析分离,使得单一模型能够在创造性推理、事实性推理和保留模糊性的推理之间自由切换而无需重新训练。合成评估表明了NRR在保持模糊性和追踪上下文方面的有效性:增强CIT的模型在外来身份转变任务上达到90.9%的准确率,相比之下基于变换器的基础模型仅为9.1%。 NRR为过早崩溃提供了一个有原则的替代方案,将模棱两可视为一种显式的表示状态而不是失败模式。问题不再是AI是否应该解析模糊性,而是何时、如何以及在谁的控制下进行这种解析。
https://arxiv.org/abs/2512.13478
Accurate and timely identification of plant leaf diseases is essential for resilient and sustainable agriculture, yet most deep learning approaches rely on large annotated datasets and computationally intensive models that are unsuitable for data-scarce and resource-constrained environments. To address these challenges we present a few-shot learning approach within a lightweight yet efficient framework that combines domain-adapted MobileNetV2 and MobileNetV3 models as feature extractors, along with a feature fusion technique to generate robust feature representation. For the classification task, the fused features are passed through a Bi-LSTM classifier enhanced with attention mechanisms to capture sequential dependencies and focus on the most relevant features, thereby achieving optimal classification performance even in complex, real-world environments with noisy or cluttered backgrounds. The proposed framework was evaluated across multiple experimental setups, including both laboratory-controlled and field-captured datasets. On tomato leaf diseases from the PlantVillage dataset, it consistently improved performance across 1 to 15 shot scenarios, reaching 98.23+-0.33% at 15 shot, closely approaching the 99.98% SOTA benchmark achieved by a Transductive LSTM with attention, while remaining lightweight and mobile-friendly. Under real-world conditions using field images from the Dhan Shomadhan dataset, it maintained robust performance, reaching 69.28+-1.49% at 15-shot and demonstrating strong resilience to complex backgrounds. Notably, it also outperformed the previous SOTA accuracy of 96.0% on six diseases from PlantVillage, achieving 99.72% with only 15-shot learning. With a compact model size of approximately 40 MB and inference complexity of approximately 1.12 GFLOPs, this work establishes a scalable, mobile-ready foundation for precise plant disease diagnostics in data-scarce regions.
准确且及时地识别植物叶片疾病对于建立有弹性和可持续的农业至关重要,然而大多数深度学习方法依赖于大规模标注数据集和计算资源密集型模型,在数据匮乏和资源受限环境中并不适用。为了解决这些问题,我们提出了一种轻量级但高效的框架内的少量样本学习(few-shot learning)方法,该框架结合了领域适应的MobileNetV2和MobileNetV3模型作为特征提取器,并采用特征融合技术生成稳健的特征表示。对于分类任务,融合后的特征通过增强注意力机制的Bi-LSTM分类器传递,以捕获序列依赖性并聚焦于最相关的特征,从而在复杂且背景嘈杂的真实世界环境中实现最优分类性能。 该框架在多个实验设置中进行了评估,包括实验室控制和野外捕捉的数据集。在PlantVillage数据集中番茄叶片疾病的测试中,在1到15个样本的场景下其表现均得到提升,并在15个样本时达到98.23±0.33%的准确率,接近使用带有注意力机制的归纳LSTM实现的99.98%的最佳性能(SOTA),同时保持轻量级和移动友好性。在真实世界条件下使用Dhan Shomadhan数据集中的田野图像时,在15个样本的情况下,其表现仍然稳健,达到了69.28±1.49%,并在复杂背景中表现出强大的适应能力。值得注意的是,它还超越了PlantVillage数据集中六种疾病的SOTA准确率(96.0%),在仅使用15个样本学习时就实现了99.72%的准确度。 此研究工作以约40MB的小型模型和大约1.12GFLOPs的推理复杂性建立了一个可扩展且适用于移动设备的基础框架,为数据匮乏地区的精准植物病害诊断奠定了基础。
https://arxiv.org/abs/2512.13428
Large Language Models (LLMs) are prone to mem- orizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA bench- marks, allowing us to evaluate their practical utility in real-world extraction scenarios.
大型语言模型(LLMs)容易记忆训练数据,这带来了严重的隐私风险。其中两个最突出的担忧是训练数据提取和成员推断攻击(MIAs)。先前的研究表明,这些威胁相互关联:对手可以通过向模型查询以生成大量文本,并随后应用MIAs来验证特定数据点是否包含在训练集中,从而从LLM中提取训练数据。在这项研究中,我们将多种MIA技术整合到数据提取管道中,系统地评估它们的有效性。然后,在这种集成设置下,我们将其性能与传统MIA基准测试结果进行比较,以评估其在实际提取场景中的实用价值。
https://arxiv.org/abs/2512.13352
The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
生成模型的迅速发展增加了对能够跨多种多样且不断演进的生成技术进行泛化的图像检测器的需求。然而,现有的方法,包括那些利用预训练的视觉-语言模型的方法,通常会产生高度纠缠的表示形式,将与任务相关的法医线索(因果特征)和虚假或无关模式(非因果特征)混在一起,从而限制了其泛化能力。为了解决这个问题,我们提出了CausalCLIP框架,该框架明确地分离出因果特征和非因果特征,并采用由因果推理原则指导的目标过滤来保留最具有可迁移性和判别性的法医线索。通过使用结构因果模型建模生成过程,并借助基于Gumbel-Softmax的特征掩码和Hilbert-Schmidt独立性准则(HSIC)约束强制执行统计独立性,CausalCLIP能够隔离出稳定且鲁棒于分布变化的因果特征。在对不同系列的未见过的生成模型进行测试时,CausalCLIP展示了强大的泛化能力,在准确率和平均精度上分别比最先进的方法提高了6.83% 和4.06%。
https://arxiv.org/abs/2512.13285
Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents' adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.
代理强化学习已经将大型语言模型(LLMs)推进到能够通过使用外部工具的交错来处理长链思维轨迹的能力。现有方法假设有一组固定的工具,限制了LLM代理适应新出现或不断变化的工件集的能力。我们提出了AutoTool框架,该框架使LLM代理在整个推理过程中具备动态选择工具的能力。 首先,我们构建了一个包含20万条数据的数据集,这些数据集中包含了1000多种工具和超过100种任务(涵盖数学、科学、代码生成和多模态推理)的明确工具选择理由。在此数据基础之上,AutoTool采用了双阶段优化管道:(i)通过监督学习和基于强化学习的轨迹稳定化实现连贯推理,并且(ii)使用KL正则化的Plackett-Luce排名来精炼一致的多步骤工具选择。 在包括十个不同基准在内的评估中,我们用AutoTool训练了两个基础模型——Qwen3-8B和Qwen2.5-VL-7B。尽管参数较少,但AutoTool在整个LLM代理和工具集成方法中的性能上始终领先,并且平均提高了数学与科学推理6.4%,基于搜索的问答4.5%,代码生成7.7%以及多模态理解6.9%的成绩。 此外,AutoTool在推断过程中表现出更强的泛化能力,能够动态地利用不断变化工件集中未见过的新工具。
https://arxiv.org/abs/2512.13278
This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a compositional approach progressing from ID-aware motion modeling, to audio-visual synchronization via lip reading-based supervision, and finally to novel view animation through temporal-to-spatial adaptation. To overcome the scarcity of 4D audio-visual data, we propose a decoupled learning approach in which view consistency and temporal coherence are trained independently. A self-forcing training scheme enables the model to learn from longer temporal contexts than those generated at inference, mitigating the overly static animations common in existing autoregressive approaches. Comprehensive evaluations demonstrate that STARCaster generalizes effectively across tasks and identities, consistently surpassing prior approaches in different benchmarks.
本文介绍了STARCaster,这是一种身份感知的空间-时间视频扩散模型,它在一个统一的框架内解决了基于语音驱动的人脸动画和自由视角说话人脸合成的问题,只需要一个身份嵌入或参考图像。现有的二维语音到视频扩散模型严重依赖于参考指导,导致动作多样性有限。同时,三维意识动画通常依赖于通过预训练的三平面生成器进行逆向转换,这往往会导致不完美的重建和身份漂移。我们从两个方面重新思考了基于参考和几何的方法。首先,在预训练中偏离严格的参考条件引入更柔和的身份约束;其次,我们在二维视频域内隐式地解决了三维意识问题,利用视频数据固有的多视角特性。STARCaster采用了一种组合方法,从身份感知的动作建模开始,通过唇读监督实现音视频同步,最后通过时间到空间的适应生成新视图动画。为了克服四维视听数据稀缺的问题,我们提出了一种解耦学习方法,在这种方法中,视图一致性和时间连贯性分别独立训练。一种自我强制训练方案使模型能够从比推理时产生的更长的时间上下文中进行学习,缓解了现有自回归方法中常见的过度静态动画问题。全面的评估表明,STARCaster在不同任务和身份上有效泛化,并且在不同的基准测试中始终超越先前的方法。
https://arxiv.org/abs/2512.13247
Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as \(1 - \max(P_{\mathrm{target}})\). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.
Speculative Decoding 是一种通过使用快速草稿模型提出候选词序列并由大型目标模型进行并行验证来加速大规模语言模型(LLM)自回归推理的技术。然而,其核心组件——拒绝抽样机制——依赖于一个固定且与上下文无关的随机阈值。这导致在高不确定性生成场景中存在显著的“随机拒绝”问题:许多合理的候选词因纯粹的偶然性而被错误地拒绝,从而降低了推断效率。 本文介绍了一种名为高效自适应拒绝抽样(EARS)的新方法,该方法通过将目标模型自身的预测不确定性(用 \(1 - \max(P_{\mathrm{target}})\) 表示)纳入其中来动态调整接受阈值。通过引入一个与这种不确定性成比例的容差项,EARS 智能地在模型不确定时放宽接受标准,并且当模型信心充足时维持严格的接受条件,从而有效地减少了随机拒绝事件的发生。 实验结果表明,在创意写作和开放领域问答任务上,EARS 显著提升了 Speculative Decoding 的效率。具体而言,在 GSM8K 基准测试中,使用 EARS 可以实现高达 18.12% 的吞吐量提升,仅导致约 0.84% 的准确率下降。 该方法无需对模型架构进行任何修改,并且可以无缝集成到现有的 Speculative Decoding 框架中。
https://arxiv.org/abs/2512.13194
High resolution phenotyping at the level of individual leaves offers fine-grained insights into plant development and stress responses. However, the full potential of accurate leaf tracking over time remains largely unexplored due to the absence of robust tracking methods-particularly for structurally complex crops such as canola. Existing plant-specific tracking methods are typically limited to small-scale species or rely on constrained imaging conditions. In contrast, generic multi-object tracking (MOT) methods are not designed for dynamic biological scenes. Progress in the development of accurate leaf tracking models has also been hindered by a lack of large-scale datasets captured under realistic conditions. In this work, we introduce CanolaTrack, a new benchmark dataset comprising 5,704 RGB images with 31,840 annotated leaf instances spanning the early growth stages of 184 canola plants. To enable accurate leaf tracking over time, we introduce LeafTrackNet, an efficient framework that combines a YOLOv10-based leaf detector with a MobileNetV3-based embedding network. During inference, leaf identities are maintained over time through an embedding-based memory association strategy. LeafTrackNet outperforms both plant-specific trackers and state-of-the-art MOT baselines, achieving a 9% HOTA improvement on CanolaTrack. With our work we provide a new standard for leaf-level tracking under realistic conditions and we provide CanolaTrack - the largest dataset for leaf tracking in agriculture crops, which will contribute to future research in plant phenotyping. Our code and dataset are publicly available at this https URL.
高分辨率的单叶表型分析为植物发育和压力响应提供了精细的见解。然而,由于缺乏稳健的追踪方法(特别是对于结构复杂的作物如油菜),随着时间推移准确地进行叶片跟踪的全部潜力尚未被充分发掘。现有的特定于植物的跟踪方法通常仅限于小型物种或依赖于受限的成像条件。相比之下,通用多目标跟踪(MOT)方法并未针对动态生物场景设计。由于缺乏在现实条件下捕获的大规模数据集,准确叶片追踪模型的发展也受到了阻碍。 在这项工作中,我们介绍了CanolaTrack,这是一个新的基准数据集,包含5,704张RGB图像和31,840个注释的叶子实例,涵盖了184株油菜植物的早期生长阶段。为了实现在时间上的准确叶片跟踪,我们引入了LeafTrackNet,这是一种高效的框架,结合了一种基于YOLOv10的叶检测器和一种基于MobileNetV3的嵌入式网络。在推断过程中,通过一种基于嵌入式的记忆关联策略,在一段时间内保持叶子的身份信息。LeafTrackNet超越了特定于植物的跟踪器和最先进的MOT基线方法,在CanolaTrack上实现了9%的HOTA改进。我们提供的这项工作为现实条件下叶片级追踪提供了一个新的标准,并提供了CanolaTrack——农业作物中最大的叶片跟踪数据集,这将有助于未来在植物表型研究中的发展。 我们的代码和数据集可以在[此链接](https://给出URL)公开获取。
https://arxiv.org/abs/2512.13130
Diffusion models have recently emerged as powerful tools for robot motion planning by capturing the multi-modal distribution of feasible trajectories. However, their extension to multi-robot settings with flexible, language-conditioned task specifications remains limited. Furthermore, current diffusion-based approaches incur high computational cost during inference and struggle with generalization because they require explicit construction of environment representations and lack mechanisms for reasoning about geometric reachability. To address these limitations, we present Language-Conditioned Heat-Inspired Diffusion (LCHD), an end-to-end vision-based framework that generates language-conditioned, collision-free trajectories. LCHD integrates CLIP-based semantic priors with a collision-avoiding diffusion kernel serving as a physical inductive bias that enables the planner to interpret language commands strictly within the reachable workspace. This naturally handles out-of-distribution scenarios -- in terms of reachability -- by guiding robots toward accessible alternatives that match the semantic intent, while eliminating the need for explicit obstacle information at inference time. Extensive evaluations on diverse real-world-inspired maps, along with real-robot experiments, show that LCHD consistently outperforms prior diffusion-based planners in success rate, while reducing planning latency.
最近,扩散模型作为一种强大的工具在机器人路径规划中崭露头角,通过捕捉可行轨迹的多模态分布来发挥作用。然而,它们在处理具有灵活的语言条件任务规范的多机器人环境中的应用仍存在局限性。此外,现有的基于扩散的方法在推理过程中计算成本高昂,并且由于需要显式构建环境表示并缺乏关于几何可达性的推理机制,因此难以泛化。为了解决这些问题,我们提出了语言条件启发式的热扩散(LCHD),这是一种端到端的视觉框架,能够生成遵循语言指令、无碰撞的轨迹。 LCHD将基于CLIP的语义先验与一种避免碰撞的扩散核相结合,后者作为一个物理归纳偏差,使规划器能够在可触及的工作空间内严格解释语言命令。这自然地处理了超出分布范围的情况——在可达性方面——通过引导机器人向能够匹配语义意图且可访问的目标前进,同时消除了推理过程中显式障碍物信息的需要。 在多样化的、现实世界启发的地图上的广泛评估以及真实的机器人实验表明,LCHD在成功率上始终优于先前基于扩散的方法,并且还减少了规划延迟。
https://arxiv.org/abs/2512.13090
Dense retrieval has become the industry standard in large-scale information retrieval systems due to its high efficiency and competitive accuracy. Its core relies on a coarse-to-fine hierarchical architecture that enables rapid candidate selection and precise semantic matching, achieving millisecond-level response over billion-scale corpora. This capability makes it essential not only in traditional search and recommendation scenarios but also in the emerging paradigm of generative recommendation driven by large language models, where semantic IDs-themselves a form of coarse-to-fine representation-play a foundational role. However, the widely adopted dual-tower encoding architecture introduces inherent challenges, primarily representational space misalignment and retrieval index inconsistency, which degrade matching accuracy, retrieval stability, and performance on long-tail queries. These issues are further magnified in semantic ID generation, ultimately limiting the performance ceiling of downstream generative models. To address these challenges, this paper proposes a simple and effective framework named SCI comprising two synergistic modules: a symmetric representation alignment module that employs an innovative input-swapping mechanism to unify the dual-tower representation space without adding parameters, and an consistent indexing with dual-tower synergy module that redesigns retrieval paths using a dual-view indexing strategy to maintain consistency from training to inference. The framework is systematic, lightweight, and engineering-friendly, requiring minimal overhead while fully supporting billion-scale deployment. We provide theoretical guarantees for our approach, with its effectiveness validated by results across public datasets and real-world e-commerce datasets.
密集检索已成为大规模信息检索系统中的行业标准,因其高效性和竞争性准确性而备受青睐。其核心依赖于粗到细的分层架构,该架构能够实现快速候选选择和精准语义匹配,在处理海量数据集时能达到毫秒级响应速度。这种能力不仅在传统的搜索和推荐场景中至关重要,还在由大型语言模型驱动的生成式推荐这一新兴范式中扮演基础角色,其中语义ID(自身即为一种粗到细表示)发挥着关键作用。然而,广泛采用的双塔编码架构引入了内在挑战,主要包括表征空间错位及检索索引不一致的问题,这些问题降低了匹配准确性、检索稳定性以及处理长尾查询时的表现。在生成语义ID的过程中,这些问题被进一步放大,最终限制了下游生成模型的性能上限。 为了应对这些挑战,本文提出了一种简单有效的框架SCI,该框架由两个协同模块组成:对称表征对齐模块和双塔协同一致索引模块。前者采用创新的输入交换机制统一双塔表示空间而不增加参数;后者则通过双重视图索引策略重新设计检索路径,从训练到推断保持一致性。 此框架系统性强、轻量级且工程友好,在支持十亿级别部署时几乎无需额外开销,并全面保障性能。我们为该方法提供了理论保证,其有效性已在公共数据集及现实世界电商数据集中得到验证。
https://arxiv.org/abs/2512.13074
Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided self-forcing with a dual-region KV caching scheme to address the exposure bias issue at inference time; (3) multi-prompt conditioning, which provides fine-grained and procedural control over multi-step instructions. Together, these components mitigate temporal drift, preserve motion consistency, and enable interactive video generation where future prompt updates dynamically influence ongoing streaming video generation. Experimental results demonstrate that our method produces temporally coherent and semantically faithful instructional videos that accurately follow complex, multi-step task descriptions.
指令视频生成是一项新兴任务,旨在从文本描述中综合连贯地演示程序性活动。这种能力在内容创作、教育和人机交互方面具有广泛的影响,然而现有的视频扩散模型难以在整个动作序列的长时段内保持时间一致性与可控性。我们引入了一种面向未来驱动的流式指令视频生成流水线,名为SneakPeek,这是一个基于扩散的自回归框架,旨在根据初始图像和结构化文本提示来生成精确、分步的操作指导视频。 我们的方法通过三个关键创新来增强一致性和可控制性: 1. **预测因果适应**:一种因果模型学习进行下一帧预测并提前预判未来的关键帧。 2. **基于未来的自我强迫机制及双区域KV缓存方案**,在推断时解决曝光偏差问题。 3. **多提示条件化**,提供对多步指令的精细且程序化的控制。 这些组件共同作用可以减少时间漂移、保持运动一致性,并支持交互式视频生成,在未来提示更新时动态影响正在进行中的流媒体视频生成。实验结果表明我们的方法能够生成具有时间和语义一致性的指令性视频,准确遵循复杂的多步骤任务描述。
https://arxiv.org/abs/2512.13019
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.
最近的自监督学习(SSL)在Transformer上的进展显著提高了说话人验证(SV)的效果,通过提供领域通用的语音表示。然而,现有的方法未能充分利用SSL编码器的多层特性。为了解决这一限制,我们提出了基于层次感知的时间延迟神经网络(L-TDNN),它直接对预训练模型各层隐藏状态输出进行逐层/逐帧处理,提取固定大小的说话人向量。L-TDNN包括一个层次感知卷积网络、帧适应性层级聚合以及注意统计池化模块,显式建模了之前被忽略的层级维度的识别和处理过程。 我们在多个语音SSL Transformer模型和多样化的语音-说话人语料库中评估了L-TDNN,并将其与利用预训练编码器的其他方法进行了比较。L-TDNN在所有实验中均表现出稳健的验证性能,实现了最低的错误率。同时,在模型紧凑性和推理效率方面也表现突出,与现有系统相当。 这些结果强调了所提出的层次感知处理方法的优势。未来的工作将探索SSL前端和评分校准的联合训练,以进一步提升最先进的验证性能。
https://arxiv.org/abs/2409.07770
This paper presents VLCache, a cache reuse framework that exploits both Key-Value (KV) cache and encoder cache from prior multimodal inputs to eliminate costly recomputation when the same multimodal inputs recur. Unlike previous heuristic approaches, we formally identify the cumulative reuse error effect and demonstrate how to minimize the non-prefix cache reuse error effectively. We further analyze the varying importance of model layers and propose a dynamic, layer-aware recomputation strategy to balance accuracy and efficiency. Experimental results show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups. The proposed VLCache pipeline has been integrated into SGLang, enabling significantly faster inference in practical deployments.
本文介绍了VLCache,这是一个缓存重用框架,它利用了来自先前多模态输入的键值(KV)缓存和编码器缓存,以消除当相同的多模态输入重复出现时产生的昂贵的重新计算成本。与以前的经验方法不同,我们正式识别出累积重用误差效应,并展示了如何有效地最小化非前缀缓存重用误差。此外,我们还分析了模型各层的不同重要性,并提出了一种动态且分层感知的重新计算策略,以平衡准确性和效率。实验结果表明,VLCache在精度上与完全重新计算相媲美,但只需要计算2-5%的令牌,从而实现了1.2倍至16倍的时间缩短(TTFT)加速效果。所提出的VLCache流水线已经集成到了SGLang中,在实际部署中显著加快了推理速度。
https://arxiv.org/abs/2512.12977