Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{this https URL}{CamPilot Page}.
近期,在相机控制视频扩散模型方面的进展显著提高了视频与摄像机之间的对齐精度。然而,摄像机的可控性仍然有限。在这项工作中,我们基于奖励反馈学习(Reward Feedback Learning)方法,并致力于进一步提升摄像机的可控性。不过,直接借用现有的ReFL方法会遇到几个挑战:首先,当前的奖励模型缺乏评估视频与摄像机对齐能力的能力;其次,在计算奖励时将潜在变量解码为RGB视频带来了大量的计算开销;第三,在视频解码过程中通常忽略了3D几何信息。 为了应对这些局限性,我们引入了一个高效的感知相机的3D解码器,该解码器能够将视频潜变量解码成用于奖励量化的3D表示。具体而言,视频潜在编码与摄像机姿态一起被解码为3D高斯分布,在这一过程中,摄像机姿态不仅作为输入,还充当投影参数的角色。如果视频潜在变量和摄像机姿态之间存在对齐误差,则会导致3D结构的几何失真,并进而导致渲染模糊。 基于该特性,我们明确地优化了合成视角与真实视图之间的像素级一致性作为奖励计算的基础。考虑到这一随机性质,我们进一步引入了一个可见性项,仅针对通过几何变形导出的确定区域进行监督。在RealEstate10K和WorldScore基准上的广泛实验验证了所提出方法的有效性。 项目页面:\[链接\](请将“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.16214
Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: this https URL
机器人基础模型已经开始实现通用型机器人代理的承诺,但其进展仍受限于大规模现实世界操作数据集的稀缺。仿真和合成数据生成提供了一种可扩展的替代方案,但由于模拟与现实之间的视觉领域差距,它们的有效性受到了限制。在本文中,我们介绍了Point Bridge框架,该框架利用统一、无领域的点基表示法来解锁合成数据集以实现零样本仿真实现策略迁移,而无需显式的视觉或对象级别的对齐。通过结合基于视觉-语言模型(VLMs)的自动点基表示提取、基于变压器的学习策略以及高效的推理时间管道,Point Bridge能够仅使用合成数据训练具备能力的真实世界操作代理。在额外与少量实际演示进行共训的情况下,Point Bridge进一步提高了性能,并显著超越了先前基于视觉的仿真实现共训方法的表现。在零样本仿真到现实迁移中,它最多可实现44%的增长,在具有有限真实数据的情境下跨单一任务和多任务设置则可达66%。 请注意查看机器人视频的最佳方式是访问此链接:[请在此插入实际URL](原文中的“this https URL”应为具体的网址链接)。
https://arxiv.org/abs/2601.16212
We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.
我们研究了组合视频理解(CVU),在这种情况下,模型必须识别动词和物体,并将它们组合起来以推广到未见过的组合。我们发现现有的零样本组合动作识别(ZS-CAR)模型主要由于一个被忽略的问题模式而失败:基于对象的动词捷径。通过系统的分析,我们展示了这种行为是由两个相互交织的因素引起的:组成监督的高度稀疏性和偏斜性,以及动词和物体之间的不对称学习难度。随着训练的进行,现有的ZS-CAR模型越来越忽视视觉证据,并过度适应共现统计信息。因此,现有模型无法获得在未见过的动词-物体组合中的组合识别益处。 为了解决这个问题,我们提出了RCORE,这是一个简单而有效的框架,强制执行基于时间的基础动词学习。RCORE引入了(i)一种组合感知增强方法,可以在不破坏运动线索的情况下多样化动词-对象组合;(ii)一种时间顺序正则化损失,通过显式建模时间结构来惩罚捷径行为。在两个基准测试Sth-com和我们新构建的EK100-com上,RCORE显著提高了未见过组合的准确性,减少了对共现偏差的依赖,并实现了持续的正面组合差距。 我们的发现揭示了基于对象的捷径作为ZS-CAR中的关键限制因素,并证明解决这些问题对于稳健的组合视频理解至关重要。
https://arxiv.org/abs/2601.16211
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
离散视频变分自编码器(VAEs)是现代文本到视频生成和视频理解系统的基础,然而现有的标记化方法通常在单一尺度上学习有限词汇量的视觉代码本,并且语言监督较浅层,导致跨模态对齐效果不佳及零样本迁移性能差。我们引入了PyraTok,这是一种与语言相匹配的金字塔式标记器,它能够在多个时空分辨率下学习语义结构化的离散潜在变量。PyraTok基于预训练的视频VAE以及一种新颖的语言一致金字塔量化(LaPQ)模块构建而成,该模块通过共享的大二进制代码本来自不同深度对编码特征进行离散化处理,生成紧凑且表达力强的视频标记序列。 为了将视觉标记与语言紧密耦合,PyraTok同时优化多尺度文本引导量化和整个令牌层次上的全局自回归目标。在十项基准测试中,PyraTok提供了最先进的(SOTA)视频重建效果,在文本到视频质量上持续改进,并在视频分割、时间动作定位以及视频理解的零样本性能方面设立新的SOTA标准,能够稳健地扩展至4K/8K分辨率。
https://arxiv.org/abs/2601.16210
Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.
表示自编码器(RAE,Representation Autoencoders)在图像网(ImageNet)上的扩散模型训练中显示出了显著的优势,尤其是在高维语义潜在空间的训练方面。在这项工作中,我们探讨了这一框架是否可以扩展到大规模、自由形式的文字转图像(T2I,Text-to-Image)生成任务上。 首先,我们在冻结表示编码器(SigLIP-2)的基础上,通过网络数据、合成数据和文本渲染数据对RAE解码器进行训练,以超越ImageNet的限制。我们发现,在扩展规模时虽然整体保真度有所提高,但在特定领域如文字生成中,有针对性的数据组合至关重要。 接着,我们严格测试了最初为ImageNet设计的RAE架构选择的有效性。分析结果显示,随着规模的扩大,框架变得简化:尽管维度依赖性的噪音调度仍然关键,但诸如扩散头部宽度加大和噪音增强解码等复杂结构在大规模下几乎没有带来实际好处。 基于这一简化的框架,我们对比了RAE与当前最佳的FLUX VAE(变分自编码器),在从0.5B到9.8B参数的不同规模下的扩散变压器模型上进行了有控制的比较。结果表明,在所有模型规模的预训练阶段,RAEs始终优于VAEs。 进一步地,在高质量数据集上的微调过程中,基于VAE的模型在64个epoch后出现灾难性过拟合,而基于RAE的模型则保持稳定至256个epoch,并且在整个过程中表现更佳。在所有实验中,基于RAE的扩散模型都显示出更快的收敛速度和更好的生成质量,确立了RAEs作为大规模T2I生成任务中的简化且更强的基础框架的地位。 此外,由于视觉理解和生成都可以在共享表示空间内操作,多模态模型可以直接对生成的潜在表达进行推理,为统一性模型提供了新的可能性。
https://arxiv.org/abs/2601.16208
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: this http URL
许多视觉-语言-动作(VLA)模型将图像补丁展平为一维标记序列,从而削弱了进行精确操作所需的二维空间线索。我们提出了一种轻量级、无需训练的方法IVRA,该方法通过利用内置视觉编码器中已有的亲和性提示来改进对空间的理解,而不需要任何外部编码器或重新训练。IVRA选择性地将这些亲和信号注入包含实例级特征的语言模型层中。这种推理时的干预措施能够重新调整视觉标记之间的相互作用,并更好地保持几何结构的同时固定所有模型参数不变。 我们通过将其应用于多种VLA架构(包括LLaRA、OpenVLA及FLOWER)在跨越2D和3D操作(如VIMA和LIBERO)的模拟基准测试以及各种真实机器人任务上,展示了IVRA的通用性。在低数据环境下的2D VIMA中,与基础模型LLaRA相比,IVRA平均成功率提高了+4.2%。在3D LIBERO场景中,与OpenVLA及FLOWER基线相比,它保持了一致性的改进,即使是在基准准确度接近饱和(96.3%到97.1%)的情况下也是如此。 所有代码和模型将公开发布。可视化材料可在此处访问:此HTTP链接。
https://arxiv.org/abs/2601.16207
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
我们介绍了LLM-in-Sandbox,这是一种使大型语言模型(LLMs)能够在代码沙盒(即虚拟计算机中)内探索的方法,以激发其在非代码领域中的通用智能。首先,我们展示了强大的LLMs无需额外训练即可表现出将代码沙盒用于非代码任务的一般化能力。例如,LLM会自发地访问外部资源来获取新知识,利用文件系统处理长上下文,并执行脚本来满足格式要求。此外,我们还表明通过仅使用非代理数据进行模型训练的LLM-in-Sandbox强化学习(LLM-in-Sandbox-RL),可以增强这些代理能力。实验结果证明,在无需训练和后训练设置下,LLM-in-Sandbox在数学、物理、化学、生物医学、长上下文理解和指令遵循等领域实现了稳健的一般化效果。最后,我们从计算和系统视角分析了LLM-in-Sandbox的效率,并将其开源为一个Python包,以促进其在现实世界中的部署。
https://arxiv.org/abs/2601.16206
We propose a novel training regime termed counterfactual training that leverages counterfactual explanations to increase the explanatory capacity of models. Counterfactual explanations have emerged as a popular post-hoc explanation method for opaque machine learning models: they inform how factual inputs would need to change in order for a model to produce some desired output. To be useful in real-world decision-making systems, counterfactuals should be plausible with respect to the underlying data and actionable with respect to the feature mutability constraints. Much existing research has therefore focused on developing post-hoc methods to generate counterfactuals that meet these desiderata. In this work, we instead hold models directly accountable for the desired end goal: counterfactual training employs counterfactuals during the training phase to minimize the divergence between learned representations and plausible, actionable explanations. We demonstrate empirically and theoretically that our proposed method facilitates training models that deliver inherently desirable counterfactual explanations and additionally exhibit improved adversarial robustness.
我们提出了一种新的训练方法,称为反事实训练(counterfactual training),该方法利用反事实解释来增强模型的解释能力。反事实解释作为一种流行的事后解释方法已经为不透明的机器学习模型广泛使用:它们提供关于现实输入如何需要改变才能使模型产生所需输出的信息。为了在实际决策系统中发挥作用,反事实应该与底层数据相符,并且在特征可变性约束下具有操作性。因此,现有的许多研究都集中在开发能够生成符合这些标准的事后方法上。 然而,在这项工作中,我们直接让模型对其期望的目标负责:反事实训练通过在训练阶段使用反事实来最小化学习表示与合理、可行的解释之间的差异。我们从实证和理论上证明了所提出的方法有助于训练出自然提供具有内在价值的反事实解释的模型,并且这些模型还表现出改进后的对抗鲁棒性。
https://arxiv.org/abs/2601.16205
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.
多模态大型语言模型(MLLMs)在各种应用场景中表现出强大的能力,但它们仍然容易受到通过扭曲特征表示并引发错误预测的对抗性干扰的影响。为了解决这一脆弱性问题,我们提出了特征空间平滑(FS),并通过理论证明了FS能够提供关于MLLMs特征表示的认证鲁棒性保障。具体而言,FS将任何特征编码器转换为其平滑版本,并保证在$\ell_2$界限内的攻击下,干净和对抗性表示之间的特征余弦相似度可以维持一个经过验证的最低边界。此外,我们指出通过增加原始编码器上的高斯鲁棒评分,可以从FS中得出的特征余弦相似度边界(FCSB)值得到提高。基于此,我们引入了纯化和平滑映射器(PSM),这是一种即插即用模块,它可以提升MLLMs的高斯鲁棒评分并因此增强其在FS下的认证鲁棒性,而无需对MLLMs进行重新训练。我们展示了带有PSM的FS不仅提供了强大的理论稳健保证,而且在对抗性训练方面也表现出更优越的实际性能。跨多种MLLM和下游任务的广泛实验表明,FS-PSM的有效性,将各种白盒攻击的成功率从接近90%降低到大约1%。
https://arxiv.org/abs/2601.16200
Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at this https URL.
将透视图像和视频转换为360°全景图能够实现沉浸式的三维世界生成。现有的方法通常依赖于透视视图与等距矩形投影(ERP)空间之间的显式几何对齐。然而,这需要已知的相机元数据,在野外的数据中,这种校准通常是缺失或有噪声的。我们提出了一种名为360Anything的新框架,该框架基于预训练的扩散变换器构建,并且不需要任何几何信息。通过将透视输入和全景图目标视为简单的令牌序列,360Anything能够以完全数据驱动的方式学习透视到等距矩形映射,从而消除了对相机信息的需求。 我们的方法在图像和视频从透视视图到360°生成的性能上达到了最先进的水平,并且超越了那些使用真实相机信息的方法。我们还追溯到了ERP边界处的接缝瑕疵的根本原因——VAE编码器中的零填充处理,并引入了圆形潜在编码以促进无缝生成。 最后,我们在无提示相机视野和方向估计基准测试中展示了具有竞争力的结果,这表明360Anything具备深刻的几何理解能力以及在计算机视觉任务中的更广泛实用性。更多的研究成果可以访问此链接:[提供的URL]。 简而言之,这项工作展示了一种创新的方法来处理没有明确几何对齐信息的图像和视频数据,并且证明了这种方法在多种应用中的有效性和广泛的适用性。
https://arxiv.org/abs/2601.16192
How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) ErdÅs' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
如何利用人工智能来发现某一科学问题的新前沿状态?先前的工作,如测试时间缩放中的AlphaEvolve,通过提示一个冻结的大型语言模型(LLM)来进行搜索。而我们则在测试期间执行强化学习,使得LLM能够继续训练,并且现在可以使用与特定测试问题相关的经验进行训练。这种持续学习方式非常特别,因为它旨在生成一个优秀的解决方案而非众多较好的平均方案,并且目标是解决这个问题而不是泛化到其他问题上。因此,我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为“测试时间训练以发现”(TTT-Discover)。 借鉴先前的研究成果,我们将重点放在具有连续奖励的问题上。我们在数学、GPU内核工程、算法设计及生物学等领域的所有尝试问题中报告了结果。在几乎所有的领域,TTT-Discover都设定了新的前沿状态: (i) ErdÅ¡os的最小重叠问题和一个自相关不等式; (ii) GPU模式内核竞赛(速度比之前的最佳实践快最多2倍); (iii) 过去的AtCoder算法比赛;以及 (iv) 单细胞分析中的去噪问题。 我们的解决方案由专家或组织者评审。我们所有的结果都是通过使用开放模型OpenAI gpt-oss-120b实现的,并可以通过公开提供的代码重现,而不同于以前的最佳成果需要封闭式前沿模型来完成。我们的测试时间训练运行使用了Thinking Machines的一个API——Tinker,每个问题的成本仅为几百美元。
https://arxiv.org/abs/2601.16175
State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
最先进的神经定理证明器,如DeepSeek-Prover-V1.5,结合了大型语言模型和强化学习,在经过复杂的训练后取得了令人印象深刻的结果。我们的问题是:这些高度训练的模型在推理时是否仍然能从简单的结构引导中获益?我们在miniF2F基准上评估了一种轻量级干预方法——一个固定的提示时间表,涵盖15个常见的策略框架。这种简单的方法相比于使用相同模型的标准采样(pass@16为15.2%)在相同的样本数量(k=16)和最大生成长度(1024令牌)下实现了21.7%的通过率,相对改进了43%。我们的结果表明,即使是能力较强的强化学习训练证明器也未能充分利用定理语言中的结构先验知识,在推理时简单的引导仍是一种低成本且互补的方法,可以进一步提升性能。
https://arxiv.org/abs/2601.16172
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at this https URL
最近的视频生成模型展示了捕捉复杂物理交互和随时间演变场景的强大能力。为了利用这些时空先验,机器人研究工作已经将视频模型应用于策略学习中,但这种方法通过引入多阶段后训练以及用于动作生成的新架构组件而增加了复杂性。在本工作中,我们介绍了Cosmos Policy,这是一种简单的方法,它可以通过在目标平台上收集的机器人演示数据上进行单阶段后训练,将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略,并且无需对架构进行任何修改。 Cosmos Policy 学习直接生成编码为视频模型潜在扩散过程中的潜在帧的机器人动作,利用该模型预先训练的先验知识和核心学习算法来捕捉复杂的动作分布。此外,Cosmos Policy 生成未来状态图像和值(预期累积奖励),这些同样被编码为潜在帧,在测试时进行行动轨迹规划,从而增加成功几率。 在我们的评估中,Cosmos Policy 在 LIBERO 和 RoboCasa 模拟基准上实现了最先进的性能 (平均成功率分别为98.5% 和 67.1%),并且在具有挑战性的现实世界双臂操作任务中获得了最高的平均分数,优于从头开始训练的强大扩散策略、基于视频模型的策略以及在同一机器人演示数据上微调的状态-of-the-art 视觉-语言-动作模型。此外,在给定策略回滚数据的情况下,Cosmos Policy 可以通过学习经验来改进其世界模型和价值函数,并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。 我们将在该网址发布代码、模型以及训练数据:[请在此处插入URL]
https://arxiv.org/abs/2601.16163
Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.
关键词识别(KWS)系统在边缘设备上部署的小型模型面临着由于噪声和录音条件变化导致的领域偏移所引起的准确性和鲁棒性挑战。为了解决这些问题,我们提出了一种全面的连续学习框架,旨在适应新的领域同时保持计算效率。该提议的流程集成了一个双输入卷积神经网络(CNN),利用梅尔频率倒谱系数(MFCC)和梅尔频谱图特征,并结合多级去噪过程,包括离散小波变换和频谱减法技术以及模型更新和原型更新模块。 与以前的方法仅限于特定层的更新不同,我们的方法更新整个量化模型,这得益于紧凑型模型架构。在运行时使用类原型和基于置信度的过滤器选择输入样本的一部分,在这些选定的样本上添加伪标签,并将其与回放缓冲区结合以进行增量模型重训练。 实验结果表明,在嘈杂的数据测试集中该框架的有效性:对于干净数据,准确率达到了99.63%,并且即使在-10 dB信噪比的情况下,也能保持稳健性能(超过94%的准确性),适用于各种噪音环境。这项工作证明了结合高效的去噪技术与基于原型的连续学习可以使KWS模型能够在资源受限和动态环境中自主且鲁棒地运行。
https://arxiv.org/abs/2601.16158
The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.
CLIP模型的成功推动了文本视频检索领域的显著进步。然而,当前的方法往往在“盲”特征交互方面存在问题,即由于文本查询的稀疏性,模型难以从背景噪声中区分出关键视觉信息。为了弥补这一差距,我们借鉴了人类的认知行为,并提出了人眼视图驱动(HVD)模型。我们的框架建立了一个由粗到细的对齐机制,包含两个关键组件:帧特征选择模块(FFSM)和补丁特征压缩模块(PFCM)。FFSM通过选择关键帧来消除时间冗余,模拟了人类宏观感知能力。随后,PFCM通过先进的注意力机制聚合补丁特征以形成显著视觉实体,从而模仿微观感知并实现精确的实体级别匹配。 在五个基准测试中的大量实验表明,HVD不仅能够捕捉到类似人的视觉关注点,还实现了最先进的性能。
https://arxiv.org/abs/2601.16155
Modern data systems increasingly operate under conditions of persistent legal, political, and analytic disagreement. In such settings, interoperability cannot rely on shared interpretation, negotiated semantics, or centralized authority. Instead, representations must function as neutral substrates that preserve stable reference across incompatible extensions. This paper investigates the structural constraints imposed on ontological design by this requirement. Building on a neutrality framework that treats interpretive non-commitment and stability under extension as explicit design constraints, we ask what minimal ontological structure is forced if accountability relationships are to remain referable and comparable under disagreement. Minimality here is not mere parsimony: a reduction is admissible only if it does not reintroduce stability-critical distinctions as hidden roles, flags, or contextual predicates. We establish a conditional lower-bound result: any ontology capable of supporting accountability under persistent disagreement must realize at least six distinct identity-and-persistence regimes. We further show that a construction with exactly six such regimes is sufficient to satisfy the stated requirements without embedding causal or normative commitments in the substrate. The result is not a proposal for a universal ontology, but a constraint on what is possible when neutrality and stable reference are treated as non-negotiable design goals.
现代数据系统越来越多地在持续存在的法律、政治和分析分歧条件下运行。在这种环境下,互操作性不能依赖于共享解释、协商语义或中央权威。相反,表示必须作为中立的基础结构来保持稳定引用,在不兼容的扩展下依然有效。本文探讨了这种需求对本体设计施加的结构性限制。 基于一个将解释非承诺和在扩展下的稳定性视为显式设计约束的中立框架,我们研究了如果问责关系要在分歧条件下仍然可参照且可比较的话,最小化的本体结构必须是什么样的。这里的“最小化”并非简单的简约性:只有当这种简化的结果不重新引入影响稳定性的关键区别作为隐藏角色、标志或上下文谓词时才是可以接受的。 我们证明了一个条件下的下限结果:任何能够支持持续分歧条件下问责制的本体都必须实现至少六种不同的身份和持久性制度。此外,我们还表明,具有恰好六种这样的制度结构足以满足既定要求而不将因果或规范承诺嵌入基础架构中。这一结果不是对通用本体的一种提议,而是在将中立性和稳定引用视为不可谈判的设计目标时可能实现的限制条件。
https://arxiv.org/abs/2601.16152
Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.
旋律和声化,即为给定的旋律生成和声伴奏,在计算音乐生成中仍然是一个核心挑战。最近采用单一编码器变压器的方法将和声化问题视为屏蔽序列建模问题,但现有的训练课程(受离散扩散启发)通常会导致旋律与和声之间的弱交叉注意力。这导致了对旋律线索利用的限制,尤其是在域外上下文的情况下。 在这项工作中,我们引入了一种训练课程 FF (full-to-full),该方法在训练初期将所有和声音符保持屏蔽状态,并逐渐在整个序列训练过程中解除屏蔽,以加强旋律与和声之间的相互作用。我们在多个实验轴上系统地评估了这种方法与先前的课程效果,包括时间量化(四分音符 vs. 十六分音符)、小节级 vs. 节拍签名条件、旋律表示形式(全范围 vs. 音阶)以及推理时的解除屏蔽策略。模型在 HookTheory 数据集上进行训练,并且使用全面评估和声进程结构、和声-旋律对齐以及节奏一致性的指标,在域内及一个精选的爵士标准曲集合中进行了评估。 实验结果表明,我们提出的 FF 课程方案几乎在所有指标上都优于基线方法,特别是在需要适应新型旋律线索的域外评估中表现尤为突出。此外,四分音符量化、小节标记交织以及音阶表示形式被证明在 FF 设置下具有优势。我们的研究强调了训练课程在有效旋律调适中的重要性,并表明全面解除屏蔽策略为单一编码器和声生成提供了一种稳健的方法。
https://arxiv.org/abs/2601.16150
Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
生成动画的三维对象是许多应用程序的核心,然而大多数先进的研究成果通常难以在实践中应用,原因在于其设置有限、运行时间长或质量不佳。我们介绍了ActionMesh,这是一种生成模型,它能够以前馈方式预测“行动中”的生产级3D网格(mesh)。借鉴早期视频模型的灵感,我们的关键洞察是修改现有的3D扩散模型,使其包括一个时间轴,从而形成所谓的“时序3D扩散”框架。 具体来说,我们首先将3D扩散阶段调整为生成一系列同步潜变量序列,这些序列代表随时间变化且独立的三维形状。其次,我们设计了一个时序3D自动编码器,该编码器可以将一系列独立的形状转换成预定义参考形状的相应变形,从而构建动画。结合这两个组件,ActionMesh可以从不同的输入中生成动画的3D网格,如单目视频、文本描述或带有描述其动画的文本提示的3D网格。 此外,与之前的方法相比,我们的方法速度快,并且产生的结果无骨骼绑定(rig-free)和拓扑一致,因此能够快速迭代并支持无缝应用如纹理映射和重定向。我们在标准的视频到4D基准测试(Consistent4D、Objaverse)上评估了我们的模型,在几何准确性和时间一致性方面均达到了最先进的性能水平,证明了该模型可以以前所未有的速度和质量提供动画3D网格。
https://arxiv.org/abs/2601.16148
Existing approaches for watermarking AI-generated images often rely on post-hoc methods applied in pixel space, introducing computational overhead and potential visual artifacts. In this work, we explore latent space watermarking and introduce DistSeal, a unified approach for latent watermarking that works across both diffusion and autoregressive models. Our approach works by training post-hoc watermarking models in the latent space of generative models. We demonstrate that these latent watermarkers can be effectively distilled either into the generative model itself or into the latent decoder, enabling in-model watermarking. The resulting latent watermarks achieve competitive robustness while offering similar imperceptibility and up to 20x speedup compared to pixel-space baselines. Our experiments further reveal that distilling latent watermarkers outperforms distilling pixel-space ones, providing a solution that is both more efficient and more robust.
现有的AI生成图像的水印方法通常依赖于在像素空间中应用后处理技术,这可能会引入计算开销和潜在的视觉伪影。在这项工作中,我们探索了隐式空间水印,并提出了DistSeal,这是一种统一的方法,用于跨扩散模型和自回归模型进行隐式水印。我们的方法通过在生成模型的隐式空间中训练后处理水印模型来实现。我们证明这些隐式水印可以在生成模型本身或隐式解码器中被有效蒸馏,从而实现了模型内水印功能。与像素空间基准相比,所得的隐式水印不仅达到了具有竞争力的鲁棒性,还提供了相似的不可见性和高达20倍的速度提升。 实验进一步表明,蒸馏隐式水印优于蒸馏像素空间水印,提供了一种既高效又更稳健的解决方案。
https://arxiv.org/abs/2601.16140
The Arabic language has undergone notable transformations over time, including the emergence of new vocabulary, the obsolescence of others, and shifts in word usage. This evolution is evident in the distinction between the classical and modern Arabic eras. Although historians and linguists have partitioned Arabic literature into multiple eras, relatively little research has explored the automatic classification of Arabic texts by time period, particularly beyond the domain of poetry. This paper addresses this gap by employing neural networks and deep learning techniques to automatically classify Arabic texts into distinct eras and periods. The proposed models are evaluated using two datasets derived from two publicly available corpora, covering texts from the pre-Islamic to the modern era. The study examines class setups ranging from binary to 15-class classification and considers both predefined historical eras and custom periodizations. Results range from F1-scores of 0.83 and 0.79 on the binary-era classification task using the OpenITI and APCD datasets, respectively, to 0.20 on the 15-era classification task using OpenITI and 0.18 on the 12-era classification task using APCD.
阿拉伯语随着时间的推移经历了显著的变化,包括新词汇的出现、旧词汇的淘汰以及词语使用的转变。这种演变在古典时代和现代阿拉伯时代的区别中尤为明显。虽然历史学家和语言学家已经将阿拉伯文学划分成多个时期,但较少有研究探索自动分类不同时间段的阿拉伯文本,尤其是在诗歌领域之外的研究更为稀缺。本文通过运用神经网络和深度学习技术来填补这一空白,旨在自动将阿拉伯文本划分为不同的时代和地区。所提出的模型使用了两个公开可用语料库派生的数据集进行评估,这些数据集涵盖了从前伊斯兰时期到现代的各种文本。研究考察了从二元分类到15类分类的不同设置,并考虑到了预定义的历史时期和定制的时间段划分。结果显示,在使用OpenITI数据集的二元时代分类任务中,F1分数为0.83;在使用APCD数据集的任务中,为0.79。而在使用OpenITI数据集进行15类时代分类时,F1分数下降到0.20,在使用APCD数据集进行12类时代分类时则降至0.18。
https://arxiv.org/abs/2601.16138