We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
我们展示出,在多种任务上训练的深度神经网络表现出非常相似的低维参数子空间。我们提供了首个大规模实证证据,证明无论初始条件、任务或领域如何,神经网络会系统地收敛到共享的谱子空间中。通过对超过1100个模型(包括500个Mistral-7B LoRAs、500个Vision Transformers和50个LLaMA-8B模型)进行模式级别的光谱分析,我们发现仅通过几个主要方向就能捕获大多数方差的通用子空间。通过对不同架构在广泛任务和数据集上训练得到的权重矩阵应用谱分解技术,我们识别出了稀疏且一致使用的联合子空间,在共享架构中跨多种任务和数据集内被利用。 我们的研究结果为深度网络内部信息组织结构提供了新的见解,并提出了关于是否有可能在无需大量数据和计算资源的情况下发现这些通用子空间的重要问题。此外,这种内在结构对于模型的可重用性、多任务学习、模型合并以及开发训练和推理高效算法具有重要意义,可能有助于减少大规模神经模型的碳足迹。
https://arxiv.org/abs/2512.05117
While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
虽然存在将流匹配模型(一类流行且有效的生成模型)与人类偏好对齐的方法,但现有的方法无法同时实现适应效率和概率上合理的先验保持。在本研究中,我们利用最优控制理论提出了VGG-Flow,这是一种基于梯度匹配的微调预训练流匹配模型的方法。该算法的核心思想是:微调后的速度场与预训练的速度场之间的最优差异应与价值函数的梯度场相匹配。这种方法不仅结合了奖励模型的一阶信息,还通过价值函数的启发式初始化加快了适应过程。 从经验上讲,我们在一个流行的文本到图像流匹配模型Stable Diffusion 3上展示了我们的方法:在计算预算有限的情况下,可以有效地微调流匹配模型,并保持其先验特性。
https://arxiv.org/abs/2512.05116
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
最近在光照控制方面的进展将基于图像的方法扩展到了视频领域,但仍面临着照明真实感与时间一致性之间的权衡。超越单纯的重光(relighting),向现实场景的生成建模迈进的关键一步是同时控制摄像机轨迹和光照,因为视觉动态本质上是由几何形状和光照共同塑造的。为此,我们提出了Light-X,这是一个视频生成框架,它能够从单目视频中进行视角与光照均可控的渲染。 1) 我们提出了一种解耦设计,将几何信号和照明信号分开:通过用户定义的摄像机轨迹动态点云捕捉几何形状和运动,而光照线索则由一致投影到同一几何结构中的重光帧提供。这些明确且精细的线索能够有效实现分离并引导高质量的光照。 2) 为了应对缺少多视角、多光照视频对的问题,我们引入了Light-Syn,这是一个基于降质处理的流水线,并结合逆向映射技术,从野外单目镜头中合成训练对。这种方法生成了一套涵盖静态场景、动态场景以及AI生成场景的数据集,确保了稳健的训练。 广泛的实验表明,在联合摄像机-光照控制方面,Light-X超越了基线方法;在文本和背景条件下的视频重光方法上,也优于先前的方法。
https://arxiv.org/abs/2512.05115
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
磁共振成像(MRI)的分割有助于通过界定解剖结构来分析人类大脑的发展。然而,在婴幼儿中,由于发育和成像限制的原因,准确地进行图像分割是非常具有挑战性的。儿科脑部MRI尤其难以获取,因为影像模态不一致、视野内大量非头部解剖结构以及频繁出现的运动伪影等问题。这些问题导致了专门针对特定图像类型或狭窄年龄组的分割模型的发展,但对于更易变的实际临床图像而言,这些模型往往不够稳定。 为了解决这一方法碎片化问题,我们提出了BabySeg,这是一种专为婴幼儿设计的深度学习脑部分割框架,能够支持多种MRI协议,包括重复扫描和训练期间未出现的影像类型。我们的方法借鉴了最新的领域随机化技术,通过合成超出现实界限的大量训练图像来促进数据集转换不变性。此外,我们还介绍了一种机制,使模型能够在任意数量的输入扫描之间灵活地汇集并互动特征。 我们在各种年龄组和输入配置下展示了该框架的状态-of-the-art性能,使用单一模型就能达到或超越现有多种方法的准确性,并且运行时间仅为许多现有工具所需时间的一小部分。
https://arxiv.org/abs/2512.05114
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: this https URL
合成高保真度的冻结3D场景,这是从单目Mannequin-Challenge(MC)视频中提取的独特问题,这个问题与标准动态场景重建不同。在我们的目标中,并非专注于建模运动本身,而是要创建一个静止场景,同时战略性地保留微妙的动力学现象,以实现用户控制的即时选择。为了实现这一目标,我们引入了一种新型的动态高斯点绘制应用:通过动态模型来保持附近的时间变化,并通过固定模型的时间参数来渲染静态场景。然而,在这种使用方式下,单目捕捉加上稀疏时间监督会在Gaussians(在弱监督时间戳处未被观察到或被遮挡)中引入如幽灵和模糊等伪影。 我们提出了Splannequin——一种与架构无关的正则化方法,它可以检测高斯元素的两种状态:隐藏态和缺陷态,并应用时间锚定。在主要向前移动摄像头的情况下,隐藏的状态会被锚定到其最近且观察良好的过去状态,而缺陷的状态则被锚定到未来监督更强烈的状态。 我们的方法通过简单的损失项融入现有的动态高斯管道中,无需架构更改,并且不会增加任何推理开销。这导致了显著的视觉质量改进,能够实现高保真度、用户可选择的时间冻结渲染,经验证据是96%的用户偏好。项目页面:[此处插入链接](请根据实际情况添加正确的URL)。
https://arxiv.org/abs/2512.05113
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
最近统一的多模态大型语言模型(MLLMs)展示了令人印象深刻的性能,通过链式思维(CoT)推理增强了文本到图像生成的能力。然而,现有的方法仍然有限,要么仅仅将模型视为独立的生成器,要么依赖于抽象的文字规划。为了解决这些问题,我们提出了Draft-as-CoT (DraCo),这是一种新的交替推理范例,全面利用了链式思维中的文字和视觉内容以实现更好的计划与验证。 我们的方法首先生成低分辨率的草图图像作为预览,提供更具体且结构化的视觉规划和指导。然后,通过模型内在的理解能力来核实草图与输入提示之间的潜在语义不一致,并通过选择性修正并结合超分辨率技术进行细化处理。这样一来,我们的方法解决了两个根本性的挑战:文本规划的粗粒度性质以及生成罕见属性组合的难度。 为了支持训练,我们整理了DraCo-240K数据集,旨在增强三种基本能力——通用校正、实例操作和布局重组。通过专门用于交替推理的无分类器引导(CFG)策略DraCo-CFG的支持,DraCo在GenEval中提高了8%,在Imagine-Bench中提高了0.91,在GenEval++中提高了3%的成绩,明显优于直接生成和其他受链式思维增强的方法。
https://arxiv.org/abs/2512.05112
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
奖励模型对于将视觉语言系统与人类偏好对齐至关重要,但当前的方法存在幻觉、弱化视觉定位以及无法使用工具进行验证的问题,这限制了其在复杂多模态推理任务中的可靠性。我们介绍了ARM-Thinker,这是一种具有代理能力的多模态奖励模型,能够自主调用外部工具(例如图像裁剪、文档页面检索)来将判断建立在可验证的证据上,从而取代静态且非互动式的奖励评分方式。这使得该模型能够在细粒度视觉细节中进行验证,跨页引用多个证据来源,并验证推理主张——这是现有奖励模型所不具备的能力。 我们使用多阶段强化学习训练ARM-Thinker,以同时优化工具调用决策和判断准确性。为了评估代理型奖励建模,我们引入了ARMBench-VL,这是一个包含三个基准测试的框架,用于评估细粒度视觉定位(图像级工具)、跨页文档理解(检索工具)以及指令跟随(文本级验证)。ARM-Thinker在奖励模型基准上平均提高了16.2%,在工具使用任务上提高了9.6%,并且在多模态数学和逻辑推理基准上优于现有基线。我们的结果表明,代理型能力显著增强了奖励模型的准确性和可解释性。
https://arxiv.org/abs/2512.05111
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page this https URL for more results and an end-to-end real-world demonstration of our pipeline!
我们介绍了一种名为ShadowDraw的框架,它可以将普通的3D对象转化为阴影绘制组合艺术。给定一个3D对象,我们的系统可以预测场景参数(包括物体姿态和光照)以及部分线条画,使得产生的阴影与这些线条一起构成一幅可辨识的画面。为此,我们在显示有意义的阴影、引导线条绘制生成和采用自动评估以确保阴影绘画的一致性和视觉质量方面进行了优化。 实验表明,无论输入来源是现实世界的扫描图、精选数据集还是生成性资产,ShadowDraw都能产生令人信服的结果,并且自然地扩展到了多物体场景、动画以及物理部署。我们的工作为创建阴影绘制艺术提供了一条实用的流水线,并扩大了计算视觉艺术的设计空间,弥合了算法设计与艺术叙事之间的差距。 想要了解更多结果和我们管道在现实世界中的端到端演示,请访问此项目页面:[请在此处插入正确的URL]
https://arxiv.org/abs/2512.05110
Recent advances in Vision-Language-Action (VLA) models, powered by large language models and reinforcement learning-based fine-tuning, have shown remarkable progress in robotic manipulation. Existing methods often treat long-horizon actions as linguistic sequences and apply trajectory-level optimization methods such as Trajectory-wise Preference Optimization (TPO) or Proximal Policy Optimization (PPO), leading to coarse credit assignment and unstable training. However, unlike language, where a unified semantic meaning is preserved despite flexible sentence order, action trajectories progress through causally chained stages with different learning difficulties. This motivates progressive stage optimization. Thereby, we present Stage-Aware Reinforcement (STARE), a module that decomposes a long-horizon action trajectory into semantically meaningful stages and provides dense, interpretable, and stage-aligned reinforcement signals. Integrating STARE into TPO and PPO, we yield Stage-Aware TPO (STA-TPO) and Stage-Aware PPO (STA-PPO) for offline stage-wise preference and online intra-stage interaction, respectively. Further building on supervised fine-tuning as initialization, we propose the Imitation -> Preference -> Interaction (IPI), a serial fine-tuning pipeline for improving action accuracy in VLA models. Experiments on SimplerEnv and ManiSkill3 demonstrate substantial gains, achieving state-of-the-art success rates of 98.0 percent on SimplerEnv and 96.4 percent on ManiSkill3 tasks.
最近,通过大型语言模型和基于强化学习的微调所驱动的视觉-语言-行动(VLA)模型,在机器人操作领域取得了显著进展。现有的方法通常将长期动作序列视为语言序列,并应用轨迹级优化方法,如偏好优化轨迹(TPO)或近端策略优化(PPO),这种方法导致了粗略的责任分配和不稳定的训练过程。然而,与语言不同的是,尽管句子顺序可以灵活变化但其语义意义是统一的,而行动轨迹则通过因果链接的不同阶段进行发展,并且这些阶段的学习难度各不相同。因此,逐步阶段优化成为必要。 为此,我们提出了Stage-Aware Reinforcement(STARE)模块,该模块将长期的动作轨迹分解为具有语义意义的阶段,并提供密集、可解释和与阶段对齐的强化信号。我们将STARE集成到TPO和PPO中,分别生成了Stage-Aware TPO (STA-TPO) 和 Stage-Aware PPO (STA-PPO),用于离线分阶段偏好优化以及在线跨阶段互动。 在监督微调作为初始步骤的基础上,我们提出了模仿-偏好-交互(IPI)这一序列微调流水线,旨在提高VLA模型中动作的准确性。在SimplerEnv和ManiSkill3环境中的实验表明了显著的进步,在这两个环境中分别达到了98.0% 和 96.4% 的任务成功率,超过了现有方法的最佳水平。
https://arxiv.org/abs/2512.05107
Long context reasoning in large language models (LLMs) has demonstrated enhancement of their cognitive capabilities via chain-of-thought (CoT) inference. Training such models is usually done via reinforcement learning with verifiable rewards (RLVR) in reasoning based problems, like math and programming. However, RLVR is limited by several bottlenecks, such as, lack of dense reward, and inadequate sample efficiency. As a result, it requires significant compute resources in post-training phase. To overcome these limitations, in this work, we propose \textbf{Semantic Soft Bootstrapping (SSB)}, a self-distillation technique, in which the same base language model plays the role of both teacher and student, but receives different semantic contexts about the correctness of its outcome at training time. The model is first prompted with a math problem and several rollouts are generated. From them, the correct and most common incorrect response are filtered, and then provided to the model in context to produce a more robust, step-by-step explanation with a verified final answer. This pipeline automatically curates a paired teacher-student training set from raw problem-answer data, without any human intervention. This generation process also produces a sequence of logits, which is what the student model tries to match in the training phase just from the bare question alone. In our experiment, Qwen2.5-3B-Instruct on GSM8K dataset via parameter-efficient fine-tuning. We then tested its accuracy on MATH500, and AIME2024 benchmarks. Our experiments show a jump of 10.6%, and 10% improvements in accuracy, respectively, over group relative policy optimization (GRPO), which is a commonly used RLVR algorithm. Our code is available at this https URL, and the model, curated dataset is available at this https URL.
长上下文推理在大型语言模型(LLMs)中通过链式思考(CoT)推断展示了其认知能力的提升。这类模型通常通过基于问题的强化学习与可验证奖励(RLVR),如数学和编程等问题进行训练。然而,RLVR受限于几个瓶颈,包括缺乏密集奖励以及样本效率不足。因此,在后期训练阶段需要大量的计算资源。 为克服这些限制,我们在本研究中提出了一种称为**语义软引导(SSB)**的技术,这是一种自蒸馏技术,其中相同的基语言模型同时扮演教师和学生角色,但在训练过程中接收关于其结果正确性的不同语义上下文信息。首先用一个数学问题提示模型,并生成多个rollouts。从中筛选出正确的以及最常见的错误回答,并将其提供给模型以产生更稳健的、逐步解释的答案并验证最终答案。这一流程自动从原始的问题-答案数据中整理出一对师生训练集,无需人工干预。这个生成过程还产生了一个logits序列,即学生模型在仅从单纯问题本身进行训练时所尝试匹配的内容。 我们在实验中使用了通过参数高效微调的Qwen2.5-3B-Instruct在GSM8K数据集上的表现,并对其进行了MATH500和AIME2024基准测试。我们的实验结果表明,相比于常用的RLVR算法之一——群组相对策略优化(GRPO),其准确率分别提高了10.6%和10%。 我们的代码可在[此链接](https://this-URL.com)获取,模型和整理好的数据集可以在[此链接](https://that-URL.com)获得。请注意,上述URL仅为示例,实际链接应根据具体情况提供。
https://arxiv.org/abs/2512.05105
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion {\phi}-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. {\phi}-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, {\phi}-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, {\phi}-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{this https URL}{project page}.
标准扩散过程通过使用高斯噪声来破坏数据,该噪声的傅立叶系数具有随机幅度和随机相位。虽然这种方法对于无条件生成或文本到图像生成非常有效,但破坏相位成分会摧毁空间结构,使其不适合需要几何一致性任务,例如重新渲染、模拟增强及图间转换。 我们引入了Phase-Preserving Diffusion({\phi}-PD),这是一种模型无关的扩散过程重构方法,它在保留输入相位的同时随机化幅度,从而可以在不改变架构或添加额外参数的情况下生成与结构对齐的结果。此外,我们提出了频率选择性结构性(FSS)噪声,通过单一频率截止参数提供连续的结构刚度控制。 {\phi}-PD 不会增加推理时间成本,并且可以与其他任何图像或视频扩散模型兼容。在逼真的和风格化的重新渲染、以及用于驾驶规划器的真实世界增强方面,{\phi}-PD 可以生成可控且空间对齐的结果。当应用于CARLA模拟器时,{\phi}-PD 使CARLA到Waymo规划器的性能提高了50%。 该方法与现有的条件化方法互补,并广泛适用于图像到图像和视频到视频的生成任务。项目页面包含更多示例、代码以及视频资料(点击这里访问)。
https://arxiv.org/abs/2512.05106
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
全频段图像恢复(All-in-One Image Restoration,AiOIR)任务通常涉及各种退化情况,需要稳健且灵活的策略。然而,大多数现有方法通常缺乏显式的频率建模,并依赖于固定的或启发式优化调度,这限制了其在异构退化情况下的一般性应用。为了解决这些问题,我们提出了一种名为EvoIR的专门针对AiOIR任务的框架,该框架引入了进化频率调制以实现动态和自适应的图像恢复。 具体来说,EvoIR采用了频率调制模块(Frequency-Modulated Module, FMM),将特征明确地分解为高频和低频分支,并且能够根据需要灵活调整这些分支,从而增强结构准确性和细粒度细节。在EvoIR的核心部分,进化优化策略(Evolutionary Optimization Strategy, EOS)通过基于群体的进化过程迭代调整频率感知目标,动态平衡了结构精度与感知保真度之间的关系。此外,其进化指导进一步缓解了退化情况下的梯度冲突,并加速了收敛速度。通过将FMM和EOS相结合,EvoIR比单独使用任一组件取得了更大的改进效果,这强调了这两种方法的互补作用。 在多个基准测试中的广泛实验表明,EvoIR优于现有的最先进的全频段图像恢复方法。
https://arxiv.org/abs/2512.05104
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
视频生成模型正在迅速发展,但仍然在处理需要大量语义分支或反复进行高层次推理来决定下一步该发生什么的复杂视频输出时面临挑战。在这篇论文中,我们介绍了一类新的全视图文模型,这些模型结合了最近大型语言模型(LM)推理进展的思想以应对这一挑战。具体而言,我们提出了TV2TV,这是一种统一的生成建模框架,它将视频生成过程分解为一个交替进行文本和视频帧生成的过程。TV2TV使用混合变换器架构(MoT),同时学习语言模型(下一个标记预测)和视频流匹配(下一帧预测)。在推理阶段,TV2TV决定何时切换到生成文本或视频帧,使模型能够先“用文字思考”之后的内容再“通过像素行动”来生成画面。这种设计将对未来可能发生什么的责任大部分转移到了语言建模塔上,从而提高了生成视频的视觉质量和提示对齐能力。此外,它还支持细粒度控制,在整个过程中可以通过文本干预修改视频生成轨迹。 在基于视频游戏数据进行受控实验中,TV2TV展示了显著的视觉质量提升和可控性改进。TV2TV也可以扩展到自然视频上,我们通过使用视觉语言模型(VLM)为体育视频添加交替出现的自然语言动作描述来证明这一点。在该语料库上训练TV2TV可以产生出色的视觉质量和提示对齐,展示了其能够推理并生成复杂的现实世界行动序列的能力。 综合这些结果表明,TV2TV是朝向开放文本性推理和控制的视频生成的一个有希望的进步步骤。
https://arxiv.org/abs/2512.05103
Recent works on structured text translation remain limited to the sentence level, as they struggle to effectively handle the complex document-level XML or HTML structures. To address this, we propose \textbf{Format Reinforcement Learning (FormatRL)}, which employs Group Relative Policy Optimization on top of a supervised fine-tuning model to directly optimize novel structure-aware rewards: 1) TreeSim, which measures structural similarity between predicted and reference XML trees and 2) Node-chrF, which measures translation quality at the level of XML nodes. Additionally, we apply StrucAUC, a fine-grained metric distinguishing between minor errors and major structural failures. Experiments on the SAP software-documentation benchmark demonstrate improvements across six metrics and an analysis further shows how different reward functions contribute to improvements in both structural and translation quality.
最近关于结构化文本翻译的研究仍然局限于句子层面,因为它们难以有效处理复杂的文档级XML或HTML结构。为了解决这一问题,我们提出了**格式强化学习(FormatRL)**方法,该方法在监督微调模型的基础上运用了群组相对策略优化技术,并直接优化了两种新颖的感知结构奖励:1) TreeSim,用于衡量预测和参考XML树之间的结构相似性;2) Node-chrF,用于衡量XML节点级别的翻译质量。此外,我们应用了一种细粒度指标StrucAUC,该指标能够区分轻微错误和重大结构性失败。在SAP软件文档基准上的实验表明,在六个不同评估标准下均有改进,并且进一步的分析显示不同的奖励函数如何共同提升了结构和翻译的质量。
https://arxiv.org/abs/2512.05100
In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
近年来,针对人工智能生成图像(AIGI)的图像质量评估(IQA)取得了快速进展;然而,现有的方法主要关注肖像和艺术类图片的质量评价,缺乏对室内场景系统性评测。我们提出了空间美学的概念框架,它从布局、和谐、照明和失真四个维度评估室内图象的审美质量。我们构建了SA-BENCH,这是首个针对空间美学基准测试的数据集,包含18,000张图像及50,000条精确标注。使用SA-BENCH,我们系统地评估了现有的IQA方法,并开发了一种新的框架——SA-IQA(Spatial Aesthetics Image Quality Assessment),通过MLLM微调和多维度融合的方法来全面评估空间美学质量。我们将SA-IQA应用于两个下游任务:(1) 作为奖励信号集成到GRPO强化学习中,以优化AIGC生成流程;(2) 在“最佳N选一”筛选高质图像并提升生成质量的场景下应用该框架。实验结果显示,与现有方法相比,SA-IQA在SA-BENCH上的表现明显更优,确立了空间美学评价的新标准。我们将开源代码和数据集以促进此领域的研究与发展。
https://arxiv.org/abs/2512.05098
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
视频生成模型在合成人类动作方面的能力正在迅速提高,尤其是在新颖的上下文中。这使得它们有可能成为情境机器人控制中的高级规划器。为了实现这一潜力,仍有一个关键的研究问题尚未解决:类人机器人如何能在零样本(zero-shot)情况下执行由生成视频中的人类行为?这个问题的存在是因为生成的视频通常存在噪声和形态失真,这些都使得直接模仿变得困难,与真实视频相比差距显著。 为了解决这个挑战,我们提出了一种两阶段管道方法。首先,我们将视频像素提升到一个4D人体表示中,然后重新定位到类人机器人的人体结构上;其次,我们提出了GenMimic——这是一种基于物理感知的强化学习策略,该策略以3D关键点为条件,并通过对称正则化和关键点加权追踪奖励进行训练。因此,即使面对由噪声生成视频产生的行为,GenMimic也能模仿人类动作。 为了评估零样本泛化能力和政策鲁棒性,我们整理了使用两种不同的视频生成模型跨各种动作与上下文合成的虚拟人体运动数据集——GenMimicBench。这个基准测试为这些高级策略在机器人控制中的应用提供了一个起点。 通过一系列广泛的实验,在模拟环境中GenMimic的表现超越了强大的基线模型,证明了它能够在未经过微调的情况下于Unitree G1类人机器人上实现连贯且物理稳定的运动跟踪。 这项工作为利用视频生成模型的潜力作为高级策略控制机器人提供了有希望的道路。
https://arxiv.org/abs/2512.05094
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
最近在多模态大型语言模型(MLLMs)领域的进展显著提升了视觉定位和视觉问答等任务的性能。然而,这些模型的推理过程仍然很大程度上是不透明的;它们通常只输出最终预测结果而不展示导致该结果的中间步骤或细粒度证据(如像素、位置)。这与人类智能形成鲜明对比,后者通过一系列视觉推理自然地运作。为了解决这一局限性,我们引入了视觉推理追踪器(Visual Reasoning Tracer, VRT)任务,它要求模型不仅定位目标对象,还要明确预测构成推理路径的中间对象。 为了推进该领域的研究,我们贡献了以下内容: 1. **VRT-Bench**:一个由人工标注的数据集,用于评估视觉推理能力。 2. **新的评价指标**:一种衡量推理痕迹质量的新方法。 3. **VRT-80k**:一个大规模数据集,专门用于训练推理模型。 我们的实验表明,尽管现有的模型经常能够生成正确的最终输出,但它们在解释中间的推理过程方面存在困难。相比之下,在VRT-80k上进行训练的模型显著提升了追踪推理路径的能力。
https://arxiv.org/abs/2512.05091
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
最近在自回归视频扩散方面取得的进展已经实现了实时帧流传输,然而现有的解决方案仍然存在时间重复、漂移和运动减速等问题。我们发现直接应用类似StreamingLLM风格的关注机制会导致图像清晰度下降和运动停滞。为了解决这个问题,我们提出了Deep Forcing,这是一种无需训练的机制组合,能够在不进行微调的情况下解决这些问题。具体来说: 1. **Deep Sink**:该方法将滑动窗口的一半分配给持久sink令牌,并重新对齐这些token的时间RoPE相位与当前时间线一致,从而在长时间序列生成过程中稳定全局上下文。 2. **Participative Compression**:这种机制执行重要性感知的KV缓存剪枝操作,仅保留近期注意力中活跃参与的token,并安全地丢弃冗余和退化的历史信息,以减少长视频生成时离分布长度产生的误差累积。 这些组件共同作用下,可以实现超过12倍的外推(例如:5秒训练即可生成60秒以上的内容),并且在成像质量上优于LongLive,在美学质量上超越RollingForcing,并且几乎保持整体一致性,同时动态度方面也有显著提升。所有这一切都是在维持实时生成的前提下达成的。 我们的结果显示,无需训练的KV缓存管理方法可以在自回归长视频生成中与基于训练的方法相匹敌或超过它们的效果。
https://arxiv.org/abs/2512.05081
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
物体几何形状是机器人操作中的关键信息。然而,由于相机只能捕获到对象的部分视图,尤其是在发生遮挡时,因此物体重建是一项具有挑战性的任务。在本文中,我们利用两种额外的信息来源来减少视觉信号的模糊性。首先,生成模型学习常见物体形状的先验知识,使我们可以合理地猜测未见部分的几何结构。其次,可以通过视频和物理交互获得的接触信息为几何边界提供稀疏约束。我们将这两种信息源通过接触引导的3D生成技术结合起来。指导公式受到生成模型中基于拖动编辑启发。在合成数据和真实世界数据上的实验表明,我们的方法相比于纯粹的3D生成和基于接触的优化方法,在重建方面有了改进。
https://arxiv.org/abs/2512.05079
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: this https URL
新出现的视频扩散模型能够实现高视觉保真度,但从根本上将场景动态与相机运动耦合在一起,从而限制了它们在提供精确的空间和时间控制方面的能力。我们引入了一种4D可控视频扩散框架,该框架明确地解耦了场景动态与摄像机姿态之间的关系,从而使用户可以精细操控场景动态以及摄像机视角。我们的框架采用了连续的世界时间和相机轨迹作为条件输入,并通过注意力层中的4D位置编码和自适应归一化来注入这些输入以调整特征。 为了训练这个模型,我们创建了一个独特的数据集,在该数据集中时间变化和相机变化是独立参数化的;这一数据集将公开发布。实验表明,我们的模型能够实现稳健的现实世界4D控制,适用于各种时间模式和摄像机轨迹,并保持高质量的生成效果且在可控性方面优于先前的工作。 您可以在以下网址查看视频结果:[此链接](this https URL)(请将此占位符替换为实际URL)。
https://arxiv.org/abs/2512.05076