We show that deep neural networks trained across diverse tasks exhibit remarkably similar low-dimensional parametric subspaces. We provide the first large-scale empirical evidence that demonstrates that neural networks systematically converge to shared spectral subspaces regardless of initialization, task, or domain. Through mode-wise spectral analysis of over 1100 models - including 500 Mistral-7B LoRAs, 500 Vision Transformers, and 50 LLaMA-8B models - we identify universal subspaces capturing majority variance in just a few principal directions. By applying spectral decomposition techniques to the weight matrices of various architectures trained on a wide range of tasks and datasets, we identify sparse, joint subspaces that are consistently exploited, within shared architectures across diverse tasks and datasets. Our findings offer new insights into the intrinsic organization of information within deep networks and raise important questions about the possibility of discovering these universal subspaces without the need for extensive data and computational resources. Furthermore, this inherent structure has significant implications for model reusability, multi-task learning, model merging, and the development of training and inference-efficient algorithms, potentially reducing the carbon footprint of large-scale neural models.
我们展示出,在多种任务上训练的深度神经网络表现出非常相似的低维参数子空间。我们提供了首个大规模实证证据,证明无论初始条件、任务或领域如何,神经网络会系统地收敛到共享的谱子空间中。通过对超过1100个模型(包括500个Mistral-7B LoRAs、500个Vision Transformers和50个LLaMA-8B模型)进行模式级别的光谱分析,我们发现仅通过几个主要方向就能捕获大多数方差的通用子空间。通过对不同架构在广泛任务和数据集上训练得到的权重矩阵应用谱分解技术,我们识别出了稀疏且一致使用的联合子空间,在共享架构中跨多种任务和数据集内被利用。 我们的研究结果为深度网络内部信息组织结构提供了新的见解,并提出了关于是否有可能在无需大量数据和计算资源的情况下发现这些通用子空间的重要问题。此外,这种内在结构对于模型的可重用性、多任务学习、模型合并以及开发训练和推理高效算法具有重要意义,可能有助于减少大规模神经模型的碳足迹。
https://arxiv.org/abs/2512.05117
While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.
虽然存在将流匹配模型(一类流行且有效的生成模型)与人类偏好对齐的方法,但现有的方法无法同时实现适应效率和概率上合理的先验保持。在本研究中,我们利用最优控制理论提出了VGG-Flow,这是一种基于梯度匹配的微调预训练流匹配模型的方法。该算法的核心思想是:微调后的速度场与预训练的速度场之间的最优差异应与价值函数的梯度场相匹配。这种方法不仅结合了奖励模型的一阶信息,还通过价值函数的启发式初始化加快了适应过程。 从经验上讲,我们在一个流行的文本到图像流匹配模型Stable Diffusion 3上展示了我们的方法:在计算预算有限的情况下,可以有效地微调流匹配模型,并保持其先验特性。
https://arxiv.org/abs/2512.05116
Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.
最近在光照控制方面的进展将基于图像的方法扩展到了视频领域,但仍面临着照明真实感与时间一致性之间的权衡。超越单纯的重光(relighting),向现实场景的生成建模迈进的关键一步是同时控制摄像机轨迹和光照,因为视觉动态本质上是由几何形状和光照共同塑造的。为此,我们提出了Light-X,这是一个视频生成框架,它能够从单目视频中进行视角与光照均可控的渲染。 1) 我们提出了一种解耦设计,将几何信号和照明信号分开:通过用户定义的摄像机轨迹动态点云捕捉几何形状和运动,而光照线索则由一致投影到同一几何结构中的重光帧提供。这些明确且精细的线索能够有效实现分离并引导高质量的光照。 2) 为了应对缺少多视角、多光照视频对的问题,我们引入了Light-Syn,这是一个基于降质处理的流水线,并结合逆向映射技术,从野外单目镜头中合成训练对。这种方法生成了一套涵盖静态场景、动态场景以及AI生成场景的数据集,确保了稳健的训练。 广泛的实验表明,在联合摄像机-光照控制方面,Light-X超越了基线方法;在文本和背景条件下的视频重光方法上,也优于先前的方法。
https://arxiv.org/abs/2512.05115
Segmentation of magnetic resonance images (MRI) facilitates analysis of human brain development by delineating anatomical structures. However, in infants and young children, accurate segmentation is challenging due to development and imaging constraints. Pediatric brain MRI is notoriously difficult to acquire, with inconsistent availability of imaging modalities, substantial non-head anatomy in the field of view, and frequent motion artifacts. This has led to specialized segmentation models that are often limited to specific image types or narrow age groups, or that are fragile for more variable images such as those acquired clinically. We address this method fragmentation with BabySeg, a deep learning brain segmentation framework for infants and young children that supports diverse MRI protocols, including repeat scans and image types unavailable during training. Our approach builds on recent domain randomization techniques, which synthesize training images far beyond realistic bounds to promote dataset shift invariance. We also describe a mechanism that enables models to flexibly pool and interact features from any number of input scans. We demonstrate state-of-the-art performance that matches or exceeds the accuracy of several existing methods for various age cohorts and input configurations using a single model, in a fraction of the runtime required by many existing tools.
磁共振成像(MRI)的分割有助于通过界定解剖结构来分析人类大脑的发展。然而,在婴幼儿中,由于发育和成像限制的原因,准确地进行图像分割是非常具有挑战性的。儿科脑部MRI尤其难以获取,因为影像模态不一致、视野内大量非头部解剖结构以及频繁出现的运动伪影等问题。这些问题导致了专门针对特定图像类型或狭窄年龄组的分割模型的发展,但对于更易变的实际临床图像而言,这些模型往往不够稳定。 为了解决这一方法碎片化问题,我们提出了BabySeg,这是一种专为婴幼儿设计的深度学习脑部分割框架,能够支持多种MRI协议,包括重复扫描和训练期间未出现的影像类型。我们的方法借鉴了最新的领域随机化技术,通过合成超出现实界限的大量训练图像来促进数据集转换不变性。此外,我们还介绍了一种机制,使模型能够在任意数量的输入扫描之间灵活地汇集并互动特征。 我们在各种年龄组和输入配置下展示了该框架的状态-of-the-art性能,使用单一模型就能达到或超越现有多种方法的准确性,并且运行时间仅为许多现有工具所需时间的一小部分。
https://arxiv.org/abs/2512.05114
Synthesizing high-fidelity frozen 3D scenes from monocular Mannequin-Challenge (MC) videos is a unique problem distinct from standard dynamic scene reconstruction. Instead of focusing on modeling motion, our goal is to create a frozen scene while strategically preserving subtle dynamics to enable user-controlled instant selection. To achieve this, we introduce a novel application of dynamic Gaussian splatting: the scene is modeled dynamically, which retains nearby temporal variation, and a static scene is rendered by fixing the model's time parameter. However, under this usage, monocular capture with sparse temporal supervision introduces artifacts like ghosting and blur for Gaussians that become unobserved or occluded at weakly supervised timestamps. We propose Splannequin, an architecture-agnostic regularization that detects two states of Gaussian primitives, hidden and defective, and applies temporal anchoring. Under predominantly forward camera motion, hidden states are anchored to their recent well-observed past states, while defective states are anchored to future states with stronger supervision. Our method integrates into existing dynamic Gaussian pipelines via simple loss terms, requires no architectural changes, and adds zero inference overhead. This results in markedly improved visual quality, enabling high-fidelity, user-selectable frozen-time renderings, validated by a 96% user preference. Project page: this https URL
合成高保真度的冻结3D场景,这是从单目Mannequin-Challenge(MC)视频中提取的独特问题,这个问题与标准动态场景重建不同。在我们的目标中,并非专注于建模运动本身,而是要创建一个静止场景,同时战略性地保留微妙的动力学现象,以实现用户控制的即时选择。为了实现这一目标,我们引入了一种新型的动态高斯点绘制应用:通过动态模型来保持附近的时间变化,并通过固定模型的时间参数来渲染静态场景。然而,在这种使用方式下,单目捕捉加上稀疏时间监督会在Gaussians(在弱监督时间戳处未被观察到或被遮挡)中引入如幽灵和模糊等伪影。 我们提出了Splannequin——一种与架构无关的正则化方法,它可以检测高斯元素的两种状态:隐藏态和缺陷态,并应用时间锚定。在主要向前移动摄像头的情况下,隐藏的状态会被锚定到其最近且观察良好的过去状态,而缺陷的状态则被锚定到未来监督更强烈的状态。 我们的方法通过简单的损失项融入现有的动态高斯管道中,无需架构更改,并且不会增加任何推理开销。这导致了显著的视觉质量改进,能够实现高保真度、用户可选择的时间冻结渲染,经验证据是96%的用户偏好。项目页面:[此处插入链接](请根据实际情况添加正确的URL)。
https://arxiv.org/abs/2512.05113
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
最近统一的多模态大型语言模型(MLLMs)展示了令人印象深刻的性能,通过链式思维(CoT)推理增强了文本到图像生成的能力。然而,现有的方法仍然有限,要么仅仅将模型视为独立的生成器,要么依赖于抽象的文字规划。为了解决这些问题,我们提出了Draft-as-CoT (DraCo),这是一种新的交替推理范例,全面利用了链式思维中的文字和视觉内容以实现更好的计划与验证。 我们的方法首先生成低分辨率的草图图像作为预览,提供更具体且结构化的视觉规划和指导。然后,通过模型内在的理解能力来核实草图与输入提示之间的潜在语义不一致,并通过选择性修正并结合超分辨率技术进行细化处理。这样一来,我们的方法解决了两个根本性的挑战:文本规划的粗粒度性质以及生成罕见属性组合的难度。 为了支持训练,我们整理了DraCo-240K数据集,旨在增强三种基本能力——通用校正、实例操作和布局重组。通过专门用于交替推理的无分类器引导(CFG)策略DraCo-CFG的支持,DraCo在GenEval中提高了8%,在Imagine-Bench中提高了0.91,在GenEval++中提高了3%的成绩,明显优于直接生成和其他受链式思维增强的方法。
https://arxiv.org/abs/2512.05112
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
奖励模型对于将视觉语言系统与人类偏好对齐至关重要,但当前的方法存在幻觉、弱化视觉定位以及无法使用工具进行验证的问题,这限制了其在复杂多模态推理任务中的可靠性。我们介绍了ARM-Thinker,这是一种具有代理能力的多模态奖励模型,能够自主调用外部工具(例如图像裁剪、文档页面检索)来将判断建立在可验证的证据上,从而取代静态且非互动式的奖励评分方式。这使得该模型能够在细粒度视觉细节中进行验证,跨页引用多个证据来源,并验证推理主张——这是现有奖励模型所不具备的能力。 我们使用多阶段强化学习训练ARM-Thinker,以同时优化工具调用决策和判断准确性。为了评估代理型奖励建模,我们引入了ARMBench-VL,这是一个包含三个基准测试的框架,用于评估细粒度视觉定位(图像级工具)、跨页文档理解(检索工具)以及指令跟随(文本级验证)。ARM-Thinker在奖励模型基准上平均提高了16.2%,在工具使用任务上提高了9.6%,并且在多模态数学和逻辑推理基准上优于现有基线。我们的结果表明,代理型能力显著增强了奖励模型的准确性和可解释性。
https://arxiv.org/abs/2512.05111
We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page this https URL for more results and an end-to-end real-world demonstration of our pipeline!
我们介绍了一种名为ShadowDraw的框架,它可以将普通的3D对象转化为阴影绘制组合艺术。给定一个3D对象,我们的系统可以预测场景参数(包括物体姿态和光照)以及部分线条画,使得产生的阴影与这些线条一起构成一幅可辨识的画面。为此,我们在显示有意义的阴影、引导线条绘制生成和采用自动评估以确保阴影绘画的一致性和视觉质量方面进行了优化。 实验表明,无论输入来源是现实世界的扫描图、精选数据集还是生成性资产,ShadowDraw都能产生令人信服的结果,并且自然地扩展到了多物体场景、动画以及物理部署。我们的工作为创建阴影绘制艺术提供了一条实用的流水线,并扩大了计算视觉艺术的设计空间,弥合了算法设计与艺术叙事之间的差距。 想要了解更多结果和我们管道在现实世界中的端到端演示,请访问此项目页面:[请在此处插入正确的URL]
https://arxiv.org/abs/2512.05110
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion {\phi}-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. {\phi}-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, {\phi}-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, {\phi}-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{this https URL}{project page}.
标准扩散过程通过使用高斯噪声来破坏数据,该噪声的傅立叶系数具有随机幅度和随机相位。虽然这种方法对于无条件生成或文本到图像生成非常有效,但破坏相位成分会摧毁空间结构,使其不适合需要几何一致性任务,例如重新渲染、模拟增强及图间转换。 我们引入了Phase-Preserving Diffusion({\phi}-PD),这是一种模型无关的扩散过程重构方法,它在保留输入相位的同时随机化幅度,从而可以在不改变架构或添加额外参数的情况下生成与结构对齐的结果。此外,我们提出了频率选择性结构性(FSS)噪声,通过单一频率截止参数提供连续的结构刚度控制。 {\phi}-PD 不会增加推理时间成本,并且可以与其他任何图像或视频扩散模型兼容。在逼真的和风格化的重新渲染、以及用于驾驶规划器的真实世界增强方面,{\phi}-PD 可以生成可控且空间对齐的结果。当应用于CARLA模拟器时,{\phi}-PD 使CARLA到Waymo规划器的性能提高了50%。 该方法与现有的条件化方法互补,并广泛适用于图像到图像和视频到视频的生成任务。项目页面包含更多示例、代码以及视频资料(点击这里访问)。
https://arxiv.org/abs/2512.05106
All-in-One Image Restoration (AiOIR) tasks often involve diverse degradation that require robust and versatile strategies. However, most existing approaches typically lack explicit frequency modeling and rely on fixed or heuristic optimization schedules, which limit the generalization across heterogeneous degradation. To address these limitations, we propose EvoIR, an AiOIR-specific framework that introduces evolutionary frequency modulation for dynamic and adaptive image restoration. Specifically, EvoIR employs the Frequency-Modulated Module (FMM) that decomposes features into high- and low-frequency branches in an explicit manner and adaptively modulates them to enhance both structural fidelity and fine-grained details. Central to EvoIR, an Evolutionary Optimization Strategy (EOS) iteratively adjusts frequency-aware objectives through a population-based evolutionary process, dynamically balancing structural accuracy and perceptual fidelity. Its evolutionary guidance further mitigates gradient conflicts across degradation and accelerates convergence. By synergizing FMM and EOS, EvoIR yields greater improvements than using either component alone, underscoring their complementary roles. Extensive experiments on multiple benchmarks demonstrate that EvoIR outperforms state-of-the-art AiOIR methods.
全频段图像恢复(All-in-One Image Restoration,AiOIR)任务通常涉及各种退化情况,需要稳健且灵活的策略。然而,大多数现有方法通常缺乏显式的频率建模,并依赖于固定的或启发式优化调度,这限制了其在异构退化情况下的一般性应用。为了解决这些问题,我们提出了一种名为EvoIR的专门针对AiOIR任务的框架,该框架引入了进化频率调制以实现动态和自适应的图像恢复。 具体来说,EvoIR采用了频率调制模块(Frequency-Modulated Module, FMM),将特征明确地分解为高频和低频分支,并且能够根据需要灵活调整这些分支,从而增强结构准确性和细粒度细节。在EvoIR的核心部分,进化优化策略(Evolutionary Optimization Strategy, EOS)通过基于群体的进化过程迭代调整频率感知目标,动态平衡了结构精度与感知保真度之间的关系。此外,其进化指导进一步缓解了退化情况下的梯度冲突,并加速了收敛速度。通过将FMM和EOS相结合,EvoIR比单独使用任一组件取得了更大的改进效果,这强调了这两种方法的互补作用。 在多个基准测试中的广泛实验表明,EvoIR优于现有的最先进的全频段图像恢复方法。
https://arxiv.org/abs/2512.05104
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
视频生成模型正在迅速发展,但仍然在处理需要大量语义分支或反复进行高层次推理来决定下一步该发生什么的复杂视频输出时面临挑战。在这篇论文中,我们介绍了一类新的全视图文模型,这些模型结合了最近大型语言模型(LM)推理进展的思想以应对这一挑战。具体而言,我们提出了TV2TV,这是一种统一的生成建模框架,它将视频生成过程分解为一个交替进行文本和视频帧生成的过程。TV2TV使用混合变换器架构(MoT),同时学习语言模型(下一个标记预测)和视频流匹配(下一帧预测)。在推理阶段,TV2TV决定何时切换到生成文本或视频帧,使模型能够先“用文字思考”之后的内容再“通过像素行动”来生成画面。这种设计将对未来可能发生什么的责任大部分转移到了语言建模塔上,从而提高了生成视频的视觉质量和提示对齐能力。此外,它还支持细粒度控制,在整个过程中可以通过文本干预修改视频生成轨迹。 在基于视频游戏数据进行受控实验中,TV2TV展示了显著的视觉质量提升和可控性改进。TV2TV也可以扩展到自然视频上,我们通过使用视觉语言模型(VLM)为体育视频添加交替出现的自然语言动作描述来证明这一点。在该语料库上训练TV2TV可以产生出色的视觉质量和提示对齐,展示了其能够推理并生成复杂的现实世界行动序列的能力。 综合这些结果表明,TV2TV是朝向开放文本性推理和控制的视频生成的一个有希望的进步步骤。
https://arxiv.org/abs/2512.05103
In recent years, Image Quality Assessment (IQA) for AI-generated images (AIGI) has advanced rapidly; however, existing methods primarily target portraits and artistic images, lacking a systematic evaluation of interior scenes. We introduce Spatial Aesthetics, a paradigm that assesses the aesthetic quality of interior images along four dimensions: layout, harmony, lighting, and distortion. We construct SA-BENCH, the first benchmark for spatial aesthetics, comprising 18,000 images and 50,000 precise annotations. Employing SA-BENCH, we systematically evaluate current IQA methodologies and develop SA-IQA, through MLLM fine-tuning and a multidimensional fusion approach, as a comprehensive reward framework for assessing spatial aesthetics. We apply SA-IQA to two downstream tasks: (1) serving as a reward signal integrated with GRPO reinforcement learning to optimize the AIGC generation pipeline, and (2) Best-of-N selection to filter high-quality images and improve generation quality. Experiments indicate that SA-IQA significantly outperforms existing methods on SA-BENCH, setting a new standard for spatial aesthetics evaluation. Code and dataset will be open-sourced to advance research and applications in this domain.
近年来,针对人工智能生成图像(AIGI)的图像质量评估(IQA)取得了快速进展;然而,现有的方法主要关注肖像和艺术类图片的质量评价,缺乏对室内场景系统性评测。我们提出了空间美学的概念框架,它从布局、和谐、照明和失真四个维度评估室内图象的审美质量。我们构建了SA-BENCH,这是首个针对空间美学基准测试的数据集,包含18,000张图像及50,000条精确标注。使用SA-BENCH,我们系统地评估了现有的IQA方法,并开发了一种新的框架——SA-IQA(Spatial Aesthetics Image Quality Assessment),通过MLLM微调和多维度融合的方法来全面评估空间美学质量。我们将SA-IQA应用于两个下游任务:(1) 作为奖励信号集成到GRPO强化学习中,以优化AIGC生成流程;(2) 在“最佳N选一”筛选高质图像并提升生成质量的场景下应用该框架。实验结果显示,与现有方法相比,SA-IQA在SA-BENCH上的表现明显更优,确立了空间美学评价的新标准。我们将开源代码和数据集以促进此领域的研究与发展。
https://arxiv.org/abs/2512.05098
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.
视频生成模型在合成人类动作方面的能力正在迅速提高,尤其是在新颖的上下文中。这使得它们有可能成为情境机器人控制中的高级规划器。为了实现这一潜力,仍有一个关键的研究问题尚未解决:类人机器人如何能在零样本(zero-shot)情况下执行由生成视频中的人类行为?这个问题的存在是因为生成的视频通常存在噪声和形态失真,这些都使得直接模仿变得困难,与真实视频相比差距显著。 为了解决这个挑战,我们提出了一种两阶段管道方法。首先,我们将视频像素提升到一个4D人体表示中,然后重新定位到类人机器人的人体结构上;其次,我们提出了GenMimic——这是一种基于物理感知的强化学习策略,该策略以3D关键点为条件,并通过对称正则化和关键点加权追踪奖励进行训练。因此,即使面对由噪声生成视频产生的行为,GenMimic也能模仿人类动作。 为了评估零样本泛化能力和政策鲁棒性,我们整理了使用两种不同的视频生成模型跨各种动作与上下文合成的虚拟人体运动数据集——GenMimicBench。这个基准测试为这些高级策略在机器人控制中的应用提供了一个起点。 通过一系列广泛的实验,在模拟环境中GenMimic的表现超越了强大的基线模型,证明了它能够在未经过微调的情况下于Unitree G1类人机器人上实现连贯且物理稳定的运动跟踪。 这项工作为利用视频生成模型的潜力作为高级策略控制机器人提供了有希望的道路。
https://arxiv.org/abs/2512.05094
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
最近在多模态大型语言模型(MLLMs)领域的进展显著提升了视觉定位和视觉问答等任务的性能。然而,这些模型的推理过程仍然很大程度上是不透明的;它们通常只输出最终预测结果而不展示导致该结果的中间步骤或细粒度证据(如像素、位置)。这与人类智能形成鲜明对比,后者通过一系列视觉推理自然地运作。为了解决这一局限性,我们引入了视觉推理追踪器(Visual Reasoning Tracer, VRT)任务,它要求模型不仅定位目标对象,还要明确预测构成推理路径的中间对象。 为了推进该领域的研究,我们贡献了以下内容: 1. **VRT-Bench**:一个由人工标注的数据集,用于评估视觉推理能力。 2. **新的评价指标**:一种衡量推理痕迹质量的新方法。 3. **VRT-80k**:一个大规模数据集,专门用于训练推理模型。 我们的实验表明,尽管现有的模型经常能够生成正确的最终输出,但它们在解释中间的推理过程方面存在困难。相比之下,在VRT-80k上进行训练的模型显著提升了追踪推理路径的能力。
https://arxiv.org/abs/2512.05091
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
最近在自回归视频扩散方面取得的进展已经实现了实时帧流传输,然而现有的解决方案仍然存在时间重复、漂移和运动减速等问题。我们发现直接应用类似StreamingLLM风格的关注机制会导致图像清晰度下降和运动停滞。为了解决这个问题,我们提出了Deep Forcing,这是一种无需训练的机制组合,能够在不进行微调的情况下解决这些问题。具体来说: 1. **Deep Sink**:该方法将滑动窗口的一半分配给持久sink令牌,并重新对齐这些token的时间RoPE相位与当前时间线一致,从而在长时间序列生成过程中稳定全局上下文。 2. **Participative Compression**:这种机制执行重要性感知的KV缓存剪枝操作,仅保留近期注意力中活跃参与的token,并安全地丢弃冗余和退化的历史信息,以减少长视频生成时离分布长度产生的误差累积。 这些组件共同作用下,可以实现超过12倍的外推(例如:5秒训练即可生成60秒以上的内容),并且在成像质量上优于LongLive,在美学质量上超越RollingForcing,并且几乎保持整体一致性,同时动态度方面也有显著提升。所有这一切都是在维持实时生成的前提下达成的。 我们的结果显示,无需训练的KV缓存管理方法可以在自回归长视频生成中与基于训练的方法相匹敌或超过它们的效果。
https://arxiv.org/abs/2512.05081
Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.
物体几何形状是机器人操作中的关键信息。然而,由于相机只能捕获到对象的部分视图,尤其是在发生遮挡时,因此物体重建是一项具有挑战性的任务。在本文中,我们利用两种额外的信息来源来减少视觉信号的模糊性。首先,生成模型学习常见物体形状的先验知识,使我们可以合理地猜测未见部分的几何结构。其次,可以通过视频和物理交互获得的接触信息为几何边界提供稀疏约束。我们将这两种信息源通过接触引导的3D生成技术结合起来。指导公式受到生成模型中基于拖动编辑启发。在合成数据和真实世界数据上的实验表明,我们的方法相比于纯粹的3D生成和基于接触的优化方法,在重建方面有了改进。
https://arxiv.org/abs/2512.05079
Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: this https URL
新出现的视频扩散模型能够实现高视觉保真度,但从根本上将场景动态与相机运动耦合在一起,从而限制了它们在提供精确的空间和时间控制方面的能力。我们引入了一种4D可控视频扩散框架,该框架明确地解耦了场景动态与摄像机姿态之间的关系,从而使用户可以精细操控场景动态以及摄像机视角。我们的框架采用了连续的世界时间和相机轨迹作为条件输入,并通过注意力层中的4D位置编码和自适应归一化来注入这些输入以调整特征。 为了训练这个模型,我们创建了一个独特的数据集,在该数据集中时间变化和相机变化是独立参数化的;这一数据集将公开发布。实验表明,我们的模型能够实现稳健的现实世界4D控制,适用于各种时间模式和摄像机轨迹,并保持高质量的生成效果且在可控性方面优于先前的工作。 您可以在以下网址查看视频结果:[此链接](this https URL)(请将此占位符替换为实际URL)。
https://arxiv.org/abs/2512.05076
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in this https URL
构建4D语言场对于具身人工智能、增强/虚拟现实以及4D场景理解至关重要,因为它们提供了动态环境的丰富语义表示,并在复杂场景中支持开放式词汇查询。然而,现有方法主要依赖于特定场景的高斯点化(Gaussian splatting),这种方法需要针对每个场景进行优化,泛化能力有限且难以扩展到实际应用中去。为了解决这些局限性,我们提出了4DLangVGGT,这是首个基于Transformer的前馈统一框架,用于4D语言定位,并在单一架构内同时整合了几何感知和语言对齐。4DLangVGGT有两个关键组件:一个是捕捉动态场景时空几何表示的4D视觉几何变换器(StreamVGGT);另一个是语义桥接解码器(SBD),该解码器将具有几何感知性的特征映射到与语言对齐的语义空间中,从而增强语义可解释性的同时保持结构保真度。不同于依赖昂贵的场景特定优化的先前方法,4DLangVGGT可以在多个动态场景间联合训练,并在推断时直接应用,实现了部署效率和强大泛化的双重目标。这一设计显著提高了大规模部署的实际可行性,并为开放式词汇4D场景理解建立了一个新的范式。在HyperNeRF和Neu3D数据集上的实验表明,我们的方法不仅能够有效泛化,而且达到了最先进的性能,在每个场景训练下可获得高达2%的增益,在多场景训练下可取得1%的改进。我们的代码可在以下链接获取:[提供具体链接]
https://arxiv.org/abs/2512.05060
Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: this https URL.
从单张静态图像生成互动且动态的4D场景依然是一个核心挑战。大多数现有的“先生成再重构”和“先重构再生成”的方法将几何与运动分离,导致时空不一致性和较差的泛化能力。为解决这些问题,我们扩展了“先重构再生成”的框架,提出了同时进行运动生成和几何重构以实现4D合成(MoRe4D)。首先,我们介绍了TrajScene-60K,这是一个包含密集点轨迹的大型数据集,其中包含了60,000个视频样本,解决了高质量4D场景数据稀缺的问题。基于此数据集,我们提出了一种扩散模型为基础的4D场景轨迹生成器(4D-STraG),以同时生成几何一致且运动合理的4D点轨迹。为了利用单视角先验信息,我们设计了深度引导的运动规范化策略和感知运动模块来有效地整合几何与动态特性。随后,我们提出了一种4D视图合成模块(4D-ViSM),用于从4D点轨迹表示中渲染任意相机路径的视频。实验表明,MoRe4D能够从单张图像生成具有多视角一致性和丰富动态细节的高质量4D场景。 代码地址:[此处应填写具体链接,请参见原文]
https://arxiv.org/abs/2512.05044
Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.
面部图像修复(Facial Image Inpainting)的目标是恢复脸部图像中丢失或损坏的区域,同时保持身份识别、结构一致性和逼真的照片质量。这一任务专门用于图片修复领域。尽管近年来深度生成模型取得了显著进展,现有方法在处理大面积不规则遮罩时仍面临挑战,通常会导致边缘模糊、语义不一致性或面部结构不合理等问题,这是因为直接采用像素级合成的方法和有限地利用面部先验知识所致。 本文提出了一种新架构,通过语义引导的层次化综合解决了上述问题。我们的方法首先基于意义组织并合成了信息,然后细化纹理,从而在创造详细图像之前提供了对面部结构清晰的见解。第一阶段中,我们融合了两种技术:一种专注于使用CNN进行局部特征分析,另一种则利用Vision Transformers对全局特征进行处理。这有助于创建清晰且详细的语义布局。 第二阶段中,我们采用多模态纹理生成器来细化这些布局,并从不同尺度获取信息,确保一切看起来连贯一致。该架构通过动态注意力机制自然地处理了任意遮罩配置,无需针对特定遮罩的训练。在CelebA-HQ和FFHQ两个数据集上的实验表明,我们的模型优于其他最先进的方法,在LPIPS、PSNR和SSIM等指标上显示出改进,并在具有挑战性的大面积修复情况下产生了视觉效果更佳且语义保存更好的结果。 简而言之,所提出的新型架构通过引入层次化综合和多模态纹理生成器,显著改善了面部图像的修复质量。这种方法不仅提高了视觉质量和结构一致性,而且对于解决复杂的大面积损坏问题也展现了强大的潜力。
https://arxiv.org/abs/2512.05039