Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at this https URL.
自回归大型语言模型(LLMs)已经统一了广泛的语言任务,激发了初步的自回归视频生成尝试。现有的自回归视频生成器要么偏离标准LLM架构,依赖于庞大的外部文本编码器,或者由于逐令牌解码而导致不可接受的延迟。在这篇论文中,我们介绍了Lumos-1,这是一种保留LLM架构且仅需最小架构修改的自回归视频生成器。 为了在LLMs中注入时空相关性,我们确定了引入3D RoPE(旋转位置嵌入)的有效性,并诊断了其不平衡的频率光谱范围。因此,我们提出了MM-RoPE,一种保持原始文本RoPE同时提供全面频率光谱和缩放的3D位置方案,以建模多模式时空数据。 此外,Lumos-1依赖于令牌依赖策略,该策略遵守帧内双向性和帧间时间因果性。基于这种依赖策略,我们识别了由空间信息冗余引起的逐帧损失不平衡问题,并通过提出自回归离散扩散强迫(AR-DF)来解决此问题。AR-DF在训练过程中引入了时空管掩码,并制定了与推理时间兼容的掩码政策以避免质量下降。 利用内存高效的训练技术,我们在仅48个GPU上对Lumos-1进行了预训练,在GenEval、COSMOS-Video2World(VBench-I2V)和OpenSoraPlan(VBench-T2V)等基准测试中达到了与EMU3相当的性能。 有关代码和模型,请访问此链接:[提供链接]。
https://arxiv.org/abs/2507.08801
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.
我们介绍了一种名为NeuralOS的神经框架,该框架通过直接预测屏幕帧来模拟操作系统中的图形用户界面(GUI),这些屏幕帧是对诸如鼠标移动、点击和键盘事件等用户输入的响应。NeuralOS结合了一个递归神经网络(RNN),用于跟踪计算机状态,并采用基于扩散的神经渲染器生成屏幕图像。该模型是在大规模数据集上进行训练的,该数据集包括Ubuntu XFCE记录,这些记录既包含随机生成的交互,也包含了由AI代理产生的现实交互。 实验结果表明,NeuralOS能够成功地渲染出逼真的GUI序列,准确捕捉鼠标交互,并可靠预测诸如应用程序启动等状态转换。尽管精确建模细粒度键盘交互仍然具有挑战性,但NeuralOS为未来人机交互系统中创建完全适应性和生成性的神经界面提供了一步进展。
https://arxiv.org/abs/2507.08800
Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.
最近在三维生成领域的进展已经从多视角二维渲染方法转向了利用地面真实数据中的几何先验的3D原生潜在扩散框架。尽管取得了进步,但仍存在三个关键限制:(1) 单一潜在表示无法捕捉复杂的多部件几何形状,导致细节退化;(2) 整体潜在编码忽略了组成设计中至关重要的各部分独立性和相互关系;(3) 全局条件机制缺乏细粒度的可控性。受人类三维设计工作流程启发,我们提出了CoPart——一个以部分感知为主的扩散框架,它将三维对象分解为上下文相关的部分潜在表示,用于一致的多部件生成。这种范式提供了三个优势:i)通过部分分解减少编码复杂度;ii)支持显式的部分关系建模;iii)支持基于部分级别的条件设置。为了进一步优化预训练的扩散模型以进行联合部分潜在去噪,我们开发了一种相互指导策略,确保几何一致性和基础模型先验知识的同时实现这一目标。为大规模训练提供支持,我们构建了Partverse——一个新颖的3D部分数据集,它是通过Objaverse的自动网格分割和人工验证注释衍生而来的。广泛的实验表明,CoPart在部分级编辑、连杆对象生成以及场景组合方面具有前所未有的可控性,并展示了其卓越的能力。
https://arxiv.org/abs/2507.08772
Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.
多模态大型语言模型(MLLMs)在准确捕捉相机与物体之间的关系方面面临挑战,特别是在对象方向、相机视角和镜头选择方面。这一问题的部分原因是现有的MLLM主要是在包含有限多样化的相机-物体关系及其相应文本描述的图像上进行训练的。为解决这个问题,我们提出了一种合成生成管道来创建大规模的3D视觉指令数据集。我们的框架以3D资产作为输入,并使用渲染和基于扩散的图像生成模型来创建具有精确相机-物体关系的真实感照片图像。此外,大型语言模型(LLMs)被用来生成文本提示,用于指导视觉指令微调并控制图像生成过程。 我们创建了Ultimate3D数据集,这是一个包含240,000个具有精确相机-物体注释的VQA问题的数据集及相应的基准测试。在我们的提议数据集上进行微调后的MLLM模型,在相机-物体关系识别任务上比商业模型的表现有了显著提升,平均准确率提高了33.4%。 我们团队开发的代码、数据集和基准将为广泛的多模态大型语言模型应用提供支持。
https://arxiv.org/abs/2507.08513
Diffusion transformers have emerged as an alternative to U-net-based diffusion models for high-fidelity image and video generation, offering superior scalability. However, their heavy computation remains a major obstacle to real-world deployment. Existing acceleration methods primarily exploit the temporal dimension such as reusing cached features across diffusion timesteps. Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free framework that accelerates inference along spatial dimension. RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement. To stabilize generations across resolution transitions, we leverage noise-timestep rescheduling to adapt the noise level across varying resolutions. Our method significantly reduces computation while preserving image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$ on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is complementary to existing temporal accelerations such as caching methods, thus can be seamlessly integrated to further reduce inference latency without compromising generation quality.
扩散变换器作为一种基于U-net的扩散模型的替代方案,已被提出用于高保真图像和视频生成,并提供了更好的可扩展性。然而,其计算量大仍然是其实用部署中的主要障碍。现有的加速方法主要利用时间维度,例如通过在不同的扩散步骤之间重用缓存特征来实现这一点。在这里,我们提出了区域自适应潜在上采样(RALU),这是一种无需重新训练的框架,旨在沿空间维度加速推理过程。 RALU通过三个阶段进行混合分辨率采样:1)低分辨率去噪潜扩散以高效地捕获全局语义结构;2)在容易产生伪影的全分辨率区域进行自适应上采样;3)在整个全分辨率范围内对所有潜在变量进行上采样,以便进行细节精炼。为了稳定不同分辨率转换之间的生成过程,我们利用噪声时间重新调度来调整变化分辨率下的噪声水平。我们的方法通过实现FLUX最多7.0倍和Stable Diffusion 3最多3.0倍的速度提升,显著减少了计算量,同时保持了图像质量,并且在最小的质量损失情况下实现了这一目标。 此外,RALU与现有的基于时间维度的加速技术(如缓存方法)兼容,因此可以无缝集成以进一步减少推理延迟而不会影响生成质量。
https://arxiv.org/abs/2507.08422
Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in this https URL.
主题一致的生成(Subject-consistent generation,SCG)旨在让文本到图像(Text-to-Image,T2I)模型在不同场景中保持主体身份的一致性。然而,现有的无需训练的方法往往以牺牲布局和姿态多样性为代价来实现一致性,这限制了视觉叙事的表现力。为了克服这一局限性,我们提出了一个名为CoDi的主题一致且姿态多样的T2I框架,该框架能够在保证主题一致性的前提下生成具有多种姿势和布局的图像。 受到扩散模型渐进性质的启发,在早期去噪步骤中会出现粗略结构,而在后期去噪步骤中则会进一步细化细节。因此,CoDi采用了两阶段策略:身份传输(Identity Transport, IT)和身份细化(Identity Refinement, IR)。IT在早期去噪步骤中运行,利用最优传输技术将身份特征以姿态感知的方式转移到每个目标图像上,在保持主题一致性的同时保留姿势多样性。IR则是在后期的去噪步骤中应用,选择最显著的身份特征来进一步细化主体细节。 通过大量定性和定量实验结果表明,CoDi在主题一致性、姿态多样性和指令忠实度方面均表现出色,并且在所有评估指标上的视觉感知和性能都优于现有方法。相关代码可以在提供的链接处找到:[请在这里插入实际的URL地址]。
https://arxiv.org/abs/2507.08396
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
Concept Bottleneck Models (CBMs) provide interpretable and controllable generative modeling by routing generation through explicit, human-understandable concepts. However, previous generative CBMs often rely on auxiliary visual cues at the bottleneck to compensate for information not captured by the concepts, which undermines interpretability and compositionality. We propose CoCo-Bot, a post-hoc, composable concept bottleneck generative model that eliminates the need for auxiliary cues by transmitting all information solely through explicit concepts. Guided by diffusion-based energy functions, CoCo-Bot supports robust post-hoc interventions-such as concept composition and negation-across arbitrary concepts. Experiments using StyleGAN2 pre-trained on CelebA-HQ show that CoCo-Bot improves concept-level controllability and interpretability, while maintaining competitive visual quality.
概念瓶颈模型(CBM)通过将生成过程路由到明确且易于人类理解的概念上来提供可解释性和可控性的生成建模。然而,之前的生成型CBM通常依赖于辅助视觉线索来补偿概念未捕获的信息,这削弱了其可解释性与组合性。我们提出了CoCo-Bot,这是一种事后、可组合的概念瓶颈生成模型,通过仅传递明确的概念信息而消除了对辅助线索的需求。在基于扩散的能函数指导下,CoCo-Bot支持强大的事后干预措施——例如跨任意概念的概念合成和否定操作。实验表明,在使用StyleGAN2预先训练于CelebA-HQ的数据集上,CoCo-Bot提高了概念级别的可控性和可解释性,并且保持了视觉质量的竞争水平。
https://arxiv.org/abs/2507.08334
Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at this https URL
音频修复(audio inpainting)是指重建被损坏的音频记录中缺失的部分。尽管之前的方法,包括基于波形和频谱图的扩散模型,在处理短间隔时表现出了良好的结果,但当缺口超过100毫秒(ms)时,它们的质量通常会下降。在本研究中,我们介绍了一种新的修复方法,该方法基于离散扩散建模,并且是在预训练音频标记器生成的标记化音频表示上进行操作的。我们的方法直接对离散潜在空间中的生成过程进行建模,从而能够实现稳定而语义一致的缺失音频重建。 我们在MusicNet数据集上使用客观和感知指标评估了该方法,在300毫秒(ms)以内的缺口长度下进行了测试。我们还在MTG数据集上进一步验证了我们的方法,并将缺口持续时间延长至500毫秒。实验结果表明,与现有基准相比,我们的方法在较长的缺口情况下达到了竞争力或优越的表现,为修复受损音乐录音提供了稳健的解决方案。 有关我们提出的方法的音频示例,请参见此链接:[https URL](此处应填写实际的URL)。
https://arxiv.org/abs/2507.08333
We propose Adaptive Diffusion Denoised Smoothing, a method for certifying the predictions of a vision model against adversarial examples, while adapting to the input. Our key insight is to reinterpret a guided denoising diffusion model as a long sequence of adaptive Gaussian Differentially Private (GDP) mechanisms refining a pure noise sample into an image. We show that these adaptive mechanisms can be composed through a GDP privacy filter to analyze the end-to-end robustness of the guided denoising process, yielding a provable certification that extends the adaptive randomized smoothing analysis. We demonstrate that our design, under a specific guiding strategy, can improve both certified accuracy and standard accuracy on ImageNet for an $\ell_2$ threat model.
我们提出了自适应扩散去噪平滑法(Adaptive Diffusion Denoised Smoothing),这是一种针对视觉模型预测进行对抗样本认证的方法,并且能够根据输入情况进行调整。我们的关键见解是将一种引导式去噪扩散模型重新解释为一系列通过自适应高斯差异隐私(GDP)机制对纯噪声样本进行细化并最终生成图像的长序列操作。我们展示了这些自适应机制可以通过GDP隐私过滤器组合起来,以分析引导式去噪过程的整体鲁棒性,并提供一种可证明的认证方法,这种方法扩展了自适应随机平滑分析的应用范围。我们演示了在特定引导策略下,我们的设计可以在$\ell_2$威胁模型中同时提升ImageNet数据集上的已验证准确率和标准准确率。
https://arxiv.org/abs/2507.08163
We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.
我们研究了在给定离线数据集的情况下,使用在线强化学习(RL)训练和微调具有表达性的策略的问题。用在线RL训练表达性强的策略类面临着稳定价值最大化的独特挑战。与常见的高斯策略相比,扩散和流匹配等表达性策略通过一个长去噪链进行参数化,这在针对某些价值函数优化时阻碍了从动作到政策参数的稳定梯度传播。 我们的关键见解是可以通过避免直接利用表达性策略进行价值优化,并构建一种实时RL策略来最大化Q值的方式来解决稳定的值最大化问题。我们提出了Expressive Policy Optimization(EXPO),这是一种样本高效的在线RL算法,它使用一种实时策略通过两个参数化的策略——一个较大的、用稳定模仿学习目标训练的表达性强的基础策略和一个轻量级高斯编辑策略——来实现价值的最大化。这个编辑策略将从基础策略中采样的动作调整到更高的价值分布。 该实时策略利用学到的编辑策略优化基础策略中的动作,并在采样以及时间差分(TD)备份时,从基础策略的动作和编辑后的动作中选择具有最大价值的动作。 我们的方法相对于先前的方法,在给定离线数据微调预训练政策的情况下,以及使用离线数据来训练在线学习方面,平均样本效率提高了2到3倍。
https://arxiv.org/abs/2507.07986
Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: this https URL.
视频本质上是动态三维世界的二维投影。然而,我们的分析表明,仅基于原始视频数据训练的视频扩散模型往往无法捕捉到有意义的几何感知结构。为了弥合视频扩散模型与物理世界内在三维性质之间的差距,我们提出了Geometry Forcing(几何强制),这是一种简单而有效的方法,鼓励视频扩散模型内化潜在的三维表示。我们的关键见解是通过将中间表示与其从预训练的几何基础模型获得的功能对齐来引导模型朝向具有几何感知结构的方向发展。 为此,我们引入了两个互补的目标对齐方法:Angular Alignment(角度对齐),它通过余弦相似性强制执行方向一致性;以及Scale Alignment(尺度对齐),该方法通过回归未归一化的几何特征以保持与扩散表示的规范化形式相关的尺度信息。我们在基于摄像机视图和基于动作条件的视频生成任务上评估了Geometry Forcing的效果。 实验结果表明,相较于基线方法,我们的方法在视觉质量和三维一致性方面有显著提升。项目页面:[此链接](https://this-url.com)(请将占位符替换为实际URL)。
https://arxiv.org/abs/2507.07982
The recent advances in generative models such as diffusion models have raised several risks and concerns related to privacy, copyright infringements and data stewardship. To better understand and control the risks, various researchers have created techniques, experiments and attacks that reconstruct images, or part of images, from the training set. While these techniques already establish that data from the training set can be reconstructed, they often rely on high-resources, excess to the training set as well as well-engineered and designed prompts. In this work, we devise a new attack that requires low resources, assumes little to no access to the actual training set, and identifies, seemingly, benign prompts that lead to potentially-risky image reconstruction. This highlights the risk that images might even be reconstructed by an uninformed user and unintentionally. For example, we identified that, with regard to one existing model, the prompt ``blue Unisex T-Shirt'' can generate the face of a real-life human model. Our method builds on an intuition from previous works which leverages domain knowledge and identifies a fundamental vulnerability that stems from the use of scraped data from e-commerce platforms, where templated layouts and images are tied to pattern-like prompts.
最近,生成模型(如扩散模型)的进展引发了一些与隐私、版权侵权和数据管理相关的风险和担忧。为了更好地理解和控制这些风险,各种研究人员已经创建了技术、实验和攻击方法来从训练集中重构图像或部分图像。尽管这些技术已经证明可以从训练集中重建数据,但它们通常依赖于高资源消耗、对实际训练集的访问权限以及精心设计的提示语。 在本工作中,我们设计了一种新的攻击方法,它只需要低资源,并且假设几乎没有访问实际训练集的机会,同时识别看似无害但实际上可能导致潜在风险图像重构的提示。这突显了即使是没有专业知识的用户也可能无意中重建图像的风险。例如,我们发现对于现有的某个模型而言,“蓝色男女同款T恤”的提示语可以生成一个现实生活中的模特的脸部图像。 我们的方法借鉴了之前研究的工作思路,利用领域知识并识别了一个根本性的脆弱性,该脆弱性源于从电子商务平台抓取的数据使用,其中模板布局和图像与模式化的提示语紧密关联。
https://arxiv.org/abs/2507.07947
Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website this https URL.
水下图像恢复算法旨在修复因拍摄于水下而失真的场景颜色、对比度和外观。这些算法在从海洋生态学和水产养殖到水下建筑和考古等各种应用中至关重要。虽然现有的基于像素域扩散的图像恢复方法对于处理深度变化有限的简单场景非常有效,但它们计算成本高且通常会在复杂的几何结构或显著深度变化的情况下生成不切实际的伪影。在这项工作中,我们通过结合一种新颖的网络架构(SLURPP)和一个准确的合成数据生成管道来克服这些限制。SLURPP将预训练的潜在扩散模型——这种模型包含了场景几何形状和深度的强大先验知识——与显式的场景分解相结合,后者能够建模并考虑光衰减和后向散射的效果。 为了训练SLURPP,我们设计了一个基于物理原理的水下图像合成管道,该管道将各种现实的水下退化效果应用于现有的陆地图像数据集。这种方法可以生成具有密集介质/退化注释的多样化训练数据。我们在合成基准和真实世界基准上对我们的方法进行了广泛的评估,并展示了最先进的性能表现。值得注意的是,SLURPP比现有基于扩散的方法快200多倍,同时在合成基准上的PSNR(峰值信噪比)提升了约3分贝。它还提供了真实的图像质量改进。项目网站:[此URL]。 请根据实际链接地址替换"this https URL"部分的占位符。
https://arxiv.org/abs/2507.07878
Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.
神经音频编解码器和自编码器作为音频压缩、传输、特征提取以及潜在空间生成的多功能模型已经出现。然而,这些模型的一个关键限制是大多数都经过训练以最大化重构保真度,而往往忽略了在各种下游应用中实现最佳性能所必需的具体潜在结构。我们提出了一种简单的后处理框架,通过修改预训练自编码器的瓶颈来解决这一问题。我们的方法引入了“重新瓶颈”,这是一个仅通过潜在空间损失进行训练的内部瓶颈,以植入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先,在不牺牲重构质量的前提下强制执行潜在通道的顺序。其次,我们使潜在变量与语义嵌入对齐,并分析这对下游扩散模型的影响。第三,我们引入等变性,确保输入波形上的滤波操作在潜在空间中有直接对应的特定转换。最终,“重新瓶颈”框架为调整神经音频模型的表示提供了一种灵活而高效的方式,使其能够轻松满足不同应用的需求,同时只需进行最少的额外训练。
https://arxiv.org/abs/2507.07867
Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.
基于内容的拼图解谜方法已经得到了广泛研究,并在计算技术方面取得了显著进展。然而,这些解谜方法的评估通常缺少对实际应用至关重要的挑战性任务,例如碎片化文物或碎纸文件的重组。在这项工作中,我们探讨了最先进的基于内容的拼图解谜器面对三种类型的拼图破坏(缺失部分、边缘侵蚀和内容侵蚀)时的鲁棒性表现,并分析了解谜器处理这些破坏的能力以及它们的关键局限性。 我们的研究评估了启发式方法和基于深度学习的方法在解决上述问题中的性能,结果表明:标准拼图解谜器如果遇到更多被损坏的部分,其性能会迅速下降。然而,通过使用增强数据进行微调,基于深度学习的模型可以显著提高其鲁棒性。特别地,在我们进行的各项实验中,先进的Positional Diffusion模型表现出色,超越了其他竞争对手。 根据我们的研究发现,对于提升现实世界中文物自动重建技术,我们可以指出一些有前景的研究方向。
https://arxiv.org/abs/2507.07828
The remarkable results for denoising in computer vision using diffusion models given in \cite{SDWMG,HJA,HHG} yield a robust mathematical justification for algorithms based on crucial properties of a sequence of Gaussian independent $N(0,1)$ random variables. In particular the derivations use the fact that a Gaussian distribution is determined by its mean and variance and that the sum of two Gaussians is another Gaussian. \bigskip The issue raised in this short note is the following: suppose we use the algorithm without any changes but replace the nature of the noise and use, for instance, uniformly distributed noise or noise with a Beta distribution, or noise which is a random superposition of two Gaussians with very different variances. One could, of course, try to modify the algorithm keeping in mind the nature of the noise, but this is not what we do. Instead we study the performance of the algorithm when used with noise that is very far in nature from the Gaussian case, where it is designed to work well. Usually these algorithms are implemented on very powerful computers. Our experiments are all carried out on a small laptop and for the smallest possible image size. Exploring how our observations are confirmed or changed when dealing in different situations remains an interesting challenge.
使用扩散模型在计算机视觉领域进行去噪所取得的显著成果(参见 \cite{SDWMG,HJA,HHG}),为基于一系列独立高斯分布 $N(0,1)$ 随机变量的关键属性的算法提供了强有力的数学依据。特别是,推导过程利用了高斯分布由其均值和方差唯一确定以及两个高斯分布之和仍为高斯分布这一事实。 本文提出的问题是:如果我们不作任何改变就使用该算法,但用其他类型的噪声(例如均匀分布的噪声、具有贝塔分布的噪声或由两个差异很大的方差的高斯分布随机叠加而成的噪声)来替代原来的噪声会怎样?当然,我们可以尝试根据噪声的特性修改算法,但这不是我们要做的事情。相反,我们研究在使用与高斯噪声性质相差甚远的噪声时该算法的表现情况。通常这些算法是在非常强大的计算机上实现的,而我们的实验则全部在一个小型笔记本电脑上进行,并且针对最小尺寸的图像展开。在未来探讨不同情况下我们的观察结果如何被证实或改变仍然是一项有趣的挑战。
https://arxiv.org/abs/2507.08059
Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.
无约束对抗攻击旨在通过不限制$\ell_p$-范数边界(例如,通过改变对象的颜色)来欺骗计算机视觉模型而不被人类察觉。这使攻击者能够规避传统的、基于规范的防御策略,如对抗训练或经过验证的安全策略。然而,由于其不受限制的性质,也没有提供基于范数不可察觉性的保证,这就需要人工评估以确认这些对抗性样本看起来有多逼真。尽管一些相关工作已经对这一关键属性进行了评估,但没有提供统计学上显著的见解。为了解决这个问题,我们引入了SCOOTER——一个开源、具备统计功效的框架来评估无约束的对抗示例。 我们的贡献包括:$(i)$ 用于群体研究力量、补偿和Likert等效边界以衡量不可察觉性的最佳实践指导;$(ii)$ 首次大规模的人类与模型对比实验,涉及346名参与者,结果显示三种颜色空间攻击和三种基于扩散的攻击无法产生不可察觉的图像。此外,我们发现GPT-4o可以作为初步测试来检测不可察觉性,但它仅能一致地检测出六种被测攻击中的四种;$(iii)$ 开源软件工具,包括一个用于收集注释的浏览器任务模板以及Python和R语言的分析脚本;$(iv)$ 一个基于ImageNet的数据集,包含3K张真实图像、7K个对抗性示例及超过34K条人类评分。 我们的研究结果表明,自动化的视觉系统与人类感知不一致,这强调了需要建立一个以SCOOTER为基准的真实数据集合。
https://arxiv.org/abs/2507.07776
Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.
近期视频生成技术的进步催生了一种新兴的生成式视频编码范式,旨在通过利用强大的生成先验,在超低比特率(ULB)场景中实现语义精确的重建。然而,大多数现有的方法受到领域特定性(例如面部或人体视频)或对高水平文本指导过度依赖的限制,这往往无法捕捉运动细节,并导致不真实的重建结果。为了解决这些挑战,我们提出了一种轨迹引导生成式视频编码框架(称为T-GVC)。T-GVC采用语义感知稀疏运动采样管道,通过基于像素级运动的重要性和其语义重要性提取稀疏轨迹点来有效地将低级别的运动跟踪与高级别的语义理解相结合。这不仅显著降低了比特率,还保留了关键的时间语义信息。 此外,通过在扩散过程中整合轨迹对齐的损失约束,我们引入了一种无需训练的潜在空间指导机制,以确保物理上合理的运动模式,同时不牺牲生成模型内在的能力。实验结果表明,在ULB条件下,我们的框架优于传统的编解码器和最先进的端到端视频压缩方法。此外,额外的实验确认了我们的方法比现有的文本引导的方法实现了更精确的运动控制,为基于几何运动建模指导的生成式视频编码开辟了一条新的方向。
https://arxiv.org/abs/2507.07633
Capture stages are high-end sources of state-of-the-art recordings for downstream applications in movies, games, and other media. One crucial step in almost all pipelines is the matting of images to isolate the captured performances from the background. While common matting algorithms deliver remarkable performance in other applications like teleconferencing and mobile entertainment, we found that they struggle significantly with the peculiarities of capture stage content. The goal of our work is to share insights into those challenges as a curated list of those characteristics along with a constructive discussion for proactive intervention and present a guideline to practitioners for an improved workflow to mitigate unresolved challenges. To this end, we also demonstrate an efficient pipeline to adapt state-of-the-art approaches to such custom setups without the need of extensive annotations, both offline and real-time. For an objective evaluation, we propose a validation methodology based on a leading diffusion model that highlights the benefits of our approach.
捕捉阶段是电影、游戏和其他媒体下游应用中高质量录音的重要来源。几乎所有管道中的一个关键步骤是对图像进行抠图,以从背景中分离出捕获的表演。虽然常见的抠图算法在其他应用场景(如视频会议和移动娱乐)中表现出色,但我们发现它们在处理捕捉阶段内容的独特性时遇到了严重困难。我们的工作目标是分享这些挑战的见解,并以一个整理过的清单形式列出那些特征,同时提出建设性的讨论来积极应对这些问题,为从业人员提供一套指导方针,以改进工作流程并解决未决的难题。为此,我们还演示了一种高效的方法,可以将最先进的方法适应于此类定制设置中,无需大量的标注数据,无论是离线还是实时使用。为了客观评估我们的方法,我们提出了一种基于领先的扩散模型验证方法,这种方法突出了我们的方法的优势。
https://arxiv.org/abs/2507.07623