We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
我们解决了单目到立体视频转换的问题,并提出了一种新颖的架构,用于对通过基于深度的重投影从输入左视图获得的扭曲右视图进行修复和细化。我们将Stable Video Diffusion (SVD) 模型扩展为利用输入左视视频、扭曲后的右视图以及遮挡排除掩码作为条件输入来生成高质量的右摄像机视图。为了有效利用相邻帧的信息来进行修补,我们修改了SVD中的注意力层以对被遮挡像素计算完全注意(full attention)。我们的模型通过最小化图像空间损失,在端到端的方式下训练以生成高质量的右视视频。在用户研究中,与四个比较方法相比,我们的方法平均排名为1.43,同时比第二名的方法快6倍。
https://arxiv.org/abs/2505.16565
Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
获取详细的三维场景通常需要昂贵的设备、多视角数据或复杂的建模过程。因此,一种轻量级的方法——从单一顶视图图像生成复杂三维场景,在实际应用中扮演着重要角色。尽管最近的三维生成模型在物体级别的表现非常出色,但它们扩展到全场景生成时往往会导致不一致的几何结构、布局幻觉以及低质量的网格。为此,我们在本研究中引入了3DTown,这是一个无需训练框架,旨在从单一顶视图图像合成真实且连贯的三维场景。 我们的方法基于两个原则:区域化生成以提高二维到三维的一致性和分辨率,以及空间感知的三维修复填充来确保全局场景的一致性及高质量几何结构生成。具体而言,我们将输入图像分解为重叠的区域,并使用预训练的三维对象生成器生成每个区域;随后进行掩码修正流修复填充过程,以填补缺失的几何信息并保持结构连续性。 这种模块化设计允许我们克服分辨率瓶颈问题,并在无需三维监督或微调的情况下保存空间结构。通过在各种场景中进行广泛的实验表明,3DTown 在几何质量、空间一致性以及纹理保真度方面均优于当前最佳基线模型(包括 Trellis, Hunyuan3D-2 和 TripoSG)。 我们的研究成果展示了基于单一图像生成高质量三维城镇的可行性,并且采用了一种无需训练的原理性方法。
https://arxiv.org/abs/2505.15765
As diffusion-based malicious image manipulation becomes increasingly prevalent, multiple proactive defense methods are developed to safeguard images against unauthorized tampering. However, most proactive defense methods only can safeguard images against manipulation under known conditions, and fail to protect images from manipulations guided by tampering conditions crafted by malicious users. To tackle this issue, we propose Anti-Inpainting, a proactive defense method that achieves adequate protection under unknown conditions through a triple mechanism to address this challenge. Specifically, a multi-level deep feature extractor is presented to obtain intricate features during the diffusion denoising process to improve protective effectiveness. We design multi-scale semantic-preserving data augmentation to enhance the transferability of adversarial perturbations across unknown conditions by multi-scale transformations while preserving semantic integrity. In addition, we propose a selection-based distribution deviation optimization strategy to improve the protection of adversarial perturbation against manipulation under diverse random seeds. Extensive experiments indicate the proactive defensive performance of Anti-Inpainting against diffusion-based inpainters guided by unknown conditions in InpaintGuardBench and CelebA-HQ. At the same time, we also demonstrate the proposed approach's robustness under various image purification methods and its transferability across different versions of diffusion models.
随着基于扩散的恶意图像篡改变得越来越普遍,已经开发出多种主动防御方法来保护图像免受未经授权的修改。然而,大多数主动防御方法仅能在已知条件下保护图像不受操纵,并且无法防止由恶意用户定制篡改条件所导致的图像被操控。为了解决这一问题,我们提出了Anti-Inpainting,这是一种通过三重机制实现未知条件下充分保护的主动防御方法。 具体来说,我们提出了一种多级深度特征提取器,在扩散去噪过程中获取复杂的特征以提高防护效果。此外,我们设计了多层次语义保持数据增强技术,通过多种尺度变换增强了对抗性扰动在未知条件下的迁移能力,并同时保证了语义的完整性。另外,我们还提出了一种基于选择的分布偏差优化策略,以提升对不同随机种子条件下对抗性扰动防护的效果。 广泛的实验表明,Anti-Inpainting 在 InpaintGuardBench 和 CelebA-HQ 数据集上针对由未知条件引导的扩散式修复程序展示了积极防御性能。同时,我们还证明了所提出的方法在各种图像净化方法下具有鲁棒性,并且能够在不同的扩散模型版本之间实现迁移能力。 此研究工作对于增强基于人工智能技术的安全防护有着重要的贡献和应用价值。
https://arxiv.org/abs/2505.13023
In recent years, implicit neural representations(INRs) have gained popularity in the computer vision community. This is mainly due to the strong performance of INRs in many computer vision tasks. These networks can extract a continuous signal representation given a discrete signal representation. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous activation functions have been proposed that are competitive with one another, they share some common set of challenges such as spectral bias(Lack of sensitivity to high-frequency content in signals), limited robustness to signal noise and difficulties in simultaneous capturing both local and global features. and furthermore, the requirement for manual parameter tuning. To address these issues, we introduce a novel activation function, Band Shifted Raised Cosine Activated Implicit Neural Networks \textbf{(BandRC)} tailored to enhance signal representation capacity further. We also incorporate deep prior knowledge extracted from the signal to adjust the activation functions through a task-specific model. Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the dominance of BandRC over existing state of the art activation functions.
近年来,隐式神经表示(INRs)在计算机视觉社区中获得了很大的关注。这主要是由于它们在许多计算机视觉任务中的强大性能表现。这些网络能够从离散信号表示中提取出连续的信号表示。以往的研究反复表明,多层感知器中使用的激活函数对INR的性能有很强的相关性。尽管提出了众多相互竞争的激活函数,但它们都面临一些共同挑战:如谱偏差(信号高频内容缺乏敏感度)、对抗信号噪声的能力有限以及难以同时捕捉局部和全局特征等问题,并且还需要手动调整参数。 为了应对这些挑战,我们引入了一种新型激活函数——频带偏移提升余弦激活隐式神经网络(BandRC),旨在进一步增强信号表示能力。此外,我们还整合了从信号中提取的深度先验知识,通过特定任务模型来调整激活函数。通过对数学分析和一系列实验进行验证(包括图像重建、去噪、超分辨率处理以及3D形状重构等场景),结果显示BandRC在现有最先进的激活函数性能上占据主导地位:图像重建提高了8.93 dB PSNR;去噪提升了0.46 dB PSNR,超分辨率处理中对于6倍放大任务,相比最近的最先进方法(SOTA)改进了1.03 dB。
https://arxiv.org/abs/2505.11640
This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.
这项工作提出了Prior Depth Anything框架,该框架结合了深度测量中不完整但精确的度量信息与深度预测中相对但完整的几何结构,从而为任何场景生成准确、密集和详细的度量深度图。为此,我们设计了一个从粗到细的流水线,逐步整合这两种互补的深度来源。首先,我们引入像素级度量对齐和距离感知加权,通过明确使用深度预测来预先填充各种度量先验,有效地缩小了先前模式之间的领域差距,增强了在不同场景中的泛化能力。其次,我们开发了一个条件单目深度估计(MDE)模型,以精炼深度先验中存在的固有噪声。该模型通过对归一化的预填充分先验和预测进行调节,进一步隐式地融合这两种互补的深度来源。我们的模型在七个真实世界数据集上展示了跨深度完成、超分辨率和修复任务的强大零样本泛化能力,并且在这些特定任务的方法中表现出匹配甚至超越的结果。更重要的是,它在具有挑战性的未见混合先验下表现良好,并通过切换预测模型实现了测试时间改进,提供了一个灵活的准确性和效率之间的权衡,随着MDE模型的进步而不断进化。
https://arxiv.org/abs/2505.10565
This report introduces Aquarius, a family of industry-level video generation models for marketing scenarios designed for thousands-xPU clusters and models with hundreds of billions of parameters. Leveraging efficient engineering architecture and algorithmic innovation, Aquarius demonstrates exceptional performance in high-fidelity, multi-aspect-ratio, and long-duration video synthesis. By disclosing the framework's design details, we aim to demystify industrial-scale video generation systems and catalyze advancements in the generative video community. The Aquarius framework consists of five components: Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands of CPUs and thousands of xPUs via automated task distribution, enabling efficient video data processing. Additionally, we are about to open-source the entire data processing framework named "Aquarius-Datapipe". Model Architectures for Different Scales: Include a Single-DiT architecture for 2B models and a Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios, multi-resolution, and multi-duration video generation. High-Performance infrastructure designed for video generation model training: Incorporating hybrid parallelism and fine-grained memory optimization strategies, this infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference Acceleration: Utilizes diffusion cache and attention optimization to achieve a 2.35x inference speedup. Multiple marketing-scenarios applications: Including image-to-video, text-to-video (avatar), video inpainting and video personalization, among others. More downstream applications and multi-dimensional evaluation metrics will be added in the upcoming version updates.
该报告介绍了Aquarius,这是一个专为营销场景设计的行业级视频生成模型系列,适用于数千xPU集群和具有数百亿参数的大规模模型。通过高效的工程架构和算法创新,Aquarius在高保真度、多宽高比以及长时间段视频合成方面展现了卓越性能。公开该框架的设计细节旨在揭开工业级视频生成系统的神秘面纱,并推动生成式视频社区的进步。 Aquarius框架由五个组成部分构成: 1. **分布式图与视频数据处理流水线**:通过自动任务分配管理数以万计的CPU和数千xPU,从而实现高效的视频数据处理。此外,我们即将开源整个数据处理框架“Aquarius-Datapipe”。 2. **不同规模的模型架构**:包括针对20亿参数模型设计的Single-DiT架构以及面向134亿参数模型设计的Multimodal-DiT架构,支持多宽高比、多分辨率及多种时长的视频生成。 3. **高性能基础设施,用于视频生成模型训练**:采用混合并行化和细粒度内存优化策略,在大规模下达到36%的MFU(最大可能利用率)。 4. **多xPU并行推理加速**:利用扩散缓存与注意力机制优化实现2.35倍的推理速度提升。 5. **多种营销场景应用**:包括图像转视频、文本转视频(虚拟形象)、视频修复和视频个性化等。未来版本更新将增加更多下游应用场景及多维度评估指标。
https://arxiv.org/abs/2505.10584
Recent proliferation of generative AI tools for visual content creation-particularly in the context of visual artworks-has raised serious concerns about copyright infringement and forgery. The large-scale datasets used to train these models often contain a mixture of copyrighted and non-copyrighted artworks. Given the tendency of generative models to memorize training patterns, they are susceptible to varying degrees of copyright violation. Building on the recently proposed DeepfakeArt Challenge benchmark, this work introduces DFA-CON, a contrastive learning framework designed to detect copyright-infringing or forged AI-generated art. DFA-CON learns a discriminative representation space, posing affinity among original artworks and their forged counterparts within a contrastive learning framework. The model is trained across multiple attack types, including inpainting, style transfer, adversarial perturbation, and cutmix. Evaluation results demonstrate robust detection performance across most attack types, outperforming recent pretrained foundation models. Code and model checkpoints will be released publicly upon acceptance.
近期,用于视觉内容创作的生成式人工智能工具(特别是针对视觉艺术作品)的普及引发了关于版权侵权和伪造的严重担忧。训练这些模型所使用的大型数据集通常包含受版权保护的艺术作品与不受版权保护的作品混合在一起的情况。鉴于生成性模型倾向于记住训练模式,它们在不同程度上容易发生版权侵犯行为。 在此背景下,本工作借鉴了最近提出的DeepfakeArt挑战基准测试,并引入了一种名为DFA-CON的对比学习框架,旨在检测侵权或伪造的人工智能生成的艺术作品。该框架通过构建一个区分性的表示空间,在对比学习框架内促进原始艺术作品与其被伪造版本之间的关联性。 模型在多种攻击类型下进行训练,包括图像修复(inpainting)、风格转换、对抗性扰动和混合剪切(cutmix)。评估结果显示,在大多数攻击类型下,该方法具有强大的检测性能,并且优于最近预训练的基础模型。本研究的代码和模型检查点将在接受后公开发布。
https://arxiv.org/abs/2505.08552
Raindrop removal is a challenging task in image processing. Removing raindrops while relying solely on a single image further increases the difficulty of the task. Common approaches include the detection of raindrop regions in the image, followed by performing a background restoration process conditioned on those regions. While various methods can be applied for the detection step, the most common architecture used for background restoration is the Generative Adversarial Network (GAN). Recent advances in the use of diffusion models have led to state-of-the-art image inpainting techniques. In this paper, we introduce a novel technique for raindrop removal from a single image using diffusion-based image inpainting.
雨滴去除是图像处理中的一个挑战性任务。仅基于单幅图像来移除雨滴会进一步增加任务难度。常见的方法包括在图像中检测出雨滴区域,随后进行背景恢复过程,该过程依赖于这些雨滴所在的特定区域。尽管可以采用多种技术来进行检测步骤,但用于背景恢复的最常用架构是生成对抗网络(GAN)。最近,在扩散模型的应用方面取得了进展,这使得当前最佳的图像修复技术成为可能。在这篇论文中,我们介绍了一种新的单幅图像雨滴去除方法,该方法利用基于扩散模型的图像修复技术。
https://arxiv.org/abs/2505.08190
Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera-only baselines by integrating a diffusion model into a camera-radar fusion architecture. We leverage radar point features to create pseudo-masks using the Segment-Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo-masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera-only segmentation baseline by 2.63% in mIoU and enhances our camera-radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera-radar fusion under adverse weather conditions.
在自主驾驶和机器人技术中,对环境中物体进行分割是一项关键任务,这有助于每个代理更好地理解其周围环境。尽管摄像头传感器提供了丰富的视觉细节,但它们容易受到恶劣天气条件的影响。相比之下,雷达传感器在这种条件下仍保持稳定可靠,但通常会产生稀疏且噪声较多的数据。因此,一种有前景的方法是融合来自两种传感器的信息。 在本工作中,我们提出了一种新的框架,通过将扩散模型集成到摄像头-雷达融合架构中来增强仅基于摄像头的基准方法。利用雷达点特征,我们使用Segment-Anything模型创建伪掩码,并将投影后的雷达点视为点提示。此外,我们还提出了一个噪声减少单元,用于净化这些伪掩码,进一步生成修复图像以完成原始图像中的缺失信息。 我们的方法在Waterscenes数据集上提高了仅基于摄像头的分割基准2.63%的mIoU(平均交并比),并且改进了我们的摄像头-雷达融合架构1.48%的mIoU。这证明了我们在恶劣天气条件下使用摄像头-雷达融合进行语义分割方法的有效性。
https://arxiv.org/abs/2505.03679
We explore Generalizable Tumor Segmentation, aiming to train a single model for zero-shot tumor segmentation across diverse anatomical regions. Existing methods face limitations related to segmentation quality, scalability, and the range of applicable imaging modalities. In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. DiffuGTS creates anomaly-aware open-vocabulary attention maps based on text prompts to enable generalizable anomaly segmentation without being restricted by a predefined training category list. To further improve and refine anomaly segmentation masks, DiffuGTS leverages the diffusion model, transforming pathological regions into high-quality pseudo-healthy counterparts through latent space inpainting, and applies a novel pixel-level and feature-level residual learning approach, resulting in segmentation masks with significantly enhanced quality and generalization. Comprehensive experiments on four datasets and seven tumor categories demonstrate the superior performance of our method, surpassing current state-of-the-art models across multiple zero-shot settings. Codes are available at this https URL.
我们研究了一种通用肿瘤分割方法,目标是训练一个单一的模型以实现跨不同解剖区域的零样本肿瘤分割。现有方法在分割质量、可扩展性和适用成像模态范围方面存在局限性。在这篇论文中,我们通过引入一种名为DiffuGTS的新框架,揭示了冻结医学基础扩散模型内部表示作为高效零样本学习器进行肿瘤分割的巨大潜力。DiffuGTS基于文本提示创建异常感知开放词汇注意力图,使通用的异常分割不再受预定义训练类别列表的限制。 为了进一步提高和细化异常分割掩码的质量,DiffuGTS利用了扩散模型,将病理区域通过潜在空间修复转换为高质量伪健康对应的区域,并应用了一种新颖的像素级和特征级残差学习方法。这使得生成的分割掩码质量显著提升且具有更强的泛化能力。 我们在四个数据集和七种类别肿瘤上的全面实验展示了我们方法在多个零样本设置中的优越性能,超越了当前最先进的模型。代码可以在提供的链接中获得。
https://arxiv.org/abs/2505.02753
Collecting demonstrations enriched with fine-grained tactile information is critical for dexterous manipulation, particularly in contact-rich tasks that require precise force control and physical interaction. While prior works primarily focus on teleoperation or video-based retargeting, they often suffer from kinematic mismatches and the absence of real-time tactile feedback, hindering the acquisition of high-fidelity tactile data. To mitigate this issue, we propose KineDex, a hand-over-hand kinesthetic teaching paradigm in which the operator's motion is directly transferred to the dexterous hand, enabling the collection of physically grounded demonstrations enriched with accurate tactile feedback. To resolve occlusions from human hand, we apply inpainting technique to preprocess the visual observations. Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. We evaluate KineDex on a suite of challenging contact-rich manipulation tasks, including particularly difficult scenarios such as squeezing toothpaste onto a toothbrush, which require precise multi-finger coordination and stable force regulation. Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. Comparative experiments with teleoperation and user studies further validate the advantages of KineDex in data collection efficiency and operability. Specifically, KineDex collects data over twice as fast as teleoperation across two tasks of varying difficulty, while maintaining a near-100% success rate, compared to under 50% for teleoperation.
收集包含精细触觉信息的演示对于灵巧操作至关重要,尤其是在需要精确力控制和物理互动的接触密集型任务中。虽然先前的工作主要集中在远程操作或基于视频的目标重定位上,但它们通常会因运动学不匹配以及缺乏实时触觉反馈而受到限制,从而阻碍了高质量触觉数据的采集。为了解决这个问题,我们提出了KineDex,一种手把手运动教学范式,在该范式中,操作者的动作直接转移到灵巧的手上,使得能够收集富含准确触觉反馈的物理基础演示。 为了处理人类手部引起的遮挡问题,我们将修复技术应用于视觉观察数据的预处理。基于这些演示,我们使用包含增强触觉输入的数据来训练视动政策,并在部署过程中实现力控制,以进行精确的接触密集型操作。我们在一系列具有挑战性的接触密集型任务上评估了KineDex,包括挤牙膏到牙刷这种特别困难的情况,这需要多指协调和稳定的力度调节。 在这些任务中,KineDex实现了74.4%的成功率,比没有力控制的变体提高了57.7%。与远程操作进行对比实验以及用户研究进一步验证了KineDex在数据收集效率和可操作性方面的优势。具体而言,在两个难度不同的任务上,KineDex的数据采集速度几乎是远程操作的两倍,并保持接近100%的成功率,而远程操作则低于50%。
https://arxiv.org/abs/2505.01974
Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video's dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is this https URL.
隐式神经表示(INR)在多个领域中取得了成功应用。为了在实际生活中使用 INR,加快训练速度至关重要。在 INR 用于视频应用的领域内,最先进方法采用网格类型的参数编码,并且相比其前驱实现了更快的编码速度。然而,这种网格使用方式没有考虑视频动态性质,导致可训练参数冗余使用。因此,在参数效率和比特率方面,这种方法比不使用参数编码的 NeRV 风格的方法要低得多。 为了解决这一问题,我们提出了具有时间一致性调制(Neural Video representation with Temporally coherent Modulation, NVTM)的新框架,该框架能够捕捉视频的动态特性。通过将时空三维视频数据分解成一组带有流信息的二维网格,NVTM 使得学习视频表示变得更快,并且能更有效地使用参数。我们的框架允许一次性处理时间对应的像素,从而在保持合理视频质量的前提下实现了最快的编码速度,尤其是在与 NeRV 风格的方法相比时,速度快了超过三倍。 此外,在 UVG(动态)上,NVTM 在 PSNR/LPIPS 上分别提升了 1.54 dB/0.019(即使参数减少了 10%),而在 MCL-JCV(动态)上则提高了 1.84 dB/0.013 的 PSNR/LPIPS。通过扩展到压缩任务中,我们展示了 NVTM 方法在视频压缩标准(H.264, HEVC)和最近的 INR 视频压缩方法中的性能相当。 此外,我们进行了广泛的实验,展示我们的算法在超分辨率、帧插值和视频修复等多样任务上具有优越的表现。项目页面链接为:https://thisisprojecturl.com(请将此URL替换为您项目的实际链接)。
https://arxiv.org/abs/2505.00335
The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model's priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on this https URL
沉浸式视觉体验的日益流行增加了对立体3D视频生成的兴趣。尽管在视频合成方面取得了显著进展,但由于缺乏3D视频数据,创建高质量的3D视频仍然颇具挑战性。我们提出了一种简单的方法,将文本到视频的生成器转变为视频到立体视图的生成器。给定输入视频后,我们的框架能够自动生成从稍有不同的视角拍摄的画面帧,从而产生引人入胜的3D效果。 以往和目前针对此任务的方法通常分为多个阶段:首先估计视频中的视差或深度信息;然后根据这些信息将视频画面扭曲以产生第二视角;最后进行不可见区域的填充。这种方法在处理包含镜面反射表面或透明物体的场景时会失效,因为单一层次的视差估算在这种情况下是不够的,会导致错误的像素偏移和伪影。 我们的工作通过直接合成新的视角来绕过这些限制,而无需依赖任何中间步骤。我们利用预先训练好的视频模型对几何、材质、光学及语义的理解,在不依赖外部几何模型或手动分离几何信息的情况下实现这一点。我们在包含各种物体材料和组成的复杂现实场景中展示了我们方法的优势。 请参见此网址上的相关视频:[链接](在实际应用中,请插入正确的URL地址)。
https://arxiv.org/abs/2505.00135
Existing deep learning-based image inpainting methods typically rely on convolutional networks with RGB images to reconstruct images. However, relying exclusively on RGB images may neglect important depth information, which plays a critical role in understanding the spatial and structural context of a scene. Just as human vision leverages stereo cues to perceive depth, incorporating depth maps into the inpainting process can enhance the model's ability to reconstruct images with greater accuracy and contextual awareness. In this paper, we propose a novel approach that incorporates both RGB and depth images for enhanced image inpainting. Our models employ a dual encoder architecture, where one encoder processes the RGB image and the other handles the depth image. The encoded features from both encoders are then fused in the decoder using an attention mechanism, effectively integrating the RGB and depth representations. We use two different masking strategies, line and square, to test the robustness of the model under different types of occlusions. To further analyze the effectiveness of our approach, we use Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations to examine the regions of interest the model focuses on during inpainting. We show that incorporating depth information alongside the RGB image significantly improves the reconstruction quality. Through both qualitative and quantitative comparisons, we demonstrate that the depth-integrated model outperforms the baseline, with attention mechanisms further enhancing inpainting performance, as evidenced by multiple evaluation metrics and visualization.
现有的基于深度学习的图像修复方法通常依赖于使用RGB图像的卷积网络来重建图像。然而,单纯依靠RGB图像可能会忽视重要的深度信息,这种信息在理解场景的空间和结构上下文中起着关键作用。就像人类视觉利用立体线索感知深度一样,在图像修复过程中引入深度图可以增强模型对不同遮挡类型下图像进行更精确重构的能力,并提升其对背景环境的理解能力。在这篇论文中,我们提出了一种新颖的方法,该方法结合了RGB和深度图以实现改进的图像修复效果。我们的模型采用双编码器架构:一个编码器处理RGB图像,另一个处理深度图。两个编码器提取到的特征随后在解码器阶段通过注意力机制进行融合,从而有效地整合了RGB与深度表示之间的信息。我们采用了两种不同的掩膜策略——线形和方形掩膜——来测试模型在不同遮挡类型下的鲁棒性。为了进一步分析我们的方法的有效性,我们使用梯度加权类激活映射(Grad-CAM)可视化技术来观察模型在图像修复过程中关注的感兴趣区域。研究表明,在处理RGB图像的同时结合深度信息显著提升了重建的质量。通过定性和定量比较,我们证明了融合深度信息的模型优于基准模型,并且注意力机制进一步提高了图像修复的表现力,这一点从多种评估指标和可视化效果中得到了证实。
https://arxiv.org/abs/2505.00735
Room Impulse Responses (RIRs) characterize acoustic environments and are crucial in multiple audio signal processing tasks. High-quality RIR estimates drive applications such as virtual microphones, sound source localization, augmented reality, and data augmentation. However, obtaining RIR measurements with high spatial resolution is resource-intensive, making it impractical for large spaces or when dense sampling is required. This research addresses the challenge of estimating RIRs at unmeasured locations within a room using Denoising Diffusion Probabilistic Models (DDPM). Our method leverages the analogy between RIR matrices and image inpainting, transforming RIR data into a format suitable for diffusion-based reconstruction. Using simulated RIR data based on the image method, we demonstrate our approach's effectiveness on microphone arrays of different curvatures, from linear to semi-circular. Our method successfully reconstructs missing RIRs, even in large gaps between microphones. Under these conditions, it achieves accurate reconstruction, significantly outperforming baseline Spline Cubic Interpolation in terms of Normalized Mean Square Error and Cosine Distance between actual and interpolated RIRs. This research highlights the potential of using generative models for effective RIR interpolation, paving the way for generating additional data from limited real-world measurements.
房间脉冲响应(RIR)描述了声学环境,并在多个音频信号处理任务中至关重要。高质量的RIR估计可以驱动虚拟麦克风、声音源定位、增强现实和数据增强等应用。然而,以高空间分辨率获取RIR测量非常耗费资源,在大型空间或需要密集采样的情况下,这种方法难以实施。这项研究旨在使用去噪扩散概率模型(DDPM)解决在房间内未测量位置估计RIR的挑战。我们的方法利用了RIR矩阵与图像修复之间的类比,将RIR数据转换为适合基于扩散重建的形式。通过基于影像法的模拟RIR数据,我们在不同曲率的话筒阵列上展示了我们方法的有效性,从线性到半圆形不等。即使在话筒之间有较大间隔的情况下,我们的方法也能成功地重建缺失的RIR。在这种条件下,它实现了准确的重构,在归一化均方误差和实际与插值RIR之间的余弦距离方面,远优于基线样条三次插值方法。这项研究展示了使用生成模型进行有效的RIR插值的巨大潜力,并为从有限的实际测量中生成额外数据铺平了道路。
https://arxiv.org/abs/2504.20625
Image inpainting is a fundamental research area between image editing and image generation. Recent state-of-the-art (SOTA) methods have explored novel attention mechanisms, lightweight architectures, and context-aware modeling, demonstrating impressive performance. However, they often struggle with complex structure (e.g., texture, shape, spatial relations) and semantics (e.g., color consistency, object restoration, and logical correctness), leading to artifacts and inappropriate generation. To address this challenge, we design a simple yet effective inpainting paradigm called latent categories guidance, and further propose a diffusion-based model named PixelHacker. Specifically, we first construct a large dataset containing 14 million image-mask pairs by annotating foreground and background (potential 116 and 21 categories, respectively). Then, we encode potential foreground and background representations separately through two fixed-size embeddings, and intermittently inject these features into the denoising process via linear attention. Finally, by pre-training on our dataset and fine-tuning on open-source benchmarks, we obtain PixelHacker. Extensive experiments show that PixelHacker comprehensively outperforms the SOTA on a wide range of datasets (Places2, CelebA-HQ, and FFHQ) and exhibits remarkable consistency in both structure and semantics. Project page at this https URL.
图像修复是图像编辑和图像生成领域中的一个基本研究方向。最近的最先进的(SOTA)方法探索了新颖的注意力机制、轻量级架构以及上下文感知建模,展现了令人印象深刻的表现力。然而,它们在处理复杂结构(例如纹理、形状及空间关系)和语义信息(如颜色一致性、对象恢复及逻辑正确性)时常常遇到困难,导致生成图像中出现伪影或不合适的细节。为解决这一挑战,我们设计了一种简单但有效的修复范式,称为潜在类别引导,并进一步提出了一种基于扩散模型的方法,命名为PixelHacker。 具体而言,我们首先构建了一个包含1400万张图像-掩膜对的大型数据集,通过标注前景和背景(分别为116个和21个潜在类别)来创建。接着,我们分别通过对两个固定大小的嵌入编码潜在前景和背景表示,并在去噪过程中间歇地注入这些特征,使用线性注意力机制完成这一过程。最后,在我们的数据集上进行预训练并在开源基准测试中进行微调之后,得到了PixelHacker模型。 广泛的实验表明,PixelHacker在多个数据集(Places2、CelebA-HQ 和 FFHQ)上的表现全面超越了最先进的方法,并且无论是在结构还是语义方面都展现出了卓越的一致性。项目页面在此 https URL.
https://arxiv.org/abs/2504.20438
Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model's attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques. Project page: this https URL
最近的人体图像补全方法能够重建出合理的身体形状,但常常无法在没有明确参考图片的情况下保留独特的细节,比如特定的服装图案或独特的配饰。即使是目前最先进的基于参考图的修复技术也难以准确捕捉并整合来自参考图片的细微信息。为了解决这一局限性,我们提出了一种新的基于参考图的人体图像补全框架——CompleteMe。 CompleteMe采用了一种结合双U-Net架构和区域聚焦注意力(RFA)块的方法,该方法明确引导模型关注参考图片中的相关区域。这种方法能够有效地捕捉细微细节,并确保语义对应性的一致性和准确性,从而显著提高了完成图像的保真度和一致性。 此外,我们还引入了一个专门用于评估基于参考图的人体图像补全任务的新基准测试。广泛的实验表明,与现有技术相比,我们的方法在视觉质量和语义一致性方面都达到了优越水平。 项目页面:[此链接](https://this_https_URL/)(请将URL替换为实际的网页地址)
https://arxiv.org/abs/2504.20042
Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at this https URL.
图像去遮挡(或称模态完成)的目标是恢复出图中被遮挡物体不可见区域的形状和外观。尽管近年来取得了进展,但缺乏一种既能保证多样性、合理性又能保持精确性的高质量数据依然是一个主要障碍。为解决这一挑战,我们识别出了三个关键要素:利用野外图像数据来实现多样性和可变性;引入人类专业知识以确保合理性和真实性;以及通过生成式先验知识提升精度和完整性。 为此,我们提出了SynergyAmodal框架,这是一种新颖的协同合成模态数据集的方法,涵盖了全面的形状和外观注释。该方法结合了这三个关键要素,并且是通过对数据-人-模型三元协作来实现。首先,设计了一个基于遮挡的自监督学习算法,以利用野外图像数据中的多样性,将一个修复扩散模型微调为部分完成扩散模型。接下来,建立了一条协同合成流水线,该流水线通过迭代过滤、精炼和标注初始解遮挡结果,并根据人类专家指导以及先验模型约束确保合理性和精确性来生成高质量的成对模态数据集,涵盖约16K对图像。 此外,我们还在合成的数据集上训练了一个完整的完成扩散模型,引入文本提示作为条件信号。广泛的实验表明,我们的框架在零样本泛化能力和文本可控性方面表现出了卓越的效果。代码、数据集和模型将在[此处提供的URL]公开发布。
https://arxiv.org/abs/2504.19506
With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.
随着基于变压器架构和大型语言模型(LLM)的出现,道路场景感知的准确性得到了显著提升。然而,目前的道路场景分割方法主要是在封闭集数据上进行训练的,这导致了在处理分布外(OOD)对象时检测能力不足的问题。为了解决这一限制,提出了道路异常检测方法。但是,现有的方法主要依赖于图像修复和OOD分布检测技术,面临着两个关键问题:(1)对异常区域客观属性考虑不足,在异常物体与已知类别相似的情况下会导致分割不完整;(2)对环境约束的关注不够,导致检测到的异常与自动驾驶任务无关。 本文提出了一种新的框架,称为对象性和任务感知分割(SOTA),用于自动驾驶场景。具体而言,通过语义融合块(Semantic Fusion Block, SFB),SOTA增强了对客观性的分割,并使用基于场景理解引导提示上下文适配器(Scene-understanding Guided Prompt-Context Adaptor, SG-PCA)过滤了与道路导航任务无关的异常。 在Fishyscapes Lost and Found、Segment-Me-If-You-Can 和 RoadAnomaly 等多个基准数据集上的广泛实证评估表明,所提出的SOTA框架能够持续提高各种检测器的OOD检测性能,并实现了稳健且准确的分割结果。
https://arxiv.org/abs/2504.19183
Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.
最近在图像操纵领域的进展实现了生成逼真内容的前所未有的进步,但同时也消除了任意操作和编辑的障碍,引发了对多媒体真实性和网络安全性的担忧。然而,现有的图像操纵检测与定位(IMDL)方法主要关注拼接或复制移动伪造品,缺乏针对基于修复的操纵的专门基准测试。为了弥补这一差距,我们推出了COCOInpaint,这是一个专门为修复检测设计的全面基准,具有三大贡献:1) 由六个最先进的修复模型生成的高质量修复样本;2) 四种掩码生成策略(可选文字引导)实现多样化的生成场景;3) 包含258,266张语义丰富的修复图像的大规模覆盖范围。我们的基准测试着重于突出修复区域和真实区域之间的内在不一致性,而不是表面的语义特征如物体形状。我们建立了三个标准评估指标来严格评估现有的IMDL方法。该数据集将公开发布以促进未来的研究工作。
https://arxiv.org/abs/2504.18361