Pre-trained language models have achieved impressive results in various music understanding and generation tasks. However, existing pre-training methods for symbolic melody generation struggle to capture multi-scale, multi-dimensional structural information in note sequences, due to the domain knowledge discrepancy between text and music. Moreover, the lack of available large-scale symbolic melody datasets limits the pre-training improvement. In this paper, we propose MelodyGLM, a multi-task pre-training framework for generating melodies with long-term structure. We design the melodic n-gram and long span sampling strategies to create local and global blank infilling tasks for modeling the local and global structures in melodies. Specifically, we incorporate pitch n-grams, rhythm n-grams, and their combined n-grams into the melodic n-gram blank infilling tasks for modeling the multi-dimensional structures in melodies. To this end, we have constructed a large-scale symbolic melody dataset, MelodyNet, containing more than 0.4 million melody pieces. MelodyNet is utilized for large-scale pre-training and domain-specific n-gram lexicon construction. Both subjective and objective evaluations demonstrate that MelodyGLM surpasses the standard and previous pre-training methods. In particular, subjective evaluations show that, on the melody continuation task, MelodyGLM achieves average improvements of 0.82, 0.87, 0.78, and 0.94 in consistency, rhythmicity, structure, and overall quality, respectively. Notably, MelodyGLM nearly matches the quality of human-composed melodies on the melody inpainting task.
预训练语言模型在多种音乐理解和生成任务中取得了令人印象深刻的结果。然而,现有的符号旋律生成预训练方法在音符序列中的多尺度、多维度结构信息捕捉方面遇到了困难,这是因为文本和音乐之间的 domain knowledge 差异。此外,缺乏可用的大型符号旋律数据集限制了预训练改进。在本文中,我们提出了MelodyGLM,一个多任务预训练框架,用于生成具有长期结构的旋律。我们设计旋律元和长跨度采样策略,以创建 local 和 global 填充任务,以建模旋律的 local 和 global 结构。具体来说,我们将音高元、节奏元和它们的组合元纳入旋律元填充任务,以建模旋律的多维度结构。为此,我们创建了一个大型的符号旋律数据集 MelodyNet,其中包含超过 400 万的旋律碎片。MelodyNet 被用于大规模预训练和特定领域的元语言词汇构建。主观和客观评估表明,MelodyGLM 在旋律延续任务中的平均改进率为 0.82、0.87、0.78 和 0.94,在一致性、节奏性、结构和整体质量方面分别提高了 0.82、0.87、0.78 和 0.94。值得注意的是,MelodyGLM 在旋律填充任务中几乎与人类创作的旋律的质量相同。
https://arxiv.org/abs/2309.10738
In recent years, novel view synthesis from a single image has seen significant progress thanks to the rapid advancements in 3D scene representation and image inpainting techniques. While the current approaches are able to synthesize geometrically consistent novel views, they often do not handle the view-dependent effects properly. Specifically, the highlights in their synthesized images usually appear to be glued to the surfaces, making the novel views unrealistic. To address this major problem, we make a key observation that the process of synthesizing novel views requires changing the shading of the pixels based on the novel camera, and moving them to appropriate locations. Therefore, we propose to split the view synthesis process into two independent tasks of pixel reshading and relocation. During the reshading process, we take the single image as the input and adjust its shading based on the novel camera. This reshaded image is then used as the input to an existing view synthesis method to relocate the pixels and produce the final novel view image. We propose to use a neural network to perform reshading and generate a large set of synthetic input-reshaded pairs to train our network. We demonstrate that our approach produces plausible novel view images with realistic moving highlights on a variety of real world scenes.
近年来,从单个图像生成全新视角的图像合成技术取得了重大进展,这得益于3D场景表示和图像填充技术的迅速发展。虽然当前的方法能够生成几何上相同的全新视角,但它们往往无法正确处理视角依赖的效果。具体来说,合成的图像高光通常看起来似乎黏附于表面,使得全新视角显得不现实。为了解决这个问题,我们做了一个关键观察,即合成全新视角的过程需要基于新的相机改变像素的色调,并将像素移动到适当的位置。因此,我们建议将视角合成过程分解为两个独立的任务:像素重调和移动。在重调和移动过程中,我们将单个图像作为输入,并根据新的相机调整其色调。此重调和移动后的图像则用作现有视角合成方法的输入,以将像素移动到最终的新视角图像中。我们建议使用神经网络进行重调和移动,并生成大量合成输入-重调和移动对,以训练我们的网络。我们证明了我们的方法能够在多种真实场景下生成逼真的移动高光的全新视角图像。
https://arxiv.org/abs/2309.10689
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved remarkable results in conditional and unconditional image generation. The pre-trained models can be adapted without further training to different downstream tasks, by guiding their iterative denoising process at inference time to satisfy additional constraints. For the specific task of image inpainting, the current guiding mechanism relies on copying-and-pasting the known regions from the input image at each denoising step. However, diffusion models are strongly conditioned by the initial random noise, and therefore struggle to harmonize predictions inside the inpainting mask with the real parts of the input image, often producing results with unnatural artifacts. Our method, dubbed GradPaint, steers the generation towards a globally coherent image. At each step in the denoising process, we leverage the model's "denoised image estimation" by calculating a custom loss measuring its coherence with the masked input image. Our guiding mechanism uses the gradient obtained from backpropagating this loss through the diffusion model itself. GradPaint generalizes well to diffusion models trained on various datasets, improving upon current state-of-the-art supervised and unsupervised methods.
去噪扩散概率模型(DDPM)最近在条件型和无条件的图像生成方面取得了显著成果。预先训练的模型不需要进一步训练就能适应不同的后续任务,通过在推理时指导其迭代去噪过程来满足额外的约束。对于图像填充的具体任务,当前指导机制依赖于从输入图像中复制和粘贴已知区域在每个去噪步骤。然而,扩散模型受到初始随机噪声的强烈约束,因此 struggle 于在填充 mask 内部预测与输入图像真实部分的和谐一致性,往往产生具有自然痕迹的结果。我们的方法被称为gradPaint,它引导生成向一个全局一致性的图像。在每个去噪步骤中,我们利用模型的“去噪图像估计”计算一个自定义损失,以衡量它与掩膜输入图像的一致性。我们的指导机制使用从扩散模型自身传播此损失的梯度。gradPaint对来自不同数据集训练的扩散模型具有很好的泛化能力,改进了当前监督和无监督方法的最新水平。
https://arxiv.org/abs/2309.09614
Procedural Content Generation (PCG) and Procedural Content Generation via Machine Learning (PCGML) have been used in prior work for generating levels in various games. This paper introduces Content Augmentation and focuses on the subproblem of level inpainting, which involves reconstructing and extending video game levels. Drawing inspiration from image inpainting, we adapt two techniques from this domain to address our specific use case. We present two approaches for level inpainting: an Autoencoder and a U-net. Through a comprehensive case study, we demonstrate their superior performance compared to a baseline method and discuss their relative merits. Furthermore, we provide a practical demonstration of both approaches for the level inpainting task and offer insights into potential directions for future research.
程序生成内容(PCG)和通过机器学习程序生成内容(PCGML)在先前的工作中被用于生成各种游戏中的等级。本论文介绍了内容增强,并专注于级绘制子问题,该问题涉及重建和扩展视频游戏级别。从图像修复中汲取灵感,我们改编了两个该领域的技术,以解决我们的具体使用场景。我们提出了两个级绘制方法:自动编码器和U-net。通过综合案例研究,我们证明了它们比基准方法更好的表现,并讨论了它们的相对优点。此外,我们提供了对两个方法的级绘制任务的实际演示,并提供了未来研究的潜在方向。
https://arxiv.org/abs/2309.09472
Global pandemic due to the spread of COVID-19 has post challenges in a new dimension on facial recognition, where people start to wear masks. Under such condition, the authors consider utilizing machine learning in image inpainting to tackle the problem, by complete the possible face that is originally covered in mask. In particular, autoencoder has great potential on retaining important, general features of the image as well as the generative power of the generative adversarial network (GAN). The authors implement a combination of the two models, context encoders and explain how it combines the power of the two models and train the model with 50,000 images of influencers faces and yields a solid result that still contains space for improvements. Furthermore, the authors discuss some shortcomings with the model, their possible improvements, as well as some area of study for future investigation for applicative perspective, as well as directions to further enhance and refine the model.
由于新冠病毒的扩散,全球疫情进入了人脸识别的新维度,人们开始戴口罩。在这种情况下,作者考虑利用图像修复技术解决这一问题,通过完整原本被口罩覆盖的图像上的可能人脸。特别是,自编码器在保留图像的重要通用特征以及生成对抗网络(GAN)的生成能力方面具有巨大的潜力。作者使用了上下文编码器组合了两个模型,并解释如何将它们的力量结合起来,使用50,000张影响者面部图像训练模型,并获得了坚实的结果,但仍有改进的空间。此外,作者讨论了模型的一些缺点,可能的改进,以及适用于应用视角的未来研究领域,以及进一步改进和优化模型的方向。
https://arxiv.org/abs/2309.07293
We aim to provide a general framework of for computational photography that recovers the real scene from imperfect images, via the Deep Nonparametric Convexified Filtering (DNCF). It is consists of a nonparametric deep network to resemble the physical equations behind the image formation, such as denoising, super-resolution, inpainting, and flash. DNCF has no parameterization dependent on training data, therefore has a strong generalization and robustness to adversarial image manipulation. During inference, we also encourage the network parameters to be nonnegative and create a bi-convex function on the input and parameters, and this adapts to second-order optimization algorithms with insufficient running time, having 10X acceleration over Deep Image Prior. With these tools, we empirically verify its capability to defend image classification deep networks against adversary attack algorithms in real-time.
我们希望提供一个计算摄影的通用框架,通过 Deep Nonparametric Convexified 滤波(DNCF)从不完美的图像中恢复真实场景。DNCF 由一个 nonparametric 深度学习网络组成,类似于图像形成背后的物理方程,例如去噪、高分辨率、填充和闪光。DNCF 没有依赖于训练数据的参数化,因此具有对dversarial 图像操纵的强大泛化和鲁棒性。在推理期间,我们鼓励网络参数非负,在输入和参数上创建一个双曲函数,这适应于运行时间不足的 second-order 优化算法,比 Deep Image Prior 具有 10X 加速。通过这些工具,我们经验证了它的能力,在实时保护图像分类深度学习网络免受dversarial 攻击算法的攻击方面抵御这种攻击。
https://arxiv.org/abs/2309.06724
The accuracy of learning-based optical flow estimation models heavily relies on the realism of the training datasets. Current approaches for generating such datasets either employ synthetic data or generate images with limited realism. However, the domain gap of these data with real-world scenes constrains the generalization of the trained model to real-world applications. To address this issue, we investigate generating realistic optical flow datasets from real-world images. Firstly, to generate highly realistic new images, we construct a layered depth representation, known as multiplane images (MPI), from single-view images. This allows us to generate novel view images that are highly realistic. To generate optical flow maps that correspond accurately to the new image, we calculate the optical flows of each plane using the camera matrix and plane depths. We then project these layered optical flows into the output optical flow map with volume rendering. Secondly, to ensure the realism of motion, we present an independent object motion module that can separate the camera and dynamic object motion in MPI. This module addresses the deficiency in MPI-based single-view methods, where optical flow is generated only by camera motion and does not account for any object movement. We additionally devise a depth-aware inpainting module to merge new images with dynamic objects and address unnatural motion occlusions. We show the superior performance of our method through extensive experiments on real-world datasets. Moreover, our approach achieves state-of-the-art performance in both unsupervised and supervised training of learning-based models. The code will be made publicly available at: \url{this https URL}.
学习基于光学流估计模型的准确性很大程度上依赖于训练数据的真实性。目前用于生成此类数据的方法要么使用合成数据,要么生成具有有限真实性的图像。然而,这些数据与现实世界场景的domain gap限制了训练模型对现实世界应用的推广。为了解决这一问题,我们研究从现实世界图像生成真实感的光学流数据集的方法。首先,为了生成高度真实的新图像,我们构建了一个分层的深度表示,称为多平面图像(MPI),从单个视角图像中生成。这允许我们生成高度真实的新视角图像。为了生成与新图像准确地对应的光学流地图,我们使用相机矩阵和平面深度计算每个平面的光学流。然后,我们将这些分层光学流写入体积渲染的输出光学流地图中。其次,为了确保运动的真实性,我们介绍了一个独立的物体运动模块,可以在MPI中的相机和动态物体运动分离。该模块解决了MPI基于单视角方法中的不足之处,即光学流只由相机运动生成,并且不考虑到任何物体运动。我们此外设计了一个深度aware填充模块,以将新图像与动态物体合并,并处理不自然的运动遮挡。我们通过在真实世界数据集上进行广泛的实验展示了我们方法的优越性能。此外,我们的方法在 unsupervised 和 supervised 学习基于模型的训练中都实现了最先进的性能。代码将公开可用: \url{this https URL}。
https://arxiv.org/abs/2309.06714
We present a novel end-to-end identity-agnostic face reenactment system, MaskRenderer, that can generate realistic, high fidelity frames in real-time. Although recent face reenactment works have shown promising results, there are still significant challenges such as identity leakage and imitating mouth movements, especially for large pose changes and occluded faces. MaskRenderer tackles these problems by using (i) a 3DMM to model 3D face structure to better handle pose changes, occlusion, and mouth movements compared to 2D representations; (ii) a triplet loss function to embed the cross-reenactment during training for better identity preservation; and (iii) multi-scale occlusion, improving inpainting and restoring missing areas. Comprehensive quantitative and qualitative experiments conducted on the VoxCeleb1 test set, demonstrate that MaskRenderer outperforms state-of-the-art models on unseen faces, especially when the Source and Driving identities are very different.
我们提出了一种端到端的身份无关面部重现系统,名为MaskRenderer,它能够实时生成逼真、高保真的帧。尽管最近的面部重现工作表现出令人充满希望的结果,但仍存在一些重要挑战,如身份泄漏和模仿口形动作,特别是对于大型姿势变化和遮挡面部的情况。MaskRenderer解决这些问题的方法是使用(i)一种3DMM来建模3D面部结构,以更好地处理姿势变化、遮挡和口形动作,与2D表示相比;(ii)一种三组 Loss 函数,在训练期间嵌入交叉重现,以更好地保护身份;(iii)多尺度遮挡,改善填充和恢复缺失区域。在VoxCeleb1测试集上进行的全面定量和定性实验,表明MaskRenderer在 unseen 面部上比最先进的模型表现更好,特别是在来源和驱动身份非常不同的情况下。
https://arxiv.org/abs/2309.05095
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.
流based传播和时空Transformer是视频修复(VI)的两个主要机制。尽管这些组件的有效性,但它们仍然受到一些影响,影响其性能。以前的传播方法分别在图像或特征域进行单独处理。将全球图像传播从学习中孤立出来可能会导致空间不匹配,因为不准确的光学流动。此外,内存或计算限制限制特征传播和视频Transformer的时间范围,防止从遥远帧探索对应信息。为了解决这些问题,我们提出了一个改进的框架,称为ProPainter,它涉及增强的ProPagation和高效的Transformer。具体来说,我们引入了双重 domain 传播,结合图像和特征扭曲的优势, reliablely 利用全球对应关系。我们还提出了一个 mask-引导稀疏视频Transformer,通过丢弃不必要的和冗余的代币来实现高效的。有了这些组件,ProPainter在PSNR方面比先前方法提高了1.46dB,同时保持了吸引人的效率。
https://arxiv.org/abs/2309.03897
Trendy suggestions for learning-based elastic warps enable the deep image stitchings to align images exposed to large parallax errors. Despite the remarkable alignments, the methods struggle with occasional holes or discontinuity between overlapping and non-overlapping regions of a target image as the applied training strategy mostly focuses on overlap region alignment. As a result, they require additional modules such as seam finder and image inpainting for hiding discontinuity and filling holes, respectively. In this work, we suggest Recurrent Elastic Warps (REwarp) that address the problem with Dirichlet boundary condition and boost performances by residual learning for recurrent misalign correction. Specifically, REwarp predicts a homography and a Thin-plate Spline (TPS) under the boundary constraint for discontinuity and hole-free image stitching. Our experiments show the favorable aligns and the competitive computational costs of REwarp compared to the existing stitching methods. Our source code is available at this https URL.
流行的基于学习的弹性扭曲建议使得深度图像拼接能够对齐暴露巨大互斥错误的图像。尽管这些方法取得了令人印象深刻的对齐,但它们经常 struggle 在与目标图像重叠和非重叠区域之间的偶尔洞或断线问题上,因为应用的训练策略主要关注重叠区域对齐。因此,它们需要额外的模块,例如 seam finder 和图像填充,以隐藏断线并填充洞。在本文中,我们建议循环弹性扭曲(REwarp),以解决 Dirichlet 边界条件问题并通过残留学习提高性能,以解决循环错误对齐。具体而言,REwarp 在边界约束下预测了一个基函数和一个薄板拟合spline(TPS),以进行无洞图像拼接。我们的实验结果表明,与现有拼接方法相比,REwarp 的对齐和计算成本具有竞争力。我们的源代码可用在此 https URL 上。
https://arxiv.org/abs/2309.01406
The emergence of artificial intelligence-generated content (AIGC) has raised concerns about the authenticity of multimedia content in various fields. However, existing research for forgery content detection has focused mainly on binary classification tasks of complete videos, which has limited applicability in industrial settings. To address this gap, we propose UMMAFormer, a novel universal transformer framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation. Our approach introduces a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. We also design a Parallel Cross-Attention Feature Pyramid Network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To evaluate the proposed method, we contribute a novel Temporal Video Inpainting Localization (TVIL) dataset specifically tailored for video inpainting scenes. Our experiments show that our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd, significantly outperforming previous methods. The code and data are available at this https URL.
人工智能生成内容(AIGC)的出现引起了在不同领域的多媒体内容真实性的担忧。然而,现有用于检测伪造内容的研究主要关注完整的视频的二进制分类任务,这在工业环境下的适用性有限。为了解决这一差异,我们提出了UMMA former,一个用于时间戳伪造定位(TFL)的全新的通用Transformer框架,该框架使用多模式适应来预测伪造片段。我们的算法基于时间特征重构引入了一个Temporal Feature Abnormal Attention(TFAA)模块,以增强时间差异的检测。我们还设计了并行交叉注意力特征层次网络(PCA-FPN),以优化特征层次网络(FPN)以实现微妙特征增强。为了评估所提出的方法,我们提供了专门适用于视频涂色场景的新的时间视频涂色定位数据集(TVIL)。我们的实验结果表明,我们的方法在包括LAV-DF、TVIL和Pysnd等基准数据集上实现了最先进的性能,显著优于以前的方法。代码和数据可在该httpsURL上提供。
https://arxiv.org/abs/2308.14395
Effective image restoration with large-size corruptions, such as blind image inpainting, entails precise detection of corruption region masks which remains extremely challenging due to diverse shapes and patterns of corruptions. In this work, we present a novel method for automatic corruption detection, which allows for blind corruption restoration without known corruption masks. Specifically, we develop a hierarchical contrastive learning framework to detect corrupted regions by capturing the intrinsic semantic distinctions between corrupted and uncorrupted regions. In particular, our model detects the corrupted mask in a coarse-to-fine manner by first predicting a coarse mask by contrastive learning in low-resolution feature space and then refines the uncertain area of the mask by high-resolution contrastive learning. A specialized hierarchical interaction mechanism is designed to facilitate the knowledge propagation of contrastive learning in different scales, boosting the modeling performance substantially. The detected multi-scale corruption masks are then leveraged to guide the corruption restoration. Detecting corrupted regions by learning the contrastive distinctions rather than the semantic patterns of corruptions, our model has well generalization ability across different corruption patterns. Extensive experiments demonstrate following merits of our model: 1) the superior performance over other methods on both corruption detection and various image restoration tasks including blind inpainting and watermark removal, and 2) strong generalization across different corruption patterns such as graffiti, random noise or other image content. Codes and trained weights are available at this https URL .
对大型 corruption 进行有效图像修复,例如盲图像填充,需要进行精确的 corruption 区域掩膜检测,但由于各种 corruption 的形状和模式的多样性,这一任务仍然非常困难。在这项工作中,我们提出了一种 novel 的方法,用于自动 corruption 检测,这允许在没有已知 corruption 掩膜的情况下进行盲 corruption 恢复。具体来说,我们开发了一种Hierarchical Contrastive Learning 框架,以检测由 corruption 和未 corruption 区域之间的固有语义区别而产生的 corruption 区域。特别是,我们的模型采用 Coarser to Fine 的方式,通过在低分辨率特征空间中通过 contrastive 学习预测一个粗的掩膜,然后通过高分辨率的 contrastive 学习 refine 掩膜的不确定区域。一种专门的Hierarchical 相互作用机制被设计以促进不同尺度的 contrastive 学习知识传播,大大提高了建模性能。检测到的多尺度 corruption 掩膜然后利用以指导 corruption 恢复。通过学习 contrastive 区别而不是 corruption 语义模式,我们的模型具有跨不同 corruption 模式的良好泛化能力。大量实验证明了我们的模型的优点:1) 在 corruption 检测和各种图像恢复任务中比其他方法表现更好,包括盲填充和水印去除,2) 在不同 corruption 模式,例如涂鸦、随机噪声或其他图像内容中的 strong 泛化。代码和训练权重可在 this https URL 中找到。
https://arxiv.org/abs/2308.14061
Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of altering the perspective after finishing the capture process. To explore this influence, we first propose a pilot study that captures real environments with multiple eye heights and asks participants to judge the egocentric distances and immersion. If a significant influence is confirmed, an effective image-based approach to adapt pre-captured real-world environments to the user's eye height would be desirable. Motivated by the study, we propose a learning-based approach for synthesizing novel views for omnidirectional images with altered eye heights. This approach employs a multitask architecture that learns depth and semantic segmentation in two formats, and generates high-quality depth and semantic segmentation to facilitate the inpainting stage. With the improved omnidirectional-aware layered depth image, our approach synthesizes natural and realistic visuals for eye height adaptation. Quantitative and qualitative evaluation shows favorable results against state-of-the-art methods, and an extensive user study verifies improved perception and immersion for pre-captured real-world environments.
使用 Omnidirectional 相机预先捕获的沉浸式环境提供了广泛的虚拟现实应用。先前的研究已经表明,在 egocentric 虚拟环境中操纵 eye height 可以显著影响距离感知和沉浸式效果。然而,在预先捕获的真实环境中, eye height 的影响却较少受到关注,因为完成捕获过程后改变视角非常困难。为了探索这种影响,我们提出了一项试点研究,捕捉多个 eye height 的真实环境并请参与者评估 egocentric 距离和沉浸式效果。如果确认存在显著影响,我们希望采用一种基于图像的方法,将改变 eye height 的图像合成出独特的视角。这种方法使用了一个多任务架构,学习两种格式的深度和语义分割,并生成高质量的深度和语义分割,以方便填充阶段。随着 Omnidirectional aware 分层深度图像的改善,我们的方法合成出自然和真实的视觉图像,用于 eye height 适应。量化和定性评估结果表明,与现有方法相比,我们的方法取得了有利的结果。用户的广泛研究证实了预先捕获的真实环境沉浸式效果的提高。
https://arxiv.org/abs/2308.13042
The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames. Code and models will be available for research at this https URL.
3D人类姿态和形状(HPS)从图像的回归越来越准确,这使得结果对于后续任务如人类行动识别或3D图形有用。然而,没有一个回归器是完美的,并且准确性可能受到模糊的图像证据或训练期间未观察到的姿态和外貌的影响。大部分当前HPS回归器都没有报告其输出的 confidence 值,这意味着后续任务无法区分准确的估计值和不准确的估计值。为了解决这个问题,我们开发了POCO框架,一个 novel 框架,用于训练 HPS 回归器,在一条前向路径上估计不仅一个 3D 人类身体,而且其 confidence 值。具体来说,POCO 估计了 3D 身体姿态和每个样本的方差。关键思想是引入一个二元 conditioning 策略(DCS),以回归不确定性,该不确定性与姿态重建质量高度相关。POCO 框架可以应用于任何 HPS 回归器,在这里我们通过修改 HMR、PARE 和 CLIFF 来评估它。在所有情况下,训练网络以推理关于不确定性 help 它学习更准确地估计 3D 姿态。虽然这不是我们的目标,但改进是适度的但连续的。我们的主要动机是为后续任务提供不确定性估计;我们有两种方式来实现这一点:(1) 利用 confidence 估计来Bootstrap HPS 训练。给定未标记的图像数据,我们使用 POCO 训练的回归器的信任估计作为伪 ground truth。使用这个自动整理的数据进行训练可以提高准确性。(2) 利用视频姿态估计中的不确定性,自动识别不确定帧(例如由于遮挡)并从自信帧中填充这些帧。代码和模型将在这个 https URL 上提供研究。
https://arxiv.org/abs/2308.12965
Image composition refers to inserting a foreground object into a background image to obtain a composite image. In this work, we focus on generating plausible shadow for the inserted foreground object to make the composite image more realistic. To supplement the existing small-scale dataset DESOBA, we create a large-scale dataset called DESOBAv2 by using object-shadow detection and inpainting techniques. Specifically, we collect a large number of outdoor scene images with object-shadow pairs. Then, we use pretrained inpainting model to inpaint the shadow region, resulting in the deshadowed images. Based on real images and deshadowed images, we can construct pairs of synthetic composite images and ground-truth target images. Dataset is available at this https URL.
图像组合是指将前景对象插入背景图像中,以得到合成图像。在本研究中,我们重点是为插入前景对象生成合理的阴影,以使合成图像更真实。为了补充现有的小型数据集DESOBA,我们使用物体阴影检测和填充技术创建了一个大型数据集DESOBAv2。具体来说,我们收集了大量的户外场景图像,其中包含物体阴影对。然后,我们使用预先训练的填充模型填充阴影区域,生成的阴影图像。基于真实图像和阴影图像,我们可以构建合成图像和目标图像的对。数据集可用在此https URL上。
https://arxiv.org/abs/2308.09972
Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, "whether diffusion model can boost image restoration". To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly-used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose five potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design.
图像恢复(IR)在低级别视觉领域中是一个不可或缺的且具有挑战性的任务,旨在改善被各种形式破坏的图像的主观质量。最近,扩散模型在AGC图像生成方面取得了重大进展,因此引发了一个直觉问题,即“扩散模型是否可以增强图像恢复”。为了回答这个问题,一些先驱研究尝试将扩散模型集成到图像恢复任务中,结果比先前基于GAN的方法表现更好。尽管如此,基于扩散模型的图像恢复的全面且富有启发性的研究仍然稀缺。在本文中,我们将首先介绍基于扩散模型的图像恢复的最新方法,涵盖了学习范式、条件策略、框架设计、建模策略和评估。具体来说,我们首先介绍扩散模型的背景,然后介绍两个普遍存在的工作流程,利用扩散模型在图像恢复中。随后,我们分类和强调使用扩散模型用于IR和盲/实际IR的创新性设计,旨在启发未来的开发。为了全面评估现有方法,我们总结了常用的数据集、实现细节和评估指标。此外,我们呈现了三个任务中公开开源方法的主观比较,包括图像超分辨率、去噪和填充。最终,基于现有工作的限制,我们提出了五个潜在的且具有挑战性的方向,包括采样效率、模型压缩、图像失真模拟和估计、不变失真学习以及框架设计。
https://arxiv.org/abs/2308.09388
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. Here, we present LesionMix, a novel and simple lesion-aware data augmentation method. It performs augmentation at the lesion level, increasing the diversity of lesion shape, location, intensity and load distribution, and allowing both lesion populating and inpainting. Experiments on different modalities and different lesion datasets, including four brain MR lesion datasets and one liver CT lesion dataset, demonstrate that LesionMix achieves promising performance in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code will be released at this https URL.
数据增强已经成为深度学习医学图像分割方法的事实上组成部分。在医学图像中,常用的数据增强技术主要集中在空间和质量变换以提高训练图像的多样性。它们通常图像级别上设计,增加整个图像的数字,并忽视了图像中的具体异常。在这里,我们介绍了 Lesion Mix 一种新颖且简单的Lesion aware数据增强方法。它在Lesion级别上进行增强,增加Lesion形状、位置、强度和质量分布的多样性,并同时允许Lesion填充和填充。在不同模式和不同Lesion数据集上的实验,包括四个大脑MRI Lesion数据集和一个肝脏CT Lesion数据集,表明 Lesion Mix 在Lesion图像分割中表现出良好的性能,比一些最近基于混合的数据增强方法表现更好。代码将在本 https URL 上发布。
https://arxiv.org/abs/2308.09026
The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues. Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion. The crux of innovation lies in our adept utilization of the T2I diffusion model for producing video frames successively while preserving contextual relevance. We surmount the hurdles posed by maintaining human character and clothing consistency across varying poses, along with upholding the background's continuity amidst diverse human movements. To ensure consistent human appearances across the entire video, we devise an intra-frame alignment module. This module assimilates text-guided synthesized human character knowledge into the pretrained T2I diffusion model, synergizing insights from ChatGPT. For preserving background continuity, we put forth a background alignment pipeline, amalgamating insights from segment anything and image inpainting techniques. Furthermore, we propose an inter-frame alignment module that draws inspiration from an auto-regressive pipeline to augment temporal consistency between adjacent frames, where the preceding frame guides the synthesis process of the current frame. Comparisons with state-of-the-art methods demonstrate that Dancing Avatar exhibits the capacity to generate human videos with markedly superior quality, both in terms of human and background fidelity, as well as temporal coherence compared to existing state-of-the-art approaches.
数字领域对创造逼真Avatar的需求不断增加,导致需要更多的人以文本描述和姿势为指导生成高质量的人类视频。我们提出了舞蹈的Avatar,旨在以姿势和文本提示驱动生成人类运动视频。我们的方法使用了一个预训练的T2I扩散模型,以生成每个视频帧自回归的方式。创新的关键在于我们巧妙地利用T2I扩散模型生成视频帧的同时保留上下文相关度。我们克服了保持人类角色和服装一致性在不同姿势之间保持一致的挑战,同时保持背景连续性,在各种人类运动之间保持稳定。为了确保整个视频中保持一致的人类外观,我们设计了内帧对齐模块。该模块将文本引导的合成人类角色知识集成到预训练的T2I扩散模型中,并协同ChatGPT insights。为了保留背景连续性,我们提出了背景对齐管道,整合了片段任何事情和图像填充技术 insights。此外,我们提出了Inter-frame对齐模块,从自回归管道中借鉴灵感,增加相邻帧之间的时间一致性,其中前一个帧指导当前帧的合成过程。与现有先进技术的比较表明,舞蹈的Avatar表现出生成人类视频显著优于现有先进技术的能力,无论在人类和背景逼真度方面,以及与现有先进技术的时间一致性方面。
https://arxiv.org/abs/2308.07749
Equipping the rototranslation group $SE(2)$ with a sub-Riemannian structure inspired by the visual cortex V1, we propose algorithms for image inpainting and enhancement based on hypoelliptic diffusion. We innovate on previous implementations of the methods by Citti, Sarti and Boscain et al., by proposing an alternative that prevents fading and capable of producing sharper results in a procedure that we call WaxOn-WaxOff. We also exploit the sub-Riemannian structure to define a completely new unsharp using $SE(2)$, analogous of the classical unsharp filter for 2D image processing, with applications to image enhancement. We demonstrate our method on blood vessels enhancement in retinal scans.
将Roto Translation Group $SE(2)$基于视觉皮层V1的启发式下拉欧几里得结构,我们提出了基于低切变扩散的图像填充和增强算法。我们对Citti、Sarti和Boscain等人先前实现的这些方法进行了创新,提出了一种能够防止褪色并产生更锐利结果的替代方法,我们称之为 WaxOn-WaxOff。我们还利用下拉欧几里得结构定义了一个完全新的下拉模糊滤波器,并将其应用于2D图像处理中的模糊过滤,类似于经典的2D图像增强模糊滤波器。我们利用这些方法在视网膜扫描中演示了我们的算法。
https://arxiv.org/abs/2308.07652
Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method.
虚拟试穿是一个关键的图像合成任务,旨在将衣服从一个图像转移到另一个中同时保留人类和衣服的细节。虽然许多现有方法依赖于生成对抗网络(GANs)来实现这一点,但缺陷仍然可能发生,特别是在高分辨率方面。最近,扩散模型已成为在各种应用程序中生成高质量图像的一种有前途的替代方法。然而,仅仅使用衣服作为引导扩散模型填充的条件是不够的,以保留衣服的细节。要克服这一挑战,我们提出了一种基于示例的填充方法,利用偏振模块来有效地指导扩散模型的生成。偏振模块对衣服进行预处理,有助于保留衣服的局部细节。然后我们将偏振衣服与衣服无关的人图像合并,并将噪声作为扩散模型输入。此外,将偏振衣服用作每个去噪过程的局部条件,以确保最终结果尽可能保留细节。我们的方法,即虚拟试穿的基于条件填充(DCI-VTON),有效地利用了扩散模型的力量,并纳入偏振模块帮助生成高质量的、真实的虚拟试穿结果。VITON-HD的实验结果证明了我们方法的有效性和优越性。
https://arxiv.org/abs/2308.06101