We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at this https URL.
我们提出了MosaicFusion,一种简单但有效的扩散based数据增强方法,用于大规模词汇实例分割。我们的方法不需要训练,并依赖于任何标签监督。两个关键设计使我们可以使用现成的文本到图像扩散模型,作为对象实例和 mask 注释有用的数据生成器。首先,我们将图像 canvas 分为多个区域,并执行一次扩散过程,以同时生成多个实例,根据不同的文本提示条件进行训练。其次,我们获得相应的实例掩膜,通过汇总跨层和扩散时间步的对象提示相关的交叉注意力地图,然后简单的阈值和边缘aware 优化处理。在没有花哨的功能的情况下,MosaicFusion 可以生成大量的合成标注数据,特别是对于稀有和新颖类别。在挑战性的LVIS长长尾和开放词汇基准测试中,实验结果证明MosaicFusion可以显著改进现有的实例分割模型的性能,特别是稀有和新颖类别。代码将在本httpsURL上发布。
https://arxiv.org/abs/2309.13042
Sequential recommendation (SRS) has become the technical foundation in many applications recently, which aims to recommend the next item based on the user's historical interactions. However, sequential recommendation often faces the problem of data sparsity, which widely exists in recommender systems. Besides, most users only interact with a few items, but existing SRS models often underperform these users. Such a problem, named the long-tail user problem, is still to be resolved. Data augmentation is a distinct way to alleviate these two problems, but they often need fabricated training strategies or are hindered by poor-quality generated interactions. To address these problems, we propose a Diffusion Augmentation for Sequential Recommendation (DiffuASR) for a higher quality generation. The augmented dataset by DiffuASR can be used to train the sequential recommendation models directly, free from complex training procedures. To make the best of the generation ability of the diffusion model, we first propose a diffusion-based pseudo sequence generation framework to fill the gap between image and sequence generation. Then, a sequential U-Net is designed to adapt the diffusion noise prediction model U-Net to the discrete sequence generation task. At last, we develop two guide strategies to assimilate the preference between generated and origin sequences. To validate the proposed DiffuASR, we conduct extensive experiments on three real-world datasets with three sequential recommendation models. The experimental results illustrate the effectiveness of DiffuASR. As far as we know, DiffuASR is one pioneer that introduce the diffusion model to the recommendation.
Sequential recommendation (SRS)已经成为许多应用的技术基础,其目标是基于用户的历史交互推荐下一个物品。然而,Sequential recommendation常常面临数据稀疏的问题,这个问题在推荐系统中很常见。此外,大多数用户只与少数物品交互,但现有的SRS模型往往在这些用户上表现不好,这种情况被称为长尾用户问题,仍需要解决。数据增强是一种缓解这两个问题的独特方法,但常常需要编造训练策略或受到生成 interactions 的质量阻碍。为了解决这些问题,我们提出了一种Sequential Recommendation中的扩散增强(DiffuASR),以提供更高质量的生成。通过DiffuASR增强的数据集可以用于直接训练Sequential recommendation模型,而无需复杂的训练程序。为了最大限度地发挥扩散模型的生成能力,我们首先提出了一种基于扩散的伪序列生成框架,以填补图像和序列生成之间的空缺。然后,我们设计了Sequential U-Net,以适应扩散噪声预测模型U-Net的离散序列生成任务。最后,我们发展了两个指导策略,以整合生成和起源序列的偏好。为了验证所提出的DiffuASR,我们针对三个实际数据集和三个Sequential recommendation模型进行了广泛的实验。实验结果展示了DiffuASR的有效性。据所知,DiffuASR是将扩散模型引入推荐领域的先驱之一。
https://arxiv.org/abs/2309.12858
Accurate segmentation is essential for echocardiography-based assessment of cardiovascular diseases (CVDs). However, the variability among sonographers and the inherent challenges of ultrasound images hinder precise segmentation. By leveraging the joint representation of image and text modalities, Vision-Language Segmentation Models (VLSMs) can incorporate rich contextual information, potentially aiding in accurate and explainable segmentation. However, the lack of readily available data in echocardiography hampers the training of VLSMs. In this study, we explore using synthetic datasets from Semantic Diffusion Models (SDMs) to enhance VLSMs for echocardiography segmentation. We evaluate results for two popular VLSMs (CLIPSeg and CRIS) using seven different kinds of language prompts derived from several attributes, automatically extracted from echocardiography images, segmentation masks, and their metadata. Our results show improved metrics and faster convergence when pretraining VLSMs on SDM-generated synthetic images before finetuning on real images. The code, configs, and prompts are available at this https URL.
准确的分割对于基于超声心动图评估心血管疾病(CVD)是必不可少的。然而,超声科医生之间的差异和超声图像固有的挑战不利于精确分割。通过利用图像和文本模式的联合表示,视觉语言分割模型(VLSMs)可以包含丰富的上下文信息, potentially aid in accurate and explainable分割。然而,在超声心动图分析中缺乏可用的数据妨碍了VLSMs的训练。在本研究中,我们探索了利用语义扩散模型(SDMs)生成的合成数据来提高VLSMs在超声心动图分割中的应用。我们使用了七种不同的语言提示,从超声图像、分割掩码和它们的元数据中提取了几个属性,并评估了两种最受欢迎的VLSMs(CLIPSeg和CRIS)的结果。我们的结果表明,在先于真实图像 finetuning 之前对VLSMs进行预训练时,可以提高指标和更快地收敛。代码、配置和提示可在该httpsURL上获取。
https://arxiv.org/abs/2309.12829
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
本 paper 介绍了一种改进的时间 informed 注意力神经网络 (Durian-E) 用于表达高保真的文本到语音(TTS)合成。从原始的 Durian 模型中继承而来,采用了一种自回归模型结构,该结构使用一个时间模型从时间推断输入语言信息和输出声学特征的对齐。同时, proposed 的 Durian-E 利用多个叠加的 SwishRNN 基座Transformer 块作为语言编码器。Style-Adaptive Instance Normalization (SAIN) 层被应用于帧级别的编码器,以改善表达建模能力。一种去噪器,结合mel频谱去噪和 SAIN 模块,进一步改善合成语音质量和表达性能。实验结果表明,本文提出的表达TTS模型在主观意见评分(MOS)和偏好测试中比现有方法表现更好。
https://arxiv.org/abs/2309.12792
Simulation of autonomous vehicle systems requires that simulated traffic participants exhibit diverse and realistic behaviors. The use of prerecorded real-world traffic scenarios in simulation ensures realism but the rarity of safety critical events makes large scale collection of driving scenarios expensive. In this paper, we present DJINN - a diffusion based method of generating traffic scenarios. Our approach jointly diffuses the trajectories of all agents, conditioned on a flexible set of state observations from the past, present, or future. On popular trajectory forecasting datasets, we report state of the art performance on joint trajectory metrics. In addition, we demonstrate how DJINN flexibly enables direct test-time sampling from a variety of valuable conditional distributions including goal-based sampling, behavior-class sampling, and scenario editing.
模拟自动驾驶系统需要模拟交通参与者表现出多样性和真实的行为。在模拟中使用录制的真实世界交通场景确保真实感,但安全关键事件罕见的情况使得大规模收集驾驶场景非常昂贵。在本文中,我们介绍了DIJNN - 一种基于扩散的方法生成交通场景的方法。我们的算法联合扩散所有Agent的轨迹,以灵活的时间观测序列为条件从过去、现在或未来收集。在流行的轨迹预测数据集上,我们报告了最先进的 joint轨迹 metrics 的性能。此外,我们展示了DIJNN如何灵活地支持直接测试时间采样,包括目标采样、行为class采样和场景编辑等多种有价值的条件分布。
https://arxiv.org/abs/2309.12508
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
https://arxiv.org/abs/2309.12506
The Video and Image Processing (VIP) Cup is a student competition that takes place each year at the IEEE International Conference on Image Processing. The 2022 IEEE VIP Cup asked undergraduate students to develop a system capable of distinguishing pristine images from generated ones. The interest in this topic stems from the incredible advances in the AI-based generation of visual data, with tools that allows the synthesis of highly realistic images and videos. While this opens up a large number of new opportunities, it also undermines the trustworthiness of media content and fosters the spread of disinformation on the internet. Recently there was strong concern about the generation of extremely realistic images by means of editing software that includes the recent technology on diffusion models. In this context, there is a need to develop robust and automatic tools for synthetic image detection.
视频和图像处理(VIP Cup)是一个每年在IEEE图像处理国际会议中举办的学生比赛。2022年IEEE VIP Cup要求本科生开发一种系统,能够分辨原始图像和生成图像。对这个话题的兴趣源于基于AI生成的视觉数据惊人的进展,以及允许合成高度逼真的图像和视频的工具。尽管这带来了大量新机会,但它也破坏了媒体内容的可靠性,并促进了互联网上的虚假信息的传播。最近,存在着强烈的关注,担心使用包括最新扩散模型技术的编辑软件生成极度逼真的图像。在这种情况下,需要开发 robust 和自动的合成图像检测工具。
https://arxiv.org/abs/2309.12428
Generating multi-instrument music from symbolic music representations is an important task in Music Information Retrieval (MIR). A central but still largely unsolved problem in this context is musically and acoustically informed control in the generation process. As the main contribution of this work, we propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment, thus allowing for better guidance of timbre and style. Building on state-of-the-art diffusion-based music generative models, we introduce performance conditioning - a simple tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and achieves state-of-the-art FAD realism scores while allowing novel timbre and style control. Our project page, including samples and demonstrations, is available at this http URL
从符号音乐表示中提取多乐器音乐的生成任务在音乐信息检索(MIR)中是一项重要的任务。该领域的一个中心但至今仍未完全解决的问题是生成过程中的音乐和声学控制。作为这项工作的主要贡献,我们提议增强多乐器合成的控制,通过特定表演和录制环境 conditioning 一个生成模型,从而更好地指导音色和风格。基于最先进的扩散型音乐生成模型,我们引入了表现 conditioning —— 一个简单的工具,指示生成模型从特定表演中提取的特定乐器的音色和风格进行合成音乐。我们原型使用未编辑的多样乐器表演进行评估,并取得了先进的淡入淡出真实感分数,同时允许新的音色和风格控制。我们的项目页面,包括样本和演示,可访问此 http URL。
https://arxiv.org/abs/2309.12283
Disease progression simulation is a crucial area of research that has significant implications for clinical diagnosis, prognosis, and treatment. One major challenge in this field is the lack of continuous medical imaging monitoring of individual patients over time. To address this issue, we develop a novel framework termed Progressive Image Editing (PIE) that enables controlled manipulation of disease-related image features, facilitating precise and realistic disease progression simulation. Specifically, we leverage recent advancements in text-to-image generative models to simulate disease progression accurately and personalize it for each patient. We theoretically analyze the iterative refining process in our framework as a gradient descent with an exponentially decayed learning rate. To validate our framework, we conduct experiments in three medical imaging domains. Our results demonstrate the superiority of PIE over existing methods such as Stable Diffusion Walk and Style-Based Manifold Extrapolation based on CLIP score (Realism) and Disease Classification Confidence (Alignment). Our user study collected feedback from 35 veteran physicians to assess the generated progressions. Remarkably, 76.2% of the feedback agrees with the fidelity of the generated progressions. To our best knowledge, PIE is the first of its kind to generate disease progression images meeting real-world standards. It is a promising tool for medical research and clinical practice, potentially allowing healthcare providers to model disease trajectories over time, predict future treatment responses, and improve patient outcomes.
疾病进展模拟是一个关键的研究领域,它对临床诊断、预后和治疗具有重大的影响。该领域的一个主要挑战是缺乏对个体患者定期医学影像监测的时间持续存在。为了解决这个问题,我们开发了一种名为 progressive image editing (PIE) 的新框架,它能够实现控制的疾病相关图像特征的操纵,便于精确和真实的疾病进展模拟。具体来说,我们利用最近的文本到图像生成模型的进步,准确模拟疾病进展,并为每个患者个性化定制。我们从理论上分析我们的迭代改进过程,将其视为一个以指数衰减的学习率的梯度下降过程。为了验证我们的框架,我们进行了三个医学影像领域的实验。我们的实验结果证明,PIE 比现有的方法如稳定扩散漫步和基于CLIP得分(真实感)和疾病分类信心(对齐)的方法优越。我们的用户研究从35名经验丰富的医生收集了反馈,以评估生成的进展。令人惊讶地,76.2% 的反馈与生成的进展精度一致。据我们所知,PIE 是生成符合实际标准的疾病进展图像的第一种方法。它对于医学研究和临床实践是一个有前途的工具,可能允许医疗保健提供者模拟时间轨迹,预测未来的治疗反应,并改善患者的治疗效果。
https://arxiv.org/abs/2309.11745
Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.
Segment Anything (SAM) 是一种高级的、基于大量视觉数据集的图像分割模型,它在图像分割和计算机视觉领域树立了新的基准。然而,在区分阴影和背景时,它遇到了挑战。为了解决这一问题,我们开发了Deshadow- Anything,考虑了大规模数据集的泛化情况,并在大规模数据集上进行微调,以去除图像阴影。扩散模型可以在图像的边缘和纹理上扩散,帮助去除阴影,同时保留图像细节。此外,我们设计了多自注意力引导(MSAG)和自适应输入扰动(DDPM-AIP)来加速扩散迭代训练速度。在阴影去除任务的实验表明,这些方法可以有效地改善图像恢复性能。
https://arxiv.org/abs/2309.11715
In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.
在本文中,我们揭示了扩散U-Net的潜在潜力,它是一条能够在空中显著改善生成质量的“免费午餐”。我们首先研究了U-Net架构对去噪过程的关键贡献,并确定其主要骨架主要贡献于去噪,而它的跳过连接主要将高频特征引入解码模块,导致网络忽略了骨架语义。利用这一发现,我们提出了名为“FreeU”的简单但有效的方法,它不需要额外的训练或微调。我们的关键洞察是战略地重新加权来自U-Net的跳过连接和骨架特征映射的贡献,以利用U-Net架构的两个组件的优势。在图像和视频生成任务中的良好结果表明,我们的FreeU可以轻松地与现有的扩散模型集成,例如稳定扩散、梦幻 Booth、模型Scope、重渲染和重版本,只需要几行代码就能提高生成质量。项目页面:https://chenyangsi.top/FreeU/
https://arxiv.org/abs/2309.11497
We discuss the emerging new opportunity for building feedback-rich computational models of social systems using generative artificial intelligence. Referred to as Generative Agent-Based Models (GABMs), such individual-level models utilize large language models such as ChatGPT to represent human decision-making in social settings. We provide a GABM case in which human behavior can be incorporated in simulation models by coupling a mechanistic model of human interactions with a pre-trained large language model. This is achieved by introducing a simple GABM of social norm diffusion in an organization. For educational purposes, the model is intentionally kept simple. We examine a wide range of scenarios and the sensitivity of the results to several changes in the prompt. We hope the article and the model serve as a guide for building useful diffusion models that include realistic human reasoning and decision-making.
我们讨论了利用生成人工智能构建反馈丰富的社会系统计算模型的新机会。这些个体级模型使用大型语言模型如ChatGPT在社会环境中代表人类决策。我们提供了GABM的案例,其中可以通过耦合人类交互的机械性模型与预先训练的大型语言模型,将人类行为融入模拟模型中。通过在一个组织中引入简单的社会规范扩散GABM,实现了这一点。为了教育目的,故意保持了模型的简单性。我们研究了广泛的情境,并考察了结果对 prompt 中几个变化的反应。我们希望文章和模型可以成为构建包含实际人类推理和决策的有用扩散模型的指导。
https://arxiv.org/abs/2309.11456
Classical motion planning for robotic manipulation includes a set of general algorithms that aim to minimize a scene-specific cost of executing a given plan. This approach offers remarkable adaptability, as they can be directly used off-the-shelf for any new scene without needing specific training datasets. However, without a prior understanding of what diverse valid trajectories are and without specially designed cost functions for a given scene, the overall solutions tend to have low success rates. While deep-learning-based algorithms tremendously improve success rates, they are much harder to adopt without specialized training datasets. We propose EDMP, an Ensemble-of-costs-guided Diffusion for Motion Planning that aims to combine the strengths of classical and deep-learning-based motion planning. Our diffusion-based network is trained on a set of diverse kinematically valid trajectories. Like classical planning, for any new scene at the time of inference, we compute scene-specific costs such as "collision cost" and guide the diffusion to generate valid trajectories that satisfy the scene-specific constraints. Further, instead of a single cost function that may be insufficient in capturing diversity across scenes, we use an ensemble of costs to guide the diffusion process, significantly improving the success rate compared to classical planners. EDMP performs comparably with SOTA deep-learning-based methods while retaining the generalization capabilities primarily associated with classical planners.
对机器人操纵的古典运动规划包括一组通用算法,旨在最小化特定场景执行给定计划的费用。这种方法具有令人突出的可移植性,因为它们可以直接用于任何新场景,不需要特定的训练数据集。然而,在没有预先理解哪些多样化 valid trajectory 是有效的以及没有为特定场景特别设计的 cost 函数,整体解决方案往往具有较低的成功率。尽管深度学习算法极大地提高了成功率,但在没有专门的训练数据集的情况下,它们更难以采用。我们提出了 EDMP,一个由成本组成的聚合引导扩散的运动规划方法,旨在结合古典和深度学习运动规划的优点。我们的扩散网络是基于一组多样化 kinematic valid trajectory 训练的。与古典规划类似,在推理时任何新场景,我们计算场景特定的成本,如“碰撞成本”,并指导扩散生成有效的轨迹,以满足场景特定限制。此外,与我们可能不足以捕捉不同场景之间的多样性的单个 cost 函数相反,我们使用一个集合的成本来指导扩散过程,比古典规划方法显著提高了成功率。 EDMP 与最先进的深度学习基于方法的性能相当,同时保留了古典规划的主要泛化能力。
https://arxiv.org/abs/2309.11414
In this paper, we address the problem of face aging: generating past or future facial images by incorporating age-related changes to the given face. Previous aging methods rely solely on human facial image datasets and are thus constrained by their inherent scale and bias. This restricts their application to a limited generatable age range and the inability to handle large age gaps. We propose FADING, a novel approach to address Face Aging via DIffusion-based editiNG. We go beyond existing methods by leveraging the rich prior of large-scale language-image diffusion models. First, we specialize a pre-trained diffusion model for the task of face age editing by using an age-aware fine-tuning scheme. Next, we invert the input image to latent noise and obtain optimized null text embeddings. Finally, we perform text-guided local age editing via attention control. The quantitative and qualitative analyses demonstrate that our method outperforms existing approaches with respect to aging accuracy, attribute preservation, and aging quality.
在本文中,我们解决了面部 aging 问题:通过将给定面部的年龄相关变化融入生成过去的或未来的面部图像,来生成面部图像。以前的 aging 方法仅依赖于人类面部图像数据集,因此受到其固有的规模和偏差的限制。这限制了其应用范围,只能用于可生成的年龄范围有限,且无法处理较大的年龄差距。我们提出了 FADING,一种新颖的方法,通过利用大规模语言-图像扩散模型的丰富先前知识来解决面部 aging 问题。我们超越了现有的方法,通过利用大规模语言-图像扩散模型的丰富先前知识,利用注意力控制,实现文本引导的局部年龄编辑。定量和定性分析表明,我们的方法在年龄准确性、属性保留和面部 aging 质量方面优于现有的方法。
https://arxiv.org/abs/2309.11321
Speech-driven 3D facial animation synthesis has been a challenging task both in industry and research. Recent methods mostly focus on deterministic deep learning methods meaning that given a speech input, the output is always the same. However, in reality, the non-verbal facial cues that reside throughout the face are non-deterministic in nature. In addition, majority of the approaches focus on 3D vertex based datasets and methods that are compatible with existing facial animation pipelines with rigged characters is scarce. To eliminate these issues, we present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations that is trained with both 3D vertex and blendshape based datasets. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. To the best of our knowledge, we are the first to employ the diffusion method for the task of speech-driven 3D facial animation synthesis. We have run extensive objective and subjective analyses and show that our approach achieves better or comparable results in comparison to the state-of-the-art methods. We also introduce a new in-house dataset that is based on a blendshape based rigged character. We recommend watching the accompanying supplementary video. The code and the dataset will be publicly available.
语音驱动的三维面部动画合成在工业和研究领域都是一个具有挑战性的任务。最近的研究方法主要关注确定性深度学习方法,这意味着给定语音输入,输出总是相同的。然而,实际上,存在于面部整个范围内的非言语面部表情提示是随机的。此外,大多数方法都专注于3D顶点based datasets,以及与现有的面部动画管道兼容的方法,具有rigged角色的方法是罕见的。为了解决这些问题,我们提出了FaceDiffuser,一个随机深度学习模型,用于生成训练了3D顶点和Blendshapebased datasets的语音驱动面部动画。我们的方法和扩散技术基于方法,使用预训练的大型语音表示模型HuBERT编码音频输入。据我们所知,我们是第一个使用扩散方法完成语音驱动的三维面部动画合成任务的人。我们进行了广泛的客观和主观分析,并表明我们的方法比最先进的方法 achieves better或 comparable results。我们还介绍了基于Blendshapebasedrigged角色的新的内部数据集,我们建议观看相关的补充视频。代码和数据集将公开可用。
https://arxiv.org/abs/2309.11306
Neural radiance field is an emerging rendering method that generates high-quality multi-view consistent images from a neural scene representation and volume rendering. Although neural radiance field-based techniques are robust for scene reconstruction, their ability to add or remove objects remains limited. This paper proposes a new language-driven approach for object manipulation with neural radiance fields through dataset updates. Specifically, to insert a new foreground object represented by a set of multi-view images into a background radiance field, we use a text-to-image diffusion model to learn and generate combined images that fuse the object of interest into the given background across views. These combined images are then used for refining the background radiance field so that we can render view-consistent images containing both the object and the background. To ensure view consistency, we propose a dataset updates strategy that prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. We show that under the same dataset updates strategy, we can easily adapt our method for object insertion using data from text-to-3D models as well as object removal. Experimental results show that our method generates photorealistic images of the edited scenes, and outperforms state-of-the-art methods in 3D reconstruction and neural radiance field blending.
神经网络光场是一种新兴渲染方法,可以从神经网络场景表示和体积渲染中生成高质量的多视角一致性图像。虽然基于神经网络光场的 techniques 在场景重建方面非常稳健,但添加或删除物体的能力仍然受到限制。本文提出了一种通过数据集更新进行语言驱动的对象操纵方法。具体而言,通过将多个视角图像所代表的新前景色物体插入背景光场中,我们使用文本到图像扩散模型学习并生成 combined 图像,将感兴趣的物体在各个视角下融合到给定的背景中。这些 combined 图像后来被用于优化背景光场,以生成包含物体和背景的一致性图像。为了保证视角一致性,我们提出了一种数据集更新策略,该策略将光场训练放在与已经训练过的视图接近的相机视角之前,并优先考虑光场训练与已训练视图之间的相似性。我们证明,在相同的数据集更新策略下,可以使用文本到 3D 模型和物体删除的数据来轻松地适应我们的方法和用于物体插入和删除。实验结果显示,我们的方法生成编辑场景的逼真图像,并在 3D 重建和神经网络光场融合中优于最先进的方法。
https://arxiv.org/abs/2309.11281
Coarse architectural models are often generated at scales ranging from individual buildings to scenes for downstream applications such as Digital Twin City, Metaverse, LODs, etc. Such piece-wise planar models can be abstracted as twins from 3D dense reconstructions. However, these models typically lack realistic texture relative to the real building or scene, making them unsuitable for vivid display or direct reference. In this paper, we present TwinTex, the first automatic texture mapping framework to generate a photo-realistic texture for a piece-wise planar proxy. Our method addresses most challenges occurring in such twin texture generation. Specifically, for each primitive plane, we first select a small set of photos with greedy heuristics considering photometric quality, perspective quality and facade texture completeness. Then, different levels of line features (LoLs) are extracted from the set of selected photos to generate guidance for later steps. With LoLs, we employ optimization algorithms to align texture with geometry from local to global. Finally, we fine-tune a diffusion model with a multi-mask initialization component and a new dataset to inpaint the missing region. Experimental results on many buildings, indoor scenes and man-made objects of varying complexity demonstrate the generalization ability of our algorithm. Our approach surpasses state-of-the-art texture mapping methods in terms of high-fidelity quality and reaches a human-expert production level with much less effort. Project page: https://vcc.tech/research/2023/TwinTex.
粗粒度的建筑模型通常生成从单个建筑到场景的尺度,为下游应用如数字孪生城市、虚拟现实、LODs等生成间接可视化数据。这些分块平面模型可以从3D密度重构中抽象成双胞胎。然而,与真实的建筑或场景相比,这些模型通常缺乏真实的纹理,因此不适合生动地展示或直接引用。在本文中,我们介绍了 twinTex,是第一个自动纹理映射框架,用于生成分块平面 proxy 的逼真纹理。我们的方法解决了这种双胞胎纹理生成面临的大部分挑战。具体来说,对于每个基本平面,我们首先选择考虑纹理质量、视角质量和立面纹理完整的小批量照片。然后,从所选照片的不同级别中获取线特征(LoLs),以生成后续步骤的指导。通过 LoLs,我们使用优化算法将纹理与几何从局部到全局对齐。最后,我们优化了一个多Mask初始化组件和新数据集的扩散模型,以填充缺失区域。对许多建筑、室内场景和复杂的人工物体的不同复杂性的实验结果证明了我们的算法的泛化能力。我们的方法在高保真度质量方面超越了最先进的纹理映射方法,并以较少的努力达到人类专家生产水平。项目页面:https://vcc.tech/research/2023/TwinTex.
https://arxiv.org/abs/2309.11258
In this work, we investigate the personalization of text-to-music diffusion models in a few-shot setting. Motivated by recent advances in the computer vision domain, we are the first to explore the combination of pre-trained text-to-audio diffusers with two established personalization methods. We experiment with the effect of audio-specific data augmentation on the overall system performance and assess different training strategies. For evaluation, we construct a novel dataset with prompts and music clips. We consider both embedding-based and music-specific metrics for quantitative evaluation, as well as a user study for qualitative evaluation. Our analysis shows that similarity metrics are in accordance with user preferences and that current personalization approaches tend to learn rhythmic music constructs more easily than melody. The code, dataset, and example material of this study are open to the research community.
在本研究中,我们研究在几个样本量下对文本到音乐扩散模型进行个性化处理。由于计算机视觉领域的最新进展,我们是第一个探索将预先训练的文本到音频扩散器和两个已知的个性化方法结合起来的人。我们进行了实验,探索 audio-specific 数据增强对整体系统性能的影响,并评估了不同的训练策略。为了评估,我们创造了一个带有提示和音乐片段的新数据集。我们考虑了基于嵌入和音乐特定的度量指标进行定量评估,同时也进行了用户研究进行定性评估。我们的分析表明,相似度度量与用户偏好一致,而当前的个性处理方法更倾向于学习节奏音乐构造比旋律更容易。本文代码、数据集和示例材料已公开向学术界。
https://arxiv.org/abs/2309.11140
Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, detection and Re-IDentification (ReID). Despite significant progress, two major challenges remain: 1) Detection-prior modules in previous methods are suboptimal for the ReID task. 2) The collaboration between two sub-tasks is ignored. To alleviate these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Unlike existing methods that follow the Detection-to-ReID paradigm, our denoising paradigm eliminates detection-prior modules to avoid the local-optimum of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.
主要的人脸搜索方法旨在在一个统一的网络中定位和识别查询人员,同时优化两个子任务,即检测和ReID识别(ReID)。尽管取得了显著进展,但仍存在两个主要挑战:1)以前的检测前模块在ReID任务中最优化。2)两个子任务的协作被忽视。为了减轻这些问题,我们提出了一种基于扩散模型的新的人脸搜索框架,PSDiff。PSDiff将人脸搜索转化为两个互相独立的去噪过程,从噪声盒子和ReID嵌入中提取真实值。与现有的方法遵循检测到ReID范式不同,我们的去噪范式消除了检测前模块,以避免ReID任务的局部最优解。遵循新范式,我们设计了一个新的协作去噪层(CDL),以迭代和协作的方式优化检测和ReID子任务,从而使两个子任务相互有益。在标准基准测试上的广泛实验表明,PSDiff使用更少参数和弹性计算开销,取得了最先进的性能。
https://arxiv.org/abs/2309.11125
The rapid and massive diffusion of electric vehicles poses new challenges to the electric system, which must be able to supply these new loads, but at the same time opens up new opportunities thanks to the possible provision of ancillary services. Indeed, in the so-called Vehicle-to-Grid (V2G) set-up, the charging power can be modulated throughout the day so that a fleet of vehicles can absorb an excess of power from the grid or provide extra power during a this http URL this end, many works in the literature focus on the optimization of each vehicle daily charging profiles to offer the requested ancillary services while guaranteeing a charged battery for each vehicle at the end of the day. However, the size of the economic benefits related to the provision of ancillary services varies significantly with the modeling approaches, different assumptions, and considered scenarios. In this paper we propose a profitability analysis with reference to a recently proposed framework for V2G optimal operation in presence of uncertainty. We provide necessary and sufficient conditions for profitability in a simplified case and we show via simulation that they also hold for the general case.
电动车的迅速和大规模扩散对电力系统构成了新挑战,必须能够供应这些新负载,但同时也因为可能提供配套后处理服务而开启了新机会。事实上,在所谓的车辆到电网(V2G)的建立中,充电功率可以在整个白天进行调制,以便一辆辆车可以从电网吸收多余的能量或在此http://URL结束提供额外的能量,许多文献中的许多工作都关注于优化每个车辆的每日充电 profiles,以提供所需的配套后处理服务,同时保证每个车辆在一天结束时都有一个充电电池。然而,与提供配套后处理服务相关的经济收益的大小与建模方法、不同假设和考虑的情况相关。在本文中,我们提出了一种盈利能力分析,参考了最近提出的在不确定性条件下V2G最优操作的框架。我们提供了简化情况下的盈利能力必要和充分的条件,并通过模拟表明,对于一般情况也是如此。
https://arxiv.org/abs/2309.11118