Western music is often characterized by a homophonic texture, in which the musical content can be organized into a melody and an accompaniment. In orchestral music, in particular, the composer can select specific characteristics for each instrument's part within the accompaniment, while also needing to adapt the melody to suit the capabilities of the instruments performing it. In this work, we propose METEOR, a model for Melody-aware Texture-controllable Orchestral music generation. This model performs symbolic multi-track music style transfer with a focus on melodic fidelity. We allow bar- and track-level controllability of the accompaniment with various textural attributes while keeping a homophonic texture. We show that the model can achieve controllability performances similar to strong baselines while greatly improve melodic fidelity.
西方音乐通常具有和声 texture,其中音乐内容可以组织成旋律和伴奏。在管弦音乐中,尤其是作曲家,可以在伴奏中为每个乐器部分选择特定的特征,同时还需要适应演奏这些乐器的乐器特性来适应旋律。在这项工作中,我们提出了METEOR,一个适用于旋律感知 Texture-controllable Orchestral 音乐生成的模型。这个模型专注于旋律忠诚度,具有符号多轨音乐风格转移。我们在保持和声的同时允许对伴奏的各种文本属性进行级控制。我们证明了这个模型可以在类似于强基线的控制性能的同时大大提高旋律忠诚度。
https://arxiv.org/abs/2409.11753
The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at this https URL.
风格迁移的目标是,给定一个内容图像和一个风格来源,生成一个新的图像,保留内容但保留风格来源的艺术表现。大多数最先进的架构使用变压器或扩散基础模型来执行这项任务,尽管它们需要大量计算负担。特别是,变压器使用具有大内存足迹的自和跨注意力层,而扩散模型需要高推理时间。为了克服上述问题,本文探索了一种名为Mamba-ST的新型Mamba架构,用于进行风格迁移。为此,我们将Mamba的线性方程进行修改,以模拟跨注意力层的行为,这些层能够将两个单独的嵌入合并为单个输出,但大大减少了内存使用和时间复杂度。我们将Mamba的内部方程改为接受来自两个单独数据流的输入,并合并它们。据我们所知,这是第一个将SSM的方程适应视觉任务(如风格迁移)的尝试,而无需添加其他模块(如跨注意或自定义归一化层)。一系列实验证明了我们在风格迁移方面的优越性和高效性。结果表明,与变压器和扩散模型相比,我们的方法在质量和效率方面都具有优势。代码可以从该链接的URL中获取。
https://arxiv.org/abs/2409.10385
Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.
目前主流的音频生成方法主要依赖于简单的文本提示,往往无法捕捉到多风格音频生成的细微细节。为了克服这一局限,我们提出了Sound Event Enhanced Prompt Adapter。与传统的静态全局风格转移不同,这种方法通过文本和参考音频之间的交叉注意力来提取风格嵌入,实现自适应风格控制。然后,利用自适应层归一化来增强模型表达多种风格的能力。此外,还引入了Sound Event Reference Style Transfer Dataset(SERST)用于提出的目标风格音频生成任务,使得可以使用文本和音频作为参考来生成双提示音频。实验结果证明了模型的稳健性,达到最优的弗雷切距离为26.94,kl散度为1.82,超过了Tango、AudioLDM和AudioGen。此外,生成的音频与相应音频参考具有很高的相似性。演示、代码和数据集都可以公开使用。
https://arxiv.org/abs/2409.09381
Over the last decade, deep learning has shown great success at performing computer vision tasks, including classification, super-resolution, and style transfer. Now, we apply it to data compression to help build the next generation of multimedia codecs. This thesis provides three primary contributions to this new field of learned compression. First, we present an efficient low-complexity entropy model that dynamically adapts the encoding distribution to a specific input by compressing and transmitting the encoding distribution itself as side information. Secondly, we propose a novel lightweight low-complexity point cloud codec that is highly specialized for classification, attaining significant reductions in bitrate compared to non-specialized codecs. Lastly, we explore how motion within the input domain between consecutive video frames is manifested in the corresponding convolutionally-derived latent space.
在过去的十年里,深度学习已经在计算机视觉任务中取得了巨大的成功,包括分类、超分辨率和支持向量转移。现在,我们将它应用于数据压缩,以帮助构建下一代的多媒体编码标准。本论文为学习压缩领域提供了三个主要的贡献。首先,我们提出了一种高效且低复杂度的熵模型,通过压缩和传输编码分布本身作为侧信息,动态地适应特定输入。其次,我们提出了一种新颖的轻量级低复杂度点云编码器,专为分类而设计,比非专用编码器在带宽方面显著减少了比特率。最后,我们探讨了连续视频帧之间输入域内运动在相应卷积导出的潜在空间中的表现。
https://arxiv.org/abs/2409.08376
Motion style transfer changes the style of a motion while retaining its content and is useful in computer animations and games. Contact is an essential component of motion style transfer that should be controlled explicitly in order to express the style vividly while enhancing motion naturalness and quality. However, it is unknown how to decouple and control contact to achieve fine-grained control in motion style transfer. In this paper, we present a novel style transfer method for fine-grained control over contacts while achieving both motion naturalness and spatial-temporal variations of style. Based on our empirical evidence, we propose controlling contact indirectly through the hip velocity, which can be further decomposed into the trajectory and contact timing, respectively. To this end, we propose a new model that explicitly models the correlations between motions and trajectory/contact timing/style, allowing us to decouple and control each separately. Our approach is built around a motion manifold, where hip controls can be easily integrated into a Transformer-based decoder. It is versatile in that it can generate motions directly as well as be used as post-processing for existing methods to improve quality and contact controllability. In addition, we propose a new metric that measures a correlation pattern of motions based on our empirical evidence, aligning well with human perception in terms of motion naturalness. Based on extensive evaluation, our method outperforms existing methods in terms of style expressivity and motion quality.
运动风格迁移会改变一个运动的风格,同时保留其内容,在计算机动画和游戏中很有用。接触是运动风格迁移的一个关键组成部分,应该明确地控制以生动表达风格,同时增强动作的自然性和质量。然而,如何解耦并控制接触以实现对运动风格迁移的精细控制仍然是未知的。在本文中,我们提出了一种新的风格迁移方法,可以在保留动作自然性和空间-时间风格变化的同时,对接触进行细粒度控制。基于我们的实证证据,我们提出了一种通过髋关节速度间接控制接触的方法,可以进一步分解为轨迹和接触时间。为此,我们提出了一种新的模型,该模型明确地建模了动作之间的相关性以及轨迹/接触时间/风格之间的相关性,允许我们分别解耦和控制每个。我们的方法基于运动场,其中髋关节控制可以很容易地集成到基于Transformer的编码器中。这种方法的多功能性使得它可以直接生成动作,还可以作为现有方法的后期处理,以提高质量和接触可控制性。此外,我们还提出了一种基于我们实证证据度量的动作相关性模式的新指标,与人类感知在动作自然性方面相似。通过广泛的评估,我们的方法在风格表现力和动作质量方面优于现有方法。
https://arxiv.org/abs/2409.05387
In this paper, we introduce MRStyle, a comprehensive framework that enables color style transfer using multi-modality reference, including image and text. To achieve a unified style feature space for both modalities, we first develop a neural network called IRStyle, which generates stylized 3D lookup tables for image reference. This is accomplished by integrating an interaction dual-mapping network with a combined supervised learning pipeline, resulting in three key benefits: elimination of visual artifacts, efficient handling of high-resolution images with low memory usage, and maintenance of style consistency even in situations with significant color style variations. For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle). Our TRStyle method is highly efficient in both training and inference, producing notable open-set text-guided transfer results. Extensive experiments in both image and text settings demonstrate that our proposed method outperforms the state-of-the-art in both qualitative and quantitative evaluations.
在本文中,我们提出了MRStyle,一种全面框架,可实现多模态参考下的颜色风格迁移,包括图像和文本。为了实现模态统一风格特征空间,我们首先开发了IRStyle神经网络,为图像和文本生成拟合的3D查找表。这是通过将交互式双重映射网络与联合监督学习管道集成实现的,从而实现了三个关键优势:消除视觉伪影,高效处理高分辨率图像且内存占用低,以及在色彩风格变化较大的情况下保持风格一致性。对于文本参考,我们将稳定扩散 prior 的文本特征与我们的IRStyle的风格特征对齐,实现文本引导的颜色风格迁移(TRStyle)。我们提出的TRStyle方法在训练和推理过程中都非常高效,产生了显著的开放集文本引导转移结果。在图像和文本设置的广泛实验中,我们的方法在质量和数量评估中均优于现有技术水平。
https://arxiv.org/abs/2409.05250
A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: this https URL.
一个健壮的人脸识别模型必须通过包括大量受试者和每个受试者大量样本的数据集进行训练,这些数据集在不同的条件下(如姿势、表情、年龄、噪声和遮挡)包括。由于伦理和隐私问题,大规模真实人脸数据集已经停止,例如MS1MV3,并且已经提出了使用GAN和扩散模型(如SYNFace、SFace、DigiFace-1M、IDiff-Face、DCFace和GANDiffFace)合成人脸的方法,旨在满足这一需求。其中一些方法可以产生高质量的高保真度人脸,但内类一致性较低,而其他方法生成具有较低identity consistency的高变异性人脸。在本文中,我们提出了一个三重约束扩散模型(TCDiff)来通过二维和三维面部约束从真实到合成人脸的风格迁移,同时提高身份一致性,保持必要的较高内类一致性。用我们新数据集(包括1k、2k和5k类)进行训练的 face 识别实验在真实人脸基准测试中优于最先进的合成数据集,如 LFW、CFP-FP、AgeDB 和 BUPT。我们的源代码可在此处下载:https://this URL。
https://arxiv.org/abs/2409.03600
One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics into the generated speech effectively. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
一次性的语音转换(VC)旨在通过仅使用一个语音样本将源语音的音色变换为目标听说话人的音色,实现一次转换。现有的基于风格迁移的VC方法依赖于语音表示的解耦,且在准确度和独立编码每个语音成分并重新合成转化为转换后的语音方面存在困难。为了解决这个问题,我们提出了Pureformer-VC,它利用Conformer模块建立了一个解耦的编码器,并利用Zipformer模块建立了一个风格迁移的解码器作为生成器。在解码器中,我们使用有效的风格变换器模块将说话人的特征融入生成的语音中。该模型使用生成VAE损失对组件进行编码,使用三元组损失进行无监督的判别训练。我们将风格变换方法应用于Zipformer的共享权重进行风格迁移。实验结果表明,与现有方法相比,所提出的模型在单次语音转换场景中实现了相当的主观评分,并显示了在客观指标方面的改进。
https://arxiv.org/abs/2409.01668
This article compares two style transfer methods in image processing: the traditional method, which synthesizes new images by stitching together small patches from existing images, and a modern machine learning-based approach that uses a segmentation network to isolate foreground objects and apply style transfer solely to the background. The traditional method excels in creating artistic abstractions but can struggle with seamlessness, whereas the machine learning method preserves the integrity of foreground elements while enhancing the background, offering improved aesthetic quality and computational efficiency. Our study indicates that machine learning-based methods are more suited for real-world applications where detail preservation in foreground elements is essential.
这篇文章比较了图像处理中两种风格迁移方法:传统方法和基于机器学习的方法。传统方法通过将现有图像的小补丁缝合在一起来合成新图像,而基于机器学习的方法则利用分割网络来分离前景物体,并仅将风格迁移应用于背景。传统方法在创造艺术抽象表现方面表现出色,但可能会出现拼接痕迹不明显的问题,而基于机器学习的方法则保留了前景元素的完整性,同时增强背景,提供了更好的美学质量和计算效率。我们的研究结果表明,基于机器学习的方法更适用于现实世界的应用,其中前景元素的细节保留至关重要。
https://arxiv.org/abs/2409.00606
Portrait sketching involves capturing identity specific attributes of a real face with abstract lines and shades. Unlike photo-realistic images, a good portrait sketch generation method needs selective attention to detail, making the problem challenging. This paper introduces \textbf{Portrait Sketching StyleGAN (PS-StyleGAN)}, a style transfer approach tailored for portrait sketch synthesis. We leverage the semantic $W+$ latent space of StyleGAN to generate portrait sketches, allowing us to make meaningful edits, like pose and expression alterations, without compromising identity. To achieve this, we propose the use of Attentive Affine transform blocks in our architecture, and a training strategy that allows us to change StyleGAN's output without finetuning it. These blocks learn to modify style latent code by paying attention to both content and style latent features, allowing us to adapt the outputs of StyleGAN in an inversion-consistent manner. Our approach uses only a few paired examples ($\sim 100$) to model a style and has a short training time. We demonstrate PS-StyleGAN's superiority over the current state-of-the-art methods on various datasets, qualitatively and quantitatively.
肖像绘画涉及用抽象线条和阴影捕捉真实脸的身份特定特征。与照片现实主义图像相比,一个好的肖像绘画模板生成方法需要对细节进行选择性关注,使得这个问题具有挑战性。本文介绍了一种名为PS-StyleGAN的样式迁移方法,专门用于肖像绘画合成。我们利用StyleGAN的语义$W+$潜在空间生成肖像画,使我们能够在不损害身份的情况下进行有意义的变化,如姿态和表情的改变。为了实现这一点,我们在架构中引入了关注内容和风格的注意力平滑变换块,并采用一种允许我们无需微调StyleGAN输出以改变其风格的学习策略。这些块通过关注内容和风格潜在特征来修改样式潜在代码,使我们在反向一致的方式中适应StyleGAN的输出。我们的方法只需要几对成对的示例($\sim100$)来建模风格,训练时间较短。我们证明了PS-StyleGAN在各种数据集上的优越性,无论是定性的还是定量的。
https://arxiv.org/abs/2409.00345
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \url{this https URL}.
扩散模型在控制图像生成方面表现出了卓越的能力,这进一步推动了图像风格转移的研究兴趣。现有的工作主要集中在基于自由学习的图像生成方法(例如,图像反演)上,因为特定数据的稀缺性。在本文中,我们提出了一个内容风格化图像三元组的数据构建管道,该管道可以生成和自动清理风格化的数据三元组。基于此管道,我们构建了一个名为IMAGStyle的 dataset,这是第一个包含210k个图像三元组的大型风格转移数据集,供社区研究和探索。配备了IMAGStyle,我们提出了CSGO,一种基于端到端训练的图像风格转移模型,它通过独立特征注入明确地解耦了内容和风格特征。统一CSGO实现了图像驱动风格转移、文本驱动风格合成和文本编辑驱动风格合成。广泛的实验证明了我们方法在增强图像生成中的风格控制能力方面的有效性。此外,项目页面上有更多可视化和源代码访问:\url{this <https:// this URL> }。
https://arxiv.org/abs/2408.16766
Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their face shape. To enhance the preservation of the face image's original features, we leverage adaptive hair blending during the inference, where the output's hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at this https URL.
发型迁移图像编辑领域是一项具有挑战性的任务,它在对给定面部图像进行发型修改的同时保留其其他外观和背景特征。现有的发型迁移方法很大程度上依赖于预训练的StyleGAN,它对裁剪和对齐的带标签面部图像进行训练。因此,它们在复杂的情况(如极端的头部姿态或焦距差异)下表现不佳。为了解决这个问题,我们提出了一种一阶段发型迁移扩散模型HairFusion,应用于现实世界场景。具体来说,我们仔细设计了一个头发无关的表示作为输入,其中原始头发信息被完全消除。接下来,我们引入了头发对齐交叉注意(Align-CA)来准确将参考发型与面部图像对齐,同时考虑它们的脸部形状的差异。为了增强保留面部图像原始特征的效果,我们在推理过程中利用自适应发型混合,其中Align-CA的交叉注意图估计输出的头发区域,并将其与面部图像的非头发区域混合。我们的实验结果表明,与其他方法相比,我们的方法在保留转移发型和周围特征的完整性方面取得了最先进的成绩。代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2408.16450
Facial analysis is a key component in a wide range of applications such as security, autonomous driving, entertainment, and healthcare. Despite the availability of various facial RGB datasets, the thermal modality, which plays a crucial role in life sciences, medicine, and biometrics, has been largely overlooked. To address this gap, we introduce the T-FAKE dataset, a new large-scale synthetic thermal dataset with sparse and dense landmarks. To facilitate the creation of the dataset, we propose a novel RGB2Thermal loss function, which enables the transfer of thermal style to RGB faces. By utilizing the Wasserstein distance between thermal and RGB patches and the statistical analysis of clinical temperature distributions on faces, we ensure that the generated thermal images closely resemble real samples. Using RGB2Thermal style transfer based on our RGB2Thermal loss function, we create the T-FAKE dataset, a large-scale synthetic thermal dataset of faces. Leveraging our novel T-FAKE dataset, probabilistic landmark prediction, and label adaptation networks, we demonstrate significant improvements in landmark detection methods on thermal images across different landmark conventions. Our models show excellent performance with both sparse 70-point landmarks and dense 478-point landmark annotations. Our code and models are available at this https URL.
面部分析是许多应用的关键组件,如安全、自动驾驶、娱乐和医疗。尽管各种面部RGB数据集已经存在,但生物医学和生物测量学中扮演关键角色的热模量(thermal modality)却受到了很大的忽视。为了填补这一空白,我们引入了T-FAKE数据集,一个大型合成热数据集,具有稀疏和密集的标志。为了方便数据集的创建,我们提出了一个新颖的RGB2Thermal损失函数,该函数可以将热风格传递到RGB面部。通过利用热和RGB补丁之间的Wasserstein距离和面部临床温度分布的统计分析,我们确保生成的热图像与真实样本非常相似。基于我们的RGB2Thermal损失函数,我们创建了T-FAKE数据集,一个大型合成热数据集。通过利用我们新颖的T-FAKE数据集、概率地标预测和标签适应网络,我们在不同地标范式下显著提高了热图像中的地标检测方法。我们的模型在稀疏70个地标和密集478个地标注释上表现优异。我们的代码和模型可在此处访问:https://www.thisurl.com/
https://arxiv.org/abs/2408.15127
Deep Neural Networks (DNNs) for Autonomous Driving Systems (ADS) are typically trained on real-world images and tested using synthetic simulator images. This approach results in training and test datasets with dissimilar distributions, which can potentially lead to erroneously decreased test accuracy. To address this issue, the literature suggests applying domain-to-domain translators to test datasets to bring them closer to the training datasets. However, translating images used for testing may unpredictably affect the reliability, effectiveness and efficiency of the testing process. Hence, this paper investigates the following questions in the context of ADS: Could translators reduce the effectiveness of images used for ADS-DNN testing and their ability to reveal faults in ADS-DNNs? Can translators result in excessive time overhead during simulation-based testing? To address these questions, we consider three domain-to-domain translators: CycleGAN and neural style transfer, from the literature, and SAEVAE, our proposed translator. Our results for two critical ADS tasks -- lane keeping and object detection -- indicate that translators significantly narrow the gap in ADS test accuracy caused by distribution dissimilarities between training and test data, with SAEVAE outperforming the other two translators. We show that, based on the recent diversity, coverage, and fault-revealing ability metrics for testing deep-learning systems, translators do not compromise the diversity and the coverage of test data, nor do they lead to revealing fewer faults in ADS-DNNs. Further, among the translators considered, SAEVAE incurs a negligible overhead in simulation time and can be efficiently integrated into simulation-based testing. Finally, we show that translators increase the correlation between offline and simulation-based testing results, which can help reduce the cost of simulation-based testing.
深度神经网络(DNNs)用于自动驾驶系统(ADS)通常在现实世界图像上进行训练,并通过使用合成仿真图像进行测试。这种方法导致训练和测试数据集具有不同的分布,这可能会导致测试准确度下降。为了应对这个问题,文献建议在测试数据集中应用领域到领域的翻译器,以使它们更接近训练数据集。然而,翻译用于测试的图像可能会不可预测地影响测试的可靠性、有效性和效率。因此,本文在ADS的背景下研究以下问题:翻译器能否减少用于ADS-DNN测试的图像的有效性和揭示ADS-DNN中的缺陷?翻译器在仿真为基础的测试中的时间开销是否超过其他翻译器?为了回答这些问题,我们考虑了来自文献的三个领域到领域的翻译器:循环GAN和神经风格迁移,以及我们提出的SAEVAE。我们对两个关键ADS任务——车道保持和物体检测——的研究结果表明,翻译器显著缩小了由于训练和测试数据集分布不同时导致的ADS测试准确度差距,而SAEVAE表现最佳。我们证明了,根据最近多样性、覆盖和故障揭示能力指标衡量深度学习系统的测试数据,翻译器没有牺牲测试数据的多样性和覆盖,也没有导致ADS-DNN中揭示的故障更少。此外,在考虑的翻译器中,SAEVAE在仿真时间上的开销可以忽略不计,并且可以有效地集成到仿真为基础的测试中。最后,我们研究了翻译器是否增加了离线和仿真为基础的测试结果之间的相关性,这有助于降低仿真为基础的测试的成本。
https://arxiv.org/abs/2408.13950
Text-driven diffusion models have achieved remarkable success in image editing, but a crucial component in these models-text embeddings-has not been fully explored. The entanglement and opacity of text embeddings present significant challenges to achieving precise image editing. In this paper, we provide a comprehensive and in-depth analysis of text embeddings in Stable Diffusion XL, offering three key insights. First, while the 'aug_embedding' captures the full semantic content of the text, its contribution to the final image generation is relatively minor. Second, 'BOS' and 'Padding_embedding' do not contain any semantic information. Lastly, the 'EOS' holds the semantic information of all words and contains the most style features. Each word embedding plays a unique role without interfering with one another. Based on these insights, we propose a novel approach for controllable image editing using a free-text embedding control method called PSP (Prompt-Softbox-Prompt). PSP enables precise image editing by inserting or adding text embeddings within the cross-attention layers and using Softbox to define and control the specific area for semantic injection. This technique allows for obejct additions and replacements while preserving other areas of the image. Additionally, PSP can achieve style transfer by simply replacing text embeddings. Extensive experimental results show that PSP achieves significant results in tasks such as object replacement, object addition, and style transfer.
文本驱动的扩散模型在图像编辑方面取得了显著的成功,但这类模型中文本嵌入的关键组成部分尚未被充分探讨。文本嵌入的纠缠和透明度给精确图像编辑带来了巨大的挑战。在本文中,我们对Stable Diffusion XL中的文本嵌入进行全面而深入的分析,提供了三个关键见解。首先,虽然'aug_embedding'捕捉了文本的全部语义内容,但其在最终图像生成中的贡献相对较小。其次,'BOS'和'Padding_embedding'不包含任何语义信息。最后,'EOS'包含了所有单词的语义信息,并包含最多的风格特征。每个单词嵌入都在发挥作用,而不会相互干扰。基于这些见解,我们提出了一个使用免费文本嵌入控制方法PSP(提示-软盒-提示)进行可控图像编辑的新颖方法。PSP通过在交叉注意力层中插入或添加文本嵌入,并使用Softbox定义和控制语义注入的具体区域,实现了精确图像编辑。这种技术允许在保持图像其他区域不变的情况下进行对象添加和替换。此外,PSP可以通过简单地替换文本嵌入实现风格转移。大量实验结果表明,PSP在诸如对象替换、对象添加和风格转移等任务中取得了显著的成果。
https://arxiv.org/abs/2408.13623
Neural Radiance Fields (NeRF) have emerged as a powerful tool for creating highly detailed and photorealistic scenes. Existing methods for NeRF-based 3D style transfer need extensive per-scene optimization for single or multiple styles, limiting the applicability and efficiency of 3D style transfer. In this work, we overcome the limitations of existing methods by rendering stylized novel views from a NeRF without the need for per-scene or per-style optimization. To this end, we take advantage of a generalizable NeRF model to facilitate style transfer in 3D, thereby enabling the use of a single learned model across various scenes. By incorporating a hypernetwork into a generalizable NeRF, our approach enables on-the-fly generation of stylized novel views. Moreover, we introduce a novel flow-based multi-view consistency loss to preserve consistency across multiple views. We evaluate our method across various scenes and artistic styles and show its performance in generating high-quality and multi-view consistent stylized images without the need for a scene-specific implicit model. Our findings demonstrate that this approach not only achieves a good visual quality comparable to that of per-scene methods but also significantly enhances efficiency and applicability, marking a notable advancement in the field of 3D style transfer.
神经辐射场(NeRF)已成为创建高度详细和逼真的场景的强大工具。现有的基于NeRF的3D风格迁移方法需要对单个或多个风格进行广泛的场景优化,从而限制了3D风格迁移的适用性和效率。在这项工作中,我们通过无需进行场景或风格优化,从NeRF中生成风格化的全新视角,克服了现有方法的限制。为此,我们利用了一个通用的NeRF模型来促进3D风格迁移,从而可以在各种场景中使用单个学习到的模型。通过将超网络引入一个通用的NeRF,我们的方法实现了在无需进行场景或风格优化的情况下生成风格化的全新视角。此外,我们还引入了一种新的多视角一致性损失来保留多视角之间的 consistency。我们在各种场景和艺术风格上评估我们的方法,并证明了它不需要场景特定的隐式模型,即可生成高质量的多视角一致的逼真图像。我们的研究结果表明,这种方法不仅在视觉质量上实现了与场景方法相当的水平,而且在效率和适用性方面显著增强了,标志着3D风格转移领域的重要进展。
https://arxiv.org/abs/2408.13508
Pedestrian Crossing Prediction (PCP) in driving scenes plays a critical role in ensuring the safe operation of intelligent vehicles. Due to the limited observations of pedestrian crossing behaviors in typical situations, recent studies have begun to leverage synthetic data with flexible variation to boost prediction performance, employing domain adaptation frameworks. However, different domain knowledge has distinct cross-domain distribution gaps, which necessitates suitable domain knowledge adaption ways for PCP tasks. In this work, we propose a Gated Syn-to-Real Knowledge transfer approach for PCP (Gated-S2R-PCP), which has two aims: 1) designing the suitable domain adaptation ways for different kinds of crossing-domain knowledge, and 2) transferring suitable knowledge for specific situations with gated knowledge fusion. Specifically, we design a framework that contains three domain adaption methods including style transfer, distribution approximation, and knowledge distillation for various information, such as visual, semantic, depth, location, etc. A Learnable Gated Unit (LGU) is employed to fuse suitable cross-domain knowledge to boost pedestrian crossing prediction. We construct a new synthetic benchmark S2R-PCP-3181 with 3181 sequences (489,740 frames) which contains the pedestrian locations, RGB frames, semantic images, and depth images. With the synthetic S2R-PCP-3181, we transfer the knowledge to two real challenging datasets of PIE and JAAD, and superior PCP performance is obtained to the state-of-the-art methods.
行人过马路预测 (PCP) 在驾驶场景中起着关键作用,确保智能车辆的安全运行。由于在典型情况下行人过马路行为的观察有限,最近的研究开始利用具有灵活变化的合成数据来提高预测性能,采用领域自适应框架。然而,不同的领域知识具有显著的跨领域分布缺口,因此需要适合 PCP 任务的领域知识适应方法。在这项工作中,我们提出了一个基于Gated Syn-to-Real Knowledge transfer approach for PCP (Gated-S2R-PCP) 的方法,其有两个目标:1)为不同类型的跨领域知识设计合适的领域适应方法;2)通过门控知识融合转移适合特定情况的知识。具体来说,我们设计了一个包含各种信息(如视觉、语义、深度、位置等)的领域适应框架,其中包括风格迁移、分布逼近和知识蒸馏等三种领域适应方法。一个可学习的大门单元(LGU)用于将合适的跨领域知识进行融合以提高行人过马路预测。我们构建了一个包含 3181 个序列(489,740 帧)的新合成基准 S2R-PCP-3181,其中包含行人位置、RGB 帧、语义图像和深度图像。通过合成 S2R-PCP-3181,我们将知识传递到两个具有挑战性的现实数据集 PIE 和 JAAD 上,并且获得了与最先进方法相当的 PCP 性能。
https://arxiv.org/abs/2409.06707
This study proposes a novel approach to extract stylistic features of Jiehua: the utilization of the Fine-tuned Stable Diffusion Model with ControlNet (FSDMC) to refine depiction techniques from artists' Jiehua. The training data for FSDMC is based on the opensource Jiehua artist's work collected from the Internet, which were subsequently manually constructed in the format of (Original Image, Canny Edge Features, Text Prompt). By employing the optimal hyperparameters identified in this paper, it was observed FSDMC outperforms CycleGAN, another mainstream style transfer model. FSDMC achieves FID of 3.27 on the dataset and also surpasses CycleGAN in terms of expert evaluation. This not only demonstrates the model's high effectiveness in extracting Jiehua's style features, but also preserves the original pre-trained semantic information. The findings of this study suggest that the application of FSDMC with appropriate hyperparameters can enhance the efficacy of the Stable Diffusion Model in the field of traditional art style migration tasks, particularly within the context of Jiehua.
这项研究提出了一种提取周杰伦风格特征的新方法:利用微调的稳定扩散模型(FSDMC)对艺术家的周杰伦进行精确描述。FSDMC的训练数据基于从互联网上收集的周杰伦艺术家的作品,随后手动构建为(原始图像,Canny边缘特征,文本提示)格式。通过使用本文中确定的最优超参数,观察到FSDMC在数据集上的FID得分高于CycleGAN,另一种主流的风格迁移模型。FSDMC在数据集上的FID得分为3.27,同时也在专家评估中超过了CycleGAN。这些结果不仅表明了模型在提取周杰伦风格特征方面的卓越效果,而且保留了原始预训练语义信息。本研究的发现表明,在适当的超参数下应用FSDMC可以提高传统艺术风格迁移任务中Stable Diffusion模型的效率,特别是在周杰伦这一背景下。
https://arxiv.org/abs/2408.11744
The goal of image style transfer is to render an image guided by a style reference while maintaining the original content. Existing image-guided methods rely on specific style reference images, restricting their wider application and potentially compromising result quality. As a flexible alternative, text-guided methods allow users to describe the desired style using text prompts. Despite their versatility, these methods often struggle with maintaining style consistency, reflecting the described style accurately, and preserving the content of the target image. To address these challenges, we introduce FAGStyle, a zero-shot text-guided diffusion image style transfer method. Our approach enhances inter-patch information interaction by incorporating the Sliding Window Crop technique and Feature Augmentation on Geodesic Surface into our style control loss. Furthermore, we integrate a Pre-Shape self-correlation consistency loss to ensure content consistency. FAGStyle demonstrates superior performance over existing methods, consistently achieving stylization that retains the semantic content of the source image. Experimental results confirms the efficacy of FAGStyle across a diverse range of source contents and styles, both imagined and common.
图像风格迁移的目标是在保持原始内容的同时,通过风格参考图像生成图像。现有的图像引导方法依赖于特定的风格参考图像,限制了它们的应用范围,并可能影响到结果的质量。作为一种灵活的替代方法,文本引导方法允许用户通过文本提示描述所需风格。尽管这些方法具有很大的可扩展性,但它们往往在保持风格一致性、准确描述描述的风格和保留目标图像的内容方面遇到困难。为解决这些挑战,我们引入了FAGStyle,一种零散式文本引导扩散图像风格迁移方法。我们的方法通过结合Sliding Window Crop技术和在Geodesic表面上的特征增强来增强互纹信息交互。此外,我们还引入了Pre-Shape自相关一致性损失来确保内容一致性。FAGStyle在各种源内容和风格上的表现都超过了现有方法,始终保持源图像的语义内容。实验结果证实了FAGStyle在各种想象力和常见风格上的有效性。
https://arxiv.org/abs/2408.10533
We present Text-driven Object-Centric Style Transfer (TEXTOC), a novel method that guides style transfer at an object-centric level using textual inputs. The core of TEXTOC is our Patch-wise Co-Directional (PCD) loss, meticulously designed for precise object-centric transformations that are closely aligned with the input text. This loss combines a patch directional loss for text-guided style direction and a patch distribution consistency loss for even CLIP embedding distribution across object regions. It ensures a seamless and harmonious style transfer across object regions. Key to our method are the Text-Matched Patch Selection (TMPS) and Pre-fixed Region Selection (PRS) modules for identifying object locations via text, eliminating the need for segmentation masks. Lastly, we introduce an Adaptive Background Preservation (ABP) loss to maintain the original style and structural essence of the image's background. This loss is applied to dynamically identified background areas. Extensive experiments underline the effectiveness of our approach in creating visually coherent and textually aligned style transfers.
我们提出了基于文本的对象中心样式迁移(TEXTOC)方法,这是一种新的方法,它通过文本输入来指导对象级别的样式迁移。TEXTOC的核心是我们在对象级别精确的样式迁移Patch-wise Co-Directional(PCD)损失,该损失精心设计用于与输入文本紧密对齐的对象中心变换。该损失结合了文本引导的样式方向补丁方向损失和对象区域 even CLIP 嵌入分布一致性损失。它确保了对象区域之间平滑流畅的样式迁移。 关键在于我们的方法是Text-Matched Patch Selection(TMPS)和预置区域选择(PRS)模块,用于通过文本识别对象位置,无需分割掩码。最后,我们引入了自适应背景保留(ABP)损失来保持图像背景的原始样式和结构本质。该损失应用于动态确定的背景区域。大量实验证明了我们方法在创建视觉上连贯、文本上对齐的样式转移方面的有效性。
https://arxiv.org/abs/2408.08461