Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for denoising with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance regardless of the model size. Audio samples are available this https URL.
扩散based生成模型在近年来表现出强大的生成性能。然而,由于数据分布中存在许多属性,并且由于在生成过程中共享模型参数的一些限制,控制每个属性的特定风格仍然是一项挑战。为了解决上述问题,本文提出了分离的去除噪声扩散模型(DDDMs),具有分离的表示,可以控制生成模型中的每个属性的风格。我们将DDDMs应用于语音转换任务,以解决分离和控制每个 speech 属性(如语言信息、音调和音色)的挑战。首先,我们使用自监督表示分离语音表示。随后,我们将DDDMs应用于从分离的表示中提取语音以去除每个属性。此外,我们还提出了可靠的语音风格转换的先混合方案,该方案使用混合风格转换为扩散模型的先分布。实验结果表明,我们的方法比公开可用的语音转换模型表现更好。此外,我们表明,我们的方法无论模型大小如何都提供了可靠的生成性能。音频样本在这个 https URL 上可用。
https://arxiv.org/abs/2305.15816
In this paper, we propose a novel language-guided 3D arbitrary neural style transfer method (CLIP3Dstyler). We aim at stylizing any 3D scene with an arbitrary style from a text description, and synthesizing the novel stylized view, which is more flexible than the image-conditioned style transfer. Compared with the previous 2D method CLIPStyler, we are able to stylize a 3D scene and generalize to novel scenes without re-train our model. A straightforward solution is to combine previous image-conditioned 3D style transfer and text-conditioned 2D style transfer \bigskip methods. However, such a solution cannot achieve our goal due to two main challenges. First, there is no multi-modal model matching point clouds and language at different feature scales (\eg low-level, high-level). Second, we observe a style mixing issue when we stylize the content with different style conditions from text prompts. To address the first issue, we propose a 3D stylization framework to match the point cloud features with text features in local and global views. For the second issue, we propose an improved directional divergence loss to make arbitrary text styles more distinguishable as a complement to our framework. We conduct extensive experiments to show the effectiveness of our model on text-guided 3D scene style transfer.
本文提出了一种独特的语言指导的三维任意神经网络风格转移方法(CLIP3Dstyler)。我们旨在从文本描述中塑造任何具有任意风格的三维场景,并合成新的样式化视图,这种方法比图像 conditioned 风格转移方法更灵活。与之前的2D方法CLIPStyler相比,我们能够在不需要重新训练模型的情况下塑造三维场景并泛化到新的场景。一种简单的方法是将之前的图像 conditioned 3D风格转移方法和文本 conditioned 2D风格转移方法结合起来。但是,这种方法无法达到我们的目标,因为面临两个主要挑战。第一个挑战是不存在匹配点云和语言在不同特征尺度上的方法(例如低水平和高级别)。第二个挑战是,从文本提示中塑造内容时,我们观察到风格混合问题。为了解决这些问题,我们提出了一个三维风格化框架,在该框架中 local 和 global 视角下的点云特征与文本特征进行匹配。对于第二个问题,我们提出了改进的方向交叉损失,以使任意文本风格更加可区分,作为我们框架的补充。我们进行了广泛的实验,以展示我们模型在文本指导的三维场景风格转移方面的有效性。
https://arxiv.org/abs/2305.15732
Text style transfer is an exciting task within the field of natural language generation that is often plagued by the need for high-quality paired datasets. Furthermore, training a model for multi-attribute text style transfer requires datasets with sufficient support across all combinations of the considered stylistic attributes, adding to the challenges of training a style transfer model. This paper explores the impact of training data input diversity on the quality of the generated text from the multi-style transfer model. We construct a pseudo-parallel dataset by devising heuristics to adjust the style distribution in the training samples. We balance our training dataset using marginal and joint distributions to train our style transfer models. We observe that a balanced dataset produces more effective control effects over multiple styles than an imbalanced or skewed one. Through quantitative analysis, we explore the impact of multiple style distributions in training data on style-transferred output. These findings will better inform the design of style-transfer datasets.
文本风格转移是自然语言生成领域中令人兴奋的任务,往往需要高质量的配对数据集。此外,训练一个多属性文本风格转移模型需要支持所有考虑风格属性组合的足够数据的dataset,增加了训练风格转移模型的挑战。本文探讨了训练数据输入多样性对多风格转移模型生成文本质量的影响。我们通过设计启发式来调整训练样本的风格分布,构建了一个伪并行数据集。我们使用边际和联合分布来平衡训练数据集,训练我们的风格转移模型。我们观察到,一个平衡的数据集对多个风格之间的控制效果比一个不平衡或偏斜的数据集更有效。通过量化分析,我们探讨了训练数据中多个风格分布对风格转移输出的影响。这些发现将更好地指导风格转移数据集的设计。
https://arxiv.org/abs/2305.15582
Image translation has wide applications, such as style transfer and modality conversion, usually aiming to generate images having both high degrees of realism and faithfulness. These problems remain difficult, especially when it is important to preserve semantic structures. Traditional image-level similarity metrics are of limited use, since the semantics of an image are high-level, and not strongly governed by pixel-wise faithfulness to an original image. Towards filling this gap, we introduce SAMScore, a generic semantic structural similarity metric for evaluating the faithfulness of image translation models. SAMScore is based on the recent high-performance Segment Anything Model (SAM), which can perform semantic similarity comparisons with standout accuracy. We applied SAMScore on 19 image translation tasks, and found that it is able to outperform all other competitive metrics on all of the tasks. We envision that SAMScore will prove to be a valuable tool that will help to drive the vibrant field of image translation, by allowing for more precise evaluations of new and evolving translation models. The code is available at this https URL.
图像翻译具有广泛的应用,例如风格转移和模式转换,通常旨在生成具有高度真实感和准确性的图像。这些问题仍然非常困难,特别是在保护语义结构非常重要的情况下。传统的图像级相似度 metrics 有限使用,因为图像的语义含义是高层次的,并且不是像素级忠实于原始图像的程度。为了填补这个差距,我们介绍了SamadScore,一个通用的语义结构相似度 metrics,用于评估图像翻译模型的准确性。SamadScore 基于最近的高性能分块模型(Sam),可以进行语义相似度比较,并有出色的精度表现。我们使用了SamadScore 对 19 个图像翻译任务进行了应用,发现它可以在所有任务中比所有其他竞争指标表现更好。我们展望着SamadScore 将成为一个重要的工具,可以帮助推动图像翻译领域的发展,允许更精确的评估新的和进化的翻译模型。代码在此httpsURL上可用。
https://arxiv.org/abs/2305.15367
With the increasing availability of depth sensors, multimodal frameworks that combine color information with depth data are attracting increasing interest. In the challenging task of semantic segmentation, depth maps allow to distinguish between similarly colored objects at different depths and provide useful geometric cues. On the other side, ground truth data for semantic segmentation is burdensome to be provided and thus domain adaptation is another significant research area. Specifically, we address the challenging source-free domain adaptation setting where the adaptation is performed without reusing source data. We propose MISFIT: MultImodal Source-Free Information fusion Transformer, a depth-aware framework which injects depth information into a segmentation module based on vision transformers at multiple stages, namely at the input, feature and output levels. Color and depth style transfer helps early-stage domain alignment while re-wiring self-attention between modalities creates mixed features allowing the extraction of better semantic content. Furthermore, a depth-based entropy minimization strategy is also proposed to adaptively weight regions at different distances. Our framework, which is also the first approach using vision transformers for source-free semantic segmentation, shows noticeable performance improvements with respect to standard strategies.
随着深度传感器的日益普及,将颜色信息和深度数据结合的多模式框架也越来越受到关注。在语义分割这个挑战性的任务中,深度图能够让人们在不同的深度上识别相似的颜色对象,并提供有用的几何提示。另一方面,语义分割的基准数据需要大量的提供,因此域转换也是一个重要的研究领域。具体来说,我们提出了MISFIT:多modal源-free信息融合Transformer,一个深度意识的框架,该框架在多个阶段使用视觉Transformer将深度信息注入到分割模块中,具体来说是输入、特征和输出水平。颜色和深度风格迁移可以帮助早期域对齐,同时重新调整modal之间的自我注意创造混合特征,使更好地提取语义内容。此外,还提出了一种基于深度的熵最小化策略,以自适应地加权不同距离的区域。我们的框架也是使用视觉Transformer进行源-free语义分割的第一种方法,与标准策略相比,表现出明显的性能改进。
https://arxiv.org/abs/2305.14269
This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: this https URL
本文提出了一种可控制的文字到视频(T2V)扩散模型,名为Video-ControlNet,该模型生成视频取决于一系列控制信号,例如边缘或深度地图。Video-ControlNet是基于预训练的Conditional Text-to-Image(T2I)扩散模型构建的,通过添加空间和时间自注意力机制和可训练的时间层,实现了高效的跨帧建模。我们提出了一个第一帧 conditioning 策略,以促进模型生成从图像域转换而来的任意长度视频,并采用自回归方式生成更连贯的视频。此外,Video-ControlNet还采用了一种新颖的基于残留噪声初始化策略,从输入视频引入运动,生成更具一致性的视频。通过提出的架构和策略,Video-ControlNet可以实现高效的资源收敛,生成高质量、精确的视频,并实现精细的控制。广泛的实验结果表明,Video-ControlNet在多种视频生成任务,如视频编辑和视频风格转移中取得了成功,在一致性和质量方面超越了以前的方法和。项目页面:这个https URL。
https://arxiv.org/abs/2305.13840
In this study, we address the importance of modeling behavior style in virtual agents for personalized human-agent interaction. We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers, even those unseen during training. Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS database, which contains videos of diverse speakers. We recognize style as a pervasive element during speech, influencing the expressivity of communicative behaviors, while content is conveyed through multimodal signals and text. By disentangling content and style, we directly infer the style embedding, even for speakers not included in the training phase, without the need for additional training or fine-tuning. Objective and subjective evaluations are conducted to validate our approach and compare it against two baseline methods.
本研究探讨了在虚拟代理中建模行为风格对于个性化人类代理交互的重要性。我们提出了一种机器学习方法,用于合成不同说话者的风格,即使培训过程中未曾见过他们。我们的模型使用从PATS数据库中收集的多种模式数据进行零次多方模式转移,该数据库包含多种不同说话者的视频。我们认识到风格在演讲中是一个普遍的元素,影响交流行为的表达力,而内容通过多种模式信号和文本传达。通过分离内容和风格,我们可以直接推断风格嵌入,即使不同说话者不在训练阶段,也不需要额外的培训和微调。客观和主观评估用于验证我们的方法和比较两个基准方法。
https://arxiv.org/abs/2305.12887
We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{this https URL}{InstructVid2Vid}$.
我们提出了一种端到端扩散的方法,用于编辑带有人类语言指令的视频,即$\textbf{InstructVid2Vid}$。我们的方法不需要针对每个例子进行微调或反向操作,即可基于自然语言指令进行输入视频的编辑。我们提出的InstructVid2Vid模型结合了预训练的图像生成模型Stable Diffusion和条件3D U-Net架构,生成时间依赖的视频帧序列。为了获得训练数据,我们融合了各种模型的知识和 expertise,包括ChatGPT、BLIP和 Tune-a-Video,以合成视频指令三帧,这是一种比在现实生活中收集数据更加高效且成本更低的替代方法。为了改善生成视频中相邻帧的一致性,我们提出了Frame Difference Loss,它在训练过程中被引入。在推断期间,我们扩展了无分类器指导的文字视频输入,以指导生成的结果,使其与输入视频和指令更加相关。实验结果表明,InstructVid2Vid能够生成高质量、时间一致性的视频,并进行各种编辑,包括属性编辑、背景更改和风格转移。这些结果强调了我们提出的方法的广泛性和有效性。代码已发布在$\href{this https URL}{InstructVid2Vid}$。
https://arxiv.org/abs/2305.12328
Large Language Models (LLMs) have demonstrated remarkable performance in various tasks and gained significant attention. LLMs are also used for local sequence transduction tasks, including grammatical error correction (GEC) and formality style transfer, where most tokens in a source text are kept unchanged. However, it is inefficient to generate all target tokens because a prediction error of a target token may cause a catastrophe in predicting subsequent tokens and because the computational cost grows quadratically with the target sequence length. This paper proposes to predict a set of edit operations for the source text for local sequence transduction tasks. Representing an edit operation with a span of the source text and changed tokens, we can reduce the length of the target sequence and thus the computational cost for inference. We apply instruction tuning for LLMs on the supervision data of edit operations. Experiments show that the proposed method achieves comparable performance to the baseline in four tasks, paraphrasing, formality style transfer, GEC, and text simplification, despite reducing the length of the target text by as small as 21\%. Furthermore, we report that the instruction tuning with the proposed method achieved the state-of-the-art performance in the four tasks.
大型语言模型(LLMs)在各种任务中表现出非凡的性能并得到了广泛关注。LLMs也被用于本地序列转换任务,包括语法错误纠正(GEC)和正式化风格转移,其中源文本中的大部分 tokens 保持不变。然而,生成所有目标 token 是不高效的,因为目标 token 的预测错误可能会在预测后续 token 时引发灾难,并且计算成本随着目标序列长度的平方增长。本文提出了预测源文本中的本地序列转换任务的目标 token 的一组编辑操作。用源文本的跨度和变化的目标 token 表示编辑操作,我们可以减少目标序列的长度,从而降低推理的计算成本。我们对编辑操作的监督数据应用了LLM的指示调优。实验结果表明, proposed 方法在四个任务中表现出与基准线相当的性能,包括改写、正式化风格转移、GEC 和文本简化,尽管目标文本的长度被减少了小 21%。此外,我们报告了使用 proposed 方法实现在四个任务中最先进的性能。
https://arxiv.org/abs/2305.11862
Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications in various fields, including neural art, style transfer, and portable devices.
每一天,人类大脑处理巨大的视觉信息,依靠复杂的神经网络来感知和理解这些刺激。最近,功能性磁共振成像(fMRI)技术的突破使科学家能够从人类大脑活动模式中获取视觉信息。在本研究中,我们提出了一种创新的方法,将大脑活动解码为有意义的图像和标题,特别关注大脑标题解码,因为相比将大脑活动解码为图像,它更加灵活。我们利用最新的图像标题建模技术,并采用了一种独特的图像重建管道,利用潜伏扩散模型和深度估计。我们使用自然场景数据集,这是一个从COCO数据集中观看图像的8名 subjects 综合的fMRI数据集。我们使用生成图像到文本Transformer(GIT)作为标题解码的主干,并提出了基于潜伏扩散模型的新图像重建管道。方法包括训练 regularized 线性回归模型,大脑活动和提取特征之间的训练。此外,我们还将控制Net的深度地图引入,以进一步指导重建过程。我们使用定量指标对生成标题和图像都进行了评估。我们的大脑标题解码方法胜过了现有的方法,而我们的图像重建管道生成了更好的空间关系的图像。总之,我们展示了大脑解码的重大进展,展示了将视觉和语言相结合以更好地理解人类认知的巨大潜力。我们的方法为未来的研究提供了一个灵活的平台,可能有广泛的应用领域,包括神经网络艺术、风格转移和便携式设备。
https://arxiv.org/abs/2305.11560
Robotic ultrasound (US) systems have shown great potential to make US examinations easier and more accurate. Recently, various machine learning techniques have been proposed to realize automatic US image interpretation for robotic US acquisition tasks. However, obtaining large amounts of real US imaging data for training is usually expensive or even unfeasible in some clinical applications. An alternative is to build a simulator to generate synthetic US data for training, but the differences between simulated and real US images may result in poor model performance. This work presents a Sim2Real framework to efficiently learn robotic US image analysis tasks based only on simulated data for real-world deployment. A style transfer module is proposed based on unsupervised contrastive learning and used as a preprocessing step to convert the real US images into the simulation style. Thereafter, a task-relevant model is designed to combine CNNs with vision transformers to generate the task-dependent prediction with improved generalization ability. We demonstrate the effectiveness of our method in an image regression task to predict the probe position based on US images in robotic transesophageal echocardiography (TEE). Our results show that using only simulated US data and a small amount of unlabelled real data for training, our method can achieve comparable performance to semi-supervised and fully supervised learning methods. Moreover, the effectiveness of our previously proposed CT-based US image simulation method is also indirectly confirmed.
机器人超声波(US)系统已经展现出了让超声波检查更加容易和更准确的潜力。近年来,已经提出了多种机器学习技术来实现机器人超声波摄取任务中的自动超声波图像解释。然而,获取大量的真实超声波图像用于训练通常非常昂贵或在某些临床应用程序中甚至不可行。一种替代方法是建立一个模拟器来生成合成的超声波数据用于训练,但模拟和真实超声波图像之间的差异可能会导致模型性能不佳。本文提出了一个Sim2Real框架,以高效地学习机器人超声波图像分析任务,仅基于模拟数据进行实际部署。基于无监督比较学习的样式转移模块被提出,并用作预处理步骤,将真实的超声波图像转换为模拟风格。此后,一个与任务相关的模型被设计,结合卷积神经网络和视觉转换器,生成任务相关的增强泛化能力的预测。我们证明了我们的方法在图像回归任务中,以预测机器人超声波心动图(TEE)中探针位置的超声波图像,的方法的有效性。我们的结果表明,仅使用模拟的超声波数据和少量的未标记的真实数据用于训练,我们的方法可以与半监督和完全监督学习方法相媲美。此外,我们之前提出的基于CT的超声波图像模拟方法的效果也间接得到了确认。
https://arxiv.org/abs/2305.09169
Breast cancer early detection is crucial for improving patient outcomes. The Institut Català de la Salut (ICS) has launched the DigiPatICS project to develop and implement artificial intelligence algorithms to assist with the diagnosis of cancer. In this paper, we propose a new approach for facing the color normalization problem in HER2-stained histopathological images of breast cancer tissue, posed as an style transfer problem. We combine the Color Deconvolution technique with the Pix2Pix GAN network to present a novel approach to correct the color variations between different HER2 stain brands. Our approach focuses on maintaining the HER2 score of the cells in the transformed images, which is crucial for the HER2 analysis. Results demonstrate that our final model outperforms the state-of-the-art image style transfer methods in maintaining the cell classes in the transformed images and is as effective as them in generating realistic images.
乳腺癌早期检测对于改善患者结果至关重要。西班牙卫生研究所(ICS)启动了DigiPatICS项目,旨在开发和实施人工智能算法,以协助乳腺癌的诊断。在本文中,我们提出了一种新的方法来解决乳腺癌组织中HER2染色后色彩正则化问题,将其视为风格转移问题。我们结合了色彩复原技术和Pix2PixGAN网络,提出了一种新方法来纠正不同HER2染色品牌之间的色彩差异。我们的 approach 重点是维持转换后图像中的细胞HER2得分,这对于HER2分析至关重要。结果证明,我们的最终模型在维持转换后图像中的细胞类别方面比最先进的图像风格转移方法更有效,并且生成真实图像的效果也类似。
https://arxiv.org/abs/2305.07404
This research paper explores the application of style transfer in computer vision using RGB images and their corresponding depth maps. We propose a novel method that incorporates the depth map and a heatmap of the RGB image to generate more realistic style transfer results. We compare our method to the traditional neural style transfer approach and find that our method outperforms it in terms of producing more realistic color and style. The proposed method can be applied to various computer vision applications, such as image editing and virtual reality, to improve the realism of generated images. Overall, our findings demonstrate the potential of incorporating depth information and heatmap of RGB images in style transfer for more realistic results.
这篇研究论文探讨了在计算机视觉中使用RGB图像及其对应深度图的风格转移应用。我们提出了一种新方法,将深度图和RGB图像的热力图集成起来,以生成更加真实的风格转移结果。我们比较了我们的方法和传统的神经网络风格转移方法,并发现我们的方法在生成更加真实的颜色和风格方面表现更好。该方法可以应用于各种计算机视觉应用程序,例如图像编辑和虚拟现实,以改善生成图像的逼真度。总的来说,我们的研究结果展示了在风格转移中集成深度信息和RGB图像热图以获得更加逼真结果的潜力。
https://arxiv.org/abs/2305.06565
Adapting a large language model for multiple-attribute text style transfer via fine-tuning can be challenging due to the significant amount of computational resources and labeled data required for the specific task. In this paper, we address this challenge by introducing AdapterTST, a framework that freezes the pre-trained model's original parameters and enables the development of a multiple-attribute text style transfer model. Using BART as the backbone model, Adapter-TST utilizes different neural adapters to capture different attribute information, like a plug-in connected to BART. Our method allows control over multiple attributes, like sentiment, tense, voice, etc., and configures the adapters' architecture to generate multiple outputs respected to attributes or compositional editing on the same sentence. We evaluate the proposed model on both traditional sentiment transfer and multiple-attribute transfer tasks. The experiment results demonstrate that Adapter-TST outperforms all the state-of-the-art baselines with significantly lesser computational resources. We have also empirically shown that each adapter is able to capture specific stylistic attributes effectively and can be configured to perform compositional editing.
通过微调大型语言模型来适应多个属性文本风格转换任务可能会是一项挑战,因为该任务需要大量的计算资源和标记数据。在本文中,我们介绍了AdapterTST框架,该框架冻结了训练好的模型的原始参数,并实现了一个多个属性文本风格转换模型的开发。通过使用 BART 作为骨架模型,Adapter-TST使用不同的神经网络适配器来捕捉不同的属性信息,就像与 BART 连接的插件一样。我们的方法和框架允许对多个属性进行控制,如情感、 tense、声音等,并配置适配器架构以在同一句话上生成多个符合属性或组合编辑的输出。我们同时评估了传统的情感迁移和多个属性迁移任务,实验结果表明,Adapter-TST 在计算资源较少的情况下比所有最先进的基准模型表现更好。我们还经验证了每个适配器都可以有效地捕捉特定的风格属性,并可以配置进行组合编辑。
https://arxiv.org/abs/2305.05945
Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfilling the style transfer task's content preservation criteria is challenging. This paper proposes a zero-shot video stylization method named Style-A-Video, which utilizes a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. We improve the guidance condition in the denoising process, establishing a balance between artistic expression and structure preservation. Furthermore, to decrease inter-frame flicker and avoid the formation of additional artifacts, we employ a sampling optimization and a temporal consistency module. Extensive experiments show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions. Code will be available at this https URL.
大规模的文本到视频扩散模型已经展示了合成多样化视频的出色能力。然而,由于缺乏广泛的文本到视频数据集以及必要的训练计算资源,直接应用这些模型进行视频风格化仍然非常困难。此外,由于输入内容中的噪声添加过程是随机且破坏性的,满足风格转移任务中的内容保留标准是一项挑战性的任务。本文提出了一种名为Style-A-Video的零样本视频风格化方法,该方法利用一个生成式预训练Transformer和一个图像隐态扩散模型来实现文本控制的视频中风格化。我们改进了去噪过程的指导条件,建立了艺术表达和结构保留的平衡。此外,为了减少帧间闪烁并避免形成额外的工效性,我们采用了采样优化和时间一致性模块。广泛的实验表明,我们可以在比先前解决方案消耗更少的资源下实现更好的内容保留和风格表现。代码将在这个httpsURL上可用。
https://arxiv.org/abs/2305.05464
Automatic dubbing, which generates a corresponding version of the input speech in another language, could be widely utilized in many real-world scenarios such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing needs to further transfer the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on duration and speaking rate, neglecting the other aspects in speaking style such as emotion, intonation and emphasis which are also crucial to fully perform the characters and speech understanding. In this paper, we propose a joint multi-scale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between languages at both global (i.e. utterance level) and local (i.e. word level) scales. The global and local speaking styles in each language are extracted and utilized to predicted the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multi-scale speaking style enhanced FastSpeech 2 is then utilized to synthesize the predicted the global and local speaking styles to speech for each language. Experiment results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in both objective and subjective evaluations.
自动配音生成一种对应于输入语音的另一种语言的相应版本,可以在许多现实世界场景中得到广泛应用,例如视频和游戏本地化。除了合成翻译脚本,自动配音还需要进一步将原始语言的说话风格转移到配音演讲,以使观众感觉角色是在用母语说话。然而,最先进的自动配音系统只关注时长和说话速率的转移,忽略了说话风格的其他方面,如情感、语调和强调,这些也是充分表现角色和语言理解的关键。在本文中,我们提出了一种 joint 多尺度跨语言说话风格转移框架,以同时在全球和本地尺度上双向地建模语言之间的说话风格转移。每个语言的全球和本地说话风格都被提取并使用,以预测另一个语言的全球和本地说话风格,使用每个方向的编码器和解码框架以及两个方向的共享双向注意力机制。一种多尺度说话风格增强的 FastSpeech 2 被使用来合成预测的全球和本地说话风格到每个语言的演讲中。实验结果证明了我们提出的框架的有效性,在客观和主观评价中均优于仅仅进行时间转移的基线。
https://arxiv.org/abs/2305.05203
Voice conversion (VC), as a voice style transfer technology, is becoming increasingly prevalent while raising serious concerns about its illegal use. Proactively tracing the origins of VC-generated speeches, i.e., speaker traceability, can prevent the misuse of VC, but unfortunately has not been extensively studied. In this paper, we are the first to investigate the speaker traceability for VC and propose a traceable VC framework named VoxTracer. Our VoxTracer is similar to but beyond the paradigm of audio watermarking. We first use unique speaker embedding to represent speaker identity. Then we design a VAE-Glow structure, in which the hiding process imperceptibly integrates the source speaker identity into the VC, and the tracing process accurately recovers the source speaker identity and even the source speech in spite of severe speech quality degradation. To address the speech mismatch between the hiding and tracing processes affected by different distortions, we also adopt an asynchronous training strategy to optimize the VAE-Glow models. The VoxTracer is versatile enough to be applied to arbitrary VC methods and popular audio coding standards. Extensive experiments demonstrate that the VoxTracer achieves not only high imperceptibility in hiding, but also nearly 100% tracing accuracy against various types of audio lossy compressions (AAC, MP3, Opus and SILK) with a broad range of bitrates (16 kbps - 128 kbps) even in a very short time duration (0.74s). Our speech demo is available at https://anonymous.4open.science/w/DEMOofVoxTracer.
语音转换(VC)作为一种语音风格转移技术,正在变得越来越普遍,同时也引起了有关其非法使用的严重关注。积极追踪VC生成讲话人的身份,即讲话人追溯性,可以防止滥用VC,但不幸的是,已经很少被广泛研究。在本文中,我们是第一位研究VC讲话人追溯性的,并提出了名为 VoxTracer 的可追踪的VC框架。我们的 VoxTracer 类似于但超越了音频水印范式。我们首先使用独特的讲话人嵌入来表示讲话人身份。然后我们设计了VAE-Glow结构,其中隐藏过程几乎悄悄地将源讲话人身份集成到VC中,而追踪过程准确地恢复了源讲话人身份和即使语音质量严重下降,也恢复了源讲话。为了解决受到不同失真影响的讲话不匹配问题,我们还采用异步训练策略来优化VAE-Glow模型。 VoxTracer 非常灵活,可以应用于任意VC方法和普通音频编码标准。广泛的实验表明, VoxTracer不仅实现了高隐蔽性,而且几乎实现了100%的路径准确性,对抗各种类型音频lossy压缩(AAC、MP3、Opus和SILK)广泛的比特率(16kbps到128kbps)即使在非常短的时间内(0.74s)也是如此。我们的讲话演示可在 https://anonymous.4open.science/w/DemoofVoxTracer 中找到。
https://arxiv.org/abs/2305.05152
This perspective paper proposes a series of interactive scenarios that utilize Artificial Intelligence (AI) to enhance classroom teaching, such as dialogue auto-completion, knowledge and style transfer, and assessment of AI-generated content. By leveraging recent developments in Large Language Models (LLMs), we explore the potential of AI to augment and enrich teacher-student dialogues and improve the quality of teaching. Our goal is to produce innovative and meaningful conversations between teachers and students, create standards for evaluation, and improve the efficacy of AI-for-Education initiatives. In Section 3, we discuss the challenges of utilizing existing LLMs to effectively complete the educated tasks and present a unified framework for addressing diverse education dataset, processing lengthy conversations, and condensing information to better accomplish more downstream tasks. In Section 4, we summarize the pivoting tasks including Teacher-Student Dialogue Auto-Completion, Expert Teaching Knowledge and Style Transfer, and Assessment of AI-Generated Content (AIGC), providing a clear path for future research. In Section 5, we also explore the use of external and adjustable LLMs to improve the generated content through human-in-the-loop supervision and reinforcement learning. Ultimately, this paper seeks to highlight the potential for AI to aid the field of education and promote its further exploration.
这篇视角论文提出了一系列利用人工智能技术(AI)加强教室教学的交互场景,例如对话自动完成、知识和风格转移以及评估AI生成的内容等。通过利用大型语言模型(LLM)的最新发展,我们探索了AI增强和丰富师生对话的潜力,并提高了教学的质量。我们的目标是创造师生之间的创新和有意义的对话,制定评估标准,并改进AI对教育倡议的效果。在第3节中,我们讨论了利用现有LLM高效完成教育任务的挑战,并提出了处理多样化的教育数据集、处理较长的对话、压缩信息以更好地完成后续任务的统一框架。在第4节中,我们总结了包括教师-学生对话自动完成、专家教学知识和风格转移以及评估AI生成内容(AIGC)的pivot任务,为未来的研究提供了明确的道路。在第5节中,我们还探讨了利用外部和可调整的LLM来提高生成内容的方法,通过人类参与循环监督和强化学习。最终,本文旨在强调AI援助教育领域的潜力,推动进一步研究。
https://arxiv.org/abs/2305.03433
We investigate the potential of ChatGPT as a multidimensional evaluator for the task of \emph{Text Style Transfer}, alongside, and in comparison to, existing automatic metrics as well as human judgements. We focus on a zero-shot setting, i.e. prompting ChatGPT with specific task instructions, and test its performance on three commonly-used dimensions of text style transfer evaluation: style strength, content preservation, and fluency. We perform a comprehensive correlation analysis for two transfer directions (and overall) at different levels. Compared to existing automatic metrics, ChatGPT achieves competitive correlations with human judgments. These preliminary results are expected to provide a first glimpse into the role of large language models in the multidimensional evaluation of stylized text generation.
我们研究了 ChatGPT 作为文本风格转移任务多因素评估器的潜力,并与之相比,同时考虑了现有的自动 metrics 和人类判断。我们重点关注了一个零次响应设置,即通过特定的任务指令prompt ChatGPT,并测试它在三个常见的文本风格转移评估维度上的表现:风格强度、内容保留和流畅度。我们在两个转移方向(以及整体)上进行了全面的相关性分析。与现有的自动 metrics 相比,ChatGPT 实现了竞争性相关性。这些初步结果预计将提供一个初步瞥见大型语言模型在格式化文本生成多因素评估中的作用。
https://arxiv.org/abs/2304.13462
Inferring 3D object structures from a single image is an ill-posed task due to depth ambiguity and occlusion. Typical resolutions in the literature include leveraging 2D or 3D ground truth for supervised learning, as well as imposing hand-crafted symmetry priors or using an implicit representation to hallucinate novel viewpoints for unsupervised methods. In this work, we propose a general adversarial learning framework for solving Unsupervised 2D to Explicit 3D Style Transfer (UE3DST). Specifically, we merge two architectures: the unsupervised explicit 3D reconstruction network of Wu et al.\ and the Generative Adversarial Network (GAN) named StarGAN-v2. We experiment across three facial datasets (Basel Face Model, 3DFAW and CelebA-HQ) and show that our solution is able to outperform well established solutions such as DepthNet in 3D reconstruction and Pix2NeRF in conditional style transfer, while we also justify the individual contributions of our model components via ablation. In contrast to the aforementioned baselines, our scheme produces features for explicit 3D rendering, which can be manipulated and utilized in downstream tasks.
从一张图像推断3D物体结构是一个缺乏解决方案的任务,因为这涉及到深度歧义和遮挡。在文献中,通常的分辨率包括利用2D或3D基准 truth 进行监督学习,以及采用手工制定的对称前提或使用隐含表示来幻觉新的视角,以无监督方法为例。在本研究中,我们提出了一个通用的对抗学习框架,以解决无监督2D到 explicit 3D风格转移(UE3DST)问题。具体来说,我们将 Wu等人的无监督 explicit 3D重构网络和名为 StarGAN-v2的生成对抗网络合并。我们跨越三个面部数据集(Basel Face Model、3DFAW 和CelebA-HQ)进行实验,并表明我们的解决方案能够在3D重建和条件风格转移方面胜过良好确立的解决方案,如深度网络和 Pix2NeRF,同时我们也通过削除证明了我们模型组件的个人贡献。与上述基准相比,我们的方案生成了 explicit 3D渲染的特征,这些特征可以在后续任务中操纵和利用。
https://arxiv.org/abs/2304.12455