Recently, and under the umbrella of Responsible AI, efforts have been made to develop gender-ambiguous synthetic speech to represent with a single voice all individuals in the gender spectrum. However, research efforts have completely overlooked the speaking style despite differences found among binary and non-binary populations. In this work, we synthesise gender-ambiguous speech by combining the timbre of a male speaker with the manner of speech of a female speaker using voice morphing and pitch shifting towards the male-female boundary. Subjective evaluations indicate that the ambiguity of the morphed samples that convey the female speech style is higher than those that undergo pure pitch transformations suggesting that the speaking style can be a contributing factor in creating gender-ambiguous speech. To our knowledge, this is the first study that explicitly uses the transfer of the speaking style to create gender-ambiguous voices.
近年来,在负责AI的背景下,努力开发了能够用单个声音代表性别范围内所有个体的性别歧义合成语音。然而,尽管在二进制和非二进制人群中发现了差异,但研究努力完全忽视了说话方式。在这项工作中,我们通过将男性说话者的音色和女性说话者的语调相结合,在进行语音变形和音高平移,以跨越男女性别边界,合成性别歧义 speech。主观评价表明,经过变换的样本中传达女性说话方式的主观性比经过纯音高变换的样本中更高,这表明说话方式可能是一个导致性别歧义 speech的因素。据我们所知,这是第一个明确研究将说话方式转移以创建性别歧义 voices 的第一项研究。
https://arxiv.org/abs/2403.07661
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.
3D-一致图像生成是一个重要的计算机图形学和计算机视觉研究课题。尽管在这个领域中已经取得了一些相关的进展,但现有的方法在形状和外观的分离性能上往往表现不佳,并且缺乏多模态控制。在本文中,我们提出了一个新型的端到端3D意识图像生成和编辑模型,包括纯噪声、文本和参考图像等多种条件输入。一方面,我们深入探索了3D生成对抗网络(GANs)的潜在空间,并提出了一种新的分离策略,在生成过程中将外观特征与形状特征分离。另一方面,我们提出了一个多模态条件下灵活图像生成和编辑任务的统一框架。我们的方法可以生成具有独特噪声的多样图像,通过文本描述编辑属性,并通过给定参考RGB图像进行风格迁移。大量实验证明,与 alternative 方法相比,所提出的方法在图像生成和编辑方面都表现出优异的性能。
https://arxiv.org/abs/2403.06470
While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with 'part-attentive style modulator across body parts' and 'Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at this https URL.
尽管现有的动作风格迁移方法在相同内容的两个动作之间非常有效,但当将风格从一个具有不同内容的动作迁移时,它们的性能会显著下降。这个挑战在于动作内容与风格之间的明确分离。为了解决这个问题,我们提出了一个新颖的动作风格变换器,它有效地将风格与内容分离,并从源动作生成具有转移风格的合乎情理的运动。我们实现这个目标的双重方法是:(1)一个新的动作风格变换器架构,包括“全身关注风格调节器”和“双编码器,分别编码风格和内容特征”;(2)风格解耦损失。我们的方法超越了现有的方法,并展示了在不需要启发式后处理的情况下在具有不同内容的动作对中具有卓越的质量。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2403.06225
Current 3D stylization methods often assume static scenes, which violates the dynamic nature of our real world. To address this limitation, we present S-DyRF, a reference-based spatio-temporal stylization method for dynamic neural radiance fields. However, stylizing dynamic 3D scenes is inherently challenging due to the limited availability of stylized reference images along the temporal axis. Our key insight lies in introducing additional temporal cues besides the provided reference. To this end, we generate temporal pseudo-references from the given stylized reference. These pseudo-references facilitate the propagation of style information from the reference to the entire dynamic 3D scene. For coarse style transfer, we enforce novel views and times to mimic the style details present in pseudo-references at the feature level. To preserve high-frequency details, we create a collection of stylized temporal pseudo-rays from temporal pseudo-references. These pseudo-rays serve as detailed and explicit stylization guidance for achieving fine style transfer. Experiments on both synthetic and real-world datasets demonstrate that our method yields plausible stylized results of space-time view synthesis on dynamic 3D scenes.
当前的3D风格化方法通常假定静态的场景,这违反了我们现实世界的动态性质。为了解决这个问题,我们提出了S-DyRF,一种基于参考的动态神经辐射场风格化方法。然而,由于在时间轴上 stylized 参考图像的有限可用性,风格化动态3D场景具有挑战性。我们的关键见解在于引入额外的时变提示。为此,我们从给定的风格参考中生成时变伪参考。这些伪参考有助于在参考和整个动态3D场景之间传播风格信息。为了粗粒度风格转移,我们要求新视图和时间与特征级别上的伪参考的样式相匹配。为了保留高频细节,我们创建了一组风格化的时变伪射线。这些伪射线作为实现细粒度风格转移的详细和明确的指导。在合成和真实世界数据集上进行的实验证明,我们的方法在动态3D场景上产生了合理的风格化结果。
https://arxiv.org/abs/2403.06205
Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.
理解生物和人工智能系统中视觉信息的编码通常需要 vision 科学家生成适当的刺激来测试特定假设。尽管深度神经网络模型通过诸如图像风格转移等方法已经极大地推动了图像生成领域的发展,但用于视频生成的现有方法却很少。在这里,我们介绍了 Spatiotemporal Style Transfer(STST)算法,一种动态视频刺激生成框架,允许视觉研究界强大地操作和合成视频刺激。它基于两个深度神经网络模型,将空间和时间特征进行分解,生成动态视觉刺激,其模型层激活与输入视频相匹配。例如,我们展示了我们的算法能够生成模型同构体,即与自然视频具有相同层激活的动态刺激。我们证明了这些生成的刺激与它们的自然对应物的低级空间和时间特征相匹配,但缺乏高级语义特征,这使得研究对象识别是一个强大的范式。在深度视觉模型中,后层激活与自然和同构刺激之间的相似性较低,证实了生成的刺激中缺乏高级信息。最后,我们使用生成的刺激来研究预测编码深度网络的表示能力。这些结果展示了我们算法的潜在应用作为视觉科学研究中动态刺激生成的多功能工具。
https://arxiv.org/abs/2403.04940
We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in this https URL.
我们通过反事实推理来研究基于文本的图像编辑(TBIE)单个图像,因为这是一个优雅的公式,可以精确满足要求:编辑后的图像应保留原始图像的准确性。通过公式的透镜,我们发现TBIE的关键在于现有技术很难在编辑性和准确性之间实现良好的权衡,主要原因是单图像微调的过拟合。为此,我们提出了一个双重抽样反事实推理框架(DAC)。我们首先将外生变量参数化为一个UNet LoRA,其推理可以编码所有图像细节。然后,我们将另一个外生变量参数化为一个文本编码器LoRA,它通过过拟合第一次推理来恢复由过拟合引起的编辑失真。由于第二个推理仅编码了从后编辑到前编辑的视觉转换,因此它的逆置--减去LoRA--有效地将前编辑恢复到后编辑,从而实现了编辑。通过广泛的实验,我们的DAC在编辑性和准确性之间实现了良好的权衡。因此,我们可以支持广泛的用户编辑意图,包括添加、删除、操作、替换、风格转移和面部变化,这些都在定性和定量评估中得到了广泛验证。代码可在此处访问:<https://url.cn/xyz01>
https://arxiv.org/abs/2403.02981
Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process.
文本风格迁移(TST)旨在通过保留文本的核心内容来改变其风格。考虑到TST有限的可并行数据集的限制,我们提出了CoTeX,一个利用大型语言模型(LLMs)和连锁思维(CoT)提示来促进TST的框架。通过在四个TST数据集上的实验,CoTeX证明了LLMs的复杂重构和推理能力可以转化为更简洁的模型,能够处理非并行和并行数据。通过对比实验,CoTeX与当前的无监督、监督、上下文学习(ICL)技术和指令微调的LLM,进行了全面评估。此外,CoTeX通过提供透明的方式来解释其风格迁移过程,使其与其它方法区分明显。
https://arxiv.org/abs/2403.01106
This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challenge, this paper introduces a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Specifically, we transform image features into the frequency domain using Discrete Fourier Transformation (DFT). Subsequently, frequency components (amplitude and phase) are processed separately to form the FDL loss function. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain. Extensive experimental evaluations, focusing on image enhancement and super-resolution tasks, demonstrate that FDL outperforms existing misalignment-robust loss functions. Furthermore, we explore the potential of our FDL for image style transfer that relies solely on completely misaligned data. Our code is available at: this https URL
本文旨在解决深度学习图像变换方法中一个普遍的挑战,例如图像增强和超分辨率,这些方法高度依赖像素级对齐的配对数据。然而,精确对齐配对图像存在重大挑战,阻碍了基于此类数据训练的方法的进步。为解决这个问题,本文引入了一种新颖且简单的频谱分布损失(FDL)用于计算频域中的分布距离。具体来说,我们使用离散傅里叶变换(DFT)将图像特征转换到频域。然后,对频谱成分(幅度和相位)分别进行处理,形成FDL损失函数。我们的方法通过在频域中明智地利用全局信息,作为训练约束,已被实验证明在图像增强和超分辨率任务中具有优越性能。此外,我们还探讨了我们的FDL在仅依赖完全错配数据进行图像风格转移方面的潜力。我们的代码可在此处下载:https://this URL。
https://arxiv.org/abs/2402.18192
Recently, the contrastive learning paradigm has achieved remarkable success in high-level tasks such as classification, detection, and segmentation. However, contrastive learning applied in low-level tasks, like image restoration, is limited, and its effectiveness is uncertain. This raises a question: Why does the contrastive learning paradigm not yield satisfactory results in image restoration? In this paper, we conduct in-depth analyses and propose three guidelines to address the above question. In addition, inspired by style transfer and based on contrastive learning, we propose a novel module for image restoration called \textbf{ConStyle}, which can be efficiently integrated into any U-Net structure network. By leveraging the flexibility of ConStyle, we develop a \textbf{general restoration network} for image restoration. ConStyle and the general restoration network together form an image restoration framework, namely \textbf{IRConStyle}. To demonstrate the capability and compatibility of ConStyle, we replace the general restoration network with transformer-based, CNN-based, and MLP-based networks, respectively. We perform extensive experiments on various image restoration tasks, including denoising, deblurring, deraining, and dehazing. The results on 19 benchmarks demonstrate that ConStyle can be integrated with any U-Net-based network and significantly enhance performance. For instance, ConStyle NAFNet significantly outperforms the original NAFNet on SOTS outdoor (dehazing) and Rain100H (deraining) datasets, with PSNR improvements of 4.16 dB and 3.58 dB with 85% fewer parameters.
近年来,对比学习范式在高级任务如分类、检测和分割方面取得了显著的成功。然而,在低级任务(如图像修复)中应用对比学习时,其效果受到限制,其有效性仍不确定。这引发了一个问题:为什么在图像修复上,对比学习范式没有产生令人满意的结果?在本文中,我们进行了深入分析,并提出了三个指南来回答这个问题。此外,受到风格迁移的启发,基于对比学习,我们提出了名为ConStyle的新模块,可以有效地集成到任何U-Net网络结构中。通过利用ConStyle的灵活性,我们开发了一个用于图像修复的通用网络结构,即IRConStyle。为了证明ConStyle的性能和兼容性,我们将一般修复网络用Transformer-based、CNN-based和MLP-based网络替换。我们对各种图像修复任务进行了广泛的实验,包括去噪、去雾、去雨和去雾。在19个基准上的结果表明,ConStyle可以与任何基于U-Net的网络集成,显著提高性能。例如,ConStyle NAFNet在SOTS户外的去雾和Rain100H数据集上明显优于原始NAFNet,其中PSNR改进了4.16 dB和3.58 dB,分别有85%的参数更少。
https://arxiv.org/abs/2402.15784
With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition (DVC) task, an interesting and challenging task that aims to automatically integrate various media elements based on user requirements and create storytelling videos. We propose an Intelligent Director framework, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names. Then, the best-matched music is obtained through music retrieval. Then, materials such as captions, images, videos, and music are integrated to seamlessly synthesize the video. Finally, we apply AnimeGANv2 for style transfer. We construct UCF101-DVC and Personal Album datasets and verified the effectiveness of our framework in solving DVC through qualitative and quantitative comparisons, along with user studies, demonstrating its substantial potential.
随着代表短视频平台的TikTok的崛起,用户通过照片和视频表达创意的趋势大幅增加。然而,普通用户缺乏使用专业创作软件制作高质量视频的专业技能。为了满足对智能和用户友好视频创作工具的需求,我们提出了Dynamic Visual Composition(DVC)任务,这是一个有趣且具有挑战性的任务,旨在根据用户需求自动整合各种媒体元素并创建讲故事视频。我们提出了Intelligent Director框架,利用LENS生成图像和视频帧的描述,并结合ChatGPT生成连贯的标题,同时推荐适当的音乐名称。然后,通过音乐检索获得最佳匹配的音乐。接着,将字幕、图像、视频和音乐等材料整合,使视频流畅合成。最后,我们应用AnimeGANv2进行风格转移。我们构建了UCF101-DVC和Personal Album数据集,并通过定性和定量比较以及用户研究证明了我们的框架在解决DVC方面的实际效果。
https://arxiv.org/abs/2402.15746
Counterfactual generation lies at the core of various machine learning tasks, including image translation and controllable text generation. This generation process usually requires the identification of the disentangled latent representations, such as content and style, that underlie the observed data. However, it becomes more challenging when faced with a scarcity of paired data and labeling information. Existing disentangled methods crucially rely on oversimplified assumptions, such as assuming independent content and style variables, to identify the latent variables, even though such assumptions may not hold for complex data distributions. For instance, food reviews tend to involve words like tasty, whereas movie reviews commonly contain words such as thrilling for the same positive sentiment. This problem is exacerbated when data are sampled from multiple domains since the dependence between content and style may vary significantly over domains. In this work, we tackle the domain-varying dependence between the content and the style variables inherent in the counterfactual generation task. We provide identification guarantees for such latent-variable models by leveraging the relative sparsity of the influences from different latent variables. Our theoretical insights enable the development of a doMain AdapTive counTerfactual gEneration model, called (MATTE). Our theoretically grounded framework achieves state-of-the-art performance in unsupervised style transfer tasks, where neither paired data nor style labels are utilized, across four large-scale datasets. Code is available at this https URL
翻译:假设生成是各种机器学习任务的核心,包括图像翻译和可控制文本生成。这种生成过程通常需要识别出支撑观察数据的不分离的潜在表示,例如内容和风格。然而,在面临数据稀疏性和标注信息不足的情况下,这会变得更加具有挑战性。现有的分离方法关键地依赖过简化的假设,例如假设独立的内容和风格变量,来识别潜在变量,即使这些假设可能不适用于复杂的数据分布。例如,食品评论通常包含诸如美味这样的词,而电影评论通常包含类似令人兴奋这样的词。当数据从多个领域采样时,这个问题会加剧,因为内容和风格之间的依赖可能在不同领域上显著不同。在这项工作中,我们解决了由于不同潜在变量之间的依赖而导致的领域可变性问题。我们通过利用不同潜在变量的影响力相对稀疏来为这种潜在变量模型提供识别保证。我们的理论见解使得开发了一种具有(MATTE)名称的领域自适应假设生成模型成为可能。我们的理论框架在无监督风格迁移任务中实现了最先进的性能,其中没有使用配对数据或标签信息。来自四个大型数据集的实验结果表明,我们的方法在实现无需配对数据或标签信息的领域自适应风格迁移任务方面具有优势。代码可以从该链接下载:https://www.acm.org/dl/d/2022.0121215
https://arxiv.org/abs/2402.15309
In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.
近年来,机器学习,尤其是生成对抗网络(GANs)和注意力基础神经网络(Transformer)已经被成功应用于创作和生成音乐,包括旋律和和声作品。当前的研究主要集中在基于大量录制或转录的音乐的大规模风格复制(例如,生成一首巴哈风格合唱)或风格转移(例如,从古典到爵士)上,这也使得对表演的直接评估变得相对简单。然而,大多数这些模型并不适合通过现场互动进行人类与机器合作创作,而且也不清楚如何对这类模型和创作进行评估。本文对音乐表示、特征分析、启发式算法、统计建模和人类与自动评估措施进行了全面的回顾,并讨论了哪种方法和模型似乎最适合现场互动。
https://arxiv.org/abs/2402.15294
With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at this https URL.
随着扩散模型的不断发展,文本指导图像风格迁移已经展示了高质量的可控制合成结果。然而,由于不同音乐风格迁移中音频-文本数据集的有限性,利用文本进行多样音乐风格迁移面临着显著的挑战,主要原因是在同一流派内,音乐的差异和复杂性使得准确的文本描述具有挑战性。本文提出了一种使用最小数据捕捉音乐属性的音乐风格迁移方法。我们引入了一种新颖的时间可变文本反演模块,精确捕捉不同层次的旋律特征。在推理过程中,我们提出了一种降低偏见的风格迁移技术,以获得稳定的结果。实验结果表明,我们的方法可以将特定乐器的风格转移,并将自然声音集成到旋律创作中。样本和源代码可在此链接中获取:https://www.example.com/
https://arxiv.org/abs/2402.13763
Unsupervised Text Style Transfer (UTST) has emerged as a critical task within the domain of Natural Language Processing (NLP), aiming to transfer one stylistic aspect of a sentence into another style without changing its semantics, syntax, or other attributes. This task is especially challenging given the intrinsic lack of parallel text pairings. Among existing methods for UTST tasks, attention masking approach and Large Language Models (LLMs) are deemed as two pioneering methods. However, they have shortcomings in generating unsmooth sentences and changing the original contents, respectively. In this paper, we investigate if we can combine these two methods effectively. We propose four ways of interactions, that are pipeline framework with tuned orders; knowledge distillation from LLMs to attention masking model; in-context learning with constructed parallel examples. We empirically show these multi-way interactions can improve the baselines in certain perspective of style strength, content preservation and text fluency. Experiments also demonstrate that simply conducting prompting followed by attention masking-based revision can consistently surpass the other systems, including supervised text style transfer systems. On Yelp-clean and Amazon-clean datasets, it improves the previously best mean metric by 0.5 and 3.0 absolute percentages respectively, and achieves new SOTA results.
无监督文本样式迁移(UTST)是自然语言处理(NLP)领域的一个关键任务,旨在将一句话的某种风格转移到另一种风格,而不会改变其语义、语法或其他属性。由于固有的缺乏并行文本对齐,这个任务尤其具有挑战性。在现有的UTST任务方法中,关注掩码方法和大型语言模型(LLMs)被认为是两个先驱方法。然而,它们在生成平滑句子和改变原始内容方面存在缺陷。在本文中,我们研究是否可以有效地结合这两种方法。我们提出了四种互动方式,包括带有自定义顺序的管道框架、LLM到关注掩码模型的知识蒸馏、上下文学习以及自构建并行示例。我们通过实验实证证明,这些多途径互动可以在某些方面的风格强度、内容保留和文本流畅度方面显著提高基线。实验还表明,仅在提示后进行关注掩码基于的修订可以持续超越其他系统,包括监督文本样式迁移系统。在Yelp-clean和Amazon-clean数据集上,它分别提高了0.5和3.0的绝对百分比,并实现了新的SOTA结果。
https://arxiv.org/abs/2402.13647
The rapid advancement of diffusion models (DMs) has not only transformed various real-world industries but has also introduced negative societal concerns, including the generation of harmful content, copyright disputes, and the rise of stereotypes and biases. To mitigate these issues, machine unlearning (MU) has emerged as a potential solution, demonstrating its ability to remove undesired generative capabilities of DMs in various applications. However, by examining existing MU evaluation methods, we uncover several key challenges that can result in incomplete, inaccurate, or biased evaluations for MU in DMs. To address them, we enhance the evaluation metrics for MU, including the introduction of an often-overlooked retainability measurement for DMs post-unlearning. Additionally, we introduce UnlearnCanvas, a comprehensive high-resolution stylized image dataset that facilitates us to evaluate the unlearning of artistic painting styles in conjunction with associated image objects. We show that this dataset plays a pivotal role in establishing a standardized and automated evaluation framework for MU techniques on DMs, featuring 7 quantitative metrics to address various aspects of unlearning effectiveness. Through extensive experiments, we benchmark 5 state-of-the-art MU methods, revealing novel insights into their pros and cons, and the underlying unlearning mechanisms. Furthermore, we demonstrate the potential of UnlearnCanvas to benchmark other generative modeling tasks, such as style transfer. The UnlearnCanvas dataset, benchmark, and the codes to reproduce all the results in this work can be found at this https URL.
扩散模型(DMs)的快速进步不仅改变了各种现实行业,还引入了包括有害内容生成、版权纠纷和刻板印象和偏见在内的社会问题。为减轻这些问题,机器学习(MU)作为一种潜在解决方案应运而生,展示了它在各种应用中去除DM中不良生成能力的能力。然而,通过分析现有MU评估方法,我们发现了几项关键挑战,可能导致MU在DM中的评估不完整、不准确或带有偏见。为解决这些问题,我们提高了MU的评估指标,包括引入了一个经常被忽视的保留性测量来衡量DM的学习。此外,我们还引入了UnlearnCanvas,一个全面的高分辨率风格化图像数据集,帮助我们评估艺术绘画风格的 unlearning 及其相关图像对象。我们发现,这个数据集在为MU技术在DM上建立标准化和自动化的评估框架方面发挥了关键作用,包括7个定量的指标来解决各种 unlearning 的效果。通过广泛的实验,我们基准了5种最先进的MU方法,揭示了它们优缺点的新见解以及 underlying unlearning 机制。此外,我们还证明了UnlearnCanvas在其他生成建模任务(如风格迁移)中的潜在作用。UnlearnCanvas数据集、基准和用于重现本研究中所有结果的代码可在此处找到:https://url.in/unlearncanvas
https://arxiv.org/abs/2402.11846
RL-based techniques can be used to search for prompts that when fed into a target language model maximize a set of user-specified reward functions. However, in many target applications, the natural reward functions are in tension with one another -- for example, content preservation vs. style matching in style transfer tasks. Current techniques focus on maximizing the average of reward functions, which does not necessarily lead to prompts that achieve balance across rewards -- an issue that has been well-studied in the multi-objective and robust optimization literature. In this paper, we adapt several techniques for multi-objective optimization to RL-based discrete prompt optimization -- two that consider volume of the Pareto reward surface, and another that chooses an update direction that benefits all rewards simultaneously. We conduct an empirical analysis of these methods on two NLP tasks: style transfer and machine translation, each using three competing reward functions. Our experiments demonstrate that multi-objective methods that directly optimize volume perform better and achieve a better balance of all rewards than those that attempt to find monotonic update directions.
基于强化学习的技术可以用于寻找在目标语言模型中输入提示时能够最大化用户指定奖励函数的一组提示。然而,在许多目标应用中,自然奖励函数之间存在紧张关系——例如,在风格迁移任务中的内容保留与风格匹配之间。目前的技术集中于最大化奖励函数的平均值,这并不一定导致在奖励之间实现平衡的提示——这个问题已在多目标规划和稳健优化文献中得到了充分研究。在本文中,我们将几个多目标优化技术应用于基于强化学习的离散提示优化——两个考虑帕累托奖励表面体积的技巧,另一个选择一个更新方向,同时有益于所有奖励。我们对这些方法在两个自然语言处理任务(风格迁移和机器翻译)上的实验进行了实证分析,每个任务都使用三个竞争性奖励函数。我们的实验结果表明,直接优化奖励体积的多目标方法比试图找到单调更新方向的尝试方法表现更好,能够实现更好的奖励平衡。
https://arxiv.org/abs/2402.11711
This paper focuses on text detoxification, i.e., automatically converting toxic text into non-toxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text style changes while its content is preserved. We present three approaches: knowledge transfer from a similar task, multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and, delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al.(2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxication while preserving the actual content and maintaining fluency.
本文重点关注文本去毒,即自动将有毒文本转换为无毒文本。这项任务有助于实现更安全和尊重的在线沟通,可以被视为文本风格迁移(TST)任务,其中在内容保留的情况下,文本风格发生变化。我们提出了三种方法:知识迁移从一个类似任务、多任务学习方法、结合序列到序列建模与各种毒性分类任务以及删除并重构方法。为了支持我们的研究,我们利用Dementieva等人(2021)提供的数据集,该数据集包含多个去毒文本版本,对应于有毒文本。在我们的实验中,我们通过专家人类注释者选择最佳变体,创建了一个与每个有毒句子搭配的单个适当去毒版本的数据集。此外,我们还引入了一个小型印度语平行数据集,与英语数据集的部分内容相对应,适用于评估目的。我们的结果表明,我们的方法在保持实际内容的同时有效平衡了文本去毒,保持了流畅。
https://arxiv.org/abs/2402.07767
In MRI studies, the aggregation of imaging data from multiple acquisition sites enhances sample size but may introduce site-related variabilities that hinder consistency in subsequent analyses. Deep learning methods for image translation have emerged as a solution for harmonizing MR images across sites. In this study, we introduce IGUANe (Image Generation with Unified Adversarial Networks), an original 3D model that leverages the strengths of domain translation and straightforward application of style transfer methods for multicenter brain MR image harmonization. IGUANe extends CycleGAN architecture by integrating an arbitrary number of domains for training through a many-to-one strategy. During inference, the model can be applied to any image, even from an unknown acquisition site, making it a universal generator for harmonization. Trained on a dataset comprising T1-weighted images from 11 different scanners, IGUANe was evaluated on data from unseen sites. The assessments included the transformation of MR images with traveling subjects, the preservation of pairwise distances between MR images within domains, the evolution of volumetric patterns related to age and Alzheimer$^\prime$s disease (AD), and the performance in age regression and patient classification tasks. Comparisons with other harmonization and normalization methods suggest that IGUANe better preserves individual information in MR images and is more suitable for maintaining and reinforcing variabilities related to age and AD. Future studies may further assess IGUANe in other multicenter contexts, either using the same model or retraining it for applications to different image modalities.
在MRI研究中,从多个采集点的图像数据集的聚集增强了样本量,但可能引入了与站点相关的方差,这会阻碍后续分析的一致性。已经 emergence 深度学习方法来平滑跨站点MRI图像。在这项研究中,我们介绍了IGUANe(图像生成与统一对抗网络),一种利用领域转换的优势和直接应用风格迁移方法来多中心脑MRI图像和谐的方法的原始3D模型。IGUANe通过许多对一策略将CycleGAN架构扩展到可以训练任意数量个领域的模型。在推理过程中,该模型可以应用于任何图像,即使是未知采集点的图像,使其成为一种通用的和谐剂。在11种不同扫描仪的T1加权图像的训练数据上进行训练的IGUANe,在未见过的站点上的数据上进行了评估。评估包括旅行主题的MRI图像变换、在领域内保留成对距离、与年龄和阿尔茨海默病(AD)相关的体积模式的演变以及年龄预测和患者分类任务的性能。与其他和谐和标准化方法相比,IGUANe在MRI图像中更好地保留了个体信息,更适合维持和增强与年龄和AD相关的方差。未来的研究可能会进一步评估IGUANe在多中心环境中的效果,无论是使用相同的模型还是为不同图像模态进行应用重新训练。
https://arxiv.org/abs/2402.03227
Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the absence of a network capable of seamlessly editing the apparent age on NPR images means that these tasks have been confined to a naive approach, applying each task sequentially. This often results in unpleasant artifacts and a loss of facial attributes due to domain discrepancies. In this paper, we introduce a novel one-stage method for face re-aging combined with portrait style transfer, executed in a single generative step. We leverage existing face re-aging and style transfer networks, both trained within the same PR domain. Our method uniquely fuses distinct latent vectors, each responsible for managing aging-related attributes and NPR appearance. Adopting an exemplar-based approach, our method offers greater flexibility than domain-level fine-tuning approaches, which typically require separate training or fine-tuning for each domain. This effectively addresses the limitation of requiring paired datasets for re-aging and domain-level, data-driven approaches for stylization. Our experiments show that our model can effortlessly generate re-aged images while simultaneously transferring the style of examples, maintaining both natural appearance and controllability.
面部衰老是一个在计算机视觉和图形学中颇具影响力的领域,在电影、广告和实时流媒体等现实主义领域具有显著的应用。最近,将面部衰老应用到非现实主义图像,如漫画、插图和动画,已成为各个娱乐行业扩展的一个趋势。然而,缺乏能够无缝编辑NPR图像的网络,意味着这些任务只能采用一种盲目的方法,即逐个任务进行编辑。这往往会导致令人不满意的图像和面部特征的损失,由于不同域之间的差异。在本文中,我们提出了一种结合面部衰老和肖像风格转移的一阶段方法,在单个生成步骤中执行。我们利用现有的面部衰老和风格转移网络,这些网络都是在同一PR领域训练的。我们的方法独特地将不同的潜在矢量融合在一起,每个矢量负责管理与衰老相关的属性和NPR外观。采用类元方法,我们的方法比领域级微调方法提供了更大的灵活性,这些方法通常需要为每个领域进行单独的训练或微调。有效解决了需要成对数据集进行衰老和领域级数据驱动风格的重新训练的局限性。我们的实验结果表明,我们的模型可以在轻松地生成衰老图像的同时,同时将样本的风格转移,保持自然外观和可控制性。
https://arxiv.org/abs/2402.02733
Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality.
大部分现有的关于任意3D NeRF风格迁移的工作都需要在每种单一风格条件下进行重新训练。本文旨在利用文本或视觉输入作为条件因素,实现无需 shots 的3D 场景中的零 shots 风格迁移。我们引入了 ConRF,一种用于零 shots 风格迁移的新方法。具体来说,由于 CLIP 特征的不确定性,我们采用一种将 CLIP 特征空间映射到预训练 VGG 网络的风格空间的转换过程,然后将 CLIP 多模态知识精炼为风格迁移神经辐射场。此外,我们还使用 3D 体积表示进行局部风格迁移。通过结合这些操作,ConRF 提供了利用文本或图像作为参考,生成具有全球或局部风格增强的新视图序列的能力。我们的实验证明,ConRF 在视觉质量方面优于其他现有方法,尤其是在 3D 场景和单文本风格迁移方面。
https://arxiv.org/abs/2402.01950