We demonstrate that discriminative models inherently contain powerful generative capabilities, challenging the fundamental distinction between discriminative and generative architectures. Our method, Direct Ascent Synthesis (DAS), reveals these latent capabilities through multi-resolution optimization of CLIP model representations. While traditional inversion attempts produce adversarial patterns, DAS achieves high-quality image synthesis by decomposing optimization across multiple spatial scales (1x1 to 224x224), requiring no additional training. This approach not only enables diverse applications -- from text-to-image generation to style transfer -- but maintains natural image statistics ($1/f^2$ spectrum) and guides the generation away from non-robust adversarial patterns. Our results demonstrate that standard discriminative models encode substantially richer generative knowledge than previously recognized, providing new perspectives on model interpretability and the relationship between adversarial examples and natural image synthesis.
我们展示了判别模型本质上含有强大的生成能力,这挑战了判别架构和生成架构之间的基本区别。我们的方法,直接上升合成(Direct Ascent Synthesis, DAS),通过在CLIP模型表示上的多分辨率优化揭示这些潜在的能力。虽然传统的逆向尝试会产生对抗性模式,但DAS通过将优化分解到多个空间尺度(从1x1到224x224)实现了高质量的图像合成,并且无需额外训练。这种方法不仅能够支持多种应用——包括文本到图像生成和风格迁移——还能保持自然图像统计特性($1/f^2$频谱),并指导生成远离非鲁棒性的对抗性模式。我们的结果表明,标准判别模型编码了比之前认识到的更为丰富的生成知识,这为模型可解释性和对抗样本与自然图像合成之间的关系提供了新的视角。
https://arxiv.org/abs/2502.07753
Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.
在给定风格参考图像作为附加条件的情况下,文本到图像的扩散模型展示出了生成同时具备文本提示内容并采用参考图像视觉风格的图片的强大能力。然而,目前最先进的方法常常难以从风格参考图中分离出内容和风格,导致诸如内容泄露等问题的发生。为解决这一问题,我们提出了一种基于掩码的方法,能够在不调整任何模型参数的情况下有效分离内容与风格。通过简单地在风格参考图像特征的特定元素上进行掩码处理,我们揭示了一个重要但尚未充分探索的原则:使用适当选择的较少条件(例如,删除几个图像特征元素)可以有效地避免不必要的内容流入扩散模型,从而提升文本到图像扩散模型的风格迁移性能。本文中,我们在理论和实验两方面验证了这一发现的有效性。跨多种风格进行的大量实验展示了我们基于掩码的方法的有效性,并支持我们的理论结果。
https://arxiv.org/abs/2502.07466
This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page this https URL.
这篇论文介绍了PFlow-VC,这是一种条件流匹配声音转换模型,利用细粒度的离散音高令牌和目标说话人的提示信息来进行富有表现力的声音转换(Voice Conversion, VC)。之前的声音转换工作主要集中在说话人转换上,在提升音色转换中的表达性方面(如韵律和情感)还有待进一步探索。与以往的方法不同,我们采用了一种简单且高效的方式来增强声音转换模型的风格表达能力。具体来说,我们预先训练了一个自我监督的音高VQVAE模型来离散化与说话人无关的音高信息,并利用一个遮蔽音高标准匹配流合成模型进行梅尔频谱图生成,这为说话人转换模型提供了上下文中的音高建模能力,有效地提升了语音风格传输的能力。此外,我们通过结合全局音色嵌入与时变音色调料来改进音色相似度。 在未见过的LibriTTS测试集(test-clean)和情感语音数据集ESD上的实验表明,PFlow-VC模型在音色转换和风格迁移方面均表现出优越性。音频样本可在该演示页面上找到:[此URL](请将方括号中的文本替换为实际提供的链接)。
https://arxiv.org/abs/2502.05471
Artistic style transfer aims to use a style image and a content image to synthesize a target image that retains the same artistic expression as the style image while preserving the basic content of the content image. Many recently proposed style transfer methods have a common problem; that is, they simply transfer the texture and color of the style image to the global structure of the content image. As a result, the content image has a local structure that is not similar to the local structure of the style image. In this paper, we present an effective method that can be used to transfer style patterns while fusing the local style structure into the local content structure. In our method, dif-ferent levels of coarse stylized features are first reconstructed at low resolution using a Coarse Network, in which style color distribution is roughly transferred, and the content structure is combined with the style structure. Then, the reconstructed features and the content features are adopted to synthesize high-quality structure-aware stylized images with high resolution using a Fine Network with three structural selective fusion (SSF) modules. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a com-parison with some state-of-the-art style transfer methods.
艺术风格转换的目标是利用一幅风格图像和一幅内容图像来合成目标图像,使其保留与风格图像相同的艺术表达的同时保持内容图像的基本内容不变。然而,许多最近提出的风格转移方法存在一个共同的问题:它们只是简单地将风格图像的纹理和颜色转移到内容图像的整体结构上。因此,生成的内容图像在局部结构上往往并不类似风格图像的局部结构。 在这篇论文中,我们提出了一种有效的方法,可以用于在融合局部风格结构到局部内容结构的同时进行样式图案转移。我们的方法首先使用一个粗略网络(Coarse Network)在低分辨率下重构不同级别的粗略风格化特征,在此过程中大致传输了风格颜色分布,并结合了内容与风格的结构。然后,通过一个配备有三个结构选择性融合(SSF)模块的精细网络(Fine Network),利用重建后的特征和内容特征来生成高分辨率且具有高度感知结构意识的风格化图像。 我们的方法的有效性通过生成吸引人的高质量风格化结果以及与其他最新风格转移方法进行比较得到了证明。
https://arxiv.org/abs/2502.05387
Text Style Transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TST outputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Using human evaluation is ideal but costly, same as in other natural language processing (NLP) tasks, however, automatic metrics for TST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set of existing and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks-sentiment transfer and detoxification-in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with human judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigate the potential of Large Language Models (LLMs) as tools for TST evaluation. Our findings highlight that certain advanced NLP metrics and experimental-hybrid-techniques, provide better insights than existing TST metrics for delivering more accurate, consistent, and reproducible TST evaluations.
文本风格转换(TST)的任务是将一段文本转化为具有特定风格的版本,同时保留其原始内容。评估TST输出是一个多维度挑战,需要对风格转化准确性、内容保存和自然度进行评测。虽然人工评价是最理想的但成本高昂,与其他自然语言处理(NLP)任务类似,自动化的TST评估指标并未像机器翻译或摘要生成等领域那样受到足够重视。在本文中,我们考察了现有以及新提出的广泛应用于其他NLP任务的评估标准,并集中于两个流行的子任务:情感转换和去毒化,在包含英语、印地语和孟加拉语的多语言环境中进行研究。通过与人工判断的相关性测试作为元评价方法,我们展示了这些指标在单独使用及组合使用时的有效性。此外,我们也探讨了大型语言模型(LLMs)作为TST评估工具的潜力。我们的研究表明,某些先进的NLP评估标准和实验混合技术比现有的TST评估标准能提供更准确、一致且可重复的结果。
https://arxiv.org/abs/2502.04718
Style transfer is adopted to synthesize appealing stylized images that preserve the structure of a content image but carry the pattern of a style image. Many recently proposed style transfer methods use only western oil paintings as style images to achieve image stylization. As a result, unnatural messy artistic effects are produced in stylized images when using these methods to directly transfer the patterns of traditional Chinese paintings, which are composed of plain colors and abstract objects. Moreover, most of them work only at the original image scale and thus ignore multiscale image information during training. In this paper, we present a novel effective multiscale style transfer method based on Laplacian pyramid decomposition and reconstruction, which can transfer unique patterns of Chinese paintings by learning different image features at different scales. In the first stage, the holistic patterns are transferred at low resolution by adopting a Style Transfer Base Network. Then, the details of the content and style are gradually enhanced at higher resolutions by a Detail Enhancement Network with an edge information selection (EIS) module in the second stage. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a comparison with some state-of-the-art style transfer methods. Datasets and codes are available at this https URL.
风格转换技术被用来合成具有吸引力的、保留内容图像结构但携带样式图像图案的艺术化图片。近年来,许多提出的风格转换方法仅使用西方油画作为样式图像来实现图像艺术化处理。因此,在直接应用这些方法转移中国传统绘画(由单纯颜色和抽象物体组成)中的模式时,会产生不自然且杂乱的艺术效果。此外,大多数现有方法只在原始图像尺度上工作,因而忽略了训练过程中的多尺度图像信息。 本文提出了一种基于拉普拉斯金字塔分解与重构的全新有效的多尺度风格转换方法,能够通过学习不同尺度下的不同图像特征来转移中国绘画的独特模式。首先,在第一阶段使用风格传输基础网络以低分辨率进行整体图案的传递;然后在第二阶段中,细节增强网络利用边缘信息选择(EIS)模块在更高分辨率下逐步强化内容和样式的细节。 我们方法的有效性通过生成高质量且具有吸引力的艺术化结果以及与一些最先进的风格转换方法对比来证明。数据集和代码可在以下网址获取:[此处提供实际链接]。
https://arxiv.org/abs/2502.04597
Deep learning has enabled remarkable advances in style transfer across various domains, offering new possibilities for creative content generation. However, in the realm of symbolic music, generating controllable and expressive performance-level style transfers for complete musical works remains challenging due to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations. Our code and demo page can be found at this https URL.
深度学习已经在多个领域内实现了风格转换的显著进展,为创意内容生成提供了新的可能性。然而,在象征性音乐(如爵士乐等)中,由于数据集有限和缺乏能够处理多种音乐生成任务的统一模型,完整作品的可控且富有表现力的演奏级风格转移仍是一个挑战。本文介绍了ImprovNet,这是一种基于Transformer架构的方法,通过自我监督的破坏-精炼训练策略生成表达性和可控制性的音乐即兴创作。ImprovNet将多项能力整合在一个单一模型中:它可以进行跨流派和同一流派内的即兴演奏、以特定风格和声旋律,并执行短提示延续和填空任务。该模型的迭代生成框架允许用户控制风格转移的程度以及与原始作品结构相似性的程度。客观和主观评估表明,ImprovNet在产生音乐连贯的即兴表演的同时,能够保持与原作结构的关系方面是有效的。在短续写和填空任务中,ImprovNet优于预测性音乐Transformer,并成功实现了可识别的流派转换,有79%的参与者正确辨识了爵士风格的即兴演奏。我们的代码和演示页面可在[此处](https://example.com)找到(注:实际链接请参见原文)。
https://arxiv.org/abs/2502.04522
This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples. The text systematically addresses the mathematical and theoretical underpinnings including probability theory, statistics, and game theory providing a solid framework for understanding the objectives, loss functions, and optimisation challenges inherent to GAN training. Subsequent chapters review classic variants such as Conditional GANs, DCGANs, InfoGAN, and LAPGAN before progressing to advanced training methodologies like Wasserstein GANs, GANs with gradient penalty, least squares GANs, and spectral normalisation techniques. The book further examines architectural enhancements and task-specific adaptations in generators and discriminators, showcasing practical implementations in high resolution image generation, artistic style transfer, video synthesis, text to image generation and other multimedia applications. The concluding sections offer insights into emerging research trends, including self-attention mechanisms, transformer-based generative models, and a comparative analysis with diffusion models, thus charting promising directions for future developments in both academic and applied settings.
这本书以对生成对抗网络(GANs)的基本原理和历史发展进行了详细的介绍开始,通过与传统的生成模型对比,并利用说明性的Python示例阐述了核心的对抗机制。文本系统地探讨了数学和理论基础,包括概率论、统计学和博弈论,为理解GAN训练的目标、损失函数以及优化挑战提供了一个坚实的基础框架。 随后的章节回顾了经典的变体,如条件生成对抗网络(Conditional GANs)、深度卷积生成对抗网络(DCGANs)、信息GAN(InfoGAN)和拉普拉斯生成对抗网络(LAPGAN),然后逐步过渡到更先进的训练方法,例如瓦瑟斯坦生成对抗网络(Wasserstein GANs)、具有梯度惩罚的GAN、最小二乘GAN以及谱归一化技术。本书进一步探讨了在生成器和判别器架构上的改进措施及特定任务适应性,展示了在高分辨率图像生成、艺术风格迁移、视频合成、文本到图像生成和其他多媒体应用中的实用实现。 最后的部分提供了关于新兴研究趋势的见解,包括自我注意机制(self-attention mechanisms)、基于变换器的生成模型以及与扩散模型的比较分析,从而勾勒出未来学术和实际应用场景中GAN发展的有前景方向。
https://arxiv.org/abs/2502.04116
Attention-based arbitrary style transfer methods have gained significant attention recently due to their impressive ability to synthesize style details. However, the point-wise matching within the attention mechanism may overly focus on local patterns such that neglect the remarkable global features of style images. Additionally, when processing large images, the quadratic complexity of the attention mechanism will bring high computational load. To alleviate above problems, we propose Holistic Style Injector (HSI), a novel attention-style transformation module to deliver artistic expression of target style. Specifically, HSI performs stylization only based on global style representation that is more in line with the characteristics of style transfer, to avoid generating local disharmonious patterns in stylized images. Moreover, we propose a dual relation learning mechanism inside the HSI to dynamically render images by leveraging semantic similarity in content and style, ensuring the stylized images preserve the original content and improve style fidelity. Note that the proposed HSI achieves linear computational complexity because it establishes feature mapping through element-wise multiplication rather than matrix multiplication. Qualitative and quantitative results demonstrate that our method outperforms state-of-the-art approaches in both effectiveness and efficiency.
基于注意力机制的任意风格迁移方法由于其合成样式细节的能力而最近备受关注。然而,注意力机制中的逐点匹配可能会过分注重局部模式,从而忽视了样图中显著的整体特征。此外,在处理大图像时,注意力机制的二次复杂性会带来很高的计算负载。为了解决上述问题,我们提出了一个新颖的关注式风格转换模块——整体风格注入器(Holistic Style Injector, HSI),用于传递目标样式的表现力。具体而言,HSI仅基于全局样式的表示进行风格化,这更符合风格迁移的特性,并且可以避免生成图像中不和谐的地方模式。此外,在HSI内部我们提出了一个双关系学习机制,通过利用内容和样图之间的语义相似性来动态渲染图像,确保样式化的图像保留原始的内容并提高样式的一致性。 值得注意的是,由于HSI通过逐元素相乘而非矩阵相乘建立特征映射,其计算复杂度为线性。定性和定量的结果均表明我们提出的方法在有效性和效率上都优于现有的方法。
https://arxiv.org/abs/2502.04369
Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction-following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named DIM-Bench, specifically designed to assess LLMs' performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: rewriting, proofreading, translation, and style transfer -- alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.
尽管大型语言模型(LLM)在遵循指令的任务中表现出色,但当这些模型需要忽略某些指令时,这种优势可能会变成弱点。通常,指令跟随任务包括一个明确的任务描述和包含要处理的目标数据的输入文本。然而,如果输入本身看起来像一条指令,即使有明确的提示来区分任务指令和输入内容,也可能产生混淆。我们将这一现象称为“指令干扰”。在本文中,我们介绍了一个新基准测试DIM-Bench,该基准专门用于评估LLM在面临指令干扰时的表现。该基准将现实世界中的指令干扰实例分类,并对四种指令任务(重写、校对、翻译和风格转换)以及五种输入任务(推理、代码生成、数学推理、偏见检测和问题回答)进行评价。我们的实验结果表明,即使是最先进的LLM也容易受到指令干扰的影响,在这种情况下往往无法准确遵循用户的意图。
https://arxiv.org/abs/2502.04362
Several studies indicate that deep learning models can learn to detect breast cancer from mammograms (X-ray images of the breasts). However, challenges with overfitting and poor generalisability prevent their routine use in the clinic. Models trained on data from one patient population may not perform well on another due to differences in their data domains, emerging due to variations in scanning technology or patient characteristics. Data augmentation techniques can be used to improve generalisability by expanding the diversity of feature representations in the training data by altering existing examples. Image-to-image translation models are one approach capable of imposing the characteristic feature representations (i.e. style) of images from one dataset onto another. However, evaluating model performance is non-trivial, particularly in the absence of ground truths (a common reality in medical imaging). Here, we describe some key aspects that should be considered when evaluating style transfer algorithms, highlighting the advantages and disadvantages of popular metrics, and important factors to be mindful of when implementing them in practice. We consider two types of generative models: a cycle-consistent generative adversarial network (CycleGAN) and a diffusion-based SynDiff model. We learn unpaired image-to-image translation across three mammography datasets. We highlight that undesirable aspects of model performance may determine the suitability of some metrics, and also provide some analysis indicating the extent to which various metrics assess unique aspects of model performance. We emphasise the need to use several metrics for a comprehensive assessment of model performance.
几项研究表明,深度学习模型可以从乳房X光片(乳腺的X射线图像)中学会检测乳腺癌。然而,由于过拟合和较差的泛化能力问题,这些模型还不能在临床上常规使用。在一个患者群体的数据上训练出的模型可能无法很好地应用于其他群体,这是因为不同数据集之间的扫描技术和患者特征差异导致的数据域不一致。通过数据增强技术可以改进模型的泛化能力,例如通过对现有样本进行修改来扩展训练数据中的特征表示多样性。图像到图像转换模型是一种能够将一个数据集中图像的独特特征表现(即风格)施加于另一个数据集的方法。然而,在缺乏地面真实值的情况下评估模型性能颇具挑战性,这是医学影像中常见的现实问题。在此,我们描述了在评估样式迁移算法时应考虑的一些关键方面,并强调了一些流行指标的优点和缺点以及在实践中实施这些指标的重要因素。我们将探讨两种生成模型:循环一致对抗网络(CycleGAN)和基于扩散的SynDiff模型,在三个乳腺X射线数据集中学习未配对图像到图像的转换。我们指出,模型性能中的某些不良方面可能会影响一些度量标准的适用性,并提供了一些分析以说明各种指标在评估模型独特性能方面的程度。我们强调需要使用多个指标来全面评估模型性能。
https://arxiv.org/abs/2502.02475
Forests function as crucial carbon reservoirs on land, and their carbon sinks can efficiently reduce atmospheric CO2 concentrations and mitigate climate change. Currently, the overall trend for monitoring and assessing forest carbon stocks is to integrate ground monitoring sample data with satellite remote sensing imagery. This style of analysis facilitates large-scale observation. However, these techniques require improvement in accuracy. We used GF-1 WFV and Landsat TM images to analyze Huize County, Qujing City, Yunnan Province in China. Using the style transfer method, we introduced Swin Transformer to extract global features through attention mechanisms, converting the carbon stock estimation into an image translation.
森林作为陆地上的关键碳储存库,其碳汇能够有效减少大气中的二氧化碳浓度并缓解气候变化。目前监测和评估森林碳储量的趋势是将地面监测样本数据与卫星遥感图像相结合。这种分析方法有助于大规模观测,但这些技术在准确性方面仍需改进。 我们使用GF-1 WFV和Landsat TM影像对中国云南省曲靖市会泽县的森林进行了分析。采用风格迁移的方法,我们将Swin Transformer引入其中,利用注意力机制提取全局特征,并将碳储量估算转化为图像翻译问题。
https://arxiv.org/abs/2502.00784
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
我们介绍了MILS:多模态迭代LLM求解器,这是一种令人惊讶地简单且无需训练的方法,用于赋予您喜欢的大型语言模型(LLM)多模态能力。通过利用它们内在的执行多步推理的能力,MILS提示LLM生成候选输出,并对每个输出进行评分并反复反馈,最终生成任务解决方案。这使得各种通常需要在特定任务数据上训练专用模型的应用程序成为可能。特别是,我们在新兴的零样本图像、视频和音频描述方面建立了新的最先进的成果。MILS还无缝地适用于媒体生成,通过发现提示重写来改进文本到图像生成,并且甚至可以编辑提示以进行风格转换!最后,作为无梯度优化方法,MILS能够将多模态嵌入逆向映射为文本,从而使跨模态算术等应用成为可能。
https://arxiv.org/abs/2501.18096
Mammographic screening is an effective method for detecting breast cancer, facilitating early diagnosis. However, the current need to manually inspect images places a heavy burden on healthcare systems, spurring a desire for automated diagnostic protocols. Techniques based on deep neural networks have been shown effective in some studies, but their tendency to overfit leaves considerable risk for poor generalisation and misdiagnosis, preventing their widespread adoption in clinical settings. Data augmentation schemes based on unpaired neural style transfer models have been proposed that improve generalisability by diversifying the representations of training image features in the absence of paired training data (images of the same tissue in either image style). But these models are similarly prone to various pathologies, and evaluating their performance is challenging without ground truths/large datasets (as is often the case in medical imaging). Here, we consider two frameworks/architectures: a GAN-based cycleGAN, and the more recently developed diffusion-based SynDiff. We evaluate their performance when trained on image patches parsed from three open access mammography datasets and one non-medical image dataset. We consider the use of uncertainty quantification to assess model trustworthiness, and propose a scheme to evaluate calibration quality in unpaired training scenarios. This ultimately helps facilitate the trustworthy use of image-to-image translation models in domains where ground truths are not typically available.
乳腺摄影筛查是一种有效的检测乳腺癌的方法,有助于早期诊断。然而,当前需要人工检查图像给医疗系统带来了沉重负担,因此产生了对自动化诊断程序的需求。基于深度神经网络的技术在一些研究中显示出有效性,但这些技术倾向于过拟合,这使得它们存在较差的泛化能力和误诊风险,从而阻碍了其在临床环境中的广泛应用。 为了改进训练数据不足或缺乏配对样本的情况下模型的表现,已经提出了一些基于未配对神经风格转换模型的数据增强方案。这些方法通过改变训练图像特征的表示方式来提高模型的通用性,但同样容易受到各种病态的影响,并且由于缺乏地面真实值/大型数据集(这种情况在医学成像中经常发生),评估它们的表现具有挑战性。 在这里,我们考虑了两个框架/架构:基于GAN的cycleGAN和最近开发的基于扩散模型的SynDiff。我们将训练这两个模型使用从三个开放访问乳腺摄影数据集中以及一个非医疗图像数据集中解析出的图像补丁,并评估其性能。 为了衡量这些模型的信任度,我们探讨了不确定性量化方法的应用,同时提出了一种在未配对训练场景中评价校准质量的方法。这最终有助于促进在地面真实值通常不可用的情况下可信地使用图像到图像转换模型。
https://arxiv.org/abs/2501.17570
Despite significant recent advances in image generation with diffusion models, their internal latent representations remain poorly understood. Existing works focus on the bottleneck layer (h-space) of Stable Diffusion's U-Net or leverage the cross-attention, self-attention, or decoding layers. Our model, SkipInject takes advantage of U-Net's skip connections. We conduct thorough analyses on the role of the skip connections and find that the residual connections passed by the third encoder block carry most of the spatial information of the reconstructed image, splitting the content from the style. We show that injecting the representations from this block can be used for text-based editing, precise modifications, and style transfer. We compare our methods state-of-the-art style transfer and image editing methods and demonstrate that our method obtains the best content alignment and optimal structural preservation tradeoff.
尽管基于扩散模型的图像生成技术取得了重大进展,但对其内部潜在表示的理解仍然不足。现有的研究主要集中在稳定扩散模型U-Net中的瓶颈层(h-space)上,或者利用交叉注意力、自我注意或解码层。我们的模型SkipInject则充分利用了U-Net的跳跃连接。我们对跳跃连接的作用进行了深入分析,并发现由第三编码块传递的残差连接携带了重建图像中大部分的空间信息,从而将内容与风格分离。我们展示了从该区块注入表示的方法可以用于基于文本的编辑、精确修改和风格迁移。 我们将我们的方法与最先进的风格迁移和图像编辑方法进行比较,并证明我们的方法在内容对齐和结构保存的最佳权衡方面表现最佳。
https://arxiv.org/abs/2501.14524
We propose a unified framework for Singing Voice Synthesis (SVS) and Conversion (SVC), addressing the limitations of existing approaches in cross-domain SVS/SVC, poor output musicality, and scarcity of singing data. Our framework enables control over multiple aspects, including language content based on lyrics, performance attributes based on a musical score, singing style and vocal techniques based on a selector, and voice identity based on a speech sample. The proposed zero-shot learning paradigm consists of one SVS model and two SVC models, utilizing pre-trained content embeddings and a diffusion-based generator. The proposed framework is also trained on mixed datasets comprising both singing and speech audio, allowing singing voice cloning based on speech reference. Experiments show substantial improvements in timbre similarity and musicality over state-of-the-art baselines, providing insights into other low-data music tasks such as instrumental style transfer. Examples can be found at: this http URL.
我们提出了一种统一的框架,用于歌唱声音合成(SVS)和转换(SVC),以解决现有方法在跨域SVS/SVC、输出音乐性差以及缺乏歌唱数据方面的限制。我们的框架支持对多个方面进行控制,包括基于歌词的语言内容、基于乐谱的表现属性、基于选择器的演唱风格和技术手段,以及基于语音样本的声音身份。 所提出的零样本学习范式包含一个SVS模型和两个SVC模型,并使用预训练的内容嵌入和基于扩散的生成器。我们的框架还利用混合数据集进行训练,该数据集中包括歌唱和讲话音频,从而允许根据讲话参考实现歌唱声音克隆。实验表明,在音色相似度和音乐性方面,与最先进的基准相比有显著改进,这为其他低数据量的音乐任务(如乐器风格转换)提供了见解。 示例可以在以下网址找到:this http URL.
https://arxiv.org/abs/2501.13870
Large Language Models (LLMs) excel at rewriting tasks such as text style transfer and grammatical error correction. While there is considerable overlap between the inputs and outputs in these tasks, the decoding cost still increases with output length, regardless of the amount of overlap. By leveraging the overlap between the input and the output, Kaneko and Okazaki (2023) proposed model-agnostic edit span representations to compress the rewrites to save computation. They reported an output length reduction rate of nearly 80% with minimal accuracy impact in four rewriting tasks. In this paper, we propose alternative edit phrase representations inspired by phrase-based statistical machine translation. We systematically compare our phrasal representations with their span representations. We apply the LLM rewriting model to the task of Automatic Speech Recognition (ASR) post editing and show that our target-phrase-only edit representation has the best efficiency-accuracy trade-off. On the LibriSpeech test set, our method closes 50-60% of the WER gap between the edit span model and the full rewrite model while losing only 10-20% of the length reduction rate of the edit span model.
大型语言模型(LLMs)在诸如文本风格转换和语法错误纠正等重写任务中表现出色。尽管这些任务中的输入和输出之间存在一定的重叠,但解码成本仍然会随着输出长度的增加而上升,无论重叠程度如何。通过利用输入与输出之间的重叠,Kaneko 和 Okazaki(2023)提出了模型无关的编辑跨度表示方法来压缩重写内容以节省计算资源。他们在四个重写任务中报告了近 80% 的输出长度缩减率,并且对准确性的影响很小。 在本文中,我们提出了一种受基于短语的统计机器翻译启发的替代编辑短语表示方法。我们系统地比较了我们的短语表示与他们的跨度表示。我们将LLM重写模型应用于自动语音识别(ASR)后处理任务,并表明仅目标短语的编辑表示具有最佳的效率-准确性权衡。 在LibriSpeech测试集上,我们的方法缩小了编辑跨度模型和完整重写模型之间50-60%的词错误率(WER)差距,同时仅损失了编辑跨度模型长度缩减率的10-20%。
https://arxiv.org/abs/2501.13831
Stylistic text generation plays a vital role in enhancing communication by reflecting the nuances of individual expression. This paper presents a novel approach for generating text in a specific speaker's style across different languages. We show that by leveraging only 100 lines of text, an individuals unique style can be captured as a high-dimensional embedding, which can be used for both text generation and stylistic translation. This methodology breaks down the language barrier by transferring the style of a speaker between languages. The paper is structured into three main phases: augmenting the speaker's data with stylistically consistent external sources, separating style from content using machine learning and deep learning techniques, and generating an abstract style profile by mean pooling the learned embeddings. The proposed approach is shown to be topic-agnostic, with test accuracy and F1 scores of 74.9\% and 0.75, respectively. The results demonstrate the potential of the style profile for multilingual communication, paving the way for further applications in personalized content generation and cross-linguistic stylistic transfer.
风格化的文本生成在通过反映个人表达的细微差别来增强沟通方面扮演着至关重要的角色。本文提出了一种新颖的方法,用于跨不同语言以特定说话者的风格生成文本。我们展示了仅利用100行文本即可捕捉到个体独特风格的高维嵌入,并且这种嵌入可用于文本生成和风格化翻译。该方法通过在语言间转移发言人的风格来打破语言障碍。 本文按照三个主要阶段展开:使用与说话者数据风格一致的外部来源增强其数据,利用机器学习和深度学习技术将风格从内容中分离出来,以及通过对所学嵌入进行均值池化生成抽象的风格配置文件。所提出的这种方法被证明是主题无关的,在测试准确性和F1分数方面分别达到了74.9%和0.75。 实验结果展示了风格配置文件在多语言沟通中的潜力,并为个性化内容生成和个人化的跨语种风格转换应用开辟了道路。
https://arxiv.org/abs/2501.11639
Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image.
无需训练的基于扩散的方法在风格转换方面取得了显著的成功,消除了广泛训练或微调的需求。然而,由于缺乏针对风格信息提取的目标训练以及对内容图像布局的限制,无训练方法通常会遭受原始内容布局变化和从风格图像中渗入内容的问题。通过一系列实验,我们发现采样阶段的有效起点可以极大地提升风格转换过程的效果。基于这一发现,我们提出了StyleSSP,它专注于获得更好的起点来解决原始内容布局的变化以及从风格图像中渗入的内容问题。StyleSSP包含两个关键组成部分:(1)频率操控:为了提高内容保留效果,我们将DDIM潜在空间中的低频成分减少,使采样阶段可以更多地关注内容图像的布局;(2)通过反演操作实现负向指导:为了解决从风格图像中渗入的内容问题,我们在反演阶段采用负向指导来确保采样阶段的起点远离了风格图像的内容。实验表明,StyleSSP在保留原始内容和减少来自风格图像的内容渗漏方面超越了之前的无训练风格转换基线方法。
https://arxiv.org/abs/2501.11319
Throughout history, humans have created remark- able works of art, but artificial intelligence has only recently started to make strides in generating visually compelling art. Breakthroughs in the past few years have focused on using convolutional neural networks (CNNs) to separate and manipulate the content and style of images, applying texture synthesis techniques. Nevertheless, a number of current techniques continue to encounter obstacles, including lengthy processing times, restricted choices of style images, and the inability to modify the weight ratio of styles. We proposed a neural style transfer system that can add various artistic styles to a desired image to address these constraints allowing flexible adjustments to style weight ratios and reducing processing time. The system uses the VGG19 model for feature extraction, ensuring high-quality, flexible stylization without compromising content integrity.
历史上,人类创造了令人惊叹的艺术作品,而人工智能在生成视觉吸引人的艺术方面仅最近才开始取得进展。近年来的突破主要集中在使用卷积神经网络(CNN)来分离和操作图像的内容与风格,并应用纹理合成技术。然而,目前许多技术仍然面临一些障碍,包括处理时间长、可供选择的风格图片有限以及无法调整风格权重比等问题。为了解决这些问题,我们提出了一种基于VGG19模型进行特征提取的神经样式迁移系统。该系统能够向目标图像添加各种艺术风格,并允许灵活地调整风格权重比例以减少处理时间,同时确保在不损害内容完整性的情况下实现高质量和灵活性的风格化。
https://arxiv.org/abs/2501.09420