Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel $\textit{GTF}$ for text-guided semantic manipulation, which has the following attractive capabilities: 1) $\textbf{Generalized}$: our $\textit{GTF}$ supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) $\textbf{Training-free}$: $\textit{GTF}$ produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
文本引导的语义编辑是指在基于源提示生成图像后,通过对该图像进行语义上的修改使其与目标描述相匹配的过程。这个过程可以实现诸如添加、移除或风格转换等所需的语义变化,同时保持无关内容不变。借助扩散模型强大的生成能力,这项任务已经显示出能够生成高保真视觉内容的潜力。然而,现有的方法要么需要耗时的微调(效率低下),无法完成多样的语义编辑操作(扩展性差),或者不支持不同模态的任务(泛化能力弱)。进一步研究发现,扩散模型中噪声的几何属性与语义变化之间有很强的相关性。受此启发,我们提出了一种新颖的方法——$\textit{GTF}$ (Generalized Training-free Framework),用于文本引导的语义编辑,该方法具备以下优点: 1. **通用化**:我们的 $\textit{GTF}$ 支持多种语义操作(如添加、移除和风格转换),并且可以无缝地集成到所有基于扩散模型的方法中,在不同模态下均能使用(即插即用)。 2. **无需训练**:$\textit{GTF}$ 通过简单控制噪声之间的几何关系即可生成高保真的结果,而不需要任何微调或优化步骤。 我们的大量实验展示了这种方法的有效性,并突显了其在语义编辑领域中的领先地位和应用潜力。
https://arxiv.org/abs/2504.17269
3D Gaussian Splatting (3DGS) excels in photorealistic scene reconstruction but struggles with stylized scenarios (e.g., cartoons, games) due to fragmented textures, semantic misalignment, and limited adaptability to abstract aesthetics. We propose StyleMe3D, a holistic framework for 3D GS style transfer that integrates multi-modal style conditioning, multi-level semantic alignment, and perceptual quality enhancement. Our key insights include: (1) optimizing only RGB attributes preserves geometric integrity during stylization; (2) disentangling low-, medium-, and high-level semantics is critical for coherent style transfer; (3) scalability across isolated objects and complex scenes is essential for practical deployment. StyleMe3D introduces four novel components: Dynamic Style Score Distillation (DSSD), leveraging Stable Diffusion's latent space for semantic alignment; Contrastive Style Descriptor (CSD) for localized, content-aware texture transfer; Simultaneously Optimized Scale (SOS) to decouple style details and structural coherence; and 3D Gaussian Quality Assessment (3DG-QA), a differentiable aesthetic prior trained on human-rated data to suppress artifacts and enhance visual harmony. Evaluated on NeRF synthetic dataset (objects) and tandt db (scenes) datasets, StyleMe3D outperforms state-of-the-art methods in preserving geometric details (e.g., carvings on sculptures) and ensuring stylistic consistency across scenes (e.g., coherent lighting in landscapes), while maintaining real-time rendering. This work bridges photorealistic 3D GS and artistic stylization, unlocking applications in gaming, virtual worlds, and digital art.
3D高斯点云(3DGS)在逼真的场景重建方面表现出色,但在处理风格化场景(例如卡通、游戏等)时遇到了挑战。由于纹理碎片化、语义不匹配和对抽象美学适应性有限的原因,它在这方面表现不佳。为此,我们提出了StyleMe3D——一个用于3D GS风格转换的综合性框架,该框架集成了多模态风格条件、多层次语义对齐以及感知质量增强。我们的关键见解包括: 1. 仅优化RGB属性可以在风格化过程中保持几何完整性。 2. 解耦低级、中级和高级语义对于连贯的风格转移至关重要。 3. 对孤立对象和复杂场景的扩展性是实际部署中的必要条件。 StyleMe3D引入了四个新颖组件:动态风格分数蒸馏(Dynamic Style Score Distillation,DSSD),利用Stable Diffusion的潜在空间进行语义对齐;对比风格描述符(Contrastive Style Descriptor,CSD),用于局部化、内容感知的纹理转移;同时优化尺度(Simultaneously Optimized Scale,SOS),以解耦样式细节和结构一致性;以及3D高斯质量评估(3DG-QA),这是一种基于人类评分数据训练而成的可微美学术前项,能够抑制伪影并增强视觉和谐。 在NeRF合成数据集(对象)和tandt db(场景)数据集中进行评估后,StyleMe3D在保留几何细节(例如雕塑上的雕刻)以及确保跨场景风格一致性(例如景观中的连贯照明)方面优于现有最先进的方法,并且能够保持实时渲染。这项工作将逼真的3D GS与艺术风格化结合在一起,解锁了游戏、虚拟世界和数字艺术领域的应用。
https://arxiv.org/abs/2504.15281
Spine surgery is a high-risk intervention demanding precise execution, often supported by image-based navigation systems. Recently, supervised learning approaches have gained attention for reconstructing 3D spinal anatomy from sparse fluoroscopic data, significantly reducing reliance on radiation-intensive 3D imaging systems. However, these methods typically require large amounts of annotated training data and may struggle to generalize across varying patient anatomies or imaging conditions. Instance-learning approaches like Gaussian splatting could offer an alternative by avoiding extensive annotation requirements. While Gaussian splatting has shown promise for novel view synthesis, its application to sparse, arbitrarily posed real intraoperative X-rays has remained largely unexplored. This work addresses this limitation by extending the $R^2$-Gaussian splatting framework to reconstruct anatomically consistent 3D volumes under these challenging conditions. We introduce an anatomy-guided radiographic standardization step using style transfer, improving visual consistency across views, and enhancing reconstruction quality. Notably, our framework requires no pretraining, making it inherently adaptable to new patients and anatomies. We evaluated our approach using an ex-vivo dataset. Expert surgical evaluation confirmed the clinical utility of the 3D reconstructions for navigation, especially when using 20 to 30 views, and highlighted the standardization's benefit for anatomical clarity. Benchmarking via quantitative 2D metrics (PSNR/SSIM) confirmed performance trade-offs compared to idealized settings, but also validated the improvement gained from standardization over raw inputs. This work demonstrates the feasibility of instance-based volumetric reconstruction from arbitrary sparse-view X-rays, advancing intraoperative 3D imaging for surgical navigation.
脊柱手术是一项高风险的干预措施,要求精确执行,并且通常需要图像导航系统的支持。最近,监督学习方法因能从稀疏的荧光透视数据中重建三维脊柱解剖结构而受到关注,这大大减少了对放射强度高的3D成像系统的依赖。然而,这些方法通常需要大量的注释训练数据,并可能难以在不同的患者解剖结构或成像条件下进行泛化。实例学习方法(如高斯点阵)通过避免大量标注要求提供了一种替代方案。尽管高斯点阵法在新颖视图合成方面显示出潜力,但将其应用于稀疏、任意角度的实时手术X射线的应用尚属鲜有探索。 本研究旨在解决这一限制,扩展了$R^2$-高斯点阵框架以在这种具有挑战性的条件下重建解剖一致的三维体积。我们引入了一步基于解剖引导的放射学标准化步骤,通过样式转换改善跨视图的一致性并提高重建质量。值得注意的是,我们的框架不需要预训练,从而使其能够适应新的患者和解剖结构。 我们在一个离体数据集上评估了这种方法,并由专家手术评价确认3D重建在导航中的临床实用性,尤其是在使用20到30个视角时,特别强调标准化对解剖清晰度的益处。通过定量二维指标(PSNR/SSIM)进行基准测试证实,在理想设置中存在性能折衷,但也验证了从原始输入到标准化改进的有效性。 这项工作展示了基于实例的体积重建技术从任意稀疏视图X射线获取中的可行性,并推动了手术导航中的实时3D成像技术。
https://arxiv.org/abs/2504.14699
Generating multi-subject stylized images remains a significant challenge due to the ambiguity in defining style attributes (e.g., color, texture, atmosphere, and structure) and the difficulty in consistently applying them across multiple subjects. Although recent diffusion-based text-to-image models have achieved remarkable progress, existing methods typically rely on computationally expensive inversion procedures or large-scale stylized datasets. Moreover, these methods often struggle with maintaining multi-subject semantic fidelity and are limited by high inference costs. To address these limitations, we propose ICAS (IP-Adapter and ControlNet-based Attention Structure), a novel framework for efficient and controllable multi-subject style transfer. Instead of full-model tuning, ICAS adaptively fine-tunes only the content injection branch of a pre-trained diffusion model, thereby preserving identity-specific semantics while enhancing style controllability. By combining IP-Adapter for adaptive style injection with ControlNet for structural conditioning, our framework ensures faithful global layout preservation alongside accurate local style synthesis. Furthermore, ICAS introduces a cyclic multi-subject content embedding mechanism, which enables effective style transfer under limited-data settings without the need for extensive stylized corpora. Extensive experiments show that ICAS achieves superior performance in structure preservation, style consistency, and inference efficiency, establishing a new paradigm for multi-subject style transfer in real-world applications.
生成多主体风格化图像仍然面临重大挑战,主要是由于定义风格属性(如颜色、纹理、氛围和结构)的模糊性以及在多个主体之间一致应用这些属性的难度。尽管最近基于扩散模型的文本到图像方法取得了显著进展,现有的方法通常依赖于计算成本高昂的逆向过程或大规模风格化数据集。此外,这些方法往往难以保持多主体语义保真度,并且受到高推断成本的限制。 为了克服这些局限性,我们提出了ICAS(IP-Adapter 和 ControlNet 基础注意力结构),这是一个新型框架,旨在实现高效可控的多主体风格转换。不同于全模型微调,ICAS 仅对预训练扩散模型的内容注入分支进行自适应微调,从而在保留特定身份语义的同时增强风格控制性。通过结合IP-Adapter 进行自适应风格注入与ControlNet 的结构条件设定,我们的框架确保了忠实的全局布局保存及精确的地方风格合成。 此外,ICAS 引入了一种循环多主体内容嵌入机制,在数据有限的情况下实现了有效的风格转换,无需依赖大规模风格化语料库。广泛的实验表明,ICAS 在结构保留、风格一致性和推断效率方面取得了卓越性能,为现实世界应用中的多主题风格转移开创了新的范式。
https://arxiv.org/abs/2504.13224
Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.
手语是一种动态的视觉语言,通过手势结合面部表情等非手动元素来表达。虽然视频录制的手语在教育和记录中被广泛使用,但由于其动态特性,在细节上进行研究具有挑战性,尤其是对于新手学习者和教师来说更为困难。本项目旨在将手语视频片段转化为静态插图,作为补充视频内容的教育资源。这一过程通常由艺术家完成,因此成本较高。我们提出了一种方法,利用生成模型理解图像的语义和几何方面的能力来绘制手语视频。我们的方法重点在于将类似素描的风格转移到手语视频上,并结合手势开始帧和结束帧以形成单一插图,同时使用箭头突出双手的方向和运动。 虽然许多风格迁移的方法在不同程度上解决了领域适应的问题,但对手势语言特别是对于手部动作和面部表情进行素描化处理仍然具有挑战性。为此,我们对扩散模型的去噪过程进行了干预,在高分辨率注意力层中注入样式作为键值,并融合图像和边缘中的几何信息作为查询。最后,通过注意机制结合开始帧和结束帧插图的关注权重来产生最终的插图,形成柔和组合。 我们的方法提供了一种成本效益高的解决方案,可以在推断时生成手语插图,弥补了教育材料中此类资源的不足。
https://arxiv.org/abs/2504.10822
In this paper, we introduce a novel data augmentation technique that combines the advantages of style augmentation and random erasing by selectively replacing image subregions with style-transferred patches. Our approach first applies a random style transfer to training images, then randomly substitutes selected areas of these images with patches derived from the style-transferred versions. This method is able to seamlessly accommodate a wide range of existing style transfer algorithms and can be readily integrated into diverse data augmentation pipelines. By incorporating our strategy, the training process becomes more robust and less prone to overfitting. Comparative experiments demonstrate that, relative to previous style augmentation methods, our technique achieves superior performance and faster convergence.
在本文中,我们介绍了一种新颖的数据增强技术,该技术结合了样式增强和随机擦除的优点,通过选择性地用经过风格转换的补丁替换图像子区域来实现。我们的方法首先对训练图像应用随机样式转换,然后从这些经过样式转换的版本中选取部分区域作为补丁,并将原始图像中的选定区域随机替换成这些补丁。这种方法能够无缝兼容各种现有的风格转移算法,并且可以轻松集成到多种数据增强流程中。通过采用我们的策略,训练过程变得更加稳健,减少了过拟合的风险。比较实验表明,相对于先前的样式增强方法,我们的技术在性能上更胜一筹,并且收敛速度更快。
https://arxiv.org/abs/2504.10563
Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users' gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.
近期在硬件、计算机图形和人工智能方面的进展可能会使增强现实/虚拟现实头戴式显示器(HMD)成为像智能手机和平板电脑一样日常使用的设备。HMD 中的眼动追踪器为这种设置提供了特殊的机会,因为它可以促进基于注视的研究和交互。然而,估计用户的目光信息通常需要包含虹膜纹理的原始眼图像和视频,而这些被视作用于身份认证的标准生物特征数据,这引发了隐私方面的担忧。此前,在眼动追踪领域的研究主要集中在模糊化虹膜纹理的同时保持诸如视线估计等实用任务的准确性。尽管有这些尝试,但尚无全面评估前沿方法的基准测试。 考虑到所有因素,本文对模糊、噪声处理、降采样、橡胶片模型和虹膜风格转换这五种技术进行基准测试,以混淆用户身份,并比较它们在两个数据集上对图像质量、隐私性、实用性和冒充攻击风险的影响。我们使用眼睛分割和注视估计作为实用性任务,将虹膜识别准确性的降低视为隐私保护的衡量标准,并用错误接受率来估算攻击的风险。 我们的实验表明,传统的图像处理方法如模糊和添加噪声对于基于深度学习的任务影响甚微。然而,降采样、橡胶片模型以及虹膜风格转换在隐藏用户标识符方面都表现出色,其中虹膜风格转换以更高的计算成本为代价,在实用性任务中表现优于其他技术,并且更能抵御冒充攻击。 我们的分析表明,没有一种普遍适用的方法可以在隐私性、实用性和计算负担之间达到平衡。因此,我们建议实践者应考虑每种方法的优点和缺点,以及这些方法可能的组合方式,以实现最佳的隐私与实用性权衡。
https://arxiv.org/abs/2504.10267
Recent advances in visual synthesis have leveraged diffusion models and attention mechanisms to achieve high-fidelity artistic style transfer and photorealistic text-to-image generation. However, real-time deployment on edge devices remains challenging due to computational and memory constraints. We propose Muon-AD, a co-designed framework that integrates the Muon optimizer with attention distillation for real-time edge synthesis. By eliminating gradient conflicts through orthogonal parameter updates and dynamic pruning, Muon-AD achieves 3.2 times faster convergence compared to Stable Diffusion-TensorRT, while maintaining synthesis quality (15% lower FID, 4% higher SSIM). Our framework reduces peak memory to 7GB on Jetson Orin and enables 24FPS real-time generation through mixed-precision quantization and curriculum learning. Extensive experiments on COCO-Stuff and ImageNet-Texture demonstrate Muon-AD's Pareto-optimal efficiency-quality trade-offs. Here, we show a 65% reduction in communication overhead during distributed training and real-time 10s/image generation on edge GPUs. These advancements pave the way for democratizing high-quality visual synthesis in resource-constrained environments.
最近在视觉合成领域取得的进展利用了扩散模型和注意力机制,实现了高质量的艺术风格转换以及逼真的文本到图像生成。然而,由于计算资源和内存限制,在边缘设备上实现实时部署仍然面临挑战。我们提出了一种名为Muon-AD的框架,该框架结合了Muon优化器与注意力蒸馏技术,旨在为边缘设备上的实时合成提供解决方案。 通过正交参数更新和动态修剪消除梯度冲突,Muon-AD实现了比Stable Diffusion-TensorRT快3.2倍的收敛速度,同时保持了合成质量(FID降低了15%,SSIM提高了4%)。我们的框架将Jetson Orin上的峰值内存减少到7GB,并通过混合精度量化和课程学习实现实时生成速率高达24FPS。在COCO-Stuff和ImageNet-Texture数据集上进行的广泛实验表明,Muon-AD在效率与质量之间实现了Pareto最优权衡。 此外,我们的方法展示了分布式训练期间通信开销减少了65%,并且能够在边缘GPU上实现实时每10秒生成一张图像。这些改进为在资源受限环境中实现高质量视觉合成铺平了道路。
https://arxiv.org/abs/2504.08451
Image fusion seeks to seamlessly integrate foreground objects with background scenes, producing realistic and harmonious fused images. Unlike existing methods that directly insert objects into the background, adaptive and interactive fusion remains a challenging yet appealing task. It requires the foreground to adjust or interact with the background context, enabling more coherent integration. To address this, we propose an iterative human-in-the-loop data generation pipeline, which leverages limited initial data with diverse textual prompts to generate fusion datasets across various scenarios and interactions, including placement, holding, wearing, and style transfer. Building on this, we introduce DreamFuse, a novel approach based on the Diffusion Transformer (DiT) model, to generate consistent and harmonious fused images with both foreground and background information. DreamFuse employs a Positional Affine mechanism to inject the size and position of the foreground into the background, enabling effective foreground-background interaction through shared attention. Furthermore, we apply Localized Direct Preference Optimization guided by human feedback to refine DreamFuse, enhancing background consistency and foreground harmony. DreamFuse achieves harmonious fusion while generalizing to text-driven attribute editing of the fused results. Experimental results demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.
图像融合的目标是无缝地将前景对象与背景场景结合,生成逼真且和谐的合成图像。不同于现有的直接将对象插入到背景中的方法,适应性和交互式的融合仍然是一项具有挑战性但吸引人的任务。这一过程要求前景能够根据背景环境进行调整或互动,从而实现更连贯的整合。为此,我们提出了一种迭代的人机协作数据生成流水线,该流水线利用有限的初始数据和多样化的文本提示来生成涵盖不同场景和交互方式(如放置、手持、穿戴及风格转换)的融合数据集。在此基础上,我们引入了DreamFuse这一基于扩散变换器(DiT)模型的新方法,它能够结合前景与背景信息,生成一致且和谐的合成图像。DreamFuse采用位置归一化机制将前景对象的位置和大小注入到背景中,通过共享注意力实现有效的前、后景互动。此外,我们还应用了由人类反馈引导的局部直接偏好优化技术来改进DreamFuse,增强其在背景一致性及前景和谐性方面的表现。DreamFuse不仅实现了无缝融合,还能根据文本驱动的属性编辑需求进行泛化。实验结果表明,在多个评价指标上我们的方法超越了现有的最先进技术。
https://arxiv.org/abs/2504.08291
Prompt Recovery, reconstructing prompts from the outputs of large language models (LLMs), has grown in importance as LLMs become ubiquitous. Most users access LLMs through APIs without internal model weights, relying only on outputs and logits, which complicates recovery. This paper explores a unique prompt recovery task focused on reconstructing prompts for style transfer and rephrasing, rather than typical question-answering. We introduce a dataset created with LLM assistance, ensuring quality through multiple techniques, and test methods like zero-shot, few-shot, jailbreak, chain-of-thought, fine-tuning, and a novel canonical-prompt fallback for poor-performing cases. Our results show that one-shot and fine-tuning yield the best outcomes but highlight flaws in traditional sentence similarity metrics for evaluating prompt recovery. Contributions include (1) a benchmark dataset, (2) comprehensive experiments on prompt recovery strategies, and (3) identification of limitations in current evaluation metrics, all of which advance general prompt recovery research, where the structure of the input prompt is unrestricted.
**翻译** 提示恢复,即从大型语言模型(LLM)的输出中重构提示信息,在LLMs广泛使用的情况下变得日益重要。大多数用户通过API访问LLMs,并且没有内部模型权重的支持,仅依赖于输出和logits,这使得提示恢复变得更加复杂。本文探讨了一个独特的任务——专注于重建用于风格转换和重新表述的提示信息,而不是传统的问答场景。我们介绍了一套利用LLM辅助创建的数据集,并采用多种方法确保数据质量。此外,还测试了零样本、少量样本、破解(jailbreak)、链式思维(chain-of-thought)以及针对性能不佳情况的新颖规范性提示回退等方法。我们的研究结果表明,单样本和微调策略能取得最佳效果,但同时也指出了传统句子相似度评估指标在评价提示恢复方面的不足之处。本文的贡献包括:(1) 提供了一个基准数据集;(2) 对多种提示恢复策略进行了全面实验;以及 (3) 发现了当前评估指标存在的局限性。所有这些都进一步推进了一般性的未受限输入提示结构的提示恢复研究。
https://arxiv.org/abs/2504.04373
Thermal imaging is crucial for scene understanding, particularly in low-light and nighttime conditions. However, collecting large thermal datasets is costly and labor-intensive due to the specialized equipment required for infrared image capture. To address this challenge, researchers have explored visible-to-thermal image translation. Most existing methods rely on Generative Adversarial Networks (GANs) or Diffusion Models (DMs), treating the task as a style transfer problem. As a result, these approaches attempt to learn both the modality distribution shift and underlying physical principles from limited training data. In this paper, we propose F-ViTA, a novel approach that leverages the general world knowledge embedded in foundation models to guide the diffusion process for improved translation. Specifically, we condition an InstructPix2Pix Diffusion Model with zero-shot masks and labels from foundation models such as SAM and Grounded DINO. This allows the model to learn meaningful correlations between scene objects and their thermal signatures in infrared imagery. Extensive experiments on five public datasets demonstrate that F-ViTA outperforms state-of-the-art (SOTA) methods. Furthermore, our model generalizes well to out-of-distribution (OOD) scenarios and can generate Long-Wave Infrared (LWIR), Mid-Wave Infrared (MWIR), and Near-Infrared (NIR) translations from the same visible image. Code: this https URL.
热成像是场景理解的关键,尤其是在低光和夜间条件下。然而,收集大规模的热数据集由于需要专门的红外图像捕捉设备而成本高昂且耗时长。为了解决这一挑战,研究人员探索了可见光到热成像的转换技术。大多数现有的方法依赖于生成对抗网络(GANs)或扩散模型(DMs),将任务视为风格迁移问题。因此,这些方法试图从有限的训练数据中学习模态分布变化和潜在的物理原理。 在本文中,我们提出了F-ViTA,这是一种新颖的方法,它利用基础模型中嵌入的一般世界知识来指导扩散过程以实现更优的转换效果。具体而言,我们将InstructPix2Pix扩散模型与来自SAM(Segment Anything Model)和Grounded DINO等基础模型的零样本掩码和标签相结合进行条件化处理。这样可以使模型学习场景中的物体与其在红外图像中的热特征之间的有意义的相关性。 我们在五个公共数据集上进行了广泛的实验,结果显示F-ViTA优于当前最先进(SOTA)的方法。此外,我们的模型能够在分布外(OOD)情景下表现出良好的泛化能力,并且可以从同一张可见光图像生成长波红外(LWIR)、中波红外(MWIR)和近红外(NIR)的转换结果。 代码地址:[这个链接](https://this https URL)
https://arxiv.org/abs/2504.02801
The limited availability of training data for low-resource languages makes applying machine learning techniques challenging. Ancient Egyptian is one such language with few resources. However, innovative applications of data augmentation methods, such as Neural Style Transfer, could overcome these barriers. This paper presents a novel method for generating datasets of ancient Egyptian hieroglyphs by applying NST to a digital typeface. Experimental results found that image classification models trained on NST-generated examples and photographs demonstrate equal performance and transferability to real unseen images of hieroglyphs.
低资源语言训练数据的有限可用性使得应用机器学习技术变得具有挑战性。古埃及语就是这样一个资源匮乏的语言。然而,通过创新地使用数据增强方法(如神经风格迁移)可以克服这些障碍。本文提出了一种新颖的方法,利用神经风格迁移对数字字体进行处理来生成古埃及象形文字的数据集。实验结果表明,在使用NST生成的示例和照片训练的图像分类模型在识别真实未见过的象形文字图像时表现出同等的效果和可转移性。
https://arxiv.org/abs/2504.02163
Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves transferring style and appearance features from a reference image to a target visual content based on semantic correspondence. We subsequently propose a training-free method, Semantix an energy-guided sampler designed for Semantic Style Transfer that simultaneously guides both style and appearance transfer based on semantic understanding capacity of pre-trained diffusion models. Additionally, as a sampler, Semantix be seamlessly applied to both image and video models, enabling semantic style transfer to be generic across various visual media. Specifically, once inverting both reference and context images or videos to noise space by SDEs, Semantix utilizes a meticulously crafted energy function to guide the sampling process, including three key components: Style Feature Guidance, Spatial Feature Guidance and Semantic Distance as a regularisation term. Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields. The project website is available at this https URL
最近在风格和外观转移方面的进展令人印象深刻,但大多数方法要么单独处理全局风格或局部外观的转换,要么忽视了语义对应关系。此外,图像和视频任务通常被孤立地处理,很少关注将它们集成以进行视频转移的方法。为了解决这些限制,我们引入了一个新的任务:语义风格迁移(Semantic Style Transfer),它涉及根据语义对应关系从参考图像中传输样式和外观特征到目标视觉内容。 随后,我们提出了一种无需训练的方法——Semantix,这是一个能量引导采样器,专门为语义风格转移设计。该方法基于预训练扩散模型的语义理解能力同时指导样式和外观转换。此外,作为采样器,Semantix可以无缝应用于图像和视频模型,使语义风格迁移能够在各种视觉媒体中通用。 具体而言,在将参考图和上下文图或视频通过SDEs反演到噪声空间后,Semantix利用精心设计的能量函数来指导采样过程。该能量函数包括三个关键组成部分:样式特征引导、空间特征引导以及作为正则化项的语义距离。 实验结果表明,Semantix不仅能够有效地完成图像和视频之间的语义风格迁移任务,并且在两个领域都超越了现有的最先进解决方案。该项目网站可在此处访问(请将“this https URL”替换为实际网址)。
https://arxiv.org/abs/2503.22344
3D scene stylization approaches based on Neural Radiance Fields (NeRF) achieve promising results by optimizing with Nearest Neighbor Feature Matching (NNFM) loss. However, NNFM loss does not consider global style information. In addition, the implicit representation of NeRF limits their fine-grained control over the resulting scenes. In this paper, we introduce ABC-GS, a novel framework based on 3D Gaussian Splatting to achieve high-quality 3D style transfer. To this end, a controllable matching stage is designed to achieve precise alignment between scene content and style features through segmentation masks. Moreover, a style transfer loss function based on feature alignment is proposed to ensure that the outcomes of style transfer accurately reflect the global style of the reference image. Furthermore, the original geometric information of the scene is preserved with the depth loss and Gaussian regularization terms. Extensive experiments show that our ABC-GS provides controllability of style transfer and achieves stylization results that are more faithfully aligned with the global style of the chosen artistic reference. Our homepage is available at this https URL.
基于神经辐射场(NeRF)的3D场景风格化方法通过最近邻特征匹配(NNFM)损失优化取得了令人鼓舞的结果。然而,NNFM损失没有考虑全局样式信息。此外,NeRF 的隐式表示限制了对结果场景的细粒度控制。在本文中,我们介绍了ABC-GS,这是一个基于3D高斯点绘的新框架,旨在实现高质量的3D风格转换。为此,设计了一个可控匹配阶段,通过分割掩码实现场景内容和样式特征之间的精确对齐。此外,提出了一种基于特征对齐的风格迁移损失函数,以确保风格迁移的结果准确反映参考图像的整体样式。另外,使用深度损失和高斯正则化项保留了原始场景的几何信息。广泛的实验表明,我们的ABC-GS提供了风格转换的可控性,并且实现了更加忠实于所选艺术参考的整体样式的风格化结果。更多关于ABC-GS的信息可以在[这里](https://example.com/)找到。
https://arxiv.org/abs/2503.22218
Mammography stands as the main screening method for detecting breast cancer early, enhancing treatment success rates. The segmentation of landmark structures in mammography images can aid the medical assessment in the evaluation of cancer risk and the image acquisition adequacy. We introduce a series of data-centric strategies aimed at enriching the training data for deep learning-based segmentation of landmark structures. Our approach involves augmenting the training samples through annotation-guided image intensity manipulation and style transfer to achieve better generalization than standard training procedures. These augmentations are applied in a balanced manner to ensure the model learns to process a diverse range of images generated by different vendor equipments while retaining its efficacy on the original data. We present extensive numerical and visual results that demonstrate the superior generalization capabilities of our methods when compared to the standard training. For this evaluation, we consider a large dataset that includes mammography images generated by different vendor equipments. Further, we present complementary results that show both the strengths and limitations of our methods across various scenarios. The accuracy and robustness demonstrated in the experiments suggest that our method is well-suited for integration into clinical practice.
乳房X光摄影是早期检测乳腺癌的主要筛查方法,可以提高治疗成功率。在乳房X光图像中对标志性结构进行分割有助于医疗评估,在癌症风险评估和影像获取充分性方面发挥重要作用。我们提出了一系列以数据为中心的策略,旨在丰富基于深度学习的标志性结构分割模型的训练数据。我们的方法通过指导注释引导下的图像强度调整和风格转换来扩充训练样本,从而实现比标准训练程序更好的泛化能力。这些增强措施是以平衡的方式应用的,确保模型能够处理由不同供应商设备生成的不同类型的图像,同时在原始数据上保持其有效性。我们展示了大量的数值和视觉结果,证明了我们的方法与标准训练相比具有更优越的泛化能力。为此评估,我们考虑了一个包含来自不同供应商设备产生的乳房X光摄影图像的大规模数据集。此外,还呈现了补充结果,展示我们在各种场景下所提方法的优点和局限性。实验中表现出的准确性和鲁棒性表明我们的方法非常适合在临床实践中应用。
https://arxiv.org/abs/2503.22052
Deep generative models have been used in style transfer tasks for images. In this study, we adapt and improve CycleGAN model to perform music style transfer on Jazz and Classic genres. By doing so, we aim to easily generate new songs, cover music to different music genres and reduce the arrangements needed in those processes. We train and use music genre classifier to assess the performance of the transfer models. To that end, we obtain 87.7% accuracy with Multi-layer Perceptron algorithm. To improve our style transfer baseline, we add auxiliary discriminators and triplet loss to our model. According to our experiments, we obtain the best accuracies as 69.4% in Jazz to Classic task and 39.3% in Classic to Jazz task with our developed genre classifier. We also run a subjective experiment and results of it show that the overall performance of our transfer model is good and it manages to conserve melody of inputs on the transferred outputs. Our code is available at this https URL fidansamet/tune-it-up
深度生成模型已被应用于图像风格转换任务中。在本研究中,我们对CycleGAN模型进行了适应和改进,以实现爵士乐和古典音乐之间的风格转移。通过这种方式,我们的目标是轻松地生成新歌曲、将音乐改编为不同的音乐流派,并减少这些过程中的编曲需求。为了评估风格迁移模型的性能,我们训练并使用了音乐流派分类器。基于此,我们利用多层感知机算法达到了87.7%的准确率。 为了提高我们的风格转移基线,我们在模型中添加了辅助鉴别器和三元组损失函数。根据我们的实验结果,在爵士乐到古典音乐的任务中,我们获得了69.4%的最佳准确性;而在古典音乐到爵士乐的任务中,则是39.3%的最高准确率,这些成绩都是通过使用我们开发的流派分类器得到的。 此外,我们也进行了一项主观实验。该实验的结果表明,我们的转移模型的整体性能良好,并且能够保持输入旋律在转换输出中的完整性。 我们的代码可以在以下链接中找到:https://github.com/fidansamet/tune-it-up 请注意,上述URL是示例性的,在实际使用时需要替换为正确的GitHub地址或相应的开源平台地址。
https://arxiv.org/abs/2503.22008
We propose a novel, zero-shot image generation technique called "Visual Concept Blending" that provides fine-grained control over which features from multiple reference images are transferred to a source image. If only a single reference image is available, it is difficult to isolate which specific elements should be transferred. However, using multiple reference images, the proposed approach distinguishes between common and unique features by selectively incorporating them into a generated output. By operating within a partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding space (from IP-Adapter), our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations without requiring additional training or text prompts. We demonstrate its effectiveness across a diverse range of tasks, including style transfer, form metamorphosis, and conceptual transformations, showing how subtle or abstract attributes (e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly combined into a new image. In a user study, participants accurately recognized which features were intended to be transferred. Its simplicity, flexibility, and high-level control make Visual Concept Blending valuable for creative fields such as art, design, and content creation, where combining specific visual qualities from multiple inspirations is crucial.
我们提出了一种新颖的零样本图像生成技术,名为“视觉概念融合”,该技术能够精细控制从多张参考图像中的哪些特征转移到源图像。当仅有一张参考图像可用时,很难确定应转移的具体元素。然而,通过使用多张参考图像,所提出的这种方法可以区分出共同和独特的特征,并有选择地将它们整合到生成的输出中。 我们的方法在部分解耦的对比语言-图像预训练(CLIP)嵌入空间内操作(来自IP适配器),从而能够灵活地转移纹理、形状、运动、风格以及更抽象的概念转化,而无需额外的训练或文本提示。我们在包括样式转换、形态变化和概念转化在内的多种任务上展示了该技术的有效性,并证明了如何将微妙或抽象的属性(例如笔触风格、流线型线条和动态感)无缝地结合到新的图像中。 在一项用户研究中,参与者准确识别出哪些特征是意欲转移的。视觉概念融合方法因其简单性、灵活性以及高层次的控制能力,在艺术、设计和内容创作等创意领域尤为有价值,这些领域需要将多来源灵感中的特定视觉质量进行组合。
https://arxiv.org/abs/2503.21277
Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbf{StyledStreets}, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene's motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication.
城市场景重建需要同时建模静态基础设施和动态元素,并支持各种环境条件。我们介绍了**StyledStreets**,这是一个多风格街景模拟器,它通过保证空间和时间一致性来实现指令驱动的场景编辑。我们的方法基于最先进的高斯点绘框架进行街景建模,并在此基础上提出了姿态优化和多视角训练,使模型能够跨季节、天气条件以及摄像机设置进行逼真的风格转换。以下是三个关键创新: 1. **混合嵌入方案**:该方案将持久的场景几何与瞬时的样式属性分离,允许在保留结构完整性的前提下对环境做出真实感的编辑。 2. **不确定性感知渲染**:这种方法减轻了来自扩散先验的监督噪声,使得模型能够在极端风格变化的情况下进行稳健训练。 3. **统一参数化模型**:通过正则化更新防止几何漂移,并保持七个车载摄像头视角的一致性。 我们的框架保留了原始场景的运动模式和几何关系。定性的结果展示了在不同条件(如雪天、沙尘暴、夜晚)之间合理的过渡,而定量评估显示,在风格转换过程中达到了最先进的几何准确性。该方法为城市模拟建立了一系列新能力,适用于自动驾驶汽车测试以及需要可靠环境一致性的增强现实系统。相关代码将在论文发表后公开提供。
https://arxiv.org/abs/2503.21104
Text-driven voice conversion allows customization of speaker characteristics and prosodic elements using textual descriptions. However, most existing methods rely heavily on direct text-to-speech training, limiting their flexibility in controlling nuanced style elements or timbral features. In this paper, we propose a novel \textbf{Latent State-Space} approach for text-driven voice conversion (\textbf{LSS-VC}). Our method treats each utterance as an evolving dynamical system in a continuous latent space. Drawing inspiration from mamba, which introduced a state-space model for efficient text-driven \emph{image} style transfer, we adapt a loosely related methodology for \emph{voice} style transformation. Specifically, we learn a voice latent manifold where style and content can be manipulated independently by textual style prompts. We propose an adaptive cross-modal fusion mechanism to inject style information into the voice latent representation, enabling interpretable and fine-grained control over speaker identity, speaking rate, and emphasis. Extensive experiments show that our approach significantly outperforms recent baselines in both subjective and objective quality metrics, while offering smoother transitions between styles, reduced artifacts, and more precise text-based style control.
文本驱动的语音转换技术允许通过文本描述自定义说话人的特征和韵律元素。然而,大多数现有的方法严重依赖于直接的文本到语音训练,这限制了它们在控制细微风格要素或音色特性方面的灵活性。本文提出了一种新颖的方法——**隐含状态空间**(Latent State-Space)文本驱动语音转换(LSS-VC)。我们的方法将每个语句视为连续隐含空间中的演化动力系统。借鉴mamba引入的状态空间模型,该模型用于高效的文本驱动图像风格转移,我们将其相关技术适应于声音风格变换中。 具体而言,我们学习了一个可以独立操控的音色潜在流形,其中通过文本风格提示来改变样式和内容。我们提出了一种自适应跨模态融合机制,将风格信息注入到语音潜在表示中,从而实现对说话者身份、语速和强调程度的解释性和精细控制。 大量的实验表明,我们的方法在主观和客观质量指标上均显著优于近期基线方法,并且提供了更平滑的风格转换过程,减少了伪影,并实现了更为精确的文本基础风格控制。
https://arxiv.org/abs/2503.20999
Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emph{ReverBERT}, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\cite{wang2024stylemamba}. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emph{Transformer-based SSM} layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emph{ReverBERT} significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.
文本驱动的语音风格转换旨在根据文本描述中的风格提示调整口语表达的情感、语速和音色。虽然现有的方法依赖大规模神经网络架构或预训练的语言模型,但计算成本依然较高。在本文中,我们提出了**ReverBERT**,这是一种高效的基于文本的语音风格转换框架,它借鉴了状态空间模型(SSM)范式的思想,受到Wang和Liu提出的基于图像的方法StyleMamba的启发。与图像领域的技术不同,我们的方法直接作用于语音领域,并通过整合潜在语音特征的离散傅里叶变换来实现平滑且连续的风格调节。我们还提出了一种新的**Transformer基SSM层**,用于连接文本风格描述和声学属性,这在保证高质量语音特性的同时显著减少了推理时间。广泛基于基准语音语料库进行的实验表明,ReverBERT在自然度、表现力及计算效率方面明显优于基础模型。我们公开发布了我们的模型和代码,以促进文本驱动的语音风格转换领域的进一步研究。
https://arxiv.org/abs/2503.20992