Diffusion models have emerged as the leading approach for style transfer, yet they struggle with photo-realistic transfers, often producing painting-like results or missing detailed stylistic elements. Current methods inadequately address unwanted influence from original content styles and style reference content features. We introduce SCAdapter, a novel technique leveraging CLIP image space to effectively separate and integrate content and style features. Our key innovation systematically extracts pure content from content images and style elements from style references, ensuring authentic transfers. This approach is enhanced through three components: Controllable Style Adaptive Instance Normalization (CSAdaIN) for precise multi-style blending, KVS Injection for targeted style integration, and a style transfer consistency objective maintaining process coherence. Comprehensive experiments demonstrate SCAdapter significantly outperforms state-of-the-art methods in both conventional and diffusion-based baselines. By eliminating DDIM inversion and inference-stage optimization, our method achieves at least $2\times$ faster inference than other diffusion-based approaches, making it both more effective and efficient for practical applications.
扩散模型已经成为风格转换的领先方法,但它们在进行照片级写实的风格转换时存在困难,常常产生类似绘画的结果或遗漏细微的风格元素。目前的方法未能充分解决原始内容风格和风格参考内容特征之间不必要的相互影响问题。我们引入了SCAdapter,这是一种新颖的技术,利用CLIP图像空间有效地分离并整合内容和风格特征。我们的关键技术创新系统地从内容图片中提取纯净的内容,并从风格参考中提取风格元素,确保进行真实的转换。这种技术通过三个组件得到增强:可控风格自适应实例归一化(CSAdaIN)用于精确的多风格融合、KVS注入以实现目标风格集成以及一种保持过程一致性的风格传输一致性目标。全面的实验表明,在传统方法和基于扩散的方法基线上,SCAdapter显著优于现有的最先进的方法。通过消除DDIM反转和推理阶段优化,我们的方法至少比其他基于扩散的方法快两倍的推断速度,使其在实际应用中既更有效又更高效。
https://arxiv.org/abs/2512.12963
Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: this https URL.
生成逼真的合成显微图像对于在标签稀缺的环境下训练深度学习模型(例如,每张图像中有大量细胞的情况下的计数任务)至关重要。然而,传统的领域适应方法往往难以跨越合成图像缺乏真实样本中的复杂纹理和视觉模式的领域差距。在这项工作中,我们将原本用于艺术风格迁移的基于逆向的样式转换(Inversion-Based Style Transfer, InST)框架应用于生物医学显微镜图像。我们的方法结合了潜在空间自适应实例归一化与扩散模型中的随机反转,以从真实荧光显微图像中传递风格到合成图像上,并弱化地保持内容结构。 我们通过在不同数据源(包括真实数据、硬编码的合成数据以及公共Cell200-s数据集)上对EfficientNet-B0模型进行预训练和微调来评估基于InST的合成数据集用于下游细胞计数的有效性。使用我们的InST生成图像训练得到的模型相比单纯使用硬编码合成数据训练的模型,平均绝对误差(Mean Absolute Error, MAE)最多可减少37%,与Cell200-s上训练的模型相比则能降低52%(MAE从53.70降至25.95)。值得注意的是,我们的方法甚至超过了单纯基于真实数据训练得到的结果(25.95 vs. 27.74 MAE)。 通过将InST合成的数据与轻量级领域适应技术相结合(如DACS结合CutMix),还可以进一步提高性能。这些发现表明,基于InST的风格转换最有效地减少了合成和真实显微数据之间的域差距。我们的方法为在减少人工标注工作的同时提升细胞计数性能提供了一条可扩展的途径。源代码和资源可在以下网址获取:[this https URL] (请将 [ ] 替换为实际链接地址)。
https://arxiv.org/abs/2512.11763
Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
近期,扩散模型(DMs)在图像编辑任务中取得了卓越的视觉效果。然而,DMs 的全局去噪动态特性会将局部编辑目标与整幅图像上下文混淆在一起,导致非目标区域出现未预期的修改。为此,本文转向探讨遮罩生成变换器(MGTs),作为一种替代方法来应对这一挑战。通过预测多个被屏蔽的标记而非整体细化,MGT 展现了本地化解码范式,并因此具备在编辑过程中固有的能力,能够显式地保留非相关区域。 基于此见解,我们引入首个基于 MGT 的图像编辑框架——EditMGT。首先,我们证明了 MGT 的交叉注意力图提供了有关编辑相关区域定位的丰富信息信号,并设计了一种多层注意力整合方案以细化这些地图,从而实现精细化和精准化定位。在此基础上,我们提出了区域保持采样策略,在低关注度区域内限制标记翻转操作,抑制意外修改,使修改局限于目标区域并保留周围非目标区域的完整性。 为了训练 EditMGT,我们构建了 CrispEdit-2M 数据集,这是一个涵盖七类多样化编辑任务的高分辨率数据集。在不增加额外参数的情况下,通过注意力注入的方式将预训练的文字到图像 MGT 适配为图像编辑模型。跨四个标准基准进行的广泛实验表明,在少于10亿个参数的情况下,我们的模型实现了相似的表现水平,并使编辑速度提高了六倍。此外,它提供了相当或更好的编辑质量,风格变化任务和风格迁移任务分别提高了3.6% 和 17.6% 的性能。
https://arxiv.org/abs/2512.11715
We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.
我们提出了Lang2Motion,这是一个通过将运动流形与联合嵌入空间对齐来生成语言引导的点轨迹的框架。不同于先前专注于人体动作或视频合成的工作,我们利用从现实世界视频中提取并通过点跟踪获得的运动数据,为任意对象生成显式的轨迹。我们的基于变压器的自编码器通过双重监督学习轨迹表示:文本运动描述和渲染后的轨迹可视化,两者都通过CLIP的冻结编码器进行映射。 Lang2Motion在文本到轨迹检索上达到了34.2%的Recall@1,比视频基线方法高出12.5个百分点,并将动作准确性提高了33-52%,与视频生成基准相比(ADE值为12.4 vs 18.3-25.3)。尽管训练数据仅为多样化对象的动作,但我们在人类动作识别上达到了88.3%的Top-1准确率,证明了在不同运动领域之间的有效迁移。Lang2Motion支持风格转移、语义插值以及通过与CLIP对齐的轨迹表示进行潜在空间编辑。
https://arxiv.org/abs/2512.10617
Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.
运动生成是计算机动画的基础,并广泛应用于娱乐、机器人技术和虚拟环境等领域。尽管最近的方法取得了令人印象深刻的结果,但大多数方法依赖于固定的骨骼模板,这限制了它们对不同或受到干扰的拓扑结构的适应能力。我们解决了当前运动生成方法的核心局限性——缺乏大规模异构动物运动数据和统一的生成框架,这些框架能够同时建模任意骨骼拓扑和文本条件。为此,我们引入了OmniZoo,这是一个包含140种物种、共32,979个序列的大规模动物运动数据集,并且包含了多模式注释。基于OmniZoo,我们提出了一种通用的自回归运动生成框架,能够为任意骨骼拓扑生成文本驱动的动作。我们的模型核心是一个感知拓扑结构的骨架嵌入模块,该模块将任何骨骼的几何和结构性质编码到共享令牌空间中,从而能够无缝融合文本语义。在给定一个文本提示和目标骨骼后,我们的方法可以生成时间上连贯、物理上合理且语义对齐的动作,并进一步支持跨物种动作风格转换。
https://arxiv.org/abs/2512.10352
We introduce Stylized Meta-Album (SMA), a new image classification meta-dataset comprising 24 datasets (12 content datasets, and 12 stylized datasets), designed to advance studies on out-of-distribution (OOD) generalization and related topics. Created using style transfer techniques from 12 subject classification datasets, SMA provides a diverse and extensive set of 4800 groups, combining various subjects (objects, plants, animals, human actions, textures) with multiple styles. SMA enables flexible control over groups and classes, allowing us to configure datasets to reflect diverse benchmark scenarios. While ideally, data collection would capture extensive group diversity, practical constraints often make this infeasible. SMA addresses this by enabling large and configurable group structures through flexible control over styles, subject classes, and domains-allowing datasets to reflect a wide range of real-world benchmark scenarios. This design not only expands group and class diversity, but also opens new methodological directions for evaluating model performance across diverse group and domain configurations-including scenarios with many minority groups, varying group imbalance, and complex domain shifts-and for studying fairness, robustness, and adaptation under a broader range of realistic conditions. To demonstrate SMA's effectiveness, we implemented two benchmarks: (1) a novel OOD generalization and group fairness benchmark leveraging SMA's domain, class, and group diversity to evaluate existing benchmarks. Our findings reveal that while simple balancing and algorithms utilizing group information remain competitive as claimed in previous benchmarks, increasing group diversity significantly impacts fairness, altering the superiority and relative rankings of algorithms. We also propose to use \textit{Top-M worst group accuracy} as a new hyperparameter tuning metric, demonstrating broader fairness during optimization and delivering better final worst-group accuracy for larger group diversity. (2) An unsupervised domain adaptation (UDA) benchmark utilizing SMA's group diversity to evaluate UDA algorithms across more scenarios, offering a more comprehensive benchmark with lower error bars (reduced by 73\% and 28\% in closed-set setting and UniDA setting, respectively) compared to existing efforts. These use cases highlight SMA's potential to significantly impact the outcomes of conventional benchmarks.
我们介绍了Stylized Meta-Album(SMA),这是一个新的图像分类元数据集,包含24个数据集(12个内容数据集和12个风格化数据集),旨在推进出界分布(OOD)泛化的研究及相关话题。通过使用来自12个主题分类数据集的样式转换技术创建而成,SMA提供了一套多样且全面的4800组组合,将各种主题(对象、植物、动物、人类动作、纹理)与多种风格相结合。SMA允许对组和类进行灵活控制,使我们能够配置数据集以反映多样化基准测试场景。 虽然理想的数据收集会捕捉广泛的群体多样性,但实际限制通常使得这一目标难以实现。SMA通过对其风格、主题类别及领域进行灵活控制来解决此问题,从而建立大型且可配置的群体结构,让数据集能反映多种现实世界中的基准测试情景。这种设计不仅扩大了组和类别的多样性,还为评估模型在各种多样化的群组和领域设置下的表现打开了新的方法论方向,并可用于研究更多真实条件下公平性、鲁棒性和适应性的问题。 为了展示SMA的有效性,我们实施了两个基准测试:(1) 一个利用SMA的领域、类别及群体多样性进行OOD泛化和群体公平性评估的新基准;我们的发现表明,尽管简单的平衡以及使用群体信息的算法在现有基准中仍然具有竞争力,但增加群体多样性显著影响了公平性,改变了算法的优势及其相对排名。(2) 一项无监督域适应(UDA)基准测试,利用SMA的群组多样性来评价UDA算法跨更多场景的表现。与现有的努力相比,在封闭集设置和UniDA设置下,该基准测试提供了一个更全面的评估,误差范围降低了73% 和 28%,展示了SMA在扩大群体多样性的优化过程中实现更广泛公平性,并最终提高了最差群体准确性。 这些应用场景突显了SMA对传统基准结果具有显著影响的潜力。
https://arxiv.org/abs/2512.09773
Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.
油画作为一种将人类抽象思维与艺术表达相结合的高级媒介,由于其复杂的笔触动态和风格化特征,在数字生成和编辑方面面临着重大挑战。现有的生成和编辑技术通常受到训练数据分布的限制,并且主要侧重于修改真实照片。在这项工作中,我们引入了一个统一的多模态框架,用于油画的生成和编辑。该系统允许用户通过参考图像进行精确语义控制、手绘草图进行空间结构对齐以及自然语言提示进行高层次语义指导的同时,在所有输出中保持一致的绘画风格。我们的方法通过三项关键技术进步实现了互动式油画创作。 首先,我们加强了训练阶段的空间对齐和语义增强条件策略,将蒙版和草图映射为空间约束,并从参考图像和文本编码上下文嵌入以形成特征约束,从而实现对象级别的语义对齐。其次,为了克服数据稀缺性的问题,我们提出了一种基于笔触基渲染(SBR)的自我监督风格迁移管道,该方法模拟了油画修复中的填色动态,将真实图像转换为具有保存下来的笔触纹理的风格化油画,以构建大规模成对训练数据集。最后,在推理阶段,我们使用AdaIN操作符整合特征,确保风格一致性。 广泛的实验表明,我们的互动系统能够在保留油画艺术品质的同时进行精细编辑,实现了在风格化油画生成和编辑方面前所未有的想象力实现水平。
https://arxiv.org/abs/2512.08534
Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.
深度学习在医学图像分析方面展现出巨大的潜力,然而其临床适用性的推广仍然受到严重限制。一个主要障碍是数据异质性问题,在磁共振成像(MRI)中尤其明显。MRI扫描仪硬件差异、多样的采集协议和不同的序列参数引入了显著的领域偏移,从而掩盖了潜在的生物学信号。为了减少这些仪器和采集过程中的变异性,人们开发了一系列数据校准方法,但现有的方法仍然不够完善。在应用于图像数据时,基于图像的方法通常受到需要目标图像限制,而现有文本引导的方法则依赖于简单的标签,无法捕捉复杂的采集细节或仅限于具有有限变异性的数据集,从而不能适应现实世界临床环境中的异质性。 为了解决这些问题,我们提出了一种名为DIST-CLIP(使用CLIP指导的解耦风格迁移)的统一框架,用于MRI校准。该框架灵活地利用目标图像或DICOM元数据进行引导,并明确分离出解剖内容和图像对比度。使用预先训练好的CLIP编码器从输入图像中提取对比度表示,这些对比度嵌入随后通过新颖的自适应风格迁移模块融入到解剖结构当中。 我们在多种真实世界的临床数据集上对DIST-CLIP进行了训练和评估,并显示了其在风格翻译准确性和解剖细节保留方面的性能相比现有最先进的方法有了显著提高。这为MRI图像的数据标准化提供了一个灵活且高效的解决方案。我们的代码和权重将在发表后公开发布。
https://arxiv.org/abs/2512.07674
Generating stable and controllable character motion in real-time is a key challenge in computer animation. Existing methods often fail to provide fine-grained control or suffer from motion degradation over long sequences, limiting their use in interactive applications. We propose COMET, an autoregressive framework that runs in real time, enabling versatile character control and robust long-horizon synthesis. Our efficient Transformer-based conditional VAE allows for precise, interactive control over arbitrary user-specified joints for tasks like goal-reaching and in-betweening from a single model. To ensure long-term temporal stability, we introduce a novel reference-guided feedback mechanism that prevents error accumulation. This mechanism also serves as a plug-and-play stylization module, enabling real-time style transfer. Extensive evaluations demonstrate that COMET robustly generates high-quality motion at real-time speeds, significantly outperforming state-of-the-art approaches in complex motion control tasks and confirming its readiness for demanding interactive applications.
实时生成稳定且可控的角色动画是计算机动画中的一个关键挑战。现有方法往往难以提供精细的控制,或者在长时间序列中会出现动作退化的问题,从而限制了它们在交互式应用程序中的应用。我们提出了COMET(条件变分自编码器框架),这是一个自回归框架,在实时运行时能够实现多样化的角色控制和长期稳定的合成。 我们的高效Transformer基条件VAE模型允许用户对任意指定的关节进行精确、互动式的控制,适用于目标到达和过渡动作等任务,并且仅需一个模型即可完成。为了确保长时间序列中的时间稳定性,我们引入了一种新颖的参考引导反馈机制,以防止错误积累。这一机制还作为插件式的风格化模块存在,能够实现实时风格转换。 广泛的评估表明,COMET能够在实时速度下生成高质量的动作,并在复杂动作控制任务中显著优于现有最先进的方法,确认了它对于苛刻交互式应用的适用性。
https://arxiv.org/abs/2512.04487
Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.
现有的风格化运动生成模型已经展示了其从风格动作中理解特定风格信息,并将其插入内容运动的卓越能力。然而,捕捉跨风格多样性(即单一风格应当对应多种多样的运动变化)仍然是一个重大挑战。在本文中,我们提出了一种基于聚类的框架ClusterStyle来解决这一限制。该方法不再从每个风格化的动作学习无结构嵌入,而是利用一组原型有效建模同一类别内不同运动中的多样风格模式。我们考虑两种类型的风格多样性:同类风格动作之间的全局层次多样性以及运动序列时间动态内的局部层次多样性。这些组成部分共同塑造了两个有组织的风格嵌入空间,即全局和局部,并通过与非可学习的原型锚点对齐进行优化。 此外,我们将预训练的文本到运动生成模型用带有风格化调制适配器(SMA)增强,以整合风格特征。广泛的实验表明,我们的方法在风格化动作生成和动作风格转换方面超过了现有的最先进的模型性能。
https://arxiv.org/abs/2512.02453
Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbf{S}tyle-\textbf{A}daptive \textbf{GE}neralization framework (\textbf{SAGE}), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.
领域泛化在语义分割中的目标是缓解由于领域偏移导致的模型性能下降。然而,在许多现实场景中,由于隐私和安全限制的原因,我们无法访问模型参数和架构细节。这使得传统的微调或适应方法受到阻碍,从而产生了需求:通过不修改模型权重的方式来提高泛化能力的方法。为此,我们提出了一个名为**SAGE(Style-Adaptive Generalization框架)**的方案,该框架旨在改进在隐私限制下的冻结模型的泛化性能。 SAGE的学习方式是合成视觉提示,以隐式地跨风格对齐特征分布,而不是直接微调骨干网络。具体来说,我们首先利用风格迁移构建源领域的多样化的风格表示,从而学习一套覆盖广泛视觉特性的风格特征。然后,该模型根据每个输入的视觉上下文自适应融合这些风格线索,形成一个动态提示,这有助于协调图像外观而不触及模型内部结构。通过这种闭环设计,SAGE有效弥合了冻结模型不变性和未见领域多样性的差距。 在五个基准数据集上的广泛实验表明,在隐私限制下,SAGE与现有最佳方法相比取得了竞争性或优越的性能,并且在所有设置中均优于全量微调基线。
https://arxiv.org/abs/2512.02369
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.
本文介绍了LLM2Fx-Tools,这是一个多模态工具调用框架,用于生成音乐后期制作中的可执行音频效果序列(Fx链)。LLM2Fx-Tools 使用大型语言模型 (LLM) 来理解音频输入、选择音频效果类型、确定其顺序并估算参数,通过链条思维 (CoT) 规划来进行引导。我们还介绍了一个新的指令跟随数据集LP-Fx,该数据集带有结构化的CoT注释和音频效果模块的工具调用。实验表明,LLM2Fx-Tools 可以从未经处理和已处理的音频对中推断出 Fx链及其参数,这得益于自回归序列建模、工具调用和 CoT 推理。在风格转移设置下,我们进一步验证了该系统能够将参考源中的音频效果信息转移到新内容上。最后,作为裁判的LLM评估表明,我们的方法可以为音乐制作查询生成合适的CoT推理和响应。据我们所知,这是首次将基于LLM的工具调用应用于音频效果模块的工作,从而使音乐制作变得可解释且可控。
https://arxiv.org/abs/2512.01559
Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit} (this https URL), we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.
生成模型,如扩散和自回归方法,在编辑自然图像方面展示了令人印象深刻的性能。然而,将这些工具应用于科学图表时存在一个基本假设上的缺陷:图表不仅仅是像素的排列,而是遵循图形语法的结构化数据的视觉表示形式。因此,图表编辑不是一个像素操作任务,而是一个需要进行结构化转换的问题。为了解决这一根本性的不匹配问题,我们引入了\textit{FigEdit},这是一个用于科学图表编辑的大规模基准测试集,包含超过30,000个样本。基于真实世界的数据,我们的基准测试以其多样性和涵盖10种不同类型的图表以及丰富复杂的编辑指令词汇表而著称。该基准测试由五项独特且难度递增的任务组成:单一编辑、多任务编辑、对话式编辑、基于视觉指导的编辑和风格转换。 我们对一系列最先进的模型在这一基准上的评估显示,它们在科学图表上表现不佳,因为这些模型始终无法处理有效编辑所需的底层结构化转换。此外,我们的分析表明,传统的评估指标(如SSIM和PSNR)在捕捉图表编辑中的语义正确性方面存在局限性。 \textit{FigEdit}基准测试揭示了基于像素级别的操作所存在的深刻限制,并为开发和评估未来具备结构感知能力的模型提供了坚实的基础。通过发布\textit{FigEdit}(此链接),我们的目标是促进结构感知图表编辑方面的系统进步,提供一个公平比较的共同基础,并鼓励对理解科学图表视觉和语义层面模型的研究。
https://arxiv.org/abs/2512.00752
Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.
https://arxiv.org/abs/2511.20986
Styled handwriting generation aims to synthesize handwritten text that looks both realistic and aligned with a specific writer's style. While recent approaches involving GAN, transformer and diffusion-based models have made progress, they often struggle to capture the full spectrum of writer-specific attributes, particularly global stylistic patterns that span long-range spatial dependencies. As a result, capturing subtle writer-specific traits such as consistent slant, curvature or stroke pressure, while keeping the generated text accurate is still an open problem. In this work, we present a unified framework designed to address these limitations. We introduce a Vision Transformer-based style encoder that learns global stylistic patterns from multiple reference images, allowing the model to better represent long-range structural characteristics of handwriting. We then integrate these style cues with the target text using a cross-attention mechanism, enabling the system to produce handwritten images that more faithfully reflect the intended style. To make the process more interpretable, we utilize Salient Stroke Attention Analysis (SSAA), which reveals the stroke-level features the model focuses on during style transfer. Together, these components lead to handwriting synthesis that is not only more stylistically coherent, but also easier to understand and analyze.
https://arxiv.org/abs/2511.18307
The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.
https://arxiv.org/abs/2511.17155
Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
https://arxiv.org/abs/2511.17136
Recent advances in expressive text-to-speech (TTS) have introduced diverse methods based on style embedding extracted from reference speech. However, synthesizing high-quality expressive speech remains challenging. We propose SpotlightTTS, which exclusively emphasizes style via voiced-aware style extraction and style direction adjustment. Voiced-aware style extraction focuses on voiced regions highly related to style while maintaining continuity across different speech regions to improve expressiveness. We adjust the direction of the extracted style for optimal integration into the TTS model, which improves speech quality. Experimental results demonstrate that Spotlight-TTS achieves superior performance compared to baseline models in terms of expressiveness, overall speech quality, and style transfer capability.
https://arxiv.org/abs/2511.14824
Unsupervised remote sensing change detection aims to monitor and analyze changes from multi-temporal remote sensing images in the same geometric region at different times, without the need for labeled training data. Previous unsupervised methods attempt to achieve style transfer across multi-temporal remote sensing images through reconstruction by a generator network, and then capture the unreconstructable areas as the changed regions. However, it often leads to poor performance due to generator overfitting. In this paper, we propose a novel Consistency Change Detection Framework (CCDF) to address this challenge. Specifically, we introduce a Cycle Consistency (CC) module to reduce the overfitting issues in the generator-based reconstruction. Additionally, we propose a Semantic Consistency (SC) module to enable detail reconstruction. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.
https://arxiv.org/abs/2511.08904
Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.
https://arxiv.org/abs/2511.06687