Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.
https://arxiv.org/abs/2605.09425
Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to natural images and categorical representations, with limited capacity to capture structural features and to differentiate objective perception from subjective cognition. We propose a Structure-Guided Diffusion Model (SGDM) that incorporates explicit structural information for EEG-based visual reconstruction. Approach: SGDM is evaluated on the Kilogram abstract visual object dataset and the THINGS natural image dataset using a two-stage generative mechanism. The framework combines a structurally supervised variational autoencoder with a spatiotemporal EEG encoder aligned to a visual embedding space via contrastive learning. Structural information is integrated into a diffusion model through ControlNet to guide image generation from EEG features. Results: SGDM outperforms existing methods on both abstract and natural image datasets. Reconstructed images achieve higher fidelity in low-level visual features and semantic representations, indicating improved decoding accuracy and strong generalization across diverse visual domains. Spatiotemporal analysis of EEG signals further reveals hierarchical structural encoding patterns, consistent with the neural dynamics of visual cognition. Significance: These findings validate the effectiveness of SGDM in capturing explicit structural geometry and generating images with high fidelity to individual cognitive representations. By enabling decoding of complex visual content from EEG signals, the framework extends neural decoding beyond low-dimensional or categorical outputs. This supports BCIs with increased degrees of freedom for intention decoding and more flexible brain-to-machine communication.
https://arxiv.org/abs/2604.22649
We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
https://arxiv.org/abs/2604.19858
While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.
https://arxiv.org/abs/2604.18258
Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.
https://arxiv.org/abs/2604.05853
Diffusion models have demonstrated impressive image synthesis performance, yet many UNet-based models are trained at certain fixed resolutions. Their quality tends to degrade when generating images at out-of-training resolutions. We trace this issue to resolution-dependent parameter behaviors, where weights that function well at the default resolution can become adverse when spatial scales shift, weakening semantic alignment and causing structural instability in the UNet architecture. Based on this analysis, this paper introduces CR-Diff, a novel method that improves the cross-resolution visual consistency by pruning some parameters of the diffusion model. Specifically, CR-Diff has two stages. It first performs block-wise pruning to selectively eliminate adverse weights. Then, a pruned output amplification is conducted to further purify the pruned predictions. Empirically, extensive experiments suggest that CR-Diff can improve perceptual fidelity and semantic coherence across various diffusion backbones and unseen resolutions, while largely preserving the performance at default resolutions. Additionally, CR-Diff supports prompt-specific refinement, enabling quality enhancement on demand.
扩散模型已展现出卓越的图像合成性能,然而许多基于UNet的模型仅在特定固定分辨率下进行训练。当生成训练分辨率之外的图像时,其质量往往会下降。我们将此问题归因于分辨率依赖的参数行为——在默认分辨率下表现良好的权重,在空间尺度变化时可能产生负面影响,从而削弱语义对齐并导致UNet架构中的结构不稳定性。基于此分析,本文提出CR-Diff这一新颖方法,通过剪枝扩散模型的部分参数来改善跨分辨率视觉一致性。具体而言,CR-Diff包含两个阶段:首先进行块级剪枝以选择性地消除不利权重;随后执行剪枝输出放大以进一步净化剪枝后的预测结果。实验表明,CR-Diff能够在多种扩散主干网络及未见分辨率上提升感知保真度与语义连贯性,同时最大程度保持默认分辨率的性能。此外,CR-Diff还支持提示词特定的细化机制,实现按需质量增强。
https://arxiv.org/abs/2604.05524
Humans paint images incrementally: they plan a global layout, sketch a coarse draft, inspect, and refine details, and most importantly, each step is grounded in the evolving visual states. However, can unified multimodal models trained on text-image interleaved datasets also imagine the chain of intermediate states? In this paper, we introduce process-driven image generation, a multi-step paradigm that decomposes synthesis into an interleaved reasoning trajectory of thoughts and actions. Rather than generating images in a single step, our approach unfolds across multiple iterations, each consisting of 4 stages: textual planning, visual drafting, textual reflection, and visual refinement. The textual reasoning explicitly conditions how the visual state should evolve, while the generated visual intermediate in turn constrains and grounds the next round of textual reasoning. A core challenge of process-driven generation stems from the ambiguity of intermediate states: how can models evaluate each partially-complete image? We address this through dense, step-wise supervision that maintains two complementary constraints: for the visual intermediate states, we enforce the spatial and semantic consistency; for the textual intermediate states, we preserve the prior visual knowledge while enabling the model to identify and correct prompt-violating elements. This makes the generation process explicit, interpretable, and directly supervisable. To validate proposed method, we conduct experiments under various text-to-image generation benchmarks.
https://arxiv.org/abs/2604.04746
Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.
肖像构图在肖像美学与视觉传播中起着核心作用,然而现有数据集与基准主要集中于粗粒度的美学评分、通用图像美学或无约束肖像生成,这限制了在显式构图要求下对结构化肖像构图分析与可控肖像生成的系统研究。本文提出 PortraitCraft,一个统一的肖像构图理解与生成基准。该基准基于约5万张精心筛选的真实肖像图像数据集构建,包含结构化多层次监督信息,如全局构图评分、13项构图属性标注、属性级解释文本、视觉问答对以及面向构图的生成文本描述。基于此数据集,我们在统一框架下建立了两个互补的基准任务:构图理解与构图感知生成。前者通过评分预测、细粒度属性推理与图像导向的视觉问答评估肖像构图理解能力;后者则在显式构图约束下,从结构化构图描述生成肖像图像。我们还定义了标准化评估协议,并为代表性多模态模型提供了参考基线结果。PortraitCraft 为未来精细肖像理解、可解释美学评估与可控肖像生成研究提供了综合性基准。
https://arxiv.org/abs/2604.03611
Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.
在生命之树上精确生成图像具有挑战性:地球上有超过1000万个不同物种,其中许多仅凭细微的视觉特征区分。尽管文本到图像合成技术已取得显著进展,但现有模型往往难以捕捉定义物种身份的细粒度视觉线索,即便其输出结果看似逼真。为此,我们提出了TaxaAdapter——一种简单轻量的方法,通过引入BioCLIP等视觉分类模型(VTMs)来指导细粒度物种生成。我们的方法将VTM嵌入注入冻结的文本到图像扩散模型,在保持对姿态、风格和背景等属性的灵活文本控制的同时,提升了物种级别的保真度。大量实验表明,TaxaAdapter在形态保真度和物种身份准确性上持续优于强基线模型,并具备更简洁的架构与训练方案。为更好地评估这些改进,我们还引入了一种基于多模态大语言模型的评估指标,该指标能汇总生成图像与真实图像的性状级描述,提供更具可解释性的形态一致性度量。此外,我们观察到TaxaAdapter展现出强大的泛化能力,可在小样本物种(仅需少量训练图像)乃至训练中未见的物种等具有挑战性的场景中实现物种合成。总体而言,我们的研究结果表明,视觉分类模型是实现可扩展、细粒度物种生成的关键要素。
https://arxiv.org/abs/2603.26128
Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.
https://arxiv.org/abs/2603.25319
While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial this http URL limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional this http URL overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core this http URL mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural this http URL, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
尽管扩散模型在生成常规尺寸图像方面表现出色,但将其推向极端纵横比(EAR)的超高分辨率图像合成时,常会引发灾难性的结构失效,例如物体重复和空间局限。这一根本限制源于缺乏稳健的空间先验,因为静态文本到图像模型主要基于常规尺寸图像分布进行训练。为突破这一瓶颈,我们提出了ScrollScape,一种新颖框架,通过两个核心机制将EAR图像合成重新构建为连续视频生成过程:通过将超大画布的空间扩展映射到视频帧的时间演变,ScrollScape利用视频模型固有的时间一致性作为强大的全局约束,以确保长程结构连贯性。扫描位置编码(ScanPE)将全局坐标分布到各帧中,充当灵活的移动摄像机;而滚动超分辨率(ScrollSR)则利用视频超分辨率先验来规避内存瓶颈,高效将输出扩展至前所未有的32K分辨率。经精心策划的3K多纵横比图像数据集微调后,ScrollScape能有效将预训练的视频先验与EAR生成任务对齐。广泛评估表明,该方法通过消除严重的局部伪影,显著优于现有图像扩散基线。因此,我们的方法克服了固有的结构瓶颈,确保了在极端尺度下跨不同领域的卓越全局连贯性和视觉保真度。
https://arxiv.org/abs/2603.24270
Multimodal sarcasm detection has recently garnered significant attention. However, existing benchmarks suffer from coarse-grained annotations and limited cultural coverage, which hinder research into fine-grained semantic understanding. To address this, we construct CFMS, the first fine-grained multimodal sarcasm dataset tailored for Chinese social media. It comprises 2,796 high-quality image-text pairs and provides a triple-level annotation framework: sarcasm identification, target recognition, and explanation generation. We find that the fine-grained explanation annotations effectively guide AI in generating images with explicit sarcastic intent. Furthermore, we curate a high-consistency parallel Chinese-English metaphor subset (200 entries each), revealing significant limitations of current models in metaphoric reasoning. To overcome the constraints of traditional retrieval methods, we propose a Reinforcement Learning-augmented In-Context Learning strategy (PGDS) to dynamically optimize exemplar selection. Extensive experiments demonstrate that CFMS provides a solid foundation for building reliable multimodal sarcasm understanding systems, and the PGDS method significantly outperforms existing baselines on key tasks. Our data and code are available at this https URL.
https://arxiv.org/abs/2604.16372
In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
https://arxiv.org/abs/2603.18001
Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.
https://arxiv.org/abs/2603.17044
Text-to-image diffusion models excel at generating images from natural language descriptions, yet fail to interpret numerical colors such as hex codes (#FF5733) and RGB values (rgb(255,87,51)). This limitation stems from subword tokenization, which fragments color codes into semantically meaningless tokens that text encoders cannot map to coherent color representations. We present NumColor, that enables precise numerical color control across multiple diffusion architectures. NumColor comprises two components: a Color Token Aggregator that detects color specifications regardless of tokenization, and a ColorBook containing 6,707 learnable embeddings that map colors to embedding space of text encoder in perceptually uniform CIE Lab space. We introduce two auxiliary losses, directional alignment and interpolation consistency, to enforce geometric correspondence between Lab and embedding spaces, enabling smooth color interpolation. To train the ColorBook, we construct NumColor-Data, a synthetic dataset of 500K rendered images with unambiguous color-to-pixel correspondence, eliminating the annotation ambiguity inherent in photographic datasets. Although trained solely on FLUX, NumColor transfers zero-shot to SD3, SD3.5, PixArt-{\alpha}, and PixArt-{\Sigma} without model-specific adaptation. NumColor improves numerical color accuracy by 4-9x across five models, while simultaneously improving color harmony scores by 10-30x on GenColorBench benchmark.
https://arxiv.org/abs/2603.13547
Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
扩散变压器(Diffusion Transformer,DiT)在生成高于训练分辨率的图像时面临挑战,尤其是在结构退化方面,这是由于注意力稀释导致的。先前的方法试图通过锐化注意力分布来缓解这一问题,但未能保留细微的语义细节,并引入了明显的伪影。在这项工作中,我们分析了DiTs的特点,并提出了TIDE(Training-free Image DEextrapolation),这是一种无需额外训练即可实现任意分辨率和长宽比生成的文字到图像(Text-to-Image, T2I)外推方法。我们确定了提示信息丢失的核心因素,并引入了一种文本锚定机制来纠正文字与图像标记之间的不平衡。为了进一步消除伪影,我们设计了一个动态温度控制机制,利用扩散过程中的频谱进展模式。 广泛评估表明,TIDE具有高质量的分辨率外推能力,并且能够无缝集成到现有的最先进的方法中。
https://arxiv.org/abs/2603.08928
While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
虽然像Veo3和Sora2这样的最先进的音频视频生成模型展示了非凡的能力,但它们的闭源性质使其架构和训练范式对外界不可访问。为了弥合这一可访问性和性能差距,我们引入了UniTalking,这是一种统一的端到端扩散框架,用于生成高保真语音和唇部同步视频。我们的框架的核心采用了多模态Transformer模块,通过共享自注意力机制明确地建模音频和视频潜在标记之间的细粒度时间对应关系。 借助预训练视频生成模型的强大先验知识,我们的框架确保了视觉保真的领先水平,并实现了高效的训练过程。此外,UniTalking还集成了个性化的语音克隆功能,可以从简短的音频参考中生成目标风格下的语音。 定性和定量的结果表明,我们所提出的方法能够生成高度逼真的话面图像,在唇同步准确性、音频自然度和整体感知质量方面优于现有的开源方法。
https://arxiv.org/abs/2603.01418
Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
生成模型已被证明会“记忆”某些训练数据,导致生成图像时出现文字上或近似于原文的内容,这可能会引发隐私问题或版权侵权。为此,我们提出了一个新框架——利用吸引-排斥动力学的指引(Guidance Using Attractive-Repulsive Dynamics, GUARD),用于减轻文本到图像扩散模型中的记忆问题。GUARD通过调整图像去噪过程来引导生成结果远离原始训练图像,并使其与提示保持一致的同时与其他训练数据区分开来,从而防止再现训练数据,而不会影响图像的生成质量。 我们提出了一种具体的框架实现方式:其中正向目标由一种基于(i)一种新统计机制自动识别需要减弱交叉注意力的提示位置以及(ii) 在这些特定提示位置上减弱交叉注意力的新方法给出。由此产生的GUARD提供了一种手术般的、动态的提示时间推理方法,我们发现这种方法在两个架构中对于文字和模板记忆的消除方面表现出了迄今为止最为稳健的方法,并且在图像质量方面也有所改进或达到了可比的结果。
https://arxiv.org/abs/2603.00133
While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
尽管扩散模型在肖像生成方面展现出了巨大的潜力,但生成具有表现力、连贯性和可控性的电影级肖像视频仍然是一项重大挑战。现有用于肖像生成的中间信号(如2D地标和参数化模型)由于其稀疏或低秩表示,解耦能力有限,无法表达个性化的细节。因此,基于这些模型的方法难以准确地保持人物的身份和表情,阻碍了高度表现力的肖像视频的生成。 为克服这些限制,我们提出了一种高保真度的个性化头部表征方法,这种方法能够更有效地分离表情和身份。该表征捕捉到了静态的、个性化的全局几何特征以及动态的表情相关细节。此外,我们还引入了一个表情迁移模块,用于实现不同人物之间头部姿态和表情细节的个性化转移。 我们将这一复杂且高度表现力的头部模型作为条件信号来训练基于扩散变换器(DiT)的生成器,以合成具有丰富细节的肖像视频。在自我再现和跨身份再现任务中的广泛实验表明,我们的方法在身份保持、表情准确性以及时间稳定性方面均优于先前的方法,特别是在捕捉复杂动作的细粒度细节方面尤为突出。
https://arxiv.org/abs/2602.19900
Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.
之前的研究在视觉定制领域主要依赖于不同控制信号(如语言、布局和边缘检测)与编辑后图像之间的客观对齐,而忽视了主观情感内容,并且缺乏用于情感视觉定制的通用基础模型。鉴于此,本文提出了一个以大型多模态语言模型为中心的情感视觉定制(L-AVC)任务,该任务专注于通过多模态LLM生成和修改图像中的主观情绪。此外,本文认为在L-AVC任务中,如何使模型有效地对齐编辑前后的情绪语义转换(称为跨情感语义转换),以及如何精确地保留与情感无关的内容(称为外情感语义保持)是非常重要且具有挑战性的。 为此,本文提出了一种有效而精准的情感操控方法(EPEM),用于在图像中编辑主观情绪。具体来说,设计了一个有效的跨情感转换(EIC)模块,使LLM能够高效地对齐编辑前后的语义情绪转换,并随后使用一个精确的外情感保留(PER)模块来保持与情感无关的内容。 本文构建了L-AVC数据集并进行了全面的实验评估,结果表明所提出的EPEM方法在L-AVC任务中相对于几个最先进的基线模型具有明显优势。这证明了情感信息对于L-AVC的重要性以及EPEM在高效和精准操控此类信息方面的有效性。
https://arxiv.org/abs/2602.18016