We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
https://arxiv.org/abs/2604.19858
Portrait composition plays a central role in portrait aesthetics and visual communication, yet existing datasets and benchmarks mainly focus on coarse aesthetic scoring, generic image aesthetics, or unconstrained portrait generation. This limits systematic research on structured portrait composition analysis and controllable portrait generation under explicit composition requirements. In this paper, we introduce PortraitCraft, a unified benchmark for portrait composition understanding and generation. PortraitCraft is built on a dataset of approximately 50,000 curated real portrait images with structured multi-level supervision, including global composition scores, annotations over 13 composition attributes, attribute-level explanation texts, visual question answering pairs, and composition-oriented textual descriptions for generation. Based on this dataset, we establish two complementary benchmark tasks for composition understanding and composition-aware generation within a unified framework. The first evaluates portrait composition understanding through score prediction, fine-grained attribute reasoning, and image-grounded visual question answering, while the second evaluates portrait generation from structured composition descriptions under explicit composition constraints. We further define standardized evaluation protocols and provide reference baseline results with representative multimodal models. PortraitCraft provides a comprehensive benchmark for future research on fine-grained portrait understanding, interpretable aesthetic assessment, and controllable portrait generation.
肖像构图在肖像美学与视觉传播中起着核心作用,然而现有数据集与基准主要集中于粗粒度的美学评分、通用图像美学或无约束肖像生成,这限制了在显式构图要求下对结构化肖像构图分析与可控肖像生成的系统研究。本文提出 PortraitCraft,一个统一的肖像构图理解与生成基准。该基准基于约5万张精心筛选的真实肖像图像数据集构建,包含结构化多层次监督信息,如全局构图评分、13项构图属性标注、属性级解释文本、视觉问答对以及面向构图的生成文本描述。基于此数据集,我们在统一框架下建立了两个互补的基准任务:构图理解与构图感知生成。前者通过评分预测、细粒度属性推理与图像导向的视觉问答评估肖像构图理解能力;后者则在显式构图约束下,从结构化构图描述生成肖像图像。我们还定义了标准化评估协议,并为代表性多模态模型提供了参考基线结果。PortraitCraft 为未来精细肖像理解、可解释美学评估与可控肖像生成研究提供了综合性基准。
https://arxiv.org/abs/2604.03611
While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.
虽然像Veo3和Sora2这样的最先进的音频视频生成模型展示了非凡的能力,但它们的闭源性质使其架构和训练范式对外界不可访问。为了弥合这一可访问性和性能差距,我们引入了UniTalking,这是一种统一的端到端扩散框架,用于生成高保真语音和唇部同步视频。我们的框架的核心采用了多模态Transformer模块,通过共享自注意力机制明确地建模音频和视频潜在标记之间的细粒度时间对应关系。 借助预训练视频生成模型的强大先验知识,我们的框架确保了视觉保真的领先水平,并实现了高效的训练过程。此外,UniTalking还集成了个性化的语音克隆功能,可以从简短的音频参考中生成目标风格下的语音。 定性和定量的结果表明,我们所提出的方法能够生成高度逼真的话面图像,在唇同步准确性、音频自然度和整体感知质量方面优于现有的开源方法。
https://arxiv.org/abs/2603.01418
While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
尽管扩散模型在肖像生成方面展现出了巨大的潜力,但生成具有表现力、连贯性和可控性的电影级肖像视频仍然是一项重大挑战。现有用于肖像生成的中间信号(如2D地标和参数化模型)由于其稀疏或低秩表示,解耦能力有限,无法表达个性化的细节。因此,基于这些模型的方法难以准确地保持人物的身份和表情,阻碍了高度表现力的肖像视频的生成。 为克服这些限制,我们提出了一种高保真度的个性化头部表征方法,这种方法能够更有效地分离表情和身份。该表征捕捉到了静态的、个性化的全局几何特征以及动态的表情相关细节。此外,我们还引入了一个表情迁移模块,用于实现不同人物之间头部姿态和表情细节的个性化转移。 我们将这一复杂且高度表现力的头部模型作为条件信号来训练基于扩散变换器(DiT)的生成器,以合成具有丰富细节的肖像视频。在自我再现和跨身份再现任务中的广泛实验表明,我们的方法在身份保持、表情准确性以及时间稳定性方面均优于先前的方法,特别是在捕捉复杂动作的细粒度细节方面尤为突出。
https://arxiv.org/abs/2602.19900
Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
在音频驱动的肖像生成领域,实现高保真视觉质量和低延迟流媒体之间的平衡仍然是一个重大挑战。现有的大规模模型通常面临计算成本过高的问题,而轻量级替代方案则往往牺牲整体面部表示和时间稳定性。为此,本文提出了SoulX-FlashHead,这是一种统一的13亿参数框架,旨在实现实时、无限长度且高保真的流媒体视频生成。 为了应对流媒体场景中音频特征不稳定的问题,我们引入了“流感知时空预训练”(Streaming-Aware Spatiotemporal Pre-training),并配备了一个时间音频上下文缓存机制(Temporal Audio Context Cache),以确保从短音频片段中提取稳健的特征。此外,为了解决长时间序列自回归生成过程中固有的错误累积和身份漂移问题,我们提出了Oracle-Guided双向蒸馏(Oracle-Guided Bidirectional Distillation)方法,利用地面真实运动先验提供精确的物理引导。 为了支持稳健训练,我们还推出了VividHead,这是一个大规模高质量的数据集,包含782小时严格对齐的视频片段。经过广泛的实验验证,SoulX-FlashHead在HDTF和VFHQ基准测试中实现了最先进的性能表现。值得一提的是,我们的Lite变体能够在单个NVIDIA RTX 4090上实现每秒96帧(FPS)的推理速度,从而支持超快交互的同时不牺牲视觉连贯性。
https://arxiv.org/abs/2602.07449
Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.
https://arxiv.org/abs/2602.00627
Recent advancements in diffusion-based technologies have made significant strides, particularly in identity-preserved portrait generation (IPG). However, when using multiple reference images from the same ID, existing methods typically produce lower-fidelity portraits and struggle to customize face attributes precisely. To address these issues, this paper presents HiFi-Portrait, a high-fidelity method for zero-shot portrait generation. Specifically, we first introduce the face refiner and landmark generator to obtain fine-grained multi-face features and 3D-aware face landmarks. The landmarks include the reference ID and the target attributes. Then, we design HiFi-Net to fuse multi-face features and align them with landmarks, which improves ID fidelity and face control. In addition, we devise an automated pipeline to construct an ID-based dataset for training HiFi-Portrait. Extensive experimental results demonstrate that our method surpasses the SOTA approaches in face similarity and controllability. Furthermore, our method is also compatible with previous SDXL-based works.
最近在基于扩散的技术方面取得了一些显著进展,特别是在保留身份的肖像生成(IPG)领域。然而,当使用来自同一ID的多张参考图像时,现有方法通常会产生较低质量的肖像,并且难以精确定制面部属性。为了解决这些问题,本文提出了一种名为HiFi-Portrait的方法,这是一种用于零样本肖像生成的高保真技术。 具体来说,我们首先引入了面部精炼器和地标生成器来获取细粒度多张人脸特征及具有3D感知的人脸地标信息。这些地标包含参考ID和目标属性的信息。然后,我们设计了HiFi-Net网络用于融合多个人脸特征,并将它们与地标对齐,这提升了身份保真度并增强了面部控制能力。 此外,我们还开发了一种自动化管道来构建基于ID的数据集,以便训练HiFi-Portrait模型。广泛的实验结果表明,我们的方法在人脸相似性和可控性方面超越了现有的最先进(SOTA)方法。而且,我们的方法还可以与之前的SDXL相关工作兼容使用。
https://arxiv.org/abs/2512.14542
Personalized dual-person portrait customization has considerable potential applications, such as preserving emotional memories and facilitating wedding photography planning. However, the absence of a benchmark dataset hinders the pursuit of high-quality customization in dual-person portrait generation. In this paper, we propose the PairHuman dataset, which is the first large-scale benchmark dataset specifically designed for generating dual-person portraits that meet high photographic standards. The PairHuman dataset contains more than 100K images that capture a variety of scenes, attire, and dual-person interactions, along with rich metadata, including detailed image descriptions, person localization, human keypoints, and attribute tags. We also introduce DHumanDiff, which is a baseline specifically crafted for dual-person portrait generation that features enhanced facial consistency and simultaneously balances in personalized person generation and semantic-driven scene creation. Finally, the experimental results demonstrate that our dataset and method produce highly customized portraits with superior visual quality that are tailored to human preferences. Our dataset is publicly available at this https URL.
https://arxiv.org/abs/2511.16712
Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.
https://arxiv.org/abs/2510.26819
We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.
https://arxiv.org/abs/2510.23929
Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
最近在音频驱动的虚拟化身视频生成方面取得的进步显著提升了视听真实感。然而,现有的方法仅仅将指令条件化视为由声学或视觉线索驱动的基本级跟踪,并没有建模指令传达的沟通意图。这种局限性损害了它们的故事连贯性和角色表现力。为了解决这一问题,我们提出了Kling-Avatar,这是一种新的级联框架,它统一了多模式指令理解和逼真的肖像生成。 我们的方法采用了两阶段管道。在第一阶段中,我们设计了一种多模态大型语言模型(MLLM)导演,该导演根据各种指令信号生成蓝图视频,从而控制角色动作和情感等高层次语义。在第二阶段中,在蓝本关键帧的引导下,我们使用首尾帧策略并行生成多个子片段。这种从全局到局部的方法在保持细粒度细节的同时,忠实地编码了多模式指令背后的高层次意图。 我们的并行架构还能够快速而稳定地生成长时间视频,使其适用于数字人类直播和Vlogging等现实世界应用中。为了全面评估我们方法的效果,我们构建了一个基准测试集,包括375个精心挑选的样本,涵盖了各种各样的指令和挑战性的场景。广泛的实验表明,Kling-Avatar能够以高达1080p和48fps的速度生成生动、流畅、长时间视频,在唇同步准确度、情感及动态表现力、指令可控性、身份保持以及跨域泛化能力方面表现出色。 这些结果使Kling-Avatar成为语义支持的高保真音频驱动虚拟化身合成的新基准。
https://arxiv.org/abs/2509.09595
Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.
最近,使用文本到图像扩散模型进行个性化肖像生成在引入Textual Inversion技术后取得了显著进展,成为创建高保真个性化图像的一种有前景的方法。尽管具有巨大潜力,现有的Textual Inversion方法却难以保持面部身份的一致性,因为文本和视觉嵌入空间之间关于身份的语义不匹配导致了这个问题。我们提出了ID-EA(Identity-Driven Embedding Alignment),这是一种新的框架,它引导文本嵌入与视觉身份嵌入对齐,从而在个性化生成中提高身份保存的效果。 ID-EA包括两个关键组成部分:由身份驱动的增强器(ID-Enhancer)和根据身份条件适配器(ID-Adapter)。首先,ID-Enhancer 将身份嵌入与文本 ID 锚点集成起来,并使用来自面部识别模型的视觉身份嵌入对代表性文本嵌入进行细化。然后,ID-Adapter 利用增强后的身份嵌入来调整文本条件,在预训练的 UNet 模型中通过调节交叉注意力模块确保身份保存。这一过程鼓励文本特征找到前景片段中的最相关视觉线索。 广泛的定量和定性评估表明,与现有方法相比,ID-EA 在身份保持度量方面显著超越了最先进的技术,并且实现了令人瞩目的计算效率,在生成个性化肖像时比现有的方法快大约15倍。
https://arxiv.org/abs/2507.11990
We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.
我们考虑从大型视觉-语言模型中分离出三维信息的问题,并通过生成的三维肖像展示了这一过程。这使得我们可以用自由形式的文字控制外观属性(如年龄、发型和眼镜),并通过3D几何学控制面部表情和相机姿态。在这个设置下,假设我们使用一个预训练的大规模视觉-语言模型(LVLM;CLIP)从一个小的2D数据集中生成结果,并且该数据集没有额外配对标签,同时定义了一个预设的三维可变形模型(FLAME)。首先,我们通过将神经3D三平面表示规范到二维参考帧来实现分离。然而,另一种纠缠形式来自于LVLM嵌入空间中的大量噪声,这些噪声描述了无关特征。这种噪声会损害输出质量和多样性,但我们可以通过计算效率高的随机近似器进行雅可比正则化的方法克服这一问题。 与现有方法相比,我们的方法可以在生成的肖像中添加文本和3D控制,当改变任一控制时,肖像仍然保持一致性。总体而言,这种方法让创作者能够在其自己的2D面部数据上控制三维生成模型,并且无需为标注大量数据或训练大型模型投入资源。
https://arxiv.org/abs/2506.14015
This study presents a novel approach to enhance the cost-to-quality ratio of image generation with diffusion models. We hypothesize that differences between distilled (e.g. FLUX.1-schnell) and baseline (e.g. FLUX.1-dev) models are consistent and, therefore, learnable within a specialized domain, like portrait generation. We generate a synthetic paired dataset and train a fast image-to-image translation head. Using two sets of low- and high-quality synthetic images, our model is trained to refine the output of a distilled generator (e.g., FLUX.1-schnell) to a level comparable to a baseline model like FLUX.1-dev, which is more computationally intensive. Our results show that the pipeline, which combines a distilled version of a large generative model with our enhancement layer, delivers similar photorealistic portraits to the baseline version with up to an 82% decrease in computational cost compared to FLUX.1-dev. This study demonstrates the potential for improving the efficiency of AI solutions involving large-scale image generation.
这项研究提出了一种新方法,旨在提高使用扩散模型生成图像的成本与质量比率。我们假设经过蒸馏(例如FLUX.1-schnell)和基准(例如FLUX.1-dev)模型之间的差异在特定领域内是稳定且可学习的,比如人像生成领域。为此,我们生成了一个合成配对数据集,并训练了一个快速图像到图像翻译头。使用两组低质量和高质量的人工合成图像,我们的模型被训练以优化一个蒸馏生成器(例如FLUX.1-schnell)的输出,使其质量接近于计算资源需求更高的基准模型如FLUX.1-dev。 研究结果表明,结合大尺寸生成模型的简化版本与增强层的管道能够提供类似于基线版本的逼真图像,但相比FLUX.1-dev可降低高达82%的计算成本。这项研究表明,在大规模图像生成涉及的人工智能解决方案中,存在提高效率的巨大潜力。
https://arxiv.org/abs/2505.02255
Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio's inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method's superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method's ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
最近在 Talking Head Generation (THG) 方面的进展,通过扩散模型实现了令人印象深刻的唇部同步和视觉质量;然而,现有的方法在生成具有情感表达力的同时保持说话者身份方面仍存在困难。我们指出了当前情感面部生成中的三个关键限制:音频中内在的情感线索利用不足、情感表示中的身份泄露以及情感关联的孤立学习。 为了应对这些挑战,我们提出了一种新的框架,称为 DICE-Talk,该框架遵循将身份与情绪解耦然后合作具有相似特征的情绪的理念。首先,我们开发了一个解耦式情感嵌入器,通过跨模态注意力同时建模音频-视觉的情感线索,表示为无身份的高斯分布。其次,我们引入了一种增强关联的情感条件模块,并采用可学习的情感库,该模块通过向量量化和基于注意力的功能聚合明确捕捉了情感之间的关系。第三,我们设计了一个情绪判别目标,在扩散过程中通过潜在空间分类强制执行情感一致性。 在 MEAD 和 HDTF 数据集上的广泛实验表明,我们的方法优于现有最佳方法,在情感准确性方面表现出色的同时保持了竞争性的唇部同步性能。定性结果和用户研究进一步确认了我们方法生成的身份一致且具有丰富相关情感表情的能力,并能够自然适应未见过的身份。
https://arxiv.org/abs/2504.18087
We propose a novel framework for ID-preserving generation using a multi-modal encoding strategy rather than injecting identity features via adapters into pre-trained models. Our method treats identity and text as a unified conditioning input. To achieve this, we introduce FaceCLIP, a multi-modal encoder that learns a joint embedding space for both identity and textual semantics. Given a reference face and a text prompt, FaceCLIP produces a unified representation that encodes both identity and text, which conditions a base diffusion model to generate images that are identity-consistent and text-aligned. We also present a multi-modal alignment algorithm to train FaceCLIP, using a loss that aligns its joint representation with face, text, and image embedding spaces. We then build FaceCLIP-SDXL, an ID-preserving image synthesis pipeline by integrating FaceCLIP with Stable Diffusion XL (SDXL). Compared to prior methods, FaceCLIP-SDXL enables photorealistic portrait generation with better identity preservation and textual relevance. Extensive experiments demonstrate its quantitative and qualitative superiority.
我们提出了一种新颖的框架,用于通过多模态编码策略而非使用适配器向预训练模型注入身份特征来进行保持身份特性的生成。我们的方法将身份和文本视为统一的条件输入。为此,我们引入了FaceCLIP,这是一种多模态编码器,能够为身份和文本语义学习联合嵌入空间。给定一个参考人脸图像和一段文字提示,FaceCLIP可以产生一种同时包含身份信息和文本内容的统一表示形式,这种表示形式能条件化基础扩散模型以生成既符合身份又与文本相关的图像。此外,我们还提出了一种多模态对齐算法来训练FaceCLIP,该算法使用一种损失函数将其联合表示与人脸、文本及图像嵌入空间进行对齐。接着,我们将FaceCLIP与Stable Diffusion XL (SDXL)集成起来构建了FaceCLIP-SDXL,这是一种保持身份特性的图像合成流水线。相比之前的方法,FaceCLIP-SDXL能够生成更逼真的肖像图片,并且在身份保存和文本相关性方面表现更好。广泛的实验表明其具有定量和定性的优越性能。
https://arxiv.org/abs/2504.14202
Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: this https URL.
从单张静态肖像创建一个逼真的可动画化身仍然具有挑战性。现有方法往往难以捕捉微妙的面部表情、相关的全身动作以及动态背景。为了解决这些限制,我们提出了一种新颖框架,利用预训练的视频扩散变换器模型生成高保真度、连贯的说话头像,并且可以控制运动动力学。我们的工作核心是一种双阶段音频-视觉对齐策略。 在第一阶段,我们采用片段级训练方案,通过在整个场景中(包括参考肖像、上下文对象和背景)对准由音频驱动的动力学来建立连贯的整体运动。在第二阶段,我们使用唇部跟踪掩码以帧为单位细化嘴唇动作,确保与音频信号的精确同步。 为了保持身份一致性而不牺牲运动灵活性,我们将常用的参考网络替换为面部聚焦的跨注意力模块,该模块在整个视频中有效维持面部一致性。此外,我们整合了一个运动强度调节模块,它明确控制表情和身体运动强度,从而实现头像动作(不仅仅是唇部动作)可操控地调整。 广泛的实验结果表明,我们的方法在质量和现实感、连贯性、运动强度和身份保持方面均优于现有技术。 有关我们的项目的更多信息,请访问此链接:[项目页面链接] (请将“this https URL”替换为实际的项目页面URL)。
https://arxiv.org/abs/2504.04842
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.
生成自然且细腻的听者动作,以支持长时间互动的问题仍然未得到解决。现有方法通常依赖于低维运动代码来生成面部行为,并随后进行逼真的渲染,这既限制了视觉保真度也削弱了表现力的丰富性。为了应对这些挑战,我们引入了DiTaiListener,它是由多模态条件下的视频扩散模型驱动的。我们的方法首先使用DiTaiListener-Gen根据说话人的语音和面部动作生成听众反应的短片段。然后通过DiTaiListener-Edit改进过渡帧以实现无缝连接。 具体来说,DiTaiListener-Gen采用了一种经过改编的Diffusion Transformer(DiT)用于听者头像生成任务,并引入了一个因果时间多模态适配器(CTM-Adapter),用以处理说话人的音频和视觉线索。CTM-Adapter将说话人输入以因果方式整合到视频生成过程中,确保了在产生连贯且一致的听众反应时的时间连续性。 对于长时间视频生成,我们引入了DiTaiListener-Edit,这是一个用于过渡细化的视频到视频扩散模型。该模型融合短片段视频以生成流畅且连贯的长视频,确保在将由DiTaiListener-Gen产生的短视频片段合并后,在面部表情和图像质量方面的时间一致性。 从量化指标来看,DiTaiListener在基准数据集上的表现达到了最先进的水平,分别在逼真度(RealTalk数据集上FID得分提升73.8%)和运动表示能力(VICO数据集上FD指标提高6.1%)。用户研究证实了DiTaiListener的优越性,模型在反馈、多样性和流畅性方面明显优于竞争对手。
https://arxiv.org/abs/2504.04010
Text-to-image diffusion models excel at generating diverse portraits, but lack intuitive shadow control. Existing editing approaches, as post-processing, struggle to offer effective manipulation across diverse styles. Additionally, these methods either rely on expensive real-world light-stage data collection or require extensive computational resources for training. To address these limitations, we introduce Shadow Director, a method that extracts and manipulates hidden shadow attributes within well-trained diffusion models. Our approach uses a small estimation network that requires only a few thousand synthetic images and hours of training-no costly real-world light-stage data needed. Shadow Director enables parametric and intuitive control over shadow shape, placement, and intensity during portrait generation while preserving artistic integrity and identity across diverse styles. Despite training only on synthetic data built on real-world identities, it generalizes effectively to generated portraits with diverse styles, making it a more accessible and resource-friendly solution.
文本到图像的扩散模型在生成多样的肖像方面表现出色,但缺乏直观的阴影控制。现有的编辑方法作为后处理手段,在处理不同风格时难以提供有效的操作。此外,这些方法要么依赖于昂贵的真实世界光舞台数据收集,要么需要大量的计算资源进行训练。为了解决这些问题,我们介绍了Shadow Director方法,该方法可以从已经训练好的扩散模型中提取并操纵隐藏的阴影属性。我们的方法使用一个小型估计网络,只需要几千张合成图像和几个小时的训练时间——无需昂贵的真实世界光舞台数据。 Shadow Director在生成肖像时提供了参数化且直观的阴影形状、位置及强度控制,并能在保持艺术完整性和身份一致性的前提下应用于各种风格中。尽管仅基于真实世界的身份构建并经过少量合成数据训练,它仍然能够有效地推广到具有多样风格的生成肖像上,使其成为一个更易于使用和资源友好的解决方案。
https://arxiv.org/abs/2503.21943
Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
个人肖像合成技术在社交娱乐等领域中至关重要,最近取得了显著进展。基于个人微调的方法(如LoRA和DreamBooth)可以生成逼真的图像输出,但需要对每个样本进行训练,这会消耗大量时间和资源,并且存在不稳定的隐患。而基于适配器的技术(例如IP-Adapter),冻结基础模型参数并采用插件架构以实现零样本推理,但在肖像合成任务中往往缺乏自然感和真实性。 在本文中,我们提出了一种参数高效的自适应生成方法——HyperLoRA,该方法使用一个自适应的插件网络来生成LoRA权重,从而结合了LoRA的优越性能与适配器方案的零样本推理能力。通过精心设计的网络结构和训练策略,我们的方法能够实现高逼真度、保真度及可编辑性的零样本个性化肖像生成(支持单图或多图输入)。
https://arxiv.org/abs/2503.16944